Scalar Arithmetic Multiple Data Customizable Precision for Deep Neural Networks
Andrew Anderson and Michael Doyle and David Gregg
Lero, Trinity College Dublin {aanderso,mjdoyle,dgregg}@tcd.ie
Scalar Arithmetic Multiple Data Customizable Precision for Deep - - PowerPoint PPT Presentation
Scalar Arithmetic Multiple Data Customizable Precision for Deep Neural Networks Andrew Anderson and Michael Doyle and David Gregg Lero, Trinity College Dublin { aanderso ,mjdoyle,dgregg } @tcd.ie ARITH, Kyoto June 2019 DNN Convolution Figure:
Lero, Trinity College Dublin {aanderso,mjdoyle,dgregg}@tcd.ie
Figure: Multi-channel multi-kernel convolution
for ( unsigned m = 0; m < k e r n e l s ; m++) for ( unsigned h = 0; h < img h / s t r i d e h ; h++) for ( unsigned w = 0; w < img w/ s t r i d e w ; w++) for ( unsigned c = 0; c < channels ; c++) for ( unsigned y = 0; y < k ; y++) for ( unsigned x = 0; x < k ; x++)
input [ c ] [ ( ( h ∗ s t r i d e h ) + y ) − ( k /2)] [ ( (w ∗ s t r i d e w ) + x ) − ( k /2)] ∗ k e r n e l [m] [ c ] [ y ] [ x ] ) ;
Figure: But we want to use them on this!
4xuint8_t 4xuint4_t
Figure: uint4 t expanded to uint8 t
uint16_t 4xuint4_t
Figure: Example SWAR operation.
masked add xor
xor
Figure: Spacer bits
k3i0 k3i1 k3i2 k3i3 k2i0 k2i1 k2i2 k2i3 k1i0 k1i1 k1i2 k1i3
unsigned integer multiply
+ +
uint32_t 4xuint4_t uint64_t 8xuint8_t k1 k2 k3 i0 i1 i2 i3
Figure: Convolutional substructure in scalar integer multiplication
k3i0 k3i1 k3i2 k3i3 k2i0 k2i1 k2i2 k2i3 k1i0 k1i1 k1i2 k1i3
unsigned integer multiply
+ +
uint32_t 4xuint4_t uint64_t 8xuint8_t k1 k2 k3 i0 i1 i2 i3
Figure: Convolution
5x108 1x109 1.5x109 2x109 2.5x109 3x109 conv3-1 conv3-2 conv4-1 conv4-2 conv4-3 Execution Time (ns) SAMD Convolution with T emporary Spacer Bits (ARM Cortex A-57) direct-sum2d SAMD8 SAMD7 SAMD6 SAMD5 SAMD4 SAMD3 SAMD2
Figure: Performance with Temporary Spacer bits
5x108 1x109 1.5x109 2x109 2.5x109 conv3-1 conv3-2 conv4-1 conv4-2 conv4-3 Execution Time (ns) SAMD Convolution with Permanent Spacer Bits (ARM Cortex A-57) direct-sum2d SAMD8 SAMD7 SAMD6 SAMD5 SAMD4 SAMD3 SAMD2
Figure: Performance with Permanent Spacer bits