Scalar Arithmetic Multiple Data Customizable Precision for Deep - - PowerPoint PPT Presentation

scalar arithmetic multiple data customizable precision
SMART_READER_LITE
LIVE PREVIEW

Scalar Arithmetic Multiple Data Customizable Precision for Deep - - PowerPoint PPT Presentation

Scalar Arithmetic Multiple Data Customizable Precision for Deep Neural Networks Andrew Anderson and Michael Doyle and David Gregg Lero, Trinity College Dublin { aanderso ,mjdoyle,dgregg } @tcd.ie ARITH, Kyoto June 2019 DNN Convolution Figure:


slide-1
SLIDE 1

Scalar Arithmetic Multiple Data Customizable Precision for Deep Neural Networks

Andrew Anderson and Michael Doyle and David Gregg

Lero, Trinity College Dublin {aanderso,mjdoyle,dgregg}@tcd.ie

ARITH, Kyoto June 2019

slide-2
SLIDE 2

DNN Convolution

Figure: Multi-channel multi-kernel convolution

slide-3
SLIDE 3

DNN Convolution

for ( unsigned m = 0; m < k e r n e l s ; m++) for ( unsigned h = 0; h < img h / s t r i d e h ; h++) for ( unsigned w = 0; w < img w/ s t r i d e w ; w++) for ( unsigned c = 0; c < channels ; c++) for ( unsigned y = 0; y < k ; y++) for ( unsigned x = 0; x < k ; x++)

  • utput [m] [ h ] [ w] +=

input [ c ] [ ( ( h ∗ s t r i d e h ) + y ) − ( k /2)] [ ( (w ∗ s t r i d e w ) + x ) − ( k /2)] ∗ k e r n e l [m] [ c ] [ y ] [ x ] ) ;

slide-4
SLIDE 4

Quantized Arithmetic

DNN weights occupy huge amounts of space in FP32 VGG-19 Network: 548 MB

Figure: But we want to use them on this!

OpenMV Cam – 512 KB RAM, 2 MB ROM, 216 MHZ Cortex-M7

slide-5
SLIDE 5

Quantized Arithmetic

In Deep Learning we have it very easy! ◮ Network training compensates for arithmetic error ◮ Often, noisy arithmetic actually helps! (with overfitting) Lots of research about how harshly DNN weights can be quantized ◮ Can go to integer (eventually!) ◮ Can go down to one (1) bit (’binarized’ nets) ◮ But we don’t want to do all our work on FPGA... ◮ In fact, commodity hardware is ideal.

slide-6
SLIDE 6

The Simple Approach

Convert to native arithmetic

00000010000011100000100100000101

4xuint8_t 4xuint4_t

00000101000011010000001000001011

Figure: uint4 t expanded to uint8 t

◮ Can use native SIMD ◮ Space overhead only in registers (not memory) ◮ Extra precision in intermediate results (for free) ◮ Easy to mix and match number formats (e.g. uint6 t + uint4 t)

slide-7
SLIDE 7

Quantized Arithmetic

0010111010010101

uint16_t 4xuint4_t

0101110100101011 1 0 1

Figure: Example SWAR operation.

4 × 4-bit words packed into a 16-bit scalar register

slide-8
SLIDE 8

SIMD Within A Register (SWAR)

Dealing with overflow

X010X110X001X101 X101X101X010X011 0XXX1XXX1XXX0XXX 0XXX1XXX0XXX1XXX 0XXX0XXX1XXX1XXX 0111100100111000

masked add xor

0111100110110000

xor

Figure: Spacer bits

Temporary spacer bits are spacer bits in intermediate values that don’t get written to the data format in memory.

slide-9
SLIDE 9

SIMD Within A Register (SWAR)

k3i0 k3i1 k3i2 k3i3 k2i0 k2i1 k2i2 k2i3 k1i0 k1i1 k1i2 k1i3

unsigned integer multiply

+ +

uint32_t 4xuint4_t uint64_t 8xuint8_t k1 k2 k3 i0 i1 i2 i3

Figure: Convolutional substructure in scalar integer multiplication

Long multiplication is discrete convolution over digit sequences

slide-10
SLIDE 10

SIMD Within A Register (SWAR)

k3i0 k3i1 k3i2 k3i3 k2i0 k2i1 k2i2 k2i3 k1i0 k1i1 k1i2 k1i3

unsigned integer multiply

+ +

uint32_t 4xuint4_t uint64_t 8xuint8_t k1 k2 k3 i0 i1 i2 i3

Figure: Convolution

k × i subword multiplies and (k − 1) × (i − 1) additions with a single instruction

slide-11
SLIDE 11

Results

5x108 1x109 1.5x109 2x109 2.5x109 3x109 conv3-1 conv3-2 conv4-1 conv4-2 conv4-3 Execution Time (ns) SAMD Convolution with T emporary Spacer Bits (ARM Cortex A-57) direct-sum2d SAMD8 SAMD7 SAMD6 SAMD5 SAMD4 SAMD3 SAMD2

Figure: Performance with Temporary Spacer bits

slide-12
SLIDE 12

Results

5x108 1x109 1.5x109 2x109 2.5x109 conv3-1 conv3-2 conv4-1 conv4-2 conv4-3 Execution Time (ns) SAMD Convolution with Permanent Spacer Bits (ARM Cortex A-57) direct-sum2d SAMD8 SAMD7 SAMD6 SAMD5 SAMD4 SAMD3 SAMD2

Figure: Performance with Permanent Spacer bits

slide-13
SLIDE 13

Future Work

◮ All-SAMD network (nonlinearities & utility ops) ◮ Codesign HW Integer Support Instructions ◮ GPU (but microcontrollers don’t have GPUs (yet!))

slide-14
SLIDE 14

Thanks for listening!