[PPT] - SNNAP: Approximate Computing on Programmable SoCs via Neural PowerPoint Presentation

SLIDE 1

SNNAP: Approximate Computing

n Programmable SoCs

via Neural Acceleration

Thierry Moreau Mark Wyse Jacob Nelson Adrian Sampson Hadi Esmaeilzadeh Luis Ceze Mark Oskin

SLIDE 2

Approximate Computing

Expose quality-performance trade-offs

SLIDE 3

Approximate Computing

Expose quality-performance trade-offs ❌ Approximate ✅ Cheap ✅ Accurate ❌ Expensive

SLIDE 4

Approximate Computing

Expose quality-performance trade-offs

Domains include image processing, machine learning, search, physical simulation, multimedia etc.

✅ Approximate ✅ Cheap ✅ Accurate ❌ Expensive

SLIDE 5

Neural Acceleration

float foo (float a, float b) { … return val; }

approximation acceleration

AR M

F P G

NPU

SLIDE 6

Neural Acceleration

float foo (float a, float b) { … return val; }

approximation acceleration

AR M

F P G

NPU

CPU

F P G A

SNNAP Esmaeilzadeh et al. [MICRO 2012]

CPU F D X I M C

NPU

SLIDE 7

SNNAP

Programmable SoCs 3.8x speedup and 2.8x efficiency gains

ffers an alternative to HLS tools for neural acceleration

A neural processing unit on off-the-shelf

float foo (float a, float b) { … return val; }

approximation acceleration

AR M

F P G

NPU

SLIDE 8

Talk Outline

Introduction Programming model SNNAP design:

Efficient neural network evaluation
Low-latency communication

Evaluation & Comparison with HLS

SLIDE 9

Background: Compilation

code annotation

1. Region detection
2. ANN Training

[training.data]

back prop. & topology search

region detection & program instrumentation

binary generation

3. Code Generation

CPU

SNNAP

SLIDE 10

Programming Model

sobel

float ¡sobel ¡(float* ¡p); ¡ . ¡. ¡. ¡ ¡ Image ¡src; ¡ Image ¡dst; ¡ while ¡(true) ¡{ ¡ ¡ ¡ ¡src ¡= ¡read_from_camera(); ¡ ¡ ¡ ¡for ¡(y=0; ¡y ¡< ¡h; ¡++y) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡for ¡(x=0; ¡x ¡< ¡w; ¡++x) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡dst.p[y][x] ¡= ¡sobel(& ¡src.p[y][x]); ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡display(dst); ¡ } ¡

SLIDE 11

Programming Model

APPROX ¡float ¡sobel ¡(APPROX ¡float* ¡p); ¡ . ¡. ¡. ¡ ¡ APPROX ¡Image ¡src; ¡ APPROX ¡Image ¡dst; ¡ while ¡(true) ¡{ ¡ ¡ ¡ ¡src ¡= ¡read_from_camera(); ¡ ¡ ¡ ¡for ¡(y=0; ¡y ¡< ¡h; ¡++y) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡for ¡(x=0; ¡x ¡< ¡w; ¡++x) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡dst.p[y][x] ¡= ¡sobel(& ¡src.p[y][x]); ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡display(dst); ¡ } ¡

✅ no side effects ✅ executes often sobel ACCEPT: compilation framework for approximate programs

SLIDE 12

Talk Outline

Introduction Programming model SNNAP design:

Efficient neural network evaluation
Low-latency communication

Evaluation & Comparison with HLS

SLIDE 13

Background: Multi-Layer Perceptrons

x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 y0 y1 Input Layer Hidden Layer 0 Hidden Layer 1 Output w47 w57 w67

∑

i=4 6

wi7•xi

!

x7

neural network computing a single layer activation function f

x6 x5 x4 x7 x8 x9

= ([

] []) []

f

w67 w57 w47 w68 w58 w48 w69 w59 w49

SLIDE 14

Background: Systolic Arrays

systolic array computing a single layer

x6 x5 x4 x7 x8 x9

= ([

] []) []

f

w67 w57 w47 w68 w58 w48 w69 w59 w49

SLIDE 15

Background: Systolic Arrays

x6 x5 x4 x7 x8 x9

systolic array

w67 w68 w69 w57 w58 w59 w47 w49 w48

f

SLIDE 16

PU Micro-Architecture

x6 x5 x4 x7 x8 x9

systolic array

w67 w68 w69 w57 w58 w59 w47 w49 w48

f

PU control PE

f

PE PE PE Storage

processing unit

1 - processing elements in DSP logic 2 - local storage for synaptic weights 3 - sigmoid unit implements non- linear activation functions 4 - vertically micro-coded sequencer

SLIDE 17

Multi-Processing Units

PU control PE

f

PE PE PE Storage PU control PE

f

PE PE PE Storage PU control PE

f

PE PE PE Storage PU control PE

f

PE PE PE Storage bus scheduler AXI Master

SLIDE 18

Talk Outline

Introduction Programming model SNNAP design:

Efficient neural network evaluation
Low-latency communication

Evaluation & Comparison with HLS

SLIDE 19

CPU-SNNAP Integration

CPU

$L1 $L2

PU PU PU PU

bus

scheduler

SEV WFE

DMA master

ACP

Interface requirements:

Low-latency data transfer
Fast signaling

SLIDE 20

CPU-SNNAP Integration

CPU

$L1 $L2

PU PU PU PU

bus

scheduler

SEV WFE

DMA master

ACP

coherent reads & writes with accelerator coherency port

Interface requirements:

Low-latency data transfer
Fast signaling

SLIDE 21

CPU-SNNAP Integration

CPU

$L1 $L2

PU PU PU PU

bus

scheduler

SEV WFE

DMA master

ACP

custom mastering interface coherent reads & writes with accelerator coherency port

Interface requirements:

Low-latency data transfer
Fast signaling

SLIDE 22

CPU-SNNAP Integration

CPU

$L1 $L2

PU PU PU PU

bus

scheduler

SEV WFE

DMA master

ACP

custom mastering interface coherent reads & writes with accelerator coherency port

low-latency event signaling, sleep & wakeup

Interface requirements:

Low-latency data transfer
Fast signaling

SLIDE 23

Talk Outline

Introduction Programming model SNNAP design:

Efficient neural network evaluation
Low-latency communication

Evaluation & Comparison with HLS

SLIDE 24

Evaluation

Neural acceleration on SNNAP (8x8 configuration, clocked at 1/4 of fCPU) vs. precise CPU execution

SLIDE 25

Evaluation

Neural acceleration on SNNAP (8x8 configuration, clocked at 1/4 of fCPU) vs. precise CPU execution

application domain error metric blackscholes

ption pricing

MSE fft DSP MSE inversek2j robotics MSE jmeint 3D-modeling miss rate jpeg compression image diff kmeans ML image diff sobel vision image diff

SLIDE 26

Speedup

Whole Application Speedup 0.00 1.00 2.00 3.00 4.00 b s c h

l

e s f f t i n v e r s e k 2 j j m e i n t j p e g k m e a n s s

b

e l G E O M E A N

3.8 2.4 1.3 2.3 1.5 38.1 2.7 10.8

Factors:

Amdahl’s Speedup
Cost of instructions on CPU
vs. cost of NN on SNNAP

SLIDE 27

Speedup

Whole Application Speedup 0.00 1.00 2.00 3.00 4.00 b s c h

l

e s f f t i n v e r s e k 2 j j m e i n t j p e g k m e a n s s

b

e l G E O M E A N

3.8 2.4 1.3 2.3 1.5 38.1 2.7 10.8

Factors:

Amdahl’s Speedup
Cost of instructions on CPU
vs. cost of NN on SNNAP

inversek2j kmeans Amdahl’s Speedup >100x 1.47x CPU cost 1660 cycles 29 cycles NN hidden layers 1 2

SLIDE 28

Energy Savings

Energy Savings 0.00 1.00 2.00 3.00 4.00 b s c h

l

e s f f t i n v e r s e k 2 j j m e i n t j p e g k m e a n s s

b

e l G E O M E A N

2.8 1.8 .9 1.7 1.1 28.0 2.2 7.8

Energy = Power

n

(DRAM + SoC) * Runtime +36%

SLIDE 29

HW Acceleration

Neural Acceleration with SNNAP

which one should you use?

High Level Synthesis Compilers vs.

SLIDE 30

HLS Comparison Study

HLS

x x

√

HLS

compiled down netlist

neural transform

SNNAP executes compiled down NN FPGA design

SLIDE 31

HLS Comparison Study

Resource-normalized throughput:

pipeline invocation interval
maximum frequency
resource utilization

HLS

x x

√

HLS

compiled down netlist

neural transform

SNNAP executes compiled down NN FPGA design

SLIDE 32

HLS Comparison Study

Normalized Throughput Improvement over HLS 0.10 1.00 10.00 b s c h

l

e s f f t i n v e r s e k 2 j j m e i n t j p e g k m e a n s s

b

e l G E O M E A N

1.6 .5 .2 43.7 7.9 1.3 .4 1.6

HLS is better Neural Acceleration is better

SLIDE 33

HLS Comparison Study

Normalized Throughput Improvement over HLS 0.10 1.00 10.00 b s c h

l

e s f f t i n v e r s e k 2 j j m e i n t j p e g k m e a n s s

b

e l G E O M E A N

1.6 .5 .2 43.7 7.9 1.3 .4 1.6

Neural Accel. HLS Precision Virtualization Performance Programmability

✅ ✅

SLIDE 34

HLS Comparison Study

Normalized Throughput Improvement over HLS 0.10 1.00 10.00 b s c h

l

e s f f t i n v e r s e k 2 j j m e i n t j p e g k m e a n s s

b

e l G E O M E A N

1.6 .5 .2 43.7 7.9 1.3 .4 1.6

Neural Accel. HLS Precision Virtualization Performance Programmability

✅ ✅ ✅ ~ ~

SLIDE 35

Conclusion

3.8x speedup & 2.8x energy savings

neural acceleration is a viable alternative to HLS SNNAP: apply approximate computing on programmable SoCs through neural acceleration

float foo (float a, float b) { … return r; }

SLIDE 36

Mark Wyse Jacob Nelson Adrian Sampson

SNNAP: Approximate Computing

n Programmable SoCs

via Neural Acceleration

Hadi Esmaeilzadeh Luis Ceze Mark Oskin

http://sampa.cs.washington.edu/

Thierry Moreau: moreau@uw.edu