SNNAP: Approximate Computing on Programmable SoCs via Neural - - PowerPoint PPT Presentation

snnap approximate computing on programmable socs via
SMART_READER_LITE
LIVE PREVIEW

SNNAP: Approximate Computing on Programmable SoCs via Neural - - PowerPoint PPT Presentation

SNNAP: Approximate Computing on Programmable SoCs via Neural Acceleration Thierry Moreau Hadi Esmaeilzadeh Mark Wyse Luis Ceze Jacob Nelson Mark Oskin Adrian Sampson Approximate Computing Expose quality-performance trade-offs


slide-1
SLIDE 1

SNNAP: Approximate Computing

  • n Programmable SoCs

via Neural Acceleration

Thierry Moreau Mark Wyse Jacob Nelson Adrian Sampson Hadi Esmaeilzadeh Luis Ceze Mark Oskin

slide-2
SLIDE 2

Approximate Computing

Expose quality-performance trade-offs

slide-3
SLIDE 3

Approximate Computing

Expose quality-performance trade-offs ❌ Approximate ✅ Cheap ✅ Accurate ❌ Expensive

slide-4
SLIDE 4

Approximate Computing

Expose quality-performance trade-offs

Domains include image processing, machine learning, search, physical simulation, multimedia etc.

✅ Approximate ✅ Cheap ✅ Accurate ❌ Expensive

slide-5
SLIDE 5

Neural Acceleration

float foo (float a, float b) { … return val; }

approximation acceleration

AR M

F P G

NPU

slide-6
SLIDE 6

Neural Acceleration

float foo (float a, float b) { … return val; }

approximation acceleration

AR M

F P G

NPU

CPU

F P G A

SNNAP Esmaeilzadeh et al. [MICRO 2012]

CPU F D X I M C

NPU

slide-7
SLIDE 7

SNNAP

Programmable SoCs 3.8x speedup and 2.8x efficiency gains

  • ffers an alternative to HLS tools for neural acceleration

A neural processing unit on off-the-shelf

float foo (float a, float b) { … return val; }

approximation acceleration

AR M

F P G

NPU

slide-8
SLIDE 8

Talk Outline

Introduction Programming model SNNAP design:

  • Efficient neural network evaluation
  • Low-latency communication

Evaluation & Comparison with HLS

slide-9
SLIDE 9

Background: Compilation

code annotation

  • 1. Region detection
  • 2. ANN Training

[training.data]

back prop. & topology search

region detection & program instrumentation

binary generation

  • 3. Code Generation

CPU

SNNAP

slide-10
SLIDE 10

Programming Model

sobel

float ¡sobel ¡(float* ¡p); ¡ . ¡. ¡. ¡ ¡ Image ¡src; ¡ Image ¡dst; ¡ while ¡(true) ¡{ ¡ ¡ ¡ ¡src ¡= ¡read_from_camera(); ¡ ¡ ¡ ¡for ¡(y=0; ¡y ¡< ¡h; ¡++y) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡for ¡(x=0; ¡x ¡< ¡w; ¡++x) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡dst.p[y][x] ¡= ¡sobel(& ¡src.p[y][x]); ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡display(dst); ¡ } ¡

slide-11
SLIDE 11

Programming Model

APPROX ¡float ¡sobel ¡(APPROX ¡float* ¡p); ¡ . ¡. ¡. ¡ ¡ APPROX ¡Image ¡src; ¡ APPROX ¡Image ¡dst; ¡ while ¡(true) ¡{ ¡ ¡ ¡ ¡src ¡= ¡read_from_camera(); ¡ ¡ ¡ ¡for ¡(y=0; ¡y ¡< ¡h; ¡++y) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡for ¡(x=0; ¡x ¡< ¡w; ¡++x) ¡{ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡dst.p[y][x] ¡= ¡sobel(& ¡src.p[y][x]); ¡ ¡ ¡ ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡} ¡ ¡ ¡ ¡display(dst); ¡ } ¡

✅ no side effects ✅ executes often sobel ACCEPT: compilation framework for approximate programs

slide-12
SLIDE 12

Talk Outline

Introduction Programming model SNNAP design:

  • Efficient neural network evaluation
  • Low-latency communication

Evaluation & Comparison with HLS

slide-13
SLIDE 13

Background: Multi-Layer Perceptrons

x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 y0 y1 Input Layer Hidden Layer 0 Hidden Layer 1 Output w47 w57 w67

i=4 6

wi7•xi

!

x7

neural network computing a single layer activation function f

x6 x5 x4 x7 x8 x9

= ([

] []) []

f

w67 w57 w47 w68 w58 w48 w69 w59 w49

slide-14
SLIDE 14

Background: Systolic Arrays

systolic array computing a single layer

x6 x5 x4 x7 x8 x9

= ([

] []) []

f

w67 w57 w47 w68 w58 w48 w69 w59 w49

slide-15
SLIDE 15

Background: Systolic Arrays

x6 x5 x4 x7 x8 x9

systolic array

w67 w68 w69 w57 w58 w59 w47 w49 w48

f

slide-16
SLIDE 16

PU Micro-Architecture

x6 x5 x4 x7 x8 x9

systolic array

w67 w68 w69 w57 w58 w59 w47 w49 w48

f

PU control PE

f

PE PE PE Storage

processing unit

1 - processing elements in DSP logic 2 - local storage for synaptic weights 3 - sigmoid unit implements non- linear activation functions 4 - vertically micro-coded sequencer

slide-17
SLIDE 17

Multi-Processing Units

PU control PE

f

PE PE PE Storage PU control PE

f

PE PE PE Storage PU control PE

f

PE PE PE Storage PU control PE

f

PE PE PE Storage bus scheduler AXI Master

slide-18
SLIDE 18

Talk Outline

Introduction Programming model SNNAP design:

  • Efficient neural network evaluation
  • Low-latency communication

Evaluation & Comparison with HLS

slide-19
SLIDE 19

CPU-SNNAP Integration

CPU

$L1 $L2

PU PU PU PU

bus

scheduler

SEV WFE

DMA master

ACP

Interface requirements:

  • Low-latency data transfer
  • Fast signaling
slide-20
SLIDE 20

CPU-SNNAP Integration

CPU

$L1 $L2

PU PU PU PU

bus

scheduler

SEV WFE

DMA master

ACP

coherent reads & writes with accelerator coherency port

Interface requirements:

  • Low-latency data transfer
  • Fast signaling
slide-21
SLIDE 21

CPU-SNNAP Integration

CPU

$L1 $L2

PU PU PU PU

bus

scheduler

SEV WFE

DMA master

ACP

custom mastering interface coherent reads & writes with accelerator coherency port

Interface requirements:

  • Low-latency data transfer
  • Fast signaling
slide-22
SLIDE 22

CPU-SNNAP Integration

CPU

$L1 $L2

PU PU PU PU

bus

scheduler

SEV WFE

DMA master

ACP

custom mastering interface coherent reads & writes with accelerator coherency port

low-latency event signaling, sleep & wakeup

Interface requirements:

  • Low-latency data transfer
  • Fast signaling
slide-23
SLIDE 23

Talk Outline

Introduction Programming model SNNAP design:

  • Efficient neural network evaluation
  • Low-latency communication

Evaluation & Comparison with HLS

slide-24
SLIDE 24

Evaluation

Neural acceleration on SNNAP (8x8 configuration, clocked at 1/4 of fCPU) vs. precise CPU execution

slide-25
SLIDE 25

Evaluation

Neural acceleration on SNNAP (8x8 configuration, clocked at 1/4 of fCPU) vs. precise CPU execution

application domain error metric blackscholes

  • ption pricing

MSE fft DSP MSE inversek2j robotics MSE jmeint 3D-modeling miss rate jpeg compression image diff kmeans ML image diff sobel vision image diff

slide-26
SLIDE 26

Speedup

Whole Application Speedup 0.00 1.00 2.00 3.00 4.00 b s c h

  • l

e s f f t i n v e r s e k 2 j j m e i n t j p e g k m e a n s s

  • b

e l G E O M E A N

3.8 2.4 1.3 2.3 1.5 38.1 2.7 10.8

Factors:

  • Amdahl’s Speedup
  • Cost of instructions on CPU
  • vs. cost of NN on SNNAP
slide-27
SLIDE 27

Speedup

Whole Application Speedup 0.00 1.00 2.00 3.00 4.00 b s c h

  • l

e s f f t i n v e r s e k 2 j j m e i n t j p e g k m e a n s s

  • b

e l G E O M E A N

3.8 2.4 1.3 2.3 1.5 38.1 2.7 10.8

Factors:

  • Amdahl’s Speedup
  • Cost of instructions on CPU
  • vs. cost of NN on SNNAP

inversek2j kmeans Amdahl’s Speedup >100x 1.47x CPU cost 1660 cycles 29 cycles NN hidden layers 1 2

slide-28
SLIDE 28

Energy Savings

Energy Savings 0.00 1.00 2.00 3.00 4.00 b s c h

  • l

e s f f t i n v e r s e k 2 j j m e i n t j p e g k m e a n s s

  • b

e l G E O M E A N

2.8 1.8 .9 1.7 1.1 28.0 2.2 7.8

Energy = Power

  • n

(DRAM + SoC) * Runtime +36%

slide-29
SLIDE 29

HW Acceleration

Neural Acceleration with SNNAP

which one should you use?

High Level Synthesis Compilers vs.

slide-30
SLIDE 30

HLS Comparison Study

HLS

x x

HLS

compiled down netlist

neural transform

SNNAP executes compiled down NN FPGA design

slide-31
SLIDE 31

HLS Comparison Study

Resource-normalized throughput:

  • pipeline invocation interval
  • maximum frequency
  • resource utilization

HLS

x x

HLS

compiled down netlist

neural transform

SNNAP executes compiled down NN FPGA design

slide-32
SLIDE 32

HLS Comparison Study

Normalized Throughput Improvement over HLS 0.10 1.00 10.00 b s c h

  • l

e s f f t i n v e r s e k 2 j j m e i n t j p e g k m e a n s s

  • b

e l G E O M E A N

1.6 .5 .2 43.7 7.9 1.3 .4 1.6

HLS is better Neural Acceleration is better

slide-33
SLIDE 33

HLS Comparison Study

Normalized Throughput Improvement over HLS 0.10 1.00 10.00 b s c h

  • l

e s f f t i n v e r s e k 2 j j m e i n t j p e g k m e a n s s

  • b

e l G E O M E A N

1.6 .5 .2 43.7 7.9 1.3 .4 1.6

Neural Accel. HLS Precision Virtualization Performance Programmability

✅ ✅

slide-34
SLIDE 34

HLS Comparison Study

Normalized Throughput Improvement over HLS 0.10 1.00 10.00 b s c h

  • l

e s f f t i n v e r s e k 2 j j m e i n t j p e g k m e a n s s

  • b

e l G E O M E A N

1.6 .5 .2 43.7 7.9 1.3 .4 1.6

Neural Accel. HLS Precision Virtualization Performance Programmability

✅ ✅ ✅ ~ ~

slide-35
SLIDE 35

Conclusion

3.8x speedup & 2.8x energy savings

neural acceleration is a viable alternative to HLS SNNAP: apply approximate computing on programmable SoCs through neural acceleration

float foo (float a, float b) { … return r; }
slide-36
SLIDE 36

Mark Wyse Jacob Nelson Adrian Sampson

SNNAP: Approximate Computing

  • n Programmable SoCs

via Neural Acceleration

Hadi Esmaeilzadeh Luis Ceze Mark Oskin

http://sampa.cs.washington.edu/

Thierry Moreau: moreau@uw.edu