Compilation and Hardware Support for Approximate Acceleration - - PowerPoint PPT Presentation

compilation and hardware support for approximate
SMART_READER_LITE
LIVE PREVIEW

Compilation and Hardware Support for Approximate Acceleration - - PowerPoint PPT Presentation

Compilation and Hardware Support for Approximate Acceleration Thierry Moreau , Adrian Sampson, Andre Baixo, Mark Wyse, Ben Ransford, Jacob Nelson, Hadi Esmaeilzadeh (Georgia Tech), Luis Ceze and Mark Oskin University of Washington


slide-1
SLIDE 1

Compilation and Hardware Support for Approximate Acceleration

Thierry Moreau, Adrian Sampson, Andre Baixo, Mark Wyse, Ben Ransford, Jacob Nelson, Hadi Esmaeilzadeh (Georgia Tech), Luis Ceze and Mark Oskin University of Washington moreau@uw.edu

Theme: 2384.004

1 Thierry Moreau

slide-2
SLIDE 2

Approximate Computing

Aims to exploit application resilience to trade-off quality for efficiency

2 Thierry Moreau

slide-3
SLIDE 3

Approximate Computing

3 Thierry Moreau

slide-4
SLIDE 4

Approximate Computing

✅ Approximate ✅ Cheap ✅ Accurate ❌ Expensive

4 Thierry Moreau

slide-5
SLIDE 5

5 Thierry Moreau

slide-6
SLIDE 6

6 Thierry Moreau

slide-7
SLIDE 7

7 Thierry Moreau

slide-8
SLIDE 8

Neural Networks as Approximate Accelerators

8 Thierry Moreau

CPU

Esmaeilzadeh et al. [MICRO 2012]

slide-9
SLIDE 9

Neural Acceleration

float foo (float a, float b) { … return val; }

approximation acceleration

AR M

F P G

NPU 9 Thierry Moreau

slide-10
SLIDE 10

Neural Acceleration

approximation acceleration

AR M

F P G

NPU 10 Thierry Moreau

compiler-support ACCEPT* *Sampson et. al [UW-TR]

float foo (float a, float b) { … return val; }

slide-11
SLIDE 11

Neural Acceleration

approximation acceleration

AR M

F P G

NPU 11 Thierry Moreau

HW-support SNNAP* compiler-support ACCEPT *Moreau et. al [HPCA2015]

float foo (float a, float b) { … return val; }

slide-12
SLIDE 12

Neural Acceleration

approximation acceleration

AR M

F P G

NPU 12 Thierry Moreau

HW-support SNNAP compiler-support ACCEPT

3.8x speedup and 2.8x efficiency - 10% error

float foo (float a, float b) { … return val; }

slide-13
SLIDE 13

Talk Outline

Introduction Compiler Support with ACCEPT SNNAP Accelerator design Evaluation & Comparison with HLS

13 Thierry Moreau

slide-14
SLIDE 14

Compilation Overview

code annotation

  • 1. Region detection

14 Thierry Moreau

slide-15
SLIDE 15

Compilation Overview

code annotation

  • 1. Region detection

region detection & program instrumentation

15 Thierry Moreau

ACCEPT

slide-16
SLIDE 16

Compilation Overview

code annotation

  • 1. Region detection
  • 2. ANN Training

[training.data]

back prop. & topology search

region detection & program instrumentation

16 Thierry Moreau

ACCEPT

slide-17
SLIDE 17

Compilation Overview

code annotation

  • 1. Region detection
  • 2. ANN Training

[training.data]

back prop. & topology search

region detection & program instrumentation

  • 3. Code Generation

CPU

SNNAP

17 Thierry Moreau

ACCEPT ACCEPT

code transformation executes

slide-18
SLIDE 18

Compilation Overview

code annotation

  • 1. Region detection
  • 2. ANN Training

[training.data]

back prop. & topology search

region detection & program instrumentation

  • 3. Code Generation

CPU

SNNAP

18 Thierry Moreau

ACCEPT ACCEPT

code transformation executes

slide-19
SLIDE 19

Compilation Overview

code annotation

  • 1. Region detection
  • 2. ANN Training

[training.data]

back prop. & topology search

region detection & program instrumentation

  • 3. Code Generation

CPU

SNNAP

19 Thierry Moreau

ACCEPT ACCEPT

code transformation executes

slide-20
SLIDE 20

Programming Model

sobel

float sobel (float* p); . . . float** src; float** dst; while (true) { src = read_from_camera(); for (y=0; y < h; ++y) { for (x=0; x < w; ++x) { dst[y][x] = sobel(& src[y][x]); } } display(dst); }

20 Thierry Moreau

slide-21
SLIDE 21

Programming Model

APPROX float sobel (APPROX float* p); . . . APPROX float** src; APPROX float** dst; while (true) { src = read_from_camera(); for (y=0; y < h; ++y) { for (x=0; x < w; ++x) { dst[y][x] = sobel(& src[y][x]); } } display(ENDORSE(dst)); }

sobel

21 Thierry Moreau

slide-22
SLIDE 22

Programming Model

APPROX float sobel (APPROX float* p); . . . APPROX float** src; APPROX float** dst; while (true) { src = read_from_camera(); for (y=0; y < h; ++y) { for (x=0; x < w; ++x) { dst[y][x] = sobel(& src[y][x]); } } display(ENDORSE(dst)); }

✅ no side effects ✅ executes often sobel

22 Thierry Moreau

slide-23
SLIDE 23

Checking for Quality

23 Thierry Moreau

annotated program

sobel.c

slide-24
SLIDE 24

Checking for Quality

24 Thierry Moreau

d(y, y0)

annotated program

sobel.c

quality metric

slide-25
SLIDE 25

Checking for Quality

25 Thierry Moreau

annotated program input data

sobel.c

quality metric

d(y, y0)

slide-26
SLIDE 26

Checking for Quality

26 Thierry Moreau

annotated program input data

sobel.c

quality metric

training test

d(y, y0)

slide-27
SLIDE 27

Checking for Quality

27 Thierry Moreau

annotated program input data

sobel.c

quality metric

training test

d(y, y0)

Performance Output Quality

slide-28
SLIDE 28

Talk Outline

Introduction Compiler Support with ACCEPT SNNAP Accelerator design Evaluation & Comparison with HLS

28 Thierry Moreau

slide-29
SLIDE 29

Background: Multi-Layer Perceptrons

x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 y0 y1 Input Layer Hidden Layer 0 Hidden Layer 1 Output w47 w57 w67

i=4 6

wi7•xi

!

x7

neural network computing a single layer activation function f

x6 x5 x4 x7 x8 x9 = ([

] []) []

f

w67 w57 w47 w68 w58 w48 w69 w59 w49

29 Thierry Moreau

slide-30
SLIDE 30

Background: Systolic Arrays

computing a single layer

x6 x5 x4 x7 x8 x9 = ([

] []) []

f

w67 w57 w47 w68 w58 w48 w69 w59 w49

30

x6 x5 x4 x7 x8 x9

systolic array

w67 w68 w69 w57 w58 w59 w47 w49 w48

f

Thierry Moreau

slide-31
SLIDE 31

PU Micro-Architecture

x6 x5 x4 x7 x8 x9

systolic array

w67 w68 w69 w57 w58 w59 w47 w49 w48

f

PU control PE

f

PE PE PE Storage

processing unit

31

slide-32
SLIDE 32

PU Micro-Architecture

x6 x5 x4 x7 x8 x9

systolic array

w67 w68 w69 w57 w58 w59 w47 w49 w48

f

PU control PE

f

PE PE PE Storage

processing unit

1 - processing elements in DSP logic

32 Thierry Moreau

slide-33
SLIDE 33

PU Micro-Architecture

x6 x5 x4 x7 x8 x9

systolic array

w67 w68 w69 w57 w58 w59 w47 w49 w48

f

PU control PE

f

PE PE PE Storage

processing unit

1 - processing elements in DSP logic 2 - local storage for synaptic weights

33 Thierry Moreau

slide-34
SLIDE 34

PU Micro-Architecture

x6 x5 x4 x7 x8 x9

systolic array

w67 w68 w69 w57 w58 w59 w47 w49 w48

f

PU control PE

f

PE PE PE Storage

processing unit

1 - processing elements in DSP logic 2 - local storage for synaptic weights 3 - sigmoid unit implements non- linear activation functions

34 Thierry Moreau

slide-35
SLIDE 35

PU Micro-Architecture

x6 x5 x4 x7 x8 x9

systolic array

w67 w68 w69 w57 w58 w59 w47 w49 w48

f

PU control PE

f

PE PE PE Storage

processing unit

1 - processing elements in DSP logic 2 - local storage for synaptic weights 3 - sigmoid unit implements non- linear activation functions 4 - vertically micro-coded sequencer

35 Thierry Moreau

slide-36
SLIDE 36

Multi-Processing Units

PU control PE

f

PE PE PE Storage PU control PE

f

PE PE PE Storage PU control PE

f

PE PE PE Storage PU control PE

f

PE PE PE Storage bus scheduler DMA Master 36 Thierry Moreau

slide-37
SLIDE 37

CPU-SNNAP Integration

CPU

$L1 $L2

PU PU PU PU

bus

scheduler

SE WF

DMA master

ACP

custom mastering interface coherent reads & writes with accelerator coherency port

low-latency event signaling, sleep & wakeup

37 Thierry Moreau

slide-38
SLIDE 38

Talk Outline

Introduction Programming model SNNAP design:

  • Efficient neural network evaluation
  • Low-latency communication

Evaluation & Comparison with HLS

38 Thierry Moreau

slide-39
SLIDE 39

Evaluation

Neural acceleration on SNNAP (8x8 configuration, clocked at 1/4 of fCPU) vs. precise CPU execution

application domain error metric blackscholes

  • ption pricing

MSE fft DSP MSE inversek2j robotics MSE jmeint 3D-modeling miss rate jpeg compression image diff kmeans ML image diff sobel vision image diff

39 Thierry Moreau

slide-40
SLIDE 40

Whole-Application Speedup

Whole Application Speedup 0.00 1.00 2.00 3.00 4.00 b s c h

  • l

e s f f t i n v e r s e k 2 j j m e i n t j p e g k m e a n s s

  • b

e l G E O M E A N

3.8 2.4 1.3 2.3 1.5 38.1 2.7 10.8

40 Thierry Moreau

slide-41
SLIDE 41

Energy Savings

Energy Savings 0.00 1.00 2.00 3.00 4.00 b s c h

  • l

e s f f t i n v e r s e k 2 j j m e i n t j p e g k m e a n s s

  • b

e l G E O M E A N

2.8 1.8 .9 1.7 1.1 28.0 2.2 7.8

Energy = Power

  • n

(DRAM + SoC) * Runtime +36%

41 Thierry Moreau

slide-42
SLIDE 42

Conclusion

42 Thierry Moreau

float foo (float a, float b) { … return val; }

approximation acceleration

AR M

F P G

NPU

slide-43
SLIDE 43

Conclusion

43 Thierry Moreau

float foo (float a, float b) { … return val; }

approximation acceleration

AR M

F P G

NPU

HW-support compiler-support ACCEPT

slide-44
SLIDE 44

Conclusion

3.8x speedup & 2.8x energy savings

44 Thierry Moreau

float foo (float a, float b) { … return val; }

approximation acceleration

AR M

F P G

NPU

HW-support SNNAP compiler-support ACCEPT

slide-45
SLIDE 45

Compilation and Hardware Support for Approximate Acceleration

Thierry Moreau, Adrian Sampson, Andre Baixo, Mark Wyse, Ben Ransford, Jacob Nelson, Luis Ceze and Mark Oskin University of Washington moreau@uw.edu

45 Thierry Moreau

ACCEPT: http://accept.rocks SNNAP: upon request