Neural Acceleration for General-Purpose Approximate Programs Hadi - - PowerPoint PPT Presentation

neural acceleration for general purpose approximate
SMART_READER_LITE
LIVE PREVIEW

Neural Acceleration for General-Purpose Approximate Programs Hadi - - PowerPoint PPT Presentation

Neural Acceleration for General-Purpose Approximate Programs Hadi Esmaeilzadeh Adrian Sampson University of Washington Luis Ceze Doug Burger Microsoft Research sa pa University of Washington MICRO 2012 CPU Program computer vision


slide-1
SLIDE 1

sa pa

University of Washington MICRO 2012

Neural Acceleration for General-Purpose Approximate Programs

Hadi Esmaeilzadeh Adrian Sampson Luis Ceze Doug Burger University of Washington Microsoft Research

slide-2
SLIDE 2

Program

CPU

slide-3
SLIDE 3

JPL & Rob Hogg NASA thefrugalgirl.com

computer vision machine learning sensory data physical simulation information retrieval augmented reality image rendering

slide-4
SLIDE 4

computer vision machine learning sensory data physical simulation information retrieval augmented reality image rendering Approximate computing

EnerJ programming language

[PLDI 2011]

Truffle dual-voltage architecture

[ASPLOS 2012]

Relax software fault recovery

[de Kruijf et al., ISCA 2010]

Code perforation transformations

[MIT]

Green runtime system

[Baek and Chilimbi, PLDI 2010]

Flikker approximate DRAM

[Liu et al., ASPLOS 2011]

Stochastic processors

[Illinois]

Probabilistic CMOS designs

[Rice, NTU, Georgia Tech…]

slide-5
SLIDE 5

Accelerators

CPU

GPU

FPGA

Vector Unit DySER

Wisconsin

BERET

Michigan

Conservation Cores

UCSD

slide-6
SLIDE 6

Accelerators

CPU

GPU

FPGA

Vector Unit DySER

Wisconsin

BERET

Michigan

Conservation Cores

UCSD

Approximate computing computer vision machine learning sensory data physical simulation information retrieval augmented reality image rendering

slide-7
SLIDE 7

An accelerator for approximate computations

Approximate Accelerator 1.0

Mimics functions written in traditional languages! Runs more efficiently than a CPU or a precise accelerator! May introduce small errors!

√ √ √

N E W !

slide-8
SLIDE 8

Neural networks are function approximators

Trainable: implements many functions Very efficient hardware implementations Highly parallel Fault tolerant

[Temam, ISCA 2012]

CPU

slide-9
SLIDE 9

Program

Neural acceleration

slide-10
SLIDE 10

Neural acceleration

Program

Annotate an approximate program component

slide-11
SLIDE 11

Program

Annotate an approximate program component Compile the program and train a neural network

Neural acceleration

slide-12
SLIDE 12

Program

Annotate an approximate program component Compile the program and train a neural network Execute on a fast Neural Processing Unit (NPU)

Neural acceleration

slide-13
SLIDE 13

Annotate an approximate program component Compile the program and train a neural network Execute on a fast Neural Processing Unit (NPU)

Neural acceleration

Improve performance 2.3x and energy 3.0x on average

1 2 3 4

slide-14
SLIDE 14

Programming model

float grad(float[3][3] p) { … }

edgeDetection()

void edgeDetection(Image &src, Image &dst) { for (int y = …) { for (int x = …) { dst[x][y] = grad(window(src, x, y)); } } } [[transform]]

slide-15
SLIDE 15

Code region criteria

Hot code Approximable Well-defined inputs and outputs

grad()

run on every 3x3 pixel window small errors do not corrupt output takes 9 pixel values; returns a scalar

√ √ √

slide-16
SLIDE 16

Empirically selecting target functions

Program Accelerated Program

√ √ ✗

slide-17
SLIDE 17

Compiling and transforming

Annotated Source Code Training Inputs Trained Neural Network Augmented Binary

  • 1. Code

Observation

  • 2. Training
  • 3. Code

Generation

slide-18
SLIDE 18

Code observation

[[NPU]] float grad(float[3][3] p) { … } void edgeDetection(Image &src, Image &dst) { for (int y = …) { for (int x = …) { dst[x][y] = grad(window(src, x, y)); } } }

+ = test cases instrumented program sample arguments & outputs

323, 231, 122, 93, 321, 49 53.2 ➝ p grad(p) 49, 423, 293, 293, 23, 2 94.2 ➝ 34, 129, 493, 49, 31, 11 1.2 ➝ 21, 85, 47, 62, 21, 577 64.2 ➝ 7, 55, 28, 96, 552, 921 18.1 ➝ 5, 129, 493, 49, 31, 11 92.2 ➝ 49, 423, 293, 293, 23, 2 6.5 ➝ 34, 129, 72, 49, 5, 2 120 ➝ 323, 231, 122, 93, 321, 49 53.2 ➝ 6, 423, 293, 293, 23, 2 49.7 ➝

record(p); record(result);

slide-19
SLIDE 19

Training

Backpropagation Training

Training Inputs

slide-20
SLIDE 20

Training

Training Inputs Training Inputs Training Inputs

faster less robust slower more accurate 70% 98% 99%

slide-21
SLIDE 21

Code generation

void edgeDetection(Image &src, Image &dst) { for (int y = …) { for (int x = …) { p = window(src, x, y); NPU_SEND(p[0][0]); NPU_SEND(p[0][1]); NPU_SEND(p[0][2]); … dst[x][y] = NPU_RECEIVE(); } } }

slide-22
SLIDE 22

Neural Processing Unit (NPU)

Core NPU

slide-23
SLIDE 23

Software interface: ISA extensions

Core NPU

input

  • utput

configuration enq.c deq.c enq.d deq.d

slide-24
SLIDE 24

Microarchitectural interface

NPU

configuration S enq.c deq.c enq.d deq.d NS S NS Fetch Decode Issue Execute Memory Commit

slide-25
SLIDE 25

A digital NPU

Bus Scheduler

Processing Engines

input

  • utput

scheduling

slide-26
SLIDE 26

A digital NPU

Bus Scheduler

Processing Engines

input

  • utput

scheduling

multiply-add unit accumulator sigmoid LUT neuron weights input

  • utput
slide-27
SLIDE 27

Experiments

Several benchmarks; annotated one hot function each FFT, inverse kinematics, triangle intersection, JPEG, K-means, Sobel Simulated full programs on MARSSx86 Energy modeled with McPAT and CACTI Microarchitecture like Intel Penryn: 4-wide, 6-issue 45 nm, 2080 MHz, 0.9 V

slide-28
SLIDE 28

1,079 static x86-64 instructions 60 neurons 2 hidden layers 88 static instructions 18 neurons

triangle intersection edge detection

Two benchmarks

56% of dynamic instructions 97% of dynamic instructions

slide-29
SLIDE 29

Speedup with NPU acceleration

2.3x average speedup Ranges from 0.8x to 11.1x

0x 2x 4x 6x 8x 10x 12x fft inversek2j jmeint jpeg kmeans sobel geometric mean speedup over all-CPU execution

slide-30
SLIDE 30

Energy savings with NPU acceleration

3.0x average energy reduction All benchmarks benefit

0x 2x 4x 6x 8x 10x 12x fft inversek2j jmeint jpeg kmeans sobel geometric mean energy reduction over all-CPU execution

21.1x

slide-31
SLIDE 31

Application quality loss

Quality loss below 10% in all cases Based on application-specific quality metrics

0% 20% 40% 60% 80% 100% fft inversek2j jmeint jpeg kmeans sobel geometric mean quality degradation

slide-32
SLIDE 32

Edge detection with gradient calculation on NPU

slide-33
SLIDE 33

Also in the paper

Sensitivity to communication latency Sensitivity to NN evaluation efficiency Sensitivity to PE count Benchmark statistics All-software NN slowdown

slide-34
SLIDE 34

Program

slide-35
SLIDE 35

Program

slide-36
SLIDE 36

Program

Neural networks can efficiently approximate functions from programs written in conventional languages.

slide-37
SLIDE 37

CPU

low power parallel regular fault-tolerant analog flexible

slide-38
SLIDE 38
slide-39
SLIDE 39

Normalized dynamic instructions

0% 20% 40% 60% 80% 100% fft inversek2j jmeint jpeg kmeans sobel geometric mean dynamic instruction count normalized to original

  • ther instructions

NPU queue instructions

slide-40
SLIDE 40

Slowdown with software NN

20x average slowdown Using off-the-shelf FANN library

0x 15x 30x 45x 60x 75x fft inversek2j jmeint jpeg kmeans sobel geometric mean slowdown over original program