neural acceleration for general purpose approximate
play

Neural Acceleration for General-Purpose Approximate Programs Hadi - PowerPoint PPT Presentation

Neural Acceleration for General-Purpose Approximate Programs Hadi Esmaeilzadeh Adrian Sampson University of Washington Luis Ceze Doug Burger Microsoft Research sa pa University of Washington MICRO 2012 CPU Program computer vision


  1. Neural Acceleration for General-Purpose Approximate Programs Hadi Esmaeilzadeh Adrian Sampson University of Washington Luis Ceze Doug Burger Microsoft Research sa pa University of Washington MICRO 2012

  2. CPU Program

  3. computer vision machine learning sensory data physical simulation information retrieval augmented reality image rendering JPL & Rob Hogg NASA thefrugalgirl.com

  4. Approximate computer vision computing Probabilistic CMOS designs machine learning [Rice, NTU, Georgia Tech…] Stochastic processors [Illinois] sensory data Code perforation transformations [MIT] physical simulation Relax software fault recovery [de Kruijf et al., ISCA 2010] Green runtime system information retrieval [Baek and Chilimbi, PLDI 2010] Flikker approximate DRAM [Liu et al., ASPLOS 2011] augmented reality EnerJ programming language [PLDI 2011] image rendering Tru ffl e dual-voltage architecture [ASPLOS 2012]

  5. Accelerators BERET Michigan Conservation Cores UCSD CPU GPU DySER Wisconsin FPGA Vector Unit

  6. Accelerators Approximate computing BERET computer vision Michigan Conservation Cores UCSD machine learning sensory data CPU GPU physical simulation DySER Wisconsin information retrieval augmented reality FPGA Vector Unit image rendering

  7. An accelerator for approximate computations √ Mimics functions written in traditional languages! √ Runs more efficiently than a CPU or a precise accelerator! ! W Approximate E N Accelerator √ 1.0 May introduce small errors!

  8. Neural networks are function approximators Trainable: implements Highly parallel many functions CPU Very efficient hardware implementations Fault tolerant [Temam, ISCA 2012]

  9. Neural acceleration Program

  10. Neural acceleration Annotate an approximate program component Program

  11. Neural acceleration Annotate an approximate program component Compile the program and train a neural network Program

  12. Neural acceleration Annotate an approximate program component Compile the program and train a neural network Program Execute on a fast Neural Processing Unit (NPU)

  13. Neural acceleration 1 Annotate an approximate program component 2 Compile the program and train a neural network 3 Execute on a fast Neural Processing Unit (NPU) 4 Improve performance 2.3x and energy 3.0x on average

  14. Programming model [[transform]] float grad ( float [3][3] p) { … } void edgeDetection(Image &src, Image &dst) { edgeDetection() for ( int y = …) { for ( int x = …) { dst[x][y] = grad (window(src, x, y)); } } }

  15. Code region criteria grad() run on every √ Hot code 3x3 pixel window small errors do not √ Approximable corrupt output √ Well-defined takes 9 pixel values; inputs and outputs returns a scalar

  16. Empirically selecting target functions √ Accelerated Program Program ✗ √

  17. Compiling and transforming Annotated Source Code 1. Code Observation Training Inputs 2. Training Trained 3. Code Neural Network Generation Augmented Binary

  18. Code observation record(p); record(result); p grad(p) [[NPU]] 323, 231, 122, 93, 321, 49 53.2 ➝ float grad ( float [3][3] p) { 49, 423, 293, 293, 23, 2 94.2 ➝ … } 34, 129, 493, 49, 31, 11 1.2 ➝ 21, 85, 47, 62, 21, 577 64.2 ➝ void edgeDetection(Image &src, = + Image &dst) { 7, 55, 28, 96, 552, 921 18.1 ➝ for ( int y = …) { 5, 129, 493, 49, 31, 11 92.2 ➝ for ( int x = …) { 49, 423, 293, 293, 23, 2 6.5 dst[x][y] = ➝ grad (window(src, x, y)); 34, 129, 72, 49, 5, 2 120 ➝ } 323, 231, 122, 93, 321, 49 53.2 ➝ } } 6, 423, 293, 293, 23, 2 49.7 ➝ test cases instrumented sample program arguments & outputs

  19. Training Training Inputs Backpropagation Training

  20. Training Training Training Training Inputs Inputs Inputs 70% 98% 99% faster slower less robust more accurate

  21. Code generation void edgeDetection(Image &src, Image &dst) { for ( int y = …) { for ( int x = …) { p = window(src, x, y); NPU_SEND(p[0][0]) ; NPU_SEND(p[0][1]) ; NPU_SEND(p[0][2]) ; … dst[x][y] = NPU_RECEIVE() ; } } }

  22. Neural Processing Unit (NPU) Core NPU

  23. Software interface: ISA extensions input enq.d output Core NPU deq.d configuration enq.c deq.c

  24. Microarchitectural interface Fetch Decode S NS Issue enq.d S NS Execute NPU deq.d Memory configuration enq.c Commit deq.c

  25. A digital NPU scheduling Bus Scheduler Processing Engines input output

  26. A digital NPU multiply-add unit scheduling Bus input Scheduler neuron output weights accumulator sigmoid LUT Processing Engines input output

  27. Experiments Several benchmarks; annotated one hot function each FFT, inverse kinematics, triangle intersection, JPEG, K-means, Sobel Simulated full programs on MARSSx86 Energy modeled with McPAT and CACTI Microarchitecture like Intel Penryn: 4-wide, 6-issue 45 nm, 2080 MHz, 0.9 V

  28. Two benchmarks 88 static 18 edge instructions neurons 56% of dynamic detection instructions 1,079 triangle 60 neurons static x86-64 intersection 2 hidden layers instructions 97% of dynamic instructions

  29. Speedup with NPU acceleration 12x 10x speedup over all-CPU execution 8x 6x 4x 2x 0x fft inversek2j jmeint jpeg kmeans sobel geometric mean 2.3x average speedup Ranges from 0.8x to 11.1x

  30. Energy savings with NPU acceleration 21.1x 12x energy reduction over all-CPU execution 10x 8x 6x 4x 2x 0x fft inversek2j jmeint jpeg kmeans sobel geometric mean 3.0x average energy reduction All benchmarks benefit

  31. Application quality loss 100% 80% quality degradation 60% 40% 20% 0% fft inversek2j jmeint jpeg kmeans sobel geometric mean Quality loss below 10% in all cases Based on application-specific quality metrics

  32. Edge detection with gradient calculation on NPU

  33. Also in the paper Sensitivity to communication latency Sensitivity to NN evaluation efficiency Sensitivity to PE count Benchmark statistics All-software NN slowdown

  34. Program

  35. Program

  36. Neural networks can efficiently approximate functions from programs Program written in conventional languages.

  37. low power parallel flexible CPU regular fault-tolerant analog

  38. Normalized dynamic instructions 100% dynamic instruction count normalized to original 80% 60% 40% 20% 0% fft inversek2j jmeint jpeg kmeans sobel geometric mean NPU queue instructions other instructions

  39. Slowdown with software NN 75x 60x slowdown over original program 45x 30x 15x 0x fft inversek2j jmeint jpeg kmeans sobel geometric mean 20x average slowdown Using off-the-shelf FANN library

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend