Neural Acceleration for General-Purpose Approximate Programs Hadi - PowerPoint PPT Presentation

Neural Acceleration for General-Purpose Approximate Programs Hadi Esmaeilzadeh Adrian Sampson University of Washington Luis Ceze Doug Burger Microsoft Research sa pa University of Washington MICRO 2012

CPU Program

computer vision machine learning sensory data physical simulation information retrieval augmented reality image rendering JPL & Rob Hogg NASA thefrugalgirl.com

Approximate computer vision computing Probabilistic CMOS designs machine learning [Rice, NTU, Georgia Tech…] Stochastic processors [Illinois] sensory data Code perforation transformations [MIT] physical simulation Relax software fault recovery [de Kruijf et al., ISCA 2010] Green runtime system information retrieval [Baek and Chilimbi, PLDI 2010] Flikker approximate DRAM [Liu et al., ASPLOS 2011] augmented reality EnerJ programming language [PLDI 2011] image rendering Tru ffl e dual-voltage architecture [ASPLOS 2012]

Accelerators BERET Michigan Conservation Cores UCSD CPU GPU DySER Wisconsin FPGA Vector Unit

Accelerators Approximate computing BERET computer vision Michigan Conservation Cores UCSD machine learning sensory data CPU GPU physical simulation DySER Wisconsin information retrieval augmented reality FPGA Vector Unit image rendering

An accelerator for approximate computations √ Mimics functions written in traditional languages! √ Runs more efficiently than a CPU or a precise accelerator! ! W Approximate E N Accelerator √ 1.0 May introduce small errors!

Neural networks are function approximators Trainable: implements Highly parallel many functions CPU Very efficient hardware implementations Fault tolerant [Temam, ISCA 2012]

Neural acceleration Program

Neural acceleration Annotate an approximate program component Program

Neural acceleration Annotate an approximate program component Compile the program and train a neural network Program

Neural acceleration Annotate an approximate program component Compile the program and train a neural network Program Execute on a fast Neural Processing Unit (NPU)

Neural acceleration 1 Annotate an approximate program component 2 Compile the program and train a neural network 3 Execute on a fast Neural Processing Unit (NPU) 4 Improve performance 2.3x and energy 3.0x on average

Programming model [[transform]] float grad ( float [3][3] p) { … } void edgeDetection(Image &src, Image &dst) { edgeDetection() for ( int y = …) { for ( int x = …) { dst[x][y] = grad (window(src, x, y)); } } }

Code region criteria grad() run on every √ Hot code 3x3 pixel window small errors do not √ Approximable corrupt output √ Well-defined takes 9 pixel values; inputs and outputs returns a scalar

Empirically selecting target functions √ Accelerated Program Program ✗ √

Compiling and transforming Annotated Source Code 1. Code Observation Training Inputs 2. Training Trained 3. Code Neural Network Generation Augmented Binary

Code observation record(p); record(result); p grad(p) [[NPU]] 323, 231, 122, 93, 321, 49 53.2 ➝ float grad ( float [3][3] p) { 49, 423, 293, 293, 23, 2 94.2 ➝ … } 34, 129, 493, 49, 31, 11 1.2 ➝ 21, 85, 47, 62, 21, 577 64.2 ➝ void edgeDetection(Image &src, = + Image &dst) { 7, 55, 28, 96, 552, 921 18.1 ➝ for ( int y = …) { 5, 129, 493, 49, 31, 11 92.2 ➝ for ( int x = …) { 49, 423, 293, 293, 23, 2 6.5 dst[x][y] = ➝ grad (window(src, x, y)); 34, 129, 72, 49, 5, 2 120 ➝ } 323, 231, 122, 93, 321, 49 53.2 ➝ } } 6, 423, 293, 293, 23, 2 49.7 ➝ test cases instrumented sample program arguments & outputs

Training Training Inputs Backpropagation Training

Training Training Training Training Inputs Inputs Inputs 70% 98% 99% faster slower less robust more accurate

Code generation void edgeDetection(Image &src, Image &dst) { for ( int y = …) { for ( int x = …) { p = window(src, x, y); NPU_SEND(p[0][0]) ; NPU_SEND(p[0][1]) ; NPU_SEND(p[0][2]) ; … dst[x][y] = NPU_RECEIVE() ; } } }

Neural Processing Unit (NPU) Core NPU

Software interface: ISA extensions input enq.d output Core NPU deq.d configuration enq.c deq.c

Microarchitectural interface Fetch Decode S NS Issue enq.d S NS Execute NPU deq.d Memory configuration enq.c Commit deq.c

A digital NPU scheduling Bus Scheduler Processing Engines input output

A digital NPU multiply-add unit scheduling Bus input Scheduler neuron output weights accumulator sigmoid LUT Processing Engines input output

Experiments Several benchmarks; annotated one hot function each FFT, inverse kinematics, triangle intersection, JPEG, K-means, Sobel Simulated full programs on MARSSx86 Energy modeled with McPAT and CACTI Microarchitecture like Intel Penryn: 4-wide, 6-issue 45 nm, 2080 MHz, 0.9 V

Two benchmarks 88 static 18 edge instructions neurons 56% of dynamic detection instructions 1,079 triangle 60 neurons static x86-64 intersection 2 hidden layers instructions 97% of dynamic instructions

Speedup with NPU acceleration 12x 10x speedup over all-CPU execution 8x 6x 4x 2x 0x fft inversek2j jmeint jpeg kmeans sobel geometric mean 2.3x average speedup Ranges from 0.8x to 11.1x

Energy savings with NPU acceleration 21.1x 12x energy reduction over all-CPU execution 10x 8x 6x 4x 2x 0x fft inversek2j jmeint jpeg kmeans sobel geometric mean 3.0x average energy reduction All benchmarks benefit

Application quality loss 100% 80% quality degradation 60% 40% 20% 0% fft inversek2j jmeint jpeg kmeans sobel geometric mean Quality loss below 10% in all cases Based on application-specific quality metrics

Edge detection with gradient calculation on NPU

Also in the paper Sensitivity to communication latency Sensitivity to NN evaluation efficiency Sensitivity to PE count Benchmark statistics All-software NN slowdown

Program

Neural networks can efficiently approximate functions from programs Program written in conventional languages.

low power parallel flexible CPU regular fault-tolerant analog

Normalized dynamic instructions 100% dynamic instruction count normalized to original 80% 60% 40% 20% 0% fft inversek2j jmeint jpeg kmeans sobel geometric mean NPU queue instructions other instructions

Slowdown with software NN 75x 60x slowdown over original program 45x 30x 15x 0x fft inversek2j jmeint jpeg kmeans sobel geometric mean 20x average slowdown Using off-the-shelf FANN library

Neural Acceleration for General-Purpose Approximate Programs Hadi - PowerPoint PPT Presentation

Neural Acceleration for General-Purpose Approximate Programs Hadi Esmaeilzadeh Adrian Sampson University of Washington Luis Ceze Doug Burger Microsoft Research sa pa University of Washington MICRO 2012 CPU Program computer vision

SNNAP: Approximate Computing on Programmable SoCs via Neural Acceleration Thierry Moreau Hadi

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

Particle Driven Acceleration Experiments Edda Gschwendtner CAS, Plasma Wake Acceleration 2014 2

Motion with Constant Acceleration 1 Particle Under Constant Acceleration In the case of motion

acceleration Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada NSS acceleration

Approximate Computing Is Dead; Long Live Approximate Computing Adrian Sampson Cornell Hardware

Approximate Nearest Neighbors Search Approximate Nearest Neighbors Search in High Dimensions in

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Wednesday, November 30, 2016 3:41 PM General Page 1 General Page 2 General Page 3 General Page

Neural Cache: Bit it-Serial l In In-Cache Acceleration of f Deep Neural l Networks Charles

Towards General-Purpose Acceleration by Exploiting Common Data- Dependence Forms Vidushi Dadu ,

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Real-Time Computerized Annotation of Pictures Jia Li James Z. Wang The Pennsylvania State

Annotating a highly processed image in PixInsight Deep Sky West Orion Mosaic Project Rokinon

A tracking-based approach for video and volume annotation with sparse point supervision L.

Great Chart Primary School Marking and Feedback Policy - 2014 RATIONALE Marking and providing

Finishing a Presentation Objectives Modify masters Customize the background and theme

Victoria Primary Academy A Member of Hatton Academies Trust Title Marking, Feedback and

Understanding Social Tags: Relation Extraction and Tag Annotation Presentation at NLP@UoL, Mar

Using iPads in the chemistry classroom: Steps toward a fully paperless classroom What was behind