Compilation and Hardware Support for Approximate Acceleration - PowerPoint PPT Presentation

Compilation and Hardware Support for Approximate Acceleration Thierry Moreau , Adrian Sampson, Andre Baixo, Mark Wyse, Ben Ransford, Jacob Nelson, Hadi Esmaeilzadeh (Georgia Tech), Luis Ceze and Mark Oskin University of Washington moreau@uw.edu Theme: 2384.004 1 Thierry Moreau

Approximate Computing Aims to exploit application resilience to trade-off quality for efficiency 2 Thierry Moreau

Approximate Computing 3 Thierry Moreau

Approximate Computing ✅ Accurate ✅ Approximate ❌ Expensive ✅ Cheap 4 Thierry Moreau

5 Thierry Moreau

6 Thierry Moreau

7 Thierry Moreau

Neural Networks as Approximate Accelerators CPU Esmaeilzadeh et al. [MICRO 2012] 8 Thierry Moreau

Neural Acceleration float foo (float a, float b) { AR F NPU … P M G return val; approximation acceleration } 9 Thierry Moreau

Neural Acceleration compiler-support float foo (float a, float b) { AR F NPU … P M G return val; approximation acceleration } ACCEPT* *Sampson et. al [UW-TR] 10 Thierry Moreau

Neural Acceleration compiler-support HW-support float foo (float a, float b) { AR F NPU … P M G return val; approximation acceleration } ACCEPT SNNAP* *Moreau et. al [HPCA2015] 11 Thierry Moreau

Neural Acceleration compiler-support HW-support float foo (float a, float b) { AR F NPU … P M G return val; approximation acceleration } ACCEPT SNNAP 3.8x speedup and 2.8x efficiency - 10% error 12 Thierry Moreau

Talk Outline Introduction Compiler Support with ACCEPT SNNAP Accelerator design Evaluation & Comparison with HLS 13 Thierry Moreau

Compilation Overview code 1. Region detection annotation 14 Thierry Moreau

Compilation Overview ACCEPT region detection code 1. Region detection & program annotation instrumentation 15 Thierry Moreau

Compilation Overview ACCEPT region detection code 1. Region detection & program annotation instrumentation back prop. 2. ANN Training & topology [training.data] search 16 Thierry Moreau

Compilation Overview ACCEPT region detection code 1. Region detection & program annotation instrumentation back prop. 2. ANN Training & topology [training.data] search ACCEPT code executes SNNAP 3. Code Generation transformation CPU 17 Thierry Moreau

Programming Model float sobel (float* p); . . . float** src; float** dst; while (true) { sobel src = read_from_camera(); for (y=0; y < h; ++y) { for (x=0; x < w; ++x) { dst[y][x] = sobel(& src[y][x]); } } display(dst); } 20 Thierry Moreau

Programming Model APPROX float sobel (APPROX float* p); . . . APPROX float** src; APPROX float** dst; while (true) { sobel src = read_from_camera(); for (y=0; y < h; ++y) { for (x=0; x < w; ++x) { dst[y][x] = sobel(& src[y][x]); } } display(ENDORSE(dst)); } 21 Thierry Moreau

Programming Model APPROX float sobel (APPROX float* p); . . . APPROX float** src; APPROX float** dst; ✅ no side effects while (true) { sobel ✅ executes often src = read_from_camera(); for (y=0; y < h; ++y) { for (x=0; x < w; ++x) { dst[y][x] = sobel(& src[y][x]); } } display(ENDORSE(dst)); } 22 Thierry Moreau

Checking for Quality annotated program sobel.c 23 Thierry Moreau

Checking for Quality annotated quality program metric sobel.c d ( y, y 0 ) 24 Thierry Moreau

Checking for Quality input data annotated quality program metric sobel.c d ( y, y 0 ) 25 Thierry Moreau

Checking for Quality input data annotated quality program metric test sobel.c d ( y, y 0 ) training 26 Thierry Moreau

Checking for Quality input data annotated quality program metric test sobel.c d ( y, y 0 ) Performance training Output Quality 27 Thierry Moreau

Talk Outline Introduction Compiler Support with ACCEPT SNNAP Accelerator design Evaluation & Comparison with HLS 28 Thierry Moreau

Background: Multi-Layer Perceptrons neural network computing a single layer x 9 = ([ [] ] []) x 7 w 67 w 57 w 47 x 6 x 8 w 68 w 58 w 48 6 f x 5 ! ∑ x7 wi7•xi w 69 w 59 w 49 x 4 i=4 x0 w47 x7 x4 w57 y0 x1 x8 activation function f x5 w67 y1 x2 x9 x6 Output x3 Hidden Layer 0 Hidden Layer 1 Input Layer 29 Thierry Moreau

Background: Systolic Arrays computing a single layer systolic array x 9 = ([ x 6 [] ] []) x 7 w 67 w 57 w 47 x 5 x 6 x 8 w 68 w 58 w 48 x 4 f x 5 w 69 w 59 w 49 x 4 w 49 w 48 w 47 w 59 w 58 w 57 w 69 w 68 w 67 f x 7 x 8 x 9 30 Thierry Moreau

PU Micro-Architecture systolic array processing unit x 6 x 5 x 4 PU control w 49 w 48 w 47 PE w 59 w 58 w 57 PE w 69 w 68 w 67 Storage PE f PE x 7 f x 8 x 9 31

PU Micro-Architecture systolic array processing 1 - processing elements in DSP logic unit x 6 x 5 x 4 PU control w 49 w 48 w 47 PE w 59 w 58 w 57 PE w 69 w 68 w 67 Storage PE f PE x 7 f x 8 x 9 32 Thierry Moreau

PU Micro-Architecture systolic array processing 1 - processing elements in DSP logic unit x 6 x 5 x 4 PU control w 49 w 48 w 47 2 - local storage for synaptic weights PE w 59 w 58 w 57 PE w 69 w 68 w 67 Storage PE f PE x 7 f x 8 x 9 33 Thierry Moreau

PU Micro-Architecture systolic array processing 1 - processing elements in DSP logic unit x 6 x 5 x 4 PU control w 49 w 48 w 47 2 - local storage for synaptic weights PE w 59 w 58 w 57 3 - sigmoid unit implements non- PE w 69 w 68 w 67 linear activation functions Storage PE f PE x 7 f x 8 x 9 34 Thierry Moreau

PU Micro-Architecture systolic array processing 1 - processing elements in DSP logic unit x 6 x 5 x 4 PU control w 49 w 48 w 47 2 - local storage for synaptic weights PE w 59 w 58 w 57 3 - sigmoid unit implements non- PE w 69 w 68 w 67 linear activation functions Storage PE f PE x 7 f 4 - vertically micro-coded sequencer x 8 x 9 35 Thierry Moreau

Multi-Processing Units DMA Master scheduler bus PU PU PU PU control control control control PE PE PE PE PE PE PE PE Storage Storage Storage Storage PE PE PE PE PE PE PE PE f f f f 36 Thierry Moreau

CPU-SNNAP Integration coherent reads custom & writes mastering with accelerator interface coherency port $L2 ACP DMA scheduler low-latency master $L1 event signaling, bus SE WF sleep & CPU wakeup PU PU PU PU 37 Thierry Moreau

Talk Outline Introduction Programming model SNNAP design: • Efficient neural network evaluation • Low-latency communication Evaluation & Comparison with HLS 38 Thierry Moreau

Evaluation Neural acceleration on SNNAP (8x8 configuration, clocked at 1/4 of f CPU ) vs. precise CPU execution application domain error metric blackscholes option pricing MSE fft DSP MSE inversek2j robotics MSE jmeint 3D-modeling miss rate jpeg compression image diff kmeans ML image diff sobel vision image diff 39 Thierry Moreau

Whole-Application Speedup 10.8 38.1 3.8 4.00 Whole Application Speedup 3.00 2.7 2.4 2.3 2.00 1.5 1.3 1.00 0.00 b f i j j k s G n m p f m o t s v E e e b c e e g O i h e r n a s M o l t n e l e s E k s A 2 j N 40 Thierry Moreau

Energy Savings 7.8 28.0 +36% 4.00 Energy = Power * Runtime on 3.00 2.8 (DRAM Energy Savings 2.2 + SoC) 2.00 1.8 1.7 1.1 .9 1.00 0.00 b f i j j k s G n m p f m o t s v E e e b c e e g O h i e r n a o s l M t n e l e s E k s A 2 j N 41 Thierry Moreau

Conclusion float foo (float a, float b) { AR F NPU … P M G return val; approximation acceleration } 42 Thierry Moreau

Conclusion compiler-support HW-support float foo (float a, float b) { AR F NPU … P M G return val; approximation acceleration } ACCEPT 43 Thierry Moreau

Conclusion compiler-support HW-support float foo (float a, float b) { AR F NPU … P M G return val; approximation acceleration } ACCEPT SNNAP 3.8x speedup & 2.8x energy savings 44 Thierry Moreau

Compilation and Hardware Support for Approximate Acceleration Thierry Moreau , Adrian Sampson, Andre Baixo, Mark Wyse, Ben Ransford, Jacob Nelson, Luis Ceze and Mark Oskin University of Washington moreau@uw.edu ACCEPT: http://accept.rocks SNNAP: upon request 45 Thierry Moreau

Compilation and Hardware Support for Approximate Acceleration - PowerPoint PPT Presentation

Compilation and Hardware Support for Approximate Acceleration Thierry Moreau , Adrian Sampson, Andre Baixo, Mark Wyse, Ben Ransford, Jacob Nelson, Hadi Esmaeilzadeh (Georgia Tech), Luis Ceze and Mark Oskin University of Washington

JIT Compilation Module Overview JIT Compilation Native vs. Managed Compilation Managed

Approximate Computing Is Dead; Long Live Approximate Computing Adrian Sampson Cornell Hardware

Hardware Observability Framework Hardware Observability Framework Hardware Observability

The Compilation Process Preprocessing: o processes include-files, conditional compilation and

Analyses, Hardware/Software Compilation, Code Optimization for Complex Dataflow HPC Applications

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

Approximate Nearest Neighbors Search Approximate Nearest Neighbors Search in High Dimensions in

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

API-Compilation for Image Hardware Accelerators Fabien Coelho & Franc ois Irigoin ANR

software and hardware for the Internet of Things. Choose hardware Design hardware Design

December 12, 2018 Luis Ceze Welcome to the 1st TVM and Deep Learning Compilation Conference!

MID-YEAR MOBILITY DATA 1 PURPOSE The following slides are a compilation of the mid-year

MERCHANDISE TRADE TRADE STATISTICS MERCHANDISE STATISTICS (IMTS) IN BOTSWANA (with ref to

MERCHANDISE TRADE TRADE STATISTICS MERCHANDISE STATISTICS (IMTS) IN BOTSWANA (GENERAL) (IMTS)

JVM Optimization 101 Sebastian Zarnekow itemis Static vs Dynamic Compilation AOT vs JIT JIT

Dynamic Compilation for Reducing Dynamic Compilation for Reducing Energy Consumption of I/O-

Software testing In requirements gathering we focus on validation, are we building the right

Software Testing Some of the slides taken from: 1) Software Engineering, Ian Sommerville, 9 th

Testing a large testing software Rmi Duraffort, Linaro Ltd. remi.duraffort@linaro.org Who am

Announcements 61A Lecture 34 Database Management System Architecture Database Management Systems

High-Performance Embedded High-Performance Embedded Systems-on-a-Chip Systems-on-a-Chip Sanjay

Automated Task Distribution in Multicore Network Processors using Statistical Analysis Arindam

TPU for Exa-TrkX Xiangyang Ju ExaTrkX Collaboration Meeting 7 April 2020 Introduction

Linear Arrays Chapter 7 1. Basics for the linear array computational model. a. A diagram for this

Compilation and Hardware Support for Approximate Acceleration - PowerPoint PPT Presentation

Compilation and Hardware Support for Approximate Acceleration Thierry Moreau , Adrian Sampson, Andre Baixo, Mark Wyse, Ben Ransford, Jacob Nelson, Hadi Esmaeilzadeh (Georgia Tech), Luis Ceze and Mark Oskin University of Washington

JIT Compilation Module Overview JIT Compilation Native vs. Managed Compilation Managed

Approximate Computing Is Dead; Long Live Approximate Computing Adrian Sampson Cornell Hardware

Hardware Observability Framework Hardware Observability Framework Hardware Observability

The Compilation Process Preprocessing: o processes include-files, conditional compilation and

Analyses, Hardware/Software Compilation, Code Optimization for Complex Dataflow HPC Applications

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

Approximate Nearest Neighbors Search Approximate Nearest Neighbors Search in High Dimensions in

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

API-Compilation for Image Hardware Accelerators Fabien Coelho &amp; Franc ois Irigoin ANR

software and hardware for the Internet of Things. Choose hardware Design hardware Design

December 12, 2018 Luis Ceze Welcome to the 1st TVM and Deep Learning Compilation Conference!

MID-YEAR MOBILITY DATA 1 PURPOSE The following slides are a compilation of the mid-year

MERCHANDISE TRADE TRADE STATISTICS MERCHANDISE STATISTICS (IMTS) IN BOTSWANA (with ref to

MERCHANDISE TRADE TRADE STATISTICS MERCHANDISE STATISTICS (IMTS) IN BOTSWANA (GENERAL) (IMTS)

JVM Optimization 101 Sebastian Zarnekow itemis Static vs Dynamic Compilation AOT vs JIT JIT

Dynamic Compilation for Reducing Dynamic Compilation for Reducing Energy Consumption of I/O-

Software testing In requirements gathering we focus on validation, are we building the right

Software Testing Some of the slides taken from: 1) Software Engineering, Ian Sommerville, 9 th

Testing a large testing software Rmi Duraffort, Linaro Ltd. remi.duraffort@linaro.org Who am

Announcements 61A Lecture 34 Database Management System Architecture Database Management Systems

High-Performance Embedded High-Performance Embedded Systems-on-a-Chip Systems-on-a-Chip Sanjay

Automated Task Distribution in Multicore Network Processors using Statistical Analysis Arindam

TPU for Exa-TrkX Xiangyang Ju ExaTrkX Collaboration Meeting 7 April 2020 Introduction

Linear Arrays Chapter 7 1. Basics for the linear array computational model. a. A diagram for this

API-Compilation for Image Hardware Accelerators Fabien Coelho & Franc ois Irigoin ANR