Exploiting Quality-Efficiency Tradeoffs with Arbitrary Quantization - PowerPoint PPT Presentation

Exploiting Quality-Efficiency Tradeoffs with Arbitrary Quantization Special Session - CODES+ISSS Thierry Moreau , Felipe Augusto, Patrick Howe Armin Alaghi, Luis Ceze

Internet of Things Revolution aggregate noisy, real world analytics, processing sensory input consumed by human etc. … … double temp = sensor_acquire(); double temp = sensor_acquire(); … … Approximate computing: eliminate ine ffi ciencies in systems by producing just-the-right quality

Quantization: going back to basics aggregate noisy, real world analytics, processing sensory input consumed by human etc. SRAM ALU SRAM ALU

This Talk: A “Limit Study” on Precision Scaling Assumption : hardware that can dynamically and arbitrarily scale its precision float double 1 n SW Scope : compute heavy, regular applications HW Scope : hardware accelerators

Talk Overview 1. How much precision is needed at different stages of a program? 2. How much energy can be saved (upper bound)? 3. How does this inform approximate computing research?

Talk Overview 1. How much precision is needed at different stages of a program? QAPPA - Precision Autotuner 2. How much energy can be saved? 3. How does this inform approximate computing research?

QAPPA: Quality Autotuner for Precision- Programmable Accelerators Goal: Minimize instruction-level precision requirements given a quality target desired quality target quality & energy savings QAPPA kernel.c instruction-level framework precision requirements Built on top of ACCEPT , the approximate C/C++ compiler http://accept.rocks

QAPPA Autotuner Overview Default (no savings) savings instruction 0 instruction 1 instruction 2 … instruction n-1 instruction n bad OK application quality

QAPPA Autotuner Overview Optimized: extraneous precision is shaved off savings instruction 0 instruction 1 instruction 2 … instruction n-1 instruction n bad OK application quality

QAPPA 5-Step Description Output Quality Results Quality Autotuner Configuration & Bit Savings ACCEPT ACCEPT Execution & Annotated Approximate static analysis ILPC* error injection & Quality Program Binary instrumentation Assessment Program Inputs & Quality Metrics * Instruction-level Precision Configuration

1. Program Annotation Output Quality Results Quality Autotuner Configuration & Bit Savings ACCEPT ACCEPT Execution & Annotated Approximate static analysis ILPC* error injection & Quality Program Binary instrumentation Assessment Program Inputs & Quality Metrics * Instruction-level Precision Configuration void conv2d ( APPROX pix *in, APPROX pix *out, APPROX flt *filter) { for (row) { for (col) { Key: use the APPROX APPROX flt sum = 0 int dstPos = … type qualifier [*] for (row_offset) { for (col_offset) { int srcPos = … int fltPos = … sum += in[srcPos] * filter[fltPos] } } out[dstPos] = sum / normFactor } } } [*] EnerJ, Sampson et al., PLDI’11

2. Static Analysis Output Quality Results Quality Autotuner Configuration & Bit Savings ACCEPT ACCEPT Execution & Annotated Approximate static analysis ILPC* error injection & Quality Program Binary instrumentation Assessment Program Inputs & Quality Metrics * Instruction-level Precision Configuration void conv2d ( APPROX pix *in, APPROX pix *out, APPROX flt *filter) Instruction-Level { Precision Configuration for (row) { for (col) { (ILPC) APPROX flt sum = 0 ACCEPT int dstPos = … conv2d:13:7:load:Int32 for (row_offset) { for (col_offset) { conv2d:13:10:load:Float int srcPos = … conv2d:13:11:fmul:Float int fltPos = … sum += in[srcPos] * filter[fltPos] conv2d:13:12:fadd:Float } conv2d:15:1:fdiv:Float } conv2d:15:7:store:Int32 out[dstPos] = sum / normFactor } } } ACCEPT identifies safe-to-approximate instructions from data annotations using flow analysis

3. Error Injection Output Quality Results Quality Autotuner Configuration & Bit Savings ACCEPT ACCEPT Execution & Annotated Approximate static analysis ILPC* error injection & Quality Program Binary instrumentation Assessment Program Inputs & Quality Metrics * Instruction-level Precision Configuration Instruction-Level Precision Configuration (ILPC) Instrumentation Approximate conv2d:13:7:load:Int4 & Compilation conv2d:13:10:load:Fix2.3 Binary conv2d:13:11:fmul:Fix2.3 conv2d:13:12:fadd:Fix4.5 conv2d:15:1:fdiv:Fix2.3 conv2d:15:7:store:Int4 Each instruction in the ILCP acts as a quality knob that the autotuner can use to maximize bit-savings

4. Quality Assessment Output Quality Results Quality Autotuner Configuration & Bit Savings ACCEPT ACCEPT Execution & Annotated Approximate static analysis ILPC* error injection & Quality Program Binary instrumentation Assessment Program Inputs & Quality Metrics * Instruction-level Precision Configuration Reference Binary eval.py Approximate Binary 10dB SNR The programmer provides a quality assessment script to evaluate quality on the program output

5. Autotuning Algorithm Output Quality Results Quality Autotuner Configuration & Bit Savings ACCEPT ACCEPT Execution & Annotated Approximate static analysis ILPC* error injection & Quality Program Binary instrumentation Assessment Program Inputs & Quality Metrics * Instruction-level Precision Configuration Greedy iterative algorithm [*] : reduces precision requirement of the instruction that impacts quality the least … config k: error = 0.10% config [k+1, i-1]: config [k+1, i]: config [k+1, i+1]: … … error = 5.91% error = 0.30% error = 0.12% config [k+2, i-1]: config [k+2, i]: config [k+2, i+1]: … … error = 5.91% error = 0.33% error = 1.6% … Finds solution in O(m 2 n) worst case where m is the number of static safe-to- approximate instructions and n are the levels of precision for all instructions [*] Precimonious, Rubio-Gonzalez et al., SC’13

5. Autotuning Algorithm Output Quality Results Quality Autotuner Configuration & Bit Savings ACCEPT ACCEPT Execution & Annotated Approximate static analysis ILPC* error injection & Quality Program Binary instrumentation Assessment Program Inputs & Quality Metrics * Instruction-level Precision Configuration The autotuner greedily maximizes bit-savings 10dB as the quality target is lowered 20dB 40dB 60dB precise

PERFECT Application Study Application Domain Kernels Metric Discrete Wavelet Transform PERFECT Application 1 2D Convolution Histogram Equalization Outer Product Space Time Adaptive System Solve Processing Signal to Noise Ratio Inner Product (SNR) Interpolation 1 Synthetic Aperture Radar Interpolation 2 [120dB to 10dB] Back Projection (0.0001% to 31.6% MSE) Debayer Wide Area Motion Imaging Image Registration Change Detection FFT 1D Required Kernels FFT 2D

Opportunity of Approximations QAPPA Analyzes PERFECT Dynamic Instruction Mix control 11% load/store 27% int arith 25% int arith 4% math fp arith 1% 31% Safe to approximate Precise

Average Precision Reduction Achieved Across PERFECT Kernels Approximate High Quality 100% More savings Dynamic precision reduction on safe-to-approximate instructions 83% 80% 74% 60% 57% 48% 40% 40% 32% 26% 20% 0% 10 20 40 60 80 100 120 Target Application SNR (dB)

Average Precision Reduction Achieved Across PERFECT Kernels 100% Dynamic precision reduction on safe-to-approximate instructions 83% 80% 74% 60% PERFECT Manual 57% 0.001% MSE 48% 40% 40% 32% 26% 20% 0% 10 20 40 60 80 100 120 Average SNR (dB)

Average Precision Reduction Achieved Across PERFECT Kernels 100% Approximate Computing Dynamic precision reduction on safe-to-approximate instructions 10% MSE 83% 80% 74% 60% 57% 48% 40% 40% 32% 26% 20% 0% 10 20 40 60 80 100 120 Average SNR (dB)

Talk Overview 1. How much precision is needed at different stages of a program? QAPPA - Precision Autotuner 2. How much energy can be saved (upper bound)? Case Study of Precision Scaling Hardware Mechanisms 3. How does this inform approximate computing research?

Translating Precision Reduction into Energy Savings (Compute) Baseline ALU 0100 0110 1001 0101 0100 0110 1001 0010 ser ser ser ser quant quant 10 01 01 01 0100 0110 0100 1000 01 01 00 10 1001 0010 1000 0100 c c 11 11 10 00 1101 1100 de-ser de-ser 1110 1100 1110 1100 No savings

Translating Precision Reduction into Energy Savings (Compute) Baseline ALU Value Truncation 0100 0110 1001 0101 0100 0110 1001 0010 ser ser ser ser quant quant 10 01 01 01 0100 0110 0100 1000 01 01 00 10 1001 0010 1000 0100 c c 11 11 10 00 1101 1100 de-ser de-ser 1110 1100 1110 1100 QUORA [MICRO’13] No savings Less Power

Translating Precision Reduction into Energy Savings (Compute) Baseline ALU Value Truncation Bit-Sliced 0100 0110 1001 0101 0100 0110 1001 0010 ser ser ser ser quant quant 10 01 01 01 0100 0110 0100 1000 01 01 00 10 1001 0010 1000 0100 c c 11 11 10 00 1101 1100 de-ser de-ser 1110 1100 1110 1100 QUORA [MICRO’13] Stripes [MICRO’16] No savings Less Power Higher Throughput

Exploiting Quality-Efficiency Tradeoffs with Arbitrary Quantization - PowerPoint PPT Presentation

Exploiting Quality-Efficiency Tradeoffs with Arbitrary Quantization Special Session - CODES+ISSS Thierry Moreau , Felipe Augusto, Patrick Howe Armin Alaghi, Luis Ceze Internet of Things Revolution aggregate noisy, real world analytics,

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

REDD+ within the WEL nexus Opportunities and tradeoffs Kristy Graham May 2011 Outline What

Space/time tradeoffs; dynamic programming; y g g transform and conquer 1. Space/time

Storage Tradeoffs in a Collaborative Backup Service for Mobile Devices Ludovic Courts,

Area and Time Tradeoffs in FPGAs Examining the concept of area/time tradeoffs in FPGA design,

Self-testing quantum systems of arbitrary local Self-testing quantum systems of arbitrary local

Di ff erentially-Private Batch Query Answering Exploiting the Workload vs. Exploiting the Data

Exploiting Private Local Exploiting Private Local Memories to Reduce the Memories to Reduce the

Exploiting carbon and nitrogen Exploiting carbon and nitrogen compounds for enhanced energy

Exploiting Extreme Processor Counts on the Cray Exploiting Extreme Processor Counts on the Cray

Visualization of Geant4 Data: Exploiting Component Visualization of Geant4 Data: Exploiting

Hacking Browser's DOM Exploiting Ajax and RIA Exploiting Ajax and RIA Shreeraj Shah

Register Packing Register Packing Exploiting Narrow- -Width Operands Width Operands Exploiting

Exploiting Level- Exploiting Level -of of- -Detail Perception Detail Perception Multiple

C3 B: Exploiting the Num erous C3 B: Exploiting the Num erous Possibilities W eb Technology

Exploiting Exploiting Back-End Back-End APIs APIs fo for Feasible easible Ontology-Based

340 Million Tweets per day 2.3 Billion Queries per day < 10 s Indexing latency 50 ms Avg.

Architecture level Optimizations for Kummer based HECC on FPGAs Gabriel GALLIN Turku Ozlum

CS 126 Lecture A3: Boolean Logic Outline Introduction Logic gates Boolean algebra

ECEU530 Schedule ECE U530 Homework 6 due Wednesday, November 15 Digital Hardware Synthesis

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

BUBBLE STR UBBLE STRUGGLE UGGLE Call Graph Visualization with Radare2 Marion Marschalek

Graphics and Visualization Yuriy Tymchuk (almost) Alain Plantec Guillaume Larcheveque What are

Memories and SRAM 1 Silicon Memories Why store things in silicon? Its fast!!!

Exploiting Quality-Efficiency Tradeoffs with Arbitrary Quantization - PowerPoint PPT Presentation

Exploiting Quality-Efficiency Tradeoffs with Arbitrary Quantization Special Session - CODES+ISSS Thierry Moreau , Felipe Augusto, Patrick Howe Armin Alaghi, Luis Ceze Internet of Things Revolution aggregate noisy, real world analytics,

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

REDD+ within the WEL nexus Opportunities and tradeoffs Kristy Graham May 2011 Outline What

Space/time tradeoffs; dynamic programming; y g g transform and conquer 1. Space/time

Storage Tradeoffs in a Collaborative Backup Service for Mobile Devices Ludovic Courts,

Area and Time Tradeoffs in FPGAs Examining the concept of area/time tradeoffs in FPGA design,

Self-testing quantum systems of arbitrary local Self-testing quantum systems of arbitrary local

Di ff erentially-Private Batch Query Answering Exploiting the Workload vs. Exploiting the Data

Exploiting Private Local Exploiting Private Local Memories to Reduce the Memories to Reduce the

Exploiting carbon and nitrogen Exploiting carbon and nitrogen compounds for enhanced energy

Exploiting Extreme Processor Counts on the Cray Exploiting Extreme Processor Counts on the Cray

Visualization of Geant4 Data: Exploiting Component Visualization of Geant4 Data: Exploiting

Hacking Browser's DOM Exploiting Ajax and RIA Exploiting Ajax and RIA Shreeraj Shah

Register Packing Register Packing Exploiting Narrow- -Width Operands Width Operands Exploiting

Exploiting Level- Exploiting Level -of of- -Detail Perception Detail Perception Multiple

C3 B: Exploiting the Num erous C3 B: Exploiting the Num erous Possibilities W eb Technology

Exploiting Exploiting Back-End Back-End APIs APIs fo for Feasible easible Ontology-Based

340 Million Tweets per day 2.3 Billion Queries per day &lt; 10 s Indexing latency 50 ms Avg.

Architecture level Optimizations for Kummer based HECC on FPGAs Gabriel GALLIN Turku Ozlum

CS 126 Lecture A3: Boolean Logic Outline Introduction Logic gates Boolean algebra

ECEU530 Schedule ECE U530 Homework 6 due Wednesday, November 15 Digital Hardware Synthesis

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

BUBBLE STR UBBLE STRUGGLE UGGLE Call Graph Visualization with Radare2 Marion Marschalek

Graphics and Visualization Yuriy Tymchuk (almost) Alain Plantec Guillaume Larcheveque What are

Memories and SRAM 1 Silicon Memories Why store things in silicon? Its fast!!!

340 Million Tweets per day 2.3 Billion Queries per day < 10 s Indexing latency 50 ms Avg.