Exploiting Quality-Efficiency Tradeoffs with Arbitrary Quantization
Special Session - CODES+ISSS Thierry Moreau, Felipe Augusto, Patrick Howe Armin Alaghi, Luis Ceze
Exploiting Quality-Efficiency Tradeoffs with Arbitrary Quantization - - PowerPoint PPT Presentation
Exploiting Quality-Efficiency Tradeoffs with Arbitrary Quantization Special Session - CODES+ISSS Thierry Moreau , Felipe Augusto, Patrick Howe Armin Alaghi, Luis Ceze Internet of Things Revolution aggregate noisy, real world analytics,
Special Session - CODES+ISSS Thierry Moreau, Felipe Augusto, Patrick Howe Armin Alaghi, Luis Ceze
noisy, real world sensory input processing aggregate analytics, consumed by human etc.
… double temp = sensor_acquire(); … … double temp = sensor_acquire(); …
Approximate computing: eliminate inefficiencies in systems by producing just-the-right quality
noisy, real world sensory input processing aggregate analytics, consumed by human etc.
SRAM SRAM
ALUALU
n double float 1
Assumption: hardware that can dynamically and arbitrarily scale its precision SW Scope: compute heavy, regular applications HW Scope: hardware accelerators
QAPPA - Precision Autotuner
Goal: Minimize instruction-level precision requirements given a quality target
kernel.c desired quality target QAPPA framework
instruction-level precision requirements
quality & energy savings
Built on top of ACCEPT, the approximate C/C++ compiler http://accept.rocks
application quality savings
bad OK
instruction 0 instruction 1 instruction 2 … instruction n-1 instruction n
Default (no savings)
instruction 0 instruction 1 instruction 2 … instruction n-1 instruction n
Optimized: extraneous precision is shaved off
savings
bad OK
application quality
Annotated Program Program Inputs & Quality Metrics ACCEPT static analysis ILPC* ACCEPT error injection & instrumentation Approximate Binary Execution & Quality Assessment Quality Autotuner Output Configuration Quality Results & Bit Savings * Instruction-level Precision Configuration
Annotated Program Program Inputs & Quality Metrics ACCEPT static analysis ILPC* ACCEPT error injection & instrumentation Approximate Binary Execution & Quality Assessment Quality Autotuner Output Configuration Quality Results & Bit Savings * Instruction-level Precision Configuration
void conv2d (APPROX pix *in, APPROX pix *out, APPROX flt *filter) { for (row) { for (col) { APPROX flt sum = 0 int dstPos = … for (row_offset) { for (col_offset) { int srcPos = … int fltPos = … sum += in[srcPos] * filter[fltPos] } }
} } }
Key: use the APPROX type qualifier [*]
[*] EnerJ, Sampson et al., PLDI’11
Instruction-Level Precision Configuration (ILPC) conv2d:13:7:load:Int32 conv2d:13:10:load:Float conv2d:13:11:fmul:Float conv2d:13:12:fadd:Float conv2d:15:1:fdiv:Float conv2d:15:7:store:Int32
void conv2d (APPROX pix *in, APPROX pix *out, APPROX flt *filter) { for (row) { for (col) { APPROX flt sum = 0 int dstPos = … for (row_offset) { for (col_offset) { int srcPos = … int fltPos = … sum += in[srcPos] * filter[fltPos] } }
} } }
ACCEPT
Annotated Program Program Inputs & Quality Metrics ACCEPT static analysis ILPC* ACCEPT error injection & instrumentation Approximate Binary Execution & Quality Assessment Quality Autotuner Output Configuration Quality Results & Bit Savings * Instruction-level Precision Configuration
ACCEPT identifies safe-to-approximate instructions from data annotations using flow analysis
Annotated Program Program Inputs & Quality Metrics ACCEPT static analysis ILPC* ACCEPT error injection & instrumentation Approximate Binary Execution & Quality Assessment Quality Autotuner Output Configuration Quality Results & Bit Savings * Instruction-level Precision Configuration
Approximate Binary
Instruction-Level Precision Configuration (ILPC) conv2d:13:7:load:Int4 conv2d:13:10:load:Fix2.3 conv2d:13:11:fmul:Fix2.3 conv2d:13:12:fadd:Fix4.5 conv2d:15:1:fdiv:Fix2.3 conv2d:15:7:store:Int4
Instrumentation & Compilation
Each instruction in the ILCP acts as a quality knob that the autotuner can use to maximize bit-savings
Annotated Program Program Inputs & Quality Metrics ACCEPT static analysis ILPC* ACCEPT error injection & instrumentation Approximate Binary Execution & Quality Assessment Quality Autotuner Output Configuration Quality Results & Bit Savings * Instruction-level Precision Configuration
The programmer provides a quality assessment script to evaluate quality on the program output Reference Binary Approximate Binary eval.py 10dB SNR
Annotated Program Program Inputs & Quality Metrics ACCEPT static analysis ILPC* ACCEPT error injection & instrumentation Approximate Binary Execution & Quality Assessment Quality Autotuner Output Configuration Quality Results & Bit Savings * Instruction-level Precision Configuration
config k: error = 0.10% config [k+1, i-1]: error = 5.91% config [k+1, i]: error = 0.30% config [k+1, i+1]: error = 0.12% config [k+2, i-1]: error = 5.91% config [k+2, i]: error = 0.33% config [k+2, i+1]: error = 1.6% … … … … … …
Greedy iterative algorithm [*]: reduces precision requirement of the instruction that impacts quality the least
Finds solution in O(m2n) worst case where m is the number of static safe-to- approximate instructions and n are the levels of precision for all instructions
[*] Precimonious, Rubio-Gonzalez et al., SC’13
Annotated Program Program Inputs & Quality Metrics ACCEPT static analysis ILPC* ACCEPT error injection & instrumentation Approximate Binary Execution & Quality Assessment Quality Autotuner Output Configuration Quality Results & Bit Savings * Instruction-level Precision Configuration
precise 60dB 40dB 20dB 10dB
The autotuner greedily maximizes bit-savings as the quality target is lowered
Application Domain Kernels Metric PERFECT Application 1 Discrete Wavelet Transform Signal to Noise Ratio (SNR) [120dB to 10dB] (0.0001% to 31.6% MSE) 2D Convolution Histogram Equalization Space Time Adaptive Processing Outer Product System Solve Inner Product Synthetic Aperture Radar Interpolation 1 Interpolation 2 Back Projection Wide Area Motion Imaging Debayer Image Registration Change Detection Required Kernels FFT 1D FFT 2D
QAPPA Analyzes PERFECT Dynamic Instruction Mix
load/store 27% int arith 4% fp arith 31% math 1% int arith 25% control 11%
Safe to approximate Precise
Dynamic precision reduction on safe-to-approximate instructions
0% 20% 40% 60% 80% 100%
Target Application SNR (dB)
10 20 40 60 80 100 120
26% 32% 40% 48% 57% 74% 83%
High Quality Approximate More savings
Dynamic precision reduction on safe-to-approximate instructions
0% 20% 40% 60% 80% 100%
Average SNR (dB)
10 20 40 60 80 100 120
26% 32% 40% 48% 57% 74% 83%
PERFECT Manual 0.001% MSE
Dynamic precision reduction on safe-to-approximate instructions
0% 20% 40% 60% 80% 100%
Average SNR (dB)
10 20 40 60 80 100 120
26% 32% 40% 48% 57% 74% 83%
Approximate Computing 10% MSE
QAPPA - Precision Autotuner
Case Study of Precision Scaling Hardware Mechanisms
1101 1110 0100 1001 0110 0010 0100 1000 1100 1100 1000 0100
quant quant
0100 1001 0110 0010 10 01 11 10 01 01 c
ser ser
1001 0101
de-ser
1110 01 00 11 00 01 10 c
ser ser
0100 0110
de-ser
1100
Baseline ALU
No savings
1101 1110 0100 1001 0110 0010 0100 1000 1100 1100 1000 0100
quant quant
0100 1001 0110 0010 10 01 11 10 01 01 c
ser ser
1001 0101
de-ser
1110 01 00 11 00 01 10 c
ser ser
0100 0110
de-ser
1100 QUORA [MICRO’13]
Baseline ALU Value Truncation
No savings Less Power
1101 1110 0100 1001 0110 0010 0100 1000 1100 1100 1000 0100
quant quant
0100 1001 0110 0010 10 01 11 10 01 01 c
ser ser
1001 0101
de-ser
1110 01 00 11 00 01 10 c
ser ser
0100 0110
de-ser
1100
Bit-Sliced
Stripes [MICRO’16] QUORA [MICRO’13]
Baseline ALU Value Truncation
No savings Less Power Higher Throughput
Methodology: Post-place-and-route prime-time power analysis on 65nm TSMC library Goal: Design an precision scalable adder that can elegantly trade lower precision for energy savings Exploration: Combine value truncation and bit slicing techniques, and vary the slice width in increments of powers of 2
Energy Cost (pJ) 0.00 0.45 0.90 1.35 1.80 Input Bit-Width 8 16 24 32 40 48 56 64
64
technique 1: value truncation
power
Energy Cost (pJ) 0.00 0.45 0.90 1.35 1.80 Input Bit-Width 8 16 24 32 40 48 56 64
1 64
technique 2: bit slicing
Energy Cost (pJ) 0.00 0.45 0.90 1.35 1.80 Input Bit-Width 8 16 24 32 40 48 56 64
1 2 4 8 16 32 64
we look at different slice widths in powers of 2 increments
slice width
a 2-bit slice seems to be the energy-optimal design point
Average Compute Energy Savings vs. Application SNR
Energy Savings (x) - Higher is Better 1 2 3 4 5 6 7 8 9 10 Application SNR (dB) - Higher is Better 20 30 40 50 60
2.5 2.6 2.9 3.1 3.8 3.6 4.3 4.8 5.6 7.1 2.8 3.0 3.2 3.6 7.7 1.4 1.5 1.7 2.0 2.5
1 2 4 8 16 32
quora stripes
slice width
Average Compute Energy Savings vs. Application SNR
Energy Savings (x) - Higher is Better 1 2 3 4 5 6 7 8 9 10 Application SNR (dB) - Higher is Better 20 30 40 50 60
2.5 2.6 2.9 3.1 3.8 3.6 4.3 4.8 5.6 7.1 2.8 3.0 3.2 3.6 7.7 1.4 1.5 1.7 2.0 2.5
1 2 4 8 16 32
slice width
At 40dB a 16b sliced ALU can achieve 4.8 energy reduction!
Average Compute Energy Savings vs. Application SNR
Energy Savings (x) - Higher is Better 1 2 3 4 5 6 7 8 9 10 Application SNR (dB) - Higher is Better 20 30 40 50 60
2.5 2.6 2.9 3.1 3.8 3.6 4.3 4.8 5.6 7.1 2.8 3.0 3.2 3.6 7.7 1.4 1.5 1.7 2.0 2.5
1 2 4 8 16 32
slice width
At 20dB the
point shifts to 8-bit slice
QAPPA - Precision Autotuner
Case Study of Precision Scaling Hardware Mechanisms
Comparative Study of Approximation Techniques
Many papers on approximate computing state: “Our technique provided n times speedup at x% error” Problem: This give us a data point but doesn’t quite say much about the merits of the technique at trading accuracy for efficiency Solution: Use QAPPA to produce quick comparison results to assess effectiveness of technique
10 0.8 20 1
Error Probability (%)
2 30 3 0.85 4 5 6 40 7 8 9 10 0.9 11
Overscaling Factor
12 13
Bit Position
14 15 16 17 0.95 18 19 20 21 22 23 24 1 25 26 27 28 29 30 31
Methodology (1/2): Spice simulation of ALU/FPU design under different voltage overscaling factors.
fp adder example
Results: Precision scaling always produces better quality/efficiency Methodology (2/2): Then we feed the error model into QAPPA’s error injection framework to assess application error.
SNR (dB) - higer is better
10 20 30 40
2dconv dwt histeq
systemsolve inner interp1 interp2 bp debayer lucaskanade changedet fft1d fft2d
VOF=0.95 VOF=0.90 VOF=0.84
Precision Scaling Architectures: Need to see more precision-scaled accelerators for more applications
CAD tools with Quality Awareness: Need to see more tools that can leverage quantization, especially in the FPGA community, of the likes of AHLS[DATE’17]
QAPPA - Precision Autotuner
Case Study of Precision Scaling Hardware Mechanisms
Comparative Study of Approximation Techniques
Special Session - CODES+ISSS Thierry Moreau, Felipe Augusto, Patrick Howe Armin Alaghi, Luis Ceze