Exploiting Quality-Efficiency Tradeoffs with Arbitrary Quantization - - PowerPoint PPT Presentation

exploiting quality efficiency tradeoffs with arbitrary
SMART_READER_LITE
LIVE PREVIEW

Exploiting Quality-Efficiency Tradeoffs with Arbitrary Quantization - - PowerPoint PPT Presentation

Exploiting Quality-Efficiency Tradeoffs with Arbitrary Quantization Special Session - CODES+ISSS Thierry Moreau , Felipe Augusto, Patrick Howe Armin Alaghi, Luis Ceze Internet of Things Revolution aggregate noisy, real world analytics,


slide-1
SLIDE 1

Exploiting Quality-Efficiency Tradeoffs with Arbitrary Quantization

Special Session - CODES+ISSS Thierry Moreau, Felipe Augusto, Patrick Howe Armin Alaghi, Luis Ceze

slide-2
SLIDE 2

Internet of Things Revolution

noisy, real world sensory input processing aggregate analytics, consumed by human etc.

… double temp = sensor_acquire(); … … double temp = sensor_acquire(); …

Approximate computing: eliminate inefficiencies in systems by producing just-the-right quality

slide-3
SLIDE 3

Quantization: going back to basics

noisy, real world sensory input processing aggregate analytics, consumed by human etc.

SRAM SRAM

ALU

ALU

slide-4
SLIDE 4

This Talk: A “Limit Study” on Precision Scaling

n double float 1

Assumption: hardware that can dynamically and arbitrarily scale its precision SW Scope: compute heavy, regular applications HW Scope: hardware accelerators

slide-5
SLIDE 5

Talk Overview

  • 1. How much precision is needed at different stages of a program?
  • 2. How much energy can be saved (upper bound)?
  • 3. How does this inform approximate computing research?
slide-6
SLIDE 6

Talk Overview

  • 1. How much precision is needed at different stages of a program?

QAPPA - Precision Autotuner

  • 2. How much energy can be saved?
  • 3. How does this inform approximate computing research?
slide-7
SLIDE 7

QAPPA: Quality Autotuner for Precision- Programmable Accelerators

Goal: Minimize instruction-level precision requirements given a quality target

kernel.c desired quality target QAPPA framework

instruction-level precision requirements

quality & energy savings

Built on top of ACCEPT, the approximate C/C++ compiler http://accept.rocks

slide-8
SLIDE 8

QAPPA Autotuner Overview

application quality savings

bad OK

instruction 0 instruction 1 instruction 2 … instruction n-1 instruction n

Default (no savings)

slide-9
SLIDE 9

QAPPA Autotuner Overview

instruction 0 instruction 1 instruction 2 … instruction n-1 instruction n

Optimized: extraneous precision is shaved off

savings

bad OK

application quality

slide-10
SLIDE 10

QAPPA 5-Step Description

Annotated Program Program Inputs & Quality Metrics ACCEPT static analysis ILPC* ACCEPT error injection & instrumentation Approximate Binary Execution & Quality Assessment Quality Autotuner Output Configuration Quality Results & Bit Savings * Instruction-level Precision Configuration

slide-11
SLIDE 11
  • 1. Program Annotation

Annotated Program Program Inputs & Quality Metrics ACCEPT static analysis ILPC* ACCEPT error injection & instrumentation Approximate Binary Execution & Quality Assessment Quality Autotuner Output Configuration Quality Results & Bit Savings * Instruction-level Precision Configuration

void conv2d (APPROX pix *in, APPROX pix *out, APPROX flt *filter) { for (row) { for (col) { APPROX flt sum = 0 int dstPos = … for (row_offset) { for (col_offset) { int srcPos = … int fltPos = … sum += in[srcPos] * filter[fltPos] } }

  • ut[dstPos] = sum / normFactor

} } }

Key: use the APPROX type qualifier [*]

[*] EnerJ, Sampson et al., PLDI’11

slide-12
SLIDE 12
  • 2. Static Analysis

Instruction-Level Precision Configuration (ILPC) conv2d:13:7:load:Int32 conv2d:13:10:load:Float conv2d:13:11:fmul:Float conv2d:13:12:fadd:Float conv2d:15:1:fdiv:Float conv2d:15:7:store:Int32

void conv2d (APPROX pix *in, APPROX pix *out, APPROX flt *filter) { for (row) { for (col) { APPROX flt sum = 0 int dstPos = … for (row_offset) { for (col_offset) { int srcPos = … int fltPos = … sum += in[srcPos] * filter[fltPos] } }

  • ut[dstPos] = sum / normFactor

} } }

ACCEPT

Annotated Program Program Inputs & Quality Metrics ACCEPT static analysis ILPC* ACCEPT error injection & instrumentation Approximate Binary Execution & Quality Assessment Quality Autotuner Output Configuration Quality Results & Bit Savings * Instruction-level Precision Configuration

ACCEPT identifies safe-to-approximate instructions from data annotations using flow analysis

slide-13
SLIDE 13
  • 3. Error Injection

Annotated Program Program Inputs & Quality Metrics ACCEPT static analysis ILPC* ACCEPT error injection & instrumentation Approximate Binary Execution & Quality Assessment Quality Autotuner Output Configuration Quality Results & Bit Savings * Instruction-level Precision Configuration

Approximate Binary

Instruction-Level Precision Configuration (ILPC) conv2d:13:7:load:Int4 conv2d:13:10:load:Fix2.3 conv2d:13:11:fmul:Fix2.3 conv2d:13:12:fadd:Fix4.5 conv2d:15:1:fdiv:Fix2.3 conv2d:15:7:store:Int4

Instrumentation & Compilation

Each instruction in the ILCP acts as a quality knob that the autotuner can use to maximize bit-savings

slide-14
SLIDE 14
  • 4. Quality Assessment

Annotated Program Program Inputs & Quality Metrics ACCEPT static analysis ILPC* ACCEPT error injection & instrumentation Approximate Binary Execution & Quality Assessment Quality Autotuner Output Configuration Quality Results & Bit Savings * Instruction-level Precision Configuration

The programmer provides a quality assessment script to evaluate quality on the program output Reference Binary Approximate Binary eval.py 10dB SNR

slide-15
SLIDE 15
  • 5. Autotuning Algorithm

Annotated Program Program Inputs & Quality Metrics ACCEPT static analysis ILPC* ACCEPT error injection & instrumentation Approximate Binary Execution & Quality Assessment Quality Autotuner Output Configuration Quality Results & Bit Savings * Instruction-level Precision Configuration

config k: error = 0.10% config [k+1, i-1]: error = 5.91% config [k+1, i]: error = 0.30% config [k+1, i+1]: error = 0.12% config [k+2, i-1]: error = 5.91% config [k+2, i]: error = 0.33% config [k+2, i+1]: error = 1.6% … … … … … …

Greedy iterative algorithm [*]: reduces precision requirement of the instruction that impacts quality the least

Finds solution in O(m2n) worst case where m is the number of static safe-to- approximate instructions and n are the levels of precision for all instructions

[*] Precimonious, Rubio-Gonzalez et al., SC’13

slide-16
SLIDE 16

Annotated Program Program Inputs & Quality Metrics ACCEPT static analysis ILPC* ACCEPT error injection & instrumentation Approximate Binary Execution & Quality Assessment Quality Autotuner Output Configuration Quality Results & Bit Savings * Instruction-level Precision Configuration

precise 60dB 40dB 20dB 10dB

The autotuner greedily maximizes bit-savings as the quality target is lowered

  • 5. Autotuning Algorithm
slide-17
SLIDE 17

PERFECT Application Study

Application Domain Kernels Metric PERFECT Application 1 Discrete Wavelet Transform Signal to Noise Ratio (SNR) [120dB to 10dB] (0.0001% to 31.6% MSE) 2D Convolution Histogram Equalization Space Time Adaptive Processing Outer Product System Solve Inner Product Synthetic Aperture Radar Interpolation 1 Interpolation 2 Back Projection Wide Area Motion Imaging Debayer Image Registration Change Detection Required Kernels FFT 1D FFT 2D

slide-18
SLIDE 18

Opportunity of Approximations

QAPPA Analyzes PERFECT Dynamic Instruction Mix

load/store 27% int arith 4% fp arith 31% math 1% int arith 25% control 11%

Safe to approximate Precise

slide-19
SLIDE 19

Average Precision Reduction Achieved Across PERFECT Kernels

Dynamic precision reduction on safe-to-approximate instructions

0% 20% 40% 60% 80% 100%

Target Application SNR (dB)

10 20 40 60 80 100 120

26% 32% 40% 48% 57% 74% 83%

High Quality Approximate More savings

slide-20
SLIDE 20

Average Precision Reduction Achieved Across PERFECT Kernels

Dynamic precision reduction on safe-to-approximate instructions

0% 20% 40% 60% 80% 100%

Average SNR (dB)

10 20 40 60 80 100 120

26% 32% 40% 48% 57% 74% 83%

PERFECT Manual 0.001% MSE

slide-21
SLIDE 21

Average Precision Reduction Achieved Across PERFECT Kernels

Dynamic precision reduction on safe-to-approximate instructions

0% 20% 40% 60% 80% 100%

Average SNR (dB)

10 20 40 60 80 100 120

26% 32% 40% 48% 57% 74% 83%

Approximate Computing 10% MSE

slide-22
SLIDE 22

Talk Overview

  • 1. How much precision is needed at different stages of a program?

QAPPA - Precision Autotuner

  • 2. How much energy can be saved (upper bound)?

Case Study of Precision Scaling Hardware Mechanisms

  • 3. How does this inform approximate computing research?
slide-23
SLIDE 23

Translating Precision Reduction into Energy Savings (Compute)

1101 1110 0100 1001 0110 0010 0100 1000 1100 1100 1000 0100

quant quant

0100 1001 0110 0010 10 01 11 10 01 01 c

ser ser

1001 0101

de-ser

1110 01 00 11 00 01 10 c

ser ser

0100 0110

de-ser

1100

Baseline ALU

No savings

slide-24
SLIDE 24

Translating Precision Reduction into Energy Savings (Compute)

1101 1110 0100 1001 0110 0010 0100 1000 1100 1100 1000 0100

quant quant

0100 1001 0110 0010 10 01 11 10 01 01 c

ser ser

1001 0101

de-ser

1110 01 00 11 00 01 10 c

ser ser

0100 0110

de-ser

1100 QUORA [MICRO’13]

Baseline ALU Value Truncation

No savings Less Power

slide-25
SLIDE 25

Translating Precision Reduction into Energy Savings (Compute)

1101 1110 0100 1001 0110 0010 0100 1000 1100 1100 1000 0100

quant quant

0100 1001 0110 0010 10 01 11 10 01 01 c

ser ser

1001 0101

de-ser

1110 01 00 11 00 01 10 c

ser ser

0100 0110

de-ser

1100

Bit-Sliced

Stripes [MICRO’16] QUORA [MICRO’13]

Baseline ALU Value Truncation

No savings Less Power Higher Throughput

slide-26
SLIDE 26

Case Study: Precision Scaled Adder

Methodology: Post-place-and-route prime-time power analysis on 65nm TSMC library Goal: Design an precision scalable adder that can elegantly trade lower precision for energy savings Exploration: Combine value truncation and bit slicing techniques, and vary the slice width in increments of powers of 2

slide-27
SLIDE 27

Precision Scaled Adder Study

Energy Cost (pJ) 0.00 0.45 0.90 1.35 1.80 Input Bit-Width 8 16 24 32 40 48 56 64

64

technique 1: value truncation

  • ffset due to static

power

slide-28
SLIDE 28

Precision Scaled Adder Study

Energy Cost (pJ) 0.00 0.45 0.90 1.35 1.80 Input Bit-Width 8 16 24 32 40 48 56 64

1 64

technique 2: bit slicing

slide-29
SLIDE 29

Case Study: Precision-Scaled Adder

Energy Cost (pJ) 0.00 0.45 0.90 1.35 1.80 Input Bit-Width 8 16 24 32 40 48 56 64

1 2 4 8 16 32 64

we look at different slice widths in powers of 2 increments

slice width

a 2-bit slice seems to be the energy-optimal design point

slide-30
SLIDE 30

Average Compute Energy Savings vs. Application SNR

Energy Savings (x) - Higher is Better 1 2 3 4 5 6 7 8 9 10 Application SNR (dB) - Higher is Better 20 30 40 50 60

2.5 2.6 2.9 3.1 3.8 3.6 4.3 4.8 5.6 7.1 2.8 3.0 3.2 3.6 7.7 1.4 1.5 1.7 2.0 2.5

1 2 4 8 16 32

PERFECT Study: Compute Energy Savings

quora stripes

slice width

slide-31
SLIDE 31

PERFECT Study: Compute Energy Savings

Average Compute Energy Savings vs. Application SNR

Energy Savings (x) - Higher is Better 1 2 3 4 5 6 7 8 9 10 Application SNR (dB) - Higher is Better 20 30 40 50 60

2.5 2.6 2.9 3.1 3.8 3.6 4.3 4.8 5.6 7.1 2.8 3.0 3.2 3.6 7.7 1.4 1.5 1.7 2.0 2.5

1 2 4 8 16 32

slice width

At 40dB a 16b sliced ALU can achieve 4.8 energy reduction!

slide-32
SLIDE 32

PERFECT Study: Compute Energy Savings

Average Compute Energy Savings vs. Application SNR

Energy Savings (x) - Higher is Better 1 2 3 4 5 6 7 8 9 10 Application SNR (dB) - Higher is Better 20 30 40 50 60

2.5 2.6 2.9 3.1 3.8 3.6 4.3 4.8 5.6 7.1 2.8 3.0 3.2 3.6 7.7 1.4 1.5 1.7 2.0 2.5

1 2 4 8 16 32

slice width

At 20dB the

  • ptimal design

point shifts to 8-bit slice

slide-33
SLIDE 33

Talk Overview

  • 1. How much precision is needed at different stages of a program?

QAPPA - Precision Autotuner

  • 2. How much energy can be saved (upper bound)?

Case Study of Precision Scaling Hardware Mechanisms

  • 3. How does this inform approximate computing research?

Comparative Study of Approximation Techniques

slide-34
SLIDE 34

Comparative Study

Many papers on approximate computing state: “Our technique provided n times speedup at x% error” Problem: This give us a data point but doesn’t quite say much about the merits of the technique at trading accuracy for efficiency Solution: Use QAPPA to produce quick comparison results to assess effectiveness of technique

slide-35
SLIDE 35

Comparative Study - Voltage Overscaling

10 0.8 20 1

Error Probability (%)

2 30 3 0.85 4 5 6 40 7 8 9 10 0.9 11

Overscaling Factor

12 13

Bit Position

14 15 16 17 0.95 18 19 20 21 22 23 24 1 25 26 27 28 29 30 31

Methodology (1/2): Spice simulation of ALU/FPU design under different voltage overscaling factors.

fp adder example

slide-36
SLIDE 36

Comparative Study - Voltage Overscaling

Results: Precision scaling always produces better quality/efficiency Methodology (2/2): Then we feed the error model into QAPPA’s error injection framework to assess application error.

SNR (dB) - higer is better

10 20 30 40

2dconv dwt histeq

  • uter

systemsolve inner interp1 interp2 bp debayer lucaskanade changedet fft1d fft2d

VOF=0.95 VOF=0.90 VOF=0.84

slide-37
SLIDE 37

Future Directions in Architecture/CAD

Precision Scaling Architectures: Need to see more precision-scaled accelerators for more applications

  • f the likes of Quora[MICRO’13], Stripes[MICRO’16]

CAD tools with Quality Awareness: Need to see more tools that can leverage quantization, especially in the FPGA community, of the likes of AHLS[DATE’17]

slide-38
SLIDE 38

Conclusion

  • 1. How much precision is needed at different stages of a program?

QAPPA - Precision Autotuner

  • 2. How much energy can be saved (upper bound)?

Case Study of Precision Scaling Hardware Mechanisms

  • 3. How does this inform approximate computing research?

Comparative Study of Approximation Techniques

slide-39
SLIDE 39

Special Session - CODES+ISSS Thierry Moreau, Felipe Augusto, Patrick Howe Armin Alaghi, Luis Ceze

Exploiting Quality-Efficiency Tradeoffs with Arbitrary Quantization

Thank you!