Approximating to the Last Bit Thierry Moreau , Adrian Sampson, Luis - - PowerPoint PPT Presentation

approximating to the last bit
SMART_READER_LITE
LIVE PREVIEW

Approximating to the Last Bit Thierry Moreau , Adrian Sampson, Luis - - PowerPoint PPT Presentation

Approximating to the Last Bit Thierry Moreau , Adrian Sampson, Luis Ceze {moreau, luisceze}@cs.washington.edu, asampson@cornell.edu WAX 2016 co-located with ASPLOS 2016 April 3rd 2016 What this Talk is About How many bits in a program are


slide-1
SLIDE 1

Approximating to the Last Bit

Thierry Moreau, Adrian Sampson, Luis Ceze

{moreau, luisceze}@cs.washington.edu, asampson@cornell.edu

WAX 2016 co-located with ASPLOS 2016 April 3rd 2016

slide-2
SLIDE 2

What this Talk is About

2

How many bits in a program are really that important? 1 - AXE: Quality Tuning Framework 2 - PERFECT Benchmark Study

slide-3
SLIDE 3

Precision Tuning

3

More precision means larger memory footprint, more data movement, more energy used in computation

slide-4
SLIDE 4

Precision Tuning

4

More precision means larger memory footprint, more data movement, more energy used in computation

double float

slide-5
SLIDE 5

Precision Tuning

5

More precision means larger memory footprint, more data movement, more energy used in computation

n double float 1

slide-6
SLIDE 6

AXE Precision Tuning Framework

6

Goal: Maximize Bit-Savings given a Quality Target

slide-7
SLIDE 7

AXE Precision Tuning Framework

kernel.c quality target AXE framework

instruction-level precision requirements

quality & bit-savings

7

Built on top of ACCEPT, the approximate C/C++ compiler http://accept.rocks

slide-8
SLIDE 8

quality bit-savings

bad OK

AXE Precision Tuning Framework

8

instruction 0 instruction 1 instruction 2 … instruction n-1 instruction n

Default (no bit-savings)

slide-9
SLIDE 9

AXE Precision Tuning Framework

9

instruction 0 instruction 1 instruction 2 … instruction n-1 instruction n

Coarse-Grained Precision Reduction

quality bit-savings

bad OK

slide-10
SLIDE 10

AXE Precision Tuning Framework

10

instruction 0 instruction 1 instruction 2 … instruction n-1 instruction n

Fine-Grained Precision Reduction

quality bit-savings

bad OK

slide-11
SLIDE 11

PERFECT Benchmark Suite

Application Domain Kernels Metric PERFECT Application 1 Discrete Wavelet Transform Signal to Noise Ratio (SNR) [120dB to 10dB] (0.0001% to 31.6% MSE) 2D Convolution Histogram Equalization Space Time Adaptive Processing Outer Product System Solve Inner Product Synthetic Aperture Radar Interpolation 1 Interpolation 2 Back Projection Wide Area Motion Imaging Debayer Image Registration Change Detection Required Kernels FFT 1D FFT 2D

11

slide-12
SLIDE 12

1 - PERFECT Dynamic Instruction Mix

12

load/store 27% int arith 4% fp arith 31% math 1% int arith 25% control 11%

Safe to approximate Precise

slide-13
SLIDE 13

1 - PERFECT Dynamic Instruction Mix

Long latency ops are all safe to approximate

13

fp arith 31% math 1%

Safe to approximate Precise

slide-14
SLIDE 14

1 - PERFECT Dynamic Instruction Mix

14

load/store 27%

Memory ops are mostly safe to approximate (mostly data vs. pointers)

Safe to approximate Precise

slide-15
SLIDE 15

1 - PERFECT Dynamic Instruction Mix

15

int arith 25% control 11%

Control and address computation must remain precise

Safe to approximate Precise

slide-16
SLIDE 16

2 - Bit-Savings over Approximate Instructions

16

Bit-Savings

0% 20% 40% 60% 80% 100%

Average SNR (dB)

10 20 40 60 80 100 120

26% 32% 40% 48% 57% 74% 83%

High Quality Approximate

slide-17
SLIDE 17

2 - Bit-Savings over Approximate Instructions

17

Bit-Savings

0% 20% 40% 60% 80% 100%

Average SNR (dB)

10 20 40 60 80 100 120

26% 32% 40% 48% 57% 74% 83%

PERFECT Manual 0.001% MSE

slide-18
SLIDE 18

2 - Bit-Savings over Approximate Instructions

18

Bit-Savings

0% 20% 40% 60% 80% 100%

Average SNR (dB)

10 20 40 60 80 100 120

26% 32% 40% 48% 57% 74% 83%

PERFECT Manual 0.001% MSE Approximate Computing 10% MSE

slide-19
SLIDE 19

Future Architectural Challenges

Mechanisms to translate bit-savings into energy savings? New data types/representations? ISA extensions?

19

slide-20
SLIDE 20

Thank You!

20

Thierry Moreau, Luis Ceze, Adrian Sampson

{moreau, luisceze}@cs.washington.edu, asampson@cornell.edu

WAX 2016 co-located with ASPLOS 2016 April 3rd 2016

Approximating to the Last Bit

slide-21
SLIDE 21

Backup Slides

21

slide-22
SLIDE 22

Bit Savings

Explore the opportunity for precision reduction in a hardware-agnostic way

22

BitSavings = X

insnstatic

(precisionref − precisionapprox) precisionref × execs execstotal

slide-23
SLIDE 23

Framework Overview

Built on top of ACCEPT, the approximate C/C++ compiler http://accept.rocks

Annotated Program Program Inputs & Quality Metrics ACCEPT static analysis ILPC* ACCEPT error injection & instrumentation Approximate Binary Execution & Quality Assessment Quality Autotuner Output Configuration Quality Results & Bit Savings * Instruction-level Precision Configuration 23

slide-24
SLIDE 24

Program Annotation

void conv2d (pix *in, pix *out, flt *filter) { for (row) { for (col) { flt sum = 0 int dstPos = … for (row_offset) { for (col_offset) { int srcPos = … int fltPos = … sum += in[srcPos] * filter[fltPos] } }

  • ut[dstPos] = sum / normFactor

} } }

Annotated Program Program Inputs & Quality Metrics ACCEPT static analysis ILPC* ACCEPT error injection & instrumentation Approximate Binary Execution & Quality Assessment Quality Autotuner Output Configuration Quality Results & Bit Savings * Instruction-level Precision Configuration

24

slide-25
SLIDE 25

Program Annotation

void conv2d (APPROX pix *in, APPROX pix *out, APPROX flt *filter) { for (row) { for (col) { APPROX flt sum = 0 int dstPos = … for (row_offset) { for (col_offset) { int srcPos = … int fltPos = … sum += in[srcPos] * filter[fltPos] } }

  • ut[dstPos] = sum / normFactor

} } }

Annotated Program Program Inputs & Quality Metrics ACCEPT static analysis ILPC* ACCEPT error injection & instrumentation Approximate Binary Execution & Quality Assessment Quality Autotuner Output Configuration Quality Results & Bit Savings * Instruction-level Precision Configuration

Key: use the APPROX type qualifier

25

slide-26
SLIDE 26

Program Annotation

Takeways:

Annotating data is intuitive (~10 mins to annotate a kernel) Variables used to index arrays cannot be safely approximated

typedef float flt typedef int pix typedef APPROX float flt typedef APPROX int pix

tips on annotating programs faster

Annotated Program Program Inputs & Quality Metrics ACCEPT static analysis ILPC* ACCEPT error injection & instrumentation Approximate Binary Execution & Quality Assessment Quality Autotuner Output Configuration Quality Results & Bit Savings * Instruction-level Precision Configuration

26

slide-27
SLIDE 27

Static Analysis

Annotated Program Program Inputs & Quality Metrics ACCEPT static analysis ILPC* ACCEPT error injection & instrumentation Approximate Binary Execution & Quality Assessment Quality Autotuner Output Configuration Quality Results & Bit Savings * Instruction-level Precision Configuration

Instruction-Level Precision Configuration (ILPC) conv2d:13:7:load:Int32 conv2d:13:10:load:Float conv2d:13:11:fmul:Float conv2d:13:12:fadd:Float conv2d:15:1:fdiv:Float conv2d:15:7:store:Int32

ACCEPT identified safe-to-approximate instructions from data annotations using flow analysis

void conv2d (APPROX pix *in, APPROX pix *out, APPROX flt *filter) { for (row) { for (col) { APPROX flt sum = 0 int dstPos = … for (row_offset) { for (col_offset) { int srcPos = … int fltPos = … sum += in[srcPos] * filter[fltPos] } }

  • ut[dstPos] = sum / normFactor

} } }

ACCEPT

27

slide-28
SLIDE 28

Approximate Binary

Annotated Program Program Inputs & Quality Metrics ACCEPT static analysis ILPC* ACCEPT error injection & instrumentation Approximate Binary Execution & Quality Assessment Quality Autotuner Output Configuration Quality Results & Bit Savings * Instruction-level Precision Configuration

Error Injection

Instruction-Level Precision Configuration (ILPC) conv2d:13:7:load:Int32 conv2d:13:10:load:Float conv2d:13:11:fmul:Float conv2d:13:12:fadd:Float conv2d:15:1:fdiv:Float conv2d:15:7:store:Int32

Each instruction in the ILCP acts as a quality knob that the autotuner can use to maximize bit-savings ACCEPT

28

slide-29
SLIDE 29

Annotated Program Program Inputs & Quality Metrics ACCEPT static analysis ILPC* ACCEPT error injection & instrumentation Approximate Binary Execution & Quality Assessment Quality Autotuner Output Configuration Quality Results & Bit Savings * Instruction-level Precision Configuration

Quality Assessment

The programmer provides a quality assessment script to evaluate quality on the program output Reference Binary Approximate Binary eval.py 10dB SNR

29

slide-30
SLIDE 30

Annotated Program Program Inputs & Quality Metrics ACCEPT static analysis ILPC* ACCEPT error injection & instrumentation Approximate Binary Execution & Quality Assessment Quality Autotuner Output Configuration Quality Results & Bit Savings * Instruction-level Precision Configuration

Autotuner

config k: error = 0.10% config [k+1, i-1]: error = 5.91% config [k+1, i]: error = 0.30% config [k+1, i+1]: error = 0.12% config [k+2, i-1]: error = 5.91% config [k+2, i]: error = 0.33% config [k+2, i+1]: error = 1.6% … … … … … …

Greedy iterative algorithm: reduces precision requirement

  • f the instruction that impacts quality the least

Finds solution in O(m2n) worst case where m is the number of static safe-to- approximate instructions and n are the levels of precision for all instructions

30

slide-31
SLIDE 31

Annotated Program Program Inputs & Quality Metrics ACCEPT static analysis ILPC* ACCEPT error injection & instrumentation Approximate Binary Execution & Quality Assessment Quality Autotuner Output Configuration Quality Results & Bit Savings * Instruction-level Precision Configuration

Autotuner

precise 60dB 40dB 20dB 10dB

The autotuner greedily maximizes bit-savings as the quality target is lowered

31

slide-32
SLIDE 32

Precision “Guarantees”

Currently empirically derived and input dependent Future work would extend on the current infrastructure to assimilate data dependence information in order to derive formal error guarantees

slide-33
SLIDE 33

1 - PERFECT Dynamic Instruction Mix

0% 25% 50% 75% 100%

2d-conv dwt hist-eq

  • uter

system inner interp1 interp2 bp debayer lucas-kanade change-det fft-1d fft-2d AVERAGE

prec_control prec_int_arith prec_mem appr_math appr_fp_arith appr_int_arith appr_mem 33

slide-34
SLIDE 34

PERFECT Benchmark Suite

Application Domain Kernels Metric PERFECT Application 1 Discrete Wavelet Transform SNR [120dB to 10dB] 2D Convolution Histogram Equalization Space Time Adaptive Processing Outer Product System Solve Inner Product Synthetic Aperture Radar Interpolation 1 Interpolation 2 Back Projection Wide Area Motion Imaging Debayer Image Registration Change Detection Required Kernels FFT 1D FFT 2D

10 log10 PN

k=1 |rk|2

PN

k=1 |rk − ak|2

!

N: number of output elements rk: reference value of element k tk: approximate value of element k

34

slide-35
SLIDE 35

2 - Bit-Savings over Approximate Instructions

35

Aggregate Bit Savings

0% 25% 50% 75% 100%

2 d c

  • n

v d w t h i s t e q

  • u

t e r s y s t e m s

  • l

v e i n n e r i n t e r p 1 i n t e r p 2 b p d e b a y e r l u c a s k a n a d e c h a n g e d e t f f t 1 d f f t 2 d A V E R A G E

10 20 40 60 80 100 120

slide-36
SLIDE 36

2 - Bit-Savings over Approximate Instructions

You don’t need a lot of bits to obtain an acceptable output!

36

Bit-Savings

0% 20% 40% 60% 80% 100%

Average SNR (dB)

10 20 40 60 80 100 120

int arith fp arith mem ops math AGGREGATE

26% 32% 40% 48% 57% 74% 83% 83% 74% 57% 48% 40% 32% 26%

slide-37
SLIDE 37

Architectural Target

37

Core Energy Breakdown

  • verheads

compute

General Purpose CPU

compute

Vector Processor* Accelerators specialization the smaller the overheads, the larger the potential gains

* [Quora, Venkataramani et al., MICRO2013]

slide-38
SLIDE 38

Precision Scaling

Mechanisms for precision scalability:

  • Fine-grained ALU power gating*
  • Bit-sliced ALU units
  • Lossy Compression

38

+ +

Energy Savings Bit-Savings

?

* [Quora, Venkataramani et al., MICRO2013]