Architecture Support for Disciplined Approximate Programming Hadi - - PowerPoint PPT Presentation

architecture support for disciplined approximate
SMART_READER_LITE
LIVE PREVIEW

Architecture Support for Disciplined Approximate Programming Hadi - - PowerPoint PPT Presentation

Architecture Support for Disciplined Approximate Programming Hadi Esmaeilzadeh Adrian Sampson University of Washington Luis Ceze Doug Burger Microsoft Research sa pa ASPLOS 2012 mobile devices battery usage data centers power &


slide-1
SLIDE 1

sa pa

ASPLOS 2012 Hadi Esmaeilzadeh Adrian Sampson Luis Ceze Doug Burger

Architecture Support for Disciplined Approximate Programming

University of Washington Microsoft Research

slide-2
SLIDE 2

mobile devices battery usage data centers power & cooling costs dark silicon utilization wall

slide-3
SLIDE 3

Disciplined approximate programming

Precise Approximate

✗ ✓

references jump targets JPEG header pixel data neuron weights audio samples video frames

The EnerJ programming language

safely interleave approximate and precise operation

slide-4
SLIDE 4
slide-5
SLIDE 5

Energy Errors Energy Errors Energy Errors Energy Errors

slide-6
SLIDE 6

Perfect correctness is not required

information retrieval machine learning sensory data scientific computing physical simulation games augmented reality computer vision

slide-7
SLIDE 7

@Approx float[] nums; ⋮ @Approx float total = 0.0f; for (@Precise int i = 0; i < nums.length; ++i) total += nums[i]; return total / nums.length;

Disciplined approximate programming The EnerJ programming language

slide-8
SLIDE 8

@Approx float[] nums; ⋮ @Approx float total = 0.0f; for (@Precise int i = 0; i < nums.length; ++i) total += nums[i]; return total / nums.length;

Disciplined approximate programming The EnerJ programming language

approximate data storage

slide-9
SLIDE 9

@Approx float[] nums; ⋮ @Approx float total = 0.0f; for (@Precise int i = 0; i < nums.length; ++i) total += nums[i]; return total / nums.length;

Disciplined approximate programming The EnerJ programming language

approximate operations

slide-10
SLIDE 10

Hardware support

for disciplined approximate programming

Truffle Core Compiler

EnerJ Code

@Approx float[] nums; ⋮ @Approx float total = 0.0f; for (@Precise int i = 0; i < nums.length; ++i) total += nums[i]; return total / nums.length;

slide-11
SLIDE 11

Hardware support

for disciplined approximate programming

Truffle Core Compiler

Compiler-directed approximation Simplify hardware implementation Safety checks at compile time No expensive checks at run time

slide-12
SLIDE 12

Approximation-aware ISA Dual-voltage microarchitecture Energy savings results

Hardware support

for disciplined approximate programming

slide-13
SLIDE 13

Approximation-aware ISA Dual-voltage microarchitecture Energy savings results

Hardware support

for disciplined approximate programming

slide-14
SLIDE 14

Approximation-aware languages need:

Approximate operations Approximate data Fine-grained interleaving

+

  • ÷

×

&

|

ALU

registers caches main memory

ADD R1 R2 R3 MOV R3 R4 JMP 0x01234 STL R1 0xABCD LDF R2 0xBCDE ADD R1 R2 R3 MOV R3 R4 JMP 0x01234 STL R1 0xABCD LDF R2 0xBCDE ADD R1 R2 R3 MOV R3 R4 JMP 0x01234

slide-15
SLIDE 15

Approximation-aware languages need:

Approximate operations Approximate data

+

  • ÷

×

&

|

ALU

registers caches main memory

per instruction per cache line

slide-16
SLIDE 16

Traditional, precise semantics

ADD r1 r2 r3: writes the sum of r1 and r2 to r3 some value

slide-17
SLIDE 17

Approximate semantics

ADD r1 r2 r3: writes the sum of r1 and r2 to r3 some value

Informally: r3 gets something that approximates the sum r1 + r2. Actual error pattern depends on microarchitecture, voltage, process, variation, …

slide-18
SLIDE 18

Undefined behavior

ADD r1 r2 r3:

???

slide-19
SLIDE 19

Approximate semantics

ADD r1 r2 r3: writes the sum of r1 and r2 to r3 some value

Informally: r3 gets something that approximates the sum r1 + r2. No other register is modified. Does not jump to an arbitrary address. No floating point division exception is raised. No missiles are launched. ⋮

slide-20
SLIDE 20

An ISA extension with approximate semantics

  • perations

ADD.a MUL.a CMPLE.a AND.a XNOR.a SRL.a ADDF.a DIVF.a …

ALU

storage

registers caches main memory

LDL.a STL.a STF.a LDF.a …

slide-21
SLIDE 21

Dual-voltage pipeline

Fetch Decode Reg Read Execute Memory Write Back

Branch Predictor Instruction Cache ITLB Decoder Register File Integer FU FP FU Data Cache DTLB Register File

data movement & processing plane control plane

slide-22
SLIDE 22

Dual-voltage pipeline

Register File Integer FU FP FU Data Cache

slide-23
SLIDE 23

Dual-voltage pipeline

Register File Integer FU FP FU Data Cache Integer FU FP FU

switch replicate switch

(dynamic) (dynamic) (static)

slide-24
SLIDE 24

Dual-voltage functional units: shadow structures

Execute Stage

  • perands

result One structure is active at a time.

slide-25
SLIDE 25

Dual-voltage functional units: shadow structures

Issue width not changed (scheduler is unaware of shadowing) Inactive unit is power-gated No voltage change latency

slide-26
SLIDE 26

Approximate storage: register modes

r1 r2 r3 r4 r5 r6 r7 r8

r4

precise mode approximate mode Reads from registers in approximate mode may return any value.

slide-27
SLIDE 27

Approximate storage: register modes

r1 r2 r3 r4 r5 r6 r7 r8

⋮ ADD r1 r2 r3

slide-28
SLIDE 28

Approximate storage: register modes

r1 r2 r3 r4 r5 r6 r7 r8

⋮ ADD.a r1 r2 r3

r3

The destination register’s mode is set to match the writing instruction.

slide-29
SLIDE 29

Approximate storage: register modes

r1 r2 r3 r4 r5 r6 r7 r8

r3 r4

ADD r2 r3.a r4 Register operands must be marked with the register’s mode. (Otherwise, read garbage.)

slide-30
SLIDE 30

Registers and caches: dual-voltage SRAMs

1 1 1 1 1 1 1

precision column data

VDDH VDDL

row selection data (read) + data (write) + precision

DV-SRAM subarray

(for sense amplifiers and precharge)

slide-31
SLIDE 31

Registers and caches: dual-voltage SRAMs

Mixture of precise and approximate data Instruction stream gives access levels (compiler-specified)

slide-32
SLIDE 32

Approximate storage: caches

r1 r2 r3 r4 r5 r6 r7 r8

⋮ LDL.a 0x…

r3 r4 r1 Cache

Data enters cache with precision of the access. Compiler: consistently treat data as approximate

  • r precise.

(Otherwise, read garbage.)

slide-33
SLIDE 33

Approximate main memory Detailed DV-SRAM design Voltage level-shifter and mux circuits Replicated pipeline registers Broadcast network details

Also in the paper

0-VddH
  • utput
VddH VddH VddL input 0-VddH precision 0 -Vdd(H/L) VddH VddL 0 -VddL
  • utput
VddH 0-VddH precision VddH input 0 -Vdd(H/L) VddH 0 -Vdd(H/L) input[0] 0 -Vdd(H/L) input[1] 0 -Vdd(H/L)
  • utput
0-VddH select
slide-34
SLIDE 34

Approximation-aware ISA Dual-voltage microarchitecture Energy savings results

Hardware support

for disciplined approximate programming

slide-35
SLIDE 35

Energy savings results

Simulated EnerJ programs Precision-annotated Java [PLDI’11] Scientific kernels, mobile app, game engine, imaging, raytracer Modified McPAT models for OoO (Alpha 21264) and in-order cores [Li, Ahn, Strong, Brockman, Tullsen, Jouppi; MICRO’09] 65 nm process, 1666 MHz, 1.5 V nominal (VDDH) 4-wide (OoO) and 2-wide (in-order) Includes overhead of additional muxing, shadow FUs, etc. Extended CACTI for DV-SRAM structures [Muralimanohar, Balasubramonian, and Jouppi; MICRO’07] 64 KB (OoO) and 32 KB (in-order) L1 cache Line size: 16 bytes Includes precision column overhead

slide-36
SLIDE 36

Energy savings on in-order core

7–24% energy saved on average Raytracer saves 14–43% energy

  • 10%

0% 10% 20% 30% 40% 50% fft imagefill jmeint lu mc raytracer smm sor zxing average energy reduction over non-Truffle 0.75 V 0.94 V 1.13 V 1.31 V VDDL =

slide-37
SLIDE 37

Energy savings on OoO core

Energy savings up to 17% Efficiency loss up to 5% in the worst case

  • 10%

0% 10% 20% 30% 40% 50% fft imagefill jmeint lu mc raytracer smm sor zxing average energy reduction over non-Truffle 0.75 V 0.94 V 1.13 V 1.31 V VDDL =

slide-38
SLIDE 38

Application accuracy trade-off

fft imagefill jmeint lu mc raytracer smm sor zxing

0% 20% 40% 60% 80% 100%

  • utput quality-of-service loss

10-8 10-7 10-6 10-5 10-4 10-3 10-2

Application-specific output quality metrics Error resilience varies across applications

slide-39
SLIDE 39

Hardware support for disciplined approximate programming

Truffle Core Compiler

int p = 5; @Approx int a = 7; for (int x = 0..) { a += func(2); @Approx int z; z = p * 2; p += 4; } a /= 9; func2(p); a += func(2); @Approx int y; z = p * 22 + z; p += 10;

VDDH VDDL

slide-40
SLIDE 40

Hardware support for disciplined approximate programming

Approximation-aware ISA

Tightly coupled with language-level precision information

Dual-voltage microarchitecture

Data plane can run at lower voltage Low-complexity design relying on compiler support

Significant energy savings

Up to 43% vs. a baseline in-order core

slide-41
SLIDE 41

Future work on disciplined approximate programming

Approximate accelerators Precision-aware programmer tools Non-voltage approximation techniques

slide-42
SLIDE 42