Architecture Support for Disciplined Approximate Programming Hadi Esmaeilzadeh Adrian Sampson University of Washington Luis Ceze Doug Burger Microsoft Research sa pa ASPLOS 2012
mobile devices battery usage data centers power & cooling costs dark silicon utilization wall
Disciplined approximate programming The EnerJ programming language ✓ references pixel data neuron weights jump targets audio samples ✗ JPEG header video frames Precise Approximate safely interleave approximate and precise operation
Errors Energy Errors Energy Errors Energy Errors Energy
Perfect correctness is not required computer vision machine learning augmented reality sensory data games scientific computing information retrieval physical simulation
Disciplined approximate programming The EnerJ programming language @Approx float[] nums; ⋮ @Approx float total = 0.0f; for ( @Precise int i = 0; i < nums.length; ++i) total += nums[i]; return total / nums.length;
Disciplined approximate programming The EnerJ programming language @Approx float[] nums; ⋮ @Approx float total = 0.0f; for ( @Precise int i = 0; i < nums.length; ++i) total += nums[i]; return total / nums.length; approximate data storage
Disciplined approximate programming The EnerJ programming language @Approx float[] nums; ⋮ @Approx float total = 0.0f; for ( @Precise int i = 0; i < nums.length; ++i) total += nums[i]; return total / nums.length; approximate operations
Hardware support for disciplined approximate programming @Approx float[] nums; ⋮ Truffle @Approx float total = 0.0f; Compiler for (@Precise int i = 0; Core i < nums.length; ++i) total += nums[i]; return total / nums.length; EnerJ Code
Hardware support for disciplined approximate programming Compiler-directed approximation Safety checks at compile time Truffle Compiler Core No expensive checks at run time Simplify hardware implementation
Hardware support for disciplined approximate programming Approximation-aware ISA Dual-voltage microarchitecture Energy savings results
Hardware support for disciplined approximate programming Approximation-aware ISA Dual-voltage microarchitecture Energy savings results
Approximation-aware languages need: ÷ & + Approximate operations × - ALU | Approximate data registers caches main memory Fine-grained interleaving ADD R1 R2 R3 MOV R3 R4 JMP 0x01234 STL R1 0xABCD LDF R2 0xBCDE ADD R1 R2 R3 MOV R3 R4 JMP 0x01234 STL R1 0xABCD LDF R2 0xBCDE ADD R1 R2 R3 MOV R3 R4 JMP 0x01234
Approximation-aware languages need: ÷ & + Approximate operations × - ALU | per instruction Approximate data registers caches main memory per cache line
Traditional, precise semantics ADD r1 r2 r3: some value writes the sum of r1 and r2 to r3
Approximate semantics ADD r1 r2 r3: some value writes the sum of r1 and r2 to r3 Informally: r3 gets something that approximates the sum r1 + r2. Actual error pattern depends on microarchitecture, voltage, process, variation, …
Undefined behavior ADD r1 r2 r3: ???
Approximate semantics ADD r1 r2 r3: some value writes the sum of r1 and r2 to r3 Informally: r3 gets something that approximates the sum r1 + r2. No other register is modified. No floating point division exception is raised. Does not jump to an arbitrary address. No missiles are launched. ⋮
An ISA extension with approximate semantics ADD.a AND.a ADDF.a operations MUL.a XNOR.a DIVF.a CMPLE.a SRL.a … ALU LDL.a LDF.a … storage STL.a STF.a registers caches main memory
Dual-voltage pipeline Fetch Decode Reg Read Execute Memory Write Back Branch Predictor Integer FU Data Cache Instruction Register File Register File Decoder Cache FP FU ITLB DTLB data movement & control plane processing plane
Dual-voltage pipeline Integer FU Register File Data Cache FP FU
Dual-voltage pipeline Integer FU Integer FU Register File Data Cache FP FU FP FU switch replicate switch (dynamic) (static) (dynamic)
Dual-voltage functional units: shadow structures operands result One structure is active at a time. Execute Stage
Dual-voltage functional units: shadow structures Issue width not changed (scheduler is unaware of shadowing) Inactive unit is power-gated No voltage change latency
Approximate storage: register modes r1 r2 precise mode r3 approximate mode r4 r4 r5 Reads from registers r6 in approximate mode r7 may return any value. r8 ⋮
Approximate storage: register modes r1 ADD r1 r2 r3 r2 r3 r4 r5 r6 r7 r8 ⋮
Approximate storage: register modes r1 ADD.a r1 r2 r3 r2 r3 r3 r4 The destination register’s r5 mode is set to match the r6 writing instruction. r7 r8 ⋮
Approximate storage: register modes r1 r2 ADD r2 r3.a r4 r3 r3 r4 r4 r5 Register operands r6 must be marked with r7 the register’s mode. (Otherwise, read garbage.) r8 ⋮
Registers and caches: dual-voltage SRAMs precision column data V DD H V DD L 0 0 1 1 row selection 1 data (read) 0 + data (write) 1 + precision 0 (for sense 1 amplifiers and 1 precharge) 0 1 DV-SRAM subarray
Registers and caches: dual-voltage SRAMs Mixture of precise and approximate data Instruction stream gives access levels (compiler-specified)
Approximate storage: caches LDL.a 0x… r1 r1 r2 Cache r3 r3 r4 r4 Data enters cache with precision of the access. r5 Compiler: consistently r6 treat data as approximate r7 or precise. r8 (Otherwise, read garbage.) ⋮
Also in the paper Approximate main memory Detailed DV-SRAM design Voltage level-shifter and mux circuits VddH VddH VddL VddH input[0] 0 -Vdd(H/L) output input output input 0-VddH VddH 0 -Vdd(H/L) 0 -VddL output 0 -Vdd(H/L) select precision 0-VddH VddH 0 -Vdd(H/L) VddL 0-VddH precision VddH 0-VddH input[1] 0 -Vdd(H/L) Replicated pipeline registers Broadcast network details
Hardware support for disciplined approximate programming Approximation-aware ISA Dual-voltage microarchitecture Energy savings results
Energy savings results Simulated EnerJ programs Precision-annotated Java [PLDI’11] Scientific kernels, mobile app, game engine, imaging, raytracer Modified McPAT models for OoO (Alpha 21264) and in-order cores [Li, Ahn, Strong, Brockman, Tullsen, Jouppi; MICRO’09] 65 nm process, 1666 MHz, 1.5 V nominal ( V DD H ) 4-wide (OoO) and 2-wide (in-order) Includes overhead of additional muxing, shadow FUs, etc. Extended CACTI for DV-SRAM structures [Muralimanohar, Balasubramonian, and Jouppi; MICRO’07] 64 KB (OoO) and 32 KB (in-order) L1 cache Line size: 16 bytes Includes precision column overhead
Energy savings on in-order core 50% energy reduction over non-Truffle 40% 30% 20% 10% 0% -10% fft imagefill jmeint lu mc raytracer smm sor zxing average V DD L = 0.75 V 0.94 V 1.13 V 1.31 V 7–24% energy saved on average Raytracer saves 14–43% energy
Energy savings on OoO core 50% energy reduction over non-Truffle 40% 30% 20% 10% 0% -10% fft imagefill jmeint lu mc raytracer smm sor zxing average V DD L = 0.75 V 0.94 V 1.13 V 1.31 V Energy savings up to 17% Efficiency loss up to 5% in the worst case
Application accuracy trade-off 100% output quality-of-service loss 80% 60% 40% 20% 0% 10 -8 10 -7 10 -6 10 -5 10 -4 10 -3 10 -2 fft imagefill jmeint lu mc raytracer smm sor zxing Application-specific output quality metrics Error resilience varies across applications
Hardware support for disciplined approximate programming V DD H int p = 5; @Approx int a = 7; for (int x = 0..) { a += func(2); @Approx int z; Truffle z = p * 2; Compiler p += 4; Core } a /= 9; func2(p); V DD L a += func(2); @Approx int y; z = p * 22 + z; p += 10;
Hardware support for disciplined approximate programming Approximation-aware ISA Tightly coupled with language-level precision information Dual-voltage microarchitecture Data plane can run at lower voltage Low-complexity design relying on compiler support Significant energy savings Up to 43% vs. a baseline in-order core
Future work on disciplined approximate programming Approximate accelerators Precision-aware programmer tools Non-voltage approximation techniques
Recommend
More recommend