architecture support for disciplined approximate
play

Architecture Support for Disciplined Approximate Programming Hadi - PowerPoint PPT Presentation

Architecture Support for Disciplined Approximate Programming Hadi Esmaeilzadeh Adrian Sampson University of Washington Luis Ceze Doug Burger Microsoft Research sa pa ASPLOS 2012 mobile devices battery usage data centers power &


  1. Architecture Support for Disciplined Approximate Programming Hadi Esmaeilzadeh Adrian Sampson University of Washington Luis Ceze Doug Burger Microsoft Research sa pa ASPLOS 2012

  2. mobile devices battery usage data centers power & cooling costs dark silicon utilization wall

  3. Disciplined approximate programming The EnerJ programming language ✓ references pixel data neuron weights jump targets audio samples ✗ JPEG header video frames Precise Approximate safely interleave approximate and precise operation

  4. Errors Energy Errors Energy Errors Energy Errors Energy

  5. Perfect correctness is not required computer vision machine learning augmented reality sensory data games scientific computing information retrieval physical simulation

  6. Disciplined approximate programming The EnerJ programming language @Approx float[] nums; ⋮ @Approx float total = 0.0f; for ( @Precise int i = 0; i < nums.length; ++i) total += nums[i]; return total / nums.length;

  7. Disciplined approximate programming The EnerJ programming language @Approx float[] nums; ⋮ @Approx float total = 0.0f; for ( @Precise int i = 0; i < nums.length; ++i) total += nums[i]; return total / nums.length; approximate data storage

  8. Disciplined approximate programming The EnerJ programming language @Approx float[] nums; ⋮ @Approx float total = 0.0f; for ( @Precise int i = 0; i < nums.length; ++i) total += nums[i]; return total / nums.length; approximate operations

  9. Hardware support for disciplined approximate programming @Approx float[] nums; ⋮ Truffle @Approx float total = 0.0f; Compiler for (@Precise int i = 0; Core i < nums.length; ++i) total += nums[i]; return total / nums.length; EnerJ Code

  10. Hardware support for disciplined approximate programming Compiler-directed approximation Safety checks at compile time Truffle Compiler Core No expensive checks at run time Simplify hardware implementation

  11. Hardware support for disciplined approximate programming Approximation-aware ISA Dual-voltage microarchitecture Energy savings results

  12. Hardware support for disciplined approximate programming Approximation-aware ISA Dual-voltage microarchitecture Energy savings results

  13. Approximation-aware languages need: ÷ & + Approximate operations × - ALU | Approximate data registers caches main memory Fine-grained interleaving ADD R1 R2 R3 MOV R3 R4 JMP 0x01234 STL R1 0xABCD LDF R2 0xBCDE ADD R1 R2 R3 MOV R3 R4 JMP 0x01234 STL R1 0xABCD LDF R2 0xBCDE ADD R1 R2 R3 MOV R3 R4 JMP 0x01234

  14. Approximation-aware languages need: ÷ & + Approximate operations × - ALU | per instruction Approximate data registers caches main memory per cache line

  15. Traditional, precise semantics ADD r1 r2 r3: some value writes the sum of r1 and r2 to r3

  16. Approximate semantics ADD r1 r2 r3: some value writes the sum of r1 and r2 to r3 Informally: r3 gets something that approximates the sum r1 + r2. Actual error pattern depends on microarchitecture, voltage, process, variation, …

  17. Undefined behavior ADD r1 r2 r3: ???

  18. Approximate semantics ADD r1 r2 r3: some value writes the sum of r1 and r2 to r3 Informally: r3 gets something that approximates the sum r1 + r2. No other register is modified. No floating point division exception is raised. Does not jump to an arbitrary address. No missiles are launched. ⋮

  19. An ISA extension with approximate semantics ADD.a AND.a ADDF.a operations MUL.a XNOR.a DIVF.a CMPLE.a SRL.a … ALU LDL.a LDF.a … storage STL.a STF.a registers caches main memory

  20. Dual-voltage pipeline Fetch Decode Reg Read Execute Memory Write Back Branch Predictor Integer FU Data Cache Instruction Register File Register File Decoder Cache FP FU ITLB DTLB data movement & control plane processing plane

  21. Dual-voltage pipeline Integer FU Register File Data Cache FP FU

  22. Dual-voltage pipeline Integer FU Integer FU Register File Data Cache FP FU FP FU switch replicate switch (dynamic) (static) (dynamic)

  23. Dual-voltage functional units: shadow structures operands result One structure is active at a time. Execute Stage

  24. Dual-voltage functional units: shadow structures Issue width not changed (scheduler is unaware of shadowing) Inactive unit is power-gated No voltage change latency

  25. Approximate storage: register modes r1 r2 precise mode r3 approximate mode r4 r4 r5 Reads from registers r6 in approximate mode r7 may return any value. r8 ⋮

  26. Approximate storage: register modes r1 ADD r1 r2 r3 r2 r3 r4 r5 r6 r7 r8 ⋮

  27. Approximate storage: register modes r1 ADD.a r1 r2 r3 r2 r3 r3 r4 The destination register’s r5 mode is set to match the r6 writing instruction. r7 r8 ⋮

  28. Approximate storage: register modes r1 r2 ADD r2 r3.a r4 r3 r3 r4 r4 r5 Register operands r6 must be marked with r7 the register’s mode. (Otherwise, read garbage.) r8 ⋮

  29. Registers and caches: dual-voltage SRAMs precision column data V DD H V DD L 0 0 1 1 row selection 1 data (read) 0 + data (write) 1 + precision 0 (for sense 1 amplifiers and 1 precharge) 0 1 DV-SRAM subarray

  30. Registers and caches: dual-voltage SRAMs Mixture of precise and approximate data Instruction stream gives access levels (compiler-specified)

  31. Approximate storage: caches LDL.a 0x… r1 r1 r2 Cache r3 r3 r4 r4 Data enters cache with precision of the access. r5 Compiler: consistently r6 treat data as approximate r7 or precise. r8 (Otherwise, read garbage.) ⋮

  32. Also in the paper Approximate main memory Detailed DV-SRAM design Voltage level-shifter and mux circuits VddH VddH VddL VddH input[0] 0 -Vdd(H/L) output input output input 0-VddH VddH 0 -Vdd(H/L) 0 -VddL output 0 -Vdd(H/L) select precision 0-VddH VddH 0 -Vdd(H/L) VddL 0-VddH precision VddH 0-VddH input[1] 0 -Vdd(H/L) Replicated pipeline registers Broadcast network details

  33. Hardware support for disciplined approximate programming Approximation-aware ISA Dual-voltage microarchitecture Energy savings results

  34. Energy savings results Simulated EnerJ programs Precision-annotated Java [PLDI’11] Scientific kernels, mobile app, game engine, imaging, raytracer Modified McPAT models for OoO (Alpha 21264) and in-order cores [Li, Ahn, Strong, Brockman, Tullsen, Jouppi; MICRO’09] 65 nm process, 1666 MHz, 1.5 V nominal ( V DD H ) 4-wide (OoO) and 2-wide (in-order) Includes overhead of additional muxing, shadow FUs, etc. Extended CACTI for DV-SRAM structures [Muralimanohar, Balasubramonian, and Jouppi; MICRO’07] 64 KB (OoO) and 32 KB (in-order) L1 cache Line size: 16 bytes Includes precision column overhead

  35. Energy savings on in-order core 50% energy reduction over non-Truffle 40% 30% 20% 10% 0% -10% fft imagefill jmeint lu mc raytracer smm sor zxing average V DD L = 0.75 V 0.94 V 1.13 V 1.31 V 7–24% energy saved on average Raytracer saves 14–43% energy

  36. Energy savings on OoO core 50% energy reduction over non-Truffle 40% 30% 20% 10% 0% -10% fft imagefill jmeint lu mc raytracer smm sor zxing average V DD L = 0.75 V 0.94 V 1.13 V 1.31 V Energy savings up to 17% Efficiency loss up to 5% in the worst case

  37. Application accuracy trade-off 100% output quality-of-service loss 80% 60% 40% 20% 0% 10 -8 10 -7 10 -6 10 -5 10 -4 10 -3 10 -2 fft imagefill jmeint lu mc raytracer smm sor zxing Application-specific output quality metrics Error resilience varies across applications

  38. Hardware support for disciplined approximate programming V DD H int p = 5; @Approx int a = 7; for (int x = 0..) { a += func(2); @Approx int z; Truffle z = p * 2; Compiler p += 4; Core } a /= 9; func2(p); V DD L a += func(2); @Approx int y; z = p * 22 + z; p += 10;

  39. Hardware support for disciplined approximate programming Approximation-aware ISA Tightly coupled with language-level precision information Dual-voltage microarchitecture Data plane can run at lower voltage Low-complexity design relying on compiler support Significant energy savings Up to 43% vs. a baseline in-order core

  40. Future work on disciplined approximate programming Approximate accelerators Precision-aware programmer tools Non-voltage approximation techniques

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend