number units in multi core clusters
play

Number Units in Multi-Core Clusters ARITH 2016 Silicon Valley July - PowerPoint PPT Presentation

Accuracy and Performance Trade-offs of Logarithmic Number Units in Multi-Core Clusters ARITH 2016 Silicon Valley July 10-13, 2016 Michael Schaffner 1 Michael Gautschi 1 Frank K. Grkaynak 1 Prof. Luca Benini 1,2 1 Integrated Systems Laboratory


  1. Accuracy and Performance Trade-offs of Logarithmic Number Units in Multi-Core Clusters ARITH 2016 Silicon Valley July 10-13, 2016 Michael Schaffner 1 Michael Gautschi 1 Frank K. Gürkaynak 1 Prof. Luca Benini 1,2 1 Integrated Systems Laboratory 2 Università di Bologna

  2. Integrated Systems Laboratory Advanced Processing in IoT Sense Analyze and Classify Transmit Low Power Processing System Complex preprocessing close to sensor, e.g.: Feature extraction, regression, classification, compression, sensor fusion 2

  3. Integrated Systems Laboratory Arithmetic with High Dynamic Range (HDR) Desirable Low Power Processing System Idle: ~1µW Fixed-Point 100 µW - 2 mW Active: ~ 50mW 1 - 10 mW • Fixed-point: labor intensive, error-prone, quality losses 3

  4. Integrated Systems Laboratory Arithmetic with High Dynamic Range (HDR) Desirable Low Power Processing System Idle: ~1µW HDR Arithmetic 100 µW - 2 mW Active: ~ 50mW 1 - 10 mW • Fixed-point: labor intensive, error-prone, quality losses • Energy-efficient, low-cost HDR arithmetic desirable 4

  5. Integrated Systems Laboratory Logarithmic Number System (LNS) FP: integer exponent FP: integer mantissa 1 8 23 LNS: fixed-point exponent • Efficient MUL, DIV, SQRT • c = log 2 (2 a */ 2 b ) = log 2 (2 a ±b ) = a ±b • c = log 2 (sqrt(2 a )) = log 2 (2 0.5a ) = 0.5a = a >> 1 → Simple integer operations! • Nonlinear ADD, SUB, I2F, F2I • function interpolator → large LNS unit (LNU) 5

  6. Integrated Systems Laboratory Precision & Approximation • Bilateral filter example: LNS 8.23 (0.5ulp) LNS 8.17 (16 ulp) “precise” “approximate” • Error tolerant applications • Full precision not always required → Additional tuning knob 6

  7. Integrated Systems Laboratory Contributions • Generator framework for automatic generation of “precise” (0.5ulp) and “approximate” (> 0.5ulp) LNU instances. • Design space exploration of precise / approximate LNUs. • 33%-71% smaller LNU (precise) with more functionality than previous designs [8,9,27]. • Case study: accuracy/performance tradeoffs of a shared LNU in a 65nm CMOS multicore cluster. [8] J.N. Coleman et al. "The European Logarithmic Microprocessor" IEEE TC, 2008 [9] R.C. Ismail et al. "ROM-less LNS" IEEE ARITH, 2011 [27] M. Gautschi, M. Schaffner, F.K. Gürkaynak, L. Benini, ISSCC 2016 7

  8. Integrated Systems Laboratory Problematic LNS Additions/Subtractions • C=A ±B with A = 2 a , B = 2 b , C = 2 c • Easy case (ADD): c = log 2 (2 a + 2 b ) = max (a,b) + f + (|a-b|) • Hard case (SUB): c = log 2 (2 a - 2 b ) = max (a,b) + f - (|a-b|) critical region 8

  9. Integrated Systems Laboratory Critical Region Decomposition • Analytic transformation of f - into subfunctions • Literature: – Coleman (1995) [5] ASIC complexity 8.23bit, 0.5ulp – Arnold (1998) [4] (Synthesis): – Vouzis (2007) [7] – Coleman (2008) [8] 94 kGE – Ismail (2011) [9] 63 kGE – Gautschi, Popoff (2016) [27,11] 40 kGE – This work, using Paliouras (1996) [3] 27 kGE 9

  10. Integrated Systems Laboratory Critical Region Decomposition c = max (a,b) + f - (r) c = max (a,b) - log 2 ((1-2 -r ) / r) + log 2 (r) cotrans (r) critical region 10

  11. Integrated Systems Laboratory Function Approximation f (r) • E.g., 8.23 LNS Different methods: • r 1) LUT only (very large!) 1 st order f (r) 2 nd order 2) High order polynomial 3 rd order • Often high order required Interpolation error Large interpolator delay • r 3) LUT + piecewise poly f (r) Tradeoff: precomputation vs. interpolation • • Half precision - single precision: 1-2nd order r d d d d 11

  12. Integrated Systems Laboratory LNU Generator Framework • Specs: bitwidth , accuracy , order • Iterative fitting heuristic (similar to [30]) • Piecewise minimax polynomials (using Sollya [29]) [30] De Dinechin et al., “Automatic Generation of Polynomial -Based Hardware Architectures for Function Evaluation”, ASAP 2010 [29] Chevillarde et al., “ Sollya : An Environment for the Development of Numerical Codes”, ICMS 2010 12

  13. Integrated Systems Laboratory Architecture Template Preprocessing Block Main Interpolator Log/Exp Block Block Postprocessing Block 13

  14. Integrated Systems Laboratory LNS Sub (critical): c = max (a,b) + cotrans (r)+ log 2 (r) Main Interpolator Log/Exp Block Block Postprocessing Block 14

  15. Integrated Systems Laboratory LNS Sub (critical): c = max (a,b) + cotrans (r)+ log 2 (r) LUTs Log/Exp Block N th order interpolator Postprocessing Block 15

  16. Integrated Systems Laboratory LNS Sub (critical): c = max (a,b) + cotrans (r)+ log 2 (r) Postprocessing Block 16

  17. Integrated Systems Laboratory LNS Sub (critical): c = max (a,b) + cotrans (r)+ log 2 (r) 17

  18. Integrated Systems Laboratory “Precise” 32bit LNU: Features & Comparison ELM [8] ROM-less [9] ISSCC’16 [27] This Work F2I, I2F, EXP, F2I, I2F, EXP, Functionality ADD, SUB ADD, SUB LOG, ADD, SUB LOG, ADD, SUB Max error [ulp] 0.454 0.498 0.479 0.45 LUT size [Kbit] 256.4 183.3 113.1 64.2 Technology 180 nm 180 nm 65 nm 65 nm Area [um 2 ] 904’943 589’357 57’264 38’592 Post-synthesis 97 63 40 26.8 [kGE] Min delay [ns] 11.74 7.10 6 4.5 Max delay [ns] 13.15 14.79 6 4.5 [8] J.N. Coleman et al. "The European Logarithmic Microprocessor" IEEE TC, 2008 [9] R.C. Ismail et al. "ROM-less LNS" IEEE ARITH, 2011 [27] M. Gautschi, M. Schaffner, F.K. Gürkaynak, L. Benini, ISSCC 2016 18

  19. Integrated Systems Laboratory Design Space: Precision vs. Area @4.5ns delay in umc65, post-synthesis ulp in the LNS domain - 40% Tipping point 1 st → 2 nd order 19

  20. Integrated Systems Laboratory Case Study: HW Platform • Parallel Ultra-Low-Power (PULP) Platform [31] www.pulp-platform.org  4x 32b OpenRISC Cores (in-order) PE0 PE1  16 kByte shared L1 (TCDM), 16 kByte L2 memory LNU • Configurations: PE2 PE3 – 1 Shared LNU (Precise, Approx1, Approx2) • 4, 3 or 2 pipeline registers PE0 PE1 Fair round robin arbiter • FPU FPU – 4 Private FPUs (reference) FPU FPU • Directly integrated into cores PE2 PE3 2 pipeline register • [31] M. Gautschi et al., “Tailoring Instruction -Set Extensions for an Ultra-Low Power Tightly-Coupled Cluster of OpenRISC Cores,” in VLSI -SoC, 2015 20

  21. Integrated Systems Laboratory Chip Complexities Name FPU Precise Approx1 Approx2 Format IEEE754 LNS LNS LNS Bitwidth 8.23 8.23 8.20 8.17 Precision 0.5 ulp 0.5 ulp 4 ulp* 16 ulp* Order - 2 2 1 Pipeline Stages 2 4 3 2 FPU/LNU [kGE] 4x11 36 27 23 Total Complexity [kGE] 720 718 708 704 * In the LNS domain 21

  22. Integrated Systems Laboratory Kernel Level Results umc65, post-layout Pipeline depth is the relevant factor! Energy efficiency gains mainly due to corresponding speedup! 22

  23. Integrated Systems Laboratory Conclusions • Generator Framework for precise and approximate LNUs • Very compact 8.23bit LNU ( 33%-71% smaller ) • Shared setting attractive for LNU • Up to 4.2x more energy efficient than private FPU baseline • Approximation : • Additional gains in area, speedup and energy efficiency • Energy-efficiency gains mainly due to lower latency and speedup • Less time is needed to complete a task → lower system energy consumption 23

  24. Integrated Systems Laboratory Outlook • Vectorization and trigonometric extensions • Optimization opportunities for many algorithms to leverage LNS and approximation PULP Platform: Looking for Collaborators! • OpenRISC / RISC-V ISA • Open source, silicon proven • Extending DSP capabilities… • www.pulp-platform.org pulp@pulp.ethz.ch

  25. Integrated Systems Laboratory Q&A Acknowledgements: Nano Tera IcySoC project

  26. Integrated Systems Laboratory Backup Slides 26

  27. Integrated Systems Laboratory Outline • Motivation • Preliminaries: LNS Add/Sub and Interpolation • LNU Architecture and Generator Framework • Multicore Hardware Platform • Results • Conclusion • Q&A 27

  28. Integrated Systems Laboratory Private FPUs INT operations Core 0 Core 1 Core 2 Core 3 FPU FPU FPU FPU HDR-ADD/SUB/MUL 50% 28

  29. Integrated Systems Laboratory Private LNUs INT operations Core 0 Core 1 Core 2 Core 3 FPU FPU FPU FPU LNU LNU LNU LNU HDR MUL/DIV/SQRT ADD/SUB • Area: 1 LNU < 4 × standard IEEE compliant FPU (no DIV) • Poor LNU utilization ~ 0.2 29

  30. Integrated Systems Laboratory Shared LNU INT operations HDR-MUL/DIV/SQRT Core 0 Core 0 Core 1 Core 2 Core 3 Interconnect Arbiter LNU HDR-ADD/SUB/I2F/F2I 30

  31. Integrated Systems Laboratory Design Space Exploration • Bitwidth: – Half to Single Precision: 5.10 – 8.23 • Accuracy: – Precise (0.5ulp) and Approximate (up to 16ulp) • Order: – 1st/2nd Order Interpolation 31

  32. Integrated Systems Laboratory Design Space: Area vs. Delay * Required # pipeline stages for 500MHz target * * Precise Approx2 Approx1 32

  33. Integrated Systems Laboratory Kernels • Linear Algebra : AXPY, GEMM, GEMV, DotP • Matrix Factorizations : Chol, QR • Geometry : Homographies, Distances, Pojection Errors • Image : Gradient Magnitude, Bilateral, FIR • Audio : Butterworth, Sine, DCT-II • Other : Radial Basis Functions 50% 25% 33

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend