Number Units in Multi-Core Clusters ARITH 2016 Silicon Valley July - PowerPoint PPT Presentation

Accuracy and Performance Trade-offs of Logarithmic Number Units in Multi-Core Clusters ARITH 2016 Silicon Valley July 10-13, 2016 Michael Schaffner 1 Michael Gautschi 1 Frank K. Gürkaynak 1 Prof. Luca Benini 1,2 1 Integrated Systems Laboratory 2 Università di Bologna

Integrated Systems Laboratory Advanced Processing in IoT Sense Analyze and Classify Transmit Low Power Processing System Complex preprocessing close to sensor, e.g.: Feature extraction, regression, classification, compression, sensor fusion 2

Integrated Systems Laboratory Arithmetic with High Dynamic Range (HDR) Desirable Low Power Processing System Idle: ~1µW Fixed-Point 100 µW - 2 mW Active: ~ 50mW 1 - 10 mW • Fixed-point: labor intensive, error-prone, quality losses 3

Integrated Systems Laboratory Arithmetic with High Dynamic Range (HDR) Desirable Low Power Processing System Idle: ~1µW HDR Arithmetic 100 µW - 2 mW Active: ~ 50mW 1 - 10 mW • Fixed-point: labor intensive, error-prone, quality losses • Energy-efficient, low-cost HDR arithmetic desirable 4

Integrated Systems Laboratory Logarithmic Number System (LNS) FP: integer exponent FP: integer mantissa 1 8 23 LNS: fixed-point exponent • Efficient MUL, DIV, SQRT • c = log 2 (2 a */ 2 b ) = log 2 (2 a ±b ) = a ±b • c = log 2 (sqrt(2 a )) = log 2 (2 0.5a ) = 0.5a = a >> 1 → Simple integer operations! • Nonlinear ADD, SUB, I2F, F2I • function interpolator → large LNS unit (LNU) 5

Integrated Systems Laboratory Precision & Approximation • Bilateral filter example: LNS 8.23 (0.5ulp) LNS 8.17 (16 ulp) “precise” “approximate” • Error tolerant applications • Full precision not always required → Additional tuning knob 6

Integrated Systems Laboratory Contributions • Generator framework for automatic generation of “precise” (0.5ulp) and “approximate” (> 0.5ulp) LNU instances. • Design space exploration of precise / approximate LNUs. • 33%-71% smaller LNU (precise) with more functionality than previous designs [8,9,27]. • Case study: accuracy/performance tradeoffs of a shared LNU in a 65nm CMOS multicore cluster. [8] J.N. Coleman et al. "The European Logarithmic Microprocessor" IEEE TC, 2008 [9] R.C. Ismail et al. "ROM-less LNS" IEEE ARITH, 2011 [27] M. Gautschi, M. Schaffner, F.K. Gürkaynak, L. Benini, ISSCC 2016 7

Integrated Systems Laboratory Problematic LNS Additions/Subtractions • C=A ±B with A = 2 a , B = 2 b , C = 2 c • Easy case (ADD): c = log 2 (2 a + 2 b ) = max (a,b) + f + (|a-b|) • Hard case (SUB): c = log 2 (2 a - 2 b ) = max (a,b) + f - (|a-b|) critical region 8

Integrated Systems Laboratory Critical Region Decomposition • Analytic transformation of f - into subfunctions • Literature: – Coleman (1995) [5] ASIC complexity 8.23bit, 0.5ulp – Arnold (1998) [4] (Synthesis): – Vouzis (2007) [7] – Coleman (2008) [8] 94 kGE – Ismail (2011) [9] 63 kGE – Gautschi, Popoff (2016) [27,11] 40 kGE – This work, using Paliouras (1996) [3] 27 kGE 9

Integrated Systems Laboratory Critical Region Decomposition c = max (a,b) + f - (r) c = max (a,b) - log 2 ((1-2 -r ) / r) + log 2 (r) cotrans (r) critical region 10

Integrated Systems Laboratory Function Approximation f (r) • E.g., 8.23 LNS Different methods: • r 1) LUT only (very large!) 1 st order f (r) 2 nd order 2) High order polynomial 3 rd order • Often high order required Interpolation error Large interpolator delay • r 3) LUT + piecewise poly f (r) Tradeoff: precomputation vs. interpolation • • Half precision - single precision: 1-2nd order r d d d d 11

Integrated Systems Laboratory LNU Generator Framework • Specs: bitwidth , accuracy , order • Iterative fitting heuristic (similar to [30]) • Piecewise minimax polynomials (using Sollya [29]) [30] De Dinechin et al., “Automatic Generation of Polynomial -Based Hardware Architectures for Function Evaluation”, ASAP 2010 [29] Chevillarde et al., “ Sollya : An Environment for the Development of Numerical Codes”, ICMS 2010 12

Integrated Systems Laboratory Architecture Template Preprocessing Block Main Interpolator Log/Exp Block Block Postprocessing Block 13

Integrated Systems Laboratory LNS Sub (critical): c = max (a,b) + cotrans (r)+ log 2 (r) Main Interpolator Log/Exp Block Block Postprocessing Block 14

Integrated Systems Laboratory LNS Sub (critical): c = max (a,b) + cotrans (r)+ log 2 (r) LUTs Log/Exp Block N th order interpolator Postprocessing Block 15

Integrated Systems Laboratory LNS Sub (critical): c = max (a,b) + cotrans (r)+ log 2 (r) Postprocessing Block 16

Integrated Systems Laboratory LNS Sub (critical): c = max (a,b) + cotrans (r)+ log 2 (r) 17

Integrated Systems Laboratory “Precise” 32bit LNU: Features & Comparison ELM [8] ROM-less [9] ISSCC’16 [27] This Work F2I, I2F, EXP, F2I, I2F, EXP, Functionality ADD, SUB ADD, SUB LOG, ADD, SUB LOG, ADD, SUB Max error [ulp] 0.454 0.498 0.479 0.45 LUT size [Kbit] 256.4 183.3 113.1 64.2 Technology 180 nm 180 nm 65 nm 65 nm Area [um 2 ] 904’943 589’357 57’264 38’592 Post-synthesis 97 63 40 26.8 [kGE] Min delay [ns] 11.74 7.10 6 4.5 Max delay [ns] 13.15 14.79 6 4.5 [8] J.N. Coleman et al. "The European Logarithmic Microprocessor" IEEE TC, 2008 [9] R.C. Ismail et al. "ROM-less LNS" IEEE ARITH, 2011 [27] M. Gautschi, M. Schaffner, F.K. Gürkaynak, L. Benini, ISSCC 2016 18

Integrated Systems Laboratory Design Space: Precision vs. Area @4.5ns delay in umc65, post-synthesis ulp in the LNS domain - 40% Tipping point 1 st → 2 nd order 19

Integrated Systems Laboratory Case Study: HW Platform • Parallel Ultra-Low-Power (PULP) Platform [31] www.pulp-platform.org  4x 32b OpenRISC Cores (in-order) PE0 PE1  16 kByte shared L1 (TCDM), 16 kByte L2 memory LNU • Configurations: PE2 PE3 – 1 Shared LNU (Precise, Approx1, Approx2) • 4, 3 or 2 pipeline registers PE0 PE1 Fair round robin arbiter • FPU FPU – 4 Private FPUs (reference) FPU FPU • Directly integrated into cores PE2 PE3 2 pipeline register • [31] M. Gautschi et al., “Tailoring Instruction -Set Extensions for an Ultra-Low Power Tightly-Coupled Cluster of OpenRISC Cores,” in VLSI -SoC, 2015 20

Integrated Systems Laboratory Chip Complexities Name FPU Precise Approx1 Approx2 Format IEEE754 LNS LNS LNS Bitwidth 8.23 8.23 8.20 8.17 Precision 0.5 ulp 0.5 ulp 4 ulp* 16 ulp* Order - 2 2 1 Pipeline Stages 2 4 3 2 FPU/LNU [kGE] 4x11 36 27 23 Total Complexity [kGE] 720 718 708 704 * In the LNS domain 21

Integrated Systems Laboratory Kernel Level Results umc65, post-layout Pipeline depth is the relevant factor! Energy efficiency gains mainly due to corresponding speedup! 22

Integrated Systems Laboratory Conclusions • Generator Framework for precise and approximate LNUs • Very compact 8.23bit LNU ( 33%-71% smaller ) • Shared setting attractive for LNU • Up to 4.2x more energy efficient than private FPU baseline • Approximation : • Additional gains in area, speedup and energy efficiency • Energy-efficiency gains mainly due to lower latency and speedup • Less time is needed to complete a task → lower system energy consumption 23

Integrated Systems Laboratory Outlook • Vectorization and trigonometric extensions • Optimization opportunities for many algorithms to leverage LNS and approximation PULP Platform: Looking for Collaborators! • OpenRISC / RISC-V ISA • Open source, silicon proven • Extending DSP capabilities… • www.pulp-platform.org pulp@pulp.ethz.ch

Integrated Systems Laboratory Q&A Acknowledgements: Nano Tera IcySoC project

Integrated Systems Laboratory Backup Slides 26

Integrated Systems Laboratory Outline • Motivation • Preliminaries: LNS Add/Sub and Interpolation • LNU Architecture and Generator Framework • Multicore Hardware Platform • Results • Conclusion • Q&A 27

Integrated Systems Laboratory Private FPUs INT operations Core 0 Core 1 Core 2 Core 3 FPU FPU FPU FPU HDR-ADD/SUB/MUL 50% 28

Integrated Systems Laboratory Private LNUs INT operations Core 0 Core 1 Core 2 Core 3 FPU FPU FPU FPU LNU LNU LNU LNU HDR MUL/DIV/SQRT ADD/SUB • Area: 1 LNU < 4 × standard IEEE compliant FPU (no DIV) • Poor LNU utilization ~ 0.2 29

Integrated Systems Laboratory Shared LNU INT operations HDR-MUL/DIV/SQRT Core 0 Core 0 Core 1 Core 2 Core 3 Interconnect Arbiter LNU HDR-ADD/SUB/I2F/F2I 30

Integrated Systems Laboratory Design Space Exploration • Bitwidth: – Half to Single Precision: 5.10 – 8.23 • Accuracy: – Precise (0.5ulp) and Approximate (up to 16ulp) • Order: – 1st/2nd Order Interpolation 31

Integrated Systems Laboratory Design Space: Area vs. Delay * Required # pipeline stages for 500MHz target * * Precise Approx2 Approx1 32

Integrated Systems Laboratory Kernels • Linear Algebra : AXPY, GEMM, GEMV, DotP • Matrix Factorizations : Chol, QR • Geometry : Homographies, Distances, Pojection Errors • Image : Gradient Magnitude, Bilateral, FIR • Audio : Butterworth, Sine, DCT-II • Other : Radial Basis Functions 50% 25% 33

Number Units in Multi-Core Clusters ARITH 2016 Silicon Valley July - PowerPoint PPT Presentation

Accuracy and Performance Trade-offs of Logarithmic Number Units in Multi-Core Clusters ARITH 2016 Silicon Valley July 10-13, 2016 Michael Schaffner 1 Michael Gautschi 1 Frank K. Grkaynak 1 Prof. Luca Benini 1,2 1 Integrated Systems Laboratory

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

I nternational research The evidence on clusters is clear Firms located in clusters are more

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Scientific Units & Conversions Objective: Students will be able to convert units and choose

Overview Respondent pool makeup 50-99 Other / 0-49 units multiple units 7% types

Welcome to Elyria High School Complete 21 course credits Units Our Districts Course

Welcome to Elyria High School Complete 21 course credits Units Our Districts Course

Tuesday, February 13, 2018 Phase One of Growth YDSS Number of Tax Units Occupancy Permits

Designing Multi-Leader based Allgather Algorithms for Multi-core Clusters Krishna Kandalla, Hari

Locational narratives in creative clusters An exploration of place, reputation and creative

Dynamic Virtual Clusters in a Grid Dynamic Virtual Clusters in a Grid Site Manager Site Manager

12 Tips for giving an Effective Presentation Louise Lehane, UoL, Ireland Tip Number One Tip

Issues for progress Issues for Future Number of clusters Progress: Clusters High redshift

Corporate Presentation Jan 2020 Sales Performance Full year 2019 Overall: 1,361,560 units -9%

Case for Accessory Dwelling Units (ADUs) in California and Beyond Craig Savage, Building Media,

Dick Rideout Chris Edgar Mark Majewsky Andrew Stoltman Supervisory Forester, FIA Urban

Closing the Loop on Textiles - Opportunities and Challenges Textile Exchange Workshop Arnheim,

Life Cycle Assessment Sustainable Nanotechnology Conference 2015 Dipl. Ing. Michael Steinfeldt

THE LAW AND ECONOMICS OF COLLUSION Overview Context: At an industry convention, a competitor

Outline Statistical inference for linear mixed models general form of linear mixed models

Cement Plants 2012 GHG Reporting March 13, 2013 Presentation Slides Available

Q2 2020 Financial Results August 13, 2020 Tracy Pagliara Randy Lay President & CEO SVP

Obligatory joke Keep your eye on the food. Goal-Directed Fluid Resuscitation Christopher G.

Number Units in Multi-Core Clusters ARITH 2016 Silicon Valley July - PowerPoint PPT Presentation

Accuracy and Performance Trade-offs of Logarithmic Number Units in Multi-Core Clusters ARITH 2016 Silicon Valley July 10-13, 2016 Michael Schaffner 1 Michael Gautschi 1 Frank K. Grkaynak 1 Prof. Luca Benini 1,2 1 Integrated Systems Laboratory

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

I nternational research The evidence on clusters is clear Firms located in clusters are more

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Scientific Units &amp; Conversions Objective: Students will be able to convert units and choose

Overview Respondent pool makeup 50-99 Other / 0-49 units multiple units 7% types

Welcome to Elyria High School Complete 21 course credits Units Our Districts Course

Welcome to Elyria High School Complete 21 course credits Units Our Districts Course

Tuesday, February 13, 2018 Phase One of Growth YDSS Number of Tax Units Occupancy Permits

Designing Multi-Leader based Allgather Algorithms for Multi-core Clusters Krishna Kandalla, Hari

Locational narratives in creative clusters An exploration of place, reputation and creative

Dynamic Virtual Clusters in a Grid Dynamic Virtual Clusters in a Grid Site Manager Site Manager

12 Tips for giving an Effective Presentation Louise Lehane, UoL, Ireland Tip Number One Tip

Issues for progress Issues for Future Number of clusters Progress: Clusters High redshift

Corporate Presentation Jan 2020 Sales Performance Full year 2019 Overall: 1,361,560 units -9%

Case for Accessory Dwelling Units (ADUs) in California and Beyond Craig Savage, Building Media,

Dick Rideout Chris Edgar Mark Majewsky Andrew Stoltman Supervisory Forester, FIA Urban

Closing the Loop on Textiles - Opportunities and Challenges Textile Exchange Workshop Arnheim,

Life Cycle Assessment Sustainable Nanotechnology Conference 2015 Dipl. Ing. Michael Steinfeldt

THE LAW AND ECONOMICS OF COLLUSION Overview Context: At an industry convention, a competitor

Outline Statistical inference for linear mixed models general form of linear mixed models

Cement Plants 2012 GHG Reporting March 13, 2013 Presentation Slides Available

Q2 2020 Financial Results August 13, 2020 Tracy Pagliara Randy Lay President &amp; CEO SVP

Obligatory joke Keep your eye on the food. Goal-Directed Fluid Resuscitation Christopher G.

Scientific Units & Conversions Objective: Students will be able to convert units and choose

Q2 2020 Financial Results August 13, 2020 Tracy Pagliara Randy Lay President & CEO SVP