Number Units in Multi-Core Clusters ARITH 2016 Silicon Valley July - - PowerPoint PPT Presentation

number units in multi core clusters
SMART_READER_LITE
LIVE PREVIEW

Number Units in Multi-Core Clusters ARITH 2016 Silicon Valley July - - PowerPoint PPT Presentation

Accuracy and Performance Trade-offs of Logarithmic Number Units in Multi-Core Clusters ARITH 2016 Silicon Valley July 10-13, 2016 Michael Schaffner 1 Michael Gautschi 1 Frank K. Grkaynak 1 Prof. Luca Benini 1,2 1 Integrated Systems Laboratory


slide-1
SLIDE 1

1Integrated Systems Laboratory 2 Università di Bologna

Accuracy and Performance Trade-offs of Logarithmic Number Units in Multi-Core Clusters

ARITH 2016 Silicon Valley July 10-13, 2016

Michael Schaffner1 Michael Gautschi1 Frank K. Gürkaynak1

  • Prof. Luca Benini1,2
slide-2
SLIDE 2

Integrated Systems Laboratory

Sense Analyze and Classify Transmit Complex preprocessing close to sensor, e.g.: Feature extraction, regression, classification, compression, sensor fusion Low Power Processing System

Advanced Processing in IoT

2

slide-3
SLIDE 3

Integrated Systems Laboratory

  • Fixed-point: labor intensive, error-prone, quality losses

Fixed-Point 1 - 10 mW Low Power Processing System 100 µW - 2 mW Idle: ~1µW Active: ~ 50mW

Arithmetic with High Dynamic Range (HDR) Desirable

3

slide-4
SLIDE 4

Integrated Systems Laboratory

  • Fixed-point: labor intensive, error-prone, quality losses
  • Energy-efficient, low-cost HDR arithmetic desirable

Low Power Processing System 100 µW - 2 mW Idle: ~1µW Active: ~ 50mW

Arithmetic with High Dynamic Range (HDR) Desirable

4

HDR Arithmetic 1 - 10 mW

slide-5
SLIDE 5

Integrated Systems Laboratory

Logarithmic Number System (LNS)

  • Efficient MUL, DIV, SQRT
  • c = log2(2a */ 2b) = log2(2a ±b) = a ±b
  • c = log2(sqrt(2a)) = log2(20.5a) = 0.5a = a >> 1

→ Simple integer operations!

  • Nonlinear ADD, SUB, I2F, F2I
  • function interpolator → large LNS unit (LNU)

1 8 23

LNS: fixed-point exponent FP: integer exponent FP: integer mantissa

5

slide-6
SLIDE 6

Integrated Systems Laboratory

  • Bilateral filter example:
  • Error tolerant applications
  • Full precision not always required

→ Additional tuning knob

Precision & Approximation

6

LNS 8.23 (0.5ulp) “precise” LNS 8.17 (16 ulp) “approximate”

slide-7
SLIDE 7

Integrated Systems Laboratory

Contributions

  • Generator framework for automatic generation of “precise” (0.5ulp)

and “approximate” (> 0.5ulp) LNU instances.

  • Design space exploration of precise / approximate LNUs.
  • 33%-71% smaller LNU (precise) with more functionality than previous

designs [8,9,27].

  • Case study: accuracy/performance tradeoffs of a shared LNU in a

65nm CMOS multicore cluster.

[8] J.N. Coleman et al. "The European Logarithmic Microprocessor" IEEE TC, 2008 [9] R.C. Ismail et al. "ROM-less LNS" IEEE ARITH, 2011 [27] M. Gautschi, M. Schaffner, F.K. Gürkaynak, L. Benini, ISSCC 2016

7

slide-8
SLIDE 8

Integrated Systems Laboratory

Problematic LNS Additions/Subtractions

  • C=A ±B with

A = 2a, B = 2b, C = 2c

  • Easy case (ADD): c = log2(2a + 2b) = max(a,b) + f +(|a-b|)
  • Hard case (SUB): c = log2(2a - 2b) = max(a,b) + f -(|a-b|)

8

critical region

slide-9
SLIDE 9

Integrated Systems Laboratory

Critical Region Decomposition

9

  • Analytic transformation of f - into subfunctions
  • Literature:

– Coleman (1995) [5] – Arnold (1998) [4] – Vouzis (2007) [7] – Coleman (2008) [8] – Ismail (2011) [9] – Gautschi, Popoff (2016) [27,11] – This work, using Paliouras (1996) [3]

94 kGE 63 kGE 40 kGE 27 kGE ASIC complexity 8.23bit, 0.5ulp (Synthesis):

slide-10
SLIDE 10

Integrated Systems Laboratory

Critical Region Decomposition

10

cotrans(r) c = max(a,b) + f -(r) c = max(a,b) - log2((1-2-r) / r) + log2(r)

critical region

slide-11
SLIDE 11

Integrated Systems Laboratory

Function Approximation

f (r)

r

  • E.g., 8.23 LNS
  • Different methods:

11

r

Interpolation error 1st order 2nd order 3rd order

f (r) f (r)

r

1) LUT only (very large!) 2) High order polynomial

  • Often high order required
  • Large interpolator delay

3) LUT + piecewise poly

  • Tradeoff: precomputation vs. interpolation
  • Half precision - single precision: 1-2nd order

d d d d

slide-12
SLIDE 12

Integrated Systems Laboratory

LNU Generator Framework

12

  • Specs: bitwidth, accuracy, order
  • Iterative fitting heuristic (similar to [30])
  • Piecewise minimax polynomials (using Sollya [29])

[30] De Dinechin et al., “Automatic Generation of Polynomial-Based Hardware Architectures for Function Evaluation”, ASAP 2010 [29] Chevillarde et al., “Sollya: An Environment for the Development of Numerical Codes”, ICMS 2010

slide-13
SLIDE 13

Integrated Systems Laboratory

Architecture Template

Preprocessing Block Postprocessing Block Main Interpolator Block

13

Log/Exp Block

slide-14
SLIDE 14

Integrated Systems Laboratory

LNS Sub (critical):

Postprocessing Block Main Interpolator Block

14

Log/Exp Block c = max(a,b) + cotrans(r)+ log2(r)

slide-15
SLIDE 15

Integrated Systems Laboratory

LNS Sub (critical):

Postprocessing Block

15

Log/Exp Block c = max(a,b) + cotrans(r)+ log2(r) LUTs Nth order interpolator

slide-16
SLIDE 16

Integrated Systems Laboratory

LNS Sub (critical):

Postprocessing Block

16

c = max(a,b) + cotrans(r)+ log2(r)

slide-17
SLIDE 17

Integrated Systems Laboratory

LNS Sub (critical):

17

c = max(a,b) + cotrans(r)+ log2(r)

slide-18
SLIDE 18

Integrated Systems Laboratory

“Precise” 32bit LNU: Features & Comparison

ELM [8] ROM-less [9] ISSCC’16 [27] This Work Functionality ADD, SUB ADD, SUB F2I, I2F, EXP, LOG, ADD, SUB F2I, I2F, EXP, LOG, ADD, SUB Max error [ulp] 0.454 0.498 0.479 0.45 LUT size [Kbit] 256.4 183.3 113.1 64.2 Technology 180 nm 180 nm 65 nm 65 nm Area [um2] 904’943 589’357 57’264 38’592 Post-synthesis [kGE] 97 63 40 26.8 Min delay [ns] 11.74 7.10 6 4.5 Max delay [ns] 13.15 14.79 6 4.5

[8] J.N. Coleman et al. "The European Logarithmic Microprocessor" IEEE TC, 2008 [9] R.C. Ismail et al. "ROM-less LNS" IEEE ARITH, 2011 [27] M. Gautschi, M. Schaffner, F.K. Gürkaynak, L. Benini, ISSCC 2016

18

slide-19
SLIDE 19

Integrated Systems Laboratory

Design Space: Precision vs. Area

19

Tipping point 1st → 2nd order

  • 40%

@4.5ns delay in umc65, post-synthesis ulp in the LNS domain

slide-20
SLIDE 20

Integrated Systems Laboratory

Case Study: HW Platform

  • Parallel Ultra-Low-Power (PULP) Platform [31]

 4x 32b OpenRISC Cores (in-order)  16 kByte shared L1 (TCDM), 16 kByte L2 memory

  • Configurations:

– 1 Shared LNU (Precise, Approx1, Approx2)

  • 4, 3 or 2 pipeline registers
  • Fair round robin arbiter

– 4 Private FPUs (reference)

  • Directly integrated into cores
  • 2 pipeline register

[31] M. Gautschi et al., “Tailoring Instruction-Set Extensions for an Ultra-Low Power Tightly-Coupled Cluster of OpenRISC Cores,” in VLSI-SoC, 2015

LNU PE0 PE1 PE3 PE2 PE0 PE1 PE3 PE2

FPU FPU FPU FPU

20

www.pulp-platform.org

slide-21
SLIDE 21

Integrated Systems Laboratory

Chip Complexities

21

Name Format Bitwidth Precision Order Pipeline Stages FPU/LNU [kGE] Total Complexity [kGE] FPU IEEE754 8.23 0.5 ulp

  • 2

4x11 720 Precise LNS 8.23 0.5 ulp 2 4 36 718 Approx1 LNS 8.20 4 ulp* 2 3 27 708 Approx2 LNS 8.17 16 ulp* 1 2 23 704 * In the LNS domain

slide-22
SLIDE 22

Integrated Systems Laboratory

Kernel Level Results

22

Pipeline depth is the relevant factor! Energy efficiency gains mainly due to corresponding speedup!

umc65, post-layout

slide-23
SLIDE 23

Integrated Systems Laboratory

Conclusions

23

  • Generator Framework for precise and approximate LNUs
  • Very compact 8.23bit LNU (33%-71% smaller)
  • Shared setting attractive for LNU
  • Up to 4.2x more energy efficient than private FPU baseline
  • Approximation:
  • Additional gains in area, speedup and energy efficiency
  • Energy-efficiency gains mainly due to lower latency and speedup
  • Less time is needed to complete a task → lower system energy consumption
slide-24
SLIDE 24

Integrated Systems Laboratory

PULP Platform: Looking for Collaborators! Outlook

  • Vectorization and trigonometric extensions
  • Optimization opportunities for many algorithms

to leverage LNS and approximation

  • OpenRISC / RISC-V ISA
  • Open source, silicon proven
  • Extending DSP capabilities…
  • www.pulp-platform.org

pulp@pulp.ethz.ch

slide-25
SLIDE 25

Integrated Systems Laboratory

Q&A

Acknowledgements:

Nano Tera IcySoC project

slide-26
SLIDE 26

Integrated Systems Laboratory 26

Backup Slides

slide-27
SLIDE 27

Integrated Systems Laboratory

Outline

  • Motivation
  • Preliminaries: LNS Add/Sub and Interpolation
  • LNU Architecture and Generator Framework
  • Multicore Hardware Platform
  • Results
  • Conclusion
  • Q&A

27

slide-28
SLIDE 28

Integrated Systems Laboratory

Core 0 Core 1 Core 2 Core 3 FPU FPU FPU FPU INT operations HDR-ADD/SUB/MUL

Private FPUs

28

50%

slide-29
SLIDE 29

Integrated Systems Laboratory

Core 0 Core 1 Core 2 FPU Core 3 INT operations LNU

  • Area: 1 LNU < 4× standard IEEE compliant FPU (no DIV)
  • Poor LNU utilization ~ 0.2

FPU LNU FPU LNU FPU LNU HDR MUL/DIV/SQRT ADD/SUB

Private LNUs

29

slide-30
SLIDE 30

Integrated Systems Laboratory

Shared LNU

Core 0 Core 1 Core 2 Core 3 Core 0 Interconnect Arbiter LNU INT operations HDR-MUL/DIV/SQRT HDR-ADD/SUB/I2F/F2I

30

slide-31
SLIDE 31

Integrated Systems Laboratory

Design Space Exploration

31

  • Bitwidth:

– Half to Single Precision: 5.10 – 8.23

  • Accuracy:

– Precise (0.5ulp) and Approximate (up to 16ulp)

  • Order:

– 1st/2nd Order Interpolation

slide-32
SLIDE 32

Integrated Systems Laboratory

Design Space: Area vs. Delay

32

Precise Approx2 Approx1 * Required # pipeline stages for 500MHz target * *

slide-33
SLIDE 33

Integrated Systems Laboratory

Kernels

33

  • Linear Algebra: AXPY, GEMM, GEMV, DotP
  • Matrix Factorizations: Chol, QR
  • Geometry: Homographies, Distances, Pojection Errors
  • Image: Gradient Magnitude, Bilateral, FIR
  • Audio: Butterworth, Sine, DCT-II
  • Other: Radial Basis Functions

50% 25%

slide-34
SLIDE 34

Integrated Systems Laboratory

LNU PULP Chips

34

Selene (ISSCC’16 [27]) UMC 65nm 4 OpenRISC Cores 1 shared 32bit LNU Phoebe UMC 65nm 4 OpenRISC Cores 1 shared 32bit LNUv2 1 shared 2x16bit LNUv2

slide-35
SLIDE 35

Integrated Systems Laboratory

Comparison with SFU

35

Functionality Format Functionality Precision Order NaN, INF support Postlayout [kGE] Caro et al. 2009 SQRT, INVSQRT, INV, LOG, EXP, SQRT2, INVSQRT2 IEEE754, 8.23 8.23 1.0 ulp 2 no 36.3 LNU ADD, SUB, F2I, I2F, LOG, EXP, INV*, INVSQRT*, SQRT* LNS, 8.23 8.23 0 - 0.75 ulp 2 yes 36

  • D. D. Caro, N. Petra, and A. G. M. Strollo, “High-Performance Special Function Unit for Programmable 3-D

Graphics Processors,” IEEE TCAS I, vol. 56, no. 9, pp. 1968–1978, Sept 2009.

* Evaluated in integer cores

slide-36
SLIDE 36

Integrated Systems Laboratory

PULP Architecture with shared LNU

36

4 Core Cluster and L1 Memory Periphery and L2 Memory

slide-37
SLIDE 37

Integrated Systems Laboratory

PULP Architecture with shared LNU

37

slide-38
SLIDE 38

Integrated Systems Laboratory 38

33.3333.. = (-1)0 * (1 + 0.0416666) * 25 33.3333.. = (-1)0 * 25.0588936

IEEE 754 float LNS 01000010000001010101010101010101 00000010100001111000100111010100 LNS Example

slide-39
SLIDE 39

Integrated Systems Laboratory

Accuracy Impact (1)

39

slide-40
SLIDE 40

Integrated Systems Laboratory

Accuracy Impact (2)

40