A general-purpose method for faithfully rounded floating-point - - PowerPoint PPT Presentation

a general purpose method for
SMART_READER_LITE
LIVE PREVIEW

A general-purpose method for faithfully rounded floating-point - - PowerPoint PPT Presentation

A general-purpose method for faithfully rounded floating-point function approximation in FPGAs David B. Thomas Imperial College London 1 David Thomas, Imperial College, dt10@ic.ac.uk FloPoCo : Parameterised primitives 2 David Thomas,


slide-1
SLIDE 1

A general-purpose method for faithfully rounded floating-point function approximation in FPGAs

David B. Thomas Imperial College London

David Thomas, Imperial College, dt10@ic.ac.uk 1

slide-2
SLIDE 2

FloPoCo : Parameterised primitives

David Thomas, Imperial College, dt10@ic.ac.uk 2

slide-3
SLIDE 3

FloPoCo : Parameterised primitives

David Thomas, Imperial College, dt10@ic.ac.uk 3

slide-4
SLIDE 4

FloPoCo : Parameterised primitives

David Thomas, Imperial College, dt10@ic.ac.uk 4

slide-5
SLIDE 5

FloatApprox : Parameterised anything

David Thomas, Imperial College, dt10@ic.ac.uk 5

Input format Approximation interval Output format

slide-6
SLIDE 6

FloatApprox : Parameterised anything

David Thomas, Imperial College, dt10@ic.ac.uk 6

slide-7
SLIDE 7

FloatApprox

  • Architecture for FPGA function approximation

– Deeply pipelined – Floating-point in and out – Faithfully rounded

  • Method and tool for approximating functions

– Handles any most twice-differentiable functions – Completely automated: expression to VHDL – Designed for reliability rather than optimality

David Thomas, Imperial College, dt10@ic.ac.uk 7

slide-8
SLIDE 8
  • 1. Motivation
  • 2. The FloatApprox approach
  • 1. Range reduction and approximation method
  • 2. Evaluation architecture
  • 3. Evaluation in hardware

David Thomas, Imperial College, dt10@ic.ac.uk 8

slide-9
SLIDE 9

Context: FPGA accelerators

  • Mathematical or algorithmic specification
  • Convert to HLS or VHDL implementation

– Rely on optimised IP for floating-point – Integrated at link-time into the final design

David Thomas, Imperial College, dt10@ic.ac.uk 9

slide-10
SLIDE 10

Context: FPGA accelerators

  • Mathematical or algorithmic specification
  • Convert to HLS or VHDL implementation

– Rely on optimised IP for floating-point – Integrated at link-time into the final design

  • Intellectual challenges for accelerator design

– Managing memory accesses and bandwidth – Rewriting to tolerate latency of operators – Keeping pipelines occupied – Not: designing low-level IP cores

David Thomas, Imperial College, dt10@ic.ac.uk 9

slide-11
SLIDE 11

Floating-point IP: Requirements

  • Faithfully rounded

– Make every bit count – Tractable error analysis

  • Pipelined for 150MHz+ clock rate

– Must be pipelined: RAM and DSPs are multi-cycle – Synthesis tools have limited retiming capability

  • Working RTL (circuit) implementation

– A paper can’t be synthesised

David Thomas, Imperial College, dt10@ic.ac.uk 10

slide-12
SLIDE 12

Floating-point IP: Requirements

  • Faithfully rounded

– Make every bit count – Tractable error analysis

  • Pipelined for 150MHz+ clock rate

– Must be pipelined: RAM and DSPs are multi-cycle – Synthesis tools have limited retiming capability

  • Working RTL (circuit) implementation

– A paper can’t be synthesised

David Thomas, Imperial College, dt10@ic.ac.uk 10

slide-13
SLIDE 13

A fable...

Subject: Floating-point log1p? To: dt10@ic.ac.uk From: phd-slash-industry-bod@somewhere.com Body: I’m converting some code for an accelerator, and it uses log1p. Can I use your core from that PoC you did a while back?

David Thomas, Imperial College, dt10@ic.ac.uk 11

slide-14
SLIDE 14

A fable...

Subject: Re: Floating-point log1p? To: phd-slash-industry-bod@somewhere.com From: dt10@ic.ac.uk Body: Afraid that was written in Handel-C, I don’t have any VHDL. You could recreate it using the attached maple script, plus write a code gen. > I’m converting some code for an accelerator, and > it uses log1p. Can I use your core from that > PoC you did a while back?

David Thomas, Imperial College, dt10@ic.ac.uk 12

slide-15
SLIDE 15

A fable...

Subject: Re: Floating-point log1p? To: phd-slash-industry-bod@somewhere.com From: dt10@ic.ac.uk Body: Any luck? > Afraid that was written in Handel-C, I don’t > have any VHDL. You could recreate it using > the attached maple script, plus write a code gen. > > I’m converting some code for an accelerator, and > > it uses log1p. Can I use your core from that > > PoC you did a while back?

David Thomas, Imperial College, dt10@ic.ac.uk 13

slide-16
SLIDE 16

... becomes a nightmare

Subject: Re: Floating-point log1p? To: phd-slash-industry-bod@somewhere.com From: dt10@ic.ic.ac.uk Body: Oh, we don’t have maple. It’s ok, I found out log1p(x)=log(1+x), and just did

  • that. Works fine.

> Any luck? > > Afraid that was written in Handel-C, I don’t > > have any VHDL. You could recreate it using > > the attached maple script, plus write a code gen.

David Thomas, Imperial College, dt10@ic.ac.uk 14

slide-17
SLIDE 17

What IP is available?

Source Pipelined Faithful RTL add, mul, div FloPoCo Yes Yes Yes log, exp FloPoCo Yes Yes Yes sin, cos FPLibrary No Yes Yes Altera Yes Yes Altera flow only Xilinx Yes ? Vivado HLS only log1p Altera Yes Yes Altera flow only expm1 Altera Yes No OpenCL only erf Altera Yes No OpenCL only

David Thomas, Imperial College, dt10@ic.ac.uk 15

slide-18
SLIDE 18

What IP is available?

Source Pipelined Faithful RTL add, mul, div FloPoCo Yes Yes Yes log, exp FloPoCo Yes Yes Yes sin, cos FPLibrary No Yes Yes Altera Yes Yes Altera flow only Xilinx Yes ? Vivado HLS only log1p Altera Yes Yes Altera flow only expm1 Altera Yes No OpenCL only erf Altera Yes No OpenCL only

David Thomas, Imperial College, dt10@ic.ac.uk 15

slide-19
SLIDE 19

What IP is available?

Source Pipelined Faithful RTL add, mul, div FloPoCo Yes Yes Yes log, exp FloPoCo Yes Yes Yes sin, cos FPLibrary No Yes Yes Altera Yes Yes Altera flow only Xilinx Yes ? Vivado HLS only log1p Altera Yes Yes Altera flow only expm1 Altera Yes No OpenCL only erf Altera Yes No OpenCL only

David Thomas, Imperial College, dt10@ic.ac.uk 15

slide-20
SLIDE 20

Motivation for FloatApprox

  • We currently have : + , - , * , / , log , exp

– Use existing IP: FloPoCo, Xilinx, Altera, ...

David Thomas, Imperial College, dt10@ic.ac.uk 16

slide-21
SLIDE 21

Motivation for FloatApprox

  • We currently have : + , - , * , / , log , exp

– Use existing IP: FloPoCo, Xilinx, Altera, ...

  • We should have : log1p, expm1, erf, sin, acos, ...

– What FloatApprox does badly... ... but better than anything else available

David Thomas, Imperial College, dt10@ic.ac.uk 16

slide-22
SLIDE 22

Motivation for FloatApprox

  • We currently have : + , - , * , / , log , exp

– Use existing IP: FloPoCo, Xilinx, Altera, ...

  • We should have : log1p, expm1, erf, sin, acos, ...

– What FloatApprox does badly... ... but better than anything else available

  • What I want : sqrt(-2 log(x)), 1/(1+exp(-x))

– What FloatApprox does well

David Thomas, Imperial College, dt10@ic.ac.uk 16

slide-23
SLIDE 23

Goals of FloatApprox

  • As a tool

– Convert any function f(x) to RTL – Able to handle most smooth functions

  • Smooth = twice differentiable for our purposes

– Suitable for automated use

  • Input : data-types, range, function
  • Output : faithfully rounded circuit

David Thomas, Imperial College, dt10@ic.ac.uk 17

slide-24
SLIDE 24

Goals of FloatApprox

  • As a tool

– Convert any function f(x) to RTL – Able to handle most smooth functions

  • Smooth = twice differentiable for our purposes

– Suitable for automated use

  • Input : data-types, range, function
  • Output : faithfully rounded circuit
  • As generated IP

– Pipelined – Faithfully rounded – Working RTL

David Thomas, Imperial College, dt10@ic.ac.uk 17

slide-25
SLIDE 25

FloatApprox: requirements

  • User can specify any specified target function
  • Parameterised floating-point representation

– Input and output formats can be distinct

  • Portable between platforms
  • Usable from many languages
  • Open-source
  • Low latency
  • Minimal resource

David Thomas, Imperial College, dt10@ic.ac.uk 19

slide-26
SLIDE 26

Architecture and Approximation

  • Architecture :

– General template for creating any approximator

  • Approximation

– Configuring the template for a given function

David Thomas, Imperial College, dt10@ic.ac.uk 20

slide-27
SLIDE 27

FloatApprox : Approximation

  • Given a function ft how do we create fa?

David Thomas, Imperial College, dt10@ic.ac.uk 21

slide-28
SLIDE 28

FloatApprox : Approximation

  • Given a function ft how do we create fa?
  • Segment the function so that segments are:

David Thomas, Imperial College, dt10@ic.ac.uk 21

slide-29
SLIDE 29

FloatApprox : Approximation

  • Given a function ft how do we create fa?
  • Segment the function so that segments are:
  • 1. Contained in one input binade

David Thomas, Imperial College, dt10@ic.ac.uk 21

slide-30
SLIDE 30

FloatApprox : Approximation

  • Given a function ft how do we create fa?
  • Segment the function so that segments are:
  • 1. Contained in one input binade
  • 1. Contained in one output binade

David Thomas, Imperial College, dt10@ic.ac.uk 21

slide-31
SLIDE 31

FloatApprox : Approximation

  • Given a function ft how do we create fa?
  • Segment the function so that segments are:
  • 1. Contained in one input binade
  • 1. Contained in one output binade
  • 1. FaithfulFixed: can faithfully approximate with

fixed-point polynomial of degree d

David Thomas, Imperial College, dt10@ic.ac.uk 21

slide-32
SLIDE 32

FloatApprox : Approximation

  • Given a function ft how do we create fa?
  • Segment the function so that segments are:
  • 1. Contained in one input binade
  • 2. Monotonically increasing or decreasing in range
  • 3. Contained in one output binade
  • 4. FaithfulReal: can approx. with real degree d poly
  • 5. FaithfulFixed: can faithfully approximate with

fixed-point polynomial of degree d

David Thomas, Imperial College, dt10@ic.ac.uk 21

slide-33
SLIDE 33

Example: Input function over reals

22 David Thomas, Imperial College, dt10@ic.ac.uk

x y

2 4 6 8 10 12 14 16 0.5 1.5 2.5 3.0 3.5 4.0 16 06 . ) 1 . sin(

95 .

     x x x x y

slide-34
SLIDE 34

Move to float representation

20 2-1 2-3 2-2 2-4 22 21 23 24 2-4 2-3 2-2 2-1 20 21 22 23 24

23 David Thomas, Imperial College, dt10@ic.ac.uk

slide-35
SLIDE 35

1 : Segment using input binades

20 2-1 2-3 2-2 2-4 22 21 23 24 2-4 2-3 2-2 2-1 20 21 22 23 24

David Thomas, Imperial College, dt10@ic.ac.uk 24

slide-36
SLIDE 36

2 : Make segments monotonic

20 2-1 2-3 2-2 2-4 22 21 23 24 2-4 2-3 2-2 2-1 20 21 22 23 24

David Thomas, Imperial College, dt10@ic.ac.uk 25

slide-37
SLIDE 37

3 : Segment using output binades

20 2-1 2-3 2-2 2-4 22 21 23 24 2-4 2-3 2-2 2-1 20 21 22 23 24

David Thomas, Imperial College, dt10@ic.ac.uk 26

slide-38
SLIDE 38

3 : Segment using output binades

20 2-1 2-3 2-2 2-4 22 21 23 24 2-4 2-3 2-2 2-1 20 21 22 23 24

David Thomas, Imperial College, dt10@ic.ac.uk 27

slide-39
SLIDE 39

3 : Segment using output binades

David Thomas, Imperial College, dt10@ic.ac.uk 28

20 2-1 2-3 2-2 2-4 22 21 23 24 2-4 2-3 2-2 2-1 20 21 22 23 24

slide-40
SLIDE 40

3 : Segment using output binades

David Thomas, Imperial College, dt10@ic.ac.uk 29

20 2-1 2-3 2-2 2-4 22 21 23 24 2-4 2-3 2-2 2-1 20 21 22 23 24

slide-41
SLIDE 41

4 : Split to degree d polynomials

David Thomas, Imperial College, dt10@ic.ac.uk 30

20 2-1 2-3 2-2 2-4 22 21 23 24 2-4 2-3 2-2 2-1 20 21 22 23 24

slide-42
SLIDE 42

4 : Split to degree d polynomials

David Thomas, Imperial College, dt10@ic.ac.uk 31

20 2-1 2-3 2-2 2-4 22 21 23 24 2-4 2-3 2-2 2-1 20 21 22 23 24

slide-43
SLIDE 43

4 : Split to degree d polynomials

David Thomas, Imperial College, dt10@ic.ac.uk 32

20 2-1 2-3 2-2 2-4 22 21 23 24 2-4 2-3 2-2 2-1 20 21 22 23 24

slide-44
SLIDE 44

20 2-1 2-3 2-2 2-4 22 21 23 24 2-4 2-3 2-2 2-1 20 21

  • Segments form a partition on input domain
  • Segment domains and ranges cover one binade
  • All segments can be faithfully calculated as

degree d fixed-point polynomial

slide-45
SLIDE 45

Real-world issues

  • Lots of corner cases to worry about

– Crossing from negative to positive to NaN is fun – Method should be faithful by construction

slide-46
SLIDE 46

Real-world issues

  • Lots of corner cases to worry about

– Crossing from negative to positive to NaN is fun – Method should be faithful by construction

  • Calculations performed using mpfr and sollya

– Mostly interval arithmetic via sollya – Occasionally bisection search in mpfr

slide-47
SLIDE 47

Real-world issues

  • Lots of corner cases to worry about

– Crossing from negative to positive to NaN is fun – Method should be faithful by construction

  • Calculations performed using mpfr and sollya

– Mostly interval arithmetic via sollya – Occasionally bisection search in mpfr

  • Speed of approximation is an issue

– Single precision takes minutes – Double precision takes hours

slide-48
SLIDE 48

FloatApprox : Architecture

David Thomas, Imperial College, dt10@ic.ac.uk 35

Segmentation Fixed-Point Polynomial Table-Lookup s flags expnt significand c0 cd ... s flags

c0 cd x

c1

c1

expnt

y = ∑ci xi x ëSiû ≤ i ≤ éSiù

s flags expnt significand

slide-49
SLIDE 49

Compile-time configuration

David Thomas, Imperial College, dt10@ic.ac.uk 36

Segmentation Fixed-Point Polynomial Table-Lookup s flags expnt significand c0 cd ... s flags

c0 cd x

c1

c1

expnt

y = ∑ci xi x ëSiû ≤ i ≤ éSiù

s flags expnt significand 20 2-1 2-3 2-2 2-4 22 21 23 2-4 2-3 2-2 2-1 20 21

slide-50
SLIDE 50

Evaluation: Input

David Thomas, Imperial College, dt10@ic.ac.uk 37

Segmentation Fixed-Point Polynomial Table-Lookup s flags expnt significand c0 cd ... s flags

c0 cd x

c1

c1

expnt

y = ∑ci xi x ëSiû ≤ i ≤ éSiù

s flags expnt significand 20 2-1 2-3 2-2 2-4 22 21 23 2-4 2-3 2-2 2-1 20 21

slide-51
SLIDE 51

Evaluation: Segmentation

David Thomas, Imperial College, dt10@ic.ac.uk 38

Segmentation Fixed-Point Polynomial Table-Lookup s flags expnt significand c0 cd ... s flags

c0 cd x

c1

c1

expnt

y = ∑ci xi x ëSiû ≤ i ≤ éSiù

s flags expnt significand 20 2-1 2-3 2-2 2-4 22 21 23 2-4 2-3 2-2 2-1 20 21

slide-52
SLIDE 52

Evaluation: Segmentation

David Thomas, Imperial College, dt10@ic.ac.uk 39

Segmentation Fixed-Point Polynomial Table-Lookup s flags expnt significand c0 cd ... s flags

c0 cd x

c1

c1

expnt

y = ∑ci xi x ëSiû ≤ i ≤ éSiù

s flags expnt significand 20 2-1 2-3 2-2 2-4 22 21 23 2-4 2-3 2-2 2-1 20 21

slide-53
SLIDE 53

Evaluation: Segmentation

David Thomas, Imperial College, dt10@ic.ac.uk 40

Segmentation Fixed-Point Polynomial Table-Lookup s flags expnt significand c0 cd ... s flags

c0 cd x

c1

c1

expnt

y = ∑ci xi x ëSiû ≤ i ≤ éSiù

s flags expnt significand 20 2-1 2-3 2-2 2-4 22 21 23 2-4 2-3 2-2 2-1 20 21

slide-54
SLIDE 54

Evaluation: Segmentation

David Thomas, Imperial College, dt10@ic.ac.uk 41

Segmentation Fixed-Point Polynomial Table-Lookup s flags expnt significand c0 cd ... s flags

c0 cd x

c1

c1

expnt

y = ∑ci xi x ëSiû ≤ i ≤ éSiù

s flags expnt significand 20 2-1 2-3 2-2 2-4 22 21 23 2-4 2-3 2-2 2-1 20 21

slide-55
SLIDE 55

Evaluation: Segmentation

David Thomas, Imperial College, dt10@ic.ac.uk 42

Segmentation Fixed-Point Polynomial Table-Lookup s flags expnt significand c0 cd ... s flags

c0 cd x

c1

c1

expnt

y = ∑ci xi x ëSiû ≤ i ≤ éSiù

s flags expnt significand 20 2-1 2-3 2-2 2-4 22 21 23 2-4 2-3 2-2 2-1 20 21

slide-56
SLIDE 56

Evaluation: Segmentation

David Thomas, Imperial College, dt10@ic.ac.uk 43

Segmentation Fixed-Point Polynomial Table-Lookup s flags expnt significand c0 cd ... s flags

c0 cd x

c1

c1

expnt

y = ∑ci xi x ëSiû ≤ i ≤ éSiù

s flags expnt significand 20 2-1 2-3 2-2 2-4 22 21 23 2-4 2-3 2-2 2-1 20 21

slide-57
SLIDE 57

Evaluation: Table Lookup

David Thomas, Imperial College, dt10@ic.ac.uk 44

Segmentation Fixed-Point Polynomial Table-Lookup s flags expnt significand c0 cd ... s flags

c0 cd x

c1

c1

expnt

y = ∑ci xi x ëSiû ≤ i ≤ éSiù

s flags expnt significand 20 2-1 2-3 2-2 2-4 22 21 23 2-4 2-3 2-2 2-1 20 21 1.0 2.0 2.0 1.0 1.5 1.5

slide-58
SLIDE 58

Evaluation: Fraction

David Thomas, Imperial College, dt10@ic.ac.uk 45

Segmentation Fixed-Point Polynomial Table-Lookup s flags expnt significand c0 cd ... s flags

c0 cd x

c1

c1

expnt

y = ∑ci xi x ëSiû ≤ i ≤ éSiù

s flags expnt significand 20 2-1 2-3 2-2 2-4 22 21 23 2-4 2-3 2-2 2-1 20 21 1.0 2.0 2.0 1.0 1.5 1.5

slide-59
SLIDE 59

Evaluation: Flags and Exponent

David Thomas, Imperial College, dt10@ic.ac.uk 46

Segmentation Fixed-Point Polynomial Table-Lookup s flags expnt significand c0 cd ... s flags

c0 cd x

c1

c1

expnt

y = ∑ci xi x ëSiû ≤ i ≤ éSiù

s flags expnt significand 20 2-1 2-3 2-2 2-4 22 21 23 2-4 2-3 2-2 2-1 20 21

slide-60
SLIDE 60

FloatApprox : Architecture

David Thomas, Imperial College, dt10@ic.ac.uk 47

Segmentation Fixed-Point Polynomial Table-Lookup s flags expnt significand c0 cd ... s flags

c0 cd x

c1

c1

expnt

y = ∑ci xi x ëSiû ≤ i ≤ éSiù

s flags expnt significand

slide-61
SLIDE 61

Architectural pros and cons

  • Key strengths of the architecture:

– Simplicity: building and verifying is very simple – Generality: it can handle any function – Speed: very easy to make it fast

David Thomas, Imperial College, dt10@ic.ac.uk 48

slide-62
SLIDE 62

Architectural pros and cons

  • Key strengths of the architecture:

– Simplicity: building and verifying is very simple – Generality: it can handle any function – Speed: very easy to make it fast

  • Weaknesses of the architecture:

– Table size: exponential in exponent width – Table blow-up: periodic functions are impractical

David Thomas, Imperial College, dt10@ic.ac.uk 48

slide-63
SLIDE 63

Evaluation: Test Method

  • Three classes of function

– Primitives: faithfully rounded IP available – Composite: can express in terms of available IP – Approximate: no direct method for evaluation

David Thomas, Imperial College, dt10@ic.ac.uk 49

slide-64
SLIDE 64

Evaluation: Test Method

  • Three classes of function

– Primitives: faithfully rounded IP available – Composite: can express in terms of available IP – Approximate: no direct method for evaluation

  • Source of reference IP is FloPoCo

– OpenSource; good performance; portable

David Thomas, Imperial College, dt10@ic.ac.uk 49

slide-65
SLIDE 65

Evaluation: Test Method

  • Three classes of function

– Primitives: faithfully rounded IP available – Composite: can express in terms of available IP – Approximate: no direct method for evaluation

  • Source of reference IP is FloPoCo

– OpenSource; good performance; portable

  • Approximations are “found on the internet”

– i.e. Abramowitz and Stegun

David Thomas, Imperial College, dt10@ic.ac.uk 49

slide-66
SLIDE 66

Evaluation: Test Method

  • Three classes of function

– Primitives: faithfully rounded IP available – Composite: can express in terms of available IP – Approximate: no direct method for evaluation

  • Source of reference IP is FloPoCo

– OpenSource; good performance; portable

  • Approximations are “found on the internet”

– i.e. Abramowitz and Stegun

  • All results are post place-and-route in Virtex-6

David Thomas, Imperial College, dt10@ic.ac.uk 49

slide-67
SLIDE 67

David Thomas, Imperial College, dt10@ic.ac.uk 50

Name Method Interval Primitive log IP core [0,∞] exp IP core [-∞, ∞] Composite normpdf exp(-x^2)/sqrt(2pi) [-16,16] sigmoid 1/(1+exp(-x)) [-∞,+∞] log1p log(x+1) [-1,+∞] expm1 exp(x)-1 [-∞,+∞] Approximate sin mul:7, add:5 [-π,+π] cos mul:6, add:5 [-π,+π] erf mul:7, add:7, inv:1, exp:1 [-32,+32]

slide-68
SLIDE 68

David Thomas, Imperial College, dt10@ic.ac.uk 51

Name LUTs BRAM DSP Latency log 1376 (1.7x) 12 (12x) 10 (2.0x) 181 (2.2x) exp 1372 (3.4x) 24 (24x) 9 (9.0x) 216 (5.9x) normpdf 1498 (1.8x) 25 (25x) 11 (2.8x) 190 (2.2x) sigmoid 1259 (0.8x) 5 (5x) 9 (9.0x) 165 (1.1x) log1p 2203 (1.9x) 10 (10x) 17 (3.4x) 214 (1.3x) expm1 1304 (1.8x) 12 (12x) 9 (9.0x) 173 (2.5x) sin 1366 (0.5x) 5 (∞x) 10 (0.8x) 189 (0.8x) cos 1220 (0.5x) 5 (∞x) 9 (0.8x) 200 (0.8x) erf 881 (0.2x) 4 (4x) 6 (0.6x) 149 (0.4x)

slide-69
SLIDE 69

David Thomas, Imperial College, dt10@ic.ac.uk 51

Name LUTs BRAM DSP Latency log 1376 (1.7x) 12 (12x) 10 (2.0x) 181 (2.2x) exp 1372 (3.4x) 24 (24x) 9 (9.0x) 216 (5.9x) normpdf 1498 (1.8x) 25 (25x) 11 (2.8x) 190 (2.2x) sigmoid 1259 (0.8x) 5 (5x) 9 (9.0x) 165 (1.1x) log1p 2203 (1.9x) 10 (10x) 17 (3.4x) 214 (1.3x) expm1 1304 (1.8x) 12 (12x) 9 (9.0x) 173 (2.5x) sin 1366 (0.5x) 5 (∞x) 10 (0.8x) 189 (0.8x) cos 1220 (0.5x) 5 (∞x) 9 (0.8x) 200 (0.8x) erf 881 (0.2x) 4 (4x) 6 (0.6x) 149 (0.4x)

slide-70
SLIDE 70

David Thomas, Imperial College, dt10@ic.ac.uk 51

Name LUTs BRAM DSP Latency log 1376 (1.7x) 12 (12x) 10 (2.0x) 181 (2.2x) exp 1372 (3.4x) 24 (24x) 9 (9.0x) 216 (5.9x) normpdf 1498 (1.8x) 25 (25x) 11 (2.8x) 190 (2.2x) sigmoid 1259 (0.8x) 5 (5x) 9 (9.0x) 165 (1.1x) log1p 2203 (1.9x) 10 (10x) 17 (3.4x) 214 (1.3x) expm1 1304 (1.8x) 12 (12x) 9 (9.0x) 173 (2.5x) sin 1366 (0.5x) 5 (∞x) 10 (0.8x) 189 (0.8x) cos 1220 (0.5x) 5 (∞x) 9 (0.8x) 200 (0.8x) erf 881 (0.2x) 4 (4x) 6 (0.6x) 149 (0.4x)

slide-71
SLIDE 71

David Thomas, Imperial College, dt10@ic.ac.uk 52

Name LUTs BRAM DSP Latency log

2.5x 18x 10x 4x

exp normpdf

2x 10x 5x 1.8x

sigmoid log1p expm1 sin

0.4x ~5x 0.8x 0.7x

cos erf

slide-72
SLIDE 72

David Thomas, Imperial College, dt10@ic.ac.uk 52

Name LUTs BRAM DSP Latency log

2.5x 18x 10x 4x

exp normpdf

2x 10x 5x 1.8x

sigmoid log1p expm1 sin

0.4x ~5x 0.8x 0.7x

cos erf

Worse in every way

slide-73
SLIDE 73

David Thomas, Imperial College, dt10@ic.ac.uk 52

Name LUTs BRAM DSP Latency log

2.5x 18x 10x 4x

exp normpdf

2x 10x 5x 1.8x

sigmoid log1p expm1 sin

0.4x ~5x 0.8x 0.7x

cos erf

Worse in every way Poor resource utilisation Much better accuracy

slide-74
SLIDE 74

David Thomas, Imperial College, dt10@ic.ac.uk 52

Name LUTs BRAM DSP Latency log

2.5x 18x 10x 4x

exp normpdf

2x 10x 5x 1.8x

sigmoid log1p expm1 sin

0.4x ~5x 0.8x 0.7x

cos erf

Worse in every way Poor resource utilisation Much better accuracy More accurate and smaller except for RAMs

slide-75
SLIDE 75

Current and Future Work

  • Optimisation of final segments

– Many segments contain 0 or +Inf – Segments could be merged, coefficients shared

David Thomas, Imperial College, dt10@ic.ac.uk 53

slide-76
SLIDE 76

Current and Future Work

  • Optimisation of final segments

– Many segments contain 0 or +Inf – Segments could be merged, coefficients shared

  • Improve fixed-point polynomial evaluation

– Range of intermediate polynomial stages is large – Can apply cheap pre-scaling to normalise

David Thomas, Imperial College, dt10@ic.ac.uk 53

slide-77
SLIDE 77

Current and Future Work

  • Optimisation of final segments

– Many segments contain 0 or +Inf – Segments could be merged, coefficients shared

  • Improve fixed-point polynomial evaluation

– Range of intermediate polynomial stages is large – Can apply cheap pre-scaling to normalise

  • Export as packaged IP cores

– Add AXI and Avalon stream interfaces

David Thomas, Imperial College, dt10@ic.ac.uk 53

slide-78
SLIDE 78

Current and Future Work

  • Optimisation of final segments

– Many segments contain 0 or +Inf – Segments could be merged, coefficients shared

  • Improve fixed-point polynomial evaluation

– Range of intermediate polynomial stages is large – Can apply cheap pre-scaling to normalise

  • Export as packaged IP cores

– Add AXI and Avalon stream interfaces

  • Support generation of HLS compatible C code

David Thomas, Imperial College, dt10@ic.ac.uk 53

slide-79
SLIDE 79

Conclusion

  • General method for function approximation

– Parameterisable template architecture – Method for generating parameters from function

David Thomas, Imperial College, dt10@ic.ac.uk 54

slide-80
SLIDE 80

Conclusion

  • General method for function approximation

– Parameterisable template architecture – Method for generating parameters from function

  • Faithfully rounded by construction

– Ok method for creating primitives that don’t exist – Good method for creating complex function

David Thomas, Imperial College, dt10@ic.ac.uk 54

slide-81
SLIDE 81

Conclusion

  • General method for function approximation

– Parameterisable template architecture – Method for generating parameters from function

  • Faithfully rounded by construction

– Ok method for creating primitives that don’t exist – Good method for creating complex function

  • Currently in FloPoCo on an obscure branch

– Hopefully going to roll into trunk some time soon

David Thomas, Imperial College, dt10@ic.ac.uk 54