Hardware and FPGAs computing just right ICERM 2020 Florent de - - PowerPoint PPT Presentation

hardware and fpgas computing just right
SMART_READER_LITE
LIVE PREVIEW

Hardware and FPGAs computing just right ICERM 2020 Florent de - - PowerPoint PPT Presentation

Hardware and FPGAs computing just right ICERM 2020 Florent de Dinechin Everybody is wondering what Im doing here Indeed, they probably invited me because I used to write software, too... But today I want to talk about hardware. I want to


slide-1
SLIDE 1

Hardware and FPGAs computing just right

ICERM 2020

Florent de Dinechin

slide-2
SLIDE 2

Everybody is wondering what I’m doing here

Indeed, they probably invited me because I used to write software, too... But today I want to talk about hardware. I want to show you that

  • 1. variable precision is a core issue in hardware design

(for some meaning of “variable”)

  • 2. hardware design can be fun
  • F. de Dinechin

Hardware and FPGAs computing just right 2 / 33

slide-3
SLIDE 3

Hardware and FPGAs ?

Hardware is, well, hardware. I assume that you have a vague idea...

bits, gates, wires, etc.

FPGAs (Field-Programmable Gate Arrays) are a kind of reconfigurable hardware

you can program it, just like your processors but the programming model is the circuit

This talk is really about designing circuits. Circuits that compute. Just right.

  • F. de Dinechin

Hardware and FPGAs computing just right 3 / 33

slide-4
SLIDE 4

Computing just right?

This is the pathetic logo of the FloPoCo project:

e

x

x2+y2+z2

πx

sinex+y

n

  • i=0

xi

x

logx

(the proper term is probably allogory)

  • F. de Dinechin

Hardware and FPGAs computing just right 4 / 33

slide-5
SLIDE 5

Computing just right?

This is the pathetic logo of the FloPoCo project:

e

x

x2+y2+z2

πx

sinex+y

n

  • i=0

xi

x

logx

(the proper term is probably allogory) This is the kind of thing FloPoCo does − → It is a floating-point exponential operator where each wire, each component is tailored to its context with love and care.

(not a very good logo either)

unpack X

flp(wE , wF )

shift to fixed point 1.FX

ufix(0, −wF )

EX

wE

×1/ log(2) ×(− log(2)) negate

+/−

Y

sfix(−1, −wF − g)

eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix|

ufix(wE − 2, −wF − g) ufix(−1, −wF − g) sfix(−1, −wF − g) ufix(wE − 2, −4) ufix(wE , 0)

|E|

sfix(−1, −wF − g)

A

sfix(−1, −k)

Z

ufix(−k − 1, −wF − g)

Ztrunc

ufix(−k, −wF − g)

C

ufix(0, −wF − g) ufix(0, −wF − g + k)

Ttrunc H

ufix(−k − 1, −wF + k − g) ufix(−2k − 1, −wF − g) ufix(−k + 1, −wF − g)

P T M ≈ eY

ufix(0, −wF − g) exception bits uo

R

flp(wE , wF )

  • F. de Dinechin

Hardware and FPGAs computing just right 4 / 33

slide-6
SLIDE 6

Save power! Don’t move useless bits around!

What is true for transatlantic cat videos is also true inside a circuit.

In software, if your result is correct, it is probably wasteful

Did you really need the bits 18 to 31 of this 32-bit word? If they carry useless noise, you don’t want to compute them... ... and you want even less to compute on them. But in software, you don’t really have the choice (it’s 32 bits or 64 bits)

  • F. de Dinechin

Hardware and FPGAs computing just right 5 / 33

slide-7
SLIDE 7

Save power! Don’t move useless bits around!

What is true for transatlantic cat videos is also true inside a circuit.

In software, if your result is correct, it is probably wasteful

Did you really need the bits 18 to 31 of this 32-bit word? If they carry useless noise, you don’t want to compute them... ... and you want even less to compute on them. But in software, you don’t really have the choice (it’s 32 bits or 64 bits)

Hardware is fun, it is all about freedom

In a circuit, we may choose, for each variable, how many bits are computed/stored/transmitted! − → the opportunities Overwhelming freedom! Help! − → the challenges

  • F. de Dinechin

Hardware and FPGAs computing just right 5 / 33

slide-8
SLIDE 8

Save power! Don’t move useless bits around!

What is true for transatlantic cat videos is also true inside a circuit.

In software, if your result is correct, it is probably wasteful

Did you really need the bits 18 to 31 of this 32-bit word? If they carry useless noise, you don’t want to compute them... ... and you want even less to compute on them. But in software, you don’t really have the choice (it’s 32 bits or 64 bits)

Hardware is fun, it is all about freedom

In a circuit, we may choose, for each variable, how many bits are computed/stored/transmitted! − → the opportunities Overwhelming freedom! Help! − → the challenges Note that this is variable precision in space, but not in time.

  • F. de Dinechin

Hardware and FPGAs computing just right 5 / 33

slide-9
SLIDE 9

Introduction: not your PC’s exponential

Introduction: not your PC’s exponential Some opportunities of hardware computing just right Error analysis for dummies (and other proof assistants) Conclusions Backup: crash-introduction to FPGAs

  • F. de Dinechin

Hardware and FPGAs computing just right 6 / 33

slide-10
SLIDE 10

First, a math proficiency test

Three identities to remember from our happy school days

2X = eX log(2) (1) eA+B = eA × eB (2) eZ ≈ 1 + Z + Z 2 2 if Z is small (3)

  • F. de Dinechin

Hardware and FPGAs computing just right 7 / 33

slide-11
SLIDE 11

unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate

+/−

Y eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY

exception bits uo

R

We want to obtain eX as eX = 2E · 1.F

  • F. de Dinechin

Hardware and FPGAs computing just right 8 / 33

slide-12
SLIDE 12

unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate

+/−

Y eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY

exception bits uo

R

We want to obtain eX as eX = 2E · 1.F Compute E ≈ X log 2

  • F. de Dinechin

Hardware and FPGAs computing just right 8 / 33

slide-13
SLIDE 13

unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate

+/−

Y eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY

exception bits uo

R

We want to obtain eX as eX = 2E · 1.F Compute E ≈ X log 2

  • then

Y ≈ X − E × log 2.

  • F. de Dinechin

Hardware and FPGAs computing just right 8 / 33

slide-14
SLIDE 14

unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate

+/−

Y eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY

exception bits uo

R

We want to obtain eX as eX = 2E · 1.F Compute E ≈ X log 2

  • then

Y ≈ X − E × log 2. Now eX = eE log 2+Y = eE log 2 · eY = 2E · eY

  • F. de Dinechin

Hardware and FPGAs computing just right 8 / 33

slide-15
SLIDE 15

unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate

+/−

Y eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY

exception bits uo

R

We want to obtain eX as eX = 2E · eY Now we have to compute eY with Y ∈ (−1/2, 1/2).

  • F. de Dinechin

Hardware and FPGAs computing just right 8 / 33

slide-16
SLIDE 16

unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate

+/−

Y eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY

exception bits uo

R

We want to obtain eX as eX = 2E · eY Now we have to compute eY with Y ∈ (−1/2, 1/2). Split Y : Y = A Z

−1 −k −wF − g

i.e. write Y = A + Z with Z < 2−k

  • F. de Dinechin

Hardware and FPGAs computing just right 8 / 33

slide-17
SLIDE 17

unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate

+/−

Y eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY

exception bits uo

R

We want to obtain eX as eX = 2E · eY Now we have to compute eY with Y ∈ (−1/2, 1/2). Split Y : Y = A Z

−1 −k −wF − g

i.e. write Y = A + Z with Z < 2−k so eY = eA × eZ

  • F. de Dinechin

Hardware and FPGAs computing just right 8 / 33

slide-18
SLIDE 18

unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate

+/−

Y eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY

exception bits uo

R

We want to obtain eX as eX = 2E · eY eY = eA × eZ Tabulate eA in a ROM

  • F. de Dinechin

Hardware and FPGAs computing just right 8 / 33

slide-19
SLIDE 19

unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate

+/−

Y eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY

exception bits uo

R

We want to obtain eX as eX = 2E · eY eY = eA × eZ Evaluation of eZ: Z < 2−k, so eZ ≈ 1 + Z + Z 2/2

  • F. de Dinechin

Hardware and FPGAs computing just right 8 / 33

slide-20
SLIDE 20

unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate

+/−

Y eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY

exception bits uo

R

We want to obtain eX as eX = 2E · eY eY = eA × eZ Evaluation of eZ: Z < 2−k, so eZ ≈ 1 + Z + Z 2/2 Notice that eZ − 1 − Z ≈ Z 2/2 < 2−2k

  • F. de Dinechin

Hardware and FPGAs computing just right 8 / 33

slide-21
SLIDE 21

unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate

+/−

Y eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY

exception bits uo

R

We want to obtain eX as eX = 2E · eY eY = eA × eZ Evaluation of eZ: Z < 2−k, so eZ ≈ 1 + Z + Z 2/2 Notice that eZ − 1 − Z ≈ Z 2/2 < 2−2k Evaluate eZ − Z − 1 somewhow (out of Z truncated to its higher bits only)

  • F. de Dinechin

Hardware and FPGAs computing just right 8 / 33

slide-22
SLIDE 22

unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate

+/−

Y eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY

exception bits uo

R

We want to obtain eX as eX = 2E · eY eY = eA × eZ Evaluation of eZ: Z < 2−k, so eZ ≈ 1 + Z + Z 2/2 Notice that eZ − 1 − Z ≈ Z 2/2 < 2−2k Evaluate eZ − Z − 1 somewhow (out of Z truncated to its higher bits only) then add Z to obtain eZ − 1

  • F. de Dinechin

Hardware and FPGAs computing just right 8 / 33

slide-23
SLIDE 23

unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate

+/−

Y eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY

exception bits uo

R

We want to obtain eX as eX = 2E · eY eY = eA × eZ Also notice that eZ = 1.

k−1 zeroes

  • 000...000 zzzz

Evaluate eA × eZ as eA + eA × (eZ − 1)

  • F. de Dinechin

Hardware and FPGAs computing just right 8 / 33

slide-24
SLIDE 24

unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate

+/−

Y eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY

exception bits uo

R

We want to obtain eX as eX = 2E · eY eY = eA × eZ Also notice that eZ = 1.

k−1 zeroes

  • 000...000 zzzz

Evaluate eA × eZ as eA + eA × (eZ − 1) (before the product, truncate eA to precision

  • f eZ − 1)
  • F. de Dinechin

Hardware and FPGAs computing just right 8 / 33

slide-25
SLIDE 25

unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate

+/−

Y eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY

exception bits uo

R

We want to obtain eX as eX = 2E · eY eY = eA × eZ And that’s it, we have E and eY

  • F. de Dinechin

Hardware and FPGAs computing just right 8 / 33

slide-26
SLIDE 26

unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate

+/−

Y eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY

exception bits uo

R

We want to obtain eX as eX = 2E · eY eY = eA × eZ And that’s it, we have E and eY (using only fixed-point computations)

  • F. de Dinechin

Hardware and FPGAs computing just right 8 / 33

slide-27
SLIDE 27

unpack X

flp(wE , wF )

shift to fixed point 1.FX

ufix(0, −wF )

EX

wE

×1/ log(2) ×(− log(2)) negate

+/−

Y

sfix(−1, −wF − g)

eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix|

ufix(wE − 2, −wF − g) ufix(−1, −wF − g) sfix(−1, −wF − g) ufix(wE − 2, −4) ufix(wE , 0)

|E|

sfix(−1, −wF − g)

A

sfix(−1, −k)

Z

ufix(−k − 1, −wF − g)

Ztrunc

ufix(−k, −wF − g)

C

ufix(0, −wF − g) ufix(0, −wF − g + k)

Ttrunc H

ufix(−k − 1, −wF + k − g) ufix(−2k − 1, −wF − g) ufix(−k + 1, −wF − g)

P T M ≈ eY

ufix(0, −wF − g) exception bits uo

R

flp(wE , wF )

We want to obtain eX as eX = 2E · eY eY = eA × eZ And that’s it, we have E and eY (using only fixed-point computations)

Don’t worry if you got lost

this talk is not about this algorithm, but about its details

  • F. de Dinechin

Hardware and FPGAs computing just right 8 / 33

slide-28
SLIDE 28

Some opportunities

  • f hardware computing just right

Introduction: not your PC’s exponential Some opportunities of hardware computing just right Error analysis for dummies (and other proof assistants) Conclusions Backup: crash-introduction to FPGAs

  • F. de Dinechin

Hardware and FPGAs computing just right 9 / 33

slide-29
SLIDE 29

Opportunity #1: Over-parameterization

unpack X

flp(wE , wF )

shift to fixed point 1.FX

ufix(0, −wF )

EX

wE

×1/ log(2) ×(− log(2)) negate

+/−

Y

sfix(−1, −wF − g)

eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix|

ufix(wE − 2, −wF − g) ufix(−1, −wF − g) sfix(−1, −wF − g) ufix(wE − 2, −4) ufix(wE , 0)

|E|

sfix(−1, −wF − g)

A

sfix(−1, −k)

Z

ufix(−k − 1, −wF − g)

Ztrunc

ufix(−k, −wF − g)

C

ufix(0, −wF − g) ufix(0, −wF − g + k)

Ttrunc H

ufix(−k − 1, −wF + k − g) ufix(−2k − 1, −wF − g) ufix(−k + 1, −wF − g)

P T M ≈ eY

ufix(0, −wF − g) exception bits uo

R

flp(wE , wF )

Example:

  • F. de Dinechin

Hardware and FPGAs computing just right 10 / 33

slide-30
SLIDE 30

Opportunity #1: Over-parameterization

unpack X

flp(wE , wF )

shift to fixed point 1.FX

ufix(0, −wF )

EX

wE

×1/ log(2) ×(− log(2)) negate

+/−

Y

sfix(−1, −wF − g)

eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix|

ufix(wE − 2, −wF − g) ufix(−1, −wF − g) sfix(−1, −wF − g) ufix(wE − 2, −4) ufix(wE , 0)

|E|

sfix(−1, −wF − g)

A

sfix(−1, −k)

Z

ufix(−k − 1, −wF − g)

Ztrunc

ufix(−k, −wF − g)

C

ufix(0, −wF − g) ufix(0, −wF − g + k)

Ttrunc H

ufix(−k − 1, −wF + k − g) ufix(−2k − 1, −wF − g) ufix(−k + 1, −wF − g)

P T M ≈ eY

ufix(0, −wF − g) exception bits uo

R

flp(wE , wF )

Example:

  • F. de Dinechin

Hardware and FPGAs computing just right 10 / 33

slide-31
SLIDE 31

Opportunity #1: Over-parameterization

unpack X

flp(wE , wF )

shift to fixed point 1.FX

ufix(0, −wF )

EX

wE

×1/ log(2) ×(− log(2)) negate

+/−

Y

sfix(−1, −wF − g)

eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix|

ufix(wE − 2, −wF − g) ufix(−1, −wF − g) sfix(−1, −wF − g) ufix(wE − 2, −4) ufix(wE , 0)

|E|

sfix(−1, −wF − g)

A

sfix(−1, −k)

Z

ufix(−k − 1, −wF − g)

Ztrunc

ufix(−k, −wF − g)

C

ufix(0, −wF − g) ufix(0, −wF − g + k)

Ttrunc H

ufix(−k − 1, −wF + k − g) ufix(−2k − 1, −wF − g) ufix(−k + 1, −wF − g)

P T M ≈ eY

ufix(0, −wF − g) exception bits uo

R

flp(wE , wF )

×1/ log(2) ×(− log(2)) negate

+/−

Y

sfix(−1,

|Xfix|

ufix(w ufix(wE − 2, −4) ufix(wE , 0)

|E|

sfix(−1, −wF − g) sfix(−1, −k) bits uo

Example:

  • F. de Dinechin

Hardware and FPGAs computing just right 10 / 33

slide-32
SLIDE 32

Opportunity #1: Over-parameterization

unpack X

flp(wE , wF )

shift to fixed point 1.FX

ufix(0, −wF )

EX

wE

×1/ log(2) ×(− log(2)) negate

+/−

Y

sfix(−1, −wF − g)

eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix|

ufix(wE − 2, −wF − g) ufix(−1, −wF − g) sfix(−1, −wF − g) ufix(wE − 2, −4) ufix(wE , 0)

|E|

sfix(−1, −wF − g)

A

sfix(−1, −k)

Z

ufix(−k − 1, −wF − g)

Ztrunc

ufix(−k, −wF − g)

C

ufix(0, −wF − g) ufix(0, −wF − g + k)

Ttrunc H

ufix(−k − 1, −wF + k − g) ufix(−2k − 1, −wF − g) ufix(−k + 1, −wF − g)

P T M ≈ eY

ufix(0, −wF − g) exception bits uo

R

flp(wE , wF )

×1/ log(2) ×(− log(2)) negate

+/−

Y

sfix(−1,

|Xfix|

ufix(w ufix(wE − 2, −4) ufix(wE , 0)

|E|

sfix(−1, −wF − g) sfix(−1, −k) bits uo ufix(wE − 2, −4) ufix(wE , 0) sfix(−1, −wF − g)

Example:

Multipliers of all shapes and sizes

  • F. de Dinechin

Hardware and FPGAs computing just right 10 / 33

slide-33
SLIDE 33

Opportunity #1: Over-parameterization

unpack X

flp(wE , wF )

shift to fixed point 1.FX

ufix(0, −wF )

EX

wE

×1/ log(2) ×(− log(2)) negate

+/−

Y

sfix(−1, −wF − g)

eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix|

ufix(wE − 2, −wF − g) ufix(−1, −wF − g) sfix(−1, −wF − g) ufix(wE − 2, −4) ufix(wE , 0)

|E|

sfix(−1, −wF − g)

A

sfix(−1, −k)

Z

ufix(−k − 1, −wF − g)

Ztrunc

ufix(−k, −wF − g)

C

ufix(0, −wF − g) ufix(0, −wF − g + k)

Ttrunc H

ufix(−k − 1, −wF + k − g) ufix(−2k − 1, −wF − g) ufix(−k + 1, −wF − g)

P T M ≈ eY

ufix(0, −wF − g) exception bits uo

R

flp(wE , wF )

×1/ log(2) ×(− log(2)) negate

+/−

Y

sfix(−1,

|Xfix|

ufix(w ufix(wE − 2, −4) ufix(wE , 0)

|E|

sfix(−1, −wF − g) sfix(−1, −k) bits uo ufix(wE − 2, −4) ufix(wE , 0) sfix(−1, −wF − g)

14 12 56

Example:

Multipliers of all shapes and sizes

In a double-precision exponential, wE = 11, wF = 52, first multiplier 14-bits in, 12 bits out second multiplier 12-bits in, 56 bits out ... and truncated left and right

  • F. de Dinechin

Hardware and FPGAs computing just right 10 / 33

slide-34
SLIDE 34

Over-parameterization is cool

⊖ OK, there is a bit more work involved in designing a parametric operator

To start with, it must be a hardware-generating program

  • F. de Dinechin

Hardware and FPGAs computing just right 11 / 33

slide-35
SLIDE 35

Over-parameterization is cool

⊖ OK, there is a bit more work involved in designing a parametric operator

To start with, it must be a hardware-generating program

⊕ Direct benefit to end-users: freedom of choice

People used to publish “An exponential architecture for single-precision”, standard is now “A family of exponential architectures for each precision” Application-specific optimal, future-proof, etc.

  • F. de Dinechin

Hardware and FPGAs computing just right 11 / 33

slide-36
SLIDE 36

Over-parameterization is cool

⊖ OK, there is a bit more work involved in designing a parametric operator

To start with, it must be a hardware-generating program

⊕ Direct benefit to end-users: freedom of choice

People used to publish “An exponential architecture for single-precision”, standard is now “A family of exponential architectures for each precision” Application-specific optimal, future-proof, etc.

⊕ It actually simplifies design of composite operators (e.g. the exponential)

No need to take any dramatic decision in the design phase: You don’t know how many bits on this wire make sense? Keep it open as a parameter. Then estimate cost and accuracy as a function of the parameters Then find the optimal values of the parameters, e.g. using ILP or common sense (whichever gives the best results)

  • F. de Dinechin

Hardware and FPGAs computing just right 11 / 33

slide-37
SLIDE 37

Opportunity #2: Operator specialization

Ha, that’s something you sofware people don’t get! Multiplication by a constant

multiplication by integers: 17X = (X ≪ 4) + X but also by reals such as log(2) or sin(42π/256) for the FFT Two main techniques, tens of papers

...

  • F. de Dinechin

Hardware and FPGAs computing just right 12 / 33

slide-38
SLIDE 38

Opportunity #2: Operator specialization

Ha, that’s something you sofware people don’t get! Multiplication by a constant

multiplication by integers: 17X = (X ≪ 4) + X but also by reals such as log(2) or sin(42π/256) for the FFT Two main techniques, tens of papers

Division by 3 (for various values of 3)

in floating point for Jacobi and other stencils integer (quotient and remainder) for addressing in 3 memory banks

...

  • F. de Dinechin

Hardware and FPGAs computing just right 12 / 33

slide-39
SLIDE 39

Opportunity #2: Operator specialization

Ha, that’s something you sofware people don’t get! Multiplication by a constant

multiplication by integers: 17X = (X ≪ 4) + X but also by reals such as log(2) or sin(42π/256) for the FFT Two main techniques, tens of papers

Division by 3 (for various values of 3)

in floating point for Jacobi and other stencils integer (quotient and remainder) for addressing in 3 memory banks

A squarer is a multiplier specialization

× x2 x

... 321 × 321 321 642 963 103041

  • F. de Dinechin

Hardware and FPGAs computing just right 12 / 33

slide-40
SLIDE 40

Opportunity #2: Operator specialization

Ha, that’s something you sofware people don’t get! Multiplication by a constant

multiplication by integers: 17X = (X ≪ 4) + X but also by reals such as log(2) or sin(42π/256) for the FFT Two main techniques, tens of papers

Division by 3 (for various values of 3)

in floating point for Jacobi and other stencils integer (quotient and remainder) for addressing in 3 memory banks

A squarer is a multiplier specialization

× x2 x

Specialization of elementary functions to specific domains ... 321 × 321 321 642 963 103041

  • F. de Dinechin

Hardware and FPGAs computing just right 12 / 33

slide-41
SLIDE 41

Opportunity #3: target-specific optimizations

unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate

+/−

Y eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY

exception bits uo

R

Modern FPGAs also have

  • F. de Dinechin

Hardware and FPGAs computing just right 13 / 33

slide-42
SLIDE 42

Opportunity #3: target-specific optimizations

unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate

+/−

Y eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY

exception bits uo

R

Modern FPGAs also have small multipliers with pre-adders and post-adders

  • F. de Dinechin

Hardware and FPGAs computing just right 13 / 33

slide-43
SLIDE 43

Opportunity #3: target-specific optimizations

unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate

+/−

Y eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY

exception bits uo

R

Modern FPGAs also have small multipliers with pre-adders and post-adders ... and dual-ported small memories

  • F. de Dinechin

Hardware and FPGAs computing just right 13 / 33

slide-44
SLIDE 44

Opportunity #3: target-specific optimizations

unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate

+/−

Y eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY

exception bits uo

R

Modern FPGAs also have small multipliers with pre-adders and post-adders ... and dual-ported small memories

Single-precision accurate exponential on Xilinx

  • ne block RAM (0.1% of the chip)
  • ne DSP block (0.1%)

< 400 LUTs (0.1%, ≈ one FP adder) to compute one exponential per cycle at 500MHz (∼ one AVX512 core trashing on its 16 FP32 lanes)

  • F. de Dinechin

Hardware and FPGAs computing just right 13 / 33

slide-45
SLIDE 45

Opportunity #3: target-specific optimizations

unpack X

flp(wE , wF )

shift to fixed point 1.FX

ufix(0, −wF )

EX

wE

×1/ log(2) ×(− log(2)) negate

+/−

Y

sfix(−1, −wF − g)

eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix|

ufix(wE − 2, −wF − g) ufix(−1, −wF − g) sfix(−1, −wF − g) ufix(wE − 2, −4) ufix(wE , 0)

|E|

sfix(−1, −wF − g)

A

sfix(−1, −k)

Z

ufix(−k − 1, −wF − g)

Ztrunc

ufix(−k, −wF − g)

C

ufix(0, −wF − g) ufix(0, −wF − g + k)

Ttrunc H

ufix(−k − 1, −wF + k − g) ufix(−2k − 1, −wF − g) ufix(−k + 1, −wF − g)

P T M ≈ eY

ufix(0, −wF − g) exception bits uo

R

flp(wE , wF )

Modern FPGAs also have small multipliers with pre-adders and post-adders ... and dual-ported small memories

Single-precision accurate exponential on Xilinx

  • ne block RAM (0.1% of the chip)
  • ne DSP block (0.1%)

< 400 LUTs (0.1%, ≈ one FP adder) to compute one exponential per cycle at 500MHz (∼ one AVX512 core trashing on its 16 FP32 lanes) For one specific value only of the architectural parameter k! (over-parameterization is cool)

  • F. de Dinechin

Hardware and FPGAs computing just right 13 / 33

slide-46
SLIDE 46

Opportunity #4: Tabulation

unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate

+/−

Y eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY

exception bits uo

R

Being unable to trust my reasoning, I learnt by heart the results of all the possible multiplications (E. Ionesco) ... and all the possible exponentials

  • F. de Dinechin

Hardware and FPGAs computing just right 14 / 33

slide-47
SLIDE 47

Opportunity #4: Tabulation

unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate

+/−

Y eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY

exception bits uo

R

Being unable to trust my reasoning, I learnt by heart the results of all the possible multiplications (E. Ionesco) ... and all the possible exponentials ... and all the possible values of eZ − Z − 1

  • F. de Dinechin

Hardware and FPGAs computing just right 14 / 33

slide-48
SLIDE 48

Opportunity #4: Tabulation

unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate

+/−

Y eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY

exception bits uo

R

Being unable to trust my reasoning, I learnt by heart the results of all the possible multiplications (E. Ionesco) ... and all the possible exponentials ... and all the possible values of eZ − Z − 1 ... and indeed, all the possible multiplications

  • F. de Dinechin

Hardware and FPGAs computing just right 14 / 33

slide-49
SLIDE 49

Opportunity #4: Tabulation

unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate

+/−

Y eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY

exception bits uo

R

Being unable to trust my reasoning, I learnt by heart the results of all the possible multiplications (E. Ionesco) ... and all the possible exponentials ... and all the possible values of eZ − Z − 1 ... and indeed, all the possible multiplications

  • F. de Dinechin

Hardware and FPGAs computing just right 14 / 33

slide-50
SLIDE 50

Opportunity #4: Tabulation

unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate

+/−

Y eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY

exception bits uo

R

Being unable to trust my reasoning, I learnt by heart the results of all the possible multiplications (E. Ionesco) ... and all the possible exponentials ... and all the possible values of eZ − Z − 1 ... and indeed, all the possible multiplications Reading a tabulated value is very efficient when the table is close to the consumer.

  • F. de Dinechin

Hardware and FPGAs computing just right 14 / 33

slide-51
SLIDE 51

Opportunity #5: Generic approximators (when tabulation won’t scale)

unpack X

flp(wE , wF )

shift to fixed point 1.FX

ufix(0, −wF )

EX

wE

×1/ log(2) ×(− log(2)) negate

+/−

Y

sfix(−1, −wF − g)

eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix|

ufix(wE − 2, −wF − g) ufix(−1, −wF − g) sfix(−1, −wF − g) ufix(wE − 2, −4) ufix(wE , 0)

|E|

sfix(−1, −wF − g)

A

sfix(−1, −k)

Z

ufix(−k − 1, −wF − g)

Ztrunc

ufix(−k, −wF − g)

C

ufix(0, −wF − g) ufix(0, −wF − g + k)

Ttrunc H

ufix(−k − 1, −wF + k − g) ufix(−2k − 1, −wF − g) ufix(−k + 1, −wF − g)

P T M ≈ eY

ufix(0, −wF − g) exception bits uo

R

flp(wE , wF )

Polynomial Coefficient Table

× + ×

+

×

+ S2 S1 C0 C1 C2 C3 X A

α w

Y

w − α

  • Y3
  • Y2
  • Y3 = X

final round

  • P(Y )

R

The FloPoCo FixFunctionByPiecewisePoly operator

state-of-the-art polynomial approximation each multiplier tailored with love and care Also multipartite tables, filter approximators, and more to come.

  • F. de Dinechin

Hardware and FPGAs computing just right 15 / 33

slide-52
SLIDE 52

Opportunity #6: merged arithmetic in bit heaps

Adder Multi-adder Multiplier Constant multiplier Complex product ...... Polynomial Multipartite

  • bi2wi

Algorithmic description Architecture generation Spartan 5 Spartan6 Zynq 7000 Virtex-4 Virtex-5 Virtex-6 Kintex-7

......

Stratix III Stratix IV Stratix V Stratix 10

  • F. de Dinechin

Hardware and FPGAs computing just right 16 / 33

slide-53
SLIDE 53

Opportunity #6: merged arithmetic in bit heaps

One data-structure to rule them all...

Adder Multi-adder Multiplier Constant multiplier Complex product ...... Polynomial Multipartite

  • bi2wi

Algorithmic description Architecture generation Spartan 5 Spartan6 Zynq 7000 Virtex-4 Virtex-5 Virtex-6 Kintex-7

......

Stratix III Stratix IV Stratix V Stratix 10

The sum of weighted bits as a first-class arithmetic object

A very wide class of operators: multi-valued polynomials, and more Captures the true bit-level parallelism, enables bit-level optimization opportunities

  • F. de Dinechin

Hardware and FPGAs computing just right 16 / 33

slide-54
SLIDE 54

Opportunity #6: merged arithmetic in bit heaps

One data-structure to rule them all... and in the hardware to bind them

Adder Multi-adder Multiplier Constant multiplier Complex product ...... Polynomial Multipartite

  • bi2wi

Algorithmic description Architecture generation Spartan 5 Spartan6 Zynq 7000 Virtex-4 Virtex-5 Virtex-6 Kintex-7

......

Stratix III Stratix IV Stratix V Stratix 10

The sum of weighted bits as a first-class arithmetic object

A very wide class of operators: multi-valued polynomials, and more Captures the true bit-level parallelism, enables bit-level optimization opportunities Bit-array compressor trees can be optimized for each target ... and optimally so for practical sizes, thanks to M. Kumm

  • F. de Dinechin

Hardware and FPGAs computing just right 16 / 33

slide-55
SLIDE 55

When you have a good hammer, you see nails everywhere

A sine/cosine architecture (I¸ stoan, HEART 2013):

s q

  • A

Yred

T T T T T T T T

Z 3/6 Z 2/2 ×π Sin/Cos table sinPiX cosPiX Swap/negate

sinZ cosPiA sinPiA Z sinAcosZ cosAcosZ sinAsinZ cosAsinZ

  • F. de Dinechin

Hardware and FPGAs computing just right 17 / 33

slide-56
SLIDE 56

When you have a good hammer, you see nails everywhere

A sine/cosine architecture (I¸ stoan, HEART 2013): 5 bit heaps

s q

  • A

Yred

T T T T T T T T

Z 3/6 Z 2/2 ×π Sin/Cos table sinPiX cosPiX Swap/negate

sinZ cosPiA sinPiA Z sinAcosZ cosAcosZ sinAsinZ cosAsinZ

  • F. de Dinechin

Hardware and FPGAs computing just right 17 / 33

slide-57
SLIDE 57

Bit heaps for some operators and filters

w=16 bits

Why are some people still insisting I should call this “bit arrays”?

  • F. de Dinechin

Hardware and FPGAs computing just right 18 / 33

slide-58
SLIDE 58

Error analysis for dummies (and other proof assistants)

Introduction: not your PC’s exponential Some opportunities of hardware computing just right Error analysis for dummies (and other proof assistants) Conclusions Backup: crash-introduction to FPGAs

  • F. de Dinechin

Hardware and FPGAs computing just right 19 / 33

slide-59
SLIDE 59

Computing just right

Error analysis used to be to ensure the operator works.

  • F. de Dinechin

Hardware and FPGAs computing just right 20 / 33

slide-60
SLIDE 60

Computing just right

Error analysis used to be to ensure the operator works. This is sooooo nineties. Here, error analysis is for optimization. (the fact that the operators work is an appreciable bonus)

  • F. de Dinechin

Hardware and FPGAs computing just right 20 / 33

slide-61
SLIDE 61

Error analysis method in my early papers: handwaving

The typical error analysis used to look like this: “This term contributes at most 1 ulp (unit in the last place) to the overall error” “This operation contributes at most one half-ulp to the error” ... “Altogether we have 6 ulps of error” “so if we add ⌈log2(6)⌉ bits to all the datapath, it should be accurate enough.

  • F. de Dinechin

Hardware and FPGAs computing just right 21 / 33

slide-62
SLIDE 62

And then I saw the light

  • G. Melquiond, the creator of Gappa (the proof assistant for the rest of us)

An error is a difference between a less accurate value and a more accurate value.

  • F. de Dinechin

Hardware and FPGAs computing just right 22 / 33

slide-63
SLIDE 63

And then I saw the light

  • G. Melquiond, the creator of Gappa (the proof assistant for the rest of us)

An error is a difference between a less accurate value and a more accurate value. For instance, to bound some error, first write it δAC = A − C, then

  • F. de Dinechin

Hardware and FPGAs computing just right 22 / 33

slide-64
SLIDE 64

And then I saw the light

  • G. Melquiond, the creator of Gappa (the proof assistant for the rest of us)

An error is a difference between a less accurate value and a more accurate value. For instance, to bound some error, first write it δAC = A − C, then look for some intermediate value B (more accurate than A but less accurate than C)

  • F. de Dinechin

Hardware and FPGAs computing just right 22 / 33

slide-65
SLIDE 65

And then I saw the light

  • G. Melquiond, the creator of Gappa (the proof assistant for the rest of us)

An error is a difference between a less accurate value and a more accurate value. For instance, to bound some error, first write it δAC = A − C, then look for some intermediate value B (more accurate than A but less accurate than C) write A − C = A − B + B − C (4)

  • r

δAC = δAB + δBC

  • F. de Dinechin

Hardware and FPGAs computing just right 22 / 33

slide-66
SLIDE 66

And then I saw the light

  • G. Melquiond, the creator of Gappa (the proof assistant for the rest of us)

An error is a difference between a less accurate value and a more accurate value. For instance, to bound some error, first write it δAC = A − C, then look for some intermediate value B (more accurate than A but less accurate than C) write A − C = A − B + B − C (4)

  • r

δAC = δAB + δBC By triangular inequality, |δAC| ≤ |δAB| + |δBC| Therefore, the error bounds (noted δ = max |δ| ) verify δAC = δAB + δBC (5) A divide-and-conquer method, to use when approximations and rounding errors pile up...

  • F. de Dinechin

Hardware and FPGAs computing just right 22 / 33

slide-67
SLIDE 67

A big mess of rounding errors piled over approximation

unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate

+/−

Y eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY

exception bits uo

R

  • F. de Dinechin

Hardware and FPGAs computing just right 23 / 33

slide-68
SLIDE 68

A big mess of rounding errors piled over approximation

unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate

+/−

Y eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY

exception bits uo

R

  • F. de Dinechin

Hardware and FPGAs computing just right 23 / 33

slide-69
SLIDE 69

A big mess of rounding errors piled over approximation

Y eA eZ − Z − 1

+

×

+

A Z Ztrunc C Ttrunc H P T M ≈ eY

δtotal = M − eY

  • F. de Dinechin

Hardware and FPGAs computing just right 23 / 33

slide-70
SLIDE 70

A big mess of rounding errors piled over approximation

Y eA eZ − Z − 1

+

×

+

A Z Ztrunc C Ttrunc H P T M ≈ eY

δtotal = M − eY = M − eAeZ since Y = A + Z exactly

  • F. de Dinechin

Hardware and FPGAs computing just right 23 / 33

slide-71
SLIDE 71

A big mess of rounding errors piled over approximation

Y eA eZ − Z − 1

+

×

+

A Z Ztrunc C Ttrunc H P T M ≈ eY

δtotal = M − eY = M − eAeZ since Y = A + Z exactly = M−TeZ + TeZ − eAeZ

  • F. de Dinechin

Hardware and FPGAs computing just right 23 / 33

slide-72
SLIDE 72

A big mess of rounding errors piled over approximation

Y eA eZ − Z − 1

+

×

+

A Z Ztrunc C Ttrunc H P T M ≈ eY

δtotal = M − eY = M − eAeZ since Y = A + Z exactly = M−TeZ + TeZ − eAeZ

  • ||

(T − eA)eZ || δT

  • F. de Dinechin

Hardware and FPGAs computing just right 23 / 33

slide-73
SLIDE 73

A big mess of rounding errors piled over approximation

Y

sfix(−1, −wF − g)

eA eZ − Z − 1

+

×

+

A

sfix(−1, −k)

Z

ufix(−k − 1, −wF − g)

Ztrunc

ufix(−k, −wF − g)

C

ufix(0, −wF − g) ufix(0, −wF − g + k)

Ttrunc H

ufix(−k − 1, −wF + k − g) ufix(−2k − 1, −wF − g) ufix(−k + 1, −wF − g)

P T M ≈ eY

ufix(0, −wF − g)

δtotal = M − eY = M − eAeZ since Y = A + Z exactly = M−TeZ + TeZ − eAeZ

  • ||

(T − eA)eZ || δT Now we can bound this first source of error: |δT| = |T − eA| · |eZ| < 2−wF −g · (1 + 2−k+1) (6) Keep it parametric!

  • F. de Dinechin

Hardware and FPGAs computing just right 23 / 33

slide-74
SLIDE 74

That was the first step

Y eA eZ − Z − 1

+

×

+

A Z Ztrunc C Ttrunc H P T M ≈ eY

δtotal = M − TeZ + δT Where can we go from here?

  • F. de Dinechin

Hardware and FPGAs computing just right 24 / 33

slide-75
SLIDE 75

That was the first step

P

Y eA eZ − Z − 1

+

×

+

A Z Ztrunc C Ttrunc H P T M ≈ eY

δtotal = M − TeZ + δT Where can we go from here? Last addition is exact (that’s fixed-point for you) so M = T + P, hence : M − TeZ = T + P − TeZ = T + P−TtruncC + TtruncC − TeZ

  • F. de Dinechin

Hardware and FPGAs computing just right 24 / 33

slide-76
SLIDE 76

That was the first step

P

Y eA eZ − Z − 1

+

×

+

A Z Ztrunc C Ttrunc H P T M ≈ eY

δtotal = M − TeZ + δT Where can we go from here? Last addition is exact (that’s fixed-point for you) so M = T + P, hence : M − TeZ = T + P − TeZ = T + P−TtruncC

  • + TtruncC − TeZ

|| δP

  • F. de Dinechin

Hardware and FPGAs computing just right 24 / 33

slide-77
SLIDE 77

That was the first step

P

Y

sfix(−1, −wF − g)

eA eZ − Z − 1

+

×

+

A

sfix(−1, −k)

Z

ufix(−k − 1, −wF − g)

Ztrunc

ufix(−k, −wF − g)

C

ufix(0, −wF − g) ufix(0, −wF − g + k)

Ttrunc H

ufix(−k − 1, −wF + k − g) ufix(−2k − 1, −wF − g) ufix(−k + 1, −wF − g)

P T M ≈ eY

ufix(0, −wF − g)

δtotal = M − TeZ + δT Where can we go from here? Last addition is exact (that’s fixed-point for you) so M = T + P, hence : M − TeZ = T + P − TeZ = T + P−TtruncC

  • + TtruncC − TeZ

|| δP The bound on δP depends on the technology used for the multiplier (at most δP = 2−wF −g) anyway it is under control

  • F. de Dinechin

Hardware and FPGAs computing just right 24 / 33

slide-78
SLIDE 78

More of the same

Y eA eZ − Z − 1

+

×

+

A Z Ztrunc C Ttrunc H P T M ≈ eY

δtotal = T + TtruncC − TeZ + δT + δP Where can we go from here?

  • F. de Dinechin

Hardware and FPGAs computing just right 25 / 33

slide-79
SLIDE 79

More of the same

Y eA eZ − Z − 1

+

×

+

A Z Ztrunc C Ttrunc H P T M ≈ eY

δtotal = T + TtruncC − TeZ + δT + δP Where can we go from here? This Ttrunc is annoying, so let’s get it out of the way T + TtruncC − TeZ = T + TtruncC − TC

  • +TC − TeZ

|| (Ttrunc − T)C || δTtrunc

  • F. de Dinechin

Hardware and FPGAs computing just right 25 / 33

slide-80
SLIDE 80

More of the same

Y eA eZ − Z − 1

+

×

+

A Z Ztrunc C Ttrunc H P T M ≈ eY

δtotal = T + TtruncC − TeZ + δT + δP Where can we go from here? This Ttrunc is annoying, so let’s get it out of the way T + TtruncC − TeZ = T + TtruncC − TC

  • +TC − TeZ

|| (Ttrunc − T)C || δTtrunc Ttrunc − T < 2−wF −g+k, then we need a bound on C; C ≈ eZ − 1 so Taylor is our friend again

  • F. de Dinechin

Hardware and FPGAs computing just right 25 / 33

slide-81
SLIDE 81

More of the same

Y

sfix(−1, −wF − g)

eA eZ − Z − 1

+

×

+

A

sfix(−1, −k)

Z

ufix(−k − 1, −wF − g)

Ztrunc

ufix(−k, −wF − g)

C

ufix(0, −wF − g) ufix(0, −wF − g + k)

Ttrunc H

ufix(−k − 1, −wF + k − g) ufix(−2k − 1, −wF − g) ufix(−k + 1, −wF − g)

P T M ≈ eY

ufix(0, −wF − g)

δtotal = T + TtruncC − TeZ + δT + δP Where can we go from here? This Ttrunc is annoying, so let’s get it out of the way T + TtruncC − TeZ = T + TtruncC − TC

  • +TC − TeZ

|| (Ttrunc − T)C || δTtrunc Ttrunc − T < 2−wF −g+k, then we need a bound on C; C ≈ eZ − 1 so Taylor is our friend again

  • F. de Dinechin

Hardware and FPGAs computing just right 25 / 33

slide-82
SLIDE 82

You’ll get your lunch only after I get to an approximation error

Y eA eZ − Z − 1

+

×

+

A Z Ztrunc C Ttrunc H P T M ≈ eY

δtotal = T + TC − TeZ + δT + δP + δTtrunc Where can we go from here?

  • F. de Dinechin

Hardware and FPGAs computing just right 26 / 33

slide-83
SLIDE 83

You’ll get your lunch only after I get to an approximation error

Y eA eZ − Z − 1

+

×

+

A Z Ztrunc C Ttrunc H P T M ≈ eY

δtotal = T + TC − TeZ + δT + δP + δTtrunc Where can we go from here? C = H + Z exactly (fixed-point additions are exact) T + TC − TeZ = T · (1 + H + Z − eZ) = T · (H − h(Z)) with h(Z) = eZ − Z − 1 = T · (H−h(Ztrunc) + h(Ztrunc) − h(Z)) = T · (H − h(Ztrunc))

  • + T · (h(Ztrunc) − h(Z))
  • ||

δH || δZtrunc

  • F. de Dinechin

Hardware and FPGAs computing just right 26 / 33

slide-84
SLIDE 84

You’ll get your lunch only after I get to an approximation error

Y eA eZ − Z − 1

+

×

+

A Z Ztrunc C Ttrunc H P T M ≈ eY

δtotal = T + TC − TeZ + δT + δP + δTtrunc Where can we go from here? C = H + Z exactly (fixed-point additions are exact) T + TC − TeZ = T · (1 + H + Z − eZ) = T · (H − h(Z)) with h(Z) = eZ − Z − 1 = T · (H−h(Ztrunc) + h(Ztrunc) − h(Z)) = T · (H − h(Ztrunc))

  • + T · (h(Ztrunc) − h(Z))
  • ||

δH || δZtrunc δH includes the approximation error H − h(Ztrunc)

  • F. de Dinechin

Hardware and FPGAs computing just right 26 / 33

slide-85
SLIDE 85

Finally, scientific precision sabotaging

δtotal = δT + δP + δTtrunc + δH + δZtrunc hence δtotal = δT + δP + δTtrunc + δH + δZtrunc If any of these terms is much smaller than the others, useless bits are being computed I’ll hack at the hardware to make this error worse!

by moving a parameter up or down, maybe adding a truncation somewhere...

  • F. de Dinechin

Hardware and FPGAs computing just right 27 / 33

slide-86
SLIDE 86

Finally, scientific precision sabotaging

δtotal = δT + δP + δTtrunc + δH + δZtrunc hence δtotal = δT + δP + δTtrunc + δH + δZtrunc If any of these terms is much smaller than the others, useless bits are being computed I’ll hack at the hardware to make this error worse!

by moving a parameter up or down, maybe adding a truncation somewhere...

Oh, yes, I will also make sure that δtotal is small enough to guarantee last-bit accuracy.

  • F. de Dinechin

Hardware and FPGAs computing just right 27 / 33

slide-87
SLIDE 87

Take away messages

Error analysis for performance, not only for accuracy Straightforward engineering based on additions and multiplications Strict and accurate worst-case analysis (amenable to formal proof) Perfectly captures how an early rounding error is amplified in the algorithm

And for you floating-point people, there exists a relative-error version

If A approximates B and B approximates C, then A − C C = A − B B + B − C C + A − B B × B − C C (7)

  • r

εAC = εAB + εBC + εAB · εBC

  • F. de Dinechin

Hardware and FPGAs computing just right 28 / 33

slide-88
SLIDE 88

Conclusions

Introduction: not your PC’s exponential Some opportunities of hardware computing just right Error analysis for dummies (and other proof assistants) Conclusions Backup: crash-introduction to FPGAs

  • F. de Dinechin

Hardware and FPGAs computing just right 29 / 33

slide-89
SLIDE 89

First conclusion

There seems to be a recurrent pattern in talks by French people: They sell you a very generic title and then all you see is an implementation of the exponential function

  • F. de Dinechin

Hardware and FPGAs computing just right 30 / 33

slide-90
SLIDE 90

A look forward, beyond the horizon

Generating parametric hardware was the easy part!

The difficult part of the problem is: What precision is needed at this point of this application ? (Overwhelming freedom! Help!)

  • F. de Dinechin

Hardware and FPGAs computing just right 31 / 33

slide-91
SLIDE 91

A look forward, beyond the horizon

unpack X

flp(wE , wF )

shift to fixed point 1.FX

ufix(0, −wF )

EX

wE

×1/ log(2) ×(− log(2)) negate

+/−

Y

sfix(−1, −wF − g)

eA eZ − Z − 1

+

×

+

normalize-round-pack sX |Xfix|

ufix(wE − 2, −wF − g) ufix(−1, −wF − g) sfix(−1, −wF − g) ufix(wE − 2, −4) ufix(wE , 0)

|E|

sfix(−1, −wF − g)

A

sfix(−1, −k)

Z

ufix(−k − 1, −wF − g)

Ztrunc

ufix(−k, −wF − g)

C

ufix(0, −wF − g) ufix(0, −wF − g + k)

Ttrunc H

ufix(−k − 1, −wF + k − g) ufix(−2k − 1, −wF − g) ufix(−k + 1, −wF − g)

P T M ≈ eY

ufix(0, −wF − g) exception bits uo

R

flp(wE , wF )

Generating parametric hardware was the easy part!

The difficult part of the problem is: What precision is needed at this point of this application ? (Overwhelming freedom! Help!) I know, because this is what I’ve been doing for the exponential. Current challenges: precimonious-like tools at the bit level for my users My beautiful parametric FPExp is only used for single and double precision! and in software: replace some floats with fixed-point (hidden in elementary functions, reductions...)

  • F. de Dinechin

Hardware and FPGAs computing just right 31 / 33

slide-92
SLIDE 92

“You’ve found yourself a comfortable niche, but...”

You optimize computations, but power is dissipated in data transfers. Answer: Don’t move useless bits around, including to memory/storage

  • F. de Dinechin

Hardware and FPGAs computing just right 32 / 33

slide-93
SLIDE 93

“You’ve found yourself a comfortable niche, but...”

You optimize computations, but power is dissipated in data transfers. Answer: Don’t move useless bits around, including to memory/storage Who cares about your exponential, scientific computing is all sums of products. Answer:

Dura Amdahl lex, sed lex

SPICE (electrical circuit simulation) inner loop, cut from Kapre and DeHon (FPL 2009)

  • F. de Dinechin

Hardware and FPGAs computing just right 32 / 33

slide-94
SLIDE 94

“You’ve found yourself a comfortable niche, but...”

You optimize computations, but power is dissipated in data transfers. Answer: Don’t move useless bits around, including to memory/storage Who cares about your exponential, scientific computing is all sums of products. Answer:

Dura Amdahl lex, sed lex

SPICE (electrical circuit simulation) inner loop, cut from Kapre and DeHon (FPL 2009) Please add your question here.

  • F. de Dinechin

Hardware and FPGAs computing just right 32 / 33

slide-95
SLIDE 95

Time for a bit of advertising

A Springer book: Application-specific arithmetic by F. de Dinechin and M. Kumm, to appear somewhere in 2020. Hopefully.

Thanks to all FloPoCo contributors

  • H. Abdoli, S. Banescu, L. Bes‘eme, A. B¨
  • ttcher, N. Bonfante,
  • M. Christ, N. Brunie, S. Collange, J. Detrey, P. Echeverr´

ıa,

  • F. Ferrandi, N. Fiege, L. Forget, M. Grad, K. Illyes, M. Istoan,
  • M. Joldes, J. Kappauf, C. Klein, M. Kleinlein, K. Klug, M. Kumm,
  • K. Kullmann, D. Mastrandrea, K. Moeller, B. Pasca, B. Popa,
  • X. Pujol, G. Sergent, D. Thomas, R. Tudoran, A. Vasquez.

and the authors of NVC, MPFR, Sollya, FPLLL, ScaLP, ...

e

x

x2+y2+z2

πx

s i n

e

x+ y n

  • i=0

xi

√x

logx http://flopoco.gforge.inria.fr/

  • F. de Dinechin

Hardware and FPGAs computing just right 33 / 33

slide-96
SLIDE 96

Backup: crash-introduction to FPGAs

Introduction: not your PC’s exponential Some opportunities of hardware computing just right Error analysis for dummies (and other proof assistants) Conclusions Backup: crash-introduction to FPGAs

  • F. de Dinechin

Hardware and FPGAs computing just right 34 / 33

slide-97
SLIDE 97

Basic FPGA structure

Overview A grid of configurable cells

  • F. de Dinechin

Hardware and FPGAs computing just right 35 / 33

slide-98
SLIDE 98

Basic FPGA structure

Overview A grid of configurable cells

... to build arbitrary logic

Inside a cell F x0 x1 x2 x3 A Look-Up Table (LUT) F

4 inputs, one output holds any truth table

  • F. de Dinechin

Hardware and FPGAs computing just right 35 / 33

slide-99
SLIDE 99

Basic FPGA structure

Overview A grid of configurable cells

... to build arbitrary logic ... and sequential circuits

Inside a cell F x0 x1 x2 x3 R yr y A Look-Up Table (LUT) F

4 inputs, one output holds any truth table

1 bit of run-time memory R

  • F. de Dinechin

Hardware and FPGAs computing just right 35 / 33

slide-100
SLIDE 100

Basic FPGA structure

Overview A grid of configurable cells

... to build arbitrary logic ... and sequential circuits

Configurable wiring

routing channels configurable switches where wires cross

→ random access to distant cells Inside a cell F x0 x1 x2 x3 R yr y A Look-Up Table (LUT) F

4 inputs, one output holds any truth table

1 bit of run-time memory R

  • F. de Dinechin

Hardware and FPGAs computing just right 35 / 33

slide-101
SLIDE 101

A configured FPGA

Also known as reconfigurable circuits used for reconfigurable computing

  • F. de Dinechin

Hardware and FPGAs computing just right 36 / 33

slide-102
SLIDE 102

Two moments in the life of an FPGA

Configuration time (1-1000 ms)

the LUTs are filled with truth tables the state (on/off) of each switch in each switch box is defined a program == a lot of configuration bits

  • F. de Dinechin

Hardware and FPGAs computing just right 37 / 33

slide-103
SLIDE 103

Two moments in the life of an FPGA

Configuration time (1-1000 ms)

the LUTs are filled with truth tables the state (on/off) of each switch in each switch box is defined a program == a lot of configuration bits

Run time (forever if needed)

Data is processed by each LUT according to its truth table Data moves from LUT to LUT along the (static) connexions The FPGA behaves as a circuit of gates

  • F. de Dinechin

Hardware and FPGAs computing just right 37 / 33

slide-104
SLIDE 104

Two moments in the life of an FPGA

Configuration time (1-1000 ms)

the LUTs are filled with truth tables the state (on/off) of each switch in each switch box is defined a program == a lot of configuration bits

Run time (forever if needed)

Data is processed by each LUT according to its truth table Data moves from LUT to LUT along the (static) connexions The FPGA behaves as a circuit of gates The programming model of FPGAs is the digital circuit. You don’t program an FPGA, you configure it as a circuit.

  • F. de Dinechin

Hardware and FPGAs computing just right 37 / 33

slide-105
SLIDE 105

No free lunch, of course

Most FPGA silicon is dedicated to programmable routing. “Customers buy logic, but they pay for routing” (Langhammer, Intel) a picture from 1999 − → (it got much worse since then)

  • F. de Dinechin

Hardware and FPGAs computing just right 38 / 33

slide-106
SLIDE 106

No free lunch, of course

Most FPGA silicon is dedicated to programmable routing. “Customers buy logic, but they pay for routing” (Langhammer, Intel) a picture from 1999 − → (it got much worse since then) A circuit that would fit in 1 mm2 of ASIC silicon will only fit in a 50mm2 FPGA... ... and the configured FPGA will run at 1/10th the frequency

there are transistors on all the wires!

  • F. de Dinechin

Hardware and FPGAs computing just right 38 / 33

slide-107
SLIDE 107

It was too simple so far, people would complain

coarser cells, optimized for additions many small (18-24 bit) hard multipliers (“DSP blocks”) many small (∼ 20 kBit) memories (and a galore of IOs and clocks) the Altera/Intel stratix IV FPGA

  • F. de Dinechin

Hardware and FPGAs computing just right 39 / 33

slide-108
SLIDE 108

It was too simple so far, people would complain

coarser cells, optimized for additions many small (18-24 bit) hard multipliers (“DSP blocks”) many small (∼ 20 kBit) memories (and a galore of IOs and clocks) the Altera/Intel stratix IV FPGA

Numbers for a high-end FPGA

500,000 LUTs + 2000 DSP blocks + 1000 36KBit memory block ... running at 500 MHz

  • F. de Dinechin

Hardware and FPGAs computing just right 39 / 33