Hardware and FPGAs computing just right ICERM 2020 Florent de - - PowerPoint PPT Presentation
Hardware and FPGAs computing just right ICERM 2020 Florent de - - PowerPoint PPT Presentation
Hardware and FPGAs computing just right ICERM 2020 Florent de Dinechin Everybody is wondering what Im doing here Indeed, they probably invited me because I used to write software, too... But today I want to talk about hardware. I want to
Everybody is wondering what I’m doing here
Indeed, they probably invited me because I used to write software, too... But today I want to talk about hardware. I want to show you that
- 1. variable precision is a core issue in hardware design
(for some meaning of “variable”)
- 2. hardware design can be fun
- F. de Dinechin
Hardware and FPGAs computing just right 2 / 33
Hardware and FPGAs ?
Hardware is, well, hardware. I assume that you have a vague idea...
bits, gates, wires, etc.
FPGAs (Field-Programmable Gate Arrays) are a kind of reconfigurable hardware
you can program it, just like your processors but the programming model is the circuit
This talk is really about designing circuits. Circuits that compute. Just right.
- F. de Dinechin
Hardware and FPGAs computing just right 3 / 33
Computing just right?
This is the pathetic logo of the FloPoCo project:
e
x
√
x2+y2+z2
πx
sinex+y
n
- i=0
xi
√
x
logx
(the proper term is probably allogory)
- F. de Dinechin
Hardware and FPGAs computing just right 4 / 33
Computing just right?
This is the pathetic logo of the FloPoCo project:
e
x
√
x2+y2+z2
πx
sinex+y
n
- i=0
xi
√
x
logx
(the proper term is probably allogory) This is the kind of thing FloPoCo does − → It is a floating-point exponential operator where each wire, each component is tailored to its context with love and care.
(not a very good logo either)
unpack X
flp(wE , wF )
shift to fixed point 1.FX
ufix(0, −wF )
EX
wE
×1/ log(2) ×(− log(2)) negate
+/−
Y
sfix(−1, −wF − g)
eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix|
ufix(wE − 2, −wF − g) ufix(−1, −wF − g) sfix(−1, −wF − g) ufix(wE − 2, −4) ufix(wE , 0)
|E|
sfix(−1, −wF − g)
A
sfix(−1, −k)
Z
ufix(−k − 1, −wF − g)
Ztrunc
ufix(−k, −wF − g)
C
ufix(0, −wF − g) ufix(0, −wF − g + k)
Ttrunc H
ufix(−k − 1, −wF + k − g) ufix(−2k − 1, −wF − g) ufix(−k + 1, −wF − g)
P T M ≈ eY
ufix(0, −wF − g) exception bits uo
R
flp(wE , wF )
- F. de Dinechin
Hardware and FPGAs computing just right 4 / 33
Save power! Don’t move useless bits around!
What is true for transatlantic cat videos is also true inside a circuit.
In software, if your result is correct, it is probably wasteful
Did you really need the bits 18 to 31 of this 32-bit word? If they carry useless noise, you don’t want to compute them... ... and you want even less to compute on them. But in software, you don’t really have the choice (it’s 32 bits or 64 bits)
- F. de Dinechin
Hardware and FPGAs computing just right 5 / 33
Save power! Don’t move useless bits around!
What is true for transatlantic cat videos is also true inside a circuit.
In software, if your result is correct, it is probably wasteful
Did you really need the bits 18 to 31 of this 32-bit word? If they carry useless noise, you don’t want to compute them... ... and you want even less to compute on them. But in software, you don’t really have the choice (it’s 32 bits or 64 bits)
Hardware is fun, it is all about freedom
In a circuit, we may choose, for each variable, how many bits are computed/stored/transmitted! − → the opportunities Overwhelming freedom! Help! − → the challenges
- F. de Dinechin
Hardware and FPGAs computing just right 5 / 33
Save power! Don’t move useless bits around!
What is true for transatlantic cat videos is also true inside a circuit.
In software, if your result is correct, it is probably wasteful
Did you really need the bits 18 to 31 of this 32-bit word? If they carry useless noise, you don’t want to compute them... ... and you want even less to compute on them. But in software, you don’t really have the choice (it’s 32 bits or 64 bits)
Hardware is fun, it is all about freedom
In a circuit, we may choose, for each variable, how many bits are computed/stored/transmitted! − → the opportunities Overwhelming freedom! Help! − → the challenges Note that this is variable precision in space, but not in time.
- F. de Dinechin
Hardware and FPGAs computing just right 5 / 33
Introduction: not your PC’s exponential
Introduction: not your PC’s exponential Some opportunities of hardware computing just right Error analysis for dummies (and other proof assistants) Conclusions Backup: crash-introduction to FPGAs
- F. de Dinechin
Hardware and FPGAs computing just right 6 / 33
First, a math proficiency test
Three identities to remember from our happy school days
2X = eX log(2) (1) eA+B = eA × eB (2) eZ ≈ 1 + Z + Z 2 2 if Z is small (3)
- F. de Dinechin
Hardware and FPGAs computing just right 7 / 33
unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate
+/−
Y eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY
exception bits uo
R
We want to obtain eX as eX = 2E · 1.F
- F. de Dinechin
Hardware and FPGAs computing just right 8 / 33
unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate
+/−
Y eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY
exception bits uo
R
We want to obtain eX as eX = 2E · 1.F Compute E ≈ X log 2
- F. de Dinechin
Hardware and FPGAs computing just right 8 / 33
unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate
+/−
Y eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY
exception bits uo
R
We want to obtain eX as eX = 2E · 1.F Compute E ≈ X log 2
- then
Y ≈ X − E × log 2.
- F. de Dinechin
Hardware and FPGAs computing just right 8 / 33
unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate
+/−
Y eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY
exception bits uo
R
We want to obtain eX as eX = 2E · 1.F Compute E ≈ X log 2
- then
Y ≈ X − E × log 2. Now eX = eE log 2+Y = eE log 2 · eY = 2E · eY
- F. de Dinechin
Hardware and FPGAs computing just right 8 / 33
unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate
+/−
Y eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY
exception bits uo
R
We want to obtain eX as eX = 2E · eY Now we have to compute eY with Y ∈ (−1/2, 1/2).
- F. de Dinechin
Hardware and FPGAs computing just right 8 / 33
unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate
+/−
Y eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY
exception bits uo
R
We want to obtain eX as eX = 2E · eY Now we have to compute eY with Y ∈ (−1/2, 1/2). Split Y : Y = A Z
−1 −k −wF − g
i.e. write Y = A + Z with Z < 2−k
- F. de Dinechin
Hardware and FPGAs computing just right 8 / 33
unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate
+/−
Y eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY
exception bits uo
R
We want to obtain eX as eX = 2E · eY Now we have to compute eY with Y ∈ (−1/2, 1/2). Split Y : Y = A Z
−1 −k −wF − g
i.e. write Y = A + Z with Z < 2−k so eY = eA × eZ
- F. de Dinechin
Hardware and FPGAs computing just right 8 / 33
unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate
+/−
Y eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY
exception bits uo
R
We want to obtain eX as eX = 2E · eY eY = eA × eZ Tabulate eA in a ROM
- F. de Dinechin
Hardware and FPGAs computing just right 8 / 33
unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate
+/−
Y eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY
exception bits uo
R
We want to obtain eX as eX = 2E · eY eY = eA × eZ Evaluation of eZ: Z < 2−k, so eZ ≈ 1 + Z + Z 2/2
- F. de Dinechin
Hardware and FPGAs computing just right 8 / 33
unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate
+/−
Y eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY
exception bits uo
R
We want to obtain eX as eX = 2E · eY eY = eA × eZ Evaluation of eZ: Z < 2−k, so eZ ≈ 1 + Z + Z 2/2 Notice that eZ − 1 − Z ≈ Z 2/2 < 2−2k
- F. de Dinechin
Hardware and FPGAs computing just right 8 / 33
unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate
+/−
Y eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY
exception bits uo
R
We want to obtain eX as eX = 2E · eY eY = eA × eZ Evaluation of eZ: Z < 2−k, so eZ ≈ 1 + Z + Z 2/2 Notice that eZ − 1 − Z ≈ Z 2/2 < 2−2k Evaluate eZ − Z − 1 somewhow (out of Z truncated to its higher bits only)
- F. de Dinechin
Hardware and FPGAs computing just right 8 / 33
unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate
+/−
Y eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY
exception bits uo
R
We want to obtain eX as eX = 2E · eY eY = eA × eZ Evaluation of eZ: Z < 2−k, so eZ ≈ 1 + Z + Z 2/2 Notice that eZ − 1 − Z ≈ Z 2/2 < 2−2k Evaluate eZ − Z − 1 somewhow (out of Z truncated to its higher bits only) then add Z to obtain eZ − 1
- F. de Dinechin
Hardware and FPGAs computing just right 8 / 33
unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate
+/−
Y eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY
exception bits uo
R
We want to obtain eX as eX = 2E · eY eY = eA × eZ Also notice that eZ = 1.
k−1 zeroes
- 000...000 zzzz
Evaluate eA × eZ as eA + eA × (eZ − 1)
- F. de Dinechin
Hardware and FPGAs computing just right 8 / 33
unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate
+/−
Y eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY
exception bits uo
R
We want to obtain eX as eX = 2E · eY eY = eA × eZ Also notice that eZ = 1.
k−1 zeroes
- 000...000 zzzz
Evaluate eA × eZ as eA + eA × (eZ − 1) (before the product, truncate eA to precision
- f eZ − 1)
- F. de Dinechin
Hardware and FPGAs computing just right 8 / 33
unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate
+/−
Y eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY
exception bits uo
R
We want to obtain eX as eX = 2E · eY eY = eA × eZ And that’s it, we have E and eY
- F. de Dinechin
Hardware and FPGAs computing just right 8 / 33
unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate
+/−
Y eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY
exception bits uo
R
We want to obtain eX as eX = 2E · eY eY = eA × eZ And that’s it, we have E and eY (using only fixed-point computations)
- F. de Dinechin
Hardware and FPGAs computing just right 8 / 33
unpack X
flp(wE , wF )
shift to fixed point 1.FX
ufix(0, −wF )
EX
wE
×1/ log(2) ×(− log(2)) negate
+/−
Y
sfix(−1, −wF − g)
eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix|
ufix(wE − 2, −wF − g) ufix(−1, −wF − g) sfix(−1, −wF − g) ufix(wE − 2, −4) ufix(wE , 0)
|E|
sfix(−1, −wF − g)
A
sfix(−1, −k)
Z
ufix(−k − 1, −wF − g)
Ztrunc
ufix(−k, −wF − g)
C
ufix(0, −wF − g) ufix(0, −wF − g + k)
Ttrunc H
ufix(−k − 1, −wF + k − g) ufix(−2k − 1, −wF − g) ufix(−k + 1, −wF − g)
P T M ≈ eY
ufix(0, −wF − g) exception bits uo
R
flp(wE , wF )
We want to obtain eX as eX = 2E · eY eY = eA × eZ And that’s it, we have E and eY (using only fixed-point computations)
Don’t worry if you got lost
this talk is not about this algorithm, but about its details
- F. de Dinechin
Hardware and FPGAs computing just right 8 / 33
Some opportunities
- f hardware computing just right
Introduction: not your PC’s exponential Some opportunities of hardware computing just right Error analysis for dummies (and other proof assistants) Conclusions Backup: crash-introduction to FPGAs
- F. de Dinechin
Hardware and FPGAs computing just right 9 / 33
Opportunity #1: Over-parameterization
unpack X
flp(wE , wF )
shift to fixed point 1.FX
ufix(0, −wF )
EX
wE
×1/ log(2) ×(− log(2)) negate
+/−
Y
sfix(−1, −wF − g)
eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix|
ufix(wE − 2, −wF − g) ufix(−1, −wF − g) sfix(−1, −wF − g) ufix(wE − 2, −4) ufix(wE , 0)
|E|
sfix(−1, −wF − g)
A
sfix(−1, −k)
Z
ufix(−k − 1, −wF − g)
Ztrunc
ufix(−k, −wF − g)
C
ufix(0, −wF − g) ufix(0, −wF − g + k)
Ttrunc H
ufix(−k − 1, −wF + k − g) ufix(−2k − 1, −wF − g) ufix(−k + 1, −wF − g)
P T M ≈ eY
ufix(0, −wF − g) exception bits uo
R
flp(wE , wF )
Example:
- F. de Dinechin
Hardware and FPGAs computing just right 10 / 33
Opportunity #1: Over-parameterization
unpack X
flp(wE , wF )
shift to fixed point 1.FX
ufix(0, −wF )
EX
wE
×1/ log(2) ×(− log(2)) negate
+/−
Y
sfix(−1, −wF − g)
eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix|
ufix(wE − 2, −wF − g) ufix(−1, −wF − g) sfix(−1, −wF − g) ufix(wE − 2, −4) ufix(wE , 0)
|E|
sfix(−1, −wF − g)
A
sfix(−1, −k)
Z
ufix(−k − 1, −wF − g)
Ztrunc
ufix(−k, −wF − g)
C
ufix(0, −wF − g) ufix(0, −wF − g + k)
Ttrunc H
ufix(−k − 1, −wF + k − g) ufix(−2k − 1, −wF − g) ufix(−k + 1, −wF − g)
P T M ≈ eY
ufix(0, −wF − g) exception bits uo
R
flp(wE , wF )
Example:
- F. de Dinechin
Hardware and FPGAs computing just right 10 / 33
Opportunity #1: Over-parameterization
unpack X
flp(wE , wF )
shift to fixed point 1.FX
ufix(0, −wF )
EX
wE
×1/ log(2) ×(− log(2)) negate
+/−
Y
sfix(−1, −wF − g)
eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix|
ufix(wE − 2, −wF − g) ufix(−1, −wF − g) sfix(−1, −wF − g) ufix(wE − 2, −4) ufix(wE , 0)
|E|
sfix(−1, −wF − g)
A
sfix(−1, −k)
Z
ufix(−k − 1, −wF − g)
Ztrunc
ufix(−k, −wF − g)
C
ufix(0, −wF − g) ufix(0, −wF − g + k)
Ttrunc H
ufix(−k − 1, −wF + k − g) ufix(−2k − 1, −wF − g) ufix(−k + 1, −wF − g)
P T M ≈ eY
ufix(0, −wF − g) exception bits uo
R
flp(wE , wF )
×1/ log(2) ×(− log(2)) negate
+/−
Y
sfix(−1,
|Xfix|
ufix(w ufix(wE − 2, −4) ufix(wE , 0)
|E|
sfix(−1, −wF − g) sfix(−1, −k) bits uo
Example:
- F. de Dinechin
Hardware and FPGAs computing just right 10 / 33
Opportunity #1: Over-parameterization
unpack X
flp(wE , wF )
shift to fixed point 1.FX
ufix(0, −wF )
EX
wE
×1/ log(2) ×(− log(2)) negate
+/−
Y
sfix(−1, −wF − g)
eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix|
ufix(wE − 2, −wF − g) ufix(−1, −wF − g) sfix(−1, −wF − g) ufix(wE − 2, −4) ufix(wE , 0)
|E|
sfix(−1, −wF − g)
A
sfix(−1, −k)
Z
ufix(−k − 1, −wF − g)
Ztrunc
ufix(−k, −wF − g)
C
ufix(0, −wF − g) ufix(0, −wF − g + k)
Ttrunc H
ufix(−k − 1, −wF + k − g) ufix(−2k − 1, −wF − g) ufix(−k + 1, −wF − g)
P T M ≈ eY
ufix(0, −wF − g) exception bits uo
R
flp(wE , wF )
×1/ log(2) ×(− log(2)) negate
+/−
Y
sfix(−1,
|Xfix|
ufix(w ufix(wE − 2, −4) ufix(wE , 0)
|E|
sfix(−1, −wF − g) sfix(−1, −k) bits uo ufix(wE − 2, −4) ufix(wE , 0) sfix(−1, −wF − g)
Example:
Multipliers of all shapes and sizes
- F. de Dinechin
Hardware and FPGAs computing just right 10 / 33
Opportunity #1: Over-parameterization
unpack X
flp(wE , wF )
shift to fixed point 1.FX
ufix(0, −wF )
EX
wE
×1/ log(2) ×(− log(2)) negate
+/−
Y
sfix(−1, −wF − g)
eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix|
ufix(wE − 2, −wF − g) ufix(−1, −wF − g) sfix(−1, −wF − g) ufix(wE − 2, −4) ufix(wE , 0)
|E|
sfix(−1, −wF − g)
A
sfix(−1, −k)
Z
ufix(−k − 1, −wF − g)
Ztrunc
ufix(−k, −wF − g)
C
ufix(0, −wF − g) ufix(0, −wF − g + k)
Ttrunc H
ufix(−k − 1, −wF + k − g) ufix(−2k − 1, −wF − g) ufix(−k + 1, −wF − g)
P T M ≈ eY
ufix(0, −wF − g) exception bits uo
R
flp(wE , wF )
×1/ log(2) ×(− log(2)) negate
+/−
Y
sfix(−1,
|Xfix|
ufix(w ufix(wE − 2, −4) ufix(wE , 0)
|E|
sfix(−1, −wF − g) sfix(−1, −k) bits uo ufix(wE − 2, −4) ufix(wE , 0) sfix(−1, −wF − g)
14 12 56
Example:
Multipliers of all shapes and sizes
In a double-precision exponential, wE = 11, wF = 52, first multiplier 14-bits in, 12 bits out second multiplier 12-bits in, 56 bits out ... and truncated left and right
- F. de Dinechin
Hardware and FPGAs computing just right 10 / 33
Over-parameterization is cool
⊖ OK, there is a bit more work involved in designing a parametric operator
To start with, it must be a hardware-generating program
- F. de Dinechin
Hardware and FPGAs computing just right 11 / 33
Over-parameterization is cool
⊖ OK, there is a bit more work involved in designing a parametric operator
To start with, it must be a hardware-generating program
⊕ Direct benefit to end-users: freedom of choice
People used to publish “An exponential architecture for single-precision”, standard is now “A family of exponential architectures for each precision” Application-specific optimal, future-proof, etc.
- F. de Dinechin
Hardware and FPGAs computing just right 11 / 33
Over-parameterization is cool
⊖ OK, there is a bit more work involved in designing a parametric operator
To start with, it must be a hardware-generating program
⊕ Direct benefit to end-users: freedom of choice
People used to publish “An exponential architecture for single-precision”, standard is now “A family of exponential architectures for each precision” Application-specific optimal, future-proof, etc.
⊕ It actually simplifies design of composite operators (e.g. the exponential)
No need to take any dramatic decision in the design phase: You don’t know how many bits on this wire make sense? Keep it open as a parameter. Then estimate cost and accuracy as a function of the parameters Then find the optimal values of the parameters, e.g. using ILP or common sense (whichever gives the best results)
- F. de Dinechin
Hardware and FPGAs computing just right 11 / 33
Opportunity #2: Operator specialization
Ha, that’s something you sofware people don’t get! Multiplication by a constant
multiplication by integers: 17X = (X ≪ 4) + X but also by reals such as log(2) or sin(42π/256) for the FFT Two main techniques, tens of papers
...
- F. de Dinechin
Hardware and FPGAs computing just right 12 / 33
Opportunity #2: Operator specialization
Ha, that’s something you sofware people don’t get! Multiplication by a constant
multiplication by integers: 17X = (X ≪ 4) + X but also by reals such as log(2) or sin(42π/256) for the FFT Two main techniques, tens of papers
Division by 3 (for various values of 3)
in floating point for Jacobi and other stencils integer (quotient and remainder) for addressing in 3 memory banks
...
- F. de Dinechin
Hardware and FPGAs computing just right 12 / 33
Opportunity #2: Operator specialization
Ha, that’s something you sofware people don’t get! Multiplication by a constant
multiplication by integers: 17X = (X ≪ 4) + X but also by reals such as log(2) or sin(42π/256) for the FFT Two main techniques, tens of papers
Division by 3 (for various values of 3)
in floating point for Jacobi and other stencils integer (quotient and remainder) for addressing in 3 memory banks
A squarer is a multiplier specialization
× x2 x
... 321 × 321 321 642 963 103041
- F. de Dinechin
Hardware and FPGAs computing just right 12 / 33
Opportunity #2: Operator specialization
Ha, that’s something you sofware people don’t get! Multiplication by a constant
multiplication by integers: 17X = (X ≪ 4) + X but also by reals such as log(2) or sin(42π/256) for the FFT Two main techniques, tens of papers
Division by 3 (for various values of 3)
in floating point for Jacobi and other stencils integer (quotient and remainder) for addressing in 3 memory banks
A squarer is a multiplier specialization
× x2 x
Specialization of elementary functions to specific domains ... 321 × 321 321 642 963 103041
- F. de Dinechin
Hardware and FPGAs computing just right 12 / 33
Opportunity #3: target-specific optimizations
unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate
+/−
Y eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY
exception bits uo
R
Modern FPGAs also have
- F. de Dinechin
Hardware and FPGAs computing just right 13 / 33
Opportunity #3: target-specific optimizations
unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate
+/−
Y eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY
exception bits uo
R
Modern FPGAs also have small multipliers with pre-adders and post-adders
- F. de Dinechin
Hardware and FPGAs computing just right 13 / 33
Opportunity #3: target-specific optimizations
unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate
+/−
Y eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY
exception bits uo
R
Modern FPGAs also have small multipliers with pre-adders and post-adders ... and dual-ported small memories
- F. de Dinechin
Hardware and FPGAs computing just right 13 / 33
Opportunity #3: target-specific optimizations
unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate
+/−
Y eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY
exception bits uo
R
Modern FPGAs also have small multipliers with pre-adders and post-adders ... and dual-ported small memories
Single-precision accurate exponential on Xilinx
- ne block RAM (0.1% of the chip)
- ne DSP block (0.1%)
< 400 LUTs (0.1%, ≈ one FP adder) to compute one exponential per cycle at 500MHz (∼ one AVX512 core trashing on its 16 FP32 lanes)
- F. de Dinechin
Hardware and FPGAs computing just right 13 / 33
Opportunity #3: target-specific optimizations
unpack X
flp(wE , wF )
shift to fixed point 1.FX
ufix(0, −wF )
EX
wE
×1/ log(2) ×(− log(2)) negate
+/−
Y
sfix(−1, −wF − g)
eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix|
ufix(wE − 2, −wF − g) ufix(−1, −wF − g) sfix(−1, −wF − g) ufix(wE − 2, −4) ufix(wE , 0)
|E|
sfix(−1, −wF − g)
A
sfix(−1, −k)
Z
ufix(−k − 1, −wF − g)
Ztrunc
ufix(−k, −wF − g)
C
ufix(0, −wF − g) ufix(0, −wF − g + k)
Ttrunc H
ufix(−k − 1, −wF + k − g) ufix(−2k − 1, −wF − g) ufix(−k + 1, −wF − g)
P T M ≈ eY
ufix(0, −wF − g) exception bits uo
R
flp(wE , wF )
Modern FPGAs also have small multipliers with pre-adders and post-adders ... and dual-ported small memories
Single-precision accurate exponential on Xilinx
- ne block RAM (0.1% of the chip)
- ne DSP block (0.1%)
< 400 LUTs (0.1%, ≈ one FP adder) to compute one exponential per cycle at 500MHz (∼ one AVX512 core trashing on its 16 FP32 lanes) For one specific value only of the architectural parameter k! (over-parameterization is cool)
- F. de Dinechin
Hardware and FPGAs computing just right 13 / 33
Opportunity #4: Tabulation
unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate
+/−
Y eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY
exception bits uo
R
Being unable to trust my reasoning, I learnt by heart the results of all the possible multiplications (E. Ionesco) ... and all the possible exponentials
- F. de Dinechin
Hardware and FPGAs computing just right 14 / 33
Opportunity #4: Tabulation
unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate
+/−
Y eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY
exception bits uo
R
Being unable to trust my reasoning, I learnt by heart the results of all the possible multiplications (E. Ionesco) ... and all the possible exponentials ... and all the possible values of eZ − Z − 1
- F. de Dinechin
Hardware and FPGAs computing just right 14 / 33
Opportunity #4: Tabulation
unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate
+/−
Y eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY
exception bits uo
R
Being unable to trust my reasoning, I learnt by heart the results of all the possible multiplications (E. Ionesco) ... and all the possible exponentials ... and all the possible values of eZ − Z − 1 ... and indeed, all the possible multiplications
- F. de Dinechin
Hardware and FPGAs computing just right 14 / 33
Opportunity #4: Tabulation
unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate
+/−
Y eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY
exception bits uo
R
Being unable to trust my reasoning, I learnt by heart the results of all the possible multiplications (E. Ionesco) ... and all the possible exponentials ... and all the possible values of eZ − Z − 1 ... and indeed, all the possible multiplications
- F. de Dinechin
Hardware and FPGAs computing just right 14 / 33
Opportunity #4: Tabulation
unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate
+/−
Y eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY
exception bits uo
R
Being unable to trust my reasoning, I learnt by heart the results of all the possible multiplications (E. Ionesco) ... and all the possible exponentials ... and all the possible values of eZ − Z − 1 ... and indeed, all the possible multiplications Reading a tabulated value is very efficient when the table is close to the consumer.
- F. de Dinechin
Hardware and FPGAs computing just right 14 / 33
Opportunity #5: Generic approximators (when tabulation won’t scale)
unpack X
flp(wE , wF )
shift to fixed point 1.FX
ufix(0, −wF )
EX
wE
×1/ log(2) ×(− log(2)) negate
+/−
Y
sfix(−1, −wF − g)
eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix|
ufix(wE − 2, −wF − g) ufix(−1, −wF − g) sfix(−1, −wF − g) ufix(wE − 2, −4) ufix(wE , 0)
|E|
sfix(−1, −wF − g)
A
sfix(−1, −k)
Z
ufix(−k − 1, −wF − g)
Ztrunc
ufix(−k, −wF − g)
C
ufix(0, −wF − g) ufix(0, −wF − g + k)
Ttrunc H
ufix(−k − 1, −wF + k − g) ufix(−2k − 1, −wF − g) ufix(−k + 1, −wF − g)
P T M ≈ eY
ufix(0, −wF − g) exception bits uo
R
flp(wE , wF )
Polynomial Coefficient Table
× + ×
+
×
+ S2 S1 C0 C1 C2 C3 X A
α w
Y
w − α
- Y3
- Y2
- Y3 = X
final round
- P(Y )
R
The FloPoCo FixFunctionByPiecewisePoly operator
state-of-the-art polynomial approximation each multiplier tailored with love and care Also multipartite tables, filter approximators, and more to come.
- F. de Dinechin
Hardware and FPGAs computing just right 15 / 33
Opportunity #6: merged arithmetic in bit heaps
Adder Multi-adder Multiplier Constant multiplier Complex product ...... Polynomial Multipartite
- bi2wi
Algorithmic description Architecture generation Spartan 5 Spartan6 Zynq 7000 Virtex-4 Virtex-5 Virtex-6 Kintex-7
......
Stratix III Stratix IV Stratix V Stratix 10
- F. de Dinechin
Hardware and FPGAs computing just right 16 / 33
Opportunity #6: merged arithmetic in bit heaps
One data-structure to rule them all...
Adder Multi-adder Multiplier Constant multiplier Complex product ...... Polynomial Multipartite
- bi2wi
Algorithmic description Architecture generation Spartan 5 Spartan6 Zynq 7000 Virtex-4 Virtex-5 Virtex-6 Kintex-7
......
Stratix III Stratix IV Stratix V Stratix 10
The sum of weighted bits as a first-class arithmetic object
A very wide class of operators: multi-valued polynomials, and more Captures the true bit-level parallelism, enables bit-level optimization opportunities
- F. de Dinechin
Hardware and FPGAs computing just right 16 / 33
Opportunity #6: merged arithmetic in bit heaps
One data-structure to rule them all... and in the hardware to bind them
Adder Multi-adder Multiplier Constant multiplier Complex product ...... Polynomial Multipartite
- bi2wi
Algorithmic description Architecture generation Spartan 5 Spartan6 Zynq 7000 Virtex-4 Virtex-5 Virtex-6 Kintex-7
......
Stratix III Stratix IV Stratix V Stratix 10
The sum of weighted bits as a first-class arithmetic object
A very wide class of operators: multi-valued polynomials, and more Captures the true bit-level parallelism, enables bit-level optimization opportunities Bit-array compressor trees can be optimized for each target ... and optimally so for practical sizes, thanks to M. Kumm
- F. de Dinechin
Hardware and FPGAs computing just right 16 / 33
When you have a good hammer, you see nails everywhere
A sine/cosine architecture (I¸ stoan, HEART 2013):
s q
- A
Yred
T T T T T T T T
Z 3/6 Z 2/2 ×π Sin/Cos table sinPiX cosPiX Swap/negate
sinZ cosPiA sinPiA Z sinAcosZ cosAcosZ sinAsinZ cosAsinZ
- F. de Dinechin
Hardware and FPGAs computing just right 17 / 33
When you have a good hammer, you see nails everywhere
A sine/cosine architecture (I¸ stoan, HEART 2013): 5 bit heaps
s q
- A
Yred
T T T T T T T T
Z 3/6 Z 2/2 ×π Sin/Cos table sinPiX cosPiX Swap/negate
sinZ cosPiA sinPiA Z sinAcosZ cosAcosZ sinAsinZ cosAsinZ
- F. de Dinechin
Hardware and FPGAs computing just right 17 / 33
Bit heaps for some operators and filters
w=16 bits
Why are some people still insisting I should call this “bit arrays”?
- F. de Dinechin
Hardware and FPGAs computing just right 18 / 33
Error analysis for dummies (and other proof assistants)
Introduction: not your PC’s exponential Some opportunities of hardware computing just right Error analysis for dummies (and other proof assistants) Conclusions Backup: crash-introduction to FPGAs
- F. de Dinechin
Hardware and FPGAs computing just right 19 / 33
Computing just right
Error analysis used to be to ensure the operator works.
- F. de Dinechin
Hardware and FPGAs computing just right 20 / 33
Computing just right
Error analysis used to be to ensure the operator works. This is sooooo nineties. Here, error analysis is for optimization. (the fact that the operators work is an appreciable bonus)
- F. de Dinechin
Hardware and FPGAs computing just right 20 / 33
Error analysis method in my early papers: handwaving
The typical error analysis used to look like this: “This term contributes at most 1 ulp (unit in the last place) to the overall error” “This operation contributes at most one half-ulp to the error” ... “Altogether we have 6 ulps of error” “so if we add ⌈log2(6)⌉ bits to all the datapath, it should be accurate enough.
- F. de Dinechin
Hardware and FPGAs computing just right 21 / 33
And then I saw the light
- G. Melquiond, the creator of Gappa (the proof assistant for the rest of us)
An error is a difference between a less accurate value and a more accurate value.
- F. de Dinechin
Hardware and FPGAs computing just right 22 / 33
And then I saw the light
- G. Melquiond, the creator of Gappa (the proof assistant for the rest of us)
An error is a difference between a less accurate value and a more accurate value. For instance, to bound some error, first write it δAC = A − C, then
- F. de Dinechin
Hardware and FPGAs computing just right 22 / 33
And then I saw the light
- G. Melquiond, the creator of Gappa (the proof assistant for the rest of us)
An error is a difference between a less accurate value and a more accurate value. For instance, to bound some error, first write it δAC = A − C, then look for some intermediate value B (more accurate than A but less accurate than C)
- F. de Dinechin
Hardware and FPGAs computing just right 22 / 33
And then I saw the light
- G. Melquiond, the creator of Gappa (the proof assistant for the rest of us)
An error is a difference between a less accurate value and a more accurate value. For instance, to bound some error, first write it δAC = A − C, then look for some intermediate value B (more accurate than A but less accurate than C) write A − C = A − B + B − C (4)
- r
δAC = δAB + δBC
- F. de Dinechin
Hardware and FPGAs computing just right 22 / 33
And then I saw the light
- G. Melquiond, the creator of Gappa (the proof assistant for the rest of us)
An error is a difference between a less accurate value and a more accurate value. For instance, to bound some error, first write it δAC = A − C, then look for some intermediate value B (more accurate than A but less accurate than C) write A − C = A − B + B − C (4)
- r
δAC = δAB + δBC By triangular inequality, |δAC| ≤ |δAB| + |δBC| Therefore, the error bounds (noted δ = max |δ| ) verify δAC = δAB + δBC (5) A divide-and-conquer method, to use when approximations and rounding errors pile up...
- F. de Dinechin
Hardware and FPGAs computing just right 22 / 33
A big mess of rounding errors piled over approximation
unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate
+/−
Y eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY
exception bits uo
R
- F. de Dinechin
Hardware and FPGAs computing just right 23 / 33
A big mess of rounding errors piled over approximation
unpack X shift to fixed point 1.FX EX ×1/ log(2) ×(− log(2)) negate
+/−
Y eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix| |E| A Z Ztrunc C Ttrunc H P T M ≈ eY
exception bits uo
R
- F. de Dinechin
Hardware and FPGAs computing just right 23 / 33
A big mess of rounding errors piled over approximation
−
Y eA eZ − Z − 1
+
×
+
A Z Ztrunc C Ttrunc H P T M ≈ eY
δtotal = M − eY
- F. de Dinechin
Hardware and FPGAs computing just right 23 / 33
A big mess of rounding errors piled over approximation
−
Y eA eZ − Z − 1
+
×
+
A Z Ztrunc C Ttrunc H P T M ≈ eY
δtotal = M − eY = M − eAeZ since Y = A + Z exactly
- F. de Dinechin
Hardware and FPGAs computing just right 23 / 33
A big mess of rounding errors piled over approximation
−
Y eA eZ − Z − 1
+
×
+
A Z Ztrunc C Ttrunc H P T M ≈ eY
δtotal = M − eY = M − eAeZ since Y = A + Z exactly = M−TeZ + TeZ − eAeZ
- F. de Dinechin
Hardware and FPGAs computing just right 23 / 33
A big mess of rounding errors piled over approximation
−
Y eA eZ − Z − 1
+
×
+
A Z Ztrunc C Ttrunc H P T M ≈ eY
δtotal = M − eY = M − eAeZ since Y = A + Z exactly = M−TeZ + TeZ − eAeZ
- ||
(T − eA)eZ || δT
- F. de Dinechin
Hardware and FPGAs computing just right 23 / 33
A big mess of rounding errors piled over approximation
−
Y
sfix(−1, −wF − g)
eA eZ − Z − 1
+
×
+
A
sfix(−1, −k)
Z
ufix(−k − 1, −wF − g)
Ztrunc
ufix(−k, −wF − g)
C
ufix(0, −wF − g) ufix(0, −wF − g + k)
Ttrunc H
ufix(−k − 1, −wF + k − g) ufix(−2k − 1, −wF − g) ufix(−k + 1, −wF − g)
P T M ≈ eY
ufix(0, −wF − g)
δtotal = M − eY = M − eAeZ since Y = A + Z exactly = M−TeZ + TeZ − eAeZ
- ||
(T − eA)eZ || δT Now we can bound this first source of error: |δT| = |T − eA| · |eZ| < 2−wF −g · (1 + 2−k+1) (6) Keep it parametric!
- F. de Dinechin
Hardware and FPGAs computing just right 23 / 33
That was the first step
−
Y eA eZ − Z − 1
+
×
+
A Z Ztrunc C Ttrunc H P T M ≈ eY
δtotal = M − TeZ + δT Where can we go from here?
- F. de Dinechin
Hardware and FPGAs computing just right 24 / 33
That was the first step
P
−
Y eA eZ − Z − 1
+
×
+
A Z Ztrunc C Ttrunc H P T M ≈ eY
δtotal = M − TeZ + δT Where can we go from here? Last addition is exact (that’s fixed-point for you) so M = T + P, hence : M − TeZ = T + P − TeZ = T + P−TtruncC + TtruncC − TeZ
- F. de Dinechin
Hardware and FPGAs computing just right 24 / 33
That was the first step
P
−
Y eA eZ − Z − 1
+
×
+
A Z Ztrunc C Ttrunc H P T M ≈ eY
δtotal = M − TeZ + δT Where can we go from here? Last addition is exact (that’s fixed-point for you) so M = T + P, hence : M − TeZ = T + P − TeZ = T + P−TtruncC
- + TtruncC − TeZ
|| δP
- F. de Dinechin
Hardware and FPGAs computing just right 24 / 33
That was the first step
P
−
Y
sfix(−1, −wF − g)
eA eZ − Z − 1
+
×
+
A
sfix(−1, −k)
Z
ufix(−k − 1, −wF − g)
Ztrunc
ufix(−k, −wF − g)
C
ufix(0, −wF − g) ufix(0, −wF − g + k)
Ttrunc H
ufix(−k − 1, −wF + k − g) ufix(−2k − 1, −wF − g) ufix(−k + 1, −wF − g)
P T M ≈ eY
ufix(0, −wF − g)
δtotal = M − TeZ + δT Where can we go from here? Last addition is exact (that’s fixed-point for you) so M = T + P, hence : M − TeZ = T + P − TeZ = T + P−TtruncC
- + TtruncC − TeZ
|| δP The bound on δP depends on the technology used for the multiplier (at most δP = 2−wF −g) anyway it is under control
- F. de Dinechin
Hardware and FPGAs computing just right 24 / 33
More of the same
−
Y eA eZ − Z − 1
+
×
+
A Z Ztrunc C Ttrunc H P T M ≈ eY
δtotal = T + TtruncC − TeZ + δT + δP Where can we go from here?
- F. de Dinechin
Hardware and FPGAs computing just right 25 / 33
More of the same
−
Y eA eZ − Z − 1
+
×
+
A Z Ztrunc C Ttrunc H P T M ≈ eY
δtotal = T + TtruncC − TeZ + δT + δP Where can we go from here? This Ttrunc is annoying, so let’s get it out of the way T + TtruncC − TeZ = T + TtruncC − TC
- +TC − TeZ
|| (Ttrunc − T)C || δTtrunc
- F. de Dinechin
Hardware and FPGAs computing just right 25 / 33
More of the same
−
Y eA eZ − Z − 1
+
×
+
A Z Ztrunc C Ttrunc H P T M ≈ eY
δtotal = T + TtruncC − TeZ + δT + δP Where can we go from here? This Ttrunc is annoying, so let’s get it out of the way T + TtruncC − TeZ = T + TtruncC − TC
- +TC − TeZ
|| (Ttrunc − T)C || δTtrunc Ttrunc − T < 2−wF −g+k, then we need a bound on C; C ≈ eZ − 1 so Taylor is our friend again
- F. de Dinechin
Hardware and FPGAs computing just right 25 / 33
More of the same
−
Y
sfix(−1, −wF − g)
eA eZ − Z − 1
+
×
+
A
sfix(−1, −k)
Z
ufix(−k − 1, −wF − g)
Ztrunc
ufix(−k, −wF − g)
C
ufix(0, −wF − g) ufix(0, −wF − g + k)
Ttrunc H
ufix(−k − 1, −wF + k − g) ufix(−2k − 1, −wF − g) ufix(−k + 1, −wF − g)
P T M ≈ eY
ufix(0, −wF − g)
δtotal = T + TtruncC − TeZ + δT + δP Where can we go from here? This Ttrunc is annoying, so let’s get it out of the way T + TtruncC − TeZ = T + TtruncC − TC
- +TC − TeZ
|| (Ttrunc − T)C || δTtrunc Ttrunc − T < 2−wF −g+k, then we need a bound on C; C ≈ eZ − 1 so Taylor is our friend again
- F. de Dinechin
Hardware and FPGAs computing just right 25 / 33
You’ll get your lunch only after I get to an approximation error
−
Y eA eZ − Z − 1
+
×
+
A Z Ztrunc C Ttrunc H P T M ≈ eY
δtotal = T + TC − TeZ + δT + δP + δTtrunc Where can we go from here?
- F. de Dinechin
Hardware and FPGAs computing just right 26 / 33
You’ll get your lunch only after I get to an approximation error
−
Y eA eZ − Z − 1
+
×
+
A Z Ztrunc C Ttrunc H P T M ≈ eY
δtotal = T + TC − TeZ + δT + δP + δTtrunc Where can we go from here? C = H + Z exactly (fixed-point additions are exact) T + TC − TeZ = T · (1 + H + Z − eZ) = T · (H − h(Z)) with h(Z) = eZ − Z − 1 = T · (H−h(Ztrunc) + h(Ztrunc) − h(Z)) = T · (H − h(Ztrunc))
- + T · (h(Ztrunc) − h(Z))
- ||
δH || δZtrunc
- F. de Dinechin
Hardware and FPGAs computing just right 26 / 33
You’ll get your lunch only after I get to an approximation error
−
Y eA eZ − Z − 1
+
×
+
A Z Ztrunc C Ttrunc H P T M ≈ eY
δtotal = T + TC − TeZ + δT + δP + δTtrunc Where can we go from here? C = H + Z exactly (fixed-point additions are exact) T + TC − TeZ = T · (1 + H + Z − eZ) = T · (H − h(Z)) with h(Z) = eZ − Z − 1 = T · (H−h(Ztrunc) + h(Ztrunc) − h(Z)) = T · (H − h(Ztrunc))
- + T · (h(Ztrunc) − h(Z))
- ||
δH || δZtrunc δH includes the approximation error H − h(Ztrunc)
- F. de Dinechin
Hardware and FPGAs computing just right 26 / 33
Finally, scientific precision sabotaging
δtotal = δT + δP + δTtrunc + δH + δZtrunc hence δtotal = δT + δP + δTtrunc + δH + δZtrunc If any of these terms is much smaller than the others, useless bits are being computed I’ll hack at the hardware to make this error worse!
by moving a parameter up or down, maybe adding a truncation somewhere...
- F. de Dinechin
Hardware and FPGAs computing just right 27 / 33
Finally, scientific precision sabotaging
δtotal = δT + δP + δTtrunc + δH + δZtrunc hence δtotal = δT + δP + δTtrunc + δH + δZtrunc If any of these terms is much smaller than the others, useless bits are being computed I’ll hack at the hardware to make this error worse!
by moving a parameter up or down, maybe adding a truncation somewhere...
Oh, yes, I will also make sure that δtotal is small enough to guarantee last-bit accuracy.
- F. de Dinechin
Hardware and FPGAs computing just right 27 / 33
Take away messages
Error analysis for performance, not only for accuracy Straightforward engineering based on additions and multiplications Strict and accurate worst-case analysis (amenable to formal proof) Perfectly captures how an early rounding error is amplified in the algorithm
And for you floating-point people, there exists a relative-error version
If A approximates B and B approximates C, then A − C C = A − B B + B − C C + A − B B × B − C C (7)
- r
εAC = εAB + εBC + εAB · εBC
- F. de Dinechin
Hardware and FPGAs computing just right 28 / 33
Conclusions
Introduction: not your PC’s exponential Some opportunities of hardware computing just right Error analysis for dummies (and other proof assistants) Conclusions Backup: crash-introduction to FPGAs
- F. de Dinechin
Hardware and FPGAs computing just right 29 / 33
First conclusion
There seems to be a recurrent pattern in talks by French people: They sell you a very generic title and then all you see is an implementation of the exponential function
- F. de Dinechin
Hardware and FPGAs computing just right 30 / 33
A look forward, beyond the horizon
Generating parametric hardware was the easy part!
The difficult part of the problem is: What precision is needed at this point of this application ? (Overwhelming freedom! Help!)
- F. de Dinechin
Hardware and FPGAs computing just right 31 / 33
A look forward, beyond the horizon
unpack X
flp(wE , wF )
shift to fixed point 1.FX
ufix(0, −wF )
EX
wE
×1/ log(2) ×(− log(2)) negate
+/−
Y
sfix(−1, −wF − g)
eA eZ − Z − 1
+
×
+
normalize-round-pack sX |Xfix|
ufix(wE − 2, −wF − g) ufix(−1, −wF − g) sfix(−1, −wF − g) ufix(wE − 2, −4) ufix(wE , 0)
|E|
sfix(−1, −wF − g)
A
sfix(−1, −k)
Z
ufix(−k − 1, −wF − g)
Ztrunc
ufix(−k, −wF − g)
C
ufix(0, −wF − g) ufix(0, −wF − g + k)
Ttrunc H
ufix(−k − 1, −wF + k − g) ufix(−2k − 1, −wF − g) ufix(−k + 1, −wF − g)
P T M ≈ eY
ufix(0, −wF − g) exception bits uo
R
flp(wE , wF )
Generating parametric hardware was the easy part!
The difficult part of the problem is: What precision is needed at this point of this application ? (Overwhelming freedom! Help!) I know, because this is what I’ve been doing for the exponential. Current challenges: precimonious-like tools at the bit level for my users My beautiful parametric FPExp is only used for single and double precision! and in software: replace some floats with fixed-point (hidden in elementary functions, reductions...)
- F. de Dinechin
Hardware and FPGAs computing just right 31 / 33
“You’ve found yourself a comfortable niche, but...”
You optimize computations, but power is dissipated in data transfers. Answer: Don’t move useless bits around, including to memory/storage
- F. de Dinechin
Hardware and FPGAs computing just right 32 / 33
“You’ve found yourself a comfortable niche, but...”
You optimize computations, but power is dissipated in data transfers. Answer: Don’t move useless bits around, including to memory/storage Who cares about your exponential, scientific computing is all sums of products. Answer:
Dura Amdahl lex, sed lex
SPICE (electrical circuit simulation) inner loop, cut from Kapre and DeHon (FPL 2009)
- F. de Dinechin
Hardware and FPGAs computing just right 32 / 33
“You’ve found yourself a comfortable niche, but...”
You optimize computations, but power is dissipated in data transfers. Answer: Don’t move useless bits around, including to memory/storage Who cares about your exponential, scientific computing is all sums of products. Answer:
Dura Amdahl lex, sed lex
SPICE (electrical circuit simulation) inner loop, cut from Kapre and DeHon (FPL 2009) Please add your question here.
- F. de Dinechin
Hardware and FPGAs computing just right 32 / 33
Time for a bit of advertising
A Springer book: Application-specific arithmetic by F. de Dinechin and M. Kumm, to appear somewhere in 2020. Hopefully.
Thanks to all FloPoCo contributors
- H. Abdoli, S. Banescu, L. Bes‘eme, A. B¨
- ttcher, N. Bonfante,
- M. Christ, N. Brunie, S. Collange, J. Detrey, P. Echeverr´
ıa,
- F. Ferrandi, N. Fiege, L. Forget, M. Grad, K. Illyes, M. Istoan,
- M. Joldes, J. Kappauf, C. Klein, M. Kleinlein, K. Klug, M. Kumm,
- K. Kullmann, D. Mastrandrea, K. Moeller, B. Pasca, B. Popa,
- X. Pujol, G. Sergent, D. Thomas, R. Tudoran, A. Vasquez.
and the authors of NVC, MPFR, Sollya, FPLLL, ScaLP, ...
e
x
√
x2+y2+z2
πx
s i n
e
x+ y n
- i=0
xi
√x
logx http://flopoco.gforge.inria.fr/
- F. de Dinechin
Hardware and FPGAs computing just right 33 / 33
Backup: crash-introduction to FPGAs
Introduction: not your PC’s exponential Some opportunities of hardware computing just right Error analysis for dummies (and other proof assistants) Conclusions Backup: crash-introduction to FPGAs
- F. de Dinechin
Hardware and FPGAs computing just right 34 / 33
Basic FPGA structure
Overview A grid of configurable cells
- F. de Dinechin
Hardware and FPGAs computing just right 35 / 33
Basic FPGA structure
Overview A grid of configurable cells
... to build arbitrary logic
Inside a cell F x0 x1 x2 x3 A Look-Up Table (LUT) F
4 inputs, one output holds any truth table
- F. de Dinechin
Hardware and FPGAs computing just right 35 / 33
Basic FPGA structure
Overview A grid of configurable cells
... to build arbitrary logic ... and sequential circuits
Inside a cell F x0 x1 x2 x3 R yr y A Look-Up Table (LUT) F
4 inputs, one output holds any truth table
1 bit of run-time memory R
- F. de Dinechin
Hardware and FPGAs computing just right 35 / 33
Basic FPGA structure
Overview A grid of configurable cells
... to build arbitrary logic ... and sequential circuits
Configurable wiring
routing channels configurable switches where wires cross
→ random access to distant cells Inside a cell F x0 x1 x2 x3 R yr y A Look-Up Table (LUT) F
4 inputs, one output holds any truth table
1 bit of run-time memory R
- F. de Dinechin
Hardware and FPGAs computing just right 35 / 33
A configured FPGA
Also known as reconfigurable circuits used for reconfigurable computing
- F. de Dinechin
Hardware and FPGAs computing just right 36 / 33
Two moments in the life of an FPGA
Configuration time (1-1000 ms)
the LUTs are filled with truth tables the state (on/off) of each switch in each switch box is defined a program == a lot of configuration bits
- F. de Dinechin
Hardware and FPGAs computing just right 37 / 33
Two moments in the life of an FPGA
Configuration time (1-1000 ms)
the LUTs are filled with truth tables the state (on/off) of each switch in each switch box is defined a program == a lot of configuration bits
Run time (forever if needed)
Data is processed by each LUT according to its truth table Data moves from LUT to LUT along the (static) connexions The FPGA behaves as a circuit of gates
- F. de Dinechin
Hardware and FPGAs computing just right 37 / 33
Two moments in the life of an FPGA
Configuration time (1-1000 ms)
the LUTs are filled with truth tables the state (on/off) of each switch in each switch box is defined a program == a lot of configuration bits
Run time (forever if needed)
Data is processed by each LUT according to its truth table Data moves from LUT to LUT along the (static) connexions The FPGA behaves as a circuit of gates The programming model of FPGAs is the digital circuit. You don’t program an FPGA, you configure it as a circuit.
- F. de Dinechin
Hardware and FPGAs computing just right 37 / 33
No free lunch, of course
Most FPGA silicon is dedicated to programmable routing. “Customers buy logic, but they pay for routing” (Langhammer, Intel) a picture from 1999 − → (it got much worse since then)
- F. de Dinechin
Hardware and FPGAs computing just right 38 / 33
No free lunch, of course
Most FPGA silicon is dedicated to programmable routing. “Customers buy logic, but they pay for routing” (Langhammer, Intel) a picture from 1999 − → (it got much worse since then) A circuit that would fit in 1 mm2 of ASIC silicon will only fit in a 50mm2 FPGA... ... and the configured FPGA will run at 1/10th the frequency
there are transistors on all the wires!
- F. de Dinechin
Hardware and FPGAs computing just right 38 / 33
It was too simple so far, people would complain
coarser cells, optimized for additions many small (18-24 bit) hard multipliers (“DSP blocks”) many small (∼ 20 kBit) memories (and a galore of IOs and clocks) the Altera/Intel stratix IV FPGA
- F. de Dinechin
Hardware and FPGAs computing just right 39 / 33
It was too simple so far, people would complain
coarser cells, optimized for additions many small (18-24 bit) hard multipliers (“DSP blocks”) many small (∼ 20 kBit) memories (and a galore of IOs and clocks) the Altera/Intel stratix IV FPGA
Numbers for a high-end FPGA
500,000 LUTs + 2000 DSP blocks + 1000 36KBit memory block ... running at 500 MHz
- F. de Dinechin
Hardware and FPGAs computing just right 39 / 33