Hardware and FPGAs computing just right ICERM 2020 Florent de - PowerPoint PPT Presentation

X We want to obtain e X as s X unpack exception bits E X 1 . F X e X = 2 E · e Y shift to fixed point uo | X fix | e Y = e A × e Z × 1 / log(2) | E | negate Evaluation of e Z : Z < 2 − k , so × ( − log(2)) + / − e Z ≈ 1 + Z + Z 2 / 2 Y Z A Z trunc Notice that e Z − 1 − Z ≈ Z 2 / 2 < 2 − 2 k e Z − Z − 1 e A H Evaluate e Z − Z − 1 somewhow + (out of Z truncated to its higher bits only) C T trunc × T P + M ≈ e Y normalize-round-pack R F. de Dinechin Hardware and FPGAs computing just right 8 / 33

X We want to obtain e X as s X unpack exception bits E X 1 . F X e X = 2 E · e Y shift to fixed point uo | X fix | e Y = e A × e Z × 1 / log(2) | E | negate Evaluation of e Z : Z < 2 − k , so × ( − log(2)) + / − e Z ≈ 1 + Z + Z 2 / 2 Y Z A Z trunc Notice that e Z − 1 − Z ≈ Z 2 / 2 < 2 − 2 k e Z − Z − 1 e A H Evaluate e Z − Z − 1 somewhow + (out of Z truncated to its higher bits only) C T trunc then add Z to obtain e Z − 1 × T P + M ≈ e Y normalize-round-pack R F. de Dinechin Hardware and FPGAs computing just right 8 / 33

X We want to obtain e X as s X unpack exception bits E X 1 . F X e X = 2 E · e Y shift to fixed point uo | X fix | e Y = e A × e Z × 1 / log(2) | E | negate × ( − log(2)) Also notice that + / − Y Z k − 1 zeroes A � �� Z trunc e Z = 1 . e Z − Z − 1 000 ... 000 zzzz e A H + Evaluate e A × e Z as C T trunc × e A + e A × ( e Z − 1) T P + M ≈ e Y normalize-round-pack R F. de Dinechin Hardware and FPGAs computing just right 8 / 33

X We want to obtain e X as s X unpack exception bits E X 1 . F X e X = 2 E · e Y shift to fixed point uo | X fix | e Y = e A × e Z × 1 / log(2) | E | negate × ( − log(2)) Also notice that + / − Y Z k − 1 zeroes A � �� Z trunc e Z = 1 . e Z − Z − 1 000 ... 000 zzzz e A H + Evaluate e A × e Z as C T trunc × e A + e A × ( e Z − 1) T P + M ≈ e Y (before the product, truncate e A to precision normalize-round-pack of e Z − 1) R F. de Dinechin Hardware and FPGAs computing just right 8 / 33

X We want to obtain e X as s X unpack exception bits E X 1 . F X e X = 2 E · e Y shift to fixed point uo | X fix | e Y = e A × e Z × 1 / log(2) | E | negate × ( − log(2)) And that’s it, we have E and e Y + / − Y Z A Z trunc e Z − Z − 1 e A H + C T trunc × T P + M ≈ e Y normalize-round-pack R F. de Dinechin Hardware and FPGAs computing just right 8 / 33

X We want to obtain e X as s X unpack exception bits E X 1 . F X e X = 2 E · e Y shift to fixed point uo | X fix | e Y = e A × e Z × 1 / log(2) | E | negate × ( − log(2)) And that’s it, we have E and e Y + / − (using only fixed-point computations) Y Z A Z trunc e Z − Z − 1 e A H + C T trunc × T P + M ≈ e Y normalize-round-pack R F. de Dinechin Hardware and FPGAs computing just right 8 / 33

X flp( w E , w F ) We want to obtain e X as s X unpack exception bits E X 1 . F X ufix(0 , − w F ) w E e X = 2 E · e Y shift to fixed point uo | X fix | ufix( w E − 2 , − w F − g ) ufix( w E − 2 , − 4) e Y = e A × e Z × 1 / log(2) ufix( − 1 , − w F − g ) | E | ufix( w E , 0) negate × ( − log(2)) sfix( − 1 , − w F − g ) sfix( − 1 , − w F − g ) And that’s it, we have E and e Y + / − (using only fixed-point computations) Y sfix( − 1 , − w F − g ) sfix( − 1 , − k ) Z ufix( − k − 1 , − w F − g ) A Z trunc ufix( − k − 1 , − w F + k − g ) e Z − Z − 1 e A ufix( − 2 k − 1 , − w F − g ) H ufix(0 , − w F − g ) + ufix(0 , − w F − g + k ) ufix( − k , − w F − g ) C T trunc × T ufix( − k + 1 , − w F − g ) P Don’t worry if you got lost + M ≈ e Y this talk is not about this algorithm, ufix(0 , − w F − g ) normalize-round-pack but about its details flp( w E , w F ) R F. de Dinechin Hardware and FPGAs computing just right 8 / 33

Some opportunities of hardware computing just right Introduction: not your PC’s exponential Some opportunities of hardware computing just right Error analysis for dummies (and other proof assistants) Conclusions Backup: crash-introduction to FPGAs F. de Dinechin Hardware and FPGAs computing just right 9 / 33

Opportunity #1: Over-parameterization X flp( w E , w F ) Example: s X unpack exception bits E X 1 . F X ufix(0 , − w F ) w E shift to fixed point uo | X fix | ufix( w E − 2 , − w F − g ) ufix( w E − 2 , − 4) × 1 / log(2) ufix( − 1 , − w F − g ) | E | ufix( w E , 0) × ( − log(2)) negate sfix( − 1 , − w F − g ) sfix( − 1 , − w F − g ) + / − Y sfix( − 1 , − w F − g ) sfix( − 1 , − k ) Z ufix( − k − 1 , − w F − g ) A Z trunc ufix( − k − 1 , − w F + k − g ) e Z − Z − 1 e A ufix( − 2 k − 1 , − w F − g ) H ufix(0 , − w F − g ) + ufix(0 , − w F − g + k ) ufix( − k , − w F − g ) T trunc C × T P ufix( − k + 1 , − w F − g ) + M ≈ e Y ufix(0 , − w F − g ) normalize-round-pack flp( w E , w F ) R F. de Dinechin Hardware and FPGAs computing just right 10 / 33

Opportunity #1: Over-parameterization bits uo | X fix | ufix( w X flp( w E , w F ) Example: s X ufix( w E − 2 , − 4) unpack exception bits E X 1 . F X ufix(0 , − w F ) w E shift to fixed point uo | X fix | ufix( w E − 2 , − w F − g ) ufix( w E − 2 , − 4) × 1 / log(2) × 1 / log(2) ufix( − 1 , − w F − g ) | E | ufix( w E , 0) | E | × ( − log(2)) negate ufix( w E , 0) sfix( − 1 , − w F − g ) sfix( − 1 , − w F − g ) + / − Y sfix( − 1 , − w F − g ) sfix( − 1 , − k ) Z ufix( − k − 1 , − w F − g ) A Z trunc negate × ( − log(2)) ufix( − k − 1 , − w F + k − g ) e Z − Z − 1 e A ufix( − 2 k − 1 , − w F − g ) H ufix(0 , − w F − g ) + sfix( − 1 , − w F − g ) ufix(0 , − w F − g + k ) ufix( − k , − w F − g ) T trunc C × T + / − P ufix( − k + 1 , − w F − g ) + M ≈ e Y ufix(0 , − w F − g ) normalize-round-pack Y sfix( − 1 , sfix( − 1 , − k ) flp( w E , w F ) R F. de Dinechin Hardware and FPGAs computing just right 10 / 33

Opportunity #1: Over-parameterization bits uo | X fix | ufix( w X flp( w E , w F ) Example: s X ufix( w E − 2 , − 4) ufix( w E − 2 , − 4) unpack exception bits E X 1 . F X ufix(0 , − w F ) w E shift to fixed point Multipliers of all shapes and sizes uo | X fix | ufix( w E − 2 , − w F − g ) ufix( w E − 2 , − 4) × 1 / log(2) × 1 / log(2) ufix( − 1 , − w F − g ) | E | ufix( w E , 0) | E | × ( − log(2)) negate ufix( w E , 0) ufix( w E , 0) sfix( − 1 , − w F − g ) sfix( − 1 , − w F − g ) + / − Y sfix( − 1 , − w F − g ) sfix( − 1 , − k ) Z ufix( − k − 1 , − w F − g ) A Z trunc negate × ( − log(2)) ufix( − k − 1 , − w F + k − g ) e Z − Z − 1 e A ufix( − 2 k − 1 , − w F − g ) H ufix(0 , − w F − g ) + sfix( − 1 , − w F − g ) sfix( − 1 , − w F − g ) ufix(0 , − w F − g + k ) ufix( − k , − w F − g ) T trunc C × T + / − P ufix( − k + 1 , − w F − g ) + M ≈ e Y ufix(0 , − w F − g ) normalize-round-pack Y sfix( − 1 , sfix( − 1 , − k ) flp( w E , w F ) R F. de Dinechin Hardware and FPGAs computing just right 10 / 33

Opportunity #1: Over-parameterization bits uo | X fix | ufix( w X 14 flp( w E , w F ) Example: s X ufix( w E − 2 , − 4) ufix( w E − 2 , − 4) unpack exception bits E X 1 . F X ufix(0 , − w F ) w E shift to fixed point Multipliers of all shapes and sizes uo | X fix | ufix( w E − 2 , − w F − g ) ufix( w E − 2 , − 4) × 1 / log(2) In a double-precision exponential, × 1 / log(2) ufix( − 1 , − w F − g ) | E | ufix( w E , 0) | E | × ( − log(2)) negate w E = 11, w F = 52, 12 ufix( w E , 0) ufix( w E , 0) sfix( − 1 , − w F − g ) sfix( − 1 , − w F − g ) + / − first multiplier 14-bits in, 12 bits out Y sfix( − 1 , − w F − g ) sfix( − 1 , − k ) Z ufix( − k − 1 , − w F − g ) A Z trunc negate × ( − log(2)) second multiplier 12-bits in, 56 bits out ufix( − k − 1 , − w F + k − g ) e Z − Z − 1 e A ufix( − 2 k − 1 , − w F − g ) H ufix(0 , − w F − g ) ... and truncated left and right + 56 sfix( − 1 , − w F − g ) sfix( − 1 , − w F − g ) ufix(0 , − w F − g + k ) ufix( − k , − w F − g ) T trunc C × T + / − P ufix( − k + 1 , − w F − g ) + M ≈ e Y ufix(0 , − w F − g ) normalize-round-pack Y sfix( − 1 , sfix( − 1 , − k ) flp( w E , w F ) R F. de Dinechin Hardware and FPGAs computing just right 10 / 33

Over-parameterization is cool ⊖ OK, there is a bit more work involved in designing a parametric operator To start with, it must be a hardware-generating program F. de Dinechin Hardware and FPGAs computing just right 11 / 33

Over-parameterization is cool ⊖ OK, there is a bit more work involved in designing a parametric operator To start with, it must be a hardware-generating program ⊕ Direct benefit to end-users: freedom of choice People used to publish “An exponential architecture for single-precision”, standard is now “A family of exponential architectures for each precision” Application-specific optimal, future-proof, etc. F. de Dinechin Hardware and FPGAs computing just right 11 / 33

Over-parameterization is cool ⊖ OK, there is a bit more work involved in designing a parametric operator To start with, it must be a hardware-generating program ⊕ Direct benefit to end-users: freedom of choice People used to publish “An exponential architecture for single-precision”, standard is now “A family of exponential architectures for each precision” Application-specific optimal, future-proof, etc. ⊕ It actually simplifies design of composite operators (e.g. the exponential) No need to take any dramatic decision in the design phase: You don’t know how many bits on this wire make sense? Keep it open as a parameter. Then estimate cost and accuracy as a function of the parameters Then find the optimal values of the parameters, e.g. using ILP or common sense (whichever gives the best results) F. de Dinechin Hardware and FPGAs computing just right 11 / 33

Opportunity #2: Operator specialization Ha, that’s something you sofware people don’t get! Multiplication by a constant multiplication by integers: 17 X = ( X ≪ 4) + X but also by reals such as log(2) or sin(42 π/ 256) for the FFT Two main techniques, tens of papers ... F. de Dinechin Hardware and FPGAs computing just right 12 / 33

Opportunity #2: Operator specialization Ha, that’s something you sofware people don’t get! Multiplication by a constant multiplication by integers: 17 X = ( X ≪ 4) + X but also by reals such as log(2) or sin(42 π/ 256) for the FFT Two main techniques, tens of papers Division by 3 (for various values of 3) in floating point for Jacobi and other stencils integer (quotient and remainder) for addressing in 3 memory banks ... F. de Dinechin Hardware and FPGAs computing just right 12 / 33

Opportunity #2: Operator specialization Ha, that’s something you sofware people don’t get! Multiplication by a constant multiplication by integers: 17 X = ( X ≪ 4) + X but also by reals such as log(2) or sin(42 π/ 256) for the FFT 321 Two main techniques, tens of papers × 321 Division by 3 (for various values of 3) 321 in floating point for Jacobi and other stencils 642 integer (quotient and remainder) for addressing in 3 memory banks 963 A squarer is a multiplier specialization 103041 x × x 2 ... F. de Dinechin Hardware and FPGAs computing just right 12 / 33

Opportunity #2: Operator specialization Ha, that’s something you sofware people don’t get! Multiplication by a constant multiplication by integers: 17 X = ( X ≪ 4) + X but also by reals such as log(2) or sin(42 π/ 256) for the FFT 321 Two main techniques, tens of papers × 321 Division by 3 (for various values of 3) 321 in floating point for Jacobi and other stencils 642 integer (quotient and remainder) for addressing in 3 memory banks 963 A squarer is a multiplier specialization 103041 x × x 2 Specialization of elementary functions to specific domains ... F. de Dinechin Hardware and FPGAs computing just right 12 / 33

Opportunity #3: target-specific optimizations X Modern FPGAs also have s X unpack exception bits E X 1 . F X shift to fixed point uo | X fix | × 1 / log(2) | E | negate × ( − log(2)) + / − Y Z A Z trunc e Z − Z − 1 e A H + C T trunc T × P + M ≈ e Y normalize-round-pack R F. de Dinechin Hardware and FPGAs computing just right 13 / 33

Opportunity #3: target-specific optimizations X Modern FPGAs also have s X unpack exception bits E X 1 . F X small multipliers with pre-adders and post-adders shift to fixed point uo | X fix | × 1 / log(2) | E | negate × ( − log(2)) + / − Y Z A Z trunc e Z − Z − 1 e A H + C T trunc T × P + M ≈ e Y normalize-round-pack R F. de Dinechin Hardware and FPGAs computing just right 13 / 33

Opportunity #3: target-specific optimizations X Modern FPGAs also have s X unpack exception bits E X 1 . F X small multipliers with pre-adders and post-adders shift to fixed point uo | X fix | ... and dual-ported small memories × 1 / log(2) | E | negate × ( − log(2)) + / − Y Z A Z trunc e Z − Z − 1 e A H + C T trunc T × P + M ≈ e Y normalize-round-pack R F. de Dinechin Hardware and FPGAs computing just right 13 / 33

Opportunity #3: target-specific optimizations X Modern FPGAs also have s X unpack exception bits E X 1 . F X small multipliers with pre-adders and post-adders shift to fixed point uo | X fix | ... and dual-ported small memories × 1 / log(2) | E | Single-precision accurate exponential on Xilinx negate × ( − log(2)) one block RAM (0.1% of the chip) + / − Y Z one DSP block (0.1%) A Z trunc e Z − Z − 1 e A < 400 LUTs (0.1%, ≈ one FP adder) H + to compute one exponential per cycle at 500MHz C T trunc ( ∼ one AVX512 core trashing on its 16 FP32 lanes) T × P + M ≈ e Y normalize-round-pack R F. de Dinechin Hardware and FPGAs computing just right 13 / 33

Opportunity #3: target-specific optimizations X flp( w E , w F ) Modern FPGAs also have s X unpack exception bits E X 1 . F X ufix(0 , − w F ) w E small multipliers with pre-adders and post-adders shift to fixed point uo | X fix | ufix( w E − 2 , − w F − g ) ufix( w E − 2 , − 4) ... and dual-ported small memories × 1 / log(2) ufix( − 1 , − w F − g ) | E | ufix( w E , 0) Single-precision accurate exponential on Xilinx negate × ( − log(2)) sfix( − 1 , − w F − g ) sfix( − 1 , − w F − g ) one block RAM (0.1% of the chip) + / − Y sfix( − 1 , − w F − g ) sfix( − 1 , − k ) Z one DSP block (0.1%) ufix( − k − 1 , − w F − g ) A Z trunc ufix( − k − 1 , − w F + k − g ) e Z − Z − 1 e A < 400 LUTs (0.1%, ≈ one FP adder) ufix( − 2 k − 1 , − w F − g ) H ufix(0 , − w F − g ) + to compute one exponential per cycle at 500MHz ufix(0 , − w F − g + k ) ufix( − k , − w F − g ) C T trunc ( ∼ one AVX512 core trashing on its 16 FP32 lanes) T × P ufix( − k + 1 , − w F − g ) + For one specific value only of the architectural parameter k! M ≈ e Y ufix(0 , − w F − g ) (over-parameterization is cool) normalize-round-pack flp( w E , w F ) R F. de Dinechin Hardware and FPGAs computing just right 13 / 33

Opportunity #4: Tabulation X s X unpack exception bits E X 1 . F X shift to fixed point uo | X fix | Being unable to trust my reasoning, I learnt by heart × 1 / log(2) the results of all the possible multiplications | E | (E. Ionesco) negate × ( − log(2)) ... and all the possible exponentials + / − Y Z A Z trunc e Z − Z − 1 e A H + C T trunc T × P + M ≈ e Y normalize-round-pack R F. de Dinechin Hardware and FPGAs computing just right 14 / 33

Opportunity #4: Tabulation X s X unpack exception bits E X 1 . F X shift to fixed point uo | X fix | Being unable to trust my reasoning, I learnt by heart × 1 / log(2) the results of all the possible multiplications | E | (E. Ionesco) negate × ( − log(2)) ... and all the possible exponentials + / − ... and all the possible values of e Z − Z − 1 Y Z A Z trunc e Z − Z − 1 e A H + C T trunc T × P + M ≈ e Y normalize-round-pack R F. de Dinechin Hardware and FPGAs computing just right 14 / 33

Opportunity #4: Tabulation X s X unpack exception bits E X 1 . F X shift to fixed point uo | X fix | Being unable to trust my reasoning, I learnt by heart × 1 / log(2) the results of all the possible multiplications | E | (E. Ionesco) negate × ( − log(2)) ... and all the possible exponentials + / − ... and all the possible values of e Z − Z − 1 Y Z A Z trunc ... and indeed, all the possible multiplications e Z − Z − 1 e A H + C T trunc T × P + M ≈ e Y normalize-round-pack R F. de Dinechin Hardware and FPGAs computing just right 14 / 33

Opportunity #4: Tabulation X s X unpack exception bits E X 1 . F X shift to fixed point uo | X fix | Being unable to trust my reasoning, I learnt by heart × 1 / log(2) the results of all the possible multiplications | E | (E. Ionesco) negate × ( − log(2)) ... and all the possible exponentials + / − ... and all the possible values of e Z − Z − 1 Y Z A Z trunc ... and indeed, all the possible multiplications e Z − Z − 1 e A H + Reading a tabulated value is very efficient C T trunc T × when the table is close to the consumer. P + M ≈ e Y normalize-round-pack R F. de Dinechin Hardware and FPGAs computing just right 14 / 33

Opportunity #5: Generic approximators (when tabulation won’t scale) X flp( w E , w F ) s X unpack exception bits E X 1 . F X ufix(0 , − w F ) w E shift to fixed point Polynomial Coefficient Table uo | X fix | ufix( w E − 2 , − w F − g ) ufix( w E − 2 , − 4) A C 1 C 0 C 2 C 3 × 1 / log(2) ufix( − 1 , − w F − g ) final round | E | α S 2 S 1 ufix( w E , 0) w × + + + R × X × × ( − log(2)) negate � P ( Y ) sfix( − 1 , − w F − g ) sfix( − 1 , − w F − g ) w − α � � Y 3 Y 2 + / − � Y 3 = X Y Y sfix( − 1 , − w F − g ) sfix( − 1 , − k ) Z ufix( − k − 1 , − w F − g ) A Z trunc ufix( − k − 1 , − w F + k − g ) e Z − Z − 1 e A The FloPoCo FixFunctionByPiecewisePoly operator ufix( − 2 k − 1 , − w F − g ) H ufix(0 , − w F − g ) + state-of-the-art polynomial approximation ufix(0 , − w F − g + k ) ufix( − k , − w F − g ) T trunc C × each multiplier tailored with love and care T ufix( − k + 1 , − w F − g ) P + Also multipartite tables, filter approximators, and more to come. M ≈ e Y ufix(0 , − w F − g ) normalize-round-pack flp( w E , w F ) R F. de Dinechin Hardware and FPGAs computing just right 15 / 33

Opportunity #6: merged arithmetic in bit heaps Complex product ...... Multiplier Polynomial Adder Multipartite Constant multiplier Multi-adder Algorithmic description � b i 2 w i Architecture generation Spartan 5 Virtex-4 ...... Virtex-5 Stratix III Spartan6 Virtex-6 Stratix IV Zynq 7000 Kintex-7 Stratix V Stratix 10 F. de Dinechin Hardware and FPGAs computing just right 16 / 33

Opportunity #6: merged arithmetic in bit heaps One data-structure to rule them all... Complex product ...... Multiplier Polynomial Adder Multipartite Constant multiplier Multi-adder Algorithmic description � b i 2 w i Architecture generation Spartan 5 Virtex-4 ...... Virtex-5 Stratix III Spartan6 Virtex-6 Stratix IV Zynq 7000 Kintex-7 Stratix V Stratix 10 The sum of weighted bits as a first-class arithmetic object A very wide class of operators: multi-valued polynomials, and more Captures the true bit-level parallelism, enables bit-level optimization opportunities F. de Dinechin Hardware and FPGAs computing just right 16 / 33

Opportunity #6: merged arithmetic in bit heaps One data-structure to rule them all... and in the hardware to bind them Complex product ...... Multiplier Polynomial Adder Multipartite Constant multiplier Multi-adder Algorithmic description � b i 2 w i Architecture generation Spartan 5 Virtex-4 ...... Virtex-5 Stratix III Spartan6 Virtex-6 Stratix IV Zynq 7000 Kintex-7 Stratix V Stratix 10 The sum of weighted bits as a first-class arithmetic object A very wide class of operators: multi-valued polynomials, and more Captures the true bit-level parallelism, enables bit-level optimization opportunities Bit-array compressor trees can be optimized for each target ... and optimally so for practical sizes, thanks to M. Kumm F. de Dinechin Hardware and FPGAs computing just right 16 / 33

When you have a good hammer, you see nails everywhere A sine/cosine architecture (I¸ stoan, HEART 2013): s q o A Yred × π Sin/Cos table T Z T sinPiA Z 2 / 2 cosPiA Z 3 / 6 T T T T T sinZ T sinAsinZ cosAsinZ sinAcosZ cosAcosZ Swap/negate sinPiX cosPiX F. de Dinechin Hardware and FPGAs computing just right 17 / 33

When you have a good hammer, you see nails everywhere A sine/cosine architecture (I¸ stoan, HEART 2013): 5 bit heaps s q o A Yred × π Sin/Cos table T Z T sinPiA Z 2 / 2 cosPiA Z 3 / 6 T T T T T sinZ T sinAsinZ cosAsinZ sinAcosZ cosAcosZ Swap/negate sinPiX cosPiX F. de Dinechin Hardware and FPGAs computing just right 17 / 33

Bit heaps for some operators and filters w=16 bits Why are some people still insisting I should call this “bit arrays”? F. de Dinechin Hardware and FPGAs computing just right 18 / 33

Error analysis for dummies (and other proof assistants) Introduction: not your PC’s exponential Some opportunities of hardware computing just right Error analysis for dummies (and other proof assistants) Conclusions Backup: crash-introduction to FPGAs F. de Dinechin Hardware and FPGAs computing just right 19 / 33

Computing just right Error analysis used to be to ensure the operator works. F. de Dinechin Hardware and FPGAs computing just right 20 / 33

Computing just right Error analysis used to be to ensure the operator works. This is sooooo nineties. Here, error analysis is for optimization . (the fact that the operators work is an appreciable bonus) F. de Dinechin Hardware and FPGAs computing just right 20 / 33

Error analysis method in my early papers: handwaving The typical error analysis used to look like this: “This term contributes at most 1 ulp (unit in the last place) to the overall error” “This operation contributes at most one half-ulp to the error” ... “Altogether we have 6 ulps of error” “so if we add ⌈ log 2 (6) ⌉ bits to all the datapath, it should be accurate enough. F. de Dinechin Hardware and FPGAs computing just right 21 / 33

And then I saw the light G. Melquiond, the creator of Gappa (the proof assistant for the rest of us) An error is a difference between a less accurate value and a more accurate value. F. de Dinechin Hardware and FPGAs computing just right 22 / 33

And then I saw the light G. Melquiond, the creator of Gappa (the proof assistant for the rest of us) An error is a difference between a less accurate value and a more accurate value. For instance, to bound some error, first write it δ AC = A − C , then F. de Dinechin Hardware and FPGAs computing just right 22 / 33

And then I saw the light G. Melquiond, the creator of Gappa (the proof assistant for the rest of us) An error is a difference between a less accurate value and a more accurate value. For instance, to bound some error, first write it δ AC = A − C , then look for some intermediate value B (more accurate than A but less accurate than C ) F. de Dinechin Hardware and FPGAs computing just right 22 / 33

And then I saw the light G. Melquiond, the creator of Gappa (the proof assistant for the rest of us) An error is a difference between a less accurate value and a more accurate value. For instance, to bound some error, first write it δ AC = A − C , then look for some intermediate value B (more accurate than A but less accurate than C ) write A − C = A − B + B − C (4) or δ AC = δ AB + δ BC F. de Dinechin Hardware and FPGAs computing just right 22 / 33

And then I saw the light G. Melquiond, the creator of Gappa (the proof assistant for the rest of us) An error is a difference between a less accurate value and a more accurate value. For instance, to bound some error, first write it δ AC = A − C , then look for some intermediate value B (more accurate than A but less accurate than C ) write A − C = A − B + B − C (4) or δ AC = δ AB + δ BC By triangular inequality, | δ AC | ≤ | δ AB | + | δ BC | Therefore, the error bounds (noted δ = max | δ | ) verify δ AC = δ AB + δ BC (5) A divide-and-conquer method, to use when approximations and rounding errors pile up... F. de Dinechin Hardware and FPGAs computing just right 22 / 33

A big mess of rounding errors piled over approximation X s X unpack exception bits E X 1 . F X shift to fixed point uo | X fix | × 1 / log(2) | E | negate × ( − log(2)) + / − Y Z A Z trunc e Z − Z − 1 e A H + C T trunc T × P + M ≈ e Y normalize-round-pack R F. de Dinechin Hardware and FPGAs computing just right 23 / 33

A big mess of rounding errors piled over approximation − M − e Y Y δ total = Z A Z trunc e Z − Z − 1 e A H + C T trunc × T P + M ≈ e Y F. de Dinechin Hardware and FPGAs computing just right 23 / 33

A big mess of rounding errors piled over approximation − M − e Y Y δ total = Z M − e A e Z A = since Y = A + Z exactly Z trunc e Z − Z − 1 e A H + C T trunc × T P + M ≈ e Y F. de Dinechin Hardware and FPGAs computing just right 23 / 33

A big mess of rounding errors piled over approximation − M − e Y Y δ total = Z M − e A e Z A = since Y = A + Z exactly Z trunc Te Z − e A e Z M − Te Z = + e Z − Z − 1 e A H + C T trunc × T P + M ≈ e Y F. de Dinechin Hardware and FPGAs computing just right 23 / 33

A big mess of rounding errors piled over approximation − M − e Y δ total = Y Z M − e A e Z = since Y = A + Z exactly A Z trunc Te Z − e A e Z M − Te Z = + e Z − Z − 1 � �� e A || H ( T − e A ) e Z + || C T trunc δ T × T P + M ≈ e Y F. de Dinechin Hardware and FPGAs computing just right 23 / 33

A big mess of rounding errors piled over approximation − M − e Y δ total = Y sfix( − 1 , − w F − g ) Z sfix( − 1 , − k ) ufix( − k − 1 , − w F − g ) M − e A e Z = since Y = A + Z exactly A Z trunc Te Z − e A e Z M − Te Z = + ufix( − k − 1 , − w F + k − g ) e Z − Z − 1 � �� e A ufix( − 2 k − 1 , − w F − g ) || H ufix(0 , − w F − g ) ( T − e A ) e Z + ufix(0 , − w F − g + k ) ufix( − k , − w F − g ) || C T trunc δ T × T Now we can bound this first source of error: ufix( − k + 1 , − w F − g ) P + | T − e A | · | e Z | | δ T | = M ≈ e Y ufix(0 , − w F − g ) 2 − w F − g · (1 + 2 − k +1 ) < (6) Keep it parametric! F. de Dinechin Hardware and FPGAs computing just right 23 / 33

That was the first step − M − Te Z Y δ total = + δ T Z A Z trunc Where can we go from here? e Z − Z − 1 e A H + C T trunc × T P + M ≈ e Y F. de Dinechin Hardware and FPGAs computing just right 24 / 33

That was the first step − M − Te Z Y δ total = + δ T Z A Z trunc Where can we go from here? e Z − Z − 1 e A Last addition is exact (that’s fixed-point for you) so M = T + P , hence : H + M − Te Z T + P − Te Z = C T trunc T + P − T trunc C + T trunc C − Te Z = × T P P + M ≈ e Y F. de Dinechin Hardware and FPGAs computing just right 24 / 33

That was the first step − M − Te Z δ total = + δ T Y Z A Z trunc Where can we go from here? e Z − Z − 1 Last addition is exact (that’s fixed-point for you) so e A M = T + P , hence : H + M − Te Z T + P − Te Z = � + T trunc C − Te Z C T trunc = T + P − T trunc C � �� × T || δ P P P + M ≈ e Y F. de Dinechin Hardware and FPGAs computing just right 24 / 33

That was the first step − M − Te Z δ total = + δ T Y sfix( − 1 , − w F − g ) sfix( − 1 , − k ) Z ufix( − k − 1 , − w F − g ) A Z trunc Where can we go from here? ufix( − k − 1 , − w F + k − g ) e Z − Z − 1 Last addition is exact (that’s fixed-point for you) so e A M = T + P , hence : ufix( − 2 k − 1 , − w F − g ) H ufix(0 , − w F − g ) + M − Te Z T + P − Te Z = ufix(0 , − w F − g + k ) ufix( − k , − w F − g ) � + T trunc C − Te Z C T trunc = T + P − T trunc C � �� × T || δ P ufix( − k + 1 , − w F − g ) P P + The bound on δ P depends on the technology used for M ≈ e Y ufix(0 , − w F − g ) the multiplier (at most δ P = 2 − w F − g ) anyway it is under control F. de Dinechin Hardware and FPGAs computing just right 24 / 33

More of the same − Y Z T + T trunc C − Te Z δ total = + δ T + δ P A Z trunc e Z − Z − 1 Where can we go from here? e A H + C T trunc × T P + M ≈ e Y F. de Dinechin Hardware and FPGAs computing just right 25 / 33

More of the same − Y Z T + T trunc C − Te Z δ total = + δ T + δ P A Z trunc e Z − Z − 1 Where can we go from here? e A This T trunc is annoying, so let’s get it out of the way H + T + T trunc C − Te Z � + TC − Te Z = T + T trunc C − TC � �� C T trunc || × T ( T trunc − T ) C || P δ T trunc + M ≈ e Y F. de Dinechin Hardware and FPGAs computing just right 25 / 33

More of the same − Y Z T + T trunc C − Te Z δ total = + δ T + δ P A Z trunc e Z − Z − 1 Where can we go from here? e A This T trunc is annoying, so let’s get it out of the way H + T + T trunc C − Te Z � + TC − Te Z = T + T trunc C − TC � �� C T trunc || × T ( T trunc − T ) C || P δ T trunc + M ≈ e Y T trunc − T < 2 − w F − g + k , then we need a bound on C ; C ≈ e Z − 1 so Taylor is our friend again F. de Dinechin Hardware and FPGAs computing just right 25 / 33

More of the same − Y sfix( − 1 , − w F − g ) Z sfix( − 1 , − k ) T + T trunc C − Te Z δ total = + δ T + δ P ufix( − k − 1 , − w F − g ) A Z trunc ufix( − k − 1 , − w F + k − g ) e Z − Z − 1 Where can we go from here? e A This T trunc is annoying, so let’s get it out of the way ufix( − 2 k − 1 , − w F − g ) H ufix(0 , − w F − g ) + T + T trunc C − Te Z � + TC − Te Z = T + T trunc C − TC ufix(0 , − w F − g + k ) � �� ufix( − k , − w F − g ) C T trunc || × T ( T trunc − T ) C || ufix( − k + 1 , − w F − g ) P δ T trunc + M ≈ e Y ufix(0 , − w F − g ) T trunc − T < 2 − w F − g + k , then we need a bound on C ; C ≈ e Z − 1 so Taylor is our friend again F. de Dinechin Hardware and FPGAs computing just right 25 / 33

You’ll get your lunch only after I get to an approximation error − Y Z T + TC − Te Z δ total = + δ T + δ P + δ T trunc A Z trunc Where can we go from here? e Z − Z − 1 e A H + C T trunc × T P + M ≈ e Y F. de Dinechin Hardware and FPGAs computing just right 26 / 33

You’ll get your lunch only after I get to an approximation error − Y Z T + TC − Te Z δ total = + δ T + δ P + δ T trunc A Z trunc Where can we go from here? e Z − Z − 1 e A C = H + Z exactly (fixed-point additions are exact) H + T + TC − Te Z T · (1 + H + Z − e Z ) = h ( Z ) = e Z − Z − 1 = T · ( H − h ( Z )) with C T trunc = T · ( H − h ( Z trunc ) + h ( Z trunc ) − h ( Z )) × T = T · ( H − h ( Z trunc )) � + T · ( h ( Z trunc ) − h ( Z )) � �� P + || || δ H δ Z trunc M ≈ e Y F. de Dinechin Hardware and FPGAs computing just right 26 / 33

You’ll get your lunch only after I get to an approximation error − Y Z T + TC − Te Z δ total = + δ T + δ P + δ T trunc A Z trunc Where can we go from here? e Z − Z − 1 e A C = H + Z exactly (fixed-point additions are exact) H + T + TC − Te Z T · (1 + H + Z − e Z ) = h ( Z ) = e Z − Z − 1 = T · ( H − h ( Z )) with C T trunc = T · ( H − h ( Z trunc ) + h ( Z trunc ) − h ( Z )) × T = T · ( H − h ( Z trunc )) � + T · ( h ( Z trunc ) − h ( Z )) � �� P + || || δ H δ Z trunc M ≈ e Y δ H includes the approximation error H − h ( Z trunc ) F. de Dinechin Hardware and FPGAs computing just right 26 / 33

Finally, scientific precision sabotaging δ total = δ T + δ P + δ T trunc + δ H + δ Z trunc hence δ total = δ T + δ P + δ T trunc + δ H + δ Z trunc If any of these terms is much smaller than the others, useless bits are being computed I’ll hack at the hardware to make this error worse! by moving a parameter up or down, maybe adding a truncation somewhere... F. de Dinechin Hardware and FPGAs computing just right 27 / 33

Finally, scientific precision sabotaging δ total = δ T + δ P + δ T trunc + δ H + δ Z trunc hence δ total = δ T + δ P + δ T trunc + δ H + δ Z trunc If any of these terms is much smaller than the others, useless bits are being computed I’ll hack at the hardware to make this error worse! by moving a parameter up or down, maybe adding a truncation somewhere... Oh, yes, I will also make sure that δ total is small enough to guarantee last-bit accuracy. F. de Dinechin Hardware and FPGAs computing just right 27 / 33

Take away messages Error analysis for performance, not only for accuracy Straightforward engineering based on additions and multiplications Strict and accurate worst-case analysis (amenable to formal proof) Perfectly captures how an early rounding error is amplified in the algorithm And for you floating-point people, there exists a relative-error version If A approximates B and B approximates C, then A − C A − B B − C A − B × B − C = + + (7) C B C B C or ε AC = ε AB + ε BC + ε AB · ε BC F. de Dinechin Hardware and FPGAs computing just right 28 / 33

Conclusions Introduction: not your PC’s exponential Some opportunities of hardware computing just right Error analysis for dummies (and other proof assistants) Conclusions Backup: crash-introduction to FPGAs F. de Dinechin Hardware and FPGAs computing just right 29 / 33

First conclusion There seems to be a recurrent pattern in talks by French people: They sell you a very generic title and then all you see is an implementation of the exponential function F. de Dinechin Hardware and FPGAs computing just right 30 / 33

A look forward, beyond the horizon Generating parametric hardware was the easy part! The difficult part of the problem is: What precision is needed at this point of this application ? (Overwhelming freedom! Help!) F. de Dinechin Hardware and FPGAs computing just right 31 / 33

A look forward, beyond the horizon X flp( w E , w F ) s X unpack exception bits E X 1 . F X ufix(0 , − w F ) w E Generating parametric hardware was the easy part! shift to fixed point uo | X fix | ufix( w E − 2 , − w F − g ) ufix( w E − 2 , − 4) The difficult part of the problem is: × 1 / log(2) ufix( − 1 , − w F − g ) What precision is needed at this point of this application ? | E | ufix( w E , 0) (Overwhelming freedom! Help!) negate × ( − log(2)) sfix( − 1 , − w F − g ) sfix( − 1 , − w F − g ) + / − I know, because this is what I’ve been doing for the exponential. Y sfix( − 1 , − w F − g ) sfix( − 1 , − k ) Z ufix( − k − 1 , − w F − g ) A Z trunc ufix( − k − 1 , − w F + k − g ) e Z − Z − 1 e A Current challenges: ufix( − 2 k − 1 , − w F − g ) H ufix(0 , − w F − g ) + precimonious-like tools at the bit level for my users ufix(0 , − w F − g + k ) ufix( − k , − w F − g ) C T trunc My beautiful parametric FPExp is only used for single and T × double precision! P ufix( − k + 1 , − w F − g ) + and in software: replace some floats with fixed-point M ≈ e Y ufix(0 , − w F − g ) normalize-round-pack (hidden in elementary functions, reductions...) flp( w E , w F ) R F. de Dinechin Hardware and FPGAs computing just right 31 / 33

“You’ve found yourself a comfortable niche, but...” You optimize computations, but power is dissipated in data transfers. Answer: Don’t move useless bits around, including to memory/storage F. de Dinechin Hardware and FPGAs computing just right 32 / 33

“You’ve found yourself a comfortable niche, but...” You optimize computations, but power is dissipated in data transfers. Answer: Don’t move useless bits around, including to memory/storage Who cares about your exponential, scientific computing is all sums of products. Answer: Dura Amdahl lex, sed lex SPICE (electrical circuit simulation) inner loop, cut from Kapre and DeHon (FPL 2009) F. de Dinechin Hardware and FPGAs computing just right 32 / 33

“You’ve found yourself a comfortable niche, but...” You optimize computations, but power is dissipated in data transfers. Answer: Don’t move useless bits around, including to memory/storage Who cares about your exponential, scientific computing is all sums of products. Answer: Dura Amdahl lex, sed lex SPICE (electrical circuit simulation) inner loop, cut from Kapre and DeHon (FPL 2009) Please add your question here. F. de Dinechin Hardware and FPGAs computing just right 32 / 33

Time for a bit of advertising A Springer book: Application-specific arithmetic by F. de Dinechin and M. Kumm, to appear somewhere in 2020. Hopefully. Thanks to all FloPoCo contributors √ x x x 2+ y 2+ z 2 log x √ H. Abdoli, S. Banescu, L. Bes‘eme, A. B¨ ottcher, N. Bonfante, s i x i n e � π x M. Christ, N. Brunie, S. Collange, J. Detrey, P. Echeverr´ ıa, n e x + i =0 F. Ferrandi, N. Fiege, L. Forget, M. Grad, K. Illyes, M. Istoan, y M. Joldes, J. Kappauf, C. Klein, M. Kleinlein, K. Klug, M. Kumm, K. Kullmann, D. Mastrandrea, K. Moeller, B. Pasca, B. Popa, X. Pujol, G. Sergent, D. Thomas, R. Tudoran, A. Vasquez. and the authors of NVC, MPFR, Sollya, FPLLL, ScaLP, ... http://flopoco.gforge.inria.fr/ F. de Dinechin Hardware and FPGAs computing just right 33 / 33

Backup: crash-introduction to FPGAs Introduction: not your PC’s exponential Some opportunities of hardware computing just right Error analysis for dummies (and other proof assistants) Conclusions Backup: crash-introduction to FPGAs F. de Dinechin Hardware and FPGAs computing just right 34 / 33

Basic FPGA structure Overview A grid of configurable cells F. de Dinechin Hardware and FPGAs computing just right 35 / 33

Basic FPGA structure Overview Inside a cell x 0 x 1 F x 2 x 3 A grid of configurable cells ... to build arbitrary logic A Look-Up Table (LUT) F 4 inputs, one output holds any truth table F. de Dinechin Hardware and FPGAs computing just right 35 / 33

Basic FPGA structure Overview Inside a cell x 0 x 1 y r F R x 2 x 3 y A grid of configurable cells ... to build arbitrary logic ... and sequential circuits A Look-Up Table (LUT) F 4 inputs, one output holds any truth table 1 bit of run-time memory R F. de Dinechin Hardware and FPGAs computing just right 35 / 33

Basic FPGA structure Overview Inside a cell x 0 x 1 y r F R x 2 x 3 y A grid of configurable cells ... to build arbitrary logic ... and sequential circuits A Look-Up Table (LUT) F Configurable wiring 4 inputs, one output routing channels holds any truth table configurable switches where wires cross 1 bit of run-time memory R → random access to distant cells F. de Dinechin Hardware and FPGAs computing just right 35 / 33

Hardware and FPGAs computing just right ICERM 2020 Florent de - PowerPoint PPT Presentation

Hardware and FPGAs computing just right ICERM 2020 Florent de Dinechin Everybody is wondering what Im doing here Indeed, they probably invited me because I used to write software, too... But today I want to talk about hardware. I want to

The BIST History of FPGAs FPGAs The BIST History of The BISTory BISTory of of FPGAs FPGAs

FPGAs 1 CMPE691/491: Advanced FPGA Design FPGAs Large array of configurable logic blocks

Physical Design For FPGAs Rajeev Jayaraman Physical Implementation Tools Xilinx Inc. ISPD-2001

HW/SW Codesign w/ FPGAs The Nature of HW/SW I ECE 522 Hardware Software Codesign with FPGAs

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto

High-Speed Computing & Co-Processing with FPGAs FPGAs (Field Programmable Gate Arrays) are

Fast dynamic and partial reconfiguration Data Path with low Hardware overhead on Xilinx FPGAs

Finding the Right Target Audience Defining the Right Audience Right Visitors Right Time

Linux and FPGAs Chad D. Kersey chad@cdkersey.com cdkersey@gatech.edu Linux and FPGAs - p. 1/9

FPGAs as Tools and Architectures at ETH Systems FPGAs as Tools and Architectures at ETH Systems

FPGAs 1 To read more This days papers: Brown and Rose, Architecture of FPGAs and

Virtex-7 FPGAs Target Software Virtex-7 FPGAs Target Software Defined Radio Applications Defined

Physical optimization for Physical optimization for FPGAs using post- FPGAs using post-

Hybrid Dot-Product Design for FP-Enabled FPGAs Bogdan Pasca Intel ARITH 2019, June 10-12, 2019

A Network of Time Division Multiplexing for FPGAs Rosemary Francis Motivation FPGAs are

Linear Regression & Gradient Descent Many slides attributable to: Prof. Mike Hughes Erik

Bosonization in 3d and 2d David Tong ICTP, April 2018 Work with Andreas Karch and Carl Turner

A solution to the PoplMark challenge in Isabelle/HOL Stefan Berghofer Technische Universit at

MON: MISSION-OPTIMIZED OVERLAY NETWORKS Bruce Spang , Anirudh Sabnis, Ramesh Sitaraman, Don

Training DNNs: Tricks Ju Sun Computer Science & Engineering University of Minnesota, Twin

Constructing Generalized Bent Functions from Trace Forms over Galois Rings Xiaoming Zhang Key

WRAP-UP Joint IS-ENES Workshop on Workflows and Metadata Generation Costa da Caparica,

-4 -2 0 in 2 r, E 10 Erro 15 x 0 In-sample 20 x 1 25 s W eights, w ( )

Hardware and FPGAs computing just right ICERM 2020 Florent de - PowerPoint PPT Presentation

Hardware and FPGAs computing just right ICERM 2020 Florent de Dinechin Everybody is wondering what Im doing here Indeed, they probably invited me because I used to write software, too... But today I want to talk about hardware. I want to

The BIST History of FPGAs FPGAs The BIST History of The BISTory BISTory of of FPGAs FPGAs

FPGAs 1 CMPE691/491: Advanced FPGA Design FPGAs Large array of configurable logic blocks

Physical Design For FPGAs Rajeev Jayaraman Physical Implementation Tools Xilinx Inc. ISPD-2001

HW/SW Codesign w/ FPGAs The Nature of HW/SW I ECE 522 Hardware Software Codesign with FPGAs

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto

High-Speed Computing &amp; Co-Processing with FPGAs FPGAs (Field Programmable Gate Arrays) are

Fast dynamic and partial reconfiguration Data Path with low Hardware overhead on Xilinx FPGAs

Finding the Right Target Audience Defining the Right Audience Right Visitors Right Time

Linux and FPGAs Chad D. Kersey chad@cdkersey.com cdkersey@gatech.edu Linux and FPGAs - p. 1/9

FPGAs as Tools and Architectures at ETH Systems FPGAs as Tools and Architectures at ETH Systems

FPGAs 1 To read more This days papers: Brown and Rose, Architecture of FPGAs and

Virtex-7 FPGAs Target Software Virtex-7 FPGAs Target Software Defined Radio Applications Defined

Physical optimization for Physical optimization for FPGAs using post- FPGAs using post-

Hybrid Dot-Product Design for FP-Enabled FPGAs Bogdan Pasca Intel ARITH 2019, June 10-12, 2019

A Network of Time Division Multiplexing for FPGAs Rosemary Francis Motivation FPGAs are

Linear Regression &amp; Gradient Descent Many slides attributable to: Prof. Mike Hughes Erik

Bosonization in 3d and 2d David Tong ICTP, April 2018 Work with Andreas Karch and Carl Turner

A solution to the PoplMark challenge in Isabelle/HOL Stefan Berghofer Technische Universit at

MON: MISSION-OPTIMIZED OVERLAY NETWORKS Bruce Spang , Anirudh Sabnis, Ramesh Sitaraman, Don

Training DNNs: Tricks Ju Sun Computer Science &amp; Engineering University of Minnesota, Twin

Constructing Generalized Bent Functions from Trace Forms over Galois Rings Xiaoming Zhang Key

WRAP-UP Joint IS-ENES Workshop on Workflows and Metadata Generation Costa da Caparica,

-4 -2 0 in 2 r, E 10 Erro 15 x 0 In-sample 20 x 1 25 s W eights, w ( )

High-Speed Computing & Co-Processing with FPGAs FPGAs (Field Programmable Gate Arrays) are

Linear Regression & Gradient Descent Many slides attributable to: Prof. Mike Hughes Erik

Training DNNs: Tricks Ju Sun Computer Science & Engineering University of Minnesota, Twin