FPGA Multipliers
Bogdan PASCA
projet Ar´ enaire, ENS-Lyon/INRIA/CNRS/Universit´ e de Lyon, France
RAIM’11 February 7-10, 2011
FPGA Multipliers Bogdan PASCA projet Ar enaire, - - PowerPoint PPT Presentation
FPGA Multipliers Bogdan PASCA projet Ar enaire, ENS-Lyon/INRIA/CNRS/Universit e de Lyon, France RAIM11 February 7-10, 2011 Outline Background & Context Algorithmic techniques for reducing DSP count of large multipliers
Bogdan PASCA
projet Ar´ enaire, ENS-Lyon/INRIA/CNRS/Universit´ e de Lyon, France
RAIM’11 February 7-10, 2011
Background & Context Algorithmic techniques for reducing DSP count of large multipliers Karatsuba-Ofman algorithm Non-Standard tilings Squarers Truncated multipliers Conclusions
Bogdan PASCA FPGA Multipliers 1
Field Programmable Gate Array integrated circuit has a regular architecture (hence array) logic elements can be programmed to perform various functions
Bogdan PASCA FPGA Multipliers 2
a set of configurable logic elements
digital signal processing (DSP) blocks (including multipliers) connected by a configurable wire network all connected to outside world by I/O pins
Bogdan PASCA FPGA Multipliers 3
RAM RAM RAM RAM
a set of configurable logic elements
digital signal processing (DSP) blocks (including multipliers) connected by a configurable wire network all connected to outside world by I/O pins
Bogdan PASCA FPGA Multipliers 3
RAM RAM RAM RAM DSP DSP DSP DSP
a set of configurable logic elements
digital signal processing (DSP) blocks (including multipliers) connected by a configurable wire network all connected to outside world by I/O pins
Bogdan PASCA FPGA Multipliers 3
RAM RAM RAM RAM DSP DSP DSP DSP
a set of configurable logic elements
digital signal processing (DSP) blocks (including multipliers) connected by a configurable wire network all connected to outside world by I/O pins
Bogdan PASCA FPGA Multipliers 3
RAM RAM RAM RAM DSP DSP DSP DSP
a set of configurable logic elements
digital signal processing (DSP) blocks (including multipliers) connected by a configurable wire network all connected to outside world by I/O pins
Bogdan PASCA FPGA Multipliers 3
RAM RAM RAM RAM DSP DSP DSP DSP
LUT
a set of configurable logic elements
digital signal processing (DSP) blocks (including multipliers) connected by a configurable wire network all connected to outside world by I/O pins
Bogdan PASCA FPGA Multipliers 3
RAM RAM RAM RAM DSP DSP DSP DSP
LUT
shift 17 18 18
a set of configurable logic elements
digital signal processing (DSP) blocks (including multipliers) connected by a configurable wire network all connected to outside world by I/O pins
Bogdan PASCA FPGA Multipliers 3
LUT LUT LUT LUT LUT LUT LUT LUT LUT
x0 y0 y0 x1 y0 x2 y1 x0 x1 y1 y1 x2 u2 u1 u0 l2 l1 l0 p0 p1 p2 p3 p4
x2x1x0× y1y0 l2 l1 l0+ u2u1u0 p4p3p2p1p0 l0 = y0 ∧ x0 l1 = y0 ∧ x1 l2 = y0 ∧ x2 u0 = y1 ∧ x0 u1 = y1 ∧ x1 u2 = y1 ∧ x2
Bogdan PASCA FPGA Multipliers 4
LUT LUT LUT LUT LUT LUT LUT LUT LUT
x0 y0 y0 x1 y0 x2 y1 x0 x1 y1 y1 x2 u2 u1 u0 l2 l1 l0 p0 p1 p2 p3 p4
x2x1x0× y1y0 l2 l1 l0+ u2u1u0 p4p3p2p1p0 l0 = y0 ∧ x0 l1 = y0 ∧ x1 l2 = y0 ∧ x2 u0 = y1 ∧ x0 u1 = y1 ∧ x1 u2 = y1 ∧ x2
Bogdan PASCA FPGA Multipliers 4
LUT LUT LUT LUT LUT LUT LUT LUT LUT
x0 y0 y0 x1 y0 x2 y1 x0 x1 y1 y1 x2 u2 u1 u0 l2 l1 l0 p0 p1 p2 p3 p4
x2x1x0× y1y0 l2 l1 l0+ u2u1u0 p4p3p2p1p0 l0 = y0 ∧ x0 l1 = y0 ∧ x1 l2 = y0 ∧ x2 u0 = y1 ∧ x0 u1 = y1 ∧ x1 u2 = y1 ∧ x2
Bogdan PASCA FPGA Multipliers 4
LUT LUT LUT LUT LUT LUT
x0 y0 y0 x1 y0 x2 y1 x0 x1 y1 y1 x2 u2 u1 u0 l2 l1 l0 p0 p1 p2 p3 p4 FA FA FA
x2x1x0× y1y0 l2 l1 l0+ u2u1u0 p4p3p2p1p0 l0 = y0 ∧ x0 l1 = y0 ∧ x1 l2 = y0 ∧ x2 u0 = y1 ∧ x0 u1 = y1 ∧ x1 u2 = y1 ∧ x2
Bogdan PASCA FPGA Multipliers 4
Multiplication in logic is expensive
n × n bit ≈ n2
+ n(n − 1)
LUTs 18 × 18 bit ≈ 324LUT + 306LUT = 630LUTs 1 DSP block = 8 LEs (size on FPGA layout)
Bogdan PASCA FPGA Multipliers 5
Multiplication in logic is expensive
n × n bit ≈ n2
+ n(n − 1)
LUTs 18 × 18 bit ≈ 324LUT + 306LUT = 630LUTs 1 DSP block = 8 LEs (size on FPGA layout) DSP blocks are a need in modern FPGAs
17 bit shift 17 bit shift
48
48
P 18 18 A C P
Bogdan PASCA FPGA Multipliers 5
FPGA floating point performance – a pencil and paper evaluation 1 → DSP-blocks are a scarce resource for accelerating DP apps. Efficient reconfigurable design for pricing asian options 2 → LUTs 46%, RAM 4%, DSP 100% (192) Implementation and evaluation of an arithmetic pipeline on FLOPS-2D: multi-FPGA system3 → a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100% A temporal coding hardware implementation for spiking neural networks4 → 16PE: LE 22%, RAM 3%, DSP 74% (100/136)
2Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10)
4Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10) Bogdan PASCA FPGA Multipliers 6
FPGA floating point performance – a pencil and paper evaluation 1 → DSP-blocks are a scarce resource for accelerating DP apps. Efficient reconfigurable design for pricing asian options 2 → LUTs 46%, RAM 4%, DSP 100% (192) Implementation and evaluation of an arithmetic pipeline on FLOPS-2D: multi-FPGA system3 → a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100% A temporal coding hardware implementation for spiking neural networks4 → 16PE: LE 22%, RAM 3%, DSP 74% (100/136)
2Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10)
4Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10) Bogdan PASCA FPGA Multipliers 6
FPGA floating point performance – a pencil and paper evaluation 1 → DSP-blocks are a scarce resource for accelerating DP apps. Efficient reconfigurable design for pricing asian options 2 → LUTs 46%, RAM 4%, DSP 100% (192) Implementation and evaluation of an arithmetic pipeline on FLOPS-2D: multi-FPGA system3 → a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100% A temporal coding hardware implementation for spiking neural networks4 → 16PE: LE 22%, RAM 3%, DSP 74% (100/136)
2Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10)
4Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10) Bogdan PASCA FPGA Multipliers 6
FPGA floating point performance – a pencil and paper evaluation 1 → DSP-blocks are a scarce resource for accelerating DP apps. Efficient reconfigurable design for pricing asian options 2 → LUTs 46%, RAM 4%, DSP 100% (192) Implementation and evaluation of an arithmetic pipeline on FLOPS-2D: multi-FPGA system3 → a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100% A temporal coding hardware implementation for spiking neural networks4 → 16PE: LE 22%, RAM 3%, DSP 74% (100/136)
2Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10)
4Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10) Bogdan PASCA FPGA Multipliers 6
FPGA floating point performance – a pencil and paper evaluation 1 → DSP-blocks are a scarce resource for accelerating DP apps. Efficient reconfigurable design for pricing asian options 2 → LUTs 46%, RAM 4%, DSP 100% (192) Implementation and evaluation of an arithmetic pipeline on FLOPS-2D: multi-FPGA system3 → a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100% A temporal coding hardware implementation for spiking neural networks4 → 16PE: LE 22%, RAM 3%, DSP 74% (100/136)
2Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10)
4Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10) Bogdan PASCA FPGA Multipliers 6
X Y
all sub-products can be properly located inside the diamond rotate the diamond so to obtain a rectangle
Bogdan PASCA FPGA Multipliers 7
X2:0
classical binary multiplication all sub-products can be properly located inside the diamond rotate the diamond so to obtain a rectangle
Bogdan PASCA FPGA Multipliers 7
Y5:3
classical binary multiplication all sub-products can be properly located inside the diamond rotate the diamond so to obtain a rectangle
Bogdan PASCA FPGA Multipliers 7
Y4:3
classical binary multiplication all sub-products can be properly located inside the diamond rotate the diamond so to obtain a rectangle
Bogdan PASCA FPGA Multipliers 7
Y4:3
classical binary multiplication all sub-products can be properly located inside the diamond rotate the diamond so to obtain a rectangle
Bogdan PASCA FPGA Multipliers 7
5 5 3
X0 X1 Y0 Y1
X0Y0
classical binary multiplication all sub-products can be properly located inside the diamond rotate the diamond so to obtain a rectangle
Bogdan PASCA FPGA Multipliers 7
5 5 3
X0 X1 Y0 Y1
X0Y0 +23+3X1Y1 +23X1Y0 +23X0Y1 XY =
classical binary multiplication all sub-products can be properly located inside the diamond rotate the diamond so to obtain a rectangle
Bogdan PASCA FPGA Multipliers 7
trading multiplications for additions
Bogdan PASCA FPGA Multipliers 8
Basic principle for two way splitting
split X and Y into two chunks: X = 2kX1 + X0 and Y = 2kY1 + Y0 computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0 precompute DX = X1 − X0 and DY = Y1 − Y0 make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY XY requires only 3 DSP blocks (X1Y1, X0Y0, DXDY )
Bogdan PASCA FPGA Multipliers 9
Basic principle for two way splitting
split X and Y into two chunks: X = 2kX1 + X0 and Y = 2kY1 + Y0 computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0 precompute DX = X1 − X0 and DY = Y1 − Y0 make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY XY requires only 3 DSP blocks (X1Y1, X0Y0, DXDY )
Bogdan PASCA FPGA Multipliers 9
Basic principle for two way splitting
split X and Y into two chunks: X = 2kX1 + X0 and Y = 2kY1 + Y0 computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0 precompute DX = X1 − X0 and DY = Y1 − Y0 make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY XY requires only 3 DSP blocks (X1Y1, X0Y0, DXDY )
Bogdan PASCA FPGA Multipliers 9
Basic principle for two way splitting
split X and Y into two chunks: X = 2kX1 + X0 and Y = 2kY1 + Y0 computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0 precompute DX = X1 − X0 and DY = Y1 − Y0 make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY XY requires only 3 DSP blocks (X1Y1, X0Y0, DXDY )
Bogdan PASCA FPGA Multipliers 9
Basic principle for two way splitting
split X and Y into two chunks: X = 2kX1 + X0 and Y = 2kY1 + Y0 computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0 precompute DX = X1 − X0 and DY = Y1 − Y0 make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY XY requires only 3 DSP blocks (X1Y1, X0Y0, DXDY )
Bogdan PASCA FPGA Multipliers 9
Basic principle for two way splitting
split X and Y into two chunks: X = 2kX1 + X0 and Y = 2kY1 + Y0 computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0 precompute DX = X1 − X0 and DY = Y1 − Y0 make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY XY requires only 3 DSP blocks (X1Y1, X0Y0, DXDY )
Bogdan PASCA FPGA Multipliers 9
Basic principle for two way splitting
split X and Y into two chunks: X = 2kX1 + X0 and Y = 2kY1 + Y0 computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0 precompute DX = X1 − X0 and DY = Y1 − Y0 make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY XY requires only 3 DSP blocks (X1Y1, X0Y0, DXDY )
Bogdan PASCA FPGA Multipliers 9
X0 X1 Y1 Y0
Bogdan PASCA FPGA Multipliers 10
X0 X1 Y1 Y0 X1 X2 Y0 Y1 Y2 X0
Bogdan PASCA FPGA Multipliers 10
X0 X1 Y1 Y0 X1 X2 Y0 Y1 Y2 X0
Bogdan PASCA FPGA Multipliers 10
X0 X1 Y1 Y0 X1 X2 Y0 Y1 Y2 X0
Bogdan PASCA FPGA Multipliers 10
X0 X1 Y1 Y0 X1 X2 Y0 Y1 Y2 X0
Bogdan PASCA FPGA Multipliers 10
X0 X1 Y1 Y0 X1 X2 Y0 Y1 Y2 X0
Bogdan PASCA FPGA Multipliers 10
X0 X1 Y1 Y0 X1 X2 Y0 Y1 Y2 X0
Bogdan PASCA FPGA Multipliers 10
X0 X1 Y1 Y0 X1 X2 Y0 Y1 Y2 X0 Y0 Y1 Y2 X0 Y3 X1 X2 X3 X0 X1 X2 X3
Bogdan PASCA FPGA Multipliers 10
X0 X1 Y1 Y0 X1 X2 Y0 Y1 Y2 X0 Y0 Y1 Y2 X0 Y3 X1 X2 X3 X0 X1 X2 X3
Bogdan PASCA FPGA Multipliers 10
fairly trivial starting from the equation: XY = 22kX1Y1 + 2k(X1Y1 + X0Y0 − DXDY ) + X0Y0
z z DSP48 17 17 17 17 18 18
Y0 X0 Y0 Y1 X0 X1 Y1 X1 36 34 34 X0Y0 X0Y0 − DXDY X1Y1 + X0Y0 − DXDY P 68 51 X1Y1 34 (16 : 0) (33 : 17)
34x34bit multiplier using Virtex-4 DSP48 X1Y1 + X0Y0 − DXDY is implemented inside the DSPs need to recover X1Y1 with a subtraction
Bogdan PASCA FPGA Multipliers 11
latency frequency (MHz). slices5 DSPs LogiCore 6 447 26 4 LogiCore 3 176 34 4 K-O-2 3 317 95 3
Table: 34x34-bit multipliers on Virtex-4
trade-off 1DSPs (>630 Logic Elements) for 138 Logic Elements latency frequency(MHz) slices DSPs LogiCore 11 353 185 9 LogiCore 6 264 122 9 K-O-3 6 317 331 6
Table: 51x51 multipliers on Virtex-4
trade-off 3DSPs (>1890 Logic Elements) for 292 Logic Elements
5On Virtex4 devices 1 slice = 2 Logic Elements Bogdan PASCA FPGA Multipliers 12
new multiplication algorithms
Bogdan PASCA FPGA Multipliers 13
classical decomposition may produce suboptimal results
chunk size for X is 24 chunk size for Y is 17
translate the operand decomposition into a tiling problem
Bogdan PASCA FPGA Multipliers 14
classical decomposition may produce suboptimal results
chunk size for X is 24 chunk size for Y is 17
translate the operand decomposition into a tiling problem
FPGA Multipliers 14
classical decomposition may produce suboptimal results
chunk size for X is 24 chunk size for Y is 17
translate the operand decomposition into a tiling problem
X2:0
Bogdan PASCA FPGA Multipliers 14
classical decomposition may produce suboptimal results
chunk size for X is 24 chunk size for Y is 17
translate the operand decomposition into a tiling problem
Y5:3
Bogdan PASCA FPGA Multipliers 14
classical decomposition may produce suboptimal results
chunk size for X is 24 chunk size for Y is 17
translate the operand decomposition into a tiling problem
Y4:3
Bogdan PASCA FPGA Multipliers 14
classical decomposition may produce suboptimal results
chunk size for X is 24 chunk size for Y is 17
translate the operand decomposition into a tiling problem
Y4:3
Bogdan PASCA FPGA Multipliers 14
classical decomposition may produce suboptimal results
chunk size for X is 24 chunk size for Y is 17
translate the operand decomposition into a tiling problem
X
5 5 1 3Y
23+1X3:1Y4:3
Bogdan PASCA FPGA Multipliers 14
classical decomposition may produce suboptimal results
chunk size for X is 24 chunk size for Y is 17
translate the operand decomposition into a tiling problem
X
5 5 1 3Y
23+1X3:1Y4:3 +21+5X3:1Y5 +23X0Y5:3 +X3:0Y2:0 +24X5:4Y5:0 XY =
Bogdan PASCA FPGA Multipliers 14
Performing a 53 × 53-bit multiplication on Virtex5
51 48
(a) standard tiling
16 33 16 33 58 58
(b) Logicore tiling
34 24 41 58 34 17 41 24 17
M1 M2 M3 M4 M5 M6 M7 M8
(c) proposed tiling
standard tiling ≡ classical decomposition (12 DSPs) Logicore 11.1 tiling uses 10 DSPs (4 DSPs used as 17x17-bit)
Bogdan PASCA FPGA Multipliers 15
34 24 41 58 34 17 41 24 17
M1 M2 M3 M4 M5 M6 M7 M8
XY = X0:23Y0:16 (M1) + 217(X0:23Y17:33 (M2) + 217(X0:16Y34:57 (M3) + 217X17:33Y34:57)) (M4) + 224(X24:40Y0:23 (M8) + 217(X41:57Y0:23 (M7) + 217(X34:57Y24:40 (M6) + 217X34:57Y41:57))) (M5) + 248X24:33Y24:33 X24:33Y24:33 (10x10 multiplier) probably best implemented in LUTs. parenthesis makes best use of DSP48E internal adders (17-bit shifts)
Bogdan PASCA FPGA Multipliers 16
58x58 multipliers on Virtex-5 (5vlx50ff676-3)6 latency Freq. REGs LUTs DSPs LogiCore 14 440 300 249 10 LogiCore 8 338 208 133 10 LogiCore 4 95 208 17 10 Tiling 4 366 247 388 8
Remarks
save 2 DSP48E for a few LUTs/REGs huge latency save at a comparable frequency good use of internal adders due to the 17-bit shifts
6Results for 53-bits are almost identical Bogdan PASCA FPGA Multipliers 17
simple methods to save resources
Bogdan PASCA FPGA Multipliers 18
appear in norms, statistical computations, polynomial evaluation... dedicated squarer saves as many DSP blocks as the Karatsuba-Ofman algorithm, but without its overhead∗.
Bogdan PASCA FPGA Multipliers 19
appear in norms, statistical computations, polynomial evaluation... dedicated squarer saves as many DSP blocks as the Karatsuba-Ofman algorithm, but without its overhead∗.
Squaring with k = 17 on a Virtex-4
X0
2
X1
2
X0X1 X0X1
(2kX1 + X0)2 = 22kX 2
1 + 2 · 2kX1X0 + X 2
Bogdan PASCA FPGA Multipliers 19
appear in norms, statistical computations, polynomial evaluation... dedicated squarer saves as many DSP blocks as the Karatsuba-Ofman algorithm, but without its overhead∗.
Squaring with k = 17 on a Virtex-4
X0
2
X1
2
X0X1 X0X1
(2kX1 + X0)2 = 22kX 2
1 + 2 · 2kX1X0 + X 2
X0
2
X1
2
X0X1 X0X1 X0X2 X0X2 X1X2 X1X2 X2
2
(22kX2 + 2kX1 + X0)2 = 24kX 2
2 + 22kX 2 1 + X 2
+ 2 · 23kX2X1 + 2 · 22kX2X0 + 2 · 2kX1X0
Bogdan PASCA FPGA Multipliers 19
(2kX1 + X0)2 = 234X 2
1 + 218X1X0 + X 2
shifts of 0, 18, 34 the previous equation the DSP48 of VirtexIV allow shifts of 17 so internal adders unused
Bogdan PASCA FPGA Multipliers 20
(2kX1 + X0)2 = 234X 2
1 + 218X1X0 + X 2
shifts of 0, 18, 34 the previous equation the DSP48 of VirtexIV allow shifts of 17 so internal adders unused
Workaround for ≤ 33-bit multiplications
rewrite equation: (217X1 + X0)2 = 234X 2
1 + 217(2X1)X0 + X 2
compute 2X1 by shifting X1 by one bit before inputing into DSP48 block
Bogdan PASCA FPGA Multipliers 20
latency frequency slices DSPs bits LogiCore 6 489 59 4 32 LogiCore 3 176 34 4 Squarer 3 317 18 3 LogiCore 18 380 279 16 53 LogiCore 7 176 207 16 Squarer 7 317 332 6 DSPs saved without any overhead impressive 10 DSPs saved for double precision squarer
Bogdan PASCA FPGA Multipliers 21
the tiling technique can be extended to squaring
36 53 17
M1 M2 M3 M6 M5 M4
41 24 19 36 53
M1 M2 M3 M4 M5
Issues
darker squares are computed twice thus need be removed. thanks to symmetry diagonal multiplication of size n should consume only n(n + 1)/2 LUTs instead of n2 .
Bogdan PASCA FPGA Multipliers 22
Bogdan PASCA FPGA Multipliers 23
Classical technique
reduce resources, delay, or power consumption controlled accuracy degradation
Bogdan PASCA FPGA Multipliers 24
Classical technique
reduce resources, delay, or power consumption controlled accuracy degradation ×
A u k d n − k v
Bogdan PASCA FPGA Multipliers 24
Classical technique
reduce resources, delay, or power consumption controlled accuracy degradation ×
A u k d n − k v
remove some of the least-significant d columns keep the error smaller than 2k
Bogdan PASCA FPGA Multipliers 24
×
A u k d n − k v
Etotal = Eapprox + Eround ≤ 2k Eround – caused by rounding the n − d-bit result to n − k bits
use compensation bit to center the error round to nearest bounds Eround ≤ 2k−1
Bogdan PASCA FPGA Multipliers 25
×
A u k d n − k v
Etotal = Eapprox + Eround ≤ 2k Eround – caused by rounding the n − d-bit result to n − k bits
use compensation bit to center the error round to nearest bounds Eround ≤ 2k−1
Eapprox – caused by the truncation of the d columns
i=1 i2i−1
Eapprox < 2k−1 → d = f (k) Precision k Discarded (d)
Bogdan PASCA FPGA Multipliers 25
M2 M3 M1
k d
M4
Sol 1: tile and discard columns (save additions)
waste DSPs
Sol 2: use softcore multiplier (trade a DSP for logic) Best : tile with softcore multipliers so that Eapprox ≤ 2k−1
use the extra precision for free
Bogdan PASCA FPGA Multipliers 26
M2 M3 M1
k d
M4 M2 M3 M1
k d
M4
Sol 1: tile and discard columns (save additions)
waste DSPs
Sol 2: use softcore multiplier (trade a DSP for logic) Best : tile with softcore multipliers so that Eapprox ≤ 2k−1
use the extra precision for free
Bogdan PASCA FPGA Multipliers 26
M2 M3 M1
k d
M4 M2 M3 M1
k d
M4 M2 M3 M1
k d
Sol 1: tile and discard columns (save additions)
waste DSPs
Sol 2: use softcore multiplier (trade a DSP for logic) Best : tile with softcore multipliers so that Eapprox ≤ 2k−1
use the extra precision for free
Bogdan PASCA FPGA Multipliers 26
Mantissa Multipliers for SP,DP,QP, Virtex4 (left) and Virtex5(right)
FPGA Prec. Latency, Freq. Resources Virtex5 DP 6 cycles @ 414MHz 320LUT 302REG 5DSP QP 20 cycles @ 334MHz 2497LUT 2321REG 19DSP QP 14 cycles @ 245MHz 2249LUT 1576REG 19DSP Virtex4 DP 11 cycles @ 368MHz
QP 21 cycles @ 368MHz
Virtex4
DP reduce DSPs from 10 to 7 while also reducing slice count QP reduce DSPs from 49 to 26 at without any slice penalty
Virtex5
DP reduce DSP from 6 to 5 for and roughly half the LUTs and REGs QP reduce DSP from 34 to 19 at a small increase in logic resources.
Bogdan PASCA FPGA Multipliers 27
(wE, wF) =accuracy (wE, wF + 1) correctly rounded faithfully rounded → in FPGAs the extra bit comes for free∗ truncate multipliers when IEEE-754 compliance is not needed
function approximation by polynomial evaluation
log2(1 + x) (53-bit) default 27 DSPs
23 DSPs
11* DSPs
Bogdan PASCA FPGA Multipliers 28
(wE, wF) =accuracy (wE, wF + 1) correctly rounded faithfully rounded → in FPGAs the extra bit comes for free∗ truncate multipliers when IEEE-754 compliance is not needed
function approximation by polynomial evaluation
log2(1 + x) (53-bit) default 27 DSPs
23 DSPs
11* DSPs
Bogdan PASCA FPGA Multipliers 28
save DSPs by exploiting the flexibility of the FPGA Karatsuba-Ofman reduces DSP cost at small price in logic elements tiling techinques adapt better to asymmetric DSPs dedicated squarers significantly reduce DSP count control accuracy and save DSPs using truncated multipliers
Bogdan PASCA FPGA Multipliers 29
http://flopoco.gforge.inria.fr/ Questions ?
Bogdan PASCA FPGA Multipliers 30