Resource Optimal Design of Large Multipliers for FPGAs
Martin Kumm*, Johannes Kappauf*, Matei Istoan† and Peter Zipf*
*University of Kassel, Germany †University Lyon, France
Resource Optimal Design of Large Multipliers for FPGAs Martin Kumm - - PowerPoint PPT Presentation
Resource Optimal Design of Large Multipliers for FPGAs Martin Kumm * , Johannes Kappauf * , Matei Istoan and Peter Zipf * * University of Kassel, Germany University Lyon, France 24'th IEEE Symposium on Computer Arithmetic
*University of Kassel, Germany †University Lyon, France
2
3
4
5
M4
M3
M2
M1
6
16 32 16 32
32×32 board with n=m=16 bit mult.
M4
M3
M2
M1
7
17 24 34 41 5853 17 24 34 41 58 53
y ↑ ← x
8
9
10
3 3
(a) 3 × 3
3 2 2 3
(b) 3 × 2/2 × 3
1 2 2 1
(c) 2 × 1/1 × 2 . . . . . .
k 2
(d) k × 2 . . . . . .
2 k
(e) 2 × k
11
12
Shape Tile area Word size (ws) #LUTm Total cost (costs) Efficiency (Es) 1 × 1 1 1 1 1.65 0.625 1 × 2 2 2 1 2.3 0.87 2 × 3 6 5 3 6.25 0.96 3 × 3 9 6 6 9.9 0.91 2 × k 2k k + 2 k + 1 1.65k + 2.3
2k 1.65k+2.3
(= 1.21 for k → ∞)
13
14
15
20
Constant/Variable Meaning x, y ∈ N0 Coordinates X, Y ∈ N0 Outer bounds of the multiplier to be designed Mx,y ∈ {0, 1} Shape of the multiplier to be designed; true when (x, y) is within the area of the multiplier S Set of small multipliers with different shape S = |S| Number of available smaller multipliers s ∈{0, 1, . . . , S − 1} Shape index of smaller Multiplier ms
x,y ∈ {0, 1}
Boolean constant describing each small multiplier; true when (x, y) is within the area of the multiplier of shape s costs ∈ R Cost of a small multiplier of shape s ds
x,y ∈ {0, 1}
Decision variable, which is true when multiplier of shape s is placed at coordinate (x, y)
21
1 2 1 2 3
0,0 = m0 0,1 = m0 0,2 = m0 1,0 = m0 1,1 = 1
22
S−1
s=0 X−1
x=0 Y −1
y=0
x,y
S−1
s=0 X−1
x0=0 Y −1
y0=0
x−x0,y−y0ds x0,y0 = 1
23
1 2 3 4 5 1 2 3 4 5
y ↑ ← x m0
0,3d0 1,2 = 0
m0
0,2d0 1,2 = 1
m0
0,1d0 1,2 = 1
m0
0,0d0 1,2 = 1
m0
1,1d0 1,2 = 1
m0
1,0d0 1,2 = 1
24
S−1
s=0 X−1
x=0 Y −1
y=0
x,y = #DSP
25
26
24 24
24 × 24, 0 DSP
24 17 24
24 × 24, 1 DSP
24 34 24
24 × 24, 2 DSP
32 32
32 × 32, 0 DSP
24 32 17 32
32 × 32, 1 DSP
17 32 24 32
32 × 32, 2 DSP
6 17 32 24 41 32
32 × 32, 3 DSP
8 32
32 × 32, 4 DSP
26
24 24
24 × 24, 0 DSP
24 17 24
24 × 24, 1 DSP
24 34 24
24 × 24, 2 DSP
32 32
32 × 32, 0 DSP
24 32 17 32
32 × 32, 1 DSP
17 32 24 32
32 × 32, 2 DSP
6 17 32 24 41 32
32 × 32, 3 DSP
8 32
32 × 32, 4 DSP
Baugh-Wooley multiplier [Parandeh-Afshar 2011]
26
24 24
24 × 24, 0 DSP
24 17 24
24 × 24, 1 DSP
24 34 24
24 × 24, 2 DSP
32 32
32 × 32, 0 DSP
24 32 17 32
32 × 32, 1 DSP
17 32 24 32
32 × 32, 2 DSP
6 17 32 24 41 32
32 × 32, 3 DSP
8 32
32 × 32, 4 DSP
2×k and 1:2 performs best for LUT-based multiplication
26
24 24
24 × 24, 0 DSP
24 17 24
24 × 24, 1 DSP
24 34 24
24 × 24, 2 DSP
32 32
32 × 32, 0 DSP
24 32 17 32
32 × 32, 1 DSP
17 32 24 32
32 × 32, 2 DSP
6 17 32 24 41 32
32 × 32, 3 DSP
8 32
32 × 32, 4 DSP
efficient solution utilizing two super-tiles
27
8 24 49 53 17 34 41 53
53 × 53, 5 DSP
24 50 53 17 34 53
53 × 53, 6 DSP
3 17 27 34 53 24 41 58 53
53 × 53, 7 DSP
12 29 41 53 58 12 24 41 58
53 × 53, 8 DSP
12 24 41 58 12 29 41 53 58
53 × 53, 9 DSP
27
8 24 49 53 17 34 41 53
53 × 53, 5 DSP
24 50 53 17 34 53
53 × 53, 6 DSP
3 17 27 34 53 24 41 58 53
53 × 53, 7 DSP
12 29 41 53 58 12 24 41 58
53 × 53, 8 DSP
12 24 41 58 12 29 41 53 58
53 × 53, 9 DSP
pinwheel inside of a pinwheel logic-mult. consumes 1/4 are compared to previous hand-optimized design [de Dinechin 2009]
28
17 34 51 58 64 24 41 58 64
64 × 64, 7 DSP
17 34 58 64 17 24 30 34 58 64
64 × 64, 8 DSP
6 23 40 47 64 6 23 40 47 64
64 × 64, 9 DSP
16 23 40 64 2 16 19 23 33 40 43 47 50 67 64
64 × 64, 10 DSP
24 48 72 13 23 30 47 64
64 × 64, 11 DSP
29
Mult. Method #DSP Area (geom.) logic mult. Slices Slice red. fclk [MHz] 24 × 24 [Brunie 2013] 1 216 65 212.4 proposed 1 168 58 10.8% 287.4 [Brunie 2013] 2 418.9 proposed 2 0.0% 418.9 32 × 32 [Banescu 2010] 1024 339 275.8 proposed 1024 276 18.6% 304.4 [Brunie 2013] 1 648 205 192.8 [Banescu 2010] 1 616 234 352.6 proposed 1 616 180 12.2% 302.5 [Brunie 2013] 2 288 94 270.1 proposed 2 256 82 12.8% 338.0 [Brunie 2013] 3 135 75 194.0 [Banescu 2010] 3 176 75 426.6 proposed 3 64 44 41.3% 314.5 [Brunie 2013] 4 17 314.7 [Banescu 2010] 4 40 38 379.4 proposed 4 13 23.5% 181.7
29
Mult. Method #DSP Area (geom.) logic mult. Slices Slice red. fclk [MHz] 24 × 24 [Brunie 2013] 1 216 65 212.4 proposed 1 168 58 10.8% 287.4 [Brunie 2013] 2 418.9 proposed 2 0.0% 418.9 32 × 32 [Banescu 2010] 1024 339 275.8 proposed 1024 276 18.6% 304.4 [Brunie 2013] 1 648 205 192.8 [Banescu 2010] 1 616 234 352.6 proposed 1 616 180 12.2% 302.5 [Brunie 2013] 2 288 94 270.1 proposed 2 256 82 12.8% 338.0 [Brunie 2013] 3 135 75 194.0 [Banescu 2010] 3 176 75 426.6 proposed 3 64 44 41.3% 314.5 [Brunie 2013] 4 17 314.7 [Banescu 2010] 4 40 38 379.4 proposed 4 13 23.5% 181.7
less slices because of better logic-based multiplier/compressor tree
29
Mult. Method #DSP Area (geom.) logic mult. Slices Slice red. fclk [MHz] 24 × 24 [Brunie 2013] 1 216 65 212.4 proposed 1 168 58 10.8% 287.4 [Brunie 2013] 2 418.9 proposed 2 0.0% 418.9 32 × 32 [Banescu 2010] 1024 339 275.8 proposed 1024 276 18.6% 304.4 [Brunie 2013] 1 648 205 192.8 [Banescu 2010] 1 616 234 352.6 proposed 1 616 180 12.2% 302.5 [Brunie 2013] 2 288 94 270.1 proposed 2 256 82 12.8% 338.0 [Brunie 2013] 3 135 75 194.0 [Banescu 2010] 3 176 75 426.6 proposed 3 64 44 41.3% 314.5 [Brunie 2013] 4 17 314.7 [Banescu 2010] 4 40 38 379.4 proposed 4 13 23.5% 181.7
less slices because of better super-tile usage
30
Mult. Method #DSP Area (geom.) logic mult. Slices Slice red. fclk [MHz] 53 × 53 [Banescu 2010] 5 1029 350 298.2 proposed 5 769 295 15.7% 313.2 [Brunie 2013] 6 468 196 214.1 [Banescu 2010] 6 721 220 298.2 proposed 6 361 180 8.2% 263.2 [Banescu 2010] 7 313 223 378.9 proposed 7 193 137 38.6% 290.2 [Banescu 2010] 8 265 145 356.4 proposed 8 25 81 44.1% 272.7 [Brunie 2013] 9 162 125 195.6 [Banescu 2010] 9 215 174 255.8 proposed 9 72 42.4% 348.8 64 × 64 [Banescu 2010] 7 1504 614 245.0 proposed 7 1191 430 30.0% 270.5 [Brunie 2013] 8 1188 420 194.2 [Banescu 2010] 8 1096 449 280.7 proposed 8 652 348 17.1% 261.2 [Banescu 2010] 9 864 413 262.9 proposed 9 475 217 47.5% 249.6 [Banescu 2010] 10 592 341 250.7 proposed 10 187 179 47.5% 267.7 [Brunie 2013] 11 270 196 162.8 [Banescu 2010] 11 592 268 225.3 proposed 11 108 44.9% 265.4
30
Mult. Method #DSP Area (geom.) logic mult. Slices Slice red. fclk [MHz] 53 × 53 [Banescu 2010] 5 1029 350 298.2 proposed 5 769 295 15.7% 313.2 [Brunie 2013] 6 468 196 214.1 [Banescu 2010] 6 721 220 298.2 proposed 6 361 180 8.2% 263.2 [Banescu 2010] 7 313 223 378.9 proposed 7 193 137 38.6% 290.2 [Banescu 2010] 8 265 145 356.4 proposed 8 25 81 44.1% 272.7 [Brunie 2013] 9 162 125 195.6 [Banescu 2010] 9 215 174 255.8 proposed 9 72 42.4% 348.8 64 × 64 [Banescu 2010] 7 1504 614 245.0 proposed 7 1191 430 30.0% 270.5 [Brunie 2013] 8 1188 420 194.2 [Banescu 2010] 8 1096 449 280.7 proposed 8 652 348 17.1% 261.2 [Banescu 2010] 9 864 413 262.9 proposed 9 475 217 47.5% 249.6 [Banescu 2010] 10 592 341 250.7 proposed 10 187 179 47.5% 267.7 [Brunie 2013] 11 270 196 162.8 [Banescu 2010] 11 592 268 225.3 proposed 11 108 44.9% 265.4
DPS-only solutions with less DPSs found
31
32
[de Dinechin 2009] Large Multipliers with Fewer DSP Blocks FPL 2012 [Kumm 2015] An Efficient Softcore Multiplier Architecture for Xilinx FPGAs, ARITH 2015 [Parandeh-Afshar 2011] Measuring and Reducing the Performance Gap between Embedded and Soft Multipliers on FPGAs, FPL 2011 [Banescu 2010] Multipliers for Floating-Point Double Precision and Beyond on FPGAs, SIGARCH 2010 [Brunie 2013] Arithmetic Core Generation Using Bit Heaps, FPL 2013
34 24 × 24 (single precision floating point) #DSP 2 1 LUT cost 31.2 179.95 502.8 ∆LUT – 148.75 322.85 CPU [s] 22.7 129 8 32 × 32 (unsigned) #DSP 4 3 2 1 LUT cost 57.85 119.2 256.8 567.95 881.6 ∆LUT – 61.35 137.6 311.15 313.65 CPU [s] 146 320 187 382 19 53 × 53 (double precision floating point) #DSP 9 8 7 6 5 LUT cost 144.3 164.45 307 450.5 759.7 ∆LUT – 20.15 142.55 143.5 309.2 CPU [s] 1433 701 4331 2112 27215 64 × 64 (unsigned) #DSP 11 10 9 8 7 LUT cost 198.25 354.8 570.7 862.5 1192.35 ∆LUT – 156.55 215.9 291.15 329.9 CPU [s] 43031 81149 21382 54001 TO
35
36
(a) Multi-Input addition of 10 numbers with 10 bit each (b) x3 operation for an input word size of 6 bit
37
X-Ref Target - Figure 2-1X
17-Bit Shift 17-Bit Shift
Y Z
1 48 48 4 48
BCIN* ACIN* OPMODE PCIN* MULTSIGNIN* PCOUT* CARRYCASCOUT* MULTSIGNOUT* CREG/C Bypass/Mask CARRYCASCIN* CARRYIN
CARRYINSEL
A:B ALUMODE B B A C M P P P C MULT 25 X 18 A
18 30 3 PATTERNDETECT PATTERNBDETECT
CARRYOUT
4 7 48 48 30 18
P P
5
D
25 25
INMODE BCOUT* ACOUT*
18 30 4 1 30 18
Dual B Register Dual A, D, and Pre-adder
1 1 1
Carry Logic
1
LUT LUT LUT LUT
38
1 1 1
Carry Logic
1
LUT LUT LUT LUT
38
[Walters 2014] Partial-Product Generation and Addition for Multiplication in FPGAs with 6-Input LUTs, ASILOMAR 2014 [Kumm 2015] An Efficient Softcore Multiplier Architecture for Xilinx FPGAs, ARITH 2015 [Walters 2016] Array Multipliers for High Throughput in Xilinx FPGAs with 6-Input LUTs, Computers, MDPI [Parandeh-Afshar 2011]: Measuring and Reducing the Performance Gap between Embedded and Soft Multipliers on FPGAs, FPL 2011 [de Dinechin 2009] Large Multipliers with Fewer DSP Blocks FPL 2012 [Banescu 2010] Multipliers for Floating-Point Double Precision and Beyond on FPGAs, SIGARCH 2010 [Brunie 2013]: Arithmetic Core Generation Using Bit Heaps, FPL 2013
39