Resource Optimal Design of Large Multipliers for FPGAs Martin Kumm - PowerPoint PPT Presentation

  Resource Optimal Design of   Large Multipliers for FPGAs Martin Kumm * , Johannes Kappauf * ,   Matei Istoan † and Peter Zipf * * University of Kassel, Germany † University Lyon, France 24'th IEEE Symposium on Computer Arithmetic 25.07.2017

Motivation Multiplication is a fundamental arithmetic operation Embedded multipliers available in the FPGA fabric are limited in size (& quantity) Larger multipliers can be decomposed into smaller multipliers realized by DSP blocks or logic resources Question of interest:   How to do the decomposition in a (resource) optimal way? 2

Outline 1. How to formulate the problem as tiling problem? 2. How do the tiles look like? 3. How to solve the problem? 3

Multiplier Decomposition A large multiplier can be decomposed into several smaller multipliers: A × B = ( A H 2 n + A L )( B H 2 m + B L ) 2 n + A L B H 2 n + m + A H B L = A H B H 2 m + A L B L | {z } | {z } | {z } | {z } M4 M3 M2 M1 5

Multiplier Tiling 32 The multiplier can be graphically M 4 M 2 represented as an X × Y board which is ↑ 16 tiled by smaller multiplier, represented y M 3 M 1 as rectangles [de Dinechin 2009] 0 32 16 0 The required left shift can be obtained ← x from the sum of the tile coordinates 32 × 32 board with (x,y) n = m =16 bit mult. A × B = ( A H 2 16 + A L )( B H 2 16 + B L ) 2 32 + A H B L 2 16 + A L B H 2 16 + A L B L = A H B H | {z } | {z } | {z } | {z } M4 M3 M2 M1 6

Multiplier Tiling A valid multiplier tiling is as follows: 58 53 The board must completely 41 covered without overlaps of the ↑ 34 tiles y 24 17 Overlaps with the border of the board are allowed 0 5853 41 34 24 17 0 ← x 53 × 53 multiplier   [de Dinechin 2009] 7

Logic-based Tiles Several LUT-based multipliers can be used: 3 × 3 Mult., which can be mapped to six 6-input LUTs (LUT6) [Brunie 2013] 2 × 3 Mult. which can be mapped to three LUT6   (realizing five LUT5) [Kumm 2015] 1 × 2 Mult., uses a single LUT6 (realizing two LUT5) In addition, LUT/carry-chain multipliers are used: Single row of an FPGA-optimized   Baugh-Wooley multiplier [Parandeh-Afshar 2011] 9

Shapes of the Logic-based Tiles 3 3 2 2 1 0 0 0 0 0 3 0 3 0 2 0 1 0 2 0 (a) 3 × 3 (b) 3 × 2/2 × 3 (c) 2 × 1/1 × 2 k . . . . . . . . . 2 . . . 0 0 k 0 2 0 (d) k × 2 (e) 2 × k 10

LUT Requirements in the Compressor Tree 1 , 000 #LUTs 500 multi-input addition x 3 operation 0.65 × #bits 0 0 200 400 600 800 1 , 000 1 , 200 1 , 400 1 , 600 Input bits (#bits) 11

Logic-based Multipliers Cost is composed to: cost s = #LUT m + 0 . 65 w s To get the "quality" of a multiplier, an efficiency metric is defined as benefit/cost ratio: E s = area s cost s Shape Tile area Word size ( w s ) #LUT m Total cost (cost s ) Efficiency ( E s ) 1 × 1 1 1 1 1.65 0.625 1 × 2 2 2 1 2.3 0.87 2 × 3 6 5 3 6.25 0.96 3 × 3 9 6 6 9.9 0.91 2 k 2 × k 2 k k + 2 k + 1 1 . 65 k + 2 . 3 1 . 65 k +2 . 3 (= 1 . 21 for k → ∞ ) 12

DSP-based Tiles Xilinx DSP blocks contain 18 × 25 bit (signed)/17 × 24 bit (unsigned) multipliers They contain additional post-adders These can be used to add a multiplier result already obtained This reduces the size of the compressor tree Graphically, this can be represented as a so-called super-tile [Banescu 2010] 13

Super-Tiles of Xilinx FPGAs (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) 14

Formalizing the Problem Constant/Variable Meaning x, y ∈ N 0 Coordinates X, Y ∈ N 0 Outer bounds of the multiplier to be designed M x,y ∈ { 0 , 1 } Shape of the multiplier to be designed; true when ( x, y ) is within the area of the multiplier S Set of small multipliers with different shape S = |S| Number of available smaller multipliers s ∈{ 0 , 1 , . . . , S − 1 } Shape index of smaller Multiplier x,y ∈ { 0 , 1 } Boolean constant describing each small multiplier; true when m s ( x, y ) is within the area of the multiplier of shape s cost s ∈ R Cost of a small multiplier of shape s x,y ∈ { 0 , 1 } Decision variable, which is true when multiplier of shape s is d s placed at coordinate ( x, y ) 20

      Specification of a Tile Setting   m 0 0 , 0 = m 0 0 , 1 = m 0 0 , 2 = m 0 1 , 0 = m 0 1 , 1 = 1 with all other m's zero would define the following tile: 3 ↑ 2 y 1 0 2 1 0 ← x 21

                ILP Formulation The multiplier tiling problem can be reformulated into an integer linear programming (ILP) as follows:   S − 1 X − 1 Y − 1 X X X cost s d s minimize x,y s =0 x =0 y =0 subject to 9 for 0 ≤ x ≤ X, S − 1 X − 1 Y − 1 = X X X m s x − x 0 ,y − y 0 d s 0 ≤ y ≤ Y x 0 ,y 0 = 1 with M x,y = 1 ; s =0 x 0 =0 y 0 =0 The ILP problem can be solved by using standard solvers 22

ILP Formulation Graphical representation of the left-hand-side of the ILP constraint:   m 0 0 , 3 d 0 1 , 2 = 0 m 0 0 , 2 d 0 1 , 2 = 1 m 0 1 , 1 d 0 1 , 2 = 1 5 4 ↑ m 0 0 , 1 d 0 1 , 2 = 1 3 y 2 m 0 1 , 0 d 0 1 , 2 = 1 1 m 0 0 , 0 d 0 1 , 2 = 1 0 5 4 3 2 1 0 ← x 23

        Additional DSP Constraint The cost of DSP blocks are hard to compare with the cost of LUTs Better to constrain the DSP count of a certain application A single additional constraint can be used to specify the number of DSPs ( # DSP):   S − 1 X − 1 Y − 1 X X X D s d s x,y = #DSP s =0 x =0 y =0 where D s specifies the number of DSPs in multiplier shape s 24

Results Four important cases were considered: 24 × 24 (single precision) 32 × 32 53 × 53 (double precision) 64 × 64 Each evaluated for varying DSP count up to DSP-only implementation 25

Resulting Tilings 24/32 Bit 34 24 24 24 17 0 0 0 24 0 24 0 24 0 24 × 24 , 0 DSP 24 × 24 , 1 DSP 24 × 24 , 2 DSP 32 32 32 24 17 0 0 0 32 0 32 24 0 32 17 0 32 × 32 , 0 DSP 32 × 32 , 1 DSP 32 × 32 , 2 DSP 41 32 32 24 8 0 0 32 17 6 0 32 × 32 , 3 DSP 32 × 32 , 4 DSP 26

Resulting Tilings 24/32 Bit 34 24 24 24 17 0 0 0 24 0 24 0 24 0 24 × 24 , 0 DSP 24 × 24 , 1 DSP 24 × 24 , 2 DSP 32 32 32 Baugh-Wooley multiplier   24 [Parandeh-Afshar 2011] 17 0 0 0 32 0 32 24 0 32 17 0 32 × 32 , 0 DSP 32 × 32 , 1 DSP 32 × 32 , 2 DSP 41 32 32 24 8 0 0 32 17 6 0 32 × 32 , 3 DSP 32 × 32 , 4 DSP 26

Resulting Tilings 24/32 Bit 34 24 24 24 17 0 0 0 24 0 24 0 24 0 24 × 24 , 0 DSP 24 × 24 , 1 DSP 24 × 24 , 2 DSP 32 32 32 2 × k and 1:2 performs   24 best for LUT-based   17 multiplication 0 0 0 32 0 32 24 0 32 17 0 32 × 32 , 0 DSP 32 × 32 , 1 DSP 32 × 32 , 2 DSP 41 32 32 24 8 0 0 32 17 6 0 32 × 32 , 3 DSP 32 × 32 , 4 DSP 26

Resulting Tilings 24/32 Bit 34 24 24 24 17 0 0 0 24 0 24 0 24 0 24 × 24 , 0 DSP 24 × 24 , 1 DSP 24 × 24 , 2 DSP 32 32 32 24 17 0 0 0 32 0 32 24 0 32 17 0 32 × 32 , 0 DSP 32 × 32 , 1 DSP 32 × 32 , 2 DSP 41 32 32 efficient solution   24 utilizing   two super-tiles 8 0 0 32 17 6 0 32 × 32 , 3 DSP 32 × 32 , 4 DSP 26

Resulting Tilings 53 Bit 58 53 53 53 41 41 34 34 24 17 17 0 0 0 53 49 24 8 0 53 50 24 0 53 34 27 17 3 0 53 × 53 , 5 DSP 53 × 53 , 6 DSP 53 × 53 , 7 DSP 58 58 53 41 41 29 24 12 12 0 58 53 41 29 12 58 41 24 12 0 53 × 53 , 8 DSP 53 × 53 , 9 DSP 27

Resulting Tilings 53 Bit 58 53 53 53 41 41 34 34 24 17 17 0 0 0 53 49 24 8 0 53 50 24 0 53 34 27 17 3 0 53 × 53 , 5 DSP 53 × 53 , 6 DSP 53 × 53 , 7 DSP 58 58 pinwheel inside of a pinwheel 53 41 41 logic-mult. consumes   29 1/4 are compared to   24 previous hand-optimized   12 12 design [de Dinechin 2009]   0 58 53 41 29 12 58 41 24 12 0 53 × 53 , 8 DSP 53 × 53 , 9 DSP 27

Resulting Tilings 64 Bit 64 64 64 58 58 47 41 40 34 30 24 24 23 17 6 0 0 0 64 58 51 34 17 0 64 58 34 17 0 64 47 40 23 6 0 64 × 64 , 7 DSP 64 × 64 , 8 DSP 64 × 64 , 9 DSP 64 67 64 47 50 47 43 40 30 33 23 23 19 13 16 0 72 48 24 0 2 0 64 40 23 16 0 64 × 64 , 10 DSP 64 × 64 , 11 DSP 28

Optimization & Synthesis Results Mult. Method #DSP Area (geom.) logic mult. Slices Slice red. f clk [MHz] [Brunie 2013] 1 216 65 212.4 24 × 24 proposed 1 168 58 10.8% 287.4 [Brunie 2013] 2 0 0 418.9 proposed 2 0 0 0.0% 418.9 [Banescu 2010] 0 1024 339 275.8 proposed 0 1024 276 18.6% 304.4 [Brunie 2013] 1 648 205 192.8 [Banescu 2010] 1 616 234 352.6 proposed 1 616 180 12.2% 302.5 32 × 32 [Brunie 2013] 2 288 94 270.1 proposed 2 256 82 12.8% 338.0 [Brunie 2013] 3 135 75 194.0 [Banescu 2010] 3 176 75 426.6 proposed 3 64 44 41.3% 314.5 [Brunie 2013] 4 0 17 314.7 [Banescu 2010] 4 40 38 379.4 proposed 4 0 13 23.5% 181.7 29

Resource Optimal Design of Large Multipliers for FPGAs Martin Kumm - PowerPoint PPT Presentation

Resource Optimal Design of Large Multipliers for FPGAs Martin Kumm * , Johannes Kappauf * , Matei Istoan and Peter Zipf * * University of Kassel, Germany University Lyon, France 24'th IEEE Symposium on Computer Arithmetic

Classes of Herz-Schur multipliers Ivan Todorov April 2014 Toronto Content Positive multipliers

The BIST History of FPGAs FPGAs The BIST History of The BISTory BISTory of of FPGAs FPGAs

FPGAs 1 CMPE691/491: Advanced FPGA Design FPGAs Large array of configurable logic blocks

Physical Design For FPGAs Rajeev Jayaraman Physical Implementation Tools Xilinx Inc. ISPD-2001

Cash Flow Multipliers and Optimal Investment Decisions Holger Kraft 1 Eduardo S. Schwartz 2 1

Decomposable Schur multipliers and non-commutative Fourier multipliers Christoph Kriegler

5 Multipliers Of IMPACT How do you measure IMPACT? The 5 Multipliers of IMPACT Awareness

Littlewood-Paley Theory and Multipliers George Kinnear September 11, 2009 George Kinnear

Extracting INT8 Multipliers from INT18 Multipliers Bogdan Pasca, Martin Langhammer, Gregg

Norms of idempotent Schur multipliers Rupert Levene University College Dublin Banach Algebras

Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto

SoC Design SoC Design : Designing with FPGAs Designing with FPGAs es g es g g w t g w t G s

Hybrid Dot-Product Design for FP-Enabled FPGAs Bogdan Pasca Intel ARITH 2019, June 10-12, 2019

FPGAs 1 To read more This days papers: Brown and Rose, Architecture of FPGAs and

FPGA Multipliers Bogdan PASCA projet Ar enaire, ENS-Lyon/INRIA/CNRS/Universit e de Lyon,

Virtex-7 FPGAs Target Software Virtex-7 FPGAs Target Software Defined Radio Applications Defined

Architectures Enabled by Intra-Unit Fast Forwarding Jihee Seo and Dae Hyun Kim School of

t s

COLUMNS VS. ROWS INFLUENCE OF THE REDUCTION ORDER IN MULTIPLIER VERIFICATION USING COMPUTER

Multiplication Overview Multiplication approaches: Sequential: Shift-and-Add produces one

A hierarchical graph-based approach to generating formally-proofed Galois-field multipliers

MULTIPLICATION p = x y x (multiplicand), y (multiplier), and p (product) signed integers

Quiz I Give our two primary interpretations of matrix-vector multiplication. I Give the

Matrix Multiplication Matrix multiplication is an operation with properties quite different from