FPGA Multipliers Bogdan PASCA projet Ar enaire, - - PowerPoint PPT Presentation

fpga multipliers
SMART_READER_LITE
LIVE PREVIEW

FPGA Multipliers Bogdan PASCA projet Ar enaire, - - PowerPoint PPT Presentation

FPGA Multipliers Bogdan PASCA projet Ar enaire, ENS-Lyon/INRIA/CNRS/Universit e de Lyon, France RAIM11 February 7-10, 2011 Outline Background & Context Algorithmic techniques for reducing DSP count of large multipliers


slide-1
SLIDE 1

FPGA Multipliers

Bogdan PASCA

projet Ar´ enaire, ENS-Lyon/INRIA/CNRS/Universit´ e de Lyon, France

RAIM’11 February 7-10, 2011

slide-2
SLIDE 2

Outline

Background & Context Algorithmic techniques for reducing DSP count of large multipliers Karatsuba-Ofman algorithm Non-Standard tilings Squarers Truncated multipliers Conclusions

Bogdan PASCA FPGA Multipliers 1

slide-3
SLIDE 3

What’s an FPGA?

Field Programmable Gate Array integrated circuit has a regular architecture (hence array) logic elements can be programmed to perform various functions

Bogdan PASCA FPGA Multipliers 2

slide-4
SLIDE 4

Modern FPGA Architecture

a set of configurable logic elements

  • n chip memory blocks

digital signal processing (DSP) blocks (including multipliers) connected by a configurable wire network all connected to outside world by I/O pins

Bogdan PASCA FPGA Multipliers 3

slide-5
SLIDE 5

Modern FPGA Architecture

RAM RAM RAM RAM

a set of configurable logic elements

  • n chip memory blocks

digital signal processing (DSP) blocks (including multipliers) connected by a configurable wire network all connected to outside world by I/O pins

Bogdan PASCA FPGA Multipliers 3

slide-6
SLIDE 6

Modern FPGA Architecture

RAM RAM RAM RAM DSP DSP DSP DSP

a set of configurable logic elements

  • n chip memory blocks

digital signal processing (DSP) blocks (including multipliers) connected by a configurable wire network all connected to outside world by I/O pins

Bogdan PASCA FPGA Multipliers 3

slide-7
SLIDE 7

Modern FPGA Architecture

RAM RAM RAM RAM DSP DSP DSP DSP

a set of configurable logic elements

  • n chip memory blocks

digital signal processing (DSP) blocks (including multipliers) connected by a configurable wire network all connected to outside world by I/O pins

Bogdan PASCA FPGA Multipliers 3

slide-8
SLIDE 8

Modern FPGA Architecture

RAM RAM RAM RAM DSP DSP DSP DSP

a set of configurable logic elements

  • n chip memory blocks

digital signal processing (DSP) blocks (including multipliers) connected by a configurable wire network all connected to outside world by I/O pins

Bogdan PASCA FPGA Multipliers 3

slide-9
SLIDE 9

Modern FPGA Architecture

RAM RAM RAM RAM DSP DSP DSP DSP

LUT

a set of configurable logic elements

  • n chip memory blocks

digital signal processing (DSP) blocks (including multipliers) connected by a configurable wire network all connected to outside world by I/O pins

Bogdan PASCA FPGA Multipliers 3

slide-10
SLIDE 10

Modern FPGA Architecture

RAM RAM RAM RAM DSP DSP DSP DSP

LUT

shift 17 18 18

a set of configurable logic elements

  • n chip memory blocks

digital signal processing (DSP) blocks (including multipliers) connected by a configurable wire network all connected to outside world by I/O pins

Bogdan PASCA FPGA Multipliers 3

slide-11
SLIDE 11

What can we compute?

LUT LUT LUT LUT LUT LUT LUT LUT LUT

x0 y0 y0 x1 y0 x2 y1 x0 x1 y1 y1 x2 u2 u1 u0 l2 l1 l0 p0 p1 p2 p3 p4

x2x1x0× y1y0 l2 l1 l0+ u2u1u0 p4p3p2p1p0 l0 = y0 ∧ x0 l1 = y0 ∧ x1 l2 = y0 ∧ x2 u0 = y1 ∧ x0 u1 = y1 ∧ x1 u2 = y1 ∧ x2

Bogdan PASCA FPGA Multipliers 4

slide-12
SLIDE 12

What can we compute?

LUT LUT LUT LUT LUT LUT LUT LUT LUT

x0 y0 y0 x1 y0 x2 y1 x0 x1 y1 y1 x2 u2 u1 u0 l2 l1 l0 p0 p1 p2 p3 p4

x2x1x0× y1y0 l2 l1 l0+ u2u1u0 p4p3p2p1p0 l0 = y0 ∧ x0 l1 = y0 ∧ x1 l2 = y0 ∧ x2 u0 = y1 ∧ x0 u1 = y1 ∧ x1 u2 = y1 ∧ x2

Bogdan PASCA FPGA Multipliers 4

slide-13
SLIDE 13

What can we compute?

LUT LUT LUT LUT LUT LUT LUT LUT LUT

x0 y0 y0 x1 y0 x2 y1 x0 x1 y1 y1 x2 u2 u1 u0 l2 l1 l0 p0 p1 p2 p3 p4

x2x1x0× y1y0 l2 l1 l0+ u2u1u0 p4p3p2p1p0 l0 = y0 ∧ x0 l1 = y0 ∧ x1 l2 = y0 ∧ x2 u0 = y1 ∧ x0 u1 = y1 ∧ x1 u2 = y1 ∧ x2

Bogdan PASCA FPGA Multipliers 4

slide-14
SLIDE 14

What can we compute?

LUT LUT LUT LUT LUT LUT

x0 y0 y0 x1 y0 x2 y1 x0 x1 y1 y1 x2 u2 u1 u0 l2 l1 l0 p0 p1 p2 p3 p4 FA FA FA

x2x1x0× y1y0 l2 l1 l0+ u2u1u0 p4p3p2p1p0 l0 = y0 ∧ x0 l1 = y0 ∧ x1 l2 = y0 ∧ x2 u0 = y1 ∧ x0 u1 = y1 ∧ x1 u2 = y1 ∧ x2

Bogdan PASCA FPGA Multipliers 4

slide-15
SLIDE 15

Need of DSP blocks

Multiplication in logic is expensive

n × n bit ≈ n2

  • partial products

+ n(n − 1)

  • adder tree

LUTs 18 × 18 bit ≈ 324LUT + 306LUT = 630LUTs 1 DSP block = 8 LEs (size on FPGA layout)

Bogdan PASCA FPGA Multipliers 5

slide-16
SLIDE 16

Need of DSP blocks

Multiplication in logic is expensive

n × n bit ≈ n2

  • partial products

+ n(n − 1)

  • adder tree

LUTs 18 × 18 bit ≈ 324LUT + 306LUT = 630LUTs 1 DSP block = 8 LEs (size on FPGA layout) DSP blocks are a need in modern FPGAs

17 bit shift 17 bit shift

48

48

  • B

P 18 18 A C P

Bogdan PASCA FPGA Multipliers 5

slide-17
SLIDE 17

DSP-Hungry Applications

FPGA floating point performance – a pencil and paper evaluation 1 → DSP-blocks are a scarce resource for accelerating DP apps. Efficient reconfigurable design for pricing asian options 2 → LUTs 46%, RAM 4%, DSP 100% (192) Implementation and evaluation of an arithmetic pipeline on FLOPS-2D: multi-FPGA system3 → a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100% A temporal coding hardware implementation for spiking neural networks4 → 16PE: LE 22%, RAM 3%, DSP 74% (100/136)

Four recipes for saving DSPs

  • 1D. Strenski (HPCWire, 2007.)

2Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10)

  • 3H. Morisita, K. Inakagata, Y. Osana, N. Fujita, H. Amano (HEART’10)

4Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10) Bogdan PASCA FPGA Multipliers 6

slide-18
SLIDE 18

DSP-Hungry Applications

FPGA floating point performance – a pencil and paper evaluation 1 → DSP-blocks are a scarce resource for accelerating DP apps. Efficient reconfigurable design for pricing asian options 2 → LUTs 46%, RAM 4%, DSP 100% (192) Implementation and evaluation of an arithmetic pipeline on FLOPS-2D: multi-FPGA system3 → a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100% A temporal coding hardware implementation for spiking neural networks4 → 16PE: LE 22%, RAM 3%, DSP 74% (100/136)

Four recipes for saving DSPs

  • 1D. Strenski (HPCWire, 2007.)

2Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10)

  • 3H. Morisita, K. Inakagata, Y. Osana, N. Fujita, H. Amano (HEART’10)

4Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10) Bogdan PASCA FPGA Multipliers 6

slide-19
SLIDE 19

DSP-Hungry Applications

FPGA floating point performance – a pencil and paper evaluation 1 → DSP-blocks are a scarce resource for accelerating DP apps. Efficient reconfigurable design for pricing asian options 2 → LUTs 46%, RAM 4%, DSP 100% (192) Implementation and evaluation of an arithmetic pipeline on FLOPS-2D: multi-FPGA system3 → a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100% A temporal coding hardware implementation for spiking neural networks4 → 16PE: LE 22%, RAM 3%, DSP 74% (100/136)

Four recipes for saving DSPs

  • 1D. Strenski (HPCWire, 2007.)

2Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10)

  • 3H. Morisita, K. Inakagata, Y. Osana, N. Fujita, H. Amano (HEART’10)

4Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10) Bogdan PASCA FPGA Multipliers 6

slide-20
SLIDE 20

DSP-Hungry Applications

FPGA floating point performance – a pencil and paper evaluation 1 → DSP-blocks are a scarce resource for accelerating DP apps. Efficient reconfigurable design for pricing asian options 2 → LUTs 46%, RAM 4%, DSP 100% (192) Implementation and evaluation of an arithmetic pipeline on FLOPS-2D: multi-FPGA system3 → a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100% A temporal coding hardware implementation for spiking neural networks4 → 16PE: LE 22%, RAM 3%, DSP 74% (100/136)

Four recipes for saving DSPs

  • 1D. Strenski (HPCWire, 2007.)

2Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10)

  • 3H. Morisita, K. Inakagata, Y. Osana, N. Fujita, H. Amano (HEART’10)

4Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10) Bogdan PASCA FPGA Multipliers 6

slide-21
SLIDE 21

DSP-Hungry Applications

FPGA floating point performance – a pencil and paper evaluation 1 → DSP-blocks are a scarce resource for accelerating DP apps. Efficient reconfigurable design for pricing asian options 2 → LUTs 46%, RAM 4%, DSP 100% (192) Implementation and evaluation of an arithmetic pipeline on FLOPS-2D: multi-FPGA system3 → a)LE 30%, DSP 86%, b) LE 52%, DSP 88%, c) LE 63%, DSP 100% A temporal coding hardware implementation for spiking neural networks4 → 16PE: LE 22%, RAM 3%, DSP 74% (100/136)

Four recipes for saving DSPs

  • 1D. Strenski (HPCWire, 2007.)

2Anson H.T. Tse, David B. Thomas, K. H. Tsoi, Wayne Luk (HEART’10)

  • 3H. Morisita, K. Inakagata, Y. Osana, N. Fujita, H. Amano (HEART’10)

4Marco Nuno-Maganda, Cesar Torres-Huitzil (HEART’10) Bogdan PASCA FPGA Multipliers 6

slide-22
SLIDE 22

Perceiving Multiplications Visually

X Y

  • classical binary multiplication

all sub-products can be properly located inside the diamond rotate the diamond so to obtain a rectangle

Bogdan PASCA FPGA Multipliers 7

slide-23
SLIDE 23

Perceiving Multiplications Visually

  • Y2:0

X2:0

classical binary multiplication all sub-products can be properly located inside the diamond rotate the diamond so to obtain a rectangle

Bogdan PASCA FPGA Multipliers 7

slide-24
SLIDE 24

Perceiving Multiplications Visually

  • X5:3

Y5:3

classical binary multiplication all sub-products can be properly located inside the diamond rotate the diamond so to obtain a rectangle

Bogdan PASCA FPGA Multipliers 7

slide-25
SLIDE 25

Perceiving Multiplications Visually

  • X3:1

Y4:3

classical binary multiplication all sub-products can be properly located inside the diamond rotate the diamond so to obtain a rectangle

Bogdan PASCA FPGA Multipliers 7

slide-26
SLIDE 26

Perceiving Multiplications Visually

  • X3:1

Y4:3

classical binary multiplication all sub-products can be properly located inside the diamond rotate the diamond so to obtain a rectangle

Bogdan PASCA FPGA Multipliers 7

slide-27
SLIDE 27

Perceiving Multiplications Visually

5 5 3

X0 X1 Y0 Y1

X0Y0

classical binary multiplication all sub-products can be properly located inside the diamond rotate the diamond so to obtain a rectangle

Bogdan PASCA FPGA Multipliers 7

slide-28
SLIDE 28

Perceiving Multiplications Visually

5 5 3

X0 X1 Y0 Y1

X0Y0 +23+3X1Y1 +23X1Y0 +23X0Y1 XY =

classical binary multiplication all sub-products can be properly located inside the diamond rotate the diamond so to obtain a rectangle

Bogdan PASCA FPGA Multipliers 7

slide-29
SLIDE 29

Karatsuba-Ofman algorithm

trading multiplications for additions

Bogdan PASCA FPGA Multipliers 8

slide-30
SLIDE 30

The Karatsuba-Ofman algorithm

Basic principle for two way splitting

split X and Y into two chunks: X = 2kX1 + X0 and Y = 2kY1 + Y0 computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0 precompute DX = X1 − X0 and DY = Y1 − Y0 make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY XY requires only 3 DSP blocks (X1Y1, X0Y0, DXDY )

  • verhead: two k-bit and one 2k-bit subtraction
  • verhead ≪ DSP-block emulation

Bogdan PASCA FPGA Multipliers 9

slide-31
SLIDE 31

The Karatsuba-Ofman algorithm

Basic principle for two way splitting

split X and Y into two chunks: X = 2kX1 + X0 and Y = 2kY1 + Y0 computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0 precompute DX = X1 − X0 and DY = Y1 − Y0 make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY XY requires only 3 DSP blocks (X1Y1, X0Y0, DXDY )

  • verhead: two k-bit and one 2k-bit subtraction
  • verhead ≪ DSP-block emulation

Bogdan PASCA FPGA Multipliers 9

slide-32
SLIDE 32

The Karatsuba-Ofman algorithm

Basic principle for two way splitting

split X and Y into two chunks: X = 2kX1 + X0 and Y = 2kY1 + Y0 computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0 precompute DX = X1 − X0 and DY = Y1 − Y0 make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY XY requires only 3 DSP blocks (X1Y1, X0Y0, DXDY )

  • verhead: two k-bit and one 2k-bit subtraction
  • verhead ≪ DSP-block emulation

Bogdan PASCA FPGA Multipliers 9

slide-33
SLIDE 33

The Karatsuba-Ofman algorithm

Basic principle for two way splitting

split X and Y into two chunks: X = 2kX1 + X0 and Y = 2kY1 + Y0 computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0 precompute DX = X1 − X0 and DY = Y1 − Y0 make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY XY requires only 3 DSP blocks (X1Y1, X0Y0, DXDY )

  • verhead: two k-bit and one 2k-bit subtraction
  • verhead ≪ DSP-block emulation

Bogdan PASCA FPGA Multipliers 9

slide-34
SLIDE 34

The Karatsuba-Ofman algorithm

Basic principle for two way splitting

split X and Y into two chunks: X = 2kX1 + X0 and Y = 2kY1 + Y0 computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0 precompute DX = X1 − X0 and DY = Y1 − Y0 make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY XY requires only 3 DSP blocks (X1Y1, X0Y0, DXDY )

  • verhead: two k-bit and one 2k-bit subtraction
  • verhead ≪ DSP-block emulation

Bogdan PASCA FPGA Multipliers 9

slide-35
SLIDE 35

The Karatsuba-Ofman algorithm

Basic principle for two way splitting

split X and Y into two chunks: X = 2kX1 + X0 and Y = 2kY1 + Y0 computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0 precompute DX = X1 − X0 and DY = Y1 − Y0 make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY XY requires only 3 DSP blocks (X1Y1, X0Y0, DXDY )

  • verhead: two k-bit and one 2k-bit subtraction
  • verhead ≪ DSP-block emulation

Bogdan PASCA FPGA Multipliers 9

slide-36
SLIDE 36

The Karatsuba-Ofman algorithm

Basic principle for two way splitting

split X and Y into two chunks: X = 2kX1 + X0 and Y = 2kY1 + Y0 computation goal: XY = 22kX1Y1 + 2k(X1Y0 + X0Y1) + X0Y0 precompute DX = X1 − X0 and DY = Y1 − Y0 make the observation: X1Y0 + X0Y1 = X1Y1 + X0Y0 − DXDY XY requires only 3 DSP blocks (X1Y1, X0Y0, DXDY )

  • verhead: two k-bit and one 2k-bit subtraction
  • verhead ≪ DSP-block emulation

Bogdan PASCA FPGA Multipliers 9

slide-37
SLIDE 37

Visual Interpretation

X0 X1 Y1 Y0

Bogdan PASCA FPGA Multipliers 10

slide-38
SLIDE 38

Visual Interpretation

X0 X1 Y1 Y0 X1 X2 Y0 Y1 Y2 X0

Bogdan PASCA FPGA Multipliers 10

slide-39
SLIDE 39

Visual Interpretation

X0 X1 Y1 Y0 X1 X2 Y0 Y1 Y2 X0

Bogdan PASCA FPGA Multipliers 10

slide-40
SLIDE 40

Visual Interpretation

X0 X1 Y1 Y0 X1 X2 Y0 Y1 Y2 X0

Bogdan PASCA FPGA Multipliers 10

slide-41
SLIDE 41

Visual Interpretation

X0 X1 Y1 Y0 X1 X2 Y0 Y1 Y2 X0

Bogdan PASCA FPGA Multipliers 10

slide-42
SLIDE 42

Visual Interpretation

X0 X1 Y1 Y0 X1 X2 Y0 Y1 Y2 X0

Bogdan PASCA FPGA Multipliers 10

slide-43
SLIDE 43

Visual Interpretation

X0 X1 Y1 Y0 X1 X2 Y0 Y1 Y2 X0

Bogdan PASCA FPGA Multipliers 10

slide-44
SLIDE 44

Visual Interpretation

X0 X1 Y1 Y0 X1 X2 Y0 Y1 Y2 X0 Y0 Y1 Y2 X0 Y3 X1 X2 X3 X0 X1 X2 X3

Bogdan PASCA FPGA Multipliers 10

slide-45
SLIDE 45

Visual Interpretation

X0 X1 Y1 Y0 X1 X2 Y0 Y1 Y2 X0 Y0 Y1 Y2 X0 Y3 X1 X2 X3 X0 X1 X2 X3

Bogdan PASCA FPGA Multipliers 10

slide-46
SLIDE 46

Implementation

fairly trivial starting from the equation: XY = 22kX1Y1 + 2k(X1Y1 + X0Y0 − DXDY ) + X0Y0

z z DSP48 17 17 17 17 18 18

Y0 X0 Y0 Y1 X0 X1 Y1 X1 36 34 34 X0Y0 X0Y0 − DXDY X1Y1 + X0Y0 − DXDY P 68 51 X1Y1 34 (16 : 0) (33 : 17)

34x34bit multiplier using Virtex-4 DSP48 X1Y1 + X0Y0 − DXDY is implemented inside the DSPs need to recover X1Y1 with a subtraction

Bogdan PASCA FPGA Multipliers 11

slide-47
SLIDE 47

Results

latency frequency (MHz). slices5 DSPs LogiCore 6 447 26 4 LogiCore 3 176 34 4 K-O-2 3 317 95 3

Table: 34x34-bit multipliers on Virtex-4

trade-off 1DSPs (>630 Logic Elements) for 138 Logic Elements latency frequency(MHz) slices DSPs LogiCore 11 353 185 9 LogiCore 6 264 122 9 K-O-3 6 317 331 6

Table: 51x51 multipliers on Virtex-4

trade-off 3DSPs (>1890 Logic Elements) for 292 Logic Elements

5On Virtex4 devices 1 slice = 2 Logic Elements Bogdan PASCA FPGA Multipliers 12

slide-48
SLIDE 48

Non-Standard tilings

new multiplication algorithms

Bogdan PASCA FPGA Multipliers 13

slide-49
SLIDE 49

Non-standard tilings

  • ptimize use of rectangular multipliers on Virtex5 (25x18 signed)

classical decomposition may produce suboptimal results

chunk size for X is 24 chunk size for Y is 17

translate the operand decomposition into a tiling problem

Bogdan PASCA FPGA Multipliers 14

slide-50
SLIDE 50

Non-standard tilings

  • ptimize use of rectangular multipliers on Virtex5 (25x18 signed)

classical decomposition may produce suboptimal results

chunk size for X is 24 chunk size for Y is 17

translate the operand decomposition into a tiling problem

  • Bogdan PASCA

FPGA Multipliers 14

slide-51
SLIDE 51

Non-standard tilings

  • ptimize use of rectangular multipliers on Virtex5 (25x18 signed)

classical decomposition may produce suboptimal results

chunk size for X is 24 chunk size for Y is 17

translate the operand decomposition into a tiling problem

  • Y2:0

X2:0

Bogdan PASCA FPGA Multipliers 14

slide-52
SLIDE 52

Non-standard tilings

  • ptimize use of rectangular multipliers on Virtex5 (25x18 signed)

classical decomposition may produce suboptimal results

chunk size for X is 24 chunk size for Y is 17

translate the operand decomposition into a tiling problem

  • X5:3

Y5:3

Bogdan PASCA FPGA Multipliers 14

slide-53
SLIDE 53

Non-standard tilings

  • ptimize use of rectangular multipliers on Virtex5 (25x18 signed)

classical decomposition may produce suboptimal results

chunk size for X is 24 chunk size for Y is 17

translate the operand decomposition into a tiling problem

  • X3:1

Y4:3

Bogdan PASCA FPGA Multipliers 14

slide-54
SLIDE 54

Non-standard tilings

  • ptimize use of rectangular multipliers on Virtex5 (25x18 signed)

classical decomposition may produce suboptimal results

chunk size for X is 24 chunk size for Y is 17

translate the operand decomposition into a tiling problem

  • X3:1

Y4:3

Bogdan PASCA FPGA Multipliers 14

slide-55
SLIDE 55

Non-standard tilings

  • ptimize use of rectangular multipliers on Virtex5 (25x18 signed)

classical decomposition may produce suboptimal results

chunk size for X is 24 chunk size for Y is 17

translate the operand decomposition into a tiling problem

X

5 5 1 3Y

23+1X3:1Y4:3

Bogdan PASCA FPGA Multipliers 14

slide-56
SLIDE 56

Non-standard tilings

  • ptimize use of rectangular multipliers on Virtex5 (25x18 signed)

classical decomposition may produce suboptimal results

chunk size for X is 24 chunk size for Y is 17

translate the operand decomposition into a tiling problem

X

5 5 1 3Y

23+1X3:1Y4:3 +21+5X3:1Y5 +23X0Y5:3 +X3:0Y2:0 +24X5:4Y5:0 XY =

Bogdan PASCA FPGA Multipliers 14

slide-57
SLIDE 57

Tilings

Performing a 53 × 53-bit multiplication on Virtex5

51 48

(a) standard tiling

16 33 16 33 58 58

(b) Logicore tiling

34 24 41 58 34 17 41 24 17

M1 M2 M3 M4 M5 M6 M7 M8

(c) proposed tiling

standard tiling ≡ classical decomposition (12 DSPs) Logicore 11.1 tiling uses 10 DSPs (4 DSPs used as 17x17-bit)

  • ur proposed tiling does it in 8 DSPs and a few LUTs

Bogdan PASCA FPGA Multipliers 15

slide-58
SLIDE 58

Tiling Architecture - 53x53bit

34 24 41 58 34 17 41 24 17

M1 M2 M3 M4 M5 M6 M7 M8

XY = X0:23Y0:16 (M1) + 217(X0:23Y17:33 (M2) + 217(X0:16Y34:57 (M3) + 217X17:33Y34:57)) (M4) + 224(X24:40Y0:23 (M8) + 217(X41:57Y0:23 (M7) + 217(X34:57Y24:40 (M6) + 217X34:57Y41:57))) (M5) + 248X24:33Y24:33 X24:33Y24:33 (10x10 multiplier) probably best implemented in LUTs. parenthesis makes best use of DSP48E internal adders (17-bit shifts)

Bogdan PASCA FPGA Multipliers 16

slide-59
SLIDE 59

Tiling Results

58x58 multipliers on Virtex-5 (5vlx50ff676-3)6 latency Freq. REGs LUTs DSPs LogiCore 14 440 300 249 10 LogiCore 8 338 208 133 10 LogiCore 4 95 208 17 10 Tiling 4 366 247 388 8

Remarks

save 2 DSP48E for a few LUTs/REGs huge latency save at a comparable frequency good use of internal adders due to the 17-bit shifts

6Results for 53-bits are almost identical Bogdan PASCA FPGA Multipliers 17

slide-60
SLIDE 60

Squarers

simple methods to save resources

Bogdan PASCA FPGA Multipliers 18

slide-61
SLIDE 61

Squarers

appear in norms, statistical computations, polynomial evaluation... dedicated squarer saves as many DSP blocks as the Karatsuba-Ofman algorithm, but without its overhead∗.

Bogdan PASCA FPGA Multipliers 19

slide-62
SLIDE 62

Squarers

appear in norms, statistical computations, polynomial evaluation... dedicated squarer saves as many DSP blocks as the Karatsuba-Ofman algorithm, but without its overhead∗.

Squaring with k = 17 on a Virtex-4

X0

2

X1

2

X0X1 X0X1

(2kX1 + X0)2 = 22kX 2

1 + 2 · 2kX1X0 + X 2

Bogdan PASCA FPGA Multipliers 19

slide-63
SLIDE 63

Squarers

appear in norms, statistical computations, polynomial evaluation... dedicated squarer saves as many DSP blocks as the Karatsuba-Ofman algorithm, but without its overhead∗.

Squaring with k = 17 on a Virtex-4

X0

2

X1

2

X0X1 X0X1

(2kX1 + X0)2 = 22kX 2

1 + 2 · 2kX1X0 + X 2

X0

2

X1

2

X0X1 X0X1 X0X2 X0X2 X1X2 X1X2 X2

2

(22kX2 + 2kX1 + X0)2 = 24kX 2

2 + 22kX 2 1 + X 2

+ 2 · 23kX2X1 + 2 · 22kX2X0 + 2 · 2kX1X0

Bogdan PASCA FPGA Multipliers 19

slide-64
SLIDE 64

However ...

(2kX1 + X0)2 = 234X 2

1 + 218X1X0 + X 2

shifts of 0, 18, 34 the previous equation the DSP48 of VirtexIV allow shifts of 17 so internal adders unused

Bogdan PASCA FPGA Multipliers 20

slide-65
SLIDE 65

However ...

(2kX1 + X0)2 = 234X 2

1 + 218X1X0 + X 2

shifts of 0, 18, 34 the previous equation the DSP48 of VirtexIV allow shifts of 17 so internal adders unused

Workaround for ≤ 33-bit multiplications

rewrite equation: (217X1 + X0)2 = 234X 2

1 + 217(2X1)X0 + X 2

compute 2X1 by shifting X1 by one bit before inputing into DSP48 block

Bogdan PASCA FPGA Multipliers 20

slide-66
SLIDE 66

Results – 32-bit and 53-bit squarers on Virtex-4

latency frequency slices DSPs bits LogiCore 6 489 59 4 32 LogiCore 3 176 34 4 Squarer 3 317 18 3 LogiCore 18 380 279 16 53 LogiCore 7 176 207 16 Squarer 7 317 332 6 DSPs saved without any overhead impressive 10 DSPs saved for double precision squarer

Bogdan PASCA FPGA Multipliers 21

slide-67
SLIDE 67

Squarers on Virtex5 using tilings

the tiling technique can be extended to squaring

36 53 17

M1 M2 M3 M6 M5 M4

41 24 19 36 53

M1 M2 M3 M4 M5

Issues

darker squares are computed twice thus need be removed. thanks to symmetry diagonal multiplication of size n should consume only n(n + 1)/2 LUTs instead of n2 .

Bogdan PASCA FPGA Multipliers 22

slide-68
SLIDE 68

Truncated multipliers

Bogdan PASCA FPGA Multipliers 23

slide-69
SLIDE 69

Truncated multipliers

Classical technique

reduce resources, delay, or power consumption controlled accuracy degradation

Bogdan PASCA FPGA Multipliers 24

slide-70
SLIDE 70

Truncated multipliers

Classical technique

reduce resources, delay, or power consumption controlled accuracy degradation ×

  • B

A u k d n − k v

Bogdan PASCA FPGA Multipliers 24

slide-71
SLIDE 71

Truncated multipliers

Classical technique

reduce resources, delay, or power consumption controlled accuracy degradation ×

  • B

A u k d n − k v

remove some of the least-significant d columns keep the error smaller than 2k

Bogdan PASCA FPGA Multipliers 24

slide-72
SLIDE 72

Error budget

×

  • B

A u k d n − k v

Etotal = Eapprox + Eround ≤ 2k Eround – caused by rounding the n − d-bit result to n − k bits

use compensation bit to center the error round to nearest bounds Eround ≤ 2k−1

Bogdan PASCA FPGA Multipliers 25

slide-73
SLIDE 73

Error budget

×

  • B

A u k d n − k v

Etotal = Eapprox + Eround ≤ 2k Eround – caused by rounding the n − d-bit result to n − k bits

use compensation bit to center the error round to nearest bounds Eround ≤ 2k−1

Eapprox – caused by the truncation of the d columns

  • 0 ≤ Eapprox ≤ d

i=1 i2i−1

Eapprox < 2k−1 → d = f (k) Precision k Discarded (d)

Bogdan PASCA FPGA Multipliers 25

slide-74
SLIDE 74

Tiling the truncated board

M2 M3 M1

k d

M4

Sol 1: tile and discard columns (save additions)

waste DSPs

Sol 2: use softcore multiplier (trade a DSP for logic) Best : tile with softcore multipliers so that Eapprox ≤ 2k−1

use the extra precision for free

Bogdan PASCA FPGA Multipliers 26

slide-75
SLIDE 75

Tiling the truncated board

M2 M3 M1

k d

M4 M2 M3 M1

k d

M4

Sol 1: tile and discard columns (save additions)

waste DSPs

Sol 2: use softcore multiplier (trade a DSP for logic) Best : tile with softcore multipliers so that Eapprox ≤ 2k−1

use the extra precision for free

Bogdan PASCA FPGA Multipliers 26

slide-76
SLIDE 76

Tiling the truncated board

M2 M3 M1

k d

M4 M2 M3 M1

k d

M4 M2 M3 M1

k d

Sol 1: tile and discard columns (save additions)

waste DSPs

Sol 2: use softcore multiplier (trade a DSP for logic) Best : tile with softcore multipliers so that Eapprox ≤ 2k−1

use the extra precision for free

Bogdan PASCA FPGA Multipliers 26

slide-77
SLIDE 77

Reality Check – faithfully rounding

Mantissa Multipliers for SP,DP,QP, Virtex4 (left) and Virtex5(right)

FPGA Prec. Latency, Freq. Resources Virtex5 DP 6 cycles @ 414MHz 320LUT 302REG 5DSP QP 20 cycles @ 334MHz 2497LUT 2321REG 19DSP QP 14 cycles @ 245MHz 2249LUT 1576REG 19DSP Virtex4 DP 11 cycles @ 368MHz

  • 358sl. 7DSP

QP 21 cycles @ 368MHz

  • 1735sl. 26DSP

Virtex4

DP reduce DSPs from 10 to 7 while also reducing slice count QP reduce DSPs from 49 to 26 at without any slice penalty

Virtex5

DP reduce DSP from 6 to 5 for and roughly half the LUTs and REGs QP reduce DSP from 34 to 19 at a small increase in logic resources.

Bogdan PASCA FPGA Multipliers 27

slide-78
SLIDE 78

Another point of view

(wE, wF) =accuracy (wE, wF + 1) correctly rounded faithfully rounded → in FPGAs the extra bit comes for free∗ truncate multipliers when IEEE-754 compliance is not needed

function approximation by polynomial evaluation

log2(1 + x) (53-bit) default 27 DSPs

  • ptimized Horner

23 DSPs

  • ptimized Horner + truncated multipliers

11* DSPs

Bogdan PASCA FPGA Multipliers 28

slide-79
SLIDE 79

Another point of view

(wE, wF) =accuracy (wE, wF + 1) correctly rounded faithfully rounded → in FPGAs the extra bit comes for free∗ truncate multipliers when IEEE-754 compliance is not needed

function approximation by polynomial evaluation

log2(1 + x) (53-bit) default 27 DSPs

  • ptimized Horner

23 DSPs

  • ptimized Horner + truncated multipliers

11* DSPs

Bogdan PASCA FPGA Multipliers 28

slide-80
SLIDE 80

Conclusion

save DSPs by exploiting the flexibility of the FPGA Karatsuba-Ofman reduces DSP cost at small price in logic elements tiling techinques adapt better to asymmetric DSPs dedicated squarers significantly reduce DSP count control accuracy and save DSPs using truncated multipliers

Bogdan PASCA FPGA Multipliers 29

slide-81
SLIDE 81

Thank you for your attention !

http://flopoco.gforge.inria.fr/ Questions ?

Bogdan PASCA FPGA Multipliers 30