An Efficient Softcore Multiplier Architecture for Xilinx FPGAs 22 nd - - PowerPoint PPT Presentation

an efficient softcore multiplier architecture for xilinx
SMART_READER_LITE
LIVE PREVIEW

An Efficient Softcore Multiplier Architecture for Xilinx FPGAs 22 nd - - PowerPoint PPT Presentation

An Efficient Softcore Multiplier Architecture for Xilinx FPGAs 22 nd IEEE Symposium on Computer Arithmetic Martin Kumm, Shahid Abbas and Peter Zipf University of Kassel, Germany CONTENTS 1. State-of-the-art 2. Proposed multiplier 3.


slide-1
SLIDE 1

An Efficient Softcore Multiplier Architecture for Xilinx FPGAs

22nd IEEE Symposium on Computer Arithmetic

Martin Kumm, Shahid Abbas and Peter Zipf

University of Kassel, Germany

slide-2
SLIDE 2

2

CONTENTS

1. State-of-the-art 2. Proposed multiplier 3. Results

slide-3
SLIDE 3

WHY FPGA 
 SOFTCORE MULTIPLIERS?

The need for efficient multipliers forced FPGA vendors to embed hard multiplier blocks FPGA softcore multipliers are still required: Small word sizes (worse mapping for embedded mults) Large word sizes ("fill gaps") Replace embedded mults on small/low-cost FPGAs

3

slide-4
SLIDE 4

Research for efficient multipliers is an ongoing process nearly since >50 years Efficient multipliers in terms of gates may not be efficient

  • n FPGAs

FPGA optimized structures are relatively rare

WHY THEY ARE DIFFERENT?

4

slide-5
SLIDE 5

WHY THEY ARE DIFFERENT?

5

Xilinx slice 6/7 series

slide-6
SLIDE 6

PREVIOUS WORK

1 1 1

Carry Logic

1

LUT LUT LUT LUT

A Baugh-Wooley like multiplier was proposed in 
 [Parandeh-Afshar 2011] Two partial products are generated and added using carry chain Compression tree of already reduced PP's necessary

6

slide-7
SLIDE 7

PREVIOUS WORK

1 1 1

Carry Logic

1

LUT LUT LUT LUT

A Baugh-Wooley like multiplier was proposed in 
 [Parandeh-Afshar 2011] Two partial products are generated and added using carry chain Compression tree of already reduced PP's necessary

full adder

6

slide-8
SLIDE 8

PREVIOUS WORK

Another idea was discussed in [Brunie 2013]: Decompose multiplication into small multipliers that fit into single LUTs, e. g., 3x3, 2x3, 1x4 Use a compression tree to add partial results

p =M1 + 23M2 + 26M3 + . . . . . . + 23M4 + 26M5 + 29M6 + . . . . . . + 26M7 + 29M8 + 212M9

7

slide-9
SLIDE 9

BOOTH RECODING

a · b =

M

X

m=0 m even

a · BEm2m

bm+1 bm bm−1 BEm zm cm sm 1 1 1 1 1 1 1 2 1 1

  • 2

1 1 1 1

  • 1

1 1 1

  • 1

1 1 1 1 1

8

slide-10
SLIDE 10

9

BOOTH MULTIPLIER

c6 c6 c4 c4 c4 c4 c2 c2 c2 c2 c2 c2 c0 c0 c0 c0 c0 c0 c0 c0

LSB MSB

b + =

slide-11
SLIDE 11

10

BOOTH MULTIPLIER

c6 c6 c4 c4 1 c2 c2 1 c0 c0 1 1

LSB MSB

b + =

slide-12
SLIDE 12

PROPOSED ARCHITECTURE

1 1

Carry Logic

1 0 1 0 1 0 1

LUT LUT LUT

0 1 0 1 0 1

LUT

1

11

slide-13
SLIDE 13

PROPOSED ARCHITECTURE

1 1

Carry Logic

1 0 1 0 1 0 1

LUT LUT LUT

0 1 0 1 0 1

LUT

1

11

full adder

slide-14
SLIDE 14

PROPOSED ARCHITECTURE

12

slide-15
SLIDE 15

RESULTS

The number of slices can be precisely predicted:
 
 
 Design was implemented as generic VHDL A pipelined multiplier can be obtained by using the 
 (otherwise unused) slice FFs without much additional cost Reference circuits (Parandeh-Afshar & LUT-based) were designed with the FloPoCo library [de Dinechin 2012] Xilinx Coregen was used as a commercial reference

#slices(M, N) = dN/4 + 1e | {z }

slices per row

· bM/2 + 1c | {z }

no of rows

13

slide-16
SLIDE 16

RESULTS VIRTEX 6 COMBINATORIAL, SLICES

8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 Input word size (N) #Slices 1x4 LUT Multiplier 3x2 LUT Multiplier 3x3 LUT Multiplier Parandeh-Afshar Multiplier Coregen (area) Coregen (speed) proposed

14

slide-17
SLIDE 17

8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 20 40 60 80 Input word size (N) Slice reduction (%) 1x4 LUT Multiplier 3x2 LUT Multiplier 3x3 LUT Multiplier Parandeh-Afshar Multiplier Coregen (area) Coregen (speed)

RESULTS VIRTEX 6 COMBINATORIAL, SLICE RED.

15

slide-18
SLIDE 18

RESULTS VIRTEX 6 COMBINATORIAL, FREQ.

8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 100 200 300 400 500 600 700 Input word size (N) Frequency [MHz] 1x4 LUT Multiplier 3x2 LUT Multiplier 3x3 LUT Multiplier Parandeh-Afshar Multiplier Coregen (area) Coregen (speed) proposed

16

slide-19
SLIDE 19

RESULTS VIRTEX 6 PIPELINED, SLICES

8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000 Input word size (N) #Slices 1x4 LUT Multiplier 3x2 LUT Multiplier 3x3 LUT Multiplier Parandeh-Afshar Multiplier Coregen (area) Coregen (speed) proposed

17

slide-20
SLIDE 20

RESULTS VIRTEX 6 PIPELINED, SLICE RED.

18

8 12 16 20 24 28 32 36 40 44 48 52 56 60 −10 10 20 30 40 50 60 70 80 Input word size (N) Slice reduction (%) 1x4 LUT Multiplier 3x2 LUT Multiplier 3x3 LUT Multiplier Parandeh-Afshar Multiplier Coregen (area) Coregen (speed)

slide-21
SLIDE 21

RESULTS VIRTEX 6 PIPELINED, FREQ.

8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 100 200 300 400 500 600 700 Input word size (N) Frequency [MHz] 1x4 LUT Multiplier 3x2 LUT Multiplier 3x3 LUT Multiplier Parandeh-Afshar Multiplier Coregen (area) Coregen (speed) proposed

19

slide-22
SLIDE 22

UNFORTUNATELY NOT POSSIBLE ON ALTERA FPGAS

20

Altera ALM

slide-23
SLIDE 23

MAYBE POSSIBLE NEXT?

21

slide-24
SLIDE 24

CONCLUSION

Compared to the best known design, up to 50% slices can be saved for the combinatorial multiplier 30% slices can be saved for the pipelined multiplier Portable to FPGAs providing a 5-input LUT at one full adder input "Free addition" supports multiply-accumulate (MAC) operation

22

slide-25
SLIDE 25

LITERATURE

[Parandeh-Afshar 2011]: Parandeh-Afshar & Ienne Measuring and Reducing the Performance Gap between Embedded and Soft Multipliers on FPGAs, FPL 2011 [Brunie 2013]: Brunie, de Dinechin, Istoan, Sergent, Illyes & Popa Arithmetic Core Generation Using Bit Heaps, FPL 2013 [de Dinechin 2012]: de Dinechin & Pasca Designing Custom Arithmetic Data Paths with FloPoCo IEEE Design & Test of Computers 2012

THANK YOU!

23

slide-26
SLIDE 26
slide-27
SLIDE 27

BOOTH RECODING

b =bM−12M−1 + . . . + b222 + b121 + b0 =bM−12M−1 + . . . + b222 + 2b121 + −b121 + b0 | {z }

BE0=−2b1+b0

=bM−12M−1 + . . . . . . + 2b323 −b323 + b222 + 2b121 | {z }

BE2=(−2b3+b2+b1)22

+BE0 =

M

X

m=0 m even

BEm2m with BEm = −2bm+1 + bm + bm−1

25

slide-28
SLIDE 28

WHY THEY ARE DIFFERENT?

26

Altera ALM

slide-29
SLIDE 29

WHY THEY ARE DIFFERENT?

D FF/LAT INIT1 INIT0 SRHI SRLO SR CE CK D6:1 CE Q CK SR Q SRHI SRLO INIT1 INIT0

27