A New Family of High-Performance Parallel Decimal Multipliers* - - PowerPoint PPT Presentation

a new family of high performance parallel decimal
SMART_READER_LITE
LIVE PREVIEW

A New Family of High-Performance Parallel Decimal Multipliers* - - PowerPoint PPT Presentation

A New Family of High-Performance Parallel Decimal Multipliers* Alvaro Vzquez, Elisardo Antelo Paolo Montuschi Dept. of Electronic and Computer Science Dept. of Computer Engineering University of Santiago de Compostela Politecnico di Torino


slide-1
SLIDE 1

ARITH 18 - Montpellier, France. June 25-27, 2007

1

A New Family of High-Performance Parallel Decimal Multipliers*

Alvaro Vázquez, Elisardo Antelo

  • Dept. of Electronic and Computer Science

University of Santiago de Compostela Spain alvaro@dec.usc.es elisardo@dec.usc.es

Paolo Montuschi

  • Dept. of Computer Engineering

Politecnico di Torino Italy montuschi@polito.it

*A. Vázquez and E. Antelo supported in part by the Ministry of Science and Technology of Spain under contract TIN2004-07797-C02 and Xunta de Galicia under contract PGIDT03TIC10502PR.

slide-2
SLIDE 2

ARITH 18 - Montpellier, France. June 25-27, 2007

2

Outline

  • Introduction. Previous work.
  • Implementation of decimal parallel multiplication:

– Fast carry-save addition using non conventional BCD. – Design of high-performance decimal p:2 CSAs. – Parallel partial product generation.

  • Architectures.

– Signed-digit (SD) Radix-10. – SD Radix-4/Radix-5 (combined binary/decimal).

  • Evaluation and Comparison.
  • Conclusions.

A New Family of High-Performance Parallel Decimal Multipliers

slide-3
SLIDE 3

ARITH 18 - Montpellier, France. June 25-27, 2007

3

Introduction

  • High-performance decimal floating-point units.
  • Parallel multiplier: scaling performance by pipelining.
  • Multiplication stages:
  • 1. Generation of partial products (PPG)
  • 2. Reduction of partial products (PPR)
  • 3. Conversion to non-redundant representation.
  • Problems of decimal implementation:

– High value-range for decimal digits (0-9) PPG – Inefficiency of conventional BCD coding PPG, PPR

A New Family of High-Performance Parallel Decimal Multipliers

slide-4
SLIDE 4

ARITH 18 - Montpellier, France. June 25-27, 2007

4

Previous Work on Decimal Multiplication

  • Previous proposals for PPG

1. Direct generation of partial products (digit-by-digit) 2. Using multiplicand multiples (X,2X,3X,4X,…,9X).

– Direct implementation. – SD multiplier. [Ex. 2 radix5 digits (-5X, 5X) (-2X,-X, X,2X)]

  • Previous proposals for PPR

1. Carry-save BCD-8421.

a. Full BCD operands (3:2 CSAs + correction) b. Carry operand 1 bit each 4-bit. (4-bit decimal CPAs)

2. Signed-digit representation for decimal digits.

– SD adders more complex than CSA based implementations.

A New Family of High-Performance Parallel Decimal Multipliers

slide-5
SLIDE 5

ARITH 18 - Montpellier, France. June 25-27, 2007

5

Proposed techniques

1. Decimal carry-save addition using BCD-4221. 2. Implementation of decimal CSAs for PPR. 3. Implementation of PPG using multiplier recoding:

– SD radix-10 – SD radix-4. – SD radix-5.

A New Family of High-Performance Parallel Decimal Multipliers

j j j i i

r z Z

=

=

3 ,

  • X multiplicand, Y multiplier BCD integer words.
  • BCD digit represented as:

BCD-8421 (rj=2j) BCD-4221 (r3,r2,r1,r0) = (4,2,2,1) BCD-5211 (r3,r2,r1,r0) = (5,2,1,1)

slide-6
SLIDE 6

ARITH 18 - Montpellier, France. June 25-27, 2007

6

3:2 CSA

Decimal carry-save addition (BCD-8421)

A New Family of High-Performance Parallel Decimal Multipliers 5 0 1 0 1 6 0 1 1 0 9 1 0 0 1 8 4 2 1 x2 10 1 0 1 0 Si : Hi : 5 0 1 0 1 10 1 0 0 0 - 2Hi :

Ai+Bi+Ci = Si+2Hi

Ai : Bi : Ci : si,j = Xor Xor(ai,j ,bi,j ,ci,j) hi,j = ai,j bi,j + (ai,j + bi,j ) ci,j ai,j bi,j ci,j

Ai,Bi,Ci,Si,Hi є[0,9]

  • Add 3 decimal digits to produce 2 decimal digits (sum and carry digits).

Carry-in Carry-out

Ai+Bi+Ci = Si+2Hi = 20

Input digits in [0,9] BUT Sum digit out of decimal range [0,9] ->[0,16] 2Hi є[0,18] and even

PROBLEM WITH BCD-8421

Sum digits require correction

slide-7
SLIDE 7

ARITH 18 - Montpellier, France. June 25-27, 2007

7

3:2 CSA

Decimal carry-save addition (BCD-4221)

A New Family of High-Performance Parallel Decimal Multipliers x2 Si : Hi : 2Hi : Ai : Bi : Ci : si,j = Xor Xor(ai,j ,bi,j ,ci,j) hi,j = ai,j bi,j + (ai,j + bi,j ) ci,j ai,j bi,j ci,j

Ai,Bi,Ci,Si,Hi,Wi є[0,9]

  • Add 3 decimal digits to produce 2 decimal digits (sum and carry digits).

Carry-in Carry-out

Ai+Bi+Ci = Si+2Hi = 20

Input digits in [0,9] and Sum digit always in range [0,9].

SOLUTION WITH BCD-4221 4 2 2 1 5 1 0 0 1 6 1 1 0 0 9 1 1 1 1 6 1 0 1 0 7 1 1 0 1 14 1 1 0 0 - L1-shift (Wi) 7 1 1 0 0 (BCD-5211)

Ai+Bi+Ci = Si+2Hi = Si+L1 L1-

  • shift

shift(Wi)

Wi :

slide-8
SLIDE 8

ARITH 18 - Montpellier, France. June 25-27, 2007

8

3:2 CSA

Decimal carry-save addition (BCD-5211)

A New Family of High-Performance Parallel Decimal Multipliers x2 Si : Hi : 2Hi : Ai : Bi : Ci : si,j = Xor Xor(ai,j ,bi,j ,ci,j) hi,j = ai,j bi,j + (ai,j + bi,j ) ci,j ai,j bi,j ci,j

  • Add 3 decimal digits to produce 2 decimal digits (sum and carry digits).

Carry-in Carry-out

Ai+Bi+Ci = Si+2Hi = 20

Input digits in [0,9] and Sum digit always in range [0,9].

SOLUTION WITH BCD-5211 12 1 0 1 0 - L1-shift 12 1 0 0 1 - 5 2 1 1 5 1 0 0 0 6 1 0 0 1 9 1 1 1 1 8 1 1 1 0 6 1 0 0 1 BCD-5211 BCD-4221

Ai+Bi+Ci = Si+2Hi = Si+L1 L1-

  • shift

shift(Hi)BCD-4221 Ai,Bi,Ci,Si,Hi є[0,9]

slide-9
SLIDE 9

ARITH 18 - Montpellier, France. June 25-27, 2007

9

0 0 0 0 1 0 0 1 0 1 0 0

Decimal multiplication by ±2n and ±5n

  • Multiplication by 2

Multiplication by 2

A New Family of High-Performance Parallel Decimal Multipliers

  • Multiplication by 5

Multiplication by 5

  • Negative operands (10

Negative operands (10’ ’s complement) by bit inversion (2 s complement) by bit inversion (2’ ’s complement) s complement)

0 1 0 0 1 0 0 1 25

BCD-4221 BCD-5211

1 0 0 0 50

BCD-4221

L1-SHIFT

Digit recoding BCD-4221 BCD-5211 BCD-4221

L3-SHIFT 0000 1001 1111 1100 5 9 6 1111 0110 0000 0011 9 4 3

BCD-4221 BCD-4221 Bit-complement

  • 596 = - 10000 + 9403 +1

+1

Hot-one Digit recoding 5 2 1 1

25

4 2 2 1 5 2 1 1 4 2 2 1

25 125

x2 x5

5 2 1 1

125

4 2 2 1 4 2 2 1

x10

0 1 0 0 1 0 0 1

4 2 2 1 4 2 2 1

0 0 0 1

4 2 2 1

x10 x100

0 1 0 0 1 - - - 0 0 1 0

x10 x100

5 2 1 1 5 2 1 1

0 1 0 0 1 0 0 1 0 0 0 0

4 2 2 1 4 2 2 1 4 2 2 1

x10 x10 x10

slide-10
SLIDE 10

ARITH 18 - Montpellier, France. June 25-27, 2007

10

Proposed decimal 3:2 CSA (BCD-4221)

A New Family of High-Performance Parallel Decimal Multipliers

Ai+Bi+Ci = Si+2Hi = Si+L1 L1-

  • shift

shift(Wi)

slide-11
SLIDE 11

ARITH 18 - Montpellier, France. June 25-27, 2007

11

Proposed decimal 3:2 CSA (BCD-4221)

A New Family of High-Performance Parallel Decimal Multipliers

Digit recoder BCD-4221 to BCD-5211 AREA: 18 NAND2 (0.35 times 4-bit 3:2 CSA area) DELAY: 4 FO4 (0.9 times binary 3:2 CSA delay)

Critical path

1111 1111

9

1110 1110

8

1011 1101 1011

7

1010 1100 1010

6

BCD-4221 0101 0011 0101

3

0100 0010 0100

2

0001 0001

1

1001 0111 0100 0110 0000 1000

5

0111

4

0000 BCD-5211

Decimal (digit) 3:2 CSA AREA: 66 NAND2 (1.35 times 4-bit 3:2 CSA area) *DELAY: 1.4 times carry path/same sum path *Ratio respect sum path (critical path) delay of bin. 3:2 CSA.

slide-12
SLIDE 12

ARITH 18 - Montpellier, France. June 25-27, 2007

12

Decimal CSA tree (BCD-4221)

  • Example: 9:2 Decimal CSA (digit

slice).

  • 1.35 area ratio resp. binary CSA.
  • 1.40 delay ratio resp. binary CSA.
  • Hardware complexity (1 digit):

– 4-bit 3to2: 7x48 NAND2 – Digit recoder (x2): 7x18 NAND2.

  • Critical path delay:

– 1-bit 3to2: 4.5/2.2 FO4 (2/1 XOR) – Recoder: 4 FO4 (1.75 XOR) – 9:2 Decimal CSA: 25 FO4. – 9:2 Binary CSA: 18 FO4.

A New Family of High-Performance Parallel Decimal Multipliers 4-bit 3:2 4-bit 3:2 4-bit 3:2 4-bit 3:2 4-bit 3:2 x2 x2 x2 4-bit 3:2 4-bit 3:2 x2 x2 x2 x2

Critical path

Mux 2:1 For combined Decimal/Binary CSA

slide-13
SLIDE 13

ARITH 18 - Montpellier, France. June 25-27, 2007

13

Decimal CSA tree BCD-4221 (area-optimized)

A New Family of High-Performance Parallel Decimal Multipliers

  • Example: 9:2 Decimal CSA (digit

slice).

  • Area optimization: Group inputs

with similar multiplicative factor.

  • 1.20 area ratio resp. binary CSA.
  • 1.40 delay ratio resp. binary CSA.
  • Hardware complexity (1 digit):

– 4-bit 3to2: 7x48 NAND2 – Digit recoder (x2): 5x18 NAND2.

  • Critical path delay:

– 9:2 Decimal CSA: 25 FO4. – 9:2 Binary CSA: 18 FO4.

4-bit 3:2 4-bit 3:2 4-bit 3:2 4-bit 3:2 4-bit 3:2 4-bit 3:2 4-bit 3:2 x2 x2 x2 x2

Critical path

x2 x1 x2 x1 x2

slide-14
SLIDE 14

ARITH 18 - Montpellier, France. June 25-27, 2007

14

SD radix-10 multiplier recoding

  • Multiplicand X (BCD-4221)

A New Family of High-Performance Parallel Decimal Multipliers

4 (recoded sign) 5 4d

SD radix-10 digit recoder

  • Mult. multiples gen.

Mux-5

3X 2X 4X 5X X

  • Multiplier Y (BCD-8421)

Yi є [0,9] Ybi є [-5,5]

(hot-one code)

1

x2 x5 x2

4d

4d-bit decimal adder

  • 1 SD radix-10 digit/multiplicand digit
  • d+1 partial products (additional encoded SD radix-10 digit)

Integer d-digit precision operands

slide-15
SLIDE 15

ARITH 18 - Montpellier, France. June 25-27, 2007

15

SD radix-4 multiplier recoding

A New Family of High-Performance Parallel Decimal Multipliers

  • Multiplicand X (BCD-4221)

4 (recoded sign) 2 4d

SD radix-4 digit recoder

  • Mult. multiples gen.

8X 4X

  • Multiplier Y (BCD-8421)

Yi є [0,9] YL

i є [-2,2]

(hot-one code)

1

x2 x2 x2

4d

  • 2 SD radix-4 digit/multiplicand digit
  • 2d partial products

Integer d-digit precision operands

Mux-2

2X X

Mux-2

4d 2

YU

i є [0,2]

Ybi = YU

i 4+ YL i

slide-16
SLIDE 16

ARITH 18 - Montpellier, France. June 25-27, 2007

16

SD radix-5 multiplier recoding

A New Family of High-Performance Parallel Decimal Multipliers

  • Multiplicand X (BCD-4221)

4 (recoded sign) 2 4d

SD radix-5 digit recoder

  • Mult. multiples gen.

10X 5X

  • Multiplier Y (BCD-8421)

Yi є [0,9] YL

i є [-2,2]

(hot-one code)

1

x10 x2 x5

4d

  • 2 SD radix-5 digit/multiplicand digit.
  • 2d partial products
  • Simple PPG: area/latency figures similar as Booth radix-4.

Integer d-digit precision operands

Mux-2

2X X

Mux-2

4d 2

YU

i є [0,2]

Ybi = YU

i 5+ YL i

4-bit left wired shift

slide-17
SLIDE 17

ARITH 18 - Montpellier, France. June 25-27, 2007

17

Radix-10 architecture

  • Z= X x Y only decimal

multiplications.

  • 16 BCD-digit (64 bits)

significands (IEEE-754r Decimal64 format).

  • SD radix-10 multiplier

recoding.

  • 17 partial products

generated.

  • Easily pipelined.

A New Family of High-Performance Parallel Decimal Multipliers X Y Z

64 64 16 (recoded signs) 17x5 17x 64 17 partial products

Decimal 17:2 CSA tree

128 128

128-bit Decimal Adder SD radix-10 recoder

  • Mult. multiples gen.

Mux-5

64

3X 2X 4X 5X X

slide-18
SLIDE 18

ARITH 18 - Montpellier, France. June 25-27, 2007

18

Radix-4/5 architecture

A New Family of High-Performance Parallel Decimal Multipliers

  • Can perform binary/decimal

multiplications Z= X x Y.

  • SD radix-5/4 multiplier

recoding (2 SD digits/BCD digit)

  • 32 partial products

generated.

  • Easily pipelined.

X Y Z

64 64 16 (recoded signs) 32x5 16x 64 32 partial products

Decimal 32:2 CSA tree

128 128

128-bit Decimal Adder SD radix-4/5 recoder

  • Mult. multiples gen.

Mux-2

64

10X/8X 2X 5X/4X X

Mux-2

16x 64 32x5

slide-19
SLIDE 19

ARITH 18 - Montpellier, France. June 25-27, 2007

19

Evaluation results

  • Area-delay model based on logical effort (delay in FO4;area in NAND2)

A New Family of High-Performance Parallel Decimal Multipliers

0.90 40000 1.45 72

  • Dec. Radix-10

1.60 69000 1.85 92 Proposed in [8] 1.10 49000 1.3 65

  • Dec. radix-5

1.0 43000 1.0 50

  • Bin. radix-4

1.25 53500 1.2/1.4 61/71 Bin/dec. radix-4/5 1.25 54000 1.2/1.5 59/75 Bin/dec. radix-4 1.15 49500 1.4 70

  • Dec. radix-4

0.90 39500 1.15 57

  • Bin. radix-8

Architecture Delay Area (64-bits) (FO4) Ratio (Nand2) Ratio [8] T. Lang and A. Nannarelli. A radix-10 combinational multiplier. Proc. 40th Asilomar Conf. on Signals, Systems, and Computers, pp 313–317, Oct. 2006.

slide-20
SLIDE 20

ARITH 18 - Montpellier, France. June 25-27, 2007

20

Comparison of decimal carry-free trees

[4] M. A. Erle and M. J. Schulte. Decimal multiplication via carry-save addition. In Proc. IEEE Int’l Conference

  • n Application-Specific Systems, Architectures, and Processors, pp. 348–358, June 2003.

[5] M. A. Erle, E. M. Schwarz, and M. J. Schulte. Decimal multiplication with efficient partial product generation.

  • Proc. IEEE 17th Symposium on Computer Arithmetic, pp. 21–28, June 2005.

[6] R. D. Kenney and M. J. Schulte. High-speed multioperand decimal adders. IEEE Trans. on Computers, 54(8):953–963, Aug. 2005. [7] R. D. Kenney, M. J. Schulte, and M. A. Erle. High-frequency decimal multiplier. In Proc. IEEE Int’l Conference on ComputerDesign: VLSI in Computers and Processors, pp. 26–29, Oct. 2004. [11] T. Ohtsuki. Apparatus for decimal multiplication. U.S.Patent No. 4,677,583, June 1987. [14] B. Shirazi, D. Y. Y. Yun, and C. N. Zhang. RBCD: Redundant binary coded decimal adder. IEE Proc - Computers and Digital Techniques, 136(2):156–160, Mar. 1989.

A New Family of High-Performance Parallel Decimal Multipliers

1.45 1.30 Non Spec. CSA [6] 2.90 2.00 SD tree [5,14] 0.85 0.70 Binary 16:2 CSA 2.60 1.50 BCD-8421 CSA [11] 1.40 1.45 4-bit CLA tree [4,7] 1.00 1.00 Decimal 16:2 CSA (area optimized) Architecture carry-free adder Delay Ratio Area Ratio Binary Our Proposal Other proposals

slide-21
SLIDE 21

ARITH 18 - Montpellier, France. June 25-27, 2007

21

Conclusions

  • New family of parallel decimal multipliers: decimal radix-10 and

combined radix-4/5 architectures.

  • Decimal carry-save addition algorithm using BCD-4221 (also valid

for BCD-5211).

  • Efficient designs of decimal p:2 CSA trees for PPR.
  • Parallel PPG using multiplicand multiples and three different SD

recodings of the multiplier.

  • Area-delay figures outstand other proposals and comparable to

binary parallel multipliers (1.3/1.1 latency/area ratios for decimal SD radix-5 resp. binary Booth radix-4).

  • Future work: decimal floating-point VLSI implementations.

A New Family of High-Performance Parallel Decimal Multipliers