[PPT] - Pipelined Compressor Tree Optimization using Integer Linear PowerPoint Presentation

SLIDE 1

Pipelined Compressor Tree Optimization using Integer Linear Programming

International Conference on Field Programmable Logic 03.09.2014 Martin Kumm, Peter Zipf

University of Kassel, Germany

SLIDE 2

2

1. Introduction to Compressor Trees 2. Compressor Trees on FPGAs 3. Optimal Compressor Tree Synthesis

SLIDE 3

A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications are versatile: Multiplier (real, complex, squarer) Evaluation of polynomials   (e.g., for function approximation) Linear transforms (e.g., FFT, DCT) Digital filters …

3

COMPRESSOR TREES

SLIDE 4

EXAMPLE 1: MULTI-INPUT ADDITION

4

Dot representation  5 bit, 5-input addition:

S = X

i

Xi

  Formula: 

24 23 22 21 20                                      input vectors

SLIDE 5

5

Dot representation  5 bit, 5-input addition:

S = X

i

Xi

  Formula: 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

3·24 +2·23 +4·22 +3·21 +4·20 = 90 +22 +7 +13 +27 21 = 90                                      input vectors

EXAMPLE 1: MULTI-INPUT ADDITION

SLIDE 6

EXAMPLE 2: MULTIPLIER

6

Dot Representation  5x5 Multiplication:   Formula: 

SLIDE 7

EXAMPLE 3: ADVANCED ARITHMETIC

7

sine/cosine computation: Dot representation for Z-Z3/6:

[Dinechin HEART’13]

SLIDE 8

BASIC COMPRESSION

Full adder/  (3;2) counter:

8

Ripple carry adder:

FA FA FA

FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA

SLIDE 9

9

FLOW OF COMPRESSION

⇓

SLIDE 10

10

TABULAR REPRESENTATION

5 5 5 5 5 bits in stage 0 − 3

(3;2) counter

+ 1 1 − 3

(3;2) counter

+ 1 1 − 3

(3;2) counter

+ 1 1 − 3

(3;2) counter

+ 1 1 − 3

(3;2) counter

+ 1 1 = 1 4 4 4 4 3 bits in stage 1

SLIDE 11

11

1 4 4 4 4 3 bits in stage 1 − 3

(3;2) counter

+ 1 1 − 3

(3;2) counter

+ 1 1 − 3

(3;2) counter

+ 1 1 − 3

(3;2) counter

+ 1 1 − 3

(3;2) counter

+ 1 1 = 1 3 3 3 3 1 bits in stage 2

TABULAR REPRESENTATION

SLIDE 12

12

1 3 3 3 3 1 bits in stage 2 − 3

(3;2) counter

+ 1 1 − 3

(3;2) counter

+ 1 1 − 3

(3;2) counter

+ 1 1 − 3

(3;2) counter

+ 1 1 = 2 2 2 2 1 1 bits in stage 3

TABULAR REPRESENTATION

SLIDE 13

13

TABULAR REPRESENTATION

2 2 2 2 1 1 bits in stage 3 − 2 2 2 2

ripple carry adder

+ 1 1 1 1 1 = 1 1 1 1 1 1 1 bits in final stage

SLIDE 14

APPLICATION TO FPGAS

The compression using full adders is unsuitable for FPGAs: Mapping of a full adder on FPGA LUTs is inefficient and slow (➯ large routing delays) Fast carry chain is not exploited Conventional Solution: Ripple-carry adder tree Delay reduction possible by using Generalized Parallel Counters (GPCs) [Parandeh–Afshar TRETS’11]

14

SLIDE 15

(1,5;3) GPC ON FPGA

15

FA FA FA

⇓

Dot transform: Realization:

SLIDE 16

16

(1,5;3) GPC Mapping [Parandeh-Afshar TRETS’11]: Efficiency = bits reduced/#LUTs = (1+5-3)/3 = 1.0   [Dinechin FPL’13]

1 1 1

Carry Logic

1

Slice LUT FA FA

(1,5;3) GPC ON FPGA

SLIDE 17

17

(1,4,1,5;5) GPC [Kumm MBMV’14]: Efficiency = 1.5

1 1 1

Carry Logic

1

FA Slice LUT FA FA FA

EFFICIENT GPCS ON FPGAS

SLIDE 18

18

1 1 1

Carry Logic

1

FA FA Slice LUT HA HA FA FA

(1,4,0,6;5) GPC [Kumm MBMV’14]: Efficiency = 1.5

EFFICIENT GPCS ON FPGAS

SLIDE 19

19

(1,3,2,5;5) GPC (proposed): Efficiency = 1.5

1 1 1

Carry Logic

1

Slice LUT FA FA FA FA FA FA FA FA HA FA FA HA FA FA FA FA FA FA

EFFICIENT GPCS ON FPGAS

SLIDE 20

20

(6,0,6;5) GPC (proposed): Efficiency = 1.75

1 1 1

Carry Logic

1

Slice LUT FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA

EFFICIENT GPCS ON FPGAS

SLIDE 21

Problem 1: The presented GPCs have irregular input pattern How to select them to get the least LUT resources? Problem 2: Pipelining is important on FPGAs to obtain a high throughput. How to select them to get the least LUT/FF resources?  (least pipeline balancing FFs)

21

COMPRESSOR TREE OPTIMIZATION

SLIDE 22

22

EXAMPLE FOR PROBLEM 1

5 5 5 5 5 bits in stage 0 − 1 4 1 5

(1,4,1,5;5) GPC

+ 1 1 1 1 1 − 1 4 1 4

(1,4,1,5;5) GPC

+ 1 1 1 1 1 = 1 6 2 2 2 1 bits in stage 1 1 6 2 2 2 1 bits in stage 1 − 6

(6;3) GPC

+ 1 1 1 = 1 2 1 2 2 2 1 bits in stage 2

SLIDE 23

23

EXAMPLE FOR PROBLEM 2

5 5 5 5 5 bits in stage 0 − 2 4 5

(2,0,4,5;5) GPC

+ 1 1 1 1 1 − 5 5

(6,0,6;5) GPC

+ 1 1 1 1 1 − 3 1

4 FF for pipeline balancing

+ 3 1 = 1 1 2 5 2 2 1 bits in stage 1 1 1 2 5 2 2 1 bits in stage 1 − 1 1 2 5

(1,3,2,5;5) GPC

+ 1 1 1 1 1 − 2 2 1

5 FF for pipeline balancing

+ 2 2 1 = 1 1 1 1 1 2 2 1 bits in stage 2

SLIDE 24

24

A generic ILP optimizer was used Main idea of the ILP formulation is to count GPCs for each column [Matsunaga’13] and to `cover´ all bits in each stage by GPCs For that, a `pseudo compressor´ with one input and

ne output is introduced (no compression)

To optimize a combinatorial compressor tree   (problem 1) the cost are set to zero (a wire) To optimize a pipelined compressor tree   (problem 2) the cost are set to the flip flop cost

PROPOSED OPTIMIZATION

SLIDE 25

25

ILP FORMULATION

ILP variables:

No. of bits in stage s and column c:
No. of GPCs in stage s, of type e and column c:
No. of inputs and outputs of GPC (Typ e) in column c:

and , respectively LUT cost of GPC e: Binary variable to select the active stage:

ks,e,c Ns,c Me,c Ke,c Ds = ( 1 ce

if stage s is used 

therwise

SLIDE 26

26

minimize

S−1

X

s=0 C−1

X

c=0 E−1

X

e=0

ceks,e,c subject to C1: Ns−1,c ≤

E−1

X

e=0 Ce−1

X

c0=0

Me,c+c0 ks−1,e,c+c0 ) s = 1 . . . S − 1, c = 0 . . . C − 1, if Ds = 0 C2: Ns,c =

E−1

X

e=0 Ce−1

X

c0=0

Ke,c+c0 ks−1,e,c+c0 ) s = 1 . . . S − 1, c = 0 . . . C − 1 C3: Ns,c ≤ ⇢ 2 for two-input VMA 3 for ternary VMA if Ds = 1 C4:

S−1

X

s=1

Ds = 1

ILP FORMULATION

SLIDE 27

27

C1’: Ns−1,c ≤

E−1

X

e=0 Ce−1

X

c0=0

Me,c+c0 ks−1,e,c+c0 + IDs C3’: Ns,c ≤ ⇢ 2 + (1 − Ds)I for two-input VMA 3 + (1 − Ds)I for ternary VMA C1 and C3 have to be linearized:                I must be a sufficiently large integer. 

ILP FORMULATION

SLIDE 28

28

RESULTS

50 100 150 200 250 300 100 200 300 400 500 600 700 Compressed bits #LUT Heuristic [8]

prop. ILP

(a)

50 100 150 200 250 300 50 100 150 200 250 Compressed bits #LUT Heuristic [8]

prop. ILP

Virtex 4 FPGA Virtex 6 FPGA The required LUTs could be reduced by   23% (Virtex 4) and 30% (Virtex 6) compared to  Dinechin (FPL’13) [8] The slice reduction was 12.5% (Virtex 4) and 19.5% (Virtex 6) after synthesis.

SLIDE 29

29

EXAMPLE COMPRESSION TREE

WITH 16 INPUTS, 16 BIT EACH

FloPoCo  [Dinechin FPL’13] Proposed ILP

SLIDE 30

30

CONCLUSION & OUTLOOK

A novel ILP formulation for the optimization of pipelined compressor trees was presented There is a notable gap between the former   state-of-the-art heuristic and our optimal solution Extensions are proposed for minimal stage count or variable column counters like 4:2 compressors Good heuristics are still required for problem sizes >100 bit due to the runtime of the ILP solver So far there is no heuristic considering pipelining

SLIDE 31

THANK YOU!

SLIDE 32

LITERATURE

[Parandeh-Afshar TRETS’11]: H. Parandeh-Afshar, A. Neogy, P. Brisk, and P. Inne, “Compressor Tree Synthesis on Commercial High-Performance FPGAs,” ACM TRETS, 2011 [Dinechin HEART’13]: F. de Dinechin, M. Istoan, and G. Sergent, “Fixed-Point Trigonometric Functions on FPGAs,” HEART 2013,

Jun. 2013.

[Dinechin FPL’13]: N. Brunie, F. de Dinechin, M. Istoan, G. Sergent,

K. Illyes, and B. Popa, “Arithmetic Core Generation Using Bit

Heaps,” FPL 2013 [Matsunaga’13]: T. Matsunaga, S. Kimura, and Y. Matsunaga, “An Exact Approach for GPC-Based Compressor Tree Synthesis,” IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, Dec. 2013.

SLIDE 33

SLIDE 34

ATTACHMENTS

34

SLIDE 35

35

DETAILED RESULTS VIRTEX 4

Heuristic [Dinechin FPL’13] proposed ILP Size [bits] LUT4 FF Slices fmax [MHz] LUT4 FF Slices fmax [MHz] 16 34 20 25 501.5 28 21 25 562.4 25 45 39 29 455.2 46 45 39 562.1 36 78 63 59 489.5 54 56 35 491.4 49 123 86 73 444.8 79 78 46 481.9 64 181 108 109 412.9 123 120 100 471.5 81 209 132 117 420.7 141 135 106 477.8 100 267 173 174 414.8 181 178 109 454.6 121 332 182 181 332.6 242 247 211 435.4 144 395 243 255 376.2 272 273 223 441.1 169 492 283 277 344.8 309 317 197 428.3 196 582 328 368 355.0 407 416 340 423.2 225 622 345 410 333.9 444 451 349 424.3 256 706 386 459 343.3 506 518 438 410.3 Avg.: 312.8 183.7 195.1 401.9 217.8 219.6 170.6 466.5 Imp.: – – – – 30.3%

19.6%

12.5% 16.1%

SLIDE 36

36

DETAILED RESULTS VIRTEX 6

Heuristic [Dinechin FPL’13] proposed ILP Size [bits] LUT6 FF Slices fmax [MHz] LUT6 FF Slices fmax [MHz] 16 12 7 3 478.0 10 9 3 639.4 25 24 11 6 636.5 26 25 7 452.9 36 32 13 9 595.6 27 36 7 603.1 49 44 15 12 492.4 35 40 10 407.7 64 59 19 16 407.7 47 48 13 506.8 81 76 21 20 442.9 56 59 15 480.1 100 96 47 26 435.9 77 98 20 437.5 121 116 26 32 401.6 89 112 25 438.6 144 134 28 35 383.9 94 121 24 469.0 169 161 60 43 396.8 119 155 30 470.6 196 189 76 50 358.0 131 160 35 408.0 225 216 81 56 327.2 192 236 57 364.0 256 251 74 66 338.3 204 251 55 372.3 Avg.: 108.5 36.8 28.8 438.1 85.2 103.8 23.2 465.4 Imp.: – – – – 21.5%

182.4%

19.5% 6.2%

SLIDE 37

37

EFFICIENT GPCS ON FPGAS

GPC / Compressor #LUT6 (k) Efficiency (E = δ/k) delay LUT based GPCs from [Dinechin FPL’13] (3;2) GPC 1 1 τL ≈ τ (6;3) GPC 3 1 τL ≈ τ (1,5;3) GPC 3 1 τL ≈ τ Improved GPC mappings from [Parandeh-Afshar TRETS’11]: (6;3) GPC 3 1 2τL + τR + 3τCC ≈ 3τ (1,5;3) GPC 2 1.5 τL + 2τCC ≈ τ (2,3;3) GPC 2 1 τL + 2τCC ≈ τ (7;3) GPC 3 1.33 2τL + τR + 3τCC ≈ 3τ (5,3;4) GPC 3 1.33 2τL + τR + 3τCC ≈ 3τ (6,2;4) GPC 3 1.33 2τL + τR + 3τCC ≈ 3τ

SLIDE 38

38

EFFICIENT GPCS ON FPGAS

GPC / Compressor #LUT6 (k) Efficiency (E = δ/k) delay GPCs and 4:2 compressor from [Kumm MBMV’13]: (5,0,6;5) GPC 4 1.5 τL + 4τCC ≈ τ (1,4,1,5;5) GPC 4 1.5 τL + 4τCC ≈ τ (1,4,0,6;5) GPC 4 1.5 τL + 4τCC ≈ τ (2,0,4,5;5) GPC 4 1.5 2τL + τR + 4τCC ≈ 3τ 4:2 compressor k 2 − 2

k

τL + kτCC Adder with k BLE: 2-input adder k 1 τL + kτCC 3-input adder k 2 − 2

k

2τL + τR + kτCC ≈ 3τ + kτCC Proposed GPCs: (6,0,6;5) GPC 4 1.75 τL + 4τCC ≈ τ (1,3,2,5;5) GPC 4 1.5 τL ≈ τ

SLIDE 39

39

1 1 1

Carry Logic

1

Slice LUT HA FA FA FA

(2,0,4,5;5) GPC [Kumm MBMV’14]: Efficiency = 1.5

EFFICIENT GPCS ON FPGAS

SLIDE 40

40

4:2 COMPRESSOR

1

Slice LUT FA

1

FA

1

Carry Logic

1

FA

. . . ⇓ . . . . . .

[Kumm MBMV’14]

SLIDE 41

41

We developed an ILP optimizer The main idea of the ILP formulation is to `cover´ all bits in each stage by GPCs. For that, a `pseudo element´ is introduced for which  and (no compression) In case of a combinatorial compressor tree (problem 1) we set its cost to (wire) In case of a pipelined compressor tree (problem 2)   corresponds to the flip flop cost.

PROPOSED OPTIMIZATION

e0 Me0,c = 1 Ke0,c = 1 ce0 = 0 ce0

SLIDE 42

42

FA FA FA FA

1 1 1

Carry Logic

1

Slice LUT FA FA FA

(7;3) COMPRESSOR

SLIDE 43

TERNARY ADDERS

A ternary adder realizes the operation It can be realized as cascade of two ripple carry adders:

FA FA FA FA FA FA FA FA

s = x + y + z

43

Pipelined Compressor Tree Optimization using Integer Linear Programming

International Conference on Field Programmable Logic 03.09.2014 Martin Kumm, Peter Zipf

CONTENTS

1. Introduction to Compressor Trees 2. Compressor Trees on FPGAs 3. Optimal Compressor Tree Synthesis

A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications are versatile: Multiplier (real, complex, squarer) Evaluation of polynomials (e.g., for function approximation) Linear transforms (e.g., FFT, DCT) Digital filters …

COMPRESSOR TREES

EXAMPLE 1: MULTI-INPUT ADDITION

Dot representation 5 bit, 5-input addition:

S = X

Xi

Formula:

Dot representation 5 bit, 5-input addition:

S = X

Xi

Formula:

EXAMPLE 1: MULTI-INPUT ADDITION

EXAMPLE 2: MULTIPLIER

Dot Representation 5x5 Multiplication: Formula:

EXAMPLE 3: ADVANCED ARITHMETIC

sine/cosine computation: Dot representation for Z-Z3/6:

BASIC COMPRESSION

Full adder/ (3;2) counter:

Ripple carry adder:

FA FA FA

FLOW OF COMPRESSION

⇓

TABULAR REPRESENTATION

TABULAR REPRESENTATION

TABULAR REPRESENTATION

TABULAR REPRESENTATION

APPLICATION TO FPGAS

(1,5;3) GPC ON FPGA

⇓

Dot transform: Realization:

(1,5;3) GPC Mapping [Parandeh-Afshar TRETS’11]: Efficiency = bits reduced/#LUTs = (1+5-3)/3 = 1.0 [Dinechin FPL’13]

(1,5;3) GPC ON FPGA

(1,4,1,5;5) GPC [Kumm MBMV’14]: Efficiency = 1.5

EFFICIENT GPCS ON FPGAS

(1,4,0,6;5) GPC [Kumm MBMV’14]: Efficiency = 1.5

EFFICIENT GPCS ON FPGAS

(1,3,2,5;5) GPC (proposed): Efficiency = 1.5

EFFICIENT GPCS ON FPGAS

(6,0,6;5) GPC (proposed): Efficiency = 1.75

EFFICIENT GPCS ON FPGAS

Problem 1: The presented GPCs have irregular input pattern How to select them to get the least LUT resources? Problem 2: Pipelining is important on FPGAs to obtain a high throughput. How to select them to get the least LUT/FF resources? (least pipeline balancing FFs)

COMPRESSOR TREE OPTIMIZATION

EXAMPLE FOR PROBLEM 1

EXAMPLE FOR PROBLEM 2

A generic ILP optimizer was used Main idea of the ILP formulation is to count GPCs for each column [Matsunaga’13] and to `cover´ all bits in each stage by GPCs For that, a `pseudo compressor´ with one input and

To optimize a combinatorial compressor tree (problem 1) the cost are set to zero (a wire) To optimize a pipelined compressor tree (problem 2) the cost are set to the flip flop cost

PROPOSED OPTIMIZATION

ILP FORMULATION

ILP variables:

and , respectively LUT cost of GPC e: Binary variable to select the active stage:

ks,e,c Ns,c Me,c Ke,c Ds = ( 1 ce

if stage s is used

ILP FORMULATION

C1’: Ns−1,c ≤

X

X

Me,c+c0 ks−1,e,c+c0 + IDs C3’: Ns,c ≤ ⇢ 2 + (1 − Ds)I for two-input VMA 3 + (1 − Ds)I for ternary VMA C1 and C3 have to be linearized: I must be a sufficiently large integer.

ILP FORMULATION

RESULTS

Virtex 4 FPGA Virtex 6 FPGA The required LUTs could be reduced by 23% (Virtex 4) and 30% (Virtex 6) compared to Dinechin (FPL’13) [8] The slice reduction was 12.5% (Virtex 4) and 19.5% (Virtex 6) after synthesis.

EXAMPLE COMPRESSION TREE

WITH 16 INPUTS, 16 BIT EACH

FloPoCo [Dinechin FPL’13] Proposed ILP

CONCLUSION & OUTLOOK

THANK YOU!

LITERATURE

ATTACHMENTS

DETAILED RESULTS VIRTEX 4

DETAILED RESULTS VIRTEX 6

EFFICIENT GPCS ON FPGAS

EFFICIENT GPCS ON FPGAS

(2,0,4,5;5) GPC [Kumm MBMV’14]: Efficiency = 1.5

EFFICIENT GPCS ON FPGAS

4:2 COMPRESSOR

. . . ⇓ . . . . . .

[Kumm MBMV’14]

A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications are versatile: Multiplier (real, complex, squarer) Evaluation of polynomials   (e.g., for function approximation) Linear transforms (e.g., FFT, DCT) Digital filters …

Dot representation  5 bit, 5-input addition:

  Formula: 

Dot representation  5 bit, 5-input addition:

  Formula: 

Dot Representation  5x5 Multiplication:   Formula: 

Full adder/  (3;2) counter:

(1,5;3) GPC Mapping [Parandeh-Afshar TRETS’11]: Efficiency = bits reduced/#LUTs = (1+5-3)/3 = 1.0   [Dinechin FPL’13]

Problem 1: The presented GPCs have irregular input pattern How to select them to get the least LUT resources? Problem 2: Pipelining is important on FPGAs to obtain a high throughput. How to select them to get the least LUT/FF resources?  (least pipeline balancing FFs)

To optimize a combinatorial compressor tree   (problem 1) the cost are set to zero (a wire) To optimize a pipelined compressor tree   (problem 2) the cost are set to the flip flop cost

if stage s is used 

Me,c+c0 ks−1,e,c+c0 + IDs C3’: Ns,c ≤ ⇢ 2 + (1 − Ds)I for two-input VMA 3 + (1 − Ds)I for ternary VMA C1 and C3 have to be linearized:                I must be a sufficiently large integer. 

Virtex 4 FPGA Virtex 6 FPGA The required LUTs could be reduced by   23% (Virtex 4) and 30% (Virtex 6) compared to  Dinechin (FPL’13) [8] The slice reduction was 12.5% (Virtex 4) and 19.5% (Virtex 6) after synthesis.

FloPoCo  [Dinechin FPL’13] Proposed ILP