Application to Post-Placement Multi-Bit Flip-Flop Merging Chang Xu 1 - - PowerPoint PPT Presentation

application to post placement
SMART_READER_LITE
LIVE PREVIEW

Application to Post-Placement Multi-Bit Flip-Flop Merging Chang Xu 1 - - PowerPoint PPT Presentation

Analytical Clustering Score with Application to Post-Placement Multi-Bit Flip-Flop Merging Chang Xu 1 , Peixin Li 1 , Guojie Luo 1 , Yiyu Shi 2 , and Iris Hui-Ru Jiang 3 {changxu, gluo} @ pku.edu.cn 1 Outline Background Multi-bit


slide-1
SLIDE 1

1

Analytical Clustering Score with Application to Post-Placement Multi-Bit Flip-Flop Merging

Chang Xu1, Peixin Li1, Guojie Luo1, Yiyu Shi2, and Iris Hui-Ru Jiang3

{changxu, gluo} @ pku.edu.cn

slide-2
SLIDE 2

2

Background

  • Multi-bit flip-flop
  • Previous works and limitation

Our method

  • Analytical score
  • Discrete refinement
  • Efficient implementation

Experimental Results Conclusion

Outline

slide-3
SLIDE 3

3

Clock power predominates dynamic power

  • 𝑸𝒅𝒎𝒍 = 𝜷𝑫𝒅𝒎𝒍𝑾𝒆𝒆

𝟑 𝒈𝒅𝒎𝒍

Clock power optimization

  • Reduce 𝜷
  • Clock gating technique
  • Reduce 𝑾𝒆𝒆
  • Sub-threshold voltage
  • Multi-supply-voltage
  • Reduce 𝑫𝒅𝒎𝒍
  • Multi-bit flip-flop
  • Resonance clock

Clock Power Optimization

slide-4
SLIDE 4

4

What’s MBFF

  • Several SBFFs share common inverters in MBFF cell

Multi-Bit Flip-Flop(MBFF)

2-Bit Flip-Flop Source: ICCAD’10 Chang et al.

slide-5
SLIDE 5

5

Power saving comes from

  • MBFF library
  • Simplified clock tree

Multi-Bit Flip-Flop(MBFF)

(a) Common clock tree (b) Simplified clock tree with MBFF

UMC 55nm process Faraday cell library

slide-6
SLIDE 6

6

Pre-placement MBFF

  • SNUG’10 Chen et al.,

In-placement MBFF

  • ISPD’13 Tsai et al.,
  • ICCAD’13 Hsu et al.,

Post-placement MBFF

  • ICGCS’10 Yan and Chen
  • ICCAD’10 Chang et al.,
  • ISPD’11 Jiang et al., INTEGRA

Using MBFF at Different Stages

Logic Synthesis Placement Timing Analysis Post-placement Optimization CTS Routing

MBFF Clustering MBFF Clustering MBFF Clustering

slide-7
SLIDE 7

7

Input

  • Placement of FFs and other gates
  • Timing slacks
  • MBFF library

Output

  • FF clusters (MBFF)

Constraint

  • Timing constraint

Post-Placement MBFF Clustering

FF

Input pin Output pin

FF FF FF TVFR

slide-8
SLIDE 8

8

Timing violation free region (TVFR)

Post-Placement MBFF Clustering

FF TVFR TVFR1 TVFR2 2-bit FF

Input pin Output pin

slide-9
SLIDE 9

9

Intersection graph-based searching [ICCAD’10 ]

  • Time consuming: 𝑷(𝑶𝟒)
  • Window-based acceleration affects power reduction

Previous Works and Limitation

TVFRs Intersection Graph Complete Graph TVFRs

slide-10
SLIDE 10

10

Interval graph-based searching [ISPD’11]

  • Efficient: sub-quadratic time complexity
  • Effective: best power reduction
  • Simple: signal wirelength degradation

Previous Works and Limitation

Random Choice!

Illustration to Interval Graph Source: ISPD’11 Jiang et al.

slide-11
SLIDE 11

11

 Difference

  • TVFD/AFFD: roughly estimate #FF can be covered within TVFR
  • IWLS benchmarks have much more MBFF candidates!

Signal wirelength degradation (for Integra)

  • C1-C6: Avg. 3%
  • IWLS: Avg. 932%

Benchmarks: C1-C6 Vs. IWLS 2005

C1-C6 TVFD/AFFD FF ratio Vga (IWLS 2005) TVFD/AFFD FF ratio

slide-12
SLIDE 12

12

Efficient and great scalability

  • Sub-quadratic time complexity

Robust performance

  • Power reduction: comparable to Integra
  • Signal wirelength: much better than Integra, especially for real

designs

Analytical fashion

  • Potential integration in analytical global placement
  • Potential usage for clustering algorithms

Our Contribution

slide-13
SLIDE 13

13

Optimization Flow

slide-14
SLIDE 14

14

Optimization Problem 𝑔

𝑚 𝒚, 𝒛 : signal wirelength

  • weighted-average WL[DAC’11]

𝑔

𝑑 𝒚, 𝒛 : #FF groups

  • nontrivial to be formulated

Timing constraint

  • feasible region

Analytical Step: Basic Idea

𝑛𝑗𝑜 𝛽𝑔

𝑚 𝒚, 𝒛 − 𝑔 𝑑 𝒚, 𝒛

𝑡. 𝑢. 𝑢 𝒚, 𝒛 ≤ 𝑈

TVFRs

2-bit group 3-bit group

slide-15
SLIDE 15

15

Dirac delta function

Cluster size

Analytical Step: Def. of Clustering Score

𝑂𝑗 𝒚, 𝒛 = 𝑘=1

𝑂

𝜀( 𝑦𝑗, 𝑧𝑗 − 𝑦𝑘, 𝑧𝑘 , 0) 𝜀 𝑥, 𝑨 = 1 (𝑥 = 𝑨) 0 (𝑥 ≠ 𝑨) 𝜀 𝑦𝑗, 𝑧𝑗 − 𝑦𝑘, 𝑧𝑘 , 0 = 1 𝑦𝑗, 𝑧𝑗 − 𝑦𝑘, 𝑧𝑘 = 0 0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓

TVFRs

𝑶𝒌 = 𝟑

𝑮𝑮𝒌 𝑮𝑮𝒋 𝑶𝒋 = 𝟒

slide-16
SLIDE 16

16

Objective function: 𝒈𝒅 term

  • 4-bit group is most-efficient

Analytical Step: Def. of Clustering Score

𝑛𝑗𝑜 − 𝑔

𝑑 = −𝑛𝑏𝑦𝑔 𝑑 = −𝑛𝑏𝑦 𝑗=1 𝑂

𝜀 𝑂𝑗 𝒚, 𝒛 , 4

slide-17
SLIDE 17

17

Gaussian function

Analytical Step: Smoothing

𝜀 𝑥, 𝑨 ≈ 𝐸 𝑥, 𝑨 = exp 𝑥 − 𝑨 2𝑚𝑜𝜗 𝑒0

2

𝐸 𝑥 − 𝑨 = 1 𝑥ℎ𝑓𝑜 𝑥 = 𝑨 𝐸 𝑥 − 𝑨 < 𝜗 𝑥ℎ𝑓𝑜 𝑥 − 𝑨 > 𝑒0

Dirac Delta function Gaussian function

slide-18
SLIDE 18

18

Attractive force & repelling force

Analytical Step: Effectiveness

PULL PUSH

𝐺𝐺

𝑗

𝐺𝐺

𝑗

𝐺𝐺

𝑗

PULL

𝐺𝐺

𝑗

slide-19
SLIDE 19

19

Analytical Step: Preliminary Clusters

500 1000 1500 2000 2500 3000 3500 500 1000 1500 2000 2500 3000 3500

  • Init. Loc.

(a) Initial FFs’ distribution

500 1000 1500 2000 2500 3000 3500 500 1000 1500 2000 2500 3000 3500

NLP Loc. (b) FFs’ distribution after analytical clustering

 𝒈𝒅: maximizes MBFF group numbers  𝒈𝒎: pulls FFs towards their “optimal locations” in terms of WL

slide-20
SLIDE 20

20

Two-pass best-choice clustering

  • First-pass: discretization
  • Second-pass: refinement

Discrete Step: Basic Idea

(a) Proximity relation after analytical step A B C D I H E F G A B C D I H E F G (b) Discrete clustering A B C D I H E F G (c) Discrete refinement (d) Final MBFF groups A B C D I H E F G

First-pass Second-pass

slide-21
SLIDE 21

21

First-pass: extract proximity relation

  • Bottom-up merging
  • Priority queue
  • Tuple: 𝑮𝑮𝒋, 𝑮𝑮𝒌, 𝒆 𝒆 = 𝒆𝒋𝒕𝒖(𝑮𝑮𝒋, 𝑮𝑮𝒌)
  • Capacity constraint: 4-bit

Second-pass: further refinement

  • Improve the ratio of 4-bit groups

Discrete Step: Two-Pass Best-Choice Clustering

slide-22
SLIDE 22

22

Discrete Step: Two-Pass Best-Choice Clustering

S(C,D) S(G,F) S(E,G) S(I,H) S(A,C) S(A,B) S(I,E) (a) Proximity relation after analytical step A B C D I H E F G (b) First-pass clustering A B C D I H E F G

x

S(I,E) (c) second-pass clustering A B C D I H E F G (d) Final MBFF groups A B C D I H E F G S(H,F)

slide-23
SLIDE 23

23

MBFF Clusters

500 1000 1500 2000 2500 3000 3500 500 1000 1500 2000 2500 3000 3500

  • Init. Loc.

500 1000 1500 2000 2500 3000 3500 500 1000 1500 2000 2500 3000 3500

NLP Loc.

500 1000 1500 2000 2500 3000 500 1000 1500 2000 2500 3000 3500

Final Loc.

500 1000 1500 2000 2500 3000 3500

500 1000 1500 2000 2500 3000 3500

  • Init. Loc.

NLP Loc. Final Loc.

slide-24
SLIDE 24

24

Efficient Implementation

Analytical Step

𝑛𝑗𝑜 𝛽𝑔

𝑚 𝒚, 𝒛 − 𝑔 𝑑 𝒚, 𝒛

𝑡. 𝑢. 𝑢 𝒚, 𝒛 ≤ 𝑈

Discrete Refinement

  • Gradient calculation

Fast gauss transformation (FGT) 𝑃 𝑂2 ⇒ 𝑃(𝑂)

  • Nonlinear programming solver

Nesterov method Placement-like problem 𝑃(𝑂1.18)

  • FF-pair distance

Bin-structure searching 𝑃 𝑂2 ⇒ 𝑃(𝑂) Sub-quadratic timing complexity

slide-25
SLIDE 25

25

Setup:

  • G++ 4.5.1 −𝑷𝟒
  • Intel Xeon CPU @ 2.4GHz with 16 logical threads
  • Benchmarks: C1-C6, IWLS-2005 suite
  • Synthesis flow for real designs
  • Synopsys DC
  • Cadence Encounter SOC

Experiment Results:

slide-26
SLIDE 26

26

Comparable power reduction 33% WL reduction

Experimental Results: C1-C6

Circuit Integra Ours PWR WLR RT (s) PWR WLR RT (s) C1 82.8 96 0.01 83.5 77.4 0.42 C2 80.9 102 0.01 82.3 76.4 0.97 C3 80.8 104 0.01 82.3 74.9 3.14 C4 81.0 104 0.02 82.4 75.6 10.59 C5 80.7 105 0.05 82.1 76.4 16.66 C6 80.7 105 1.11 82.3 82 217.4 Avg. 1 1.33 1 1.02 1 252

slide-27
SLIDE 27

27

Bound-Integra

Experimental Results: Real Designs

Effect of Different Bound Factors to Power Ration and WL Ratio

slide-28
SLIDE 28

28

 Comparable power reduction  43% WL reduction compared with Bound-Integra

Experimental Results: Real Designs

Circuit Bound-Integra Ours PWR WLR RT (s) PWR WLR RT (s) Tv80 78.11 109.2 0.01 78.10 95.7 0.94 Wbconmax 78.26 128 0.03 78.02 105 2.3 Pairing 78.00 132 0.03 78.00 109 6.61 Dma 78.04 124 0.05 78.02 96 5.43 Ac97 78.02 120 0.02 78.02 96 4.88 Ethernet 78.00 217 0.63 78.00 88 24.5 Avg. 1 1.43 1 0.99 1 84

slide-29
SLIDE 29

29

We propose analytical clustering score to merge

MBFF

  • The time complexity is sub-quadratic
  • We get comparable power reduction as Integra
  • We reduce wirelength by about 25% compared with
  • riginal placement

Potential usage:

  • Integrated in global placement
  • Clustering algorithms

Conclusion

slide-30
SLIDE 30

30

Thanks

{changxu, gluo} @pku.edu.cn

Q&A