Fast Buffer Insertion Considering Process Variation Jinjun Xiong, - - PowerPoint PPT Presentation

fast buffer insertion considering process variation
SMART_READER_LITE
LIVE PREVIEW

Fast Buffer Insertion Considering Process Variation Jinjun Xiong, - - PowerPoint PPT Presentation

Fast Buffer Insertion Considering Process Variation Jinjun Xiong, Lei He EE Department EE Department University of California, Los Angeles University of California, Los Angeles Sponsors: NSF, UC MICRO, Actel Actel, , Mindspeed Mindspeed


slide-1
SLIDE 1

Fast Buffer Insertion Considering Process Variation

Jinjun Xiong, Lei He

EE Department EE Department University of California, Los Angeles University of California, Los Angeles

Sponsors: NSF, UC MICRO, Sponsors: NSF, UC MICRO, Actel Actel, , Mindspeed Mindspeed

slide-2
SLIDE 2

2

Agenda

Introduction and motivation Modeling Problem formulation Detailed algorithms with complexity analysis Experimental results Conclusion

slide-3
SLIDE 3

3

Buffer Insertion Flashback

Buffer insertion and sizing is a commonly used technique for high- performance chip designs to minimize delay Classic results on buffer insertion – Two-pin nets: closed form for optimal solution [Bakoglu 90] – Multi-pin nets: dynamic-programming based algorithm to find the

  • ptimal solution [Van Ginneken 90]

Extensions – Multiple buffer libraries considering power minimization [Lillis 96] – Wire segmentation [Alpert DAC97] – Simultaneous buffer insertion and wire sizing [Chu, ISPD97] – Simultaneous tree construction and buffer insertion [Okamoto DAC96] – Simultaneous dual Vdd assignment and buffered tree construction

[Tam DAC05]

– …..

slide-4
SLIDE 4

4

Design Optimization in Nanometer Manufacturing

Probabilistic design approaches showed great promise to achieve better design quality – Compared to deterministic approaches, statistical circuit tuning achieved

  • 20% area reduction [Choi DAC04]
  • 17% power reduction [Mani DAC05]

Buffer insertion considering process variation is also gaining attention recently – Limited consideration of process variation

  • Wire-length variation [Khandelwal ICCAD03]

– Independency assumption on process variation

  • Ignores global and spatial correlation [Xiong DATE05], [He ISPD05]

– High complexity

  • Numerical integration to obtain JPDF [Xiong DATE05]

– Applicable to only special routing structures

  • Two-pin nets only [Deng ICCAD05]

Our major contributions: theoretical foundations that lifts these limitations

slide-5
SLIDE 5

5

Agenda

Introduction and motivation Modeling Problem formulation Detailed algorithms with complexity analysis Experimental results Conclusion

slide-6
SLIDE 6

6

Modeling

Linear delay model for buffer – Input capacitance (Cb), output resistance (Rb), and intrinsic delay (Tb) π-model for interconnect – Wire capacitance (Cw) and wire resistance (Rw) How to model these quantities with correlated process variation?

slide-7
SLIDE 7

7

First-order Canonical Form for Variation Modeling

Mean value E(A) = a0 Random variables X1, X2, …, Xn model – Die-to-die global variation: instances are affected in the same way – Within-die spatial correlation: instances physically nearby are more likely to be similar [Agarwal ASPDAC03, Chang ICCAD03, Khandelwal DAC05] Random variable XRa model – Independent variation: instances next to each other are different All Xi follow independent normal distributions – Well accepted practice in SSTA [Chang ICCAD03, Visweswariah DAC04] In vector form, write device and interconnect with process variation – Device: Tb = Tb0 + γb

T X, Cb = Cb0 + ηb T X, Rb = Rb0 + ζb T X

– Interconnect: Cw = Cw0 + ηw

T X, Rw = Rw0 + ζw T X

Ra R n n

X a X a X a X a a A + + + + + = L

2 2 1 1

slide-8
SLIDE 8

8

Buffer Insertion Considering Process Variation

  • Given: a routing tree with required arrival time (RAT) and loading

capacitance specified at sinks, and N possible buffer locations

  • Considering: both FEOL device and BEOL interconnect process

variations

  • Find: locations to insert buffers
  • So that: the timing slack at the root is maximized

– Timing slack: mini (RATi – delayi)

s2 s0 s1 s3 s4

root possible buffer locations sinks

slide-9
SLIDE 9

9

Agenda

Introduction and motivation Modeling Problem formulation Detailed algorithms

– Key operations for buffering solutions – Transitive-closure pruning rule – Complexity analysis

Experimental results Conclusion

slide-10
SLIDE 10

10

Key Operations in Van Ginneken Algorithm

Associate each node with two metrics (Ct, Tt) – Downstream loading capacitance (Ct) and RAT (Tt) – DP-based alg propagates potential solutions bottom-up [Van Ginneken, 90] Add a wire Add a buffer Merge two solutions How to define these operations in statistical sense?

1 2

t n w t n w n w w

C C C T T R L R C = + = − ⋅ − ⋅

t b t n b b n

C C T T T R L = = − − ⋅

m in( , )

t n m t n m

C C C T T T = + =

Cn, Tn Ct, Tt Cn, Tn Ct, Tt Cn, Tn Cm, Tm Ct, Tt Cw, Rw

slide-11
SLIDE 11

11

Keep all quantities in canonical form after operations – Maintain correlation w.r.t. sources of variation – Updated solutions can still be handled by the same set of operations Add a wire Add a buffer Merge two solutions

Atomic Operations

Addition/subtraction Multiplications Minimum 1 2

t n w t n w n w w

C C C T T R L R C = + = − ⋅ − ⋅

t b t n b b n

C C T T T R L = = − − ⋅

m in( , )

t n m t n m

C C C T T T = + =

  • f two

canonical forms is another canonical form No longer a canonical form

( ) ( ) ( ) ( )

T T T

A B a X b X a b X α β α β + = + + + = + + +

slide-12
SLIDE 12

12

Approximate Multiplication as Canonical Form

Multiplication of two canonical forms results in a quadratic term – Matrix Γ = αβT Approximate it as a canonical form by matching the mean and variance with that of the exact solution – E(C) is the mean value (first moment) of C – E(C2) is the second moment of CE(C2)-E(C)2 is the variance – C’ is a new canonical form with the same mean and variance as C

0 0 0 0

( )( ) ( )

T T T T T T T T

C A B a X b X a b b a X X X a b X X X α β α β αβ γ = ⋅ = + + = + + + = + + Γ

2 2 '

( ) ( ) ( )

T T T

E C E C C E C X c X γ η γ γ − = + = +

slide-13
SLIDE 13

13

Theorem: If X is an independent multivariate normal distribution ~N(0,I), then for any vector γ and matrix Γ – Trace of a matrix (tr) equals to the sum of all diagonal elements In general, tr(Γ) and tr(Γ2) are expensive, but if Γ=αβT+εηT (a row rank matrix), we can show

Closed Form for Moment Computation

2 2 2

( ) ( ) ( ) (( ) ) 2 ( ) ( )

T T T T

E X X tr E X X X E X X tr tr γ Γ = Γ Γ = Γ = Γ + Γ ( ) ( ) ( )

T T

E C c E X E X X γ = + + Γ

2 2 2

( ) , ( ) ( ) ( ) 2( )( )

T T T T T T

tr tr β α ηε β α ηε β α ηε Γ = + Γ = + +

1st Moment 2nd Moment

2 2 2 2 2

( ) ( 2 2 2 ( ) ) 2 ( ) 2 ( ) ( (2 ) ) (( ) )

T T T T T T T T T T T T T

E C E c c X X X X c X X X X X X c c E X E X X X E X c X E X X γ γ γγ γ γ γγ = + + Γ + Γ + + Γ = + + Γ + Γ+ + Γ

2 2 '

( ) ( ) ( )

T T T

E C E C A B C E C X c X γ η γ γ − ⋅ ≈ = + = +

slide-14
SLIDE 14

14

Approximate Minimum as Canonical Form

Minimum of two canonical forms is also not a canonical form Approximate it as a canonical from by matching the exact mean and variance – Tightness probability of A:

  • Φ is the CDF of a standard normal distribution
  • θ is given by

– Exact mean and variance can be computed in closed form [Clack 65]

  • Well known for statistical timing analysis

Design for mean value ≠ design for nominal value because of mean shift

( )

A

a b T P A B θ − ⎛ ⎞ = > = Φ⎜ ⎟ ⎝ ⎠

2 2

2 cov( , )

A B

A B θ σ σ = + −

(min( , )) min( , )

A B

b a E A B T a T b a b θφ θ − ⎛ ⎞ = + − ≠ ⎜ ⎟ ⎝ ⎠

min( , ) ( )

T T A B R R

A B c T T X c X β α = + + +

Design for mean value Design for nominal value

slide-15
SLIDE 15

15

Agenda

Introduction and motivation Modeling Problem formulation Detailed algorithms

– Key operations for buffering solutions – Transitive-closure pruning rule – Complexity analysis

Experimental results Conclusion

slide-16
SLIDE 16

16

If T1>T2 and C1< C2 (C1, T1) dominates (C2, T2) – Dominated solution (C2, T2) is redundant Deterministic pruning has linear time complexity because of the following two desired properties – Ordering property

  • Either A>B or A<B holds

– Transitive ordering (transitive-closure) property

  • A>B, B>C A>C

– Make it possible to sort solutions in order

  • Assume sorted by load linear time to prune redundant solutions

Deterministic Pruning Rule

Load RAT Redundant solutions

Can we achieve the same time complexity for statistical pruning?

slide-17
SLIDE 17

17

Statistical Pruning Rule

(C1, T1) dominates (C2, T2) P(C1 < C2) ≥ 0.5 and P(T1 > T2) ≥ 0.5 Properties of this statistical pruning rule – Ordering property

  • Given: T1 and T2 as two dependent random variables
  • Then: either P(T1>T2) ≥ 0.5 or P(T1<T2) ≥ 0.5 holds

– Transitive-closure ordering property

  • Given T1, T2, and T3 as three dependent random variables with a joint normal

distribution,

  • If: P(T1>T2) ≥0.5, P(T2>T3) ≥0.5
  • Then: P(T1>T3) ≥0.5

– Transitive-closure property can be extended to the more general case > P(T1>T2) ≥p, P(T2>T3) ≥p P(T1>T3) ≥p for any p 2 [0.5, 1]

Statistical pruning has the same linear time complexity as deterministic pruning

slide-18
SLIDE 18

18

Deterministic vs Statistical Buffering

Same O(N2) complexity as the classic deterministic buffering algorithm Deterministic merge and pruning operations can be combined into

  • ne linear time operation

– New complexity result: O(N*log2(N)) [Wei, DAC 03] Statistic merge and pruning can not be combined – Statistic version’s complexity is higher

For solution (Cn, Tn) in node t Z1 = ADD-WIRE(Cn, Tn); Z2 = ADD-BUFFER(Z1); …… For solution (Cm, Tm) from subtree m For solution (Cn, Tn) from subtree n (Ct, Tt)=MERGE((Cm, Tm),(Cn, Tn)); …… Z = PRUNE(Z); For solution (Cn, Tn) in node t Z1 = STAT-ADD-WIRE(Cn, Tn); Z2 = STAT-ADD-BUFFER(Z1); …… For solution (Cm, Tm) from subtree m For solution (Cn, Tn) from subtree n (Ct, Tt)=STAT-MERGE((Cm, Tm),(Cn, Tn)); …… Z = STAT-PRUNE(Z);

All quantities are deterministic values All quantities are canonical forms

slide-19
SLIDE 19

19

When Merge and Prune can be Combined?

Made possible via merge-sort like operation in deterministic case – Because of the following property: Min(A1,B1) ≤ Min(A2,B1) if A1 ≤ A2

  • Min(A,B) ≤ Min(A+δA,B)Min(A,B) is a nondecreasing function of inputs

In statistic case, such a property does not hold (even for mean)

(min( , )) ( ) E A B E A ∂ > ∂

Load RAT 1 3 5 1 3 6 1 5 7 2 3 8 Load RAT 2 4 6 9 11 1 2 3 5 7 RAT Load

(min( , ))

A

E A B σ ∂ < ∂

(min( , )) ( , ) E A B A B ρ ∂ > ∂

slide-20
SLIDE 20

20

Agenda

Introduction and motivation Modeling Problem formulation Detailed algorithms

– Key operations for buffering solutions – Transitive-closure pruning rule – Complexity analysis

Experimental results Conclusion

slide-21
SLIDE 21

21

Experimental Setting

Variation setting

– Global, spatial, and independent variations all to be 5% of the nominal value – Spatial variation used a grid model similar to [Chang, ICCAD03]

  • Grid size 500um
  • Correlation distance about 2mm (beyond that, no spatial correlation)

Benchmarks

– Two sets of benchmarks from public domain [Shi, DAC03]

Deterministic design for worst case (WORST)

– All parameters projected to its respective 3-sigma values

slide-22
SLIDE 22

22

Runtime Comparison

Compared with T2P proposed in [Xiong DATE05] – Only known work that considered both device and interconnect variations – JPDF computed via expensive numerical integration – No global and spatial correlation considered – Heuristic pruning rules (T2P) Re-implement T2P under the same first-order variation model, but still use its heuristic pruning rule 195.8

  • 0.08

6201 3101 r5 88.9

  • 0.04

3805 1903 r4 27.5

  • 0.02

1723 862 r3 15.0

  • 1195

598 r2 3.6

  • 533

267 r1 4.3

  • 0.01

1205 603 p2 1.0 (25.4x) 25.4 537 269 p1 This work T2P WORST Buf Loc Sink Bench

slide-23
SLIDE 23

23

For a given a buffered routing tree with 10K MC runs, delay PDF at the root – PDF from Monte Carlo roughly follows a normal distribution – Our approximation technique captures the PDF well Figure-of-merits: 3-sigma delay vs yield loss

1910 1920 1930 1940 1950 1960 1970 0.01 0.02 0.03 0.04 0.05 0.06 Monte Carlo Norm Approx. 1910 1920 1930 1940 1950 1960 1970 1980 1990 0.01 0.02 0.03 0.04 0.05 0.06 0.07 This work WORST

Yield Loss

Monte Carlo Simulation Results

3-sigma delay for red PDF

slide-24
SLIDE 24

24

Timing Optimization Comparison based on MC

Buffer insertion considering process variation improves timing yield by 15% on average – More effective for large benchmarks – Relative mean (or 3-sigma) delay improvement is small large mean values – More buffers are inserted in order to achieve this gain

1938 1674 1119 1090 771 3161 2375 mean 0.7% 1958 (1%) 1700 (1.5%) 1127 (0.7%) 1109 (1.7%) 772 3161 2374 Mean This Work WORST 9.6% 15.7% 0.6% Avg 100% 1966 608 (10.5%) 17.9% 1986 (1.0%) 544 r5 100% 1699 374 (14.4%) 54.9% 1723 (1.4%) 320 r4 100% 1142 188 (8%) 1.6% 1147 (0.5%) 173 r3 100% 1111 135 (17%) 35.3% 1128 (1.5%) 112 r2 100% 790 65 (9.2%) 0% 790 59 r1 100% 3204 156 (4.5%) 0% 3203 149 p2 100% 2403 60 (3.3%) 0% 2403 58 p1 Yield 3-sigma Delay Buffer Yield Loss 3-sigma Delay Buffer

slide-25
SLIDE 25

25

Conclusion and Future Work

Developed a novel algorithm for buffer insertion considering process variation Two major theoretical contributions

– An effective approximation technique to handle nonlinear multiplication operation, all through closed form computation – A provably transitive-closure pruning rule – Maybe useful for other applications

Timing optimization shown that considering process variation can improve timing yield by more than 15% Future work

– Theoretically examine the impact of process variation on buffering – Apply the theories in this work to other design applications

slide-26
SLIDE 26

26

Questions?