Efficient Generation of Short and Fast Repeater Tree Topologies - - PowerPoint PPT Presentation

efficient generation of short and fast repeater tree
SMART_READER_LITE
LIVE PREVIEW

Efficient Generation of Short and Fast Repeater Tree Topologies - - PowerPoint PPT Presentation

Efficient Generation of Short and Fast Repeater Tree Topologies Christoph Bartoschek, Dieter Rautenbach, Jens Vygen, Stephan Held Research Institute for Discrete Mathematics University of Bonn Aussois, 2006 The Repeater Tree Problem source


slide-1
SLIDE 1

Efficient Generation of Short and Fast Repeater Tree Topologies

Christoph Bartoschek, Dieter Rautenbach, Jens Vygen, Stephan Held

Research Institute for Discrete Mathematics University of Bonn

Aussois, 2006

slide-2
SLIDE 2

The Repeater Tree Problem

sinks source

◮ A signal has to be distributed from a source to a set of sinks. ◮ The delay on a source-sink path increases

◮ quadratically in the path length within the tree.

slide-3
SLIDE 3

The Repeater Tree Problem

sinks source

◮ A signal has to be distributed from a source to a set of sinks. ◮ The delay on a source-sink path increases

◮ linearly in path length (assuming ideal repeater insertion).

slide-4
SLIDE 4

The Repeater Tree Problem

sinks source

◮ A signal has to be distributed from a source to a set of sinks. ◮ The delay on a source-sink path increases

◮ linearly in path length (assuming ideal repeater insertion), ◮ with every bifurcation on the path.

slide-5
SLIDE 5

Importance of Repeater Trees

◮ As feature sizes decrease the wire resistances increase. ◮ More and more repeaters are needed:

◮ 10 − 20% repeaters in 130nm technology ◮ 20 − 30% repeaters in 90nm technology ◮ 30 − 40% repeaters in 65nm technology

◮ The speed, robustness and power consumption depend heavily

  • n repeater insertion algorithms.

◮ Up to 30 Mio. instances are solved during timing closure.

⇒ Routines must be fast.

slide-6
SLIDE 6

The Repeater Tree Problem

Input

◮ Repeater tree root-pin r with location Pl(r) ∈ R2. ◮ Set S of sink-pins s ∈ S with

◮ locations Pl(s) ∈ R2, ◮ required signal arrival times RATs

(w.l.o.g. ATr = 0),

◮ required signal parities + or − and ◮ input pin capacitances.

◮ A library L of repeaters (inverters and buffers of varying sizes)

Output

A repeater tree that connects r with all s ∈ S using wires and legally placed repeaters from L, such that the signal arrives with the correct parity at all s ∈ S.

slide-7
SLIDE 7

The Repeater Tree Problem

Objectives

◮ Minimize power consumption ◮ Minimize wiring ◮ Maximize worst slack σr, where

σr := min

s∈S {RATs − signal delay(r, s)}

slide-8
SLIDE 8

Previous Work

◮ Repeater insertion into given topology and a finite number of

admissible locations L.

◮ Dynamic Programming with O(|L|2) running time

(van Ginneken 1990).

◮ Running time was improved to O(|L| log |L|)

(Shi and Li 2003, 2005).

slide-9
SLIDE 9

Previous Work

◮ Repeater insertion into given topology and a finite number of

admissible locations L.

◮ Dynamic Programming with O(|L|2) running time

(van Ginneken 1990).

◮ Running time was improved to O(|L| log |L|)

(Shi and Li 2003, 2005).

◮ No satisfying solution exists for topology generation:

◮ Steiner Minimum Trees.

Minimum power but poor delays due to long paths.

◮ Bounded radius Steiner trees. ◮ Heuristical splitting into critical and non-critical sub-trees.

slide-10
SLIDE 10

Our Contribution

◮ New topology generation:

◮ Balance between power and performance.

A parameter ξ ∈ [0, 1] allows scaling between power ξ = 0 and performance ξ = 1.

◮ Extremely fast.

◮ A linear time repeater insertion routine.

Both parts are integrated into our delay optimization environment.

slide-11
SLIDE 11

Definition (Topology)

A topology T is an arborescence rooted at r with δ+(r) = 1 and δ+(u) = 2 for all internal nodes u. The set of leaves is a subset of S. All internal nodes u are assigned placement coordinates Pl(u).

Figure: Example of a topology

slide-12
SLIDE 12

Delay Model

The delay from r to a sink s is modeled as: cnode · (|E(T[r,s])| − 1) +

  • (u,v)∈E(T[r,s])

cwire · dist(Pl(u), Pl(v))

◮ cnode: Delay penalty for bifurcation ◮ cwire: Delay per unit length ◮ Typical values are cnode = 20 ps and cwire = 220 ps/mm.

slide-13
SLIDE 13

Delay Model - Example

cwire = 1, cnode = 2.

slide-14
SLIDE 14

Justification of Delay Model

Relation between critical path delays in our model (estimated delay) and after repeater insertion and exact timing analysis.

0.5 1 1.5 2

estimated delay (ns)

0.5 1 1.5 2

exact delay after buffering and sizing (ns)

slide-15
SLIDE 15

Bounds on Slack & Wire Length

Lower Wire Length Bound

A lower bound on the wire length is given by a SMT.

Upper Slack Bound - Theorem

The maximum possible slack σmax with respect to our delay model is at most: −cnode · log2

  • s∈S

2

− “ RATs −cwire dist(Pl(r),Pl(s))

cnode

.

slide-16
SLIDE 16

Proof.

The maximum possible slack can be obtained by a topology T where all internal nodes share the root location: Pl(u) = Pl(r) ∀ internal nodes u. source All distance delays are minimum: cwire · dist(Pl(r), Pl(s)), ∀ s ∈ S.

slide-17
SLIDE 17
  • Proof. (continued)

◮ The problem reduces to:

Find a topology that maximizes the worst slack with

◮ new sink locations

Pl′(s) := Pl(r) (⇔ cwire = 0) and

◮ new required arrival times

RAT ′

s := RATs − cwire · dist(Pl(r), Pl(s))

for all s ∈ S.

slide-18
SLIDE 18

Lemma

For cwire = 0, cnode = 1 and integer values for RATs, s ∈ S, the maximum possible slack with respect to our delay model is at most −

  • log2
  • s∈S

2−RATs

  • .
slide-19
SLIDE 19

Proof of Lemma.

◮ Kraft’s inequality: There exists a rooted binary tree with n

leeves at depths l1, l2, . . . , ln ⇔

n

  • i=1

2−li ≤ 1.

◮ Slack at root σr is minimum over all sinks slacks ⇒

delay(r, s) = cnode · (|E(T[r,s])| − 1) ≤ RATs − σr ∀s ∈ S. = ⇒ The maximum slack achievable by any topology is bounded by σmax = max{σ ∈ N|

  • s∈S

2

−RATs +σ cnode

≤ 1} = −cnode

  • log2
  • s∈S

2

− RATs

cnode

  • .
slide-20
SLIDE 20

Improving the Upper Slack Bound

Drawbacks of closed formula

◮ Closed formula ignores discrete structure of the problem. ◮ Computation creates numerical problems.

Huffman Coding

◮ No closed formula. ◮ Slightly better bounds. ◮ Numerical stable and linear time computation.

slide-21
SLIDE 21

Topology Generation Algorithm

Define criticality of s ∈ S by RATs − cwire · dist(Pl(r), Pl(s));

1

Start with partial topology T ′ = {r, ∅};

2

Connect most critical sink s ∈ S to r.

3

while unconnected sinks exist do

4

Choose most critical unconnected sink s ∈ S \ V (T ′);

5

Connect s to an arc e = (u, v) ∈ E(T ′) such that

6

ξ · σe + (ξ − 1) · cwire · dist(Pl(s), Area(e)) is maximized; end

σe is the slack at the root after connecting s to e. Area(e) is the area covered by the union of all shortest u − v-paths.

slide-22
SLIDE 22

Topology Generation Algorithm

Define criticality of s ∈ S by RATs − cwire · dist(Pl(r), Pl(s));

1

Start with partial topology T ′ = {r, ∅};

2

Connect most critical sink s ∈ S to r.

3

while unconnected sinks exist do

4

Choose most critical unconnected sink s ∈ S \ V (T ′);

5

Connect s to an arc e = (u, v) ∈ E(T ′) such that

6

  • 1. ξ · σe + (ξ − 1) · cwire · dist(Pl(s), Area(e)) and
  • 2. −cwire · dist(Pl(s), Area(e)) (iff ξ = 1)

is maximized; end

σe is the slack at the root after connecting s to e. Area(e) is the area covered by the union of all shortest u − v-paths.

slide-23
SLIDE 23

Lemma

For cwire = 0, cnode = 1 , ξ > 0 and integer values for RATs, s ∈ S, the algorithm generates a topology that realizes the maximum possible slack.

slide-24
SLIDE 24

Lemma

For cwire = 0, cnode = 1 , ξ > 0 and integer values for RATs, s ∈ S, the algorithm generates a topology that realizes the maximum possible slack.

Proof.

Assume the sinks in S′ ⊂ S are already connected optimally in T ′. Let s′ ∈ S \ S′.

◮ If all s ∈ S′ have the same slack σS′ in T ′.

◮ They are connected at maximum possible slack. ◮ The best possible slack for the set S′ ∪ s′ equals σS′ + 1. ◮ s′ can be connected to any existing edge in T ′ such that its

slack is ≤ σS′ + 1.

◮ Otherwise s′ can be connected to any non-critical edge.

slide-25
SLIDE 25

Prim-Heuristic for Steiner Trees

Wire Length Minimization ξ = 0:

◮ Instead of choosing next critical sink: ◮ Choose sink, which is closest to the preliminary topology T ′. ◮ Well known heuristic existing in many variants.

Hwang = ⇒ 3

2-approximation algorithm for SMT.

slide-26
SLIDE 26

Running Time

The running time is O(|S|2 · Ψ), where Ψ is the running time for computing all shortest paths between a sink and a union of paths. (Ψ = 1 for l1-distances)

slide-27
SLIDE 27

Running Time

The running time is O(|S|2 · Ψ), where Ψ is the running time for computing all shortest paths between a sink and a union of paths. (Ψ = 1 for l1-distances)

Handling Large Instances

◮ Pre-clustering if |S| > 10 000 ◮ Facility location approximation [Massberg, Vygen 2005] ◮ Runtime: O(|S| log |S|)

slide-28
SLIDE 28

Experimental Results

◮ 2.3 Mio. instances with up to 10 000 sinks were taken from

current 90nm designs.

◮ The extreme cases ξ ∈ {0, 1} are compared against

  • 1. Length bound (SMT for |S| ≤ 30, heuristics for |S| > 30).
  • 2. Slack bound (Huffman Coding).

◮ 4.6 Mio. topologies were computed in ≤ 100 seconds on a

2.6 GHz Opteron.

slide-29
SLIDE 29

Results Topology Generation

Wire Length Optimization ξ = 0 Slack Optimization ξ = 1

Wirelength Slack Wirelength Slack Deviation (%) Deviation (ps) Deviation (%) Deviation (ps) # Sinks # Instances avg. worst avg. worst avg. worst avg. worst 1 1547517 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2 319759 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3 165448 0.00 0.00 13.89 82.72 12.19 99.60 0.12 20.00 4 86377 0.16 19.65 23.72 312.98 10.93 190.27 0.27 40.00 5 44301 0.16 21.51 33.40 174.51 14.01 188.15 0.34 52.45 6 27854 0.28 23.84 41.92 118.27 14.38 268.06 1.04 52.93 7 20523 0.45 22.24 52.19 285.43 22.26 248.77 0.42 52.51 8 19300 0.44 30.73 64.01 332.29 19.39 268.49 2.08 69.13 9 11085 0.81 26.26 71.11 465.77 29.58 250.04 3.36 60.00 10 11942 0.74 28.68 76.46 367.39 23.61 296.47 1.45 54.87 11-20 38184 1.60 28.00 101.16 427.25 32.57 426.68 1.73 76.80 21-30 11104 3.20 30.80 144.27 520.00 35.86 805.45 2.51 84.18 31-50 8647 2.99 33.16 226.05 793.70 70.29 1091.17 6.55 161.81 51-100 6621 4.06 26.34 344.88 1486.06 105.90 1782.56 12.23 203.48 101-200 1863 5.82 16.91 606.26 2019.90 135.84 1498.34 19.78 351.25 201-500 824 6.22 24.00 920.37 3711.47 209.77 2127.34 26.91 304.92 501-1000 205 7.62 19.40 1686.15 3563.61 569.58 2242.49 48.57 257.65 > 1000 31 6.99 14.74 2929.08 7872.96 211.40 1124.99 17.78 89.88

Total

2321585 0.66 33.16 9.92 7872.96 19.35 2242.49 0.21 351.25 > 2 sinks 774068 1.31 33.16 50.69 7872.96 38.34 2242.49 1.08 351.25

Table: Deviation from known bounds, cnode = 20

slide-30
SLIDE 30

Repeater Insertion

◮ Repeaters are inserted bottom-up whenever the optimum load

capacitance is reached.

◮ Special care has to be taken for merging branches that require

different parity.

◮ 4.4 Mio. trees are constructed in less than 10 min. ◮ Repeater sizes are refined by global gate sizing.

slide-31
SLIDE 31

Practical Application

◮ The routine is one core component of our timing optimization

(inverter trees + gate sizing).

◮ It has already been used on released chips. ◮ On the largest designs the timing optimization runtime was

reduced from 3 days of an industrial tool to 6 hours.

◮ The runtime for timing closure (several iterations of

placement + optimization) was reduced from more than a week to 26 hours.

slide-32
SLIDE 32

Future Work

◮ Improve repeater insertion and add wire sizing. ◮ Improve topologies

(i.e. in case of huge deviation from upper slack bound).

◮ Consider blockages & wiring congestion

(shortest path computations in grid-graphs).