[PPT] - Grid Grid to Grid Grid-to to Ports Clock Routing for to-Ports PowerPoint Presentation

SLIDE 1

Grid Grid to to Ports Clock Routing for Ports Clock Routing for Grid Grid-to to-Ports Clock Routing for Ports Clock Routing for High Performance High Performance Microprocessor Designs Microprocessor Designs

Haitong Tian#, Wai-Chung Tang#, Evangeline F.Y. Young# and C.N. Sze*

# Department of Computer Science and Engineering

The Chinese University of Hong Kong

* IBM Austin Research Laboratory

IBM Austin Research Laboratory ISPD ’11, Santa Barbara , USA March 28, 2011

SLIDE 2

Outline

 Introduction  Problem Formulation  Routing Algorithm E i l R l  Experimental Results  Conclusion  Conclusion

2 I SPD 2011

SLIDE 3

Clock Distribution Categories

 Clock distribution is an very important issue

 Buffered and unbuffered trees

 Used in various ASICs  Supported by many physical design tools  See Tsay TCAD’93, Xi DAC’95

 Non-tree structure with crosslinks

 Intended for reducing clock skews  See Rajaram DAC’04, TCAD’06 See aja a C 0 , C 06

 Grid and buffered trees

 High performance processors  Sometimes manually design the clock structures  Sometimes manually design the clock structures  See Shelar ISPD’09, TCAD’10, Guru VLSI Circuits’10

3 I SPD 2011

SLIDE 4

High Performance Clock Distribution

 Clock network in high performance microprocessors

Grid buffers

PLL ...

Regional Clock buffers Local Clock buffers External Clock

microprocessors

 Distributed as global grid followed by buffered trees  See Shelar ISPD’09

PLL ... ...

Regional Clock buffers Local Clock buffers

 See Shelar ISPD 09, TCAD’10, Guru VLSI Circuits’10  This paper focuses on the

Clock Grid Grid Bufer Local Clock Network Post-grid Clock Distribution

post-grid clock distribution area Post grid clock distribution

4 I SPD 2011

SLIDE 5

Post-grid Clock Distribution

 In our modeling

 Entire chip divided into several layout areas Global grid several layout areas  Each layout area contains many blocks

Grid Buffer Blocks R d

many blocks  Each block contains standard cells and/or macros

Reserved Tracks Sequential Port

 Each layout area contains

 100s-1000s clock ports

Global Grid Local Clock Buffer Layout Region

p  Grid wires reserved for clock routing  Typically upper mental layers

Reserved multilayer tracks

5 I SPD 2011

layers

SLIDE 6

Motivations

 Clock distribution of microprocessor:

 Crucial importance  Major source of power dissipation

 High capacitance usage

 18 1% f t t l l k it

[1]

 18.1% of total clock capacitance [1]  See Pham Solid State Circuits’06

 Manually design in practice y g p

 Hard to satisfy delay/slew constraints  Time to market  S Sh l ISPD’09 TCAD’10  See Shelar ISPD’09, TCAD’10

6 I SPD 2011 [1]: D. Pham, T. Aipperspach, D. Boerstler, M. Bolliger, R. Chaudhry, D. Cox, P. Harvey, P. Harvey, H. Hofstee, C. Johns, et al. Overview of the architecture, circuit design, and physical implementation of a first-generation cell processor. IEEE Journal of Solid-State Circuits, 41(1):179–196, Jan. 2006.

SLIDE 7

Outline

 Introduction  Problem Formulation  Routing Algorithm  Experimental Results  Conclusion

7 I SPD 2011

SLIDE 8

Problem Formulation

 Input

 A set of reserved tracks  Locations and capacitances of ports P  Different types of wires on each metal layer  Delay limit D. Slew limit S

 Output

 A clock network (may be non tree structures)  A clock network (may be non-tree structures)

 Objective

 Connecting every port to the source  Satisfying delay and slew constraints  Minimizing capacitance usage

8 I SPD 2011

 Minimizing capacitance usage

SLIDE 9

Post-grid Clock Routing

 0

7 7 6 6 5

Layer

5

Layer

4 4 500 1400 1600 1800 3 500 1400 1600 1800 3

9 I SPD 2011

1000 1500 200 400 600 800 1000 1200 1400 1600 1000 1500 200 400 600 800 1000 1200 1400 1600

SLIDE 10

Outline

 Introduction  Problem Formulation  Routing Algorithm  Experimental Results  Conclusion

10 I SPD 2011

SLIDE 11

Overall Algorithm

 Critical ports

 Ports with large capacitance or f f th far away from the source

 Path expansion algorithm

 Elmore-delay driven  Expanding in some selected directions

 Post-processing

 Wire replacement  Topology refinement

 Iterations

 The overall algorithm is repeatedly invoked  May fail when number of

11 I SPD 2011

y iterations > K (user specified)

SLIDE 12

Delay-driven Path Expansion Algorithm

 Basic steps

 Simultaneously expand from all ports  Select the path with the minimum Elmore delay to further expand  Connect the ports to the source once the path reaches the source grid  Check delay/slew constraints

12 I SPD 2011

SLIDE 13

A Routing Example

 Initially, the heap is empty  First iteration (simultaneously expand from all ports)

 Heap={(P1,P2);(P1,C1);(P2,P3);(P2,P1);(P3,P2);(P3,C2)}

 Second iteration (P1,P2)

 Heap={(P3,C2);(P1,P2,P3);(P1,C1);(P2,P3);(P2,P1);(P3,P2)} p {( , );( , , );( , );( , );( , );( , )}

 Third iteration (P C )  Third iteration (P3,C2)

 Heap ={(P3,C2,S2);(P1,P2,P3);(P1,C1);(P2,P3);(P2,P1);(P3,P2)}

13 I SPD 2011

SLIDE 14

A Routing Example

 Fourth iteration (identify chain paths)

 Heap ={(P1,P2,P3);(P1,C1);(P2,P3);(P2,P1)}  Chain path={(P1,P2,P3);(P2,P3)}

 Fifth iteration (P2,P3)

 Heap={(P1,P2,P3);(P1,C1)}  Chain path={(P1,P2,P3);(P1,P2)} p {( , , );( , )}

 Sixth iteration (P1 P2)  Sixth iteration (P1,P2)

 Heap={}, chain path={}  Final result

14 I SPD 2011

SLIDE 15

Post-processing Techniques

 Wire replacement

 Two types of wires

 it / i t t d ff

 Wire replacement

 Port with largest delay: P5  Replace edge P1C1

 capacitance/resistance tradeoff

 Procedures

 Identify port Pl with the largest  Replace edge P1C1  Replace edge P4C2  Replace edge P2P3, P3C1  Replace P5C3, C3C2, C2C1, C1S1 Identify port Pl with the largest Elmore delay  Replace wires in a bottom-up style p , , ,  Check delay/slew constrains

S1 C1 P3 P2 P1 S1 C1 P3 P2 P1 S1 C1 P3 P2 P1 S1 C1 P3 P2 P1 S1 C1 P3 P2 P1 C2 P4 C3 P5 C2 P4 C3 P5 C2 P4 C3 P5 C2 P4 C3 P5 C2 P4 C3 P5 15 I SPD 2011

SLIDE 16

Post-processing Techniques

 Topology refinement

 Procedures  Topology refinement

 Elmore delay:

 Disconnect a port P  Expand P towards all

 P5>P4>P6>P2>P1>P3>P7

 Sequentially process all the ports

S1 C1 P3 P2 P1

directions  Select paths with smaller i

S1 C1 P3 P2 P1 S1 C1 P3 P2 P1 S1 C1 P3 P2 P1 C2 P4 C3 P5 P6 C4

capacitance  Check delay/slew constraints

C2 P4 P5 P6 C4 C2 P4 P5 P6 C4 C2 P4 P5 P6 C4 P7 C5 S2 P7 C5 S2 P7 C5 S2 P7 C5 S2 16 I SPD 2011

SLIDE 17

Non-tree Extensions

 A small number of ports have exceptionally large capacitances  The delay of its shortest path

 Non-tree extensions

 Connect p to S1

 The delay of its shortest path exceeds the delay limit D  Procedures

 Establish a shortest path for p

 Find a second source S2  Add crosslinks  Find a third source S3

p p  Find a second shortest path

 Add crosslinks

 Target delay not met? Add all useful corsslinks  Target delay not met? Do the  Target delay not met? Do the same thing for parent node of p

17 I SPD 2011

SLIDE 18

Outline

 Introduction  Problem Formulation  Routing Algorithm  Experimental Results  Conclusion

18 I SPD 2011

SLIDE 19

Experiment Setup

 Environment

 Implemented in C++  Run on Linux server

 Intel Pentium 4 3.2GHz  2GB RAM

 Delay setup: 5ps  Slew setup: input: 10ps; output: 15 ps

B h k  Benchmarks

 3 test cases are provided by industry  11 test cases are from ISPD 2010 Clock Network Synthesis Contest es cases a e

S

0 0 C oc Ne wo Sy es s Co es

 Comparisons

 Compared with TG, which was proposed by Shelar in ISPD’09, TCAD’10

19 I SPD 2011

TCAD’10

SLIDE 20

Tree Growing Algorithm

 Proposed in R. Shelar ISPD’09, TCAD’10

 D l /Sl t i t

 Tree Growing Algorithm

 Expand from the source  Delay/Slew constraints  Greedy expansion from the source  Ed ith th ll t  Add S1C1, S2C2  Add C2P3  Add C1P1  Edges with the smallest capacitance will be added into the network  Add P3P2

S1 S2 C2 C1 P1 P3 P2 S1 S2 C2 C1 P1 P3 P2 S1 S2 C2 C1 P1 P3 P2 S1 S2 C2 C1 P1 P3 P2 S1 S2 C2 C1 P1 P3 P2 S1 S2 C2 C1 P1 P3 P2

20 I SPD 2011

SLIDE 21

Comparisons: capacitance

 Without post-processing: 18 3% i t

Capacitance (without post-processing techniques)

18.3% improvement

5000 10000 15000 20000 25000 capacitance (fF) TG Ours 5000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Test cases c

 With topology refinement: 24.6% improvement

Capacitance (with topology refinement)

20000 25000 ce(fF) 5000 10000 15000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 capacitanc TG Ours

21 I SPD 2011

Test cases

SLIDE 22

Comparisons: wire length

 Without post-processing: 16 8% i t

Wire Length (without post-processing techniques)

50000

16.8% improvement

10000 20000 30000 40000 50000 Wire length (um) TG Ours 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Test cases W

 With topology refinement: 23.6% improvement

Wire Length (with topology refinement)

40000 50000 (um) 10000 20000 30000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Wire length ( TG Ours

22 I SPD 2011

1 2 3 4 5 6 7 8 9 10 11 12 13 14 Test cases

SLIDE 23

Comparisons: run time

 TG

 A ti 0 14

Run time (without any post-processing techniques)

7

 Average run time 0.14s

 Ours

1 2 3 4 5 6 Time (s) TG Ours

 Without any post-processing techniques: 1.49s on average  With all techniques: 1.78s on average

1 2 3 4 5 6 7 8 9 10 11 12 13 14 Test cases

average  Practical runtime

Run time (with all post-processing techniques)

6 8 (s) TG 2 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 T t Time ( TG Ours

23 I SPD 2011

Test cases

SLIDE 24

Non-tree extension

 We also did some experiments to see the results of our non-tree extension

 A 23 4% i t th i i t d l  A 23.4% improvement on the minimum port delay  Minimum port delay is the lower bound one can achieve using tree structures

Non-tree delay

3000 1000 1500 2000 2500 Delay (fs) Non-tree delay Lower bound delay 500 1000 1 2 3 4 5 6 7 8 9 10 11 12 13 14

24 I SPD 2011

Test cases

SLIDE 25

Simulation Results

 Simulation tool: Hspice  Delay

Tree delay

3000

 Correlation coefficient

 Tree: 99%  Non-tree: 99%

500 1000 1500 2000 2500

Delay(fs)

Calculated delay Simulated delay 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Test case

Non-tree delay

2000 2500 500 1000 1500 Delay (fs) Calculated delay Simulated delay

25 I SPD 2011

1 2 3 4 5 6 7 8 9 10 11 12 13 14 Test cases

SLIDE 26

Simulation Results

 Slew

 Correlation coefficient

 T 96%

Tree slew

12

 Tree: 96%  Non-tree: 94%

9.5 10 10.5 11 11.5 Slew (ps) Simulated slew Calculated slew 9 9.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Test cases

Non-tree slew

10.8 11 11.2 9.4 9.6 9.8 10 10.2 10.4 10.6 Slew (ps) Simulated slew Calculated slew

26 I SPD 2011

1 2 3 4 5 6 7 8 9 10 11 12 13 14 Test cases

SLIDE 27

Outline

 Introduction  Problem Formulation  Routing Algorithm  Experimental Results  Conclusion

27 I SPD 2011

SLIDE 28

Conclusion

 Proposed an efficient algorithm to construct a post-grid clock network on reserved multi-layer metal tracks  Extended the algorithm to allow non-tree structures to further b i d th d l brings down the delay  Verified our results using Hspice simulation  Verified our results using Hspice simulation  Expected to reduce energy consumption, improve grid to port p gy p , p g p delay in real post-grid clock networks

28 I SPD 2011

SLIDE 29

29 I SPD 2011