Large-Scale Circuit Placement:
The Gap and Promise
Contributors: Chin-Chih Chang, Kenton Sze, Tim Kong, Michail Romesis, Joe Shinnerl, Min Xie, Xin Yuan
Jason Cong
Computer Science Department
University of California, Los Angeles cong@cs.ucla.edu
Large-Scale Circuit Placement: The Gap and Promise Jason Cong - - PowerPoint PPT Presentation
Large-Scale Circuit Placement: The Gap and Promise Jason Cong Computer Science Department University of California, Los Angeles cong@cs.ucla.edu Contributors: Chin-Chih Chang, Kenton Sze, Tim Kong, Michail Romesis, Joe Shinnerl, Min Xie, Xin
Contributors: Chin-Chih Chang, Kenton Sze, Tim Kong, Michail Romesis, Joe Shinnerl, Min Xie, Xin Yuan
Computer Science Department
University of California, Los Angeles cong@cs.ucla.edu
2
Optimality and scalability study of placement
Research on multilevel large-scale placement
Our research plan
3
Lack of significant progress in wirelength
Rate of reduction is about 5-10% every 2-3 years Latest developments in placement differ mainly in
Where do we stand
How much room for further improvement? Will existing placement engines scale well to 10+M gate
Need to quantify the optimality and scalability of
4
Most work compare only with existing heuristics Use real design based benchmarks
ISPD98 [C. Alpert 1998]
Use synthetic benchmarks
circ and gen [M. D. Hutton et al, 1998] gnl [D. Stroobandt et al, 2000]
Little understanding about the gap from the
5
6
Optimality and Scalability Study of Existing
7
Input Desired number of placeable modules t Net Distribution Vector (NDV) D = ( d2, d3, … dp ), dk
t and D are extracted from a real circuit
Output Cell library L Netlist N with known optimal wirelength Constraint N has D as its NDV
8
All the modules are of equal size, and there is
For 2-pin nets , connect any two adjacent
For each n-pin net , connect the n modules in a
The wirelength is of each n-pin net is given by / 2 n n n
+ −
9
Input : t = 64, D = {d2=34,d3=20,d4=7,d5=4,d6=2, d7=1} Total WL = 110
#2-pin nets = 34, WL = 34 #7-pin nets = 1, WL = 4 #3-pin nets = 20, WL = 40 #5-pin nets = 4, WL = 12 #6-pin nets = 2, WL = 6 #4-pin nets = 7, WL= 14
10
Option 1: expanding one dimension
Option 2: removing some of the nets
Need for white space
mimic real designs Ease for legalization
11
Module number t and NDV extracted from
Two suites without pads (suite1 and suite2)
suite2 is derived by scaling t and NDV by a factor of
Two suites with pads (suite3 and suite4)
suite4 is derived by scaling t and NDV by a factor of
15% white space by expanding on dimension of
12
ckt #cell #net #row Optimal WL Peko01 12506 13865 113 8.14E+05 Peko02 19342 19325 140 1.26E+06 Peko03 22853 27118 152 1.50E+06 Peko04 27220 31683 166 1.75E+06 Peko05 28146 27777 169 1.91E+06 Peko06 32332 34660 181 2.06E+06 Peko07 45639 47830 215 2.88E+06 Peko08 51023 50227 227 3.14E+06 Peko09 53110 60617 231 3.64E+06 Peko10 68685 74452 263 4.73E+06 Peko11 70152 81048 266 4.71E+06 Peko12 70439 76603 266 5.00E+06 Peko13 83709 99176 290 5.87E+06 Peko14 147088 152255 385 9.01E+06 Peko15 161187 186225 402 1.15E+07 Peko16 182980 189544 429 1.25E+07 Peko17 184752 188838 431 1.34E+07 Peko18 210341 201648 460 1.32E+07 ckt #cell #net #row Optimal WL Peko01x10 125060 138650 335 8.14E+06 Peko02x10 193420 193250 441 1.26E+07 Peko03x10 228530 271180 479 1.50E+07 Peko04x10 272200 316830 523 1.75E+07 Peko05x10 281460 277770 532 1.91E+07 Peko06x10 323320 346600 570 2.06E+07 Peko07x10 456390 478300 677 2.88E+07 Peko08x10 510230 502270 715 3.14E+07 Peko09x10 531100 606170 730 3.64E+07 Peko10x10 686850 744520 830 4.73E+07 Peko11x10 701520 810480 839 4.71E+07 Peko12x10 704390 766030 840 5.00E+07 Peko13x10 837090 991760 916 5.87E+07 Peko14x10 1470880 1522550 1214 9.01E+07 Peko15x10 1611870 1862250 1271 1.15E+08 Peko16x10 1829800 1895440 1354 1.25E+08 Peko17x10 1847520 1888380 1360 1.34E+08 Peko18x10 2103410 2016480 1451 1.32E+08
13
Capo [A. E. Caldwell et al, 2000] based on multilevel partitioner aims to enhance the routability Dragon [M. Wang et al, 2000] uses hMetis for initial partition SA with bin-based swapping mPL [T. Chan et al, 2000] nonlinear programming on the coarsest level Goto based relaxation QPlace [Cadence Inc.] quadratic programming component of Silicon Ensemble
14
1.00 1.20 1.40 1.60 1.80 2.00 2.20 2.40 2.60 2.80 50000 100000 150000 200000 250000 #cells Multiple of Optim al
Dragon v.2.20 capo v.8.0 mPL v.1.2 qplace v.5.1.55
5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 100000 150000 200000 250000 #cells runtime(s)
Dragon v.2.20 capo v.8.0 mPL v.1.2 qplace v.5.1.55
Existing algorithms are 66-153% away from the optimal on PEKO On examples with pads
mPL and QPlace show improvement of 12% and 10% respectively Dragon and Capo do not benefit much from the additional information
There is significant room for improvement in placement algorithms!
15
10000 20000 30000 40000 50000 60000 10000 100000 1000000 10000000 #cells runtime(s)
Dragon v.2.20 capo v.8.0 mPL v.1.2 qplace v.5.11.55 1.00 1.20 1.40 1.60 1.80 2.00 2.20 2.40 2.60 2.80 10000 100000 1000000 10000000 #cells Multiple of Optimal
Dragon v.2.0 capo v.8.0 mPL v.1.2 qplace v.5.1.55
Capo, QPlace and mPL scales well in runtime Average solution quality of each tool shows deterioration by
an additional 4% to 25% when the problem size increases by a factor of 10
QoR of the existing placement algorithms can be 80% - 180% away
from the optimal for large designs
16
17
Unlikely for real designs
Timing and routability are important objectives
18
circuit height width WL of longest net WL contribution
ibm01 8158 4530 7148 51% ibm02 8158 6430 14224 46% ibm03 8158 6740 10624 58% ibm04 8158 9140 15171 53% ibm05 8158 11055 19064 47% ibm06 8158 8715 13966 61% ibm07 8158 14605 14051 51% ibm08 8158 15895 16142 60% ibm09 8158 16395 13780 55% ibm10 8158 27890 30755 53% ibm11 16350 10925 19234 59% ibm12 16350 15545 26748 52% ibm13 16350 12230 19539 59% ibm14 16350 25475 26370 61% ibm15 16350 23785 27284 63% ibm16 16350 34015 42860 59% ibm17 16283 38895 45686 56% ibm18 16350 37065 52846 64%
Produced by Dragon
The wirelength
Need to consider the
19
All the modules are of equal size, and there is
For nets of degree i
20
Input : t = 64, D = {d2=34,d3=20,d4=7,d5=4,d6=2, d7=1} α=0.2 Total WL = 160
Generate 28 2-pin optimally Generate 6 2-pin randomly Generate 16 3-pin optimally Generate 4 3-pin randomly Generate 6 4-pin optimally Generate 1 4-pin randomly Generate 4 5-pin optimally Generate 2 6-pin optimally Generate 1 7-pin optimally
21
Sum the length of each
Input : t = 64
22
23
% non- local nets circuit #cell #net #row Row utilizatio n LB UB Peku01 12506 14111 113 85% 8.14E+05 8.14E+05 Peku05 28146 28446 169 85% 1.91E+06 1.91E+06 Peku10 68685 75196 263 85% 4.73E+06 4.73E+06 Peku15 161187 186608 402 85% 1.15E+07 1.15E+07 Peku18 210341 201920 460 85% 1.32E+07 1.32E+07 Peku01 12506 14111 113 85% 8.14E+05 9.23E+05 Peku05 28146 28446 169 85% 1.91E+06 2.24E+06 Peku10 68685 75196 263 85% 4.73E+06 6.17E+06 Peku15 161187 186608 402 85% 1.15E+07 1.71E+07 Peku18 210341 201920 460 85% 1.32E+07 2.01E+07 Peku01 12506 14111 113 85% 8.14E+05 1.02E+06 Peku05 28146 28446 169 85% 1.91E+06 2.63E+06 Peku10 68685 75196 263 85% 4.73E+06 7.52E+06 Peku15 161187 186608 402 85% 1.15E+07 2.30E+07 Peku18 210341 201920 460 85% 1.32E+07 2.75E+07 Up to 10% 0.25% 0.50%
…
24
circuit #cell #net #row UB GPeku01 12506 224 113 7.93E+05 GPeku05 28146 336 169 1.79E+06 GPeku10 68685 525 263 4.38E+06 GPeku15 161187 803 402 1.03E+07 GPeku18 210341 918 460 1.34E+07
25
Capo [A. Caldwell et al, 2000] Based on multilevel partitioner Aims to enhance the routability Dragon [M. Wang et al, 2000] Uses hMetis for initial partition SA with bin-based swapping mPL [T. Chan et al, 2000] Nonlinear programming on the coarsest level Goto based relaxation mPG [C. Chang et al, 2002] Uses FC clustering and hierarchical density control Incremental A-tree for routability
26
1 1.2 1.4 1.6 1.8 2 2.2 0.00% 0.25% 0.50% 0.75% 1.00% 2.00% 5.00% 10.00% % of non-local nets Quality Ratio Capo v.8.5 Dragon v.2.20 mPG v.1.0 mPL v.2.0
mPL’s QR increases when α is increased from 0 to 0.75%,
Absolute value of the QRs may not be meaningful, but it
27
The gap between their solutions and the upper bound
Another validation that there is significant room for
circuit Dragon v.2.20 QR Capo v.8.5 QR mPG v.1.0 QR mPL v.2.0 QR GPeku01 1.98 1.56 1.91 1.69 GPeku05 2.01 1.69 1.97 1.83 GPeku10 2.02 1.72 1.98 1.94 GPeku15 1.99 1.79 1.97 1.97 GPeku18 2.02 1.78 1.98 1.98
28
Best available placement algorithms can be
A significant research and business opportunity
Use of copper interconnect is equivalent to 30%
One process generation (e.g. from 0.13um to
Both requires multi-billion dollar investment!
Better placement may extend/accelerate
29
Optimality and scalability study of placement
Our research on large-scale placement problem Our research plan
30
Multilevel Optimization --- a highly scalable framework
Hierarchy construction by recursive aggregation Intralevel optimization by various techniques Interpolation to transfer from level to level
Two Parallel Efforts
mPL: multilevel multiheuristic/hybrid optimization mPG: multilevel simulated annealing with congestion control and
mixed block support
31
Initial Fine-Grain Problem
Intermediate Level Relaxation (Refinement) Intermediate Level Relaxation (Refinement)
Interpolate
Interpolate
etc. Interpolate Interpolate
32
Originally developed to solve boundary-value PDE
Discretized elliptic PDE is a structured, positive-definite system of
linear equations
Rapidly extended to problems without physical grids during
Algebraic Multigrid (AMG)
Most research has been for continuous models, but recent
Graph/Hypergraph Partitioning, traveling salesman, VLSI Physical
Design
33
Recursive Clustering
Version 1.0 Edge-Separability (CapForest) Versions 1.1 – 2.0: FirstChoice
Nonlinear Programming at coarsest level(s) Slot assignment and discrete refinement at all levels Recent Enhancements to Version 2.0
AMG-based weighted disaggregation Quadratic relaxation on subsets (QRS) at all levels Distance-based reaggregation for iterated multilevel flow
34
Coarsening
ESC, AMG-based, First Choice Clustering (FC)
Relaxation (Intralevel Optimization)
Interior-point nonlinear programming k-cycle Goto-style discrete exchange Quadratic relaxation on subsets (QRS)
Interpolation
Declustering + partitioning + slot assignment AMG-style weighted averaging
Iterated Multilevel Flow
Repeated/Recursive V-cycles with distance-based
reaggregation
35
Edge-Separability Clustering (ESC) [Cong and Lim,
2000] Use CAPFOREST [Ibaraki and Nagamochi, 1992] to estimate all-pairs min-cut q(x,y) in N log N time. Rank pairs by connectivity and area.
First-Choice Clustering (FC) [Karypis, 1999]
Match each vertex with a neighboring vertex with which it shares the most total hyperedge weight, subject to area- balance constraints. Clusters are connected components.
Weighted Aggregation (AMG)
Split the nodes into C-points and F-points. Associate each F- point with several C-points by weighted average.
36
Coarsening
ESC, AMG, FC, FC+opt, PD-FC
Relaxation (Intralevel Optimization)
Interior-point nonlinear programming k-cycle Goto-style discrete exchange Quadratic relaxation on subsets (QRS)
Interpolation
Declustering + partitioning + slot assignment AMG-style weighted averaging
Iterated Multilevel Flow
Repeated/Recursive V-cycles with reaggregation
37
Nonlinear-Programming Formulation
Direct formulation for the coarse placement problem Cells are modeled as circular disks for smoothness Quadratic wirelength objective on a clique-model Pairwise nonoverlap constraints
Can accelerate evaluation with adaptation of Fast Multipole
Method
Nonuniform sizes are OK, but reshaping is difficult to
incorporate efficiently
Reasonable performance for coarse-level sizes N <=
38
Analogous to force-directed methods, but with direct
Effective but relatively expensive; affordable only at the
Net impact: 15% wirelength improvement overall
Global cell movement is not scalable to finer levels Plan: Restrict to cell subsets to produce improved, scalable
To date, we have implemented Linear-Programming-based and Quadratic-Programming-based subset relaxations
39
Each cell’s optimal location is readily calculated when
Compute a chain A, B, C, D, E, where
Examine all permutations of the chain and take the
Problem: the chain is not closed (A is not necessarily
40
Select a subset M of cells to move. M is obtained as segments of length 3 along a DFS
Identify other cells and pads, F, connected to M by
Decouple the horizontal and vertical problems.
41
Problem formulation (horizontal case): Iterative solve the weighted quadratic minimization
) ( ) ( 2
∈ ∈ ∈
e v e E e e v k e k e
M
42
43
mPL 1.2 mPL 2.0 (with QRS) 2 V-cycles; AMG; No QRS 2 V-cycles; AMG; QRS Circuit WL Time WL %improTime Rel. Time Bin size ibm04 7.20E+06 506 6.83E+06 5.14% 866 1.71 2x2 ibm07 1.11E+07 749 1.02E+07 8.11% 1302 1.74 2x2 ibm09 1.22E+07 860 1.13E+07 7.38% 1569 1.82 3x3 ibm10 1.97E+07 1285 1.91E+07 3.05% 2419 1.88 3x3 ibm14 4.22E+07 2524 4.03E+07 4.50% 5846 2.32 3x3 ibm16 5.72E+07 4018 5.25E+07 8.22% 16760 4.17 4x4 ibm17 7.19E+07 5051 6.78E+07 5.70% 12240 2.42 4x4 ibm18 5.63E+07 4743 5.45E+07 3.20% 13507 2.85 4x4
44
Coarsening
ESC, AMG, FC, FC+opt, PD-FC
Relaxation (Intralevel Optimization)
Interior-point nonlinear programming k-cycle Goto-style discrete exchange Quadratic relaxation on subsets (QRS)
Interpolation
Declustering + partitioning + slot assignment AMG-style weighted averaging
Iterated Multilevel Flow
Repeated/Recursive V-cycles with reaggregation
45
46
Use the clique-model (graph) to define connectivity
For each FC-cluster, select one node of maximal
Each C-point is placed at its cluster’s position. Each F-point is placed at the weighted average of the
The F-points’ positions can be iteratively improved.
47
1 vcycle 1 vcycle wirelength runtime mPL1.1 mPL1.1 AMG %improved %incr Circuit
Time
Time by AMG ibm04 7.71E+06 261 7.33E+06 427 4.93% 63.60% ibm07 1.18E+07 396 1.12E+07 437 5.08% 10.35% ibm09 1.29E+07 455 1.28E+07 563 0.78% 23.74% ibm10 2.11E+07 661 2.08E+07 744 1.42% 12.56% ibm14 4.54E+07 1982 4.28E+07 2226 5.73% 12.31% ibm16 5.88E+07 3187 5.73E+07 3615 2.55% 13.43% ibm17 8.17E+07 4173 8.08E+07 4416 1.10% 5.82% ibm18 5.81E+07 4051 5.78E+07 4496 0.52% 10.98%
48
Coarsening
ESC, AMG, FC, FC+opt, PD-FC
Relaxation (Intralevel Optimization)
Interior-point nonlinear programming k-cycle Goto-style discrete exchange Quadratic relaxation on subsets (QRS)
Interpolation
Declustering + partitioning + slot assignment AMG-style weighted averaging
Iterated Multilevel Flow
Repeated/Recursive V-cycles with distance-based
reaggregation
49
50
Initially, affinity is connectivity + area balancing Subsequently, distance
51
using proximity and connectivity in the 2nd+ coarsening pass(es)
52
Coarsening by area-balanced first-choice clustering Relaxation (Intralevel Optimization)
Interior-point nonlinear programming at coarsest level k-cycle Goto-style discrete exchange at every level Quadratic relaxation on subsets (QRS) at every level
Interpolation by AMG-style weighted averaging Iterated V-cycles with distance-based reaggregation
53
Circuit
mPL1.0 Capo8.5 Dragon Gor-L mPL1.0 Capo8.5 Dragon Gor-L
ibm04
1.12 1.07 0.93 1.00 0.30 0.51 2.91 1.82
ibm07
1.16 1.14 0.97 1.07 0.30 0.54 2.98 3.37
ibm09
1.14 1.12 1.01 1.04 0.29 0.59 4.75 4.31
ibm10
1.11 1.09 0.98 0.98 0.27 0.49 4.20 5.84
ibm14
1.13 1.05 0.92 1.01 0.34 0.47 2.48 6.78
ibm16
1.12 1.12 0.95 1.05 0.19 0.22 2.84 4.88
ibm17
1.21 1.13 1.00 1.00 0.34 0.33 5.27 8.04
ibm18
1.07 1.06 0.92 0.99 0.30 0.31 4.34 9.56
Averages 1.13
1.10 0.96 1.02 0.29 0.43 3.72 5.58
Wirelength/mPL2 CPU time/mPL2 Uniform-Cell-Size IBM/ISPD 98 Circuits
54
0.00 1.00 2.00 3.00 4.00 5.00 6.00 0.95 1.00 1.05 1.10 1.15 scaled wirelength scaled runtime mPL2.0 mPL1.1 Capo8.5 Dragon Gordian-L
55
56
Comparison with the optimal
0.00 0.50 1.00 1.50 2.00 100000 200000 300000 400000 500000 600000 #cells Multiple of optimal m PL v.1.2 m PL v.2.0
Total runtime
5000 10000 15000 20000 100000 200000 300000 400000 500000 600000 #cells runtim e(s) mPL v.1.2 mPL v.2.0
mPL v.2.0 improves the Quality Ratio by 8% on the average mPL v.2.0 increases the runtime by 131% on the average
57
Multi-level coarse placement for physical
A multi-level SA-based framework Support mixed-size large scale global placement Support routing congestion control Integrate retiming with placement
58
Possibly equal to several process generation
59
Aim at 20-30% improvement (= 1 technology
Support
large-scale mixed-size placement problem Constraint-driven placement for delay, routing etc. Incremental placement to support dynamic netlist
60
Optimization for Large-scale Circuit Placement,” ICCAD 2000.
Tim Kong. Novel Techniques for Large-Scale Circuit Placement.
Ph.D. Thesis, CS Dept., UCLA 2002.
Chin-Chih Chang, Jason Cong, David Pan, Xin Yuan. “Physical
Hierarchy Generation with Routing Congestion Control”, ISPD 2002.
Chin-Chih Chang, Jason Cong, Min Xie. “Optimality and
Scalability Study of Existing Placement Algorithms.” ASP-DAC, 2003.
Chin-Chih Chang, Jason Cong, Xin Yuan. “Multi-level Placement
for Large-Scale Mixed-Size IC Designs”, ASP-DAC 2003.
Cong and Shinnerl (editors). Multilevel Optimization and
61