Physical Hierarchy Generation
Jason Cong UCLA Computer Science Department Email: cong@cs.ucla.edu Tel: 310-206-2775 http://cadlab.cs.ucla.edu/~cong
Jason Cong 08/15/2001 2
Physical Hierarchy Generation Jason Cong UCLA Computer Science - - PDF document
Physical Hierarchy Generation Jason Cong UCLA Computer Science Department Email: cong@cs.ucla.edu Tel: 310-206-2775 http://cadlab.cs.ucla.edu/~cong Outline Global interconnects in nanometer technologies Interconnect-centric design
Jason Cong 08/15/2001 2
Jason Cong 08/15/2001 3
0.01 0.1 1 10
0.25 0.18 0.15 0.13 0.1 0.07
Technology generation (um)
Delay (ns)
1mm 2cm un-opt 2cm opt Intrinsic gate delay
Optimization is obtained buffer insertion/sizing and wire sizing
Jason Cong 08/15/2001 4
clock cycle(s)
Estimated by IPEM On NTRS’97 technology Driver size: 100x min gate Receiver size: 100x min gate Buffer size: 100x min gate
Jason Cong 08/15/2001 5
7.52 15.04 22.56 24.9 (mm) 1 clock 2 clock 3 clock 4 clock 5 clock 6 clock 7 clock
NTRS’97 0.07um Tech 5 G Hz across-chip clock 620 mm2 (24.9mm x
24.9mm)
IPEM BIWS estimations
Buffer size: 100x Driver/receiver size: 100x
From corner to corner:
7 clock cycles Jason Cong 08/15/2001 6
Jason Cong 08/15/2001 7
device interconnect device interconnect Programs Data/Objects Programs Data/Objects Proposed transition Analogy
device/function centric interconnect/communication centric
Jason Cong 08/15/2001 8
Architecture/Conceptual-level Design Design Specification Final Layout abstraction Structure view Functional view Physical view Timing view
HDM
Synthesis and Placement under Physical Hierarchy Interconnect Planning
Interconnect Optimization (TRIO)
with Buffer Insertion and Wire Sizing
Interconnect Layout
Route Planning Point-to-Point Gridless Routing
Interconnect Performance Estimation Models (IPEM)
Interconnect Synthesis
Performance-driven Global Routing Pseudo Pin Assignment under Noise Control
Jason Cong 08/15/2001 9
Architecture/Conceptual-level Design Design Specification Final Layout abstraction Structure view Functional view Physical view Timing view
HDM
Synthesis and Placement under Physical Hierarchy Interconnect Planning
Interconnect Optimization (TRIO)
with Buffer Insertion and Wire Sizing
Interconnect Layout
Route Planning Point-to-Point Gridless Routing
Interconnect Performance Estimation Models (IPEM)
Interconnect Synthesis
Performance-driven Global Routing Pseudo Pin Assignment under Noise Control
Interconnect Synthesis
Performance-driven Global Routing Pseudo Pin Assignment under Noise Control
Interconnect Layout
Route Planning Point-to-Point Gridless Routing
Interconnect Performance Estimation Models (IPEM)
Interconnect Optimization (TRIO)
Buffer Insertion
and Wire Sizing
with Buffer Insertion and Wire Sizing
Interconnect Planning
Interconnect Planning
Synthesis and Placement under Physical Hierarchy
Jason Cong 08/15/2001 10
Architecture/Conceptual-level Design Design Specification Final Layout abstraction Structure view Functional view Physical view Timing view
HDM
Synthesis and Placement under Physical Hierarchy Interconnect Planning
Interconnect Optimization (TRIO)
with Buffer Insertion and Wire Sizing
Interconnect Layout
Route Planning Point-to-Point Gridless Routing
Interconnect Performance Estimation Models (IPEM)
Interconnect Synthesis
Performance-driven Global Routing Pseudo Pin Assignment under Noise Control
Jason Cong 08/15/2001 11
Jason Cong 08/15/2001 12
Jason Cong 08/15/2001 13
module cpu(pj_su, pj_boot8, …); input …;
fpu fpu(.fpain (iu_rs2_e), .fpbin(iu_rs1_e), .fpop(fpop), .fpbusyn(fp_rdy _e), .fpkill(iu_kill_fpu), .fpout(fpu_data_e), .clk (clk), …); pcsu pcsu(.pj_clk_out(pj_clk _out), …); smu smu(.i u_optop_in(iu_optop_din), …); dtag_shell dtag_shell(.tag_in(dcu_tag_in), …); dcram_shell dcram_shell(.data_in({dcu_din_e[31], …); dcu dcu( .biu_data(pj_datain ), …); itag_shell itag_shell(.icu_tag_in(icu_tag_in), …); icram_shell icram_shell(.icu_din(icu_din), …); icu icu(.biu_data(pj_datain), …); iu iu(.iu_data_vld(iu_data_vld ), …); endmodule
Integer Unit (IU) ICRAM
DCRAM
SMU DCU
FPU MEMORY ICU itag
dtag PCSU SRAM latches
Verilog
Jason Cong 08/15/2001 14
By courtesy of IBM (Tony Drumm)
Jason Cong 08/15/2001 15
By courtesy of IBM (Tony Drumm)
Jason Cong 08/15/2001 16
Jason Cong 08/15/2001 17
Hard IP Soft module Same color for modules of the same logic hierarchy Logical Hierarchy Assign modules to physical hierarchy with interconnect estimation and optimization
Jason Cong 08/15/2001 18
Examples: Global interconnects defined by two different physical hierarchy Critical path Latch
Jason Cong 08/15/2001 19
A=3 D=4 A=4 D=3 Latch Alternative Architecture Block Selection Re-Synthesis and Retiming Critical path
Jason Cong 08/15/2001 20
Jason Cong 08/15/2001 21
Multiple clock cycles are needed to cross the chip Proper placement allows retiming to hide global
Placement 1 Before retiming, ? = 5.0 a b c d After retiming, ? = 3.0 Before retiming, ? = 4.0 a c b d Placement 2 d(v)=1, WL=6, d(e) ? WL d(v)=1, WL=6, d(e) ? WL Better Initial Placement !!
Jason Cong 08/15/2001 22
Multiple clock cycles are needed to cross the chip Proper placement allows retiming to hide global
Placement 1 Before retiming, ? = 5.0 a b c d After retiming, ? = 3.0 Before retiming, ? = 4.0 a c b d After retiming, ? = 4.0 Placement 2 d(v)=1, WL=6, d(e) ? WL d(v)=1, WL=6, d(e) ? WL Better Initial Placement !!
Jason Cong 08/15/2001 23
l(v) = max delay from PIs to v after opt. retiming under a
l(v) = max{l(u) - f · w(u,v) + d(u,v) + d(v)} Relation to retiming: r(v) = ?l(v) / f ? - 1 Theorem: P can be retimed to f + max{d(e)} iff l(POs) ? f
u v l(u) w(u,v) d(v) u w v l(u) = 7 l(w) = 3 d(v) = 1, d(e) = 2, f = 5 l(v) = max{7-5·1+2+1, 3+2+1} = 6
Jason Cong 08/15/2001 24
Topological order does not exist! Start with a min l-value for each node and iteratively improve
it
Convergence is guaranteed in O(n) iterations if the circuit can
be retimed to the target cycle time
SAT(PI) = 0, SAT(others) = -? Relax one vertex at a time and update l-values Complexity is O(VE)
Jason Cong 08/15/2001 25
d(v)=1, d(e)=2
Is ? = 4.5 possible ? Iter# a b c d e f g
1
2
1.5 1.5
3
1.5 4.5 4
1.5 4.5 5
1.5 4.5 Cycle time 4.5 is possible as l(g) ? 4.5 a b c d e f g
Jason Cong 08/15/2001 26
d(v)=1, d(e)=2
Is ? = 4.5 possible ? Iter# a b c d e f g
1
2
1.5 1.5
3
1.5 4.5 4
1.5 4.5 5
1.5 4.5 Cycle time 4.5 is possible as l(g) ? 4.5 a b c d e f g
Jason Cong 08/15/2001 27
Coarsening Uncoarsening & Refinement (optimization)
Problem sizes
faster optimization on top levels
simple with good solution quality Levels
Jason Cong 08/15/2001 28
– Best partitioner for cut-size minimization
– 30-40% delay reduction compared to hMetis
– 10x speed-up over GordianL
Jason Cong 08/15/2001 29
Timing analysis & cell move Timing analysis & cell move Next cluster level Timing analysis & cell move Next cluster level
Jason Cong 08/15/2001 30
Not sufficient information at higher
Mistake at higher level is impossible or costly to
Converge to better solution as more details are
Jason Cong 08/15/2001 31
hMetis [Karypis et al, DAC’97]
Hyper-edge coarsening
ESC [Cong and Lim, ICCAD’00]
Global edge separability based clustering
TLC [Cong and Romesis, DAC’01]
Jason Cong 08/15/2001 32
Edge separability [Cong & Lim, ASPDAC00]
Min # of edges to separate x and y: x-y mincut
– Can compute a tight lower-bound q(e) of ?(e) for all edges in O(nlogn) time [Nagamochi & Ibaraki, Algorithmica92] – Use q(e) for bottom-up multi-level clustering – Produce very good cutsize, comparable to hMetis [KA+97]
Jason Cong 08/15/2001 33
1.13 1.09 1.31 1.19 1.15 1.31 1 0.2 0.4 0.6 0.8 1 1.2 1.4 Scaled Cutsize ABS DEN REP RTC CLO CON ESC
into C
w(e)
placement result
clusters
neighboring vertices
neighboring vertices
Jason Cong 08/15/2001 34 D1 D2 D3 1st level cluster 2nd level cluster 1st level cluster 2nd level cluster
Inputs: Areas and delays for all modules Different inter-cluster delays for different level Area constraints on each level of clustering Objectives: Build multi -level clusters that minimized the delay
under the area constraints
Jason Cong 08/15/2001 35
Linear space and time complexity (if the network is
Two phases (labeling and clustering).
First phase: labeling From PIs to POs, visit nodes in topological order Label the node with the maximum delay under the two-level delay
model.
Second phase: clustering. From POs to PIs, cluster nodes
Node duplication (ND) control
Full node duplication Partial node duplication (depends on node criticality) No node duplication
Jason Cong 08/15/2001 36
Quartus Quartus + TLC (no ND) Quartus + TLC (partial ND) Quartus + TLC (full ND)
Jason Cong 08/15/2001 37
– hMetis [DAC97] + retiming + slicing floorplan [Algo89] – HPM [DAC00] + slicing floorplan [Algo89] – GEO: simultaneous partitioning + coarse placement + retiming Close to 40% delay reduction!
0.2 0.4 0.6 0.8 1 1.2 1.4 delay cutsize wire runtime hMetis+RT+FL HPM+FL GEO
Jason Cong 08/15/2001 38
108% 140% 101% 76% 101% 45% 0% 20% 40% 60% 80% 100% 120% 140% 1k-10k 10k-50k 50k-200k Circuit size Our wire length/GD wire length Our time/GD time
Our multi-level engine performs well for big circuits
s9234, s5378, s13207, s15850, bigkey
s38417, s38584, clma, big1, big3, big4
big2, big5, big6
Jason Cong 08/15/2001 39
Architecture blocks with different implementations with
Different areas Different delays Different pipeline stages …
Parameterized Buses with different bus widths Interconnect planning extracts area, delay, etc. for architecture
evaluation
Interconnect planning uses architecture evaluation functions to
explore alternative architecture blocks and buses for system performance optimization
Jason Cong 08/15/2001 40
Explore various synthesis solutions to tradeoff long
Generalized technology mapping: choosing different
Revisit and extend various re-wiring techniques
Jason Cong 08/15/2001 41
Interconnects determine system performance Interconnect-centric design is needed
Interconnect planning Interconnect synthesis Interconnect layout
Physical hierarchy generation is crucial for
A good combination of partitioning/placement and
Multi-level method is an effective way to cope with
Jason Cong 08/15/2001 42