An Interconnect-Centric Design Flow for Nanometer Technologies
Jason Cong UCLA Computer Science Department Email: cong@cs.ucla.edu Tel: 310-206-2775 http://cadlab.cs.ucla.edu/~cong
Jason Cong 10/17/01 2
An Interconnect-Centric Design Flow for Nanometer Technologies - - PDF document
An Interconnect-Centric Design Flow for Nanometer Technologies Jason Cong UCLA Computer Science Department Email: cong@cs.ucla.edu Tel: 310-206-2775 http://cadlab.cs.ucla.edu/~cong Outline Global interconnects in nanometer technologies
Jason Cong 10/17/01 2
Jason Cong 10/17/01 3
Intrinsic gate delay (ps) 71 51 49 45 39 22
59 49 51 44 52 42
2600 2500 2700 2600 3700 4700
900 800 770 700 770 670
Fact: Global interconnect is 15x – 20x slower than logic gates
Jason Cong 10/17/01 4
7.52 15.04 22.56 24.9 (mm) 1 clock 2 clock 3 clock 4 clock 5 clock 6 clock 7 clock
NTRS’97 0.07um Tech 5 G Hz across-chip clock 620 mm2 (24.9mm x
24.9mm)
IPEM BIWS estimations
Buffer size: 100x Driver/receiver size: 100x
From corner to corner:
7 clock cycles
Jason Cong 10/17/01 5
Jason Cong 10/17/01 6
device interconnect device interconnect Programs Data/Objects Programs Data/Objects Proposed transition Analogy
device/function centric interconnect/communication centric
Jason Cong 10/17/01 7
Architecture/Conceptual-level Design Design Specification Final Layout abstraction Structure view Functional view Physical view Timing view
HDM
Synthesis and Placement under Physical Hierarchy Interconnect Planning
Interconnect Optimization (TRIO)
with Buffer Insertion and Wire Sizing
Interconnect Layout
Route Planning Point-to-Point Gridless Routing
Interconnect Performance Estimation Models (IPEM)
Interconnect Synthesis
Topology genration & wiresizng for delay Wire ordering & spacing for noise control
Jason Cong 10/17/01 8
Jason Cong 10/17/01 9
module cpu(pj_su, pj_boot8, …); input …;
fpu fpu(.fpain (iu_rs2_e), .fpbin(iu_rs1_e), .fpop(fpop), .fpbusyn(fp_rdy _e), .fpkill(iu_kill_fpu), .fpout(fpu_data_e), .clk (clk), …); pcsu pcsu(.pj_clk_out(pj_clk _out), …); smu smu(.i u_optop_in(iu_optop_din), …); dtag_shell dtag_shell(.tag_in(dcu_tag_in), …); dcram_shell dcram_shell(.data_in({dcu_din_e[31], …); dcu dcu( .biu_data(pj_datain ), …); itag_shell itag_shell(.icu_tag_in(icu_tag_in), …); icram_shell icram_shell(.icu_din(icu_din), …); icu icu(.biu_data(pj_datain), …); iu iu(.iu_data_vld(iu_data_vld ), …); endmodule
Integer Unit (IU) ICRAM
DCRAM
SMU DCU
FPU MEMORY ICU itag
dtag PCSU SRAM latches
Verilog
Jason Cong 10/17/01 10
resulting poor global interconnect
Jason Cong 10/17/01 11
By courtesy of IBM (Tony Drumm)
Jason Cong 10/17/01 12
By courtesy of IBM (Tony Drumm)
Jason Cong 10/17/01 13
poor global interconnect
Jason Cong 10/17/01 14
Hard IP Soft module Same color for modules of the same logic hierarchy Logical Hierarchy Assign modules to physical hierarchy with interconnect estimation and optimization
Jason Cong 10/17/01 15
Example: Global interconnects defined by two different physical hierarchies Critical path Latch
Jason Cong 10/17/01 16
A=3 D=4 A=4 D=3 Latch Alternative Architecture Block Selection Re-Synthesis and Retiming Critical path
Jason Cong 10/17/01 17
Jason Cong 10/17/01 18
Multiple clock cycles are needed to cross the chip Proper placement allows retiming to hide global
Placement 1 Before retiming, ? = 5.0 a b c d After retiming, ? = 3.0 Before retiming, ? = 4.0 a c b d Placement 2 d(v)=1, WL=6, d(e) ? WL d(v)=1, WL=6, d(e) ? WL Better Initial Placement !!
Jason Cong 10/17/01 19
Multiple clock cycles are needed to cross the chip Proper placement allows retiming to hide global
Placement 1 Before retiming, ? = 5.0 a b c d After retiming, ? = 3.0 Before retiming, ? = 4.0 a c b d After retiming, ? = 4.0 Placement 2 d(v)=1, WL=6, d(e) ? WL d(v)=1, WL=6, d(e) ? WL Better Initial Placement !!
Jason Cong 10/17/01 20
l(v) = max delay from PIs to v after opt. retiming under a
l(v) = max{l(u) - f · w(u,v) + d(u,v) + d(v)} Relation to retiming: r(v) = ?l(v) / f ? - 1 Theorem: P can be retimed to f + max{d(e)} iff l(POs) ? f
u v l(u) w(u,v) d(v) u w v l(u) = 7 l(w) = 3 d(v) = 1, d(e) = 2, f = 5 l(v) = max{7-5·1+2+1, 3+2+1} = 6
Jason Cong 10/17/01 21
Difficulty Need to work on the entire circuit, with many cycles Topological order does not exist! Basic approach:
Start with min l-value for each node and iteratively improve it
Will the computation converge?
YES, if the the circuit can be retimed to the target cycle time Theorem: Convergence is guaranteed in O(n) iterations if the
circuit can be retimed to the target cycle time
Practical experience
Converge in constant iterations with a good DFS order
Jason Cong 10/17/01 22
d(v)=1, d(e)=2
Is ? = 4.5 possible ? Iter# a b c d e f g
1
2
1.5 1.5
3
1.5 4.5 4
1.5 4.5 5
1.5 4.5 Cycle time 4.5 is possible as l(g) ? 4.5 a b c d e f g
Jason Cong 10/17/01 23
d(v)=1, d(e)=2
Is ? = 4.5 possible ? Iter# a b c d e f g
1
2
1.5 1.5
3
1.5 4.5 4
1.5 4.5 5
1.5 4.5 Cycle time 4.5 is possible as l(g) ? 4.5 a b c d e f g
Jason Cong 10/17/01 24
Jason Cong 10/17/01 25
Jason Cong 10/17/01 26
Coarsening Uncoarsening & Refinement (optimization)
Problem sizes
faster optimization on top levels
simple with good solution quality Levels
Jason Cong 10/17/01 27
– Best partitioner for cut-size minimization
– 30-40% delay reduction compared to hMetis
– 10x speed-up over GordianL
Jason Cong 10/17/01 28
Timing analysis & cell move Timing analysis & cell move Next cluster level Timing analysis & cell move Next cluster level
Jason Cong 10/17/01 29
Not sufficient information at higher
Mistake at higher level is impossible or costly to
Converge to better solution as more details are
Jason Cong 10/17/01 30
hMetis [Karypis et al, DAC’97]
Hyper-edge coarsening
ESC [Cong and Lim, ICCAD’00]
Global edge separability based clustering
TLC [Cong and Romesis, DAC’01]
Jason Cong 10/17/01 31 D1 D2 D3 1st level cluster 2nd level cluster 1st level cluster 2nd level cluster
Inputs: Areas and delays for all modules Different inter-cluster delays for different level Area constraints on each level of clustering Objectives: Build multi -level clusters that minimized the delay
under the area constraints
Jason Cong 10/17/01 32
Linear space and time complexity (if the network is
Two phases (labeling and clustering).
First phase: labeling From PIs to POs, visit nodes in topological order Label the node with the maximum delay under the two-level delay
model.
Second phase: clustering. From POs to PIs, cluster nodes
Node duplication (ND) control
Full node duplication Partial node duplication (depends on node criticality) No node duplication
Jason Cong 10/17/01 33
Quartus Quartus + TLC (no ND) Quartus + TLC (partial ND) Quartus + TLC (full ND)
For Altera APEX FPGAs with 2-level hierarchy (LABs & MegaLABs)
Jason Cong 10/17/01 34
– hMetis [DAC97] + retiming + slicing floorplan [Algo89] – GEO: simultaneous partitioning + coarse placement + retiming Close to 40% delay reduction!
0.2 0.4 0.6 0.8 1 1.2 1.4 delay cutsize wire runtime hMetis+RT+FL GEO
Jason Cong 10/17/01 35
Physical Hierarchy Generation Detailed Placement by IBM tools
270k cells, 300k nets Technology: ibm_sa27e (0.11um copper)
Jason Cong 10/17/01 36
Architecture/Conceptual-level Design Design Specification Final Layout abstraction Structure view Functional view Physical view Timing view
HDM
Synthesis and Placement under Physical Hierarchy Interconnect Planning
Interconnect Optimization (TRIO)
with Buffer Insertion and Wire Sizing
Interconnect Layout
Route Planning Point-to-Point Gridless Routing
Interconnect Performance Estimation Models (IPEM)
Interconnect Synthesis
Topology genration & wiresizng for delay Wire ordering & spacing for noise control
Jason Cong 10/17/01 37
Explore various synthesis solutions to tradeoff long
Select different behavior and logic synthesis
Scheduling for hiding interconnect latency ……
Jason Cong 10/17/01 38
Architecture blocks with different implementations
Different areas Different delays Different pipeline stages …
Parameterized Buses with different bus widths Interconnect planning extracts area, delay, etc. for
Interconnect planning uses architecture evaluation
Jason Cong 10/17/01 39
Interconnects determine system performance An interconnect-centric design flow is needed
Interconnect planning Synthesis/layout under physical hierarchy Interconnect synthesis Interconnect layout
Physical hierarchy generation is crucial for
A good combination of partitioning/placement and
Multi-level method is an effective way to cope with
Jason Cong 10/17/01 40