Retiming & Pipelining over Global Retiming & Pipelining over - - PowerPoint PPT Presentation
Retiming & Pipelining over Global Retiming & Pipelining over - - PowerPoint PPT Presentation
Retiming & Pipelining over Global Retiming & Pipelining over Global Interconnects Interconnects Jason Cong Jason Cong Computer Science Department Computer Science Department University of California, Los Angeles University of
Motivation: How Far Can We Go in Each Clock Cycle Motivation: How Far Can We Go in Each Clock Cycle
7.52 15.04 22.56 24.9 (mm) 1 clock 2 clock 3 clock 4 clock 5 clock 6 clock 7 clock
NTRS’97 0.07um Tech 5 G Hz across-chip clock 620 mm2 (24.9mm x 24.9mm) IPEM BIWS estimations
Buffer size: 100x Driver/receiver size: 100x
From corner to corner:
7 clock cycles
Solutions Solutions
- Fully asynchronous designs
Fully asynchronous designs
- GALS (global asynchronous locally synchronous designs)
GALS (global asynchronous locally synchronous designs)
- Latency
Latency-
- insensitive designs
insensitive designs
- Synchronous designs, with multi
Synchronous designs, with multi-
- cycle communications
cycle communications
- Much better understood
Much better understood
- Supported by the current tool set
Supported by the current tool set
- More energy efficient ?
More energy efficient ?
Interconnect Interconnect-
- Centric IC Design Flow
Centric IC Design Flow Under Development at UCLA Under Development at UCLA
Interconnect Performance Estimation Models (IPEM) Architecture/Conceptual-level Design Design Specification Final Layout abstraction Structure view Functional view Physical view Timing view
HDM
Synthesis and Placement under Physical Hierarchy Interconnect Planning
- Physical Hierarchy Generation for Multi-Cycle Comm.
- Interconnect Architecture Planning
Interconnect Optimization (TRIO)
- Topology Optimization with Buffer Insertion
- Wire sizing and spacing
- Simultaneous Buffer Insertion and Wire Sizing
- Simultaneous Topology Construction
with Buffer Insertion and Wire Sizing
Interconnect Layout
Route Planning Point-to-Point Gridless Routing
- OWS, SDWS, BISWS
Interconnect Synthesis
Topology genration & wiresizng for delay Wire ordering & spacing for noise control
Physical Hierarchy Generation for Multi-Cycle Comm.
Physical Hierarchy Generation Physical Hierarchy Generation
Hard IP Soft module Same color for modules of the same logic hierarchy Logical Hierarchy Assign modules to physical hierarchy Defines global interconnects
- Optimization objectives:
- wire length minimization
- routing congestion minimization
- clock period, latency, performance (with consideration of multi-cycle comm.)
Physical Hierarchy = Placement bins + module locations
Physical Physical Hierarchy Generation Problem Formulation Hierarchy Generation Problem Formulation
Need of Considering Retiming/Pipelining during Placement Need of Considering Retiming/Pipelining during Placement
- Retiming/pipelining on global interconnects
Retiming/pipelining on global interconnects
- Multiple clock cycles are needed to cross the chip
Multiple clock cycles are needed to cross the chip
- Proper placement allows retiming to
Proper placement allows retiming to hide hide global interconnect delays. global interconnect delays.
Placement 1 Before retiming, φ = 5.0 a b c d After retiming, φ = 3.0 Before retiming, φ = 4.0 a c b d Placement 2 d(v)=1, WL=6, d(e) ∝ WL d(v)=1, WL=6, d(e) ∝ WL Better Initial Placement !!
Need of Considering Retiming during Placement Need of Considering Retiming during Placement
- Retiming/pipelining on global interconnects
Retiming/pipelining on global interconnects
- Multiple clock cycles are needed to cross the chip
Multiple clock cycles are needed to cross the chip
- Proper placement allows retiming to
Proper placement allows retiming to hide hide global interconnect delays. global interconnect delays.
Placement 1 Before retiming, φ = 5.0 a b c d After retiming, φ = 3.0 Before retiming, φ = 4.0 a c b d After retiming, φ = 4.0 Placement 2 d(v)=1, WL=6, d(e) ∝ WL d(v)=1, WL=6, d(e) ∝ WL Better Initial Placement !!
Difficulties Difficulties
- How to consider retiming/pipelining over global
How to consider retiming/pipelining over global interconnects interconnects
- Flip
Flip-
- flop boundaries are not fixed during placement, difficult to do
flop boundaries are not fixed during placement, difficult to do static timing analysis static timing analysis
- How to handle the high complexity of the combined problem
How to handle the high complexity of the combined problem
Answer: Use of the concepts of c-retiming and sequential timing analysis (Seq-TA) Answer: Use the multi-level optimization technique
Simultaneous Coarse Placement with Retiming on Simultaneous Coarse Placement with Retiming on Interconnects Interconnects
- Our solution
Our solution
- Compute the labels of all nodes under c
Compute the labels of all nodes under c-
- retiming for a given
retiming for a given placement solution and perform sequential timing analysis ( placement solution and perform sequential timing analysis (Seq Seq-
- TA)
TA)
- Minimize the longest sequential path by improving the placement
Minimize the longest sequential path by improving the placement solution solution
- Alternative solution [
Alternative solution [Brayton Brayton, et al] , et al]
- Enforcing all loop constraints during placement
Enforcing all loop constraints during placement
Static Timing Analysis (STA) Static Timing Analysis (STA)
a a c d e f g
Transform the circuit into a DAG for static timing analysis Topological order: a,b,g,f,c,d,e Compute arrival time (AT) and required time (RT) of each node are computed in linear time.
a b c d e f g
Sequential circuit example: PI: a, b. PO: g. Suppose d(v)=1, d(e)=2 a b g f c d e AT: 1 1 3 3 3 6 9 Suppose clock cycle φ =11 RT: 9 9 11 9 3 6 9
Continuous Retiming (c Continuous Retiming (c-
- retiming) and
retiming) and Sequential Arrival Time (SAT) Sequential Arrival Time (SAT)
- Definition [Pan et al, TCAD98]
Definition [Pan et al, TCAD98]
- Given a clock period
Given a clock period φ φ, , transfer circuit transfer circuit C C into an edge into an edge-
- weighted vertex weighted
weighted vertex weighted graph graph G, G,
- Label vertex v as l
Label vertex v as l( (v v) = the weight of longest path from PIs to v = max{ ) = the weight of longest path from PIs to v = max{l l( (u u) ) -
- φ
φ · · w w( (u,v u,v) + ) + d d( (u,v u,v) + ) + d d( (v v)}, )}, l l( (v v) is also called ) is also called SAT(v). SAT(v).
- Theorem:
Theorem: C C can be retimed to can be retimed to φ φ + max{ + max{d d( (v v)} iff )} iff l l(POs) (POs) ≤ ≤ φ φ
- Relation to retiming:
Relation to retiming: r r( (v v) = ) = l l( (v v) / ) / φ φ -
- 1
1
- Complexity is O(VE)
Complexity is O(VE)
d(a)=d(b) = 1, d(a,c) = d(b,c)= 2, φ = 5 l(c) = max{7+2-5·1+1, 3+2+1} = 6 l(a) = 7 l(b) = 3
a
b
c
d(a) d(b) d(c) a b c w w( (a,c a,c)=1 )=1 w w( (b.c b.c)=0 )=0 wl (a,c)= d(e(a,c)) - φ φ · · w w( (a,c a,c) ) wl (b,c)= d(e(b,c)) - φ φ · · w w(b (b,c ,c) )
Continuous Retiming (c Continuous Retiming (c-
- retiming) and
retiming) and Sequential Arrival Time (SAT) Sequential Arrival Time (SAT)
a b c d e f g
Sequential circuit
d(v)=1, d(e)=2
Is φ = 4.5 possible ? Iter# a b c d e f g 0 0 0 -∞
- ∞
- ∞
- ∞
- ∞
1 0 0 -1.5 -∞
- ∞
- ∞
- ∞
2 0 0 -1.5 1.5 1.5 -∞
- ∞
3 0 0 -1.5 1.5 4.5 0 0 4 0 0 -1.5 1.5 4.5 0 0 5 0 0 -1.5 1.5 4.5 0 0 Cycle time 4.5 is possible because l(g) ≤ 4.5 a b c d e f g
Retimed circuit
a b c d e f g
Retiming graph (not a DAG)
- 2.5
- 7
- 2.5
- 2.5
- 2.5
- 2.5
2 2 2
Continuous Retiming (c Continuous Retiming (c-
- retiming) and
retiming) and Sequential Arrival Time (SAT) (cont’d) Sequential Arrival Time (SAT) (cont’d)
a b c d e f g
Sequential circuit
a b c d e f g
Retiming graph (not a DAG)
d(v)=1, d(e)=2
Is φ = 2.5 feasible ? Iter# a b c d e f g 0 0 0 -∞
- ∞
- ∞
- ∞
- ∞
1 0 0 0.5 -∞
- ∞
- ∞
- ∞
2 0 0 0.5 3.5 3.5 -∞
- ∞
3 0 0 0.5 3.5 6.5 4 4 Cycle time 2.5 is not feasible because l(g) > 2.5
Multi Multi-
- Level Optimization Framework
Level Optimization Framework
Coarsening Uncoarsening & Refinement (optimization)
Problem sizes
- Multi-level coarsening generates smaller problem sizes for top levels
faster optimization on top levels
- May explore different aspects of the solution space at different levels
- Gradual refinement on good solutions from coarser levels is very efficient
- Successful in many applications
- Originally developed for PDE
- Recent success in VLSICAD: partitioning, placement, routing
Levels
Challenges Challenges
- Previous
Previous Seq Seq-
- TA can only handle single
TA can only handle single-
- output gate
- utput gate
- In reality multi
In reality multi-
- output modules exist
- utput modules exist
- IP block, MUX, adders
IP block, MUX, adders
- Clusters in the multi
Clusters in the multi-
- level optimization process
level optimization process
- How to integrate
How to integrate Seq Seq-
- TA into multi
TA into multi-
- level coarse placement
level coarse placement efficiently efficiently
- Need to consider congestion and
Need to consider congestion and routability routability
Generalize c Generalize c-
- retiming for Complex Combinational
retiming for Complex Combinational Modules Modules
vI0 vI1 vI2 vO0 vO1
4
11
9 3
complex module (combinational logic) with multi-output and non-uniform propagation delay
d’(v)=11
vI0 vI1 vI2 vO0 vO1
l l1
1-
- value labeling for
value labeling for each vertex each vertex l l1
1(v)=weight of the longest path from PIs to v
(v)=weight of the longest path from PIs to v using d using d’ ’(v) as uniform gate delay (v) as uniform gate delay Each vertex has a Each vertex has a l l1
1-
- value label.
value label. Upper bound of the labeling Upper bound of the labeling Reduce the non-uniformed gate delay to uniform gate delay by taking the max. Internal delay as the gate delay d’(v) = max { d(v(i, j)) }
vI0 vI1 vI2 vO0 vO1
4
11
9 3
Flatten/Decompose the complex module by treating each pin of the module as vertex with zero delay. l l2
2-
- value labeling for
value labeling for each output of a vertex each output of a vertex l l2
2(
(v vo
- t
t )=weight of the longest path from PIs to output
)=weight of the longest path from PIs to output o
- t
t of v
- f v
Each output of a vertex has a Each output of a vertex has a l l2
2-
- value label.
value label. Lower bound of the labeling Lower bound of the labeling
Properties of Generalized c Properties of Generalized c-
- retiming for Complex
retiming for Complex Combinational Modules Combinational Modules
- Theorem: If
Theorem: If ∃ ∃ a a PO POt
t with
with l l2
2(
(PO POt
t )
) > > Φ Φ, , then the circuit can not be retimed to a then the circuit can not be retimed to a clock period of clock period of Φ Φ. .
- Theorem: If for every
Theorem: If for every PO POi
i,
, l l1
1(
(PO POi
i)
)≤ ≤ Φ Φ, , then the circuit can be retimed to a then the circuit can be retimed to a clock period less than clock period less than Φ Φ+k, +k, where where k k is max. input is max. input-
- output delay of all gates.
- utput delay of all gates.
- Theorem: For any module v and its out
Theorem: For any module v and its out-
- pin
pin v vo
- t
t ,
, l l2
2(
(v vo
- t
t )
) ≤ ≤ l l1
1(v).
(v).
- Theorem: Given a circuit
Theorem: Given a circuit C, C, Φ Φ is is the min. clock period achieved by the min. clock period achieved by retiming on circuit retiming on circuit C, C, if if C Cc
c is derived from
is derived from C C by performing clustering ,and by performing clustering ,and the min. clock period achieved by retiming on the min. clock period achieved by retiming on C Cc
c is
is Φ Φc
c, then
, then Φ Φ ≤ ≤ Φ Φc.
c.
Integrate Integrate Seq Seq-
- TA with a Multi
TA with a Multi-
- level SA
level SA-
- based
based Coarse Placement Coarse Placement
In coarsening phase, FFs can only be clustered after a certain level k
- From level
From level L Ln
n to
to L Lk
k+1 +1
perform static timing perform static timing analysis (where analysis (where FFs FFs are are clusterd clusterd) )
- From level
From level L Lk
k to
to L L0
0 perform
perform Seq Seq-
- TA (where
TA (where FFs FFs are not are not clustered) clustered)
Level L0 Level Lk
…. ….
Level Ln
Refinement by timing-driven SA-based coarse placement
Initial Placement
…. ….
Area Density Problems in Multi Area Density Problems in Multi-
- level Coarse
level Coarse Placement Placement
Traditional area density control: Cell area in each bin < bin area
utilization with a small percentage of overflow
Does not work when cluster sizes
may have significant variations and may be bigger than a bin
How about use different grid sizes
for different levels of clustering?
Hard to find fixed percentages
that works
Significant placement cost jump
when switch grid sizes
Hierarchical Area Density Control Hierarchical Area Density Control
Use the same grid structure for
placement for all clustering levels
Impose hierarchy on bin structure
for area density control
Each cluster move must satisfy the
area constraints on each level in the bin hierarchy
Area constraint for moving a cell of
size A
Allowed overflow on each level in the
bin hierarchy = kA, k is a small constant (usually 1 or 2)
- Work well in multi-level framework:
Area constraints gradually tightened
during optimization
Fast Incremental A Fast Incremental A-
- tree Routing for Multi
tree Routing for Multi-
- pin Nets
pin Nets
Simple incremental A-tree Recursively Quad-partition grids Each pin recursively connects to lower left corner
- f each level of partition
For net with bounding box length B, at most 2 *log B edge updates for each pin move, except the root. Each edge routed by LZ-router
First Quadrant Root(source pin)
Fast LZ Fast LZ-
- routing for Two
routing for Two-
- pin Connections
pin Connections
Decide HVH or VHV: Select the less congested layer Binary search on V-stem (or H-stem) Initial left region and right
region to cover bounding box
Repeat Query wire usage on both
regions
Select region with less
congestion
Wire usage query can be done in
O(log grid_size)
Left region Right region HVH VHV
Placement Cost Functions Placement Cost Functions
- Wire length driven: Summation of net bounding boxes of all
Wire length driven: Summation of net bounding boxes of all nets nets
- Congestion driven:
Congestion driven:
- Wire usages estimated from the fast global router
Wire usages estimated from the fast global router
- Cost = Summation of square of wire usages in all bins
Cost = Summation of square of wire usages in all bins
- For fixed wire width
For fixed wire width
- cost equivalent to summation of weighted wire length, weight on
cost equivalent to summation of weighted wire length, weight on a a bin = wire usage of the bin bin = wire usage of the bin
- For congestion driven run: only turns on congestion driven cost
For congestion driven run: only turns on congestion driven cost at at the finest placement level the finest placement level
W1 W2 W3
Congestion cost = W12 + W22 + … + W92
W4 W5 W6 W7 W8 W9
Experimental Results on Wire Length Minimization Experimental Results on Wire Length Minimization
- Multi-level simulated annealing coarse placement
- Wire length comparison with GORDIAN-L:
- Our engine only turns on wire length optimization
- Legalized by DOMINO for wire length comparison
Our multi-level engine performs well for big circuits
- 20k-50k test cases: avqlarge, avqsmall, ibm04, ibm07
- 50k-100k test cases: ibm09, ibm10
- 100k-210k test cases: ibm14, ibm15, ibm16, ibm17, ibm18
mPG+DOM/GOR+DOM Wire Length Comparison
97% 100% 96% 93% 94% 95% 96% 97% 98% 99% 100% 20k-50k 50-100k 100k-210k
mPG+DOM/GOR+DOM CPU Time Comparison
81% 43% 22% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 20k-50k 50-100k 100k-210k
Experimental Results on Congestion Control Experimental Results on Congestion Control
18.9 18.9 0.21 0.21 0.87 0.87 0.94 0.94 1.05 1.05 mPG mPG-
- cg
cg 6.1 6.1 0.47 0.47 0.93 0.93 0.97 0.97 1.05 1.05 mPG mPG-
- cg.rd
cg.rd 1 1 1 1 1 1 1 1 1 1 mPG mPG CPU CPU Total Total
- verflow
- verflow
Max Max boundary boundary congestion congestion Routed WL Routed WL BBOX WL BBOX WL
Test cases: ibm01, ibm04, ibm07, ibm11, ibm13, ibm15
mPG: wire length driven mode mPG-cg: congestion driven at finest clustering level mPG-cg.rd: alternative congestion driven + wire length driven at fines clustering level
Initial Experimental Result on Impact of Initial Experimental Result on Impact of Simultaneous Retiming and Placement Simultaneous Retiming and Placement
0.79 0.79 0.93 0.93 1 1 Avg. Avg. 84 84 115 115 121 121 16x16 16x16 101531 101531 Ind4 Ind4 497 497 577 577 582 582 16x16 16x16 52197 52197 Ind3 Ind3 35 35 39 39 51 51 16x16 16x16 26060 26060 Ind2 Ind2 325 325 349 349 349 349 16x16 16x16 29780 29780 Ind1 Ind1 32 32 38 38 41 41 8x8 8x8 13209 13209 S38584 S38584
dly dly dly dly (after retiming) (after retiming) dly dly (before retiming) (before retiming) Simultaneous Simultaneous retiming and retiming and placement placement WL WL-
- driven
driven placement placement Grid size Grid size #gates #gates circuit circuit
Limitation of Exploring Multi Limitation of Exploring Multi-
- cycle Interconnect
cycle Interconnect Communication during Logic Synthesis Communication during Logic Synthesis
- Minimum clock period can be achieved by logic
Minimum clock period can be achieved by logic
- ptimization is bounded by max. delay
- ptimization is bounded by max. delay-
- to
to-
- register (DR)
register (DR) ratio of the loops in the circuits ratio of the loops in the circuits
- Require consideration of multi
Require consideration of multi-
- cycle communication
cycle communication during architecture & behavior synthesis during architecture & behavior synthesis
- In a loop, 4 logic cells, 2 registers
- Cell delay =1ns
- Interconnect delay=1ns
- DR ratio = (Dlogic+Dint)/#Registers = (4+4)/2=4ns
- Clock cycle >= 4ns
Global Interconnect
…
FUC
- Reg. file
…
FUC
- Reg. file
…
FUC
- Reg. file
…
FUC
- Reg. file
…
FUC
- Reg. file
…
FUC
- Reg. file
Regular Distributed Register Architecture (1) Regular Distributed Register Architecture (1)
- Distribute registers to each “island”
- Local computation and communication in each island can be done in a single clock cycle
- But registers may need to be inserted along global interconnects for multi-cycle
communication (less regular)
Function Unit Cluster (FUC)
….
Register File
Wi Hi
Island
T H W D D D D D
i i
- pt
ic
- pt
ic island ra
≤ + + ≤ + =
− − −
) 2 2 (
int log int log int ADD MUX DIV
Cluster with area constraint
Global Interconnect
…
FUC
- Reg. file
…
FUC
- Reg. file
…
FUC
- Reg. file
…
FUC
- Reg. file
…
FUC
- Reg. file
…
FUC
- Reg. file
Regular Distributed Register Architecture (2) Regular Distributed Register Architecture (2)
Function Unit Cluster (FUC)
….
Register File
Wi Hi
Island
1 cycle k cycle
T H W D D D D D
i i
- pt
ic
- pt
ic island ra
≤ + + ≤ + =
− − −
) 2 2 (
int log int log int
2 cycle
ADD MUX DIV
Cluster with area constraint
- Registers in each island are partitioned to k banks for 1 cycle, 2 cycle, … k cycle
interconnect communication in each island
- Highly regular
Example : Regular Distributed Register Example : Regular Distributed Register Architecture for 70nm Technology Architecture for 70nm Technology
NTRS’97 70nm Tech Chip dimension: 620 mm2 (24.9mm x
24.9mm)
5 G Hz across-chip clock
- Wire can travel up to 7.52mm within 1 clock
cycle under interconnect optimization
- Need 7 clock cycles to cross the chip
Each island base dimension
- Wi = Hi=2.08mm
- = critical length (longest length that a wire
can run without buffer insertion) estimated by IPEM BIWS estimations assuming buffer size: 2x, driver/receiver size: 2x
- 1/3 of distance a wire can travel in 1 clock
cycle
- Logic volume: 6.76M min-size 2-NAND gates
12X12 island-base array Local registers are partitioned to 7 banks
≈
+ 2 * 3 * 4
- 6
- 5
* 7 * 8
- 9
* 11 * 12
- 10
- 1
Data flow graph extracted from discrete cosine transformation (DCT) The delay of * operation is 2ns, the delay of + and – operation is 1ns. The resources available are 2 multipliers and 2 ALUs.
- 1
+ 2 * 3 * 4
- 6
- 5
* 7 * 8
- 9
* 11 * 12
- 10
The nodes with the same color are assigned to the same functional unit.
Example: Impact of Interconnect on Scheduling Example: Impact of Interconnect on Scheduling
Wirelength-driven Placement
- Reg. file
- Reg. file
…
Alu1 1,5,10 Alu2 2,6,9
…
FUC
- Reg. file
- Reg. file
…
Mul2 3,7,12
…
Mul1 4,8,11
- Represents long Interconnect delay.
The long interconnect delay is 2ns.
- Represents short Interconnect delay.
Short Interconnect delay is 1ns.
- 1
+ 2 * 3 * 4
- 6
- 5
* 7 * 8
- 9
* 11 * 12
- 10
Single Single-
- cycle vs. Multi
cycle vs. Multi-
- cycle Interconnect Communication
cycle Interconnect Communication
Single-cycle interconnect communication Scheduled in 6 clock cycles Clock period is 4ns Total latency is 24ns
Cycle 1 Cycle2 Cycle3 Cycle 4 Cycle5 Cycle6
- Represents registers.
+ 2
- 1
* 3 * 4
- 6
- 5
* 7 * 12
- 9
* 11 * 8
- 10
Cycle1 Cycle2 Cycle3 Cycle 4 Cycle5 Cycle6 Cycle7 Cycle8 Cycle9
Multi-cycle interconnect communication Scheduled in 9 clock cycles Clock period is 2ns Total latency is 18ns
+ 2
- 1
* 3 * 4
- 6
- 5
* 7 * 11
- 9
* 8 * 12
- 10
- Reg. file
- Reg. file
…
Alu1 1,5,10
…
Alu2 2,6,9
- Reg. file
- Reg. file
…
Mul2 3,7,12
…
Mul1 4,8,11
Simultaneous Placement and Scheduling
With placement integrated with scheduling, critical path is reduced. The DFG can be scheduled in 8 clock cycles, with clock period of 2ns. The total latency is 16ns.
Enhancement 1: Simultaneous Placement and Scheduling Enhancement 1: Simultaneous Placement and Scheduling for Performance Optimization for Performance Optimization
Cycle1 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6 Cycle7 Cycle8
+ 2
- 1
* 3 * 4
- 6
- 5
* 7 * 8
- 9
* 11 * 12
- 10
Enhancement 2: Simultaneous Placement, Scheduling Enhancement 2: Simultaneous Placement, Scheduling and Binding for Performance Optimization and Binding for Performance Optimization
Simultaneous Placement, Scheduling and Binding
With placement integrated with scheduling and binding, the critical path is further reduced. The DFG can be scheduled in 7 clock cycles, with clock period of 2ns. The total latency is 14ns
Cycle1 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6 Cycle7
- Reg. file
- Reg. file
…
Alu1 1,5,10
…
Alu2 2,6,9
- Reg. file
- Reg. file
…
Mul2 3,7,11
…
Mul1 4,8,12 + 2
- 1
* 3 * 4
- 6
- 5
* 7 * 8
- 9
* 11 * 12
- 10
Example: Example: Multicluster Multicluster Architectures of Architectures of DEC Alpha 21264
Source: The Multicluster Architecture: Reducing Cycle Time Through Partitioning by Keith I. Farkas, et al
Conclusions Conclusions
- Multi
Multi-
- cycle communication is needed for gigahertz designs
cycle communication is needed for gigahertz designs
- Sequential timing analysis + multilevel optimization
Sequential timing analysis + multilevel optimization enables efficient retiming/pipelining over global enables efficient retiming/pipelining over global interconnects interconnects
- Regular distributed register (RDR) fabric provides
Regular distributed register (RDR) fabric provides regularity to support regularity to support
- Multicycle
Multicycle communication communication
- Integrated resource binding, scheduling, and physical planning