Regular Fabrics for Retiming & Regular Fabrics for Retiming - - PowerPoint PPT Presentation
Regular Fabrics for Retiming & Regular Fabrics for Retiming - - PowerPoint PPT Presentation
Regular Fabrics for Retiming & Regular Fabrics for Retiming & Pipelining over Global Interconnects Pipelining over Global Interconnects Jason Cong Jason Cong Computer Science Department Computer Science Department University of
Overarching GSRC Research Emphasis Overarching GSRC Research Emphasis [Jan [Jan Rabaey Rabaey, June 2002] , June 2002] A broadened focus on
application-oriented embedded systems
under tight cost, PDA, and time-to-market constraints
“From Ad “From Ad-
- Hoc System
Hoc System-
- on
- n-
- a
a-
- Chip Design
Chip Design to Disciplined, Platform to Disciplined, Platform-
- Based Design”
Based Design”
Founded on One Basic Principle
The Discipline of Platform The Discipline of Platform-
- Based Design
Based Design
Silicon Implementation Platform Silicon Implementation Platform
Architectural Platform Architectural Platform
Manfacturing Interface Manfacturing Interface Silicon Implementation Silicon Implementation
Basic device & interconnect structures Delay, variation, SPICE models
Microarchitecture(s) Microarchitecture(s) Circuit Fabric(s) Circuit Fabric(s)
Functional Blocks, Interconnect Cycle-speed, power, area
S S V V S G S G S SV V S S S S V V V V S S G G
Application Application Architecture(s) Architecture(s)
Kernels/Benchmarks Programming Model: Models/Estimators
The Discipline of Platform The Discipline of Platform-
- Based Design
Based Design
Silicon Implementation Platform Silicon Implementation Platform
Architectural Platform Architectural Platform
Manfacturing Interface Manfacturing Interface Silicon Implementation Silicon Implementation
Basic device & interconnect structures Delay, variation, SPICE models
Microarchitecture(s) Microarchitecture(s) Circuit Fabric(s) Circuit Fabric(s)
Functional Blocks, Interconnect Cycle-speed, power, area
Application Application Architecture(s) Architecture(s)
Kernels/Benchmarks Programming Model: Models/Estimators
Constructive Fabrics Constructive Fabrics Test, Verification, Energy&Power Test, Verification, Energy&Power Comp and Comp and Comm Comm Based Design Based Design Programmable Systems Programmable Systems Calibrating Achievable Design Calibrating Achievable Design
From Architecture to Silicon Implementation Platform From Architecture to Silicon Implementation Platform
- Different targets employ different intermediate platforms, hence
Different targets employ different intermediate platforms, hence different layers of different layers of regularity and design regularity and design-
- space constraints
space constraints
- Design space may actually be
Design space may actually be smaller smaller than with large steps! than with large steps!
- Large
Large-
- step predictions/abstractions may misguide the optimizations
step predictions/abstractions may misguide the optimizations
Architecture Logic Regularity Component Regularity and Reuse Regular Fabrics Geometrical Regularity Silicon Implementation
Constructive FabricsTh [Source: Larry Pileggi]
Sample Work from the GSRC Fabric Theme Sample Work from the GSRC Fabric Theme
- Bob
Bob Brayton Brayton: Topologically Constrained Logic Synthesis : Topologically Constrained Logic Synthesis
- Malgorzata Marek
Malgorzata Marek-
- Sadowska
Sadowska: Interconnecting Regular Fabrics : Interconnecting Regular Fabrics
- Wojtek Maly
Wojtek Maly: Geometrical Regularity : Geometrical Regularity
- Herman
Herman Schmit Schmit: Regular Communication Fabrics : Regular Communication Fabrics
- Jason Cong
Jason Cong: : Regular Fabrics for Retiming and Pipelining over Regular Fabrics for Retiming and Pipelining over Global Interconnects Global Interconnects
Motivation: How Far Can We Go in Each Clock Cycle Motivation: How Far Can We Go in Each Clock Cycle
7.52 15.04 22.56 24.9 (mm) 1 clock 2 clock 3 clock 4 clock 5 clock 6 clock 7 clock
NTRS’97 0.07um Tech 5 G Hz across-chip clock 620 mm2 (24.9mm x 24.9mm) IPEM BIWS estimations
Buffer size: 100x Driver/receiver size: 100x
From corner to corner:
7 clock cycles
Solutions Solutions
- Fully asynchronous designs
Fully asynchronous designs
- GALS (global asynchronous locally synchronous designs)
GALS (global asynchronous locally synchronous designs)
- Latency
Latency-
- insensitive designs
insensitive designs
- Synchronous designs, with multi
Synchronous designs, with multi-
- cycle communications
cycle communications
- Much better understood
Much better understood
- Supported by the current tool set
Supported by the current tool set
- More energy efficient ?
More energy efficient ?
Need of Considering Retiming during Placement Need of Considering Retiming during Placement
- Retiming/pipelining on global interconnects
Retiming/pipelining on global interconnects
- Multiple clock cycles are needed to cross the chip
Multiple clock cycles are needed to cross the chip
- Proper placement allows retiming to
Proper placement allows retiming to hide hide global interconnect delays. global interconnect delays.
Placement 1 Before retiming, φ = 5.0 a b c d After retiming, φ = 3.0 Before retiming, φ = 4.0 a c b d Placement 2 d(v)=1, WL=6, d(e) ∝ WL d(v)=1, WL=6, d(e) ∝ WL Better Initial Placement !!
Need of Considering Retiming during Placement Need of Considering Retiming during Placement
- Retiming/pipelining on global interconnects
Retiming/pipelining on global interconnects
- Multiple clock cycles are needed to cross the chip
Multiple clock cycles are needed to cross the chip
- Proper placement allows retiming to
Proper placement allows retiming to hide hide global interconnect delays. global interconnect delays.
Placement 1 Before retiming, φ = 5.0 a b c d After retiming, φ = 3.0 Before retiming, φ = 4.0 a c b d After retiming, φ = 4.0 Placement 2 d(v)=1, WL=6, d(e) ∝ WL d(v)=1, WL=6, d(e) ∝ WL Better Initial Placement !!
Difficulties Difficulties
- How to consider retiming/pipelining over global
How to consider retiming/pipelining over global interconnects interconnects
- Flip
Flip-
- flop boundaries are not fixed during placement, difficult to do
flop boundaries are not fixed during placement, difficult to do static timing analysis static timing analysis
- How to handle the high complexity of the combined problem
How to handle the high complexity of the combined problem
Use of the concepts of c-retiming and sequential timing analysis (Seq-TA) Use the multi-level optimization technique
Static Timing Analysis (STA) Static Timing Analysis (STA)
a a c d e f g
Transform the circuit into a DAG for static timing analysis Topological order: a,b,g,f,c,d,e Compute arrival time (AT) and required time (RT) of each node are computed in linear time.
a b c d e f g
Sequential circuit example: PI: a, b. PO: g. Suppose d(v)=1, d(e)=2 a b g f c d e AT: 1 1 3 3 3 6 9 Suppose clock cycle φ =11 RT: 9 9 11 9 3 6 9
Continuous Retiming (c Continuous Retiming (c-
- retiming) and
retiming) and Sequential Arrival Time (SAT) Sequential Arrival Time (SAT)
- Definition [Pan et al, TCAD98]
Definition [Pan et al, TCAD98]
- Given a clock period
Given a clock period φ φ, , transfer circuit transfer circuit C C into an edge into an edge-
- weighted vertex weighted
weighted vertex weighted graph graph G, G,
- Label vertex v as l
Label vertex v as l( (v v) = the weight of longest path from PIs to v = max{ ) = the weight of longest path from PIs to v = max{l l( (u u) ) -
- φ
φ · · w w( (u,v u,v) + ) + d d( (u,v u,v) + ) + d d( (v v)}, )}, l l( (v v) is also called ) is also called SAT(v). SAT(v).
- Theorem:
Theorem: C C can be retimed to can be retimed to φ φ + max{ + max{d d( (v v)} iff )} iff l l(POs) (POs) ≤ ≤ φ φ
- Relation to retiming:
Relation to retiming: r r( (v v) = ) = l l( (v v) / ) / φ φ -
- 1
1
- Complexity is O(VE)
Complexity is O(VE)
d(a)=d(b) = 1, d(a,c) = d(b,c)= 2, φ = 5 l(c) = max{7+2-5·1+1, 3+2+1} = 6 l(a) = 7 l(b) = 3
a
b
c
d(a) d(b) d(c) a b c w w( (a,c a,c)=1 )=1 w w( (b.c b.c)=0 )=0 wl (a,c)= d(e(a,c)) - φ φ · · w w( (a,c a,c) ) wl (b,c)= d(e(b,c)) - φ φ · · w w(b (b,c ,c) )
Continuous Retiming (c Continuous Retiming (c-
- retiming) and
retiming) and Sequential Arrival Time (SAT) Sequential Arrival Time (SAT)
a b c d e f g
Sequential circuit
d(v)=1, d(e)=2
Is φ = 4.5 possible ? Iter# a b c d e f g 0 0 0 -∞
- ∞
- ∞
- ∞
- ∞
1 0 0 -1.5 -∞
- ∞
- ∞
- ∞
2 0 0 -1.5 1.5 1.5 -∞
- ∞
3 0 0 -1.5 1.5 4.5 0 0 4 0 0 -1.5 1.5 4.5 0 0 5 0 0 -1.5 1.5 4.5 0 0 Cycle time 4.5 is possible because l(g) ≤ 4.5 a b c d e f g
Retimed circuit
a b c d e f g
Retiming graph (not a DAG)
- 2.5
- 7
- 2.5
- 2.5
- 2.5
- 2.5
2 2 2
Continuous Retiming (c Continuous Retiming (c-
- retiming) and
retiming) and Sequential Arrival Time (SAT) (cont’d) Sequential Arrival Time (SAT) (cont’d)
a b c d e f g
Sequential circuit
a b c d e f g
Retiming graph (not a DAG)
d(v)=1, d(e)=2
Is φ = 2.5 feasible ? Iter# a b c d e f g 0 0 0 -∞
- ∞
- ∞
- ∞
- ∞
1 0 0 0.5 -∞
- ∞
- ∞
- ∞
2 0 0 0.5 3.5 3.5 -∞
- ∞
3 0 0 0.5 3.5 6.5 4 4 Cycle time 2.5 is not feasible because l(g) > 2.5
Sequential Timing Analysis ( Sequential Timing Analysis (Seq Seq-
- TA)
TA)
- With loops, problem is difficult
With loops, problem is difficult
- Topological order does not exist!
Topological order does not exist!
- Start with a min
Start with a min l l-
- value for each node and iteratively improve it
value for each node and iteratively improve it
- Convergence is guaranteed in O(n) iterations if the circuit can
Convergence is guaranteed in O(n) iterations if the circuit can be retimed to be retimed to the target cycle time the target cycle time
- Outline of
Outline of Seq Seq-
- TA
TA
- Binary search the min. feasible clock period
Binary search the min. feasible clock period
- Given a clock period
Given a clock period φ φ,
, check if
check if φ φ is feasible is feasible
l(PI) = 0, l(others) = -∞ Relax one vertex at a time and update l-values If ∃ a l(po) > φ, φ is not feasible; if relaxation converge, φ is feasible
- Complexity is O(VE)
Complexity is O(VE)
Multi Multi-
- Level Optimization Framework
Level Optimization Framework
Coarsening Uncoarsening & Refinement (optimization)
Problem sizes
- Multi-level coarsening generates smaller problem sizes for top levels
faster optimization on top levels
- May explore different aspects of the solution space at different levels
- Gradual refinement on good solutions from coarser levels is very efficient
- Successful in many applications
- Originally developed for PDE
- Recent success in VLSICAD: partitioning, placement, routing
Levels
Challenges Challenges
- Previous
Previous Seq Seq-
- TA can only handle single
TA can only handle single-
- output gate
- utput gate
- In reality multi
In reality multi-
- output modules exist
- utput modules exist
- IP block, MUX, adders
IP block, MUX, adders
- Clusters in the multi
Clusters in the multi-
- level optimization process
level optimization process
- How to integrate
How to integrate Seq Seq-
- TA into multi
TA into multi-
- level coarse placement
level coarse placement efficiently efficiently
Generalize c Generalize c-
- retiming for Complex Combinational
retiming for Complex Combinational Modules Modules
vI0 vI1 vI2 vO0 vO1
4
11
9 3
complex module (combinational logic) with multi-output and non-uniform propagation delay
d’(v)=11
vI0 vI1 vI2 vO0 vO1
l l1
1-
- value labeling for
value labeling for each vertex each vertex l l1
1(v)=weight of the longest path from PIs to v
(v)=weight of the longest path from PIs to v using d using d’ ’(v) as uniform gate delay (v) as uniform gate delay Each vertex has a Each vertex has a l l1
1-
- value label.
value label. Upper bound of the labeling Upper bound of the labeling Reduce the non-uniformed gate delay to uniform gate delay by taking the max. Internal delay as the gate delay d’(v) = max { d(v(i, j)) }
vI0 vI1 vI2 vO0 vO1
4
11
9 3
Decompose the complex module by treating each pin of the module as vertex with zero delay. l l2
2-
- value labeling for
value labeling for each output of a vertex each output of a vertex l l2
2(
(v vo
- t
t )=weight of the longest path from PIs to output
)=weight of the longest path from PIs to output o
- t
t of v
- f v
Each output of a vertex has a Each output of a vertex has a l l2
2-
- value label.
value label. Lower bound of the labeling Lower bound of the labeling
Integrate Integrate Seq Seq-
- TA with a Multi
TA with a Multi-
- level SA
level SA-
- based
based Coarse Placement Coarse Placement
In coarsening phase, FFs can only be clustered after a certain level k
- From level
From level L Ln
n to
to L Lk
k+1 +1
perform static timing perform static timing analysis (where analysis (where FFs FFs are are clusterd clusterd) )
- From level
From level L Lk
k to
to L L0
0 perform
perform Seq Seq-
- TA (where
TA (where FFs FFs are not are not clustered) clustered)
Level L0 Level Lk
…. ….
Level Ln
Refinement by timing-driven SA-based coarse placement
Initial Placement
…. ….
Initial Experimental Result on Impact of Initial Experimental Result on Impact of Simultaneous Retiming and Placement Simultaneous Retiming and Placement
0.79 0.79 0.93 0.93 1 1 Avg. Avg. 84 84 115 115 121 121 16x16 16x16 101531 101531 Ind4 Ind4 497 497 577 577 582 582 16x16 16x16 52197 52197 Ind3 Ind3 35 35 39 39 51 51 16x16 16x16 26060 26060 Ind2 Ind2 325 325 349 349 349 349 16x16 16x16 29780 29780 Ind1 Ind1 32 32 38 38 41 41 8x8 8x8 13209 13209 S38584 S38584
dly dly dly dly (after retiming) (after retiming) dly dly (before retiming) (before retiming) Simultaneous Simultaneous retiming and retiming and placement placement WL WL-
- driven
driven placement placement Grid size Grid size #gates #gates circuit circuit
Limitation of Exploring Multi Limitation of Exploring Multi-
- cycle Interconnect
cycle Interconnect Communication during Logic Synthesis Communication during Logic Synthesis
- Minimum clock period can be achieved by logic
Minimum clock period can be achieved by logic
- ptimization is bounded by max. delay
- ptimization is bounded by max. delay-
- to
to-
- register (DR)
register (DR) ratio of the loops in the circuits ratio of the loops in the circuits
- Require consideration of multi
Require consideration of multi-
- cycle communication
cycle communication during architecture & behavior synthesis during architecture & behavior synthesis
- In a loop, 4 logic cells, 2 registers
- Cell delay =1ns
- Interconnect delay=1ns
- DR ratio = (Dlogic+Dint)/#Registers = (4+4)/2=4ns
- Clock cycle >= 4ns
Global Interconnect
…
FUC
- Reg. file
…
FUC
- Reg. file
…
FUC
- Reg. file
…
FUC
- Reg. file
…
FUC
- Reg. file
…
FUC
- Reg. file
Regular Distributed Register Architecture Regular Distributed Register Architecture
Function Unit Cluster (FUC)
….
Register File
Wi Hi
Island
1 cycle k cycle
T H W D D D D D
i i
- pt
ic
- pt
ic island ra
≤ + + ≤ + =
− − −
) 2 2 (
int log int log int
2 cycle
ADD MUX DIV
Cluster with area constraint
- Registers in each island are partitioned to k banks for 1 cycle, 2 cycle, … k cycle
interconnect communication in each island
- Highly regular
Example : Regular Distributed Register Example : Regular Distributed Register Architecture for 70nm Technology Architecture for 70nm Technology
NTRS’97 70nm Tech Chip dimension: 620 mm2 (24.9mm x
24.9mm)
5 G Hz across-chip clock
- Wire can travel up to 7.52mm within 1 clock
cycle under interconnect optimization
- Need 7 clock cycles to cross the chip
Each island base dimension
- Wi = Hi=2.08mm
- = critical length (longest length that a wire
can run without buffer insertion) estimated by IPEM BIWS estimations assuming buffer size: 2x, driver/receiver size: 2x
- 1/3 of distance a wire can travel in 1 clock
cycle
- Logic volume: 6.76M min-size 2-NAND gates
12X12 island-base array Local registers are partitioned to 7 banks
≈
+ 2 * 3 * 4
- 6
- 5
* 7 * 8
- 9
* 11 * 12
- 10
- 1
Data flow graph extracted from discrete cosine transformation (DCT) The delay of * operation is 2ns, the delay of + and – operation is 1ns. The resources available are 2 multipliers and 2 ALUs.
- 1
+ 2 * 3 * 4
- 6
- 5
* 7 * 8
- 9
* 11 * 12
- 10
The nodes with the same color are assigned to the same functional unit.
Example: Impact of Interconnect on Scheduling Example: Impact of Interconnect on Scheduling
Wirelength-driven Placement
- Reg. file
- Reg. file
…
Alu1 1,5,10 Alu2 2,6,9
…
FUC
- Reg. file
- Reg. file
…
Mul2 3,7,12
…
Mul1 4,8,11
- Represents long Interconnect delay.
The long interconnect delay is 2ns.
- Represents short Interconnect delay.
Short Interconnect delay is 1ns.
- 1
+ 2 * 3 * 4
- 6
- 5
* 7 * 8
- 9
* 11 * 12
- 10
Single Single-
- cycle vs. Multi
cycle vs. Multi-
- cycle Interconnect Communication
cycle Interconnect Communication
Single-cycle interconnect communication Scheduled in 6 clock cycles Clock period is 4ns Total latency is 24ns
Cycle 1 Cycle2 Cycle3 Cycle 4 Cycle5 Cycle6
- Represents registers.
+ 2
- 1
* 3 * 4
- 6
- 5
* 7 * 12
- 9
* 11 * 8
- 10
Cycle1 Cycle2 Cycle3 Cycle 4 Cycle5 Cycle6 Cycle7 Cycle8 Cycle9
Multi-cycle interconnect communication Scheduled in 9 clock cycles Clock period is 2ns Total latency is 18ns
+ 2
- 1
* 3 * 4
- 6
- 5
* 7 * 11
- 9
* 8 * 12
- 10
- Reg. file
- Reg. file
…
Alu1 1,5,10
…
Alu2 2,6,9
- Reg. file
- Reg. file
…
Mul2 3,7,12
…
Mul1 4,8,11
Simultaneous Placement and Scheduling
With placement integrated with scheduling, critical path is reduced. The DFG can be scheduled in 8 clock cycles, with clock period of 2ns. The total latency is 16ns.
Enhancement 1: Simultaneous Placement and Scheduling Enhancement 1: Simultaneous Placement and Scheduling for Performance Optimization for Performance Optimization
Cycle1 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6 Cycle7 Cycle8
+ 2
- 1
* 3 * 4
- 6
- 5
* 7 * 8
- 9
* 11 * 12
- 10
Enhancement 2: Simultaneous Placement, Scheduling Enhancement 2: Simultaneous Placement, Scheduling and Binding for Performance Optimization and Binding for Performance Optimization
Simultaneous Placement, Scheduling and Binding
With placement integrated with scheduling and binding, the critical path is further reduced. The DFG can be scheduled in 7 clock cycles, with clock period of 2ns. The total latency is 14ns
Cycle1 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6 Cycle7
- Reg. file
- Reg. file
…
Alu1 1,5,10
…
Alu2 2,6,9
- Reg. file
- Reg. file
…
Mul2 3,7,11
…
Mul1 4,8,12 + 2
- 1
* 3 * 4
- 6
- 5
* 7 * 8
- 9
* 11 * 12
- 10
Example: Example: Multicluster Multicluster Architectures of Architectures of DEC Alpha 21264
Source: The Multicluster Architecture: Reducing Cycle Time Through Partitioning by Keith I. Farkas, et al
Conclusions Conclusions
- Multi
Multi-
- cycle communication is needed for gigahertz designs
cycle communication is needed for gigahertz designs
- Sequential timing analysis + multilevel optimization
Sequential timing analysis + multilevel optimization enables efficient retiming/pipelining over global enables efficient retiming/pipelining over global interconnects interconnects
- Regular distributed register (RDR) fabric provides
Regular distributed register (RDR) fabric provides regularity to support regularity to support
- Multicycle
Multicycle communication communication
- Integrated resource binding, scheduling, and physical planning
Integrated resource binding, scheduling, and physical planning
From Architecture to Silicon Implementation Platform From Architecture to Silicon Implementation Platform
- Different targets employ different intermediate platforms, hence
Different targets employ different intermediate platforms, hence different layers of different layers of regularity and design regularity and design-
- space constraints
space constraints
- Design space may actually be
Design space may actually be smaller smaller than with large steps! than with large steps!
- Large
Large-
- step predictions/abstractions may misguide the optimizations
step predictions/abstractions may misguide the optimizations
Architecture Logic Regularity Component Regularity and Reuse Regular Fabrics Geometrical Regularity Silicon Implementation