[PPT] - Regular Fabrics for Retiming & Regular Fabrics for Retiming PowerPoint Presentation

SLIDE 1

DUSD(Labs)

Regular Fabrics for Retiming & Regular Fabrics for Retiming & Pipelining over Global Interconnects Pipelining over Global Interconnects

Jason Cong Jason Cong Computer Science Department Computer Science Department University of California, Los Angeles University of California, Los Angeles cong@ cong@cs cs. .ucla ucla. .edu edu http:// http://cadlab cadlab. .cs cs. .ucla ucla. .edu edu/~cong /~cong FCRP Interconnect Workshop, June 28, 2002 FCRP Interconnect Workshop, June 28, 2002

SLIDE 2

Overarching GSRC Research Emphasis Overarching GSRC Research Emphasis [Jan [Jan Rabaey Rabaey, June 2002] , June 2002] A broadened focus on

application-oriented embedded systems

under tight cost, PDA, and time-to-market constraints

“From Ad “From Ad-

Hoc System

Hoc System-

on
n-
a

a-

Chip Design

Chip Design to Disciplined, Platform to Disciplined, Platform-

Based Design”

Based Design”

Founded on One Basic Principle

SLIDE 3

The Discipline of Platform The Discipline of Platform-

Based Design

Based Design

Silicon Implementation Platform Silicon Implementation Platform

Architectural Platform Architectural Platform

Manfacturing Interface Manfacturing Interface Silicon Implementation Silicon Implementation

Basic device & interconnect structures Delay, variation, SPICE models

Microarchitecture(s) Microarchitecture(s) Circuit Fabric(s) Circuit Fabric(s)

Functional Blocks, Interconnect Cycle-speed, power, area

S S V V S G S G S SV V S S S S V V V V S S G G

Application Application Architecture(s) Architecture(s)

Kernels/Benchmarks Programming Model: Models/Estimators

SLIDE 4

The Discipline of Platform The Discipline of Platform-

Based Design

Based Design

Silicon Implementation Platform Silicon Implementation Platform

Architectural Platform Architectural Platform

Manfacturing Interface Manfacturing Interface Silicon Implementation Silicon Implementation

Basic device & interconnect structures Delay, variation, SPICE models

Microarchitecture(s) Microarchitecture(s) Circuit Fabric(s) Circuit Fabric(s)

Functional Blocks, Interconnect Cycle-speed, power, area

Application Application Architecture(s) Architecture(s)

Kernels/Benchmarks Programming Model: Models/Estimators

Constructive Fabrics Constructive Fabrics Test, Verification, Energy&Power Test, Verification, Energy&Power Comp and Comp and Comm Comm Based Design Based Design Programmable Systems Programmable Systems Calibrating Achievable Design Calibrating Achievable Design

SLIDE 5

From Architecture to Silicon Implementation Platform From Architecture to Silicon Implementation Platform

Different targets employ different intermediate platforms, hence

Different targets employ different intermediate platforms, hence different layers of different layers of regularity and design regularity and design-

space constraints

space constraints

Design space may actually be

Design space may actually be smaller smaller than with large steps! than with large steps!

Large

Large-

step predictions/abstractions may misguide the optimizations

step predictions/abstractions may misguide the optimizations

Architecture Logic Regularity Component Regularity and Reuse Regular Fabrics Geometrical Regularity Silicon Implementation

Constructive FabricsTh [Source: Larry Pileggi]

SLIDE 6

Sample Work from the GSRC Fabric Theme Sample Work from the GSRC Fabric Theme

Bob

Bob Brayton Brayton: Topologically Constrained Logic Synthesis : Topologically Constrained Logic Synthesis

Malgorzata Marek

Malgorzata Marek-

Sadowska

Sadowska: Interconnecting Regular Fabrics : Interconnecting Regular Fabrics

Wojtek Maly

Wojtek Maly: Geometrical Regularity : Geometrical Regularity

Herman

Herman Schmit Schmit: Regular Communication Fabrics : Regular Communication Fabrics

Jason Cong

Jason Cong: : Regular Fabrics for Retiming and Pipelining over Regular Fabrics for Retiming and Pipelining over Global Interconnects Global Interconnects

SLIDE 7

Motivation: How Far Can We Go in Each Clock Cycle Motivation: How Far Can We Go in Each Clock Cycle

7.52 15.04 22.56 24.9 (mm) 1 clock 2 clock 3 clock 4 clock 5 clock 6 clock 7 clock

NTRS’97 0.07um Tech 5 G Hz across-chip clock 620 mm2 (24.9mm x 24.9mm) IPEM BIWS estimations

Buffer size: 100x Driver/receiver size: 100x

From corner to corner:

7 clock cycles

SLIDE 8

Solutions Solutions

Fully asynchronous designs

Fully asynchronous designs

GALS (global asynchronous locally synchronous designs)

GALS (global asynchronous locally synchronous designs)

Latency

Latency-

insensitive designs

insensitive designs

Synchronous designs, with multi

Synchronous designs, with multi-

cycle communications

cycle communications

Much better understood

Much better understood

Supported by the current tool set

Supported by the current tool set

More energy efficient ?

More energy efficient ?

SLIDE 9

Need of Considering Retiming during Placement Need of Considering Retiming during Placement

Retiming/pipelining on global interconnects

Retiming/pipelining on global interconnects

Multiple clock cycles are needed to cross the chip

Multiple clock cycles are needed to cross the chip

Proper placement allows retiming to

Proper placement allows retiming to hide hide global interconnect delays. global interconnect delays.

Placement 1 Before retiming, φ = 5.0 a b c d After retiming, φ = 3.0 Before retiming, φ = 4.0 a c b d Placement 2 d(v)=1, WL=6, d(e) ∝ WL d(v)=1, WL=6, d(e) ∝ WL Better Initial Placement !!

SLIDE 10

Need of Considering Retiming during Placement Need of Considering Retiming during Placement

Retiming/pipelining on global interconnects

Retiming/pipelining on global interconnects

Multiple clock cycles are needed to cross the chip

Multiple clock cycles are needed to cross the chip

Proper placement allows retiming to

Proper placement allows retiming to hide hide global interconnect delays. global interconnect delays.

Placement 1 Before retiming, φ = 5.0 a b c d After retiming, φ = 3.0 Before retiming, φ = 4.0 a c b d After retiming, φ = 4.0 Placement 2 d(v)=1, WL=6, d(e) ∝ WL d(v)=1, WL=6, d(e) ∝ WL Better Initial Placement !!

SLIDE 11

Difficulties Difficulties

How to consider retiming/pipelining over global

How to consider retiming/pipelining over global interconnects interconnects

Flip

Flip-

flop boundaries are not fixed during placement, difficult to do

flop boundaries are not fixed during placement, difficult to do static timing analysis static timing analysis

How to handle the high complexity of the combined problem

How to handle the high complexity of the combined problem

Use of the concepts of c-retiming and sequential timing analysis (Seq-TA) Use the multi-level optimization technique

SLIDE 12

Static Timing Analysis (STA) Static Timing Analysis (STA)

a a c d e f g

Transform the circuit into a DAG for static timing analysis Topological order: a,b,g,f,c,d,e Compute arrival time (AT) and required time (RT) of each node are computed in linear time.

a b c d e f g

Sequential circuit example: PI: a, b. PO: g. Suppose d(v)=1, d(e)=2 a b g f c d e AT: 1 1 3 3 3 6 9 Suppose clock cycle φ =11 RT: 9 9 11 9 3 6 9

SLIDE 13

Continuous Retiming (c Continuous Retiming (c-

retiming) and

retiming) and Sequential Arrival Time (SAT) Sequential Arrival Time (SAT)

Definition [Pan et al, TCAD98]

Definition [Pan et al, TCAD98]

Given a clock period

Given a clock period φ φ, , transfer circuit transfer circuit C C into an edge into an edge-

weighted vertex weighted

weighted vertex weighted graph graph G, G,

Label vertex v as l

Label vertex v as l( (v v) = the weight of longest path from PIs to v = max{ ) = the weight of longest path from PIs to v = max{l l( (u u) ) -

φ

φ · · w w( (u,v u,v) + ) + d d( (u,v u,v) + ) + d d( (v v)}, )}, l l( (v v) is also called ) is also called SAT(v). SAT(v).

Theorem:

Theorem: C C can be retimed to can be retimed to φ φ + max{ + max{d d( (v v)} iff )} iff l l(POs) (POs) ≤ ≤ φ φ

Relation to retiming:

Relation to retiming: r r( (v v) = ) =  l l( (v v) / ) / φ φ  -

1

1

Complexity is O(VE)

Complexity is O(VE)

d(a)=d(b) = 1, d(a,c) = d(b,c)= 2, φ = 5 l(c) = max{7+2-5·1+1, 3+2+1} = 6 l(a) = 7 l(b) = 3

a

b

c

d(a) d(b) d(c) a b c w w( (a,c a,c)=1 )=1 w w( (b.c b.c)=0 )=0 wl (a,c)= d(e(a,c)) - φ φ · · w w( (a,c a,c) ) wl (b,c)= d(e(b,c)) - φ φ · · w w(b (b,c ,c) )

SLIDE 14

Continuous Retiming (c Continuous Retiming (c-

retiming) and

retiming) and Sequential Arrival Time (SAT) Sequential Arrival Time (SAT)

a b c d e f g

Sequential circuit

d(v)=1, d(e)=2

Is φ = 4.5 possible ? Iter# a b c d e f g 0 0 0 -∞

∞
∞
∞
∞

1 0 0 -1.5 -∞

∞
∞
∞

2 0 0 -1.5 1.5 1.5 -∞

∞

3 0 0 -1.5 1.5 4.5 0 0 4 0 0 -1.5 1.5 4.5 0 0 5 0 0 -1.5 1.5 4.5 0 0 Cycle time 4.5 is possible because l(g) ≤ 4.5 a b c d e f g

Retimed circuit

a b c d e f g

Retiming graph (not a DAG)

2.5
7
2.5
2.5
2.5
2.5

2 2 2

SLIDE 15

Continuous Retiming (c Continuous Retiming (c-

retiming) and

retiming) and Sequential Arrival Time (SAT) (cont’d) Sequential Arrival Time (SAT) (cont’d)

a b c d e f g

Sequential circuit

a b c d e f g

Retiming graph (not a DAG)

d(v)=1, d(e)=2

Is φ = 2.5 feasible ? Iter# a b c d e f g 0 0 0 -∞

∞
∞
∞
∞

1 0 0 0.5 -∞

∞
∞
∞

2 0 0 0.5 3.5 3.5 -∞

∞

3 0 0 0.5 3.5 6.5 4 4 Cycle time 2.5 is not feasible because l(g) > 2.5

SLIDE 16

Sequential Timing Analysis ( Sequential Timing Analysis (Seq Seq-

TA)

TA)

With loops, problem is difficult

With loops, problem is difficult

Topological order does not exist!

Topological order does not exist!

Start with a min

Start with a min l l-

value for each node and iteratively improve it

value for each node and iteratively improve it

Convergence is guaranteed in O(n) iterations if the circuit can

Convergence is guaranteed in O(n) iterations if the circuit can be retimed to be retimed to the target cycle time the target cycle time

Outline of

Outline of Seq Seq-

TA

TA

Binary search the min. feasible clock period

Binary search the min. feasible clock period

Given a clock period

Given a clock period φ φ,

, check if

check if φ φ is feasible is feasible

l(PI) = 0, l(others) = -∞ Relax one vertex at a time and update l-values If ∃ a l(po) > φ, φ is not feasible; if relaxation converge, φ is feasible

Complexity is O(VE)

Complexity is O(VE)

SLIDE 17

Multi Multi-

Level Optimization Framework

Level Optimization Framework

Coarsening Uncoarsening & Refinement (optimization)

Problem sizes

Multi-level coarsening generates smaller problem sizes for top levels

faster optimization on top levels

May explore different aspects of the solution space at different levels
Gradual refinement on good solutions from coarser levels is very efficient
Successful in many applications
Originally developed for PDE
Recent success in VLSICAD: partitioning, placement, routing

Levels

SLIDE 18

Challenges Challenges

Previous

Previous Seq Seq-

TA can only handle single

TA can only handle single-

output gate
utput gate
In reality multi

In reality multi-

output modules exist
utput modules exist
IP block, MUX, adders

IP block, MUX, adders

Clusters in the multi

Clusters in the multi-

level optimization process

level optimization process

How to integrate

How to integrate Seq Seq-

TA into multi

TA into multi-

level coarse placement

level coarse placement efficiently efficiently

SLIDE 19

Generalize c Generalize c-

retiming for Complex Combinational

retiming for Complex Combinational Modules Modules

vI0 vI1 vI2 vO0 vO1

4

11

9 3

complex module (combinational logic) with multi-output and non-uniform propagation delay

d’(v)=11

vI0 vI1 vI2 vO0 vO1

l l1

1-

value labeling for

value labeling for each vertex each vertex l l1

1(v)=weight of the longest path from PIs to v

(v)=weight of the longest path from PIs to v using d using d’ ’(v) as uniform gate delay (v) as uniform gate delay Each vertex has a Each vertex has a l l1

1-

value label.

value label. Upper bound of the labeling Upper bound of the labeling Reduce the non-uniformed gate delay to uniform gate delay by taking the max. Internal delay as the gate delay d’(v) = max { d(v(i, j)) }

vI0 vI1 vI2 vO0 vO1

4

11

9 3

Decompose the complex module by treating each pin of the module as vertex with zero delay. l l2

2-

value labeling for

value labeling for each output of a vertex each output of a vertex l l2

2(

(v vo

t

t )=weight of the longest path from PIs to output

)=weight of the longest path from PIs to output o

t

t of v

f v

Each output of a vertex has a Each output of a vertex has a l l2

2-

value label.

value label. Lower bound of the labeling Lower bound of the labeling

SLIDE 20

Integrate Integrate Seq Seq-

TA with a Multi

TA with a Multi-

level SA

level SA-

based

based Coarse Placement Coarse Placement

In coarsening phase, FFs can only be clustered after a certain level k

From level

From level L Ln

n to

to L Lk

k+1 +1

perform static timing perform static timing analysis (where analysis (where FFs FFs are are clusterd clusterd) )

From level

From level L Lk

k to

to L L0

0 perform

perform Seq Seq-

TA (where

TA (where FFs FFs are not are not clustered) clustered)

Level L0 Level Lk

…. ….

Level Ln

Refinement by timing-driven SA-based coarse placement

Initial Placement

…. ….

SLIDE 21

Initial Experimental Result on Impact of Initial Experimental Result on Impact of Simultaneous Retiming and Placement Simultaneous Retiming and Placement

0.79 0.79 0.93 0.93 1 1 Avg. Avg. 84 84 115 115 121 121 16x16 16x16 101531 101531 Ind4 Ind4 497 497 577 577 582 582 16x16 16x16 52197 52197 Ind3 Ind3 35 35 39 39 51 51 16x16 16x16 26060 26060 Ind2 Ind2 325 325 349 349 349 349 16x16 16x16 29780 29780 Ind1 Ind1 32 32 38 38 41 41 8x8 8x8 13209 13209 S38584 S38584

dly dly dly dly (after retiming) (after retiming) dly dly (before retiming) (before retiming) Simultaneous Simultaneous retiming and retiming and placement placement WL WL-

driven

driven placement placement Grid size Grid size #gates #gates circuit circuit

SLIDE 22

Limitation of Exploring Multi Limitation of Exploring Multi-

cycle Interconnect

cycle Interconnect Communication during Logic Synthesis Communication during Logic Synthesis

Minimum clock period can be achieved by logic

Minimum clock period can be achieved by logic

ptimization is bounded by max. delay
ptimization is bounded by max. delay-
to

to-

register (DR)

register (DR) ratio of the loops in the circuits ratio of the loops in the circuits

Require consideration of multi

Require consideration of multi-

cycle communication

cycle communication during architecture & behavior synthesis during architecture & behavior synthesis

In a loop, 4 logic cells, 2 registers
Cell delay =1ns
Interconnect delay=1ns
DR ratio = (Dlogic+Dint)/#Registers = (4+4)/2=4ns
Clock cycle >= 4ns

SLIDE 23

Global Interconnect

…

FUC

Reg. file

…

FUC

Reg. file

…

FUC

Reg. file

…

FUC

Reg. file

…

FUC

Reg. file

…

FUC

Reg. file

Regular Distributed Register Architecture Regular Distributed Register Architecture

Function Unit Cluster (FUC)

….

Register File

Wi Hi

Island

1 cycle k cycle

T H W D D D D D

i i

pt

ic

pt

ic island ra

≤ + + ≤ + =

− − −

) 2 2 (

int log int log int

2 cycle

ADD MUX DIV

Cluster with area constraint

Registers in each island are partitioned to k banks for 1 cycle, 2 cycle, … k cycle

interconnect communication in each island

Highly regular

SLIDE 24

Example : Regular Distributed Register Example : Regular Distributed Register Architecture for 70nm Technology Architecture for 70nm Technology

NTRS’97 70nm Tech Chip dimension: 620 mm2 (24.9mm x

24.9mm)

5 G Hz across-chip clock

Wire can travel up to 7.52mm within 1 clock

cycle under interconnect optimization

Need 7 clock cycles to cross the chip

Each island base dimension

Wi = Hi=2.08mm
= critical length (longest length that a wire

can run without buffer insertion) estimated by IPEM BIWS estimations assuming buffer size: 2x, driver/receiver size: 2x

1/3 of distance a wire can travel in 1 clock

cycle

Logic volume: 6.76M min-size 2-NAND gates

12X12 island-base array Local registers are partitioned to 7 banks

≈

SLIDE 25

+ 2 * 3 * 4

6
5

* 7 * 8

9

* 11 * 12

10
1

Data flow graph extracted from discrete cosine transformation (DCT) The delay of * operation is 2ns, the delay of + and – operation is 1ns. The resources available are 2 multipliers and 2 ALUs.

1

+ 2 * 3 * 4

6
5

* 7 * 8

9

* 11 * 12

10

The nodes with the same color are assigned to the same functional unit.

Example: Impact of Interconnect on Scheduling Example: Impact of Interconnect on Scheduling

Wirelength-driven Placement

Reg. file
Reg. file

…

Alu1 1,5,10 Alu2 2,6,9

…

FUC

Reg. file
Reg. file

…

Mul2 3,7,12

…

Mul1 4,8,11

Represents long Interconnect delay.

The long interconnect delay is 2ns.

Represents short Interconnect delay.

Short Interconnect delay is 1ns.

1

+ 2 * 3 * 4

6
5

* 7 * 8

9

* 11 * 12

10

SLIDE 26

Single Single-

cycle vs. Multi

cycle vs. Multi-

cycle Interconnect Communication

cycle Interconnect Communication

Single-cycle interconnect communication Scheduled in 6 clock cycles Clock period is 4ns Total latency is 24ns

Cycle 1 Cycle2 Cycle3 Cycle 4 Cycle5 Cycle6

Represents registers.

+ 2

1

* 3 * 4

6
5

* 7 * 12

9

* 11 * 8

10

Cycle1 Cycle2 Cycle3 Cycle 4 Cycle5 Cycle6 Cycle7 Cycle8 Cycle9

Multi-cycle interconnect communication Scheduled in 9 clock cycles Clock period is 2ns Total latency is 18ns

+ 2

1

* 3 * 4

6
5

* 7 * 11

9

* 8 * 12

10

SLIDE 27

Reg. file
Reg. file

…

Alu1 1,5,10

…

Alu2 2,6,9

Reg. file
Reg. file

…

Mul2 3,7,12

…

Mul1 4,8,11

Simultaneous Placement and Scheduling

With placement integrated with scheduling, critical path is reduced. The DFG can be scheduled in 8 clock cycles, with clock period of 2ns. The total latency is 16ns.

Enhancement 1: Simultaneous Placement and Scheduling Enhancement 1: Simultaneous Placement and Scheduling for Performance Optimization for Performance Optimization

Cycle1 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6 Cycle7 Cycle8

+ 2

1

* 3 * 4

6
5

* 7 * 8

9

* 11 * 12

10

SLIDE 28

Enhancement 2: Simultaneous Placement, Scheduling Enhancement 2: Simultaneous Placement, Scheduling and Binding for Performance Optimization and Binding for Performance Optimization

Simultaneous Placement, Scheduling and Binding

With placement integrated with scheduling and binding, the critical path is further reduced. The DFG can be scheduled in 7 clock cycles, with clock period of 2ns. The total latency is 14ns

Cycle1 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6 Cycle7

Reg. file
Reg. file

…

Alu1 1,5,10

…

Alu2 2,6,9

Reg. file
Reg. file

…

Mul2 3,7,11

…

Mul1 4,8,12 + 2

1

* 3 * 4

6
5

* 7 * 8

9

* 11 * 12

10

SLIDE 29

Example: Example: Multicluster Multicluster Architectures of Architectures of DEC Alpha 21264

Source: The Multicluster Architecture: Reducing Cycle Time Through Partitioning by Keith I. Farkas, et al

SLIDE 30

Conclusions Conclusions

Multi

Multi-

cycle communication is needed for gigahertz designs

cycle communication is needed for gigahertz designs

Sequential timing analysis + multilevel optimization

Sequential timing analysis + multilevel optimization enables efficient retiming/pipelining over global enables efficient retiming/pipelining over global interconnects interconnects

Regular distributed register (RDR) fabric provides

Regular distributed register (RDR) fabric provides regularity to support regularity to support

Multicycle

Multicycle communication communication

Integrated resource binding, scheduling, and physical planning

Integrated resource binding, scheduling, and physical planning

SLIDE 31

From Architecture to Silicon Implementation Platform From Architecture to Silicon Implementation Platform

Different targets employ different intermediate platforms, hence

Different targets employ different intermediate platforms, hence different layers of different layers of regularity and design regularity and design-

space constraints

space constraints

Design space may actually be

Design space may actually be smaller smaller than with large steps! than with large steps!

Large

Large-

step predictions/abstractions may misguide the optimizations

step predictions/abstractions may misguide the optimizations

Architecture Logic Regularity Component Regularity and Reuse Regular Fabrics Geometrical Regularity Silicon Implementation