[PPT] - Retiming & Pipelining over Global Retiming & Pipelining over PowerPoint Presentation

SLIDE 1

Retiming & Pipelining over Global Retiming & Pipelining over Global Interconnects Interconnects

Jason Cong Jason Cong Computer Science Department Computer Science Department University of California, Los Angeles University of California, Los Angeles cong@cs.ucla. cong@cs.ucla.edu edu http://cadlab.cs.ucla.edu/~cong http://cadlab.cs.ucla.edu/~cong Joint work with C. C. Chang, D. Pan, and X. Yuan Joint work with C. C. Chang, D. Pan, and X. Yuan * IBM Research * IBM Research

SLIDE 2

Motivation: How Far Can We Go in Each Clock Cycle Motivation: How Far Can We Go in Each Clock Cycle

7.52 15.04 22.56 24.9 (mm) 1 clock 2 clock 3 clock 4 clock 5 clock 6 clock 7 clock

NTRS’97 0.07um Tech 5 G Hz across-chip clock 620 mm2 (24.9mm x 24.9mm) IPEM BIWS estimations

Buffer size: 100x Driver/receiver size: 100x

From corner to corner:

7 clock cycles

SLIDE 3

Solutions Solutions

Fully asynchronous designs

Fully asynchronous designs

GALS (global asynchronous locally synchronous designs)

GALS (global asynchronous locally synchronous designs)

Latency

Latency-

insensitive designs

insensitive designs

Synchronous designs, with multi

Synchronous designs, with multi-

cycle communications

cycle communications

Much better understood

Much better understood

Supported by the current tool set

Supported by the current tool set

More energy efficient ?

More energy efficient ?

SLIDE 4

Interconnect Interconnect-

Centric IC Design Flow

Centric IC Design Flow Under Development at UCLA Under Development at UCLA

Interconnect Performance Estimation Models (IPEM) Architecture/Conceptual-level Design Design Specification Final Layout abstraction Structure view Functional view Physical view Timing view

HDM

Synthesis and Placement under Physical Hierarchy Interconnect Planning

Physical Hierarchy Generation for Multi-Cycle Comm.
Interconnect Architecture Planning

Interconnect Optimization (TRIO)

Topology Optimization with Buffer Insertion
Wire sizing and spacing
Simultaneous Buffer Insertion and Wire Sizing
Simultaneous Topology Construction

with Buffer Insertion and Wire Sizing

Interconnect Layout

Route Planning Point-to-Point Gridless Routing

OWS, SDWS, BISWS

Interconnect Synthesis

Topology genration & wiresizng for delay Wire ordering & spacing for noise control

Physical Hierarchy Generation for Multi-Cycle Comm.

SLIDE 5

Physical Hierarchy Generation Physical Hierarchy Generation

Hard IP Soft module Same color for modules of the same logic hierarchy Logical Hierarchy Assign modules to physical hierarchy Defines global interconnects

Optimization objectives:
wire length minimization
routing congestion minimization
clock period, latency, performance (with consideration of multi-cycle comm.)

Physical Hierarchy = Placement bins + module locations

Physical Physical Hierarchy Generation Problem Formulation Hierarchy Generation Problem Formulation

SLIDE 6

Need of Considering Retiming/Pipelining during Placement Need of Considering Retiming/Pipelining during Placement

Retiming/pipelining on global interconnects

Retiming/pipelining on global interconnects

Multiple clock cycles are needed to cross the chip

Multiple clock cycles are needed to cross the chip

Proper placement allows retiming to

Proper placement allows retiming to hide hide global interconnect delays. global interconnect delays.

Placement 1 Before retiming, φ = 5.0 a b c d After retiming, φ = 3.0 Before retiming, φ = 4.0 a c b d Placement 2 d(v)=1, WL=6, d(e) ∝ WL d(v)=1, WL=6, d(e) ∝ WL Better Initial Placement !!

SLIDE 7

Need of Considering Retiming during Placement Need of Considering Retiming during Placement

Retiming/pipelining on global interconnects

Retiming/pipelining on global interconnects

Multiple clock cycles are needed to cross the chip

Multiple clock cycles are needed to cross the chip

Proper placement allows retiming to

Proper placement allows retiming to hide hide global interconnect delays. global interconnect delays.

Placement 1 Before retiming, φ = 5.0 a b c d After retiming, φ = 3.0 Before retiming, φ = 4.0 a c b d After retiming, φ = 4.0 Placement 2 d(v)=1, WL=6, d(e) ∝ WL d(v)=1, WL=6, d(e) ∝ WL Better Initial Placement !!

SLIDE 8

Difficulties Difficulties

How to consider retiming/pipelining over global

How to consider retiming/pipelining over global interconnects interconnects

Flip

Flip-

flop boundaries are not fixed during placement, difficult to do

flop boundaries are not fixed during placement, difficult to do static timing analysis static timing analysis

How to handle the high complexity of the combined problem

How to handle the high complexity of the combined problem

Answer: Use of the concepts of c-retiming and sequential timing analysis (Seq-TA) Answer: Use the multi-level optimization technique

SLIDE 9

Simultaneous Coarse Placement with Retiming on Simultaneous Coarse Placement with Retiming on Interconnects Interconnects

Our solution

Our solution

Compute the labels of all nodes under c

Compute the labels of all nodes under c-

retiming for a given

retiming for a given placement solution and perform sequential timing analysis ( placement solution and perform sequential timing analysis (Seq Seq-

TA)

TA)

Minimize the longest sequential path by improving the placement

Minimize the longest sequential path by improving the placement solution solution

Alternative solution [

Alternative solution [Brayton Brayton, et al] , et al]

Enforcing all loop constraints during placement

Enforcing all loop constraints during placement

SLIDE 10

Static Timing Analysis (STA) Static Timing Analysis (STA)

a a c d e f g

Transform the circuit into a DAG for static timing analysis Topological order: a,b,g,f,c,d,e Compute arrival time (AT) and required time (RT) of each node are computed in linear time.

a b c d e f g

Sequential circuit example: PI: a, b. PO: g. Suppose d(v)=1, d(e)=2 a b g f c d e AT: 1 1 3 3 3 6 9 Suppose clock cycle φ =11 RT: 9 9 11 9 3 6 9

SLIDE 11

Continuous Retiming (c Continuous Retiming (c-

retiming) and

retiming) and Sequential Arrival Time (SAT) Sequential Arrival Time (SAT)

Definition [Pan et al, TCAD98]

Definition [Pan et al, TCAD98]

Given a clock period

Given a clock period φ φ, , transfer circuit transfer circuit C C into an edge into an edge-

weighted vertex weighted

weighted vertex weighted graph graph G, G,

Label vertex v as l

Label vertex v as l( (v v) = the weight of longest path from PIs to v = max{ ) = the weight of longest path from PIs to v = max{l l( (u u) ) -

φ

φ · · w w( (u,v u,v) + ) + d d( (u,v u,v) + ) + d d( (v v)}, )}, l l( (v v) is also called ) is also called SAT(v). SAT(v).

Theorem:

Theorem: C C can be retimed to can be retimed to φ φ + max{ + max{d d( (v v)} iff )} iff l l(POs) (POs) ≤ ≤ φ φ

Relation to retiming:

Relation to retiming: r r( (v v) = ) =  l l( (v v) / ) / φ φ  -

1

1

Complexity is O(VE)

Complexity is O(VE)

d(a)=d(b) = 1, d(a,c) = d(b,c)= 2, φ = 5 l(c) = max{7+2-5·1+1, 3+2+1} = 6 l(a) = 7 l(b) = 3

a

b

c

d(a) d(b) d(c) a b c w w( (a,c a,c)=1 )=1 w w( (b.c b.c)=0 )=0 wl (a,c)= d(e(a,c)) - φ φ · · w w( (a,c a,c) ) wl (b,c)= d(e(b,c)) - φ φ · · w w(b (b,c ,c) )

SLIDE 12

Continuous Retiming (c Continuous Retiming (c-

retiming) and

retiming) and Sequential Arrival Time (SAT) Sequential Arrival Time (SAT)

a b c d e f g

Sequential circuit

d(v)=1, d(e)=2

Is φ = 4.5 possible ? Iter# a b c d e f g 0 0 0 -∞

∞
∞
∞
∞

1 0 0 -1.5 -∞

∞
∞
∞

2 0 0 -1.5 1.5 1.5 -∞

∞

3 0 0 -1.5 1.5 4.5 0 0 4 0 0 -1.5 1.5 4.5 0 0 5 0 0 -1.5 1.5 4.5 0 0 Cycle time 4.5 is possible because l(g) ≤ 4.5 a b c d e f g

Retimed circuit

a b c d e f g

Retiming graph (not a DAG)

2.5
7
2.5
2.5
2.5
2.5

2 2 2

SLIDE 13

Continuous Retiming (c Continuous Retiming (c-

retiming) and

retiming) and Sequential Arrival Time (SAT) (cont’d) Sequential Arrival Time (SAT) (cont’d)

a b c d e f g

Sequential circuit

a b c d e f g

Retiming graph (not a DAG)

d(v)=1, d(e)=2

Is φ = 2.5 feasible ? Iter# a b c d e f g 0 0 0 -∞

∞
∞
∞
∞

1 0 0 0.5 -∞

∞
∞
∞

2 0 0 0.5 3.5 3.5 -∞

∞

3 0 0 0.5 3.5 6.5 4 4 Cycle time 2.5 is not feasible because l(g) > 2.5

SLIDE 14

Multi Multi-

Level Optimization Framework

Level Optimization Framework

Coarsening Uncoarsening & Refinement (optimization)

Problem sizes

Multi-level coarsening generates smaller problem sizes for top levels

faster optimization on top levels

May explore different aspects of the solution space at different levels
Gradual refinement on good solutions from coarser levels is very efficient
Successful in many applications
Originally developed for PDE
Recent success in VLSICAD: partitioning, placement, routing

Levels

SLIDE 15

Challenges Challenges

Previous

Previous Seq Seq-

TA can only handle single

TA can only handle single-

output gate
utput gate
In reality multi

In reality multi-

output modules exist
utput modules exist
IP block, MUX, adders

IP block, MUX, adders

Clusters in the multi

Clusters in the multi-

level optimization process

level optimization process

How to integrate

How to integrate Seq Seq-

TA into multi

TA into multi-

level coarse placement

level coarse placement efficiently efficiently

Need to consider congestion and

Need to consider congestion and routability routability

SLIDE 16

Generalize c Generalize c-

retiming for Complex Combinational

retiming for Complex Combinational Modules Modules

vI0 vI1 vI2 vO0 vO1

4

11

9 3

complex module (combinational logic) with multi-output and non-uniform propagation delay

d’(v)=11

vI0 vI1 vI2 vO0 vO1

l l1

1-

value labeling for

value labeling for each vertex each vertex l l1

1(v)=weight of the longest path from PIs to v

(v)=weight of the longest path from PIs to v using d using d’ ’(v) as uniform gate delay (v) as uniform gate delay Each vertex has a Each vertex has a l l1

1-

value label.

value label. Upper bound of the labeling Upper bound of the labeling Reduce the non-uniformed gate delay to uniform gate delay by taking the max. Internal delay as the gate delay d’(v) = max { d(v(i, j)) }

vI0 vI1 vI2 vO0 vO1

4

11

9 3

Flatten/Decompose the complex module by treating each pin of the module as vertex with zero delay. l l2

2-

value labeling for

value labeling for each output of a vertex each output of a vertex l l2

2(

(v vo

t

t )=weight of the longest path from PIs to output

)=weight of the longest path from PIs to output o

t

t of v

f v

Each output of a vertex has a Each output of a vertex has a l l2

2-

value label.

value label. Lower bound of the labeling Lower bound of the labeling

SLIDE 17

Properties of Generalized c Properties of Generalized c-

retiming for Complex

retiming for Complex Combinational Modules Combinational Modules

Theorem: If

Theorem: If ∃ ∃ a a PO POt

t with

with l l2

2(

(PO POt

t )

) > > Φ Φ, , then the circuit can not be retimed to a then the circuit can not be retimed to a clock period of clock period of Φ Φ. .

Theorem: If for every

Theorem: If for every PO POi

i,

, l l1

1(

(PO POi

i)

)≤ ≤ Φ Φ, , then the circuit can be retimed to a then the circuit can be retimed to a clock period less than clock period less than Φ Φ+k, +k, where where k k is max. input is max. input-

output delay of all gates.
utput delay of all gates.
Theorem: For any module v and its out

Theorem: For any module v and its out-

pin

pin v vo

t

t ,

, l l2

2(

(v vo

t

t )

) ≤ ≤ l l1

1(v).

(v).

Theorem: Given a circuit

Theorem: Given a circuit C, C, Φ Φ is is the min. clock period achieved by the min. clock period achieved by retiming on circuit retiming on circuit C, C, if if C Cc

c is derived from

is derived from C C by performing clustering ,and by performing clustering ,and the min. clock period achieved by retiming on the min. clock period achieved by retiming on C Cc

c is

is Φ Φc

c, then

, then Φ Φ ≤ ≤ Φ Φc.

c.

SLIDE 18

Integrate Integrate Seq Seq-

TA with a Multi

TA with a Multi-

level SA

level SA-

based

based Coarse Placement Coarse Placement

In coarsening phase, FFs can only be clustered after a certain level k

From level

From level L Ln

n to

to L Lk

k+1 +1

perform static timing perform static timing analysis (where analysis (where FFs FFs are are clusterd clusterd) )

From level

From level L Lk

k to

to L L0

0 perform

perform Seq Seq-

TA (where

TA (where FFs FFs are not are not clustered) clustered)

Level L0 Level Lk

…. ….

Level Ln

Refinement by timing-driven SA-based coarse placement

Initial Placement

…. ….

SLIDE 19

Area Density Problems in Multi Area Density Problems in Multi-

level Coarse

level Coarse Placement Placement

Traditional area density control: Cell area in each bin < bin area

utilization with a small percentage of overflow

Does not work when cluster sizes

may have significant variations and may be bigger than a bin

How about use different grid sizes

for different levels of clustering?

Hard to find fixed percentages

that works

Significant placement cost jump

when switch grid sizes

SLIDE 20

Hierarchical Area Density Control Hierarchical Area Density Control

Use the same grid structure for

placement for all clustering levels

Impose hierarchy on bin structure

for area density control

Each cluster move must satisfy the

area constraints on each level in the bin hierarchy

Area constraint for moving a cell of

size A

Allowed overflow on each level in the

bin hierarchy = kA, k is a small constant (usually 1 or 2)

Work well in multi-level framework:

Area constraints gradually tightened

during optimization

SLIDE 21

Fast Incremental A Fast Incremental A-

tree Routing for Multi

tree Routing for Multi-

pin Nets

pin Nets

Simple incremental A-tree Recursively Quad-partition grids Each pin recursively connects to lower left corner

f each level of partition

For net with bounding box length B, at most 2 *log B edge updates for each pin move, except the root. Each edge routed by LZ-router

First Quadrant Root(source pin)

SLIDE 22

Fast LZ Fast LZ-

routing for Two

routing for Two-

pin Connections

pin Connections

Decide HVH or VHV: Select the less congested layer Binary search on V-stem (or H-stem) Initial left region and right

region to cover bounding box

Repeat Query wire usage on both

regions

Select region with less

congestion

Wire usage query can be done in

O(log grid_size)

Left region Right region HVH VHV

SLIDE 23

Placement Cost Functions Placement Cost Functions

Wire length driven: Summation of net bounding boxes of all

Wire length driven: Summation of net bounding boxes of all nets nets

Congestion driven:

Congestion driven:

Wire usages estimated from the fast global router

Wire usages estimated from the fast global router

Cost = Summation of square of wire usages in all bins

Cost = Summation of square of wire usages in all bins

For fixed wire width

For fixed wire width

cost equivalent to summation of weighted wire length, weight on

cost equivalent to summation of weighted wire length, weight on a a bin = wire usage of the bin bin = wire usage of the bin

For congestion driven run: only turns on congestion driven cost

For congestion driven run: only turns on congestion driven cost at at the finest placement level the finest placement level

W1 W2 W3

Congestion cost = W12 + W22 + … + W92

W4 W5 W6 W7 W8 W9

SLIDE 24

Experimental Results on Wire Length Minimization Experimental Results on Wire Length Minimization

Multi-level simulated annealing coarse placement
Wire length comparison with GORDIAN-L:
Our engine only turns on wire length optimization
Legalized by DOMINO for wire length comparison

Our multi-level engine performs well for big circuits

20k-50k test cases: avqlarge, avqsmall, ibm04, ibm07
50k-100k test cases: ibm09, ibm10
100k-210k test cases: ibm14, ibm15, ibm16, ibm17, ibm18

mPG+DOM/GOR+DOM Wire Length Comparison

97% 100% 96% 93% 94% 95% 96% 97% 98% 99% 100% 20k-50k 50-100k 100k-210k

mPG+DOM/GOR+DOM CPU Time Comparison

81% 43% 22% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 20k-50k 50-100k 100k-210k

SLIDE 25

Experimental Results on Congestion Control Experimental Results on Congestion Control

18.9 18.9 0.21 0.21 0.87 0.87 0.94 0.94 1.05 1.05 mPG mPG-

cg

cg 6.1 6.1 0.47 0.47 0.93 0.93 0.97 0.97 1.05 1.05 mPG mPG-

cg.rd

cg.rd 1 1 1 1 1 1 1 1 1 1 mPG mPG CPU CPU Total Total

verflow
verflow

Max Max boundary boundary congestion congestion Routed WL Routed WL BBOX WL BBOX WL

Test cases: ibm01, ibm04, ibm07, ibm11, ibm13, ibm15

mPG: wire length driven mode mPG-cg: congestion driven at finest clustering level mPG-cg.rd: alternative congestion driven + wire length driven at fines clustering level

SLIDE 26

Initial Experimental Result on Impact of Initial Experimental Result on Impact of Simultaneous Retiming and Placement Simultaneous Retiming and Placement

0.79 0.79 0.93 0.93 1 1 Avg. Avg. 84 84 115 115 121 121 16x16 16x16 101531 101531 Ind4 Ind4 497 497 577 577 582 582 16x16 16x16 52197 52197 Ind3 Ind3 35 35 39 39 51 51 16x16 16x16 26060 26060 Ind2 Ind2 325 325 349 349 349 349 16x16 16x16 29780 29780 Ind1 Ind1 32 32 38 38 41 41 8x8 8x8 13209 13209 S38584 S38584

dly dly dly dly (after retiming) (after retiming) dly dly (before retiming) (before retiming) Simultaneous Simultaneous retiming and retiming and placement placement WL WL-

driven

driven placement placement Grid size Grid size #gates #gates circuit circuit

SLIDE 27

Limitation of Exploring Multi Limitation of Exploring Multi-

cycle Interconnect

cycle Interconnect Communication during Logic Synthesis Communication during Logic Synthesis

Minimum clock period can be achieved by logic

Minimum clock period can be achieved by logic

ptimization is bounded by max. delay
ptimization is bounded by max. delay-
to

to-

register (DR)

register (DR) ratio of the loops in the circuits ratio of the loops in the circuits

Require consideration of multi

Require consideration of multi-

cycle communication

cycle communication during architecture & behavior synthesis during architecture & behavior synthesis

In a loop, 4 logic cells, 2 registers
Cell delay =1ns
Interconnect delay=1ns
DR ratio = (Dlogic+Dint)/#Registers = (4+4)/2=4ns
Clock cycle >= 4ns

SLIDE 28

Global Interconnect

…

FUC

Reg. file

…

FUC

Reg. file

…

FUC

Reg. file

…

FUC

Reg. file

…

FUC

Reg. file

…

FUC

Reg. file

Regular Distributed Register Architecture (1) Regular Distributed Register Architecture (1)

Distribute registers to each “island”
Local computation and communication in each island can be done in a single clock cycle
But registers may need to be inserted along global interconnects for multi-cycle

communication (less regular)

Function Unit Cluster (FUC)

….

Register File

Wi Hi

Island

T H W D D D D D

i i

pt

ic

pt

ic island ra

≤ + + ≤ + =

− − −

) 2 2 (

int log int log int ADD MUX DIV

Cluster with area constraint

SLIDE 29

Global Interconnect

…

FUC

Reg. file

…

FUC

Reg. file

…

FUC

Reg. file

…

FUC

Reg. file

…

FUC

Reg. file

…

FUC

Reg. file

Regular Distributed Register Architecture (2) Regular Distributed Register Architecture (2)

Function Unit Cluster (FUC)

….

Register File

Wi Hi

Island

1 cycle k cycle

T H W D D D D D

i i

pt

ic

pt

ic island ra

≤ + + ≤ + =

− − −

) 2 2 (

int log int log int

2 cycle

ADD MUX DIV

Cluster with area constraint

Registers in each island are partitioned to k banks for 1 cycle, 2 cycle, … k cycle

interconnect communication in each island

Highly regular

SLIDE 30

Example : Regular Distributed Register Example : Regular Distributed Register Architecture for 70nm Technology Architecture for 70nm Technology

NTRS’97 70nm Tech Chip dimension: 620 mm2 (24.9mm x

24.9mm)

5 G Hz across-chip clock

Wire can travel up to 7.52mm within 1 clock

cycle under interconnect optimization

Need 7 clock cycles to cross the chip

Each island base dimension

Wi = Hi=2.08mm
= critical length (longest length that a wire

can run without buffer insertion) estimated by IPEM BIWS estimations assuming buffer size: 2x, driver/receiver size: 2x

1/3 of distance a wire can travel in 1 clock

cycle

Logic volume: 6.76M min-size 2-NAND gates

12X12 island-base array Local registers are partitioned to 7 banks

≈

SLIDE 31

+ 2 * 3 * 4

6
5

* 7 * 8

9

* 11 * 12

10
1

Data flow graph extracted from discrete cosine transformation (DCT) The delay of * operation is 2ns, the delay of + and – operation is 1ns. The resources available are 2 multipliers and 2 ALUs.

1

+ 2 * 3 * 4

6
5

* 7 * 8

9

* 11 * 12

10

The nodes with the same color are assigned to the same functional unit.

Example: Impact of Interconnect on Scheduling Example: Impact of Interconnect on Scheduling

Wirelength-driven Placement

Reg. file
Reg. file

…

Alu1 1,5,10 Alu2 2,6,9

…

FUC

Reg. file
Reg. file

…

Mul2 3,7,12

…

Mul1 4,8,11

Represents long Interconnect delay.

The long interconnect delay is 2ns.

Represents short Interconnect delay.

Short Interconnect delay is 1ns.

1

+ 2 * 3 * 4

6
5

* 7 * 8

9

* 11 * 12

10

SLIDE 32

Single Single-

cycle vs. Multi

cycle vs. Multi-

cycle Interconnect Communication

cycle Interconnect Communication

Single-cycle interconnect communication Scheduled in 6 clock cycles Clock period is 4ns Total latency is 24ns

Cycle 1 Cycle2 Cycle3 Cycle 4 Cycle5 Cycle6

Represents registers.

+ 2

1

* 3 * 4

6
5

* 7 * 12

9

* 11 * 8

10

Cycle1 Cycle2 Cycle3 Cycle 4 Cycle5 Cycle6 Cycle7 Cycle8 Cycle9

Multi-cycle interconnect communication Scheduled in 9 clock cycles Clock period is 2ns Total latency is 18ns

+ 2

1

* 3 * 4

6
5

* 7 * 11

9

* 8 * 12

10

SLIDE 33

Reg. file
Reg. file

…

Alu1 1,5,10

…

Alu2 2,6,9

Reg. file
Reg. file

…

Mul2 3,7,12

…

Mul1 4,8,11

Simultaneous Placement and Scheduling

With placement integrated with scheduling, critical path is reduced. The DFG can be scheduled in 8 clock cycles, with clock period of 2ns. The total latency is 16ns.

Enhancement 1: Simultaneous Placement and Scheduling Enhancement 1: Simultaneous Placement and Scheduling for Performance Optimization for Performance Optimization

Cycle1 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6 Cycle7 Cycle8

+ 2

1

* 3 * 4

6
5

* 7 * 8

9

* 11 * 12

10

SLIDE 34

Enhancement 2: Simultaneous Placement, Scheduling Enhancement 2: Simultaneous Placement, Scheduling and Binding for Performance Optimization and Binding for Performance Optimization

Simultaneous Placement, Scheduling and Binding

With placement integrated with scheduling and binding, the critical path is further reduced. The DFG can be scheduled in 7 clock cycles, with clock period of 2ns. The total latency is 14ns

Cycle1 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6 Cycle7

Reg. file
Reg. file

…

Alu1 1,5,10

…

Alu2 2,6,9

Reg. file
Reg. file

…

Mul2 3,7,11

…

Mul1 4,8,12 + 2

1

* 3 * 4

6
5

* 7 * 8

9

* 11 * 12

10

SLIDE 35

Example: Example: Multicluster Multicluster Architectures of Architectures of DEC Alpha 21264

Source: The Multicluster Architecture: Reducing Cycle Time Through Partitioning by Keith I. Farkas, et al

SLIDE 36

Conclusions Conclusions

Multi

Multi-

cycle communication is needed for gigahertz designs

cycle communication is needed for gigahertz designs

Sequential timing analysis + multilevel optimization

Sequential timing analysis + multilevel optimization enables efficient retiming/pipelining over global enables efficient retiming/pipelining over global interconnects interconnects

Regular distributed register (RDR) fabric provides

Regular distributed register (RDR) fabric provides regularity to support regularity to support

Multicycle

Multicycle communication communication

Integrated resource binding, scheduling, and physical planning