Regular Distributed Register Fabric Regular Distributed Register - - PowerPoint PPT Presentation

regular distributed register fabric regular distributed
SMART_READER_LITE
LIVE PREVIEW

Regular Distributed Register Fabric Regular Distributed Register - - PowerPoint PPT Presentation

Regular Distributed Register Fabric Regular Distributed Register Fabric and Synthesis for Multi- -Cycle Cycle and Synthesis for Multi Communications Communications Jason Cong, Yiping Fan, Xun Yang and Zhiru Zhang Jason Cong, {cong, fanyp


slide-1
SLIDE 1

Regular Distributed Register Fabric Regular Distributed Register Fabric and Synthesis for Multi and Synthesis for Multi-

  • Cycle

Cycle Communications Communications

Jason Cong, Jason Cong, Yiping Fan, Xun Yang and Zhiru Zhang

{cong, {cong, fanyp fanyp, , yangxun yangxun, , zhiruz}@cs.ucla.edu zhiruz}@cs.ucla.edu

Department of Computer Science Department of Computer Science University of California, Los Angeles University of California, Los Angeles

Partially supported by NSF under award CCR-0096383, MARCO/DARPA GSRC, and Altera Corp. under the California MICRO program.

slide-2
SLIDE 2

Outline Outline

  • Needs for Multi

Needs for Multi-

  • Cycle On

Cycle On-

  • Chip Communication

Chip Communication

  • Contributions

Contributions

  • Regular Distributed Register (RDR) Architecture

Regular Distributed Register (RDR) Architecture

  • MCAS: Architectural Synthesis for Multi

MCAS: Architectural Synthesis for Multi-

  • Cycle Communication

Cycle Communication

  • Scheduling

Scheduling-

  • driven placement

driven placement

  • Placement

Placement-

  • driven rescheduling & rebinding

driven rescheduling & rebinding

  • Experimental Results

Experimental Results

  • Conclusions & Future Work

Conclusions & Future Work

slide-3
SLIDE 3

Needs for Multi Needs for Multi-

  • Cycle On

Cycle On-

  • Chip Communication

Chip Communication

7.52 15.04 22.56 24.9 (mm) 1 clock 2 clock 3 clock 4 clock 5 clock 6 clock 7 clock

  • Interconnect delays dominate the timing in DSM tech.

Interconnect delays dominate the timing in DSM tech.

  • Single

Single-

  • cycle full chip synchronization is no longer possible

cycle full chip synchronization is no longer possible

NTRS’97 0.07um Tech 5 G Hz across-chip clock 620 mm2 (24.9mm x 24.9mm) IPEM BIWS estimations

Buffer size: 100x Driver/receiver size: 100x

From corner to corner:

7 clock cycles

  • Source: J. Cong, “Timing Closure

Based on Physical Hierarchy,” ISPD’02.

slide-4
SLIDE 4

Multi Multi-

  • Cycle Interconnect Communication at Logic /

Cycle Interconnect Communication at Logic / Physical Level Physical Level

  • Simultaneous retiming + placement /

Simultaneous retiming + placement / floorplanning floorplanning

  • [Cong et al, ICCAD

[Cong et al, ICCAD’ ’00] [Cong et al, DAC 00] [Cong et al, DAC’ ’03] 03]

  • [

[Chong Chong & & Brayton Brayton, IWLS , IWLS’ ’01] 01]

  • [Singh & Brown, FPGA

[Singh & Brown, FPGA’ ’02] 02]

  • Limitation:

Limitation:

  • Minimum clock period can be achieved by logic optimization is bo

Minimum clock period can be achieved by logic optimization is bounded by unded by

  • max. delay
  • max. delay-
  • to

to-

  • register (DR) ratio of the loops in the circuits

register (DR) ratio of the loops in the circuits

  • In a loop, 4 logic cells, 2 registers
  • Cell delay =1ns
  • Interconnect delay=4ns
  • DR ratio = (Dlogic+Dint)/#Registers = (4+16)/2=10ns
  • Clock cycle >= 10ns
slide-5
SLIDE 5

Our Contributions Our Contributions

  • Regular Distributed Register (RDR) micro

Regular Distributed Register (RDR) micro-

  • architecture

architecture

  • Highly regular

Highly regular

  • Direct support of multi

Direct support of multi-

  • cycle on

cycle on-

  • chip communication

chip communication

  • MCAS: Architectural Synthesis for Multi

MCAS: Architectural Synthesis for Multi-

  • cycle

cycle Communication Communication

  • Integrated architectural synthesis (e.g. resource binding,

Integrated architectural synthesis (e.g. resource binding, scheduling) with physical planning scheduling) with physical planning

  • Target at RDR architecture

Target at RDR architecture

slide-6
SLIDE 6

Regular Distributed Register Architecture (1) Regular Distributed Register Architecture (1)

  • Distribute registers to each “island”
  • Chose the island size such that local computation and communication in each

island can be done in a single cycle:

Global Interconnect

LCC

  • Reg. file

LCC

  • Reg. file

LCC

  • Reg. file

LCC

  • Reg. file

LCC

  • Reg. file

LCC

  • Reg. file

FSM FSM FSM FSM FSM FSM T H W D D D D D

i i

  • pt

ic

  • pt

ic island ra

≤ + + ≤ + =

− − −

) ( 2

int log int log int

Local Computational Cluster (LCC)

….

Register File

Wi Hi

Island FSM

ADD MUX MUL

Cluster with area constraint

slide-7
SLIDE 7

Regular Distributed Register Architecture (2) Regular Distributed Register Architecture (2)

Global Interconnect

LCC

  • Reg. file

LCC

  • Reg. file

LCC

  • Reg. file

LCC

  • Reg. file

LCC

  • Reg. file

LCC

  • Reg. file

FSM FSM FSM FSM FSM FSM

Local Computational Cluster (LCC)

….

Register File

Wi Hi

Island FSM

ADD MUX MUL

Cluster with area constraint

  • Use register banks:

Registers in each island are partitioned to k banks for 1 cycle, 2 cycle, … k

cycle interconnect communication in each island

  • Highly regular

1 cycle 2 cycle k cycle

slide-8
SLIDE 8

Example : Regular Distributed Register Example : Regular Distributed Register Architecture for 70nm Technology Architecture for 70nm Technology

NTRS’97 70nm Tech Chip dimension: 620 mm2 (24.9mm x

24.9mm)

5 G Hz across-chip clock

  • Can travel up to 7.52mm within 1 clock cycle

under best interconnect optimization

  • Need 7 clock cycles to cross the chip

Each island base dimension

  • Wi = Hi=2.08mm
  • ≈ 1/3 of distance a wire can travel in 1 clock

cycle

  • Logic volume: 6.76M min-size 2-NAND gates

12X12 array of islands Local registers are partitioned to 7 banks

slide-9
SLIDE 9

RDR Architecture vs. DRA RDR Architecture vs. DRA

  • Distributed Register File Architecture (DRA)

Distributed Register File Architecture (DRA)

  • Behavior

Behavior-

  • to

to-

  • Placed RTL Synthesis with Performance

Placed RTL Synthesis with Performance-

  • Driven Placement [Kim,

Driven Placement [Kim, et al, ICCAD et al, ICCAD’ ’01] 01]

  • Similarities:

Similarities:

  • Distribute registers near the local computational units

Distribute registers near the local computational units

  • Supports multi

Supports multi-

  • cycle communication

cycle communication

  • Allows concurrent computation and communication

Allows concurrent computation and communication

  • Distinction:

Distinction:

  • The RDR architecture is highly

The RDR architecture is highly regular

regular

  • Facilitates interconnect delay estimation

Facilitates interconnect delay estimation

  • Enables the systematic exploration of cycle

Enables the systematic exploration of cycle-

  • time/latency

time/latency tradeoff by varying the size of the basic island tradeoff by varying the size of the basic island

slide-10
SLIDE 10

+ 2 * 3 * 4

  • 6
  • 5

* 7 * 8

  • 9

* 11 * 12

  • 10
  • 1

Data flow graph extracted from discrete cosine transformation (DCT) The nodes with the same color are assigned to the same functional unit.

Example: Impact of Interconnect on Scheduling Example: Impact of Interconnect on Scheduling

Performance-driven Placement

  • Reg. file
  • Reg. file

Alu1 1,5,10 Alu2 2,6,9

  • Reg. file
  • Reg. file

Mul2 3,7,12

Mul1 4,8,11

LCC

  • +

* *

  • *

*

  • *

*

  • 2

1 ns alu 2 2 ns multiplier num delay resource

2 ns 1 ns

Long interconnect Short interconnect

  • +

* *

  • *

*

  • *

*

slide-11
SLIDE 11

represents registers

Single Single-

  • cycle vs. Multi

cycle vs. Multi-

  • cycle Interconnect Communication

cycle Interconnect Communication

Single-cycle interconnect communication Scheduled in 6 clock cycles Clock period is 4ns Total latency is 24ns

Cycle1 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6

+ 2

  • 1

* 3 * 4

  • 6
  • 5

* 7 * 12

  • 9

* 11 * 8

  • 10

Cycle1 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6 Cycle7 Cycle8 Cycle9

Multi-cycle interconnect communication Scheduled in 9 clock cycles Clock period is 2ns Total latency is 18ns

+ 2

  • 1

* 3 * 4

  • 6
  • 5

* 7 * 11

  • 9

* 8 * 12

  • 10
slide-12
SLIDE 12

With placement integrated with scheduling, critical path is reduced. The DFG can be scheduled in 8 clock cycles, with clock period of 2ns. The total latency is 16ns.

Enhancement 1: Simultaneous Placement and Scheduling Enhancement 1: Simultaneous Placement and Scheduling for Performance Optimization for Performance Optimization

Cycle1 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6 Cycle7 Cycle8

+ 2

  • 1

* 3 * 4

  • 6
  • 5

* 7 * 8

  • 9

* 11 * 12

  • 10
  • Reg. file
  • Reg. file

Alu1 1,5,10

  • Reg. file
  • Reg. file

Mul2 3,7,12

Mul1 4,8,11

Simultaneous Placement and Scheduling

Alu2 2,6,9

slide-13
SLIDE 13

Enhancement 2: Simultaneous Placement, Scheduling Enhancement 2: Simultaneous Placement, Scheduling and Binding for Performance Optimization and Binding for Performance Optimization

With placement integrated with scheduling and binding, the critical path is further reduced. The DFG can be scheduled in 7 clock cycles, with clock period of 2ns. The total latency is 14ns

Cycle1 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6 Cycle7

Simultaneous Placement, Scheduling and Binding

  • Reg. file
  • Reg. file

Alu1 1,5,10

  • Reg. file
  • Reg. file

Mul2 3,7,11

Alu2 2,6,9 Mul1 4,8,12 + 2

  • 1

* 3 * 4

  • 6
  • 5

* 7 * 8

  • 9

* 11 * 12

  • 10
slide-14
SLIDE 14

MCAS: Placement MCAS: Placement-

  • Driven Architectural Synthesis Using

Driven Architectural Synthesis Using RDR Architecture RDR Architecture

CDFG

Interconnected Component Graph (ICG) C / VHDL Location information

Functional unit binding Placement-driven rebinding & scheduling Scheduling-driven placement CDFG generation Register and port binding Datapath & FSM generation

Floorplan constraints

Resource allocation

Resource constraints

RDR Arch. Spec. Target clock period

RTL VHDL files Multi-cycle path constraints

+ 2 * 3 * 4

  • 6
  • 5

* 7 * 8

  • 9

* 11 * 12

  • 10
  • 1
  • +

* *

  • *

*

  • *
  • *
  • Reg. file
  • Reg. file

Alu1 1,5,10

  • Reg. file
  • Reg. file

Mul2 3,7,12

Alu2 2,6,9 Mul1 4,8,11

Cycle1 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6 Cycle7

* * * +

  • *
  • *
  • *
  • Reg. file
  • Reg. file

Alu1 1,5,10

  • Reg. file
  • Reg. file

Mul2 3,7,11

Alu2 2,6,9 Mul1 4,8,12

Mult1 Alu2 Mult2 Alu1

Interconnected Component Graph (ICG)

slide-15
SLIDE 15

MCAS: Scheduling MCAS: Scheduling-

  • Driven Placement (1)

Driven Placement (1)

  • Basic approach:

Basic approach:

  • Integrate scheduling with an SA

Integrate scheduling with an SA-

  • based coarse placement [Chang

based coarse placement [Chang et al, ISPD et al, ISPD’ ’02] 02]

  • Hide critical data transfers into intra

Hide critical data transfers into intra-

  • island by reducing weighted

island by reducing weighted wirelength wirelength. .

  • Distinction between our placement and conventional

Distinction between our placement and conventional performance performance-

  • driven placement

driven placement

  • Problem size

Problem size : : Relatively small (<10 Relatively small (<103

3)

) vs.

  • vs. Huge

Huge

  • Input:

Input: ICG (general ICG (general graph) ) vs.

  • vs. Netlist

Netlist (acyclic graph) (acyclic graph)

  • Objective:

Objective: To minimize: # of Clock cycles To minimize: # of Clock cycles vs.

  • vs. Clock period

Clock period

slide-16
SLIDE 16

MCAS: Scheduling MCAS: Scheduling-

  • Driven Placement (2)

Driven Placement (2)

  • Scheduling

Scheduling-

  • based timing analysis

based timing analysis

  • Timing Analysis is performed on original CDFG instead of ICG

Timing Analysis is performed on original CDFG instead of ICG

  • A fast list scheduling is performed on CDFG instead of the classical

timing analysis at every temperature during the SA process to identify critical edges in ICG, and assign higher weights to them

  • Timing Analysis

Timing Analysis by

by Scheduling

Scheduling

+ 2 * 3 * 4

  • 6
  • 5

* 7 * 8

  • 9

* 11 * 12

  • 10
  • 1
  • 1

+ 2 * 3 * 4

  • 6
  • 5

* 7 * 8

  • 9

* 11 * 12

  • 10

ICG Placement

  • Reg. file
  • Reg. file

Alu1 1,5,10 Alu2 2,6,9

LCC

  • Reg. file
  • Reg. file

Mul2 3,7,12

Mul1 4,8,11

Weight assignment

slide-17
SLIDE 17

MCAS: Simultaneous Rescheduling & Rebinding (1) MCAS: Simultaneous Rescheduling & Rebinding (1)

  • Simultaneous list scheduling and

Simultaneous list scheduling and binding to minimize total binding to minimize total schedule latency schedule latency

  • Previous Approach

Previous Approach [ [Jeon Jeon et al, et al, ASPDAC ASPDAC’ ’01] 01]

  • cpl(i

cpl(i, j) = critical path length of , j) = critical path length of fanout fanout cone rooted at node i, when cone rooted at node i, when node i is bound to functional unit j. node i is bound to functional unit j.

  • Perform list scheduling using

Perform list scheduling using priority function priority function min minj

j(cpl(i

(cpl(i, j)). , j)).

  • Bind node to functional unit j with

Bind node to functional unit j with the minimum the minimum cpl(i cpl(i, j) at the earliest , j) at the earliest feasible control step feasible control step

X48 +3 *1 X40 +4 *2 *6 *5

est(i est(i, j) , j) cpl(i cpl(i, j) , j)

+8

  • 7
slide-18
SLIDE 18

MCAS: Simultaneous Rescheduling & Rebinding (2) MCAS: Simultaneous Rescheduling & Rebinding (2)

  • Our contributions

Our contributions

  • Use force

Use force-

  • directed list scheduling and binding

directed list scheduling and binding with interconnect delay estimation with interconnect delay estimation

  • Consider resource constraints

Consider resource constraints

  • During scheduling (for selecting deferred nodes)

During scheduling (for selecting deferred nodes)

  • During binding (as part of scheduling process)

During binding (as part of scheduling process)

slide-19
SLIDE 19

Experiment Settings Experiment Settings

CDFG Interconnected component graph C / VHDL Location information

1 Functional unit allocation & binding Altera FPGA development system Placement-driven rebinding & rescheduling Scheduling-driven placement CDFG generation 2 3 Register and port binding Placement-driven scheduling Scheduling Datapath & FSM generation

RTL VHDL files; Floorplan constraints; Multi-cycle path constraints RDR Arch. Spec. Target clock period

slide-20
SLIDE 20

Experimental Results (1) Experimental Results (1)

CS CP (ns) Lat (ns) CS CP (ns) Lat (ns) CS CP (ns) Lat (ns) pr 27 5.79 156.33 29 3.53 102.37 29 3.66 106.14 wang 14 7.54 105.56 20 4.14 82.8 20 3.81 76.2 lee 20 6.25 125 27 3.36 90.72 26 3.38 87.88 mcm 34 7.64 259.76 39 4.81 187.59 38 4.57 173.66 honda 23 7.58 174.34 24 3.78 90.72 24 4.18 100.32 dir 50 7.03 351.5 51 4.41 224.91 51 4.33 220.83 chem 50 8.27 413.5 53 4.64 245.92 52 4.49 233.48 u5ml12 68 9.3 632.4 70 5.34 373.8 70 4.3 301 Ave Ratio 1 1 1 1.14 0.57 0.65 1.13 0.56 0.63 Flow 1 Flow 2 Flow 3

Flow1: Conventional approach Flow2: Scheduling-driven placement Flow3: Scheduling-driven placement + placement-driven rebinding & rescheduling

Cycle number, clock period, and overall latency comparison

slide-21
SLIDE 21

Experimental Results (2) Experimental Results (2)

100 200 300 400 500 600 700

pr wang lee mcm honda dir chem u5ml12

Latency (ns)

Flow 1 Flow 2 Flow 3

Flow1: Conventional approach Flow2: Scheduling-driven placement Flow3: Scheduling-driven placement + placement-driven rebinding & rescheduling

Total latency comparison

slide-22
SLIDE 22

Synopsys Synopsys Flow Flow – – Behavioral Compiler vs. MCAS Behavioral Compiler vs. MCAS

Behavioral Compiler Design Compiler MCAS VHDL C RTL VHDL Stratix-Mapped VHDL Quartus Modelsim VHDL output for simulation gcc Report Equal high-level data flow description

slide-23
SLIDE 23

Experimental Results (3) Experimental Results (3)

Synopsys Behavioral Compiler setting: default (optimizing latency) Average latency ratio of MCASA vs. BC: 76%

MCAS basic flow vs. Synopsys’ Behavioral Compiler

  • 0. 00
  • 100. 00
  • 200. 00
  • 300. 00
  • 400. 00
  • 500. 00
  • 600. 00

pr w ang m cm honda Synopsys BC M C AS 1000 2000 3000 4000 5000 6000 7000 pr w ang m cm honda Synopsys BC M C AS

Latency Resource

Design Flow Cylces Reg ALU MULT fmax (MHz) LE Latency (ns) MCAS vs. BC Synopsys BC 25 28 5 8 90.31 2945 276.82 MCAS 27 34 6 2 96.74 2476 279.10 100.82% Synopsys BC 29 36 7 8 83.61 3605 346.85 MCAS 14 35 5 8 103.76 4242 134.93 38.90% Synopsys BC 43 142 23 7 79.65 6253 539.86 MCAS 34 35 6 3 72.05 3876 471.89 87.41% Synopsys BC 29 44 8 14 85.14 6128 340.62 MCAS 23 42 6 8 87.11 5523 264.03 77.52% pr wang mcm honda

slide-24
SLIDE 24

Conclusions Conclusions

  • Multi

Multi-

  • cycle communication is needed for multi

cycle communication is needed for multi-

  • gigahertz

gigahertz designs designs

  • Regular distributed register (RDR) architecture provides

Regular distributed register (RDR) architecture provides high regularity and direct support of high regularity and direct support of

  • Multi

Multi-

  • cycle communication

cycle communication

  • Integrated resource binding, scheduling, and physical planning

Integrated resource binding, scheduling, and physical planning

  • Experimental results demonstrate the effectiveness of

Experimental results demonstrate the effectiveness of MCAS synthesis algorithms MCAS synthesis algorithms

slide-25
SLIDE 25

Future Work Future Work

  • Support of control

Support of control-

  • intensive applications

intensive applications

  • Distributed controller generation

Distributed controller generation

  • Variable renaming

Variable renaming

  • Static Single Assignment (SSA) form

Static Single Assignment (SSA) form

  • Steering logic optimization

Steering logic optimization

  • Multiplexer input count minimization

Multiplexer input count minimization

  • Layout

Layout-

  • driven distributed multiplexer tree generation

driven distributed multiplexer tree generation