Architecture and Synthesis for Multi- -Cycle Cycle Architecture - - PowerPoint PPT Presentation

architecture and synthesis for multi cycle cycle
SMART_READER_LITE
LIVE PREVIEW

Architecture and Synthesis for Multi- -Cycle Cycle Architecture - - PowerPoint PPT Presentation

Architecture and Synthesis for Multi- -Cycle Cycle Architecture and Synthesis for Multi On- -Chip Communication Chip Communication On Jason Cong Jason Cong VLSI CAD Lab VLSI CAD Lab Computer Science Department Computer Science


slide-1
SLIDE 1

Architecture and Synthesis for Multi Architecture and Synthesis for Multi-

  • Cycle

Cycle On On-

  • Chip Communication

Chip Communication

Jason Cong Jason Cong

VLSI CAD Lab VLSI CAD Lab Computer Science Department Computer Science Department University of California, Los Angeles University of California, Los Angeles cong@ cong@cs cs. .ucla ucla. .edu edu http:// http://cadlab cadlab. .cs cs. .ucla ucla. .edu edu Joint work with Y. Fan, G. Han, X. Yang, Z. Zhang Joint work with Y. Fan, G. Han, X. Yang, Z. Zhang

slide-2
SLIDE 2

Outline Outline

  • Needs for Multi

Needs for Multi-

  • Cycle On

Cycle On-

  • Chip Communication

Chip Communication

  • Regular Distributed Register (RDR) Architecture

Regular Distributed Register (RDR) Architecture

  • MCAS: Multi

MCAS: Multi-

  • Cycle Communication Architectural Synthesis System

Cycle Communication Architectural Synthesis System

  • Scheduling

Scheduling-

  • driven placement

driven placement

  • Placement

Placement-

  • driven rescheduling & rebinding

driven rescheduling & rebinding

  • Experimental Results

Experimental Results

  • Application in Pilot System

Application in Pilot System --

  • - A Platform Based HW/SW Synthesis

A Platform Based HW/SW Synthesis System System

  • Conclusions & Future Work

Conclusions & Future Work

slide-3
SLIDE 3

Interconnect Bottleneck in Nanometer Designs Interconnect Bottleneck in Nanometer Designs

1st challenge: Interconnect delay exceeds gate delay (happened i

1st challenge: Interconnect delay exceeds gate delay (happened in mid 1990s) n mid 1990s)

Source of “timing closure” problem

Source of “timing closure” problem

Happened in mid 1990s. Addressed by new physical synthesis/prot

Happened in mid 1990s. Addressed by new physical synthesis/prototyping tools

  • typing tools
slide-4
SLIDE 4

11.4 22.8 28.3 1 clock 2 clock 3 clock 4 clock 5 clock

ITRS’01 0.07um Tech 5.63 G Hz across-chip clock 800 mm2 (28.3mm x 28.3mm) IPEM BIWS estimations

Buffer size: 100x Driver/receiver size: 100x

  • On semi-global layer (tier 3) :

Can travel up to 11.4 mm in

  • ne cycle

Need 5 clock cycles from

corner to corner

Interconnect Bottleneck in Nanometer Designs Interconnect Bottleneck in Nanometer Designs

2nd challenge:

2nd challenge: Single Single-

  • cycle full chip synchronization is no longer possible

cycle full chip synchronization is no longer possible

Not supported by the current CAD toolset

Not supported by the current CAD toolset

About to happen soon

About to happen soon

slide-5
SLIDE 5

Altera Stratix: EP1S80B-C6

  • Large Size: 79,040 LEs
  • 22 DSP blocks …

Corner to Corner

Interconnect Delay:

7.154 ns

With clock frequency:

300 MHz

From corner to corner

communication:

3 clock cycles!

MegaRAM Blocks (9) DSP Blocks (22) M4K RAM Blocks (364) M512 RAM Blocks (767) Logic Array Blocks (79,040 LEs)

Single Single-

  • cycle Full Chip Synchronization No Longer

cycle Full Chip Synchronization No Longer Possible Possible --

  • - FPGA Example

FPGA Example

slide-6
SLIDE 6

Possible Solutions Possible Solutions

  • Asynchronous designs

Asynchronous designs

  • Triggered by events instead of clocks

Triggered by events instead of clocks

  • Bridging capabilities: provides interfaces for systems of differ

Bridging capabilities: provides interfaces for systems of different speeds ent speeds

  • Greater flexibility: circuits in a system do not have to common

Greater flexibility: circuits in a system do not have to common timing timing

  • Delay

Delay-

  • insensitive

insensitive

  • Reduced power consumption ?

Reduced power consumption ?

  • Improved performance ?

Improved performance ?

  • Synchronous designs, with multi

Synchronous designs, with multi-

  • cycle communications

cycle communications

  • Much better understood

Much better understood

  • Can leverage existing tools/flows

Can leverage existing tools/flows

  • Our current focus

Our current focus

slide-7
SLIDE 7

Multi Multi-

  • Cycle Interconnect Communication

Cycle Interconnect Communication at Logic / Physical Level at Logic / Physical Level

  • Simultaneous retiming + placement /

Simultaneous retiming + placement / floorplanning floorplanning

  • Retiming + multilevel partitioning[Cong et al, ICCAD

Retiming + multilevel partitioning[Cong et al, ICCAD’ ’00] and 00] and coarse placement[Cong et al, DAC coarse placement[Cong et al, DAC’ ’03] 03]

  • Retiming +

Retiming + floorplanning floorplanning [ [Chong Chong & & Brayton Brayton, IWLS , IWLS’ ’01] 01]

  • Retiming + placement for

Retiming + placement for FPGAs FPGAs [Singh & Brown, FPGA [Singh & Brown, FPGA’ ’02] 02]

slide-8
SLIDE 8

Need of Considering Retiming during Placement Need of Considering Retiming during Placement

  • Retiming/pipelining on global interconnects

Retiming/pipelining on global interconnects

Multiple clock cycles are needed to cross the chip

Multiple clock cycles are needed to cross the chip

Proper placement allows retiming to

Proper placement allows retiming to hide hide global interconnect delays. global interconnect delays.

Placement 1 Before retiming, φ = 5.0 a b c d After retiming, φ = 3.0 Before retiming, φ = 4.0 a c b d Placement 2 d(v)=1, WL=6, d(e) ∝ WL d(v)=1, WL=6, d(e) ∝ WL Better Initial Placement !!

slide-9
SLIDE 9

Need of Considering Retiming during Placement Need of Considering Retiming during Placement

  • Retiming/pipelining on global interconnects

Retiming/pipelining on global interconnects

Multiple clock cycles are needed to cross the chip

Multiple clock cycles are needed to cross the chip

Proper placement allows retiming to

Proper placement allows retiming to hide hide global interconnect delays. global interconnect delays.

Placement 1 Before retiming, φ = 5.0 a b c d After retiming, φ = 3.0 Before retiming, φ = 4.0 a c b d After retiming, φ = 4.0 Placement 2 d(v)=1, WL=6, d(e) ∝ WL d(v)=1, WL=6, d(e) ∝ WL Better Initial Placement !!

slide-10
SLIDE 10

Simultaneous Coarse Placement with Retiming Simultaneous Coarse Placement with Retiming

  • n Interconnects
  • n Interconnects
  • Difficulties

Difficulties

  • How to consider retiming/pipelining over global interconnects

How to consider retiming/pipelining over global interconnects

  • Flip

Flip-

  • flop boundaries are not fixed during placement, difficult to do

flop boundaries are not fixed during placement, difficult to do static static timing analysis timing analysis

  • How to handle the high complexity of the combined problem

How to handle the high complexity of the combined problem

  • Our solution

Our solution

  • Compute the labels of all nodes under c

Compute the labels of all nodes under c-

  • retiming for a given

retiming for a given placement solution and perform sequential timing analysis ( placement solution and perform sequential timing analysis (Seq Seq-

  • TA)

TA)

  • Minimize the longest sequential path by improving the placement

Minimize the longest sequential path by improving the placement solution in the multilevel coarse placement framework solution in the multilevel coarse placement framework

slide-11
SLIDE 11

Sequential Arrival Time (SAT) Sequential Arrival Time (SAT)

  • Definition [Pan et al, TCAD98]

Definition [Pan et al, TCAD98]

  • l

l( (v v) = max delay from PIs to ) = max delay from PIs to v v after opt. retiming under a given clock period after opt. retiming under a given clock period f f

  • l

l( (v v) = max{ ) = max{l l( (u u) ) -

  • f

f · · w w( (u,v u,v) + ) + d d( (u,v u,v) + ) + d d( (v v)} )}

  • Relation to retiming:

Relation to retiming: r r( (v v) = ) =  l l( (v v) / ) / f f   -

  • 1

1

  • Theorem:

Theorem: P P can be retimed to can be retimed to f f + max{ + max{d d( (e e)} iff )} iff l l(POs) (POs) ≤ ≤ f f

  • SAT can be computed iteratively in O(VE) time (linear time in pr

SAT can be computed iteratively in O(VE) time (linear time in practice) actice)

u w v l(u) = 7 l(w) = 3 d(v) = 1, d(e) = 2, f = 5 l(v) = max{7-5·1+2+1, 3+2+1} = 6 u v l(u) w(u,v) d(v)

slide-12
SLIDE 12

Limitation of Exploring Multi Limitation of Exploring Multi-

  • cycle Interconnect

cycle Interconnect Communication during Logic/Physical Synthesis Communication during Logic/Physical Synthesis

  • Minimum clock period can be achieved by logic

Minimum clock period can be achieved by logic

  • ptimization is bounded by max. delay
  • ptimization is bounded by max. delay-
  • to

to-

  • register (DR)

register (DR) ratio of the loops in the circuits ratio of the loops in the circuits

  • Require consideration of multi

Require consideration of multi-

  • cycle communication

cycle communication during architecture & behavior synthesis during architecture & behavior synthesis

  • In a loop, 4 logic cells, 2 registers
  • Cell delay =1ns
  • Interconnect delay=1ns
  • DR ratio = (Dlogic+Dint)/#Registers = (4+4)/2=4ns
  • Clock cycle >= 4ns
slide-13
SLIDE 13

Our Contributions Our Contributions

  • Regular Distributed Register (RDR) micro

Regular Distributed Register (RDR) micro-

  • architecture

architecture

  • Highly regular

Highly regular

  • Direct support of multi

Direct support of multi-

  • cycle on

cycle on-

  • chip communication

chip communication

  • MCAS: Architectural Synthesis for Multi

MCAS: Architectural Synthesis for Multi-

  • cycle

cycle Communication Communication

  • Integrated architectural synthesis (e.g. resource binding,

Integrated architectural synthesis (e.g. resource binding, scheduling) with physical planning scheduling) with physical planning

  • Target at RDR architectures

Target at RDR architectures

slide-14
SLIDE 14

Regular Distributed Register Architecture (1) Regular Distributed Register Architecture (1)

  • Distribute registers to each “island”
  • Chose the island size such that local computation and communication in each

island can be done in a single cycle:

Global Interconnect

LCC

  • Reg. file

LCC

  • Reg. file

LCC

  • Reg. file

LCC

  • Reg. file

LCC

  • Reg. file

LCC

  • Reg. file

FSM FSM FSM FSM FSM FSM T H W D D D D D

i i

  • pt

ic

  • pt

ic island ra

≤ + + ≤ + =

− − −

) ( 2

int log int log int

Local Computational Cluster (LCC)

….

Register File

Wi Hi

Island FSM

ADD MUX MUL

Cluster with area constraint

slide-15
SLIDE 15

Regular Distributed Register Architecture (2) Regular Distributed Register Architecture (2)

Global Interconnect

LCC

  • Reg. file

LCC

  • Reg. file

LCC

  • Reg. file

LCC

  • Reg. file

LCC

  • Reg. file

LCC

  • Reg. file

FSM FSM FSM FSM FSM FSM

Local Computational Cluster (LCC)

….

Register File

Wi Hi

Island FSM

ADD MUX MUL

Cluster with area constraint

  • Use register banks:

Registers in each island are partitioned to k banks for 1 cycle,2 cycle, … k

cycle interconnect communication in each island

  • Highly regular

1 cycle 2 cycle k cycle

slide-16
SLIDE 16

ASIC Example : Regular Distributed Register ASIC Example : Regular Distributed Register Architecture for 70nm Technology Architecture for 70nm Technology

NTRS’01 70nm Tech Chip dimension: 800 mm2 (28.3mm x

28.3mm)

5.63 G Hz across-chip clock

  • Wire can travel up to 11.4mm within 1 clock

cycle under interconnect optimization

  • Need 5 clock cycles to cross the chip

Each island base dimension

  • Wi = Hi=3. 94 mm
  • = critical length (longest length that a

wire can run without buffer insertion) estimated by IPEM BIWS estimations assuming buffer size: 2x, driver/receiver size: 2x

  • Logic volume: 19. 63M min-size 2-NAND

gates

8X8 island-base array Local registers are partitioned to 5 banks

slide-17
SLIDE 17

FPGA Example : Regular Distributed Register FPGA Example : Regular Distributed Register Architecture for a Architecture for a Stratix Stratix Device Device

To achieve 250 MHz clock frequency

  • 4X6 island array
  • Intra-island interconnect delay:

2.616 ns

  • Logic delay of a 16 bit ADDER:

1.239 ns

  • Total Delay < 4 ns
  • Each Island contains (average)
  • 3290 LEs (for function units)
  • 1 DSP block (8 9X9 bit multipliers)
  • 32 M512 RAM blocks (register banks)
  • 15 M4K RAM blocks (register banks)
  • MegaRAM blocks: global resources
  • Stratix: EP1S80-C6
  • Large size: 79,040 LEs
  • Corner - corner interconnect delay

7.154 ns

slide-18
SLIDE 18

Example: Example: Multicluster Multicluster Architectures of Architectures of DEC Alpha 21264

Source: The Multicluster Architecture: Reducing Cycle Time Through Partitioning by Keith I. Farkas, et al

slide-19
SLIDE 19

RDR Architecture vs. DRA RDR Architecture vs. DRA

  • Distributed Register File Architecture (DRA)

Distributed Register File Architecture (DRA)

  • Behavior

Behavior-

  • to

to-

  • Placed RTL Synthesis with Performance

Placed RTL Synthesis with Performance-

  • Driven Placement [Kim,

Driven Placement [Kim, et al, ICCAD et al, ICCAD’ ’01] 01]

  • Similarities:

Similarities:

  • Distribute registers near the local computational units

Distribute registers near the local computational units

  • Supports multi

Supports multi-

  • cycle communication

cycle communication

  • Allows concurrent computation and communication

Allows concurrent computation and communication

  • Distinction:

Distinction:

  • The RDR architecture is highly

The RDR architecture is highly regular

regular

  • Facilitates interconnect delay estimation

Facilitates interconnect delay estimation

  • Enables the systematic exploration of cycle

Enables the systematic exploration of cycle-

  • time/latency

time/latency tradeoff by varying the size of the basic island tradeoff by varying the size of the basic island

slide-20
SLIDE 20

Data flow graph extracted from discrete cosine transformation (DCT)

Example: Impact of Interconnect on Scheduling Example: Impact of Interconnect on Scheduling

Wirelength-driven placement

  • Reg. file
  • Reg. file

Alu1 1,5,10 Alu2 2,6,9

  • Reg. file
  • Reg. file

Mul2 3,7,8

Mul1 4,11,12

LCC

2 ns 1 ns

  • +

* *

  • *

*

  • *

*

  • 1

3 5 7 9 2 4 6 8 11 10 12

Long interconnect Short interconnect

The nodes with the same color are assigned to the same functional unit.

2 1 ns ALU 2 2 ns Multiplier Num Delay FU

  • +

* *

  • *

*

  • *

*

slide-21
SLIDE 21

represents registers

Single Single-

  • cycle vs. Multi

cycle vs. Multi-

  • cycle Interconnect Communication

cycle Interconnect Communication

Single-cycle interconnect communication Scheduled in 6 clock cycles Clock period is 4ns Total latency is 24ns Multi-cycle interconnect communication Scheduled in 9 clock cycles Clock period is 2ns Total latency is 18ns

10 +

  • *

*

  • *

*

  • *

*

  • 2

1 3 4 6 5 7 8 9 11 12 Cycle 1 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 2 10 +

  • *

*

  • *

*

  • *
  • 2

1 3 4 6 5 7 8 9 12 Cycle 1 Cycle 7 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 2 Cycle 8 Cycle 9

  • *

11

slide-22
SLIDE 22

With placement integrated with scheduling, critical path is reduced. The DFG can be scheduled in 8 clock cycles, with clock period of 2ns. The total latency is 16ns.

Enhancement 1: Simultaneous Placement and Scheduling Enhancement 1: Simultaneous Placement and Scheduling for Performance Optimization for Performance Optimization

  • Reg. file
  • Reg. file

… Alu1 1,5,10 …

  • Reg. file
  • Reg. file

… Mul2 3,7,8 … Mul1 4,11,12 Alu2 2,6,9

Scheduling-driven placement

10 +

  • *

*

  • *

*

  • *

*

  • 2

1 3 4 6 5 7 8 9 11 12 Cycle 1 Cycle 7 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 2 Cycle 8

slide-23
SLIDE 23

Enhancement 2: Simultaneous Placement, Scheduling Enhancement 2: Simultaneous Placement, Scheduling and Binding for Performance Optimization and Binding for Performance Optimization

With placement integrated with scheduling and binding, the critical path is further reduced. The DFG can be scheduled in 7 clock cycles, with clock period of 2ns. The total latency is 14ns

Simultaneous placement, scheduling and binding

Cycle 1 Cycle 7 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 2 10 +

  • *
  • *

*

  • *

*

  • 2

1 3 4 6 5 7 8 9 11 12 *

  • Reg. file
  • Reg. file

… Alu1 1,5,10 …

  • Reg. file
  • Reg. file

… Mul2 3,7,11 … Alu2 2,6,9 Mul1 4,8,12

slide-24
SLIDE 24

MCAS: Placement MCAS: Placement-

  • Driven Architectural Synthesis Using

Driven Architectural Synthesis Using RDR Architecture RDR Architecture

Register and port binding Datapath & FSM generation

Floorplan constraints RTL VHDL files Multi-cycle path constraints CDFG

C / VHDL

CDFG generation

+ 2 * 3 * 4

  • 6
  • 5

* 7 * 8

  • 9

* 11 * 12

  • 10
  • 1

RDR Arch. Spec. Target clock period Resource allocation

Resource constraints

  • +

* *

  • *

*

  • *
  • *

Interconnected Component Graph (ICG)

Functional unit binding

Mult1 Alu2 Mult2 Alu1

Interconnected Component Graph (ICG)

Location information

Scheduling-driven placement

  • Reg. file
  • Reg. file

Alu1 1,5,10

  • Reg. file
  • Reg. file

Mul2 3,7,12

Alu2 2,6,9 Mul1 4,8,11

Placement-driven rebinding & scheduling

Cycle1 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6 Cycle7

* * * +

  • *
  • *
  • *
  • Reg. file
  • Reg. file

Alu1 1,5,10

  • Reg. file
  • Reg. file

Mul2 3,7,11

Alu2 2,6,9 Mul1 4,8,12

slide-25
SLIDE 25

MCAS: Scheduling MCAS: Scheduling-

  • Driven Placement (1)

Driven Placement (1)

  • Basic approach:

Basic approach:

  • Integrate scheduling with an SA

Integrate scheduling with an SA-

  • based coarse placement [Chang

based coarse placement [Chang et al, ISPD et al, ISPD’ ’02] 02]

  • Overlap computation with communication

Overlap computation with communication

  • Hide critical data transfers into intra

Hide critical data transfers into intra-

  • island by reducing weighted

island by reducing weighted wirelength wirelength. .

  • Distinction between our placement and conventional

Distinction between our placement and conventional performance performance-

  • driven placement

driven placement

  • Problem size

Problem size : : Relatively small (<10 Relatively small (<103) ) vs.

  • vs. Huge

Huge

  • Input:

Input: ICG (general ICG (general graph) ) vs.

  • vs. Netlist

Netlist ( (acyclic acyclic graph) graph)

  • Objective:

Objective: To minimize: # of Clock cycles To minimize: # of Clock cycles vs.

  • vs. Clock period

Clock period

slide-26
SLIDE 26
  • Reg. file
  • Reg. file

… Alu1 1,5,10 …

  • Reg. file
  • Reg. file

… Mul2 3,7,8 … Mul1 4,11,12 Alu2 2,6,9

MCAS: Scheduling MCAS: Scheduling-

  • Driven Placement (2)

Driven Placement (2)

  • Scheduling

Scheduling-

  • based timing analysis

based timing analysis

  • Timing Analysis is performed on original CDFG instead of ICG

Timing Analysis is performed on original CDFG instead of ICG

  • A fast list scheduling is performed on CDFG instead of the classical

static timing analysis

  • Critical edges in ICG are assigned high weights
  • Timing Analysis

Timing Analysis by

by Scheduling

Scheduling

Weight assignment

10 +

  • *

*

  • *

*

  • *

*

  • 2

1 3 4 6 5 7 8 9 11 12 Cycle 1 Cycle 7 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 2 Cycle 8

slide-27
SLIDE 27

MCAS: Simultaneous Rescheduling & Rebinding (1) MCAS: Simultaneous Rescheduling & Rebinding (1)

  • Simultaneous list scheduling and

Simultaneous list scheduling and binding to minimize total binding to minimize total schedule latency schedule latency

  • Previous approach

Previous approach [ [Jeon Jeon et al, et al, ASPDAC ASPDAC’ ’01] 01]

  • cpl

cpl(i, j) = critical path length of (i, j) = critical path length of fanout fanout cone rooted at node i, when cone rooted at node i, when node i is bound to functional unit j. node i is bound to functional unit j.

  • Perform list scheduling using

Perform list scheduling using priority function priority function min minj( (cpl cpl(i, j)). (i, j)).

  • Bind node to functional unit j with

Bind node to functional unit j with the minimum the minimum cpl cpl(i, j) at the earliest (i, j) at the earliest feasible control step feasible control step

X48 +3 *1 X40 +4 *2 *6 *5

est est(i, j) (i, j) cpl cpl(i, j) (i, j)

+8

  • 7
slide-28
SLIDE 28

MCAS: Simultaneous Rescheduling & Rebinding (2) MCAS: Simultaneous Rescheduling & Rebinding (2)

  • Our contributions

Our contributions

  • Use force

Use force-

  • directed list scheduling and binding

directed list scheduling and binding with interconnect delay estimation with interconnect delay estimation

  • Consider resource constraints

Consider resource constraints

  • During scheduling (for selecting deferred nodes)

During scheduling (for selecting deferred nodes)

  • During binding (as part of scheduling process)

During binding (as part of scheduling process)

slide-29
SLIDE 29

Experiment Settings Experiment Settings

CDFG Interconnected component graph C / VHDL Location information

1 Functional unit allocation & binding Commercial FPGA development system Placement-driven rebinding & rescheduling Scheduling-driven placement CDFG generation 2 3 Register and port binding Placement-driven scheduling Scheduling Datapath & FSM generation

Floorplanconstraints; Multi-cycle path constraints

RDR Arch. Spec. Target clock period

RTL VHDL files

slide-30
SLIDE 30

Experimental Results (1) Experimental Results (1)

CS CP (ns) Lat (ns) CS CP (ns) Lat (ns) CS CP (ns) Lat (ns) pr 27 5.79 156.33 29 3.53 102.37 29 3.66 106.14 wang 14 7.54 105.56 20 4.14 82.8 20 3.81 76.2 lee 20 6.25 125 27 3.36 90.72 26 3.38 87.88 mcm 34 7.64 259.76 39 4.81 187.59 38 4.57 173.66 honda 23 7.58 174.34 24 3.78 90.72 24 4.18 100.32 dir 50 7.03 351.5 51 4.41 224.91 51 4.33 220.83 chem 50 8.27 413.5 53 4.64 245.92 52 4.49 233.48 u5ml12 68 9.3 632.4 70 5.34 373.8 70 4.3 301 Ave Ratio 1 1 1 1.14 0.57 0.65 1.13 0.56 0.63 Flow 1 Flow 2 Flow 3

Flow1: Conventional approach Flow2: Scheduling-driven placement Flow3: Scheduling-driven placement + placement-driven rebinding & rescheduling

Cycle number, clock period, and overall latency comparison

slide-31
SLIDE 31

Experimental Results (2) Experimental Results (2)

100 200 300 400 500 600 700

pr wang lee mcm honda dir chem u5ml12

Latency (ns)

Flow 1 Flow 2 Flow 3

Flow1: Conventional approach Flow2: Scheduling-driven placement Flow3: Scheduling-driven placement + placement-driven rebinding & rescheduling

Total latency comparison

slide-32
SLIDE 32

Synopsys Synopsys Flow Flow – – Behavioral Compiler vs. MCAS Behavioral Compiler vs. MCAS

Behavioral Compiler Design Compiler MCAS VHDL C RTL VHDL

Mapped VHDL for Stratix FPGAs Altera Quartus-II

Modelsim VHDL Output for Simulation gcc Report Equivalent high-level data flow description

slide-33
SLIDE 33

Experimental Results (3) Experimental Results (3)

Synopsys Behavioral Compiler setting: default (optimizing latency) Average latency ratio of MCAS vs. BC: 76%

MCAS basic flow vs. Synopsys’ Behavioral Compiler

  • 0. 00
  • 100. 00
  • 200. 00
  • 300. 00
  • 400. 00
  • 500. 00
  • 600. 00

pr wang m cm honda Synopsys BC M CAS 1000 2000 3000 4000 5000 6000 7000 pr w ang m cm honda Synopsys BC M C AS

Latency Resource

Design Flow Cylces Reg ALU MULT fmax (MHz) LE Latency (ns) MCAS vs. BC Synopsys BC 25 28 5 8 90.31 2945 276.82 MCAS 27 34 6 2 96.74 2476 279.10 100.82% Synopsys BC 29 36 7 8 83.61 3605 346.85 MCAS 14 35 5 8 103.76 4242 134.93 38.90% Synopsys BC 43 142 23 7 79.65 6253 539.86 MCAS 34 35 6 3 72.05 3876 471.89 87.41% Synopsys BC 29 44 8 14 85.14 6128 340.62 MCAS 23 42 6 8 87.11 5523 264.03 77.52% pr wang mcm honda

slide-34
SLIDE 34

Pilot: A Platform Pilot: A Platform-

  • based HW/SW Synthesis System

based HW/SW Synthesis System

  • Platform

Platform-

  • based Synthesis

based Synthesis

  • Start from system level design description

Start from system level design description

  • Target to

Target to FPSoC FPSoC platform platform

  • Automate the process as much as possible

Automate the process as much as possible

  • System Data Model

System Data Model

  • MOC

MOC – – Model of Computation Model of Computation

  • System

System-

  • Level Synthesis Algorithms

Level Synthesis Algorithms

  • Incorporate models such as

Incorporate models such as Funstate Funstate model etc. model etc.

  • Internal Representation

Internal Representation

  • cover whole life

cover whole life-

  • cycle of the flow

cycle of the flow

  • SDM

SDM-

  • API supports inter

API supports inter-

  • operatability
  • peratability of CAD tools
  • f CAD tools
slide-35
SLIDE 35

Platforms Used in Our Research Platforms Used in Our Research

  • High Programmable Platforms

High Programmable Platforms

  • Xilinx

Xilinx Virtex Virtex II Pro, II Pro, Altera Stratix Altera Stratix, etc. , etc.

  • Concentrates on

Concentrates on reconfigurability reconfigurability

  • Delivers

Delivers reconfigurable reconfigurable processor + programmable logic processor + programmable logic

Rocket I/O Transceivers PowerPC 405 PowerPC 405 PowerPC 405 PowerPC 405 Rocket I/O Transceivers Programmable Logic

Xilinx Virtex II Pro

  • Up to 4 IBM PowerPC in FPGA fabric
  • Up to 24 embedded Rocket I/O transceivers
  • Up to 556 18*18 multipliers
  • Over 10 Mb embedded block RAM
  • Up to 125,136 logic elements (LEs)

Altera Stratix

  • Nios embedded processor
  • High-bandwidth I/O & High-Speed Interfaces
  • Up to 176 embedded multipliers

& up to 22 high performance DSP block

  • Up to 7 Mb embedded memory
  • Up to 79,040 logic elements (LEs)
slide-36
SLIDE 36

Pilot Design Flow Pilot Design Flow

Tools Developed:

Converter: Translate SpecC to

SDM

Simulator: Validate the design in

SDM, Simulation design at different levels of abstraction

SW code generator: Generate C

Source Code from SDM for target platform

HW code generator: Generate

VHDL Source code from SDM for target platform

Profiler: Generate profile based on

generated SW/HW system

HW synthesis: MCAS system

Design Design Spec. Spec. Simulation Simulation Synthesis Synthesis

C Code C Code VHDL VHDL Target Target SW SW Target Target PLD PLD SW SW Code Gen Code Gen HW HW Code Gen Code Gen

System System Data Data Model Model

Partitioning Partitioning Scheduling Scheduling Interface Interface Synthesis Synthesis SW synthesis SW synthesis HW synthesis HW synthesis

Platform Platform Info. Info. Estimation Estimation

MCAS system

slide-37
SLIDE 37

Work Accomplished: Work Accomplished: Jpeg Encoder Jpeg Encoder

  • Jpeg Encoder:

Jpeg Encoder:

  • An example to validate the design flow

An example to validate the design flow

116x96x8 .bmp format (12214 Bytes) 116x96x8 .jpg format (1704 Bytes)

slide-38
SLIDE 38

Jpeg Example: Program Structure Jpeg Example: Program Structure

BMP Image File BMP Image File

Image Fragmentation Image Fragmentation DCT DCT Entropy Coding Entropy Coding

JPG Image File JPG Image File

Quantization Quantization

JPEG: an standard for image compression DCT: Discrete Cosine Transform(ChenDCT) Four mode of the operations in JPEG standard Sequential DCT-based mode Progressive DCT-based mode Lossless mode Hierarchical mode JPEG: an standard for image compression DCT: Discrete Cosine Transform(ChenDCT) Four mode of the operations in JPEG standard Sequential DCT-based mode Progressive DCT-based mode Lossless mode Hierarchical mode

slide-39
SLIDE 39

Jpeg Example: Run Jpeg Example: Run-

  • time Results

time Results

Run

Run-

  • time result of Jpeg example

time result of Jpeg example

time (10

  • 6s)

rate(%) time (10

  • 6s)

rate(%) time (10

  • 6s)

rate(%) time (10

  • 6s)

rate(%) 50.31 1.22% 50.31 1.92% 50.31 1.84% 50.31 4.59% (19878.67) (19878.67) (19878.67) (19878.67) 3160.56 76.46% 1641.04 62.78% 1756.67 64.35% 123.51 11.26% (316.4) (609.37) (569.26) (8096.46) 176.42 4.27% 176.42 6.75% 176.42 6.46% 176.42 16.09% (5668.41) (5668.41) (5668.41) (5668.41) 746.29 18.05% 746.29 28.55% 746.29 27.34% 746.29 68.06% (1339.96) (1339.96) (1339.96) (1339.96) Total 4133.57 100.00% 2614.05 100.00% 2729.68 100.00% 1096.52 100.00% HuffmanEncode NIOS(SW+HW2) NIOS(SW+HW3) HandleData DCT Quantization Module Name NIOS(SW) NIOS(SW+HW1)

  • HW1: half DCT implementation with message passing communication
  • HW2: Full DCT implementation with buffering communication
  • HW3: Full DCT implementation with shared memory communication
slide-40
SLIDE 40

Conclusions & Future Work Conclusions & Future Work

  • Conclusions:

Conclusions:

  • Multi

Multi-

  • cycle communication is needed for multi

cycle communication is needed for multi-

  • gigahertz designs

gigahertz designs

  • Regular distributed register (RDR) architecture provides high re

Regular distributed register (RDR) architecture provides high regularity and gularity and direct support of direct support of

  • Multi

Multi-

  • cycle communication

cycle communication

  • Integrated resource binding, scheduling, and physical planning

Integrated resource binding, scheduling, and physical planning

  • Experimental results demonstrate the effectiveness of MCAS synth

Experimental results demonstrate the effectiveness of MCAS synthesis esis algorithms algorithms

  • Future Work:

Future Work:

  • Further refinement of synthesis for multi

Further refinement of synthesis for multi-

  • cycle synchronous designs

cycle synchronous designs

  • Support of control

Support of control-

  • intensive applications, e.g. distributed controller generation

intensive applications, e.g. distributed controller generation

  • Steering logic optimization, e.g. layout

Steering logic optimization, e.g. layout-

  • driven distributed MUX tree generation

driven distributed MUX tree generation

  • Synthesis solutions for asynchronous designs

Synthesis solutions for asynchronous designs

slide-41
SLIDE 41

Acknowledgements Acknowledgements

  • Thanks for the supports from MARCO/DARPA

Thanks for the supports from MARCO/DARPA Giga Giga-

  • Scale

Scale System Research Center (GSRC) and Semiconductor System Research Center (GSRC) and Semiconductor Research Corporation (SRC) Research Corporation (SRC)