[PPT] - Architecture and Synthesis for Multi- -Cycle Cycle Architecture PowerPoint Presentation

SLIDE 1

Architecture and Synthesis for Multi Architecture and Synthesis for Multi-

Cycle

Cycle On On-

Chip Communication

Chip Communication

Jason Cong Jason Cong

VLSI CAD Lab VLSI CAD Lab Computer Science Department Computer Science Department University of California, Los Angeles University of California, Los Angeles cong@ cong@cs cs. .ucla ucla. .edu edu http:// http://cadlab cadlab. .cs cs. .ucla ucla. .edu edu Joint work with Y. Fan, G. Han, X. Yang, Z. Zhang Joint work with Y. Fan, G. Han, X. Yang, Z. Zhang

SLIDE 2

Outline Outline

Needs for Multi

Needs for Multi-

Cycle On

Cycle On-

Chip Communication

Chip Communication

Regular Distributed Register (RDR) Architecture

Regular Distributed Register (RDR) Architecture

MCAS: Multi

MCAS: Multi-

Cycle Communication Architectural Synthesis System

Cycle Communication Architectural Synthesis System

Scheduling

Scheduling-

driven placement

driven placement

Placement

Placement-

driven rescheduling & rebinding

driven rescheduling & rebinding

Experimental Results

Experimental Results

Application in Pilot System

Application in Pilot System --

- A Platform Based HW/SW Synthesis

A Platform Based HW/SW Synthesis System System

Conclusions & Future Work

Conclusions & Future Work

SLIDE 3

Interconnect Bottleneck in Nanometer Designs Interconnect Bottleneck in Nanometer Designs

1st challenge: Interconnect delay exceeds gate delay (happened i

1st challenge: Interconnect delay exceeds gate delay (happened in mid 1990s) n mid 1990s)

Source of “timing closure” problem

Happened in mid 1990s. Addressed by new physical synthesis/prot

Happened in mid 1990s. Addressed by new physical synthesis/prototyping tools

typing tools

SLIDE 4

11.4 22.8 28.3 1 clock 2 clock 3 clock 4 clock 5 clock

ITRS’01 0.07um Tech 5.63 G Hz across-chip clock 800 mm2 (28.3mm x 28.3mm) IPEM BIWS estimations

Buffer size: 100x Driver/receiver size: 100x

On semi-global layer (tier 3) :

Can travel up to 11.4 mm in

ne cycle

Need 5 clock cycles from

corner to corner

Interconnect Bottleneck in Nanometer Designs Interconnect Bottleneck in Nanometer Designs

2nd challenge:

2nd challenge: Single Single-

cycle full chip synchronization is no longer possible

cycle full chip synchronization is no longer possible

Not supported by the current CAD toolset

About to happen soon

SLIDE 5

Altera Stratix: EP1S80B-C6

Large Size: 79,040 LEs
22 DSP blocks …

Corner to Corner

Interconnect Delay:

7.154 ns

With clock frequency:

300 MHz

From corner to corner

communication:

3 clock cycles!

MegaRAM Blocks (9) DSP Blocks (22) M4K RAM Blocks (364) M512 RAM Blocks (767) Logic Array Blocks (79,040 LEs)

Single Single-

cycle Full Chip Synchronization No Longer

cycle Full Chip Synchronization No Longer Possible Possible --

- FPGA Example

FPGA Example

SLIDE 6

Possible Solutions Possible Solutions

Asynchronous designs

Asynchronous designs

Triggered by events instead of clocks

Triggered by events instead of clocks

Bridging capabilities: provides interfaces for systems of differ

Bridging capabilities: provides interfaces for systems of different speeds ent speeds

Greater flexibility: circuits in a system do not have to common

Greater flexibility: circuits in a system do not have to common timing timing

Delay

Delay-

insensitive

insensitive

Reduced power consumption ?

Reduced power consumption ?

Improved performance ?

Improved performance ?

Synchronous designs, with multi

Synchronous designs, with multi-

cycle communications

cycle communications

Much better understood

Much better understood

Can leverage existing tools/flows

Can leverage existing tools/flows

Our current focus

Our current focus

SLIDE 7

Multi Multi-

Cycle Interconnect Communication

Cycle Interconnect Communication at Logic / Physical Level at Logic / Physical Level

Simultaneous retiming + placement /

Simultaneous retiming + placement / floorplanning floorplanning

Retiming + multilevel partitioning[Cong et al, ICCAD

Retiming + multilevel partitioning[Cong et al, ICCAD’ ’00] and 00] and coarse placement[Cong et al, DAC coarse placement[Cong et al, DAC’ ’03] 03]

Retiming +

Retiming + floorplanning floorplanning [ [Chong Chong & & Brayton Brayton, IWLS , IWLS’ ’01] 01]

Retiming + placement for

Retiming + placement for FPGAs FPGAs [Singh & Brown, FPGA [Singh & Brown, FPGA’ ’02] 02]

SLIDE 8

Need of Considering Retiming during Placement Need of Considering Retiming during Placement

Retiming/pipelining on global interconnects

Retiming/pipelining on global interconnects

Multiple clock cycles are needed to cross the chip

Proper placement allows retiming to

Proper placement allows retiming to hide hide global interconnect delays. global interconnect delays.

Placement 1 Before retiming, φ = 5.0 a b c d After retiming, φ = 3.0 Before retiming, φ = 4.0 a c b d Placement 2 d(v)=1, WL=6, d(e) ∝ WL d(v)=1, WL=6, d(e) ∝ WL Better Initial Placement !!

SLIDE 9

Need of Considering Retiming during Placement Need of Considering Retiming during Placement

Retiming/pipelining on global interconnects

Retiming/pipelining on global interconnects

Multiple clock cycles are needed to cross the chip

Proper placement allows retiming to

Proper placement allows retiming to hide hide global interconnect delays. global interconnect delays.

Placement 1 Before retiming, φ = 5.0 a b c d After retiming, φ = 3.0 Before retiming, φ = 4.0 a c b d After retiming, φ = 4.0 Placement 2 d(v)=1, WL=6, d(e) ∝ WL d(v)=1, WL=6, d(e) ∝ WL Better Initial Placement !!

SLIDE 10

Simultaneous Coarse Placement with Retiming Simultaneous Coarse Placement with Retiming

n Interconnects
n Interconnects
Difficulties

Difficulties

How to consider retiming/pipelining over global interconnects

How to consider retiming/pipelining over global interconnects

Flip

Flip-

flop boundaries are not fixed during placement, difficult to do

flop boundaries are not fixed during placement, difficult to do static static timing analysis timing analysis

How to handle the high complexity of the combined problem

How to handle the high complexity of the combined problem

Our solution

Our solution

Compute the labels of all nodes under c

Compute the labels of all nodes under c-

retiming for a given

retiming for a given placement solution and perform sequential timing analysis ( placement solution and perform sequential timing analysis (Seq Seq-

TA)

TA)

Minimize the longest sequential path by improving the placement

Minimize the longest sequential path by improving the placement solution in the multilevel coarse placement framework solution in the multilevel coarse placement framework

SLIDE 11

Sequential Arrival Time (SAT) Sequential Arrival Time (SAT)

Definition [Pan et al, TCAD98]

Definition [Pan et al, TCAD98]

l

l( (v v) = max delay from PIs to ) = max delay from PIs to v v after opt. retiming under a given clock period after opt. retiming under a given clock period f f

l

l( (v v) = max{ ) = max{l l( (u u) ) -

f

f · · w w( (u,v u,v) + ) + d d( (u,v u,v) + ) + d d( (v v)} )}

Relation to retiming:

Relation to retiming: r r( (v v) = ) =  l l( (v v) / ) / f f   -

1

1

Theorem:

Theorem: P P can be retimed to can be retimed to f f + max{ + max{d d( (e e)} iff )} iff l l(POs) (POs) ≤ ≤ f f

SAT can be computed iteratively in O(VE) time (linear time in pr

SAT can be computed iteratively in O(VE) time (linear time in practice) actice)

u w v l(u) = 7 l(w) = 3 d(v) = 1, d(e) = 2, f = 5 l(v) = max{7-5·1+2+1, 3+2+1} = 6 u v l(u) w(u,v) d(v)

SLIDE 12

Limitation of Exploring Multi Limitation of Exploring Multi-

cycle Interconnect

cycle Interconnect Communication during Logic/Physical Synthesis Communication during Logic/Physical Synthesis

Minimum clock period can be achieved by logic

Minimum clock period can be achieved by logic

ptimization is bounded by max. delay
ptimization is bounded by max. delay-
to

to-

register (DR)

register (DR) ratio of the loops in the circuits ratio of the loops in the circuits

Require consideration of multi

Require consideration of multi-

cycle communication

cycle communication during architecture & behavior synthesis during architecture & behavior synthesis

In a loop, 4 logic cells, 2 registers
Cell delay =1ns
Interconnect delay=1ns
DR ratio = (Dlogic+Dint)/#Registers = (4+4)/2=4ns
Clock cycle >= 4ns

SLIDE 13

Our Contributions Our Contributions

Regular Distributed Register (RDR) micro

Regular Distributed Register (RDR) micro-

architecture

architecture

Highly regular

Highly regular

Direct support of multi

Direct support of multi-

cycle on

cycle on-

chip communication

chip communication

MCAS: Architectural Synthesis for Multi

MCAS: Architectural Synthesis for Multi-

cycle

cycle Communication Communication

Integrated architectural synthesis (e.g. resource binding,

Integrated architectural synthesis (e.g. resource binding, scheduling) with physical planning scheduling) with physical planning

Target at RDR architectures

Target at RDR architectures

SLIDE 14

Regular Distributed Register Architecture (1) Regular Distributed Register Architecture (1)

Distribute registers to each “island”
Chose the island size such that local computation and communication in each

island can be done in a single cycle:

Global Interconnect

…

LCC

Reg. file

…

LCC

Reg. file

…

LCC

Reg. file

…

LCC

Reg. file

…

LCC

Reg. file

…

LCC

Reg. file

FSM FSM FSM FSM FSM FSM T H W D D D D D

i i

pt

ic

pt

ic island ra

≤ + + ≤ + =

− − −

) ( 2

int log int log int

Local Computational Cluster (LCC)

….

Register File

Wi Hi

Island FSM

ADD MUX MUL

Cluster with area constraint

SLIDE 15

Regular Distributed Register Architecture (2) Regular Distributed Register Architecture (2)

Global Interconnect

…

LCC

Reg. file

…

LCC

Reg. file

…

LCC

Reg. file

…

LCC

Reg. file

…

LCC

Reg. file

…

LCC

Reg. file

FSM FSM FSM FSM FSM FSM

Local Computational Cluster (LCC)

….

Register File

Wi Hi

Island FSM

ADD MUX MUL

Cluster with area constraint

Use register banks:

Registers in each island are partitioned to k banks for 1 cycle,2 cycle, … k

cycle interconnect communication in each island

Highly regular

1 cycle 2 cycle k cycle

SLIDE 16

ASIC Example : Regular Distributed Register ASIC Example : Regular Distributed Register Architecture for 70nm Technology Architecture for 70nm Technology

NTRS’01 70nm Tech Chip dimension: 800 mm2 (28.3mm x

28.3mm)

5.63 G Hz across-chip clock

Wire can travel up to 11.4mm within 1 clock

cycle under interconnect optimization

Need 5 clock cycles to cross the chip

Each island base dimension

Wi = Hi=3. 94 mm
= critical length (longest length that a

wire can run without buffer insertion) estimated by IPEM BIWS estimations assuming buffer size: 2x, driver/receiver size: 2x

Logic volume: 19. 63M min-size 2-NAND

gates

8X8 island-base array Local registers are partitioned to 5 banks

SLIDE 17

FPGA Example : Regular Distributed Register FPGA Example : Regular Distributed Register Architecture for a Architecture for a Stratix Stratix Device Device

To achieve 250 MHz clock frequency

4X6 island array
Intra-island interconnect delay:

2.616 ns

Logic delay of a 16 bit ADDER:

1.239 ns

Total Delay < 4 ns
Each Island contains (average)
3290 LEs (for function units)
1 DSP block (8 9X9 bit multipliers)
32 M512 RAM blocks (register banks)
15 M4K RAM blocks (register banks)
MegaRAM blocks: global resources
Stratix: EP1S80-C6
Large size: 79,040 LEs
Corner - corner interconnect delay

7.154 ns

SLIDE 18

Example: Example: Multicluster Multicluster Architectures of Architectures of DEC Alpha 21264

Source: The Multicluster Architecture: Reducing Cycle Time Through Partitioning by Keith I. Farkas, et al

SLIDE 19

RDR Architecture vs. DRA RDR Architecture vs. DRA

Distributed Register File Architecture (DRA)

Distributed Register File Architecture (DRA)

Behavior

Behavior-

to

to-

Placed RTL Synthesis with Performance

Placed RTL Synthesis with Performance-

Driven Placement [Kim,

Driven Placement [Kim, et al, ICCAD et al, ICCAD’ ’01] 01]

Similarities:

Similarities:

Distribute registers near the local computational units

Distribute registers near the local computational units

Supports multi

Supports multi-

cycle communication

cycle communication

Allows concurrent computation and communication

Allows concurrent computation and communication

Distinction:

Distinction:

The RDR architecture is highly

The RDR architecture is highly regular

regular

Facilitates interconnect delay estimation

Facilitates interconnect delay estimation

Enables the systematic exploration of cycle

Enables the systematic exploration of cycle-

time/latency

time/latency tradeoff by varying the size of the basic island tradeoff by varying the size of the basic island

SLIDE 20

Data flow graph extracted from discrete cosine transformation (DCT)

Example: Impact of Interconnect on Scheduling Example: Impact of Interconnect on Scheduling

Wirelength-driven placement

Reg. file
Reg. file

…

Alu1 1,5,10 Alu2 2,6,9

…

Reg. file
Reg. file

…

Mul2 3,7,8

…

Mul1 4,11,12

LCC

2 ns 1 ns

+

* *

*

*

*

*

1

3 5 7 9 2 4 6 8 11 10 12

Long interconnect Short interconnect

The nodes with the same color are assigned to the same functional unit.

2 1 ns ALU 2 2 ns Multiplier Num Delay FU

+

* *

*

*

*

*

SLIDE 21

represents registers

Single Single-

cycle vs. Multi

cycle vs. Multi-

cycle Interconnect Communication

cycle Interconnect Communication

Single-cycle interconnect communication Scheduled in 6 clock cycles Clock period is 4ns Total latency is 24ns Multi-cycle interconnect communication Scheduled in 9 clock cycles Clock period is 2ns Total latency is 18ns

10 +

*

*

*

*

*

*

2

1 3 4 6 5 7 8 9 11 12 Cycle 1 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 2 10 +

*

*

*

*

*
2

1 3 4 6 5 7 8 9 12 Cycle 1 Cycle 7 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 2 Cycle 8 Cycle 9

*

11

SLIDE 22

With placement integrated with scheduling, critical path is reduced. The DFG can be scheduled in 8 clock cycles, with clock period of 2ns. The total latency is 16ns.

Enhancement 1: Simultaneous Placement and Scheduling Enhancement 1: Simultaneous Placement and Scheduling for Performance Optimization for Performance Optimization

Reg. file
Reg. file

… Alu1 1,5,10 …

Reg. file
Reg. file

… Mul2 3,7,8 … Mul1 4,11,12 Alu2 2,6,9

Scheduling-driven placement

10 +

*

*

*

*

*

*

2

1 3 4 6 5 7 8 9 11 12 Cycle 1 Cycle 7 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 2 Cycle 8

SLIDE 23

Enhancement 2: Simultaneous Placement, Scheduling Enhancement 2: Simultaneous Placement, Scheduling and Binding for Performance Optimization and Binding for Performance Optimization

With placement integrated with scheduling and binding, the critical path is further reduced. The DFG can be scheduled in 7 clock cycles, with clock period of 2ns. The total latency is 14ns

Simultaneous placement, scheduling and binding

Cycle 1 Cycle 7 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 2 10 +

*
*

*

*

*

2

1 3 4 6 5 7 8 9 11 12 *

Reg. file
Reg. file

… Alu1 1,5,10 …

Reg. file
Reg. file

… Mul2 3,7,11 … Alu2 2,6,9 Mul1 4,8,12

SLIDE 24

MCAS: Placement MCAS: Placement-

Driven Architectural Synthesis Using

Driven Architectural Synthesis Using RDR Architecture RDR Architecture

Register and port binding Datapath & FSM generation

Floorplan constraints RTL VHDL files Multi-cycle path constraints CDFG

C / VHDL

CDFG generation

+ 2 * 3 * 4

6
5

* 7 * 8

9

* 11 * 12

10
1

RDR Arch. Spec. Target clock period Resource allocation

Resource constraints

+

* *

*

*

*
*

Interconnected Component Graph (ICG)

Functional unit binding

Mult1 Alu2 Mult2 Alu1

Interconnected Component Graph (ICG)

Location information

Scheduling-driven placement

Reg. file
Reg. file

…

Alu1 1,5,10

…

Reg. file
Reg. file

…

Mul2 3,7,12

…

Alu2 2,6,9 Mul1 4,8,11

Placement-driven rebinding & scheduling

Cycle1 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6 Cycle7

* * * +

*
*
*
Reg. file
Reg. file

…

Alu1 1,5,10

…

Reg. file
Reg. file

…

Mul2 3,7,11

…

Alu2 2,6,9 Mul1 4,8,12

SLIDE 25

MCAS: Scheduling MCAS: Scheduling-

Driven Placement (1)

Driven Placement (1)

Basic approach:

Basic approach:

Integrate scheduling with an SA

Integrate scheduling with an SA-

based coarse placement [Chang

based coarse placement [Chang et al, ISPD et al, ISPD’ ’02] 02]

Overlap computation with communication

Overlap computation with communication

Hide critical data transfers into intra

Hide critical data transfers into intra-

island by reducing weighted

island by reducing weighted wirelength wirelength. .

Distinction between our placement and conventional

Distinction between our placement and conventional performance performance-

driven placement

driven placement

Problem size

Problem size : : Relatively small (<10 Relatively small (<103) ) vs.

vs. Huge

Huge

Input:

Input: ICG (general ICG (general graph) ) vs.

vs. Netlist

Netlist ( (acyclic acyclic graph) graph)

Objective:

Objective: To minimize: # of Clock cycles To minimize: # of Clock cycles vs.

vs. Clock period

Clock period

SLIDE 26

Reg. file
Reg. file

… Alu1 1,5,10 …

Reg. file
Reg. file

… Mul2 3,7,8 … Mul1 4,11,12 Alu2 2,6,9

MCAS: Scheduling MCAS: Scheduling-

Driven Placement (2)

Driven Placement (2)

Scheduling

Scheduling-

based timing analysis

based timing analysis

Timing Analysis is performed on original CDFG instead of ICG

Timing Analysis is performed on original CDFG instead of ICG

A fast list scheduling is performed on CDFG instead of the classical

static timing analysis

Critical edges in ICG are assigned high weights
Timing Analysis

Timing Analysis by

by Scheduling

Scheduling

Weight assignment

10 +

*

*

*

*

*

*

2

1 3 4 6 5 7 8 9 11 12 Cycle 1 Cycle 7 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 2 Cycle 8

SLIDE 27

MCAS: Simultaneous Rescheduling & Rebinding (1) MCAS: Simultaneous Rescheduling & Rebinding (1)

Simultaneous list scheduling and

Simultaneous list scheduling and binding to minimize total binding to minimize total schedule latency schedule latency

Previous approach

Previous approach [ [Jeon Jeon et al, et al, ASPDAC ASPDAC’ ’01] 01]

cpl

cpl(i, j) = critical path length of (i, j) = critical path length of fanout fanout cone rooted at node i, when cone rooted at node i, when node i is bound to functional unit j. node i is bound to functional unit j.

Perform list scheduling using

Perform list scheduling using priority function priority function min minj( (cpl cpl(i, j)). (i, j)).

Bind node to functional unit j with

Bind node to functional unit j with the minimum the minimum cpl cpl(i, j) at the earliest (i, j) at the earliest feasible control step feasible control step

X48 +3 *1 X40 +4 *2 *6 *5

est est(i, j) (i, j) cpl cpl(i, j) (i, j)

+8

7

SLIDE 28

MCAS: Simultaneous Rescheduling & Rebinding (2) MCAS: Simultaneous Rescheduling & Rebinding (2)

Our contributions

Our contributions

Use force

Use force-

directed list scheduling and binding

directed list scheduling and binding with interconnect delay estimation with interconnect delay estimation

Consider resource constraints

Consider resource constraints

During scheduling (for selecting deferred nodes)

During scheduling (for selecting deferred nodes)

During binding (as part of scheduling process)

During binding (as part of scheduling process)

SLIDE 29

Experiment Settings Experiment Settings

CDFG Interconnected component graph C / VHDL Location information

1 Functional unit allocation & binding Commercial FPGA development system Placement-driven rebinding & rescheduling Scheduling-driven placement CDFG generation 2 3 Register and port binding Placement-driven scheduling Scheduling Datapath & FSM generation

Floorplanconstraints; Multi-cycle path constraints

RDR Arch. Spec. Target clock period

RTL VHDL files

SLIDE 30

Experimental Results (1) Experimental Results (1)

CS CP (ns) Lat (ns) CS CP (ns) Lat (ns) CS CP (ns) Lat (ns) pr 27 5.79 156.33 29 3.53 102.37 29 3.66 106.14 wang 14 7.54 105.56 20 4.14 82.8 20 3.81 76.2 lee 20 6.25 125 27 3.36 90.72 26 3.38 87.88 mcm 34 7.64 259.76 39 4.81 187.59 38 4.57 173.66 honda 23 7.58 174.34 24 3.78 90.72 24 4.18 100.32 dir 50 7.03 351.5 51 4.41 224.91 51 4.33 220.83 chem 50 8.27 413.5 53 4.64 245.92 52 4.49 233.48 u5ml12 68 9.3 632.4 70 5.34 373.8 70 4.3 301 Ave Ratio 1 1 1 1.14 0.57 0.65 1.13 0.56 0.63 Flow 1 Flow 2 Flow 3

Flow1: Conventional approach Flow2: Scheduling-driven placement Flow3: Scheduling-driven placement + placement-driven rebinding & rescheduling

Cycle number, clock period, and overall latency comparison

SLIDE 31

Experimental Results (2) Experimental Results (2)

100 200 300 400 500 600 700

pr wang lee mcm honda dir chem u5ml12

Latency (ns)

Flow 1 Flow 2 Flow 3

Flow1: Conventional approach Flow2: Scheduling-driven placement Flow3: Scheduling-driven placement + placement-driven rebinding & rescheduling

Total latency comparison

SLIDE 32

Synopsys Synopsys Flow Flow – – Behavioral Compiler vs. MCAS Behavioral Compiler vs. MCAS

Behavioral Compiler Design Compiler MCAS VHDL C RTL VHDL

Mapped VHDL for Stratix FPGAs Altera Quartus-II

Modelsim VHDL Output for Simulation gcc Report Equivalent high-level data flow description

SLIDE 33

Experimental Results (3) Experimental Results (3)

Synopsys Behavioral Compiler setting: default (optimizing latency) Average latency ratio of MCAS vs. BC: 76%

MCAS basic flow vs. Synopsys’ Behavioral Compiler

0. 00
100. 00
200. 00
300. 00
400. 00
500. 00
600. 00

pr wang m cm honda Synopsys BC M CAS 1000 2000 3000 4000 5000 6000 7000 pr w ang m cm honda Synopsys BC M C AS

Latency Resource

Design Flow Cylces Reg ALU MULT fmax (MHz) LE Latency (ns) MCAS vs. BC Synopsys BC 25 28 5 8 90.31 2945 276.82 MCAS 27 34 6 2 96.74 2476 279.10 100.82% Synopsys BC 29 36 7 8 83.61 3605 346.85 MCAS 14 35 5 8 103.76 4242 134.93 38.90% Synopsys BC 43 142 23 7 79.65 6253 539.86 MCAS 34 35 6 3 72.05 3876 471.89 87.41% Synopsys BC 29 44 8 14 85.14 6128 340.62 MCAS 23 42 6 8 87.11 5523 264.03 77.52% pr wang mcm honda

SLIDE 34

Pilot: A Platform Pilot: A Platform-

based HW/SW Synthesis System

based HW/SW Synthesis System

Platform

Platform-

based Synthesis

based Synthesis

Start from system level design description

Start from system level design description

Target to

Target to FPSoC FPSoC platform platform

Automate the process as much as possible

Automate the process as much as possible

System Data Model

System Data Model

MOC

MOC – – Model of Computation Model of Computation

System

System-

Level Synthesis Algorithms

Level Synthesis Algorithms

Incorporate models such as

Incorporate models such as Funstate Funstate model etc. model etc.

Internal Representation

Internal Representation

cover whole life

cover whole life-

cycle of the flow

cycle of the flow

SDM

SDM-

API supports inter

API supports inter-

operatability
peratability of CAD tools
f CAD tools

SLIDE 35

Platforms Used in Our Research Platforms Used in Our Research

High Programmable Platforms

High Programmable Platforms

Xilinx

Xilinx Virtex Virtex II Pro, II Pro, Altera Stratix Altera Stratix, etc. , etc.

Concentrates on

Concentrates on reconfigurability reconfigurability

Delivers

Delivers reconfigurable reconfigurable processor + programmable logic processor + programmable logic

Rocket I/O Transceivers PowerPC 405 PowerPC 405 PowerPC 405 PowerPC 405 Rocket I/O Transceivers Programmable Logic

Xilinx Virtex II Pro

Up to 4 IBM PowerPC in FPGA fabric
Up to 24 embedded Rocket I/O transceivers
Up to 556 18*18 multipliers
Over 10 Mb embedded block RAM
Up to 125,136 logic elements (LEs)

Altera Stratix

Nios embedded processor
High-bandwidth I/O & High-Speed Interfaces
Up to 176 embedded multipliers

& up to 22 high performance DSP block

Up to 7 Mb embedded memory
Up to 79,040 logic elements (LEs)

SLIDE 36

Pilot Design Flow Pilot Design Flow

Tools Developed:

Converter: Translate SpecC to

SDM

Simulator: Validate the design in

SDM, Simulation design at different levels of abstraction

SW code generator: Generate C

Source Code from SDM for target platform

HW code generator: Generate

VHDL Source code from SDM for target platform

Profiler: Generate profile based on

generated SW/HW system

HW synthesis: MCAS system

Design Design Spec. Spec. Simulation Simulation Synthesis Synthesis

C Code C Code VHDL VHDL Target Target SW SW Target Target PLD PLD SW SW Code Gen Code Gen HW HW Code Gen Code Gen

System System Data Data Model Model

Partitioning Partitioning Scheduling Scheduling Interface Interface Synthesis Synthesis SW synthesis SW synthesis HW synthesis HW synthesis

Platform Platform Info. Info. Estimation Estimation

MCAS system

SLIDE 37

Work Accomplished: Work Accomplished: Jpeg Encoder Jpeg Encoder

Jpeg Encoder:

Jpeg Encoder:

An example to validate the design flow

An example to validate the design flow

116x96x8 .bmp format (12214 Bytes) 116x96x8 .jpg format (1704 Bytes)

SLIDE 38

Jpeg Example: Program Structure Jpeg Example: Program Structure

BMP Image File BMP Image File

Image Fragmentation Image Fragmentation DCT DCT Entropy Coding Entropy Coding

JPG Image File JPG Image File

Quantization Quantization

JPEG: an standard for image compression DCT: Discrete Cosine Transform(ChenDCT) Four mode of the operations in JPEG standard Sequential DCT-based mode Progressive DCT-based mode Lossless mode Hierarchical mode JPEG: an standard for image compression DCT: Discrete Cosine Transform(ChenDCT) Four mode of the operations in JPEG standard Sequential DCT-based mode Progressive DCT-based mode Lossless mode Hierarchical mode

SLIDE 39

Jpeg Example: Run Jpeg Example: Run-

time Results

time Results

Run

Run-

time result of Jpeg example

time result of Jpeg example

time (10

6s)

rate(%) time (10

6s)

rate(%) time (10

6s)

rate(%) time (10

6s)

rate(%) 50.31 1.22% 50.31 1.92% 50.31 1.84% 50.31 4.59% (19878.67) (19878.67) (19878.67) (19878.67) 3160.56 76.46% 1641.04 62.78% 1756.67 64.35% 123.51 11.26% (316.4) (609.37) (569.26) (8096.46) 176.42 4.27% 176.42 6.75% 176.42 6.46% 176.42 16.09% (5668.41) (5668.41) (5668.41) (5668.41) 746.29 18.05% 746.29 28.55% 746.29 27.34% 746.29 68.06% (1339.96) (1339.96) (1339.96) (1339.96) Total 4133.57 100.00% 2614.05 100.00% 2729.68 100.00% 1096.52 100.00% HuffmanEncode NIOS(SW+HW2) NIOS(SW+HW3) HandleData DCT Quantization Module Name NIOS(SW) NIOS(SW+HW1)

HW1: half DCT implementation with message passing communication
HW2: Full DCT implementation with buffering communication
HW3: Full DCT implementation with shared memory communication

SLIDE 40

Conclusions & Future Work Conclusions & Future Work

Conclusions:

Conclusions:

Multi

Multi-

cycle communication is needed for multi

cycle communication is needed for multi-

gigahertz designs

gigahertz designs

Regular distributed register (RDR) architecture provides high re

Regular distributed register (RDR) architecture provides high regularity and gularity and direct support of direct support of

Multi

Multi-

cycle communication

cycle communication

Integrated resource binding, scheduling, and physical planning

Integrated resource binding, scheduling, and physical planning

Experimental results demonstrate the effectiveness of MCAS synth

Experimental results demonstrate the effectiveness of MCAS synthesis esis algorithms algorithms

Future Work:

Future Work:

Further refinement of synthesis for multi

Further refinement of synthesis for multi-

cycle synchronous designs

cycle synchronous designs

Support of control

Support of control-

intensive applications, e.g. distributed controller generation

intensive applications, e.g. distributed controller generation

Steering logic optimization, e.g. layout

Steering logic optimization, e.g. layout-

driven distributed MUX tree generation

driven distributed MUX tree generation

Synthesis solutions for asynchronous designs

Synthesis solutions for asynchronous designs

SLIDE 41

Acknowledgements Acknowledgements

Thanks for the supports from MARCO/DARPA

Thanks for the supports from MARCO/DARPA Giga Giga-

Scale

Scale System Research Center (GSRC) and Semiconductor System Research Center (GSRC) and Semiconductor Research Corporation (SRC) Research Corporation (SRC)