[PDF] - Platform- -Based Synthesis for Based Synthesis for Platform Field PDF Document

SLIDE 1

Page 1

Platform Platform-

Based Synthesis for

Based Synthesis for Field Field-

Programmable

Programmable SOCs SOCs

Prof. Jason Cong
Prof. Jason Cong

cong@cs.ucla.edu cong@cs.ucla.edu UCLA Computer Science Department UCLA Computer Science Department

Outline Outline

Motivation

Motivation

xPilot

xPilot system framework system framework

Behavior

Behavior-

level synthesis in

level synthesis in xPilot xPilot

Advantages of behavioral synthesis

Advantages of behavioral synthesis

Scheduling

Scheduling

Resource binding

Resource binding

System

System-

level synthesis in

level synthesis in xPilot xPilot

Synthesis for ASIP platforms

Synthesis for ASIP platforms

Design exploration for heterogeneous

Design exploration for heterogeneous MPSoCs MPSoCs

Conclusions

Conclusions

SLIDE 2

Page 2

Field Field-

Programmable

Programmable SOCs SOCs are Here: are Here: Altera Altera Stratix Stratix II FPGA II FPGA

90nm Stratix II 2S60

Adaptive Logic Modules M512 Block M4K Block High-Speed I/O Channels with Dynamic Phase Alignment (DPA) I/O Channels with External Memory Interface Circuitry M-RAM Blocks I/O Channels with External Memory Interface Circuitry Digital Signal Processing (DSP) Blocks Phase-Locked Loops (PLL) High-Speed I/O Channels with DPA 60,440 Equivalent Logic Elements 2,544,192 Memory Bits

Courtesy Courtesy Altera Altera

Soft core µProc

Nios II

Nios II /f 185MHz < 900ALMs (<1800LEs) 218 Max DMIPS

Nios II

Avalon™ Bus

IP IP

Software defined radio (SDR) baseband data path reconfiguration

Field Field-

Programmable

Programmable SOCs SOCs are Here: are Here: Xilinx Xilinx Virtex Virtex-

4 FPGA

4 FPGA

Courtesy Courtesy Xilinx Xilinx

PowerPC 405 (PPC405) core 450 MHz, 700+ DMIPS RISC core (32-bit Harvard architecture)

Micro- Blaze

Soft core µProc MicroBlaze 180MHz < ~1300 LUTs 166 DMIPS IBM CoreConnect™ Bus

IP IP

H.264/AVC hardware blocks

SLIDE 3

Page 3

What about FP What about FP-

SOC Design Tools

SOC Design Tools

Synthesis

Synthesis

Behavior

Behavior-

level synthesis: from behavior specification (e.g. C,

level synthesis: from behavior specification (e.g. C, SystemC SystemC, or , or Matlab Matlab) to RTL or ) to RTL or netlists netlists

System

System-

level synthesis: from system specification to system

level synthesis: from system specification to system implementation implementation

Verification

Verification

Behavior

Behavior-

level verification

level verification

System

System-

level verification

level verification

ESL Tools ESL Tools – – A Lot of Interests A Lot of Interests … …

SLIDE 4

Page 4

GartnerDataquest GartnerDataquest’ ’s s ESL Landscape, 2005 ESL Landscape, 2005

xPilot: Platform xPilot: Platform-

Based

Based Synthesis System Synthesis System

xPilot

Behavioral Synthesis Processor & Architecture Synthesis

SSDM (System-Level Synthesis Data Model)

FPSoC Interface Synthesis

Analysis Mapping Profiling

Processor Cores + Executables Drivers + Glue Logic Custom Logic

xPilot Front End xPilot Front End

SystemC SystemC/C /C

Platform Description Platform Description & Constraints & Constraints

Uniqueness of

Uniqueness of xPilot xPilot

Platform

Platform-

based synthesis and optimization

based synthesis and optimization

Communication

Communication-

centric synthesis with interconnect optimization

centric synthesis with interconnect optimization

SLIDE 5

Page 5

Outline Outline

Motivation

Motivation

xPilot

xPilot system framework system framework

Behavior

Behavior-

level synthesis in

level synthesis in xPilot xPilot

Advantages of behavioral synthesis

Advantages of behavioral synthesis

Scheduling

Scheduling

Resource binding

Resource binding

System

System-

level synthesis in

level synthesis in xPilot xPilot

Synthesis for ASIP platforms

Synthesis for ASIP platforms

Design exploration for heterogeneous

Design exploration for heterogeneous MPSoCs MPSoCs

Conclusions

Conclusions

Motivation (1) Motivation (1)

Design complexity is outgrowing the traditional RTL

Design complexity is outgrowing the traditional RTL method method

Behavioral synthesis

Behavioral synthesis − − a critical technology for enabling the a critical technology for enabling the move to higher level of abstraction move to higher level of abstraction

Reasons for previous failures

Reasons for previous failures

Lack of a compelling reason: design complexity is still manageab

Lack of a compelling reason: design complexity is still manageable a le a decade of ago decade of ago

Lack of a solid RTL foundation

Lack of a solid RTL foundation

Lack of consideration of physical reality

Lack of consideration of physical reality

SLIDE 6

Page 6

Motivation (2) Motivation (2)

Behavioral synthesis provides combined advantages

Behavioral synthesis provides combined advantages

Shorter verification/simulation cycle

Shorter verification/simulation cycle

Better complexity management, faster time to market

Better complexity management, faster time to market

Rapid system exploration

Rapid system exploration

Quick evaluation of different hardware/software boundaries

Quick evaluation of different hardware/software boundaries

Fast exploration of multiple micro

Fast exploration of multiple micro-

architecture alternatives

architecture alternatives

Higher quality of results

Higher quality of results

Platform

Platform-

based synthesis & optimization

based synthesis & optimization

Full consideration of physical reality

Full consideration of physical reality

Advantages Advantages − − Better Complexity Management Better Complexity Management

Shorter verification/simulation cycle

Shorter verification/simulation cycle

Simulation speed 100X faster than RTL

Simulation speed 100X faster than RTL-

based method

based method [NEC, ASPDAC04] [NEC, ASPDAC04]

Significant code size reduction

Significant code size reduction

RTL design ~300KL

RTL design ~300KL Behavioral design 40KL [NEC, ASPDAC04] Behavioral design 40KL [NEC, ASPDAC04]

VHDL code generated by UCLA xPilot targeting

VHDL code generated by UCLA xPilot targeting Altera Altera Stratix platform Stratix platform

Over 10x code size reduction can be achieved

Over 10x code size reduction can be achieved

SLIDE 7

Page 7

Advantages Advantages − − Rapid System Exploration (1) Rapid System Exploration (1)

Quick evaluation of various amounts of process level

Quick evaluation of various amounts of process level concurrency and different hardware/software boundaries concurrency and different hardware/software boundaries

Example: Motion-JPEG implementation

All HW implementation
All SW implementation (using embedded processors)
SW/HW co-design: optimal partitioning?
Repeated manual RTL coding is not solution!

Advantages Advantages − − Rapid System Exploration (2) Rapid System Exploration (2)

Fast exploration of multiple micro

Fast exploration of multiple micro-

architecture alternatives

architecture alternatives

Different hardware implementations can be easily obtained by

Different hardware implementations can be easily obtained by varying the high varying the high-

level spec. and applying different design

level spec. and applying different design constraints constraints

1926 1926 1862 1862 1777 1777 LE# LE# 128 128 128 128 128 128 DSP# DSP# 6926 6926 5211 5211 4830 4830 Cycle# Cycle# 37.8 37.8 35.4 35.4 39.1 39.1 Latency (ns) Latency (ns) 183.62 183.62 51 51 5.5ns 5.5ns 147.28 147.28 36 36 7ns 7ns 123.56 123.56 34 34 9ns 9ns Fmax Fmax (MHz) (MHz) State# State# Target cycle time Target cycle time

Platform:

Platform: Altera Altera Stratix Stratix

RTL synthesis & place

RTL synthesis & place-

and

and-

route:

route: Altera Altera QuartusII QuartusII v5.0 v5.0

Simulation: Mentor

Simulation: Mentor ModelSim ModelSim SE6.0 SE6.0

SLIDE 8

Page 8

Advantages Advantages − − Higher Quality of Results (1) Higher Quality of Results (1)

Platform

Platform-

based synthesis & optimization

based synthesis & optimization

The quality of a RTL design is platform

The quality of a RTL design is platform-

dependent

dependent

Designers often lack the complete and detail knowledge of the ta

Designers often lack the complete and detail knowledge of the target rget platform platform

7.688 8 DSP Blocks DSPMUL-24bx24b 3.833 2 DSP Blocks DSPMUL-18bx18b 4.658 264 LUTs MUX16to1-24b 2.92 120 LUTs MUX8to1-24b 2.61 33 LUTs ADDSUB-32b 2.27 25 LUTs ADDSUB-24b Delay (ns) Area Resource

Platform:

Platform: Altera Altera Stratix Stratix

RTL synthesis & place

RTL synthesis & place-

and

and-

route:

route: Altera Altera QuartusII QuartusII v5.0 v5.0

4.7 4.7 3.8 3.8 2.8 2.8 3.7 3.7 2.9 2.9 2.0 2.0 2.8 2.8 1.8 1.8 0.58 0.58

3X3 Delay Matrix (0,0) (95,61)

Motivation Motivation − − Higher Quality of Results (2) Higher Quality of Results (2)

Communication

Communication-

centric synthesis & optimization with full

centric synthesis & optimization with full consideration of physical reality consideration of physical reality

System performance & power is dominated by interconnect

System performance & power is dominated by interconnect

It is difficult for designers to consider physical layout at the

It is difficult for designers to consider physical layout at the RT level RT level

Data transfer

add1 mul1 add2 mul2 Layout Layout-

aware performance

aware performance

ptimization
ptimization

Overlap computation with communication Overlap computation with communication

Layout Layout-

aware power

aware power

ptimization
ptimization

F C2’

> 2*, 3* 5* 4* <

mul1 (2,5,6) mul2 (3,4)

6*

mul1 (2,4,5) mul2 (3,6) Binding solution 2: Binding solution 2: mul mul2

2 can be powered

can be powered

ff when false branch
ff when false branch

is taken is taken

T

Binding solution 1: Binding solution 1: Both multipliers keep Both multipliers keep active active

SLIDE 9

Page 9

xPilot: Behavioral xPilot: Behavioral-

to

to-

RTL Synthesis Flow

RTL Synthesis Flow

Behavioral spec. in C/SystemC

RTL + constraints SSDM SSDM

µArch-generation & RTL/constraints

generation

Verilog/VHDL/SystemC FPGAs: Altera, Xilinx ASICs: Magma, Synopsys, …

Presynthesis optimizations

Loop unrolling/shifting Strength reduction / Tree height reduction Bitwidth analysis Memory analysis …

FPGAs/ASICs FPGAs/ASICs

Frontend compiler Frontend compiler Platform description

Core synthesis optimizations

Scheduling Resource binding, e.g., functional unit binding register/port binding

xPilot front-end SystemC elaboration xPilot synthesis engine

SystemC SystemC-

to

to-

RTL Compilation Flow

RTL Compilation Flow

Netlist in XML Behavioral IR (CDFG) Platform description AST SystemC specification SSDM Output files (Timing/Area, RT VHDL & Constraints)

SLIDE 10

Page 10

Restricted Behavioral C Subset Restricted Behavioral C Subset

Data types:

Data types:

Primitive integer types: char, byte, short,

Primitive integer types: char, byte, short, int int, long , long… …

One

One-

dimension arrays of primitive integer types

dimension arrays of primitive integer types

Operations:

Operations:

All arithmetic and logic operations: +,

All arithmetic and logic operations: +, -

, *, /, >>, &, ...

, *, /, >>, &, ...

Control flow statements:

Control flow statements:

while, for, switch

while, for, switch-

case, if

case, if-

then

then-

else, break, continue, return, ...

else, break, continue, return, ...

Restricted Behavioral C Subset (cont.) Restricted Behavioral C Subset (cont.)

Unsynthesizable

Unsynthesizable

Recursions

Recursions

Pointers

Pointers

Dynamic memory allocations and system calls

Dynamic memory allocations and system calls

Irregular jumps, e.g.,

Irregular jumps, e.g., gotos gotos

SLIDE 11

Page 11

System System-

level Synthesis Data Model

level Synthesis Data Model

SSDM

SSDM (System (System-

level Synthesis Data Model)

level Synthesis Data Model)

Hierarchical

Hierarchical netlist netlist of concurrent processes and communication

f concurrent processes and communication

channels channels

Each leaf process contains a sequential program which is represe

Each leaf process contains a sequential program which is represented nted by an extended LLVM IR with hardware by an extended LLVM IR with hardware-

specific semantics

specific semantics

Port / IO interfaces, bit

Port / IO interfaces, bit-

vector manipulations, cycle

vector manipulations, cycle-

level notations

level notations

Hardware Hardware-

Specific SSDM Semantics

Specific SSDM Semantics

Process port/interface semantics

Process port/interface semantics

FIFO:

FIFO: FifoRead FifoRead() / () / FifoWrite FifoWrite() ()

Buffer:

Buffer: BuffRead BuffRead() / () / BuffWrite BuffWrite() ()

Memory:

Memory: MemRead MemRead() / () / MemWrite MemWrite() ()

Bit

Bit-

vector manipulation

vector manipulation

Bit extraction / concatenation / insertion

Bit extraction / concatenation / insertion

Bit

Bit-

width attributes for every operation and every value

width attributes for every operation and every value

Cycle

Cycle-

level notation

level notation

Clock:

Clock: waitClockEvent waitClockEvent() ()

SLIDE 12

Page 12

Platform Modeling & Characterization Platform Modeling & Characterization

Target platform specification

Target platform specification

High

High-

level resource library with

level resource library with delay/latency/area/power curve for delay/latency/area/power curve for various input/ various input/bitwidth bitwidth configurations configurations

Functional units: adders,

Functional units: adders, ALUs ALUs, , multipliers, comparators, etc. multipliers, comparators, etc.

Connectors:

Connectors: mux mux, , demux demux, etc. , etc.

Memories: registers, synchronous

Memories: registers, synchronous memories, etc. memories, etc.

Chip layout description

Chip layout description

On

On-

chip resource distributions

chip resource distributions

On

On-

chip interconnect delay/power

chip interconnect delay/power estimation estimation

4.7 4.7 3.8 3.8 2.8 2.8 3.7 3.7 2.9 2.9 2.0 2.0 2.8 2.8 1.8 1.8 0.58 0.58

3X3 Delay Matrix for Stratix-EP1S40 (0,0) (95,61)

Scheduling Scheduling− − Problem Statement Problem Statement

Scheduling problem in behavioral synthesis

Scheduling problem in behavioral synthesis

Given:

A control data flow graph (CDFG) which captures the

behavior of the input description

A set of scheduling constraints: resource constraints,

latency constraints, frequency constraints, relative IO timing constraints, etc. Goal:

Assign the operations to control states so that a

particular design objective (performance / power) is

ptimized while all the constraints are satisfied.
Highlights of our scheduling engine

Highlights of our scheduling engine

Applicable to a wide range of application domains

Applicable to a wide range of application domains

Computation

Computation-

intensive, memory

intensive, memory-

intensive, control

intensive, control-

intensive, partially timed, etc.

intensive, partially timed, etc.

Offers a variety of optimization techniques in a unified

Offers a variety of optimization techniques in a unified framework framework

Operation chaining, behavioral template, relative

Operation chaining, behavioral template, relative scheduling, physical layout consideration, etc. scheduling, physical layout consideration, etc.

+4 +2 5 1 +3 CS0 * + +3 1 5 +2 +4 CS1

SLIDE 13

Page 13

Scheduling Scheduling − − Overall Approach Overall Approach

Overall approach

Overall approach

Current objective: high

Current objective: high-

performance

performance

Use a system of integer difference constraints to

Use a system of integer difference constraints to express all kinds of scheduling constraints express all kinds of scheduling constraints

Represent the design objective in a linear function

Represent the design objective in a linear function

Dependency constraint

Dependency constraint

v

v1

1

v v3

3 :

: x x3

3 –

– x x1

1 ≥

≥ 0

v

v2

2

v v3

3 :

: x x3

3 –

– x x2

2 ≥

≥ 0

v

v3

3

v v5

5 :

: x x4

4 –

– x x3

3 ≥

≥ 0

v

v4

4

v v5

5 :

: x x5

5 –

– x x4

4 ≥

≥ 0

Frequency constraint

Frequency constraint

<

<v v2

2 ,

, v v5

5> :

> : x x5

5 –

– x x2

2 ≥

≥ 1 1

Resource constraint

Resource constraint

<

<v v2

2 ,

, v v3

3>:

>: x x3

3 –

– x x2

2 ≥

≥ 1 1

+ * * − +

v1 v2 v3 v4 v5

Platform characterization:

Platform characterization:

adder (+/

adder (+/– –) 2ns ) 2ns

multipiler

multipiler (*): 5ns (*): 5ns

Target cycle time: 10ns

Target cycle time: 10ns

Resource constraint: Only

Resource constraint: Only ONE multiplier is available ONE multiplier is available 1 0 -1 0 0 0 1 -1 0 0 0 0 1 -1 0 0 0 0 1 -1 0 1 0 0 -1 X1 X2 X3 X4 X5

1
1

≤ A x b

Totally Totally unimodular unimodular matrix: matrix: guarantees integral solutions

guarantees integral solutions

Scheduling Scheduling − − Design Framework Design Framework

xPilot scheduler

STG (State Transition Graph) System of pairwise difference constraints Relative timing constraints Relative timing constraints Dependency constraints Dependency constraints Frequency constraints Frequency constraints Resource constraints Resource constraints … …

Constraint equations generation Objective function generation

CDFG

Linear programming solver LP solution interpretation User- specified design constraints& assignments

Target platform modeling (resource library & chip layout)

SLIDE 14

Page 14

Unified Resource Binding Unified Resource Binding

An efficient architectural exploration

An efficient architectural exploration framework framework

Simultaneous functional unit,

Simultaneous functional unit, register, and port binding register, and port binding

Emphasize on the interconnect and

Emphasize on the interconnect and steering logic networks steering logic networks

Guided by a flexible cost evaluation

Guided by a flexible cost evaluation engine to achieve different engine to achieve different

bjectives, e.g., performance, area,
bjectives, e.g., performance, area,

power, etc. power, etc.

Extendable to exploit physical layout

Extendable to exploit physical layout information information

xPilot architecture exploration Iteration No Yes

Register Allocation/Binding FU Allocation/Binding Baseline Register Binding Improved?

STG (State Transition Graph) Platform info && User- specified constraints

Datapath model for estimation STG + Best Datapath Models

Resource Binding Resource Binding− − Problem Statement Problem Statement

Resource binding problem

Resource binding problem

Given: (1) A scheduled control data flow graph, i.e., STG; (2) Design constraints: performance, delay, or power, etc. Goal: Assign the operations and variables to functional units and register, respectively, so that their executions or lifetimes are not conflicted, and all of the design constraints are satisfied.

Properties of the problem

Properties of the problem

FU and register binding are highly

FU and register binding are highly correlated correlated

Simultaneous FU and register binding

Simultaneous FU and register binding considering interconnection is very considering interconnection is very difficult difficult +1 +2

ALU

Two binding solutions: Two binding solutions:

Which one is better?

Which one is better?

The answer depends on:

The answer depends on: 1.

1. How large are the MUX and

How large are the MUX and ALU (platform ALU (platform-

dependent)

dependent) 2.

2. Performance and area

Performance and area constraints constraints

MUX

ALU ALU Binding Binding

SLIDE 15

Page 15

Island A Data Import Logic

Distributed Register Distributed Register-

File (DRF)

File (DRF) Microarchitecture Microarchitecture

Local Register File Local Register File

FU pool MUL ALU Buffers Island B ALU’ Island C

Regular datapath structure Provides opportunities to hide

large MUX into register-files

Computations and communications

are localized

Allow replicated values among

islands

Enables efficient optimizations to

control interconnects among islands

Advantages of DRF Advantages of DRF Microarchitecture Microarchitecture

2 1

DFG (Part of Chen DCT) Scheduled DFG Resource constraint: 1 FU

DRF result:

Datapath with more regularity
Hide MUX into the register file
Especially effective for FPGA designs

Discrete register result MUX implementation may be

very expensive (e.g., on FGPAs)

1 2 4 3 1 2

SLIDE 16

Page 16

Platform Platform-

Based Interface Synthesis

Based Interface Synthesis

Focus on sequential communication channels

Focus on sequential communication channels

Data must be read and written in the same order

Data must be read and written in the same order

Example: FIFO (FSL in

Example: FIFO (FSL in VirtexII VirtexII), Bus (in both ), Bus (in both Stratix Stratix and and Virtex Virtex) )

Order may have dramatic impact on performance

Order may have dramatic impact on performance

Best order should guarantee that no data transmission on critica

Best order should guarantee that no data transmission on critical l path are delayed by non path are delayed by non-

critical transmission

critical transmission

Interface synthesis for sequential communication channels

Interface synthesis for sequential communication channels

Consider both the behavior model and communication topology

Consider both the behavior model and communication topology to detect the optimal transmission order to detect the optimal transmission order

Automatically do interface generation for sequential

Automatically do interface generation for sequential communication units, as well as code transformation for behavior communication units, as well as code transformation for behavior models models

Overall Approach to Interface Synthesis Overall Approach to Interface Synthesis

Reduce the order detection

Reduce the order detection problem to a min problem to a min-

latecncy

latecncy scheduling problem: scheduling problem:

Merge the

Merge the CDFGs CDFGs of all

f all

processes processes

Each element to be

Each element to be transferred on FIFO are transferred on FIFO are transformed to a special transformed to a special

peration T
peration T
Only one T can be scheduled

Only one T can be scheduled at each step. at each step.

Example shown on right,

Example shown on right, assuming only 1 cycle is assuming only 1 cycle is needed for FIFO operation needed for FIFO operation

T1

+

T1

T3 T2

T2

T3 + Merged CDFG Scheduling result, order is (1,3,2) *

Process 1 Process 2

*

SLIDE 17

Page 17

Experimental Results Experimental Results − − Benchmark Suite Benchmark Suite

Benchmark suite

Benchmark suite

PR, MCM:

PR, MCM:

DSP kernels: pure additions/subtractions and multiplications

DSP kernels: pure additions/subtractions and multiplications

CACHE

CACHE

Cache controller: control

Cache controller: control-

intensive designs with cycle

intensive designs with cycle-

accurate I/O operations

accurate I/O operations

MOTION:

MOTION:

Motion compensation algorithm for MPEG

Motion compensation algorithm for MPEG-

1 decoder: control

1 decoder: control-

intensive with modest

intensive with modest amount of computations amount of computations

IDCT:

IDCT:

JPEG inverse discrete cosine transform: computation intensive

JPEG inverse discrete cosine transform: computation intensive

DWT:

DWT:

JPEG2000 discrete wavelet transform: computation intensive with

JPEG2000 discrete wavelet transform: computation intensive with modest control modest control flow flow

EDGELOOP:

EDGELOOP:

Extracted from H.264 decoder: a very complex design, features a

Extracted from H.264 decoder: a very complex design, features a mix of mix of computation, control, and memory accesses computation, control, and memory accesses

SystemC SystemC/C /C-

to

to-

FPGA Design Flow (

FPGA Design Flow (Altera Altera) )

xPilot xPilot behavioral behavioral synthesis synthesis SSDM/CDFG SSDM/CDFG Behavioral synthesis Behavioral synthesis RTL generation RTL generation SSDM/FSMD SSDM/FSMD FSM with FSM with Datapath Datapath in VHDL in VHDL Floorplan and/or multi Floorplan and/or multi-

cycle path constraints

cycle path constraints SSDM

(System-Level Synthesis Data Model)

SystemC SystemC/C specification /C specification

Front Front-

end compiler

end compiler Platform description Platform description & constraints & constraints

Altera Altera QuartusII QuartusII v5.0 v5.0

Stratix/StratixII Stratix/StratixII device configurations device configurations

SLIDE 18

Page 18

Experimental Results Experimental Results − − Altera Altera

Device setting:

Device setting: Stratix Stratix

Target frequency: 200 MHz

Target frequency: 200 MHz 146.8 146.8 4 4 1627 1627 110 110 1752 1752 3489 3489 1352 1352 190 190 DIR DIR 152.56 152.56 1348 1348 73 73 981 981 2402 2402 1260 1260 161 161 MCM MCM 166.61 166.61 4 4 687 687 207 207 691 691 1585 1585 696 696 141 141 LEE LEE 166.11 166.11 8 8 516 516 62 62 527 527 1105 1105 727 727 90 90 WANG WANG 178.7 178.7 552 552 84 84 713 713 1349 1349 600 600 90 90 PR PR (MHz) (MHz) DSP DSP Comb Comb-

Reg

Reg Lonely Lonely-

Reg

Reg COMB COMB LE LE VHDL VHDL C C Fmax Fmax Resource Usage Resource Usage Line Count Line Count Designs Designs

On average,

On average, xPilot resource binding achieves designs with similar area, and xPilot resource binding achieves designs with similar area, and 1.68x higher 1.68x higher frequency over Spark frequency over Spark

1.68 n/a* 2.50 n/a* 0.68 1.12 1 1 1 1 1 1 Ave Ratio 146.8 4 1627 110 1752 3489 69.38 6 391 2034 2425 DIR 152.6 1348 73 981 2402 74.87 560 2248 2808 MCM 166.6 4 687 207 691 1585 119.3 315 1052 1367 LEE 166.1 8 516 62 527 1105 118.9 275 942 1217 WANG 178.7 552 84 713 1349 123.5 293 815 1108 PR (MHz) DSP Comb

Reg

Lonely- Reg COMB LE (MHz) DSP Comb- Reg Lonely

Reg

COMB LE Fmax Resource Usage Fmax Resource Usage xPilot SPARK Designs

Experimental Results Experimental Results − − Comparison with SPARK Comparison with SPARK

n
n Altera

Altera Stratix Stratix FPGA FPGA

SPARK [UCI/UCSD, 2004], a state of the art academic high

SPARK [UCI/UCSD, 2004], a state of the art academic high-

level synthesis tool

level synthesis tool

SLIDE 19

Page 19

SystemC SystemC/C /C-

to

to-

FPGA Design Flow (Xilinx)

FPGA Design Flow (Xilinx)

xPilot xPilot behavioral behavioral synthesis synthesis SSDM/CDFG SSDM/CDFG Behavioral synthesis Behavioral synthesis RTL generation RTL generation SSDM/FSMD SSDM/FSMD FSM with FSM with Datapath Datapath in VHDL in VHDL Floorplan and/or multi Floorplan and/or multi-

cycle path constraints

cycle path constraints SSDM

(System-Level Synthesis Data Model)

SystemC SystemC/C specification /C specification

Front Front-

end compiler

end compiler Platform description Platform description & constraints & constraints

Xilinx ISE i7.1 Xilinx ISE i7.1

VirtexII( VirtexII(-

Pro)/Virtex

Pro)/Virtex-

4

4 device configurations device configurations

Experimental Results Experimental Results − − Xilinx Xilinx

Device setting: xc2vp30

Device setting: xc2vp30 -

7

7

Target frequency: 200 MHz

Target frequency: 200 MHz 98.81 98.81 56 56 1732 1732 1002 1002 979 979 1352 1352 190 190 DIR DIR 110.38 110.38 30 30 1282 1282 1207 1207 887 887 1260 1260 161 161 MCM MCM 131.93 131.93 19 19 659 659 484 484 356 356 696 696 141 141 LEE LEE 133.51 133.51 15 15 588 588 464 464 357 357 727 727 90 90 WANG WANG 146.84 146.84 16 16 564 564 416 416 331 331 600 600 90 90 PR PR (MHz) (MHz) DSP DSP (FF) (FF) (LUT) (LUT) Slices Slices VHDL VHDL C C Fmax Fmax Resource Usage Resource Usage Line Count Line Count Designs Designs

SLIDE 20

Page 20

Synthesis from Behavior to DRF Synthesis from Behavior to DRF

Data import logic is the most

Data import logic is the most “ “critical critical” ”

Operations bound to an island form a

Operations bound to an island form a “ “chain chain” ” in DFG in DFG

Optimize complexity of inter

Optimize complexity of inter-

island connections

island connections

Min

Min-

cut chain partitioning

cut chain partitioning improve design quality improve design quality

11.69 11.69 48 48 1,076 1,076 858 858 701 701 Baseline Baseline 10.62 10.62 3 3 5 5 63 63 860 860 457 457 DRF (5 islands) DRF (5 islands) PR PR-

24

24 10.37 10.37 66 66 1,235 1,235 928 928 798 798 Baseline Baseline 12.04 12.04 6 6 6 6 86 86 808 808 425 425 DRF (6 islands) DRF (6 islands) CHEN CHEN-

24

24 Clock Clock Period (ns) Period (ns) MUL MUL RAM RAM Blk Blk# # FF FF LUT LUT Slices Slices Micro Micro-

Architecture

Architecture

Device:

Device: Xilinx Xilinx Virtex Virtex II II -

6; Target clock period: 10ns

6; Target clock period: 10ns

Observations: Large area (Slices and Multiplier blocks) reductio

Observations: Large area (Slices and Multiplier blocks) reduction by using on n by using on-

chip RAM

chip RAM blocks to implement register files, with small impact on Fmax blocks to implement register files, with small impact on Fmax

Initial Results of Interface Synthesis Initial Results of Interface Synthesis

Target for sequential communication channels

Target for sequential communication channels

In particular, FSL in

In particular, FSL in VirtexII VirtexII

Consider two communicating processes

Consider two communicating processes

20+% performance improvement on average by optimizing

20+% performance improvement on average by optimizing communication ordering communication ordering

SLIDE 21

Page 21

Outline Outline

Motivation

Motivation

xPilot

xPilot system framework system framework

Behavior

Behavior-

level synthesis in

level synthesis in xPilot xPilot

Advantages of behavioral synthesis

Advantages of behavioral synthesis

Scheduling

Scheduling

Resource binding

Resource binding

System

System-

level synthesis in

level synthesis in xPilot xPilot

Synthesis for ASIP platforms

Synthesis for ASIP platforms

Design exploration for heterogeneous

Design exploration for heterogeneous MPSoCs MPSoCs

Conclusions

Conclusions

Design Exploration for Heterogeneous Design Exploration for Heterogeneous MPSoC MPSoC Platforms Platforms

Heterogeneous

Heterogeneous MPSoCs MPSoCs exploration exploration

Processors

Processors

Heterogeneous vs. homogeneous

Heterogeneous vs. homogeneous

General

General-

purpose vs. application

purpose vs. application-

specific

specific

On

On-

chip communication architecture (OCA)

chip communication architecture (OCA)

Bus (e.g. AMBA,

Bus (e.g. AMBA, CoreConnect CoreConnect), packet switching network ), packet switching network (e.g. Alpha 21364) (e.g. Alpha 21364)

Memory hierarchy

Memory hierarchy

µP

Communication Network

µP

OS Driver tasks

µP Network Interface

Network Interface

IP

µP

FPGA

µP Network Interface

Network Interface

DSP

µP

OS Driver tasks

Network Interface

µP

OS Driver tasks

Network Interface

SLIDE 22

Page 22

Configurable Configurable SoC SoC Platforms Platforms

General purpose processor cores + programmable fabric

General purpose processor cores + programmable fabric

Tight integration using extended instructions (

Tight integration using extended instructions (ASIPs ASIPs) )

Example:

Example: Altera Altera Nios Nios / / Nios Nios II II

Loose integration using

Loose integration using FIFOs FIFOs/busses for communications /busses for communications

Example:

Example: Xilinx Xilinx MicroBlaze MicroBlaze, etc. , etc.

Custom instruction logic for Nios II [source: www.altera.com] Xilinx MicroBlaze [source: www.xilinx.com]

ASIP Compilation: Problem Statement ASIP Compilation: Problem Statement

1

( )

i i N

area p A

≤ ≤

<

∑

Given:

Given:

CDFG G(V, E)

CDFG G(V, E)

The basic instruction set

The basic instruction set I I

Pattern constraints:

Pattern constraints:

Number of inputs

Number of inputs | |PI(pi PI(pi)| )| ≤ ≤ Nin Nin; ;

Number of outputs

Number of outputs | |PO(pi PO(pi)| = 1 )| = 1; ;

Total area

Total area

Objective:

Objective:

Generate a pattern library

Generate a pattern library P P

Map G to the extended instruction set

Map G to the extended instruction set I I∪ ∪P P, so that the total execution time , so that the total execution time is minimized is minimized

* * + + *

a c e t6

+

d t1 = a * b; t2 = b * c;; t3 = d * e; t4 = t1 + t2; t5 = t2 + t3; t6 = t5 + t4; ext-inst1 (MAC1: 2 cycles) ext-inst2 (MAC2: 2 cycles) * 2 clock cycles + 1 clock cycle t4 t5 Performance speedup = 9 / 5 = 1.8X b t4 = ext-inst1(a, b, c); t5 = ext-inst2(b, c, d, e); t6 = t4 + t5;

SLIDE 23

Page 23

Target Core Processor Model Target Core Processor Model

Inst Cache Reg File Memory

MUX

4 Adder Result PC

RS1 RS2

Core Processor ID / EX EX / MEM MEM / WB IF / ID ALU

OP1 OP2

Core processor model

Core processor model

Classic single

Classic single-

issue pipelined RISC core (fetch / decode / execute /

issue pipelined RISC core (fetch / decode / execute / mem mem / / write write-

back)

back)

The number of input and output operands of an instruction is pre

The number of input and output operands of an instruction is pre-

determined

determined

An instruction reads the core register file during the execute s

An instruction reads the core register file during the execute stage, and commits tage, and commits the result during the write the result during the write-

back stage

back stage Custom Logic

ASIP Compilation Flow ASIP Compilation Flow

Front Front-

end compilation

end compilation Backend compilation Backend compilation

1. Pattern generation
1. Pattern generation
2. Pattern selection
2. Pattern selection
3. Application mapping &
3. Application mapping &

Graph covering Graph covering

Pattern Generation Satisfying input/output constraints Pattern Selection Select a subset to maximize the potential speedup while satisfying the resource constraint Application Mapping Graph covering to minimize the total execution time

C code C code µ µArch Arch constraint constraint CDFG CDFG Pattern library Pattern library Optimized Optimized CDFG CDFG Optimized assembly Optimized assembly

SLIDE 24

Page 24

Experimental Results on Experimental Results on Altera Altera Nios Nios

1.77%
2.54%
2.75

3.08 Average 56 0.00% 2.76% 186 3.22 4.75 4 mcm 16 0.00% 0.80% 54 3.02 3.28 2 dir 14 0.00% 1.05% 71 1.75 1.57 2 pr 8 0.15% 1,024 0.76% 51 2.14 2.40 2 fir 40 0.71% 4,736 3.79% 255 3.73 3.18 7 iir 16 9.79% 65,536 6.06% 408 2.65 3.28 9 fft_br DSP Block Memory LE Nios Estimation Resource Overhead Speedup Extended Instruction#

56

0.00% 2.76% 186 3.22 4.75 4 16 0.00% 0.80% 54 3.02 3.28 2 14 0.00% 1.05% 71 1.75 1.57 2 8 0.15% 1,024 0.76% 51 2.14 2.40 2 40 0.71% 4,736 3.79% 255 3.73 3.18 7 16 9.79% 65,536 6.06% 408 2.65 3.28 9 LE Nios

Altera

Altera Nios Nios is used for ASIP implementation is used for ASIP implementation

5 extended instruction formats

5 extended instruction formats

up to 2048 instructions for each format

up to 2048 instructions for each format

Small DSP applications are taken as benchmark

Small DSP applications are taken as benchmark

Data bandwidth problem

Data bandwidth problem

Limited register file bandwidth (two read ports, one write port)

Limited register file bandwidth (two read ports, one write port)

~40% of the ideal performance speedup will be lost

~40% of the ideal performance speedup will be lost

Shadow

Shadow-

register

register-

based architectural extension

based architectural extension

Core registers are augmented by an extra set of shadow registers

Core registers are augmented by an extra set of shadow registers

Conditionally written during write

Conditionally written during write-

back stage

back stage

Low power/area overhead

Low power/area overhead

Novel shadow

Novel shadow-

register binding algorithms are developed

register binding algorithms are developed

Inst Cache Reg File Memory

MUX

4 Adder Result PC

RS1 RS2

Core Processor ID / EX EX / MEM MEM / WB IF / ID ALU

Hashing Unit Hashing Unit

OP1 OP2

Custom Logic

SR1 SR1 SRK SRK

…

k = hash(j)

Architecture Extension for Architecture Extension for ASIPs ASIPs

SLIDE 25

Page 25

Problem Statement: Mapping for Heterogeneous Problem Statement: Mapping for Heterogeneous Integration with Multiple Processing Cores Integration with Multiple Processing Cores

Given:

Given:

A library of processing cores

A library of processing cores L L

Task graph

Task graph G G( (V V, , E E) )

For each

For each v v in in V V, execution time , execution time t t( (v v, , p pi

i) on

) on p pi

i

For each (

For each (u, v u, v) in ) in E E, communication data size , communication data size s s( (u u, ,v v) )

Cost (area/power) constraint

Cost (area/power) constraint C C

Problem:

Problem:

Select and instantiate the processing elements from

Select and instantiate the processing elements from L L

Generate the on

Generate the on-

chip communication architecture and topology

chip communication architecture and topology

Map the tasks onto the processing elements so that

Map the tasks onto the processing elements so that

The total latency is minimized while the final implementation co

The total latency is minimized while the final implementation cost is st is less than less than C C

Preliminary Results on Motion Preliminary Results on Motion-

JPEG Example

JPEG Example

Encoded JPEG Images RAW Images Xilinx XUP Board

Preprocess Quant DCT Huffman Table Modification

OR

0.117 0.117 0.189 0.189 Exe Time Exe Time (ms) (ms) 126 126 126 126 Fmax Fmax ( (MHZ)

MHZ)

14800 14800 ( (-

38%)

38%) 23812 23812 Cycle# Cycle# 6345 6345 Model #2 Model #2 4306 4306 Model #1 Model #1 Area Area (Slice#) (Slice#) System System

Preprocess Quant Huffman Table Modification HW-DCT

Model #1 : 5 Microblazes FSL-based communication Model #2 : 4 Microblazes + DCT on FPGA fabrics

SLIDE 26

Page 26

Conclusions Conclusions

xPilot can automatically synthesize behavior level C or

xPilot can automatically synthesize behavior level C or SystemC SystemC presentation to RTL code with necessary design constraints presentation to RTL code with necessary design constraints

Platform

Platform-

based synthesis with physical planning provides

based synthesis with physical planning provides

Shorter verification/simulation cycle

Shorter verification/simulation cycle

Better complexity management, faster time to market

Better complexity management, faster time to market

Rapid system exploration

Rapid system exploration

Higher quality of results

Higher quality of results

xPilot can help to explore the efficient use of (multiple) on

xPilot can help to explore the efficient use of (multiple) on-

chip

chip processors processors

xPilot can efficiently optimize the software for reconfigurable

xPilot can efficiently optimize the software for reconfigurable processors processors

We are interested to engage with selected industrial partners to

We are interested to engage with selected industrial partners to further validate and enhance the technology further validate and enhance the technology

Acknowledgements Acknowledgements

We would like to thank the supports from

We would like to thank the supports from

National Science Foundation (NSF)

National Science Foundation (NSF)

Gigascale

Gigascale Systems Research Center (GSRC) Systems Research Center (GSRC)

Semiconductor Research Corporation (SRC)

Semiconductor Research Corporation (SRC)

Industrial sponsors under the California MICRO programs (

Industrial sponsors under the California MICRO programs (Altera Altera, Xilinx) , Xilinx)

Team members: