Platform- -Based Synthesis for Based Synthesis for Platform Field - - PDF document

platform based synthesis for based synthesis for platform
SMART_READER_LITE
LIVE PREVIEW

Platform- -Based Synthesis for Based Synthesis for Platform Field - - PDF document

Platform- -Based Synthesis for Based Synthesis for Platform Field Field- -Programmable Programmable SOCs SOCs Prof. Jason Cong Prof. Jason Cong cong@cs.ucla.edu cong@cs.ucla.edu UCLA Computer Science Department UCLA Computer Science


slide-1
SLIDE 1

Page 1

Platform Platform-

  • Based Synthesis for

Based Synthesis for Field Field-

  • Programmable

Programmable SOCs SOCs

  • Prof. Jason Cong
  • Prof. Jason Cong

cong@cs.ucla.edu cong@cs.ucla.edu UCLA Computer Science Department UCLA Computer Science Department

Outline Outline

  • Motivation

Motivation

  • xPilot

xPilot system framework system framework

  • Behavior

Behavior-

  • level synthesis in

level synthesis in xPilot xPilot

  • Advantages of behavioral synthesis

Advantages of behavioral synthesis

  • Scheduling

Scheduling

  • Resource binding

Resource binding

  • System

System-

  • level synthesis in

level synthesis in xPilot xPilot

  • Synthesis for ASIP platforms

Synthesis for ASIP platforms

  • Design exploration for heterogeneous

Design exploration for heterogeneous MPSoCs MPSoCs

  • Conclusions

Conclusions

slide-2
SLIDE 2

Page 2

Field Field-

  • Programmable

Programmable SOCs SOCs are Here: are Here: Altera Altera Stratix Stratix II FPGA II FPGA

90nm Stratix II 2S60

Adaptive Logic Modules M512 Block M4K Block High-Speed I/O Channels with Dynamic Phase Alignment (DPA) I/O Channels with External Memory Interface Circuitry M-RAM Blocks I/O Channels with External Memory Interface Circuitry Digital Signal Processing (DSP) Blocks Phase-Locked Loops (PLL) High-Speed I/O Channels with DPA 60,440 Equivalent Logic Elements 2,544,192 Memory Bits

Courtesy Courtesy Altera Altera

Soft core µProc

Nios II

Nios II /f 185MHz < 900ALMs (<1800LEs) 218 Max DMIPS

Nios II

Avalon™ Bus

IP IP

Software defined radio (SDR) baseband data path reconfiguration

Field Field-

  • Programmable

Programmable SOCs SOCs are Here: are Here: Xilinx Xilinx Virtex Virtex-

  • 4 FPGA

4 FPGA

Courtesy Courtesy Xilinx Xilinx

PowerPC 405 (PPC405) core 450 MHz, 700+ DMIPS RISC core (32-bit Harvard architecture)

Micro- Blaze

Soft core µProc MicroBlaze 180MHz < ~1300 LUTs 166 DMIPS IBM CoreConnect™ Bus

IP IP

H.264/AVC hardware blocks

slide-3
SLIDE 3

Page 3

What about FP What about FP-

  • SOC Design Tools

SOC Design Tools

  • Synthesis

Synthesis

  • Behavior

Behavior-

  • level synthesis: from behavior specification (e.g. C,

level synthesis: from behavior specification (e.g. C, SystemC SystemC, or , or Matlab Matlab) to RTL or ) to RTL or netlists netlists

  • System

System-

  • level synthesis: from system specification to system

level synthesis: from system specification to system implementation implementation

  • Verification

Verification

  • Behavior

Behavior-

  • level verification

level verification

  • System

System-

  • level verification

level verification

ESL Tools ESL Tools – – A Lot of Interests A Lot of Interests … …

slide-4
SLIDE 4

Page 4

GartnerDataquest GartnerDataquest’ ’s s ESL Landscape, 2005 ESL Landscape, 2005

xPilot: Platform xPilot: Platform-

  • Based

Based Synthesis System Synthesis System

xPilot

Behavioral Synthesis Processor & Architecture Synthesis

SSDM (System-Level Synthesis Data Model)

FPSoC Interface Synthesis

Analysis Mapping Profiling

Processor Cores + Executables Drivers + Glue Logic Custom Logic

xPilot Front End xPilot Front End

SystemC SystemC/C /C

Platform Description Platform Description & Constraints & Constraints

  • Uniqueness of

Uniqueness of xPilot xPilot

  • Platform

Platform-

  • based synthesis and optimization

based synthesis and optimization

  • Communication

Communication-

  • centric synthesis with interconnect optimization

centric synthesis with interconnect optimization

slide-5
SLIDE 5

Page 5

Outline Outline

  • Motivation

Motivation

  • xPilot

xPilot system framework system framework

  • Behavior

Behavior-

  • level synthesis in

level synthesis in xPilot xPilot

  • Advantages of behavioral synthesis

Advantages of behavioral synthesis

  • Scheduling

Scheduling

  • Resource binding

Resource binding

  • System

System-

  • level synthesis in

level synthesis in xPilot xPilot

  • Synthesis for ASIP platforms

Synthesis for ASIP platforms

  • Design exploration for heterogeneous

Design exploration for heterogeneous MPSoCs MPSoCs

  • Conclusions

Conclusions

Motivation (1) Motivation (1)

  • Design complexity is outgrowing the traditional RTL

Design complexity is outgrowing the traditional RTL method method

  • Behavioral synthesis

Behavioral synthesis − − a critical technology for enabling the a critical technology for enabling the move to higher level of abstraction move to higher level of abstraction

  • Reasons for previous failures

Reasons for previous failures

  • Lack of a compelling reason: design complexity is still manageab

Lack of a compelling reason: design complexity is still manageable a le a decade of ago decade of ago

  • Lack of a solid RTL foundation

Lack of a solid RTL foundation

  • Lack of consideration of physical reality

Lack of consideration of physical reality

slide-6
SLIDE 6

Page 6

Motivation (2) Motivation (2)

  • Behavioral synthesis provides combined advantages

Behavioral synthesis provides combined advantages

  • Shorter verification/simulation cycle

Shorter verification/simulation cycle

  • Better complexity management, faster time to market

Better complexity management, faster time to market

  • Rapid system exploration

Rapid system exploration

  • Quick evaluation of different hardware/software boundaries

Quick evaluation of different hardware/software boundaries

  • Fast exploration of multiple micro

Fast exploration of multiple micro-

  • architecture alternatives

architecture alternatives

  • Higher quality of results

Higher quality of results

  • Platform

Platform-

  • based synthesis & optimization

based synthesis & optimization

  • Full consideration of physical reality

Full consideration of physical reality

Advantages Advantages − − Better Complexity Management Better Complexity Management

  • Shorter verification/simulation cycle

Shorter verification/simulation cycle

  • Simulation speed 100X faster than RTL

Simulation speed 100X faster than RTL-

  • based method

based method [NEC, ASPDAC04] [NEC, ASPDAC04]

  • Significant code size reduction

Significant code size reduction

  • RTL design ~300KL

RTL design ~300KL Behavioral design 40KL [NEC, ASPDAC04] Behavioral design 40KL [NEC, ASPDAC04]

  • VHDL code generated by UCLA xPilot targeting

VHDL code generated by UCLA xPilot targeting Altera Altera Stratix platform Stratix platform

  • Over 10x code size reduction can be achieved

Over 10x code size reduction can be achieved

slide-7
SLIDE 7

Page 7

Advantages Advantages − − Rapid System Exploration (1) Rapid System Exploration (1)

  • Quick evaluation of various amounts of process level

Quick evaluation of various amounts of process level concurrency and different hardware/software boundaries concurrency and different hardware/software boundaries

Example: Motion-JPEG implementation

  • All HW implementation
  • All SW implementation (using embedded processors)
  • SW/HW co-design: optimal partitioning?
  • Repeated manual RTL coding is not solution!

Advantages Advantages − − Rapid System Exploration (2) Rapid System Exploration (2)

  • Fast exploration of multiple micro

Fast exploration of multiple micro-

  • architecture alternatives

architecture alternatives

  • Different hardware implementations can be easily obtained by

Different hardware implementations can be easily obtained by varying the high varying the high-

  • level spec. and applying different design

level spec. and applying different design constraints constraints

1926 1926 1862 1862 1777 1777 LE# LE# 128 128 128 128 128 128 DSP# DSP# 6926 6926 5211 5211 4830 4830 Cycle# Cycle# 37.8 37.8 35.4 35.4 39.1 39.1 Latency (ns) Latency (ns) 183.62 183.62 51 51 5.5ns 5.5ns 147.28 147.28 36 36 7ns 7ns 123.56 123.56 34 34 9ns 9ns Fmax Fmax (MHz) (MHz) State# State# Target cycle time Target cycle time

  • Platform:

Platform: Altera Altera Stratix Stratix

  • RTL synthesis & place

RTL synthesis & place-

  • and

and-

  • route:

route: Altera Altera QuartusII QuartusII v5.0 v5.0

  • Simulation: Mentor

Simulation: Mentor ModelSim ModelSim SE6.0 SE6.0

slide-8
SLIDE 8

Page 8

Advantages Advantages − − Higher Quality of Results (1) Higher Quality of Results (1)

  • Platform

Platform-

  • based synthesis & optimization

based synthesis & optimization

  • The quality of a RTL design is platform

The quality of a RTL design is platform-

  • dependent

dependent

  • Designers often lack the complete and detail knowledge of the ta

Designers often lack the complete and detail knowledge of the target rget platform platform

7.688 8 DSP Blocks DSPMUL-24bx24b 3.833 2 DSP Blocks DSPMUL-18bx18b 4.658 264 LUTs MUX16to1-24b 2.92 120 LUTs MUX8to1-24b 2.61 33 LUTs ADDSUB-32b 2.27 25 LUTs ADDSUB-24b Delay (ns) Area Resource

  • Platform:

Platform: Altera Altera Stratix Stratix

  • RTL synthesis & place

RTL synthesis & place-

  • and

and-

  • route:

route: Altera Altera QuartusII QuartusII v5.0 v5.0

4.7 4.7 3.8 3.8 2.8 2.8 3.7 3.7 2.9 2.9 2.0 2.0 2.8 2.8 1.8 1.8 0.58 0.58

3X3 Delay Matrix (0,0) (95,61)

Motivation Motivation − − Higher Quality of Results (2) Higher Quality of Results (2)

  • Communication

Communication-

  • centric synthesis & optimization with full

centric synthesis & optimization with full consideration of physical reality consideration of physical reality

  • System performance & power is dominated by interconnect

System performance & power is dominated by interconnect

  • It is difficult for designers to consider physical layout at the

It is difficult for designers to consider physical layout at the RT level RT level

Data transfer

add1 mul1 add2 mul2 Layout Layout-

  • aware performance

aware performance

  • ptimization
  • ptimization

Overlap computation with communication Overlap computation with communication

Layout Layout-

  • aware power

aware power

  • ptimization
  • ptimization

F C2’

> 2*, 3* 5* 4* <

mul1 (2,5,6) mul2 (3,4)

6*

mul1 (2,4,5) mul2 (3,6) Binding solution 2: Binding solution 2: mul mul2

2 can be powered

can be powered

  • ff when false branch
  • ff when false branch

is taken is taken

T

Binding solution 1: Binding solution 1: Both multipliers keep Both multipliers keep active active

slide-9
SLIDE 9

Page 9

xPilot: Behavioral xPilot: Behavioral-

  • to

to-

  • RTL Synthesis Flow

RTL Synthesis Flow

Behavioral spec. in C/SystemC

RTL + constraints SSDM SSDM

µArch-generation & RTL/constraints

generation

Verilog/VHDL/SystemC FPGAs: Altera, Xilinx ASICs: Magma, Synopsys, …

Presynthesis optimizations

Loop unrolling/shifting Strength reduction / Tree height reduction Bitwidth analysis Memory analysis …

FPGAs/ASICs FPGAs/ASICs

Frontend compiler Frontend compiler Platform description

Core synthesis optimizations

Scheduling Resource binding, e.g., functional unit binding register/port binding

xPilot front-end SystemC elaboration xPilot synthesis engine

SystemC SystemC-

  • to

to-

  • RTL Compilation Flow

RTL Compilation Flow

Netlist in XML Behavioral IR (CDFG) Platform description AST SystemC specification SSDM Output files (Timing/Area, RT VHDL & Constraints)

slide-10
SLIDE 10

Page 10

Restricted Behavioral C Subset Restricted Behavioral C Subset

  • Data types:

Data types:

  • Primitive integer types: char, byte, short,

Primitive integer types: char, byte, short, int int, long , long… …

  • One

One-

  • dimension arrays of primitive integer types

dimension arrays of primitive integer types

  • Operations:

Operations:

  • All arithmetic and logic operations: +,

All arithmetic and logic operations: +, -

  • , *, /, >>, &, ...

, *, /, >>, &, ...

  • Control flow statements:

Control flow statements:

  • while, for, switch

while, for, switch-

  • case, if

case, if-

  • then

then-

  • else, break, continue, return, ...

else, break, continue, return, ...

Restricted Behavioral C Subset (cont.) Restricted Behavioral C Subset (cont.)

  • Unsynthesizable

Unsynthesizable

  • Recursions

Recursions

  • Pointers

Pointers

  • Dynamic memory allocations and system calls

Dynamic memory allocations and system calls

  • Irregular jumps, e.g.,

Irregular jumps, e.g., gotos gotos

slide-11
SLIDE 11

Page 11

System System-

  • level Synthesis Data Model

level Synthesis Data Model

  • SSDM

SSDM (System (System-

  • level Synthesis Data Model)

level Synthesis Data Model)

  • Hierarchical

Hierarchical netlist netlist of concurrent processes and communication

  • f concurrent processes and communication

channels channels

  • Each leaf process contains a sequential program which is represe

Each leaf process contains a sequential program which is represented nted by an extended LLVM IR with hardware by an extended LLVM IR with hardware-

  • specific semantics

specific semantics

  • Port / IO interfaces, bit

Port / IO interfaces, bit-

  • vector manipulations, cycle

vector manipulations, cycle-

  • level notations

level notations

Hardware Hardware-

  • Specific SSDM Semantics

Specific SSDM Semantics

  • Process port/interface semantics

Process port/interface semantics

  • FIFO:

FIFO: FifoRead FifoRead() / () / FifoWrite FifoWrite() ()

  • Buffer:

Buffer: BuffRead BuffRead() / () / BuffWrite BuffWrite() ()

  • Memory:

Memory: MemRead MemRead() / () / MemWrite MemWrite() ()

  • Bit

Bit-

  • vector manipulation

vector manipulation

  • Bit extraction / concatenation / insertion

Bit extraction / concatenation / insertion

  • Bit

Bit-

  • width attributes for every operation and every value

width attributes for every operation and every value

  • Cycle

Cycle-

  • level notation

level notation

  • Clock:

Clock: waitClockEvent waitClockEvent() ()

slide-12
SLIDE 12

Page 12

Platform Modeling & Characterization Platform Modeling & Characterization

  • Target platform specification

Target platform specification

  • High

High-

  • level resource library with

level resource library with delay/latency/area/power curve for delay/latency/area/power curve for various input/ various input/bitwidth bitwidth configurations configurations

  • Functional units: adders,

Functional units: adders, ALUs ALUs, , multipliers, comparators, etc. multipliers, comparators, etc.

  • Connectors:

Connectors: mux mux, , demux demux, etc. , etc.

  • Memories: registers, synchronous

Memories: registers, synchronous memories, etc. memories, etc.

  • Chip layout description

Chip layout description

  • On

On-

  • chip resource distributions

chip resource distributions

  • On

On-

  • chip interconnect delay/power

chip interconnect delay/power estimation estimation

4.7 4.7 3.8 3.8 2.8 2.8 3.7 3.7 2.9 2.9 2.0 2.0 2.8 2.8 1.8 1.8 0.58 0.58

3X3 Delay Matrix for Stratix-EP1S40 (0,0) (95,61)

Scheduling Scheduling− − Problem Statement Problem Statement

  • Scheduling problem in behavioral synthesis

Scheduling problem in behavioral synthesis

Given:

  • A control data flow graph (CDFG) which captures the

behavior of the input description

  • A set of scheduling constraints: resource constraints,

latency constraints, frequency constraints, relative IO timing constraints, etc. Goal:

  • Assign the operations to control states so that a

particular design objective (performance / power) is

  • ptimized while all the constraints are satisfied.
  • Highlights of our scheduling engine

Highlights of our scheduling engine

  • Applicable to a wide range of application domains

Applicable to a wide range of application domains

  • Computation

Computation-

  • intensive, memory

intensive, memory-

  • intensive, control

intensive, control-

  • intensive, partially timed, etc.

intensive, partially timed, etc.

  • Offers a variety of optimization techniques in a unified

Offers a variety of optimization techniques in a unified framework framework

  • Operation chaining, behavioral template, relative

Operation chaining, behavioral template, relative scheduling, physical layout consideration, etc. scheduling, physical layout consideration, etc.

+4 +2 *5 *1 +3 CS0 * + +3 *1 *5 +2 +4 CS1

slide-13
SLIDE 13

Page 13

Scheduling Scheduling − − Overall Approach Overall Approach

  • Overall approach

Overall approach

  • Current objective: high

Current objective: high-

  • performance

performance

  • Use a system of integer difference constraints to

Use a system of integer difference constraints to express all kinds of scheduling constraints express all kinds of scheduling constraints

  • Represent the design objective in a linear function

Represent the design objective in a linear function

  • Dependency constraint

Dependency constraint

  • v

v1

1

v v3

3 :

: x x3

3 –

– x x1

1 ≥

≥ 0

  • v

v2

2

v v3

3 :

: x x3

3 –

– x x2

2 ≥

≥ 0

  • v

v3

3

v v5

5 :

: x x4

4 –

– x x3

3 ≥

≥ 0

  • v

v4

4

v v5

5 :

: x x5

5 –

– x x4

4 ≥

≥ 0

  • Frequency constraint

Frequency constraint

  • <

<v v2

2 ,

, v v5

5> :

> : x x5

5 –

– x x2

2 ≥

≥ 1 1

  • Resource constraint

Resource constraint

  • <

<v v2

2 ,

, v v3

3>:

>: x x3

3 –

– x x2

2 ≥

≥ 1 1

+ * * − +

v1 v2 v3 v4 v5

  • Platform characterization:

Platform characterization:

  • adder (+/

adder (+/– –) 2ns ) 2ns

  • multipiler

multipiler (*): 5ns (*): 5ns

  • Target cycle time: 10ns

Target cycle time: 10ns

  • Resource constraint: Only

Resource constraint: Only ONE multiplier is available ONE multiplier is available 1 0 -1 0 0 0 1 -1 0 0 0 0 1 -1 0 0 0 0 1 -1 0 1 0 0 -1 X1 X2 X3 X4 X5

  • 1
  • 1

≤ A x b

Totally Totally unimodular unimodular matrix: matrix: guarantees integral solutions

guarantees integral solutions

Scheduling Scheduling − − Design Framework Design Framework

xPilot scheduler

STG (State Transition Graph) System of pairwise difference constraints Relative timing constraints Relative timing constraints Dependency constraints Dependency constraints Frequency constraints Frequency constraints Resource constraints Resource constraints … …

Constraint equations generation Objective function generation

CDFG

Linear programming solver LP solution interpretation User- specified design constraints& assignments

Target platform modeling (resource library & chip layout)

slide-14
SLIDE 14

Page 14

Unified Resource Binding Unified Resource Binding

  • An efficient architectural exploration

An efficient architectural exploration framework framework

  • Simultaneous functional unit,

Simultaneous functional unit, register, and port binding register, and port binding

  • Emphasize on the interconnect and

Emphasize on the interconnect and steering logic networks steering logic networks

  • Guided by a flexible cost evaluation

Guided by a flexible cost evaluation engine to achieve different engine to achieve different

  • bjectives, e.g., performance, area,
  • bjectives, e.g., performance, area,

power, etc. power, etc.

  • Extendable to exploit physical layout

Extendable to exploit physical layout information information

xPilot architecture exploration Iteration No Yes

Register Allocation/Binding FU Allocation/Binding Baseline Register Binding Improved?

STG (State Transition Graph) Platform info && User- specified constraints

Datapath model for estimation STG + Best Datapath Models

Resource Binding Resource Binding− − Problem Statement Problem Statement

  • Resource binding problem

Resource binding problem

Given: (1) A scheduled control data flow graph, i.e., STG; (2) Design constraints: performance, delay, or power, etc. Goal: Assign the operations and variables to functional units and register, respectively, so that their executions or lifetimes are not conflicted, and all of the design constraints are satisfied.

  • Properties of the problem

Properties of the problem

  • FU and register binding are highly

FU and register binding are highly correlated correlated

  • Simultaneous FU and register binding

Simultaneous FU and register binding considering interconnection is very considering interconnection is very difficult difficult +1 +2

ALU

Two binding solutions: Two binding solutions:

  • Which one is better?

Which one is better?

  • The answer depends on:

The answer depends on: 1.

  • 1. How large are the MUX and

How large are the MUX and ALU (platform ALU (platform-

  • dependent)

dependent) 2.

  • 2. Performance and area

Performance and area constraints constraints

MUX

ALU ALU Binding Binding

slide-15
SLIDE 15

Page 15

Island A Data Import Logic

Distributed Register Distributed Register-

  • File (DRF)

File (DRF) Microarchitecture Microarchitecture

Local Register File Local Register File

FU pool MUL ALU Buffers Island B ALU’ Island C

Regular datapath structure Provides opportunities to hide

large MUX into register-files

Computations and communications

are localized

  • Allow replicated values among

islands

  • Enables efficient optimizations to

control interconnects among islands

Advantages of DRF Advantages of DRF Microarchitecture Microarchitecture

2 1

DFG (Part of Chen DCT) Scheduled DFG Resource constraint: 1 FU

DRF result:

  • Datapath with more regularity
  • Hide MUX into the register file
  • Especially effective for FPGA designs

Discrete register result MUX implementation may be

very expensive (e.g., on FGPAs)

1 2 4 3 1 2

slide-16
SLIDE 16

Page 16

Platform Platform-

  • Based Interface Synthesis

Based Interface Synthesis

  • Focus on sequential communication channels

Focus on sequential communication channels

  • Data must be read and written in the same order

Data must be read and written in the same order

  • Example: FIFO (FSL in

Example: FIFO (FSL in VirtexII VirtexII), Bus (in both ), Bus (in both Stratix Stratix and and Virtex Virtex) )

  • Order may have dramatic impact on performance

Order may have dramatic impact on performance

  • Best order should guarantee that no data transmission on critica

Best order should guarantee that no data transmission on critical l path are delayed by non path are delayed by non-

  • critical transmission

critical transmission

  • Interface synthesis for sequential communication channels

Interface synthesis for sequential communication channels

  • Consider both the behavior model and communication topology

Consider both the behavior model and communication topology to detect the optimal transmission order to detect the optimal transmission order

  • Automatically do interface generation for sequential

Automatically do interface generation for sequential communication units, as well as code transformation for behavior communication units, as well as code transformation for behavior models models

Overall Approach to Interface Synthesis Overall Approach to Interface Synthesis

  • Reduce the order detection

Reduce the order detection problem to a min problem to a min-

  • latecncy

latecncy scheduling problem: scheduling problem:

  • Merge the

Merge the CDFGs CDFGs of all

  • f all

processes processes

  • Each element to be

Each element to be transferred on FIFO are transferred on FIFO are transformed to a special transformed to a special

  • peration T
  • peration T
  • Only one T can be scheduled

Only one T can be scheduled at each step. at each step.

  • Example shown on right,

Example shown on right, assuming only 1 cycle is assuming only 1 cycle is needed for FIFO operation needed for FIFO operation

  • T1

+

  • T1

T3 T2

  • T2

T3 + Merged CDFG Scheduling result, order is (1,3,2) *

Process 1 Process 2

*

slide-17
SLIDE 17

Page 17

Experimental Results Experimental Results − − Benchmark Suite Benchmark Suite

  • Benchmark suite

Benchmark suite

  • PR, MCM:

PR, MCM:

  • DSP kernels: pure additions/subtractions and multiplications

DSP kernels: pure additions/subtractions and multiplications

  • CACHE

CACHE

  • Cache controller: control

Cache controller: control-

  • intensive designs with cycle

intensive designs with cycle-

  • accurate I/O operations

accurate I/O operations

  • MOTION:

MOTION:

  • Motion compensation algorithm for MPEG

Motion compensation algorithm for MPEG-

  • 1 decoder: control

1 decoder: control-

  • intensive with modest

intensive with modest amount of computations amount of computations

  • IDCT:

IDCT:

  • JPEG inverse discrete cosine transform: computation intensive

JPEG inverse discrete cosine transform: computation intensive

  • DWT:

DWT:

  • JPEG2000 discrete wavelet transform: computation intensive with

JPEG2000 discrete wavelet transform: computation intensive with modest control modest control flow flow

  • EDGELOOP:

EDGELOOP:

  • Extracted from H.264 decoder: a very complex design, features a

Extracted from H.264 decoder: a very complex design, features a mix of mix of computation, control, and memory accesses computation, control, and memory accesses

SystemC SystemC/C /C-

  • to

to-

  • FPGA Design Flow (

FPGA Design Flow (Altera Altera) )

xPilot xPilot behavioral behavioral synthesis synthesis SSDM/CDFG SSDM/CDFG Behavioral synthesis Behavioral synthesis RTL generation RTL generation SSDM/FSMD SSDM/FSMD FSM with FSM with Datapath Datapath in VHDL in VHDL Floorplan and/or multi Floorplan and/or multi-

  • cycle path constraints

cycle path constraints SSDM

(System-Level Synthesis Data Model)

SystemC SystemC/C specification /C specification

Front Front-

  • end compiler

end compiler Platform description Platform description & constraints & constraints

Altera Altera QuartusII QuartusII v5.0 v5.0

Stratix/StratixII Stratix/StratixII device configurations device configurations

slide-18
SLIDE 18

Page 18

Experimental Results Experimental Results − − Altera Altera

  • Device setting:

Device setting: Stratix Stratix

  • Target frequency: 200 MHz

Target frequency: 200 MHz 146.8 146.8 4 4 1627 1627 110 110 1752 1752 3489 3489 1352 1352 190 190 DIR DIR 152.56 152.56 1348 1348 73 73 981 981 2402 2402 1260 1260 161 161 MCM MCM 166.61 166.61 4 4 687 687 207 207 691 691 1585 1585 696 696 141 141 LEE LEE 166.11 166.11 8 8 516 516 62 62 527 527 1105 1105 727 727 90 90 WANG WANG 178.7 178.7 552 552 84 84 713 713 1349 1349 600 600 90 90 PR PR (MHz) (MHz) DSP DSP Comb Comb-

  • Reg

Reg Lonely Lonely-

  • Reg

Reg COMB COMB LE LE VHDL VHDL C C Fmax Fmax Resource Usage Resource Usage Line Count Line Count Designs Designs

  • On average,

On average, xPilot resource binding achieves designs with similar area, and xPilot resource binding achieves designs with similar area, and 1.68x higher 1.68x higher frequency over Spark frequency over Spark

1.68 n/a* 2.50 n/a* 0.68 1.12 1 1 1 1 1 1 Ave Ratio 146.8 4 1627 110 1752 3489 69.38 6 391 2034 2425 DIR 152.6 1348 73 981 2402 74.87 560 2248 2808 MCM 166.6 4 687 207 691 1585 119.3 315 1052 1367 LEE 166.1 8 516 62 527 1105 118.9 275 942 1217 WANG 178.7 552 84 713 1349 123.5 293 815 1108 PR (MHz) DSP Comb

  • Reg

Lonely- Reg COMB LE (MHz) DSP Comb- Reg Lonely

  • Reg

COMB LE Fmax Resource Usage Fmax Resource Usage xPilot SPARK Designs

Experimental Results Experimental Results − − Comparison with SPARK Comparison with SPARK

  • n
  • n Altera

Altera Stratix Stratix FPGA FPGA

  • SPARK [UCI/UCSD, 2004], a state of the art academic high

SPARK [UCI/UCSD, 2004], a state of the art academic high-

  • level synthesis tool

level synthesis tool

slide-19
SLIDE 19

Page 19

SystemC SystemC/C /C-

  • to

to-

  • FPGA Design Flow (Xilinx)

FPGA Design Flow (Xilinx)

xPilot xPilot behavioral behavioral synthesis synthesis SSDM/CDFG SSDM/CDFG Behavioral synthesis Behavioral synthesis RTL generation RTL generation SSDM/FSMD SSDM/FSMD FSM with FSM with Datapath Datapath in VHDL in VHDL Floorplan and/or multi Floorplan and/or multi-

  • cycle path constraints

cycle path constraints SSDM

(System-Level Synthesis Data Model)

SystemC SystemC/C specification /C specification

Front Front-

  • end compiler

end compiler Platform description Platform description & constraints & constraints

Xilinx ISE i7.1 Xilinx ISE i7.1

VirtexII( VirtexII(-

  • Pro)/Virtex

Pro)/Virtex-

  • 4

4 device configurations device configurations

Experimental Results Experimental Results − − Xilinx Xilinx

  • Device setting: xc2vp30

Device setting: xc2vp30 -

  • 7

7

  • Target frequency: 200 MHz

Target frequency: 200 MHz 98.81 98.81 56 56 1732 1732 1002 1002 979 979 1352 1352 190 190 DIR DIR 110.38 110.38 30 30 1282 1282 1207 1207 887 887 1260 1260 161 161 MCM MCM 131.93 131.93 19 19 659 659 484 484 356 356 696 696 141 141 LEE LEE 133.51 133.51 15 15 588 588 464 464 357 357 727 727 90 90 WANG WANG 146.84 146.84 16 16 564 564 416 416 331 331 600 600 90 90 PR PR (MHz) (MHz) DSP DSP (FF) (FF) (LUT) (LUT) Slices Slices VHDL VHDL C C Fmax Fmax Resource Usage Resource Usage Line Count Line Count Designs Designs

slide-20
SLIDE 20

Page 20

Synthesis from Behavior to DRF Synthesis from Behavior to DRF

  • Data import logic is the most

Data import logic is the most “ “critical critical” ”

  • Operations bound to an island form a

Operations bound to an island form a “ “chain chain” ” in DFG in DFG

  • Optimize complexity of inter

Optimize complexity of inter-

  • island connections

island connections

  • Min

Min-

  • cut chain partitioning

cut chain partitioning improve design quality improve design quality

11.69 11.69 48 48 1,076 1,076 858 858 701 701 Baseline Baseline 10.62 10.62 3 3 5 5 63 63 860 860 457 457 DRF (5 islands) DRF (5 islands) PR PR-

  • 24

24 10.37 10.37 66 66 1,235 1,235 928 928 798 798 Baseline Baseline 12.04 12.04 6 6 6 6 86 86 808 808 425 425 DRF (6 islands) DRF (6 islands) CHEN CHEN-

  • 24

24 Clock Clock Period (ns) Period (ns) MUL MUL RAM RAM Blk Blk# # FF FF LUT LUT Slices Slices Micro Micro-

  • Architecture

Architecture

  • Device:

Device: Xilinx Xilinx Virtex Virtex II II -

  • 6; Target clock period: 10ns

6; Target clock period: 10ns

  • Observations: Large area (Slices and Multiplier blocks) reductio

Observations: Large area (Slices and Multiplier blocks) reduction by using on n by using on-

  • chip RAM

chip RAM blocks to implement register files, with small impact on Fmax blocks to implement register files, with small impact on Fmax

Initial Results of Interface Synthesis Initial Results of Interface Synthesis

  • Target for sequential communication channels

Target for sequential communication channels

  • In particular, FSL in

In particular, FSL in VirtexII VirtexII

  • Consider two communicating processes

Consider two communicating processes

  • 20+% performance improvement on average by optimizing

20+% performance improvement on average by optimizing communication ordering communication ordering

slide-21
SLIDE 21

Page 21

Outline Outline

  • Motivation

Motivation

  • xPilot

xPilot system framework system framework

  • Behavior

Behavior-

  • level synthesis in

level synthesis in xPilot xPilot

  • Advantages of behavioral synthesis

Advantages of behavioral synthesis

  • Scheduling

Scheduling

  • Resource binding

Resource binding

  • System

System-

  • level synthesis in

level synthesis in xPilot xPilot

  • Synthesis for ASIP platforms

Synthesis for ASIP platforms

  • Design exploration for heterogeneous

Design exploration for heterogeneous MPSoCs MPSoCs

  • Conclusions

Conclusions

Design Exploration for Heterogeneous Design Exploration for Heterogeneous MPSoC MPSoC Platforms Platforms

  • Heterogeneous

Heterogeneous MPSoCs MPSoCs exploration exploration

  • Processors

Processors

  • Heterogeneous vs. homogeneous

Heterogeneous vs. homogeneous

  • General

General-

  • purpose vs. application

purpose vs. application-

  • specific

specific

  • On

On-

  • chip communication architecture (OCA)

chip communication architecture (OCA)

  • Bus (e.g. AMBA,

Bus (e.g. AMBA, CoreConnect CoreConnect), packet switching network ), packet switching network (e.g. Alpha 21364) (e.g. Alpha 21364)

  • Memory hierarchy

Memory hierarchy

µP

Communication Network

µP

OS Driver tasks

µP Network Interface

Network Interface

Network Interface

Network Interface

IP

µP

FPGA

µP Network Interface

Network Interface

Network Interface

Network Interface

DSP

µP

µP

OS Driver tasks

Network Interface

Network Interface

µP

µP

OS Driver tasks

Network Interface

Network Interface

slide-22
SLIDE 22

Page 22

Configurable Configurable SoC SoC Platforms Platforms

  • General purpose processor cores + programmable fabric

General purpose processor cores + programmable fabric

  • Tight integration using extended instructions (

Tight integration using extended instructions (ASIPs ASIPs) )

  • Example:

Example: Altera Altera Nios Nios / / Nios Nios II II

  • Loose integration using

Loose integration using FIFOs FIFOs/busses for communications /busses for communications

  • Example:

Example: Xilinx Xilinx MicroBlaze MicroBlaze, etc. , etc.

Custom instruction logic for Nios II [source: www.altera.com] Xilinx MicroBlaze [source: www.xilinx.com]

ASIP Compilation: Problem Statement ASIP Compilation: Problem Statement

1

( )

i i N

area p A

≤ ≤

<

  • Given:

Given:

  • CDFG G(V, E)

CDFG G(V, E)

  • The basic instruction set

The basic instruction set I I

  • Pattern constraints:

Pattern constraints:

  • Number of inputs

Number of inputs | |PI(pi PI(pi)| )| ≤ ≤ Nin Nin; ;

  • Number of outputs

Number of outputs | |PO(pi PO(pi)| = 1 )| = 1; ;

  • Total area

Total area

  • Objective:

Objective:

  • Generate a pattern library

Generate a pattern library P P

  • Map G to the extended instruction set

Map G to the extended instruction set I I∪ ∪P P, so that the total execution time , so that the total execution time is minimized is minimized

* * + + *

a c e t6

+

d t1 = a * b; t2 = b * c;; t3 = d * e; t4 = t1 + t2; t5 = t2 + t3; t6 = t5 + t4; ext-inst1 (MAC1: 2 cycles) ext-inst2 (MAC2: 2 cycles) * 2 clock cycles + 1 clock cycle t4 t5 Performance speedup = 9 / 5 = 1.8X b t4 = ext-inst1(a, b, c); t5 = ext-inst2(b, c, d, e); t6 = t4 + t5;

slide-23
SLIDE 23

Page 23

Target Core Processor Model Target Core Processor Model

Inst Cache Reg File Memory

MUX

4 Adder Result PC

RS1 RS2

Core Processor ID / EX EX / MEM MEM / WB IF / ID ALU

OP1 OP2

  • Core processor model

Core processor model

  • Classic single

Classic single-

  • issue pipelined RISC core (fetch / decode / execute /

issue pipelined RISC core (fetch / decode / execute / mem mem / / write write-

  • back)

back)

  • The number of input and output operands of an instruction is pre

The number of input and output operands of an instruction is pre-

  • determined

determined

  • An instruction reads the core register file during the execute s

An instruction reads the core register file during the execute stage, and commits tage, and commits the result during the write the result during the write-

  • back stage

back stage Custom Logic

ASIP Compilation Flow ASIP Compilation Flow

Front Front-

  • end compilation

end compilation Backend compilation Backend compilation

  • 1. Pattern generation
  • 1. Pattern generation
  • 2. Pattern selection
  • 2. Pattern selection
  • 3. Application mapping &
  • 3. Application mapping &

Graph covering Graph covering

Pattern Generation Satisfying input/output constraints Pattern Selection Select a subset to maximize the potential speedup while satisfying the resource constraint Application Mapping Graph covering to minimize the total execution time

C code C code µ µArch Arch constraint constraint CDFG CDFG Pattern library Pattern library Optimized Optimized CDFG CDFG Optimized assembly Optimized assembly

slide-24
SLIDE 24

Page 24

Experimental Results on Experimental Results on Altera Altera Nios Nios

  • 1.77%
  • 2.54%
  • 2.75

3.08 Average 56 0.00% 2.76% 186 3.22 4.75 4 mcm 16 0.00% 0.80% 54 3.02 3.28 2 dir 14 0.00% 1.05% 71 1.75 1.57 2 pr 8 0.15% 1,024 0.76% 51 2.14 2.40 2 fir 40 0.71% 4,736 3.79% 255 3.73 3.18 7 iir 16 9.79% 65,536 6.06% 408 2.65 3.28 9 fft_br DSP Block Memory LE Nios Estimation Resource Overhead Speedup Extended Instruction#

  • 56

0.00% 2.76% 186 3.22 4.75 4 16 0.00% 0.80% 54 3.02 3.28 2 14 0.00% 1.05% 71 1.75 1.57 2 8 0.15% 1,024 0.76% 51 2.14 2.40 2 40 0.71% 4,736 3.79% 255 3.73 3.18 7 16 9.79% 65,536 6.06% 408 2.65 3.28 9 LE Nios

  • Altera

Altera Nios Nios is used for ASIP implementation is used for ASIP implementation

  • 5 extended instruction formats

5 extended instruction formats

  • up to 2048 instructions for each format

up to 2048 instructions for each format

  • Small DSP applications are taken as benchmark

Small DSP applications are taken as benchmark

  • Data bandwidth problem

Data bandwidth problem

  • Limited register file bandwidth (two read ports, one write port)

Limited register file bandwidth (two read ports, one write port)

  • ~40% of the ideal performance speedup will be lost

~40% of the ideal performance speedup will be lost

  • Shadow

Shadow-

  • register

register-

  • based architectural extension

based architectural extension

  • Core registers are augmented by an extra set of shadow registers

Core registers are augmented by an extra set of shadow registers

  • Conditionally written during write

Conditionally written during write-

  • back stage

back stage

  • Low power/area overhead

Low power/area overhead

  • Novel shadow

Novel shadow-

  • register binding algorithms are developed

register binding algorithms are developed

Inst Cache Reg File Memory

MUX

4 Adder Result PC

RS1 RS2

Core Processor ID / EX EX / MEM MEM / WB IF / ID ALU

Hashing Unit Hashing Unit

OP1 OP2

Custom Logic

SR1 SR1 SRK SRK

k = hash(j)

Architecture Extension for Architecture Extension for ASIPs ASIPs

slide-25
SLIDE 25

Page 25

Problem Statement: Mapping for Heterogeneous Problem Statement: Mapping for Heterogeneous Integration with Multiple Processing Cores Integration with Multiple Processing Cores

  • Given:

Given:

  • A library of processing cores

A library of processing cores L L

  • Task graph

Task graph G G( (V V, , E E) )

  • For each

For each v v in in V V, execution time , execution time t t( (v v, , p pi

i) on

) on p pi

i

  • For each (

For each (u, v u, v) in ) in E E, communication data size , communication data size s s( (u u, ,v v) )

  • Cost (area/power) constraint

Cost (area/power) constraint C C

  • Problem:

Problem:

  • Select and instantiate the processing elements from

Select and instantiate the processing elements from L L

  • Generate the on

Generate the on-

  • chip communication architecture and topology

chip communication architecture and topology

  • Map the tasks onto the processing elements so that

Map the tasks onto the processing elements so that

  • The total latency is minimized while the final implementation co

The total latency is minimized while the final implementation cost is st is less than less than C C

Preliminary Results on Motion Preliminary Results on Motion-

  • JPEG Example

JPEG Example

Encoded JPEG Images RAW Images Xilinx XUP Board

Preprocess Quant DCT Huffman Table Modification

OR

0.117 0.117 0.189 0.189 Exe Time Exe Time (ms) (ms) 126 126 126 126 Fmax Fmax ( (MHZ)

MHZ)

14800 14800 ( (-

  • 38%)

38%) 23812 23812 Cycle# Cycle# 6345 6345 Model #2 Model #2 4306 4306 Model #1 Model #1 Area Area (Slice#) (Slice#) System System

Preprocess Quant Huffman Table Modification HW-DCT

Model #1 : 5 Microblazes FSL-based communication Model #2 : 4 Microblazes + DCT on FPGA fabrics

slide-26
SLIDE 26

Page 26

Conclusions Conclusions

  • xPilot can automatically synthesize behavior level C or

xPilot can automatically synthesize behavior level C or SystemC SystemC presentation to RTL code with necessary design constraints presentation to RTL code with necessary design constraints

  • Platform

Platform-

  • based synthesis with physical planning provides

based synthesis with physical planning provides

  • Shorter verification/simulation cycle

Shorter verification/simulation cycle

  • Better complexity management, faster time to market

Better complexity management, faster time to market

  • Rapid system exploration

Rapid system exploration

  • Higher quality of results

Higher quality of results

  • xPilot can help to explore the efficient use of (multiple) on

xPilot can help to explore the efficient use of (multiple) on-

  • chip

chip processors processors

  • xPilot can efficiently optimize the software for reconfigurable

xPilot can efficiently optimize the software for reconfigurable processors processors

  • We are interested to engage with selected industrial partners to

We are interested to engage with selected industrial partners to further validate and enhance the technology further validate and enhance the technology

Acknowledgements Acknowledgements

  • We would like to thank the supports from

We would like to thank the supports from

  • National Science Foundation (NSF)

National Science Foundation (NSF)

  • Gigascale

Gigascale Systems Research Center (GSRC) Systems Research Center (GSRC)

  • Semiconductor Research Corporation (SRC)

Semiconductor Research Corporation (SRC)

  • Industrial sponsors under the California MICRO programs (

Industrial sponsors under the California MICRO programs (Altera Altera, Xilinx) , Xilinx)

  • Team members:

Team members: Yiping Fan Yiping Fan Zhiru Zhang Zhiru Zhang Wei Jiang Wei Jiang Guoling Han Guoling Han