Customizable Domain- Customizable Domain -Specific Computing - - PDF document

customizable domain customizable domain specific
SMART_READER_LITE
LIVE PREVIEW

Customizable Domain- Customizable Domain -Specific Computing - - PDF document

Customizable Domain- Customizable Domain -Specific Computing Specific Computing Jason Cong Center for Domain-Specific Computing UCLA Computer Science Department cong@cs.ucla.edu http://cadlab.cs.ucla.edu/~cong 1 The Power Barrier


slide-1
SLIDE 1

1

Customizable Domain Customizable Domain-

  • Specific Computing

Specific Computing

Jason Cong Center for Domain-Specific Computing UCLA Computer Science Department cong@cs.ucla.edu http://cadlab.cs.ucla.edu/~cong

2

The Power Barrier The Power Barrier … …

Source : Shekhar Borkar, Intel

slide-2
SLIDE 2

3

Focus: New Transformative Approach to Focus: New Transformative Approach to Power/Energy Efficient Computing Power/Energy Efficient Computing

Parallelization

Source: Shekhar Borkar, Intel

Current Solution: Parallelization Current Solution: Parallelization

4

Cost and Energy are Still a Big Issue Cost and Energy are Still a Big Issue … …

Cost of computing

  • HW acquisition
  • Energy bill
  • Heat removal
  • Space
slide-3
SLIDE 3

5

Next Significant Opportunity Next Significant Opportunity --

  • - Customization

Customization

Parallelization

Source: Shekhar Borkar, Intel

Customization

Adapt the architecture to Application domain

6

Motivation Motivation

  • A few facts

A few facts

We have sufficient computing power for most applications Each user/enterprise need high computing power for only selected tasks in its domain Application-specific integrated circuits (ASIC) can lead to 10,000X+ better power performance efficiency, but are too expensive to design and manufacture

  • Our proposal

Our proposal

A general, customizable platform for the given domain(s)

  • Can be customized to a wide-range of applications in the domain
  • Can be massively produced with cost efficiency
  • Can be programmed efficiently with novel compilation and runtime systems
  • Goal:

Goal:

  • A

A “ “supercomputer supercomputer-

  • in

in-

  • a

a-

  • box

box” ” with 100X performance/power improvement via with 100X performance/power improvement via customization for the intended customization for the intended domain(s domain(s) )

  • Analogy:

Analogy:

  • Advance of civilization via specialization/customization

Advance of civilization via specialization/customization

slide-4
SLIDE 4

7

Example Application Domain: Healthcare Example Application Domain: Healthcare

  • Medical imaging has transformed healthcare

Medical imaging has transformed healthcare

An in vivo method for understanding disease development and patient condition Estimated to be $100 billion/year More powerful & efficient computation can help

  • Fewer exposures using compressive sensing
  • Better clinical assessment (e.g., for cancer) using

improved registration and segmentation algorithms

  • Hemodynamic

Hemodynamic simulation simulation

Very useful for surgical procedures involving blood flow and vasculature

  • Both may take hours to days to construct

Both may take hours to days to construct

  • Clinical requirement: 1

Clinical requirement: 1-

  • 2 min

2 min

  • Cloud computing won

Cloud computing won’ ’t work t work – –

  • Communication, real

Communication, real-

  • time requirement, privacy

time requirement, privacy

  • A megawatt

A megawatt-

  • datacenter for each hospital?

datacenter for each hospital?

Intracranial aneurysm reconstruction with hemodynamics Magnetic resonance (MR) angiograph of an aneurysm

8

compressive sensing level set methods fluid registration total variational algorithm

Medical Image Processing Pipeline Medical Image Processing Pipeline

denoising denoising registration registration segmentation segmentation analysis analysis

h z y S i,j volume voxel j i

S k k k

e i Z w j f w i ∑ = − ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = ∀

=

− − ∈

1 2

1 2 j 2 ,

) ( 1 , 2 ) ( ) ( u : voxel σ

( ) [ ]

) ( ) ( ) ( ) ( u x T x R u x T v v u v t u v − ∇ − − − = ⋅ ∇ ∇ + + Δ ∇ ⋅ + ∂ ∂ = η μ μ

{ }

t) (x, : x voxels ) ( surface div ) , ( F = = ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ∇ ∇ + ∇ = ∂ ∂ ϕ ϕ ϕ λ φ ϕ ϕ t data t

∑ ∑

= =

+ ∂ ∂ + ∂ ∂ − = ∂ ∂ + ∂ ∂ + Δ + −∇ = ∇ ⋅ + ∂ ∂

3 1 2 2 3 1

) , ( ) , ( ) (

j i j i j j i j i j i

t x f x v v x p x v v t v t x f v p v v t v υ υ reconstruction reconstruction

∑ ∑

+ <<

voxels 2 points sampled

) (

  • AR

min : theory Nyquist

  • Shannon

classical rate a at sampled be can and sparsity, exhibit images Medical u grad S u

u

λ

Navier-Stokes equations

slide-5
SLIDE 5

9

compressive sensing level set methods fluid registration total variational algorithm

Application Domains: Medical Image Processing Pipeline Application Domains: Medical Image Processing Pipeline

denoising denoising registration registration segmentation segmentation analysis analysis reconstruction reconstruction

Navier-Stokes equations

non-iterative, highly parallel, local & global communication sparse linear algebra, structured grid, optimization methods parallel, global communication dense linear algebra, optimization methods local communication sparse linear algebra, n-body methods, graphical models local communication dense linear algebra, spectral methods, MapReduce iterative, local or global communication dense and sparse linear algebra, optimization methods

  • These algorithms have diverse

These algorithms have diverse computation & computation & communication patterns communication patterns

  • A single homogenous system

A single homogenous system can not perform very well on can not perform very well on all these algorithms all these algorithms

10

compressive sensing level set methods fluid registration total variational algorithm Navier-Stokes equations

Non-iterative, highly parallel, local & global communication sparse linear algebra, structured grid, optimization methods parallel, global communication dense linear algebra, optimization methods local communication sparse linear algebra, n-body methods, graphical models local communication dense linear algebra, spectral methods, MapReduce iterative, local or global communication dense and sparse linear algebra, optimization methods

Need of Customization for Medical Image Processing Pipeline Need of Customization for Medical Image Processing Pipeline

denoising denoising registration registration segmentation segmentation analysis analysis reconstruction reconstruction

  • These algorithms have diverse

These algorithms have diverse computation & communication computation & communication patterns patterns

  • A single, homogeneous system

A single, homogeneous system cannot perform very well on all cannot perform very well on all

  • f these algorithms
  • f these algorithms
  • Need architecture

Need architecture customization and hardware customization and hardware-

  • software co

software co-

  • optimization
  • ptimization
  • Include many common

Include many common computation kernels ( computation kernels (“ “motifs motifs” ”) )

  • Applicable to other domains

Applicable to other domains Bi Bi-

  • harmonic registration (Using the same algorithm on all

harmonic registration (Using the same algorithm on all platforms) platforms)

CPU (Xenon 2.0 GHz) CPU (Xenon 2.0 GHz) 1x 1x ~100 W ~100 W GPU (Tesla C1060) GPU (Tesla C1060) 93x 93x ~150 W ~150 W FPGA (xc4vlx100) FPGA (xc4vlx100) 11x 11x ~5W ~5W

3D median filter: For each 3D median filter: For each voxel voxel, compute the median of , compute the median of the 3 x 3 x 3 neighboring the 3 x 3 x 3 neighboring voxels voxels

CPU (Xenon 2.0 GHz) CPU (Xenon 2.0 GHz) Quick select Quick select 1x 1x ~100 W ~100 W GPU (Tesla C1060) GPU (Tesla C1060) Median of medians Median of medians 70x 70x ~140 W ~140 W FPGA (xc4vlx100) FPGA (xc4vlx100) Bit Bit-

  • by

by-

  • bit majority voting

bit majority voting 1200x 1200x ~3 W ~3 W

slide-6
SLIDE 6

11 12

Center for Domain-Specific Computing (CDSC) Organization

Reinman (UCLA) Palsberg (UCLA) Sadayappan (Ohio-State) Sarkar (Associate Dir) (Rice) Vese (UCLA) Potkonjak (UCLA)

  • A diversified & highly accomplished faculty team: 8 in CS&E; 1 in EE; 2 in medical school; 1 in applied math
  • 15-20 postdocs and graduate students in four universities – UCLA, Rice, Ohio-State, and UC Santa Barbara

Aberle (UCLA) Baraniuk (Rice) Bui (UCLA) Cong (Director) (UCLA) Cheng (UCSB) Chang (UCLA)

slide-7
SLIDE 7

13

Customizable Heterogeneous Platform (CHP)

$ $ $ $ $ $ $ $

Fixed Core Fixed Core Fixed Core Fixed Core Fixed Core Fixed Core Fixed Core Fixed Core Custom Core Custom Core Custom Core Custom Core Custom Core Custom Core Custom Core Custom Core Prog Fabric Prog Fabric Prog Fabric Prog Fabric Prog Fabric Prog Fabric Prog Fabric Prog Fabric DRAM DRAM DRAM DRAM I/O I/O CHP CHP CHP CHP CHP CHP Reconfigurable RF-I bus Reconfigurable optical bus Transceiver/receiver Optical interface

Overview of the Proposed Research Overview of the Proposed Research

CHP mapping Source-to-source CHP mapper Reconfiguring & optimizing backend Adaptive runtime Domain characterization Application modeling Domain-specific-modeling (healthcare applications) CHP creation Customizable computing engines Customizable interconnects Architecture modeling

Design once Invoke many times

14

CHP Creation CHP Creation – – Design Space Exploration Design Space Exploration

Key questions: Optimal trade-off between efficiency & customizability Which options to fix at CHP creation? Which to be set by CHP mapper?

Custom instructions & accelerators

Amount of programmable fabric Shared vs. private accelerators Custom instruction selection Choice of accelerators …

Custom instructions & accelerators

Amount of programmable fabric Shared vs. private accelerators Custom instruction selection Choice of accelerators …

Core parameters

Frequency & voltage Datapath bit width Instruction window size Issue width Cache size & configuration Register file organization # of thread contexts …

Core parameters

Frequency & voltage Datapath bit width Instruction window size Issue width Cache size & configuration Register file organization # of thread contexts …

NoC parameters

Interconnect topology # of virtual channels Routing policy Link bandwidth Router pipeline depth Number of RF-I enabled routers RF-I channel and bandwidth allocation …

NoC parameters

Interconnect topology # of virtual channels Routing policy Link bandwidth Router pipeline depth Number of RF-I enabled routers RF-I channel and bandwidth allocation …

Customizable Heterogeneous Platform (CHP)

$ $ $ $ $ $ $ $

Fixed Core Fixed Core Fixed Core Fixed Core Fixed Core Fixed Core Fixed Core Fixed Core Custom Core Custom Core Custom Core Custom Core Custom Core Custom Core Custom Core Custom Core Prog Fabric Prog Fabric Prog Fabric Prog Fabric Prog Fabric Prog Fabric Prog Fabric Prog Fabric

Reconfigurable RF-I bus Reconfigurable optical bus Transceiver/receiver Optical interface

slide-8
SLIDE 8

15

Customization for Cores

Example of core customization space

Instruction queue size ROB size Memory hierarchy and configuration Cache sizes Cache associativity Memory latency Number and type

  • f FUs

Register file size LSQ size

Branch predictor

BTB size BTB complexity

16

Existing Studies on Cores Customization (not domain-specific)

Less than 50% worse than ideal 8x powerful core. Up to 40% improvement for changing workloads Core spilling – spill from 1 core up to 8 cores [Cong et al Trans on PDS 2007] Less than 30% and 20% worse for sequential and parallel benchmarks respectively Core fusion – 2-issue cores fused to simulate 4 and 6 issue cores [Ipel et al ISCA 2007] 16% total processor energy saving Issue logic and Issue Queue (43/58) [Folegnani & Gonzalez, ISCA 2001] Power saving of 59% for the three components Instruction Queue (17/32) Reorder Buffer (57/128) Load/Store Queue (18/32) [Ponomarev, et.al., MICRO 2001] Up to 78% total energy saving with combined DVS and architectural adaption Issue Width (8,4,2) Issue Queue (128,64,32) Function Units (4,2) Dynamic Voltage Scaling [Hughes, et.al., MICRO 2001] Up to 50% power reduction and 55% performance improvement Reduced precision FP arithmetic (mini FPU mantissa 14, exponent 8) , FPU sharing (2:4:8 sharing cores), eliminating trivial FP operations, lookup table [Yeh et al MICRO 2007] Only 2x worse than domain optimized system Memory system: streaming register files or cache hierarchy Communication: broadcast and routed Processor: SIMD or RISC superscalar [Mai et al ISCA 2007] 1.6X performance gain and 0.8X power reduction 5.1x efficiency improvement Issue queue, issue width, Branch, LSQ, ROB, Registers Cache I-LI, D-L1, L2 cache size and latency, Memory Latency, temporal sensitivity [Lee and Brooks ASPLOS 2008] Impact Feature Reference

slide-9
SLIDE 9

17

Energy-Effective Issue Logic [Folegnani & Gonzalez, ISCA’01]

A A B B C C

Inefficiency of conventional

instruction issue logic & issue queue (IQ)

A) Energy waste from empty entries and ready operand B) Effectively used IQ varies across different applications C) Effectively used IQ varies in different period of one application

18

Adaptation of Multiple Datapath Resources (cont’d)

Dynamic adapt through

multi-partitioned resources

  • Instruction queue (IQ)
  • avg: 17; max: 32
  • Reorder buffer (ROB)
  • avg: 57; max: 128
  • Load/Store queue (LSQ)
  • avg 18; max: 32

Three resources are

independently adjusted at run time

  • Downsize the resources

based on sampling statistics of effective usage history

  • Upsize the resources based
  • n the resource miss

record

Total power saving for the

three resized components: 59%

slide-10
SLIDE 10

19

Architectural and Frequency Adaptations for Multimedia Applications [Hughes, et al, MICRO 2001]

Dynamic adapt

Architecture

  • Issue Width & Issue Queue
  • # Function Units

Dynamic Voltage Scaling (DVS)

  • Continuous DVS (CDVS)
  • Discrete DVS (DDVS)

Adaptation method

Initial profiling

  • Multimedia application has similar

performance and power stats for the same frame type Dynamic adaptation

  • Choose optimal configuration based
  • n history stats for the same frame

type by table lookup

Energy saving

DDVS Alone: 73% Arch Alone: 22% CDVS Alone: 75% Arch + DDVS: 77% Arch + CDVS: 78%

20

Architectural and Frequency Adaptations for Multimedia Applications (cont’d)

Important conclusions

DVS gives the most of energy reduction Architectural adaption further reduce energy when augmented on DVS Without DVS, less aggressive architectures are more energy-efficient With DVS, more aggressive architectures are often more energy-efficient

  • The higher IPC of the more aggressive architectures means it an be run at a

lower frequency to save energy

slide-11
SLIDE 11

21

Microarchitectural Adaptivity [Lee & Brooks ASPLOS’08]

Examine two main

questions:

Spatial adaptivity - which parameters to tune? Temporal adaptivity – how

  • ften to tune?

Study effects of tuning 15

parameters and at different time intervals of adaptation

22

Microarchitectural Adaptivity [Lee & Brooks ASPLOS’08]

Architectural parameters studied

Instruction queue size ROB size Cache sizes Cache associativity Memory latency Number and type

  • f FUs

Register file size LSQ size

Branch predictor

BTB size BTB complexity

slide-12
SLIDE 12

23

Microarchitectural Adaptivity [Lee & Brooks ASPLOS’08]

Key findings

Up to 5.3x improvement in efficiency through adaptation Relatively frequent adaptation (80K instruction intervals) needed to achieve maximum efficiency

24

Microarchitectural Adaptivity [Lee & Brooks ASPLOS’08]

Key findings

On average, adapting 3 parameters is sufficient to achieve 77% of efficiency gain

  • However, the 3 parameters depend on application and phase

DVFS provides relatively less benefits (in terms of efficiency) with architecture adaptations

slide-13
SLIDE 13

25

Existing Studies on Cores Customization (not domain-specific)

Less than 50% worse than ideal 8x powerful core. Up to 40% improvement for changing workloads Core spilling – spill from 1 core up to 8 cores [Cong et al Trans on PDS 2007] Less than 30% and 20% worse for sequential and parallel benchmarks respectively Core fusion – 2-issue cores fused to simulate 4 and 6 issue cores [Ipel et al ISCA 2007] 16% total processor energy saving Issue logic and Issue Queue (43/58) [Folegnani & Gonzalez, ISCA 2001] Power saving of 59% for the three components Instruction Queue (17/32) Reorder Buffer (57/128) Load/Store Queue (18/32) [Ponomarev, et.al., MICRO 2001] Up to 78% total energy saving with combined DVS and architectural adaption Issue Width (8,4,2) Issue Queue (128,64,32) Function Units (4,2) Dynamic Voltage Scaling [Hughes, et.al., MICRO 2001] Up to 50% power reduction and 55% performance improvement Reduced precision FP arithmetic (mini FPU mantissa 14, exponent 8) , FPU sharing (2:4:8 sharing cores), eliminating trivial FP operations, lookup table [Yeh et al MICRO 2007] Only 2x worse than domain optimized system Memory system: streaming register files or cache hierarchy Communication: broadcast and routed Processor: SIMD or RISC superscalar [Mai et al ISCA 2007] 1.6X performance gain and 0.8X power reduction 5.1x efficiency improvement Issue queue, issue width, Branch, LSQ, ROB, Registers Cache I-LI, D-L1, L2 cache size and latency, Memory Latency, temporal sensitivity [Lee and Brooks ASPLOS 2008] Impact Feature Reference 26

CHP Creation CHP Creation – – Design Space Exploration Design Space Exploration

Key questions: Optimal trade-off between efficiency & customizability Which options to fix at CHP creation? Which to be set by CHP mapper?

Custom instructions & accelerators

Amount of programmable fabric Shared vs. private accelerators Custom instruction selection Choice of accelerators …

Custom instructions & accelerators

Amount of programmable fabric Shared vs. private accelerators Custom instruction selection Choice of accelerators …

Core parameters

Frequency & voltage Datapath bit width Instruction window size Issue width Cache size & configuration Register file organization # of thread contexts …

Core parameters

Frequency & voltage Datapath bit width Instruction window size Issue width Cache size & configuration Register file organization # of thread contexts …

NoC parameters

Interconnect topology # of virtual channels Routing policy Link bandwidth Router pipeline depth Number of RF-I enabled routers RF-I channel and bandwidth allocation …

NoC parameters

Interconnect topology # of virtual channels Routing policy Link bandwidth Router pipeline depth Number of RF-I enabled routers RF-I channel and bandwidth allocation …

Customizable Heterogeneous Platform (CHP)

$ $ $ $ $ $ $ $

Fixed Core Fixed Core Fixed Core Fixed Core Fixed Core Fixed Core Fixed Core Fixed Core Custom Core Custom Core Custom Core Custom Core Custom Core Custom Core Custom Core Custom Core Prog Fabric Prog Fabric Prog Fabric Prog Fabric Prog Fabric Prog Fabric Prog Fabric Prog Fabric

Reconfigurable RF-I bus Reconfigurable optical bus Transceiver/receiver Optical interface

slide-14
SLIDE 14

27

Customization of Programmable Fabrics

FPGA-based acceleration has shown a lot of promise

Many applications in bio-informatics, financial engineering, image processing, scientific computing, … Many publications in FCCM, FPGA, FPL, FPT, …

Two significant barriers

Communication between CPU and FPGA accelerator

  • Overhead of using peripheral bus is too high

Automatic compilation

  • Real programmers do not use VHDL/Verilog

But … a lot of encouraging progress made recently

28

Customization of Programmable Fabrics

Recent enablers

Communication between CPU and FPGA accelerator

  • High-speed connections – HyberTransport bus, FSB, QPI, …
  • On-chip integration

Automatic compilation

  • Maturing of C/C++ to RTL synthesis tools
slide-15
SLIDE 15

29

Acceleration of Lithographic Simulation [FPGA’08]

Lithography simulation

  • Simulate the optical imaging process
  • Computational intensive; very slow for full-chip simulation

XtremeData X1000 development system (AMD Opteron + Altera StratixII EP2S180)

AutoPilotTM Synthesis Tool

Algorithm in C

Ι(x,y) = Σ λκ ∗ | Σ τ [ψκ(x−x1, y−y1) − ψκ(x−x2, y−y1) + ψκ(x−x2, y− y2) − ψκ(x−x1, y−y2)] |2

15X+ Performance Improvement vs. AMD

Opteron 2.2GHz Processor

Close to 100X improvement on energy

efficiency

  • 15W in FPGA comparing with 86W in Opteron

30

xPilot: Behavioral-to-RTL Synthesis Flow

Behavioral spec. in C/C++/SystemC

RTL + constraints SSDM SSDM

μArch-generation & RTL/constraints

generation

Verilog/VHDL/SystemC FPGAs: Altera, Xilinx ASICs: Magma, Synopsys, …

Advanced transformtion/optimizations

Loop unrolling/shifting/pipelining Strength reduction / Tree height reduction Bitwidth analysis Memory analysis …

FPGAs/ASICs FPGAs/ASICs

Frontend compiler Frontend compiler Platform description

Core behvior synthesis optimizations

Scheduling Resource binding, e.g., functional unit binding register/port binding

slide-16
SLIDE 16

31

Some Recent Studies -- Efficient Identification of Approximate Patterns [Cong & Wei, FPGA’08]

Programers may contain many patterns Prior work can only identify exact patterns We can efficiently identify “approximate”

patterns in large programs

Based on the concept of editing distance Use data-mining techniques Efficient subgraph enumeration and pruning

Highly scalable – can handle programs with

100,000+ lines of code

Applications:

Behavioral synthesis:

  • 20+% area reduction due to sharing of

approximate patterns ASIP synthesis:

  • Identify & extract customized instructions

+ + < + +

  • +

+

  • Structure Variation

+ + * + + * + + *

16 16 16 16 16 16 32 32 32 32 32 32 32 32 32 32 32 32

+ + * + + * + + *

Bitwidth Variation Ports Variation

32

Some Recent Studies -- Automatic Memory Partitioning

To appear in ICCAD 2009 Memory system is critical for high

performance and low power design

Memory bottleneck limits maximum parallelism Memory system accounts for a significant portion of total power consumption

Goal

Given platform information (memory port, power, etc.), behavioral specification, and throughput constraints

  • Partition memories automatically
  • Meet throughput constraints
  • Minimize power consumption

A[i A[i] ] A[i+1] A[i+1]

for ( for (int int i =0; i < n; i++) i =0; i < n; i++)

… … = A[i]+A[i+1] = A[i]+A[i+1] (a) C code

R1 R2 A[0, 2, 4, A[0, 2, 4,… …] ] A[1, 3, 5 A[1, 3, 5… …] ]

Decoder

(b) Scheduling (c) Memory architecture after partitioning

slide-17
SLIDE 17

33

Automatic Memory Partitioning (AMP)

Techniques

Capture array access confliction in conflict graph for throughput optimization Model the loop kernel in parametric polytopes to

  • btain array frequency

Contributions

Automatic approach for design space exploration Cycle-accurate Handle irregular array accesses Light-weight profiling for power optimization

Loop Nest Loop Nest Array Subscripts Analysis Array Subscripts Analysis

Memory Platform Memory Platform Information Information

Partition Candidate Generation Partition Candidate Generation Try Partition Candidate Try Partition Candidate C Ci

i,

, Minimize Accesses on Each Bank Minimize Accesses on Each Bank

Meet Port Limitation? Meet Port Limitation?

Loop Pipelining and Scheduling Loop Pipelining and Scheduling

Pipeline Results Pipeline Results

N

Power Optimization Power Optimization

Y

Throughput Optimization

34

Automatic Memory Partitioning (AMP)

About 6x throughput improvement on average with 45% area

  • verhead

In addition, power optimization can further reduced 30% of power

after throughput optimization

Original Partition Original Partition Area Power II II SLICES SLICES Comparsion Reduction fir 3 1 241 510 2.12 26.82% idct 4 1 354 359 1.01 44.23% litho 16 1 1220 2066 1.69 31.58% matmul 4 1 211 406 1.92 77.64% motionEst 5 1 832 961 1.16 10.53% palindrome 2 1 84 65 0.77 0.00% avg 5.67x 1.45 31.80%

slide-18
SLIDE 18

35

AutoPilot Compilation Tool (based UCLA xPilot system)

C/C++/SystemC C/C++/SystemC Timing/Power/Layout Timing/Power/Layout Constraints Constraints RTL RTL HDLs HDLs & & RTL SystemC RTL SystemC

Platform Characterization Library

FPGA FPGA Co Co-

  • Processor

Processor

=

Simulation, Verification, and Prototyping Compilation & Compilation & Elaboration Elaboration Presynthesis Presynthesis Optimizations Optimizations Behavioral & Communication Behavioral & Communication Synthesis and Optimizations Synthesis and Optimizations

AutoPilotTM

Common Testbench User Constraints User Constraints ESL Synthesis Design Specification

  • Platform-based C to FPGA

synthesis

  • Synthesize pure ANSI-C and

C++, GCC-compatible compilation flow

  • Full support of IEEE-754

floating point data types &

  • perations
  • Efficiently handle bit-accurate

fixed-point arithmetic

  • More than 10X design

productivity gain

  • High quality-of-results

36

Some Other Usage of AutoPilot (Microsoft)

On John Cooley’s DeepChip 6/30/09

  • http://www.deepchip.com/items/0482-06.html
  • “We purchased AutoESL's AutoPilot in 2008 to implement some of

the time- consuming cores in our software into FPGA hardware for the runtime speed-up improvements… 1.

RankBoost - a machine-learning algorithm used in the dynamic ranking of search engines…

2.

Sorting Algorithm - also several thousand lines of OO C++ code with 138 lines that needed speeding up…

slide-19
SLIDE 19

37

CHP Creation CHP Creation – – Design Space Exploration Design Space Exploration

Key questions: Optimal trade-off between efficiency & customizability Which options to fix at CHP creation? Which to be set by CHP mapper?

Custom instructions & accelerators

Amount of programmable fabric Shared vs. private accelerators Custom instruction selection Choice of accelerators …

Custom instructions & accelerators

Amount of programmable fabric Shared vs. private accelerators Custom instruction selection Choice of accelerators …

Core parameters

Frequency & voltage Datapath bit width Instruction window size Issue width Cache size & configuration Register file organization # of thread contexts …

Core parameters

Frequency & voltage Datapath bit width Instruction window size Issue width Cache size & configuration Register file organization # of thread contexts …

NoC parameters

Interconnect topology # of virtual channels Routing policy Link bandwidth Router pipeline depth Number of RF-I enabled routers RF-I channel and bandwidth allocation …

NoC parameters

Interconnect topology # of virtual channels Routing policy Link bandwidth Router pipeline depth Number of RF-I enabled routers RF-I channel and bandwidth allocation …

Customizable Heterogeneous Platform (CHP)

$ $ $ $ $ $ $ $

Fixed Core Fixed Core Fixed Core Fixed Core Fixed Core Fixed Core Fixed Core Fixed Core Custom Core Custom Core Custom Core Custom Core Custom Core Custom Core Custom Core Custom Core Prog Fabric Prog Fabric Prog Fabric Prog Fabric Prog Fabric Prog Fabric Prog Fabric Prog Fabric

Reconfigurable RF-I bus Reconfigurable optical bus Transceiver/receiver Optical interface

38

Current On-Chip Interconnect Technology

Optimized RC lines with repeaters

Wiresizing, buffer insertion, buffer sizing … E.g. UCLA Tio and IPEM packages

Reconfigurable interconnects

For FPGAs:

  • RC busses with pass-transistors or bi-directional buffers

For CMPs (chip multi-processors)

  • Mesh-like network-on-chip (NoC)

Pay a large penalty on performance

slide-20
SLIDE 20

39

39

Used vs. Available Bandwidth in Modern CMOS

@ 45nm CMOS Technology

Data Rate: 4 Gbit/s fT of 45nm CMOS can be as high as 240GHz Baseband signal bandwidth only about 4GHz 98.4% of available bandwidth is wasted

Question: How to take advantage of full-bandwidth of modern CMOS?

10

T

f

40

40

  • 100
  • 90
  • 80
  • 70

323.038 323.238 323.438 323.638 323.838 324.0

Frequency (GHz) Pout (dBm)

UCLA 90nm CMOS VCO at 324GHz [ISSCC 2008]

CMOS Voltage Controlled Oscillator, measured with a subharmonic mixer and driven with a 80 GHz synthesizer local oscillator. The mixing frequency is (fVCO - 4*fLO)=fIF, or fVCO -4*(80 GHz)= 3.5 GHz, yielding fVCO= 323.5 GHz!

On-Wafer VCO Test Setup at JPL

CMOS VCO designed by Frank Chang’s group at UCLA, fabricated in 90nm process 323.5GHz VCO

*Huang, D., LaRocca T., Chang, M.-C. F., “324GHz CMOS Frequency Generator Using Linear Superposition Technique IEEE International Solid-State Circuits Conference (ISSCC), 476-477, (Feb 2008) San Francisco, CA

slide-21
SLIDE 21

41

41

Multiband RF-Interconnect

  • In TX, each mixer up-converts individual baseband streams into

specific frequency band (or channel)

  • N different data streams (N=6 in exemplary figure above) may

transmit simultaneously on the shared transmission medium to achieve higher aggregate data rates

  • In RX, individual signals are down-converted by mixer, and

recovered after low-pass filter

Signal Spectrum

Signal Power Signal Power Signal Power Signal Power

42

42

Tri-band On-Chip RF-I Test Results

30GHz Channel 50 GHz Channel

30GHz Channel 50GHz Channel Base Band Channel

Process IBM 90nm CMOS Digital Process Total 3 Channels 30GHz, 50GHz, Base Band Data Rate in each channel RF Band: 4Gbps Base Band: 2Gbps Total Data Rate 10Gbps Bit Error Rate Across all Bands <10E‐9 Latency 6 ps/mm Enegry Per Bit (RF) 0.09*pJ/bit/mm Enegry Per Bit (BB) 0.125pJ/bit/mm

Data Output waveform Output Spectrum of the RF- Bands, 30GHz and 50GHz *VCO power (5mW) can be shared by all (many tens) parallel RF-I

links in NOC and does not burden individual link significantly.

slide-22
SLIDE 22

43

43

Comparison between Repeated Bus and Multi-band RF-I @ 32nm

Assumptions:

1.

32nm node; 30x repeater, FO4=8ps, Rwire = 306Ω/mm Cwire = 315fF/mm, wire pitch=0.2um, Bus length = 2cm, f_bus = 1GHz, Bus Width 96Byte

2.

Repeaters Area = 0.022mm2

3.

Bus physical width = 160um

4.

In that width we can fit 13 transmission line, each with 7 carriers with carrying 8Gbps

Interconnect length = 2cm

RF‐I Repeated Bus # of wire 13 448 Data rate per carrier (Gbit/s) 8 NA # of carrier 7 NA Data rate per carrier (Gbit/s) 56 1 Aggregate Data Rate 728 768 Bus Physical Width 160 160 Transceiver Area (mm2) 0.27 0.022 Power (mW) 455 6144 Energy per bit (pJ/bit) 0.63 8

44

44

Architectural Impact Using RF-I

High bandwidth communication

Data distribution across many-core topologies Vital in keeping many-core designs active

Low latency communication

Enables users to apply parallel computing to a broader applications through faster synchronization and communication Faster cache coherence protocols

Reconfigurability

Adapt NoC topology/bandwidth to the needs of the individual application

Power efficient communication

slide-23
SLIDE 23

45

45

Simple RF-I Topology

Four NoC Components Tunable Tx/Rx’s

Arbitrary topologies Arbitrary bandwidths

C C C C

> > > > > > > >

RF-I Transmission Line Bundle

NoC Component

Tx/Rx C C C C C C C C C C C C C C C C C C C C Pipeline/Ring Bus Multicast Fully Connected Crossbar

One physical topology can be configured to many virtual topologies

46

46

Mesh Overlaid with RF-I [HPCA’08]

10x10 mesh of pipelined routers

NoC runs at 2GHz XY routing

64 4GHz 3-wide processor cores

Labeled aqua 8KB L1 Data Cache 8KB L1 Instruction Cache

32 L2 Cache Banks

Labeled pink 256KB each Organized as shared NUCA cache

4 Main Memory Interfaces

Labeled green

RF-I transmission line bundle

Black thick line spanning mesh

slide-24
SLIDE 24

47

47

RF-I Logical Organization

  • Logically:
  • RF-I behaves as set of

N express channels

  • Each channel assigned

to src, dest router pair (s,d)

  • Reconfigured by:
  • remapping shortcuts to

match needs of different applications

LOGICAL A LOGICAL B

48

48

Power Savings [MICRO’08]

We can thin the baseline mesh links

From 16B… …to 8B …to 4B

RF-I makes up the difference in

performance while saving overall power! RF-I provides bandwidth where most necessary Baseline RC wires supply the rest 16 bytes 8 bytes 4 bytes

Requires high bw to communicate w/ B

A B

slide-25
SLIDE 25

49

49

RF-I Enabled Multicast

Get S 2 1 3 4 2 1 1 1 1 1

FILL

Fill Conventional NoC Request Scenario

Rx Rx Tx Rx Tx Rx Tx Rx Tx Rx Tx Rx Tx Rx Tx Rx Tx Tx

RF-I enabled NoC

50

50

Impact of Using RF-Interconnects [MICRO’08]

  • Adaptive RF-I enabled NoC
  • Cost Effective in terms of both power and performance
slide-26
SLIDE 26

51

Customizable Heterogeneous Platform (CHP)

$ $ $ $ $ $ $ $

Fixed Core Fixed Core Fixed Core Fixed Core Fixed Core Fixed Core Fixed Core Fixed Core Custom Core Custom Core Custom Core Custom Core Custom Core Custom Core Custom Core Custom Core Prog Fabric Prog Fabric Prog Fabric Prog Fabric Prog Fabric Prog Fabric Prog Fabric Prog Fabric DRAM DRAM DRAM DRAM I/O I/O CHP CHP CHP CHP CHP CHP Reconfigurable RF-I bus Reconfigurable optical bus Transceiver/receiver Optical interface

Overview of the Proposed Research Overview of the Proposed Research

CHP mapping Source-to-source CHP mapper Reconfiguring & optimizing backend Adaptive runtime Domain characterization Application modeling Domain-specific-modeling (healthcare applications) CHP creation Customizable computing engines Customizable interconnects Architecture modeling

Design once Invoke many times

52

CHP Mapping CHP Mapping – – Compilation and Runtime Software Systems Compilation and Runtime Software Systems for Customization for Customization

Goals: Efficient mapping of domain-specific specification to customizable hardware – Adapt the CHP to a given application for drastic performance/power efficiency improvement

Domain-specific applications Domain-specific applications Abstract execution Abstract execution Programmer Programmer Domain-specific programming model (Domain-specific coordination graph and domain-specific language extensions) Source-to source CHP Mapper Source-to source CHP Mapper

Application characteristics CHP architecture models

C/C++ code C/C++ front-end C/C++ front-end Reconfiguring and optimizing back-end Reconfiguring and optimizing back-end Analysis annotations Binary code for fixed & customized cores Customized target code RTL for programmable fabric RTL Synthesizer (xPilot) RTL Synthesizer (xPilot) C/SystemC behavioral spec

Performance feedback

Adaptive runtime Lightweight threads and adaptive configuration Adaptive runtime Lightweight threads and adaptive configuration CHP architectural prototypes (CHP hardware testbeds, CHP simulation testbed, full CHP) CHP architectural prototypes (CHP hardware testbeds, CHP simulation testbed, full CHP)

slide-27
SLIDE 27

53

FCUDA: CUDA-to-FPGA (Best Paper Award at SASP 2009)

Use CUDA in tandem with High-Level Synthesis (HLS) to:

enable high-level abstraction for FPGA programming exploit massively parallel compute capabilities of FPGA facilitate single interface for GPU and FPGA kernel acceleration

CUDA: C-based parallel programming model for GPUs

concise expression of coarse grained parallelism very popular (wide range of existing applications) Explicit partitioning and trasnfer of data between off-chip and on-chip memory

AutoPilot: Advanced HLS tool (from AutoESL)

Platform-specific (i.e. FPGA/ASIC) C-to-RTL mapping Fine-grained and loop iteration parallelism extraction Annotated coarse-grained parallelism extraction

  • Requires explicit expression and annotation from programmer

54

CUDA-to-AutoPilot C Translation

Identify off-chip data transfers

aggregate multi-thread off-chip accesses into DMA bursts

Split kernel into computation and data communication tasks Use thread-block granularity for splitting kernel threads into parallel

FPGA cores

Allocate data storage based on following memory space mapping:

GPU FPGA

  • Global

Off-chip DRAM

  • Shared

On-chip BRAMs

  • Constant/Texture

Registers

  • Registers / Local Memory

thread-block kernel tasks

slide-28
SLIDE 28

55

Results

Benchmark Core # DRAM Bandwidth Limiting Resource matmul 32bit 128 3.5GB/s DSP matmul 16bit 176 1.6GB/s BRAM matmul 8bit 176 0.8GB/s BRAM cp 32bit 25 0.128GB/s DSP cp 16bit 96 0.19GB/sec DSP cp 8bit 96 0.1GB/sec DSP rc5-72 32bit 80 ≈ 0GB/sec LUT Kernel Configuration Description Matrix Multiply (matmul) 1024x1024 Common kernel in many imaging, simulation, and scientific application Coulombic potential (cp) 4000 atoms, 512x512 grid Computation of electric potential in a volume containing charged atoms RSA Encryption (rc5-72) 4 Billion Keys Brute force encryption key generation and matching

0.5 1 1.5 2 2.5 32bit 16bit 8bit 32bit 16bit 8bit 32bit matmul cp rc5-72 speedup GPU FPGA

Benchmark GPU GeForce 8800 FPGA Virtex5 xc5vfx200t FPGA over GPU Benefit matmul 32bit ≈ 100 Watt 10.622 Watt 9.41X matmul 16bit 10.559 Watt 9.47X matmul 8bit 9.954 Watt 10.05X

Speedup comparable to GPU in several configurations Much more power efficient than GPU! Assume FPGA has high bandwidth bus to off-chip DDR

56

Concluding Remarks

  • We believe that domain

We believe that domain-

  • specific customization is the next

specific customization is the next transformative approach to energy efficient computing to energy efficient computing

Beyond parallelization?

  • Many research opportunities and challenges

Many research opportunities and challenges

Domain-specific modeling/specification Novel architecture & microarchitecture for customization Compilation and runtime software to support intelligent customization New research in testing, verification, reliability in customizable computing

  • CDSC is taking a highly integrated effort

CDSC is taking a highly integrated effort – –

  • Coordinated cross

Coordinated cross-

  • layer customization in modeling, HW, SW, & application

layer customization in modeling, HW, SW, & application development development

slide-29
SLIDE 29

57

Acknowledgements

Reinman (UCLA) Palsberg (UCLA) Sadayappan (Ohio-State) Sarkar (Associate Dir) (Rice) Vese (UCLA) Potkonjak (UCLA)

  • A highly collaborative effort
  • thanks to all my co-PIs in four universities – UCLA, Rice, Ohio-State, and UC Santa Barbara
  • Thanks the support from the National Science Foundation

Aberle (UCLA) Baraniuk (Rice) Bui (UCLA) Cong (Director) (UCLA) Cheng (UCSB) Chang (UCLA)