[PDF] - From Milliwatts Milliwatts to Megawatts: to Megawatts: From The PDF Document

SLIDE 1

1

DAC DAC’ ’2009 Panel 2009 Panel

From From Milliwatts Milliwatts to Megawatts: to Megawatts: The System The System-

Level Power Challenge

Level Power Challenge

Jason Cong UCLA Computer Science Department cong@cs.ucla.edu http://cadlab.cs.ucla.edu/~cong

2

The Power Barrier The Power Barrier … …

Source : Shekhar Borkar, Intel

SLIDE 2

3

Focus: New Transformative Approach to Focus: New Transformative Approach to Power/Energy Efficient Computing Power/Energy Efficient Computing

Parallelization

Source: Shekhar Borkar, Intel

Current Solution: Parallelization Current Solution: Parallelization

4

Rise of Multi Rise of Multi-

core Processors

core Processors

Nvidia's GT200 GPU (30*8 = 240 cores) Sony-Toshiba-IBM Cell Processor(1PPE+8SPE) Sun Rock processor (4*4 = 16 cores) Intel Larrabee (32core)

SLIDE 3

5

Cluster of Computers Cluster of Computers

IBM BlueGene/L No.1 in the Top500 list of nov.2007, now No.4 in the newest Top500 list

6

Cost and Energy are Still a Big Issue Cost and Energy are Still a Big Issue … …

Cost of computing

HW acquisition
Energy bill
Heat removal
Space
…

SLIDE 4

7

Next Big Idea for Controlling Megawatts Next Big Idea for Controlling Megawatts --

- Customization

Customization

Parallelization

Source: Shekhar Borkar, Intel

Customization

Adapt the architecture to Application domain

8

Motivation Motivation

A few facts

A few facts

We have sufficient computing power for most applications Each user/enterprise need high computing power for only selected tasks in its domain Application-specific integrated circuits (ASIC) can lead to 10,000X+ better power performance efficiency, but are too expensive to design and manufacture

Our proposal

Our proposal

A general, customizable platform for the given domain(s)

Can be customized to a wide-range of applications in the domain
Can be massively produced with cost efficiency
Can be programmed efficiently with novel compilation and runtime systems
Goal:

Goal:

A

A “ “supercomputer supercomputer-

in

in-

a

a-

box

box” ” with 100X performance/power improvement via with 100X performance/power improvement via customization for the intended customization for the intended domain(s domain(s) )

Analogy:

Analogy:

Advance of civilization via specialization/customization

Advance of civilization via specialization/customization

SLIDE 5

9

Example Application Domain: Healthcare Example Application Domain: Healthcare

Medical imaging has transformed healthcare

Medical imaging has transformed healthcare

An in vivo method for understanding disease development and patient condition Estimated to be $100 billion/year More powerful & efficient computation can help

Fewer exposures using compressive sensing
Better clinical assessment (e.g., for cancer) using

improved registration and segmentation algorithms

Hemodynamic

Hemodynamic simulation simulation

Very useful for surgical procedures involving blood flow and vasculature

Both may take hours to days to construct

Both may take hours to days to construct

Clinical requirement: 1

Clinical requirement: 1-

2 min

2 min

Cloud computing won

Cloud computing won’ ’t work t work – –

Communication, real

Communication, real-

time requirement, privacy

time requirement, privacy

A megawatt

A megawatt-

datacenter for each hospital?

datacenter for each hospital?

Intracranial aneurysm reconstruction with hemodynamics Magnetic resonance (MR) angiograph of an aneurysm

10

compressive sensing level set methods fluid registration total variational algorithm

Medical Image Processing Pipeline Medical Image Processing Pipeline

denoising denoising registration registration segmentation segmentation analysis analysis

h z y S i,j volume voxel j i

S k k k

e i Z w j f w i ∑ = − ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = ∀

=

− − ∈

∑

1 2

1 2 j 2 ,

) ( 1 , 2 ) ( ) ( u : voxel σ

( ) [ ]

) ( ) ( ) ( ) ( u x T x R u x T v v u v t u v − ∇ − − − = ⋅ ∇ ∇ + + Δ ∇ ⋅ + ∂ ∂ = η μ μ

{ }

t) (x, : x voxels ) ( surface div ) , ( F = = ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ∇ ∇ + ∇ = ∂ ∂ ϕ ϕ ϕ λ φ ϕ ϕ t data t

∑ ∑

= =

+ ∂ ∂ + ∂ ∂ − = ∂ ∂ + ∂ ∂ + Δ + −∇ = ∇ ⋅ + ∂ ∂

3 1 2 2 3 1

) , ( ) , ( ) (

j i j i j j i j i j i

t x f x v v x p x v v t v t x f v p v v t v υ υ reconstruction reconstruction

∑ ∑

∀

+ <<

voxels 2 points sampled

) (

AR

min : theory Nyquist

Shannon

classical rate a at sampled be can and sparsity, exhibit images Medical u grad S u

u

λ

Navier-Stokes equations

SLIDE 6

11

compressive sensing level set methods fluid registration total variational algorithm

Application Domains: Medical Image Processing Pipeline Application Domains: Medical Image Processing Pipeline

denoising denoising registration registration segmentation segmentation analysis analysis reconstruction reconstruction

Navier-Stokes equations

non-iterative, highly parallel, local & global communication sparse linear algebra, structured grid, optimization methods parallel, global communication dense linear algebra, optimization methods local communication sparse linear algebra, n-body methods, graphical models local communication dense linear algebra, spectral methods, MapReduce iterative, local or global communication dense and sparse linear algebra, optimization methods

These algorithms have diverse

These algorithms have diverse computation & computation & communication patterns communication patterns

A single homogenous system

A single homogenous system can not perform very well on can not perform very well on all these algorithms all these algorithms

12

compressive sensing level set methods fluid registration total variational algorithm Navier-Stokes equations

Non-iterative, highly parallel, local & global communication sparse linear algebra, structured grid, optimization methods parallel, global communication dense linear algebra, optimization methods local communication sparse linear algebra, n-body methods, graphical models local communication dense linear algebra, spectral methods, MapReduce iterative, local or global communication dense and sparse linear algebra, optimization methods

Need of Customization for Medical Image Processing Pipeline Need of Customization for Medical Image Processing Pipeline

denoising denoising registration registration segmentation segmentation analysis analysis reconstruction reconstruction

These algorithms have diverse

These algorithms have diverse computation & communication computation & communication patterns patterns

A single, homogeneous system

A single, homogeneous system cannot perform very well on all cannot perform very well on all

f these algorithms
f these algorithms
Need architecture

Need architecture customization and hardware customization and hardware-

software co

software co-

optimization
ptimization
Include many common

Include many common computation kernels ( computation kernels (“ “motifs motifs” ”) )

Applicable to other domains

Applicable to other domains Bi Bi-

harmonic registration (Using the same algorithm on all

harmonic registration (Using the same algorithm on all platforms) platforms)

CPU (Xenon 2.0 GHz) CPU (Xenon 2.0 GHz) 1x 1x ~100 W ~100 W GPU (Tesla C1060) GPU (Tesla C1060) 93x 93x ~150 W ~150 W FPGA (xc4vlx100) FPGA (xc4vlx100) 11x 11x ~5W ~5W

3D median filter: For each 3D median filter: For each voxel voxel, compute the median of , compute the median of the 3 x 3 x 3 neighboring the 3 x 3 x 3 neighboring voxels voxels

CPU (Xenon 2.0 GHz) CPU (Xenon 2.0 GHz) Quick select Quick select 1x 1x ~100 W ~100 W GPU (Tesla C1060) GPU (Tesla C1060) Median of medians Median of medians 70x 70x ~140 W ~140 W FPGA (xc4vlx100) FPGA (xc4vlx100) Bit Bit-

by

by-

bit majority voting

bit majority voting 1200x 1200x ~3 W ~3 W

SLIDE 7

13

Customizable Heterogeneous Platform (CHP)

$ $ $ $ $ $ $ $

Fixed Core Fixed Core Fixed Core Fixed Core Fixed Core Fixed Core Fixed Core Fixed Core Custom Core Custom Core Custom Core Custom Core Custom Core Custom Core Custom Core Custom Core Prog Fabric Prog Fabric Prog Fabric Prog Fabric Prog Fabric Prog Fabric Prog Fabric Prog Fabric DRAM DRAM DRAM DRAM I/O I/O CHP CHP CHP CHP CHP CHP Reconfigurable RF-I bus Reconfigurable optical bus Transceiver/receiver Optical interface

Overview of the Proposed Research Overview of the Proposed Research

CHP mapping Source-to-source CHP mapper Reconfiguring & optimizing backend Adaptive runtime Domain characterization Application modeling Domain-specific-modeling (healthcare applications) CHP creation Customizable computing engines Customizable interconnects Architecture modeling

Design once Invoke many times

14

CHP Creation CHP Creation – – Design Space Exploration Design Space Exploration

Key questions: Optimal trade-off between efficiency & customizability Which options to fix at CHP creation? Which to be set by CHP mapper?

Custom instructions & accelerators

Amount of programmable fabric Shared vs. private accelerators Custom instruction selection Choice of accelerators …

Custom instructions & accelerators

Amount of programmable fabric Shared vs. private accelerators Custom instruction selection Choice of accelerators …

Core parameters

Frequency & voltage Datapath bit width Instruction window size Issue width Cache size & configuration Register file organization # of thread contexts …

Core parameters

Frequency & voltage Datapath bit width Instruction window size Issue width Cache size & configuration Register file organization # of thread contexts …

NoC parameters

Interconnect topology # of virtual channels Routing policy Link bandwidth Router pipeline depth Number of RF-I enabled routers RF-I channel and bandwidth allocation …

NoC parameters

Interconnect topology # of virtual channels Routing policy Link bandwidth Router pipeline depth Number of RF-I enabled routers RF-I channel and bandwidth allocation …

Customizable Heterogeneous Platform (CHP)

$ $ $ $ $ $ $ $

Fixed Core Fixed Core Fixed Core Fixed Core Fixed Core Fixed Core Fixed Core Fixed Core Custom Core Custom Core Custom Core Custom Core Custom Core Custom Core Custom Core Custom Core Prog Fabric Prog Fabric Prog Fabric Prog Fabric Prog Fabric Prog Fabric Prog Fabric Prog Fabric

Reconfigurable RF-I bus Reconfigurable optical bus Transceiver/receiver Optical interface

SLIDE 8

15

Multiband RF-Interconnect

In TX, each mixer up-converts individual baseband streams into

specific frequency band (or channel)

N different data streams (N=6 in exemplary figure above) may

transmit simultaneously on the shared transmission medium to achieve higher aggregate data rates

In RX, individual signals are down-converted by mixer, and

recovered after low-pass filter

Signal Spectrum

Signal Power Signal Power Signal Power Signal Power

16

Comparison between Repeated Bus and Multi-band RF-I @ 32nm

Assumptions:

1.

32nm node; 30x repeater, FO4=8ps, Rwire = 306Ω/mm Cwire = 315fF/mm, wire pitch=0.2um, Bus length = 2cm, f_bus = 1GHz, Bus Width 96Byte

2.

Repeaters Area = 0.022mm2

3.

Bus physical width = 160um

4.

In that width we can fit 13 transmission line, each with 7 carriers with carrying 8Gbps

Interconnect length = 2cm

RF‐I Repeated Bus # of wire 13 448 Data rate per carrier (Gbit/s) 8 NA # of carrier 7 NA Data rate per carrier (Gbit/s) 56 1 Aggregate Data Rate 728 768 Bus Physical Width 160 160 Transceiver Area (mm2) 0.27 0.022 Power (mW) 455 6144 Energy per bit (pJ/bit) 0.63 8

SLIDE 9

17

Acceleration of Lithographic Simulation

Lithography simulation

Simulate the optical imaging process
Computational intensive; very slow for full-chip simulation

XtremeData X1000 development system (AMD Opteron + Altera StratixII EP2S180)

AutoPilotTM Synthesis Tool

Algorithm in C

Ι(x,y) = Σ λκ ∗ | Σ τ [ψκ(x−x1, y−y1) − ψκ(x−x2, y−y1) + ψκ(x−x2, y− y2) − ψκ(x−x1, y−y2)] |2

15X+ Performance Improvement vs. AMD

Opteron 2.2GHz Processor

Close to 100X improvement on energy

efficiency

15W in FPGA comparing with 86W in Opteron

18

Interesting News from Microsoft

On John Cooley’s DeepChip 6/30/09

http://www.deepchip.com/items/0482-06.html
“We purchased AutoESL's AutoPilot in 2008 to implement some of

the time- consuming cores in our software into FPGA hardware for the runtime speed-up improvements… 1.

RankBoost - a machine-learning algorithm used in the dynamic ranking of search engines…

2. Sorting Algorithm - also several thousand lines of OO C++ code with 138 lines that needed speeding up…

SLIDE 10

19

Conclusions and Acknowledgements

Domain-specific customized computing is key to combat megawatt

requirements

FPGA is a good starting point. But there are many other

pportunities to customize cores and interconnects

Acknowledgements -- A lot of discussions/inputs from the members

f Center of Customizable Domain-Specific Computing (CDSC), esp

DAC DAC’ ’2009 Panel 2009 Panel

From From Milliwatts Milliwatts to Megawatts: to Megawatts: The System The System-

Level Power Challenge

Jason Cong UCLA Computer Science Department cong@cs.ucla.edu http://cadlab.cs.ucla.edu/~cong

The Power Barrier The Power Barrier … …

Focus: New Transformative Approach to Focus: New Transformative Approach to Power/Energy Efficient Computing Power/Energy Efficient Computing

Parallelization

Current Solution: Parallelization Current Solution: Parallelization

Rise of Multi Rise of Multi-

core Processors

Cluster of Computers Cluster of Computers

Cost and Energy are Still a Big Issue Cost and Energy are Still a Big Issue … …

Cost of computing

Next Big Idea for Controlling Megawatts Next Big Idea for Controlling Megawatts --

Customization

Parallelization

Customization

Motivation Motivation

A few facts

We have sufficient computing power for most applications Each user/enterprise need high computing power for only selected tasks in its domain Application-specific integrated circuits (ASIC) can lead to 10,000X+ better power performance efficiency, but are too expensive to design and manufacture

Our proposal

A general, customizable platform for the given domain(s)

Goal:

A “ “supercomputer supercomputer-

in-

a-

box” ” with 100X performance/power improvement via with 100X performance/power improvement via customization for the intended customization for the intended domain(s domain(s) )

Analogy:

Advance of civilization via specialization/customization

Example Application Domain: Healthcare Example Application Domain: Healthcare

Medical imaging has transformed healthcare

Both may take hours to days to construct

compressive sensing level set methods fluid registration total variational algorithm

Medical Image Processing Pipeline Medical Image Processing Pipeline

∑

∑ ∑

Navier-Stokes equations

compressive sensing level set methods fluid registration total variational algorithm

Application Domains: Medical Image Processing Pipeline Application Domains: Medical Image Processing Pipeline

Navier-Stokes equations

compressive sensing level set methods fluid registration total variational algorithm Navier-Stokes equations

Need of Customization for Medical Image Processing Pipeline Need of Customization for Medical Image Processing Pipeline

These algorithms have diverse computation & communication computation & communication patterns patterns

A single, homogeneous system cannot perform very well on all cannot perform very well on all

Need architecture customization and hardware customization and hardware-

software co-

Include many common computation kernels ( computation kernels (“ “motifs motifs” ”) )

Overview of the Proposed Research Overview of the Proposed Research

CHP Creation CHP Creation – – Design Space Exploration Design Space Exploration

Key questions: Optimal trade-off between efficiency & customizability Which options to fix at CHP creation? Which to be set by CHP mapper?

$ $ $ $ $ $ $ $

Multiband RF-Interconnect

specific frequency band (or channel)

transmit simultaneously on the shared transmission medium to achieve higher aggregate data rates

recovered after low-pass filter

Comparison between Repeated Bus and Multi-band RF-I @ 32nm

Assumptions:

Interconnect length = 2cm

RF‐I Repeated Bus # of wire 13 448 Data rate per carrier (Gbit/s) 8 NA # of carrier 7 NA Data rate per carrier (Gbit/s) 56 1 Aggregate Data Rate 728 768 Bus Physical Width 160 160 Transceiver Area (mm2) 0.27 0.022 Power (mW) 455 6144 Energy per bit (pJ/bit) 0.63 8

Acceleration of Lithographic Simulation

AutoPilotTM Synthesis Tool

Algorithm in C

Ι(x,y) = Σ λκ ∗ | Σ τ [ψκ(x−x1, y−y1) − ψκ(x−x2, y−y1) + ψκ(x−x2, y− y2) − ψκ(x−x1, y−y2)] |2

Interesting News from Microsoft

On John Cooley’s DeepChip 6/30/09

the time- consuming cores in our software into FPGA hardware for the runtime speed-up improvements… 1.

RankBoost - a machine-learning algorithm used in the dynamic ranking of search engines…

2.

Sorting Algorithm - also several thousand lines of OO C++ code with 138 lines that needed speeding up…

Conclusions and Acknowledgements

Domain-specific customized computing is key to combat megawatt

requirements

FPGA is a good starting point. But there are many other

Acknowledgements -- A lot of discussions/inputs from the members

Alex Bui (UCLA Medical School), Glenn Reinman (UCLA CSD), Vivek Sarkar (Rice University)