[PDF] - Customized Computing for Power Efficiency Customized Computing for PDF Document

SLIDE 1

Page 1

Customized Computing for Power Efficiency Customized Computing for Power Efficiency

Jason Cong Jason Cong cong@cs.ucla.edu cong@cs.ucla.edu UCLA Computer Science Department UCLA Computer Science Department

http://cadlab.cs.ucla.edu/~cong http://cadlab.cs.ucla.edu/~cong

There are Many Options to Improve Performance There are Many Options to Improve Performance

SLIDE 2

Page 2

Past Alternatives Past Alternatives --

- Frequency Scaling

Frequency Scaling

Source : Shekhar Borkar, Intel

Current Alternatives: Parallelization Current Alternatives: Parallelization

Source : Shekhar Borkar, Intel

Parallelization

SLIDE 3

Page 3

Multi Multi-

core Processors

core Processors

Sun UltraSPARC T2 Microprocessor 8 Cores 64 threads Tilera TILE64 multi-core Processor

Warehouse of Computers Warehouse of Computers

IBM BlueGene/L No.1 in the newest Top500

SLIDE 4

Page 4

But Power Remain to Be a Limiting Factor But Power Remain to Be a Limiting Factor … …

Cost of computing

HW acquisition
Energy bill
Heat removal
Space
…

Power Will Be the Driver for Acceptance of Power Will Be the Driver for Acceptance of Customized Computing Customized Computing

Source : Shekhar Borkar, Intel

Parallelization Customization

SLIDE 5

Page 5

UCLA Experience UCLA Experience --

- Lithography Simulation Acceleration

Lithography Simulation Acceleration

Simulation of the optical imaging process

Simulation of the optical imaging process

Computational intensive and quite slow for full

Computational intensive and quite slow for full-

chip simulation

chip simulation

Synthesized into

Synthesized into Stratix Stratix-

II FPGA on XDI platform using AutoPilot

II FPGA on XDI platform using AutoPilot

Experiment Results [FPGA Experiment Results [FPGA’ ’2008] 2008]

15X speedup using a 5 by 5 partitioning over Opteron 2.2G 4G RAM Logic utilization around 25K ALUT (and 8K is used in the interface framework

rather than design)

Power utilization less than 15W in FPGA comparing with 86W in Opteron248 Close to 100X (5.8 x 15) improvement on energy efficiency

SLIDE 6

Page 6

A Lot More is Needed for Power A Lot More is Needed for Power-

Efficient Customized

Efficient Customized Computing Computing

More power

More power-

efficient programmable fabrics

efficient programmable fabrics

Capability to do power gating, voltage and frequency scaling

Capability to do power gating, voltage and frequency scaling

A powerful, fully automated C/C++ to FPGA compiler

A powerful, fully automated C/C++ to FPGA compiler

Taking full advantages of various power optimization options in

Taking full advantages of various power optimization options in a a transparent way transparent way

Customization beyond just FPGA fabrics

Customization beyond just FPGA fabrics

Application

Application-

specific instruction

specific instruction-

set processors (ASIP)

set processors (ASIP)

Application

Application-

specific processor networks (ASPN)

specific processor networks (ASPN)

More power efficient programmable (global) interconnects

More power efficient programmable (global) interconnects

E.g., RF

E.g., RF-

interconnects

interconnects

RF RF-

Interconnects

Interconnects --

- Power Efficient Programmable

Power Efficient Programmable (Global) Interconnect Solution (Global) Interconnect Solution

SLIDE 7

Page 7

Limited RC Wires Bandwidth Limited RC Wires Bandwidth

@ 45nm CMOS Technology

@ 45nm CMOS Technology

Data Rate: 4

Data Rate: 4 Gbit/s Gbit/s

f

fT

T of 45nm CMOS can be as high as 240GHz

f 45nm CMOS can be as high as 240GHz
Baseband signal bandwidth only about 4GHz

Baseband signal bandwidth only about 4GHz

98.4% of available bandwidth is wasted

98.4% of available bandwidth is wasted

Open Question:

Open Question: How to take advantage of full How to take advantage of full-

bandwidth of modern CMOS

bandwidth of modern CMOS? ?

10

T

f

100
90
80
70

323.038 323.238 323.438 323.638 323.838 324.0

Frequency (GHz) Pout (dBm)

UCLA 90nm CMOS VCO at 324GHz UCLA 90nm CMOS VCO at 324GHz

(ISSCC 2008) (ISSCC 2008)

CMOS Voltage Controlled Oscillator, measured with a CMOS Voltage Controlled Oscillator, measured with a subharmonic subharmonic mixer and driven mixer and driven with a 80 GHz synthesizer local oscillator. The mixing frequency with a 80 GHz synthesizer local oscillator. The mixing frequency is ( is (f fVCO

VCO -

4*

4*f fLO

LO)=

)=f fIF

IF, or

, or f fVCO

VCO -

4*(80 GHz)= 3.5 GHz, yielding

4*(80 GHz)= 3.5 GHz, yielding f fVCO

VCO= 323.5 GHz!

= 323.5 GHz!

On-Wafer VCO Test Setup at JPL

CMOS VCO designed by Frank Chang’s group at UCLA, fabricated in 90nm process 323.5GHz VCO

*Huang, D., LaRocca T., Chang, M.-C. F., “324GHz CMOS Frequency Generator Using Linear Superposition Technique IEEE International Solid-State Circuits Conference (ISSCC), 476-477, (Feb 2008) San Francisco, CA

SLIDE 8

Page 8

Multiband RF Multiband RF-

Interconnect

Interconnect

In TX, each mixer up-converts individual baseband streams into

specific frequency band (or channel)

N different data streams (N=6 in exemplary figure above) may

transmit simultaneously on the shared transmission medium to achieve higher aggregate data rates

In RX, individual signals are down-converted by mixer, and

recovered after low-pass filter

Signal Spectrum

Signal Power Signal Power Signal Power Signal Power

Advantages of RF Advantages of RF-

Interconnect (RF

Interconnect (RF-

I)

I)

Latency

Latency – – speed speed-

of
f-
light data transmission

light data transmission

Bandwidth

Bandwidth – – high aggregate data rate through simultaneous high aggregate data rate through simultaneous transmissions on multiple bands of RF modulated signals transmissions on multiple bands of RF modulated signals

Area

Area – – avoid extensive use of repeaters avoid extensive use of repeaters

Energy

Energy – – low overall energy bit low overall energy bit

Reconfigurability

Reconfigurability – – efficient bidirectional and tunable efficient bidirectional and tunable communications via shared on/off communications via shared on/off-

chip transmission lines or

chip transmission lines or

ff
ff-
chip antennas

chip antennas

SLIDE 9

Page 9

Simple RF Simple RF-

I Topology

I Topology

Four NoC Components

Four NoC Components

Tunable

Tunable Tx Tx/Rx /Rx’ ’s s

Arbitrary topologies

Arbitrary topologies

Arbitrary bandwidths

Arbitrary bandwidths

C C C C

> > > > > > > >

RF-I Transmission Line Bundle NoC Component Tx/Rx C C C C C C C C C C C C C C C C C C C C Pipeline/Ring Bus Multicast Fully Connected Crossbar One physical topology can be configured to many virtual topologies

RF RF-

I for Multi

I for Multi-

Core On

Core On-

Chip Communication

Chip Communication

[HPCA [HPCA’ ’2008, MICRO 2008, MICRO’ ’2008] 2008]

10x10 mesh of pipelined routers

10x10 mesh of pipelined routers

NoC runs at 2GHz

NoC runs at 2GHz

XY routing

XY routing

64 4GHz 3

64 4GHz 3-

wide processor cores

wide processor cores

Labeled aqua

Labeled aqua

8KB L1 Data Cache

8KB L1 Data Cache

8KB L1 Instruction Cache

8KB L1 Instruction Cache

32 L2 Cache Banks

32 L2 Cache Banks

Labeled pink

Labeled pink

256KB each

256KB each

Organized as shared NUCA

Organized as shared NUCA cache cache

4 Main Memory Interfaces

4 Main Memory Interfaces

Labeled green

Labeled green

RF

RF-

I transmission line bundle

I transmission line bundle

Black thick line spanning mesh

Black thick line spanning mesh

SLIDE 10

Page 10

RF RF-

I Logical Organization

I Logical Organization

Logically:
RF-I behaves as set of

N express channels

Each channel assigned

to src, dest router pair (s,d)

Reconfigured by:
remapping shortcuts to

match needs of different applications

LOGICAL A LOGICAL B

Power Savings Power Savings

We can thin the baseline mesh links

We can thin the baseline mesh links

From 16B

From 16B… …

…

…to 8B to 8B

…

…to 4B to 4B

RF

RF-

I makes up the difference in

I makes up the difference in performance while saving overall performance while saving overall power! power!

RF

RF-

I provides bandwidth where

I provides bandwidth where most necessary most necessary

Baseline RC wires supply the rest

Baseline RC wires supply the rest

Over 60% power reduction

Over 60% power reduction

16 bytes 8 bytes 4 bytes

Requires high bw to communicate w/ B

A B

A lot of potential for global interconnects in programmable fabrics

SLIDE 11

Page 11

Concluding Remarks Concluding Remarks --

- A Lot Opportunities in Power

A Lot Opportunities in Power-

Efficient Customized Computing

Efficient Customized Computing

More power

More power-

efficient programmable fabrics

efficient programmable fabrics

Options to do power gating, voltage and frequency scaling

Options to do power gating, voltage and frequency scaling

A powerful, fully automated C/C++ to FPGA compiler

A powerful, fully automated C/C++ to FPGA compiler

Taking full advantages of power optimization options in a

Taking full advantages of power optimization options in a transparent way transparent way

Customization beyond just FPGA fabrics

Customization beyond just FPGA fabrics

Application

Application-

specific instruction

specific instruction-

set processors (ASIP)

set processors (ASIP)

Application

Application-

specific processor networks (ASPN)

specific processor networks (ASPN)

More power efficient programmable (global) interconnects

More power efficient programmable (global) interconnects

E.g., RF

E.g., RF-

interconnects