Customized Computing for Power Efficiency Customized Computing for - - PDF document

customized computing for power efficiency customized
SMART_READER_LITE
LIVE PREVIEW

Customized Computing for Power Efficiency Customized Computing for - - PDF document

Customized Computing for Power Efficiency Customized Computing for Power Efficiency Jason Cong Jason Cong cong@cs.ucla.edu cong@cs.ucla.edu UCLA Computer Science Department UCLA Computer Science Department http://cadlab.cs.ucla.edu/~cong


slide-1
SLIDE 1

Page 1

Customized Computing for Power Efficiency Customized Computing for Power Efficiency

Jason Cong Jason Cong cong@cs.ucla.edu cong@cs.ucla.edu UCLA Computer Science Department UCLA Computer Science Department

http://cadlab.cs.ucla.edu/~cong http://cadlab.cs.ucla.edu/~cong

There are Many Options to Improve Performance There are Many Options to Improve Performance

slide-2
SLIDE 2

Page 2

Past Alternatives Past Alternatives --

  • - Frequency Scaling

Frequency Scaling

Source : Shekhar Borkar, Intel

Current Alternatives: Parallelization Current Alternatives: Parallelization

Source : Shekhar Borkar, Intel

Parallelization

slide-3
SLIDE 3

Page 3

Multi Multi-

  • core Processors

core Processors

Sun UltraSPARC T2 Microprocessor 8 Cores 64 threads Tilera TILE64 multi-core Processor

Warehouse of Computers Warehouse of Computers

IBM BlueGene/L No.1 in the newest Top500

slide-4
SLIDE 4

Page 4

But Power Remain to Be a Limiting Factor But Power Remain to Be a Limiting Factor … …

Cost of computing

  • HW acquisition
  • Energy bill
  • Heat removal
  • Space

Power Will Be the Driver for Acceptance of Power Will Be the Driver for Acceptance of Customized Computing Customized Computing

Source : Shekhar Borkar, Intel

Parallelization Customization

slide-5
SLIDE 5

Page 5

UCLA Experience UCLA Experience --

  • - Lithography Simulation Acceleration

Lithography Simulation Acceleration

  • Simulation of the optical imaging process

Simulation of the optical imaging process

  • Computational intensive and quite slow for full

Computational intensive and quite slow for full-

  • chip simulation

chip simulation

  • Synthesized into

Synthesized into Stratix Stratix-

  • II FPGA on XDI platform using AutoPilot

II FPGA on XDI platform using AutoPilot

Experiment Results [FPGA Experiment Results [FPGA’ ’2008] 2008]

15X speedup using a 5 by 5 partitioning over Opteron 2.2G 4G RAM Logic utilization around 25K ALUT (and 8K is used in the interface framework

rather than design)

Power utilization less than 15W in FPGA comparing with 86W in Opteron248 Close to 100X (5.8 x 15) improvement on energy efficiency

slide-6
SLIDE 6

Page 6

A Lot More is Needed for Power A Lot More is Needed for Power-

  • Efficient Customized

Efficient Customized Computing Computing

  • More power

More power-

  • efficient programmable fabrics

efficient programmable fabrics

  • Capability to do power gating, voltage and frequency scaling

Capability to do power gating, voltage and frequency scaling

  • A powerful, fully automated C/C++ to FPGA compiler

A powerful, fully automated C/C++ to FPGA compiler

  • Taking full advantages of various power optimization options in

Taking full advantages of various power optimization options in a a transparent way transparent way

  • Customization beyond just FPGA fabrics

Customization beyond just FPGA fabrics

  • Application

Application-

  • specific instruction

specific instruction-

  • set processors (ASIP)

set processors (ASIP)

  • Application

Application-

  • specific processor networks (ASPN)

specific processor networks (ASPN)

  • More power efficient programmable (global) interconnects

More power efficient programmable (global) interconnects

  • E.g., RF

E.g., RF-

  • interconnects

interconnects

RF RF-

  • Interconnects

Interconnects --

  • - Power Efficient Programmable

Power Efficient Programmable (Global) Interconnect Solution (Global) Interconnect Solution

slide-7
SLIDE 7

Page 7

Limited RC Wires Bandwidth Limited RC Wires Bandwidth

  • @ 45nm CMOS Technology

@ 45nm CMOS Technology

  • Data Rate: 4

Data Rate: 4 Gbit/s Gbit/s

  • f

fT

T of 45nm CMOS can be as high as 240GHz

  • f 45nm CMOS can be as high as 240GHz
  • Baseband signal bandwidth only about 4GHz

Baseband signal bandwidth only about 4GHz

  • 98.4% of available bandwidth is wasted

98.4% of available bandwidth is wasted

  • Open Question:

Open Question: How to take advantage of full How to take advantage of full-

  • bandwidth of modern CMOS

bandwidth of modern CMOS? ?

10

T

f

  • 100
  • 90
  • 80
  • 70

323.038 323.238 323.438 323.638 323.838 324.0

Frequency (GHz) Pout (dBm)

UCLA 90nm CMOS VCO at 324GHz UCLA 90nm CMOS VCO at 324GHz

(ISSCC 2008) (ISSCC 2008)

CMOS Voltage Controlled Oscillator, measured with a CMOS Voltage Controlled Oscillator, measured with a subharmonic subharmonic mixer and driven mixer and driven with a 80 GHz synthesizer local oscillator. The mixing frequency with a 80 GHz synthesizer local oscillator. The mixing frequency is ( is (f fVCO

VCO -

  • 4*

4*f fLO

LO)=

)=f fIF

IF, or

, or f fVCO

VCO -

  • 4*(80 GHz)= 3.5 GHz, yielding

4*(80 GHz)= 3.5 GHz, yielding f fVCO

VCO= 323.5 GHz!

= 323.5 GHz!

On-Wafer VCO Test Setup at JPL

CMOS VCO designed by Frank Chang’s group at UCLA, fabricated in 90nm process 323.5GHz VCO

*Huang, D., LaRocca T., Chang, M.-C. F., “324GHz CMOS Frequency Generator Using Linear Superposition Technique IEEE International Solid-State Circuits Conference (ISSCC), 476-477, (Feb 2008) San Francisco, CA

slide-8
SLIDE 8

Page 8

Multiband RF Multiband RF-

  • Interconnect

Interconnect

  • In TX, each mixer up-converts individual baseband streams into

specific frequency band (or channel)

  • N different data streams (N=6 in exemplary figure above) may

transmit simultaneously on the shared transmission medium to achieve higher aggregate data rates

  • In RX, individual signals are down-converted by mixer, and

recovered after low-pass filter

Signal Spectrum

Signal Power Signal Power Signal Power Signal Power

Advantages of RF Advantages of RF-

  • Interconnect (RF

Interconnect (RF-

  • I)

I)

  • Latency

Latency – – speed speed-

  • of
  • f-
  • light data transmission

light data transmission

  • Bandwidth

Bandwidth – – high aggregate data rate through simultaneous high aggregate data rate through simultaneous transmissions on multiple bands of RF modulated signals transmissions on multiple bands of RF modulated signals

  • Area

Area – – avoid extensive use of repeaters avoid extensive use of repeaters

  • Energy

Energy – – low overall energy bit low overall energy bit

  • Reconfigurability

Reconfigurability – – efficient bidirectional and tunable efficient bidirectional and tunable communications via shared on/off communications via shared on/off-

  • chip transmission lines or

chip transmission lines or

  • ff
  • ff-
  • chip antennas

chip antennas

slide-9
SLIDE 9

Page 9

Simple RF Simple RF-

  • I Topology

I Topology

  • Four NoC Components

Four NoC Components

  • Tunable

Tunable Tx Tx/Rx /Rx’ ’s s

  • Arbitrary topologies

Arbitrary topologies

  • Arbitrary bandwidths

Arbitrary bandwidths

C C C C

> > > > > > > >

RF-I Transmission Line Bundle NoC Component Tx/Rx C C C C C C C C C C C C C C C C C C C C Pipeline/Ring Bus Multicast Fully Connected Crossbar One physical topology can be configured to many virtual topologies

RF RF-

  • I for Multi

I for Multi-

  • Core On

Core On-

  • Chip Communication

Chip Communication

[HPCA [HPCA’ ’2008, MICRO 2008, MICRO’ ’2008] 2008]

  • 10x10 mesh of pipelined routers

10x10 mesh of pipelined routers

  • NoC runs at 2GHz

NoC runs at 2GHz

  • XY routing

XY routing

  • 64 4GHz 3

64 4GHz 3-

  • wide processor cores

wide processor cores

  • Labeled aqua

Labeled aqua

  • 8KB L1 Data Cache

8KB L1 Data Cache

  • 8KB L1 Instruction Cache

8KB L1 Instruction Cache

  • 32 L2 Cache Banks

32 L2 Cache Banks

  • Labeled pink

Labeled pink

  • 256KB each

256KB each

  • Organized as shared NUCA

Organized as shared NUCA cache cache

  • 4 Main Memory Interfaces

4 Main Memory Interfaces

  • Labeled green

Labeled green

  • RF

RF-

  • I transmission line bundle

I transmission line bundle

  • Black thick line spanning mesh

Black thick line spanning mesh

slide-10
SLIDE 10

Page 10

RF RF-

  • I Logical Organization

I Logical Organization

  • Logically:
  • RF-I behaves as set of

N express channels

  • Each channel assigned

to src, dest router pair (s,d)

  • Reconfigured by:
  • remapping shortcuts to

match needs of different applications

LOGICAL A LOGICAL B

Power Savings Power Savings

  • We can thin the baseline mesh links

We can thin the baseline mesh links

  • From 16B

From 16B… …

…to 8B to 8B

…to 4B to 4B

  • RF

RF-

  • I makes up the difference in

I makes up the difference in performance while saving overall performance while saving overall power! power!

  • RF

RF-

  • I provides bandwidth where

I provides bandwidth where most necessary most necessary

  • Baseline RC wires supply the rest

Baseline RC wires supply the rest

  • Over 60% power reduction

Over 60% power reduction

16 bytes 8 bytes 4 bytes

Requires high bw to communicate w/ B

A B

A lot of potential for global interconnects in programmable fabrics

slide-11
SLIDE 11

Page 11

Concluding Remarks Concluding Remarks --

  • - A Lot Opportunities in Power

A Lot Opportunities in Power-

  • Efficient Customized Computing

Efficient Customized Computing

  • More power

More power-

  • efficient programmable fabrics

efficient programmable fabrics

  • Options to do power gating, voltage and frequency scaling

Options to do power gating, voltage and frequency scaling

  • A powerful, fully automated C/C++ to FPGA compiler

A powerful, fully automated C/C++ to FPGA compiler

  • Taking full advantages of power optimization options in a

Taking full advantages of power optimization options in a transparent way transparent way

  • Customization beyond just FPGA fabrics

Customization beyond just FPGA fabrics

  • Application

Application-

  • specific instruction

specific instruction-

  • set processors (ASIP)

set processors (ASIP)

  • Application

Application-

  • specific processor networks (ASPN)

specific processor networks (ASPN)

  • More power efficient programmable (global) interconnects

More power efficient programmable (global) interconnects

  • E.g., RF

E.g., RF-

  • interconnects

interconnects