Customized Computing for Power Efficiency Customized Computing for - - PDF document
Customized Computing for Power Efficiency Customized Computing for - - PDF document
Customized Computing for Power Efficiency Customized Computing for Power Efficiency Jason Cong Jason Cong cong@cs.ucla.edu cong@cs.ucla.edu UCLA Computer Science Department UCLA Computer Science Department http://cadlab.cs.ucla.edu/~cong
Page 2
Past Alternatives Past Alternatives --
- - Frequency Scaling
Frequency Scaling
Source : Shekhar Borkar, Intel
Current Alternatives: Parallelization Current Alternatives: Parallelization
Source : Shekhar Borkar, Intel
Parallelization
Page 3
Multi Multi-
- core Processors
core Processors
Sun UltraSPARC T2 Microprocessor 8 Cores 64 threads Tilera TILE64 multi-core Processor
Warehouse of Computers Warehouse of Computers
IBM BlueGene/L No.1 in the newest Top500
Page 4
But Power Remain to Be a Limiting Factor But Power Remain to Be a Limiting Factor … …
Cost of computing
- HW acquisition
- Energy bill
- Heat removal
- Space
- …
Power Will Be the Driver for Acceptance of Power Will Be the Driver for Acceptance of Customized Computing Customized Computing
Source : Shekhar Borkar, Intel
Parallelization Customization
Page 5
UCLA Experience UCLA Experience --
- - Lithography Simulation Acceleration
Lithography Simulation Acceleration
- Simulation of the optical imaging process
Simulation of the optical imaging process
- Computational intensive and quite slow for full
Computational intensive and quite slow for full-
- chip simulation
chip simulation
- Synthesized into
Synthesized into Stratix Stratix-
- II FPGA on XDI platform using AutoPilot
II FPGA on XDI platform using AutoPilot
Experiment Results [FPGA Experiment Results [FPGA’ ’2008] 2008]
15X speedup using a 5 by 5 partitioning over Opteron 2.2G 4G RAM Logic utilization around 25K ALUT (and 8K is used in the interface framework
rather than design)
Power utilization less than 15W in FPGA comparing with 86W in Opteron248 Close to 100X (5.8 x 15) improvement on energy efficiency
Page 6
A Lot More is Needed for Power A Lot More is Needed for Power-
- Efficient Customized
Efficient Customized Computing Computing
- More power
More power-
- efficient programmable fabrics
efficient programmable fabrics
- Capability to do power gating, voltage and frequency scaling
Capability to do power gating, voltage and frequency scaling
- A powerful, fully automated C/C++ to FPGA compiler
A powerful, fully automated C/C++ to FPGA compiler
- Taking full advantages of various power optimization options in
Taking full advantages of various power optimization options in a a transparent way transparent way
- Customization beyond just FPGA fabrics
Customization beyond just FPGA fabrics
- Application
Application-
- specific instruction
specific instruction-
- set processors (ASIP)
set processors (ASIP)
- Application
Application-
- specific processor networks (ASPN)
specific processor networks (ASPN)
- More power efficient programmable (global) interconnects
More power efficient programmable (global) interconnects
- E.g., RF
E.g., RF-
- interconnects
interconnects
RF RF-
- Interconnects
Interconnects --
- - Power Efficient Programmable
Power Efficient Programmable (Global) Interconnect Solution (Global) Interconnect Solution
Page 7
Limited RC Wires Bandwidth Limited RC Wires Bandwidth
- @ 45nm CMOS Technology
@ 45nm CMOS Technology
- Data Rate: 4
Data Rate: 4 Gbit/s Gbit/s
- f
fT
T of 45nm CMOS can be as high as 240GHz
- f 45nm CMOS can be as high as 240GHz
- Baseband signal bandwidth only about 4GHz
Baseband signal bandwidth only about 4GHz
- 98.4% of available bandwidth is wasted
98.4% of available bandwidth is wasted
- Open Question:
Open Question: How to take advantage of full How to take advantage of full-
- bandwidth of modern CMOS
bandwidth of modern CMOS? ?
10
T
f
- 100
- 90
- 80
- 70
323.038 323.238 323.438 323.638 323.838 324.0
Frequency (GHz) Pout (dBm)
UCLA 90nm CMOS VCO at 324GHz UCLA 90nm CMOS VCO at 324GHz
(ISSCC 2008) (ISSCC 2008)
CMOS Voltage Controlled Oscillator, measured with a CMOS Voltage Controlled Oscillator, measured with a subharmonic subharmonic mixer and driven mixer and driven with a 80 GHz synthesizer local oscillator. The mixing frequency with a 80 GHz synthesizer local oscillator. The mixing frequency is ( is (f fVCO
VCO -
- 4*
4*f fLO
LO)=
)=f fIF
IF, or
, or f fVCO
VCO -
- 4*(80 GHz)= 3.5 GHz, yielding
4*(80 GHz)= 3.5 GHz, yielding f fVCO
VCO= 323.5 GHz!
= 323.5 GHz!
On-Wafer VCO Test Setup at JPL
CMOS VCO designed by Frank Chang’s group at UCLA, fabricated in 90nm process 323.5GHz VCO
*Huang, D., LaRocca T., Chang, M.-C. F., “324GHz CMOS Frequency Generator Using Linear Superposition Technique IEEE International Solid-State Circuits Conference (ISSCC), 476-477, (Feb 2008) San Francisco, CA
Page 8
Multiband RF Multiband RF-
- Interconnect
Interconnect
- In TX, each mixer up-converts individual baseband streams into
specific frequency band (or channel)
- N different data streams (N=6 in exemplary figure above) may
transmit simultaneously on the shared transmission medium to achieve higher aggregate data rates
- In RX, individual signals are down-converted by mixer, and
recovered after low-pass filter
Signal Spectrum
Signal Power Signal Power Signal Power Signal Power
Advantages of RF Advantages of RF-
- Interconnect (RF
Interconnect (RF-
- I)
I)
- Latency
Latency – – speed speed-
- of
- f-
- light data transmission
light data transmission
- Bandwidth
Bandwidth – – high aggregate data rate through simultaneous high aggregate data rate through simultaneous transmissions on multiple bands of RF modulated signals transmissions on multiple bands of RF modulated signals
- Area
Area – – avoid extensive use of repeaters avoid extensive use of repeaters
- Energy
Energy – – low overall energy bit low overall energy bit
- Reconfigurability
Reconfigurability – – efficient bidirectional and tunable efficient bidirectional and tunable communications via shared on/off communications via shared on/off-
- chip transmission lines or
chip transmission lines or
- ff
- ff-
- chip antennas
chip antennas
Page 9
Simple RF Simple RF-
- I Topology
I Topology
- Four NoC Components
Four NoC Components
- Tunable
Tunable Tx Tx/Rx /Rx’ ’s s
- Arbitrary topologies
Arbitrary topologies
- Arbitrary bandwidths
Arbitrary bandwidths
C C C C
> > > > > > > >
RF-I Transmission Line Bundle NoC Component Tx/Rx C C C C C C C C C C C C C C C C C C C C Pipeline/Ring Bus Multicast Fully Connected Crossbar One physical topology can be configured to many virtual topologies
RF RF-
- I for Multi
I for Multi-
- Core On
Core On-
- Chip Communication
Chip Communication
[HPCA [HPCA’ ’2008, MICRO 2008, MICRO’ ’2008] 2008]
- 10x10 mesh of pipelined routers
10x10 mesh of pipelined routers
- NoC runs at 2GHz
NoC runs at 2GHz
- XY routing
XY routing
- 64 4GHz 3
64 4GHz 3-
- wide processor cores
wide processor cores
- Labeled aqua
Labeled aqua
- 8KB L1 Data Cache
8KB L1 Data Cache
- 8KB L1 Instruction Cache
8KB L1 Instruction Cache
- 32 L2 Cache Banks
32 L2 Cache Banks
- Labeled pink
Labeled pink
- 256KB each
256KB each
- Organized as shared NUCA
Organized as shared NUCA cache cache
- 4 Main Memory Interfaces
4 Main Memory Interfaces
- Labeled green
Labeled green
- RF
RF-
- I transmission line bundle
I transmission line bundle
- Black thick line spanning mesh
Black thick line spanning mesh
Page 10
RF RF-
- I Logical Organization
I Logical Organization
- Logically:
- RF-I behaves as set of
N express channels
- Each channel assigned
to src, dest router pair (s,d)
- Reconfigured by:
- remapping shortcuts to
match needs of different applications
LOGICAL A LOGICAL B
Power Savings Power Savings
- We can thin the baseline mesh links
We can thin the baseline mesh links
- From 16B
From 16B… …
- …
…to 8B to 8B
- …
…to 4B to 4B
- RF
RF-
- I makes up the difference in
I makes up the difference in performance while saving overall performance while saving overall power! power!
- RF
RF-
- I provides bandwidth where
I provides bandwidth where most necessary most necessary
- Baseline RC wires supply the rest
Baseline RC wires supply the rest
- Over 60% power reduction
Over 60% power reduction
16 bytes 8 bytes 4 bytes
Requires high bw to communicate w/ B
A B
A lot of potential for global interconnects in programmable fabrics
Page 11
Concluding Remarks Concluding Remarks --
- - A Lot Opportunities in Power
A Lot Opportunities in Power-
- Efficient Customized Computing
Efficient Customized Computing
- More power
More power-
- efficient programmable fabrics
efficient programmable fabrics
- Options to do power gating, voltage and frequency scaling
Options to do power gating, voltage and frequency scaling
- A powerful, fully automated C/C++ to FPGA compiler
A powerful, fully automated C/C++ to FPGA compiler
- Taking full advantages of power optimization options in a
Taking full advantages of power optimization options in a transparent way transparent way
- Customization beyond just FPGA fabrics
Customization beyond just FPGA fabrics
- Application
Application-
- specific instruction
specific instruction-
- set processors (ASIP)
set processors (ASIP)
- Application
Application-
- specific processor networks (ASPN)
specific processor networks (ASPN)
- More power efficient programmable (global) interconnects
More power efficient programmable (global) interconnects
- E.g., RF
E.g., RF-
- interconnects