customized computing for power efficiency customized
play

Customized Computing for Power Efficiency Customized Computing for - PDF document

Customized Computing for Power Efficiency Customized Computing for Power Efficiency Jason Cong Jason Cong cong@cs.ucla.edu cong@cs.ucla.edu UCLA Computer Science Department UCLA Computer Science Department http://cadlab.cs.ucla.edu/~cong


  1. Customized Computing for Power Efficiency Customized Computing for Power Efficiency Jason Cong Jason Cong cong@cs.ucla.edu cong@cs.ucla.edu UCLA Computer Science Department UCLA Computer Science Department http://cadlab.cs.ucla.edu/~cong http://cadlab.cs.ucla.edu/~cong There are Many Options to Improve Performance There are Many Options to Improve Performance Page 1

  2. Past Alternatives -- -- Frequency Scaling Frequency Scaling Past Alternatives Source : Shekhar Borkar, Intel Current Alternatives: Parallelization Current Alternatives: Parallelization Parallelization Source : Shekhar Borkar, Intel Page 2

  3. Multi- -core Processors core Processors Multi Sun UltraSPARC T2 Microprocessor 8 Cores 64 threads Tilera TILE64 multi-core Processor Warehouse of Computers Warehouse of Computers IBM BlueGene/L No.1 in the newest Top500 Page 3

  4. But Power Remain to Be a Limiting Factor … … But Power Remain to Be a Limiting Factor Cost of computing • HW acquisition • Energy bill • Heat removal • Space • … Power Will Be the Driver for Acceptance of Power Will Be the Driver for Acceptance of Customized Computing Customized Computing Parallelization Customization Source : Shekhar Borkar, Intel Page 4

  5. UCLA Experience UCLA Experience -- -- Lithography Simulation Acceleration Lithography Simulation Acceleration Simulation of the optical imaging process � � Simulation of the optical imaging process � Computational intensive and quite slow for full � Computational intensive and quite slow for full- -chip simulation chip simulation � Synthesized into Synthesized into Stratix Stratix- -II FPGA on XDI platform using AutoPilot II FPGA on XDI platform using AutoPilot � Experiment Results [FPGA’ ’2008] 2008] Experiment Results [FPGA � 15X speedup using a 5 by 5 partitioning over Opteron 2.2G 4G RAM � Logic utilization around 25K ALUT (and 8K is used in the interface framework rather than design) � Power utilization less than 15W in FPGA comparing with 86W in Opteron248 � Close to 100X (5.8 x 15) improvement on energy efficiency Page 5

  6. A Lot More is Needed for Power- -Efficient Customized Efficient Customized A Lot More is Needed for Power Computing Computing � More power More power- -efficient programmable fabrics efficient programmable fabrics � � Capability to do power gating, voltage and frequency scaling � Capability to do power gating, voltage and frequency scaling � A powerful, fully automated C/C++ to FPGA compiler A powerful, fully automated C/C++ to FPGA compiler � � � Taking full advantages of various power optimization options in Taking full advantages of various power optimization options in a a transparent way transparent way Customization beyond just FPGA fabrics � Customization beyond just FPGA fabrics � � Application � Application- -specific instruction specific instruction- -set processors (ASIP) set processors (ASIP) � Application � Application- -specific processor networks (ASPN) specific processor networks (ASPN) More power efficient programmable (global) interconnects � More power efficient programmable (global) interconnects � � E.g., RF � E.g., RF- -interconnects interconnects RF- -Interconnects Interconnects -- -- Power Efficient Programmable Power Efficient Programmable RF (Global) Interconnect Solution (Global) Interconnect Solution Page 6

  7. Limited RC Wires Bandwidth Limited RC Wires Bandwidth f T 10 @ 45nm CMOS Technology � @ 45nm CMOS Technology � � Data Rate: 4 � Data Rate: 4 Gbit/s Gbit/s � f � f T T of 45nm CMOS can be as high as 240GHz of 45nm CMOS can be as high as 240GHz � � Baseband signal bandwidth only about 4GHz Baseband signal bandwidth only about 4GHz � � 98.4% of available bandwidth is wasted 98.4% of available bandwidth is wasted � Open Question: Open Question: How to take advantage of full How to take advantage of full- -bandwidth of modern CMOS bandwidth of modern CMOS? ? � UCLA 90nm CMOS VCO at 324GHz UCLA 90nm CMOS VCO at 324GHz (ISSCC 2008) (ISSCC 2008) -70 323.5GHz VCO -80 Pout (dBm) CMOS VCO designed by Frank -90 Chang’s group at UCLA, fabricated in 90nm process -100 323.038 323.238 323.438 323.638 323.838 324.0 Frequency (GHz) CMOS Voltage Controlled Oscillator, measured with a subharmonic CMOS Voltage Controlled Oscillator, measured with a subharmonic mixer and driven mixer and driven with a 80 GHz synthesizer local oscillator. The mixing frequency with a 80 GHz synthesizer local oscillator. The mixing frequency is ( is ( f f VCO VCO - - 4* 4* f f LO LO )= )= f f IF IF , or , or f VCO f VCO - -4*(80 GHz)= 3.5 GHz, yielding 4*(80 GHz)= 3.5 GHz, yielding f f VCO VCO = 323.5 GHz! = 323.5 GHz! On-Wafer VCO Test Setup at JPL *Huang, D., LaRocca T., Chang, M.-C. F., “324GHz CMOS Frequency Generator Using Linear Superposition Technique IEEE International Solid-State Circuits Conference (ISSCC), 476-477, (Feb 2008) San Francisco, CA Page 7

  8. Multiband RF- Multiband RF -Interconnect Interconnect Signal Power Signal Power Signal Power Signal Power Signal Spectrum • In TX, each mixer up-converts individual baseband streams into specific frequency band (or channel) • N different data streams (N=6 in exemplary figure above) may transmit simultaneously on the shared transmission medium to achieve higher aggregate data rates • In RX, individual signals are down-converted by mixer, and recovered after low-pass filter Advantages of RF- Advantages of RF -Interconnect (RF Interconnect (RF- -I) I) Latency – – speed speed- -of of- -light data transmission light data transmission � Latency � Bandwidth – – high aggregate data rate through simultaneous high aggregate data rate through simultaneous � � Bandwidth transmissions on multiple bands of RF modulated signals transmissions on multiple bands of RF modulated signals Area – – avoid extensive use of repeaters avoid extensive use of repeaters � Area � Energy – – low overall energy bit low overall energy bit � � Energy � Reconfigurability Reconfigurability – – efficient bidirectional and tunable efficient bidirectional and tunable � communications via shared on/off- -chip transmission lines or chip transmission lines or communications via shared on/off off- off -chip antennas chip antennas Page 8

  9. Simple RF- -I Topology I Topology Simple RF RF-I Four NoC Components Transmission � Four NoC Components � C C Line Bundle > > > > > > > > Tunable Tx Tx/Rx /Rx’ ’s s � Tunable � C C Tx/Rx � Arbitrary topologies � Arbitrary topologies NoC Component � Arbitrary bandwidths � Arbitrary bandwidths One physical topology can be configured to many virtual topologies C C C C C C C C C C C C C C C C C C C C Bus Multicast Fully Crossbar Connected Pipeline/Ring RF- -I for Multi I for Multi- -Core On Core On- -Chip Communication Chip Communication RF [HPCA’ ’2008, MICRO 2008, MICRO’ ’2008] 2008] [HPCA � 10x10 mesh of pipelined routers 10x10 mesh of pipelined routers � � NoC runs at 2GHz � NoC runs at 2GHz � � XY routing XY routing � 64 4GHz 3 64 4GHz 3- -wide processor cores wide processor cores � � Labeled aqua � Labeled aqua � � 8KB L1 Data Cache 8KB L1 Data Cache � � 8KB L1 Instruction Cache 8KB L1 Instruction Cache � 32 L2 Cache Banks 32 L2 Cache Banks � � Labeled pink � Labeled pink � � 256KB each 256KB each � � Organized as shared NUCA Organized as shared NUCA cache cache � 4 Main Memory Interfaces � 4 Main Memory Interfaces � Labeled green � Labeled green � RF RF- -I transmission line bundle I transmission line bundle � � Black thick line spanning mesh � Black thick line spanning mesh Page 9

  10. RF- RF -I Logical Organization I Logical Organization • Logically: - RF-I behaves as set of N express channels - Each channel assigned to src, dest router pair ( s , d ) • Reconfigured by: - remapping shortcuts to match needs of different applications LOGICAL A LOGICAL B Power Savings Power Savings 16 4 bytes 8 bytes � We can thin the baseline mesh links We can thin the baseline mesh links � Requires high bw to bytes � � From 16B From 16B… … communicate w/ B � � … …to 8B to 8B A � … � …to 4B to 4B � RF RF- -I makes up the difference in I makes up the difference in � performance while saving overall performance while saving overall power! power! � RF � RF- -I provides bandwidth where I provides bandwidth where most necessary most necessary � � Baseline RC wires supply the rest B Baseline RC wires supply the rest � Over 60% power reduction Over 60% power reduction � A lot of potential for global interconnects in programmable fabrics Page 10

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend