architecture and synthesis for power efficient fpgas
play

Architecture and Synthesis for Power-Efficient FPGAs Jason Cong - PowerPoint PPT Presentation

UCLA UCLA Architecture and Synthesis for Power-Efficient FPGAs Jason Cong University of California, Los Angeles cong@cs.ucla.edu Joint work with Deming Chen, Lei He, Fei Li, Yan Lin Partially supported by NSF Grants CCR-0096383, and


  1. UCLA UCLA Architecture and Synthesis for Power-Efficient FPGAs Jason Cong University of California, Los Angeles cong@cs.ucla.edu Joint work with Deming Chen, Lei He, Fei Li, Yan Lin Partially supported by NSF Grants CCR-0096383, and CCR-0306682, and Altera under the California MICRO program

  2. Outline � Introduction � Understanding Power Consumption in FPGAs � Architecture Evaluation and Power Optimization � Low Power Synthesis � Conclusions

  3. Why? FPGA is Known to be Power Inefficient! Source: [Zuchowski, et al, ICCAD02] � FPGA consumes 50-100X more power � Why do we care about power optimization for FPGAs ?!

  4. FPGA Advantages � Short TAT (total turnaround time) � No or very low NRE

  5. ASICs Become Increasingly Expensive � Traditional ASIC designs are facing rapid increase of NRE and mask-set costs at 90nm and below $2.5 60 $60 2.0 … 0.8 0.6 0.35 0.25 0.18 0.13 0.10 Total Cost for Mask Set ($M) Process (um) $50 $2.0 Single Mask 1.5 1.5 2.5 4.5 7.5 12 40 60 40 $40 cost ($K) Cost/Mask ($K) $1.5 $30 12 12 12 16 20 26 30 34 # of Masks $1.0 $20 Mask Set cost 12 18 18 30 72 150 312 1,000 2,000 $0.5 7.5 ($K) $10 $0.0 0 250nm 180nm 130nm 100nm Source: EETimes

  6. Our Research Fabric Circuit Design Design Power Efficient FPGAs Synthesis System Tools Design

  7. Outline � Introduction � Understanding Power Consumption in FPGAs � Architecture Evaluation and Power Optimization � Low Power Synthesis � Conclusions

  8. FPGA Architecture K Out Inputs D FF LUT Clock Programm able IO BLE # 1 N N Programm I Outputs I able Logic Inputs BLE # N Clock Programm able Routing

  9. Evaluation Framework – fpgaEva-LP fpgaEva-LP [Li, et al, FPGA’03] BLIF BLIF SLIF SLIF Logic Optimization(SIS) Logic Optimization(SIS) BC-Netlist Tech-Mapping (RASP) Tech-Mapping (RASP) Generator Arch Timing-Driven Packing (TV-Pack) Timing-Driven Packing (TV-Pack) BC-Netlist Spec Placement & Routing (VPR) Placement & Routing (VPR) Power Simulator Area Area Delay Delay Power

  10. BC -Netlist Generator Mapped Netlist Layout Buffer Extraction Netlist Generation for Logic Clusters Capacitance Extraction Delay Calculation Back-annotation BC-Netlist

  11. Mixed-level Power Model – Overview � Static Power � Dynamic power � Switching power � Sub-threshold leakage � Gate leakage � Short-circuit power � Reverse biased leakage � Related to signal � Depending on the input transitions vector � Functional switch � Glitch components power Logic Block Interconnect & sources clock Dynamic Macro-model Switch-level model Static Macro-model Macro-model

  12. Cycle-Accurate Power Simulator BC-Netlist Random Vector Generation Post-layout extracted delay & capacitance Cycle Accurate Power Simulation with Glitch Analysis Mixed-level Power Model All cycles No finished? ∑ ∑ = + E E ( n ) E ( n ) Yes cycle a s ∈ ∈ i active j idle Power Values

  13. Power Breakdown Cluster Size = 12, LUT Size = 4 Cluster Size = 12, LUT Size = 6 Logic Block Power Logic Block Clock Power Clock Power 19% Power 15% 22% 40% Interconnect Interconnect Power Power 45% 59% � Interconnect power is dominant

  14. Power Breakdown (cont’d) Cluster Size = 12, LUT Size = 4 Cluster Size = 12, LUT Size = 6 Leakage Leakage Power Power 42% 52% Dynamic Dynamic Power Power 48% 58% � Leakage power becomes increasingly important (100nm)

  15. Outline � Introduction � Understanding Power Consumption in FPGAs � Architecture Evaluation and Power Optimization � Architecture Parameter Selection � Dual-Vdd/Dual-Vt FPGA Architecture � Low Power Synthesis with Dual-Vdd � Conclusion

  16. Total Power along LUT and Cluster Size Changes 2 Cluster Size = 4 1.9 Cluster Size = 6 Total FPGA Power (normalized Cluster Size = 8 1.8 Cluster Size = 10 1.7 geometric mean) Cluster Size = 12 1.6 1.5 1.4 1.3 1.2 1.1 1 3 4 5 6 7 LUT Size Routing architecture: segmented wire with length of 4, and 50% tri-state buffers in routing switches

  17. Routing Architecture Evaluation

  18. Architecture of Low-power and High-performance Applications Best FPGA architecture Energy Delay E 3 t Et 3 (E) (t) Cluster size 10, LUT size 4, Low-power 0.9653 0.9904 0.8909 1.0080 wire segment length 4, (E 3 t) 25% buffered routing switches Cluster size 12, LUT size 4, High- 1.0502 0.8865 1.0268 0.7865 Wire segment length 4, performance 100% buffered routing (Et 3 ) switches � Arch. Parameter selection leads to 10% power/delay trade-off � Uniform FPGA fabrics provide limited power-performance tradeoff � Need to explore heterogeneous FPGA fabrics, e.g. dual-Vt and dual- Vdd fabrics

  19. Outline � Introduction � Understanding Power Consumption in FPGAs � Architecture Evaluation and Power Optimization � Architecture Parameter Selection � Dual-Vdd/Dual-Vt FPGA Architecture [Li, et al, FPGA’04] � Low Power Synthesis with Dual-Vdd � Conclusion

  20. Dual-Vdd LUT Design � Dual-Vdd technique makes use of the timing slack to reduce power � VddH devices on critical path performance � VddL devices on non-critical paths power � Assume uniform Vdd for one LUT � Threshold voltage Vt should be adjusted carefully for different Vdd levels � To compensate delay increase � To avoid excessive leakage power increase

  21. Vdd/Vt-Scaling for LUTs � Constant-leakage scaling obtains � Three scaling schemes a good tradeoff � Constant-Vt scaling � useful for both single-Vdd � Fixed-Vdd/Vt-ratio scaling scaling and dual-Vdd design � Constant-leakage scaling 0.7 10 constant Vt constant Vt 9 fixed-Vdd/Vt-ratio fixed-Vdd/Vt-ratio 0.6 constant leakage constant leakage 8 Leakage Power ( uW) 7 0.5 Delay (ns) 6 5 0.4 4 0.3 3 2 0.2 1 0 0.1 1.3v 1.0v 0.9v 0.8v 1.3v 1.0v 0.9v 0.8v Vdd (V) Vdd (V)

  22. Dual-Vt LUT Design � LUT is divided into two parts � Part I: configuration cells high Vt � Part II: MUX tree and input buffers normal Vt (decided by constant-leakage Vdd-scaling) � Configuration SRAM cells � Content remains unchanged after configuration � Read/write delay is not related to FPGA performance � Use high Vt ~40% of Vdd � Maintain signal integrity � Reduce SRAM leakage by 15X and LUT leakage by 2.4X � Increase configuration time by 13%

  23. Pre-Defined Dual-Vt Fabric � Power saving � 11.6% for combinational circuits � 14.6% for sequential circuits arch-SVST arch-SVDT arch-SVST arch-SVDT (Single Vt ) (Dual Vt ) (Single Vt ) (Dual Vt ) circuit Circuit power (watt) power saving power (watt) power saving bigkey 0.148 12.3% alu4 0.0798 8.5% clma 0.632 14.8% apex2 0.108 9.3% diffeq 0.0391 19.7% apex4 0.0536 12.3% dsip 0.134 14.5% des 0.234 10.7% elliptic 0.140 16.3% ex1010 0.179 17.3% frisc 0.190 19.2% ex5p 0.059 11.6% s298 0.0736 13.4% misex3 0.0753 9.4% s38417 0.307 11.7% pdc 0.256 14.7% s38484 0.261 10.2% seq 0.0927 9.4% tseng 0.0351 14.0% spla 0.180 12.4% Avg. 14.6% Avg. 11.6% Table1 Combinational circuits Table2 Sequential circuits

  24. Dual-Vdd FPGA Fabric � Granularity: logic block (i.e., cluster of LUTs) � Smaller granularity => intuitively more power saving � But a larger implementation overhead � Layout pattern: pre-defined dual-Vdd pattern � Row-based or interleaved pattern � Ratio of VddL/VddH blocks is 2:1 (benchmark profiling) � Interconnect uses uniform VddH L-block: VddL H-block: VddH

  25. Simple Design Flow for Dual-Vdd Fabric � Based on traditional design flow, but with new steps Step I: LUT mapping (FlowMap) + P & R assuming uniform VddH (using VPR) Step II: Dual-Vdd assignment based on sensitivity Setp III: Timing driven P & R considering pre- defined dual-Vdd pattern (modified VPR)

  26. Comparison Between Vdd-Scaling and Dual-Vdd � For high clock frequency, dual Vdd achieves ~6% total power saving (~18% logic power saving) � For low clock frequency, single-Vdd scaling is better � Still a large gap between ideal dual-Vdd and real case � Ideal dual-Vdd is the result without layout pattern constraint 0.09 arch-SVDT (Vdd Scaling) arch-DVDT(ideal case) 0.08 1.5v arch-DVDT(pre-defined Vdd) 0.07 1.5/1.0v Power (watt) 1.3v 1.5v/1.0v circuit: alu4 0.06 1.3/0.9v 1.3/1.0v 1.3v/0.8v 0.05 1.0v 1.0/0.9v 0.04 0.9v 1.0v/0.8v 0.9v/0.8v 0.03 65 75 85 95 105 115 125 Max. Clock Frequency (MHz)

  27. Vdd-Programmable Logic Block � Power switches for Vdd selection and power gating � One-bit control is needed for Vdd selection, but two-bit control power gating

  28. Experimental Results with Vdd- Programmable Blocks � Power v.s. performance Circuit: alu4 0.09 arch-SV (Vdd scaling) 1.5v arch-DV (configurable Vdd) 0.08 arch-DV (ideal case) total power (watt) arch-DV (pre-defined Vdd) 1.5v/0.8v 0.07 1.3 1.5v/1.0v 1.5v/1.0v v 1.5v/1.0v 1.3v/0.9v 0.06 1.3v/0.8v 1.3v/0.8v 1.3v/0.8v 0.05 1.0v 1.0v/0.8v 1.0v/0.8v 1.0v/0.9v 0.04 1.0v/0.8v 0.9v/0.8v 0.03 65 75 85 95 105 115 125 clock frequency (MHz)

  29. Outline � Introduction � Understanding Power Consumption in FPGAs � Architecture Evaluation and Power Optimization � Low Power Synthesis � Conclusions

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend