SLIDE 1 Architecture and Synthesis for Power-Efficient FPGAs
Jason Cong University of California, Los Angeles cong@cs.ucla.edu
Partially supported by NSF Grants CCR-0096383, and CCR-0306682, and Altera under the California MICRO program
UCLA UCLA
Joint work with Deming Chen, Lei He, Fei Li, Yan Lin
SLIDE 2
Outline
Introduction Understanding Power Consumption in
FPGAs
Architecture Evaluation and Power
Optimization
Low Power Synthesis Conclusions
SLIDE 3 Why? FPGA is Known to be Power Inefficient!
FPGA consumes 50-100X more power Why do we care about power optimization for FPGAs ?!
Source: [Zuchowski, et al, ICCAD02]
SLIDE 4
FPGA Advantages
Short TAT (total turnaround time) No or very low NRE
SLIDE 5 ASICs Become Increasingly Expensive
Traditional ASIC designs are facing rapid increase
- f NRE and mask-set costs at 90nm and below
Source: EETimes
7.5 12 40 60 $0.0 $0.5 $1.0 $1.5 $2.0 $2.5 250nm 180nm 130nm 100nm Total Cost for Mask Set ($M) $10 $20 $30 $40 $50 $60 Cost/Mask ($K)
Process (um) 2.0 … 0.8 0.6 0.35 0.25 0.18 0.13 0.10 Single Mask cost ($K) 1.5 1.5 2.5 4.5 7.5 12 40 60 # of Masks 12 12 12 16 20 26 30 34 Mask Set cost ($K) 18 18 30 72 150 312 1,000 2,000
SLIDE 6
Our Research
Power Efficient FPGAs Circuit Design Fabric Design System Design Synthesis Tools
SLIDE 7
Outline
Introduction Understanding Power Consumption in
FPGAs
Architecture Evaluation and Power
Optimization
Low Power Synthesis Conclusions
SLIDE 8 FPGA Architecture
Programm able IO
K LUT Inputs D FF Clock Out BLE # 1 BLE # N N Outputs I Inputs Clock I N
Programm able Logic Programm able Routing
SLIDE 9 BC-Netlist
BC-Netlist Generator Power Simulator
Power BLIF
Logic Optimization(SIS) Tech-Mapping (RASP) Timing-Driven Packing (TV-Pack) Placement & Routing (VPR)
SLIF Delay Area
Arch Spec
BLIF
Logic Optimization(SIS) Tech-Mapping (RASP) Timing-Driven Packing (TV-Pack) Placement & Routing (VPR)
SLIF Delay Area
Evaluation Framework – fpgaEva-LP
fpgaEva-LP [Li, et al, FPGA’03]
SLIDE 10 BC-Netlist Generator
Mapped Netlist Layout Buffer Extraction Netlist Generation for Logic Clusters Capacitance Extraction Delay Calculation BC-Netlist Back-annotation
SLIDE 11 Mixed-level Power Model – Overview
Dynamic power
Switching power Short-circuit power
Related to signal
transitions
Functional switch Glitch
Dynamic Interconnect & clock Macro-model Macro-model Static Switch-level model Macro-model Logic Block
components power sources
Static Power
Sub-threshold leakage Gate leakage Reverse biased leakage
Depending on the input
vector
SLIDE 12 Cycle-Accurate Power Simulator
Mixed-level Power Model Post-layout extracted delay & capacitance
Random Vector Generation BC-Netlist Cycle Accurate Power Simulation with Glitch Analysis All cycles finished? No Power Values Yes
∑ ∑
∈ ∈
+ =
active i idle j s a cycle
n E n E E ) ( ) (
SLIDE 13 Logic Block Power 19% Interconnect Power 59% Clock Power 22%
Power Breakdown
Interconnect power is dominant
Cluster Size = 12, LUT Size = 4
Clock Power 15% Interconnect Power 45% Logic Block Power 40%
Cluster Size = 12, LUT Size = 6
SLIDE 14 Power Breakdown (cont’d)
Leakage Power 42% Dynamic Power 58% Dynamic Power 48% Leakage Power 52%
Leakage power becomes increasingly important
(100nm)
Cluster Size = 12, LUT Size = 4 Cluster Size = 12, LUT Size = 6
SLIDE 15 Outline
Introduction Understanding Power Consumption in FPGAs Architecture Evaluation and Power
Optimization
Architecture Parameter Selection Dual-Vdd/Dual-Vt FPGA Architecture
Low Power Synthesis with Dual-Vdd Conclusion
SLIDE 16 Total Power along LUT and Cluster Size Changes
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 3 4 5 6 7
LUT Size Total FPGA Power (normalized geometric mean)
Cluster Size = 4 Cluster Size = 6 Cluster Size = 8 Cluster Size = 10 Cluster Size = 12
Routing architecture: segmented wire with length of 4, and 50% tri-state buffers in routing switches
SLIDE 17
Routing Architecture Evaluation
SLIDE 18 Architecture of Low-power and High-performance
0.7865 1.0268 0.8865 1.0502 Cluster size 12, LUT size 4, Wire segment length 4, 100% buffered routing switches High- performance (Et3) 1.0080 0.8909 0.9904 0.9653 Cluster size 10, LUT size 4, wire segment length 4, 25% buffered routing switches Low-power (E3t) Et3 E3t Delay (t) Energy (E) Best FPGA architecture Applications
- Arch. Parameter selection leads to 10% power/delay trade-off
Uniform FPGA fabrics provide limited power-performance tradeoff Need to explore heterogeneous FPGA fabrics, e.g. dual-Vt and dual-
Vdd fabrics
SLIDE 19 Outline
Introduction Understanding Power Consumption in FPGAs Architecture Evaluation and Power
Optimization
Architecture Parameter Selection Dual-Vdd/Dual-Vt FPGA Architecture [Li, et al,
FPGA’04]
Low Power Synthesis with Dual-Vdd Conclusion
SLIDE 20
Dual-Vdd LUT Design
Dual-Vdd technique makes use of the timing slack
to reduce power
VddH devices on critical path performance VddL devices on non-critical paths power Assume uniform Vdd for one LUT
Threshold voltage Vt should be adjusted carefully
for different Vdd levels
To compensate delay increase To avoid excessive leakage power increase
SLIDE 21 Vdd/Vt-Scaling for LUTs
Three scaling schemes
Constant-Vt scaling Fixed-Vdd/Vt-ratio scaling Constant-leakage scaling
0.1 0.2 0.3 0.4 0.5 0.6 0.7 1.3v 1.0v 0.9v 0.8v
Vdd (V) Delay (ns)
constant Vt fixed-Vdd/Vt-ratio constant leakage
1 2 3 4 5 6 7 8 9 10 1.3v 1.0v 0.9v 0.8v
Vdd (V) Leakage Power ( uW)
constant Vt fixed-Vdd/Vt-ratio constant leakage
Constant-leakage scaling obtains
a good tradeoff
useful for both single-Vdd
scaling and dual-Vdd design
SLIDE 22 Dual-Vt LUT Design
LUT is divided into two parts
Part I: configuration cells high Vt Part II: MUX tree and input buffers normal Vt (decided by
constant-leakage Vdd-scaling)
Configuration SRAM cells
Content remains unchanged after
configuration
Read/write delay is not related to
FPGA performance
Use high Vt ~40% of Vdd
Maintain signal integrity Reduce SRAM leakage by 15X
and LUT leakage by 2.4X
Increase configuration time by
13%
SLIDE 23 Pre-Defined Dual-Vt Fabric
Power saving
11.6% for combinational circuits 14.6% for sequential circuits
12.4% 0.180 spla 9.4% 0.0927 seq power saving power (watt) 11.6% Avg. 14.7% 0.256 pdc 9.4% 0.0753 misex3 11.6% 0.059 ex5p 17.3% 0.179 ex1010 10.7% 0.234 des 12.3% 0.0536 apex4 9.3% 0.108 apex2 8.5% 0.0798 alu4 arch-SVDT (Dual Vt) arch-SVST (Single Vt) Circuit Table1 Combinational circuits 14.0% 0.0351 tseng 10.2% 0.261 s38484 power saving power (watt) 14.6% Avg. 11.7% 0.307 s38417 13.4% 0.0736 s298 19.2% 0.190 frisc 16.3% 0.140 elliptic 14.5% 0.134 dsip 19.7% 0.0391 diffeq 14.8% 0.632 clma 12.3% 0.148 bigkey arch-SVDT (Dual Vt) arch-SVST (Single Vt) circuit Table2 Sequential circuits
SLIDE 24 Dual-Vdd FPGA Fabric
Granularity: logic block (i.e., cluster of LUTs)
Smaller granularity => intuitively more power saving But a larger implementation overhead
Layout pattern: pre-defined dual-Vdd pattern
Row-based or interleaved pattern Ratio of VddL/VddH blocks is 2:1 (benchmark profiling)
Interconnect uses uniform VddH
L-block: VddL H-block: VddH
SLIDE 25
Simple Design Flow for Dual-Vdd Fabric
Based on traditional design flow, but with
new steps
Step I: LUT mapping (FlowMap) + P & R assuming uniform VddH (using VPR) Step II: Dual-Vdd assignment based on sensitivity Setp III: Timing driven P & R considering pre- defined dual-Vdd pattern (modified VPR)
SLIDE 26 Comparison Between Vdd-Scaling and Dual-Vdd
For high clock frequency, dual Vdd achieves ~6% total power saving
(~18% logic power saving)
For low clock frequency, single-Vdd scaling is better Still a large gap between ideal dual-Vdd and real case
Ideal dual-Vdd is the result without layout pattern constraint
circuit: alu4
0.03 0.04 0.05 0.06 0.07 0.08 0.09 65 75 85 95 105 115 125
- Max. Clock Frequency (MHz)
Power (watt)
arch-SVDT (Vdd Scaling) arch-DVDT(ideal case) arch-DVDT(pre-defined Vdd) 1.3v 1.0v 0.9v 1.3v/0.8v 1.0v/0.8v 0.9v/0.8v 1.5v 1.5/1.0v 1.3/1.0v 1.0/0.9v 1.5v/1.0v 1.3/0.9v
SLIDE 27 Vdd-Programmable Logic Block
Power switches for Vdd selection and power gating One-bit control is needed for Vdd selection, but two-bit
control power gating
SLIDE 28 Experimental Results with Vdd- Programmable Blocks
Power v.s. performance
Circuit: alu4
0.03 0.04 0.05 0.06 0.07 0.08 0.09 65 75 85 95 105 115 125
clock frequency (MHz) total power (watt)
arch-SV (Vdd scaling) arch-DV (configurable Vdd) arch-DV (ideal case) arch-DV (pre-defined Vdd) 1.3 v 1.0v 1.5v/1.0v 1.3v/0.8v 1.0v/0.8v 1.5v/1.0v 1.3v/0.9v 1.0v/0.8v 1.5v/0.8v 1.3v/0.8v 1.0v/0.9v 1.5v 0.9v/0.8v 1.0v/0.8v 1.3v/0.8v 1.5v/1.0v
SLIDE 29
Outline
Introduction Understanding Power Consumption in
FPGAs
Architecture Evaluation and Power
Optimization
Low Power Synthesis Conclusions
SLIDE 30
Low Power Synthesis for Dual Vdd FPGAs
FPGA architecture with dual-Vdds adds
new layout constraints for synthesis tools
Novel synthesis tools are required to
support the architecture
Technology mapping [Chen, et al, FPGA’04] Circuit clustering [Chen, et al, ISLPED’04]
SLIDE 31 Technology Mapping for Low-Power FPGAs with Dual Vdds
a c d y x z b w e f g
Cut Enumeration: Topological Order from PIs to POs.
Delay 1, Power 1 Delay 2, Power 2
Optimal Delay = 1 Power = 1.5 Optimal Delay = 2 Power = 2.5
Delay 2, Power 3.2 Delay 2, Power 3.5 Delay 2, Power 2.5
Optimal Delay = 1 Power = 1 Optimal Delay = 1 Power = 1
Represent 1 case: single high Vdd case
SLIDE 32 Dual-Vdd Cases
Consider: Converter delay & power VddL LUT delay & power VddH LUT delay & power
a c d
y x
z b w e
Target LUT
Cases Input LUT Target LUT Converter 1 VddL VddL No 2 VddL VddH Yes 3 VddH VddL No 4 VddH VddH No
Input LUT
Four extra cases for dual-Vdd consideration Produce these four cases for each cut and node More tradeoff solution points Smaller power requires larger delay Smaller delay requires larger power
SLIDE 33 Low Vdd LUT High Vdd LUT
Mapping Solution Generation
From POs to PIs Critical path
driven by VddH LUT
Non-critical paths
can be driven by VddL LUT, guided by low power
a c d y x z b w e f g
SLIDE 34 Two Types of Required Times
VddL VddH 3 3.2 R x y
If R is using VddH:
converter
Req’d times Mapped LUTs
1.7 = 2.0 - 0.3
Critical path
If R is using VddL:
Critical path
1.8 2.0
Req’d times propagated back Req’d time of R is 1.7 Req’d time of R is 1.8 To be mapped Each node maintains two req’d times: Propagated separately Interact with each other
SLIDE 35 Experimental Results
0.56%
Real power Est'ed power Total edges Mapping area SVmap (Single high Vdd) compared to Emap [Lamoureux, ICCAD03] Mapping area considerably better
Estimated power very close to the real power reported after P&R
v1.3 - v1.0 v1.3 - v0.9 v1.3 - v0.8 v1.3 DVmap SVmap DVmap (dual Vdd) compared to SVmap
v1.3 as VddH and v0.8 as VddL is the best combination
SLIDE 36 Circuit Clustering with Dual Vdds
Given: A mapped FPGA design An FPGA architecture with Dual- Vdd configurable logic blocks Goal: Cluster the LUTs into logic blocks Assign voltages to the logic blocks such that the design has Optimal delay Minimum power Constraints: Logic Block Inputs ≤ K Logic Block Size ≤ M Logic Block Outputs ≤ M LUT delay = dL or dH Inter-block edge delay = D Input = 5 Size = 3 Output = 2
LUT LUT LUT LUT LUT LUT LUT
SLIDE 37 Cluster Enumeration – An Example
m n
q r s t
To get a cluster of size 6 on LUT t Get 1 node on r, 4 on s, then merge with t …., and Get 2 nodes on r, 3 on s … Common nodes
PIs to POs Dynamic Programming
Get 3 on r …
SLIDE 38 Solution Generation
m n
q r s t
Cluster s1 Cluster s2
Solution propagation similar as [Vaishnav, ICCAD’99] Delay, power and voltage (form solution points) propagate through the clusters and nodes iteratively
Try to get solutions for Cluster t1 Get solutions for s Get solutions for r
SLIDE 39 Solution Curve on Node r
Good solutions: Any two delay-power-vdd points (D1, P1, V1) and (D2, P2, V2)
- if D1 > D2, then P1 < P2
- if D1 < D2, then P1 > P2
2 4 6 8 10 12
1 2 3 4 5 6 7
H L H L
p
e r Delay
H 10 5 L 9.8 5.4 H 8.24 6.1 L 8 6.2 Vdd Power Delay Good delay-power-vdd points The corresponding solution curve
SLIDE 40 Solution Propagation
2 4 6 8 10 12
1 2 3 4 5 6 7
H L H L
p
e r Delay
2 4 6 8 10 12
1 2 3 4 5 6 7
H L H L
p
e r Delay
Delay-power-vdd curve for r Delay-power-vdd curve for s
Consider:
- Converter
- VddL LUT
- VddH LUT
- Edge delay
All the good solutions are generated All the inferior solutions are pruned away
2 4 6 8 10 12 14 16 18 1 2 3 4 5 6 7 8 9
Delay p
e r
L L L L H H Delay-power-vdd curve for cluster t1
SLIDE 41
Two Theorems
The algorithm gets the minimum number of
solution points, W, optimally for each node W is upper bounded by L where L = level(v) for node v
The algorithm is delay and power optimal for
trees and delay optimal for directed acyclic graphs (DAGs) with dual-Vdd FPGAs
2 2
) 1 ( +
L
SLIDE 42 Experimental Results Summary
v1.3 - v1.0 v1.3 - v0.9 v1.3 - v0.8 v1.3 Dual Vdd Single Vdd
Dual-Vdd Clustering results compared to the Single high-Vdd Clustering results v1.3 as high Vdd and v0.8 as low Vdd is the best combination among the three
SLIDE 43
Outline
Introduction Understanding Power Consumption in
FPGAs
Architecture Evaluation and Power
Optimization
Low Power Synthesis Conclusions
SLIDE 44 Conclusions
FPGA power consumption
Majority on programmable interconnects Leakage is significant
FPGA architecture optimization for power
Architecture parameter tuning has a limited impact Using high Vt for configuration SRAM cells is helpful Using programmable dual Vdd for logic blocks is helpful
Power-efficient FPGA architectures introduce
interesting CAD problems
Dual-Vdd mapping Dual-Vdd clustering
Up to 20% power saving reported using these algorithms