[PPT] - Architecture and Synthesis for Power-Efficient FPGAs Jason Cong PowerPoint Presentation

SLIDE 1

Architecture and Synthesis for Power-Efficient FPGAs

Jason Cong University of California, Los Angeles cong@cs.ucla.edu

Partially supported by NSF Grants CCR-0096383, and CCR-0306682, and Altera under the California MICRO program

UCLA UCLA

Joint work with Deming Chen, Lei He, Fei Li, Yan Lin

SLIDE 2

Outline

Introduction Understanding Power Consumption in

FPGAs

Architecture Evaluation and Power

Optimization

Low Power Synthesis Conclusions

SLIDE 3

Why? FPGA is Known to be Power Inefficient!

FPGA consumes 50-100X more power Why do we care about power optimization for FPGAs ?!

Source: [Zuchowski, et al, ICCAD02]

SLIDE 4

FPGA Advantages

Short TAT (total turnaround time) No or very low NRE

SLIDE 5

ASICs Become Increasingly Expensive

Traditional ASIC designs are facing rapid increase

f NRE and mask-set costs at 90nm and below

Source: EETimes

7.5 12 40 60 $0.0 $0.5 $1.0 $1.5 $2.0 $2.5 250nm 180nm 130nm 100nm Total Cost for Mask Set ($M) $10 $20 $30 $40 $50 $60 Cost/Mask ($K)

Process (um) 2.0 … 0.8 0.6 0.35 0.25 0.18 0.13 0.10 Single Mask cost ($K) 1.5 1.5 2.5 4.5 7.5 12 40 60 # of Masks 12 12 12 16 20 26 30 34 Mask Set cost ($K) 18 18 30 72 150 312 1,000 2,000

SLIDE 6

Our Research

Power Efficient FPGAs Circuit Design Fabric Design System Design Synthesis Tools

SLIDE 7

Outline

Introduction Understanding Power Consumption in

FPGAs

Architecture Evaluation and Power

Optimization

Low Power Synthesis Conclusions

SLIDE 8

FPGA Architecture

Programm able IO

K LUT Inputs D FF Clock Out BLE # 1 BLE # N N Outputs I Inputs Clock I N

Programm able Logic Programm able Routing

SLIDE 9

BC-Netlist

BC-Netlist Generator Power Simulator

Power BLIF

Logic Optimization(SIS) Tech-Mapping (RASP) Timing-Driven Packing (TV-Pack) Placement & Routing (VPR)

SLIF Delay Area

Arch Spec

BLIF

Logic Optimization(SIS) Tech-Mapping (RASP) Timing-Driven Packing (TV-Pack) Placement & Routing (VPR)

SLIF Delay Area

Evaluation Framework – fpgaEva-LP

fpgaEva-LP [Li, et al, FPGA’03]

SLIDE 10

BC-Netlist Generator

Mapped Netlist Layout Buffer Extraction Netlist Generation for Logic Clusters Capacitance Extraction Delay Calculation BC-Netlist Back-annotation

SLIDE 11

Mixed-level Power Model – Overview

Dynamic power

Switching power Short-circuit power

Related to signal

transitions

Functional switch Glitch

Dynamic Interconnect & clock Macro-model Macro-model Static Switch-level model Macro-model Logic Block

components power sources

Static Power

Sub-threshold leakage Gate leakage Reverse biased leakage

Depending on the input

vector

SLIDE 12

Cycle-Accurate Power Simulator

Mixed-level Power Model Post-layout extracted delay & capacitance

Random Vector Generation BC-Netlist Cycle Accurate Power Simulation with Glitch Analysis All cycles finished? No Power Values Yes

∑ ∑

∈ ∈

+ =

active i idle j s a cycle

n E n E E ) ( ) (

SLIDE 13

Logic Block Power 19% Interconnect Power 59% Clock Power 22%

Power Breakdown

Interconnect power is dominant

Cluster Size = 12, LUT Size = 4

Clock Power 15% Interconnect Power 45% Logic Block Power 40%

Cluster Size = 12, LUT Size = 6

SLIDE 14

Power Breakdown (cont’d)

Leakage Power 42% Dynamic Power 58% Dynamic Power 48% Leakage Power 52%

Leakage power becomes increasingly important

(100nm)

Cluster Size = 12, LUT Size = 4 Cluster Size = 12, LUT Size = 6

SLIDE 15

Outline

Introduction Understanding Power Consumption in FPGAs Architecture Evaluation and Power

Optimization

Architecture Parameter Selection Dual-Vdd/Dual-Vt FPGA Architecture

Low Power Synthesis with Dual-Vdd Conclusion

SLIDE 16

Total Power along LUT and Cluster Size Changes

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 3 4 5 6 7

LUT Size Total FPGA Power (normalized geometric mean)

Cluster Size = 4 Cluster Size = 6 Cluster Size = 8 Cluster Size = 10 Cluster Size = 12

Routing architecture: segmented wire with length of 4, and 50% tri-state buffers in routing switches

SLIDE 17

Routing Architecture Evaluation

SLIDE 18

Architecture of Low-power and High-performance

0.7865 1.0268 0.8865 1.0502 Cluster size 12, LUT size 4, Wire segment length 4, 100% buffered routing switches High- performance (Et3) 1.0080 0.8909 0.9904 0.9653 Cluster size 10, LUT size 4, wire segment length 4, 25% buffered routing switches Low-power (E3t) Et3 E3t Delay (t) Energy (E) Best FPGA architecture Applications

Arch. Parameter selection leads to 10% power/delay trade-off

Uniform FPGA fabrics provide limited power-performance tradeoff Need to explore heterogeneous FPGA fabrics, e.g. dual-Vt and dual-

Vdd fabrics

SLIDE 19

Outline

Introduction Understanding Power Consumption in FPGAs Architecture Evaluation and Power

Optimization

Architecture Parameter Selection Dual-Vdd/Dual-Vt FPGA Architecture [Li, et al,

FPGA’04]

Low Power Synthesis with Dual-Vdd Conclusion

SLIDE 20

Dual-Vdd LUT Design

Dual-Vdd technique makes use of the timing slack

to reduce power

VddH devices on critical path performance VddL devices on non-critical paths power Assume uniform Vdd for one LUT

Threshold voltage Vt should be adjusted carefully

for different Vdd levels

To compensate delay increase To avoid excessive leakage power increase

SLIDE 21

Vdd/Vt-Scaling for LUTs

Three scaling schemes

Constant-Vt scaling Fixed-Vdd/Vt-ratio scaling Constant-leakage scaling

0.1 0.2 0.3 0.4 0.5 0.6 0.7 1.3v 1.0v 0.9v 0.8v

Vdd (V) Delay (ns)

constant Vt fixed-Vdd/Vt-ratio constant leakage

1 2 3 4 5 6 7 8 9 10 1.3v 1.0v 0.9v 0.8v

Vdd (V) Leakage Power ( uW)

constant Vt fixed-Vdd/Vt-ratio constant leakage

Constant-leakage scaling obtains

a good tradeoff

useful for both single-Vdd

scaling and dual-Vdd design

SLIDE 22

Dual-Vt LUT Design

LUT is divided into two parts

Part I: configuration cells high Vt Part II: MUX tree and input buffers normal Vt (decided by

constant-leakage Vdd-scaling)

Configuration SRAM cells

Content remains unchanged after

configuration

Read/write delay is not related to

FPGA performance

Use high Vt ~40% of Vdd

Maintain signal integrity Reduce SRAM leakage by 15X

and LUT leakage by 2.4X

Increase configuration time by

13%

SLIDE 23

Pre-Defined Dual-Vt Fabric

Power saving

11.6% for combinational circuits 14.6% for sequential circuits

12.4% 0.180 spla 9.4% 0.0927 seq power saving power (watt) 11.6% Avg. 14.7% 0.256 pdc 9.4% 0.0753 misex3 11.6% 0.059 ex5p 17.3% 0.179 ex1010 10.7% 0.234 des 12.3% 0.0536 apex4 9.3% 0.108 apex2 8.5% 0.0798 alu4 arch-SVDT (Dual Vt) arch-SVST (Single Vt) Circuit Table1 Combinational circuits 14.0% 0.0351 tseng 10.2% 0.261 s38484 power saving power (watt) 14.6% Avg. 11.7% 0.307 s38417 13.4% 0.0736 s298 19.2% 0.190 frisc 16.3% 0.140 elliptic 14.5% 0.134 dsip 19.7% 0.0391 diffeq 14.8% 0.632 clma 12.3% 0.148 bigkey arch-SVDT (Dual Vt) arch-SVST (Single Vt) circuit Table2 Sequential circuits

SLIDE 24

Dual-Vdd FPGA Fabric

Granularity: logic block (i.e., cluster of LUTs)

Smaller granularity => intuitively more power saving But a larger implementation overhead

Layout pattern: pre-defined dual-Vdd pattern

Row-based or interleaved pattern Ratio of VddL/VddH blocks is 2:1 (benchmark profiling)

Interconnect uses uniform VddH

L-block: VddL H-block: VddH

SLIDE 25

Simple Design Flow for Dual-Vdd Fabric

Based on traditional design flow, but with

new steps

Step I: LUT mapping (FlowMap) + P & R assuming uniform VddH (using VPR) Step II: Dual-Vdd assignment based on sensitivity Setp III: Timing driven P & R considering pre- defined dual-Vdd pattern (modified VPR)

SLIDE 26

Comparison Between Vdd-Scaling and Dual-Vdd

For high clock frequency, dual Vdd achieves ~6% total power saving

(~18% logic power saving)

For low clock frequency, single-Vdd scaling is better Still a large gap between ideal dual-Vdd and real case

Ideal dual-Vdd is the result without layout pattern constraint

circuit: alu4

0.03 0.04 0.05 0.06 0.07 0.08 0.09 65 75 85 95 105 115 125

Max. Clock Frequency (MHz)

Power (watt)

arch-SVDT (Vdd Scaling) arch-DVDT(ideal case) arch-DVDT(pre-defined Vdd) 1.3v 1.0v 0.9v 1.3v/0.8v 1.0v/0.8v 0.9v/0.8v 1.5v 1.5/1.0v 1.3/1.0v 1.0/0.9v 1.5v/1.0v 1.3/0.9v

SLIDE 27

Vdd-Programmable Logic Block

Power switches for Vdd selection and power gating One-bit control is needed for Vdd selection, but two-bit

control power gating

SLIDE 28

Experimental Results with Vdd- Programmable Blocks

Power v.s. performance

Circuit: alu4

0.03 0.04 0.05 0.06 0.07 0.08 0.09 65 75 85 95 105 115 125

clock frequency (MHz) total power (watt)

arch-SV (Vdd scaling) arch-DV (configurable Vdd) arch-DV (ideal case) arch-DV (pre-defined Vdd) 1.3 v 1.0v 1.5v/1.0v 1.3v/0.8v 1.0v/0.8v 1.5v/1.0v 1.3v/0.9v 1.0v/0.8v 1.5v/0.8v 1.3v/0.8v 1.0v/0.9v 1.5v 0.9v/0.8v 1.0v/0.8v 1.3v/0.8v 1.5v/1.0v

SLIDE 29

Outline

Introduction Understanding Power Consumption in

FPGAs

Architecture Evaluation and Power

Optimization

Low Power Synthesis Conclusions

SLIDE 30

Low Power Synthesis for Dual Vdd FPGAs

FPGA architecture with dual-Vdds adds

new layout constraints for synthesis tools

Novel synthesis tools are required to

support the architecture

Technology mapping [Chen, et al, FPGA’04] Circuit clustering [Chen, et al, ISLPED’04]

SLIDE 31

Technology Mapping for Low-Power FPGAs with Dual Vdds

a c d y x z b w e f g

Cut Enumeration: Topological Order from PIs to POs.

Delay 1, Power 1 Delay 2, Power 2

Optimal Delay = 1 Power = 1.5 Optimal Delay = 2 Power = 2.5

Delay 2, Power 3.2 Delay 2, Power 3.5 Delay 2, Power 2.5

Optimal Delay = 1 Power = 1 Optimal Delay = 1 Power = 1

Represent 1 case: single high Vdd case

SLIDE 32

Dual-Vdd Cases

Consider: Converter delay & power VddL LUT delay & power VddH LUT delay & power

a c d

y x

z b w e

Target LUT

Cases Input LUT Target LUT Converter 1 VddL VddL No 2 VddL VddH Yes 3 VddH VddL No 4 VddH VddH No

Input LUT

Four extra cases for dual-Vdd consideration Produce these four cases for each cut and node More tradeoff solution points Smaller power requires larger delay Smaller delay requires larger power

SLIDE 33

Low Vdd LUT High Vdd LUT

Mapping Solution Generation

From POs to PIs Critical path

driven by VddH LUT

Non-critical paths

can be driven by VddL LUT, guided by low power

a c d y x z b w e f g

SLIDE 34

Two Types of Required Times

VddL VddH 3 3.2 R x y

If R is using VddH:

converter

Req’d times Mapped LUTs

1.7 = 2.0 - 0.3

Critical path

If R is using VddL:

Critical path

1.8 2.0

Req’d times propagated back Req’d time of R is 1.7 Req’d time of R is 1.8 To be mapped Each node maintains two req’d times: Propagated separately Interact with each other

SLIDE 35

Experimental Results

2.10%
1.29%

0.56%

4.04%

Real power Est'ed power Total edges Mapping area SVmap (Single high Vdd) compared to Emap [Lamoureux, ICCAD03] Mapping area considerably better

Estimated power very close to the real power reported after P&R

9.44%
10.72%
11.63%

v1.3 - v1.0 v1.3 - v0.9 v1.3 - v0.8 v1.3 DVmap SVmap DVmap (dual Vdd) compared to SVmap

v1.3 as VddH and v0.8 as VddL is the best combination

SLIDE 36

Circuit Clustering with Dual Vdds

Given: A mapped FPGA design An FPGA architecture with Dual- Vdd configurable logic blocks Goal: Cluster the LUTs into logic blocks Assign voltages to the logic blocks such that the design has Optimal delay Minimum power Constraints: Logic Block Inputs ≤ K Logic Block Size ≤ M Logic Block Outputs ≤ M LUT delay = dL or dH Inter-block edge delay = D Input = 5 Size = 3 Output = 2

LUT LUT LUT LUT LUT LUT LUT

SLIDE 37

Cluster Enumeration – An Example

m n

p

q r s t

To get a cluster of size 6 on LUT t Get 1 node on r, 4 on s, then merge with t …., and Get 2 nodes on r, 3 on s … Common nodes

PIs to POs Dynamic Programming

Get 3 on r …

SLIDE 38

Solution Generation

m n

p

q r s t

Cluster s1 Cluster s2

Solution propagation similar as [Vaishnav, ICCAD’99] Delay, power and voltage (form solution points) propagate through the clusters and nodes iteratively

Try to get solutions for Cluster t1 Get solutions for s Get solutions for r

SLIDE 39

Solution Curve on Node r

Good solutions: Any two delay-power-vdd points (D1, P1, V1) and (D2, P2, V2)

if D1 > D2, then P1 < P2
if D1 < D2, then P1 > P2

2 4 6 8 10 12

1 2 3 4 5 6 7

H L H L

p

w

e r Delay

H 10 5 L 9.8 5.4 H 8.24 6.1 L 8 6.2 Vdd Power Delay Good delay-power-vdd points The corresponding solution curve

SLIDE 40

Solution Propagation

2 4 6 8 10 12

1 2 3 4 5 6 7

H L H L

p

w

e r Delay

2 4 6 8 10 12

1 2 3 4 5 6 7

H L H L

p

w

e r Delay

Delay-power-vdd curve for r Delay-power-vdd curve for s

Consider:

Converter
VddL LUT
VddH LUT
Edge delay

All the good solutions are generated All the inferior solutions are pruned away

2 4 6 8 10 12 14 16 18 1 2 3 4 5 6 7 8 9

Delay p

w

e r

L L L L H H Delay-power-vdd curve for cluster t1

SLIDE 41

Two Theorems

The algorithm gets the minimum number of

solution points, W, optimally for each node W is upper bounded by L where L = level(v) for node v

The algorithm is delay and power optimal for

trees and delay optimal for directed acyclic graphs (DAGs) with dual-Vdd FPGAs

2 2

) 1 ( +

L

SLIDE 42

Experimental Results Summary

18.4%
19.5%
20.3%

v1.3 - v1.0 v1.3 - v0.9 v1.3 - v0.8 v1.3 Dual Vdd Single Vdd

Dual-Vdd Clustering results compared to the Single high-Vdd Clustering results v1.3 as high Vdd and v0.8 as low Vdd is the best combination among the three

SLIDE 43

Outline

Introduction Understanding Power Consumption in

FPGAs

Architecture Evaluation and Power

Optimization

Low Power Synthesis Conclusions

SLIDE 44

Conclusions

FPGA power consumption

Majority on programmable interconnects Leakage is significant

FPGA architecture optimization for power

Architecture parameter tuning has a limited impact Using high Vt for configuration SRAM cells is helpful Using programmable dual Vdd for logic blocks is helpful

Power-efficient FPGA architectures introduce

interesting CAD problems

Dual-Vdd mapping Dual-Vdd clustering

Up to 20% power saving reported using these algorithms