Becoming More Tolerant: Designing FPGAs for Variable Supply Voltage - - PowerPoint PPT Presentation

becoming more tolerant designing fpgas for variable
SMART_READER_LITE
LIVE PREVIEW

Becoming More Tolerant: Designing FPGAs for Variable Supply Voltage - - PowerPoint PPT Presentation

Becoming More Tolerant: Designing FPGAs for Variable Supply Voltage Ibrahim Ahmed Linda Shen Vaughn Betz Technology Scaling: Transforming the World Packing ever more computations on a single chip 2 Technology Scaling: Transforming the


slide-1
SLIDE 1

Becoming More Tolerant: Designing FPGAs for Variable Supply Voltage

Ibrahim Ahmed Linda Shen Vaughn Betz

slide-2
SLIDE 2

Technology Scaling: Transforming the World

  • Packing ever more computations on a single chip

2

slide-3
SLIDE 3

Technology Scaling: Transforming the World

  • Packing ever more computations on a single chip

3

slide-4
SLIDE 4

Technology Scaling: Transforming the World

  • Packing ever more computations on a single chip

4

slide-5
SLIDE 5
  • Huge energy demand
  • Data centers consumed 2% of total US electricity, 2014[a]
  • ICT sector to consume 9-20% of global electricity, 2025[b]

5

[a] N. Jones. How to stop data centres from gobbling up the worlds electricity. Nature, 561:163-166, 09 2018. [b] A. Shehabi et al. United States Data Center Energy Usage Report. Lawrence Berkeley National Laboratory, Berkeley, California., 2016.

Technology Scaling: The Other Side

slide-6
SLIDE 6

Technology Scaling: The Other Side

  • Huge energy demand
  • Data centers consumed 2% of total US electricity, 2014[a]
  • ICT sector to consume 9-20% of global electricity, 2025[b]
  • Many devices are power constrained
  • Mobile/edge
  • Cellular base station, satellites, etc.

6

[a] N. Jones. How to stop data centres from gobbling up the worlds electricity. Nature, 561:163-166, 09 2018. [b] A. Shehabi et al. United States Data Center Energy Usage Report. Lawrence Berkeley National Laboratory, Berkeley, California., 2016.

slide-7
SLIDE 7

Moving Away from General-Purpose Processors

  • FPGAs  trade-off between flexibility and efficiency
  • Users can build custom digital systems without the ASIC challenges
  • Not as power efficient as ASICs
  • Offer better performance/W than CPUs for many applications
  • Known to have lower absolute power than CPUs
  • Adopted in Microsoft, Baidu, and Amazon data centres

7

slide-8
SLIDE 8

FPGA Power Consumption Challenge

8

slide-9
SLIDE 9

9

FPGA Power Consumption Challenge

slide-10
SLIDE 10

What Happened?

10

slide-11
SLIDE 11

What Happened?

11

Nominal Vdd not scaling

slide-12
SLIDE 12

Adaptive & Dynamic Voltage Scaling (DVS)

  • Academic work on DVS
  • Set supply voltage (Vdd) dynamically  no longer fixed to nominal
  • Previous works have shown ~30% power reduction

12

slide-13
SLIDE 13

Adaptive & Dynamic Voltage Scaling (DVS)

  • Academic work on DVS
  • Set supply voltage (Vdd) dynamically  no longer fixed to nominal
  • Previous works have shown ~30% power reduction
  • Intel SmartVID (adaptive voltage scaling)
  • Each FPGA stores it’s own supply voltage value  determined during testing
  • Smart power supply sets the supply voltage based on the stored value

13

FPGA Arria 10 Stratix 10 Agilex Range (V) 0.85-0.9 0.8-0.94 0.6-1

slide-14
SLIDE 14

Rethinking FPGAs for Variable Supply Voltage

  • FPGAs moving away from fixed nominal-Vddoperation
  • But, FPGAs have always been designed for fixed-Vdd
  • Goals:
  • Evaluate the delay sensitivity of existing FPGA circuits to Vdd
  • Design FPGAs that are better suited for variable Vdd

14

slide-15
SLIDE 15

Outline

  • Background
  • Analyzing Existing FPGA building blocks (logic and routing)
  • VPR analysis over benchmarks
  • Designing new LUTs
  • Summary and Future Work

15

slide-16
SLIDE 16

Background: Island-style FPGA Architecture

16

Representative FPGA tile Logic Cluster (LC) Basic Logic Element (BLE)

slide-17
SLIDE 17

Background: Island-style FPGA Architecture

17

Representative FPGA tile Logic Cluster (LC) Basic Logic Element (BLE)

Routing Logic

slide-18
SLIDE 18

18

9-input two-stage multiplexer

Background: Conventional FPGA Routing MUX

I0 I1 I2 I3 I4 I5 I6 I7 I8

SRAM cell storing 1 SRAM cell storing 0

slide-19
SLIDE 19

19

Background: Conventional LUT Circuitry

Tree-based 6-input LUT multiplexer

SRAM cells

slide-20
SLIDE 20

20

Background: Conventional LUT Circuitry

Tree-based 6-input LUT multiplexer

SRAM cells A routing MUX that connects one of the LC inputs to a LUT input

slide-21
SLIDE 21

Outline

  • Background
  • Analyzing Existing FPGA building blocks (logic and routing)
  • VPR analysis over benchmarks
  • Designing new LUTs
  • Summary and Future Work

21

slide-22
SLIDE 22

Analyzing Existing FPGAs: Block-level (Silicon Measurements)

22

Setup to measure path delays

  • n a Stratix V FPGA
slide-23
SLIDE 23

Analyzing Existing FPGAs: Block-level (Silicon Measurements)

23

Measuring different types of paths on Stratix V

Setup to measure path delays

  • n a Stratix V FPGA
slide-24
SLIDE 24

Analyzing Existing FPGAs: Block-level (Silicon Measurements)

24

Measuring different types of paths on Stratix V

LUT delay is more sensitive to Vdd Setup to measure path delays

  • n a Stratix V FPGA
slide-25
SLIDE 25

Analyzing Existing FPGAs: Block-level (Spice Simulations)

25

slide-26
SLIDE 26

Analyzing Existing FPGAs: Block-level (Spice Simulations)

26

Routing delay increases with increasing Vdd above nominal  Gate boosted pass transistors

slide-27
SLIDE 27

Analyzing Existing FPGAs: Block-level (Spice Simulations)

27

LUTs get much slower at lower Vdd

Routing delay increases with increasing Vdd above nominal  Gate boosted pass transistors

slide-28
SLIDE 28

Outline

  • Background
  • Analyzing Existing FPGA building blocks (logic and routing)
  • VPR analysis over benchmarks
  • Designing new LUTs
  • Summary and Future Work

28

slide-29
SLIDE 29

VTR benchmarks’ CP Delay Breakdown

29

slide-30
SLIDE 30

VTR benchmarks’ CP Delay Breakdown

30

  • Nominal: ~75% routing, ~15%

LUT

slide-31
SLIDE 31

VTR benchmarks’ CP Delay Breakdown

31

  • Nominal: ~75% routing, ~15%

LUT

  • 0.6 V: ~45% routing, ~50% LUT
slide-32
SLIDE 32

VTR benchmarks’ CP Delay Breakdown

32

  • Nominal: ~75% routing, ~15%

LUT

  • 0.6 V: ~45% routing, ~50% LUT

Redesign LUTs

slide-33
SLIDE 33

Outline

  • Background
  • Analyzing Existing FPGA building blocks (logic and routing)
  • VPR analysis over benchmarks
  • Designing new LUTs
  • Summary and Future Work

33

slide-34
SLIDE 34

Proposed LUTs: Decode LUT Inputs (decode LUT)

  • Decrease number of pass transistors in series
  • Reduce number of transistors in a 6-input LUT

34

Conventional LUT (baseline) decode LUT

slide-35
SLIDE 35

Proposed LUTs: Gate Boosting LUTs (GB LUT)

  • Add level shifter to local MUX
  • Shifts from low supply voltage to the fixed SRAM 1 V

35

Local MUX

slide-36
SLIDE 36

Proposed LUTs: Gate Boosting LUTs (GB LUT)

  • Add level shifter to local MUX
  • Shifts from low supply voltage to the fixed SRAM 1 V
  • LUT input drivers

supplied by the SRAM 1 V

36

Vddl Vddh

Local MUX

Vddl

slide-37
SLIDE 37

Proposed LUTs: TG LUTs and Hybrid LUTs

  • Using TG in LUTs, while using pass transistors in routing MUXes
  • Hybrid LUTs:
  • Gate boosting LUTs + decoding slowest two inputs (decode-GB LUT)
  • TG LUTs + decoding slowest two inputs (decode-TG LUT)

37

slide-38
SLIDE 38

LUT Area and Delay Analysis

38

slide-39
SLIDE 39

FPGA Tile (Logic + Routing) Area-Delay Product

39

slide-40
SLIDE 40

FPGA Tile (Logic + Routing) Area-Delay Product

40

  • Proposed LUTs  better

FPGAs at nominal and below

  • Decode-GB LUT  12%

lower area-delay than baseline at nominal

slide-41
SLIDE 41

VTR Benchmarks’ CP delay (Geomean)

41

  • 14% faster at 0.8 V
slide-42
SLIDE 42

VTR Benchmarks’ CP delay (Geomean)

42

  • 14% faster at 0.8 V
  • 45% faster at 0.6 V
slide-43
SLIDE 43

LUT Power Consumption

43

slide-44
SLIDE 44

LUT Power Consumption

  • Decode-* LUTs have 28%

lower power than baseline

44

slide-45
SLIDE 45

LUT Power Consumption

  • Decode-* LUTs have 28%

lower power than baseline

  • At 0.8 V, decoding reduces the

GB LUT and TG LUT power by 35% and 25%, respectively

45

slide-46
SLIDE 46

LUT Power Consumption: Decoding Effects

46

slide-47
SLIDE 47

LUT Power Consumption: Decoding Effects

47

  • 40% power reduction

when input A toggles

  • Power reductions

when B or C toggles

slide-48
SLIDE 48

Energy and Energy-Delay2 Product

  • Decode-GB slightly

higher energy

  • Decode-* 14% lower

ED2 at 0.8 V

  • Decode-* 60% lower

ED2 at 0.6 V

48

slide-49
SLIDE 49

Outline

  • Background
  • Analyzing Existing FPGA building blocks (logic and routing)
  • VPR analysis over benchmarks
  • Designing new LUTs
  • Summary and Future Work

49

slide-50
SLIDE 50

Summary & Future Work

  • Delay of a conventional FPGA LUT increases by 7X when Vdd reduces

from 0.8 V to 0.6 V

  • Novel LUTs with input decoding and gate boosting
  • Reduce LUT power by 28%
  • VTR benchmarks geomean CP delay decrease by 14% and 45% at 0.8 V and 0.6 V
  • Reduce ED2 by 14% and 60% at 0.8 V and 0.6 V
  • Future work
  • Using separate voltage islands for LUTs and routing

50

slide-51
SLIDE 51

Power and Fmax at different supply voltages

  • Decode-* outperform

baseline

  • Decode-GB achieves

largest Fmax

51

slide-52
SLIDE 52

Backup: Area-Delay Product

52

slide-53
SLIDE 53

Should We Rethink CAD Tools for Variable Vdd?

  • VPR limit study  Vnom- vs Vused-optimizationflows

53

BLIF VPR Architecture file @ 0.8 V .place .route CP delay CP delay STA at 0.6 V STA at 1.0 V

Vnom-optimization flow

slide-54
SLIDE 54

Should We Rethink CAD Tools for Variable Vdd?

  • VPR limit study  Vnom- vs Vused-optimizationflows

54

BLIF VPR Architecture file @ 0.8 V .place .route CP delay CP delay STA at 0.6 V STA at 1.0 V BLIF VPR Architecture file @ 0.6 V .place .route CP delay VPR VPR Architecture file @ 0.7 V Architecture file @ 1 V .place .route .place .route CP delay CP delay

Vnom-optimization flow Vused-optimization flow

slide-55
SLIDE 55

Geomean CP Delay of VTR Benchmarks

  • No obvious gains

from Vused-optimization

  • Better to focus on circuit
  • ptimizations

55

slide-56
SLIDE 56

56

Two-stage routing multiplexer

Background: FPGA LUT and Routing Circuitry

Tree-based 6-input LUT multiplexer

slide-57
SLIDE 57

Power Modelling

  • Single-input blocks: routing multiplexers, LUT input drivers, etc.
  • Hspice to monitor the current drawn by the block during an input transition

57

Block Load Input src

slide-58
SLIDE 58

Power Modelling

  • Single-input blocks: routing multiplexers, LUT input drivers, etc.
  • Hspice to monitor the current drawn by the block during an input transition
  • LUTs have multiple inputs and the current drawn depends on LUT mask
  • Generate hundreds of random LUT masks, and for each mask:
  • Monitor the current drawn when each of the LUT inputs toggles

58

Block Load Input src

slide-59
SLIDE 59

VTR benchmarks’ Active Power Breakdown

59

slide-60
SLIDE 60

VTR benchmarks’ Active Power Breakdown

60

  • Routing consistently

contributes ~78% of the FPGA active power.