Near-Threshold Computing: Reclaiming Moores Law Dr. Ronald G. - - PowerPoint PPT Presentation

near threshold computing reclaiming moore s law
SMART_READER_LITE
LIVE PREVIEW

Near-Threshold Computing: Reclaiming Moores Law Dr. Ronald G. - - PowerPoint PPT Presentation

1 Near-Threshold Computing: Reclaiming Moores Law Dr. Ronald G. Dreslinski Research Fellow University of Michigan Ann Arbor 1 University of Michigan EnA-HPC -- September 7, 2011 1 1 Motivation 1000000 Transistors


slide-1
SLIDE 1

1 1 1

1

University of Michigan EnA-HPC -- September 7, 2011

Near-Threshold Computing: Reclaiming Moore’s Law

  • Dr. Ronald G. Dreslinski

Research Fellow University of Michigan – Ann Arbor

slide-2
SLIDE 2

2

University of Michigan EnA-HPC -- September 7, 2011 0.001 ¡ 0.01 ¡ 0.1 ¡ 1 ¡ 10 ¡ 100 ¡ 1000 ¡ 10000 ¡ 100000 ¡ 1000000 ¡

1985 ¡ 1990 ¡ 1995 ¡ 2000 ¡ 2005 ¡ 2010 ¡ 2015 ¡ 2020 ¡

Transistors ¡(100,000's) ¡ Power ¡(W) ¡ Performance ¡(GOPS) ¡ Efficiency ¡(GOPS/W) ¡

2 ¡

Limits ¡on ¡heat ¡extrac6on ¡ Limits ¡on ¡energy-­‑efficiency ¡of ¡opera6ons ¡ Stagnates ¡performance ¡growth ¡

Motivation

slide-3
SLIDE 3

3

University of Michigan EnA-HPC -- September 7, 2011 0.001 ¡ 0.01 ¡ 0.1 ¡ 1 ¡ 10 ¡ 100 ¡ 1000 ¡ 10000 ¡ 100000 ¡ 1000000 ¡

1985 ¡ 1990 ¡ 1995 ¡ 2000 ¡ 2005 ¡ 2010 ¡ 2015 ¡ 2020 ¡

Transistors ¡(100,000's) ¡ Power ¡(W) ¡ Performance ¡(GOPS) ¡ Efficiency ¡(GOPS/W) ¡

3 ¡

Era ¡of ¡High ¡Performance ¡Compu6ng ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Era ¡of ¡Energy-­‑Efficient ¡Compu6ng ¡

  • c. ¡2000 ¡

With ¡the ¡help ¡of ¡some ¡beBer ¡ thermal ¡management… ¡ Goal: ¡To ¡increase ¡energy-­‑ efficiency ¡of ¡operaGons ¡ Result: ¡Con6nue ¡scaling ¡trends ¡ that ¡fueled ¡the ¡compu6ng ¡ revolu6on ¡

Motivation

slide-4
SLIDE 4

4 4 4

4

University of Michigan EnA-HPC -- September 7, 2011

Outline

  • Define a new region of operation, Near-Threshold

Computing

  • Explore new architectures enabled by key insights of

computing in the NTC region

  • Present an initial design of a 3D stacked NTC system,

Centip3De

slide-5
SLIDE 5

5 5 5

5

University of Michigan EnA-HPC -- September 7, 2011

Dark Silicon—The emerging dilemma: More and more gates can fit on a die, but not all can be turned on at the same time

Environmental Concerns Form factor vs. Battery Life

Power Density Limitations

Circuit supply voltages are no longer scaling… Power does not decrease at the same rate that transistor count increases A = gate area  scaling 1/s2 C = capacitance  scaling < 1/s

Dynamic dominates

Stagnant Shrinking

slide-6
SLIDE 6

6 6 6

6

University of Michigan EnA-HPC -- September 7, 2011

Today: Super-Vth, High Performance, Power Constrained

Super-Vth

Energy / Operation Log (Delay) Supply Voltage Vth Vnom Core i7

3+ GHz 0.5 mW/MHz

Normalized Power, Energy, & Performance Energy per operation is the key metric for

  • efficiency. Goal: same performance, low

energy per operation

slide-7
SLIDE 7

7 7 7

7

University of Michigan EnA-HPC -- September 7, 2011

Subthreshold Design

Super-Vth Sub-Vth

Energy / Operation Log (Delay) Supply Voltage Vth Vnom 500 – 1000X 12-16X

Operating in the sub-threshold gives us huge power gains at the expense of performance  OK for sensors!

slide-8
SLIDE 8

8 8 8

8

University of Michigan EnA-HPC -- September 7, 2011

Evolution of Subthreshold Designs

Phoenix 2 Design (2010)

  • 0.18 µm CMOS
  • Commercial ARM M3 Core
  • Used to investigate:
  • Energy harvesting
  • Power management
  • 37.4 µW/MHz

Subliminal 2 Design (2007)

  • 0.13 µm CMOS
  • Used to investigate process variation
  • 3.5 µW/MHz

Subliminal 1 Design (2006)

  • 0.13 µm CMOS
  • Used to investigate existence of Vmin
  • 2.60 µW/MHz

Phoneix 1 Design (2008)

  • 0.18 µm CMOS
  • Used to investigate sleep current
  • 2.8 µW/MHz / 30pW sleep power
slide-9
SLIDE 9

9 9 9

9

University of Michigan EnA-HPC -- September 7, 2011

Near-Threshold Computing (NTC)

Super-Vth Sub-Vth

Energy / Operation Log (Delay) Supply Voltage Vth Vnom ~10X ~50-100X ~2X ~6-8X Near-Threshold Computing (NTC):

  • >60X power reduction
  • 6-8X energy reduction
  • Invest portion of extra transistors from

scaling to overcome barriers

slide-10
SLIDE 10

10 10 10

10

University of Michigan EnA-HPC -- September 7, 2011

Silicon Verification of Trends

Phoenix 2 Design [Seok’11] 180nm Design 1.8V -> 700mV ~10x NTC Performance Loss ~7x NTC Energy Reduction

Seok ISSCC 2011

Phoenix 2 Processor

slide-11
SLIDE 11

11 11 11

11

University of Michigan EnA-HPC -- September 7, 2011

NTC – Opportunities and Challenges

  • Challenges:
  • Low Voltage Memory
  • New SRAM designs
  • Robustness analysis at near-threshold
  • Variation
  • Razor [Ernst’03] and other in-situ delay monitoring
  • Adaptive body biasing
  • Performance Loss
  • Many-core designs to improve parallelism
  • Core boosting to improve single thread performance
  • Opportunities:
  • New architectures
  • Optimized Processes
  • 3D Integration – less thermal restrictions
slide-12
SLIDE 12

12 12 12

12

University of Michigan EnA-HPC -- September 7, 2011

Outline

  • Define a new region of operation, Near-Threshold

Computing

  • Explore new architectures enabled by key insights of

computing in the NTC region

  • Present an initial design of a 3D stacked NTC system,

Centip3De

slide-13
SLIDE 13

13 13 13

13

University of Michigan EnA-HPC -- September 7, 2011

Minimum Energy SRAM

  • SRAM has a lower activity rate than logic
  • VDD for minimum energy operation (VMIN) is higher
  • Running logic at VMIN for SRAM has a small energy penalty

with increased performance

Leakage Dynamic Total

slide-14
SLIDE 14

14 14 14

14

University of Michigan EnA-HPC -- September 7, 2011 Cluster

L1

Key Insight:

  • SRAM is run at a higher VDD than cores with little energy

penalty, allowing caches to operate faster than the core

Cluster Cluster Cluster

Core Core Core Core

New NTC Architectures

L1

BUS / Switched Network

Next Level Memory

Core

L1

Core

L1

Core

L1

Core

L1

Core

BUS / Switched Network

Next Level Memory

L1 L1 L1 L1

Design Levers:

  • Operating Voltage
  • L1 Size
  • Number of Cores per Cluster
  • Number of Clusters
slide-15
SLIDE 15

15 15 15

15

University of Michigan EnA-HPC -- September 7, 2011

Core L1 L2

L1 Cache Size Tradeoff

Core L1 L2 Decreased Miss Rate Higher Energy/Access

slide-16
SLIDE 16

16 16 16

16

University of Michigan EnA-HPC -- September 7, 2011

Results – Energy Optimal L1 Size (Single Core)

  • Energy dependency on L1 size
  • Trade-off between L1 and L2 access
slide-17
SLIDE 17

17 17 17

17

University of Michigan EnA-HPC -- September 7, 2011

Clustering Tradeoffs

CPU CPU CPU CPU L1 L1 L1 L1 L2 CPU CPU CPU CPU L1 L1 L2

O X X Tradeoffs

  • + Clustered Sharing
  • Cluster Conflict
  • New Bus
  • L1 Speed
slide-18
SLIDE 18

18 18 18

18

University of Michigan EnA-HPC -- September 7, 2011

Energy Optimal Cluster-based CMP (Fixed Die Size)

slide-19
SLIDE 19

19 19 19

19

University of Michigan EnA-HPC -- September 7, 2011

Full Space Analysis

slide-20
SLIDE 20

20 20 20

20

University of Michigan EnA-HPC -- September 7, 2011

0.2 0.4 0.6 0.8 1

Uniprocessor CMP w/ DVFS NTC Normalized Energy/Operation L2 L1 Core

Various Scaling Methods

  • Baseline
  • Single CPU @

233MHz

  • Simple CMP
  • One core per L1
  • Vdd scaling
  • Proposed cluster-

based CMP

  • Multiple cores per L1
  • Vdd scaling

38% 53% 71% 4 Cores 4 L1’s 2 Cores/Cluster 3 Clusters

slide-21
SLIDE 21

21 21 21

21

University of Michigan EnA-HPC -- September 7, 2011

  • 21-

Energy Optima for SPLASH2

  • Cluster based architecture with Vdd and Vth scaling
  • Optimal cluster size is 2 for most of the apps
  • Rad choose non-clustered CMP
  • Average: 74% over baseline, 55% over simple CMP

nc k L1 size/kB energy savings

  • ver baseline

energy savings over simple CMP Cho 3 2 64 70.8% 52.8% Fft 2 2 32 72.6% 68.5% fmm 8 2 128 79.7% 41.6% luc 3 2 32 77.8% 64.4% lun 2 2 64 69.2% 58.0% rad 16 1 128 84.2% 35.1% ray 3 2 128 65.1% 54.9%

slide-22
SLIDE 22

22 22 22

22

University of Michigan EnA-HPC -- September 7, 2011

Energy Optima w/ Performance Requirements

  • Cluster based approach provides best savings
  • Traditional approach only saves energy at high end

53% 32%

20%

slide-23
SLIDE 23

23 23 23

23

University of Michigan EnA-HPC -- September 7, 2011

Outline

  • Define a new region of operation, Near-Threshold

Computing

  • Explore new architectures enabled by key insights of

computing in the NTC region

  • Present an initial design of a 3D stacked NTC system,

Centip3De

slide-24
SLIDE 24

24 24 24

24

University of Michigan EnA-HPC -- September 7, 2011

A Closer Look at Wafer-Level Stacking

Dielectric(SiO2/SiN) Gate Poly STI (Shallow Trench Isolation) Oxide Silicon W (Tungsten contact & via) Al (M1 – M5) Cu (M6, Top Metal)

“Super-Contact”

Illustration from Bob Patti, Tezzaron

slide-25
SLIDE 25

25 25 25

25

University of Michigan EnA-HPC -- September 7, 2011

Next, Stack a Second Wafer & Thin:

slide-26
SLIDE 26

26 26 26

26

University of Michigan EnA-HPC -- September 7, 2011

3rd wafer 2nd wafer 1st wafer: controller

Then, Stack a Third Wafer:

slide-27
SLIDE 27

27 27 27

27

University of Michigan EnA-HPC -- September 7, 2011

Centip3De – 3D NTC Prototype

Logic - B Logic - B Logic - A DRAM Sense/Logic – Bond Routing DRAM DRAM F2F Bond F2F Bond Logic - A Centip3De Design

  • 130nm, 7-Layer 3D-Stacked Chip
  • 128 - ARM M3 Cores
  • 150mm2
slide-28
SLIDE 28

28 28 28

28

University of Michigan EnA-HPC -- September 7, 2011

  • 1.9 GOPS (3.8 GOPS in Boost)
  • Max 1 IPC per core
  • 128 Cores
  • 15 MHz
  • 130 mW (691mW in Boost)
  • 14.6 GOPS/W (5.5 in Boost)

Design Scaling and Power Breakdowns

NTC Centip3De System

42 2.9 7.0 39

NTC Mode Power (mW)

Cores I-Caches D-Caches DRAM

336 28 67 45

Boosted Mode Power (mW) Raytracing Benchmark

  • Naïve Scaling to 22nm yields ~200GOPS/W
slide-29
SLIDE 29

29 29 29

29

University of Michigan EnA-HPC -- September 7, 2011

  • Observed Voltage Scaling and

Thermal Limits reducing the gains of Moore’s Law

  • Defined a new computational
  • perating region: Near Threshold

Computing

  • Leveraged key insights of NTC for

new clustered architectures

  • Initial ideas of a 3D integrated NTC

system, Centip3De

Conclusions

slide-30
SLIDE 30

30 30 30

30

University of Michigan EnA-HPC -- September 7, 2011

Related References

  • Ronald G. Dreslinski, Michael Wieckowski, David Blaauw, Dennis Sylvester, Trevor

Mudge, “Near-Threshold Computing: Reclaiming Moore’s Law Through Energy Efficient Integrated Circuits,” Proceedings of the IEEE, Special Issue on Ultra-Low Power Circuit Technology, Vol. 98, No. 2, February 2010, pg. 253 – 266.

  • Bo Zhai, Ronald G. Dreslinski, Trevor Mudge, David Blaauw, Dennis Sylvester, “Energy

Efficent Near-threshold Chip Multi-processing,” ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED), August 2007, Best Paper Nomination.

  • Dan Ernst, Shidhartha Das, Seokwoo Lee, David Blaauw, Todd Austin, Trevor Mudge,

Nam Sung Kim, Krisztian Flautner, “Razor: Circuit-Level Correction of Timing Errors for Low-Power Operation”, IEEE, Vol. 24, No. 6, November-December 2004, pg. 10-20.

  • Mingoo Seok, Dongsuk Jeon, Chaitali Chakrabarti, David Blaauw, Dennis Sylvester, “A

0.27V, 30MHz, 17.7nJ/transform 1024-pt complex FFT core with super-pipelining,” IEEE International Solid-State Circuits Conference (ISSCC), February 2011, to appear

slide-31
SLIDE 31

31 31 31

31

University of Michigan EnA-HPC -- September 7, 2011

Backup

slide-32
SLIDE 32

32 32 32

32

University of Michigan EnA-HPC -- September 7, 2011

  • 32-

Logic vs. Memory

  • To maintain same robustness at low voltages SRAM cell sizes needs

to be increased to compensate effects of process variation

  • Increased size leads to higher energy consumption, and longer

interconnects

slide-33
SLIDE 33

33 33 33

33

University of Michigan EnA-HPC -- September 7, 2011

  • 33-

Proposed Parallel Architecture

slide-34
SLIDE 34

34 34 34

34

University of Michigan EnA-HPC -- September 7, 2011

  • 34-

Energy Optimal Vth Selection

  • Vth is very high
  • Energy optimal Vdd is

independent of Vth

  • Free performance gain

without consuming more energy

  • As Vth reduces
  • Circuit operates faster
  • More leakage, more energy

consumption per switching

  • Choose Vth
  • Body bias
  • Dopant implant