Rethinking DRAM Power Modes for Energy Proportionality Krishna - - PowerPoint PPT Presentation

rethinking dram power modes for energy proportionality
SMART_READER_LITE
LIVE PREVIEW

Rethinking DRAM Power Modes for Energy Proportionality Krishna - - PowerPoint PPT Presentation

Rethinking DRAM Power Modes for Energy Proportionality Krishna Malladi 1 , Ian Shaeffer 2 , Liji Gopalakrishnan 2 , David Lo 1 , Benjamin Lee 3 , Mark Horowitz 1 Stanford University 1 , Rambus Inc 2 , Duke University 3 ktej@stanford.edu Main


slide-1
SLIDE 1

Rethinking DRAM Power Modes for Energy Proportionality

Krishna Malladi1, Ian Shaeffer2, Liji Gopalakrishnan2, David Lo1, Benjamin Lee3, Mark Horowitz1

Stanford University1, Rambus Inc2, Duke University3

ktej@stanford.edu

slide-2
SLIDE 2

2

Main Memory in Datacenters

Server power main energy bottleneck in datacenters

PUE of ~1.1 the rest of the system is energy efficient

Significant main memory (DRAM) power

25-40% of server power across all utilization points Low dynamic range No energy proportionality

slide-3
SLIDE 3

3

Main Memory in Datacenters

Server power main energy bottleneck in datacenters

PUE of ~1.1 the rest of the system is energy efficient

Significant main memory (DRAM) power

25-40% of server power across all utilization points Low dynamic range No energy proportionality

slide-4
SLIDE 4

4

Outline

Inefficiencies of DRAM interfaces Energy-proportionality via fast DRAM interfaces

  • MemBlaze
  • MemCorrect
  • MemDrowsy
slide-5
SLIDE 5

5

Outline

Inefficiencies of DRAM interfaces Energy-proportionality via fast DRAM interfaces

  • MemBlaze
  • MemCorrect
  • MemDrowsy
slide-6
SLIDE 6

6

DDR3 Energy & Powermodes

DDR3 optimized for high bandwidth

High speed interface with DLLs, CLKs, ODTs Very high static power in active-idle

Hard to powerdown to deep states

Long impractical wakeup time to power up interface Insufficient idleness in workloads Significant active-idle time

Power Mode DIMM Idle Power (W) Exit Latency (ns)

Active Idle 5.36 Fast Powerdown 2.79 20 Deep Powerdown 0.92 768

slide-7
SLIDE 7

7

DDR3 Energy & Powermodes

DDR3 optimized for high bandwidth

High speed interface with DLLs, CLKs, ODTs Very high static power in active-idle

Hard to powerdown to deep states

Long impractical wakeup time to power up interface Insufficient idleness in workloads Significant active-idle time

Power Mode DIMM Idle Power (W) Exit Latency (ns)

Active Idle 5.36 Fast Powerdown 2.79 20 Deep Powerdown 0.92 768

slide-8
SLIDE 8

8

DDR3 Energy & Powermodes

DDR3 optimized for high bandwidth

High speed interface with DLLs, CLKs, ODTs Very high static power in active-idle

Hard to powerdown to deep states

Long impractical wakeup time to power up interface Insufficient idleness in workloads Significant active-idle time

Power Mode DIMM Idle Power (W) Exit Latency (ns)

Active Idle 5.36 Fast Powerdown 2.79 20 Deep Powerdown 0.92 768

slide-9
SLIDE 9

9

DDR3 Energy & Powermodes

DDR3 optimized for high bandwidth

High speed interface with DLLs, CLKs, ODTs Very high static power in active-idle

Hard to powerdown to deep states

Long impractical wakeup time to power up interface Insufficient idleness in workloads Significant active-idle time 88%!

Power Mode DIMM Idle Power (W) Exit Latency (ns)

Active Idle 5.36 Fast Powerdown 2.79 20 Deep Powerdown 0.92 768

slide-10
SLIDE 10

10

Path to Energy-Proportionality

slide-11
SLIDE 11

11

Path to Energy-Proportionality

slide-12
SLIDE 12

12

Path to Energy-Proportionality

Reduce active-idle power

slide-13
SLIDE 13

13

Path to Energy-Proportionality

Reduce active-idle power Reduce time in active-idle Increase time in power-down

slide-14
SLIDE 14

14

Path to Energy-Proportionality

Reduce active-idle power Reduce time in active-idle Increase time in power-down Reduce power-down power

slide-15
SLIDE 15

15

DRAM Interfaces

Bits are short

Sampling window is only 625ps

Data (DQ) and Clock (CLK) signals forwarded to DRAM Write data aligned to Clock edges

slide-16
SLIDE 16

16

DRAM Interfaces

Dynamic chip variations affect Reads

PVT variations Misaligned DQS and CLK signals Non-deterministic Read timing Incorrect sampling

slide-17
SLIDE 17

17

DRAM Interfaces

On-chip DLLs

Adjust delay to match chip temperature, voltage variations Align DQS, DQ to CLK

Power hungry, long settling time poor powermodes

slide-18
SLIDE 18

18

Live with Slow-Powerup

S/W mechanisms

Batch requests (or) subset ranks (or) Predict idleness

Degrades application performance Degraded device density

H/W mechanisms

Statically Disable DLLs in BIOS Statically lowers bandwidth

Worse performance

Use current deep powermodes

Long memory wake-up latency

slide-19
SLIDE 19

19

With Wakeup = 1u sec

E-D curves flat Can’t win with long wakeups

slide-20
SLIDE 20

20

Faster Wakeups

Powerups should be

much smaller

100ns

slide-21
SLIDE 21

21

Faster Wakeups

Powerups should be

much smaller

100ns

slide-22
SLIDE 22

22

Outline

Inefficiencies of DRAM interfaces Energy-proportionality via fast DRAM interfaces

  • MemBlaze
  • MemCorrect
  • MemDrowsy
slide-23
SLIDE 23

23

Fast DRAM Wakeups

Enabling deep powerdown needs low- latency wakeups Rearchitect interface to reduce wakeup latency Fast wakeup with MemBlaze Retain interface but powerdown aggressively Speculative wakeup with MemCorrect Lazy wakeup with MemDrowsy

slide-24
SLIDE 24

24

Fast DRAM Wakeups

Enabling deep powerdown needs low- latency wakeups Rearchitect interface to reduce wakeup latency Fast wakeup with MemBlaze Retain interface but powerdown aggressively Speculative wakeup with MemCorrect Lazy wakeup with MemDrowsy

slide-25
SLIDE 25

25

Fast DRAM Wakeups

Enabling deep powerdown needs low- latency wakeups Rearchitect interface to reduce wakeup latency Fast wakeup with MemBlaze Retain interface but powerdown aggressively Speculative wakeup with MemCorrect Lazy wakeup with MemDrowsy

slide-26
SLIDE 26

26

Fast DRAM Wakeups

Enabling deep powerdown needs low- latency wakeups Rearchitect interface to reduce wakeup latency Fast wakeup with MemBlaze Retain interface but powerdown aggressively Speculative wakeup with MemCorrect Lazy wakeup with MemDrowsy

slide-27
SLIDE 27

27

Fast Wakeup with MemBlaze

No DLL

Periodic Timing reference signal stores DRAM offset in controller Current-mode logic (CML) clocking has fewer variations

Fast turn-on of datapath

Capacitive boosting quickly restores bias values

slide-28
SLIDE 28

28

Fast Wakeup with MemBlaze

No DLL

Periodic Timing reference signal stores DRAM offset in controller Current-mode logic (CML) clocking has fewer variations

Fast turn-on of datapath

Capacitive boosting quickly restores bias values

Exit latency ~ 10ns

slide-29
SLIDE 29

29

MemBlaze DRAM + Controller

Integrated into DRAMs. Fabricated and tested More details in the paper

slide-30
SLIDE 30

30

Silicon Results

slide-31
SLIDE 31

31

Methodology

Workloads

Memcached

Key/value pairs with 100B and 10KB values Zipf popularity distribution with exponential inter-arrival times

Yahoo! Cloud Benchmark (YCSB), SPECjbb Multiprogrammed (MP) and Multithreaded (MT)

SPECCPU 2006, SPECOMP 2001, PARSEC High BW (HB), Medium BW (MB), Low BW (LB)

Architecture

8 OoO Nehalem cores at 3GHz, 8MB shared L3 cache 32 GB DRAM, 2Gb DDR3-1333 chips Fast powerdown baseline, 15 cycle powerdown timer

slide-32
SLIDE 32

32

MemBlaze Evaluation

66% lower memory energy with MemBlaze fastlock

No performance penalty

slide-33
SLIDE 33

33

Fast DRAM Wakeups

Enabling deep powerdown needs low- latency wakeups Rearchitect interface to reduce wakeup latency Fast wakeup with MemBlaze Retain interface but powerdown aggressively Speculative wakeup with MemCorrect Lazy wakeup with MemDrowsy

slide-34
SLIDE 34

34

Fast DRAM Wakeups

Enabling deep powerdown needs low- latency wakeups Rearchitect interface to reduce wakeup latency Fast wakeup with MemBlaze Retain interface but powerdown aggressively Speculative wakeup with MemCorrect Lazy wakeup with MemDrowsy

slide-35
SLIDE 35

35

Speculative Wakeup with MemCorrect

Fast wakeup

Use deep power-down, which powers-off DLL, CLK Transfer speculatively before the long DLL recalibration

Error Detection/Correction

Detector fires if power-down period accumulated large skew Corrector waits for recalibration before transfer

slide-36
SLIDE 36

36

MemCorrect Evaluation

Vary probability of correct timing (p) 40% energy savings (esp. for datacenters) Small p Recalibration latency exposed

Degrades performance for high-BW apps Increases energy/bit

slide-37
SLIDE 37

37

Fast DRAM Wakeups

Enabling deep powerdown needs low- latency wakeups Rearchitect interface to reduce wakeup latency Fast wakeup with MemBlaze Retain interface but powerdown aggressively Speculative wakeup with MemCorrect Lazy wakeup with MemDrowsy

slide-38
SLIDE 38

38

Fast DRAM Wakeups

Enabling deep powerdown needs low- latency wakeups Rearchitect interface to reduce wakeup latency Fast wakeup with MemBlaze Retain interface but powerdown aggressively Speculative wakeup with MemCorrect Lazy wakeup with MemDrowsy

slide-39
SLIDE 39

39

Lazy Wakeup with MemDrowsy

Fast wakeup

Wakeup from deep-powerdown Transfer at lower rate before DLL recalibration completes

Reduced Sampling Rate

Lower data rate for READs during calibration time (~ 700ns)

Transfer each bit multiple times Wider sampling window Eliminates timing uncertainty

slide-40
SLIDE 40

40

MemDrowsy Evaluation

Vary sampling reduction rate (Z) 40% energy savings for datacenter apps High Z harms both performance and energy/bit

Energy per bit increases from wake-ups, higher bus activity Z=2 more realistic

slide-41
SLIDE 41

41

MemCorrect + MemDrowsy

Combine MemCorrect and MemDrowsy If error detected, halve sampling rate instead of backoff ≤10% performance penalty 50% energy/bit savings

slide-42
SLIDE 42

42

Conclusion

DDR3 is energy-disproportional

DRAMs dissipate high static power

DDR3 interfaces are efficiency bottlenecks

High active-idle power Long wake-ups from power modes

Re-architect interfaces with MemBlaze Or use MemCorrect + MemDrowsy

Provide fast wake-up from power modes Energy efficiency improves by 40-70% Performance impact is ≤ 10%

slide-43
SLIDE 43

43

Thank you for your attention! Questions?

ktej@stanford.edu