DPM: Dynamic Power Management for the Microsecond Era Chih-Hsun - - PowerPoint PPT Presentation

dpm dynamic power management for the microsecond era
SMART_READER_LITE
LIVE PREVIEW

DPM: Dynamic Power Management for the Microsecond Era Chih-Hsun - - PowerPoint PPT Presentation

DPM: Dynamic Power Management for the Microsecond Era Chih-Hsun Chou Laxmi N. Bhuyan Daniel Wong cchou001@cs.ucr.edu bhuyan@cs.ucr.edu danwong@ucr.edu Computer systems efficiently support . . . ns ms events s Killer


slide-1
SLIDE 1

µDPM: Dynamic Power Management 
 for the Microsecond Era

Chih-Hsun Chou


cchou001@cs.ucr.edu

Laxmi N. Bhuyan


bhuyan@cs.ucr.edu

Daniel Wong


danwong@ucr.edu

slide-2
SLIDE 2

HPCA 2019

ns ms µs events

Killer
 
 Microsecond

2

Computer systems efficiently support . . .

Masked by 
 microarchitectural 
 techniques Masked by 
 OS-level
 techniques

slide-3
SLIDE 3

HPCA 2019

The Killer Microseconds

3

“System designers can no longer ignore efficient support for microsecond-scale I/O ... Novel microsecond-optimized system stacks are needed”[1]

Killer 
 Microseconds

[1] Luiz Barroso, Mike Marty, David Patterson, and Parthasarathy Ranganathan. 
 Attack of the killer microseconds. Commun. ACM 60, 4 (March 2017), 48-54.

slide-4
SLIDE 4

HPCA 2019

Computer systems cannot efficiently support Microsecond-scale service time

4

Request Response ~milliseconds

Traditional Monolithic Services

slide-5
SLIDE 5

HPCA 2019

Computer systems cannot efficiently support Microsecond-scale service time

5

Request Response

Emerging Microservices

microseconds

slide-6
SLIDE 6

HPCA 2019

Microservice Example

6

Source: Adrian Cockcroft, “Monitoring Microservices and Containers: A Challenge”

~700 microservices

slide-7
SLIDE 7

HPCA 2019 7

Implications of Killer Microsecond service time on Dynamic Power Management?

slide-8
SLIDE 8

HPCA 2019

Opportunity for DPM – Latency Slack

› Slow down 
 request processing
 (DVFS) › Delay request processing
 (Sleep)

8

0% 20% 40% 60% 80% 100% 120% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Tail Latency (% of SLA) Load (% of peak load) tail latency SLA Latency Slack

slide-9
SLIDE 9

HPCA 2019

Dynamic Power Management Overview

› DVFS (Rubik [MICRO’15], Pegasus [ISCA’14])

› Rubik adjusts f per request

› DVFS + Sleep

› SleepScale [ISCA’14] finds optimal frequency & C-state depth for 60s epochs

9

R1 R0 R2 R3 R4 R4 R0 R2 R3 R1

Epoch 0 Epoch 1

t t f f

slide-10
SLIDE 10

HPCA 2019

Dynamic Power Management Overview

› Sleep states (PowerNap[ASPLOS’09], Dreamweaver[ASPLOS’12], DynSleep[ISLPED’16], CARB[CAL

’17])

10

sleep time

slide-11
SLIDE 11

HPCA 2019

Dynamic Power Management Overview

› Sleep states (PowerNap[ASPLOS’09], Dreamweaver[ASPLOS’12], DynSleep[ISLPED’16], CARB[CAL

’17])

10

sleep time R1 arrives

slide-12
SLIDE 12

HPCA 2019

Dynamic Power Management Overview

› Sleep states (PowerNap[ASPLOS’09], Dreamweaver[ASPLOS’12], DynSleep[ISLPED’16], CARB[CAL

’17])

10

sleep time R1 arrives

Target Tail Latency

slide-13
SLIDE 13

HPCA 2019

Dynamic Power Management Overview

› Sleep states (PowerNap[ASPLOS’09], Dreamweaver[ASPLOS’12], DynSleep[ISLPED’16], CARB[CAL

’17])

10

sleep time R1 arrives Wake

Target Tail Latency

slide-14
SLIDE 14

HPCA 2019

Dynamic Power Management Overview

› Sleep states (PowerNap[ASPLOS’09], Dreamweaver[ASPLOS’12], DynSleep[ISLPED’16], CARB[CAL

’17])

10

sleep time R1 arrives Wake

Target Tail Latency

slide-15
SLIDE 15

HPCA 2019

Dynamic Power Management Overview

› Sleep states (PowerNap[ASPLOS’09], Dreamweaver[ASPLOS’12], DynSleep[ISLPED’16], CARB[CAL

’17])

10

sleep time Wake R2 arrives

slide-16
SLIDE 16

HPCA 2019

Dynamic Power Management Overview

› Sleep states (PowerNap[ASPLOS’09], Dreamweaver[ASPLOS’12], DynSleep[ISLPED’16], CARB[CAL

’17])

10

sleep time Wake R2 arrives

Target Tail Latency

slide-17
SLIDE 17

HPCA 2019

Dynamic Power Management Overview

› Sleep states (PowerNap[ASPLOS’09], Dreamweaver[ASPLOS’12], DynSleep[ISLPED’16], CARB[CAL

’17])

10

sleep time Wake R2 arrives

Target Tail Latency

slide-18
SLIDE 18

HPCA 2019

Dynamic Power Management Overview

› Sleep states (PowerNap[ASPLOS’09], Dreamweaver[ASPLOS’12], DynSleep[ISLPED’16], CARB[CAL

’17])

10

sleep time Wake R3 arrives

slide-19
SLIDE 19

HPCA 2019

Dynamic Power Management Overview

› Sleep states (PowerNap[ASPLOS’09], Dreamweaver[ASPLOS’12], DynSleep[ISLPED’16], CARB[CAL

’17])

10

Target Tail Latency

sleep time Wake R3 arrives

slide-20
SLIDE 20

HPCA 2019

Dynamic Power Management Overview

› Sleep states (PowerNap[ASPLOS’09], Dreamweaver[ASPLOS’12], DynSleep[ISLPED’16], CARB[CAL

’17])

10

Wake

Target Tail Latency

sleep time R3 arrives

slide-21
SLIDE 21

HPCA 2019

Dynamic Power Management Overview

› Sleep states (PowerNap[ASPLOS’09], Dreamweaver[ASPLOS’12], DynSleep[ISLPED’16], CARB[CAL

’17])

10

Wake

Target Tail Latency

sleep time R3 arrives

slide-22
SLIDE 22

HPCA 2019

Dynamic Power Management Overview

› Sleep states (PowerNap[ASPLOS’09], Dreamweaver[ASPLOS’12], DynSleep[ISLPED’16], CARB[CAL

’17])

10

Wake sleep time

slide-23
SLIDE 23

HPCA 2019

DPM ineffective w/ microsecond service time

11

Power (W) 28 30.75 33.5 36.25 39 Avg service time(us) 10 100 1000 Baseline Rubik SleepScale DynSleep DVFS Deep Sleep DVFS+Sleep Baseline

slide-24
SLIDE 24

HPCA 2019

DPM ineffective w/ microsecond service time

11

Power (W) 28 30.75 33.5 36.25 39 Avg service time(us) 10 100 1000 Baseline Rubik SleepScale DynSleep D P M I n e f f e c t i v e DVFS Deep Sleep DVFS+Sleep Baseline

slide-25
SLIDE 25

HPCA 2019

DPM ineffective w/ microsecond service time

› >250µs: DVFS effective at 
 slowing down request processing

11

Power (W) 28 30.75 33.5 36.25 39 Avg service time(us) 10 100 1000 Baseline Rubik SleepScale DynSleep D P M I n e f f e c t i v e DVFS Deep Sleep DVFS+Sleep Baseline

slide-26
SLIDE 26

HPCA 2019

DPM ineffective w/ microsecond service time

› >250µs: DVFS effective at 
 slowing down request processing › <250µs: DPM becomes ineffective

11

Power (W) 28 30.75 33.5 36.25 39 Avg service time(us) 10 100 1000 Baseline Rubik SleepScale DynSleep D P M I n e f f e c t i v e DVFS Deep Sleep DVFS+Sleep Baseline

slide-27
SLIDE 27

HPCA 2019

DPM ineffective w/ microsecond service time

› >250µs: DVFS effective at 
 slowing down request processing › <250µs: DPM becomes ineffective › Surprisingly, sleep-based policies


  • utperform DVFS-based policies

11

Power (W) 28 30.75 33.5 36.25 39 Avg service time(us) 10 100 1000 Baseline Rubik SleepScale DynSleep D P M I n e f f e c t i v e DVFS Deep Sleep DVFS+Sleep Baseline

slide-28
SLIDE 28

HPCA 2019

Fragmented idle periods à Lost Opportunities

12

50% 
 utilization Longer Service Time Shorter Service Time 50% 
 utilization

slide-29
SLIDE 29

HPCA 2019

Fragmented idle periods à Lost Opportunities

› Short service times
 fragment idle periods

13

DVFS (500µs) Sleep (500µs) Sleep (200µs) DVFS (200µs)

slide-30
SLIDE 30

HPCA 2019

Fragmented idle periods à Lost Opportunities

› Short service times
 fragment idle periods › Sleep states / 
 request delaying can
 consolidate
 idle periods

13

DVFS (500µs) Sleep (500µs) Sleep (200µs) DVFS (200µs)

slide-31
SLIDE 31

HPCA 2019

Significant transition overheads and idle power

14

Baseline Rubik Sleepscale DynSleep Optimal Normalized Energy 0.6 0.68 0.76 0.84 0.92 1 Busy Idle C-state tran. VFS tran.

Idleness and transition

  • verhead still

account for up to ~25% of energy

slide-32
SLIDE 32

HPCA 2019

DPM inefficiencies

15

Tail Service Time = 78µs C6 Residency Time = 300µs

R1 C0 C3 C6

R1
 Arrival Tail Latency Target = 800µs Wasted
 Energy Wasted
 Energy DVFS limited in 
 closing Latency Gap t

R0

* SPECjbb timing

slide-33
SLIDE 33

HPCA 2019

DPM inefficiencies

15

Tail Service Time = 78µs C6 Residency Time = 300µs

R1 C0 C3 C6

R1
 Arrival Tail Latency Target = 800µs Wasted
 Energy Wasted
 Energy Solution:
 Aggressive Deep Sleep Solution:
 Request Delaying DVFS limited in 
 closing Latency Gap t

R0

Solution:
 Coordinate DVFS

* SPECjbb timing

slide-34
SLIDE 34

HPCA 2019

Key Insight

16

Power (W) 28 30.75 33.5 36.25 39 Avg service time(us) 10 100 1000 Baseline Rubik SleepScale DynSleep µDPM

Careful coordination of 
 DVFS, Sleep state, 
 and request delaying 
 is the key to 
 effective DPM with 
 microsecond service times

Deep Sleep DVFS+Sleep DVFS Baseline

slide-35
SLIDE 35

HPCA 2019

µDPM

› Aggressively Deep Sleep › Delay and slow down request processing to finish just-in-time, even under microsecond request service times › Carefully coordinating DVFS, Sleep, and request delaying

17

Tail Service Time = 78µs

R1

t Tail Latency Target = 800µs Solution:
 Aggressive Deep Sleep Solution:
 Request Delaying R1
 Arrival Residency Time = 300µs

C0 C3 C6 R0

Solution:
 Coordinate DVFS

slide-36
SLIDE 36

HPCA 2019

Can latency-critical workloads utilize deep sleep states?

18

Memcached SPECjbb Xapian Masstree

Start

Tail Service Time
 (95th percentile)

33µs 78µs 250µs 1200µs

slide-37
SLIDE 37

HPCA 2019

Can latency-critical workloads utilize deep sleep states?

18

Memcached SPECjbb Xapian Masstree

Start

Target Tail Latency
 (95th percentile)

150µs 800µs 1100µs 2100µs

Tail Service Time
 (95th percentile)

33µs 78µs 250µs 1200µs

slide-38
SLIDE 38

HPCA 2019

Can latency-critical workloads utilize deep sleep states?

18

Memcached SPECjbb Xapian Masstree

Start

Target Tail Latency
 (95th percentile)

150µs 800µs 1100µs 2100µs

Tail Service Time
 (95th percentile)

33µs 78µs 250µs 1200µs Opportunity

slide-39
SLIDE 39

HPCA 2019

Aggressive deep sleep and request delaying

› Wakeup after residency time › Wakeup before residency time if needed to meet tail latency

19

R0 Req

R0 Arrival t

C0 C3 C6

Residency Time = 300µs

slide-40
SLIDE 40

HPCA 2019

Estimating Service time and Latency

› Estimate tail service time

› Statistical performance model[2] › Online periodic resampling (100ms)

20

R0 Req

R0 Arrival

t

C0 C3 C6

Si

[2] Kasture, Harshad, Davide B. Bartolini, Nathan Beckmann, and Daniel Sanchez. "Rubik: Fast analytical power management for latency-critical systems." MICRO 2015.

slide-41
SLIDE 41

HPCA 2019

Estimating Service time and Latency

› Estimate tail service time

› Statistical performance model[2] › Online periodic resampling (100ms)

› Estimate Request Tail Latency

› L = W + Twake + Tdvfs + Si / f

20

R0 Req

R0 Arrival

t

C0 C3 C6

Si W Si/f Twake + Tdvfs

[2] Kasture, Harshad, Davide B. Bartolini, Nathan Beckmann, and Daniel Sanchez. "Rubik: Fast analytical power management for latency-critical systems." MICRO 2015.

slide-42
SLIDE 42

HPCA 2019

Detecting critical request arrival

› Detecting critical request arrival

› If inter-arrival time between 2 consecutive requests are shorter than the tail service time

21

R0 Req

R0 t

C0 C3 C6

slide-43
SLIDE 43

HPCA 2019

Detecting critical request arrival

› Detecting critical request arrival

› If inter-arrival time between 2 consecutive requests are shorter than the tail service time

21

R0 Req

R0 t

R1*

R1 Arrival – Critical!

C0 C3 C6

slide-44
SLIDE 44

HPCA 2019

Detecting critical request arrival

› Detecting critical request arrival

› If inter-arrival time between 2 consecutive requests are shorter than the tail service time

21

R0 Req

R0 t

R1*

R1 Arrival – Critical! QoS 
 Violation!

C0 C3 C6

slide-45
SLIDE 45

HPCA 2019

Detecting critical request arrival

› Detecting critical request arrival

› If inter-arrival time between 2 consecutive requests are shorter than the tail service time

21

R0 Req

R0 t

R1*

R1 Arrival – Critical! QoS 
 Violation!

C0 C3 C6

R1 critical if 


tR1 – tR0 ≤ Sftail

slide-46
SLIDE 46

HPCA 2019

Coordinate frequency, sleep, delay

22

R0 Req

R0 t

R1*

R1 QoS 
 Violation!

C0 C3 C6

R1 critical if 


tR1 – tR0 ≤ Sftail

slide-47
SLIDE 47

HPCA 2019

Coordinate frequency, sleep, delay

22

Req

R0 t R1

C0 C3 C6

R1 critical if 


tR1 – tR0 ≤ Sftail

R1 R0

f’ = Stail/(TRiTargetCompletion-TRi-1Completion )

slide-48
SLIDE 48

HPCA 2019

Coordinate frequency, sleep, delay

22

Req

R0 t R1

C0 C3 C6

R1 critical if 


tR1 – tR0 ≤ Sftail

R1 R0

Increase frequency on wakeup Can sleep longer
 due to higher freq.

f’ = Stail/(TRiTargetCompletion-TRi-1Completion )

slide-49
SLIDE 49

HPCA 2019

Coordinate frequency, sleep, delay

› Reset to lowest frequency on wake up › Only increase frequency on reconfiguration

22

Req

R0 t R1

C0 C3 C6

R1 critical if 


tR1 – tR0 ≤ Sftail

R1 R0

Increase frequency on wakeup Can sleep longer
 due to higher freq.

f’ = Stail/(TRiTargetCompletion-TRi-1Completion )

slide-50
SLIDE 50

HPCA 2019

Minimize state transition overheads

› Calculate criticality score › Send to core that is least critical › Minimize state transitions

23

See paper for details!

slide-51
SLIDE 51

HPCA 2019

Evaluation

› In-House Simulator (similar to BigHouse) › Empirical Power Model

› 10µs DVFS transition, 
 89µs sleep transition time (double to account for cache flushing) › Add 25µs to first request service time after idle period for cold miss penalty

› Baseline – Linux menu idle governor and intel_pstate driver › Workloads

24

slide-52
SLIDE 52

HPCA 2019

Energy Savings

25

Energy Saving (%) 10 20 30 40 Load 0.1 0.2 0.3 0.4 0.5

Memcached

Energy Saving (%) 10 20 30 40 Load 0.1 0.2 0.3 0.4 0.5 0.6 0.7

SPECjbb

Energy Saving (%) 5 10 15 20 Load 0.1 0.2 0.3 0.4 0.5 0.6

Masstree

Energy Saving (%) 2.75 5.5 8.25 11 Load 0.1 0.2 0.3 0.4 0.5 0.6

Xapian Rubik DynSleep Sleepscale µDPM Optimal

slide-53
SLIDE 53

HPCA 2019

Energy Savings

25

Energy Saving (%) 10 20 30 40 Load 0.1 0.2 0.3 0.4 0.5

Memcached

Energy Saving (%) 10 20 30 40 Load 0.1 0.2 0.3 0.4 0.5 0.6 0.7

SPECjbb

Energy Saving (%) 5 10 15 20 Load 0.1 0.2 0.3 0.4 0.5 0.6

Masstree

Energy Saving (%) 2.75 5.5 8.25 11 Load 0.1 0.2 0.3 0.4 0.5 0.6

Xapian Rubik DynSleep Sleepscale µDPM Optimal

µDPM typically within 2-3% of Optimal
 µDPM saves ~2X vs others

slide-54
SLIDE 54

HPCA 2019

State transition overhead reduction

26

Baseline Rubik Sleepscale DynSleep Optimal µDPM Normalized Energy 0.6 0.68 0.76 0.84 0.92 1 Busy Idle C-state tran. VFS tran.

slide-55
SLIDE 55

HPCA 2019

Under Varying Load

27

slide-56
SLIDE 56

HPCA 2019

Sensitivity to target tail latency

28

Energy saving (%) 3 6 9 12 Latency contraint (µs) 600 1133 1667 2200 Energy saving (%) 7.5 15 22.5 30 Latency contraint (µs) 80 135 190 245 300 Energy saving (%) 2.25 4.5 6.75 9 Latency contraint (µs) 1200 1950 2700 3450 4200 Energy saving (%) 7.5 15 22.5 30 Latency contraint (µs) 400 700 1000 1300 1600

Memcached SPECjbb Masstree Xapian

slide-57
SLIDE 57

HPCA 2019

Sensitivity to Transition time

5 10 15 20 25 30 100 200 300 Energyr Saving (%) sleep transition time(µsec) µDPM µDPM w/ criticality-awareness

29

10 20 30 20 40 60 80 100 Energy Saving (%) VFS transition time (µsec) Rubik µDPM µDPM w/ criticality-awareness

slide-58
SLIDE 58

HPCA 2019

Conclusion

› Microsecond service times present challenges for Dynamic Power Management › Careful coordination of DVFS, Sleep and Request delaying can achieve savings with µs service times › µDPM is able to save ~2x energy compared to state-of-the-art techniques

30

slide-59
SLIDE 59

µDPM: Dynamic Power Management 
 for the Microsecond Era

Thank you!

Chih-Hsun Chou


cchou001@cs.ucr.edu

Laxmi N. Bhuyan


bhuyan@cs.ucr.edu

Daniel Wong


danwong@ucr.edu