Increasing Large-Scale Data Center Capacity by Statistical Power - - PowerPoint PPT Presentation

increasing large scale data center capacity by
SMART_READER_LITE
LIVE PREVIEW

Increasing Large-Scale Data Center Capacity by Statistical Power - - PowerPoint PPT Presentation

Increasing Large-Scale Data Center Capacity by Statistical Power Control Guosai Wang, Shuhao Wang, Bing Luo, Weisong Shi, Yinghang Zhu, Wenjun Yang, Dianming Hu, Longbo Huang, Xin Jin, Wei Xu Data Centers Expensive to build and operate


slide-1
SLIDE 1

Increasing Large-Scale Data Center Capacity by Statistical Power Control

Guosai Wang, Shuhao Wang, Bing Luo, Weisong Shi, Yinghang Zhu, Wenjun Yang, Dianming Hu, Longbo Huang, Xin Jin, Wei Xu

slide-2
SLIDE 2

Data Centers

Expensive to build and operate

Building cost (large DCs): $9,000–$13,000/KW* High power consumption: 10–20 MW

Goal: Fully utilize the capacity of data centers to reduce the TCO. Our Result:

  • +17% servers → +15% throughput
  • Power violations effectively avoided.
  • No performance disturbance to existing jobs.

[*LA Barroso, etc. The datacenter as a computer: An introduction to the design of warehouse-scale machines. 2013]

slide-3
SLIDE 3

Observation: Avg power utilization < 72% at DC level Reason: Conservative power provisioning

Provision according with rated power Running power < Rated power

Underutilized Capacity in DCs

slide-4
SLIDE 4

Observation: Avg power utilization < 72% at DC level Reason: Conservative power provisioning

Provision according with rated power Running power < Rated power

Over-provisioning of the facility power?

Increase the number of servers on each rack.

Underutilized Capacity in DCs

slide-5
SLIDE 5

Why People Under-provision?

[Fan X, etc. Power provisioning for a warehouse-sized computer. ISCA 2007]

Row Power Time Power limit Servers on the row level

slide-6
SLIDE 6

Why People Under-provision?

[Fan X, etc. Power provisioning for a warehouse-sized computer. ISCA 2007]

Row Power Time Power limit Servers on the row level

Under-utilized capacity

slide-7
SLIDE 7

Why People Under-provision?

[Fan X, etc. Power provisioning for a warehouse-sized computer. ISCA 2007]

Servers on the row level Over-provisioning Row Power Time Power limit

slide-8
SLIDE 8

Why People Under-provision?

[Fan X, etc. Power provisioning for a warehouse-sized computer. ISCA 2007]

Servers on the row level Over-provisioning Row Power Time Power limit

Power violation!

slide-9
SLIDE 9

Traditional approach: Power capping

Dynamic Voltage and Frequent Scaling (DVFS) Power ≈ C·V²·F

Degrade the performance of running jobs!

Violate the SLA of the latency-sensitive jobs.

Row Power Time

Power Capping Degrades Performance

slide-10
SLIDE 10

Traditional approach: Power capping

Dynamic Voltage and Frequent Scaling (DVFS) Power ≈ C·V²·F

Degrade the performance of running jobs!

Violate the SLA of the latency-sensitive jobs.

Power Capping Degrades Performance

slide-11
SLIDE 11

Can we control the power without affecting the performance of existing jobs?

Power Control Method

slide-12
SLIDE 12

Large variations on power utilization at row level

Temporal (over time) and spatial (across different rows). 
 
 
 
 
 


Idea: Dynamically move workload out of the heavily used rows.

Key Observation

Time/hour Normalized Row Power Time/hour Row

slide-13
SLIDE 13

Large variations on power utilization at row level

Temporal (over time) and spatial (across different rows). 
 
 
 
 
 


Idea: Dynamically move workload out of the heavily used rows.

Key Observation

Time/hour Normalized Row Power Time/hour Row

slide-14
SLIDE 14

Large variations on power utilization at row level

Temporal (over time) and spatial (across different rows). 
 
 
 
 
 


Idea: Dynamically move workload out of the heavily used rows.

Key Observation

Time/hour Normalized Row Power Time/hour Row

slide-15
SLIDE 15
  • Minimize interface with

the scheduler.


  • Statistically influence

new job placement.


  • Dynamic system control

Our Solution: Statistical Power Control

Two simple APIs: Freeze/unfreeze. Decoupled with the over- complicated scheduler. Indirect workload balancing. Running jobs unaffected. Does not necessarily work perfectly. Tolerate noises. System identification in a production environment.

slide-16
SLIDE 16

Light workload

No control action.

Example: Statistical Power Control

Scheduler

Running Jobs

Power Controller

Aggregated real-time power

slide-17
SLIDE 17

Light workload

No control action.

Example: Statistical Power Control

Scheduler

Running Jobs

Power Controller

Aggregated real-time power New jobs

slide-18
SLIDE 18

Light workload

No control action.

Scheduler

Running Jobs

Power Controller

New jobs Aggregated real-time power

Example: Statistical Power Control

slide-19
SLIDE 19

Light workload

No control action.

Scheduler

Running Jobs

Power Controller

New jobs Aggregated real-time power

Example: Statistical Power Control

slide-20
SLIDE 20

Light workload

No control action.

Scheduler

Running Jobs

Power Controller

New jobs Aggregated real-time power

Example: Statistical Power Control

slide-21
SLIDE 21

Light workload

No control action.

Scheduler

Running Jobs

Power Controller

New jobs Aggregated real-time power

Example: Statistical Power Control

slide-22
SLIDE 22

Heavy workload.

High row power.

Scheduler

Running Jobs

Power Controller

Aggregated real-time power

Example: Statistical Power Control

slide-23
SLIDE 23

Heavy workload.

High row power.

Scheduler

Running Jobs

Power Controller

Aggregated real-time power

Freeze

Example: Statistical Power Control

slide-24
SLIDE 24

Heavy workload.

High row power.

Scheduler

Running Jobs

Power Controller

New jobs Aggregated real-time power

Freeze

Example: Statistical Power Control

slide-25
SLIDE 25

Heavy workload.

High row power.

Scheduler

Running Jobs

Power Controller

New jobs Aggregated real-time power

Freeze

Example: Statistical Power Control

slide-26
SLIDE 26

Heavy workload.

High row power.

Scheduler

Running Jobs

Power Controller

New jobs Aggregated real-time power

Freeze

Example: Statistical Power Control

slide-27
SLIDE 27

Heavy workload.

High row power.

Scheduler

Running Jobs

Power Controller

New jobs Aggregated real-time power

Freeze

Example: Statistical Power Control

slide-28
SLIDE 28

Heavy workload.

High row power.

Scheduler

Running Jobs

Power Controller

New jobs Aggregated real-time power

Freeze

Example: Statistical Power Control

slide-29
SLIDE 29

Heavy workload.

High row power.

Scheduler

Running Jobs

Power Controller

New jobs Aggregated real-time power

Freeze

Unused power Jobs

Example: Statistical Power Control

slide-30
SLIDE 30

Some jobs finished.

Scheduler

Running Jobs

Power Controller

Aggregated real-time power

Freeze

Example: Statistical Power Control

slide-31
SLIDE 31

Some jobs finished.

Scheduler

Running Jobs

Power Controller

Aggregated real-time power

Unfreeze

Example: Statistical Power Control

slide-32
SLIDE 32

Some jobs finished.

Scheduler

Running Jobs

Power Controller

Aggregated real-time power

Unfreeze

Example: Statistical Power Control

slide-33
SLIDE 33

Power Control Model Blueprint

  • Dynamic control at each minute.
  • No control needed when the power is low.
  • Freeze more/fewer servers when power is high/low.
slide-34
SLIDE 34

Power Control Model Blueprint

  • Dynamic control at each minute.
  • No control needed when the power is low.
  • Freeze more/fewer servers when power is high/low.

?

slide-35
SLIDE 35

Power Control Model Blueprint

  • Dynamic control at each minute.
  • No control needed when the power is low.
  • Freeze more/fewer servers when power is high/low.

? ?

slide-36
SLIDE 36

10 20 30 40 50 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84

Time/min Normalized Server Power

Two effects jointly impact on the row-level power.

  • Existing jobs will finish
  • Statistically fewer jobs scheduled to the row



 


Effect of Freezing Servers

Fig: Average normalized power of about 80 servers after they are frozen.

slide-37
SLIDE 37

Two effects jointly impact on the row-level power.

  • Existing jobs will finish
  • Statistically fewer jobs scheduled to the row

How to quantify these effects?

System identification in a production environment? Designed a controlled experiment.

Effect of Freezing Servers

slide-38
SLIDE 38

Controlled experiment in production environment. Idea: A/B testing

Controlled Experiment Design

Row 1 Row 2 Row n

slide-39
SLIDE 39

Controlled experiment in production environment. Idea: A/B testing

Controlled Experiment Design

Row 1 Row 2 Row n

slide-40
SLIDE 40

Controlled experiment in production environment. Idea: A/B testing

Controlled Experiment Design

Row 1 Row 2 Row n

slide-41
SLIDE 41

Controlled experiment in production environment. Idea: A/B testing

Controlled Experiment Design

Row 1 Row 2 Row n

Experiment Group Control Group

Correlation coefficient of the group power is 0.946

Power Controller

Control Actions

slide-42
SLIDE 42

Dynamic Control Model

How many servers do we need to freeze in a row?

Freeze too few: Risk of Power violations! Freeze too many: Reduce the throughput!

Optimization problem:

Maximize: TPW (Throughput per Provisioned Watt) s.t. No power violation

Key idea:

Use simple system model and tolerate inaccuracy with dynamic control.

slide-43
SLIDE 43

Use heuristics to derive a simple control model.

Take control actions at each minute. Details in the paper.

Dynamic Control Model

Realtime Row Power Freezing Ratio

slide-44
SLIDE 44

Use heuristics to derive a simple control model.

Take control actions at each minute. Details in the paper.

Dynamic Control Model

Realtime Row Power Freezing Ratio

slide-45
SLIDE 45

Use heuristics to derive a simple control model.

Take control actions at each minute. Details in the paper.

Dynamic Control Model

Realtime Row Power Freezing Ratio

slide-46
SLIDE 46

Use heuristics to derive a simple control model.

Take control actions at each minute. Details in the paper.

Dynamic Control Model

Realtime Row Power Freezing Ratio

slide-47
SLIDE 47

Use heuristics to derive a simple control model.

Take control actions at each minute. Details in the paper.

Dynamic Control Model

Realtime Row Power Freezing Ratio

slide-48
SLIDE 48

Use heuristics to derive a simple control model.

Take control actions at each minute. Details in the paper.

Dynamic Control Model

Realtime Row Power Freezing Ratio

slide-49
SLIDE 49
  • Safety: Unacceptable to truly trigger power violations in

production environment.

  • Flexibility: How to test various over-provisioning ratio?

Solution: Emulating power violations by virtually scaling down the power budget of the row.

How to Emulate Over-provisioning?

slide-50
SLIDE 50
  • Safety: Unacceptable to truly trigger power violations in

production environment.

  • Flexibility: How to test various over-provisioning ratio?

Solution: Emulating power violations by virtually scaling down the power budget of the row.

How to Emulate Over-provisioning?

Actual row power budget: P Assumed row power budget: P’
 Over-provisioning ratio: (P-P’)/P’

P

slide-51
SLIDE 51
  • Safety: Unacceptable to truly trigger power violations in

production environment.

  • Flexibility: How to test various over-provisioning ratio?

Solution: Emulating power violations by virtually scaling down the power budget of the row.

How to Emulate Over-provisioning?

Actual row power budget: P Assumed row power budget: P’
 Over-provisioning ratio: (P-P’)/P’

P P’

slide-52
SLIDE 52

Controlled experiments on production environment. Over-provisioning ratio = 0.25

Effectiveness

slide-53
SLIDE 53

Controlled experiments on production environment. Over-provisioning ratio = 0.25

Effectiveness

slide-54
SLIDE 54

Controlled experiments on production environment. Over-provisioning ratio = 0.25

Effectiveness

slide-55
SLIDE 55

Controlled experiments on production environment. Over-provisioning ratio = 0.25

Effectiveness

slide-56
SLIDE 56

Controlled experiments on production environment. Over-provisioning ratio = 0.25

Effectiveness

slide-57
SLIDE 57

Controlled experiments on production environment. Over-provisioning ratio = 0.25

Effectiveness

slide-58
SLIDE 58

Throughput per Provisioned Watt (TPW): Gain in TPW:


How to Decide Over-provisioning Ratio?

TPW = Throughput during time interval T P⋅T

GTPW = r

T ⋅(1+r O)−1

P Provisioned power r

T Throughput ratio (≤1)

r

O Over-provisioning ratio (≥1)

slide-59
SLIDE 59

Throughput per Provisioned Watt (TPW): Gain in TPW:


How to Decide Over-provisioning Ratio?

TPW = Throughput during time interval T P⋅T

GTPW = r

T ⋅(1+r O)−1

P Provisioned power r

T Throughput ratio (≤1)

r

O Over-provisioning ratio (≥1)

r

T <1

slide-60
SLIDE 60

Throughput per Provisioned Watt (TPW): Gain in TPW:


By emulations we found when .

TPW = Throughput during time interval T P⋅T

GTPW = r

T ⋅(1+r O)−1

P Provisioned power r

T Throughput ratio (≤1)

r

O Over-provisioning ratio (≥1)

GTPW = 0.149 r

O = 0.17

How to Decide Over-provisioning Ratio?

slide-61
SLIDE 61
  • Admission control to

statistically influencing new job placement

  • Minimal APIs (freeze/

unfreeze)

  • Simple dynamic system

control

  • Controlled experiment

Conclusion

Avoid performance degradation. Decouple the power control module and the complicated scheduler. Tolerate inaccuracy. Build and evaluate system model in production environment without disturbing it too much.

slide-62
SLIDE 62
  • Admission control to

statistically influencing new job placement

  • Minimal APIs (freeze/

unfreeze)

  • Simple dynamic system

control

  • Controlled experiment

Conclusion

Avoid performance degradation. Decouple the power control module and the complicated scheduler. Tolerate inaccuracy. Build and evaluate system model in production environment without disturbing it too much.

slide-63
SLIDE 63

Outline:

Power over-provisioning motivation Ideas of statistical power control Dynamic Control model Controlled experiment design Effectiveness Deciding over-provisioning ratio Conclusion

Q&A

slide-64
SLIDE 64

Outline:

Power over-provisioning motivation Ideas of statistical power control Dynamic Control model Controlled experiment design Effectiveness Deciding over-provisioning ratio Conclusion

Q&A

slide-65
SLIDE 65

Outline:

Power over-provisioning motivation Ideas of statistical power control Dynamic Control model Controlled experiment design Effectiveness Deciding over-provisioning ratio Conclusion

Q&A

slide-66
SLIDE 66

Outline:

Power over-provisioning motivation Ideas of statistical power control Dynamic Control model Controlled experiment design Effectiveness Deciding over-provisioning ratio Conclusion

Q&A

slide-67
SLIDE 67

Outline:

Power over-provisioning motivation Ideas of statistical power control Dynamic Control model Controlled experiment design Effectiveness Deciding over-provisioning ratio Conclusion

Q&A

slide-68
SLIDE 68
slide-69
SLIDE 69

Backup Slides

slide-70
SLIDE 70

Ampere Architecture

Row #1 Row #2 Row #n

Scheduler Power Monitor Controller

... ...

freeze/unfreeze scheduling actions aggregated power

slide-71
SLIDE 71

Job Durations

slide-72
SLIDE 72

under Different

GTPW

r

O

Fix r

O,P mean ↗⇒ umean ↗⇒ r T ↘⇒ GTPW ↘

r

O ↗⇒ umean ↗

GTPW < r

O

slide-73
SLIDE 73

The effects of freezing ratio u on the power change f(u).

Quantify the Effect of Freezing Ratio

slide-74
SLIDE 74

What if the workload increases in the future? What if the jobs are locality-aware scheduled? What if the amount of jobs is small and they are long- lived? How to jointly optimize the control among all rows? Experiments needed before deployment?

Limitations and Discussion