Towards Energy-Efficient Reactive Thermal Management in Instrumented - - PowerPoint PPT Presentation

towards energy efficient reactive thermal management in
SMART_READER_LITE
LIVE PREVIEW

Towards Energy-Efficient Reactive Thermal Management in Instrumented - - PowerPoint PPT Presentation

Towards Energy-Efficient Reactive Thermal Management in Instrumented Datacenters Ivan Rodero 1 , Eun Kyung Lee 1 , Dario Pompili 1 , Manish Parashar 1 , Marc Gamell 2 , Renato J. Figueiredo 3 1 NSF Center for Autonomic Computing, Rutgers


slide-1
SLIDE 1

Towards Energy-Efficient Reactive Thermal Management in Instrumented Datacenters

Energy Efficient Grids, Clouds and Clusters, Brussels , October 26, 2010 Ivan Rodero1, Eun Kyung Lee1, Dario Pompili1, Manish Parashar1, Marc Gamell2, Renato J. Figueiredo3

1 NSF Center for Autonomic Computing, Rutgers University, NJ, USA 2 Open University of Catalonia, Barcelona, Spain 3 NSF Center for Autonomic Computing, University of Florida, FL, USA

slide-2
SLIDE 2

 Context and Motivation  Datacenter Thermal Management  Energy Efficiency and Tradeoffs  Evaluation Methodology  Results  Next Steps  Conclusions

Agenda

2

slide-3
SLIDE 3

Energy-Efficient Autonomic Management for High Performance Computing Workloads

3

Physical Environment

Sensor

Observer ¡

Actuator

Controller ¡

Resources

Sensor

Observer ¡

Actuator

Controller ¡

Virtualization

Sensor

Observer ¡

Actuator

Controller ¡

Application/ Workload

Sensor

Observer ¡

Actuator

Controller ¡

Cross-layer Power Management Cross-infrastructure Power Management Cloud

(private, public, hybrid, etc.)

Cloud

(private, public, hybrid, etc.)

Virtualized Instrumented infrastructure Application-aware

 Goal: Autonomic (self-monitored and self-managed) computing systems:

 Optimizing energy efficiency while ensuring Quality of Service delivered (performance)

slide-4
SLIDE 4

Cross-Layer Architecture

4

Application/ Workload Virtualization Resources Physical Environment

Sensor 1 Sensor 2 Sensor 3 Sensor 4

Observer ¡1 ¡ Applica3on ¡req. ¡profiles ¡ Observer ¡2 ¡ VM ¡efficiency ¡ Observer ¡3 ¡ Resource ¡performance ¡ Observer ¡4 ¡ Environment ¡predic3on ¡ Local Controller Global Controller Observer ¡ Correla3ons ¡

Actuator Actuator 1 Actuator 2 Actuator 3

Local Controller Local Controller Local Controller Observer ¡ Controls ¡

Managed ¡Environment ¡ Actuator ¡ Sensor ¡ Controller’s ¡Command ¡ Observer’s ¡ ¡Sensing ¡Port ¡ Informa3on ¡Flow ¡ Request ¡Flow ¡ Resource ¡Flow ¡

slide-5
SLIDE 5

 Abnormal operational state detection

 Distributed Online Clustering (e.g., workload)  Physical sensing physical layer (e.g., thermal hotspots)

 Reactive and proactive approaches

 Reacting to anomalies to return to steady state  Predict anomalies in order to avoid them

Cross-Layer Energy-Efficient Autonomic Management

5

QoS Energy Efficiency Thermal efficiency

Abnormal state Steady State

?

QoS Energy Efficiency Thermal efficiency QoS Energy Efficiency Thermal efficiency

Different paths for reaching steady operational state Cross-layer actions (AUTONOMIC)

slide-6
SLIDE 6

Interactions between Autonomic Components

6

Workload Characterization (e.g., DOC) Pinning VM Migration Component-level Power Management

HPC Workload

Environment monitoring (temperature, etc.) Resources monitoring (load, power, etc.) Global controller Observer (correlations) Cooperate Proactive configuration Provisioning and maping Reactive configuration Reactive Designs for aggressive power management Depend Scheduling Trading with 3rd parties

slide-7
SLIDE 7

Datacenter’s Thermal Behavior

8

20 40 60 5 10 15 2 2 4 6 8 Time [min] Node Number Temperature [C] 20 40 60 5 10 15 2 2 4 6 8 Time [min] Node Number Temperature [C]

Temporal correlation of the measured temperature under different workload distributions

slide-8
SLIDE 8

120 140 160 180 200 220 240 260 200 400 600 800 1000 1200 1400 1600 1800 2000 Power (W) Time (s)

Reacting to Thermal Hotspots

9

30 35 40 45 50 55 60 200 400 600 800 1000 1200 1400 1600 1800 2000 Temperature (C) Time (s) Internal Server Temperature Environment (Hotspot)

Correlation between server’s temperature and power

Reaction: VM migration

Steady

slide-9
SLIDE 9

 Assumption:

The lower power dissipated

  • The lower heat generated

 Reducing the activity factor (α)

 VM Migration: move a VM to another server

 May reduce CPU activity, but also memory activity, etc.

  • Potentially may result in lower CPU frequency if OS support

 Overhead (suspend, transfer data, resume, etc.)  Requires availability in another server (impact on the target

server)

Thermal Management Approaches

10

P

cpu ≈ C × α × V 2 × f

slide-10
SLIDE 10

 Reducing the activity factor (α)

 Pinning (in Xen platform): affinity in VCPUs – PCPUs mapping

 CPUs without VMs running on them  OS power management may result in automatic DVFS  Penalty on the performance (resource sharing)

 Reducing the frequency/voltage of CPUs ( V2×f )

 Processor DVFS

 Penalty on the performance (in general higher response time)  Different possibilities

  • Different frequencies/voltages
  • Applied to all CPUs/cores or to a subset

Thermal Management Approaches (2)

11

slide-11
SLIDE 11

 Goal: selection of appropriate technique to mitigate the

effects of thermal hotspots

 Energy-Efficient

 Lower energy consumption  Lower maximum/average power dissipation

 Driven by optimization requirements. Examples:

 Reduce temperature 5 oC (based on a threshold)  A penalty of up to 10% on response time is acceptable

 There are well known tradeoffs between performance and

energy efficiency

 But also we need to consider other dimensions such as thermal

efficiency (temperature)

Goals and Tradeoffs

12

slide-12
SLIDE 12

 Example: Tradeoff between temperature and

performance of pinning 4 VMs into different PCPUs

Goals and Tradeoffs (2)

13

slide-13
SLIDE 13

 Server configuration:

 Two servers based on Intel quad-core Xeon processors

 Operate at four frequencies ranging from 1.6GHz to 2.4GHz (but only 3

available under Xen 3.1)

 CentOS Linux operating system with a patched 2.6.18 kernel with

Xen version 3.1

 Additional hardware:

 A “Watts Up? .NET” power meter to empirically measure

“instantaneous” power

 Accuracy of ±1.5% of the measured power with sampling rate of 1Hz

 TelosB motes to measure both internal (not sensors embedded into

the CPU) and external temperatures

 A Sunbeam SFH111 heater (directed to the servers) in order to

emulate a thermal hotspot

 Workload: HPL linpack

Evaluation Methodology

slide-14
SLIDE 14

Energy Consumption “Estimation”

15  Use case: No running VMs in target server before migration

slide-15
SLIDE 15

Results

16

30 35 40 45 50 55 200 400 600 800 1000 1200 1400 1600 External Temperature Time (s) 4CPUs at 2.40GHz 2CPUs at 1.60GHz 4CPUs at 2.13GHz 4CPUs at 1.60GHz 30 32 34 36 38 40 42 44 46 200 400 600 800 1000 1200 1400 1600 External Temperature Time (s) 4CPUs at 2.40GHz 2CPUs at 1.60GHz 4CPUs at 2.13GHz 4CPUs at 1.60GHz 120 140 160 180 200 220 240 260 200 400 600 800 1000 1200 1400 1600 Power (W) Time (s) 4CPUs at 2.40GHz 2CPUs at 1.60GHz 4CPUs at 2.13GHz 4CPUs at 1.60GHz 30 32 34 36 38 40 42 44 46 200 400 600 800 1000 1200 1400 1600 External Temperature Time (s) Reference Migrate 1VM Migrate 2VMs Migrate 3VMs 30 35 40 45 50 55 200 400 600 800 1000 1200 1400 1600 External Temperature Time (s) Reference Migrate 1VM Migrate 2VMs Migrate 3VMs 120 140 160 180 200 220 240 260 200 400 600 800 1000 1200 1400 1600 Power (W) Time (s) Reference Migrate 1VM Migrate 2VMs Migrate 3VMs 30 35 40 45 50 55 200 400 600 800 1000 1200 1400 1600 Internal Temperature Time (s) Reference Pinning VMs to 3CPUs Pinning VMs to 2CPUs Pinning VMs to 1CPU 30 32 34 36 38 40 42 44 46 200 400 600 800 1000 1200 1400 1600 External Temperature Time (s) Reference Pinning VMs to 3CPUs Pinning VMs to 2CPUs Pinning VMs to 1CPU 120 140 160 180 200 220 240 260 200 400 600 800 1000 1200 1400 1600 Power (W) Time (s) Reference Pinning VMs to 3CPUs Pinning VMs to 2CPUs Pinning VMs to 1CPU

Internal Temperature oC Internal Temperature oC Internal Temperature oC External Temperature oC External Temperature oC External Temperature oC Power (W) Power (W) Power (W)

Correlation between internal and external temperature Correlation between temperature and power DVFS: using 2 CPUs at 1.6 GHz presents similar results than using 4 CPU at 2.13 GHz

slide-16
SLIDE 16

Results (2)

17

slide-17
SLIDE 17

 Autonomic VM allocation and reactive technique decision

 Cross-layer design approach

 Examples: component-level power management, workload

clustering, etc.

 Application-aware (workload characterization into CPU-, I/O-, etc.

intensive )

 Optimization targets based on self-monitoring  Models are required

 VM migration, DVFS (work presented in this presentation)  VM allocation (#VMs, workload characteristics, combinations,

etc.)

  • Preliminary results based on a brute force algorithm

 Models at the server and datacenter level

Next Steps

18

slide-18
SLIDE 18

 Tradeoffs exist between:

 Performance  Energy efficiency  Thermal efficiency

  • f reactive thermal management techniques for HPC workloads

 Pinning is an effective mechanism to react to thermal anomalies

under certain conditions

 In addition to VM migration  In contrast to DVFS

 Different mechanisms’ behaviors observed depending on the

system characteristics and optimization goals.

 Autonomic decision making is required  Cross-layer designs should improve datacenter’s management

Conclusions

20

slide-19
SLIDE 19

Thank you!

21

Energy Efficient High Performance Computing Initiative Center for Autonomic Computing, Rutgers University

http://nsfcac.rutgers.edu/GreenHPC/