Outline Background Research Questions Experimental Workloads - - PDF document

outline
SMART_READER_LITE
LIVE PREVIEW

Outline Background Research Questions Experimental Workloads - - PDF document

IC2E 2017 Wes J. Lloyd 4/6/2017 Outline Background Research Questions Experimental Workloads Experiments/Evaluation Wes Lloyd, Shrideep Pallickara, Olaf David, Conclusions Mazdak Arabi, Ken Rojas April 6, 2017 Institute


slide-1
SLIDE 1

IC2E 2017 – Wes J. Lloyd 4/6/2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

1

Wes Lloyd, Shrideep Pallickara, Olaf David, Mazdak Arabi, Ken Rojas April 6, 2017

Institute of Technology, University of Washington, Tacoma, Washington USA

IC2E 2017: IEEE International Conference on Cloud Engineering

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

2

Outline

Background Research Questions Experimental Workloads Experiments/Evaluation Conclusions

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

3

Outline

Background Research Questions Experimental Workloads Experiments/Evaluation Conclusions

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

4

Rosetta Protein Folding

 Computational methods for accurate design of new hyperstable constrained peptides  In 53 hours, using 5,904 EC2 compute cores:

 Generated 5.2 million peptide structures  $3,400 spot instances  Upfront cost of physical cluster to

achieve same result in ~53 hours: $857,752

 Cloud enables adhoc large-scale experimentation

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

5

Research Challenges

How can we improve performance and costs

for hosting scientific application workloads on the cloud?

 Resource heterogeneity  Resource contention

Relative to:

 HPC  Compute clusters

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

6

VM-type heterogeneity- Amazon EC2

From: Is The Same Instance Type Created Equal

2013 IEEE Transactions on Cloud Computing

slide-2
SLIDE 2

IC2E 2017 – Wes J. Lloyd 4/6/2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

2

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

7

Trial-and-better Resource Provisioning

 Z. Ou et al., 2013 IEEE Trans. on Cloud Computing  Using Amazon EC2

  • 1. Provision instances
  • 2. Perform trial(s) - - VM testing
  • 3. Keep desired instances
  • 4. Replace undesirable instances

 Test: Underlying CPU Type

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

8

VM-Scaler

future

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

9

VM-Scaler

future

  • Web services application
  • Rest-based/JSON
  • Harnesses EC2 API
  • Manages virtual cloud infrastructure
  • Supports scientific modeling-as-a-service
  • Supports Amazon, Eucalyptus clouds

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

10

Trial and Better – VM-Scaler

 Harness this approach for VM-Pools  Ensure every VM has same backing CPU  Provide more consistent test results

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

11

Resource Utilization Data Collection

 Profile resource utilization for scientific workloads running across many VMs  Sensor on every VM

 Transmits data to VM-Scaler

CPU

  • CPU time
  • cpu usr: CPU time in user mode
  • cpu krn:CPU time in kernel mode
  • cpu_idle: CPU idle time
  • contextsw: # of context switches
  • cpu_io_wait: CPU time waiting for I/O
  • cpu_sint_time: CPU time serving soft interrupts
  • loadavg: (# proc / 60 secs)
  • cpuSteal: VM CPU ready, physical CPU unavailable

Disk

  • dsr: disk sector reads
  • dsreads: disk sector reads completed
  • drm: merged adjacent disk reads
  • readtime: time spent reading from disk
  • dsw: disk sector writes
  • dswrites: disk sector writes completed
  • dwm: merged adjacent disk writes
  • writetime: time spent writing to disk

Network

  • nbr: network bytes sent
  • nbs: network bytes received

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

12

CpuSteal

 CpuSteal: VM’s CPU core is ready to execute but the physical CPU core is busy  Symptom of over provisioning physical servers  Factors which cause CpuSteal:

  • 1. Processors shared by too many busy VMs
  • 2. Hypervisor kernel (Xen dom0) is occupying the CPU
  • 3. VM’s CPU time share <100% for 1 or more cores,

and 100% is needed for a CPU intensive workload.

slide-3
SLIDE 3

IC2E 2017 – Wes J. Lloyd 4/6/2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

3

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

13

Outline

Background Research Questions Experimental Workloads Experiments/Evaluation Conclusions

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

14

Research Questions

How common is public cloud VM-type implementation heterogeneity? What performance implications result from VM-type heterogeneity for hosting scientific application workloads? RQ1: RQ2:

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

15

Research Questions - 2

How effective is cpuSteal at identifying VMs with high resource contention due to multi-tenancy (e.g. noisy neighbor VMs) in a public cloud? What are the performance implications of hosting scientific modeling workloads on worker VMs with consistently high cpuSteal measurements in a public cloud?

Is there a pattern to cpuSteal behavior across worker VMs over time?

RQ3: RQ4:

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

16

Outline

Background Research Questions Experimental Workloads Experiments/Evaluation Conclusions

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

17

CSIP Model Services

Cloud Services Innovation Platform

 Java-based framework to support development

  • f scientific model services (modeling-as-a-service)

 Increase availability and throughput of models  Harness scalable cloud infrastructure  Cloud virtualization supports variety of legacy

software required for scientific applications

 (e.g. FORTRAN, Visual C++ 6.0, etc.)

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

18

Scientific Application Workloads

Rusle2

 Soil erosion from water  Median runtime ~1.89s

WEPS

 Soil erosion from wind  Median runtime ~55s  Years weather data * Years of crop rotation

slide-4
SLIDE 4

IC2E 2017 – Wes J. Lloyd 4/6/2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

4

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

19

Scientific Modeling Workloads - 2

WEPS / RUSLE CPU utilization:

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

20

Outline

Background Research Questions Experimental Workloads Experiments/Evaluation Conclusions

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

21

Testing for VM Type Heterogeneity

Identified CPU by checking /proc/cpuinfo

 Launched 50 VMs of a given type  If there was heterogeneity, launched 50 more

Tested 12 VM types, across 3 generations

 1st: m1.medium, m1.large, m1.xlarge, c1.medium, c1.xlarge  2nd: m2.xlarge, m2.2xlarge, and m2.4xlarge  3rd: c3.large, c3.xlarge c3.2xlarge, m3.large

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

22

Amazon EC2 VM Type Heterogeneity

VM type Region Backing CPU Backing CPU m1.medium us-east-1c Intel E5-2650 v0 8c,95w,96% Intel Xeon E5645 6c,80w,4% m2.xlarge us-east-1c Intel Xeon X5550 4c, 95w, 48% Intel Xeon E5-2665 v0 8c, 115w, 42% m1.large us-east-1d Intel Xeon E5-2650 v0 8c,95w,74% Intel Xeon E5-2651 v2 12c,105w,19% m1.large us-east-1d Intel Xeon E5645 6c,80w,7%

  • m2.xlarge

us-east-1d Intel Xeon E5-2665 v0 8c, 115w,78% Intel Xeon X5550 4c, 95w, 22%

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

23

VM Type Heterogeneity Performance Implications

 Tested small 5 VM pools  Compared the two most abundant hardware implementations

 m1.large - Intel Xeon

 E5-2650 v0, 8cores, 95 w vs. E5-2651 v2, 12 cores, 105 w

 m2.xlarge - Intel Xeon

 E5-2665 v0, 8 cores, 115 w vs. X5550, 4 cores, 95 w

 Workloads

 WEPS: 10 x 100 runs  RUSLE2: 10 x 660 runs

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

24

VM Type Heterogeneity Performance Variation

slide-5
SLIDE 5

IC2E 2017 – Wes J. Lloyd 4/6/2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

5

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

25

VM Type Heterogeneity Performance Variation

Performance Variance - m1.large

RUSLE2 8%, WEPS 9%

Performance Variance – m2.xlarge

RUSLE2 14%, WEPS 4%

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

26

Noisy Neighbor (NN-Detect) Detection Methodology

 Noisy neighbors cause resource contention and degrade performance of worker VMs

 Identify noisy neighbors by analyzing cpuSteal

 Detection method:

Step 1: Execute processor intensive workload across pool of VMs. Step 2: Capture total cpuSteal for each VM for the workload. Step 3: Calculate average cpuSteal for the workload (cpuStealavg).

Identify NNs using statistical outliers, and assigning application specific thresholds through observation…

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

27

VM Type Host CPU Intel Xeon Average R2 linear reg. Average cpuSteal per core % with Noisy Neighbors us-east-1c c3.large-2c E5-2680v2/10c .1753 2.35 0% m3.large-2c E5-2670v2/10c

  • 1.58

0% m1.large-2c E5-2650v0/8c .5568 7.62 12% m2.xlarge-2c X5550/4c .4490 310.25 18% m1.xlarge-4c E5-2651v2/12c .9431 7.25 4% m3.medium-1c E5-2670v2/10c .0646 17683.21 n/a c1.xlarge-8c E5-2651v2/12c .3658 1.86 0% us-east-1d m1.medium-1c E5-2650v0/8c .4545 6.2 10% m2.xlarge-2c E5-2665v0/8c .0911 3.14 0%

Amazon EC2 CpuSteal Analysis

Test Configuration:

  • Completed 4 x 1000 WEPS runs over ~5 hours
  • ~50 VM pools (c1.xlarge 25, m3/m1.medium 60)
  • Round robin load balancing of runs across pools

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

28

VM Type Host CPU Intel Xeon Average R2 linear reg. Average cpuSteal per core % with Noisy Neighbors us-east-1c c3.large-2c E5-2680v2/10c .1753 2.35 0% m3.large-2c E5-2670v2/10c

  • 1.58

0% m1.large-2c E5-2650v0/8c .5568 7.62 12% m2.xlarge-2c X5550/4c .4490 310.25 18% m1.xlarge-4c E5-2651v2/12c .9431 7.25 4% m3.medium-1c E5-2670v2/10c .0646 17683.21 n/a c1.xlarge-8c E5-2651v2/12c .3658 1.86 0% us-east-1d m1.medium-1c E5-2650v0/8c .4545 6.2 10% m2.xlarge-2c E5-2665v0/8c .0911 3.14 0%

Amazon EC2 CpuSteal Analysis

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

29

VM Type Host CPU Intel Xeon Average R2 linear reg. Average cpuSteal per core % with Noisy Neighbors us-east-1c c3.large-2c E5-2680v2/10c .1753 2.35 0% m3.large-2c E5-2670v2/10c

  • 1.58

0% m1.large-2c E5-2650v0/8c .5568 7.62 12% m2.xlarge-2c X5550/4c .4490 310.25 18% m1.xlarge-4c E5-2651v2/12c .9431 7.25 4% m3.medium-1c E5-2670v2/10c .0646 17683.21 n/a c1.xlarge-8c E5-2651v2/12c .3658 1.86 0% us-east-1d m1.medium-1c E5-2650v0/8c .4545 6.2 10% m2.xlarge-2c E5-2665v0/8c .0911 3.14 0%

Amazon EC2 CpuSteal Analysis

Key Result #1 4 of 9 VM types had R2 > 0.44

m1.large, m2.xlarge, m1.xlarge, m1.medium

Key Result #2 Where cpuSteal could not be predicted

it did not exist. This hardware tended to be CPU core dense. (e.g. 8, 10, or 12)

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

30

Compared performance of small 5 VM pools

 5 Noisy-Neighbor VMs  5 regular VMs

WEPS: 10 x 100 runs RUSLE2: 10 x 660 runs Normalized results to VM pools w/o NN’s

Noisy Neighbor Performance Degradation

slide-6
SLIDE 6

IC2E 2017 – Wes J. Lloyd 4/6/2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

6

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

31

VM type Region WEPS RUSLE2 m1.large E5-2650v0/8c us-east-1c 117.68% df=9.866 p=6.847·10-8 125.42% df=9.003 p=.016 m2.xlarge X5550/4c us-east-1c 107.3% df=19.159 p=.05232 102.76% df=25.34 p=1.73·10-11 c1.xlarge E5-2651v2/12c us-east-1c 100.73% df=9.54 p=.1456 102.91% n.s. m1.medium E5-2650v0/8c us-east-1d 111.6% df=13.459 p=6.25·10-8 104.32% df=9.196 p=1.173·10-5

EC2 Noisy Neighbor Performance Degradation

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

32

VM type Region WEPS RUSLE2 m1.large E5-2650v0/8c us-east-1c 117.68% df=9.866 p=6.847·10-8 125.42% df=9.003 p=.016 m2.xlarge X5550/4c us-east-1c 107.3% df=19.159 p=.05232 102.76% df=25.34 p=1.73·10-11 c1.xlarge E5-2651v2/12c us-east-1c 100.73% df=9.54 p=.1456 102.91% n.s. m1.medium E5-2650v0/8c us-east-1d 111.6% df=13.459 p=6.25·10-8 104.32% df=9.196 p=1.173·10-5

EC2 Noisy Neighbor Performance Degradation

Key Result #1

Maximum performance loss: WEPS 18%, RUSLE2 25%

Key Result #2

3 VM types with significant performance loss (p <.05)

Average performance loss: WEPS/RUSLE2 ~ 9%

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

33

Outline

Background Research Questions Experimental Workloads Experiments/Evaluation Conclusions

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

34

Conclusions

Hardware (CPU-type) heterogeneity is more

prolific for legacy instance implementations

Performance variance up to ~14% For large VM pools, 4 of 9 instance types

showed had noisy neighbors which produced statistically significant performance variance

Performance variance up to ~25%

April 6, 2017

Mitigating Resource Contention and Heterogeneity in Public Clouds for Scientific Modeling Services

35

Future Work

VM-Scaler: Trail-and-better VM pool creation

 Provide cpuSteal Noisy Neighbor detection  Consider are we detecting ourselves? other users?

Extend noisy neighbor detection techniques

 Memory, disk, network resource contention

Evaluate new instance types, public clouds,

and science applications