Characterizing Private Clouds: A Large-Scale Empirical Analysis of - - PowerPoint PPT Presentation

characterizing private clouds a large scale empirical
SMART_READER_LITE
LIVE PREVIEW

Characterizing Private Clouds: A Large-Scale Empirical Analysis of - - PowerPoint PPT Presentation

Characterizing Private Clouds: A Large-Scale Empirical Analysis of Enterprise Clusters Ignacio Cano, Srinivas Aiyar, Arvind Krishnamurthy University of Washington Nutanix Inc. ACM Symposium on Cloud Computing October 2016 1 Private


slide-1
SLIDE 1

Characterizing Private Clouds: A Large-Scale Empirical Analysis of Enterprise Clusters

1

Ignacio Cano, Srinivas Aiyar, Arvind Krishnamurthy University of Washington – Nutanix Inc.

ACM Symposium on Cloud Computing

October 2016

slide-2
SLIDE 2

2

Private Clouds

slide-3
SLIDE 3

3

Private Clouds

  • Cloud computing that delivers service to a

single organization, as opposed to public clouds, which service many.

  • Direct control of infrastructure and data.
  • Carry management and maintenance costs.
slide-4
SLIDE 4

4

Motivation

  • Increasing trend in the use of private clouds

within companies.

  • Private clouds deployments require careful

consideration of what will happen in the future:

– Capacity – Failures – …

slide-5
SLIDE 5

5

Motivation

  • Research Questions:

– What are the most common failures? – What type of workloads are typically run? – How is the storage used? What about CPU usage? – How do additional replicas impact data durability? – What causes companies to expand their clusters?

Need Measurement Data!

slide-6
SLIDE 6

6

Related Work

Setting \ Study Hardware Failures Storage Compute Desktops

  • HW Failures in PCs

[Nightingale et al., EuroSys’11]

  • Metadata in Windows PCs

[Agrawal et al., TOS’07]

  • I/O on Apple computers

[Harter et al., SOSP’11]

  • Disk/CPU Usage and Load

[Bolosky et al., SIGMETRICS’00]

Public Clouds

  • HW reliability

[Vishwanath et al., SoCC’11]

  • Data Characteristics and

Access Patterns

[Liu et al., IEEE/ACM CCGrid’13]

  • Workloads characterization

[Mishra et al., SIGMETRICS’10]

  • Scheduling on

Heterogeneous Clusters

[Reiss et al., SoCC’12]

Limited prior work

  • n Private Clouds!
slide-7
SLIDE 7

7

In this talk

  • Large-Scale Measurement Study of Private Clouds

– Lower hardware failure rates – Nodes overprovisioned – Stable storage and CPU usage

  • Modeling based on the Measurements

– Each extra replica provides substantial durability improvements – Storage needs drive growth more than compute

slide-8
SLIDE 8

8

Outline

  • Large-Scale Measurement Study of Private Clouds

– Context – Cluster Profiles – Failure Analysis – Workload Characteristics

  • Modeling based on the Measurements

– Durability – Cluster Growth

slide-9
SLIDE 9

9

  • Large-Scale Measurement Study of Private Clouds

– Context – Cluster Profiles – Failure Analysis – Workload Characteristics

  • Modeling based on the Measurements

– Durability – Cluster Growth

Outline

slide-10
SLIDE 10

10

Nutanix Clusters

Operations interposed at the hypervisor level and redirected to CVMs

Global view of cluster state Integrated Compute-Storage Random replication VMs migration … Global view of cluster state

slide-11
SLIDE 11

11

  • Large-Scale Measurement Study of Private Clouds

– Context – Cluster Profiles – Failure Analysis – Workload Characteristics

  • Modeling based on the Measurements

– Durability – Cluster Growth

Outline

slide-12
SLIDE 12

12

Clusters

Summary Statistics Value # of Clusters 2168

slide-13
SLIDE 13

13

Clusters

Summary Statistics Value # of Clusters 2168 # of Nodes 13394

6.18 Nodes/Cluster

slide-14
SLIDE 14

14

Clusters

Summary Statistics Value # of Clusters 2168 # of Nodes 13394 Cluster Sizes 3 - 40

slide-15
SLIDE 15

15

Clusters

Summary Statistics Value # of Clusters 2168 # of Nodes 13394 Cluster Sizes 3 - 40 # of Disks ~ 70K

slide-16
SLIDE 16

Configuration Storage Compute Memory (GB) SSD (TB) HDD (TB) Cores Clock Rate (GHz) Config-1 1.6 8 24 2.5 384 Config-2 0.8 4 12 2.4 128 Config-3 0.8 30 16 2.4 256

16

Node Configurations

Compute-heavy Storage-heavy

Mostly homogeneous within a cluster

slide-17
SLIDE 17

17

Workloads

Workload Example Applications Configuration

Virtual Desktop Infrastructure Citrix XenDesktop VMware Horizon/View Config-1

slide-18
SLIDE 18

18

Workloads

Workload Example Applications Configuration

Virtual Desktop Infrastructure Citrix XenDesktop VMware Horizon/View Config-1 Server SQL Server Exchange Mail Server Config-2 Config-3

slide-19
SLIDE 19

19

Workloads

Workload Example Applications Configuration

Virtual Desktop Infrastructure Citrix XenDesktop VMware Horizon/View Config-1 Server SQL Server Exchange Mail Server Config-2 Config-3 Big Data Splunk Hadoop Config-3

slide-20
SLIDE 20

20

Workloads

Workload Example Applications Configuration

Virtual Desktop Infrastructure Citrix XenDesktop VMware Horizon/View Config-1 Server SQL Server Exchange Mail Server Config-2 Config-3 Big Data Splunk Hadoop Config-3 Others IT Infrastructure Custom applications Mix

slide-21
SLIDE 21

5 10 15 20 25 30 35 3 4 5 6 7 8 10 12 16 20 32

  • Avg. # of VMs per Node

Size of Cluster (# of Nodes) 1 vCPU 2-4 vCPUs > 4 vCPUs

21

Distribution of VMs per Node

Highest density Median 21 Most 2-4 vCPUs Lowest density

slide-22
SLIDE 22

22

  • Large-Scale Measurement Study of Private Clouds

– Context – Cluster Profiles – Failure Analysis – Workload Characteristics

  • Modeling based on the Measurements

– Durability – Cluster Growth

Outline

slide-23
SLIDE 23

23

Failures

  • We only consider failures that require

manual intervention, i.e., human operators annotate the cause of the problem.

slide-24
SLIDE 24

24

Hardware Failures

HDD Memory SSD PSU BIOS-Image IPMI Node Chassis NIC BMC-Image BMC-Hardware Cables CPU Fan Rail GPU 5 10 15 20 % of Total Hardware Cases

Top 3 account for around 50% of HW failures

slide-25
SLIDE 25

25

Annual Return Rate

Component ARR (%) HDD 0.76

2-9 % prior studies

slide-26
SLIDE 26

26

Annual Return Rate

Component ARR (%) HDD 0.76 SSD 0.72

4-10 % prior studies (4 years)

Lower return rates Enterprise-grade commodity HW

slide-27
SLIDE 27

27

  • Large-Scale Measurement Study of Private Clouds

– Context – Cluster Profiles – Failure Analysis – Workload Characteristics

  • Modeling based on the Measurements

– Durability – Cluster Growth

Outline

slide-28
SLIDE 28

28

Workload Characteristics

  • Usage over time seems to be stable/predictable:

80% of the clusters use

– Storage: mean <= 50%, std <= 8% – CPU: mean <= 20%, std <= 5%

  • SSDs can generally maintain the working set

– 80% of nodes use <= 500 GB for the working set

slide-29
SLIDE 29

29

  • Large-Scale Measurement Study of Private Clouds

– Context – Cluster Profiles – Failure Analysis – Workload Characteristics

  • Modeling based on the Measurements

– Durability – Cluster Growth

Outline

slide-30
SLIDE 30

30

Durability Model

  • Estimate the probability of data loss.
  • Assumptions:

– replication factor of 2 – random replication (replicate to a random node)

  • The time required to create a new replica

when a node goes down:

Data to be replicated Data transfer rate Remaining live nodes

∆t = d (n − 1)v

slide-31
SLIDE 31

31

Durability Model

  • p(∆t) = probability of node failure in ∆t time.
  • We decompose the overall period over which

we want to provide the durability guarantee into a sequence of intervals, each of length ∆t.

  • Q = data loss event where two failures occur

within ∆t time, i.e. data could not be replicated.

slide-32
SLIDE 32

32

Durability Model

  • Then the probability that there is no data loss

in an interval ∆t:

No failures Exactly one node fails The remaining n-1 nodes do not fail within ∆t time P(¬Q, ∆t) ≤ (1 − p(∆t))n + np(∆t)(1 − p(∆t))n−1(1 − p(∆t))n−1

slide-33
SLIDE 33

33

Durability Model

  • On a yearly-basis, we consider all ∆t intervals

in a year.

  • Probability of no data loss within a year is:

# of intervals of ∆t time in a year

Pdurability = P(¬Q, ∆t)N(∆t)

slide-34
SLIDE 34

0.2 0.4 0.6 0.8 1 1e-12 1e-11 1e-10 1e-09 1e-08 1e-07 1e-06 1e-05 0.0001 Fraction of Clusters Data Loss (Probability) RF2 RF3

34

Durability in Private Clouds

Rule of Thumb: each additional replica provides an additional 5 9’s of durability Most clusters have 5 9’s with RF2, and 10 9’s with RF3 Most clusters have 5 9’s with RF2, and 10 9’s with RF3

slide-35
SLIDE 35

35

  • Large-Scale Measurement Study of Private Clouds

– Context – Cluster Profiles – Failure Analysis – Workload Characteristics

  • Modeling based on the Measurements

– Durability – Cluster Growth

Outline

slide-36
SLIDE 36

36

Cluster Growth Analysis

  • Customers periodically add nodes to their

existing clusters.

  • What drives such growth?
  • We resort to machine learning

– Binary classification problem – Logistic Regression with L1 regularization

slide-37
SLIDE 37

37

Cluster Growth Analysis

  • Use 200 clusters than grew at least once in a

period of 8 months.

  • 15K examples (70% train, 10% val, 20% test).
  • Train with different combination of features to

understand which are important.

slide-38
SLIDE 38

38

Features

Cluster Features Fc Description n(nodes) discretized # of nodes n(vms) # of vms per node Storage Features Fs Description r(ssd) ssd usage to ssd capacity ratio r(hdd) hdd usage to hdd capacity ratio r(store) storage usage to total capacity ratio Performance Features Fp Description n(vcpus) # of virtual cpus n(iops) # of iops per node

slide-39
SLIDE 39

39

What drives cluster growth?

  • 1. Cluster Size
  • 2. Storage Needs
  • 3. Compute Needs

Upgrades from 3-4 node clusters HDD usage Number of VMs

Storage more than compute!

slide-40
SLIDE 40

40

  • Large-Scale Measurement Study of Private Clouds

– Context – Cluster Profiles – Failure Analysis – Workload Characteristics

  • Modeling based on the Measurements

– Durability – Cluster Growth

Outline

slide-41
SLIDE 41

41

Conclusions

  • Large-Scale Measurement Study of Private Clouds

– Lower hardware failure rates – Nodes overprovisioned – Stable storage and CPU usage

  • Modeling based on the Measurements

– Each extra replica provides substantial durability improvements – Storage needs drive growth more than compute

slide-42
SLIDE 42

42

Thanks!

slide-43
SLIDE 43

43