characterizing private clouds a large scale empirical
play

Characterizing Private Clouds: A Large-Scale Empirical Analysis of - PowerPoint PPT Presentation

Characterizing Private Clouds: A Large-Scale Empirical Analysis of Enterprise Clusters Ignacio Cano, Srinivas Aiyar, Arvind Krishnamurthy University of Washington Nutanix Inc. ACM Symposium on Cloud Computing October 2016 1 Private


  1. Characterizing Private Clouds: A Large-Scale Empirical Analysis of Enterprise Clusters Ignacio Cano, Srinivas Aiyar, Arvind Krishnamurthy University of Washington – Nutanix Inc. ACM Symposium on Cloud Computing October 2016 1

  2. Private Clouds 2

  3. Private Clouds • Cloud computing that delivers service to a single organization , as opposed to public clouds, which service many. • Direct control of infrastructure and data . • Carry management and maintenance costs . 3

  4. Motivation • Increasing trend in the use of private clouds within companies. • Private clouds deployments require careful consideration of what will happen in the future: – Capacity – Failures – … 4

  5. Motivation • Research Questions: – What are the most common failures ? Need Measurement Data! – What type of workloads are typically run? – How is the storage used ? What about CPU usage ? – How do additional replicas impact data durability ? – What causes companies to expand their clusters ? 5

  6. Related Work Setting \ Study Hardware Failures Storage Compute Metadata in Windows PCs • HW Failures in PCs • Disk/CPU Usage and Load • [Agrawal et al., TOS’07] Desktops Limited prior work • I/O on Apple computers [Nightingale et al., EuroSys’11] [Bolosky et al., SIGMETRICS’00] [Harter et al., SOSP’11] on Private Clouds! • Workloads characterization [Mishra et al., SIGMETRICS’10] • Data Characteristics and HW reliability Scheduling on • • Public Clouds Access Patterns [Vishwanath et al., SoCC’11] Heterogeneous Clusters [Liu et al., IEEE/ACM CCGrid’13] [Reiss et al., SoCC’12] 6

  7. In this talk • Large-Scale Measurement Study of Private Clouds – Lower hardware failure rates – Nodes overprovisioned – Stable storage and CPU usage • Modeling based on the Measurements – Each extra replica provides substantial durability improvements – Storage needs drive growth more than compute 7

  8. Outline • Large-Scale Measurement Study of Private Clouds – Context – Cluster Profiles – Failure Analysis – Workload Characteristics • Modeling based on the Measurements – Durability – Cluster Growth 8

  9. Outline • Large-Scale Measurement Study of Private Clouds – Context – Cluster Profiles – Failure Analysis – Workload Characteristics • Modeling based on the Measurements – Durability – Cluster Growth 9

  10. Operations interposed Random replication Nutanix Clusters at the hypervisor level VMs migration and redirected to CVMs … Integrated Global view of Global view of cluster state Compute-Storage cluster state 10

  11. Outline • Large-Scale Measurement Study of Private Clouds – Context – Cluster Profiles – Failure Analysis – Workload Characteristics • Modeling based on the Measurements – Durability – Cluster Growth 11

  12. Clusters Summary Statistics Value # of Clusters 2168 12

  13. Clusters Summary Statistics Value # of Clusters 2168 # of Nodes 13394 6.18 Nodes/Cluster 13

  14. Clusters Summary Statistics Value # of Clusters 2168 # of Nodes 13394 Cluster Sizes 3 - 40 14

  15. Clusters Summary Statistics Value # of Clusters 2168 # of Nodes 13394 Cluster Sizes 3 - 40 # of Disks ~ 70K 15

  16. Node Configurations Storage Compute Configuration Memory (GB) SSD (TB) HDD (TB) Cores Clock Rate (GHz) Config-1 1.6 8 24 2.5 384 Config-2 0.8 4 12 2.4 128 Config-3 0.8 30 16 2.4 256 Storage-heavy Compute-heavy Mostly homogeneous within a cluster 16

  17. Workloads Workload Example Applications Configuration Virtual Desktop Citrix XenDesktop Config-1 Infrastructure VMware Horizon/View 17

  18. Workloads Workload Example Applications Configuration Virtual Desktop Citrix XenDesktop Config-1 Infrastructure VMware Horizon/View SQL Server Config-2 Server Exchange Mail Server Config-3 18

  19. Workloads Workload Example Applications Configuration Virtual Desktop Citrix XenDesktop Config-1 Infrastructure VMware Horizon/View SQL Server Config-2 Server Exchange Mail Server Config-3 Splunk Big Data Config-3 Hadoop 19

  20. Workloads Workload Example Applications Configuration Virtual Desktop Citrix XenDesktop Config-1 Infrastructure VMware Horizon/View SQL Server Config-2 Server Exchange Mail Server Config-3 Splunk Big Data Config-3 Hadoop IT Infrastructure Others Mix Custom applications 20

  21. Distribution of VMs per Node Most 2-4 vCPUs Highest density Median 21 35 1 vCPU Avg. # of VMs per Node 30 2-4 vCPUs > 4 vCPUs 25 20 15 10 5 0 3 4 5 6 7 8 10 12 16 20 32 Lowest density Size of Cluster (# of Nodes) 21

  22. Outline • Large-Scale Measurement Study of Private Clouds – Context – Cluster Profiles – Failure Analysis – Workload Characteristics • Modeling based on the Measurements – Durability – Cluster Growth 22

  23. Failures • We only consider failures that require manual intervention, i.e., human operators annotate the cause of the problem. 23

  24. Hardware Failures HDD Memory SSD PSU BIOS-Image Top 3 account for IPMI around 50% of Node Chassis HW failures NIC BMC-Image BMC-Hardware Cables CPU Fan Rail GPU 0 5 10 15 20 24 % of Total Hardware Cases

  25. Annual Return Rate Component ARR (%) HDD 0.76 2-9 % prior studies 25

  26. Annual Return Rate Component ARR (%) HDD 0.76 Lower return rates SSD 0.72 Enterprise-grade 4-10 % prior commodity HW studies (4 years) 26

  27. Outline • Large-Scale Measurement Study of Private Clouds – Context – Cluster Profiles – Failure Analysis – Workload Characteristics • Modeling based on the Measurements – Durability – Cluster Growth 27

  28. Workload Characteristics • Usage over time seems to be stable/predictable: 80% of the clusters use – Storage: mean <= 50%, std <= 8% – CPU: mean <= 20%, std <= 5% • SSDs can generally maintain the working set – 80% of nodes use <= 500 GB for the working set 28

  29. Outline • Large-Scale Measurement Study of Private Clouds – Context – Cluster Profiles – Failure Analysis – Workload Characteristics • Modeling based on the Measurements – Durability – Cluster Growth 29

  30. Durability Model • Estimate the probability of data loss. • Assumptions: – replication factor of 2 – random replication (replicate to a random node) • The time required to create a new replica when a node goes down: Data to be replicated d ∆ t = Data Remaining ( n − 1) v transfer rate live nodes 30

  31. Durability Model • p (∆t) = probability of node failure in ∆t time. • We decompose the overall period over which we want to provide the durability guarantee into a sequence of intervals , each of length ∆t. • Q = data loss event where two failures occur within ∆t time, i.e. data could not be replicated. 31

  32. Durability Model • Then the probability that there is no data loss in an interval ∆t: P ( ¬ Q, ∆ t ) ≤ (1 − p ( ∆ t )) n + np ( ∆ t )(1 − p ( ∆ t )) n − 1 (1 − p ( ∆ t )) n − 1 The remaining n-1 Exactly one nodes do not fail No failures node fails within ∆t time 32

  33. Durability Model • On a yearly-basis, we consider all ∆t intervals in a year. • Probability of no data loss within a year is: P durability = P ( ¬ Q, ∆ t ) N ( ∆ t ) # of intervals of ∆t time in a year 33

  34. Durability in Private Clouds 1 Fraction of Clusters Rule of Thumb: each additional 0.8 replica provides an additional 5 0.6 9’s of durability Most clusters have 5 9’s with Most clusters have 5 9’s with 0.4 RF2, and 10 9’s with RF3 RF2, and 10 9’s with RF3 0.2 RF2 0 RF3 1e-12 1e-11 1e-10 1e-09 1e-08 1e-07 1e-06 1e-05 0.0001 Data Loss (Probability) 34

  35. Outline • Large-Scale Measurement Study of Private Clouds – Context – Cluster Profiles – Failure Analysis – Workload Characteristics • Modeling based on the Measurements – Durability – Cluster Growth 35

  36. Cluster Growth Analysis • Customers periodically add nodes to their existing clusters. • What drives such growth ? • We resort to machine learning – Binary classification problem – Logistic Regression with L1 regularization 36

  37. Cluster Growth Analysis • Use 200 clusters than grew at least once in a period of 8 months. • 15K examples (70% train, 10% val, 20% test). • Train with different combination of features to understand which are important. 37

  38. Features Cluster Features F c Description n(nodes) discretized # of nodes n(vms) # of vms per node Storage Features F s Description r(ssd) ssd usage to ssd capacity ratio r(hdd) hdd usage to hdd capacity ratio r(store) storage usage to total capacity ratio Performance Features F p Description n(vcpus) # of virtual cpus n(iops) # of iops per node 38

  39. What drives cluster growth? Upgrades from 3-4 1. Cluster Size node clusters 2. Storage Needs HDD usage 3. Compute Needs Number of VMs Storage more than compute! 39

  40. Outline • Large-Scale Measurement Study of Private Clouds – Context – Cluster Profiles – Failure Analysis – Workload Characteristics • Modeling based on the Measurements – Durability – Cluster Growth 40

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend