Q UASAR : R ESOURCE -E FFICIENT A ND Q O S-A WARE C LUSTER M - PowerPoint PPT Presentation

Q UASAR : R ESOURCE -E FFICIENT A ND Q O S-A WARE C LUSTER M ANAGEMENT Christina Delimitrou and Christos Kozyrakis Stanford University http://mast.stanford.edu ASPLOS – March 3 rd 2014

Executive Summary  Problem: low datacenter utilization  Overprovisioned reservations by users  Problem: high jitter on application performance  Interference, HW heterogeneity  Quasar: resource-efficient cluster management  User provides resource reservations performance goals  Online analysis of resource needs using info from past apps  Automatic selection of number & type of resources  High utilization and low performance jitter 2

Datacenter Underutilization  A few thousand server cluster at Twitter managed by Mesos  Running mostly latency-critical, user-facing apps 80% of servers @ < 20% utilization Servers are 65% of TCO 3

Datacenter Underutilization  Goal: raise utilization without introducing performance jitter Twitter Google 1 1 L. A. Barroso, U. Holzle. The Datacenter as a Computer, 2009. 4

Reserved vs. Used Resources 1.5-2x 3-5x  Twitter: up to 5x CPU & up to 2x memory overprovisioning 5

Reserved vs. Used Resources  20% of job under-sized, ~70% of jobs over-sized 6

Rightsizing Applications is Hard 7

Performance Depends on Scale-up Performance Cores 8

Performance Depends on Heterogeneity Performance Cores 9

Performance Depends on Heterogeneity Scale-out Performance Performance Servers Cores 10

Performance Depends on Heterogeneity Scale-out Performance Performance Servers Cores Input load Performance Input size 11

Performance Depends on Heterogeneity Scale-out Performance Performance Servers Cores When sw changes, when platforms change, etc. Input load Interference Performance Performance Input size Cores 12

Performance Depends on Heterogeneity Scale-out Performance Performance Servers Cores When sw changes, when platforms change, etc. Input load Interference Performance Performance Input size Cores 13

Rethinking Cluster Management  User provides resource reservations performance goals  Joint allocation and assignment of resources  Right amount depends on quality of available resources  Monitor and adjust dynamically as needed  But wait…  The manager must know the resource/performance tradeoffs 14

Understanding Resource/Performance Tradeoffs  Combine: Small app signal  Small signal from short run of new app  Large signal from previously-run apps  Generate:  Detailed insights for resource management  Performance vs scale-up/out, Big cluster heterogeneity, … data  Looks like a classification problem Resource/performance tradeoffs 15

Something familiar…  Collaborative filtering – similar to Netflix Challenge system  Predict preferences of new users given preferences of other users  Singular Value Decomposition (SVD) + PQ reconstruction (SGD)  High accuracy, low complexity, relaxed density constraints movies PQ SVD SVD users SGD Initial Reconstructed Final Sparse utility matrix decomposition decomposition utility matrix 16

Application Analysis with Classification Rows Columns Recommendation Netflix Users Movies Movie ratings Heterogeneity Interference Scale-up Scale-out  4 parallel classifications  Lower overheads & similar accuracy to exhaustive classification 17

Heterogeneity Classification Rows Columns Recommendation Netflix Users Movies Movie ratings Heterogeneity Apps Platforms Server type Interference Scale-up Scale-out  Profiling on two randomly selected server types  Predict performance on each server type 18

Interference Classification Rows Columns Recommendation Netflix Users Movies Movie ratings Heterogeneity Apps Platforms Server type Interference Apps Sources of interference Interference sensitivity Scale-up Scale-out  Predict sensitivity to interference  Interference intensity that leads to >5% performance loss  Profiling by injecting increasing interference 19

Scale-Up Classification Rows Columns Recommendation Netflix Users Movies Movie ratings Heterogeneity Apps Platforms Server type Interference Apps Sources of interference Interference sensitivity Scale-up Apps Resource vectors Resources/node Scale-out  Predict speedup from scale-up  Profiling with two allocations (cores & memory) 20

Scale-Out Classification Rows Columns Recommendation Netflix Users Movies Movie ratings Heterogeneity Apps Platforms Server type Interference Apps Sources of interference Interference sensitivity Scale-up Apps Resource vectors Resources/node Scale-out Apps Nodes Number of nodes  Predict speedup from scale-out  Profiling with two allocations (1 & N>1 nodes) 21

Classification Validation Heterogeneity Interference Scale-up Scale-out avg avg max avg max avg max max 4% 8% 5% 10% 4% 9% - - Single-node 4% 5% 2% 6% 5% 11% 5% 17% Batch distributed 5% 6% 7% 10% 6% 11% 6% 12% Latency-critical 22

Quasar Overview QoS 23

Quasar Overview QoS 24

Quasar Overview H <a..b...> I <..cd...> SU <e..f..> QoS SO <kl....> 25

Quasar Overview U Σ V T H <abcdefghi> H <a..b...> I <..cd...>   I <qwertyuio> ... SU <e..f..> QoS u u u 11 12 1  n  SO <kl....> SU <esdfghjkl> ...   u u u 21 22 2 n   SO <kljhgfdsa>         ... u u u 1 2 m m mn 26

Quasar Overview U Σ V T H <a..b...> H <abcdefghi> I <..cd...>   ... I <qwertyuio> u u u 11 12 1 SU <e..f..>  n  QoS ... SU <esdfghjkl>   u u u SO <kl....> 21 22 2 n       SO <kljhgfdsa>     ... u u u 1 2 m m mn 27

Quasar Overview U Σ V T H <abcdefghi> H <a..b...>   ... I <qwertyuio> u u u I <..cd...> 11 12 1 n   QoS ... SU <esdfghjkl> SU <e..f..>   u u u 21 22 2 n   SO <kl....> SO <kljhgfdsa>         ... u u u 1 2 m m mn Resource Profiling Sparse Classification Dense selection [10-60sec] input [20msec] output [50msec-2sec] signal signal 28

Greedy Resource Selection  Goals  Allocate least needed resources to meet QoS target  Pack together non-interfering applications  Overview  Start with most appropriate server types  Look for servers with interference below critical intensity  Depends on which applications are running on these servers  First scale-up, next scale-out 29

Quasar Implementation  6,000 loc of C++ and Python  Runs on Linux and OS X  Supports frameworks in C/C++, Java and Python  ~100-600 loc for framework-specific code  Side-effect free profiling using Linux containers with chroot 30

Evaluation: Cloud Scenario  Cluster  200 EC2 servers, 14 different server types  Workloads: 1,200 apps with 1sec inter-arrival rate  Analytics: Hadoop, Spark, Storm  Latency-critical: Memcached, HotCrp, Cassandra  Single-threaded: SPEC CPU2006  Multi-threaded: PARSEC, SPLASH-2, BioParallel, Specjbb  Multiprogrammed: 4-app mixes of SPEC CPU2006  Objectives: high cluster utilization and good app QoS 31

memcached Hadoop Storm Demo Single-node Cassandra Spark Quasar Reservation + LL 100% 100% Instance Core Instance Size Core Allocation Map Core Allocation Map Reservation & LL Quasar 0% 0% Cluster Utilization Performance histogram Performance histogram Quasar Reservation & LL Progress bar Progress bar 32

Cloud Scenario Summary Quasar achieves:  88% of applications get >95% performance  ~10% overprovisioning as opposed to up to 5x  Up to 70% cluster utilization at steady-state  23% shorter scenario completion time 34

Conclusions  Quasar: high utilization, high app performance  From reservation to performance-centric cluster management  Uses info from previous apps for accurate & online app analysis  Joint resource allocation and resource assignment  See paper for:  Utilization analysis of Twitter cluster  Detailed validation & sensitivity analysis of classification  Further evaluation scenarios and features  E.g., setting framework parameters for Hadoop 35

Questions??  Quasar: high utilization, high app performance  From reservation to performance-centric cluster management  Uses info from previous apps for accurate & online app analysis  Joint resource allocation and resource assignment  See paper for:  Utilization analysis of Twitter cluster  Detailed validation & sensitivity analysis of classification  Further evaluation scenarios and features  E.g., setting framework parameters for Hadoop 36

Questions?? Thank you 37

Cloud Provider: Performance 38

Cloud Provider: Performance  Most applications violate their QoS constraints 39

Cloud Provider: Performance  83% of performance target when only assignment is heterogeneity & interference aware 40

Q UASAR : R ESOURCE -E FFICIENT A ND Q O S-A WARE C LUSTER M - PowerPoint PPT Presentation

Q UASAR : R ESOURCE -E FFICIENT A ND Q O S-A WARE C LUSTER M ANAGEMENT Christina Delimitrou and Christos Kozyrakis Stanford University http://mast.stanford.edu ASPLOS March 3 rd 2014 Executive Summary Problem: low datacenter utilization

Hello! My name is... Buffy Automatic TV series Naming of Characters in TV Video , by M.

H YBRID C LOUD R ESOURCE P ROVISIONING P OLICY IN THE P RESENCE OF R ESOURCE F AILURES Bahman

T YPE -G UIDED W ORST -C ASE I NPUT G ENERATION Di Wang , Jan Hoffmann Carnegie Mellon

EFFICIENT FFICIENT ECONO ONOMY MY Opportunities for Companies and Financial Institutions

S ECURITY M ODEL OF DAA 2 A B LUEPRINT FOR DAA 3 B UILDING B LOCKS 4 O UR C ONSTRUCTIONS 5 E

Eff fficient processing of Hi Hi-C data an and ap application to can ancer Nicola las Serv

E FFICIENT V ERIFICATION OF R EPLICATED D ATATYPES USING L ATER A PPEARANCE R ECORDS (LAR) Madhavan

CING APTI BUNDLE WITH M EMORY ORY U SA GE P REDICT ION : WITH SAGE DICTIO E FFI FFICIENT G

Group Signatures [CH91] allow a member to anonymously and accountably sign on behalf of a group.

V ANTAGE : S CALABLE AND E FFICIENT F INE -G RAIN C ACHE P ARTITIONING Daniel Sanchez and Christos

E NERGY -E FFICIENT D ATA R EPLICATION IN C LOUD C OMPUTING D ATACENTERS Presented by David Ocejo

USER SIGNER E FFICIENT T WO -M OVE B LIND S IGNATURES . . . 1 / 18 B LIND S IGNATURES S ECURITY M

Training Program Human Human Resou esource ce for or Health Health (EHOs (EHOs) ) Car

Navig igati ting g lea earnin ing g res esource ces s throug ough h lin inked ed da

A B ETTER M ARYLAND T HE S TATE D EVELOPMENT P LAN AND D IGITAL R ESOURCE C ENTER A S UMMARY

COLLECTION R ESOURCE O VERVIEW C ONNECTING C URRICULUM Kevin Amboe kamboe@bcerac.ca @amboe_k

Sanmina Q1 FY20 Results January 27, 2020 WHAT WE MAKE, MAKES A DIFFERENCE Concept to Delivery

FIGHTING BID-RIGGING A Guide for Procurers Fazleen Ismail & Rebecca McAtamney Bid-rigging

Midterm Question 1-5 Questions about 1-5: Ask tomorrow in the discussion session.

Improving Linux resource control using CKRM Rik Van Riel Red Hat Inc. Hubertus Franke, Shailabh

Equitable Development Technical Assistance from Groundwork USA Brownfield Resources Webinar

Inception to Consumption Communication within the supply chain Effect and result of todays

Project Management Metrics SE 350 Software Processes & Product Quality Project Management

I nterim Results Ben Gordon Chief Executive Neil Harrington Finance Director Introduction

Q UASAR : R ESOURCE -E FFICIENT A ND Q O S-A WARE C LUSTER M - PowerPoint PPT Presentation

Q UASAR : R ESOURCE -E FFICIENT A ND Q O S-A WARE C LUSTER M ANAGEMENT Christina Delimitrou and Christos Kozyrakis Stanford University http://mast.stanford.edu ASPLOS March 3 rd 2014 Executive Summary Problem: low datacenter utilization

Hello! My name is... Buffy Automatic TV series Naming of Characters in TV Video , by M.

H YBRID C LOUD R ESOURCE P ROVISIONING P OLICY IN THE P RESENCE OF R ESOURCE F AILURES Bahman

T YPE -G UIDED W ORST -C ASE I NPUT G ENERATION Di Wang , Jan Hoffmann Carnegie Mellon

EFFICIENT FFICIENT ECONO ONOMY MY Opportunities for Companies and Financial Institutions

S ECURITY M ODEL OF DAA 2 A B LUEPRINT FOR DAA 3 B UILDING B LOCKS 4 O UR C ONSTRUCTIONS 5 E

Eff fficient processing of Hi Hi-C data an and ap application to can ancer Nicola las Serv

E FFICIENT V ERIFICATION OF R EPLICATED D ATATYPES USING L ATER A PPEARANCE R ECORDS (LAR) Madhavan

CING APTI BUNDLE WITH M EMORY ORY U SA GE P REDICT ION : WITH SAGE DICTIO E FFI FFICIENT G

Group Signatures [CH91] allow a member to anonymously and accountably sign on behalf of a group.

V ANTAGE : S CALABLE AND E FFICIENT F INE -G RAIN C ACHE P ARTITIONING Daniel Sanchez and Christos

E NERGY -E FFICIENT D ATA R EPLICATION IN C LOUD C OMPUTING D ATACENTERS Presented by David Ocejo

USER SIGNER E FFICIENT T WO -M OVE B LIND S IGNATURES . . . 1 / 18 B LIND S IGNATURES S ECURITY M

Training Program Human Human Resou esource ce for or Health Health (EHOs (EHOs) ) Car

Navig igati ting g lea earnin ing g res esource ces s throug ough h lin inked ed da

A B ETTER M ARYLAND T HE S TATE D EVELOPMENT P LAN AND D IGITAL R ESOURCE C ENTER A S UMMARY

COLLECTION R ESOURCE O VERVIEW C ONNECTING C URRICULUM Kevin Amboe kamboe@bcerac.ca @amboe_k

Sanmina Q1 FY20 Results January 27, 2020 WHAT WE MAKE, MAKES A DIFFERENCE Concept to Delivery

FIGHTING BID-RIGGING A Guide for Procurers Fazleen Ismail &amp; Rebecca McAtamney Bid-rigging

Midterm Question 1-5 Questions about 1-5: Ask tomorrow in the discussion session.

Improving Linux resource control using CKRM Rik Van Riel Red Hat Inc. Hubertus Franke, Shailabh

Equitable Development Technical Assistance from Groundwork USA Brownfield Resources Webinar

Inception to Consumption Communication within the supply chain Effect and result of todays

Project Management Metrics SE 350 Software Processes &amp; Product Quality Project Management

I nterim Results Ben Gordon Chief Executive Neil Harrington Finance Director Introduction

FIGHTING BID-RIGGING A Guide for Procurers Fazleen Ismail & Rebecca McAtamney Bid-rigging

Project Management Metrics SE 350 Software Processes & Product Quality Project Management