Surviving Failures in Bandwidth-Constrained Datacenters Peter Bodk 2 - PowerPoint PPT Presentation

Surviving Failures in Bandwidth-Constrained Datacenters Peter Bodík 2 , Ishai Menache 2 , Mosharaf Chowdhury 3 , Pradeepkumar Mani 1 , Dave Maltz 1 , Ion Stoica 3 Microsoft 1 Research 2 , UC Berkeley 3

How to allocate services to physical machines? service 1 network core C + service 2 A A agg switches service 3 racks Three important metrics considered together – FT: service fault tolerance – BW: bandwidth usage – #M: # machine moves to reach target allocation 2

FT: Improving fault tolerance of software services network core switches containers racks power distribution Complex fault domains: networking, power, cooling Worst-case survival = fraction of service available during single worst-case failure – corresponds to service throughput during failure 3

FT: Service allocation impacts worst-case survival network core switches containers racks power distribution Worst-case survival: – red service: 0% -- same container, power – green service: 67% -- different containers, power 4

BW: Reduce bandwidth usage on constrained links network core switches containers racks power distribution BW = bandwidth usage in the core Goal – reduce cost of infrastructure – consider other service location constraints 5

#M: Need incremental allocation algorithms network core switches containers racks power distribution High cost of machine move – need to deploy potentially TB of data – warm up caches – could take tens of minutes, impact network 6

Outline Why is it difficult? Traffic analysis Optimization framework – FT + #M – FT + BW + #M Evaluation 7

Trade-off between bandwidth usage and fault-tolerance optimize for optimize for bandwidth fault tolerance network core C C A A A A agg switches racks HIGH  LOW  BW: utilization in core LOW  HIGH  FT: fault tolerance (for agg switches) 0  0.5  worst-case 8 survival

Optimizing for one metric degrades the other GOAL! 160% change in average worst-case survival allocations optimizing 120% only worst-case survival 80% 40% 0% initial allocations optimizing allocation -40% only core bandwidth -80% -20% 0% 20% 40% 60% 80% reduction in BW usage Results from 6 Microsoft datacenters 9

FT-only and BW-only are both NP-hard, hard to approximate FT reduces to max independent set BW reduces to min-cut in a graph – considered previously in [Meng et al., INFOCOM’10] Most algorithms not incremental, ignore #M 10

Key insights Improve FT using convex optimization – local optimization leads to good solutions Symmetry in the optimization space – machines, racks, containers are interchangeable Communication pattern is very skewed – can spread low-talkers without affecting BW 11

Results preview GOAL! 160% change in average worst-case survival allocations optimizing 120% only worst-case survival 80% 40% 0% initial allocations optimizing allocation -40% only core bandwidth -80% -20% 0% 20% 40% 60% 80% reduction in BW usage 12

Results preview 160% change in average worst-case survival allocations optimizing 120% only worst-case survival 80% 40% 0% initial allocations optimizing allocation -40% only core bandwidth -80% -20% 0% 20% 40% 60% 80% reduction in BW usage 13

Service communication matrix is very sparse and skewed cluster manager set of services service forming an application only 2% of service pairs communicate (subset of) ~1000 services 1% of services generate 64% of traffic (lot more in the paper) 15

FT optimizing FT and #M Spread machines across all fault domains – FTC negatively correlated to worst-case survival number of machines Convex of service s in domain f optimization service fault domain weight weight Advantages of convex cost function – local actions lead to improvement of global metric – directly considers #M 17

FT machine swap as a basic move Keeps the current allocation feasible – doesn’t change number of machines per service C A A Steepest descent swap = largest reduction in cost Only evaluate a small, random set of swaps – symmetry => many “good” swaps exist 18

FT path of steepest descent FT improvement BW reduction 19

FT+BW Optimizing FT, BW, and #M Steepest descent on FTC + α BW – non-convex – no guarantees on reaching optimum α determines the FT-BW trade-off 20

FT+BW path of steepest descent α = 1 FT improvement α = 10 BW reduction 21

Benchmark algorithm machine communication graph cut FT+BW k-way minimum graph cut – optimizes BW only k-way min cut – ignores #M followed by steepest descent on FT+BW 22

Evaluation setup Simulations based on 4 production clusters – services + machine counts – network topology – fault domains – network trace from pre-production cluster Metrics relative to initial allocation – don’t know actual optimum Choosing next swap takes seconds to a minute 24

Evaluation FT 160% 120% Δ FT 80% 40% 0% core BW reduction -40% -20% 0% 20% 40% 60% 25

Evaluation FT 160% cut FT+BW boundary for 120% different values of α Δ FT 80% spreading low-talkers improves FT, 40% little impact on BW 0% cut core BW reduction -40% -20% 0% 20% 40% 60% 26

Evaluation FT 160% cut FT+BW 120% Δ FT 80% FT+BW 2.3% moved 40% 0% cut core BW reduction -40% -20% 0% 20% 40% 60% 27

Evaluation FT 160% cut FT+BW 29% moved 120% 9% moved Δ FT 80% FT+BW 2.3% moved 40% 0% cut core BW reduction -40% -20% 0% 20% 40% 60% 28

α changes the FT-BW tradeoff 160%  α  29% moved  120%  α  9% moved Δ FT 80% FT+BW 2.3% moved 40% 0% core BW reduction -40% -20% 0% 20% 40% 60% 29

Summary Trade-off between fault tolerance and bandwidth – algorithm that achieves improvement in both Improvements (across 4 production datacenters) – FT: 40% – 120% – BW: 20% – 50% – partially deployed in Bing Key insights – approximate NP-hard problem using convex optimization – lot of symmetry in search space – sparse and skewed communication matrix 30

Extensions Hard constraints on FT, BW, #M – e.g., pick a few services with FT>80% Hierarchical BW optimization on agg switches Applies to fat-tree networks 33

Main observations Most traffic generated by few services (pairs)  spread low-talkers to improve fault-tolerance Complex, overlapping fault domains – hierarchical network fault-domains – power fault domains not aligned with network  cell: set of machines with identical fault domains 34

Evaluation Moving most of machines Moving only fraction of machines 35

Our optimization framework Cost function considers FT and BW – both problems NP-hard and hard to approximate – non-convex Cut + FT + BW: 1. minimum k-way cut of communication graph • reshuffles all machines 2. gradient descent moves using machine swaps FT + BW: 1. only machine swaps • only moves small fraction of machines 36

Conclusion Study of communication patterns of Bing.com – sparse communication matrix – very skewed communication pattern Principled optimization of both BW and FT – exploits communication patterns – can handle arbitrary fault domains Reduction in BW: 20 – 50% Improvement in FT: 40 – 120% 37

Evaluation (1 datacenter) optimizing 160% just FT 120% 80% Δ FT 40% 0% initial optimizing allocation just BW -40% Δ BW -60% -40% -20% 0% 20% 38

Evaluation 160% 120% Cut+FT+BW (moves all servers) 80% Δ FT 40% 0% -40% Δ BW -60% -40% -20% 0% 20% 39

Evaluation 160% FT+BW 120% 80% Δ FT FT+BW+#M: 2.3% 40% servers moved 0% -40% Δ BW -60% -40% -20% 0% 20% 40

Evaluation 160% FT+BW FT+BW+#M: 29% servers moved 120% FT+BW+#M: 9% servers moved 80% Δ FT FT+BW+#M: 2.3% 40% servers moved 0% -40% Δ BW -60% -40% -20% 0% 20% 41

BW k-way graph cut network topology machine communication graph C + A A k-way min cut C min cut A A k-way min graph cut – ignores #M: reshuffles almost all machines – ignores FT: can’t be easily extended 42

BW k-way graph cut improved FT reduced BW 43

Scaling algorithms to large datacenters Only evaluate a small, random set of swaps – symmetry => many “good” swaps exist Cell = set of machines with same fault domains Reduce size of communication graph for cut 44

FT BW cut + steepest descent Step 1: min-cut – optimizes BW Step 2: steepest descent on FTC + α BW – non-convex – no guarantees on reaching optimum – α determines the trade-off Reshuffles all machines 45

FT BW cut + steepest descent α = 10 improved FT α = 1 reduced BW 46

Properties of allocation algorithms FT #M BW FT BW FT BW #M 47

Surviving Failures in Bandwidth-Constrained Datacenters Peter Bodk 2 - PowerPoint PPT Presentation

Surviving Failures in Bandwidth-Constrained Datacenters Peter Bodk 2 , Ishai Menache 2 , Mosharaf Chowdhury 3 , Pradeepkumar Mani 1 , Dave Maltz 1 , Ion Stoica 3 Microsoft 1 Research 2 , UC Berkeley 3 How to allocate services to physical

Surviving the First Night Surviving the First Night Surviving the First Night Surviving

Trusted End Host Monitors for Securing Cloud Datacenters Alan Shieh Srikanth Kandula

Performance Datacenters HotNets15 Xpander: Unveiling the Secrets of High-Performance

Reliable Communication for Datacenters Mahesh Balakrishnan Cornell University Mahesh

THE SEVEN STAGES OF BOSH THE SEVEN STAGES OF BOSH Surviving successful Bosh adoption Surviving

Bandwidth Management Chris Wilson Aptivate Ltd, UK AfNOG 2010 Ingredients What is bandwidth

Bandwidth Ex Parte Addendum M a y 1 0 , 2 0 1 8 Addendum to Bandwidth FCC Meeting on May 2,

Virtualising our CPE Mantychore is part-funded by the EC under Grant Agreement N 261527

Investigation of Failures 49 CFR 192.617 192.617 Investigation of Failures Each operator

Protection and Restoration Introduction Fact: Networks fail. Types of failures: Path

NUMFabric: Fast and Flexible Bandwidth Allocation in Datacenters Kanthi Nagaraj (Stanford),

Implementing Existing Management Protocols on Constrained Devices J urgen Sch onw alder

From something that fits in your pocket ... ... to, well, this. The future? ... Energy A look

Tolerating Faults in Disaggregated Datacenters Amanda Carbonari , Ivan Beschastnikh University

Prometheus @ Datacenters Why Modbus Is Even Worse than SNMP Richard Hartmann, RichiH@ {

Operating Systems CSE451 Simon Peter With thanks to Timothy Roscoe (ETH Zurich) Autumn 2015

File Systems Chapter 11, 13 OSPP What is a File? What is a Directory? Goals of File System

Communication-aware Job Scheduling using SLURM Priya Mishra, Tushar Agrawal, Preeti Malakar

The Three Dimensions of Scalable Machine Learning Reza Zadeh @Reza_Zadeh | http://reza-zadeh.com

Fairfields Migrating Birds Ian Nieduszynski Why Migrate? Bird migration is a regular

Dyn ynamic mic Pr Processes esses ove ver In Informat matio ion n Netwo works rks Rep

CSCI 350 Ch. 13 File & Directory Implementations Mark Redekopp Michael Shindler &

Strategic Pre-Commitment Felix Munoz-Garcia EconS 424 - Strategy and Game Theory Washington State

Financial Econometrics Econ 40357 Volatility, ARCH, GARCH N.C. Mark University of Notre Dame and

Surviving Failures in Bandwidth-Constrained Datacenters Peter Bodk 2 - PowerPoint PPT Presentation

Surviving Failures in Bandwidth-Constrained Datacenters Peter Bodk 2 , Ishai Menache 2 , Mosharaf Chowdhury 3 , Pradeepkumar Mani 1 , Dave Maltz 1 , Ion Stoica 3 Microsoft 1 Research 2 , UC Berkeley 3 How to allocate services to physical

Surviving the First Night Surviving the First Night Surviving the First Night Surviving

Trusted End Host Monitors for Securing Cloud Datacenters Alan Shieh Srikanth Kandula

Performance Datacenters HotNets15 Xpander: Unveiling the Secrets of High-Performance

Reliable Communication for Datacenters Mahesh Balakrishnan Cornell University Mahesh

THE SEVEN STAGES OF BOSH THE SEVEN STAGES OF BOSH Surviving successful Bosh adoption Surviving

Bandwidth Management Chris Wilson Aptivate Ltd, UK AfNOG 2010 Ingredients What is bandwidth

Bandwidth Ex Parte Addendum M a y 1 0 , 2 0 1 8 Addendum to Bandwidth FCC Meeting on May 2,

Virtualising our CPE Mantychore is part-funded by the EC under Grant Agreement N 261527

Investigation of Failures 49 CFR 192.617 192.617 Investigation of Failures Each operator

Protection and Restoration Introduction Fact: Networks fail. Types of failures: Path

NUMFabric: Fast and Flexible Bandwidth Allocation in Datacenters Kanthi Nagaraj (Stanford),

Implementing Existing Management Protocols on Constrained Devices J urgen Sch onw alder

From something that fits in your pocket ... ... to, well, this. The future? ... Energy A look

Tolerating Faults in Disaggregated Datacenters Amanda Carbonari , Ivan Beschastnikh University

Prometheus @ Datacenters Why Modbus Is Even Worse than SNMP Richard Hartmann, RichiH@ {

Operating Systems CSE451 Simon Peter With thanks to Timothy Roscoe (ETH Zurich) Autumn 2015

File Systems Chapter 11, 13 OSPP What is a File? What is a Directory? Goals of File System

Communication-aware Job Scheduling using SLURM Priya Mishra, Tushar Agrawal, Preeti Malakar

The Three Dimensions of Scalable Machine Learning Reza Zadeh @Reza_Zadeh | http://reza-zadeh.com

Fairfields Migrating Birds Ian Nieduszynski Why Migrate? Bird migration is a regular

Dyn ynamic mic Pr Processes esses ove ver In Informat matio ion n Netwo works rks Rep

CSCI 350 Ch. 13 File &amp; Directory Implementations Mark Redekopp Michael Shindler &amp;

Strategic Pre-Commitment Felix Munoz-Garcia EconS 424 - Strategy and Game Theory Washington State

Financial Econometrics Econ 40357 Volatility, ARCH, GARCH N.C. Mark University of Notre Dame and

CSCI 350 Ch. 13 File & Directory Implementations Mark Redekopp Michael Shindler &