surviving failures in bandwidth constrained datacenters
play

Surviving Failures in Bandwidth-Constrained Datacenters Peter Bodk 2 - PowerPoint PPT Presentation

Surviving Failures in Bandwidth-Constrained Datacenters Peter Bodk 2 , Ishai Menache 2 , Mosharaf Chowdhury 3 , Pradeepkumar Mani 1 , Dave Maltz 1 , Ion Stoica 3 Microsoft 1 Research 2 , UC Berkeley 3 How to allocate services to physical


  1. Surviving Failures in Bandwidth-Constrained Datacenters Peter Bodík 2 , Ishai Menache 2 , Mosharaf Chowdhury 3 , Pradeepkumar Mani 1 , Dave Maltz 1 , Ion Stoica 3 Microsoft 1 Research 2 , UC Berkeley 3

  2. How to allocate services to physical machines? service 1 network core C + service 2 A A agg switches service 3 racks Three important metrics considered together – FT: service fault tolerance – BW: bandwidth usage – #M: # machine moves to reach target allocation 2

  3. FT: Improving fault tolerance of software services network core switches containers racks power distribution Complex fault domains: networking, power, cooling Worst-case survival = fraction of service available during single worst-case failure – corresponds to service throughput during failure 3

  4. FT: Service allocation impacts worst-case survival network core switches containers racks power distribution Worst-case survival: – red service: 0% -- same container, power – green service: 67% -- different containers, power 4

  5. BW: Reduce bandwidth usage on constrained links network core switches containers racks power distribution BW = bandwidth usage in the core Goal – reduce cost of infrastructure – consider other service location constraints 5

  6. #M: Need incremental allocation algorithms network core switches containers racks power distribution High cost of machine move – need to deploy potentially TB of data – warm up caches – could take tens of minutes, impact network 6

  7. Outline Why is it difficult? Traffic analysis Optimization framework – FT + #M – FT + BW + #M Evaluation 7

  8. Trade-off between bandwidth usage and fault-tolerance optimize for optimize for bandwidth fault tolerance network core C C A A A A agg switches racks HIGH  LOW  BW: utilization in core LOW  HIGH  FT: fault tolerance (for agg switches) 0  0.5  worst-case 8 survival

  9. Optimizing for one metric degrades the other GOAL! 160% change in average worst-case survival allocations optimizing 120% only worst-case survival 80% 40% 0% initial allocations optimizing allocation -40% only core bandwidth -80% -20% 0% 20% 40% 60% 80% reduction in BW usage Results from 6 Microsoft datacenters 9

  10. FT-only and BW-only are both NP-hard, hard to approximate FT reduces to max independent set BW reduces to min-cut in a graph – considered previously in [Meng et al., INFOCOM’10] Most algorithms not incremental, ignore #M 10

  11. Key insights Improve FT using convex optimization – local optimization leads to good solutions Symmetry in the optimization space – machines, racks, containers are interchangeable Communication pattern is very skewed – can spread low-talkers without affecting BW 11

  12. Results preview GOAL! 160% change in average worst-case survival allocations optimizing 120% only worst-case survival 80% 40% 0% initial allocations optimizing allocation -40% only core bandwidth -80% -20% 0% 20% 40% 60% 80% reduction in BW usage 12

  13. Results preview 160% change in average worst-case survival allocations optimizing 120% only worst-case survival 80% 40% 0% initial allocations optimizing allocation -40% only core bandwidth -80% -20% 0% 20% 40% 60% 80% reduction in BW usage 13

  14. Outline Why is it difficult? Traffic analysis Optimization framework – FT + #M – FT + BW + #M Evaluation 14

  15. Service communication matrix is very sparse and skewed cluster manager set of services service forming an application only 2% of service pairs communicate (subset of) ~1000 services 1% of services generate 64% of traffic (lot more in the paper) 15

  16. Outline Why is it difficult? Traffic analysis Optimization framework – FT + #M – FT + BW + #M Evaluation 16

  17. FT optimizing FT and #M Spread machines across all fault domains – FTC negatively correlated to worst-case survival number of machines Convex of service s in domain f optimization service fault domain weight weight Advantages of convex cost function – local actions lead to improvement of global metric – directly considers #M 17

  18. FT machine swap as a basic move Keeps the current allocation feasible – doesn’t change number of machines per service C A A Steepest descent swap = largest reduction in cost Only evaluate a small, random set of swaps – symmetry => many “good” swaps exist 18

  19. FT path of steepest descent FT improvement BW reduction 19

  20. FT+BW Optimizing FT, BW, and #M Steepest descent on FTC + α BW – non-convex – no guarantees on reaching optimum α determines the FT-BW trade-off 20

  21. FT+BW path of steepest descent α = 1 FT improvement α = 10 BW reduction 21

  22. Benchmark algorithm machine communication graph cut FT+BW k-way minimum graph cut – optimizes BW only k-way min cut – ignores #M followed by steepest descent on FT+BW 22

  23. Outline Why is it difficult? Traffic analysis Optimization framework – FT + #M – FT + BW + #M Evaluation 23

  24. Evaluation setup Simulations based on 4 production clusters – services + machine counts – network topology – fault domains – network trace from pre-production cluster Metrics relative to initial allocation – don’t know actual optimum Choosing next swap takes seconds to a minute 24

  25. Evaluation FT 160% 120% Δ FT 80% 40% 0% core BW reduction -40% -20% 0% 20% 40% 60% 25

  26. Evaluation FT 160% cut FT+BW boundary for 120% different values of α Δ FT 80% spreading low-talkers improves FT, 40% little impact on BW 0% cut core BW reduction -40% -20% 0% 20% 40% 60% 26

  27. Evaluation FT 160% cut FT+BW 120% Δ FT 80% FT+BW 2.3% moved 40% 0% cut core BW reduction -40% -20% 0% 20% 40% 60% 27

  28. Evaluation FT 160% cut FT+BW 29% moved 120% 9% moved Δ FT 80% FT+BW 2.3% moved 40% 0% cut core BW reduction -40% -20% 0% 20% 40% 60% 28

  29. α changes the FT-BW tradeoff 160%  α  29% moved  120%  α  9% moved Δ FT 80% FT+BW 2.3% moved 40% 0% core BW reduction -40% -20% 0% 20% 40% 60% 29

  30. Summary Trade-off between fault tolerance and bandwidth – algorithm that achieves improvement in both Improvements (across 4 production datacenters) – FT: 40% – 120% – BW: 20% – 50% – partially deployed in Bing Key insights – approximate NP-hard problem using convex optimization – lot of symmetry in search space – sparse and skewed communication matrix 30

  31. 31

  32. 32

  33. Extensions Hard constraints on FT, BW, #M – e.g., pick a few services with FT>80% Hierarchical BW optimization on agg switches Applies to fat-tree networks 33

  34. Main observations Most traffic generated by few services (pairs)  spread low-talkers to improve fault-tolerance Complex, overlapping fault domains – hierarchical network fault-domains – power fault domains not aligned with network  cell: set of machines with identical fault domains 34

  35. Evaluation Moving most of machines Moving only fraction of machines 35

  36. Our optimization framework Cost function considers FT and BW – both problems NP-hard and hard to approximate – non-convex Cut + FT + BW: 1. minimum k-way cut of communication graph • reshuffles all machines 2. gradient descent moves using machine swaps FT + BW: 1. only machine swaps • only moves small fraction of machines 36

  37. Conclusion Study of communication patterns of Bing.com – sparse communication matrix – very skewed communication pattern Principled optimization of both BW and FT – exploits communication patterns – can handle arbitrary fault domains Reduction in BW: 20 – 50% Improvement in FT: 40 – 120% 37

  38. Evaluation (1 datacenter) optimizing 160% just FT 120% 80% Δ FT 40% 0% initial optimizing allocation just BW -40% Δ BW -60% -40% -20% 0% 20% 38

  39. Evaluation 160% 120% Cut+FT+BW (moves all servers) 80% Δ FT 40% 0% -40% Δ BW -60% -40% -20% 0% 20% 39

  40. Evaluation 160% FT+BW 120% 80% Δ FT FT+BW+#M: 2.3% 40% servers moved 0% -40% Δ BW -60% -40% -20% 0% 20% 40

  41. Evaluation 160% FT+BW FT+BW+#M: 29% servers moved 120% FT+BW+#M: 9% servers moved 80% Δ FT FT+BW+#M: 2.3% 40% servers moved 0% -40% Δ BW -60% -40% -20% 0% 20% 41

  42. BW k-way graph cut network topology machine communication graph C + A A k-way min cut C min cut A A k-way min graph cut – ignores #M: reshuffles almost all machines – ignores FT: can’t be easily extended 42

  43. BW k-way graph cut improved FT reduced BW 43

  44. Scaling algorithms to large datacenters Only evaluate a small, random set of swaps – symmetry => many “good” swaps exist Cell = set of machines with same fault domains Reduce size of communication graph for cut 44

  45. FT BW cut + steepest descent Step 1: min-cut – optimizes BW Step 2: steepest descent on FTC + α BW – non-convex – no guarantees on reaching optimum – α determines the trade-off Reshuffles all machines 45

  46. FT BW cut + steepest descent α = 10 improved FT α = 1 reduced BW 46

  47. Properties of allocation algorithms FT #M BW FT BW FT BW #M 47

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend