striking the right utilization availability balance in
play

Striking the Right Utilization- Availability Balance in the WAN - PowerPoint PPT Presentation

Striking the Right Utilization- Availability Balance in the WAN Manya Ghobadi MIT Joint work with: Jeremy Bogle, Nikhil Bhatia (MIT), Ishai Menache, Nikolaj Bjorner (MSR), Asaf Valadarsky, Michael Schapira (Hebrew U) How to invest smartly in


  1. Striking the Right Utilization- Availability Balance in the WAN Manya Ghobadi MIT Joint work with: Jeremy Bogle, Nikhil Bhatia (MIT), Ishai Menache, Nikolaj Bjorner (MSR), Asaf Valadarsky, Michael Schapira (Hebrew U)

  2. How to invest smartly in the stock market? 𝑦 " 𝑧 " 𝑦 # 𝑧 # 𝑦 $ 𝑧 $ Gain/Loss $10,000 Uncertainty Control Loss will be ≀ $100 with probability 99% Solution: financial risk theory 2

  3. Traffic Engineering (TE) problem 𝑦 " 10 Gbps 𝑦 # 10 Gbps BOS NYC 𝑦 $ 10 Gbps Throughput/ Traffic demand: 10 Gbps Packet loss β€’ How to configure the allocation of traffic on network paths? β€’ Goal: efficiently utilizing the network to match the current traffic demand (periodic process) 3

  4. Extensive research on TE in a broad variety of environments β€’ Wide-area networks β€’ Kumar et al. [NSDI’18] β€’ Liu et al. [SIGCOMM’14] β€’ Kumar et al. [SIGCOMM’15] β€’ Jain et al. [SIGCOMM’13] β€’ Hong et al. [SIGCOMM’13] β€’ ISP networks β€’ Jiang et al.[SIGMETRICS’09] β€’ Kandula et al. [SIGCOMM’05] β€’ Fortz et al. [INFOCOM’2000] β€’ Data center networks β€’ Alizadeh et al. [SIGCOMM’14] β€’ Akyildiz et al. [Journal of Comp. Nets.’14] β€’ Benson et al. [CoNEXT’11] 4

  5. TE problem in Wide-Area Networks Microsoft WAN 5

  6. TE problem in Wide-Area Networks Challenging: Solution: β€’Billion dollar infrastructure β€’Model the network as a graph β€’High efficiency and availability β€’Solve a Linear Program Objective Constraints Microsoft WAN [B. Fortz, Internet Traffic Engineering by Optimizing OSPF Weights, INFOCOM’2000] 6

  7. Competing goals: high utilization and availability Availability Utilization Failures 7

  8. Competing goals: high utilization and availability Availability Utilization Failures 8

  9. Traffic engineering under failures Today: optimize for the worst conceivable (potentially unlikely) β€’ failure scenarios Problem: under-utilizing the network β€’ Robust against k simultaneous link failures 10 Gbps p(fail) = 10 -2 5 Gbps p(fail) = 10 -1 NYC BOS 10 Gbps p(fail) = 10 -2 Admissible traffic: 5 Gbps all the time 99.999% of the time 9

  10. Traffic engineering under failures Robust against k simultaneous link failures 10 Gbps p(fail) = 10 -2 5 Gbps p(fail) = 10 -1 NYC BOS 10 Gbps p(fail) = 10 -2 Admissible traffic: 5 Gbps 99.999% of the time Utilization Availability Failures 10

  11. Our approach to traffic engineering under failures Use the failure probabilities to reason about the likelihood of failure scenarios β€’ Provide a mathematical probabilistic guarantee for availability β€’ 10 Gbps p(fail) = 10 -2 5 Gbps p(fail) = 10 -1 BOS NYC p(fail) = 10 -2 10 Gbps Admissible traffic Availability 5 Gbps 99.999% 10 Gbps 99.99% 15 Gbps 99.8% 20 Gbps 98% 11

  12. Our approach to traffic engineering under failures Use the failure probabilities to reason about the likelihood of failure scenarios β€’ Provide a mathematical probabilistic guarantee for availability β€’ Flow allocation vector Uncertainty vector 10 Gbps 𝑦 " 𝑧 " demand 𝑧 # 𝑦 # 5 Gbps BOS NYC 𝑒 ' 𝑦 $ 𝑧 $ 10 Gbps For all flows, 90% of the demand is satisfied 99.9% of the time For all flows, loss will be ≀ 10% of the demand 99.9% of the time 12

  13. Main idea 𝑦 " 𝑧 " 𝑦 # 𝑧 # 𝑦 $ 𝑧 $ The loss will be ≀ $100 with probability 99% Find x that minimizes the loss with probability Ξ² 𝑦 " 𝑧 " 𝑦 # 𝑧 # demand 𝑒 ' 𝑧 $ 𝑦 $ The loss will be ≀ 10% of the demand with probability 99% Find x that minimizes the loss with probability Ξ² 13

  14. Key technique: scenario-based formulation β€’ One link failure A failure scenario β€’ Correlated link failures Loss ≀ 10% of demand w probability 0.95 Probability Target probability Ξ² = 0.95 VaRᡦ = 10% Ξ² = 0.95 β€’ Unsatisfied demand 0 5 10 Loss ( % ) β€’ Packet loss 14

  15. Key technique: scenario-based formulation πœ”(𝑦, π‘Šπ‘π‘†) = 𝑄(π‘Ÿ|𝑀(𝑦, 𝑧(π‘Ÿ)) ≀ π‘Šπ‘π‘†) π‘›π‘—π‘œ{π‘Šπ‘π‘†|πœ”(𝑦, π‘Šπ‘π‘†) β‰₯ 𝛾} Probability Target probability Ξ² = 0.95 VaRᡦ = 10% Ξ² = 0.95 0 5 10 Loss ( % ) 15

  16. TeaVaR: Traffic Engineering Applying Value-at-Risk πœ”(𝑦, π‘Šπ‘π‘†) = 𝑄(π‘Ÿ|𝑀(𝑦, 𝑧(π‘Ÿ)) ≀ π‘Šπ‘π‘†) π‘›π‘—π‘œ{π‘Šπ‘π‘†|πœ”(𝑦, π‘Šπ‘π‘†) β‰₯ 𝛾} Probability What about the worst 5% of scenarios? π‘›π‘—π‘œ{𝐹[𝑀𝑝𝑑𝑑|𝑀𝑝𝑑𝑑 β‰₯ π‘Šπ‘π‘†]} Ξ² = 0.95 0 5 10 Loss ( % ) 16

  17. Challenges unique to networking β€’ Achieving fairness across network users. β€’ Enabling computational tractability as the network scales. β€’ Capturing fast rerouting of traffic in data plane. β€’ Accounting for correlated failures. 17

  18. Achieving fairness 𝑦 " 𝑧 " 𝑦 # 𝑧 # demand 𝑧 $ 𝑦 $ 𝑒 ' Objective: Find x that minimizes the loss with probability Ξ² R i : routes for flow i Satisfied demand for flow i : Ξ£ C∈D E 𝑦 C 𝑧 C x r : flow allocation on route r ∈ R i y r : binary variable indicating if route r is up Starvation-aware loss function: β€’ Worst case normalized unmet demand Ξ£ C∈D E 𝑦 C 𝑧 C ] H 𝑀(𝑦, 𝑧) = 𝑛𝑏𝑦 ' [1 βˆ’ 𝑒 ' 18

  19. Handling scale All Scenarios Our approach 100 Coverage (%) 99.5 99 98.5 98 Google’s B4 topology B4 IBM MSFT ATT 125 Topology # Edges # Scenarios 100 Run time (s) 75 B4 38 O(1E11) 50 IBM 48 O(1E14) 25 MSFT 100 O(1E30) 0 ATT 112 O(1E33) B4 IBM MSFT ATT 19

  20. System architecture 𝑧 " 𝑦 " 𝑦 # 𝑧 # demand 𝑦 $ 𝑧 $ 𝑒 ' Topology Flow demands TeaVaR Flow Linear allocations Optimization Failure probability of scenarios Target availability (0.99, 0.999,..) 20

  21. Evaluations β€’ Topologies: B4, IBM, ATT, and MSFT β€’ Traffic matrix: β€’ Four months of MSFT traffic matrix (one sample/hour), for the rest of topologies, used 24 TMs from YATES [SOSR’18] β€’ Tunnel selection: β€’ Our optimization framework is orthogonal to tunnel selection β€’ Oblivious paths, link disjoint paths, and k-shortest paths β€’ Baselines: β€’ SMORE [NSDI’18] β€’ FFC [SIGCOMM’14] β€’ B4 [SIGCOMM’13] β€’ ECMP 22

  22. Availability vs. demand scale β€’ Availability is measured as the probability mass of scenarios in which demand is fully satisfied (β€œall-or-nothing” requirement) β€’ If a TE scheme’s bandwidth allocation is unable to fully satisfy demand in 0.1% of scenarios, it has an availability of 99.9% 100 Availability (%) 98 96 94 SMORE B4 FFC-1 92 FFC-2 ECMP TeaVaR 90 1.0 1.5 2.1 2.6 3.1 Demand Scale 23

  23. Robustness to probability estimates Noise in probability % error in throughput estimations 1% 1.43% 5% 2.95% 10% 3.07% 15% 3.95% 20% 6.73% 24

  24. Summary TeaVaR uses financial risk theory for solving Traffic β€’ Engineering under failures. TeaVaR’s approach is applicable to networking β€’ resource allocation problems such as capacity planning. Code and demo available at: β€’ http://teavar.csail.mit.edu/ 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend