Striking the Right Utilization- Availability Balance in the WAN - - PowerPoint PPT Presentation

β–Ά
striking the right utilization availability balance in
SMART_READER_LITE
LIVE PREVIEW

Striking the Right Utilization- Availability Balance in the WAN - - PowerPoint PPT Presentation

Striking the Right Utilization- Availability Balance in the WAN Manya Ghobadi MIT Joint work with: Jeremy Bogle, Nikhil Bhatia (MIT), Ishai Menache, Nikolaj Bjorner (MSR), Asaf Valadarsky, Michael Schapira (Hebrew U) How to invest smartly in


slide-1
SLIDE 1

Joint work with: Jeremy Bogle, Nikhil Bhatia (MIT), Ishai Menache, Nikolaj Bjorner (MSR), Asaf Valadarsky, Michael Schapira (Hebrew U)

Striking the Right Utilization- Availability Balance in the WAN

Manya Ghobadi MIT

slide-2
SLIDE 2

𝑦" 𝑦#

Control Uncertainty Loss will be ≀ $100 with probability 99% $10,000 Gain/Loss

𝑦$ 𝑧"

𝑧# 𝑧$

How to invest smartly in the stock market?

Solution: financial risk theory

2

slide-3
SLIDE 3

Traffic Engineering (TE) problem

10 Gbps

BOS NYC

10 Gbps 10 Gbps Traffic demand: 10 Gbps

Throughput/ Packet loss

  • How to configure the allocation of traffic on network paths?
  • Goal: efficiently utilizing the network to match the current traffic

demand (periodic process)

𝑦" 𝑦# 𝑦$

3

slide-4
SLIDE 4

Extensive research on TE in a broad variety of environments

  • Wide-area networks
  • Kumar et al. [NSDI’18]
  • Liu et al. [SIGCOMM’14]
  • Kumar et al. [SIGCOMM’15]
  • Jain et al. [SIGCOMM’13]
  • Hong et al. [SIGCOMM’13]
  • ISP networks
  • Jiang et al.[SIGMETRICS’09]
  • Kandula et al. [SIGCOMM’05]
  • Fortz et al. [INFOCOM’2000]
  • Data center networks
  • Alizadeh et al. [SIGCOMM’14]
  • Akyildiz et al. [Journal of Comp. Nets.’14]
  • Benson et al. [CoNEXT’11]

4

slide-5
SLIDE 5

TE problem in Wide-Area Networks

5

Microsoft WAN

slide-6
SLIDE 6

TE problem in Wide-Area Networks

Challenging:

  • Billion dollar infrastructure
  • High efficiency and availability

[B. Fortz, Internet Traffic Engineering by Optimizing OSPF Weights, INFOCOM’2000]

Solution:

  • Model the network as a graph
  • Solve a Linear Program

Objective Constraints

Microsoft WAN

6

slide-7
SLIDE 7

Failures Availability

Utilization

Competing goals: high utilization and availability

7

slide-8
SLIDE 8

Failures Availability

Utilization

Competing goals: high utilization and availability

8

slide-9
SLIDE 9

Traffic engineering under failures

10 Gbps

BOS NYC

10 Gbps 5 Gbps

Robust against k simultaneous link failures Admissible traffic: 5 Gbps all the time

  • Today: optimize for the worst conceivable (potentially unlikely)

failure scenarios

  • Problem: under-utilizing the network

p(fail) = 10-2 p(fail) = 10-1 p(fail) = 10-2

99.999% of the time

9

slide-10
SLIDE 10

Traffic engineering under failures

10 Gbps

BOS NYC

10 Gbps 5 Gbps

Robust against k simultaneous link failures Admissible traffic: 5 Gbps 99.999% of the time

p(fail) = 10-2 p(fail) = 10-1 p(fail) = 10-2

Failures Availability Utilization

10

slide-11
SLIDE 11

Our approach to traffic engineering under failures

10 Gbps

BOS NYC

10 Gbps 5 Gbps p(fail) = 10-2 p(fail) = 10-1 p(fail) = 10-2

Admissible traffic Availability 5 Gbps 99.999% 10 Gbps 99.99% 15 Gbps 99.8% 20 Gbps 98%

  • Use the failure probabilities to reason about the likelihood of failure scenarios
  • Provide a mathematical probabilistic guarantee for availability

11

slide-12
SLIDE 12

Our approach to traffic engineering under failures

10 Gbps

BOS NYC

10 Gbps 5 Gbps

  • Use the failure probabilities to reason about the likelihood of failure scenarios
  • Provide a mathematical probabilistic guarantee for availability

For all flows, 90% of the demand is satisfied 99.9% of the time For all flows, loss will be ≀ 10% of the demand 99.9% of the time demand

𝑒' 𝑧" 𝑧# 𝑧$

Uncertainty vector

𝑦" 𝑦# 𝑦$

Flow allocation vector

12

slide-13
SLIDE 13

𝑧" 𝑧# 𝑧$ 𝑦" 𝑦# 𝑦$

Main idea

𝑦" 𝑦# 𝑦$ 𝑧" 𝑧# 𝑧$

The loss will be ≀ $100 with probability 99% Find x that minimizes the loss with probability Ξ² The loss will be ≀ 10% of the demand with probability 99% Find x that minimizes the loss with probability Ξ²

demand

𝑒'

13

slide-14
SLIDE 14

Loss (%) Probability

Key technique: scenario-based formulation

0 5 10

A failure scenario

  • One link failure
  • Correlated link failures
  • Unsatisfied demand
  • Packet loss

Target probability Ξ² = 0.95 Ξ² = 0.95 VaRᡦ = 10% Loss ≀ 10% of demand w probability 0.95

14

slide-15
SLIDE 15

Probability 0 5 10 Ξ² = 0.95

πœ”(𝑦, π‘Šπ‘π‘†) = 𝑄(π‘Ÿ|𝑀(𝑦, 𝑧(π‘Ÿ)) ≀ π‘Šπ‘π‘†)

π‘›π‘—π‘œ{π‘Šπ‘π‘†|πœ”(𝑦, π‘Šπ‘π‘†) β‰₯ 𝛾}

Target probability β = 0.95 VaRᡦ = 10%

Key technique: scenario-based formulation

15

Loss (%)

slide-16
SLIDE 16

Probability 0 5 10 Ξ² = 0.95

πœ”(𝑦, π‘Šπ‘π‘†) = 𝑄(π‘Ÿ|𝑀(𝑦, 𝑧(π‘Ÿ)) ≀ π‘Šπ‘π‘†)

π‘›π‘—π‘œ{π‘Šπ‘π‘†|πœ”(𝑦, π‘Šπ‘π‘†) β‰₯ 𝛾}

What about the worst 5% of scenarios?

π‘›π‘—π‘œ{𝐹[𝑀𝑝𝑑𝑑|𝑀𝑝𝑑𝑑 β‰₯ π‘Šπ‘π‘†]}

TeaVaR: Traffic Engineering Applying Value-at-Risk

16

Loss (%)

slide-17
SLIDE 17
  • Achieving fairness across network users.
  • Enabling computational tractability as the network scales.
  • Capturing fast rerouting of traffic in data plane.
  • Accounting for correlated failures.

Challenges unique to networking

17

slide-18
SLIDE 18

Achieving fairness

Starvation-aware loss function:

  • Worst case normalized unmet demand

demand

𝑒'

Ri: routes for flow i xr: flow allocation on route r ∈ Ri yr: binary variable indicating if route r is up Objective: Find x that minimizes the loss with probability Ξ² Satisfied demand for flow i: Ξ£C∈DE𝑦C𝑧C

𝑀(𝑦, 𝑧) = 𝑛𝑏𝑦'[1 βˆ’ Ξ£C∈DE𝑦C𝑧C 𝑒' ]H

𝑦" 𝑦# 𝑧" 𝑧# 𝑦$ 𝑧$

18

slide-19
SLIDE 19

Handling scale

Google’s B4 topology

Topology # Edges # Scenarios B4 38 O(1E11) IBM 48 O(1E14) MSFT 100 O(1E30) ATT 112 O(1E33)

98 98.5 99 99.5 100

B4 IBM MSFT ATT

Coverage (%)

All Scenarios Our approach 25 50 75 100 125

B4 IBM MSFT ATT

Run time (s)

19

slide-20
SLIDE 20

System architecture

TeaVaR Linear Optimization Topology Flow allocations Failure probability

  • f scenarios

Target availability (0.99, 0.999,..) Flow demands

𝑦" 𝑦# 𝑧"

demand

𝑒' 𝑦$ 𝑧# 𝑧$

20

slide-21
SLIDE 21
slide-22
SLIDE 22

Evaluations

  • Topologies: B4, IBM, ATT, and MSFT
  • Traffic matrix:
  • Four months of MSFT traffic matrix (one sample/hour), for the rest of

topologies, used 24 TMs from YATES [SOSR’18]

  • Tunnel selection:
  • Our optimization framework is orthogonal to tunnel selection
  • Oblivious paths, link disjoint paths, and k-shortest paths
  • Baselines:
  • SMORE [NSDI’18]
  • FFC [SIGCOMM’14]
  • B4 [SIGCOMM’13]
  • ECMP

22

slide-23
SLIDE 23

Availability vs. demand scale

90 92 94 96 98 100 1.0 1.5 2.1 2.6 3.1

Availability (%) Demand Scale

SMORE B4 FFC-1 FFC-2 ECMP TeaVaR

  • Availability is measured as the probability mass of scenarios in which demand is fully

satisfied (β€œall-or-nothing” requirement)

  • If a TE scheme’s bandwidth allocation is unable to fully satisfy demand in 0.1% of

scenarios, it has an availability of 99.9%

23

slide-24
SLIDE 24

Robustness to probability estimates

Noise in probability estimations % error in throughput 1% 1.43% 5% 2.95% 10% 3.07% 15% 3.95% 20% 6.73%

24

slide-25
SLIDE 25

Summary

  • TeaVaR uses financial risk theory for solving Traffic

Engineering under failures.

  • TeaVaR’s approach is applicable to networking

resource allocation problems such as capacity planning.

  • Code and demo available at:

http://teavar.csail.mit.edu/

25