Achieving High Utilization with Software-Driven WAN
Chi-Yao Hong (UIUC) Srikanth Kandula Ratul Mahajan
Microsoft
Ming Zhang Vijay Gill Mohan Nanduri Roger Wattenhofer
Tuesday, August 13, 13
Achieving High Utilization with Software-Driven WAN Chi-Yao Hong - - PowerPoint PPT Presentation
Achieving High Utilization with Software-Driven WAN Chi-Yao Hong (UIUC) Srikanth Kandula Ratul Mahajan Ming Zhang Vijay Gill Mohan Nanduri Roger Wattenhofer Microsoft Tuesday, August 13, 13 Background: Inter-DC WANs Dublin% Sea,le%
Chi-Yao Hong (UIUC) Srikanth Kandula Ratul Mahajan
Microsoft
Ming Zhang Vijay Gill Mohan Nanduri Roger Wattenhofer
Tuesday, August 13, 13
Hong%Kong% Seoul% Sea,le% Los%Angeles% New%York% Miami% Dublin% Barcelona%
Tuesday, August 13, 13
Hong%Kong% Seoul% Sea,le% Los%Angeles% New%York% Miami% Dublin% Barcelona%
Inter-DC WANs are critical
Tuesday, August 13, 13
Hong%Kong% Seoul% Sea,le% Los%Angeles% New%York% Miami% Dublin% Barcelona%
Inter-DC WANs are critical Inter-DC WANs are highly expensive
Tuesday, August 13, 13
Poor efficiency
average utilization over time
Poor sharing
little support for flexible resource sharing
Tuesday, August 13, 13
Poor efficiency
average utilization over time
Poor sharing
little support for flexible resource sharing
Tuesday, August 13, 13
Norm. traffic rate Time (~ one day) mean
Tuesday, August 13, 13
Background traffic Non-background traffic Norm. traffic rate Time (~ one day)
Tuesday, August 13, 13
Background traffic Non-background traffic Norm. traffic rate Time (~ one day) peak before rate adaptation peak after rate adaptation > 50% peak reduction
Tuesday, August 13, 13
MPLS TE (Multiprotocol Label Switching Traffic Engineering) greedily selects shortest path fulfilling capacity constraint
Tuesday, August 13, 13
Flow Src → Dst A 1→6 B 3→6 C 4→6
1 2 3 4 5 6 7
flow arrival order: A, B, C each link can carry at most one flow
MPLS-TE
Tuesday, August 13, 13
Flow Src → Dst A 1→6 B 3→6 C 4→6
1 2 3 4 5 6 7
flow arrival order: A, B, C each link can carry at most one flow
MPLS-TE
Tuesday, August 13, 13
Flow Src → Dst A 1→6 B 3→6 C 4→6
1 2 3 4 5 6 7
flow arrival order: A, B, C each link can carry at most one flow
MPLS-TE
Tuesday, August 13, 13
Flow Src → Dst A 1→6 B 3→6 C 4→6
1 2 3 4 5 6 7
flow arrival order: A, B, C each link can carry at most one flow
MPLS-TE
Tuesday, August 13, 13
1 2 3 5 6 7 1 2 3 5 6 7
Optimal
flow arrival order: A, B, C each link can carry at most one flow
MPLS-TE
Tuesday, August 13, 13
Tuesday, August 13, 13
higher throughput by sending faster
Tuesday, August 13, 13
switches helps, but # services ≫ # queues
(4 - 8 typically)
higher throughput by sending faster
(hundreds)
Tuesday, August 13, 13
switches helps, but # services ≫ # queues
(4 - 8 typically)
higher throughput by sending faster
Borrowing the idea of edge rate limiting, we can have better sharing without many queues
(hundreds)
Tuesday, August 13, 13
Tuesday, August 13, 13
Tuesday, August 13, 13
Tuesday, August 13, 13
WAN switches Hosts
Tuesday, August 13, 13
WAN switches SWAN controller
traffic demand topology, traffic
Hosts
Tuesday, August 13, 13
WAN switches SWAN controller
traffic demand topology, traffic [global optimization for high utilization]
Hosts
Tuesday, August 13, 13
WAN switches
rate allocation network configuration
SWAN controller
traffic demand topology, traffic [global optimization for high utilization]
Hosts
Tuesday, August 13, 13
WAN switches
rate allocation network configuration [rate limiting] [forwarding plane update]
SWAN controller
traffic demand topology, traffic [global optimization for high utilization]
Hosts
Tuesday, August 13, 13
Tuesday, August 13, 13
Challenge #1: How to compute allocation in a time-efficient manner?
Tuesday, August 13, 13
Path-constrained, multi-commodity flow problem
Solving at the granularity of {DC pairs, priority class}-tuple
Tuesday, August 13, 13
[Danna, Mandal, Singh; INFOCOM’12]
State-of-the-art takes minutes at our target scale
As it needs to solve a long sequence of LPs: # LPs = O(# saturated edges)
Tuesday, August 13, 13
Max demand α
MCF solver Maximize throughput Prefer shorter paths
upper & lower bounds
Tuesday, August 13, 13
Max demand α
MCF solver Maximize throughput Prefer shorter paths
upper & lower bounds freeze saturated flow rates
Tuesday, August 13, 13
Max demand α α2
MCF solver Maximize throughput Prefer shorter paths
freeze saturated flow rates upper & lower bounds
Tuesday, August 13, 13
Theoretical bound: Empirical efficiency (with α ¡= 2):
their fair share rate
Tuesday, August 13, 13
1 2 3 4 100 200 300 400 500 600 1 2 3 4 100 200 300 400 500 600
Flow goodput [versus max-min fair rate] Flow index [increasing order of demand] SWAN; α=2 MPLS TE
Tuesday, August 13, 13
How to update forwarding plane without causing transient congestion?
Tuesday, August 13, 13
initial state target state A B B A
Tuesday, August 13, 13
initial state target state A B B A A B
Tuesday, August 13, 13
initial state target state A B B A A B
B A
Tuesday, August 13, 13
In fact, congestion-free update sequence might not exist!
Tuesday, August 13, 13
Leave a small amount of scratch capacity on each link
Tuesday, August 13, 13
A=2/3 B=2/3 B=2/3 A=2/3
target state
Tuesday, August 13, 13
A=2/3 B=2/3 B=2/3 A=2/3
B=1/3 A=2/3 B=1/3
target state
Tuesday, August 13, 13
A=2/3 B=2/3 B=2/3 A=2/3
B=1/3 B=1/3 A=2/3 B=1/3 A=2/3 B=1/3
target state
Tuesday, August 13, 13
A=2/3 B=2/3 B=2/3 A=2/3
B=1/3 B=1/3 A=2/3 B=1/3 A=2/3 B=1/3 Does slack guarantee that congestion-free update always exists?
target state
Tuesday, August 13, 13
With slack :
update in steps
whose order can be arbitrary
Tuesday, August 13, 13
With slack :
update in steps
whose order can be arbitrary It exsits, but how to find it?
Tuesday, August 13, 13
step flow path
Tuesday, August 13, 13
step flow path
Tuesday, August 13, 13
step flow path
∀i,j on a link link capacity
Tuesday, August 13, 13
non-background is congestion-free background has bounded congestion
using 90% capacity (s ¡= ¡10%) using 100% capacity (s ¡= ¡0%)
Tuesday, August 13, 13
0.01% 0.1% 1% 10% 100 200 300
0.01% 0.1% 1% 10% 100 200 300
CCDF over links & updates Overloaded traffic [MB]
[data-driven evaluation; s = 10% for non-background]
Tuesday, August 13, 13
0.01% 0.1% 1% 10% 100 200 300
0.01% 0.1% 1% 10% 100 200 300
CCDF over links & updates Overloaded traffic [MB]
0.01% 0.1% 1% 10% 100 200 300
One-shot update brings heavy packet drops
non-background
background [data-driven evaluation; s = 10% for non-background]
Tuesday, August 13, 13
0.01% 0.1% 1% 10% 100 200 300
0.01% 0.1% 1% 10% 100 200 300
CCDF over links & updates Overloaded traffic [MB]
0.01% 0.1% 1% 10% 100 200 300
background 0.01% 0.1% 1% 10% 100 200 300 SWAN background
SWAN non-background: congestion-free background: much better than one-shot
[data-driven evaluation; s = 10% for non-background]
Tuesday, August 13, 13
Working with limited switch memory
Tuesday, August 13, 13
Tuesday, August 13, 13
How many we need?
[by data-driven analysis]
Tuesday, August 13, 13
How many we need?
[by data-driven analysis] it requires 20K rules to fully use network capacity
Tuesday, August 13, 13
Commodity switches has limited memory:
How many we need?
[by data-driven analysis] it requires 20K rules to fully use network capacity
[Broadcom Trident II]
Tuesday, August 13, 13
Finding the set of paths with a given size that carries the most traffic is NP-complete
[Hartman et al., INFOCOM’12]
Tuesday, August 13, 13
Tuesday, August 13, 13
Observation:
Tuesday, August 13, 13
provide basic connectivity
Path selection: Observation:
Tuesday, August 13, 13
provide basic connectivity
Path selection:
Rule update:
Observation:
Tuesday, August 13, 13
Compute resource allocation
Tuesday, August 13, 13
Compute resource allocation
if enough gain
Compute rule update plan
if not
Tuesday, August 13, 13
Compute resource allocation
if enough gain
Compute rule update plan
if not
Compute congestion- free update plan
Tuesday, August 13, 13
Compute resource allocation Notify services with decrease allocation
if enough gain
Compute rule update plan
if not
Compute congestion- free update plan
Tuesday, August 13, 13
Compute resource allocation Notify services with decrease allocation
if enough gain
Compute rule update plan
if not
Compute congestion- free update plan Update network
Tuesday, August 13, 13
Compute resource allocation Notify services with decrease allocation
if enough gain
Compute rule update plan
if not
Compute congestion- free update plan Update network Notify services with increase allocation
Tuesday, August 13, 13
Arista
Cisco N3K Blade Server
Prototype
10 switches
Tuesday, August 13, 13
Arista
Cisco N3K Blade Server
Prototype
10 switches Data-driven evaluation
continents, 80+ switches
Tuesday, August 13, 13
0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 Interactive Elastic Elastic Background
Time [minute] Goodput (normalized & stacked) Traffic: (∀DC-pair) 125 TCP flows per class
Tuesday, August 13, 13
0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 Interactive Elastic Elastic Background
Time [minute] Goodput (normalized & stacked)
0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 Interactive Elastic Elastic Background
(impractical)
High utilization SWAN’s goodput: 98% of an optimal method
dips due to rate adaptation Traffic: (∀DC-pair) 125 TCP flows per class
Tuesday, August 13, 13
0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 Interactive Elastic Elastic Background
Time [minute] Goodput (normalized & stacked)
0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 Interactive Elastic Elastic Background
(impractical)
High utilization SWAN’s goodput: 98% of an optimal method
Flexible sharing Interactive protected; background rate-adapted
dips due to rate adaptation Traffic: (∀DC-pair) 125 TCP flows per class
Tuesday, August 13, 13
0.2 0.4 0.6 0.8 1
SWAN SWAN w/o Rate Control MPLS TE
Goodput [versus optimal]
Tuesday, August 13, 13
0.2 0.4 0.6 0.8 1
SWAN SWAN w/o Rate Control MPLS TE
Goodput [versus optimal] 0.2 0.4 0.6 0.8 1
SWAN SWAN w/o Rate Control MPLS TE
99.0%
Near-optimal total goodput under a practical setting
Tuesday, August 13, 13
0.2 0.4 0.6 0.8 1
SWAN SWAN w/o Rate Control MPLS TE
Goodput [versus optimal] 0.2 0.4 0.6 0.8 1
SWAN SWAN w/o Rate Control MPLS TE
99.0% 58.3% 0.2 0.4 0.6 0.8 1
SWAN SWAN w/o Rate Control MPLS TE
SWAN carries ~60% more traffic than MPLS-TE
Tuesday, August 13, 13
0.2 0.4 0.6 0.8 1
SWAN SWAN w/o Rate Control MPLS TE
Goodput [versus optimal] 0.2 0.4 0.6 0.8 1
SWAN SWAN w/o Rate Control MPLS TE
99.0% 58.3% 0.2 0.4 0.6 0.8 1
SWAN SWAN w/o Rate Control MPLS TE
0.2 0.4 0.6 0.8 1
SWAN SWAN w/o Rate Control MPLS TE
70.3%
SWAN w/o rate control still carries 20% more traffic than MPLS TE
Tuesday, August 13, 13
Tuesday, August 13, 13
rate and route coordination
Tuesday, August 13, 13
rate and route coordination
fairness
Tuesday, August 13, 13
rate and route coordination
fairness
Tuesday, August 13, 13
rate and route coordination
fairness
memory
Tuesday, August 13, 13
Achieving high utilization itself is easy, but coupling it with flexible sharing and change management is hard
Tuesday, August 13, 13
Achieving high utilization itself is easy, but coupling it with flexible sharing and change management is hard
Approximating max-min fairness with low computational time
Tuesday, August 13, 13
Achieving high utilization itself is easy, but coupling it with flexible sharing and change management is hard
Approximating max-min fairness with low computational time Keeping scratch capacity of links and switch memory to enable quick transitions
Tuesday, August 13, 13
Chi-Yao Hong (UIUC) Srikanth Kandula Ratul Mahajan
Microsoft
Ming Zhang Vijay Gill Mohan Nanduri Roger Wattenhofer
Tuesday, August 13, 13
SWAN B4 high utilization yes yes scalable rate and route computation bounded error heuristic congestion-free update in bounded steps no using commodity switches with limited # forwarding rules yes no
Tuesday, August 13, 13
Tuesday, August 13, 13
Tuesday, August 13, 13
Controller crashes
Link and switch failures
recomputation
Controller bugs
Tuesday, August 13, 13
Tuesday, August 13, 13
Global allocation at {SrcDC, DstDC, ServicePriority}-level
background < elastic < interactive) Dst
30% 50% 20%
Src
Label-based forwarding (“tunnels”)
VLAN IDs
globally computes how to split traffic at ingress switches
Tuesday, August 13, 13
S1 S2 S3 D1 D2 D3
1/2 1/2 1/2 1/2
Link-level
S1 S2 S3 D1 D2 D3
2/3 1/3 1/3 2/3
Network-wide
Tuesday, August 13, 13
Tuesday, August 13, 13
s
max steps 50% 1 30% 3 10% 9 0% ∞ goodput 79% 91% 100% 100%
[data-driven evaluation]
99th pctl. 1 2 3 6
> >> >>
Tuesday, August 13, 13
if not
Solve rate allocation
if can fit in memory
Greedily select important paths Solve again with selected paths fin. Using 10x fewer paths than static k-shortest path routing
Tuesday, August 13, 13
rules new rules
Option #1: Fully utilize all the switch memory Rule update may disrupt traffic switch memory
Tuesday, August 13, 13
rules new rules
Option #2: Leave 50% slack [Reitblatt et al.; SIGCOMM’12] Waste a half switch memory switch memory
Tuesday, August 13, 13
switch memory active rules
Tuesday, August 13, 13
switch memory active rules
Tuesday, August 13, 13
switch memory active rules
Tuesday, August 13, 13
switch memory active rules
Tuesday, August 13, 13
switch memory active rules
Tuesday, August 13, 13
switch memory active rules
Tuesday, August 13, 13
switch memory active rules
Tuesday, August 13, 13
switch memory active rules # stages bound: f(memory slack)
Tuesday, August 13, 13
switch memory active rules # stages bound: f(memory slack) When slack=10%, 2 stages for 95% of time
Tuesday, August 13, 13