Traffic Engineering with Forw rward Fault Correction Harry Liu - - PowerPoint PPT Presentation

traffic engineering with forw rward fault correction
SMART_READER_LITE
LIVE PREVIEW

Traffic Engineering with Forw rward Fault Correction Harry Liu - - PowerPoint PPT Presentation

Traffic Engineering with Forw rward Fault Correction Harry Liu Microsoft Research 06/02/2016 Joint work with Ratul Mahajan, Srikanth Kandula, Ming Zhang and David Gelernter 1 Cloud services require large network capacity Cloud Applications


slide-1
SLIDE 1

Traffic Engineering with Forw rward Fault Correction

Harry Liu Microsoft Research 06/02/2016

1

Joint work with Ratul Mahajan, Srikanth Kandula, Ming Zhang and David Gelernter

slide-2
SLIDE 2

Cloud services require large network capacity

2

Cloud Applications Cloud Networks Growing traffic Expensive

(e.g. cost of WAN: $100M/year)

slide-3
SLIDE 3

TE is critical to effectively utilizing networks

3

Traffic Engineering (centralized & SDN-Based) WAN Network

  • Microsoft SWAN (SIGCOMM’13)
  • Google B4 (SIGCOMM’13)
  • ……

Datacenter Network

  • Devoflow (SIGCOMM’11)
  • MicroTE (CoNEXT’11)
  • ……
slide-4
SLIDE 4

4

Sub-optimal resource allocation based on local view & control.

2 1 4 3 10

5 5

10

Link Cap: 10

Requirement: path length ≤ 2 hops 2 1 4 3 10

10

10

Link Cap: 10

Centralized TE is the key to network efficiency

Demand=10 Demand=10 Total throughput: 15 Total throughput: 20

TE controller

Optimal resource allocation based on global view & control.

1) how much traffic to admit 2) how to route

slide-5
SLIDE 5

But, centralized TE is also vulnerable to faults

5

TE controller

Network Network view (e.g. topo, cap, traffic) Network configuration (e.g. routes, rate limits)

Frequent updates for high utilization (e.g. per 5min) Control-plane faults Data-plane faults

slide-6
SLIDE 6

Data plane faults

6

Link and switch failures

s1

7 3 3 7

s2 s3 s4

link failure

link failure s1

10 7

s2 s3 s4

3

congestion

Rescaling: Sending traffic proportionally to residual paths

Link Capacity: 10

s2 s4 (10) s3 s4 (10)

slide-7
SLIDE 7

Control plane faults

7

Failures or long delays to configure a network device

TE Controller Switch TE configurations

Memory shortage RPC failure Firmware bugs Overloaded CPU

Control plane faults can also result in congestion.

slide-8
SLIDE 8

The TE controllability is undermined by faults

8

TE controller

Network Network view Network configuration

Inaccuracy Incompleteness Control-plane faults Data-plane faults

slide-9
SLIDE 9

Control and data plane faults in practice

9

Control plane: fault rate = 0.1% -- 1% per TE update. Data plane: fault rate = 25% per 5 minutes.

In a production WAN network (200+ routers, 6000+ links):

  • Faults are common.
  • Faults cause severe congestion.
slide-10
SLIDE 10

State of the art for handling faults

  • Heavy over-provisioning:
  • Reactive handling of faults:
  • Control plane faults: retry
  • Data plane faults: re-compute TE and update networks

10

Cannot prevent congestion Slow

(seconds -- minutes)

Blocked by control plane faults Big loss in throughput

slide-11
SLIDE 11

11

How about handling faults proactively?

Network TE Algorithm not robust enough making it robust

slide-12
SLIDE 12

Forward fault correction (FFC) in TE

12

  • [Bad News] Individual faults are unpredictable.
  • [Good News] Simultaneous #faults is small.

Network faults Packet loss

FFC guarantees no congestion under up to k arbitrary faults. FEC guarantees no information loss under up to k arbitrary packet drops.

with careful data encoding with careful traffic distribution

slide-13
SLIDE 13

Example: FFC for link failures

13

Link Capacity: 10

K=1 (FFC)

Failure Cases

s2 s4 (9) s3 s4 (9)

s1

8 1 1 8

s2 s3 s4

s1

8

s2 s3 s4

9

link failure

1

s1

9

s2 s3 s4

9

link failure s1

9 1 8

s2 s3 s4 link failure

slide-14
SLIDE 14

s1

15 5 10

s2 s3 s4 link failure

Trade-off: network efficiency v.s. robustness

14

Non-FFC (Throughput: 30) FFC (k=1) (Throughput: 18) Non-FFC (Throughput: 18)

There exists a trade-off between throughput and robustness FFC does not always sacrifice efficiency for robustness

s1

8 1 1 8

s2 s3 s4 s1

10 5 5 10

s2 s3 s4 s1

5 4 4 5

s2 s3 s4

link failure

s1

9 5

s2 s3 s4

4

Achieving the optimal throughput with FFC guarantee

slide-15
SLIDE 15

Systematically realizing FFC in TE

15

Formulation: How to merge FFC into existing TE framework? Computation: How to find FFC-TE efficiently?

slide-16
SLIDE 16

Basic TE linear programming formulations

16

  • max. ∀𝑔 𝑐𝑔

Sizes of flows 𝑙𝑑 control plane faults Deliver all granted flows No overloaded link FFC constraints: Maximizing throughput No overloaded link up to 𝑙𝑓 link failures 𝑙𝑤 switch failures TE decisions: Traffic on paths Basic TE constraints: TE objective: … 𝑐𝑔 𝑚𝑔,𝑢 s.t. ∀𝑔: ∀𝑢 𝑚𝑔,𝑢 ≥ 𝑐𝑔 ∀𝑓: ∀𝑔 ∀𝑢∋𝑓 𝑚𝑔,𝑢 ≤ 𝑑𝑓 … LP formulations

slide-17
SLIDE 17

Formulating data-plane FFC

17

D S

flow size: 𝑡𝑔 𝑏2 𝑏3 bw allocation: 𝑏1

path-1 path-2 path-3

𝑡𝑔 ≤ 𝑏2+𝑏3

𝟒 𝟑

Fault on path-1: Fault on path-2: Fault on path-3:

FFC k=1

𝑡𝑔 ≤ 𝑏1+𝑏3 𝑡𝑔 ≤ 𝑏1+𝑏2 Lemma: FFC is achieved when path-i’s weight is 𝑏𝑗/𝑏1+𝑏2+𝑏3 Paths are link-disjoint.

slide-18
SLIDE 18

An efficient and precise solution to FFC

18

Given n paths and 𝐵 = {𝑏1, 𝑏2, … , 𝑏𝑜}, FFC requires that the sum of arbitrary n-k elements in 𝐵 is ≥ flow size O( 𝑜

𝑙 )

k-sum linear constraint group (k-sum group): FFC-TE LP-formulation:

TE Objective k-sum group-N k-sum group-1 Basic TE Constraints

FFC Constraints (too many) Lossless compression of a k-sum group: O( 𝑜

𝑙 )

O(kn) bubble sorting network

(SIGCOMM 2014)

O(n) strong duality

(MSR TR 2016)

http://www.hongqiangliu.com/publications.html

slide-19
SLIDE 19

FFC extensions

  • Differential protection for different traffic priorities
  • Minimizing congestion risks without rate limiters
  • Control plane faults on rate limiters
  • Uncertainty in current TE
  • Different TE objectives (e.g. max-min fairness)

19

slide-20
SLIDE 20

Implementation & evaluation highlights

  • Testbed experiment (8 switches & 30 servers)
  • FFC can be implemented in commodity switches
  • FFC has no data loss due to congestion under faults
  • Large-scale simulation
  • A WAN network with O(100) switches and O(1000) links
  • One-week traffic trace
  • Fault injection according to real failure trace
  • Results: with negligible throughput loss, FFC can reduce
  • data loss by a factor of 7-130 in well-provisioned networks
  • data loss of high priority traffic to almost zero in well-utilized networks

20

slide-21
SLIDE 21

Conclusion and future work

21

SDN Controller

Network Network view Network configuration

Network Properties:

  • High throughput
  • No congestion
  • Security
  • Availability
  • Connectivity
  • ……

Network Faults:

  • Data-plane
  • Control-plane
  • Misconfigurations
  • Attacks
  • Traffic spikes
  • ……

FFC

slide-22
SLIDE 22

Q&A

22