Forward Fault Correction (FFC) Hongqiang Harry Liu , Srikanth - - PowerPoint PPT Presentation

forward fault correction ffc
SMART_READER_LITE
LIVE PREVIEW

Forward Fault Correction (FFC) Hongqiang Harry Liu , Srikanth - - PowerPoint PPT Presentation

Traffic Engineering with Forward Fault Correction (FFC) Hongqiang Harry Liu , Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University) 1 Cloud services require large network capacity Cloud Services Growing traffic


slide-1
SLIDE 1

Traffic Engineering with Forward Fault Correction (FFC)

Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University)

1

slide-2
SLIDE 2

Cloud services require large network capacity

2

Cloud Services Cloud Networks Growing traffic Expensive

(e.g. cost of WAN: $100M/year)

slide-3
SLIDE 3

TE is critical to effectively utilizing networks

3

Traffic Engineering WAN Network

  • Microsoft SWAN
  • Google B4
  • ……

Datacenter Network

  • Devoflow
  • MicroTE
  • ……
slide-4
SLIDE 4

But, TE is also vulnerable to faults

4

TE controller

Network Network view Network configuration

Frequent updates for high utilization Control-plane faults Data-plane faults

slide-5
SLIDE 5

Control plan faults

5

Failures or long delays to configure a network device

TE Controller Switch TE configurations

Memory shortage RPC failure Firmware bugs Overloaded CPU

slide-6
SLIDE 6

Congestion due to control plane faults

s1 s2 s3 s4

Link Capacity: 10

7 3 3 7

10

s1

10 10 10 10

s2 s3 s4 6

New Flows (traffic demands): s1 s2 (10) s1 s3 (10) s1 s4 (10) Configuration failure Congestion s1

7 3 10 10 10

s2 s3 s4

10

s2 s4 (10) s3 s4 (10)

slide-7
SLIDE 7

Data plane faults

7

Link and switch failures

s1

7 3 3 7

s2 s3 s4

link failure

link failure s1

10 7

s2 s3 s4

3

congestion

Rescaling: Sending traffic proportionally to residual paths

Link Capacity: 10

s2 s4 (10) s3 s4 (10)

slide-8
SLIDE 8

Control and data plane faults in practice

8

Control plane: fault rate = 0.1% -- 1% per TE update. Data plane: fault rate = 25% per 5 minutes.

In production networks:

  • Faults are common.
  • Faults cause severe congestion.
slide-9
SLIDE 9

State of the art for handling faults

  • Heavy over-provisioning
  • Reactive handling of faults
  • Control plane faults: retry
  • Data plane faults: re-compute TE and update networks

9

Cannot prevent congestion Slow

(seconds -- minutes)

Blocked by control plane faults Big loss in throughput

slide-10
SLIDE 10

10

How about handling congestion proactively?

slide-11
SLIDE 11

Forward fault correction (FFC) in TE

11

  • [Bad News] Individual faults are unpredictable.
  • [Good News] Simultaneous #faults is small.

Network faults Packet loss

FFC guarantees no congestion under up to k arbitrary faults. FEC guarantees no information loss under up to k arbitrary packet drops.

with careful data encoding with careful traffic distribution

slide-12
SLIDE 12

Example: FFC for control plane faults

s1 s2 s3 s4

Link Capacity: 10

7 3 3 7

10

s1

10 10 10 10

s2 s3 s4

Configuration failure Congestion s1

7 3 10 10 10

s2 s3 s4

10

Non-FFC

12

slide-13
SLIDE 13

Example: FFC for control plane faults

s1 s2 s3 s4

Link Capacity: 10

7 3 3 7

13 Configuration failure s1

7 3 10 10 10

s2 s3 s4

7

s1 s2 s3 s4

10 10 10 10 7 Control Plane FFC (k=1)

Configuration failure s1

10 7 10 10

s2 s3 s4

7 3

slide-14
SLIDE 14

Trade-off: network utilization vs. robustness

10

s1

10 10 10 10

s2 s3 s4

Non-FFC

s1 s2 s3 s4

10 10 10 10 7 K=1 (Control Plane FFC)

s1

10 4 10

s2 s3 s4

10 10

K=2 (Control Plane FFC)

14

Throughput: 44 Throughput: 47 Throughput: 50

slide-15
SLIDE 15

Systematically realizing FFC in TE

15

Formulation: How to merge FFC into existing TE framework? Computation: How to find FFC-TE efficiently?

slide-16
SLIDE 16

Basic TE linear programming formulations

16

  • max. ∀𝑔 𝑐𝑔

Sizes of flows 𝑙𝑑 control plane faults Deliver all granted flows No overloaded link FFC constraints: Maximizing throughput No overloaded link up to 𝑙𝑓 link failures 𝑙𝑤 switch failures TE decisions: Traffic on paths Basic TE constraints: TE objective: … 𝑐𝑔 𝑚𝑔,𝑢 s.t. ∀𝑔: ∀𝑢 𝑚𝑔,𝑢 ≥ 𝑐𝑔 ∀𝑓: ∀𝑔 ∀𝑢∋𝑓 𝑚𝑔,𝑢 ≤ 𝑑𝑓 … LP formulations

slide-17
SLIDE 17

Formulating control plane FFC

17

s1 s2

𝑔

1

𝑔

2

𝑔

3

𝑚1

𝑜𝑓𝑥 + 𝑚2 𝑜𝑓𝑥 + 𝑚3 𝑝𝑚𝑒 ≤ link cap

𝑚1

𝑜𝑓𝑥 + 𝑚2 𝑝𝑚𝑒 + 𝑚3 𝑜𝑓𝑥 ≤ link cap

𝑚1

𝑝𝑚𝑒 + 𝑚2 𝑜𝑓𝑥 + 𝑚3 𝑜𝑓𝑥 ≤ link cap

𝟒 𝟐

𝑔

1’s load in old TE

𝑔

2’s load in new TE

Fault on 𝑔

1:

Fault on 𝑔

2:

Fault on 𝑔

3:

Total load under faults?

With n flows and FFC protection k: #constraints = 𝒐

𝟐 + … + 𝒐 𝒍 for each link.

Challenge: too many constraints

slide-18
SLIDE 18

An efficient and precise solution to FFC

18

Our approach: A lossless compression from O( 𝑜

𝑙 ) constraints to O(kn) constraints.

Given 𝑌 = {𝑦1, 𝑦2, … , 𝑦𝑜}, FFC requires that the sum of arbitrary k elements in 𝑌 is ≤ link spare capacity O( 𝑜

𝑙 )

Define 𝑧𝑛 as the mth largest element in 𝑌: 𝑛=1

𝑙

𝑧𝑛 ≤ link spare capacity

Expressing 𝑧𝑛 with 𝑌?

O(1) 𝑦𝑗: additional load due to fault-i Total load under faults ≤ link capacity Total additional load due to faults ≤ link spare capacity

slide-19
SLIDE 19

Sorting network

19

𝑦1 𝑦2 𝑦3 𝑦4

A comparison 𝑦1 𝑦2 𝑨1=max{𝑦1, 𝑦2} 𝑨2=min{𝑦1, 𝑦2}

𝑨1 𝑨2 𝑨3 𝑨4 𝑨5 𝑧1 (1st largest) 𝑧2 (2nd largest) 𝑨6 𝑨7 𝑨8

1st round 2nd round

  • Complexity: O(kn) additional variables and constraints.
  • Throughput: optimal in control-plane and data plane if

paths are disjoint.

𝑧1 + 𝑧2 ≤ link spare capacity

slide-20
SLIDE 20

FFC extensions

  • Differential protection for different traffic priorities
  • Minimizing congestion risks without rate limiters
  • Control plane faults on rate limiters
  • Uncertainty in current TE
  • Different TE objectives (e.g. max-min fairness)

20

slide-21
SLIDE 21

Evaluation overview

  • Testbed experiment
  • FFC can be implemented in commodity switches
  • FFC has no data loss due to congestion under faults
  • Large-scale simulation

21

A WAN network with O(100) switches and O(1000) links Injecting faults based

  • n real failure reports

Single priority traffic in a well-provisioned network Multiple priority traffic in a well-utilized network

slide-22
SLIDE 22

FFC prevents congestion with negligible throughput loss

FFC Throughput / Optimal Throughput

40 60 20 Ratio (%)

22

80 100 High priority (High FFC protection) Medium priority (Low FFC protection) Low priority (No FFC protection) Single priority

FFC Data-loss / Non-FFC Data-loss

160%

<0.01%

slide-23
SLIDE 23

Conclusions

  • Centralized TE is critical to high network utilization but is

vulnerable to control and data plane faults.

  • FFC proactively handle these faults.
  • Guarantee: no congestion when #faults ≤ k.
  • Efficiently computable with low throughput overhead in practice.

23

Heavy network

  • ver-provisioning

High risk of congestion

FFC