PCF: Provably Resilient Flexible Routing Chuan Jiang, Sanjay Rao, - - PowerPoint PPT Presentation

pcf provably resilient flexible routing
SMART_READER_LITE
LIVE PREVIEW

PCF: Provably Resilient Flexible Routing Chuan Jiang, Sanjay Rao, - - PowerPoint PPT Presentation

PCF: Provably Resilient Flexible Routing Chuan Jiang, Sanjay Rao, Mohit Tawarmalani Purdue University ACM SIGCOMM 2020 1 Background The network performance requirements are increasingly stringent. Over a 5 year period, traffic has


slide-1
SLIDE 1

PCF: Provably Resilient Flexible Routing

Chuan Jiang, Sanjay Rao, Mohit Tawarmalani Purdue University

  • 1

ACM SIGCOMM 2020

slide-2
SLIDE 2

Background

2

  • The network performance requirements are increasingly stringent.
  • Over a 5 year period, traffic has been increased 100X and performance must be

met 99.99% of time (vs. 99% of the time)[1].

  • Failures of network components are routine and they have great impact on network

performance.

[1] Hong et al, B4 and after: managing hierarchy, partitioning, and asymmetry for availability and scale in google’s software-defined WAN. SIGCOMM 2018.

slide-3
SLIDE 3

Background

3

Design the networks so that the desired traffic can be served over a target set of failures.

  • The network performance requirements are increasingly stringent.
  • Over a 5 year period, traffic has been increased 100X and performance must be

met 99.99% of time (vs. 99% of the time)[1].

  • Failures of network components are routine and they have great impact on network

performance.

[1] Hong et al, B4 and after: managing hierarchy, partitioning, and asymmetry for availability and scale in google’s software-defined WAN. SIGCOMM 2018.

slide-4
SLIDE 4

Congestion-free routing

  • Traditional traffic engineering: links may be overloaded upon failures[1, 2]
  • Many works[3, 4, 5] have been developed to design congestion-free mechanisms.
  • Guarantee a given throughput can be sustained under failures.
  • Tractable models to deal with large state space of failure scenarios (e.g, f

simultaneous link failures)

  • Typically involve light-weight online operations on failures
  • FFC[3] is the state-of-the-art mechanism and uses tunnel-based forwarding.
  • A set of pre-selected tunnels and traffic demand are provided to FFC.
  • It computes reservations on tunnels so that throughput can be guaranteed

across failures.

[1] Hong et al, Achieving high utilization with software-driven WAN, SIGCOMM 2013. [2] Jain et al, B4: Experience with a globally- deployed software defined wan, SIGCOMM 2013. [3] Liu et al, Traffic engineering with forward fault correction, SIGCOMM 2014. [4] Sinha et al, Network design for tolerating multiple link failures using Fast Re-route (FRR), DRCN 2014. [5] Wang et al, R3: resilient routing reconfiguration, SIGCOMM 2010.

4

slide-5
SLIDE 5

Congestion-free routing vs. optimal routing

  • FFC’s mechanism is not flexible enough and its throughput can be

very conservative.

  • Optimal mechanism
  • Most flexible
  • It recomputes the best routing online for each scenario each time

when a failure occurs, which always provide the best throughput.

  • It brings higher response overhead related to online operations.
  • It is intractable to provide a performance guarantee under

failures.

5

slide-6
SLIDE 6

Bridge the gap !

6

Tractable failure analysis Yes No Throughput FFC Optimal Throughput Optimal FFC Response Overhead high low low high low high

slide-7
SLIDE 7

Bridge the gap !

7

Yes No Throughput FFC Optimal Throughput Optimal FFC Desired area for new mechanisms

  • Our goal is to design a new mechanism which

sustains high throughput with low response

  • verhead while providing tractable failure

analysis.

Tractable failure analysis Response Overhead low high low high high low

slide-8
SLIDE 8

Contributions

  • We show that existing congestion-free schemes perform much worse than
  • ptimal.
  • FFC’s performance can be arbitrarily worse than optimal.
  • FFC’s performance can degrade with an increase in the number of

tunnels.

  • We propose a set of novel mechanism called PCF (Provably Congestion-

free and resilient Flexible routing).

  • PCF ensures the network is provably congestion-free under failures.
  • PCF performs closer to the network’s intrinsic capability.

8

slide-9
SLIDE 9

Contributions

  • We show that existing congestion-free schemes perform much worse than
  • ptimal.
  • FFC’s performance can be arbitrarily worse than optimal.
  • FFC’s performance can degrade with an increase in the number of

tunnels.

  • We propose a set of novel mechanism called PCF (Provably Congestion-

free and resilient Flexible routing).

  • PCF ensures the network is provably congestion-free under failures.
  • PCF performs closer to the network’s intrinsic capability.

9

PCF’s schemes can sustain higher throughput than FFC by a factor of upto 1.5X on average across the topologies, while providing a benefit of 2.6X in some cases.

slide-10
SLIDE 10

Example - Topology overview

S T U

Link capacity: 1/3 Link capacity: 1 e1 e2 e3 e4 e5 Tunnels: l1 - e1,e4 l2 - e1,e5 l3 - e2,e4 l4 - e2,e5 l5 - e3,e4 l6 - e3,e5

10

slide-11
SLIDE 11

How well can the network perform?

S T U

Link capacity: 1/3 Link capacity: 1 e1 e2 e3 e4 e5 Tunnels: l1 - e1,e4 l2 - e1,e5 l3 - e2,e4 l4 - e2,e5 l5 - e3,e4 l6 - e3,e5

11

  • Single link failure
  • Respond to failure optimally
  • 2/3 unit of traffic can always be sent
slide-12
SLIDE 12

How well can FFC perform?

S T U

Link capacity: 1/3 Link capacity: 1 e1 e2 e3 e4 e5 Reservation on tunnels: l1 - e1,e4: 1/6 l2 - e1,e5: 1/6 l3 - e2,e4: 1/6 l4 - e2,e5: 1/6 l5 - e3,e4: 1/6 l6 - e3,e5: 1/6

12

slide-13
SLIDE 13

S T U

Link capacity: 1/3 Link capacity: 1 e1 e2 e3 e4 e5 Reservation on tunnels: l1 - e1,e4: 1/6 l2 - e1,e5: 1/6 l3 - e2,e4: 1/6 l4 - e2,e5: 1/6 l5 - e3,e4: 1/6 l6 - e3,e5: 1/6 Remaining tunnels can

  • nly carry 1/2 !

13

How well can FFC perform?

slide-14
SLIDE 14

S T U

Link capacity: 1/3 Link capacity: 1 e1 e2 e3 e4 e5 Reservation on tunnels: l1 - e1,e4: 1/6 l2 - e1,e5: 1/6 l3 - e2,e4: 1/6 l4 - e2,e5: 1/6 l5 - e3,e4: 1/6 l6 - e3,e5: 1/6 Remaining tunnels can

  • nly carry 1/2 !

FFC’s performance guarantee: 1/2 Optimal scheme: 2/3

14

How well can FFC perform?

slide-15
SLIDE 15

U

Underlying reason

S T

Link capacity: 1/3 Link capacity: 1 e3 e4 e5 Reservation on tunnels: l1 - e1,e4: 1/6 l2 - e1,e5: 1/6 l3 - e2,e4: 1/6 l4 - e2,e5: 1/6 l5 - e3,e4: 1/6 l6 - e3,e5: 1/6

  • FFC’s reservations are made at the granularity of entire tunnel.
  • e4 fails -> l1, l3, l5 fail -> reserved capacity on e1, e2, e3 is lost !
  • PCF can solve this issue. For this example, it can achieve optimal

throughput.

15

e2 e1 l1 l5 l3

slide-16
SLIDE 16

PCF’s solution

  • FFC doesn’t provide enough flexibility in network response.
  • Optimal mechanism has the most flexibility, but doesn’t provide

tractable failure analysis.

  • PCF carefully introduces flexibility in network response to

simultaneously meet three objectives:

  • High throughput, tractable failure analysis, low response
  • verhead
  • Introduce an abstraction called logical sequence

16

slide-17
SLIDE 17

S T U

Link capacity: 1/3 Link capacity: 1 e1 e2 e3 e4 e5

PCF’s solution - Logical sequence

Tunnels: l1 - e1 l2 - e2 l3 - e3 l4 - e4 l5 - e5

17

  • Logical sequence: S-U-T
  • Traffic is independently routed in the two segments (S-U and U-

T) of the logical sequence.

  • On each segment, we want to make reservation to ensure that it

works upon failures.

slide-18
SLIDE 18

S T U

Link capacity: 1/3 Link capacity: 1 e1 e2 e3 e4 e5

U

2/3 unit of traffic can be sent under single link failure.

18

PCF’s solution - Logical sequence

slide-19
SLIDE 19

S T U

Link capacity: 1/3 Link capacity: 1 e1 e2 e3 e4 e5

U

2/3 unit of traffic can be sent under single link failure.

19

1 unit of traffic can be sent under single link failure.

PCF’s solution - Logical sequence

slide-20
SLIDE 20

S T U

Link capacity: 1/3 Link capacity: 1 e1 e2 e3 e4 e5

U

2/3 unit of traffic can be sent under single link failure.

20

1 unit of traffic can be sent under single link failure. We can reserve 2/3 unit on the logical sequence S-U-T. This reservation is always available under single link failure. Performance guarantee: 2/3 (optimal)

PCF’s solution - Logical sequence

slide-21
SLIDE 21

PCF’s solution - Logical sequence

  • Logical sequence: a sequence of nodes from s to t
  • Logical hops: s, v1, v2, v3,…,vm, t
  • Logical segments: s-v1, v1-v2, v2-v3, …, vm-t
  • Traffic needs to traverse the logical hops.
  • Logical hops don’t require direct link between them.

S v1 v2 vm t Logical sequences

}

Logical segment

21

slide-22
SLIDE 22

PCF’s solution - Logical sequence

  • Reserve on s-v1, v1-v2, v2-v3, …, vm-t independently.
  • The reservation can be made on underlying physical tunnels or
  • ther logical sequences.
  • We also consider conditional logical sequence which is only

active under certain conditions (e.g. a set of links fail).

S v1 v2 vm t Logical sequences Physical tunnels

S v1 v1 v2 vm t

22

slide-23
SLIDE 23

Logical sequence - model

  • Goal: Determine the reservation on each physical tunnel and logical

sequence

  • Objective: Maximize allocated throughput
  • Constraints:
  • Link capacity constraints
  • For any node pair s-t, and under any failure scenario
  • ensure sufficient reservation on physical tunnels and logical

sequences from s to t

  • to sustain the throughput from s to t, and other logical sequences.
slide-24
SLIDE 24

Link capacity: 1 Link capacity: 1/2 Tunnels s 1 3 t 2 4 l1 l2 l3 l4

FFC - can deteriorate with more tunnels

24

Provided tunnels Maximum Number of tunnels sharing a common link Estimated number of tunnel failures under single link failure l1, l2, l3 1 1

  • FFC estimates the maximum number of tunnel failures, then

considers all combinations of so many tunnel failures.

slide-25
SLIDE 25

Link capacity: 1 Link capacity: 1/2 Tunnels s 1 3 t 2 4 l1 l2 l3 l4

FFC - can deteriorate with more tunnels

25

Provided tunnels Maximum Number of tunnels sharing a common link Estimated number of tunnel failures under single link failure l1, l2, l3, l4 2 2

  • With 4 tunnels, FFC plans for all 2 tunnel failures, including failing

l1 and l2 at the same time.

  • If l1 and l2 die at the same time, which will never occur under

single link failure, the performance will be very low.

  • Providing more tunnels to FFC may hurt the performance!
slide-26
SLIDE 26

Link capacity: 1 Link capacity: 1/2 Tunnels s 1 3 t 2 4 l1 l2 l3 l4

FFC - can deteriorate with more tunnels

26

Provided tunnels Maximum Number of tunnels sharing a common link Estimated number of tunnel failures under single link failure l1, l2, l3, l4 2 2

  • With 4 tunnels, FFC plans for all 2 tunnel failures, including failing

l1 and l2 at the same time.

  • If l1 and l2 die at the same time, which will never occur under

single link failure, the performance will be very low.

  • Providing more tunnels to FFC may hurt the performance!

PCF solves this issue by modeling the fact that when one link fails, l1 and l2 can not die at the same time.

slide-27
SLIDE 27

Theoretical results

27

  • Proposition: PCF’s performance does not degrade with

additional tunnels, and performs at least as well as FFC.

  • Proposition: There exist topologies for which (i) FFC’s

throughput is arbitrarily worse than optimal even when exponentially many tunnels are used; and (ii) PCF’s throughput achieves the optimal with only polynomially many tunnels.

slide-28
SLIDE 28

PCF - implementation

  • When the logical sequences do not recursively depend on each other

(satisfy a topological order):

  • Local proportional routing mechanism can be used.
  • Redistribute traffic on the active tunnels and logical sequences.
  • In more general cases:
  • Use centralized controller to solve a linear system upon each

failure

  • Solving a linear system is much easier than solving an
  • ptimization problem.
  • Amenable to distributed implementation in the future.

28

slide-29
SLIDE 29

PCF - family of schemes

29

PCF-TF FFC PCF-LS- General PCF-LS- TopSort PCF-CLS- General PCF-CLS- TopSort

A is provably better than B

A B

All PCF schemes are associated with tractable models that guarantee the network is congestion-free under failures.

slide-30
SLIDE 30

30

PCF-TF FFC PCF-LS- General PCF-LS- TopSort PCF-CLS- General PCF-CLS- TopSort A B

Distribute traffic proportionally (fully distributed)

A is provably better than B

PCF - family of schemes

All PCF schemes are associated with tractable models that guarantee the network is congestion-free under failures.

slide-31
SLIDE 31

31

PCF-TF FFC PCF-LS- General PCF-LS- TopSort PCF-CLS- General PCF-CLS- TopSort A B

Distribute traffic proportionally (fully distributed) Solve a linear system

A is provably better than B

PCF - family of schemes

All PCF schemes are associated with tractable models that guarantee the network is congestion-free under failures.

slide-32
SLIDE 32

Evaluation - instantiating logical sequences

  • PCF-LS - We chose topologically sorted sequences by using

shortest paths.

  • PCF-CLS - We additionally added sequences that are activated
  • n the failure of a link.

32

slide-33
SLIDE 33

Evaluation - setup

  • Physical tunnels: as disjoint as possible
  • 21 topologies (the largest topology has 151 links)
  • Traffic matrix: gravity model
  • Metric: demand scale (the factor by which the traffic demand of

all pairs can be scaled)

33

slide-34
SLIDE 34

Demand scale 0.2 0.4 0.6 0.8 FFC PCF-TF 2 Tunnels 3 Tunnels 4 Tunnels

Benefits of the better failure model

  • FFC’s performance is worse

with 3 and 4 tunnels than with

  • nly 2 tunnels.
  • PCF performs better as tunnels

are added.

34

Deltacom topology, single link failure

slide-35
SLIDE 35

PCF vs. FFC on multiple topologies

35

1.0 1.5 2.0 2.5 3.0 DemDnd sFDOe reODtive to FFC 0.0 0.2 0.4 0.6 0.8 1.0 Fdf 2ptimDO

  • Optimal scheme gives much

higher throughput than FFC.

  • For 40% of the topologies, the
  • ptimal scheme can sustain

40% more demand than FFC.

Better performance

21 topologies, up to 3 link failures

Fraction of topologies

slide-36
SLIDE 36

PCF vs. FFC on multiple topologies

36

  • PCF-TF improves over FFC

by 11% on average and more than 50% in the best case.

  • PCF-TF has the same

response mechanism as FFC.

1.0 1.5 2.0 2.5 3.0 DePDnd sFDOe reODtive to FFC 0.0 0.2 0.4 0.6 0.8 1.0 Fdf 3CF-TF 2ptiPDO

Better performance

21 topologies, up to 3 link failures

Fraction of topologies

slide-37
SLIDE 37

PCF vs. FFC on multiple topologies

37

  • PCF-LS improves over FFC

by 25% on average, and performs 2.6x better in the best case.

  • Fully distributed response

mechanism

1.0 1.5 2.0 2.5 3.0 DePDnd sFDOe reODtLve to FFC 0.0 0.2 0.4 0.6 0.8 1.0 Fdf 3CF-TF 3CF-L6 2StLPDO

Better performance

21 topologies, up to 3 link failures

Fraction of topologies

slide-38
SLIDE 38

PCF vs. FFC on multiple topologies

38

  • PCF-CLS improves over FFC

by 50% on average, and matches the optimal for most cases.

  • Only require linear system on

failures

1.0 1.5 2.0 2.5 3.0 DePDnd sFDOe reODtLve to FFC 0.0 0.2 0.4 0.6 0.8 1.0 Fdf 3CF-TF 3CF-L6 3CF-CL6 2StLPDO

Better performance

21 topologies, up to 3 link failures

Fraction of topologies

slide-39
SLIDE 39

Other results

39

  • Similar improvement over FFC are observed in other

experiments

  • Evaluate on same topology over multiple different

demands

  • Evaluate on other metric instead of demand scale
  • An interesting heuristic shows feasibility of achieving nearly
  • ptimal performance for most topologies with completely

local routing under single link failure.

slide-40
SLIDE 40

50 100 150 200 250 300 1uPber of sub-OLnNs 10−1 100 101 102 103 soOvLng tLPe(s) 1 h (trunFated) 3C)-T) 3C)-CLS 2StLPaO

Solving time

  • PCF schemes:
  • For most topologies, the solving times are

under 10 seconds.

  • For the largest topology (302 links), the

solving time is under 100 seconds.

  • Optimal scheme:
  • Does not finish within one hour for many

topologies.

  • For the largest topology, it took days to

finish.

40

21 topologies, up to 3 link failures

slide-41
SLIDE 41

Conclusion

  • We show that existing congestion-free schemes perform much worse than

the network’s intrinsic capability. We present the underlying reasons.

  • We propose PCF in order to bridge the gap.
  • Carefully introduce flexibility in network response to achieve:
  • High throughput, tractable failure analysis, low response overhead
  • Formal results show that PCF is provably better than FFC.
  • PCF achieves up to 50% improvement over FFC on average across

21 topologies.

41

slide-42
SLIDE 42

42

Thanks!

Chuan Jiang: jiang486@purdue.edu Sanjay Rao: sanjay@ecn.purdue.edu