Towards Highly Available Clos-Based WAN Routers Sucha - - PowerPoint PPT Presentation

towards highly available clos based wan routers
SMART_READER_LITE
LIVE PREVIEW

Towards Highly Available Clos-Based WAN Routers Sucha - - PowerPoint PPT Presentation

Towards Highly Available Clos-Based WAN Routers Sucha Supittayapornpong , Barath Raghavan, Ramesh Govindan University of Southern California SIGCOMM 2019 Googles Wide Area Network This network connects datacenters, so it has to be highly


slide-1
SLIDE 1

Towards Highly Available Clos-Based WAN Routers

Sucha Supittayapornpong, Barath Raghavan, Ramesh Govindan University of Southern California SIGCOMM 2019

slide-2
SLIDE 2

Google’s Wide Area Network

This network connects datacenters, so it has to be highly available.

from B4 and After SIGCOMM’18

2

slide-3
SLIDE 3

Google’s Wide Area Network

Each datacenter has one or more routers, and each router is connected by trunks.

from B4 and After SIGCOMM’18

3

slide-4
SLIDE 4

A Trunk Contains Many Optical Links

https://www.sd-wan-experts.com/blog/undersea-cables/

4

slide-5
SLIDE 5

WAN Router

Trunk’s links are wired to the router. Real routers have 128 or 512 ports.

5

Trunk Wiring Router Router

from B4 and After SIGCOMM’18

slide-6
SLIDE 6

WAN Router

Let’s use a toy router to develop intuitions.

6

Trunk Wiring Router Router Router

slide-7
SLIDE 7

Clos-Based WAN Router

A router is built as a Clos topology.

7

Upper stage Lower stage Internal Link

slide-8
SLIDE 8

Clos is Non-Blocking

It can handle any traffic matrix without loss.

8

(All-to-All 1 unit)

slide-9
SLIDE 9

Clos is Non-Blocking

Equal-cost multipath (ECMP) routing can achieve the non-blocking property.

9

slide-10
SLIDE 10

Achieving Non-Blocking Property via ECMP

ECMP splits traffic equally to nexthops.

10

slide-11
SLIDE 11

Achieving Non-Blocking Property via ECMP

ECMP splits traffic equally to nexthops.

11

slide-12
SLIDE 12

Achieving Non-Blocking Property via ECMP

ECMP splits traffic equally to nexthops.

12

slide-13
SLIDE 13

Implication of Non-Blocking Property

There is sufficient internal capacity to route traffic between lower and upper stages.

13

slide-14
SLIDE 14

What Happens If There are Failures?

14

slide-15
SLIDE 15

What Happens If There are Failures?

A single failure reduces internal capacity.

15

slide-16
SLIDE 16

What Happens If There are Failures?

A single failure reduces internal capacity. Overall capacity can reduce by half when ECMP is used.

16

slide-17
SLIDE 17

Key Question

Can we completely mask internal link and switch failures?

17

slide-18
SLIDE 18

Key Question

Can we completely mask internal link and switch failures? If not, can we degrade gracefully? Existing approaches do neither of these.

18

slide-19
SLIDE 19

Key Insight: Wiring trunks to maximize early forwarding

19

slide-20
SLIDE 20

Careful wiring enables early forwarding.

20

Key Insight: Wiring trunks to maximize early forwarding

Previous Early forwarding

slide-21
SLIDE 21

Early forwarding can reduce upflow.

21

Key Insight: Wiring trunks to maximize early forwarding

slide-22
SLIDE 22

Early forwarding can reduce upflow.

22

Key Insight: Wiring trunks to maximize early forwarding

slide-23
SLIDE 23

The router can recover full capacity in this

  • example. (We completely mask the failure.)

23

Key Insight: Wiring trunk to maximize early forwarding

slide-24
SLIDE 24

Early forwarding needs weighted version of ECMP

24

Weight = 0 Weight = 1

slide-25
SLIDE 25

WCMP can increase table sizes

Weight = 2 Weight = 1 Weight = 21 Weight = 11

Use 2+1 = 3 weight entries Use 21+11 = 32 weight entries

25

slide-26
SLIDE 26

WCMP weights can depend on failure pattern

26

slide-27
SLIDE 27

Challenges

27

What wiring minimizes upflow? What is the most space-efficient set of WCMP weights? What is the effective capacity for a failure pattern?

slide-28
SLIDE 28

Contributions

28

slide-29
SLIDE 29

The Entire Pipeline is Offline

Computing routing table is expensive and cannot be done after failure happens. So, we must precompute tables for every possible pattern. Challenge: All of these steps must scale to very large routers.

29

slide-30
SLIDE 30

Finding Optimal Wiring

30

slide-31
SLIDE 31

Upflow depends on both trunk wiring and traffic

31

Same traffic, Different wiring

slide-32
SLIDE 32

32

Different traffic, Same wiring

Upflow depends on both trunk wiring and traffic

slide-33
SLIDE 33

Upflow depends on both trunk wiring and traffic

33

Upflow is a function of wiring and traffic matrix .

slide-34
SLIDE 34

Each wiring has its worst-case traffic

34

Traffic matrix 1 Traffic matrix 2 Upflow = 2 Upflow = 4

slide-35
SLIDE 35

Each wiring has its worst-case traffic

35

Traffic matrix 1 Traffic matrix 2 Upflow = 8 Upflow = 6

slide-36
SLIDE 36

Optimal wiring minimizes the worst-case upflow

36

Upflow = 4 Upflow = 8 Choose this wiring

slide-37
SLIDE 37

Challenge: There are infinitely many traffic matrices!

37

slide-38
SLIDE 38

Solution: Extreme traffic matrices are sufficient

38

slide-39
SLIDE 39

Finding optimal wiring becomes MILP

39

slide-40
SLIDE 40

Calculating Effective Capacity

40

slide-41
SLIDE 41

A Non-Blocking Router Allows Topology Abstraction

41

Non blocking Non-blocking router C A 2 2 Abstraction simplifies traffic engineering.

Topology Abstraction

slide-42
SLIDE 42

A Blocking router breaks the Topology Abstraction

42

Blocking The router cannot be abstracted by a simple node with flow conservation anymore. Blocking router C A 2 1

slide-43
SLIDE 43

Upon Failure, Scale Demand to Ensure Non-Blocking

43

Blocking Non-blocking

= 2 ⨉ 0.5

slide-44
SLIDE 44

Effective Capacity for Non-Blocking Design

44

Effective capacity is the largest scaling factor that a router is non-blocking under a given failure pattern. Blocking Non-blocking, 𝜄=0.5

= 2 ⨉ 0.5

slide-45
SLIDE 45

Computing Effective Capacity

Under a failure pattern, finding effective capacity is a linear program per traffic matrix.

45

Traffic matrix 1

𝜄=0.5

Traffic matrix 2

𝜄=0.75

slide-46
SLIDE 46

Effective Capacity under Failure and Traffic

The effective capacity is the minimum value.

46

Traffic matrix 1

𝜄=0.5

Traffic matrix 2

𝜄=0.75

slide-47
SLIDE 47

Challenge: Exponential Number of Failure Patterns

47

slide-48
SLIDE 48

Challenge: Exponential Number of Failure Patterns

48

Similar Not similar

slide-49
SLIDE 49

Solution: Group similar failure patterns using a graph canonicalization algorithm Calculate effective capacity for each canonical pattern

Challenge: Exponential Number of Failure Patterns

49

Similar

slide-50
SLIDE 50

Compacting Routing Table

50

Please see this part in the paper.

slide-51
SLIDE 51

Evaluation

Resilience of 128-port router Comparison to alternative strategies Scalability to 512-port router Routing table sizes Impact of optimizations

51

slide-52
SLIDE 52

Evaluation

Resilience of 128-port router Comparison to alternative strategies Scalability to 512-port router Routing table sizes Impact of optimizations

52

slide-53
SLIDE 53

Methodology

We enumerate all multiple-of-8 trunk sizes. (34 combinations) We compute the effective capacity under all possible failure conditions.

53

4 trunks 128-port router 16 lower switches 8 upper switches Link failure Upper switch failure Lower switch failure 8 links per lower switch

slide-54
SLIDE 54

Effective Capacity - Link Failure: 128-Port Router

54

(0,16] (16,32) 32

Our approach can mask up to 6 concurrent link failures.

slide-55
SLIDE 55

Lower Switch Failure: 128-Port Router

55

Capacity degrades gracefully.

Lower switch failure

slide-56
SLIDE 56

Comparison to Alternative Wiring Strategies

56

Baseline Wiring Random Wiring

slide-57
SLIDE 57

Minimal-Upflow Wiring Yields Superior Resilience

Minimal-Upflow Wiring Baseline Wiring Random Wiring

57

No other approach can mask even a single link failure

slide-58
SLIDE 58

Scalability: 512-Port Router

The pipeline can scale to the 512-port router.

58

Lower switch failure Upper switch failure

slide-59
SLIDE 59

Conclusion

Min-upflow wiring and early forwarding can mask significant number of failures. It improves the availability of WAN routers. It can be used to reduce the cost of WAN routers.

59

slide-60
SLIDE 60

https://github.com/USC-NSL/Highly-Available-WAN-Router (Available Oct. 2019)

60