B4 and After : Managing Hierarchy, Partitioning, and Asymmetry for - - PowerPoint PPT Presentation

b4 and after managing hierarchy partitioning and
SMART_READER_LITE
LIVE PREVIEW

B4 and After : Managing Hierarchy, Partitioning, and Asymmetry for - - PowerPoint PPT Presentation

B4 and After : Managing Hierarchy, Partitioning, and Asymmetry for Availability and Scale in Google's Software-Defined WAN (Chi) Chi-yao Hong , Subhasree Mandal, Mohammad Al-Fares, Min Zhu, Richard Alimi, Kondapa Naidu B., Chandan Bhagat,


slide-1
SLIDE 1

B4 and After: Managing Hierarchy, Partitioning, and Asymmetry for Availability and Scale in Google's Software-Defined WAN

slide-2
SLIDE 2

(“Chi”) Chi-yao Hong, Subhasree Mandal, Mohammad Al-Fares, Min Zhu, Richard Alimi, Kondapa Naidu B., Chandan Bhagat, Sourabh Jain, Jay Kaimal, Shiyu Liang, Kirill Mendelev, Steve Padgett, Faro Rabe, Saikat Ray, Malveeka Tewari, Matt Tierney, Monika Zahn, Jonathan Zolla, Joon Ong, Amin Vahdat On behalf of many others in: Google Network Infrastructure and Network SREs

slide-3
SLIDE 3

3

2011 2012 2013 2014 2015 2016 2017 2018

99.9% availability

Saturn

First-generation B4 network

c

  • p

y n e t w

  • r

k

99.99% availability

>100x more traffic

toward highly available, massive-scale network

99% availability

J-POP Stargate

slide-4
SLIDE 4

Previous B4 paper published in SIGCOMM 2013

4

slide-5
SLIDE 5

12-site Topology Demand Matrix (via Google BwE) Central TE Controller

Background: B4 with SDN Traffic Engineering (TE) Deployed in 2012

Per-Site Domain TE Controllers Site-level tunnels (tunnels & tunnel splits)

5

slide-6
SLIDE 6

Background: B4 with SDN Traffic Engineering (TE) Deployed in 2012

❏ High efficiency: Lower per-byte cost compared with B2 (Google global backbone running RSVP TE

  • n vendor gears)

❏ Deterministic convergence: Fast, global TE

  • ptimization and failure handling

❏ Rapid software iteration: ~1 month for developing and deploying a median-size software features Key Takeaways:

6

slide-7
SLIDE 7

But, it also comes with new challenges

7

slide-8
SLIDE 8

Grand Challenge #1: High Availability Requirements

Service Class Application Examples Availability SLO SC4 Search ads, DNS, WWW 99.99% SC3 Proto service backend, Email 99.95% SC2 Ads database replication 99.9% SC1 Search index copies, logs 99% SC0 Bulk transfer N/A

B4 initially had 99% availability in 2013

8

slide-9
SLIDE 9

Service Class Application Examples Availability SLO SC4 Search ads, DNS, WWW 99.99% SC3 Proto service backend, Email 99.95% SC2 Ads database replication 99.9% SC1 Search index copies, logs 99% SC0 Bulk transfer N/A

B4 initially had 99% availability Very demanding goal, given:

  • inherent unreliability of long-haul links
  • necessary management operations

9

slide-10
SLIDE 10

Grand Challenge #2: Scale Requirements

  • ur bandwidth

requirement doubled every ~9 months

10

slide-11
SLIDE 11

traffic increased by >100x in 5 years

11

slide-12
SLIDE 12

Grand Challenge #2: Scale Requirements

  • ur bandwidth

requirement doubled every ~9 months

Scale increased across dimensions:

  • #Cluster prefixes: 8x
  • #B4 sites: 3x
  • #Control domains: 16x
  • #Tunnels: 60x

12

slide-13
SLIDE 13

Other challenges: No disruption to existing traffic, maintain high cost efficiency and high feature velocity

13

slide-14
SLIDE 14

To meet these demanding requirements, we’ve had to aggressively develop many point solutions

14

slide-15
SLIDE 15

Lessons Learned

  • 1. Flat topology scales poorly and

hurts availability

  • 2. Solving capacity asymmetry

problem in hierarchical topology is key to achieve high availability at scale

  • 3. Scalable switch forwarding rule

management is essential to hierarchical TE

15

slide-16
SLIDE 16

Site Site Site Site

B4 WAN

BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12 / 6.4 Tbps To WAN (other B4 sites)

Saturn

First-generation B4 site fabric

16

slide-17
SLIDE 17

Site Site Site

B4 WAN

BF BF BF BF CF CF CF CF 5.12 Tbps To Clusters 5.12 / 6.4 Tbps To WAN (other B4 sites)

Scaling option #1: Add more chassis--Up to 8 chassis per Saturn fabric

Site

17

Saturn

First-generation B4 site fabric

slide-18
SLIDE 18

Site Site Site

Scaling option #2: Build multiple B4 sites in close proximity

Site Site Site

Slower central TE controller Limited switch table limit Complicated capacity planning and job allocation

18

slide-19
SLIDE 19

Jumpgate Site

Jumpgate: Two-layer Topology

80 Tbps toward WAN / clusters / sidelinks

x16 x32

spine switches

Supernode

edge switches

19

slide-20
SLIDE 20

Jumpgate Site

Jumpgate: Two-layer Topology

80 Tbps toward WAN / clusters / sidelinks

edge switches

x16 x32

spine switches

Supernode

Support horizontal scaling by adding more supernodes to a site Support vertical scaling by upgrading a supernode in place to new generation Improve availability with granular, per-supernode control domain

20

slide-21
SLIDE 21

Lessons Learned

  • 1. Flat topology scales poorly and

hurts availability

  • 2. Solving capacity asymmetry

problem in hierarchical topology is key to achieve high availability at scale

  • 3. Scalable switch forwarding rule

management is essential to hierarchical TE

21

slide-22
SLIDE 22

Site A Site B Site C

1 1 16 4 4 4 4 4 4 4 4

Site A Site B Site C

16 16 sum of supernode-level link capacity

22

slide-23
SLIDE 23

Site A Site B Site C

1 1

Site A Site B Site C

14? 16 8 2 2 2 2 2 2 2 2 8

Bottleneck!

Abstract loss 43% = (14-8) / 14

23

slide-24
SLIDE 24

Cumulative function of site-level links and topology events Site-level link capacity loss due to topology abstraction / total capacity [log10 scale]

100% capacity loss in 18% cases 2% capacity loss at median case due to striping inefficiency

24

slide-25
SLIDE 25

Solution = Sidelinks + Supernode-level TE

25

slide-26
SLIDE 26

Site A Site B Site C

1 1 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5

  • 57% toward next site
  • 43% toward self site

26

slide-27
SLIDE 27

Solution = Sidelinks + Supernode-level TE

Multi-layer TE (Site-level & supernode-level) turns out to be challenging!

27

slide-28
SLIDE 28

Design Proposals

Hierarchical Tunneling

Site-level tunnels + Supernode-level sub-tunnels

Two layers of IP encapsulation lead to inefficient hashing

Supernode-level TE

Supernode-level tunnels

Scaling challenges: Increase path allocation run time by 188x longer

28

slide-29
SLIDE 29

Supernode-level traffic splits; No packet encapsulation; Calculated per site-level link Tunnel Split Group (TSG)

x

Site A (4 supernodes) Site B (2 supernodes)

4x x x x Assume balanced ingress traffic Maximize admissible demand subject to fairness and link capacity constraint

29

slide-30
SLIDE 30

Greedy Exhaustive Waterfill Algorithm Iteratively allocate each flow on their direct path (w/o sidelinks) or alternatively on their indirect paths (w/ sidelinks on source site) until any flow cannot be allocated further Provably forwarding loop free Low abstraction capacity loss Take less than 1 second to run

Site A (4 supernodes) Site B (2 supernodes)

30

slide-31
SLIDE 31

Cumulative function of site-level links and topology events Site-level link capacity loss due to topology abstraction / total capacity [log10 scale]

100% loss < 2% loss

31

slide-32
SLIDE 32

TSG Sequencing Problem

Current TSGs Target TSGs

A1 A2 B1 B2 A1 A2 B1 B2

Forwarding Loop Blackhole

Bad properties during update:

32

slide-33
SLIDE 33

Dependency Graph based TSG Update

Loop-free and no extra blackhole Requires no packet tagging 1. Map target TSGs to a supernode dependency graph 2. Apply TSG update in reverse topological ordering* One or two steps in >99.7% of TSG ops

* Share ideas with work in IGP updates:

  • Francois & Bonaventure, Avoiding Transient Loops during IGP

convergence in IP Networks, INFOCOM’05

  • Vanbever et al., Seamless Network-wide IGP Migrations,

SIGCOMM’11

33

slide-34
SLIDE 34

Lessons Learned

  • 1. Flat topology scales poorly and

hurts availability

  • 2. Solving capacity asymmetry

problem in hierarchical topology is key to achieve high availability at scale

  • 3. Scalable switch forwarding rule

management is essential to hierarchical TE

34

slide-35
SLIDE 35

B4 Site

x16 x32

Supernode

35

Multi-stage Hashing across Switches in Clos Supernode

1. Ingress traffic at edge switches: a. Site-level tunnel split b. TSG site-level split (to self-site or next-site) 2. At spine switches: a. TSG supernode-level split b. Egress edge switch split 3. Egress traffic at edge switches: a. Egress port/trunk split

Enable hierarchical TE at scale: Overall throughput improved by >6%

slide-36
SLIDE 36

2011 2012 2013 2014 2015 2016 2017 2018

99.9% availability

TSG: Hierarchical TE Efficient switch rule management & more service classes

99.99% availability

J-POP Stargate

Jumpgate: Two-layer topology Two service classes

99% availability

Saturn

Flat topology SDN TE tunneling copy network

> 1 x m

  • r

e t r a f f i c

toward highly available, massive-scale network

36

slide-37
SLIDE 37

Conclusions

❏ Highly available WAN with plentiful bandwidth offers unique benefits to many cloud services (e.g., Spanner) ❏ Future Work--Limit the blast radius of rare yet catastrophic failures ❏ Reduce dependencies across components ❏ Network operation via per-QoS canary

37

slide-38
SLIDE 38

Before After Copy network with 99% availability High-available network with 99.99% availability Inter-DC WAN with moderate number of sites 100x more traffic, 60x more tunnels Saturn: flat site topology & per-site domain TE controller Jumpgate: hierarchical topology & granular TE control domain Site-level tunneling Site-level tunneling in conjunction with supernode-level TE (“Tunnel Split Group”) Tunnel splits implemented at ingress switches Multi-stage hashing across switches in Clos supernode

B4 and After: Managing Hierarchy, Partitioning, and Asymmetry for Availability and Scale in Google's Software-Defined WAN

slide-39
SLIDE 39

B4 Site

x16 x32

Supernode

ACL (Flow Match) ECMP (Port Hashing) Encap (+Tunnel IP)

Switch Pipeline

39

slide-40
SLIDE 40

ACL (Flow Match) ECMP (Port Hashing) Encap (+Tunnel IP)

Switch Pipeline

Size(ACL) ≥ (#Sites ✕ #PrefixesPerSite ✕ #ServiceClasses)

>16 aggregated IPv4 & IPv6 cluster prefixes 6 aggregated QoSes Up to 3K entries

Scaling bottleneck: Hit ACL table limit with ~32 sites

40

slide-41
SLIDE 41

VFP (QoS Match) ACL (Flow Match) ECMP (Port Hashing) Encap (+Tunnel IP)

Switch Pipeline (Before)

ECMP (Port Hashing) Encap (+Tunnel IP)

Switch Pipeline (After)

ACL (Flow Match) ACL (Flow Match) Per-VRF LPM (Prefix Match)

Increase # supported sites by 60x

Enable new features: Disable per-flow tunneling

41

slide-42
SLIDE 42

VFP (QoS Match) ECMP (Port Hashing) Encap (+Tunnel IP) ACL (Flow Match) ACL (Flow Match) Per-VRF LPM (Prefix Match)

Switch Pipeline

Size(ECMP) ≥ (#Sites ✕ #PathingClasses ✕ TunnelsSplits ✕ TSG_Splits ✕ SwitchSplits)

32 ways 33 sites 3 classes 4 ways 16 ways

198K entries required; 16K supported by our switches

42

slide-43
SLIDE 43

VFP (QoS Match) ECMP (Port Hashing) Encap (+Tunnel IP) ACL (Flow Match) ACL (Flow Match) Per-VRF LPM (Prefix Match)

Switch Pipeline

Size(ECMP) ≥ (#Sites ✕ #PathingClasses ✕ TunnelsSplits ✕ TSG_Splits ✕ SwitchSplits)

Scaling bottleneck: Hit ACL table limit with ~32 sites Scaling bottleneck: Hit ACL table limit with ~32 sites

x16 x32

Supernode

Overall throughput improved by >6% Support more sites & pathing classes

43

slide-44
SLIDE 44

B4 Site

x16 x32

Supernode

ACL (Flow Match) ECMP (Port Hashing) Encap (+Tunnel IP)

Switch Pipeline

Support up to

  • nly 32 sites

Reduced efficiency with lower path split granularity Efficient flow matching via virtual routing & forwarding (VRF) Multi-stage hashing by leveraging source MAC marking and packet load balancing via spine-layer switches

44