Evolve or Die
High-Availability Design Principles Drawn from Google’s Network Infrastructure
Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat … and a cast of hundreds at Google
Evolve or Die High-Availability Design Principles Drawn from - - PowerPoint PPT Presentation
Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat and a cast of hundreds at Google Evolve or Die High-Availability Design Principles Drawn from Googles Network Infrastructure Network availability is the
Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat … and a cast of hundreds at Google
2
3
The push towards higher 9s of availability
4
How do providers achieve these levels?
5
6
Time Capacity Saturn Firehose 1.0
Watchtowe r
Firehose 1.1
4 Post
Jupiter
7
B4
2006 2008 2010 2012 2014
Google Global Cache BwE Jupiter gRPC Freedome Watchtower QUIC Andromeda
8
9
Result: Failures!
B2 B4 Data centers
Other ISPs
10
11
Design Differences
12 Design Differences
Management Plane Software 13
Connect a new data center to B2 and B4 Upgrade B4 or data center control plane software Drain or undrain links, switches, routers, services
Temporarily remove from service 14
15
Why is high network availability a challenge?
16
17
18
19 Blame-free process
20
21
Examples
Impact
22
Evolution impacts availability No silver bullet Need fast recovery
23
Why is high network availability a challenge? What are the characteristics of network availability failures?
24
25
26
Backbone router fails
Minimize Operator Intervention
27
Necessary for upgrade-in-place
28
Risky! High packet loss
Assess risk continuously
“Intent”
Management Plane Software
29
Management Plane Run-time
Apply Configuration Perform management
Verify operation
Assess Risk Continuously Minimize Operator Intervention
30 Automated Risk Assessment
31
32
Fail open Contain failure radius
Centralized Control Plane
33 Traffic Exceedingly tricky!
Data center
34
Design fallback strategies
35
B4
36
B2
Can shift large traffic volumes from many data centers
B4
37
38
39
40
41
Presentation template from SlidesCarnival
43
44
45
46
47
48
49
50
51
The role of evolution in failures ▸ Rethink the Management Plane The prevalence of large, severe, failures ▸ Prevent and mitigate large failures Long failure durations ▸ Recover fast
52
Why is this difficult? Modern high capacity routers: ❖ Carry Tb/s of traffic ❖ Have hundreds of interfaces ❖ Interface with associated optical equipment ❖ Run a variety of control plane protocols: MPLS, IS-IS, BGP all of which have network-wide impact ❖ Have high capacity fabrics with complicated dynamics ❖ Have configuration files which run into 100s of thousands of lines “Intent” 53
“Intent”
Management Plane Software
54
Management Plane Software
Apply Configuration Perform management
Verify operation
Assess Risk Continuously Minimize Operator Intervention
55
Centralized Control Plane 56
Centralized Control Plane
57
Centralized Control Plane
Centralized Control Plane
58
Gateway Topology Modeler TE Server BwE
59 One piece of this large update seems wrong!!
Gateway Topology Modeler TE Server BwE
60 Let me check the correctness of the update...
Gateway Topology Modeler TE Server BwE
61
B2
62
63
Must have one functional backup SDN controller Anycast route must have AS path length
Data center must peer with two B2 routers
64
65
66
67
Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat … and a cast of hundreds at Google
Presentation template from SlidesCarnival
69
70
71
72
Different hardware Different control planes Different forwarding paradigms
73
❖ Fabrics with merchant silicon chips ❖ Centralized control plane ❖ Out of band control plane network
Control plane network 74
SIGCOMM 2015
Gateway Topology Modeler TE Server BwE ❖ B4 routers built using merchant silicon chips ❖ Centralized control plane within each B4 site ❖ Centralized traffic engineering ❖ Bandwidth enforcement for traffic metering
75
SIGCOMM 2015 SIGCOMM 2013
Other ISPs
❖ B2 routers based on vendor gear ❖ Decentralized routing and MPLS TE ❖ Class of service (high/low) using MPLS priorities
76
77
78
79
80
12 3 2 6 10 8 5 14 6
Control plane network
12 8 15
81
82
All three All three All three All three All three 83
Data Management Data Data Control Management 84 Management
85
86
Lots of complexity hidden below this statement
❖ Carry Tb/s of traffic ❖ Have hundreds of interfaces ❖ Interface with associated optical equipment ❖ Run a variety of control plane protocols: MPLS, IS-IS, BGP all of which have network-wide impact ❖ Have high capacity fabrics with complicated dynamics ❖ Have configuration files which run into 1000s of thousands of lines “Intent” 87
Centralized Control Plane
88
Centralized Control Plane
Adds design complexity
89
90
91
Data Center Data Center Data Center
92
Data Center Data Center Data Center
93
99.99% uptime 99.999% uptime 94
95