Evolve or Die High-Availability Design Principles Drawn from - - PowerPoint PPT Presentation

evolve or die
SMART_READER_LITE
LIVE PREVIEW

Evolve or Die High-Availability Design Principles Drawn from - - PowerPoint PPT Presentation

Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat and a cast of hundreds at Google Evolve or Die High-Availability Design Principles Drawn from Googles Network Infrastructure Network availability is the


slide-1
SLIDE 1

Evolve or Die

High-Availability Design Principles Drawn from Google’s Network Infrastructure

Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat … and a cast of hundreds at Google

slide-2
SLIDE 2

Network availability is the biggest challenge facing large content and cloud providers today

2

slide-3
SLIDE 3

Why?

3

At four 9s availability

❖ Outage budget is 4 mins per month

At five 9s availability

❖ Outage budget is 24 seconds per month

The push towards higher 9s of availability

slide-4
SLIDE 4

4

By learning from failures

How do providers achieve these levels?

slide-5
SLIDE 5

What design principles can achieve high availability?

What has Google Learnt from Failures?

Why is high network availability a challenge? What are the characteristics

  • f network

availability failures?

5

Paper’s Focus

slide-6
SLIDE 6

Why is high network availability a challenge?

Velocity of Evolution Scale Management Complexity

6

slide-7
SLIDE 7

Evolution

Time Capacity Saturn Firehose 1.0

Watchtowe r

Firehose 1.1

4 Post

Jupiter

7

Network hardware evolves continuously

slide-8
SLIDE 8

Evolution

B4

2006 2008 2010 2012 2014

Google Global Cache BwE Jupiter gRPC Freedome Watchtower QUIC Andromeda

8

So does network software

slide-9
SLIDE 9

Evolution

9

New hardware and software can

❖ Introduce bugs ❖ Disrupt existing software

Result: Failures!

slide-10
SLIDE 10

B2 B4 Data centers

Other ISPs

Scale and Complexity

10

slide-11
SLIDE 11

Scale and Complexity

11

B4 and Data Centers

❖ Use merchant silicon chips ❖ Centralized control planes

Design Differences

B2

❖ Vendor gear ❖ Decentralized control plane

slide-12
SLIDE 12

Scale and Complexity

12 Design Differences

These differences increase management complexity and pose availability challenges

slide-13
SLIDE 13

The Management Plane

Management Plane Software 13

Manages network evolution

slide-14
SLIDE 14

Management Plane Operations

Connect a new data center to B2 and B4 Upgrade B4 or data center control plane software Drain or undrain links, switches, routers, services

Many operations require multiple steps and can take hours or days

Temporarily remove from service 14

slide-15
SLIDE 15

The Management Plane

15

Low-level abstractions for management operations

❖ Command-line interfaces to high capacity routers A small mistake by

  • perator can impact a

large part of network

slide-16
SLIDE 16

Why is high network availability a challenge?

What are the characteristics of network availability failures?

Duration, Severity, Prevalence Root-cause Categorization

16

slide-17
SLIDE 17

Key Takeaway

17

Content provider networks evolve rapidly The way we manage evolution can impact availability We must make it easy and safe to evolve the network daily

slide-18
SLIDE 18

We analyzed over 100 Post- mortem reports written over a 2 year period

18

slide-19
SLIDE 19

What is a Post-mortem?

Carefully curated description of a previously unseen failure that had significant availability impact

Helps learn from failures

19 Blame-free process

slide-20
SLIDE 20

What a Post- Mortem Contains

20

Description of failure, with detailed timeline Root-cause(s) confirmed by reproducing the failure Discussion of fixes, follow up action items

slide-21
SLIDE 21

Failure Examples and Impact

21

❖ Entire control plane fails ❖ Upgrade causes backbone traffic shift ❖ Multiple top-of-rack switches fail

Examples

❖ Data center goes offline ❖ WAN capacity falls below demand ❖ Several services fail concurrently

Impact

slide-22
SLIDE 22

Key Quantitative Results

22

70% of failures occur when management plane operation is in progress Failures are everywhere: all three networks and three planes see comparable failure rates 80% of failure durations between 10 and 100 minutes

Evolution impacts availability No silver bullet Need fast recovery

slide-23
SLIDE 23

Root causes

23

Lessons learned from root causes motivate availability design principles

slide-24
SLIDE 24

Why is high network availability a challenge? What are the characteristics of network availability failures?

What design principles can achieve high availability?

Re-Think Management Plane Avoid and Mitigate Large Failures Evolve or Die

24

slide-25
SLIDE 25

25

Re-think the Management Plane

slide-26
SLIDE 26

Availability Principle

26

Operator types wrong CLI command, runs wrong script

Backbone router fails

Minimize Operator Intervention

slide-27
SLIDE 27

Availability Principle

27

To upgrade part of a large device…

❖ Line card, block of Clos fabric

… proceed while rest of device carries traffic

❖ Enables higher availability

Necessary for upgrade-in-place

slide-28
SLIDE 28

Availability Principle

28

Ensure residual capacity > demand Early risk assessments were manual

Risky! High packet loss

Assess risk continuously

slide-29
SLIDE 29

Re-think the Management Plane

I want to upgrade this router

“Intent”

Management Plane Software

Management Operations Device Configurations Tests to Verify Operation

29

slide-30
SLIDE 30

Re-think the Management Plane

Management Plane Run-time

Management Operations Device Configurations Tests to Verify Operation

Apply Configuration Perform management

  • peration

Verify operation

Assess Risk Continuously Minimize Operator Intervention

30 Automated Risk Assessment

slide-31
SLIDE 31

31

Avoid and Mitigate Large Failures

slide-32
SLIDE 32

Availability Principle

32

B4 and data-centers have dedicated control- plane network

❖ Failure of this can bring down entire control plane

Fail open Contain failure radius

slide-33
SLIDE 33

Fail Open

Centralized Control Plane

Preserve forwarding state of all switches ❖ Fail-open the entire data center

33 Traffic Exceedingly tricky!

Data center

slide-34
SLIDE 34

Availability Principle

34

A bug can cause state inconsistency between control plane components

➔ Capacity reduction in WAN or data center

Design fallback strategies

slide-35
SLIDE 35

Design Fallback Strategies

35

A large section of the WAN fails, so demand exceeds capacity

B4

slide-36
SLIDE 36

Design Fallback Strategies

36

B2

Fallback to B2!

Can shift large traffic volumes from many data centers

B4

slide-37
SLIDE 37

Design Fallback Strategies

37

When centralized traffic engineering fails...

❖ … fallback to IP routing

Big Red Buttons

❖ For every new software upgrade, design controls so

  • perator can initiate fallback to “safe” version
slide-38
SLIDE 38

38

Evolve or Die!

slide-39
SLIDE 39

39

We cannot treat a change to the network as an exceptional event

slide-40
SLIDE 40

Evolve or Die

Make change the common case Make it easy and safe to evolve the network daily ❖ Forces management automation ❖ Permits small, verifiable changes

40

slide-41
SLIDE 41

Conclusion

41

Content provider networks evolve rapidly The way we manage evolution can impact availability We must make it easy and safe to evolve the network daily

slide-42
SLIDE 42

Evolve or Die

High-Availability Design Principles Drawn from Google’s Network Infrastructure

Presentation template from SlidesCarnival

slide-43
SLIDE 43

43

Older Slides

slide-44
SLIDE 44

Popular root- cause categories

44

Cabling error, interface card failure, cable cut….

slide-45
SLIDE 45

Popular root- cause categories

45

Operator types wrong CLI command, runs wrong script

slide-46
SLIDE 46

Popular root- cause categories

46

Incorrect demand or capacity estimation for upgrade-in-place

slide-47
SLIDE 47

Upgrade in place

47

slide-48
SLIDE 48

Assessing Risk Correctly

Residual Capacity? Demand?

Varies by interconnect Can change dynamically

48

slide-49
SLIDE 49

Popular root- cause categories

49

Hardware or link layer failures in control plane network

slide-50
SLIDE 50

Popular root- cause categories

50

Two control plane components have inconsistent views of control plane state, caused by bug

slide-51
SLIDE 51

Popular root- cause categories

51

Running out of memory, CPU, OS resources (threads)...

slide-52
SLIDE 52

Lessons from Failures

The role of evolution in failures ▸ Rethink the Management Plane The prevalence of large, severe, failures ▸ Prevent and mitigate large failures Long failure durations ▸ Recover fast

52

slide-53
SLIDE 53

High-level Management Plane Abstractions

I want to upgrade this router

Why is this difficult? Modern high capacity routers: ❖ Carry Tb/s of traffic ❖ Have hundreds of interfaces ❖ Interface with associated optical equipment ❖ Run a variety of control plane protocols: MPLS, IS-IS, BGP all of which have network-wide impact ❖ Have high capacity fabrics with complicated dynamics ❖ Have configuration files which run into 100s of thousands of lines “Intent” 53

slide-54
SLIDE 54

High-level Management Plane Abstractions

I want to upgrade this router

“Intent”

Management Plane Software

Management Operations Device Configurations Tests to Verify Operation

54

slide-55
SLIDE 55

Management Plane Automation

Management Plane Software

Management Operations Device Configurations Tests to Verify Operation

Apply Configuration Perform management

  • peration

Verify operation

Assess Risk Continuously Minimize Operator Intervention

55

slide-56
SLIDE 56

Large Control Plane Failures

Centralized Control Plane 56

slide-57
SLIDE 57

Contain the blast radius

Centralized Control Plane

57

Centralized Control Plane

Smaller failure impact, but increased complexity

slide-58
SLIDE 58

Fail-Open

Centralized Control Plane

Preserve forwarding state of all switches ❖ Fail-open the entire fabric

58

slide-59
SLIDE 59

Defensive Control-Plane Design

Gateway Topology Modeler TE Server BwE

59 One piece of this large update seems wrong!!

slide-60
SLIDE 60

Trust but Verify

Gateway Topology Modeler TE Server BwE

60 Let me check the correctness of the update...

slide-61
SLIDE 61

Fallback to B2

Gateway Topology Modeler TE Server BwE

61

B2

slide-62
SLIDE 62

Mitigating Large Failures

Design Fallback Strategies ▸ B4 B2 ▸ Tunneling IP routing ▸ Big Red Buttons

62

slide-63
SLIDE 63

Continuously Monitor Invariants

63

Must have one functional backup SDN controller Anycast route must have AS path length

  • f 3

Data center must peer with two B2 routers

slide-64
SLIDE 64

This Alone isn’t Enough...

64

slide-65
SLIDE 65

65

We cannot treat a change to the network as an exceptional event

slide-66
SLIDE 66

Evolve or Die

Make change the common case Make it easy and safe to evolve the network daily ❖ Forces management automation ❖ Permits small, verifiable changes

66

slide-67
SLIDE 67

Key Takeaway

67

Content provider networks evolve rapidly The way we manage evolution can impact availability We must make it easy and safe to evolve the network daily

slide-68
SLIDE 68

Evolve or Die

High-Availability Design Principles Drawn from Google’s Network Infrastructure

Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat … and a cast of hundreds at Google

Presentation template from SlidesCarnival

slide-69
SLIDE 69

Impact of Availability Failures

69

slide-70
SLIDE 70

What design principles can achieve high availability?

A Case Study: Google

Why is high network availability a challenge? What are the characteristics

  • f network

availability failures?

70

slide-71
SLIDE 71

The velocity

  • f evolution

is fueled by traffic growth...

71

slide-72
SLIDE 72

… and by an increase in product and service

  • fferings

72

slide-73
SLIDE 73

Networks have very different designs

Different hardware Different control planes Different forwarding paradigms

These differences increase management and evolution complexity

73

slide-74
SLIDE 74

❖ Fabrics with merchant silicon chips ❖ Centralized control plane ❖ Out of band control plane network

Data centers

Control plane network 74

SIGCOMM 2015

slide-75
SLIDE 75

B4

Gateway Topology Modeler TE Server BwE ❖ B4 routers built using merchant silicon chips ❖ Centralized control plane within each B4 site ❖ Centralized traffic engineering ❖ Bandwidth enforcement for traffic metering

75

SIGCOMM 2015 SIGCOMM 2013

slide-76
SLIDE 76

Other ISPs

❖ B2 routers based on vendor gear ❖ Decentralized routing and MPLS TE ❖ Class of service (high/low) using MPLS priorities

B2

76

slide-77
SLIDE 77

The Management Plane

Low-level, per device, abstractions for management operations

77

slide-78
SLIDE 78

Where do failures happen?

No network or plane that dominates

78

slide-79
SLIDE 79

How long do the failures last?

Durations much longer than outage budgets Shorter failures

  • n B2

79

slide-80
SLIDE 80

What role does evolution play?

70% of failures happen when a management

  • peration is in

progress

80

slide-81
SLIDE 81

Where do failures happen?

12 3 2 6 10 8 5 14 6

Control plane network

12 8 15

81

slide-82
SLIDE 82

Failures are everywhere

82

slide-83
SLIDE 83

Across networks

All three All three All three All three All three 83

slide-84
SLIDE 84

Across planes

Data Management Data Data Control Management 84 Management

slide-85
SLIDE 85

Root-Cause Categorization

What are the root causes for these failures?

85

slide-86
SLIDE 86

Rethink the Management Plane

Low-level network management cannot ensure high availability

86

slide-87
SLIDE 87

Re-think the Management Plane

I want to upgrade this router

Lots of complexity hidden below this statement

❖ Carry Tb/s of traffic ❖ Have hundreds of interfaces ❖ Interface with associated optical equipment ❖ Run a variety of control plane protocols: MPLS, IS-IS, BGP all of which have network-wide impact ❖ Have high capacity fabrics with complicated dynamics ❖ Have configuration files which run into 1000s of thousands of lines “Intent” 87

slide-88
SLIDE 88

Contain failure radius

Centralized Control Plane

88

Centralized Control Plane

Each partition managed by different control plane

Adds design complexity

Even if one partition fails, others can carry traffic

slide-89
SLIDE 89

Key Takeaway

89

Content provider networks evolve rapidly The way we manage evolution can impact availability We must make it easy and safe to evolve the network daily

slide-90
SLIDE 90

By learning from failures

90

slide-91
SLIDE 91

What design principles can achieve high availability?

▸ Lessons learned from root-causes

What has Google Learnt from Failures?

Why is high network availability a challenge?

▸ Factors that impact availability

What are the characteristics

  • f network

failures?

▸ Severity, duration, prevalence ▸ Root-cause categorization

91

slide-92
SLIDE 92

Data Center Data Center Data Center

In a global network

Failures are common Configuration can change

These can impact network availability

92

slide-93
SLIDE 93

How long does it take... 10s of minutes to hours Hours to days

Data Center Data Center Data Center

… to root-cause a failure … to upgrade part of the network

93

slide-94
SLIDE 94

Outage budgets...

… for four 9s availability? … for five 9s availability?

4 minutes per month 24 seconds per month

99.99% uptime 99.999% uptime 94

slide-95
SLIDE 95

To move towards higher availability targets, it is important to learn from failures

95