Evolve or Die High-Availability Design Principles Drawn from - PowerPoint PPT Presentation

Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat … and a cast of hundreds at Google Evolve or Die High-Availability Design Principles Drawn from Google’s Network Infrastructure

Network availability is the biggest challenge facing large content and cloud providers today 2

Why? The push towards higher 9s of availability At four 9s availability ❖ Outage budget is 4 mins per month At five 9s availability ❖ Outage budget is 24 seconds per month 3

How do providers achieve these levels? By learning from failures 4

Paper’s What has Google Learnt from Focus Failures? Why is high What are the What design network characteristics principles can availability a of network achieve high challenge? availability availability? failures? 5

Velocity of Evolution Scale Management Complexity Why is high network availability a challenge? 6

Evolution Network hardware evolves continuously Capacity Jupiter Watchtowe r Firehose 1.0 Saturn Firehose 1.1 4 Post Time 7

Evolution So does network software QUIC gRPC Jupiter Freedome Andromeda BwE B4 Watchtower Google Global 2014 Cache 2012 2010 2008 2006 8

Evolution New hardware and software can ❖ Introduce bugs ❖ Disrupt existing software Result: Failures! 9

Other ISPs B4 B2 Scale and Complexity Data centers 10

Scale and Complexity Design Differences B4 and Data Centers ❖ Use merchant silicon chips ❖ Centralized control planes B2 ❖ Vendor gear ❖ Decentralized control plane 11

Scale and Complexity Design Differences These differences increase management complexity and pose availability challenges 12

The Management Plane Manages network Management Plane Software evolution 13

Management Plane Temporarily Operations remove from service Connect a new data Upgrade B4 or data Drain or undrain links, center to B2 and B4 center control plane switches, routers, software services Many operations require multiple steps and can take hours or days 14

The Low-level abstractions for Management Plane management operations ❖ Command-line interfaces to high capacity routers A small mistake by operator can impact a large part of network 15

Duration, Severity, Prevalence Root-cause Categorization Why is high network availability a challenge? What are the characteristics of network availability failures? 16

Key Takeaway Content provider networks evolve rapidly The way we manage evolution can impact availability We must make it easy and safe to evolve the network daily 17

We analyzed over 100 Post- mortem reports written over a 2 year period 18

Blame-free process What is a Post-mortem? Carefully curated description of a previously unseen failure that had significant availability impact Helps learn from failures 19

What a Post- Mortem Description of failure, with detailed timeline Contains Root-cause(s) confirmed by reproducing the failure Discussion of fixes, follow up action items 20

Failure Examples Examples and Impact ❖ Entire control plane fails ❖ Upgrade causes backbone traffic shift ❖ Multiple top-of-rack switches fail Impact ❖ Data center goes offline ❖ WAN capacity falls below demand ❖ Several services fail concurrently 21

Key Quantitative 70% of failures occur when management plane operation is Results in progress Evolution impacts availability Failures are everywhere: all three networks and three planes see comparable failure rates No silver bullet 80% of failure durations between 10 and 100 minutes Need fast recovery 22

Root causes Lessons learned from root causes motivate availability design principles 23

Re-Think Management Plane Why is high network availability Avoid and Mitigate Large Failures a challenge? Evolve or Die What are the characteristics of network availability failures? What design principles can achieve high availability? 24

Re-think the Management Plane 25

Availability Principle Operator types wrong CLI command, runs wrong script Backbone router fails Minimize Operator Intervention 26

Availability Principle Necessary for upgrade-in-place To upgrade part of a large device… ❖ Line card, block of Clos fabric … proceed while rest of device carries traffic ❖ Enables higher availability 27

Availability Principle Risky! Ensure residual capacity > demand Early risk assessments were manual High packet loss Assess risk continuously 28

Re-think the Management “Intent” Plane I want to upgrade this router Management Plane Software Management Device Tests to Verify Operations Configurations Operation 29

Re-think the Management Device Tests to Verify Management Operations Configurations Operation Plane Management Plane Run-time Apply Configuration Perform management Automated operation Risk Verify operation Assessment Minimize Assess Operator Risk Intervention Continuously 30

Avoid and Mitigate Large Failures 31

Availability Principle B4 and data-centers have dedicated control- plane network ❖ Failure of this can bring down entire control plane Contain failure Fail open radius 32

Fail Open Centralized Control Plane Data center Traffic Exceedingly tricky! Preserve forwarding state of all switches ❖ Fail-open the entire data center 33

Availability Principle A bug can cause state inconsistency between control plane components ➔ Capacity reduction in WAN or data center Design fallback strategies 34

Design Fallback Strategies B4 A large section of the WAN fails, so demand exceeds capacity 35

Design Fallback Strategies B4 Can shift large B2 traffic volumes from many data centers Fallback to B2! 36

Design Fallback Strategies When centralized traffic engineering fails... ❖ … fallback to IP routing Big Red Buttons ❖ For every new software upgrade, design controls so operator can initiate fallback to “safe” version 37

Evolve or Die! 38

We cannot treat a change to the network as an exceptional event 39

Evolve or Die Make change the common case Make it easy and safe to evolve the network daily ❖ Forces management automation ❖ Permits small, verifiable changes 40

Conclusion Content provider networks evolve rapidly The way we manage evolution can impact availability We must make it easy and safe to evolve the network daily 41

Evolve or Die High-Availability Design Principles Drawn from Google’s Network Infrastructure Presentation template from SlidesCarnival

Older Slides 43

Popular root- cause categories Cabling error, interface card failure, cable cut…. 44

Popular root- cause categories Operator types wrong CLI command, runs wrong script 45

Popular root- cause categories Incorrect demand or capacity estimation for upgrade-in-place 46

Upgrade in place 47

Assessing Risk Correctly Residual Capacity? Demand? Varies by interconnect Can change dynamically 48

Popular root- cause categories Hardware or link layer failures in control plane network 49

Popular root- cause categories Two control plane components have inconsistent views of control plane state, caused by bug 50

Popular root- cause categories Running out of memory, CPU, OS resources (threads)... 51

Lessons from Failures The role of evolution The prevalence of Long failure durations in failures large, severe, failures ▸ Recover fast ▸ Rethink the ▸ Prevent and Management mitigate large Plane failures 52

High-level Management “Intent” Plane Abstractions I want to upgrade this router Why is this difficult? Modern high capacity routers: ❖ Carry Tb/s of traffic ❖ Have hundreds of interfaces ❖ Interface with associated optical equipment ❖ Run a variety of control plane protocols: MPLS, IS-IS, BGP all of which have network-wide impact ❖ Have high capacity fabrics with complicated dynamics ❖ Have configuration files which run into 100s of thousands of lines 53

High-level Management “Intent” Plane Abstractions I want to upgrade this router Management Plane Software Management Device Tests to Verify Operations Configurations Operation 54

Management Management Device Tests to Verify Plane Operations Configurations Operation Automation Management Plane Software Apply Configuration Perform management operation Verify operation Minimize Assess Operator Risk Intervention Continuously 55

Large Control Centralized Control Plane Plane Failures 56

Contain the blast radius Centralized Centralized Control Plane Control Plane Smaller failure impact, but increased complexity 57

Fail-Open Centralized Control Plane Preserve forwarding state of all switches ❖ Fail-open the entire fabric 58

Defensive One piece of this large update Control-Plane seems wrong!! Design Topology TE Server Modeler BwE Gateway 59

Trust but Verify Let me check the correctness of the update... Topology TE Server Modeler BwE Gateway 60

Evolve or Die High-Availability Design Principles Drawn from - PowerPoint PPT Presentation

Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat and a cast of hundreds at Google Evolve or Die High-Availability Design Principles Drawn from Googles Network Infrastructure Network availability is the

EVolve Houston Shared Vision and Roadmap for the Greater Houston Area Presented by : EVolve

IO and Instructions Original by Koen Claessen How Would You do That? (1) Suppose you wanted to

Evolve Recycling Industry Leading Inkjet, Toner and Small Electronics Recycling Program for

Evolution of the Aluminium High Pressure Die Casting in India CONTENTS INDIA - A VIBRANT ECONOMY

Die Lernende Organisation Die Schule auf dem Weg zu einer lernenden Organisation Dr. Heinz Hinz

bug.aj @interface A {} aspect Test { declare @field : @A int var* : @A; declare @field : int

Die Hard 1.1024.0: Die Hard 1.1024.0: Backward compatibility of a Backward compatibility of a

Trumpet: Timely and Precise Triggers in Data Centers The Problem Evolve or Die, SIGCOMM 2016

Maximize the Value of CDMA Networks and Maximize the Value of CDMA Networks and Smoothly Evolve to

Outline 1 Basics 2 Traceability links 3 Evolve requirements 4 Way More Stuff Requirements

Evolve with EcoSystem Call: +61 2 8916 6391 Level 5, 14 Martin Place www.enov8.com Sydney, NSW,

Evolve with EcoSystem Call: +61 2 8916 6391 Level 5, 14 Martin Place www.enov8.com Sydney, NSW,

Instrumenting Your Business For Success With DevOps Robert Benefield Evolve Beyond, Ltd

Improved process performance of flat dies by a much wider die gap operation window and a new

Tool & Die Industry VIRGILIO LANZUELA President, Rollmaster Machinery and Industrial Services

Conductive Die Attach Film - CDAF Hi h Higher Reliability Conductive Die Attach Films: R li

PCMH Model and the Foundational Building Blocks Steve Bromer, MD Department of Family and

Primary Care Medicine: Principles and Practice The Present and Future of Primary Care:

DNS and Evidence-Based Security WIE-KISMET December 9, 2019 Geoffrey M. Voelker University of

Multi-Organ Models for Personalized and Evidence Based Medicine Evidence-Based Medicine Kristy K

Competitive and Cooperative Co Evolution Co-Evolution Companion slides for the book Bio-Inspired

The 5G Evolution:

Evolution to 5G: An operator's perspective Dr. Ivo Maljevic TELUS team IEEE 5G Summit | Nov

Techniques for Evolution-Aware Runtime Verification Owolabi Legunsen, Yi Zhang, Milica

Evolve or Die High-Availability Design Principles Drawn from - PowerPoint PPT Presentation

Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin Vahdat and a cast of hundreds at Google Evolve or Die High-Availability Design Principles Drawn from Googles Network Infrastructure Network availability is the

EVolve Houston Shared Vision and Roadmap for the Greater Houston Area Presented by : EVolve

IO and Instructions Original by Koen Claessen How Would You do That? (1) Suppose you wanted to

Evolve Recycling Industry Leading Inkjet, Toner and Small Electronics Recycling Program for

Evolution of the Aluminium High Pressure Die Casting in India CONTENTS INDIA - A VIBRANT ECONOMY

Die Lernende Organisation Die Schule auf dem Weg zu einer lernenden Organisation Dr. Heinz Hinz

bug.aj @interface A {} aspect Test { declare @field : @A int var* : @A; declare @field : int

Die Hard 1.1024.0: Die Hard 1.1024.0: Backward compatibility of a Backward compatibility of a

Trumpet: Timely and Precise Triggers in Data Centers The Problem Evolve or Die, SIGCOMM 2016

Maximize the Value of CDMA Networks and Maximize the Value of CDMA Networks and Smoothly Evolve to

Outline 1 Basics 2 Traceability links 3 Evolve requirements 4 Way More Stuff Requirements

Evolve with EcoSystem Call: +61 2 8916 6391 Level 5, 14 Martin Place www.enov8.com Sydney, NSW,

Evolve with EcoSystem Call: +61 2 8916 6391 Level 5, 14 Martin Place www.enov8.com Sydney, NSW,

Instrumenting Your Business For Success With DevOps Robert Benefield Evolve Beyond, Ltd

Improved process performance of flat dies by a much wider die gap operation window and a new

Tool &amp; Die Industry VIRGILIO LANZUELA President, Rollmaster Machinery and Industrial Services

Conductive Die Attach Film - CDAF Hi h Higher Reliability Conductive Die Attach Films: R li

PCMH Model and the Foundational Building Blocks Steve Bromer, MD Department of Family and

Primary Care Medicine: Principles and Practice The Present and Future of Primary Care:

DNS and Evidence-Based Security WIE-KISMET December 9, 2019 Geoffrey M. Voelker University of

Multi-Organ Models for Personalized and Evidence Based Medicine Evidence-Based Medicine Kristy K

Competitive and Cooperative Co Evolution Co-Evolution Companion slides for the book Bio-Inspired

The 5G Evolution:

Evolution to 5G: An operator's perspective Dr. Ivo Maljevic TELUS team IEEE 5G Summit | Nov

Techniques for Evolution-Aware Runtime Verification Owolabi Legunsen, Yi Zhang, Milica

Tool & Die Industry VIRGILIO LANZUELA President, Rollmaster Machinery and Industrial Services