Chaos Engineering: Why the world needs more resilient systems - - PowerPoint PPT Presentation

chaos engineering why the world needs more resilient
SMART_READER_LITE
LIVE PREVIEW

Chaos Engineering: Why the world needs more resilient systems - - PowerPoint PPT Presentation

Chaos Engineering: Why the world needs more resilient systems @tammybutow Oh hai, nice to meet you! @tammybutow Principal SRE @ Gremlin @tammybutow Tech Advisory Board @ Greenpeace tammybutow Enjoys Skateboarding, Snowboarding, Metal,


slide-1
SLIDE 1

Chaos Engineering: Why the world needs more resilient systems

@tammybutow

slide-2
SLIDE 2

Oh hai, nice to meet you!

@tammybutow @tammybutow tammybutow tb@gremlin.com

Principal SRE @ Gremlin Tech Advisory Board @ Greenpeace Enjoys Skateboarding, Snowboarding, Metal, Punk & Breaking Things On Purpose.

slide-3
SLIDE 3

Dropbox DigitalOcean National Australia Bank Queensland University of Technology Netflix Amazon Salesforce Google

Our Gremlin Team Were Previously @

PagerDuty Datadog

slide-4
SLIDE 4

More Resilient Systems!

Why the world needs:

slide-5
SLIDE 5

A resilient system is a highly available and durable system. A resilient system can maintain an acceptable level of service in the face of failure. A resilient system can weather the storm (a misconfiguration, a large scale natural disaster or controlled chaos engineering).

What is a resilient system?

slide-6
SLIDE 6

Resilient Systems

Let’s review industry examples to understand why we need:

slide-7
SLIDE 7

Cardiac monitoring is now done via a bluetooth device implanted in the body and a mobile app. The patient takes no action. Resilience of the device is the only thing the patient cares about.

Med Tech Industry:

slide-8
SLIDE 8
slide-9
SLIDE 9

People are changing jobs, moving homes, traveling and more. Systems need to not only keep up but also provide value anytime/anywhere.

Fin Tech Industry:

slide-10
SLIDE 10

A “technical issue related to some routine maintenance”. Impacted the purchase of over 2000 homes.

slide-11
SLIDE 11

People are traveling so frequently for work and

  • leisure. They need to be able to get where they

need to go with no hassles. Transport Tech Industry:

slide-12
SLIDE 12
slide-13
SLIDE 13

More remote learning than ever before. Many students learn remotely. They need reliable access to teachers, students and learning materials.

Edu Tech Industry:

slide-14
SLIDE 14
slide-15
SLIDE 15

People need protection from bushfires, tsunamis, earthquakes and storms. Many of the warning systems for these disasters are legacy unreliable systems. Enviro Tech Industry:

slide-16
SLIDE 16

Insert photo of tsunami

Saturday, 7 February 2009 - Australia’s all-time worst bushfire disaster

slide-17
SLIDE 17

Saturday, 7 February 2009 - Australia’s all-time worst bushfire disasters

slide-18
SLIDE 18

Saturday, 7 February 2009 - Australia’s all-time worst bushfire disasters

slide-19
SLIDE 19

What do these systems have in common?

The primary concern of the user is resilience of the system, in particular high availability.

slide-20
SLIDE 20

A great future for everyone

Let’s figure out how to create:

slide-21
SLIDE 21

What does a great future look like?

slide-22
SLIDE 22

More Resilient Systems?

How do we create:

slide-23
SLIDE 23

Introducing:

Chaos Engineering

slide-24
SLIDE 24

Chaos Engineering?

What is

slide-25
SLIDE 25

Thoughtful, planned experiments designed to reveal the weakness in our systems.

Chaos Engineering:

slide-26
SLIDE 26

Inject something harmful, in order to build an immunity

slide-27
SLIDE 27
slide-28
SLIDE 28

We can inject harm in hosts, containers, pods, applications and more.

slide-29
SLIDE 29

Chaos Engineer?

What is a

slide-30
SLIDE 30

A vaccine research computer scientist.

Chaos Engineer:

SREs / Production Engineers commonly practice Chaos Engineering.

slide-31
SLIDE 31

A vaccine research computer scientist.

Chaos Engineer:

slide-32
SLIDE 32

A vaccine research computer scientist.

Chaos Engineer:

http://www.cancerresearchuk.org/about-cancer/cancer-in-general/treatment/immunotherapy/types/vaccines-to-treat-cancer

slide-33
SLIDE 33

The Bad Database Vaccine

Bad DB Vaccine

What happens when the database is unreachable? Does the database have reliable and trustworthy monitoring? Does the database fail gracefully?

slide-34
SLIDE 34

Injecting Harm in DynamoDB

https://www.gremlin.com/community/tutorials/gremlin-gameday-breaking-dynamodb/

slide-35
SLIDE 35

Chaos Engineering

What do you need before you can start doing:

slide-36
SLIDE 36

Prerequisites for Chaos Engineering

slide-37
SLIDE 37
  • 1. High Severity Incident Management
  • 2. Monitoring
  • 3. Measure the Impact of Downtime

Prerequisites for Chaos Engineering

slide-38
SLIDE 38

High Severity Incident Management

Chaos Engineering Prerequisite #1:

slide-39
SLIDE 39

The practice of recording, triaging, tracking, and assigning business value to problems that impact critical systems. High Severity Incident Management:

slide-40
SLIDE 40

gremlin.com/community

slide-41
SLIDE 41

SEVs?

What are

slide-42
SLIDE 42

What are SEVs?

The term SEV is derived from “High Severity Incident”

slide-43
SLIDE 43

What are SEVs?

slide-44
SLIDE 44

How Do You Determine SEV levels?

slide-45
SLIDE 45

What is an example of SEV 0?

SEV Name: SEV 0 Runaway Cow (auto generated code names help your team remember and refer to SEVs!) SEV Description: Nintendo Switch eShop is down and not working SEV Start Time: 08:40am Dec 25 2017 (Christmas Day) What is the availability impact? 100% What is the outage duration? 5 hours and 40 minutes

slide-46
SLIDE 46

What is an example of SEV 0?

slide-47
SLIDE 47

The SEV Lifecycle?

What is the

slide-48
SLIDE 48
slide-49
SLIDE 49

How To Run A GameDay gremlin.com/community

slide-50
SLIDE 50

How do you identify your critical systems?

slide-51
SLIDE 51

What are your critical tier 0 systems? Traffic Database Storage

slide-52
SLIDE 52

Monitoring

Chaos Engineering Prerequisite #2:

slide-53
SLIDE 53

Monitoring Why Do You Need:

slide-54
SLIDE 54

Why Monitor - The Google SRE Book

https://landing.google.com/sre/book/chapters/monitoring-distributed-systems.html

slide-55
SLIDE 55

How Should You Use Monitoring

slide-56
SLIDE 56

Critical Services Dashboard gremlin.com/community

slide-57
SLIDE 57

The Four Golden Signals - The Google SRE Book

https://landing.google.com/sre/book/chapters/monitoring-distributed-systems.html

slide-58
SLIDE 58

The Four Golden Signals - The Google SRE Book

https://landing.google.com/sre/book/chapters/monitoring-distributed-systems.html Monitoring Signal Description Example Latency The time it takes to service a request. HTTP 500 error triggered due to loss of connection to a database Traffic A measure of how much demand is being placed on your system For a web service, this measurement is usually HTTP requests per second Errors The rate of requests that fail, either explicitly, implicitly or by policy. Catching HTTP 500s at your load balancer can do a decent job of catching all completely failed requests. Saturation How "full" your service is. Should also signal impending saturation. It looks like your database will fill its hard drive in 4 hours.

slide-59
SLIDE 59

What Happens If You Do Chaos Engineering Without Monitoring?

slide-60
SLIDE 60

You won’t know what’s happening

slide-61
SLIDE 61

Measure The Impact Of Downtime

Chaos Engineering Prerequisite #3:

slide-62
SLIDE 62

Measure The Impact Of Downtime We need to understand how SEV 0s impact our customers and business.

slide-63
SLIDE 63

Measure The Impact Of Downtime

System Impact:

  • Availability
  • Durability

Customer/Business Impact:

  • Outcome
  • Cost
  • Time
slide-64
SLIDE 64

What is the impact of the Nintendo Switch eShop SEV 0?

SEV Description: Nintendo Switch eShop is down and not working What is the availability impact? 100% Time? 5 hours and 40 minutes Cost? ______ Outcome? Switch users all over the world can’t buy games

slide-65
SLIDE 65

Chaos Engineering

Now we’re ready to get started with:

slide-66
SLIDE 66

Chaos Engineering Use Case: Twilio

slide-67
SLIDE 67

Chaos Engineering Case Study: Twilio

Ratequeue Chaos has 3 goals: 1. Pick a shard 2. Kill primary 3. Monitor recovery.

slide-68
SLIDE 68

Share The Chaos Engineering Journey Widely

slide-69
SLIDE 69
  • Do a Chaos Engineering Kick Off @ All Hands
  • Send email updates & progress reports
  • Run Monthly Metrics Reviews
  • Deliver Presentations

Share The Chaos Engineering Journey Widely

slide-70
SLIDE 70

Don’t Surprise Everyone!

slide-71
SLIDE 71

Gremlin?

What is

slide-72
SLIDE 72

What is Gremlin?

slide-73
SLIDE 73

Gremlin Chaos Engineering Attacks

There are a range of attacks built-in and ready to run on Linux.

Type of Attack Attack Gremlin Support (March 2018) Resource CPU ✅ Resource Disk ✅ Resource IO ✅ Resource Memory ✅ State Process Killer ✅ State Shutdown ✅ State Time Travel ✅ Network Blackhole ✅ Network DNS ✅ Network Latency ✅ Network Packet Loss ✅

slide-74
SLIDE 74

Live Chaos Engineering

Demo

slide-75
SLIDE 75

Create a Kubernetes Cluster gremlin.com/community

slide-76
SLIDE 76

Create a Kubernetes Cluster Master Node 1 Node 2 Node 3

159.65.85.204 159.65.85.158 159.65.85.169 159.65.85.202

slide-77
SLIDE 77

Host Level Chaos Engineering With Kubernetes

slide-78
SLIDE 78

Create a Kubernetes Daemonset For Gremlin

slide-79
SLIDE 79

Create a Kubernetes Daemonset For Gremlin Insert yams

slide-80
SLIDE 80

View Your Kubernetes Pods

slide-81
SLIDE 81

Run An Attack From The Gremlin Control Panel

slide-82
SLIDE 82

Monitor Your Chaos Engineering Attack

slide-83
SLIDE 83

Monitor Your Chaos Engineering Attack

slide-84
SLIDE 84

Notify Your Team

slide-85
SLIDE 85

The Path To Chaos Engineering

Let’s Review:

slide-86
SLIDE 86

The Path To Chaos Engineering

High Severity Incident Management Monitoring Make & Measure Improvements Chaos Engineering Measure the impact of downtime

slide-87
SLIDE 87

Blast Radius and Advanced Chaos

High Severity Incident Management Monitoring Make & Measure Improvements Chaos Engineering Measure the impact of downtime

slide-88
SLIDE 88

Make Improvements?

How do you

slide-89
SLIDE 89
  • 1. Build - Build a new system / improve existing
  • 2. Borrow - Use open source / contribute to OS
  • 3. Buy - Use 3rd party systems
  • 4. Brush up - GameDays / Team training
  • 5. Break - Chaos Engineering / Failure injection
  • 6. Begone - Decommission systems / delete code

How do you make improvements?

slide-90
SLIDE 90

Always Measure Improvements Tell a story of before and after with metrics

slide-91
SLIDE 91

More Resilient Systems

The world needs:

slide-92
SLIDE 92

More Resilient Systems!

You can create:

slide-93
SLIDE 93

Join us on this journey! gremlin.com/community

gremlin.com/slack

slide-94
SLIDE 94

Thanks!

@tammybutow gremlin.com