USING CHAOS TO BUILD RESILIENT SYSTEMS @tammybtow, Gremlin Whats - - PowerPoint PPT Presentation

using chaos to build resilient systems
SMART_READER_LITE
LIVE PREVIEW

USING CHAOS TO BUILD RESILIENT SYSTEMS @tammybtow, Gremlin Whats - - PowerPoint PPT Presentation

USING CHAOS TO BUILD RESILIENT SYSTEMS @tammybtow, Gremlin Whats the scale of your infra? @tammybtow #QCONNYC How many services do you have running in production? @tammybtow #QCONNYC How many engineers do you have at your


slide-1
SLIDE 1

USING CHAOS TO BUILD RESILIENT SYSTEMS

@tammybütow, Gremlin

slide-2
SLIDE 2

What’s the scale of your infra?

@tammybütow #QCONNYC

slide-3
SLIDE 3

How many services do you have running in production?

@tammybütow #QCONNYC

slide-4
SLIDE 4

How many engineers do you have at your company?

@tammybütow #QCONNYC

slide-5
SLIDE 5

A Common Chaos Engineering Journey

@tammybütow, Gremlin

@tammybütow #QCONNYC

🚳 🚘 🏏

slide-6
SLIDE 6

TOP 5 MOST POPULAR WAYS TO USE CHAOS ENGINEERING IN 2018

@tammybütow #QCONNYC

slide-7
SLIDE 7

🚣

ADVANCED USES OF CHAOS ENGINEERING

@tammybütow #QCONNYC

🚣

slide-8
SLIDE 8

@tammybütow, Gremlin

@tammybütow #QCONNYC

What happened this week: June 2018 Slack Outage

slide-9
SLIDE 9

@tammybütow, Gremlin

@tammybütow #QCONNYC

slide-10
SLIDE 10

TAMMY BÜTOW

Principal SRE, Gremlin Causing chaos in prod since 2009. Previously SRE Manager @ Dropbox leading Databases, Block Storage and Code Workflows for 500 million users and 800 engineers. @tammybütow

@tammybütow #QCONNYC

slide-11
SLIDE 11

GREMLIN

  • We are practitioners of Chaos

Engineering

  • We build software that helps engineers

build resilient systems in a safe, secure and simple way.

  • We offer 11 ways to inject chaos for your

Chaos Engineering experiments (e.g. host/container packet loss and shutdown)

@tammybütow #QCONNYC

slide-12
SLIDE 12

PART 1: LAYING THE FOUNDATION

@tammybütow #QCONNYC

slide-13
SLIDE 13
  • A resilient system is a highly available and durable system.
  • A resilient system can maintain an acceptable level of service in

the face of failure.

  • A resilient system can weather the storm (a misconfiguration, a

large scale natural disaster or controlled chaos engineering).

@tammybütow #QCONNYC

Let’s Define A Resilient System:

slide-14
SLIDE 14

It would be silly to give an Olympic pole-vaulter a broom and ban them from practicing!

@tammybütow #QCONNYC

slide-15
SLIDE 15

“Thoughtful planned experiments designed to reveal the weaknesses in our systems”

  • Kolton Andrus, Gremlin CEO

@tammybütow, Gremlin

@tammybütow #QCONNYC

slide-16
SLIDE 16

Inject something harmful in order to build an immunity.

@tammybütow, Gremlin

@tammybütow #QCONNYC

Think of it like a vaccination:

slide-17
SLIDE 17

Eventually systems will break in many undesired ways. Break them first on purpose with controlled chaos! 💦

@tammybütow #QCONNYC

slide-18
SLIDE 18

DOGFOODING

  • Using your own product.
  • For us that means using

Gremlin for our Chaos Engineering experiments.

  • Failure Fridays

🐷

@tammybütow #QCONNYC

slide-19
SLIDE 19

Failure Fridays are dedicated time for teams to collaboratively focus on using Chaos Engineering practices to reveal weaknesses in your services.

@tammybütow #QCONNYC

slide-20
SLIDE 20

WHY DO DISTRIBUTED SYSTEMS NEED CHAOS?

  • Unusual hard to debug

failures are common

  • Systems & companies scale

rapidly and Chaos Engineering helps you learn along the way

🚁

@tammybütow #QCONNYC

slide-21
SLIDE 21

FULL-STACK CHAOS ENGINEERING

  • You can inject chaos at any

layer.

  • API, App, Cache, Database,

OS, Host, Network, Power & more.

💼

@tammybütow #QCONNYC

slide-22
SLIDE 22

WHY RUN CHAOS ENGINEERING EXPERIMENTS?

@tammybütow #QCONNYC

slide-23
SLIDE 23

Are you confident that your metrics and alerting are as good as they should be? #pagerpain 📠

@tammybütow #QCONNYC

slide-24
SLIDE 24

Are you confident your customers are getting as good an experience as they should be? #customerpain 😟

@tammybütow #QCONNYC

slide-25
SLIDE 25

Are you losing money due to downtime and broken features? #businesspain 💹

@tammybütow #QCONNYC

slide-26
SLIDE 26

HOW DO YOU RUN CHAOS ENGINEERING EXPERIMENTS?

@tammybütow #QCONNYC

slide-27
SLIDE 27

HOW TO RUN A CHAOS ENGINEERING EXPERIMENT

  • Form a hypothesis
  • Consider blast radius
  • Run experiment
  • Measure results
  • Find & fix issues or scale

@tammybütow #QCONNYC

slide-28
SLIDE 28

Don’t run before you can walk

@tammybütow, Gremlin

@tammybütow #QCONNYC

slide-29
SLIDE 29

The 3 Prerequisites for Chaos Engineering

@tammybütow, Gremlin

@tammybütow #QCONNYC

1. Monitoring & Observability 2. On-Call & Incident Management 3. Know Your Cost of Downtime Per Hour

slide-30
SLIDE 30

What Do I Use For Monitoring & Observability?

@tammybütow, Gremlin

@tammybütow #QCONNYC

slide-31
SLIDE 31

We All Need To Know The Cost Of Downtime

@tammybütow, Gremlin

@tammybütow #QCONNYC

slide-32
SLIDE 32

We All Need Incident Management

@tammybütow, Gremlin

@tammybütow #QCONNYC

slide-33
SLIDE 33

HOW TO CHOOSE A CHAOS EXPERIMENT

  • Identify top 5 critical systems
  • Choose 1 system
  • Whiteboard the system
  • Select attack: resource/

state/network

  • Determine scope

@tammybütow #QCONNYC

slide-34
SLIDE 34

WHAT SHOULD WE MEASURE?

  • Availability — 500s
  • Service specific KPIs
  • System metrics: CPU, IO, Disk
  • Customer complaints

📉

@tammybütow #QCONNYC

slide-35
SLIDE 35

HOW TO RUN YOUR OWN GAMEDAY!

@tammybütow #QCONNYC

gremlin.com/gameday

slide-36
SLIDE 36

HOW TO RUN YOUR OWN GAMEDAY!

@tammybütow #QCONNYC

gremlin.com/gameday

slide-37
SLIDE 37

EXAMPLE SYSTEM: KUBERNETES RETAIL STORE

User Primary: kube-01 Node: kube-02 Node: kube-03 Node: kube-04

@tammybütow #QCONNYC

slide-38
SLIDE 38

PART 2: RESOURCE CHAOS ENGINEERING

@tammybütow #QCONNYC

slide-39
SLIDE 39

We can increase CPU, Disk, IO & Memory consumption to ensure monitoring is setup to catch problems. Important to catch issues before they turn into high severity incidents (unable to purchase new product!) and downtime for customers. RESOURCE CHAOS

@tammybütow #QCONNYC

slide-40
SLIDE 40

CPU CHAOS

@tammybütow #QCONNYC

slide-41
SLIDE 41

https://github.com/tammybutow/chaosengineeringbootcamp

LET’S CREATE A “KNOWN-KNOWN” EXPERIMENT

@tammybütow #QCONNYC

slide-42
SLIDE 42

CHAOS IN TOP

@tammybütow #QCONNYC

slide-43
SLIDE 43

LET’S KILL THE CHAOS NOW

@tammybütow #QCONNYC

slide-44
SLIDE 44

NO MORE CHAOS IN TOP

@tammybütow #QCONNYC

slide-45
SLIDE 45

DISK CHAOS

@tammybütow #QCONNYC

slide-46
SLIDE 46

DISK CHAOS

💦

@tammybütow #QCONNYC

slide-47
SLIDE 47

MEMORY CHAOS

@tammybütow #QCONNYC

slide-48
SLIDE 48

MEMORY CHAOS

💦

free -m

@tammybütow #QCONNYC

slide-49
SLIDE 49

PART 3: STATE CHAOS ENGINEERING

@tammybütow #QCONNYC

slide-50
SLIDE 50

PROCESS CHAOS

@tammybütow #QCONNYC

slide-51
SLIDE 51

Ways to create process chaos on purpose: PROCESS CHAOS

  • Kill one process
  • Loop kill a process
  • Spawn new processes
  • Fork bomb

@tammybütow #QCONNYC

slide-52
SLIDE 52

PROCESS CHAOS

💦

pkill -u chaos

@tammybütow #QCONNYC

slide-53
SLIDE 53

SHUTDOWN CHAOS

@tammybütow #QCONNYC

slide-54
SLIDE 54

SHUTDOWN CHAOS

💦

shutdown -h

@tammybütow #QCONNYC

slide-55
SLIDE 55

WHAT ARE OTHER WAYS YOU CAN TURN OFF A SERVER? WHAT IF YOU WANT TO TURN OFF EVERY SERVER WHEN IT’S ONE WEEK OLD?

@tammybütow #QCONNYC

slide-56
SLIDE 56

HALT, REBOOT & POWEROFF CHAOS

💦

halt

@tammybütow #QCONNYC

slide-57
SLIDE 57

WHAT ABOUT SHUTTING DOWN
 CONTAINERS AND K8’S PODS?

@tammybütow #QCONNYC

slide-58
SLIDE 58

THE MANY WAYS TO KILL CONTAINERS

  • Kill self
  • Kill a container from the host
  • Use one container to kill another
  • Use one container to kills several containers
  • Use several containers to kill several

@tammybütow #QCONNYC

slide-59
SLIDE 59

The average lifespan of a container is 2.5 days And they fail in many unexpected ways.

@tammybütow #QCONNYC

slide-60
SLIDE 60

TIME TRAVEL CHAOS

@tammybütow #QCONNYC

slide-61
SLIDE 61

TIME TRAVEL CHAOS AKA CLOCK SKEW

💦

ntpq

@tammybütow #QCONNYC

slide-62
SLIDE 62

PART 4: NETWORK CHAOS ENGINEERING

@tammybütow #QCONNYC

slide-63
SLIDE 63

BLACKHOLE CHAOS

@tammybütow #QCONNYC

slide-64
SLIDE 64

BLACKHOLE CHAOS

💦

ip route show

@tammybütow #QCONNYC

slide-65
SLIDE 65

DNS CHAOS

@tammybütow #QCONNYC

slide-66
SLIDE 66

DNS CHAOS

💦

@tammybütow #QCONNYC

slide-67
SLIDE 67

DNS CHAOS

💦

@tammybütow #QCONNYC

slide-68
SLIDE 68

LATENCY CHAOS

@tammybütow #QCONNYC

slide-69
SLIDE 69

LATENCY CHAOS

💦

mtr google.com

@tammybütow #QCONNYC

slide-70
SLIDE 70

PACKET LOSS CHAOS

@tammybütow #QCONNYC

slide-71
SLIDE 71

PACKET LOSS CHAOS

💦

@tammybütow #QCONNYC

slide-72
SLIDE 72

PART 5: COMPLEX OUTAGES

@tammybütow #QCONNYC

slide-73
SLIDE 73

We can combine different types of chaos engineering experiments to reproduce complicated outages. Reproducing outages gives you confidence you can handle it if/when it happens again.

@tammybütow #QCONNYC

slide-74
SLIDE 74

Let’s go back in time to look at some of the worst outage stories that kicked off the introduction of chaos engineering.

@tammybütow #QCONNYC

slide-75
SLIDE 75

DROPBOX’S WORST OUTAGE EVER Some master-replica pairs were impacted which resulted in the site going down.

https://blogs.dropbox.com/tech/2014/01/outage-post-mortem/

@tammybütow #QCONNYC

slide-76
SLIDE 76

UBER’S DATABASE OUTAGE 1.Master log replication to S3 failed 2.Logs backed up on the primary 3.Alerts fired to engineer but they are ignored 4.Disk fills up on database primary 5.Engineer deletes unarchived WAL files 6.Error in config prevents promotion

— Matt Ranney, Uber, 2015

@tammybütow #QCONNYC

slide-77
SLIDE 77

OUTAGES HAPPEN.

@tammybütow #QCONNYC

slide-78
SLIDE 78

THERE ARE MANY MORE OUTAGES YOU CAN READ ABOUT HERE: https://github.com/danluu/post-mortems

@tammybütow #QCONNYC

slide-79
SLIDE 79

HOW CAN YOU CONTINUE YOUR CHAOS ENGINEERING JOURNEY?

@tammybütow #QCONNYC

slide-80
SLIDE 80

JOIN THE CHAOS SLACK

GREMLIN.COM/SLACK

@tammybütow #QCONNYC

slide-81
SLIDE 81

LEARN WITH THE GREMLIN COMMUNITY

GREMLIN.COM/COMMUNITY

@tammybütow #QCONNYC

slide-82
SLIDE 82

THE FIRST CHAOS ENGINEERING CONFERENCE!

CHAOSCONF.IO

@tammybütow #QCONNYC

slide-83
SLIDE 83

THANK YOU

QCON NYC @tammybütow #CHAOSENGINEERING