The Practice of Chaos Engineering Ana Medina Chaos Engineer at - - PowerPoint PPT Presentation

the practice of chaos engineering
SMART_READER_LITE
LIVE PREVIEW

The Practice of Chaos Engineering Ana Medina Chaos Engineer at - - PowerPoint PPT Presentation

The Practice of Chaos Engineering Ana Medina Chaos Engineer at Gremlin @ana_m_medina #reactive18 @ana_m_medina @ana_m_medina Gremlin Uber SFEFCU Google Quicken Loans Stanford University Miami Dade College #reactive18 @ana_m_medina How


slide-1
SLIDE 1

The Practice of Chaos Engineering

Ana Medina Chaos Engineer at Gremlin

@ana_m_medina

slide-2
SLIDE 2

@ana_m_medina #reactive18

@ana_m_medina

Gremlin Uber SFEFCU Google Quicken Loans Stanford University Miami Dade College

slide-3
SLIDE 3

@ana_m_medina #reactive18

How many of you have heard

  • f Chaos Engineering?
slide-4
SLIDE 4

@ana_m_medina #reactive18

How many of you have run a Chaos Engineering experiment?

slide-5
SLIDE 5

@ana_m_medina #reactive18

What is Chaos Engineering?

slide-6
SLIDE 6

@ana_m_medina #reactive18

Thoughtful, planned experiments designed to reveal the weakness in our systems.

Chaos Engineering

slide-7
SLIDE 7

@ana_m_medina #reactive18

Inject something harmful to build an immunity.

  • @KoltonAndrus

Gremlin Founder and CEO

Chaos Engineering

slide-8
SLIDE 8

@ana_m_medina #reactive18

Why?

  • Microservices
  • Systems are scaling

fast

  • Downtime is really

expensive

  • Our dependencies

will fail

  • Pager fatigue and

burnout really hurts

slide-9
SLIDE 9

@ana_m_medina #reactive18

Use Cases:

  • Outage

reproduction

  • On-call training
  • Strengthen new

products

  • Battle test new

infrastructure and services

slide-10
SLIDE 10

@ana_m_medina #reactive18

What do you need before doing Chaos Engineering?

  • Monitoring/Observability
  • On-Call and Incident Management
  • Cost of Downtime Per Hour
slide-11
SLIDE 11

@ana_m_medina #reactive18

Chaos Engineering is not

  • Unexpected or unmonitored experiments
  • Creating outages
slide-12
SLIDE 12

@ana_m_medina #reactive18

“Chaos Engineering Without Observability ... Is Just Chaos”

  • @mipsytipsy

Charity Majors CEO of honeycomb

slide-13
SLIDE 13

@ana_m_medina #reactive18

Minimize the Blast radius

slide-14
SLIDE 14

THE BEGINNING

Chaos Monkey Level 0

MATURITY REQUIRED

Low

APPROACH TAKEN

Random

VALUE PROVIDED

Prepare for host failures in the cloud

@ana_m_medina #reactive18

slide-15
SLIDE 15

THE FIRST STEP

Infrastructure Failures Level 1

MATURITY REQUIRED

Basic Operations

APPROACH TAKEN

Disciplined

VALUE PROVIDED

Prepare for host-level failures

@ana_m_medina #reactive18

slide-16
SLIDE 16

INTERMEDIATE

Network Failures Level 1.5

MATURITY REQUIRED

Networking expertise

APPROACH TAKEN

Gameday

VALUE PROVIDED

Prepare for high impact events

@ana_m_medina #reactive18

slide-17
SLIDE 17

THE NEXT STEP

Application Failures Level 2

MATURITY REQUIRED

Advanced

APPROACH TAKEN

Precision Experiments

VALUE PROVIDED

Safely validate the user experience

@ana_m_medina #reactive18

slide-18
SLIDE 18

Latency added to 50% of android traffic

slide-19
SLIDE 19

Exceptions - 50% of android traffic failed

slide-20
SLIDE 20

@ana_m_medina #reactive18

You can and should inject chaos at every layer

  • f your stack
  • Application
  • API
  • Caching
  • Database
  • Hardware
  • Cloud Infrastructure / Bare metal
slide-21
SLIDE 21

@ana_m_medina #reactive18

Top places to inject chaos

slide-22
SLIDE 22

@ana_m_medina #reactive18

slide-23
SLIDE 23

@ana_m_medina #reactive18 https://www.gremlin.com/community/tutorials/what-i-learned-running-the- chaos-lab-kafka-breaks/

slide-24
SLIDE 24

@ana_m_medina #reactive18

Getting Started:

  • Identify top 5 critical systems
  • Choose system
  • Whiteboard the system
  • Determine what experiment you want to run:

(resource, state, network)

  • Determine Blast Radius
slide-25
SLIDE 25

@ana_m_medina #reactive18

Companies doing Chaos Engineering

slide-26
SLIDE 26

@ana_m_medina #reactive18

Chaos Days

slide-27
SLIDE 27

@ana_m_medina #reactive18

Chaos Days: Dedicated day for your

entire company to focus on building resilience instead of new products.

https://www.gremlin.com/community/tutorials/planning-your-own-chaos-day/

slide-28
SLIDE 28

@ana_m_medina #reactive18

“What could go wrong?” “Do we know what will happen if this breaks?”

slide-29
SLIDE 29

@ana_m_medina #reactive18

Chaos Day Crew:

VP Engineering / CTO / COO Executive Assistant Engineering Director / Manager Senior Engineer New Grad / Intern Engineer

slide-30
SLIDE 30

@ana_m_medina #reactive18

What experiments can you run?

  • Reproduce outage conditions
  • Unpredictable circumstances
  • Large traffic spikes
  • Race conditions
  • Datacenter failure
  • Time travel - system clocks to be out of sync
  • Network errors
  • CPU overloads
slide-31
SLIDE 31

@ana_m_medina #reactive18

slide-32
SLIDE 32

@ana_m_medina #reactive18

slide-33
SLIDE 33

@ana_m_medina #reactive18

slide-34
SLIDE 34

@ana_m_medina #reactive18

slide-35
SLIDE 35

@ana_m_medina #reactive18

slide-36
SLIDE 36

@ana_m_medina #reactive18

gremlin.com/chaos-monkey/

slide-37
SLIDE 37

@ana_m_medina #reactive18

Join the Chaos, Join Slack: bit.ly/chaos-eng-slack

1,900+ members across the world

Learn more:

slide-38
SLIDE 38

@ana_m_medina #reactive18

THANKS!

@ana_m_medina ana@gremlin.com