Chaos Engineering Chaos Engineering with Containers Ana Medina - - PowerPoint PPT Presentation

chaos engineering chaos engineering with containers
SMART_READER_LITE
LIVE PREVIEW

Chaos Engineering Chaos Engineering with Containers Ana Medina - - PowerPoint PPT Presentation

@ana_m_medina #QConSF Chaos Engineering Chaos Engineering with Containers Ana Medina Chaos Engineer at 1 @ana_m_medina #QConSF Ana Medina @ana_m_medina Chaos Engineer @ Gremlin Previously Software Engineer / SRE @ Uber , Also


slide-1
SLIDE 1

#QConSF @ana_m_medina

Chaos Engineering Chaos Engineering with Containers

1

Ana Medina
 Chaos Engineer at

slide-2
SLIDE 2

#QConSF @ana_m_medina

2

Ana Medina

@ana_m_medina

Chaos Engineer @ Gremlin Previously Software Engineer / SRE @ Uber, Also worked/ interned @ SFEFCU, Google, Quicken Loans, Stanford University and Miami Dade College. College dropout. Self taught engineer.

slide-3
SLIDE 3

#QConSF @ana_m_medina

3

How many of you have heard of Chaos Engineering?

slide-4
SLIDE 4

#QConSF @ana_m_medina

4

How many of have run a Chaos Engineering experiment?

slide-5
SLIDE 5

#QConSF @ana_m_medina

5

Thoughtful, planned experiments designed to reveal the weakness in our systems. 


Chaos Engineering

slide-6
SLIDE 6

#QConSF @ana_m_medina

6

Inject something harmful to build an immunity.

  • @KoltonAndrus


Gremlin Founder and CEO

Chaos Engineering

slide-7
SLIDE 7

#QConSF @ana_m_medina

7

Why?

  • Microservices
  • Systems are scaling fast
  • Downtime is really expensive
  • Our dependencies will fail
  • Pager fatigue and burnout really hurts
slide-8
SLIDE 8

#QConSF @ana_m_medina

8

“Chaos Engineering Without Observability ... Is Just Chaos”


  • @mipsytipsy

Charity Majors CEO of honeycomb


slide-9
SLIDE 9

#QConSF @ana_m_medina

9

Prerequisite of Chaos Engineering

  • Monitoring/Observability
  • On-Call and Incident Management
  • Cost of Downtime Per Hour
slide-10
SLIDE 10

#QConSF @ana_m_medina

10

Use Cases for Chaos Engineering

  • Outage reproduction
  • On-call training
  • Strengthen new products
  • Battle test new infrastructure and

services

slide-11
SLIDE 11

#QConSF @ana_m_medina

11

Use Cases for Chaos Engineering - Containers

  • Testing Provider Specific Reliability

(eg: EKS vs AKS vs GKE)

  • Auto Scaling
  • Logs, Disk failure
slide-12
SLIDE 12

#QConSF @ana_m_medina

Minimize the Blast radius

12

slide-13
SLIDE 13

#QConSF @ana_m_medina

Monitoring / Observability

13

slide-14
SLIDE 14

#QConSF @ana_m_medina

14

What to measure and monitor?

  • System Metrics: CPU, Disk, I/O
  • Availability
  • Service specific KPIs
  • Customer complaints
slide-15
SLIDE 15

#QConSF @ana_m_medina

15

Demo

slide-16
SLIDE 16

#QConSF @ana_m_medina

16

#1 - Battle Test Cloud infrastructure

Real World Scenario: company / user is evaluating cloud provider managed kubernetes. which one is more reliable? The Hypothesis: shutting down a container (1/1) should only give a small delay before app is reachable again The Experiment: shut down kubernetes dashboard container Abort Conditions: app is unreachable after 60 seconds

slide-17
SLIDE 17

#QConSF @ana_m_medina

17

slide-18
SLIDE 18

#QConSF @ana_m_medina

slide-19
SLIDE 19

#QConSF @ana_m_medina

slide-20
SLIDE 20

#QConSF @ana_m_medina

slide-21
SLIDE 21

#QConSF @ana_m_medina

21

#2 - Shutdown of a Container

Real World Scenario: company / user is evaluating

  • containers. Are they as reliable as promised?

The Hypothesis: yes, they will come back up The Experiment: shutdown container and wait a few seconds and check if it’s up Abort Conditions: app is unreachable after 60 seconds

slide-22
SLIDE 22

#QConSF @ana_m_medina

22

slide-23
SLIDE 23

#QConSF @ana_m_medina

23

#3 - Blackholing traffic to Catalog

Real World Scenario: company / user is working with their UI team to provide a good user experience when there API/DB issues The Hypothesis: images will not load, but product listing will The Experiment: blackhole all traffic from the front end to REST API and DB ports Abort Conditions: app is unreachable after 60 seconds

slide-24
SLIDE 24

#QConSF @ana_m_medina

24

slide-25
SLIDE 25

#QConSF @ana_m_medina

Case Study

25

slide-26
SLIDE 26

#QConSF @ana_m_medina

26

Companies doing Chaos Engineering

slide-27
SLIDE 27

#QConSF @ana_m_medina

27

Tools you Can Use

Gremlin
 Chaos Toolkit
 Litmus
 PowerfulSeal

slide-28
SLIDE 28

#QConSF @ana_m_medina

28

Break Things Together

bit.ly/chaos-eng-slack


2,000+ members across the world

slide-29
SLIDE 29

#QConSF @ana_m_medina

THANKS!

@ana_m_medina ana@gremlin.com