heretical resilience (to repair is human) Ryn Daniels - - - PowerPoint PPT Presentation

heretical resilience
SMART_READER_LITE
LIVE PREVIEW

heretical resilience (to repair is human) Ryn Daniels - - - PowerPoint PPT Presentation

heretical resilience (to repair is human) Ryn Daniels - @rynchantress QCon New York 2018 @rynchantress qcon nyc 2018 my side of the story AKA: A Dramatic blargh Retelling of The Time I Nearly Broke Etsy Dot Com @rynchantress qcon


slide-1
SLIDE 1

heretical resilience

Ryn Daniels - @rynchantress
 QCon New York 2018

(to repair is human)

slide-2
SLIDE 2

@rynchantress qcon nyc 2018

slide-3
SLIDE 3

@rynchantress qcon nyc 2018 blargh AKA: A Dramatic Retelling of The Time I Nearly Broke Etsy Dot Com

my side of the story

slide-4
SLIDE 4

@rynchantress qcon nyc 2018

slide-5
SLIDE 5

@rynchantress qcon nyc 2018 apache versions

slide-6
SLIDE 6

@rynchantress qcon nyc 2018 apache versions

slide-7
SLIDE 7

@rynchantress qcon nyc 2018

slide-8
SLIDE 8

@rynchantress qcon nyc 2018

slide-9
SLIDE 9

@rynchantress qcon nyc 2018 blargh

slide-10
SLIDE 10

@rynchantress qcon nyc 2018 blargh

slide-11
SLIDE 11

@rynchantress qcon nyc 2018

slide-12
SLIDE 12

@rynchantress qcon nyc 2018

slide-13
SLIDE 13

@rynchantress qcon nyc 2018

slide-14
SLIDE 14

@rynchantress qcon nyc 2018 blargh

slide-15
SLIDE 15

@rynchantress qcon nyc 2018 blargh

slide-16
SLIDE 16

@rynchantress qcon nyc 2018

slide-17
SLIDE 17

@rynchantress qcon nyc 2018

slide-18
SLIDE 18

@rynchantress qcon nyc 2018

slide-19
SLIDE 19

@rynchantress qcon nyc 2018

slide-20
SLIDE 20

@rynchantress qcon nyc 2018

slide-21
SLIDE 21

@rynchantress qcon nyc 2018

+ + + = =

slide-22
SLIDE 22

@rynchantress qcon nyc 2018

slide-23
SLIDE 23

@rynchantress qcon nyc 2018

slide-24
SLIDE 24

@rynchantress qcon nyc 2018

+ + =

slide-25
SLIDE 25

@rynchantress qcon nyc 2018

slide-26
SLIDE 26

@rynchantress qcon nyc 2018 blargh

slide-27
SLIDE 27

@rynchantress qcon nyc 2018 blargh

slide-28
SLIDE 28

@rynchantress qcon nyc 2018

The Post-mortem

aka: What the heck actually just happened?

slide-29
SLIDE 29

@rynchantress qcon nyc 2018

The Post-mortem

aka: What the heck actually just happened?

aka: what did we learn?

slide-30
SLIDE 30

@rynchantress qcon nyc 2018

how did the site stay up?

slide-31
SLIDE 31

@rynchantress qcon nyc 2018

slide-32
SLIDE 32

@rynchantress qcon nyc 2018

slide-33
SLIDE 33

@rynchantress qcon nyc 2018

Always keep 7 servers out of config management, just in case.

Lesson 1

slide-34
SLIDE 34

@rynchantress qcon nyc 2018

Consider fallbacks 
 for automation Lesson 1

slide-35
SLIDE 35

@rynchantress qcon nyc 2018

distrusting your automation

  • How will you detect problems?
  • How easily can you test your

automation?

  • Can you turn the automation off?
  • Do you remember how to do the thing

manually?

slide-36
SLIDE 36

@rynchantress qcon nyc 2018

How did we respond so fast?

slide-37
SLIDE 37

@rynchantress qcon nyc 2018

slide-38
SLIDE 38

@rynchantress qcon nyc 2018 blargh

slide-39
SLIDE 39

@rynchantress qcon nyc 2018

Create a Slack Team in charge of maintaining a proper amount of slack in case of incidents.

Lesson 2

slide-40
SLIDE 40

@rynchantress qcon nyc 2018

maintain adaptive capacity Lesson 2

slide-41
SLIDE 41

@rynchantress qcon nyc 2018

twiddling your thumbs

  • How do people ask each other for

help?

  • Which teams have more or less slack?
  • What happens after work gets

rearranged?

slide-42
SLIDE 42

@rynchantress qcon nyc 2018

what couldn't we see?

slide-43
SLIDE 43

@rynchantress qcon nyc 2018

slide-44
SLIDE 44

@rynchantress qcon nyc 2018

slide-45
SLIDE 45

@rynchantress qcon nyc 2018

slide-46
SLIDE 46

@rynchantress qcon nyc 2018

slide-47
SLIDE 47

@rynchantress qcon nyc 2018

Buy a couple botnets to DDoS your monitoring tools every now and then.

Lesson 3

slide-48
SLIDE 48

@rynchantress qcon nyc 2018

understand the dependencies
 in your tooling Lesson 3

slide-49
SLIDE 49

@rynchantress qcon nyc 2018

watching the world burn

  • What do your monitoring/automation/

  • rchestration tools depend on?
  • Who watches the watchers?
  • How do you communicate internally

and externally?

  • Do you have backup tools?
slide-50
SLIDE 50

@rynchantress qcon nyc 2018

what actually went wrong with chef?

slide-51
SLIDE 51

@rynchantress qcon nyc 2018

slide-52
SLIDE 52

@rynchantress qcon nyc 2018

Always label your dragons.

Lesson 4

slide-53
SLIDE 53

@rynchantress qcon nyc 2018

make informed decisions about which yaks to shave. Lesson 4

slide-54
SLIDE 54

@rynchantress qcon nyc 2018

choosing your yaks wisely

  • Which teams have sufficient slack?
  • Can a problem be avoided if not

solved?

  • What are the tradeoffs and
  • pportunity costs?
  • Who has the precision yak razors?
slide-55
SLIDE 55

@rynchantress qcon nyc 2018

who digs into the weird things?

slide-56
SLIDE 56

@rynchantress qcon nyc 2018

Hire the person who created the primary language your site is written in. 
 (This always scales.)

Lesson 4.5

slide-57
SLIDE 57

@rynchantress qcon nyc 2018

Develop depth of
 inter-team relationships Lesson 4.5

slide-58
SLIDE 58

@rynchantress qcon nyc 2018

finding your own rasmus

  • Which areas only have one (or two)

people who understand them?

  • How is information shared within

your organization?

  • What behaviors are rewarded?
slide-59
SLIDE 59

@rynchantress qcon nyc 2018

what happened afterwards?

slide-60
SLIDE 60

@rynchantress qcon nyc 2018

slide-61
SLIDE 61

@rynchantress qcon nyc 2018

Give people ill-fitting clothing when they mess up.

Lesson 5

slide-62
SLIDE 62

@rynchantress qcon nyc 2018

encourage organizational learning Lesson 5

slide-63
SLIDE 63

@rynchantress qcon nyc 2018

a warning to others

  • How do people respond to incidents?
  • What happens after an incident?
  • How are remediation items

prioritized?

  • What happen to the bandaid

solutions?

slide-64
SLIDE 64

@rynchantress qcon nyc 2018

slide-65
SLIDE 65

@rynchantress qcon nyc 2018

technology can be robust.*

  • nly humans can be resilient.

*for some already-known, pre-defined subset of problems

slide-66
SLIDE 66

@rynchantress qcon nyc 2018

slide-67
SLIDE 67

@rynchantress qcon nyc 2018

  • 1. understand your automation
  • 2. maintain adaptive capacity
  • 3. know your dependencies
  • 4. build cross-team relationships
  • 5. always be learning
slide-68
SLIDE 68

@rynchantress qcon nyc 2018

  • 1. understand your automation
  • 2. maintain adaptive capacity
  • 3. know your dependencies
  • 4. build cross-team relationships
  • 5. always be learning
slide-69
SLIDE 69

@rynchantress qcon nyc 2018

Thank you!