Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm What would - - PowerPoint PPT Presentation

learning from failure pkill indeedhi re 2wka2mm what
SMART_READER_LITE
LIVE PREVIEW

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm What would - - PowerPoint PPT Presentation

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm What would catastrophic failure look like in your organization? Try and picture this. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm Tightly coupled Systems that are tightly coupled


slide-1
SLIDE 1

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

slide-2
SLIDE 2

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

What would catastrophic failure look like in your

  • rganization?

Try and picture this.

slide-3
SLIDE 3

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Tightly coupled Loosely coupled Linear Complex

Systems that are tightly coupled and complex are less resilient to catastrophe

slide-4
SLIDE 4

92% of catastrophic failures are the result

  • f incorrect error handling.

Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems - Ding Yuan, et. al.

slide-5
SLIDE 5

#velocityconf

Learning from Failure: Why a Total Site Outage Can be a Good Thing

Alex Elman

Site Reliability Engineer @_pkill

slide-6
SLIDE 6
slide-7
SLIDE 7

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

slide-8
SLIDE 8

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

RAID: Redundant Array

  • f Inexpensive

Datacenters

slide-9
SLIDE 9

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

US West Response pool failover: US West » US Central + US East » Global Anycast DNS

slide-10
SLIDE 10

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Failing out a datacenter

Datacenter A Datacenter B Datacenter C

slide-11
SLIDE 11

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm nginx nginx nginx Webapp pool Webapp pool Webapp pool Services pool Services pool Services pool

Application topology

Job seekers from around the world Dynect Anycast DNS Network

slide-12
SLIDE 12

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Hiding broken parts of a service from the user is an example of the Graceful Degradation Pattern

slide-13
SLIDE 13

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

slide-14
SLIDE 14

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

slide-15
SLIDE 15

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Failure Myth #1: It is not worth planning for a catastrophic failure that’s never going to happen

slide-16
SLIDE 16

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Normalcy Bias A type of cognitive bias that leaves planners and first responders ill-equipped to deal with or respond to a catastrophic disaster because its occurrence is unencountered or inconceivable.

slide-17
SLIDE 17

“Catastrophe” by Marco Verch. Original Creative Commons 2.0 License

slide-18
SLIDE 18

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

slide-19
SLIDE 19

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Alert: region down

slide-20
SLIDE 20

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Alert: 4 regions down

slide-21
SLIDE 21

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Catastrophe

slide-22
SLIDE 22

@_pkill | Learning From Failure | indeedhi.re/2wKa2Mm

The incident lifecycle

Detection Diagnosis Mitigation Prevention Retro Cleanup Recovery

slide-23
SLIDE 23

Swiss cheese accident model

slide-24
SLIDE 24

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Failure Myth #2: A single failure can cause a catastrophe

slide-25
SLIDE 25

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm OUTAGE: [ { date: "2016-01-20T17:46:07.890-0600", description: "Load data artifact", errorMessage: "Last load of data artifact failed", id: "dataArtifact", lastKnownGoodTimestamp: 1453332285462, status: "OUTAGE", thrown: { exception: "RuntimeException", message: "Last load of data artifact failed", stack: [ "com.indeed.healthcheck.JasxDependencyManager$15.ping(JasxDependencyManager.java:908)", "com.indeed.status.core.PingableDependency.call(PingableDependency.java:59)", "com.indeed.status.core.PingableDependency.call(PingableDependency.java:15)", "java.util.concurrent.FutureTask.run(FutureTask.java:262)", "java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)", "java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)", "java.lang.Thread.run(Thread.java:745)" ] }, timestamp: 1453333567890, urgency: "Required: Failure of this dependency would result in complete system outage" }

Diagnosis

slide-26
SLIDE 26

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

RAD: Resilient Artifact Distribution

  • Use bittorrent protocol

○ Faster ○ Reduced network burden for servers ○ Horizontally scalable ○ Encrypted

  • Resilient to network issues

○ Peers in multiple regions/DCs

  • Self-service platform

○ Devs can declare data in code ○ No SRE toil needed Down DC1 DC4 DC4 DC2

slide-27
SLIDE 27

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Data artifact build process

Artifact builder Hub Hub Consumer Consumer Consumer Consumer Consumer Tracker Rhone Publisher

Announce Data

slide-28
SLIDE 28

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Artifact Generation 1 JobWebapp

artifact.1

JobWebapp

artifact.1

JobWebapp

artifact.1

JobWebapp

artifact.1

JobWebapp

artifact.1

JobWebapp

artifact.1

Diagnosis

slide-29
SLIDE 29

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

JobWebapp

artifact.1

JobWebapp

artifact.1

JobWebapp

artifact.1

JobWebapp

artifact.1

JobWebapp

artifact.1

JobWebapp

artifact.1 artifact.2 artifact.2 artifact.2

Diagnosis

Load artifact generation 2

slide-30
SLIDE 30

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

JobWebapp JobWebapp

artifact.1

JobWebapp

artifact.1

JobWebapp

artifact.1

JobWebapp JobWebapp

artifact.2 artifact.2 artifact.2 artifact.2 artifact.2 artifact.2

unavailable unavailable Diagnosis unavailable

slide-31
SLIDE 31

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

JobWebapp JobWebapp JobWebapp JobWebapp JobWebapp JobWebapp

artifact.2 artifact.2 artifact.2

unavailable unavailable unavailable

artifact.2 artifact.2 artifact.2

unavailable unavailable unavailable

Outage: 0% availability

Diagnosis

slide-32
SLIDE 32

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  • 1. Republished artifact to last known

good generation

  • 2. Performed a rolling restart of

JobWebapp

  • 3. Turned off healthchecking in the load

balancer

Mitigation

slide-33
SLIDE 33

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  • 1. Disabled artifact builder
  • 2. Waited for new artifact to replicate
  • 3. Verified all instances of the webapp

were restarted

  • 4. Verified recovery with telemetry
  • 5. Verified healthchecks

Recovery

slide-34
SLIDE 34

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

slide-35
SLIDE 35

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Something is still wrong, go back to diagnosis

Recovery

slide-36
SLIDE 36

@_pkill | Learning From Failure | indeedhi.re/2wKa2Mm

The incident lifecycle

Detection Diagnosis Mitigation Prevention Retro Cleanup Recovery

slide-37
SLIDE 37

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

$ host -t A indeed.com indeed.com has no A record

Recovery

slide-38
SLIDE 38

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  • 1. Disabled artifact builder
  • 2. Waited for new artifact to replicate
  • 3. Verified recovery with telemetry
  • 4. Verified healthchecks
  • 5. While waiting for DNS TTL expiration,

validated hypothesis

Recovery

slide-39
SLIDE 39

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  • 1. Harvested logs and artifacts for

investigation

  • 2. Re-enableed healthchecking in load

balancer

  • 3. Restored log verbosity levels
  • 4. Restored artifact building

Cleanup

slide-40
SLIDE 40

@_pkill | Learning From Failure | indeedhi.re/2wKa2Mm

The incident lifecycle

Detection Diagnosis Mitigation Prevention Retro Cleanup Recovery

slide-41
SLIDE 41

@_pkill | Learning From Failure | indeedhi.re/2wKa2Mm

The incident lifecycle

Detection Diagnosis Mitigation Prevention Retro Cleanup Recovery

slide-42
SLIDE 42

Canary Artifact Deployment

slide-43
SLIDE 43

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  • 1. Attempt to claim a canary lock
  • 2. Load the artifact
  • 3. If successful, “bless” the run
  • 4. After blessing, other servers load the

artifact

  • 5. If unsuccessful, log event and try again

Prevention

slide-44
SLIDE 44

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Canary Artifact Deployment is an example of the Circuit Breaker Pattern

slide-45
SLIDE 45

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Failure

Incorporate failure into your system’s design and design for both known and unknown failures.

slide-46
SLIDE 46

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Resilience is a property that describes a system’s ability to adapt to a previously unknown failure while robustness is a system’s ability to recover from a known failure.

slide-47
SLIDE 47

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Failure Myth #3: Failure can be prevented

slide-48
SLIDE 48

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Failure is a routine part of running distributed systems

slide-49
SLIDE 49

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Tightly-coupled Loosely-coupled Linear Complex

Systems that are tightly-coupled and complex are less resilient to catastrophe

slide-50
SLIDE 50

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Global aviation is an example of a complex and tightly coupled system

slide-51
SLIDE 51

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Failure Myth #4: Adding resilience improves reliability

slide-52
SLIDE 52

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Improving resilience can reduce reliability.

slide-53
SLIDE 53

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Incident Response

slide-54
SLIDE 54

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Situational awareness in decision making

slide-55
SLIDE 55

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Our cognitive biases are useful adaptations but they

  • ften lead us astray during incident response.

You don’t have to eliminate them but be aware of them.

slide-56
SLIDE 56

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Availability heuristic Relying only on the ideas that come to mind when making decisions in uncertain situations.

slide-57
SLIDE 57

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Focusing effect The tendency to place too much importance on one aspect of an event.

slide-58
SLIDE 58

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Illusory correlation Inaccurately perceiving a relationship between two unrelated events.

slide-59
SLIDE 59

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Confirmation bias The tendency to search for, interpret, focus on or discard evidence in a way that confirms one's preconceptions.

slide-60
SLIDE 60

@_pkill | Learning From Failure | indeedhi.re/2wKa2Mm

The incident lifecycle revisited

Detection Diagnosis Mitigation Prevention Retro Cleanup Recovery

N e w s y m p t

  • m

e m e r g e s Responder joined later Dev helped with identifying DNS issue in Slack

slide-61
SLIDE 61

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

The Retrospective Process

slide-62
SLIDE 62

Debriefings Synthesis & Analysis Retro report compiled Retro report released Remediation meetings

The Retrospective Process

Learning from incidents

Urgent remediations addressed

Retrospective start

slide-63
SLIDE 63

Debriefings Synthesis & Analysis Retro report compiled Retro report released Remediation meetings

The Retrospective Process

Learning from incidents

Urgent remediations addressed

Testimony is most accurate within two weeks of return to normalization.

slide-64
SLIDE 64

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Debriefing

slide-65
SLIDE 65

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

+ Debrief facilitator + Debrief facilitator trainee + Scribe + Incident owner + Incident participants + Retrospective owner + Subject matter experts

Debrief attendees

slide-66
SLIDE 66

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

+ Impartial: Not involved in the incident + Curious: Asks questions + Attentive: Listens + Respectful: Improves psychological safety + Thorough: Captures all relevant testimony + Patient: Mediates heated debate + Uses shared language: Sufficiently technical

Qualities of debrief facilitators

slide-67
SLIDE 67

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  • 1. Facilitator reviews the timeline
  • 2. Facilitator interviews attendees
  • 3. Call for clarifying questions

Debrief agenda

slide-68
SLIDE 68

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

What questions should a facilitator ask?

slide-69
SLIDE 69

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

What was happening just before the incident? During the incident?

slide-70
SLIDE 70

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Was there a call for assistance? How was it known who to contact?

slide-71
SLIDE 71

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

How could this incident have been worse?

slide-72
SLIDE 72

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

How did we arrive at the decision to turn off the healthchecking in the load balancer?

slide-73
SLIDE 73

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

+ Start debriefs as soon as possible + Before the debrief + Send out questions to participants + Assess the comfort level of participants + Commit someone to scribe or record + Conduct 1:1 debriefs if necessary

Debriefing tips

slide-74
SLIDE 74

Debriefings Synthesis & Analysis Retro report compiled Retro report released Remediation meetings

The Retrospective Process

Learning from incidents

Address urgent remediations

Interviews, narratives, contributing factors, latent threats, impact, remediation items

slide-75
SLIDE 75

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

+ “...made a mistake by…” + “The developer carelessly…” + “... suboptimal decision-making...” + “... should have been obvious…” + “Could have prevented the outage…” + “... failed to verify the change...”

Avoid counterfactuals

slide-76
SLIDE 76

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Root cause analysis is a fairy tale

slide-77
SLIDE 77

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Root cause is also an imprecise concept.

slide-78
SLIDE 78

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  • 1. Initiating cause
  • 2. Most basic cause
  • 3. Earliest cause
  • 4. Deepest cause

WIKIPEDIA’S DEFINITION OF ROOT CAUSE So many choices...

slide-79
SLIDE 79

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  • 1. Initiating: Non-critical healthcheck

dependency commit?

  • 2. Most basic: Filesystem exhaustion
  • n build server?
  • 3. Earliest: The Big Bang??
  • 4. Deepest: The Human Condition???

WIKIPEDIA’S DEFINITION OF ROOT CAUSE

slide-80
SLIDE 80

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Root cause analysis is too narrow in scope to maximize learning.

slide-81
SLIDE 81

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Root cause analysis is too narrow in scope to maximize learning. It leaves important contributions unexplored.

slide-82
SLIDE 82

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Root cause analysis is not blame-aware.

slide-83
SLIDE 83

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

The Five Whys is also problematic

slide-84
SLIDE 84

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Is the root cause hiding here somewhere?

Why? Why? Why? Why? Why?

slide-85
SLIDE 85

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Why? Why? Why? Why? Why? Universe of other contributions Universe of other contributions

slide-86
SLIDE 86

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Fixating on root cause is an easy trap to fall into.

slide-87
SLIDE 87

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Causal analysis and diagnosis are supremely important activities.

slide-88
SLIDE 88

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

What should we do instead? Locate contributing factors

slide-89
SLIDE 89

Contributing factors

+ Artifact publishing script didn’t handle a certain exception + Builder used non-atomic filesystem writes + Filesystem filled up to 100% + Non-critical healthcheck dependency marked as REQUIRED + No fail-open pool in the DNS traffic director + Corrupt data artifact loaded into webapp without verification

slide-90
SLIDE 90

Debriefings Synthesis & Analysis Retro report compiled Retro report released Remediation meetings

The Retrospective Process

Learning from incidents

Address urgent remediations

Write report and assemble deliverables

slide-91
SLIDE 91

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  • 1. Contributing factors
  • 2. Remaining threats
  • 3. Remediation items
  • 4. Command line history
  • 5. Chat transcripts
  • 6. Graphs
  • 7. Retrospective report

RETROSPECTIVE DELIVERABLES

slide-92
SLIDE 92

Debriefings Synthesis & Analysis Retro report compiled Retro report released Remediation meetings

The Retrospective Process

Learning from incidents

Address urgent remediations

Promote this material far and wide in your organization. Add this to your incident library.

slide-93
SLIDE 93

Debriefings Synthesis & Analysis Retro report compiled Retro report released Remediation meetings

The Retrospective Process

Learning from incidents

Address urgent remediations

These happen on the team level. This is where remediation owners are determined.

slide-94
SLIDE 94

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

+ Execution is team dependent + Dive deep retrospective report + Assign owners for remediation items + Discuss finer points of the contributing factors + Can continue in perpetuity

REMEDIATION MEETINGS

slide-95
SLIDE 95

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

We don’t deeply know our systems.

slide-96
SLIDE 96

System as imagined System as found

slide-97
SLIDE 97

System as imagined System as found

urgency: "Weak: Failure

  • f this dependency

would result in minor functionality loss" urgency: "Required: Failure of this dependency would result in complete system outage"

slide-98
SLIDE 98

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Failure

The best opportunity to gain an understanding about how our systems behave is through failure.

slide-99
SLIDE 99

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Chaos testing

Test in ALL environments with the goal of validating your hypothesis. Discovering things you didn’t know about your systems is a consequence.

slide-100
SLIDE 100

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Failure Myth #5: Safety can be measured by the number of accidents that occur

slide-101
SLIDE 101

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

+ Number of threats identified and mitigated + Number of tests running (including prod) + How readily information travels through an

  • rganization

+ How reliability work is prioritized compared to feature work + How experience with failure influences future design decisions + Psychological safety WHAT CAN BE MEASURED?

slide-102
SLIDE 102

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

  • 1. Embrace failure as part of your systems
  • 2. Evolve into a learning organization
  • 3. Move the boundary of your systems to

include people who interact with them

  • 4. Humans are imperfect responders. Be

aware of your cognitive biases.

  • 5. Root cause analysis hinders learning and

is not blame-aware. Locate contributing factors. KEY TAKEAWAYS

slide-103
SLIDE 103

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Revisit what catastrophic failure looks like for you. Why isn’t this happening in your organization right now? Do you know what’s going right and why?

slide-104
SLIDE 104

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Thank you

Alex Elman Site Reliability Engineer @_pkill

slide-105
SLIDE 105

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm