Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm What would - PowerPoint PPT Presentation

Our cognitive biases are useful adaptations but they often lead us astray during incident response. You don’t have to eliminate them but be aware of them. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Availability heuristic Relying only on the ideas that come to mind when making decisions in uncertain situations. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Focusing effect The tendency to place too much importance on one aspect of an event. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Illusory correlation Inaccurately perceiving a relationship between two unrelated events. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Confirmation bias The tendency to search for, interpret, focus on or discard evidence in a way that confirms one's preconceptions. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

The incident lifecycle revisited Responder joined later Detection Mitigation Cleanup Prevention N e w s y m p t o m e m Retro e Diagnosis r Recovery g e s Dev helped with identifying DNS issue in Slack @_pkill | Learning From Failure | indeedhi.re/2wKa2Mm

The Retrospective Process Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

The Retrospective Process Learning from incidents Retrospective start Remediation Retro report Urgent Debriefings Synthesis & Retro report meetings released remediations Analysis compiled addressed

The Retrospective Process Learning from incidents Testimony is most accurate within two weeks of return to normalization. Remediation Retro report Urgent Debriefings Synthesis & Retro report meetings released remediations Analysis compiled addressed

Debriefing Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Debrief attendees + Debrief facilitator + Debrief facilitator trainee + Scribe + Incident owner + Incident participants + Retrospective owner + Subject matter experts Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Qualities of debrief facilitators + Impartial : Not involved in the incident + Curious : Asks questions + Attentive : Listens + Respectful : Improves psychological safety + Thorough : Captures all relevant testimony + Patient : Mediates heated debate + Uses shared language : Sufficiently technical Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Debrief agenda 1. Facilitator reviews the timeline 2. Facilitator interviews attendees 3. Call for clarifying questions Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

What questions should a facilitator ask? Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

What was happening just before the incident? During the incident? Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Was there a call for assistance? How was it known who to contact? Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

How could this incident have been worse ? Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

How did we arrive at the decision to turn off the healthchecking in the load balancer? Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Debriefing tips + Start debriefs as soon as possible + Before the debrief + Send out questions to participants + Assess the comfort level of participants + Commit someone to scribe or record + Conduct 1:1 debriefs if necessary Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

The Retrospective Process Learning from incidents Interviews, narratives, contributing factors, latent threats, impact, remediation items Remediation Retro report Address urgent Debriefings Synthesis & Retro report meetings released remediations Analysis compiled

Avoid counterfactuals + “...made a mistake by…” + “The developer carelessly…” + “... suboptimal decision-making...” + “... should have been obvious…” + “Could have prevented the outage…” + “... failed to verify the change...” Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Root cause analysis is a fairy tale Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Root cause is also an imprecise concept. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

So many choices... 1. Initiating cause 2. Most basic cause WIKIPEDIA’S DEFINITION OF 3. Earliest cause ROOT CAUSE 4. Deepest cause Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

1. Initiating: Non-critical healthcheck dependency commit? 2. Most basic: Filesystem exhaustion WIKIPEDIA’S DEFINITION OF on build server? ROOT CAUSE 3. Earliest: The Big Bang?? 4. Deepest: The Human Condition??? Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Root cause analysis is too narrow in scope to maximize learning. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Root cause analysis is too narrow in scope to maximize learning. It leaves important contributions unexplored. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Root cause analysis is not blame-aware. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

The Five Whys is also problematic Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Why? Why? Why? Why? Is the root cause hiding here Why? somewhere? Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Why? Why? Why? Universe of other Universe of other contributions contributions Why? Why? Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Fixating on root cause is an easy trap to fall into. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Causal analysis and diagnosis are supremely important activities. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

What should we do instead? Locate contributing factors Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Contributing factors + Artifact publishing script didn’t handle a certain exception + Builder used non-atomic filesystem writes + Filesystem filled up to 100% + Non-critical healthcheck dependency marked as REQUIRED + No fail-open pool in the DNS traffic director + Corrupt data artifact loaded into webapp without verification

The Retrospective Process Learning from incidents Write report and assemble deliverables Remediation Retro report Address urgent Debriefings Synthesis & Retro report meetings released remediations Analysis compiled

1. Contributing factors 2. Remaining threats 3. Remediation items 4. Command line history RETROSPECTIVE DELIVERABLES 5. Chat transcripts 6. Graphs 7. Retrospective report Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

The Retrospective Process Learning from incidents Promote this material far and wide in your organization. Add this to your incident library. Remediation Retro report Address urgent Debriefings Synthesis & Retro report meetings released remediations Analysis compiled

The Retrospective Process Learning from incidents These happen on the team level. This is where remediation owners are determined. Remediation Retro report Address urgent Debriefings Synthesis & Retro report meetings released remediations Analysis compiled

+ Execution is team dependent + Dive deep retrospective report + Assign owners for remediation REMEDIATION items MEETINGS + Discuss finer points of the contributing factors + Can continue in perpetuity Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

We don’t deeply know our systems. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

System as imagined System as found

System as imagined System as found urgency: "Weak: Failure of this dependency urgency: "Required: would result in minor Failure of this functionality loss" dependency would result in complete system outage"

Failure The best opportunity to gain an understanding about how our systems behave is through failure. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Chaos testing Test in ALL environments with the goal of validating your hypothesis. Discovering things you didn’t know about your systems is a consequence. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Failure Myth #5: Safety can be measured by the number of accidents that occur Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm What would - PowerPoint PPT Presentation

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm What would catastrophic failure look like in your organization? Try and picture this. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm Tightly coupled Systems that are tightly coupled

Health Failure Telehealth Final Report Sarah Briggs Heart Failure Specialist Nurse Heart Failure

Failure is a four-letter word Andreas Zeller Thomas Zimmermann Christian Bird PROMISE

0 to 100 Real Quick Learning from Failure Learning from Failure Learning from

PALLIATIVE CARE Advanced heart failure Heart failure has a poor prognosis Heart failure

Management of Co- morbidities in Heart Failure (COPD, Renal failure, Anemia) Dr John Parissis,

Active Diuretic Management to Improve Heart Failure Outcomes Heart Failure Outcomes David

SYMPTOMS OF DECOMPENSATING HEART FAILURE A N D T R E A T M E N T S B Y K E R R Y M O R T O N

Kolton Andrus (@deelyle) Overview 1. Why is Failure Testing Important? 2. How did we build

ESC Guidelines for the Diagnosis and Treatm ent of Acute and Chronic Heart Failure Patients

Outline Chronic Heart Failure: Diagnosis and Staging Effective Diagnosis, Diastolic Heart

Heart Failure with Preserved Ejection Fraction Advances in Heart Failure CME Course Jonathan D

Acute heart failure Veli-Pekka Harjola FHFA, FESC Coordinator of the acute heart failure

Outline Chronic Heart Failure: Diagnosis and Staging Update on Effective Diastolic Heart

Market Failure Market Failure Public Goods & Externalities Spring 09 UC Berkeley

Cisco Security Authentication Failure Rate Cisco Security Authentication Failure Rate or SHIT

Static Failure Lecture 18 ME EN 372 Andrew Ning aning@byu.edu Outline Static Failure Maximum

Green Roofs Review Task Force City Council Meeting April 2 nd , 2018 OVER ERVIE VIEW W OF OF

Retrospective Updates Issues Raised at RAASP Workshops - UPDATED 10 th November 2015 1

An Approach to XML-based Description of Intraoperative Surgical Workflows T. Neumuth 1 , A.

Retractable Pool Cover Project Proposal By: Abdulhadi Alkhaldi, Zachary Keller, Cody Maurice,

PROPOSED RSU 23 2016-2017 BUDGET Superintendents Presentation @ Community Budget Workshops on

Morogoro Email: ishengomarc@yahoo.com Presen entat atio ion to 2nd nd TTC TCS Bioma omass

Graduating Student Survey 2014-15 New Mexico State University Las Cruces Office of

Out Out scali scaling ng of of T Tec echnologies hnologies through thr ough KV KVK-ATMA

Sambuz

Useful Links

Newsletter

Mail Us

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm What would - PowerPoint PPT Presentation

Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm What would catastrophic failure look like in your organization? Try and picture this. Learning From Failure | @_pkill | indeedhi.re/2wKa2Mm Tightly coupled Systems that are tightly coupled

Health Failure Telehealth Final Report Sarah Briggs Heart Failure Specialist Nurse Heart Failure

Failure is a four-letter word Andreas Zeller Thomas Zimmermann Christian Bird PROMISE

0 to 100 Real Quick Learning from Failure Learning from Failure Learning from

PALLIATIVE CARE Advanced heart failure Heart failure has a poor prognosis Heart failure

Management of Co- morbidities in Heart Failure (COPD, Renal failure, Anemia) Dr John Parissis,

Active Diuretic Management to Improve Heart Failure Outcomes Heart Failure Outcomes David

SYMPTOMS OF DECOMPENSATING HEART FAILURE A N D T R E A T M E N T S B Y K E R R Y M O R T O N

Kolton Andrus (@deelyle) Overview 1. Why is Failure Testing Important? 2. How did we build

ESC Guidelines for the Diagnosis and Treatm ent of Acute and Chronic Heart Failure Patients

Outline Chronic Heart Failure: Diagnosis and Staging Effective Diagnosis, Diastolic Heart

Heart Failure with Preserved Ejection Fraction Advances in Heart Failure CME Course Jonathan D

Acute heart failure Veli-Pekka Harjola FHFA, FESC Coordinator of the acute heart failure

Outline Chronic Heart Failure: Diagnosis and Staging Update on Effective Diastolic Heart

Market Failure Market Failure Public Goods &amp; Externalities Spring 09 UC Berkeley

Cisco Security Authentication Failure Rate Cisco Security Authentication Failure Rate or SHIT

Static Failure Lecture 18 ME EN 372 Andrew Ning aning@byu.edu Outline Static Failure Maximum

Green Roofs Review Task Force City Council Meeting April 2 nd , 2018 OVER ERVIE VIEW W OF OF

Retrospective Updates Issues Raised at RAASP Workshops - UPDATED 10 th November 2015 1

An Approach to XML-based Description of Intraoperative Surgical Workflows T. Neumuth 1 , A.

Retractable Pool Cover Project Proposal By: Abdulhadi Alkhaldi, Zachary Keller, Cody Maurice,

PROPOSED RSU 23 2016-2017 BUDGET Superintendents Presentation @ Community Budget Workshops on

Morogoro Email: ishengomarc@yahoo.com Presen entat atio ion to 2nd nd TTC TCS Bioma omass

Graduating Student Survey 2014-15 New Mexico State University Las Cruces Office of

Out Out scali scaling ng of of T Tec echnologies hnologies through thr ough KV KVK-ATMA

Sambuz

Useful Links

Newsletter

Mail Us

Market Failure Market Failure Public Goods & Externalities Spring 09 UC Berkeley