Picking up the pieces A guide to Post Incident Review @kleeut - - PowerPoint PPT Presentation

picking up the pieces
SMART_READER_LITE
LIVE PREVIEW

Picking up the pieces A guide to Post Incident Review @kleeut - - PowerPoint PPT Presentation

Picking up the pieces A guide to Post Incident Review @kleeut Picking up the pieces A guide to Post Incident Review @kleeut Klee Thomas Clean code enthusiast Code Crafuer Lover of stupid shirts Organiser of Newcastle Coders Group Senior


slide-1
SLIDE 1 @kleeut

Picking up the pieces

A guide to Post Incident Review

slide-2
SLIDE 2 @kleeut

Picking up the pieces

A guide to Post Incident Review

slide-3
SLIDE 3 @kleeut

Klee Thomas

Clean code enthusiast Code Crafuer Lover of stupid shirts Organiser of Newcastle Coders Group Senior Sofuware Developer at nib health funds @kleeut
slide-4
SLIDE 4 @kleeut Agile Pairing Clean Code TDD Dev Ops Continuous Integration Continuous Delivery Etc @kleeut
slide-5
SLIDE 5 @kleeut

Something is going to go wrong

Our customers expect more from our sofuware We are building systems that are more complicated and complex. @kleeut
slide-6
SLIDE 6 @kleeut

Cynefin

@kleeut

Complicated Complex Chaotic Simple

slide-7
SLIDE 7 @kleeut

Something is going to go wrong

Our workforce is more and more transient. Something is going to go wrong.
slide-8
SLIDE 8 @kleeut

Create a prepared culture

slide-9
SLIDE 9 @kleeut

Post Incident Review (PIR)

@kleeut
slide-10
SLIDE 10 @kleeut

Analysis of an incident

Exposing Reflection on:
  • What happened
  • What went wrong
  • How we responded
  • How we can improve
@kleeut
slide-11
SLIDE 11 @kleeut

The Flow of an incident

@kleeut
slide-12
SLIDE 12 @kleeut Something is going wrong Fix it Back to work @kleeut
slide-13
SLIDE 13 @kleeut Something is going wrong Fix it Back to work @kleeut
slide-14
SLIDE 14 @kleeut

Incident Life Cycle

slide-15
SLIDE 15 @kleeut Detection Response Readiness Resolution Analysis @kleeut
slide-16
SLIDE 16 @kleeut Detection Response Readiness Resolution Analysis Analysis @kleeut
slide-17
SLIDE 17 @kleeut

When to run a PIR

As soon as possible
slide-18
SLIDE 18 @kleeut

As Soon As Possible

Memory fades We make fake memories Within 2 days of resolution
slide-19
SLIDE 19 @kleeut

Regularly

Do this for large and small incidents We learn more about the weaknesses in our system We get practice at running reviews.
slide-20
SLIDE 20 @kleeut

Path to great Post Incident Review

slide-21
SLIDE 21 @kleeut

Example

Customers stopped being able to access https://klees-example.com. Something is going wrong Fix it Back to work Ops added more disk space to the virtual machine. Ops rebooted the server. Customer requests went back to being fulfilled. Back to work
slide-22
SLIDE 22 @kleeut

Root Cause Analysis

@kleeut
slide-23
SLIDE 23 @kleeut

5 Whys

A great technique for Root Cause analysis Get beyond the immediate answer Just keep asking “Why?” @kleeut
slide-24
SLIDE 24 @kleeut

Why did the site go down?

  • No disk space.
Why?
  • Too many logs
Why?
  • No log rolling
Why?
  • Using a custom log manager
Why?
  • John didnt want another
dependency
  • No disk space.
Why?
  • Nobody added more space
Why?
  • We didnt know space was low
Why?
  • Bill turned off alerts
Why?
  • Too many alerts over night
slide-25
SLIDE 25 @kleeut

5 Whys - problems

No repeatable outcome Root Cause analysis can lead to blaming an individual.
slide-26
SLIDE 26 @kleeut Blame is natural and human Blame happens when we’re in pain Blame leads to fear Fear leads to hiding/misrepresenting facts

Blame

@kleeut
slide-27
SLIDE 27 @kleeut

Blame

If you dont blame a successful product launch on one person, why would you blame a failure on one person? @kleeut
slide-28
SLIDE 28 @kleeut

Don’t blame the person

Blame the process, not the people - Edward Deming
slide-29
SLIDE 29 @kleeut

Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.

“ “

  • Norm Kerth, Project Retrospectives: A Handbook for Team Review
@kleeut

The Prime Directive

slide-30
SLIDE 30 @kleeut

Contributing factors

slide-31
SLIDE 31 @kleeut Problem

Ishikawa / Fishbone / Cause & Efgect Diagram

Primary Causes Secondary Causes
slide-32
SLIDE 32 @kleeut

Categories

6 “M”s - Manufacturing Machines Methods Materials Mind (People) Measurement 8 P’s - Product Marketing Product Price Promotion Place Process People Physical Evidence Performance
slide-33
SLIDE 33 @kleeut Problem People Methods Code Systems

Ishikawa / Fishbone / Cause & Efgect Diagram

Monitoring
slide-34
SLIDE 34 @kleeut Klees-example stopped serving requests People Methods Code Systems

Ishikawa / Fishbone / Cause & Efgect Diagram

Monitoring Too Many logs Not enough disk Inadequate checking of server Inadequate alerting Disk Uptime Oncall person hard to reach
slide-35
SLIDE 35 @kleeut

Heuristics/Bias

  • Subconcious
  • Problem solving shortcuts
  • Save time
  • Make things more important than they are
  • Risk ignoring valuable learnings
slide-36
SLIDE 36 @kleeut

Bias

Anchoring
  • The first piece of evidence is the most relevant
Availability
  • I can think of it therefore it’s true
Confirmation
  • Just because the outcome was good doesn’t mean it was a good decision
slide-37
SLIDE 37 @kleeut

Bias

Hindsight
  • The answer is obvious... If you know the answer
Outcome
  • Could of, should of, why didn’t
Bandwagon Effect
  • Getting swept up in the crowd
slide-38
SLIDE 38 @kleeut

How I run a PIR

@kleeut
slide-39
SLIDE 39 @kleeut

Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.

“ “

  • Norm Kerth, Project Retrospectives: A Handbook for Team Review
@kleeut

The Prime Directive

slide-40
SLIDE 40 @kleeut

Summary

Incident TL;DR; Outline what happened What was the resolution
slide-41
SLIDE 41 @kleeut

What happened

Objective Timeline Multiple points of view
  • People Involved
  • Automated Systems
  • Chat Logs
slide-42
SLIDE 42 @kleeut

Elaborate

Don’t hide what happened
  • What happened
  • What did we do
Don’t ask why X happened
  • ask how it happened
  • what factors informed the decision
slide-43
SLIDE 43 @kleeut

Key Metrics

Who was involved
  • Incident Commander
  • Contributors
Time to Acknowledge: Time to Recover: Elapsed Time in each phase (Detection, Response, Remediation) Severity: (e.g. fatal, critical, moderate, low, false alarm)
slide-44
SLIDE 44 @kleeut

Example

Summary: On January 13 klees-example.com stopped serving requests. We were able to get it back on line within 20 minutes by allocating more disk space to the server.
slide-45
SLIDE 45 @kleeut

Timeline

2019-01-12 23:30 - Logs show Disk utilisation passes 90 % 2019-01-13 09:30 - Logs show 503 responses start occuring in the routers 2019-01-13 09:35 - Logs show No 200 responses in routers at all 2019-01-13 09:40 - Customer calls service desk 2019-01-13 09:41 - Service desk contacts dev via Slack 2019-01-13 09:43 - Devs refer to Ops via Slack 2019-01-13 09:45 - Ops identify 100% disk usage on vmke01 2019-01-13 09:46 - Ops increase virtual disk space by 15% 2019-01-13 09:47 - Ops restart server 2019-01-13 09:49 - Logs show 200 responses in routers
slide-46
SLIDE 46 @kleeut Who was involved
  • @Jane, @Bill, @Fred
Time to Acknowledge: 11 minutes Time to Recover: 20 Minutes Elapsed Time in each phase:
  • Detection: 11 Minutes,
  • Response: 3 Minutes,
  • Remediation: 4 Minutes
Severity: Fatal
slide-47
SLIDE 47 @kleeut

What went well?

For all the bad stuff something must have gone well. Look at all the phases. How can you be more ready
slide-48
SLIDE 48 @kleeut

What could we improve?

There are going to be areas that didn’t work so well. Be aware of blame.
  • Understand what lead to actions.
  • Identify processes that may have failed or been missing.
Look at all the phases How can you be more ready
slide-49
SLIDE 49 @kleeut

Action Items

Document them as they come up ( Parking Lot ) Small or large, Immediate and long term Commit to some, but not necessarily all. Add them to your issue trackers, Assign them Feed back into all stages of the life cycle.
slide-50
SLIDE 50 @kleeut

Overview

The incident lifecycle: Detection -> Response -> Remediation -> Analysis -> Readiness. Avoid blame with an objective and honest timeline of events Identify what went well and what went poorly Track your actions Run reviews ofuen even on small things
slide-51
SLIDE 51 @kleeut

Klee Thomas

@kleeut