Picking up the pieces A guide to Post Incident Review @kleeut

Klee Thomas Clean code enthusiast Code Crafuer Lover of stupid shirts Organiser of Newcastle Coders Group Senior Sofuware Developer at nib health funds @kleeut @kleeut

Agile Pairing Clean Code TDD Dev Ops Continuous Integration Continuous Delivery Etc @kleeut @kleeut

Something is going to go wrong Our customers expect more from our sofuware We are building systems that are more complicated and complex. @kleeut @kleeut

Cynefin Complex Complicated Chaotic Simple @kleeut @kleeut

Something is going to go wrong Our workforce is more and more transient. Something is going to go wrong. @kleeut

Create a prepared culture @kleeut

Post Incident Review (PIR) @kleeut @kleeut

Analysis of an incident Exposing Reflection on: What happened ● What went wrong ● How we responded ● How we can improve ● @kleeut @kleeut

The Flow of an incident @kleeut @kleeut

Something is going wrong Fix it Back to work @kleeut @kleeut

Incident Life Cycle @kleeut

Detection Response Readiness Resolution Analysis @kleeut @kleeut

Detection Response Readiness Resolution Analysis Analysis @kleeut @kleeut

When to run a PIR As soon as possible @kleeut

As Soon As Possible Memory fades We make fake memories Within 2 days of resolution @kleeut

Regularly Do this for large and small incidents We learn more about the weaknesses in our system We get practice at running reviews. @kleeut

Path to great Post Incident Review @kleeut

Example Something is Customers stopped being able to access https://klees-example.com. going wrong Ops added more disk space to the virtual machine. Fix it Ops rebooted the server. Customer requests went back to being fulfilled. Back to work Back to work @kleeut

Root Cause Analysis @kleeut @kleeut

5 Whys A great technique for Root Cause analysis Get beyond the immediate answer Just keep asking “Why?” @kleeut @kleeut

Why did the site go down? • No disk space. • No disk space. Why? Why? • Too many logs • Nobody added more space Why? Why? • No log rolling • We didnt know space was low Why? Why? • Using a custom log manager • Bill turned off alerts Why? Why? • John didnt want another • Too many alerts over night dependency @kleeut

5 Whys - problems No repeatable outcome Root Cause analysis can lead to blaming an individual. @kleeut

Blame Blame is natural and human Blame happens when we’re in pain Blame leads to fear Fear leads to hiding/misrepresenting facts @kleeut @kleeut

Blame If you dont blame a successful product launch on one person, why would you blame a failure on one person? @kleeut @kleeut

Don’t blame the person Blame the process, not the people - Edward Deming @kleeut

“ The Prime Directive “ Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand. @kleeut @kleeut -Norm Kerth, Project Retrospectives: A Handbook for Team Review

Contributing factors @kleeut

Ishikawa / Fishbone / Cause & Efgect Diagram Primary Causes Secondary Causes Problem @kleeut

Categories 6 “M”s - Manufacturing 8 P’s - Product Marketing Machines Product Methods Price Materials Promotion Mind (People) Place Measurement Process People Physical Evidence Performance @kleeut

Ishikawa / Fishbone / Cause & Efgect Diagram Monitoring Methods People Problem Code Systems @kleeut

Ishikawa / Fishbone / Cause & Efgect Diagram Monitoring Methods People Oncall person Inadequate hard to reach alerting Inadequate Klees-example checking of stopped serving Disk server Uptime requests Not enough disk Too Many logs Code Systems @kleeut

Heuristics/Bias • Subconcious • Problem solving shortcuts • Save time • Make things more important than they are • Risk ignoring valuable learnings @kleeut

Bias Anchoring - The first piece of evidence is the most relevant Availability - I can think of it therefore it’s true Confirmation - Just because the outcome was good doesn’t mean it was a good decision @kleeut

Bias Hindsight - The answer is obvious... If you know the answer Outcome - Could of, should of, why didn’t Bandwagon Effect - Getting swept up in the crowd @kleeut

How I run a PIR @kleeut @kleeut

“ The Prime Directive “ Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand. @kleeut @kleeut -Norm Kerth, Project Retrospectives: A Handbook for Team Review

Summary Incident TL;DR; Outline what happened What was the resolution @kleeut

What happened Objective Timeline Multiple points of view • People Involved • Automated Systems • Chat Logs @kleeut

Elaborate Don’t hide what happened • What happened • What did we do Don’t ask why X happened • ask how it happened • what factors informed the decision @kleeut

Key Metrics Who was involved • Incident Commander • Contributors Time to Acknowledge: Time to Recover: Elapsed Time in each phase (Detection, Response, Remediation) Severity: (e.g. fatal, critical, moderate, low, false alarm) @kleeut

Example Summary: On January 13 klees-example.com stopped serving requests. We were able to get it back on line within 20 minutes by allocating more disk space to the server. @kleeut

Timeline 2019-01-12 23:30 - Logs show Disk utilisation passes 90 % 2019-01-13 09:30 - Logs show 503 responses start occuring in the routers 2019-01-13 09:35 - Logs show No 200 responses in routers at all 2019-01-13 09:40 - Customer calls service desk 2019-01-13 09:41 - Service desk contacts dev via Slack 2019-01-13 09:43 - Devs refer to Ops via Slack 2019-01-13 09:45 - Ops identify 100% disk usage on vmke01 2019-01-13 09:46 - Ops increase virtual disk space by 15% 2019-01-13 09:47 - Ops restart server 2019-01-13 09:49 - Logs show 200 responses in routers @kleeut

Who was involved • @Jane, @Bill, @Fred Time to Acknowledge: 11 minutes Time to Recover: 20 Minutes Elapsed Time in each phase: • Detection: 11 Minutes, • Response: 3 Minutes, • Remediation: 4 Minutes Severity: Fatal @kleeut

What went well? For all the bad stuff something must have gone well. Look at all the phases. How can you be more ready @kleeut

What could we improve? There are going to be areas that didn’t work so well. Be aware of blame. • Understand what lead to actions. • Identify processes that may have failed or been missing. Look at all the phases How can you be more ready @kleeut

Action Items Document them as they come up ( Parking Lot ) Small or large, Immediate and long term Commit to some, but not necessarily all. Add them to your issue trackers, Assign them Feed back into all stages of the life cycle. @kleeut

Overview The incident lifecycle: Detection -> Response -> Remediation -> Analysis -> Readiness. Avoid blame with an objective and honest timeline of events Identify what went well and what went poorly Track your actions Run reviews ofuen even on small things @kleeut

Klee Thomas @kleeut @kleeut

Picking up the pieces A guide to Post Incident Review @kleeut - PowerPoint PPT Presentation

Picking up the pieces A guide to Post Incident Review @kleeut Picking up the pieces A guide to Post Incident Review @kleeut Klee Thomas Clean code enthusiast Code Crafuer Lover of stupid shirts Organiser of Newcastle Coders Group Senior

Warehouse Operations Pallet Rack Replenish Block Stacking (20 lanes) Forward Picking Reserve

Picking the Low- -Hanging Fruit: Hanging Fruit: Picking the Low Saving Money and Energy?

(Aster)-picking through the pieces of short URL services An investigation into the maliciousness

Adapting to Climate Change (within a new economic framework) Picking up the pieces Todays menu

Transforming the Timing Industry March 2016 Rich Timing Content in All Electronics Only SiTime

bits + bits pieces pieces Resources Websites Used by RFS for direction and guidance Recommend

Improving Chronic Disease Management with Pieces Miguel A. Vazquez, MD George (Holt) Oliver MD

RIG THE PERFECT ANGLE EVERY TIME Two picking arms each with five adjustment holes make

Future mushroom farm Topics Layout of future mushroom farm Picking system Robot packing &

Shaping the Future of Warehouse Operations Dr Tony McVeigh MORE WITH LESS ! 2 ORDER PICKING 3

Exploring a Multi-Sensor Picking Process in the Future Warehouse Alexander Diete September 9,

Getting Inside A Story Literary Elements: the pieces of a story Analysis: exploring how the

Look for pieces with the most character. When you draw you popcorn This will NOT do

Structures and Meta-structures in John Cages Number Pieces: A Statistical Approach Alexandre

WHEN IT ALL FALLS TO PIECES THE FUTURE OF WORK IN THE 21 ST CENTURY Lynda J. Roth LJR Consulting

NIH Collaboratory Grand Rounds Improving Chronic Disease Management with Pieces: Overview of PCCI

Data Structures in Java Lecture 15: Sorting II 11/11/2015 Daniel Bauer 1 Quick Sort

Innovation in Educational Technology Stephen Downes Bayonne, France January 24, 2018

Edge state integrals on shaped triangulations Rinat Kashaev University of Geneva joint work with

Reversing and Exploiting an Apple Firmware Update K. Chen Black Hat USA, July 30th, 2009 K.

Assignment 5 Software and Web Security March 26 rd , 2014 Initial state RAX 0x????????????????

Runtime GUI Adaptation in Dynamic Software Product Lines Dean Kramer deankramer@acm.org

The European Spallation Source John Womersley, Director General February 2017 Neutrons are

1: Software Development and .NET An approach to building software Overview Programming in