DATA-DRIVEN POSTMORTEMS
ILAN RABINOVITCH, DATADOG @IRABINOVITCH
DATA-DRIVEN POSTMORTEMS ILAN RABINOVITCH, DATADOG @IRABINOVITCH - - PowerPoint PPT Presentation
DATA-DRIVEN POSTMORTEMS ILAN RABINOVITCH, DATADOG @IRABINOVITCH $ finger ilan@datadog [datadoghq.com] Name: Ilan Rabinovitch Role: Director, Technical Community Interests: * Monitoring and Metrics * Large scale web operations * FL/OSS
ILAN RABINOVITCH, DATADOG @IRABINOVITCH
$ finger ilan@datadog
[datadoghq.com] Name: Ilan Rabinovitch Role: Director, Technical Community Interests: * Monitoring and Metrics * Large scale web operations
* FL/OSS Community Events
Datadog Overview
“THE PROBLEMS WE WORK ON AT DATADOG ARE HARD AND OFTEN DON'T HAVE OBVIOUS, CLEAN- CUT SOLUTIONS, SO IT'S USEFUL TO CULTIVATE YOUR TROUBLESHOOTING SKILLS, NO MATTER WHAT ROLE YOU WORK IN.”
Internal Datadog Developer Guide
“AN ANALYSIS OR DISCUSSION OF AN EVENT HELD SOON AFTER IT HAS OCCURRED, ESPECIALLY IN ORDER TO DETERMINE WHY IT WAS A FAILURE.”
OXFORD ENGLISH DICTIONARY
Oxford English Dictionary
POSTMORTEM
DAMON EDWARDS & JOHN WILLIS - DEVOPSDAY LOS ANGELES
WHAT IS DEVOPS?
▸ Culture ▸ Automation ▸ Metrics ▸ Sharing
DAMON EDWARDS & JOHN WILLIS - DEVOPSDAY LOS ANGELES
OUR FOCUS AREA
▸ Culture ▸ Sharing
CULTURE & SHARING RESOURCES
▸Blameless Postmortems by John Allspaw
http://bit.ly/etsy-blameless
▸The Human Side of Postmortems by Dave
Zwieback
http://bit.ly/human-postmortem
CULTURE & SHARING ARE GREAT, BUT WHAT ABOUT
Follow @honest_update
SO INSTRUMENT ALL THE THINGS!
METRICS
▸ Well-understood ▸ Granular ▸ Tagged by scope ▸ Long-lived
RECURSE UNTIL YOU FIND THE TECHNICAL CAUSE
HUMAN DATA
▸ Everyone! ▸ Responders ▸ Identifiers ▸ Affected Users
HUMAN DATA
DATA COLLECTION: WHAT?
▸ Their perspective ▸ What they did ▸ What they thought ▸ Why they thought/did it
TECHNICAL ISSUES HAVE NON-TECHNICAL CAUSES
… we will be dramatically improving the tooling that humans (and systems) interact with such that input validation is much more strict and will not allow for all servers, and control plane servers to be rebooted simultaneously …
Joyent Postmortem http://bit.ly/joyent-post
JOYENT US-EAST-1 POST-MORTEM 2014
“WRITING IS NATURE’S WAY OF LETTING YOU KNOW HOW SLOPPY YOUR THINKING IS.”
“ONE PICTURE IS WORTH TEN THOUSAND WORDS”
HUMAN DATA
▸ As soon as possible. ▸ Memory drops sharply within 20 minutes ▸ Susceptibility to “false memory” increases
HUMAN DATA
▸ Stress ▸ Sleep deprivation ▸ Burnout
HUMAN DATA
▸ Blame/Fear of punitive action ▸ Bias ▸ Anchoring ▸ Hindsight ▸ Outcome ▸ Availability ▸ Recency
DATADOG POSTMORTEMS
A FEW NOTES
▸ Postmortems emailed to company wide ▸ Scheduled recurring postmortem meetings
DATADOG’S POSTMORTEM TEMPLATE (1/5)
▸ Describe what happened here at a high-level --
think of it as an abstract in a scientific paper.
▸ What was the impact on customers? ▸ What was the severity of the outage? ▸ What components were affected? ▸ What ultimately resolved the outage?
DATADOG’S POSTMORTEM TEMPLATE (2/5)
▸ We want to make sure we detected the issue
early and would catch the same issue if it were to repeat.
▸ Did we have a metric that showed the outage? ▸ Was there a monitor on that metric? ▸ How long did it take for us to declare an outage?
DATADOG’S POSTMORTEM TEMPLATE (3/5)
▸ Who was the incident owner & who else was
involved?
▸ Slack archive links and timeline of events! ▸ What went well? ▸ What didn’t go so well?
*Names changed
*Names changed
*Names changed
DATADOG’S POSTMORTEM TEMPLATE (4/5)
▸ Deep dive into the cause ▸ Examples from this incident: ▸ http://bit.ly/dd-statuspage ▸ http://bit.ly/alq-postmortem
DATADOG’S POSTMORTEM TEMPLATE (5/5)
HOW DO WE PREVENT IT IN THE FUTURE?
▸ Link to Github issues and Trello cards ▸ Now? ▸ Next? ▸ Later? ▸ Follow up notes
*Names changed
DATADOG’S POSTMORTEM TEMPLATE
▸ What happened (summary)? ▸ How did we detect it? ▸ How did we respond? ▸ Why did it happen (deep dive)? ▸ Actionable next steps!
KEEP LEARNING
MORE RESOURCES
▸ The Infinite Hows - John Allspaw
http://bit.ly/infinite-hows
▸ “Blameless” Postmortems don’t work - J Paul
Reed http://bit.ly/blameless-dont-work
▸ Monitoring 101 - Alexis Lê-Quôc
http://dtdg.co/monitoring-101-data
LET’S TALK! @IRABINOVITCH @DATADOGHQ