DATA-DRIVEN POSTMORTEMS ILAN RABINOVITCH, DATADOG @IRABINOVITCH - - PowerPoint PPT Presentation

data driven postmortems
SMART_READER_LITE
LIVE PREVIEW

DATA-DRIVEN POSTMORTEMS ILAN RABINOVITCH, DATADOG @IRABINOVITCH - - PowerPoint PPT Presentation

DATA-DRIVEN POSTMORTEMS ILAN RABINOVITCH, DATADOG @IRABINOVITCH $ finger ilan@datadog [datadoghq.com] Name: Ilan Rabinovitch Role: Director, Technical Community Interests: * Monitoring and Metrics * Large scale web operations * FL/OSS


slide-1
SLIDE 1

DATA-DRIVEN POSTMORTEMS

ILAN RABINOVITCH, DATADOG @IRABINOVITCH

slide-2
SLIDE 2

$ finger ilan@datadog

[datadoghq.com] Name: Ilan Rabinovitch Role: Director, Technical Community 
 Interests: * Monitoring and Metrics * Large scale web operations

* FL/OSS Community Events

slide-3
SLIDE 3
  • SaaS based infrastructure and app monitoring
  • Open Source Agent
  • Time series data (metrics and events)
  • Processing nearly a trillion data points per day
  • Intelligent Alerting
  • We’re hiring! (www.datadoghq.com/careers/)

Datadog Overview

slide-4
SLIDE 4

“THE PROBLEMS WE WORK ON AT DATADOG ARE HARD AND OFTEN DON'T HAVE OBVIOUS, CLEAN- CUT SOLUTIONS, SO IT'S USEFUL TO CULTIVATE YOUR TROUBLESHOOTING SKILLS, NO MATTER WHAT ROLE YOU WORK IN.”

Internal Datadog Developer Guide

slide-5
SLIDE 5

“THE ONLY REAL MISTAKE IS THE ONE FROM WHICH WE LEARN NOTHING.”

  • Henry Ford
slide-6
SLIDE 6

“AN ANALYSIS OR DISCUSSION OF AN EVENT HELD SOON AFTER IT HAS OCCURRED, ESPECIALLY IN ORDER TO DETERMINE WHY IT WAS A FAILURE.”

OXFORD ENGLISH DICTIONARY

Oxford English Dictionary

POSTMORTEM

slide-7
SLIDE 7

DAMON EDWARDS & JOHN WILLIS - DEVOPSDAY LOS ANGELES

WHAT IS DEVOPS?

▸ Culture ▸ Automation ▸ Metrics ▸ Sharing

slide-8
SLIDE 8
slide-9
SLIDE 9
slide-10
SLIDE 10

DAMON EDWARDS & JOHN WILLIS - DEVOPSDAY LOS ANGELES

OUR FOCUS AREA

▸ Culture ▸ Sharing

slide-11
SLIDE 11

BLAMELESS POSTMORTEMS

slide-12
SLIDE 12
slide-13
SLIDE 13

CULTURE & SHARING RESOURCES

BLAMELESS POSTMORTEMS

▸Blameless Postmortems by John Allspaw

http://bit.ly/etsy-blameless

▸The Human Side of Postmortems by Dave

Zwieback

http://bit.ly/human-postmortem

slide-14
SLIDE 14

METRICS

CULTURE & SHARING ARE GREAT, BUT WHAT ABOUT

slide-15
SLIDE 15
slide-16
SLIDE 16

Follow @honest_update

  • n Twitter
slide-17
SLIDE 17

COLLECTING DATA IS CHEAP; NOT HAVING IT WHEN YOU NEED IT CAN BE EXPENSIVE

SO INSTRUMENT ALL THE THINGS!

slide-18
SLIDE 18
slide-19
SLIDE 19
slide-20
SLIDE 20
slide-21
SLIDE 21
slide-22
SLIDE 22

METRICS

4 QUALITIES OF GOOD METRICS

▸ Well-understood ▸ Granular ▸ Tagged by scope ▸ Long-lived

slide-23
SLIDE 23

RECURSE UNTIL YOU FIND THE TECHNICAL CAUSE

slide-24
SLIDE 24

IF YOU’RE STILL RESPONDING TO THE INCIDENT, IT’S NOT TIME FOR A POSTMORTEM

slide-25
SLIDE 25

HUMAN DATA

DATA COLLECTION: WHO?

▸ Everyone! ▸ Responders ▸ Identifiers ▸ Affected Users

slide-26
SLIDE 26

HUMAN DATA

DATA COLLECTION: WHAT?

▸ Their perspective ▸ What they did ▸ What they thought ▸ Why they thought/did it

slide-27
SLIDE 27

HUMAN ELEMENT

TECHNICAL ISSUES HAVE NON-TECHNICAL CAUSES

slide-28
SLIDE 28
slide-29
SLIDE 29

… we will be dramatically improving the tooling that humans (and systems) interact with such that input validation is much more strict and will not allow for all servers, and control plane servers to be rebooted simultaneously …

Joyent Postmortem
 http://bit.ly/joyent-post

JOYENT US-EAST-1 POST-MORTEM 2014

slide-30
SLIDE 30

“WRITING IS NATURE’S WAY OF LETTING YOU KNOW HOW SLOPPY YOUR THINKING IS.”

RICHARD GUINDON

slide-31
SLIDE 31

“ONE PICTURE IS WORTH TEN THOUSAND WORDS”

CHINESE PROVERB

slide-32
SLIDE 32

HUMAN DATA

DATA COLLECTION: WHEN?

▸ As soon as possible. ▸ Memory drops sharply within 20 minutes ▸ Susceptibility to “false memory” increases

slide-33
SLIDE 33

HUMAN DATA

DATA SKEW/CORRUPTION

▸ Stress ▸ Sleep deprivation ▸ Burnout

slide-34
SLIDE 34

HUMAN DATA

DATA SKEW/CORRUPTION

▸ Blame/Fear of punitive action ▸ Bias ▸ Anchoring ▸ Hindsight ▸ Outcome ▸ Availability ▸ Recency

slide-35
SLIDE 35

HOW WE DO POSTMORTEMS AT DATADOG

slide-36
SLIDE 36

DATADOG POSTMORTEMS

A FEW NOTES

▸ Postmortems emailed to company wide ▸ Scheduled recurring postmortem meetings

slide-37
SLIDE 37

DATADOG’S POSTMORTEM TEMPLATE (1/5)

SUMMARY: WHAT HAPPENED?

▸ Describe what happened here at a high-level --

think of it as an abstract in a scientific paper.

▸ What was the impact on customers? ▸ What was the severity of the outage? ▸ What components were affected? ▸ What ultimately resolved the outage?

slide-38
SLIDE 38
slide-39
SLIDE 39
slide-40
SLIDE 40

DATADOG’S POSTMORTEM TEMPLATE (2/5)

HOW WAS THE OUTAGE DETECTED?

▸ We want to make sure we detected the issue

early and would catch the same issue if it were to repeat.

▸ Did we have a metric that showed the outage? ▸ Was there a monitor on that metric? ▸ How long did it take for us to declare an outage?

slide-41
SLIDE 41
slide-42
SLIDE 42
slide-43
SLIDE 43

DATADOG’S POSTMORTEM TEMPLATE (3/5)

HOW DID WE RESPOND?

▸ Who was the incident owner & who else was

involved?

▸ Slack archive links and timeline of events! ▸ What went well? ▸ What didn’t go so well?

slide-44
SLIDE 44

*Names changed

slide-45
SLIDE 45

CHATOPS ARCHIVES FTW!

*Names changed

slide-46
SLIDE 46

*Names changed

TRACK LEARNINGS AS YOU GO

slide-47
SLIDE 47

DATADOG’S POSTMORTEM TEMPLATE (4/5)

WHY DID IT HAPPEN?

▸ Deep dive into the cause ▸ Examples from this incident: ▸ http://bit.ly/dd-statuspage ▸ http://bit.ly/alq-postmortem

slide-48
SLIDE 48

DATADOG’S POSTMORTEM TEMPLATE (5/5)

HOW DO WE PREVENT IT IN THE FUTURE?

▸ Link to Github issues and Trello cards ▸ Now? ▸ Next? ▸ Later? ▸ Follow up notes

slide-49
SLIDE 49

*Names changed

slide-50
SLIDE 50

DATADOG’S POSTMORTEM TEMPLATE

RECAP:

▸ What happened (summary)? ▸ How did we detect it? ▸ How did we respond? ▸ Why did it happen (deep dive)? ▸ Actionable next steps!

slide-51
SLIDE 51

KEEP LEARNING

MORE RESOURCES

▸ The Infinite Hows - John Allspaw


http://bit.ly/infinite-hows


▸ “Blameless” Postmortems don’t work - J Paul

Reed
 http://bit.ly/blameless-dont-work

▸ Monitoring 101 - Alexis Lê-Quôc


http://dtdg.co/monitoring-101-data

slide-52
SLIDE 52

QUESTIONS?

LET’S TALK! @IRABINOVITCH @DATADOGHQ