Preparing for the Unexpected Samuel Parkinson - - PowerPoint PPT Presentation

preparing for the unexpected
SMART_READER_LITE
LIVE PREVIEW

Preparing for the Unexpected Samuel Parkinson - - PowerPoint PPT Presentation

Preparing for the Unexpected Samuel Parkinson samuel.parkinson@ft.com #qconlondon #prepfortheunexpected Photo by Hush Naidoo on Unsplash #qconlondon #prepfortheunexpected #qconlondon #prepfortheunexpected Lets start with a story


slide-1
SLIDE 1 #qconlondon #prepfortheunexpected

Preparing for the Unexpected

Photo by Hush Naidoo on Unsplash Samuel Parkinson samuel.parkinson@ft.com #qconlondon #prepfortheunexpected
slide-2
SLIDE 2 #qconlondon #prepfortheunexpected
slide-3
SLIDE 3 #qconlondon #prepfortheunexpected

Let’s start with a story

slide-4
SLIDE 4 #qconlondon #prepfortheunexpected

What’s the worst thing that could happen?

slide-5
SLIDE 5 #qconlondon #prepfortheunexpected
slide-6
SLIDE 6 #qconlondon #prepfortheunexpected
slide-7
SLIDE 7 #qconlondon #prepfortheunexpected
slide-8
SLIDE 8 #qconlondon #prepfortheunexpected
slide-9
SLIDE 9 #qconlondon #prepfortheunexpected
slide-10
SLIDE 10 #qconlondon #prepfortheunexpected *************
slide-11
SLIDE 11 #qconlondon #prepfortheunexpected
slide-12
SLIDE 12 #qconlondon #prepfortheunexpected

The FT.com zone was missing

slide-13
SLIDE 13 #qconlondon #prepfortheunexpected
slide-14
SLIDE 14 #qconlondon #prepfortheunexpected

FT.com has over 5,100 subdomains 😭

slide-15
SLIDE 15 #qconlondon #prepfortheunexpected

This impacted the whole company

slide-16
SLIDE 16 #qconlondon #prepfortheunexpected
slide-17
SLIDE 17 #qconlondon #prepfortheunexpected
slide-18
SLIDE 18 #qconlondon #prepfortheunexpected

😲

slide-19
SLIDE 19 #qconlondon #prepfortheunexpected

We have never prepared for such an incident

slide-20
SLIDE 20 #qconlondon #prepfortheunexpected

It’s a classic data loss situation

slide-21
SLIDE 21 #qconlondon #prepfortheunexpected
slide-22
SLIDE 22 #qconlondon #prepfortheunexpected

Our provider had a partial backup

slide-23
SLIDE 23 #qconlondon #prepfortheunexpected

But critical records we used for DNS load balancing were missing 👼

slide-24
SLIDE 24 #qconlondon #prepfortheunexpected

About 10 people worked to resolve the incident

slide-25
SLIDE 25 #qconlondon #prepfortheunexpected

And over 30 people were

  • nline to follow along
slide-26
SLIDE 26 #qconlondon #prepfortheunexpected

Most were not called, but still volunteered their time

slide-27
SLIDE 27 #qconlondon #prepfortheunexpected
slide-28
SLIDE 28 #qconlondon #prepfortheunexpected

4h 30m

The first hour was a total outage.
slide-29
SLIDE 29 #qconlondon #prepfortheunexpected

Lack of panic in the moment

slide-30
SLIDE 30 #qconlondon #prepfortheunexpected

It was a slick operation and we recovered

slide-31
SLIDE 31 #qconlondon #prepfortheunexpected

It took restoring from a backup and manual entry to get there

slide-32
SLIDE 32 #qconlondon #prepfortheunexpected

We were focused on recovery, not what happened

slide-33
SLIDE 33 #qconlondon #prepfortheunexpected

People were joining the incident to learn

slide-34
SLIDE 34 #qconlondon #prepfortheunexpected
slide-35
SLIDE 35 #qconlondon #prepfortheunexpected
slide-36
SLIDE 36 #qconlondon #prepfortheunexpected

This is where we are today

slide-37
SLIDE 37 #qconlondon #prepfortheunexpected
slide-38
SLIDE 38 #qconlondon #prepfortheunexpected
slide-39
SLIDE 39 #qconlondon #prepfortheunexpected
slide-40
SLIDE 40 #qconlondon #prepfortheunexpected Photo by Victor Garcia on Unsplash
slide-41
SLIDE 41 #qconlondon #prepfortheunexpected Photo by Markus Spiske on Unsplash
slide-42
SLIDE 42 #qconlondon #prepfortheunexpected
  • 0. How do we do on-call?
  • 1. Our incident management

challenges

  • 2. Making out-of-hours sustainable
  • 3. The results and takeaways
slide-43
SLIDE 43 #qconlondon #prepfortheunexpected
  • 0. How do we do on-call?
  • 1. Our incident management

challenges

  • 2. Making out-of-hours sustainable
  • 3. The results and takeaways
slide-44
SLIDE 44 #qconlondon #prepfortheunexpected FT Core Customer Products Enterprise Services Internal Products Operations & Reliability FT Group Products
slide-45
SLIDE 45 #qconlondon #prepfortheunexpected

We are Customer Products

slide-46
SLIDE 46 #qconlondon #prepfortheunexpected

45 engineers and counting 📉

slide-47
SLIDE 47 #qconlondon #prepfortheunexpected

And we own about 180 systems

slide-48
SLIDE 48 #qconlondon #prepfortheunexpected
slide-49
SLIDE 49 #qconlondon #prepfortheunexpected

Split into 9 teams

slide-50
SLIDE 50 #qconlondon #prepfortheunexpected
slide-51
SLIDE 51 #qconlondon #prepfortheunexpected
slide-52
SLIDE 52 #qconlondon #prepfortheunexpected

Operations monitor our entire estate 24/7

slide-53
SLIDE 53 #qconlondon #prepfortheunexpected
slide-54
SLIDE 54 #qconlondon #prepfortheunexpected

Our systems are a drop in the pond

slide-55
SLIDE 55 #qconlondon #prepfortheunexpected

You build it, you run it

slide-56
SLIDE 56 #qconlondon #prepfortheunexpected

Supporting our systems

  • ut-of-hours
slide-57
SLIDE 57 #qconlondon #prepfortheunexpected

This is our approach to DevOps

slide-58
SLIDE 58 #qconlondon #prepfortheunexpected Photo by Joshua Coleman on Unsplash

Our engineers wear many hats

slide-59
SLIDE 59 #qconlondon #prepfortheunexpected

We’re putting on our incident management hat

slide-60
SLIDE 60 #qconlondon #prepfortheunexpected

How do we do support

  • ut-of-hours?
slide-61
SLIDE 61 #qconlondon #prepfortheunexpected

Our engineers volunteer to be part of the out-of-hours team

slide-62
SLIDE 62 #qconlondon #prepfortheunexpected

We don’t have shifts

slide-63
SLIDE 63 #qconlondon #prepfortheunexpected

We don’t have shifts

slide-64
SLIDE 64 #qconlondon #prepfortheunexpected

Which means, we could all be unavailable

slide-65
SLIDE 65 #qconlondon #prepfortheunexpected

What do we care about?

slide-66
SLIDE 66 #qconlondon #prepfortheunexpected

We’re talking about our business capabilities

slide-67
SLIDE 67 #qconlondon #prepfortheunexpected

We’re talking about our business capabilities

slide-68
SLIDE 68 #qconlondon #prepfortheunexpected

What is an incident at the FT?

slide-69
SLIDE 69 #qconlondon #prepfortheunexpected

Customer Products has two really important business capabilities

slide-70
SLIDE 70 #qconlondon #prepfortheunexpected
  • 1. Our users can always

read the news

slide-71
SLIDE 71 #qconlondon #prepfortheunexpected
  • 2. Journalists must be able

to publish the news

slide-72
SLIDE 72 #qconlondon #prepfortheunexpected

If either of these go wrong we declare an incident

slide-73
SLIDE 73 #qconlondon #prepfortheunexpected
slide-74
SLIDE 74 #qconlondon #prepfortheunexpected
slide-75
SLIDE 75 #qconlondon #prepfortheunexpected
  • 0. How do we do on-call?
  • 1. Our incident management

challenges

  • 2. Making out-of-hours sustainable
  • 3. The results and takeaways
slide-76
SLIDE 76 #qconlondon #prepfortheunexpected

What were our challenges?

slide-77
SLIDE 77 #qconlondon #prepfortheunexpected

We were not immediately productive on call →

slide-78
SLIDE 78 #qconlondon #prepfortheunexpected

We had an engineering mindset in an operations situation

We were not immediately productive on call
slide-79
SLIDE 79 #qconlondon #prepfortheunexpected

Because we don’t have any SRE or DevOps specialists

We were not immediately productive on call
slide-80
SLIDE 80 #qconlondon #prepfortheunexpected

“ ”

I always start with the impact and the comms, they kinda jump in at the Tech.

slide-81
SLIDE 81 #qconlondon #prepfortheunexpected

Our incident management process wasn’t second nature

We were not immediately productive on call
slide-82
SLIDE 82 #qconlondon #prepfortheunexpected

We had very few incidents in the first half of the year

slide-83
SLIDE 83 #qconlondon #prepfortheunexpected

We had very few incidents in the first half of the year

slide-84
SLIDE 84 #qconlondon #prepfortheunexpected

And we were down to 5 people on the out-of-hours support team

slide-85
SLIDE 85 #qconlondon #prepfortheunexpected

So we needed to make

  • ut-of-hours team

sustainable

slide-86
SLIDE 86 #qconlondon #prepfortheunexpected
  • 0. How do we do on-call?
  • 1. Our incident management challenges
  • 2. Making out-of-hours

sustainable

  • 3. The results and takeaways
slide-87
SLIDE 87 #qconlondon #prepfortheunexpected

We surveyed engineers about helping out during an incident

slide-88
SLIDE 88 #qconlondon #prepfortheunexpected

There were many people

  • n the fence
slide-89
SLIDE 89 #qconlondon #prepfortheunexpected

There were many people

  • n the fence
7 people 3 people 6 people
slide-90
SLIDE 90 #qconlondon #prepfortheunexpected

And they told us why

slide-91
SLIDE 91 #qconlondon #prepfortheunexpected

“ ”

I will need much more confidence in systems and domains knowledge.

slide-92
SLIDE 92 #qconlondon #prepfortheunexpected

“ ”

If I were to have a better understanding of how it works and what I would need to do, I would very likely join.

slide-93
SLIDE 93 #qconlondon #prepfortheunexpected

We set out to convince people to join our

  • ut-of-hours team
slide-94
SLIDE 94 #qconlondon #prepfortheunexpected

We built and ran incident workshops

slide-95
SLIDE 95 #qconlondon #prepfortheunexpected

So our engineers are better prepared to take on incidents

slide-96
SLIDE 96 #qconlondon #prepfortheunexpected

And we wrote a generic runbook for our microservices

slide-97
SLIDE 97 #qconlondon #prepfortheunexpected

So engineers knew what they can do, and apply it to our ~180 systems

slide-98
SLIDE 98 #qconlondon #prepfortheunexpected

We set out in the last 6 months of 2019 to address the situation

slide-99
SLIDE 99 #qconlondon #prepfortheunexpected

Building your incident workshop →

slide-100
SLIDE 100 #qconlondon #prepfortheunexpected

Don’t Panic!

Building your incident workshop
slide-101
SLIDE 101 #qconlondon #prepfortheunexpected

Set aside a couple of hours to write a workshop

Building your incident workshop
slide-102
SLIDE 102 #qconlondon #prepfortheunexpected

Start by having a read

  • f your old incidents
Building your incident workshop
slide-103
SLIDE 103 #qconlondon #prepfortheunexpected

Start by having a read

  • f your old incidents
Building your incident workshop
slide-104
SLIDE 104 #qconlondon #prepfortheunexpected

Use the first page to set the scene

Building your incident workshop
slide-105
SLIDE 105 #qconlondon #prepfortheunexpected

Use the first page to set the scene

Building your incident workshop
slide-106
SLIDE 106 #qconlondon #prepfortheunexpected

Follow it with several pages of graphs and information

Building your incident workshop
slide-107
SLIDE 107 #qconlondon #prepfortheunexpected

Follow it with 3–5 pages of graphs and info

Building your incident workshop
slide-108
SLIDE 108 #qconlondon #prepfortheunexpected

Each page progresses the incident

Building your incident workshop
slide-109
SLIDE 109 #qconlondon #prepfortheunexpected

Each page progresses the incident

Building your incident workshop
slide-110
SLIDE 110 #qconlondon #prepfortheunexpected

And is an opportunity to ask questions about tools and systems

Building your incident workshop
slide-111
SLIDE 111 #qconlondon #prepfortheunexpected

Include the dead ends you encountered in production

Building your incident workshop
slide-112
SLIDE 112 #qconlondon #prepfortheunexpected

Include the dead ends you encountered in production

Building your incident workshop
slide-113
SLIDE 113 #qconlondon #prepfortheunexpected

Then wrap it up in a summary page

Building your incident workshop
slide-114
SLIDE 114 #qconlondon #prepfortheunexpected

Then wrap it up in a summary page

Building your incident workshop
slide-115
SLIDE 115 #qconlondon #prepfortheunexpected

What actually happened?

Building your incident workshop
slide-116
SLIDE 116 #qconlondon #prepfortheunexpected

What did we do?

Building your incident workshop
slide-117
SLIDE 117 #qconlondon #prepfortheunexpected

What caused the incident?

Building your incident workshop
slide-118
SLIDE 118 #qconlondon #prepfortheunexpected

Keep it minimal

Building your incident workshop
slide-119
SLIDE 119 #qconlondon #prepfortheunexpected

The open format encourages people to share what they know

Building your incident workshop
slide-120
SLIDE 120 #qconlondon #prepfortheunexpected

The open format encourages people to share what they know

Building your incident workshop
slide-121
SLIDE 121 #qconlondon #prepfortheunexpected

Running your incident workshop →

slide-122
SLIDE 122 #qconlondon #prepfortheunexpected

You are the incident lead*

* aka the facilitator Running your incident workshop
slide-123
SLIDE 123 #qconlondon #prepfortheunexpected

Split people into small teams

Running your incident workshop
slide-124
SLIDE 124 #qconlondon #prepfortheunexpected
slide-125
SLIDE 125 #qconlondon #prepfortheunexpected

Introduce the session and the format

Running your incident workshop
slide-126
SLIDE 126 #qconlondon #prepfortheunexpected

Hand out that first page of background information

Running your incident workshop
slide-127
SLIDE 127 #qconlondon #prepfortheunexpected

Give teams ~10 minutes to discuss this information

Running your incident workshop
slide-128
SLIDE 128 #qconlondon #prepfortheunexpected

Encourage the discussion, pose questions

Running your incident workshop
slide-129
SLIDE 129 #qconlondon #prepfortheunexpected

Remind people to write their thoughts down!

Running your incident workshop
slide-130
SLIDE 130 #qconlondon #prepfortheunexpected

Remind people to write their thoughts down!

Running your incident workshop
slide-131
SLIDE 131 #qconlondon #prepfortheunexpected

Bring the teams together and ask them the following...

Running your incident workshop
slide-132
SLIDE 132 #qconlondon #prepfortheunexpected

We’re addressing that engineering mindset

Running your incident workshop
slide-133
SLIDE 133 #qconlondon #prepfortheunexpected
  • 1. What actions, if any,

can you take right now?

Running your incident workshop
slide-134
SLIDE 134 #qconlondon #prepfortheunexpected
  • 2. What more

information do you need?

Running your incident workshop
slide-135
SLIDE 135 #qconlondon #prepfortheunexpected
  • 3. What are you

communicating?

Running your incident workshop
slide-136
SLIDE 136 #qconlondon #prepfortheunexpected

People get really excited, so moderate the conversation!

Running your incident workshop
slide-137
SLIDE 137 #qconlondon #prepfortheunexpected
slide-138
SLIDE 138 #qconlondon #prepfortheunexpected

Then, hand out more information.

Hopefully it’s what they’ve asked for... Running your incident workshop
slide-139
SLIDE 139 #qconlondon #prepfortheunexpected

Repeat the handouts and questions until the incident is “over”

Running your incident workshop
slide-140
SLIDE 140 #qconlondon #prepfortheunexpected

Leave plenty of time for questions

Running your incident workshop
slide-141
SLIDE 141 #qconlondon #prepfortheunexpected
slide-142
SLIDE 142 #qconlondon #prepfortheunexpected

Documenting a generic microservices runbook →

slide-143
SLIDE 143 #qconlondon #prepfortheunexpected

One place for generic actions you can take during an incident

Documenting a generic microservices runbook
slide-144
SLIDE 144 #qconlondon #prepfortheunexpected

We call it the FT.com incident tool belt 🛡

Documenting a generic microservices runbook
slide-145
SLIDE 145 #qconlondon #prepfortheunexpected

We call it the FT.com incident tool belt 🛡

Documenting a generic microservices runbook
slide-146
SLIDE 146 #qconlondon #prepfortheunexpected

Each action has prep, usage and previous incidents documented

Documenting a generic microservices runbook
slide-147
SLIDE 147 #qconlondon #prepfortheunexpected

Preparation, such as downloading the Heroku CLI

Documenting a generic microservices runbook
slide-148
SLIDE 148 #qconlondon #prepfortheunexpected

Preparation, such as downloading the Heroku CLI

Documenting a generic microservices runbook
slide-149
SLIDE 149 #qconlondon #prepfortheunexpected

Previous incidents provide context for those yet to happen

Documenting a generic microservices runbook
slide-150
SLIDE 150 #qconlondon #prepfortheunexpected

Previous incidents provide context for those yet to happen

Documenting a generic microservices runbook
slide-151
SLIDE 151 #qconlondon #prepfortheunexpected

We ran training sessions on each action

Documenting a generic microservices runbook
slide-152
SLIDE 152 #qconlondon #prepfortheunexpected

They were hands on and we ran them in production

Documenting a generic microservices runbook
slide-153
SLIDE 153 #qconlondon #prepfortheunexpected
  • 0. How do we do on-call?
  • 1. Our incident management

challenges

  • 2. Making out-of-hours sustainable
  • 3. The results and

takeaways

slide-154
SLIDE 154 #qconlondon #prepfortheunexpected

Did the workshops and documentation help?

slide-155
SLIDE 155 #qconlondon #prepfortheunexpected

6 incident workshops with 3 “incidents”

slide-156
SLIDE 156 #qconlondon #prepfortheunexpected

6 tool box workshops

slide-157
SLIDE 157 #qconlondon #prepfortheunexpected

How likely are you to help

  • ut during an incident?
slide-158
SLIDE 158 #qconlondon #prepfortheunexpected
slide-159
SLIDE 159 #qconlondon #prepfortheunexpected

“ ”

Solving OOH incidents is probably not as scary as most of us think.

slide-160
SLIDE 160 #qconlondon #prepfortheunexpected

The workshops expanded everyone's mental model

  • f our systems
slide-161
SLIDE 161 #qconlondon #prepfortheunexpected
slide-162
SLIDE 162 #qconlondon #prepfortheunexpected
slide-163
SLIDE 163 #qconlondon #prepfortheunexpected

We got better at the process

slide-164
SLIDE 164 #qconlondon #prepfortheunexpected

“ ”

[I’ve learnt] to focus initially on comms and customer experience, and less on finding the technical root cause.

slide-165
SLIDE 165 #qconlondon #prepfortheunexpected

We started learning from

  • ur old incidents
slide-166
SLIDE 166 #qconlondon #prepfortheunexpected Photo by Aleksandar Cvetanovic on Unsplash
slide-167
SLIDE 167 #qconlondon #prepfortheunexpected

“ ”

I am keen to work towards being on the

  • ut-of-hours team.
slide-168
SLIDE 168 #qconlondon #prepfortheunexpected

Engineers did join the

  • ut-of-hours team 🎊
slide-169
SLIDE 169 #qconlondon #prepfortheunexpected

There’s now 11 of us

slide-170
SLIDE 170 #qconlondon #prepfortheunexpected

Many of whom were dialed into the big DNS incident

slide-171
SLIDE 171 #qconlondon #prepfortheunexpected

What are we doing this year?

slide-172
SLIDE 172 #qconlondon #prepfortheunexpected

Continue to promote joining the out-of-hours team

slide-173
SLIDE 173 #qconlondon #prepfortheunexpected

Evangelise running incident workshops across the company

slide-174
SLIDE 174 #qconlondon #prepfortheunexpected

Key takeaways

Photo by Ollie Jordan on Unsplash
slide-175
SLIDE 175 #qconlondon #prepfortheunexpected

Practicing incident management can prepare you for the unexpected

slide-176
SLIDE 176 #qconlondon #prepfortheunexpected

Confidence doesn’t have to come from in-depth knowledge

slide-177
SLIDE 177 #qconlondon #prepfortheunexpected

Take this back to your teams, and have fun!

slide-178
SLIDE 178 #qconlondon #prepfortheunexpected

We’re hiring! 📉 https://www.ft.com/qcon

slide-179
SLIDE 179 #qconlondon #prepfortheunexpected
  • Coping with Complexity by the SNAFUcatchers
  • Beyond the "Fix-it" Treadmill by J. Paul Reed
  • Cognitive Work of Hypothesis Exploration During
Anomaly Response by Marisa R. Grayson
  • Learning from Incidents in Software Blog
Further reading
slide-180
SLIDE 180 #qconlondon #prepfortheunexpected
  • FT.com Incident Workshop #1
  • FT.com Incident Workshop #2
  • FT.com Incident Workshop #3
Incident workshop examples
slide-181
SLIDE 181 #qconlondon #prepfortheunexpected

Thank you! I’d love to hear your thoughts, suggestions and feedback! samuel.parkinson@ft.com