preparing for the unexpected
play

Preparing for the Unexpected Samuel Parkinson - PowerPoint PPT Presentation

Preparing for the Unexpected Samuel Parkinson samuel.parkinson@ft.com #qconlondon #prepfortheunexpected Photo by Hush Naidoo on Unsplash #qconlondon #prepfortheunexpected #qconlondon #prepfortheunexpected Lets start with a story


  1. Preparing for the Unexpected Samuel Parkinson samuel.parkinson@ft.com #qconlondon #prepfortheunexpected Photo by Hush Naidoo on Unsplash #qconlondon #prepfortheunexpected

  2. #qconlondon #prepfortheunexpected

  3. Let’s start with a story #qconlondon #prepfortheunexpected

  4. What’s the worst thing that could happen? #qconlondon #prepfortheunexpected

  5. #qconlondon #prepfortheunexpected

  6. #qconlondon #prepfortheunexpected

  7. #qconlondon #prepfortheunexpected

  8. #qconlondon #prepfortheunexpected

  9. #qconlondon #prepfortheunexpected

  10. ************* #qconlondon #prepfortheunexpected

  11. #qconlondon #prepfortheunexpected

  12. The FT.com zone was missing #qconlondon #prepfortheunexpected

  13. #qconlondon #prepfortheunexpected

  14. FT.com has over 5,100 subdomains 😭 #qconlondon #prepfortheunexpected

  15. This impacted the whole company #qconlondon #prepfortheunexpected

  16. #qconlondon #prepfortheunexpected

  17. #qconlondon #prepfortheunexpected

  18. 😲 #qconlondon #prepfortheunexpected

  19. We have never prepared for such an incident #qconlondon #prepfortheunexpected

  20. It’s a classic data loss situation #qconlondon #prepfortheunexpected

  21. #qconlondon #prepfortheunexpected

  22. Our provider had a partial backup #qconlondon #prepfortheunexpected

  23. But critical records we used for DNS load balancing were missing 👼 #qconlondon #prepfortheunexpected

  24. About 10 people worked to resolve the incident #qconlondon #prepfortheunexpected

  25. And over 30 people were online to follow along #qconlondon #prepfortheunexpected

  26. Most were not called, but still volunteered their time #qconlondon #prepfortheunexpected

  27. #qconlondon #prepfortheunexpected

  28. 4h 30m The first hour was a total outage. #qconlondon #prepfortheunexpected

  29. Lack of panic in the moment #qconlondon #prepfortheunexpected

  30. It was a slick operation and we recovered #qconlondon #prepfortheunexpected

  31. It took restoring from a backup and manual entry to get there #qconlondon #prepfortheunexpected

  32. We were focused on recovery, not what happened #qconlondon #prepfortheunexpected

  33. People were joining the incident to learn #qconlondon #prepfortheunexpected

  34. #qconlondon #prepfortheunexpected

  35. #qconlondon #prepfortheunexpected

  36. This is where we are today #qconlondon #prepfortheunexpected

  37. #qconlondon #prepfortheunexpected

  38. #qconlondon #prepfortheunexpected

  39. #qconlondon #prepfortheunexpected

  40. Photo by Victor Garcia on Unsplash #qconlondon #prepfortheunexpected

  41. Photo by Markus Spiske on Unsplash #qconlondon #prepfortheunexpected

  42. 0. How do we do on-call? 1. Our incident management challenges 2. Making out-of-hours sustainable 3. The results and takeaways #qconlondon #prepfortheunexpected

  43. 0. How do we do on-call? 1. Our incident management challenges 2. Making out-of-hours sustainable 3. The results and takeaways #qconlondon #prepfortheunexpected

  44. Internal FT Core Products Enterprise Services Customer Products Operations & Reliability FT Group Products #qconlondon #prepfortheunexpected

  45. Customer We are Products #qconlondon #prepfortheunexpected

  46. 45 engineers and counting 📉 #qconlondon #prepfortheunexpected

  47. And we own about 180 systems #qconlondon #prepfortheunexpected

  48. #qconlondon #prepfortheunexpected

  49. Split into 9 teams #qconlondon #prepfortheunexpected

  50. #qconlondon #prepfortheunexpected

  51. #qconlondon #prepfortheunexpected

  52. Operations monitor our entire estate 24/7 #qconlondon #prepfortheunexpected

  53. #qconlondon #prepfortheunexpected

  54. Our systems are a drop in the pond #qconlondon #prepfortheunexpected

  55. You build it, you run it #qconlondon #prepfortheunexpected

  56. Supporting our systems out-of-hours #qconlondon #prepfortheunexpected

  57. This is our approach to DevOps #qconlondon #prepfortheunexpected

  58. Our engineers wear many hats Photo by Joshua Coleman on Unsplash #qconlondon #prepfortheunexpected

  59. We’re putting on our incident management hat #qconlondon #prepfortheunexpected

  60. How do we do support out-of-hours? #qconlondon #prepfortheunexpected

  61. Our engineers volunteer to be part of the out-of-hours team #qconlondon #prepfortheunexpected

  62. We don’t have shifts #qconlondon #prepfortheunexpected

  63. We don’t have shifts #qconlondon #prepfortheunexpected

  64. Which means, we could all be unavailable #qconlondon #prepfortheunexpected

  65. What do we care about? #qconlondon #prepfortheunexpected

  66. We’re talking about our business capabilities #qconlondon #prepfortheunexpected

  67. We’re talking about our business capabilities #qconlondon #prepfortheunexpected

  68. What is an incident at the FT? #qconlondon #prepfortheunexpected

  69. Customer Products has two really important business capabilities #qconlondon #prepfortheunexpected

  70. 1. Our users can always read the news #qconlondon #prepfortheunexpected

  71. 2. Journalists must be able to publish the news #qconlondon #prepfortheunexpected

  72. If either of these go wrong we declare an incident #qconlondon #prepfortheunexpected

  73. #qconlondon #prepfortheunexpected

  74. #qconlondon #prepfortheunexpected

  75. 0. How do we do on-call? 1. Our incident management challenges 2. Making out-of-hours sustainable 3. The results and takeaways #qconlondon #prepfortheunexpected

  76. What were our challenges? #qconlondon #prepfortheunexpected

  77. We were not immediately productive on call → #qconlondon #prepfortheunexpected

  78. We were not immediately productive on call We had an engineering mindset in an operations situation #qconlondon #prepfortheunexpected

  79. We were not immediately productive on call Because we don’t have any SRE or DevOps specialists #qconlondon #prepfortheunexpected

  80. “ ” I always start with the impact and the comms, they kinda jump in at the Tech. #qconlondon #prepfortheunexpected

  81. We were not immediately productive on call Our incident management process wasn’t second nature #qconlondon #prepfortheunexpected

  82. We had very few incidents in the first half of the year #qconlondon #prepfortheunexpected

  83. We had very few incidents in the first half of the year #qconlondon #prepfortheunexpected

  84. And we were down to 5 people on the out-of-hours support team #qconlondon #prepfortheunexpected

  85. So we needed to make out-of-hours team sustainable #qconlondon #prepfortheunexpected

  86. 0. How do we do on-call? 1. Our incident management challenges 2. Making out-of-hours sustainable 3. The results and takeaways #qconlondon #prepfortheunexpected

  87. We surveyed engineers about helping out during an incident #qconlondon #prepfortheunexpected

  88. There were many people on the fence #qconlondon #prepfortheunexpected

  89. There were many people 7 people on the fence 3 people 6 people #qconlondon #prepfortheunexpected

  90. And they told us why #qconlondon #prepfortheunexpected

  91. “ ” I will need much more confidence in systems and domains knowledge. #qconlondon #prepfortheunexpected

  92. “ ” If I were to have a better understanding of how it works and what I would need to do, I would very likely join. #qconlondon #prepfortheunexpected

  93. We set out to convince people to join our out-of-hours team #qconlondon #prepfortheunexpected

  94. We built and ran incident workshops #qconlondon #prepfortheunexpected

  95. So our engineers are better prepared to take on incidents #qconlondon #prepfortheunexpected

  96. And we wrote a generic runbook for our microservices #qconlondon #prepfortheunexpected

  97. So engineers knew what they can do, and apply it to our ~180 systems #qconlondon #prepfortheunexpected

  98. We set out in the last 6 months of 2019 to address the situation #qconlondon #prepfortheunexpected

  99. Building your incident workshop → #qconlondon #prepfortheunexpected

  100. Building your incident workshop Don’t Panic! #qconlondon #prepfortheunexpected

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend