service ownership
play

Service Ownership Learn Faster Holly Allen Service Engineering - PowerPoint PPT Presentation

Service Ownership Learn Faster Holly Allen Service Engineering @hollyjallen Holly Allen Software development and leadership for 18 years @hollyjallen,#QConSF Nov 2018 @hollyjallen,#QConSF Nov 2018 @hollyjallen,#QConSF Nov 2018 Software!


  1. Service Ownership Learn Faster Holly Allen Service Engineering @hollyjallen

  2. Holly Allen Software development and leadership for 18 years

  3. @hollyjallen,#QConSF Nov 2018

  4. @hollyjallen,#QConSF Nov 2018

  5. @hollyjallen,#QConSF Nov 2018

  6. Software! 😎 @hollyjallen,#QConSF Nov 2018

  7. @hollyjallen,#QConSF Nov 2018

  8. S L O W 😪 @hollyjallen,#QConSF Nov 2018

  9. @hollyjallen,#QConSF Nov 2018

  10. Measure Design Learn @hollyjallen,#QConSF Nov 2018

  11. Toyota Production System @hollyjallen,#QConSF Nov 2018

  12. @hollyjallen,#QConSF Nov 2018

  13. @hollyjallen,#QConSF Nov 2018

  14. @hollyjallen,#QConSF Nov 2018

  15. @hollyjallen,#QConSF Nov 2018

  16. @hollyjallen,#QConSF Nov 2018

  17. “” Kaizen Continuous Improvement @hollyjallen,#QConSF Nov 2018

  18. Measure Design Learn @hollyjallen,#QConSF Nov 2018

  19. @hollyjallen,#QConSF Nov 2018

  20. @hollyjallen,#QConSF Nov 2018

  21. @hollyjallen,#QConSF Nov 2018

  22. “” Executive dedication to learning @hollyjallen,#QConSF Nov 2018

  23. “” High Trust Teams @hollyjallen,#QConSF Nov 2018

  24. Measure Design Learn @hollyjallen,#QConSF Nov 2018

  25. @hollyjallen,#QConSF Nov 2018

  26. 🚁 Slack launched February 2014 @hollyjallen,#QConSF Nov 2018

  27. 5 Years Grew to 13+ million weekly active users, with active sessions of 10+ hours a day @hollyjallen,#QConSF Nov 2018

  28. 5 Years From 10 to 15,000 servers In 25 cloud data centers world-wide @hollyjallen,#QConSF Nov 2018

  29. 5 Years From 8 to 1,200 people In 9 offices world-wide @hollyjallen,#QConSF Nov 2018

  30. Measure Design Learn @hollyjallen,#QConSF Nov 2018

  31. @hollyjallen,#QConSF Nov 2018

  32. “” ✅ Continuous Deployment ✅ Experiment Frameworks ✅ User Research @hollyjallen,#QConSF Nov 2018

  33. Something didn't scale... @hollyjallen,#QConSF Nov 2018

  34. 😮 Centralized Operations @hollyjallen,#QConSF Nov 2018

  35. “” Who should be responsible for the management, monitoring and operation of a production application? @hollyjallen,#QConSF Nov 2018

  36. “” Centralized Operations Division of Labor @hollyjallen,#QConSF Nov 2018

  37. Devs Ops Features Cloud Infra Scale Deployment Architecture Monitoring @hollyjallen,#QConSF Nov 2018

  38. “” Ops is getting the pages @hollyjallen,#QConSF Nov 2018

  39. “” Product Development grew faster than Operations, A lot faster @hollyjallen,#QConSF Nov 2018

  40. 20 Product 1 Ops Developers Engineer @hollyjallen,#QConSF Nov 2018

  41. “” How can operations reliably reach the developers when there's a problem? @hollyjallen,#QConSF Nov 2018

  42. “” "Call Maude, she knows how this works" @hollyjallen,#QConSF Nov 2018

  43. Devs Ops I've never been Now I know I on-call before, can find a this is scary! developer when I need to. @hollyjallen,#QConSF Nov 2018

  44. “” Ops is getting the pages first pages Ultra-senior devs on-call @hollyjallen,#QConSF Nov 2018

  45. Measure Design Learn @hollyjallen,#QConSF Nov 2018

  46. “” How can operations reliably reach the developers when there's a problem? @hollyjallen,#QConSF Nov 2018

  47. 📠 Most devs go on-call Fall 2017 @hollyjallen,#QConSF Nov 2018

  48. “” Kaizen Continuous Improvement @hollyjallen,#QConSF Nov 2018

  49. “” "Wait, I'm on-call now?" @hollyjallen,#QConSF Nov 2018

  50. Devs Ops I'm glad I'm only I'll be able to on call a few reach a search times a year engineer if I need to. @hollyjallen,#QConSF Nov 2018

  51. “” Learn by Doing @hollyjallen,#QConSF Nov 2018

  52. “” On-call 3 times a year 🤕 @hollyjallen,#QConSF Nov 2018

  53. “” Ops is getting the pages first pages Ultra-senior devs on-call Seven One dev rotations @hollyjallen,#QConSF Nov 2018

  54. “” Continuous Deployment 100+ prod deploys a day @hollyjallen,#QConSF Nov 2018

  55. “” What Changed? @hollyjallen,#QConSF Nov 2018

  56. “” @hollyjallen,#QConSF Nov 2018

  57. “” @hollyjallen,#QConSF Nov 2018

  58. “” Page the dev @hollyjallen,#QConSF Nov 2018

  59. Devs Ops I don't These are the understand this machine alerts part of the code I'm seeing @hollyjallen,#QConSF Nov 2018

  60. “” Human Routers @hollyjallen,#QConSF Nov 2018

  61. “” "Call Andy, he knows how this works" @hollyjallen,#QConSF Nov 2018

  62. “” Postmortems weren't a great place for learning @hollyjallen,#QConSF Nov 2018

  63. “” Can we catch problems earlier? @hollyjallen,#QConSF Nov 2018

  64. “” @hollyjallen,#QConSF Nov 2018

  65. “” @hollyjallen,#QConSF Nov 2018

  66. “” @hollyjallen,#QConSF Nov 2018

  67. “” Investing in tech to make detection and remediation faster @hollyjallen,#QConSF Nov 2018

  68. Operations is out Reorg! Service Engineering is in Fall 2017 @hollyjallen,#QConSF Nov 2018

  69. “” How can Slack ensure that developers know when there's a problem? @hollyjallen,#QConSF Nov 2018

  70. “” Centralized Operations Service Ownership @hollyjallen,#QConSF Nov 2018

  71. Measure Design Learn @hollyjallen,#QConSF Nov 2018

  72. “” "We are the toolsmith and specialists. We empower Service Ownership" @hollyjallen,#QConSF Nov 2018

  73. Devs Service Features Cloud Platform Reliability Observability tools Performance Service Discovery Postmortems Define best practice @hollyjallen,#QConSF Nov 2018

  74. 👌 I joined Slack in February 2018 @hollyjallen,#QConSF Nov 2018

  75. “” How to empower development teams to improve service reliability? @hollyjallen,#QConSF Nov 2018

  76. Define • At least one alerting health service metric, like latency or throughput health and operational maturity @hollyjallen,#QConSF Nov 2018

  77. “” Send metrics to Prometheus Observability team is here to help! 🔯 @hollyjallen,#QConSF Nov 2018

  78. Define • Team should be on-call service ready • At least 4, preferably 6 health and engineers participating to operational make it sustainable • 24/7 or during the weekday, maturity depending on the service @hollyjallen,#QConSF Nov 2018

  79. Define • Runbooks for standard service actions and troubleshooting health and • Central location in our code operational repository • Up to date and useable by maturity any engineer @hollyjallen,#QConSF Nov 2018

  80. Define • Paging alerts should link to service the runbook • Make responding to an health and page easy operational • Practice incident response maturity @hollyjallen,#QConSF Nov 2018

  81. “” Incident Lunch ⛑ @hollyjallen,#QConSF Nov 2018

  82. • Devops generalists Site • Emotional intelligence Reliability • Mentoring • Ambassadors Engineers • Operational maturity @hollyjallen,#QConSF Nov 2018

  83. “” SRE embedded in dev teams @hollyjallen,#QConSF Nov 2018

  84. “” Devs SRE Ops @hollyjallen,#QConSF Nov 2018

  85. Devs SREs Um, where are I'm over here the SREs? doing operational tasks @hollyjallen,#QConSF Nov 2018

  86. “” SRE Ops is still getting the first pages @hollyjallen,#QConSF Nov 2018

  87. “” How do we lower operational burden on the SREs? @hollyjallen,#QConSF Nov 2018

  88. “” Plan: Send paging alerts to the development teams @hollyjallen,#QConSF Nov 2018

  89. Devs SREs We need We're going to training plan this out perfectly @hollyjallen,#QConSF Nov 2018

  90. @hollyjallen,#QConSF Nov 2018

  91. “” Host level alerts Hundreds of them @hollyjallen,#QConSF Nov 2018

  92. “” Test with the users @hollyjallen,#QConSF Nov 2018

  93. @hollyjallen,#QConSF Nov 2018

  94. 💫 Everything was fine! @hollyjallen,#QConSF Nov 2018

  95. “” Empowered Continuous Improvement @hollyjallen,#QConSF Nov 2018

  96. “” Devs SRE Ops @hollyjallen,#QConSF Nov 2018

  97. “” How do we test our understanding of how Slack will fail? @hollyjallen,#QConSF Nov 2018

  98. “” "Disasterpiece Theater is an ongoing series of exercises in which we will purposely cause a part of Slack to fail." @hollyjallen,#QConSF Nov 2018

  99. Measure Design Learn @hollyjallen,#QConSF Nov 2018

  100. • Increased engineer Success confidence Metrics • Validate reliability improvements • Learn something new • Practice incident response @hollyjallen,#QConSF Nov 2018

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend