service ownership

Service Ownership Learn Faster Holly Allen Service Engineering - PowerPoint PPT Presentation

Service Ownership Learn Faster Holly Allen Service Engineering @hollyjallen Holly Allen Software development and leadership for 18 years @hollyjallen,#QConSF Nov 2018 @hollyjallen,#QConSF Nov 2018 @hollyjallen,#QConSF Nov 2018 Software!


  1. Service Ownership Learn Faster Holly Allen Service Engineering @hollyjallen

  2. Holly Allen Software development and leadership for 18 years

  3. @hollyjallen,#QConSF Nov 2018

  4. @hollyjallen,#QConSF Nov 2018

  5. @hollyjallen,#QConSF Nov 2018

  6. Software! ๐Ÿ˜Ž @hollyjallen,#QConSF Nov 2018

  7. @hollyjallen,#QConSF Nov 2018

  8. S L O W ๐Ÿ˜ช @hollyjallen,#QConSF Nov 2018

  9. @hollyjallen,#QConSF Nov 2018

  10. Measure Design Learn @hollyjallen,#QConSF Nov 2018

  11. Toyota Production System @hollyjallen,#QConSF Nov 2018

  12. @hollyjallen,#QConSF Nov 2018

  13. @hollyjallen,#QConSF Nov 2018

  14. @hollyjallen,#QConSF Nov 2018

  15. @hollyjallen,#QConSF Nov 2018

  16. @hollyjallen,#QConSF Nov 2018

  17. โ€œโ€ Kaizen Continuous Improvement @hollyjallen,#QConSF Nov 2018

  18. Measure Design Learn @hollyjallen,#QConSF Nov 2018

  19. @hollyjallen,#QConSF Nov 2018

  20. @hollyjallen,#QConSF Nov 2018

  21. @hollyjallen,#QConSF Nov 2018

  22. โ€œโ€ Executive dedication to learning @hollyjallen,#QConSF Nov 2018

  23. โ€œโ€ High Trust Teams @hollyjallen,#QConSF Nov 2018

  24. Measure Design Learn @hollyjallen,#QConSF Nov 2018

  25. @hollyjallen,#QConSF Nov 2018

  26. ๐Ÿš Slack launched February 2014 @hollyjallen,#QConSF Nov 2018

  27. 5 Years Grew to 13+ million weekly active users, with active sessions of 10+ hours a day @hollyjallen,#QConSF Nov 2018

  28. 5 Years From 10 to 15,000 servers In 25 cloud data centers world-wide @hollyjallen,#QConSF Nov 2018

  29. 5 Years From 8 to 1,200 people In 9 offices world-wide @hollyjallen,#QConSF Nov 2018

  30. Measure Design Learn @hollyjallen,#QConSF Nov 2018

  31. @hollyjallen,#QConSF Nov 2018

  32. โ€œโ€ โœ… Continuous Deployment โœ… Experiment Frameworks โœ… User Research @hollyjallen,#QConSF Nov 2018

  33. Something didn't scale... @hollyjallen,#QConSF Nov 2018

  34. ๐Ÿ˜ฎ Centralized Operations @hollyjallen,#QConSF Nov 2018

  35. โ€œโ€ Who should be responsible for the management, monitoring and operation of a production application? @hollyjallen,#QConSF Nov 2018

  36. โ€œโ€ Centralized Operations Division of Labor @hollyjallen,#QConSF Nov 2018

  37. Devs Ops Features Cloud Infra Scale Deployment Architecture Monitoring @hollyjallen,#QConSF Nov 2018

  38. โ€œโ€ Ops is getting the pages @hollyjallen,#QConSF Nov 2018

  39. โ€œโ€ Product Development grew faster than Operations, A lot faster @hollyjallen,#QConSF Nov 2018

  40. 20 Product 1 Ops Developers Engineer @hollyjallen,#QConSF Nov 2018

  41. โ€œโ€ How can operations reliably reach the developers when there's a problem? @hollyjallen,#QConSF Nov 2018

  42. โ€œโ€ "Call Maude, she knows how this works" @hollyjallen,#QConSF Nov 2018

  43. Devs Ops I've never been Now I know I on-call before, can find a this is scary! developer when I need to. @hollyjallen,#QConSF Nov 2018

  44. โ€œโ€ Ops is getting the pages first pages Ultra-senior devs on-call @hollyjallen,#QConSF Nov 2018

  45. Measure Design Learn @hollyjallen,#QConSF Nov 2018

  46. โ€œโ€ How can operations reliably reach the developers when there's a problem? @hollyjallen,#QConSF Nov 2018

  47. ๐Ÿ“  Most devs go on-call Fall 2017 @hollyjallen,#QConSF Nov 2018

  48. โ€œโ€ Kaizen Continuous Improvement @hollyjallen,#QConSF Nov 2018

  49. โ€œโ€ "Wait, I'm on-call now?" @hollyjallen,#QConSF Nov 2018

  50. Devs Ops I'm glad I'm only I'll be able to on call a few reach a search times a year engineer if I need to. @hollyjallen,#QConSF Nov 2018

  51. โ€œโ€ Learn by Doing @hollyjallen,#QConSF Nov 2018

  52. โ€œโ€ On-call 3 times a year ๐Ÿค• @hollyjallen,#QConSF Nov 2018

  53. โ€œโ€ Ops is getting the pages first pages Ultra-senior devs on-call Seven One dev rotations @hollyjallen,#QConSF Nov 2018

  54. โ€œโ€ Continuous Deployment 100+ prod deploys a day @hollyjallen,#QConSF Nov 2018

  55. โ€œโ€ What Changed? @hollyjallen,#QConSF Nov 2018

  56. โ€œโ€ @hollyjallen,#QConSF Nov 2018

  57. โ€œโ€ @hollyjallen,#QConSF Nov 2018

  58. โ€œโ€ Page the dev @hollyjallen,#QConSF Nov 2018

  59. Devs Ops I don't These are the understand this machine alerts part of the code I'm seeing @hollyjallen,#QConSF Nov 2018

  60. โ€œโ€ Human Routers @hollyjallen,#QConSF Nov 2018

  61. โ€œโ€ "Call Andy, he knows how this works" @hollyjallen,#QConSF Nov 2018

  62. โ€œโ€ Postmortems weren't a great place for learning @hollyjallen,#QConSF Nov 2018

  63. โ€œโ€ Can we catch problems earlier? @hollyjallen,#QConSF Nov 2018

  64. โ€œโ€ @hollyjallen,#QConSF Nov 2018

  65. โ€œโ€ @hollyjallen,#QConSF Nov 2018

  66. โ€œโ€ @hollyjallen,#QConSF Nov 2018

  67. โ€œโ€ Investing in tech to make detection and remediation faster @hollyjallen,#QConSF Nov 2018

  68. Operations is out Reorg! Service Engineering is in Fall 2017 @hollyjallen,#QConSF Nov 2018

  69. โ€œโ€ How can Slack ensure that developers know when there's a problem? @hollyjallen,#QConSF Nov 2018

  70. โ€œโ€ Centralized Operations Service Ownership @hollyjallen,#QConSF Nov 2018

  71. Measure Design Learn @hollyjallen,#QConSF Nov 2018

  72. โ€œโ€ "We are the toolsmith and specialists. We empower Service Ownership" @hollyjallen,#QConSF Nov 2018

  73. Devs Service Features Cloud Platform Reliability Observability tools Performance Service Discovery Postmortems Define best practice @hollyjallen,#QConSF Nov 2018

  74. ๐Ÿ‘Œ I joined Slack in February 2018 @hollyjallen,#QConSF Nov 2018

  75. โ€œโ€ How to empower development teams to improve service reliability? @hollyjallen,#QConSF Nov 2018

  76. Define โ€ข At least one alerting health service metric, like latency or throughput health and operational maturity @hollyjallen,#QConSF Nov 2018

  77. โ€œโ€ Send metrics to Prometheus Observability team is here to help! ๐Ÿ”ฏ @hollyjallen,#QConSF Nov 2018

  78. Define โ€ข Team should be on-call service ready โ€ข At least 4, preferably 6 health and engineers participating to operational make it sustainable โ€ข 24/7 or during the weekday, maturity depending on the service @hollyjallen,#QConSF Nov 2018

  79. Define โ€ข Runbooks for standard service actions and troubleshooting health and โ€ข Central location in our code operational repository โ€ข Up to date and useable by maturity any engineer @hollyjallen,#QConSF Nov 2018

  80. Define โ€ข Paging alerts should link to service the runbook โ€ข Make responding to an health and page easy operational โ€ข Practice incident response maturity @hollyjallen,#QConSF Nov 2018

  81. โ€œโ€ Incident Lunch โ›‘ @hollyjallen,#QConSF Nov 2018

  82. โ€ข Devops generalists Site โ€ข Emotional intelligence Reliability โ€ข Mentoring โ€ข Ambassadors Engineers โ€ข Operational maturity @hollyjallen,#QConSF Nov 2018

  83. โ€œโ€ SRE embedded in dev teams @hollyjallen,#QConSF Nov 2018

  84. โ€œโ€ Devs SRE Ops @hollyjallen,#QConSF Nov 2018

  85. Devs SREs Um, where are I'm over here the SREs? doing operational tasks @hollyjallen,#QConSF Nov 2018

  86. โ€œโ€ SRE Ops is still getting the first pages @hollyjallen,#QConSF Nov 2018

  87. โ€œโ€ How do we lower operational burden on the SREs? @hollyjallen,#QConSF Nov 2018

  88. โ€œโ€ Plan: Send paging alerts to the development teams @hollyjallen,#QConSF Nov 2018

  89. Devs SREs We need We're going to training plan this out perfectly @hollyjallen,#QConSF Nov 2018

  90. @hollyjallen,#QConSF Nov 2018

  91. โ€œโ€ Host level alerts Hundreds of them @hollyjallen,#QConSF Nov 2018

  92. โ€œโ€ Test with the users @hollyjallen,#QConSF Nov 2018

  93. @hollyjallen,#QConSF Nov 2018

  94. ๐Ÿ’ซ Everything was fine! @hollyjallen,#QConSF Nov 2018

  95. โ€œโ€ Empowered Continuous Improvement @hollyjallen,#QConSF Nov 2018

  96. โ€œโ€ Devs SRE Ops @hollyjallen,#QConSF Nov 2018

  97. โ€œโ€ How do we test our understanding of how Slack will fail? @hollyjallen,#QConSF Nov 2018

  98. โ€œโ€ "Disasterpiece Theater is an ongoing series of exercises in which we will purposely cause a part of Slack to fail." @hollyjallen,#QConSF Nov 2018

  99. Measure Design Learn @hollyjallen,#QConSF Nov 2018

  100. โ€ข Increased engineer Success confidence Metrics โ€ข Validate reliability improvements โ€ข Learn something new โ€ข Practice incident response @hollyjallen,#QConSF Nov 2018

Recommend


More recommend