detangling complex systems
play

Detangling complex systems with compassion & production - PowerPoint PPT Presentation

Detangling complex systems with compassion & production excellence Liz Fong-Jones @lizthegrey #VelocityConf San Jose June 13, 2019 1 w/ illustrations by @emilywithcurls! Production is increasingly complex. 2 @lizthegrey at


  1. Detangling complex systems with compassion & production excellence Liz Fong-Jones @lizthegrey #VelocityConf San Jose June 13, 2019 1 w/ illustrations by @emilywithcurls!

  2. Production is increasingly complex. 2 @lizthegrey at #VelocityConf

  3. There is no 100% uptime. 3 @lizthegrey at #VelocityConf

  4. Our strategies need to evolve. 4 @lizthegrey at #VelocityConf

  5. Co "bought" DevOps. @lizthegrey at #VelocityConf 5

  6. Ordering the alphabet soup... 6 @lizthegrey at #VelocityConf

  7. Noisy alerts. Grumpy engineers. 7 @lizthegrey at #VelocityConf

  8. Walls of meaningless dashboards. 8 @lizthegrey at #VelocityConf

  9. Incidents take forever to fix. 9 @lizthegrey at #VelocityConf

  10. Everyone bugs the "expert". 10 @lizthegrey at #VelocityConf

  11. Deploys are unpredictable. 11 @lizthegrey at #VelocityConf

  12. There's no time to do projects... 12 @lizthegrey at #VelocityConf

  13. and when there's time, there's no plan. 13 @lizthegrey at #VelocityConf

  14. The team is struggling to hold on. 14 @lizthegrey at #VelocityConf

  15. What's Co missing? @lizthegrey at #VelocityConf 15

  16. Co forgot who operates systems. 16 @lizthegrey at #VelocityConf

  17. Tools aren't magical. 17 @lizthegrey at #VelocityConf

  18. Invest in people, culture, & process. 18 @lizthegrey at #VelocityConf

  19. Enter the art of Production Excellence. 19 @lizthegrey at #VelocityConf

  20. Make systems more reliable & friendly. 20 @lizthegrey at #VelocityConf

  21. ProdEx takes planning. 21 @lizthegrey at #VelocityConf

  22. Measure and act on what matters. 22 @lizthegrey at #VelocityConf

  23. Involve everyone. 23 @lizthegrey at #VelocityConf

  24. Build everyone's confidence. Encourage asking questions. 24 @lizthegrey at #VelocityConf

  25. How do we get started? 25 @lizthegrey at #VelocityConf

  26. Know when it's too broken. 26 @lizthegrey at #VelocityConf

  27. & be able to debug, together when it is. 27 @lizthegrey at #VelocityConf

  28. Eliminate (unnecessary) complexity. 28 @lizthegrey at #VelocityConf

  29. Our systems are always failing. 29 @lizthegrey at #VelocityConf

  30. What if we measure too broken? 30 @lizthegrey at #VelocityConf

  31. We need Service Level Indicators @lizthegrey at #VelocityConf 31

  32. Think in terms of events in context. 32 @lizthegrey at #VelocityConf

  33. Is this event good or bad? 33 @lizthegrey at #VelocityConf

  34. Are users grumpy? Ask your PM. 34 @lizthegrey at #VelocityConf

  35. What threshold buckets events? 35 @lizthegrey at #VelocityConf

  36. HTTP Code 200? Latency < 300ms? 36 @lizthegrey at #VelocityConf

  37. How many eligible events did we see? 37 @lizthegrey at #VelocityConf

  38. Availability: Good / Eligible Events 38 @lizthegrey at #VelocityConf

  39. Set a target Service Level Objective. 39 @lizthegrey at #VelocityConf

  40. Use a window and target percentage. 40 @lizthegrey at #VelocityConf

  41. 99.9% of events good in past 30 days. 41 @lizthegrey at #VelocityConf

  42. A good SLO barely keeps users happy. 42 @lizthegrey at #VelocityConf

  43. Drive alerting with SLOs. 43 @lizthegrey at #VelocityConf

  44. Is my service on fire? 44 @lizthegrey at #VelocityConf

  45. Error budget: allowed unavailability 45 @lizthegrey at #VelocityConf

  46. How long until I run out? 46 @lizthegrey at #VelocityConf

  47. Page if it's hours. Ticket if it's days. 47 @lizthegrey at #VelocityConf

  48. Data-driven business decisions. 48 @lizthegrey at #VelocityConf

  49. Is it safe to do this risky experiment? 49 @lizthegrey at #VelocityConf

  50. Should we invest in more reliability? 50 @lizthegrey at #VelocityConf

  51. Perfect SLO > Good SLO >>> No SLO 51 @lizthegrey at #VelocityConf

  52. Measure what you can today. 52 @lizthegrey at #VelocityConf

  53. Iterate to meet user needs. 53 @lizthegrey at #VelocityConf

  54. Only alert on what matters. 54 @lizthegrey at #VelocityConf

  55. SLIs & SLOs are only half the picture... @lizthegrey at #VelocityConf 55

  56. Our outages are never identical. 56 @lizthegrey at #VelocityConf

  57. Failure modes can't be predicted. 57 @lizthegrey at #VelocityConf

  58. Support debugging novel cases. In production. 58 @lizthegrey at #VelocityConf

  59. Allow forming & testing hypotheses. 59 @lizthegrey at #VelocityConf

  60. Dive into data to ask new questions. 60 @lizthegrey at #VelocityConf

  61. Our services must be observable. 61 @lizthegrey at #VelocityConf

  62. Can you examine events in context? 62 @lizthegrey at #VelocityConf

  63. Can you explain the variance? 63 @lizthegrey at #VelocityConf

  64. Can you mitigate impact & debug later? 64 @lizthegrey at #VelocityConf

  65. SLOs and Observability go together. 65 @lizthegrey at #VelocityConf

  66. But they alone don't create collaboration. @lizthegrey at #VelocityConf 66

  67. Debugging is not a solo activity. 67 @lizthegrey at #VelocityConf

  68. Debugging is for everyone. 68 @lizthegrey at #VelocityConf

  69. Collaboration is interpersonal. 69 @lizthegrey at #VelocityConf

  70. Operations must be sustainable. 70 @lizthegrey at #VelocityConf

  71. We learn better when we document. 71 @lizthegrey at #VelocityConf

  72. Fix hero culture. Share knowledge. 72 @lizthegrey at #VelocityConf

  73. Reward curiosity and teamwork. 73 @lizthegrey at #VelocityConf

  74. Learn from the past. Reward your future self. 74 @lizthegrey at #VelocityConf

  75. Outages don't repeat, but they rhyme. 75 @lizthegrey at #VelocityConf

  76. Risk analysis helps us plan. @lizthegrey at #VelocityConf 76

  77. Quantify risks by frequency & impact. 77 @lizthegrey at #VelocityConf

  78. Which risks are most significant? 78 @lizthegrey at #VelocityConf

  79. Address risks that threaten the SLO. 79 @lizthegrey at #VelocityConf

  80. Make the business case to fix them. 80 @lizthegrey at #VelocityConf

  81. And prioritize completing the work. 81 @lizthegrey at #VelocityConf

  82. Lack of observability is systemic risk. 82 @lizthegrey at #VelocityConf

  83. So is lack of collaboration. 83 @lizthegrey at #VelocityConf

  84. Season the alphabet soup with ProdEx 84 @lizthegrey at #VelocityConf

  85. Production Excellence brings teams closer together. Measure. Debug. Collaborate. Fix. lizthegrey.com; @lizthegrey 85 @lizthegrey at #VelocityConf

  86. @lizthegrey at #VelocityConf

  87. @lizthegrey at #VelocityConf

  88. @lizthegrey at #VelocityConf

  89. @lizthegrey at #VelocityConf

  90. @lizthegrey at #VelocityConf

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend