the practice of chaos engineering
play

The Practice of Chaos Engineering Ana Medina Chaos Engineer at - PowerPoint PPT Presentation

The Practice of Chaos Engineering Ana Medina Chaos Engineer at Gremlin @ana_m_medina #reactive18 @ana_m_medina @ana_m_medina Gremlin Uber SFEFCU Google Quicken Loans Stanford University Miami Dade College #reactive18 @ana_m_medina How


  1. The Practice of Chaos Engineering Ana Medina Chaos Engineer at Gremlin @ana_m_medina

  2. #reactive18 @ana_m_medina @ana_m_medina Gremlin Uber SFEFCU Google Quicken Loans Stanford University Miami Dade College

  3. #reactive18 @ana_m_medina How many of you have heard of Chaos Engineering?

  4. #reactive18 @ana_m_medina How many of you have run a Chaos Engineering experiment?

  5. #reactive18 @ana_m_medina What is Chaos Engineering?

  6. #reactive18 @ana_m_medina Chaos Engineering Thoughtful, planned experiments designed to reveal the weakness in our systems.

  7. #reactive18 @ana_m_medina Chaos Engineering Inject something harmful to build an immunity . -@KoltonAndrus Gremlin Founder and CEO

  8. #reactive18 @ana_m_medina Why? ● Microservices ● Systems are scaling fast ● Downtime is really expensive ● Our dependencies will fail ● Pager fatigue and burnout really hurts

  9. #reactive18 @ana_m_medina Use Cases: ● Outage reproduction ● On-call training ● Strengthen new products ● Battle test new infrastructure and services

  10. #reactive18 @ana_m_medina What do you need before doing Chaos Engineering? ● Monitoring/Observability ● On-Call and Incident Management ● Cost of Downtime Per Hour

  11. #reactive18 @ana_m_medina Chaos Engineering is not ● Unexpected or unmonitored experiments ● Creating outages

  12. #reactive18 @ana_m_medina “Chaos Engineering Without Observability ... Is Just Chaos” -@mipsytipsy Charity Majors CEO of honeycomb

  13. #reactive18 @ana_m_medina Minimize the Blast radius

  14. #reactive18 @ana_m_medina Level 0 VALUE PROVIDED THE BEGINNING Prepare for host failures in the cloud Chaos Monkey APPROACH TAKEN Random MATURITY REQUIRED Low

  15. #reactive18 @ana_m_medina Level 1 VALUE PROVIDED THE FIRST STEP Prepare for host-level failures Infrastructure APPROACH TAKEN Failures Disciplined MATURITY REQUIRED Basic Operations

  16. #reactive18 @ana_m_medina Level 1.5 VALUE PROVIDED INTERMEDIATE Prepare for high impact events Network Failures APPROACH TAKEN Gameday MATURITY REQUIRED Networking expertise

  17. #reactive18 @ana_m_medina Level 2 VALUE PROVIDED THE NEXT STEP Safely validate the user experience Application APPROACH TAKEN Failures Precision Experiments MATURITY REQUIRED Advanced

  18. Latency added to 50% of android traffic

  19. Exceptions - 50% of android traffic failed

  20. #reactive18 @ana_m_medina You can and should inject chaos at every layer of your stack ● Application ● API ● Caching ● Database ● Hardware ● Cloud Infrastructure / Bare metal

  21. #reactive18 @ana_m_medina Top places to inject chaos

  22. #reactive18 @ana_m_medina

  23. #reactive18 @ana_m_medina https://www.gremlin.com/community/tutorials/what-i-learned-running-the- chaos-lab-kafka-breaks/

  24. #reactive18 @ana_m_medina Getting Started: ● Identify top 5 critical systems ● Choose system ● Whiteboard the system ● Determine what experiment you want to run: (resource, state, network) ● Determine Blast Radius

  25. #reactive18 @ana_m_medina Companies doing Chaos Engineering

  26. #reactive18 @ana_m_medina Chaos Days

  27. #reactive18 @ana_m_medina Chaos Days: Dedicated day for your entire company to focus on building resilience instead of new products. https://www.gremlin.com/community/tutorials/planning-your-own-chaos-day/

  28. #reactive18 @ana_m_medina “What could go wrong?” “Do we know what will happen if this breaks?”

  29. #reactive18 @ana_m_medina Chaos Day Crew: VP Engineering / CTO / COO Executive Assistant Engineering Director / Manager Senior Engineer New Grad / Intern Engineer

  30. #reactive18 @ana_m_medina What experiments can you run? • Reproduce outage conditions • Unpredictable circumstances • Large traffic spikes • Race conditions • Datacenter failure • Time travel - system clocks to be out of sync • Network errors • CPU overloads

  31. #reactive18 @ana_m_medina

  32. #reactive18 @ana_m_medina

  33. #reactive18 @ana_m_medina

  34. #reactive18 @ana_m_medina

  35. #reactive18 @ana_m_medina

  36. #reactive18 @ana_m_medina gremlin.com/chaos-monkey/

  37. #reactive18 @ana_m_medina Learn more: Join the Chaos, Join Slack: bit.ly/chaos-eng-slack 1,900+ members across the world

  38. #reactive18 @ana_m_medina THANKS! ana@gremlin.com @ana_m_medina

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend