building co confidence i in healthcare s syst stems t s
play

Building Co Confidence i in Healthcare S Syst stems T s Through - PowerPoint PPT Presentation

Building Co Confidence i in Healthcare S Syst stems T s Through Ch Chaos E s Engineering Carl C Chesser @che5 e55er er | c | che5 e55er er.io About m me Our S Story Traffic M Management Pa Patterns In Intr troducing


  1. Building Co Confidence i in Healthcare S Syst stems T s Through Ch Chaos E s Engineering Carl C Chesser @che5 e55er er | c | che5 e55er er.io

  2. About m me

  3. Our S Story Traffic M Management Pa Patterns In Intr troducing Chaos E Experiments Su Summary

  4. Our S Story

  5. The C Challenge Changing e existing service d deployments i in a c complex d deployment environment s supporting critical w workloads.

  6. Incrementally B Build, Allowing f for C Change We w wanted t to c control o our deployments t through d declarative configuration, t that w would a allow us t to c continue t to c change.

  7. Complexity w was G Growing within o our S Systems As w we p pursued m more w ways o of increasing a availability a and infrastructure f features, o our systems g grew i in s size a and co comple mplexit ity.

  8. Cross F Functional T Team Alignment f for E Experiments As w we b built a and t tested o our infrastructure, w we w wanted cross t team a alignment when e evaluating t the l layers of t the i infrastructure.

  9. Introducing O OpenStack We w wanted t to h have clean a availability z zone separation, a and therefore w wanted t to ensure w we d didn’t h have shared r resources.

  10. Introd oduce t the T Tiger T Team We h had t three d different organizations, b but w wanted one c cross f functional t team. Platform rm a and O Opera rations a alre ready w were re located t together, a and n needed t to g get o our r Infra rastru ructure re t team l located t together a r as w well.

  11. Starting w with D DC/OS When w we b began o our j journey, we w were l leveraging D DC/OS t to manage o our w workloads (via m marathon). With o our u r usage o of D DC/OS a and O OpenStack, w we w were re n needing t to better u r understand t the re reactions o of t these s systems i in c common failure re m modes.

  12. Validate E Early C y Con oncerns We i introduced g gamedays to s start v validating concerns o of t the w whole syst sy stem em. Simulating t tra raffic t through t the s system w while k killing V VMs, poweri ring o off h hypervisor, s stopping a availability zo zones, a and s share red infra rastru ructure re i in D DC/OS.

  13. Evol olving t to K o Kubernetes As w we l lived w with o our current s system, w we Look, t there o on the h horizon! knew w we w would need t to e evolve i it t to Ku Kubernetes.

  14. Competing T Time i in Growing B Both S Systems As w we w were e evolving o our system, w we w wanted t to c collapse the a amount o of e effort a and t time to s start c comparing e effects o of production w workloads.

  15. Leveraging S Spinnaker When w we b built o our d deployments for D DC/OS, w we a added s support f for DC/OS t to S Spinnaker. We t then l levera raged i it t to d deploy t to both s systems a as w we c compare red the b behavior i r in K Kubern rnetes.

  16. Fear o of R Running E Experiments s on L Live T Traffic We a are n not Netflix lix! The i introduction o of c chaos experiments o on l live production s systems, e even for a a s small p percentage o of traffic, c can s seem t too r risky. Larry becomes defensive when first approached about applying chaos experiments in production at ACME corporation.

  17. Introduced S Shadow T Traffic Rather t than d delaying when w we c could s start evaluating o our n newer system, w we c could leverage a a r replay o of production t traffic.

  18. Traffic ic Manageme ment Pa Patterns

  19. API PI G Gateway to to Facilitate C Change We e evolved o our s systems many t times b by l leveraging a c control g gate i into o our sy syst stem em. Used a as a an a abstra raction o of t the b backing s system.

  20. Ch Chai aining Tr Traffic Supports a an A API PI g gateway t to s simply c call another g gateway, v versus t the b backing set o of s services.

  21. Canary T Traffic Supports g gradually t transitioning a a subset o of t traffic t to a a d different t target b by leveraging c chaining. Avoid t the B Big B Bang.

  22. Shadowing T Traffic Replays a a p percentage o of t traffic t to another b backend. Background re replay o of safe re requests. (re (read-only, H HTTP G GET) Build i in a a b bulkhead f for y your r resource p pool s supporting t the r replay o of t traffic t to a avoid u unnecessary stress o on y your s service a at b bursts o of t traffic.

  23. Shadow A Allows E Early T Testing Rather t than i imposing a a c canary e early w with experiments, w where a a s small p percentage o of failure s still i introduces u undesirable r risk, look t to l leverage a a shadow o of t traffic.

  24. Learning fr from Pr Production as w we b built t the N New We w were a able t to f further compare a and e evaluate t the system a as w we e expanded t the new a and a applied g gameday exe exercises.

  25. Transitioning t to K Kubernetes became S Simple* We i identified a an i issue i in o our exi xisting s system, a and t through o our continual a assessment o of t f the n new system a and p practicing t traffic management, i it b became a a s simple* ch choice. * S Simple b by i it b being w well u understood, p practiced, a and s supported b by d data.

  26. Applied i in o our C Cross S Site Kubernetes S Support Deploy s services a across data c center s sites, a and w we were a able t to l leverage traffic a across s sites f for a a site i incident.

  27. Int Introduc ucing ing Cha Chaos os Expe Experi rime ments ts (g (gamedays) s)

  28. Align t the I Introduction o of C Chao aos with O Org rgan anized E Experi riments Optimize e engineering f focus on t the i introduction o of c chaos as p planned e experiments. Minimize ze t the o opportunity f for c r chaos t to become a a s scape g goat f for m r mysteri rious is issues.

  29. Prepare f for t the E Experiment Describe t the s scenario, w what is e expected t to o occur, h how i it will b be m measured, w who i is ne neede ded. d. Identify p pre rere requisites t that a are re n needed t to b be co completed (ex. i improved t telemetry o on c connection r refresh o of d data s store)

  30. Observability i is C Critical You n need e easy a access t to essential t telemetry d data o of a all the p parts o of t the s system. You w want t to b be a able t to a ask d differe rent a and n new q questions of y your s r system w without h having t to c change t the s system. When y you d discover a a g gap i in v visibility, f focus o on h how t to m make i it e easy t y to rebuild y your s sys ystem w with t the i improvement t through l low c coordination.

  31. Utilize a a D Dedicated S Space Have a a c common s space (physical/virtual) w where everyone a attends d during the e experiment. You w want t to o optimize ze c communication w when a assessing t the experi riment. S Schedule a adequate a amount o of t time f for m r multiple itera rations (e (ex. w whole a aftern rnoon). ).

  32. Understand a and E Embrace needed C Compliance Pr Production s systems w will bear m more c compliance a and co cont ntrols ls. Much o of t this i is a around ri risk, s so f focus o on t the i introduction through l low ri risk s scenari rios (e (ex. n non-live s systems b being b built). ).

  33. Plan t to b be S Surprised We g generally a always l learned something n new a about t the larger s system a and t the e effects of c compounding f failures. Capture re w what w was s surpri rising (a (actual re results v vs. w what w was the e expected re results) i ) in a an o open a and s searchable re repository. Plan a added t time t to d digest t the s surpri rises.

  34. Cross F Functional I Involvement Helps s share k knowledge on h how d different l layers of a a s system a are v viewed during t the e experiment. Diverse p perspectives c can a accelera rate a and i improve g group l learn rning.

  35. Prepares Y Your T Team Your e entire t team m may n not b be able t to p participate, b but t they should b be a able t to l learn f from the f findings. Experi riments h help y you p pra ractice h how y you l look i into t the s system, where re s signals n norm rmally a ari rise, a and i identifies g gaps o on e essential telemetry f for b r broader i r insight.

  36. Su Summa mmary

  37. Work t to b build c cross Plan f for y your functional t teams t to ex experiments maximize l learning Identify h y how t to m make i it e easy y to i improve o observability i y into Remind y your t teams a and your s sys ystem leadership o on m measurable improvements t through t this pr practice Identify h y how y you c can minimize r risk t through t traffic management a approaches

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend