Building Co Confidence i in Healthcare S Syst stems T s Through Ch Chaos E s Engineering
Carl C Chesser
@che5 e55er er | c | che5 e55er er.io
Building Co Confidence i in Healthcare S Syst stems T s Through - - PowerPoint PPT Presentation
Building Co Confidence i in Healthcare S Syst stems T s Through Ch Chaos E s Engineering Carl C Chesser @che5 e55er er | c | che5 e55er er.io About m me Our S Story Traffic M Management Pa Patterns In Intr troducing
Carl C Chesser
@che5 e55er er | c | che5 e55er er.io
About m me
Traffic M Management Pa Patterns Our S Story Su Summary In Intr troducing Chaos E Experiments
Changing e existing service d deployments i in a c complex d deployment environment s supporting critical w workloads.
We w wanted t to c control o
deployments t through d declarative configuration, t that w would a allow us t to c continue t to c change.
As w we p pursued m more w ways o
increasing a availability a and infrastructure f features, o
systems g grew i in s size a and co comple mplexit ity.
As w we b built a and t tested o
infrastructure, w we w wanted cross t team a alignment when e evaluating t the l layers
the i infrastructure.
We w wanted t to h have clean a availability z zone separation, a and therefore w wanted t to ensure w we d didn’t h have shared r resources.
We h had t three d different
but w wanted
cross f functional t team.
Platform rm a and O Opera rations a alre ready w were re located t together, a and n needed t to g get o
r Infra rastru ructure re t team l located t together a r as w well.
With o
r usage o
DC/OS a and O OpenStack, w we w were re n needing t to better u r understand t the re reactions o
these s systems i in c common failure re m modes.
When w we b began o
journey, we w were l leveraging D DC/OS t to manage o
workloads (via m marathon).
Simulating t tra raffic t through t the s system w while k killing V VMs, poweri ring o
hypervisor, s stopping a availability zo zones, a and s share red infra rastru ructure re i in D DC/OS.
We i introduced g gamedays to s start v validating concerns o
the w whole sy syst stem em.
As w we l lived w with o
current s system, w we knew w we w would need t to e evolve i it t to Ku Kubernetes.
Look, t there o
the h horizon!
As w we w were e evolving o
system, w we w wanted t to c collapse the a amount o
effort a and t time to s start c comparing e effects o
production w workloads.
When w we b built o
deployments for D DC/OS, w we a added s support f for DC/OS t to S Spinnaker.
We t then l levera raged i it t to d deploy t to both s systems a as w we c compare red the b behavior i r in K Kubern rnetes.
The i introduction o
chaos experiments o
live production s systems, e even for a a s small p percentage o
traffic, c can s seem t too r risky.
We a are n not Netflix lix!
Larry becomes defensive when first approached about applying chaos experiments in production at ACME corporation.
Rather t than d delaying when w we c could s start evaluating o
newer system, w we c could leverage a a r replay o
production t traffic.
We e evolved o
systems many t times b by l leveraging a c control g gate i into o
sy syst stem em.
Used a as a an a abstra raction o
the b backing s system.
Supports a an A API PI g gateway t to s simply c call another g gateway, v versus t the b backing set o
services.
Supports g gradually t transitioning a a subset o
traffic t to a a d different t target b by leveraging c chaining.
Avoid t the B Big B Bang.
Replays a a p percentage o
traffic t to another b backend.
Background re replay o
safe re requests. (re (read-only, H HTTP G GET)
Build i in a a b bulkhead f for y your r resource p pool s supporting t the r replay o
traffic t to a avoid u unnecessary stress o
your s service a at b bursts o
traffic.
Rather t than i imposing a a c canary e early w with experiments, w where a a s small p percentage o
failure s still i introduces u undesirable r risk, look t to l leverage a a shadow o
traffic.
We w were a able t to f further compare a and e evaluate t the system a as w we e expanded t the new a and a applied g gameday exe exercises.
We i identified a an i issue i in o
exi xisting s system, a and t through o
continual a assessment o
f the n new system a and p practicing t traffic management, i it b became a a s simple* ch choice.
* S Simple b by i it b being w well u understood, p practiced, a and s supported b by d data.
Deploy s services a across data c center s sites, a and w we were a able t to l leverage traffic a across s sites f for a a site i incident.
(g (gamedays) s)
Optimize e engineering f focus
the i introduction o
chaos as p planned e experiments.
Minimize ze t the o
for c r chaos t to become a a s scape g goat f for m r mysteri rious is issues.
Prepare f for t the E Experiment
Describe t the s scenario, w what is e expected t to o
how i it will b be m measured, w who i is ne neede ded. d.
Identify p pre rere requisites t that a are re n needed t to b be co completed
(ex. i improved t telemetry o
connection r refresh o
data s store)
Observability i is C Critical
You n need e easy a access t to essential t telemetry d data o
all the p parts o
the s system.
You w want t to b be a able t to a ask d differe rent a and n new q questions
your s r system w without h having t to c change t the s system.
When y you d discover a a g gap i in v visibility, f focus o
how t to m make i it e easy t y to rebuild y your s sys ystem w with t the i improvement t through l low c coordination.
Utilize a a D Dedicated S Space
Have a a c common s space (physical/virtual) w where everyone a attends d during the e experiment.
You w want t to o
ze c communication w when a assessing t the experi
Schedule a adequate a amount o
time f for m r multiple itera rations (e (ex. w whole a aftern rnoon). ).
Understand a and E Embrace needed C Compliance
Pr Production s systems w will bear m more c compliance a and co cont ntrols ls.
Much o
this i is a around ri risk, s so f focus o
the i introduction through l low ri risk s scenari rios (e (ex. n non-live s systems b being b built). ).
Plan t to b be S Surprised
We g generally a always l learned something n new a about t the larger s system a and t the e effects
compounding f failures.
Capture re w what w was s surpri rising (a (actual re results v
what w was the e expected re results) i ) in a an o
and s searchable re repository. Plan a added t time t to d digest t the s surpri rises.
Cross F Functional I Involvement
Diverse p perspectives c can a accelera rate a and i improve g group l learn rning.
Helps s share k knowledge
how d different l layers
a s system a are v viewed during t the e experiment.
Prepares Y Your T Team
Your e entire t team m may n not b be able t to p participate, b but t they should b be a able t to l learn f from the f findings.
Experi riments h help y you p pra ractice h how y you l look i into t the s system, where re s signals n norm rmally a ari rise, a and i identifies g gaps o
essential telemetry f for b r broader i r insight.
Plan f for y your ex experiments Work t to b build c cross functional t teams t to maximize l learning Identify h y how t to m make i it e easy y to i improve o
y into your s sys ystem Remind y your t teams a and leadership o
measurable improvements t through t this pr practice Identify h y how y you c can minimize r risk t through t traffic management a approaches
https:/ ://kubernetes.i .io/ https:/ ://spinnaker.i .io/ https:/ ://dropwizard.i .io/ https:/ ://metrics.d .dropwizard.i .io/ https:/ ://github.c .com/N /Netf tflix/zu zuul https:/ ://github.c .com/ts tsenart/ve vegeta
Carl C Chesser
@che5 e55er er | c | che5 e55er er.io