Building Co Confidence i in Healthcare S Syst stems T s Through - PowerPoint PPT Presentation

Building Co Confidence i in Healthcare S Syst stems T s Through Ch Chaos E s Engineering Carl C Chesser @che5 e55er er | c | che5 e55er er.io

About m me

Our S Story Traffic M Management Pa Patterns In Intr troducing Chaos E Experiments Su Summary

Our S Story

The C Challenge Changing e existing service d deployments i in a c complex d deployment environment s supporting critical w workloads.

Incrementally B Build, Allowing f for C Change We w wanted t to c control o our deployments t through d declarative configuration, t that w would a allow us t to c continue t to c change.

Complexity w was G Growing within o our S Systems As w we p pursued m more w ways o of increasing a availability a and infrastructure f features, o our systems g grew i in s size a and co comple mplexit ity.

Cross F Functional T Team Alignment f for E Experiments As w we b built a and t tested o our infrastructure, w we w wanted cross t team a alignment when e evaluating t the l layers of t the i infrastructure.

Introducing O OpenStack We w wanted t to h have clean a availability z zone separation, a and therefore w wanted t to ensure w we d didn’t h have shared r resources.

Introd oduce t the T Tiger T Team We h had t three d different organizations, b but w wanted one c cross f functional t team. Platform rm a and O Opera rations a alre ready w were re located t together, a and n needed t to g get o our r Infra rastru ructure re t team l located t together a r as w well.

Starting w with D DC/OS When w we b began o our j journey, we w were l leveraging D DC/OS t to manage o our w workloads (via m marathon). With o our u r usage o of D DC/OS a and O OpenStack, w we w were re n needing t to better u r understand t the re reactions o of t these s systems i in c common failure re m modes.

Validate E Early C y Con oncerns We i introduced g gamedays to s start v validating concerns o of t the w whole syst sy stem em. Simulating t tra raffic t through t the s system w while k killing V VMs, poweri ring o off h hypervisor, s stopping a availability zo zones, a and s share red infra rastru ructure re i in D DC/OS.

Evol olving t to K o Kubernetes As w we l lived w with o our current s system, w we Look, t there o on the h horizon! knew w we w would need t to e evolve i it t to Ku Kubernetes.

Competing T Time i in Growing B Both S Systems As w we w were e evolving o our system, w we w wanted t to c collapse the a amount o of e effort a and t time to s start c comparing e effects o of production w workloads.

Leveraging S Spinnaker When w we b built o our d deployments for D DC/OS, w we a added s support f for DC/OS t to S Spinnaker. We t then l levera raged i it t to d deploy t to both s systems a as w we c compare red the b behavior i r in K Kubern rnetes.

Fear o of R Running E Experiments s on L Live T Traffic We a are n not Netflix lix! The i introduction o of c chaos experiments o on l live production s systems, e even for a a s small p percentage o of traffic, c can s seem t too r risky. Larry becomes defensive when first approached about applying chaos experiments in production at ACME corporation.

Introduced S Shadow T Traffic Rather t than d delaying when w we c could s start evaluating o our n newer system, w we c could leverage a a r replay o of production t traffic.

Traffic ic Manageme ment Pa Patterns

API PI G Gateway to to Facilitate C Change We e evolved o our s systems many t times b by l leveraging a c control g gate i into o our sy syst stem em. Used a as a an a abstra raction o of t the b backing s system.

Ch Chai aining Tr Traffic Supports a an A API PI g gateway t to s simply c call another g gateway, v versus t the b backing set o of s services.

Canary T Traffic Supports g gradually t transitioning a a subset o of t traffic t to a a d different t target b by leveraging c chaining. Avoid t the B Big B Bang.

Shadowing T Traffic Replays a a p percentage o of t traffic t to another b backend. Background re replay o of safe re requests. (re (read-only, H HTTP G GET) Build i in a a b bulkhead f for y your r resource p pool s supporting t the r replay o of t traffic t to a avoid u unnecessary stress o on y your s service a at b bursts o of t traffic.

Shadow A Allows E Early T Testing Rather t than i imposing a a c canary e early w with experiments, w where a a s small p percentage o of failure s still i introduces u undesirable r risk, look t to l leverage a a shadow o of t traffic.

Learning fr from Pr Production as w we b built t the N New We w were a able t to f further compare a and e evaluate t the system a as w we e expanded t the new a and a applied g gameday exe exercises.

Transitioning t to K Kubernetes became S Simple* We i identified a an i issue i in o our exi xisting s system, a and t through o our continual a assessment o of t f the n new system a and p practicing t traffic management, i it b became a a s simple* ch choice. * S Simple b by i it b being w well u understood, p practiced, a and s supported b by d data.

Applied i in o our C Cross S Site Kubernetes S Support Deploy s services a across data c center s sites, a and w we were a able t to l leverage traffic a across s sites f for a a site i incident.

Int Introduc ucing ing Cha Chaos os Expe Experi rime ments ts (g (gamedays) s)

Align t the I Introduction o of C Chao aos with O Org rgan anized E Experi riments Optimize e engineering f focus on t the i introduction o of c chaos as p planned e experiments. Minimize ze t the o opportunity f for c r chaos t to become a a s scape g goat f for m r mysteri rious is issues.

Prepare f for t the E Experiment Describe t the s scenario, w what is e expected t to o occur, h how i it will b be m measured, w who i is ne neede ded. d. Identify p pre rere requisites t that a are re n needed t to b be co completed (ex. i improved t telemetry o on c connection r refresh o of d data s store)

Observability i is C Critical You n need e easy a access t to essential t telemetry d data o of a all the p parts o of t the s system. You w want t to b be a able t to a ask d differe rent a and n new q questions of y your s r system w without h having t to c change t the s system. When y you d discover a a g gap i in v visibility, f focus o on h how t to m make i it e easy t y to rebuild y your s sys ystem w with t the i improvement t through l low c coordination.

Utilize a a D Dedicated S Space Have a a c common s space (physical/virtual) w where everyone a attends d during the e experiment. You w want t to o optimize ze c communication w when a assessing t the experi riment. S Schedule a adequate a amount o of t time f for m r multiple itera rations (e (ex. w whole a aftern rnoon). ).

Understand a and E Embrace needed C Compliance Pr Production s systems w will bear m more c compliance a and co cont ntrols ls. Much o of t this i is a around ri risk, s so f focus o on t the i introduction through l low ri risk s scenari rios (e (ex. n non-live s systems b being b built). ).

Plan t to b be S Surprised We g generally a always l learned something n new a about t the larger s system a and t the e effects of c compounding f failures. Capture re w what w was s surpri rising (a (actual re results v vs. w what w was the e expected re results) i ) in a an o open a and s searchable re repository. Plan a added t time t to d digest t the s surpri rises.

Cross F Functional I Involvement Helps s share k knowledge on h how d different l layers of a a s system a are v viewed during t the e experiment. Diverse p perspectives c can a accelera rate a and i improve g group l learn rning.

Prepares Y Your T Team Your e entire t team m may n not b be able t to p participate, b but t they should b be a able t to l learn f from the f findings. Experi riments h help y you p pra ractice h how y you l look i into t the s system, where re s signals n norm rmally a ari rise, a and i identifies g gaps o on e essential telemetry f for b r broader i r insight.

Su Summa mmary

Work t to b build c cross Plan f for y your functional t teams t to ex experiments maximize l learning Identify h y how t to m make i it e easy y to i improve o observability i y into Remind y your t teams a and your s sys ystem leadership o on m measurable improvements t through t this pr practice Identify h y how y you c can minimize r risk t through t traffic management a approaches

Building Co Confidence i in Healthcare S Syst stems T s Through - PowerPoint PPT Presentation

Building Co Confidence i in Healthcare S Syst stems T s Through Ch Chaos E s Engineering Carl C Chesser @che5 e55er er | c | che5 e55er er.io About m me Our S Story Traffic M Management Pa Patterns In Intr troducing

Comput er Syst em Overview I nt roduct ion A comput er syst em consist s of har dwar e

Simu mulati lating ng Syst stems ems in Ground und Vehicl cle e Design ign Frederick J.

A Dist ribut ed Syst em 18: Dist ribut ed Syst ems Last Modif ied: 7/ 3/ 2004 1:49:01 PM -1

THE LISTING PRESENTATION A Natural Close! CONFIDENCE CONFIDENCE CONFIDENCE CONFIDENCE Hi

Building a f ile syst em To build a f ile syst em f rom an array of disk 12: FFS,LFS and ot

Ge Gene nerative and nd Mul ulti-phase se Learning fo for Computer Syst stems s

File Syst ems Last t ime we t alked about disk int ernals 11: File Syst em Basics Despit

CS70: Jean Walrand: Lecture 29. Confidence? Confidence? Confidence is essential is many

Na Navigati ting ng the Na Nati tional Payme ment t Syst stems ms in Digita tal l Er Era

Cli linic nical al De Decis isio ion n Sup uppo port rt Syst stems: ems: An An Ap

MILLION DOLLAR TRADIE SYSTEMS BOOTCAMP MODU DULE LE 1 SYST STEMS MS MINDSE DSET 1

Term 2 2020 IN THE NEXT 4 LECTURES The context: distribute ributed d syst stems ms

INTELLIGENT SYSTEMS OVER THE INTERNET Web-Bas Based ed Intellige ligent nt Syst stems

Creating Confidence Intervals using Excel 2013 XL8A-V0R XL8A-V0R XL8A-V0R Create Confidence

Creating Confidence Intervals using Excel 2010 5/08/2015 V0M V0M V0M Create Confidence

STAT 113 Confidence Intervals Colin Reimer Dawson Oberlin College October 3, 2017 1 / 51

Geospatial Data Act Update GDA Tiger Team Meeting February 26, 2019 Geospatial Data Act (GDA) 2

- The goals of the early integration pathfinder exercises are Overall goal: Ensure that when

SCOPING MEETINGS Lori Steele, NEFMC Staff Amendment 8 Scoping Document Informs the public of

NAOC Tiger Team Board Limit Term/Length Purpose of Tiger Team Based on comment from Les Clark

Diversity in the Canadian Armed Forces Director- Directorate of Human Rights and Diversity

Welcome to the Tigers Den Parent Welcome Night August 5, 2019 7:00 PM Principal: Anthony

IT Team and IT Related Meetings Jeffrey Kantor LSST IT Sr. Manager LSST Project and Community

Utility Emergency Preparedness Paul Sabella Director, Enterprise Resilience April 16, 2020