Building Co Confidence i in Healthcare S Syst stems T s Through - - PowerPoint PPT Presentation

building co confidence i in healthcare s syst stems t s
SMART_READER_LITE
LIVE PREVIEW

Building Co Confidence i in Healthcare S Syst stems T s Through - - PowerPoint PPT Presentation

Building Co Confidence i in Healthcare S Syst stems T s Through Ch Chaos E s Engineering Carl C Chesser @che5 e55er er | c | che5 e55er er.io About m me Our S Story Traffic M Management Pa Patterns In Intr troducing


slide-1
SLIDE 1

Building Co Confidence i in Healthcare S Syst stems T s Through Ch Chaos E s Engineering

Carl C Chesser

@che5 e55er er | c | che5 e55er er.io

slide-2
SLIDE 2

About m me

slide-3
SLIDE 3

Traffic M Management Pa Patterns Our S Story Su Summary In Intr troducing Chaos E Experiments

slide-4
SLIDE 4

Our S Story

slide-5
SLIDE 5

The C Challenge

Changing e existing service d deployments i in a c complex d deployment environment s supporting critical w workloads.

slide-6
SLIDE 6

Incrementally B Build, Allowing f for C Change

We w wanted t to c control o

  • ur

deployments t through d declarative configuration, t that w would a allow us t to c continue t to c change.

slide-7
SLIDE 7

Complexity w was G Growing within o

  • ur S

Systems

As w we p pursued m more w ways o

  • f

increasing a availability a and infrastructure f features, o

  • ur

systems g grew i in s size a and co comple mplexit ity.

slide-8
SLIDE 8

Cross F Functional T Team Alignment f for E Experiments

As w we b built a and t tested o

  • ur

infrastructure, w we w wanted cross t team a alignment when e evaluating t the l layers

  • f t

the i infrastructure.

slide-9
SLIDE 9

Introducing O OpenStack

We w wanted t to h have clean a availability z zone separation, a and therefore w wanted t to ensure w we d didn’t h have shared r resources.

slide-10
SLIDE 10

Introd

  • duce t

the T Tiger T Team

We h had t three d different

  • rganizations, b

but w wanted

  • ne c

cross f functional t team.

Platform rm a and O Opera rations a alre ready w were re located t together, a and n needed t to g get o

  • ur

r Infra rastru ructure re t team l located t together a r as w well.

slide-11
SLIDE 11

Starting w with D DC/OS

With o

  • ur u

r usage o

  • f D

DC/OS a and O OpenStack, w we w were re n needing t to better u r understand t the re reactions o

  • f t

these s systems i in c common failure re m modes.

When w we b began o

  • ur j

journey, we w were l leveraging D DC/OS t to manage o

  • ur w

workloads (via m marathon).

slide-12
SLIDE 12

Validate E Early C y Con

  • ncerns

Simulating t tra raffic t through t the s system w while k killing V VMs, poweri ring o

  • ff h

hypervisor, s stopping a availability zo zones, a and s share red infra rastru ructure re i in D DC/OS.

We i introduced g gamedays to s start v validating concerns o

  • f t

the w whole sy syst stem em.

slide-13
SLIDE 13

Evol

  • lving t

to K

  • Kubernetes

As w we l lived w with o

  • ur

current s system, w we knew w we w would need t to e evolve i it t to Ku Kubernetes.

Look, t there o

  • n

the h horizon!

slide-14
SLIDE 14

Competing T Time i in Growing B Both S Systems

As w we w were e evolving o

  • ur

system, w we w wanted t to c collapse the a amount o

  • f e

effort a and t time to s start c comparing e effects o

  • f

production w workloads.

slide-15
SLIDE 15

Leveraging S Spinnaker

When w we b built o

  • ur d

deployments for D DC/OS, w we a added s support f for DC/OS t to S Spinnaker.

We t then l levera raged i it t to d deploy t to both s systems a as w we c compare red the b behavior i r in K Kubern rnetes.

slide-16
SLIDE 16

Fear o

  • f R

Running E Experiments s

  • n L

Live T Traffic

The i introduction o

  • f c

chaos experiments o

  • n l

live production s systems, e even for a a s small p percentage o

  • f

traffic, c can s seem t too r risky.

We a are n not Netflix lix!

Larry becomes defensive when first approached about applying chaos experiments in production at ACME corporation.

slide-17
SLIDE 17

Introduced S Shadow T Traffic

Rather t than d delaying when w we c could s start evaluating o

  • ur n

newer system, w we c could leverage a a r replay o

  • f

production t traffic.

slide-18
SLIDE 18

Traffic ic Manageme ment Pa Patterns

slide-19
SLIDE 19

API PI G Gateway to to Facilitate C Change

We e evolved o

  • ur s

systems many t times b by l leveraging a c control g gate i into o

  • ur

sy syst stem em.

Used a as a an a abstra raction o

  • f t

the b backing s system.

slide-20
SLIDE 20

Ch Chai aining Tr Traffic

Supports a an A API PI g gateway t to s simply c call another g gateway, v versus t the b backing set o

  • f s

services.

slide-21
SLIDE 21

Canary T Traffic

Supports g gradually t transitioning a a subset o

  • f t

traffic t to a a d different t target b by leveraging c chaining.

Avoid t the B Big B Bang.

slide-22
SLIDE 22

Shadowing T Traffic

Replays a a p percentage o

  • f t

traffic t to another b backend.

Background re replay o

  • f

safe re requests. (re (read-only, H HTTP G GET)

Build i in a a b bulkhead f for y your r resource p pool s supporting t the r replay o

  • f t

traffic t to a avoid u unnecessary stress o

  • n y

your s service a at b bursts o

  • f t

traffic.

slide-23
SLIDE 23

Shadow A Allows E Early T Testing

Rather t than i imposing a a c canary e early w with experiments, w where a a s small p percentage o

  • f

failure s still i introduces u undesirable r risk, look t to l leverage a a shadow o

  • f t

traffic.

slide-24
SLIDE 24

Learning fr from Pr Production as w we b built t the N New

We w were a able t to f further compare a and e evaluate t the system a as w we e expanded t the new a and a applied g gameday exe exercises.

slide-25
SLIDE 25

Transitioning t to K Kubernetes became S Simple*

We i identified a an i issue i in o

  • ur

exi xisting s system, a and t through o

  • ur

continual a assessment o

  • f t

f the n new system a and p practicing t traffic management, i it b became a a s simple* ch choice.

* S Simple b by i it b being w well u understood, p practiced, a and s supported b by d data.

slide-26
SLIDE 26

Applied i in o

  • ur C

Cross S Site Kubernetes S Support

Deploy s services a across data c center s sites, a and w we were a able t to l leverage traffic a across s sites f for a a site i incident.

slide-27
SLIDE 27

(g (gamedays) s)

Int Introduc ucing ing Cha Chaos

  • s

Expe Experi rime ments ts

slide-28
SLIDE 28

Align t the I Introduction o

  • f C

Chao aos with O Org rgan anized E Experi riments

Optimize e engineering f focus

  • n t

the i introduction o

  • f c

chaos as p planned e experiments.

Minimize ze t the o

  • pportunity f

for c r chaos t to become a a s scape g goat f for m r mysteri rious is issues.

slide-29
SLIDE 29

Prepare f for t the E Experiment

Describe t the s scenario, w what is e expected t to o

  • ccur, h

how i it will b be m measured, w who i is ne neede ded. d.

Identify p pre rere requisites t that a are re n needed t to b be co completed

(ex. i improved t telemetry o

  • n c

connection r refresh o

  • f d

data s store)

slide-30
SLIDE 30

Observability i is C Critical

You n need e easy a access t to essential t telemetry d data o

  • f a

all the p parts o

  • f t

the s system.

You w want t to b be a able t to a ask d differe rent a and n new q questions

  • f y

your s r system w without h having t to c change t the s system.

When y you d discover a a g gap i in v visibility, f focus o

  • n h

how t to m make i it e easy t y to rebuild y your s sys ystem w with t the i improvement t through l low c coordination.

slide-31
SLIDE 31

Utilize a a D Dedicated S Space

Have a a c common s space (physical/virtual) w where everyone a attends d during the e experiment.

You w want t to o

  • ptimize

ze c communication w when a assessing t the experi

  • riment. S

Schedule a adequate a amount o

  • f t

time f for m r multiple itera rations (e (ex. w whole a aftern rnoon). ).

slide-32
SLIDE 32

Understand a and E Embrace needed C Compliance

Pr Production s systems w will bear m more c compliance a and co cont ntrols ls.

Much o

  • f t

this i is a around ri risk, s so f focus o

  • n t

the i introduction through l low ri risk s scenari rios (e (ex. n non-live s systems b being b built). ).

slide-33
SLIDE 33

Plan t to b be S Surprised

We g generally a always l learned something n new a about t the larger s system a and t the e effects

  • f c

compounding f failures.

Capture re w what w was s surpri rising (a (actual re results v

  • vs. w

what w was the e expected re results) i ) in a an o

  • pen a

and s searchable re repository. Plan a added t time t to d digest t the s surpri rises.

slide-34
SLIDE 34

Cross F Functional I Involvement

Diverse p perspectives c can a accelera rate a and i improve g group l learn rning.

Helps s share k knowledge

  • n h

how d different l layers

  • f a

a s system a are v viewed during t the e experiment.

slide-35
SLIDE 35

Prepares Y Your T Team

Your e entire t team m may n not b be able t to p participate, b but t they should b be a able t to l learn f from the f findings.

Experi riments h help y you p pra ractice h how y you l look i into t the s system, where re s signals n norm rmally a ari rise, a and i identifies g gaps o

  • n e

essential telemetry f for b r broader i r insight.

slide-36
SLIDE 36

Su Summa mmary

slide-37
SLIDE 37

Plan f for y your ex experiments Work t to b build c cross functional t teams t to maximize l learning Identify h y how t to m make i it e easy y to i improve o

  • bservability i

y into your s sys ystem Remind y your t teams a and leadership o

  • n m

measurable improvements t through t this pr practice Identify h y how y you c can minimize r risk t through t traffic management a approaches

slide-38
SLIDE 38

Te Technologies

https:/ ://kubernetes.i .io/ https:/ ://spinnaker.i .io/ https:/ ://dropwizard.i .io/ https:/ ://metrics.d .dropwizard.i .io/ https:/ ://github.c .com/N /Netf tflix/zu zuul https:/ ://github.c .com/ts tsenart/ve vegeta

slide-39
SLIDE 39

Get y your ur Guid uide!

slide-40
SLIDE 40

Thank y you! u!

Carl C Chesser

@che5 e55er er | c | che5 e55er er.io