Automating Chaos Experiments In Production Ali Basiri - Chaos Team - - PowerPoint PPT Presentation

automating chaos experiments in production
SMART_READER_LITE
LIVE PREVIEW

Automating Chaos Experiments In Production Ali Basiri - Chaos Team - - PowerPoint PPT Presentation

Automating Chaos Experiments In Production Ali Basiri - Chaos Team @abasiri Netflix Control CDN Plane Movie Bits Website, Apps, Signup, Login, Browsing, Search Playback control, Bookmarks, ... Ali Basiri Software Engineer @ Netflix


slide-1
SLIDE 1

Automating Chaos Experiments In Production

Ali Basiri - Chaos Team @abasiri

slide-2
SLIDE 2

Netflix

slide-3
SLIDE 3

CDN Control Plane

Movie Bits Website, Apps, Signup, Login, Browsing, Search Playback control, Bookmarks, ...

slide-4
SLIDE 4

Ali Basiri

Software Engineer @ Netflix

  • Chaos Engineer
  • Distributed Systems Engineer
  • Co-author of Principles of Chaos
slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8
slide-9
SLIDE 9

Chaos Monkey

slide-10
SLIDE 10

Service Availability

slide-11
SLIDE 11

Anatomy of a Failure

slide-12
SLIDE 12
slide-13
SLIDE 13

Movie Info API CDN Selection

slide-14
SLIDE 14

Movie Info API CDN Selection

slide-15
SLIDE 15

Movie Info API CDN Selection Fallback

slide-16
SLIDE 16

Movie Info API CDN Selection Fallback

slide-17
SLIDE 17

Movie Info API CDN Selection Fallback

slide-18
SLIDE 18

Movie Info API CDN Selection Fallback

slide-19
SLIDE 19

FIT

slide-20
SLIDE 20

Request Level Failure Injection

slide-21
SLIDE 21

Request Level Failure Injection

Movie Info API CDN Selection

slide-22
SLIDE 22
slide-23
SLIDE 23
slide-24
SLIDE 24
slide-25
SLIDE 25

API Gateway Persona- lization Is API resilient to failure of Personalization?

slide-26
SLIDE 26
slide-27
SLIDE 27
slide-28
SLIDE 28
slide-29
SLIDE 29
slide-30
SLIDE 30
slide-31
SLIDE 31
slide-32
SLIDE 32

API Gateway Persona- lization Randomly select 10% of requests to participate in experiment

slide-33
SLIDE 33

API Gateway Persona- lization

slide-34
SLIDE 34

API Gateway Persona- lization if (shouldFail == true)

slide-35
SLIDE 35

API Gateway Persona- lization if (shouldFail == true)

slide-36
SLIDE 36
slide-37
SLIDE 37
slide-38
SLIDE 38

Even More Availability

FIT

slide-39
SLIDE 39

Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

ENGINEERING

CH∀OS

slide-40
SLIDE 40

Stream Starts Per Second (SPS)

slide-41
SLIDE 41
  • Build a Hypothesis around Steady

State Behavior

  • Vary Real-world Events
  • Run Experiments in Production
  • Automate Experiments to Run

Continuously

Principles Of Chaos Engineering

http://principlesofchaos.org

slide-42
SLIDE 42

Stream Starts Per Second (SPS)

slide-43
SLIDE 43

ChAP

slide-44
SLIDE 44

Goal: Chaos All The Things

slide-45
SLIDE 45

API Gateway Persona- lization

slide-46
SLIDE 46

API Gateway Persona- lization

API Control API Exp

slide-47
SLIDE 47

API Gateway Persona- lization

API Control API Exp

slide-48
SLIDE 48
slide-49
SLIDE 49
slide-50
SLIDE 50
slide-51
SLIDE 51
slide-52
SLIDE 52
slide-53
SLIDE 53
slide-54
SLIDE 54

API Gateway Persona- lization

API Control API Exp

slide-55
SLIDE 55

API Gateway Persona- lization

API Control API Exp

Select 1% of requests for control Select 1% of requests for experiment

slide-56
SLIDE 56

API Gateway Persona- lization

API Control API Exp

slide-57
SLIDE 57

API Gateway Persona- lization

API Control API Exp

if(shouldRoute == true)

slide-58
SLIDE 58

API Gateway Persona- lization

API Control API Exp

1% 1% 98%

slide-59
SLIDE 59

API Gateway Persona- lization

API Control API Exp

if(shouldFail == true)

slide-60
SLIDE 60

API Gateway Persona- lization

API Control API Exp

slide-61
SLIDE 61

Stream Starts Per Second (SPS)

slide-62
SLIDE 62

Fallback Metrics

slide-63
SLIDE 63

Fallback Metrics

slide-64
SLIDE 64

Fallback Metrics

slide-65
SLIDE 65

CPU Utilization

slide-66
SLIDE 66

Future Work on ChAP

slide-67
SLIDE 67

Automated Canary Analysis

slide-68
SLIDE 68

Detect divergence and stop early

slide-69
SLIDE 69

Integrate with continuous delivery system

slide-70
SLIDE 70

Clone multiple services to run an experiment

B A

B Con B Exp

D C

C Con C Exp

slide-71
SLIDE 71

http://principlesofchaos.org http://chaos.community

slide-72
SLIDE 72

Questions?