FAILURE AT NETFLIX VELOCITY Cannot Connect to the Netflix Service - - PowerPoint PPT Presentation

failure at netflix velocity cannot connect to the netflix
SMART_READER_LITE
LIVE PREVIEW

FAILURE AT NETFLIX VELOCITY Cannot Connect to the Netflix Service - - PowerPoint PPT Presentation

THE ART OF CHAOS ENGINEERING FAILURE AT NETFLIX VELOCITY Cannot Connect to the Netflix Service 0 0 Ms % IMPACT LATENCY 1 50 Ms % IMPACT LATENCY 50 50 Ms % IMPACT LATENCY 100 50 Ms % IMPACT LATENCY 25 250 Ms % IMPACT


slide-1
SLIDE 1

FAILURE AT NETFLIX VELOCITY

THE ART OF CHAOS ENGINEERING

slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6

Cannot Connect to the Netflix Service

slide-7
SLIDE 7
slide-8
SLIDE 8

IMPACT LATENCY % Ms

slide-9
SLIDE 9

IMPACT LATENCY

1

%

50

Ms

slide-10
SLIDE 10

IMPACT LATENCY

50

%

50

Ms

slide-11
SLIDE 11

IMPACT LATENCY

100

%

50

Ms

slide-12
SLIDE 12

IMPACT LATENCY

25

%

250

Ms

slide-13
SLIDE 13

IMPACT LATENCY

50

%

250

Ms

slide-14
SLIDE 14

IMPACT LATENCY

75

%

250

Ms

slide-15
SLIDE 15

IMPACT LATENCY

100

%

250

Ms

slide-16
SLIDE 16

IMPACT LATENCY

50

%

500

Ms

slide-17
SLIDE 17

IMPACT LATENCY %

500

Ms

slide-18
SLIDE 18

IMPACT LATENCY % Ms

slide-19
SLIDE 19

IMPACT LATENCY

  • 50

% -500 Ms

slide-20
SLIDE 20

IMPACT LATENCY % Ms

slide-21
SLIDE 21

HYPOTHESIS PROVEN

slide-22
SLIDE 22

LEARNINGS

Application Behavior Blast Radius Consistency

slide-23
SLIDE 23

You already know all that

slide-24
SLIDE 24

REMEMBER THE GRAY AREA?

slide-25
SLIDE 25

{Systems | People} don’t react predictably when exposed to non-typical behavior.

slide-26
SLIDE 26

INCIDENT MANAGEMENT

slide-27
SLIDE 27

WHAT’S THE BIG DEAL? FIX IT AND MOVE ON!

slide-28
SLIDE 28

INCIDENTS ARE:

Expensive Stressful Potentially Public

slide-29
SLIDE 29

INCIDENTS ARE:

Expensive Stressful Potentially Public

slide-30
SLIDE 30

INCIDENTS ARE:

Expensive Stressful Potentially Public

slide-31
SLIDE 31

INCIDENTS ARE:

Expensive Stressful Public

slide-32
SLIDE 32
slide-33
SLIDE 33
slide-34
SLIDE 34
slide-35
SLIDE 35

WHAT CAUSES INCIDENTS?

slide-36
SLIDE 36

CHANGE

slide-37
SLIDE 37

THINGS WE CAN’T CONTROL

slide-38
SLIDE 38

THINGS WE DON’T CONTROL

slide-39
SLIDE 39

THINGS WE CAN CONTROL

slide-40
SLIDE 40

THAT ALL SOUNDS REALLLLLY NICE

slide-41
SLIDE 41

HOW?

slide-42
SLIDE 42

INCIDENT MANAGEMENT GOALS

Short Unique Never Forget

slide-43
SLIDE 43
slide-44
SLIDE 44

FOUR PARTS

slide-45
SLIDE 45

1 ENGAGEMENT

slide-46
SLIDE 46

2 COMMUNICATIONS

slide-47
SLIDE 47

3 COORDINATION

slide-48
SLIDE 48

4 MEMORIALIZATION

slide-49
SLIDE 49

TECHNIQUES

slide-50
SLIDE 50

SPECIALISTS

slide-51
SLIDE 51

BEFORE INCIDENTS

Education Best Practices Drilling

slide-52
SLIDE 52

DURING INCIDENTS

Incident Leader

slide-53
SLIDE 53

AFTER INCIDENTS

Memorialization Research Coordination Incident Review

slide-54
SLIDE 54

NETFLIX INCIDENTS MANGERS

How We Think

slide-55
SLIDE 55

WE EXPECT

Failure Incidents Empathy

slide-56
SLIDE 56

INCIDENT MANAGEMENT CONTEXT

People

slide-57
SLIDE 57

“Human operators have dual roles: producers of & defenders against failure”

— Richard I Cook

slide-58
SLIDE 58

INCIDENT MANAGEMENT CONTEXT

Failure

slide-59
SLIDE 59

Constant Failures Degraded is Normal Any Change is a Gamble

slide-60
SLIDE 60

INCIDENT MANAGEMENT CONTEXT

Biases

slide-61
SLIDE 61

Hindsight Single Attribution Anchoring

slide-62
SLIDE 62

MEASURING SUCCESS

Short & Less Impact More Incidents Better Team Engagement

slide-63
SLIDE 63

FAILURE AT NETFLIX VELOCITY

THE ART OF CHAOS ENGINEERING

Dave Hahn — Senior SRE, Netflix dhahn@netflix.com @relix42