Designing Services for Resilience Experiments: Lessons from Netflix - - PowerPoint PPT Presentation

designing services for resilience experiments lessons
SMART_READER_LITE
LIVE PREVIEW

Designing Services for Resilience Experiments: Lessons from Netflix - - PowerPoint PPT Presentation

Designing Services for Resilience Experiments: Lessons from Netflix Nora Jones, Senior Chaos Engineer @nora_js Designing Services for Resilience Experiments: Lessons from Netflix Nora Jones, Senior Chaos Engineer @nora_js So, how can teams


slide-1
SLIDE 1

Designing Services for Resilience Experiments: Lessons from Netflix

Nora Jones, Senior Chaos Engineer @nora_js

slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7

Designing Services for Resilience Experiments: Lessons from Netflix

Nora Jones, Senior Chaos Engineer @nora_js

slide-8
SLIDE 8

So, how can teams design services for resilience testing?

  • Failure Injection Enabled
slide-9
SLIDE 9

So, how can teams design services for resilience testing?

  • Failure Injection Enabled
  • RPC enabled
slide-10
SLIDE 10

So, how can teams design services for resilience testing?

  • Failure Injection Enabled
  • RPC enabled
  • Fallback Paths

○ And ways to discover them

slide-11
SLIDE 11

So, how can teams design services for resilience testing?

  • Failure Injection Enabled
  • RPC enabled
  • Fallback Paths

○ And ways to discover them

  • Proper monitoring

○ Key business metrics to look for

slide-12
SLIDE 12

So, how can teams design services for resilience testing?

  • Failure Injection Enabled
  • RPC enabled
  • Fallback Paths

○ And ways to discover them

  • Proper monitoring

○ Key business metrics to look for

  • Proper timeouts

○ And ways to discover them

slide-13
SLIDE 13

Known Ways to Increase Confidence in Resilience

slide-14
SLIDE 14

Known Ways to Increase Confidence in Resilience

  • Unit Tests
slide-15
SLIDE 15
slide-16
SLIDE 16

Known Ways to Increase Confidence in Resilience

  • Integration Tests
slide-17
SLIDE 17
slide-18
SLIDE 18

New Ways to Increase Confidence in Resilience

  • Chaos Experiments
slide-19
SLIDE 19
slide-20
SLIDE 20

SPS: Key Business Metric

slide-21
SLIDE 21
slide-22
SLIDE 22

Chaos Engineering: Netflix’s ChAP

API Personalization 100%

slide-23
SLIDE 23

Chaos Engineering: Netflix’s ChAP

API Gateway

Personalization

API Control

1% 98%

slide-24
SLIDE 24

Chaos Engineering: Netflix’s ChAP

API Gateway

Personalization

API Control

1% 98%

slide-25
SLIDE 25

Chaos Engineering: Netflix’s ChAP

API Gateway

Personalization

API Control API Exp

1% 1% 98%

slide-26
SLIDE 26

Chaos Engineering: Netflix’s ChAP

API Gateway

Personalization

API Control API Exp

1% 1% 98%

slide-27
SLIDE 27

Monitoring

slide-28
SLIDE 28

Monitoring SHORTED

slide-29
SLIDE 29
  • 1. Have Failure Injection

Testing Enabled.

slide-30
SLIDE 30

Sample Failure Injection Library

https://github.com/norajones/FailureInjectionLibrary

slide-31
SLIDE 31
slide-32
SLIDE 32
slide-33
SLIDE 33
slide-34
SLIDE 34
slide-35
SLIDE 35
slide-36
SLIDE 36
slide-37
SLIDE 37

Types of Chaos Failures

slide-38
SLIDE 38

Types of Chaos Failures

slide-39
SLIDE 39

Criteria&API

slide-40
SLIDE 40
slide-41
SLIDE 41
slide-42
SLIDE 42

Automating Creation of Chaos Experiments

slide-43
SLIDE 43
  • 2. Have Good Monitoring in

Place for Configuration Changes.

slide-44
SLIDE 44

Have Good Monitoring in Place

  • RPC Enabled
slide-45
SLIDE 45

Have Good Monitoring in Place

  • RPC Enabled

○ Associated Hystrix Commands

slide-46
SLIDE 46

Have Good Monitoring in Place

  • RPC Enabled

○ Associated Hystrix Commands ■ Associated Fallbacks

slide-47
SLIDE 47

Have Good Monitoring in Place

  • RPC Enabled

○ Associated Hystrix Commands ■ Associated Fallbacks

  • Timeouts
slide-48
SLIDE 48

Have Good Monitoring in Place

  • RPC Enabled

○ Associated Hystrix Commands ■ Associated Fallbacks

  • Timeouts
  • Retries
slide-49
SLIDE 49

Have Good Monitoring in Place

  • RPC Enabled

○ Associated Hystrix Commands ■ Associated Fallbacks

  • Timeouts
  • Retries
  • All in One Place!
slide-50
SLIDE 50
slide-51
SLIDE 51
  • Java library managing REST clients to/from

different services

  • Fast failing/fallback capability

RPC/Ribbon

slide-52
SLIDE 52

RPC/Ribbon Timeouts

slide-53
SLIDE 53

RPC Timeouts

At what point does the service give up?

slide-54
SLIDE 54

Retries

Immediately retrying a failure after an operation is not usually a great idea.

slide-55
SLIDE 55

Retries

Understand the logic between your timeouts and your retries.

slide-56
SLIDE 56

Circuit Breakers/Fallback Paths

slide-57
SLIDE 57

Hystrix Commands/Fallback Paths

If your service is non-critical, ensure that there are fallback paths in place.

slide-58
SLIDE 58

Fallback Strategies

Static Content Cache Fallback Service

slide-59
SLIDE 59

Fallback Strategies

Know what your fallback strategy is and how to get that information.

slide-60
SLIDE 60
slide-61
SLIDE 61
slide-62
SLIDE 62

3.Ensure Synergy between Hystrix Timeouts, RPC timeouts, and retry logic.

slide-63
SLIDE 63
slide-64
SLIDE 64
slide-65
SLIDE 65

ChAP’s Monocle

slide-66
SLIDE 66

ChAP’s Monocle

slide-67
SLIDE 67

ChAP’s Monocle

slide-68
SLIDE 68
slide-69
SLIDE 69

There isn’t always money in microservices

slide-70
SLIDE 70

Criticality Score

slide-71
SLIDE 71

Criticality Score

RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score

slide-72
SLIDE 72

Criticality Score

RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score

slide-73
SLIDE 73

Criticality Score

RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score

slide-74
SLIDE 74

Criticality Score

RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score

slide-75
SLIDE 75

Chaos Success Stories

slide-76
SLIDE 76

“We ran a chaos experiment which verifies that our fallback path works and it successfully caught a issue in the fallback path and the issue was resolved before it resulted in any availability incident!”

slide-77
SLIDE 77

“While [failing calls] we discovered an increase in license requests for the experiment cluster even though fallbacks were all successful...

slide-78
SLIDE 78

“While [failing calls] we discovered an increase in license requests for the experiment cluster even though fallbacks were all successful. ...This likely means that whoever was consuming the fallback was retrying the call, causing an increase in license requests.”

slide-79
SLIDE 79

Don’t lose sight of your company’s customers.

slide-80
SLIDE 80

Takeaways

  • Designing for resiliency testability is a shared

responsibility.

  • Configuration changes can cause outages.
  • Have explicit monitoring in place on

antipatterns in configuration changes.

@nora_js

slide-81
SLIDE 81

Questions?

@nora_js