SLIDE 1 Designing Services for Resilience Experiments: Lessons from Netflix
Nora Jones, Senior Chaos Engineer @nora_js
SLIDE 2
SLIDE 3
SLIDE 4
SLIDE 5
SLIDE 6
SLIDE 7 Designing Services for Resilience Experiments: Lessons from Netflix
Nora Jones, Senior Chaos Engineer @nora_js
SLIDE 8 So, how can teams design services for resilience testing?
- Failure Injection Enabled
SLIDE 9 So, how can teams design services for resilience testing?
- Failure Injection Enabled
- RPC enabled
SLIDE 10 So, how can teams design services for resilience testing?
- Failure Injection Enabled
- RPC enabled
- Fallback Paths
○ And ways to discover them
SLIDE 11 So, how can teams design services for resilience testing?
- Failure Injection Enabled
- RPC enabled
- Fallback Paths
○ And ways to discover them
○ Key business metrics to look for
SLIDE 12 So, how can teams design services for resilience testing?
- Failure Injection Enabled
- RPC enabled
- Fallback Paths
○ And ways to discover them
○ Key business metrics to look for
○ And ways to discover them
SLIDE 13
Known Ways to Increase Confidence in Resilience
SLIDE 14 Known Ways to Increase Confidence in Resilience
SLIDE 15
SLIDE 16 Known Ways to Increase Confidence in Resilience
SLIDE 17
SLIDE 18 New Ways to Increase Confidence in Resilience
SLIDE 19
SLIDE 20
SPS: Key Business Metric
SLIDE 21
SLIDE 22
Chaos Engineering: Netflix’s ChAP
API Personalization 100%
SLIDE 23 Chaos Engineering: Netflix’s ChAP
API Gateway
Personalization
API Control
1% 98%
SLIDE 24 Chaos Engineering: Netflix’s ChAP
API Gateway
Personalization
API Control
1% 98%
SLIDE 25 Chaos Engineering: Netflix’s ChAP
API Gateway
Personalization
API Control API Exp
1% 1% 98%
SLIDE 26 Chaos Engineering: Netflix’s ChAP
API Gateway
Personalization
API Control API Exp
1% 1% 98%
SLIDE 27
Monitoring
SLIDE 28
Monitoring SHORTED
SLIDE 29
- 1. Have Failure Injection
Testing Enabled.
SLIDE 30
Sample Failure Injection Library
https://github.com/norajones/FailureInjectionLibrary
SLIDE 31
SLIDE 32
SLIDE 33
SLIDE 34
SLIDE 35
SLIDE 36
SLIDE 37
Types of Chaos Failures
SLIDE 38
Types of Chaos Failures
SLIDE 39
Criteria&API
SLIDE 40
SLIDE 41
SLIDE 42
Automating Creation of Chaos Experiments
SLIDE 43
- 2. Have Good Monitoring in
Place for Configuration Changes.
SLIDE 44 Have Good Monitoring in Place
SLIDE 45 Have Good Monitoring in Place
○ Associated Hystrix Commands
SLIDE 46 Have Good Monitoring in Place
○ Associated Hystrix Commands ■ Associated Fallbacks
SLIDE 47 Have Good Monitoring in Place
○ Associated Hystrix Commands ■ Associated Fallbacks
SLIDE 48 Have Good Monitoring in Place
○ Associated Hystrix Commands ■ Associated Fallbacks
SLIDE 49 Have Good Monitoring in Place
○ Associated Hystrix Commands ■ Associated Fallbacks
- Timeouts
- Retries
- All in One Place!
SLIDE 50
SLIDE 51
- Java library managing REST clients to/from
different services
- Fast failing/fallback capability
RPC/Ribbon
SLIDE 52
RPC/Ribbon Timeouts
SLIDE 53
RPC Timeouts
At what point does the service give up?
SLIDE 54
Retries
Immediately retrying a failure after an operation is not usually a great idea.
SLIDE 55
Retries
Understand the logic between your timeouts and your retries.
SLIDE 56
Circuit Breakers/Fallback Paths
SLIDE 57
Hystrix Commands/Fallback Paths
If your service is non-critical, ensure that there are fallback paths in place.
SLIDE 58
Fallback Strategies
Static Content Cache Fallback Service
SLIDE 59
Fallback Strategies
Know what your fallback strategy is and how to get that information.
SLIDE 60
SLIDE 61
SLIDE 62
3.Ensure Synergy between Hystrix Timeouts, RPC timeouts, and retry logic.
SLIDE 63
SLIDE 64
SLIDE 65
ChAP’s Monocle
SLIDE 66
ChAP’s Monocle
SLIDE 67
ChAP’s Monocle
SLIDE 68
SLIDE 69
There isn’t always money in microservices
SLIDE 70
Criticality Score
SLIDE 71 Criticality Score
RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score
SLIDE 72 Criticality Score
RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score
SLIDE 73 Criticality Score
RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score
SLIDE 74 Criticality Score
RPS Stats Range bucket * number of retries * number of Hystrix Commands = Criticality Score
SLIDE 75
Chaos Success Stories
SLIDE 76
“We ran a chaos experiment which verifies that our fallback path works and it successfully caught a issue in the fallback path and the issue was resolved before it resulted in any availability incident!”
SLIDE 77
“While [failing calls] we discovered an increase in license requests for the experiment cluster even though fallbacks were all successful...
SLIDE 78
“While [failing calls] we discovered an increase in license requests for the experiment cluster even though fallbacks were all successful. ...This likely means that whoever was consuming the fallback was retrying the call, causing an increase in license requests.”
SLIDE 79
Don’t lose sight of your company’s customers.
SLIDE 80 Takeaways
- Designing for resiliency testability is a shared
responsibility.
- Configuration changes can cause outages.
- Have explicit monitoring in place on
antipatterns in configuration changes.
@nora_js
SLIDE 81 Questions?
@nora_js