SLIDE 1
Kolton Andrus (@deelyle)
SLIDE 2 Overview
- 1. Why is Failure Testing Important?
- 2. How did we build Failure as a Service?
- 3. How has this made our systems more
resilient?
SLIDE 3
SLIDE 4
SLIDE 5 Why Failure Testing?
- 1. Makes our systems immune to failure
- 2. Prevents larger outages
- 3. Production verification is requisite
SLIDE 6
SLIDE 7
Failure testing is a form of Hormesis - we imbibe the poison to become immune.
SLIDE 8
SLIDE 9
SLIDE 10
Validating that our defenses will work when called upon - by exercising them at scale in production.
SLIDE 11
Building Failure as a Service
FIT - Failure Injection Testing
SLIDE 12
SLIDE 13
What about the monkeys?
SLIDE 14 The 5 W’s
- 1. Why
- 2. Who - Failure Scope
- 3. Where - Injection Point
- 4. What - Injected Failure
- 5. When - Ad-hoc & Automated
SLIDE 15
SLIDE 16
SLIDE 17 Zuul (Proxy) API Critical Critical Service Secondary Secondary Service Cache C* Circuit Breaker Network Calls Injection Points
SLIDE 18
SLIDE 19
SLIDE 20
SLIDE 21
“Knowing how the system behaves in the face of failure is invaluable - our assumptions are often incomplete”
SLIDE 22
SLIDE 23 Zuul (Proxy) API Critical Critical Critical Secondary Secondary Secondary Cache C* Circuit Breaker Network Calls Injected Failure
Failure Metadata FIT Failure Scope Decorated Request
SLIDE 24
SLIDE 25
Great, does it work?
SLIDE 26
SLIDE 27
SLIDE 28
SLIDE 29
Aggressive failure testing creates not just robust programs, but an antifragile programming culture.
SLIDE 30 Take Aways
- 1. Failure Testing is a worthwhile investment
- 2. Testing in Production is sustainable
- 3. It can harden your systems against failure
Kolton Andrus (@deelyle)
SLIDE 31 Resources
- Netflix Techblog - FIT
- “On Designing and Deploying Internet-Scale
Services” - James Hamilton
- Drift into Failure - Sidney Dekker
- Antifragile - Nassim Nicholas Taleb
SLIDE 32 Photo Credits
- Nuclear Blast - Mark Waldrep
- Forest Fire
- Poison
- Needle
- Explosion
- Robot
SLIDE 33
Demo Slides
SLIDE 34
SLIDE 35
SLIDE 36
SLIDE 37
SLIDE 38