SLIDE 1 Chaos Kong Endowing Netflix with Antifragility
Luke Kosewski Traffic & Chaos Engineering
SLIDE 2
This is a Case Study We’ll Be Doing TOGETHER
SLIDE 3
This is What AWS Failover Looks Like
us-west-2 us-east-1 eu-west-1
SLIDE 4 Failover is Run By This Guy
A Traffic Engineer
SLIDE 5 Failover is Run By This Guy
A Traffic Engineer
SLIDE 6 A Traffic Engineer’s Environment
SLIDE 7 A Traffic Engineer’s Environment
- Netflix control plane
- Primarily in 3 AWS regions (EU, us-east-1, us-west-2)
SLIDE 8 A Traffic Engineer’s Environment
- Netflix control plane
- Primarily in 3 AWS regions (EU, us-east-1, us-west-2)
- They look like this:
us-west-2 us-east-1 eu-west-1
SLIDE 9 Casey Rosenthal
Traffic’s Teammates
Chaos Intuition
traffic@netflix.com / chaos@netflix.com
Traffic
(management) Lorin Hochstein, Aaron Blohowiak & Ali Basiri Niosha Behnam & myself
Justin Reynolds
SLIDE 10 Our Relationship
Chaos Traffic
High Availability
Flow
SLIDE 11
Storytime with Luke
Once upon a time... (August 2013)
SLIDE 12
3 SREs at Netflix
SLIDE 13
3 SREs at Netflix 10s of services
SLIDE 14
3 SREs at Netflix 10s of services 100s of devs
SLIDE 15
Disaster
SLIDE 16
Active-Active
SLIDE 17
Opportunity
SLIDE 18
Flow
SLIDE 19 Fail Out of US-East-1: Case Study
➢ Outage! ➢ Scaling-up ➢ Proxying ➢ DNS design and cutover ➢ Improvisation
SLIDE 20
SLIDE 21
SLIDE 22
SLIDE 23 Fail Out of US-East-1: Case Study
➢ Outage! ➢ Scaling-up ➢ Proxying ➢ DNS design and cutover ➢ Improvisation
SLIDE 24 January 14, 2016
Stream Starts per Second – us-east region
SLIDE 25 January 14, 2016
Stream Starts per Second – us-east region
SLIDE 26 Fail Out of US-East-1: Case Study
➢ Outage! ➢ Scaling-up ➢ Proxying ➢ DNS design and cutover ➢ Improvisation
SLIDE 27
Diurnal Scaling
SLIDE 28
Y’all Ready for This?
SLIDE 29
What to Scale?
SLIDE 30 What to Scale?
- Anything absorbing incoming traffic
SLIDE 31 What to Scale?
- Anything absorbing incoming traffic
- Large stateless services
SLIDE 32 What to Scale?
- Anything absorbing incoming traffic
- Large stateless services
- Required stateful services (carefully)
SLIDE 33
That’s Better
SLIDE 34
How to Scale?
SLIDE 35
SLIDE 36 Two More Fallbacks
SLIDE 37 Two More Fallbacks
- “Time of Day” estimation
- largest observed value in the last 24h as
an intercept
SLIDE 38
How Much?
SLIDE 39
Ooze
SLIDE 40
Nimble
SLIDE 41 Fail Out of US-East-1: Case Study
➢ Outage! ➢ Scaling-up ➢ Proxying ➢ DNS design and cutover ➢ Improvisation
SLIDE 42
What do I mean by that?
SLIDE 43 Why We Proxy
Stream Starts per Second - EU
SLIDE 44
How do We Proxy?
Archaius dynamic properties – regionally scoped Zuul proxy with dynamic filters (Groovy)
SLIDE 45 Fail Out of US-East-1: Case Study
➢ Outage! ➢ Scaling-up ➢ Proxying ➢ DNS design and cutover ➢ Improvisation
SLIDE 46
Traditional DNS
SLIDE 47
Netflix’s DNS as a DB
SLIDE 48
Failover
SLIDE 49 Are We Done?
Stream Starts per Second - EU
SLIDE 50 Nope
Stream Starts per Second - EU
SLIDE 51 Fail Out of US-East-1: Case Study
➢ Outage! ➢ Scaling-up ➢ Proxying ➢ DNS design and cutover ➢ Improvisation
SLIDE 52
Recap: Proxying
SLIDE 53
Recap: Proxying
SLIDE 54 The Crowd Goes Wild
Stream Starts per Second - EU
SLIDE 55
This is What Success Feels Like
SLIDE 56
Positive Feedback Loop
The more we practice, the better and more daring we get
SLIDE 57
Other Takeaways
SLIDE 58 Thank You and Questions
Luke Kosewski – luke@netflix.com Traffic & Chaos Engineering
SLIDE 59 Summary of NFLX github/techblog links
http://techblog.netflix.com/2013/12/active-active-for-multi-regional.html http://techblog.netflix.com/2016/03/global-cloud-active-active-and-beyond.html
https://github.com/Netflix/archaius http://techblog.netflix.com/2012/06/annoucing-archaius-dynamic-properties.html
https://github.com/Netflix/zuul http://techblog.netflix.com/2013/06/announcing-zuul-edge-service-in-cloud.html
http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-streaming.html