Chaos Kong Endowing Netflix with Antifragility Luke Kosewski - - PowerPoint PPT Presentation

chaos kong endowing netflix with antifragility
SMART_READER_LITE
LIVE PREVIEW

Chaos Kong Endowing Netflix with Antifragility Luke Kosewski - - PowerPoint PPT Presentation

Chaos Kong Endowing Netflix with Antifragility Luke Kosewski Traffic & Chaos Engineering This is a Case Study Well Be Doing TOGETHER This is What AWS Failover Looks Like us-west-2 us-east-1 eu-west-1 Failover is Run By This Guy A


slide-1
SLIDE 1

Chaos Kong Endowing Netflix with Antifragility

Luke Kosewski Traffic & Chaos Engineering

slide-2
SLIDE 2

This is a Case Study We’ll Be Doing TOGETHER

slide-3
SLIDE 3

This is What AWS Failover Looks Like

us-west-2 us-east-1 eu-west-1

slide-4
SLIDE 4

Failover is Run By This Guy

A Traffic Engineer

slide-5
SLIDE 5

Failover is Run By This Guy

A Traffic Engineer

slide-6
SLIDE 6

A Traffic Engineer’s Environment

  • Netflix control plane
slide-7
SLIDE 7

A Traffic Engineer’s Environment

  • Netflix control plane
  • Primarily in 3 AWS regions (EU, us-east-1, us-west-2)
slide-8
SLIDE 8

A Traffic Engineer’s Environment

  • Netflix control plane
  • Primarily in 3 AWS regions (EU, us-east-1, us-west-2)
  • They look like this:

us-west-2 us-east-1 eu-west-1

slide-9
SLIDE 9

Casey Rosenthal

Traffic’s Teammates

Chaos Intuition

traffic@netflix.com / chaos@netflix.com

Traffic

(management) Lorin Hochstein, Aaron Blohowiak & Ali Basiri Niosha Behnam & myself

Justin Reynolds

slide-10
SLIDE 10

Our Relationship

Chaos Traffic

High Availability

Flow

slide-11
SLIDE 11

Storytime with Luke

Once upon a time... (August 2013)

slide-12
SLIDE 12

3 SREs at Netflix

slide-13
SLIDE 13

3 SREs at Netflix 10s of services

slide-14
SLIDE 14

3 SREs at Netflix 10s of services 100s of devs

slide-15
SLIDE 15

Disaster

slide-16
SLIDE 16

Active-Active

slide-17
SLIDE 17

Opportunity

slide-18
SLIDE 18

Flow

slide-19
SLIDE 19

Fail Out of US-East-1: Case Study

➢ Outage! ➢ Scaling-up ➢ Proxying ➢ DNS design and cutover ➢ Improvisation

slide-20
SLIDE 20
slide-21
SLIDE 21
slide-22
SLIDE 22
slide-23
SLIDE 23

Fail Out of US-East-1: Case Study

➢ Outage! ➢ Scaling-up ➢ Proxying ➢ DNS design and cutover ➢ Improvisation

slide-24
SLIDE 24

January 14, 2016

Stream Starts per Second – us-east region

slide-25
SLIDE 25

January 14, 2016

Stream Starts per Second – us-east region

slide-26
SLIDE 26

Fail Out of US-East-1: Case Study

➢ Outage! ➢ Scaling-up ➢ Proxying ➢ DNS design and cutover ➢ Improvisation

slide-27
SLIDE 27

Diurnal Scaling

slide-28
SLIDE 28

Y’all Ready for This?

slide-29
SLIDE 29

What to Scale?

slide-30
SLIDE 30

What to Scale?

  • Anything absorbing incoming traffic
slide-31
SLIDE 31

What to Scale?

  • Anything absorbing incoming traffic
  • Large stateless services
slide-32
SLIDE 32

What to Scale?

  • Anything absorbing incoming traffic
  • Large stateless services
  • Required stateful services (carefully)
slide-33
SLIDE 33

That’s Better

slide-34
SLIDE 34

How to Scale?

slide-35
SLIDE 35
slide-36
SLIDE 36

Two More Fallbacks

  • “Time of Day” estimation
slide-37
SLIDE 37

Two More Fallbacks

  • “Time of Day” estimation
  • largest observed value in the last 24h as

an intercept

slide-38
SLIDE 38

How Much?

slide-39
SLIDE 39

Ooze

slide-40
SLIDE 40

Nimble

slide-41
SLIDE 41

Fail Out of US-East-1: Case Study

➢ Outage! ➢ Scaling-up ➢ Proxying ➢ DNS design and cutover ➢ Improvisation

slide-42
SLIDE 42

What do I mean by that?

slide-43
SLIDE 43

Why We Proxy

Stream Starts per Second - EU

slide-44
SLIDE 44

How do We Proxy?

Archaius dynamic properties – regionally scoped Zuul proxy with dynamic filters (Groovy)

slide-45
SLIDE 45

Fail Out of US-East-1: Case Study

➢ Outage! ➢ Scaling-up ➢ Proxying ➢ DNS design and cutover ➢ Improvisation

slide-46
SLIDE 46

Traditional DNS

slide-47
SLIDE 47

Netflix’s DNS as a DB

slide-48
SLIDE 48

Failover

slide-49
SLIDE 49

Are We Done?

Stream Starts per Second - EU

slide-50
SLIDE 50

Nope

Stream Starts per Second - EU

slide-51
SLIDE 51

Fail Out of US-East-1: Case Study

➢ Outage! ➢ Scaling-up ➢ Proxying ➢ DNS design and cutover ➢ Improvisation

slide-52
SLIDE 52

Recap: Proxying

slide-53
SLIDE 53

Recap: Proxying

slide-54
SLIDE 54

The Crowd Goes Wild

Stream Starts per Second - EU

slide-55
SLIDE 55

This is What Success Feels Like

slide-56
SLIDE 56

Positive Feedback Loop

The more we practice, the better and more daring we get

slide-57
SLIDE 57

Other Takeaways

slide-58
SLIDE 58

Thank You and Questions

Luke Kosewski – luke@netflix.com Traffic & Chaos Engineering

slide-59
SLIDE 59

Summary of NFLX github/techblog links

  • Active/Active

http://techblog.netflix.com/2013/12/active-active-for-multi-regional.html http://techblog.netflix.com/2016/03/global-cloud-active-active-and-beyond.html

  • Archaius

https://github.com/Netflix/archaius http://techblog.netflix.com/2012/06/annoucing-archaius-dynamic-properties.html

  • Zuul

https://github.com/Netflix/zuul http://techblog.netflix.com/2013/06/announcing-zuul-edge-service-in-cloud.html

  • SPS

http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-streaming.html