Keeping Movies Running Amid Thunderstorms Fault-tolerant Systems @ - - PowerPoint PPT Presentation

keeping movies running amid thunderstorms
SMART_READER_LITE
LIVE PREVIEW

Keeping Movies Running Amid Thunderstorms Fault-tolerant Systems @ - - PowerPoint PPT Presentation

Keeping Movies Running Amid Thunderstorms Fault-tolerant Systems @ Netflix Sid Anand (@r39132) QCon SF 2011 1 Thursday, November 17, 2011 Backgrounder Netflix Then and Now 2 Thursday, November 17, 2011 Netflix Then and Now Netflix prior


slide-1
SLIDE 1

Keeping Movies Running Amid Thunderstorms

Fault-tolerant Systems @ Netflix

Sid Anand (@r39132) QCon SF 2011

1

Thursday, November 17, 2011

slide-2
SLIDE 2

Backgrounder

Netflix Then and Now

2

Thursday, November 17, 2011

slide-3
SLIDE 3

Netflix Then and Now

Netflix prior to circa 2009 Users watched DVDs at home Peak days : Friday, Saturday, Sunday Users returned DVDs & Updated their Qs Peak days : Sunday, Monday We shipped the next DVDs Peak days : Monday, Tuesday Scheduled Site Downtimes on alternate Wednesdays Netflix post circa 2009 Users watch streaming at home Peak days : Friday, Saturday, Sunday Off-Peak days see many orders of magnitude more traffic than prior to 2009 User expectation is that streaming is always available No Scheduled Site Downtimes Fault Tolerance is a top design concern

3

Thursday, November 17, 2011

slide-4
SLIDE 4

Netflix DC Architecture

A Simple System

4

Thursday, November 17, 2011

slide-5
SLIDE 5

Netflix’s DC Architecture

Components 1 Netscaler H/W Load Balancer ~20 “WWW” Apache+Tomcat servers 3 Oracle DBs & 1 MySQL DB Cache Servers Cinematch Recommendation System

Apache + Tomcat H/W Load Balancer Oracle Apache + Tomcat Apache + Tomcat MySQL Cache Servers Cinematch System

5

Thursday, November 17, 2011

slide-6
SLIDE 6

Netflix’s DC Architecture

Types of Production Issues Java Garbage Collection problems, which would would result in slower WWW pages Deadlocks in our multi-threaded Java application would cause web page loading to timeout Transaction locking in the DB would result in the similar web page loading timeouts Under-optimized SQL or DB would cause slower web pages (e.g. DB

  • ptimizer picks a sub-optimal the

execution plan)

Apache + Tomcat H/W Load Balancer Oracle Apache + Tomcat Apache + Tomcat MySQL Cache Servers Cinematch System

6

Thursday, November 17, 2011

slide-7
SLIDE 7

Netflix’s DC Architecture

Architecture Pros As serious as these sound, they were typically single-system failure scenarios Single-system failures are relatively easy to resolve Architecture Cons Not horizontally scalable Weʼre constrained by what can fit on a single box Not conducive to high-velocity development and deployment

Apache + Tomcat H/W Load Balancer Oracle Apache + Tomcat Apache + Tomcat MySQL Cache Servers Cinematch System

7

Thursday, November 17, 2011

slide-8
SLIDE 8

Netflix’s Cloud Architecture

A Less Simple System

8

Thursday, November 17, 2011

slide-9
SLIDE 9

Netflix’s Cloud Architecture

Components Many (~100) applications, organized in clusters Clusters can be at different levels in the call stack Clusters can call each other

ELB ELB NES NES NES NES Discovery NMTS NMTS NMTS NMTS NMTS NMTS NBES NBES IAAS IAAS IAAS

9

Thursday, November 17, 2011

slide-10
SLIDE 10

Netflix’s Cloud Architecture

Levels NES : Netflix Edge Services NMTS : Netflix Mid-tier Services NBES : Netflix Back-end Services IAAS : AWS IAAS Services Discovery : Help services discover NMTS and NBES services

ELB ELB NES NES NES NES Discovery NMTS NMTS NMTS NMTS NMTS NMTS NBES NBES IAAS IAAS IAAS

10

Thursday, November 17, 2011

slide-11
SLIDE 11

Netflix’s Cloud Architecture

Components (NES) Overview

Any service that browsers and streaming devices connect to over the internet They sit behind AWS Elastic Load Balancers (a.k.a. ELB) They call clusters at lower levels

ELB ELB NES NES NES NES Discovery NMTS NMTS NMTS NMTS NMTS NMTS NBES NBES IAAS IAAS IAAS

11

Thursday, November 17, 2011

slide-12
SLIDE 12

Netflix’s Cloud Architecture

Components (NES) Examples

API Servers

Support the video browsing experience Also allows users to modify their Q

Streaming Control Servers

Support streaming video playback Authenticate your Wii, PS3, etc... Download DRM to the Wii, PS3, etc... Return a list of CDN urls to the Wii, PS3, etc...

ELB ELB NES NES NES NES Discovery NMTS NMTS NMTS NMTS NMTS NMTS NBES NBES IAAS IAAS IAAS

12

Thursday, November 17, 2011

slide-13
SLIDE 13

Netflix’s Cloud Architecture

Components (NMTS) Overview

Can call services at the same or lower levels Other NMTS NBES, IAAS Not NES

Exposed through our Discovery service

ELB ELB NES NES NES NES Discovery NMTS NMTS NMTS NMTS NMTS NMTS NBES NBES IAAS IAAS IAAS

13

Thursday, November 17, 2011

slide-14
SLIDE 14

Netflix’s Cloud Architecture

Components (NMTS) Examples

Netflix Queue Servers

Modify items in the usersʼ movie queue

Viewing History Servers

Record and track all streaming movie watching

SIMS Servers

Compute and serve user-to-user and movie-to-movie similarities

ELB ELB NES NES NES NES Discovery NMTS NMTS NMTS NMTS NMTS NMTS NBES NBES IAAS IAAS IAAS

14

Thursday, November 17, 2011

slide-15
SLIDE 15

Netflix’s Cloud Architecture

Components (NBES) Overview

A back-end, usually 3rd party, open-source service Leaf in the call tree. Cannot call anything else

ELB ELB NES NES NES NES Discovery NMTS NMTS NMTS NMTS NMTS NMTS NBES NBES IAAS IAAS IAAS

15

Thursday, November 17, 2011

slide-16
SLIDE 16

Netflix’s Cloud Architecture

Components (NBES) Examples

Cassandra Clusters

Our new cloud database is Cassandra and stores all sorts of data to support application needs

Zookeeper Clusters

Our distributed lock service and sequence generator

Memcached Clusters

Typically caches things that we store in S3 but need to access quickly or often

ELB ELB NES NES NES NES Discovery NMTS NMTS NMTS NMTS NMTS NMTS NBES NBES IAAS IAAS IAAS

16

Thursday, November 17, 2011

slide-17
SLIDE 17

Netflix’s Cloud Architecture

Components (IAAS) Examples

AWS S3

Large-sized data (e.g. video encodes, application logs, etc...) is stored here, not Cassandra

AWS SQS

Amazonʼs message queue to send events (e.g. Facebook network updates are processed asynchronously over SQS)

ELB ELB NES NES NES NES Discovery NMTS NMTS NMTS NMTS NMTS NMTS NBES NBES IAAS IAAS IAAS

17

Thursday, November 17, 2011

slide-18
SLIDE 18

Netflix’s Cloud Architecture

Types of Production Issues A user-issued call will pass through multiple levels during normal operation We are now exposed to multi-system coincident failures, a.k.a. coordinated failures

ELB ELB NES NES NES NES Discovery NMTS NMTS NMTS NMTS NMTS NMTS NBES NBES IAAS IAAS IAAS

18

Thursday, November 17, 2011

slide-19
SLIDE 19

Netflix’s Cloud Architecture

Architecture Pros Horizontally scalable at every level Should give us maximum availability Supports high-velocity development and deployment Architecture Cons A user-issued call will pass through multiple levels (a.k.a. hops) during normal operation Latency can be a concern We are now exposed to multi-system coincident failures, a.k.a. coordinated failures A lot of moving parts

ELB ELB NES NES NES NES Discovery NMTS NMTS NMTS NMTS NMTS NMTS NBES NBES IAAS IAAS IAAS

19

Thursday, November 17, 2011

slide-20
SLIDE 20

Issue 1

Capacity Planning

20

Thursday, November 17, 2011

slide-21
SLIDE 21

Issue 1

  • Service X and Service Y, each made up of 2 instances,

call Service A, also made up of 2 instance

  • If either of these services expect a large increase in

traffic, they need to let the owner of Service A know

  • Service A can then scale up ahead of the traffic

increase Disaster Avoided ??

X X Y Y A A X X Y Y A A A A A A

21

Thursday, November 17, 2011

slide-22
SLIDE 22

Issue 1

  • A given application owner may need to contact 20 other

application owners each time he expects to get a large increase in traffic

  • Too much human coordination
  • A few options
  • Some service owners vastly over-provision for

their application

  • Not cost effective
  • Auto-scaling
  • We want to generalize the model first

proved by our Streaming Control Server (a.k.a. NCCP) team

X X Y Y A A X X Y Y A A A A A A

22

Thursday, November 17, 2011

slide-23
SLIDE 23

ELB AutoScaling Interlude

How to use an ELB An elastic-load balancer (ELB) routes traffic to your EC2 instances

e.g. of an ELB : nccp-wii-11111111.us- east-1.elb.amazonaws.com

Netflix maps a CNAME to this ELB

e.g. : nccp.wii.netflix.com

Netflix then registers the API Service’s EC2 instances with this ELB The ELB periodically polls attached EC2 instances to ensure the instances are healthy

23

Thursday, November 17, 2011

slide-24
SLIDE 24

ELB AutoScaling Interlude

Taking this a bit further The NCCP servers can publish metrics to AWS CloudWatch We can set up an alarm in Cloud Watch

  • n a metric (e.g. CPU)

We can associate an auto scale policy with that alarm (e.g. if CPU > 60%, add 3 more instances) When a metric goes above a limit, an alarm is triggered, causing auto-scaling, which grows our pool

24

Thursday, November 17, 2011

slide-25
SLIDE 25

ELB AutoScaling Interlude

NCCP

Cloud Watch (Alarms) Auto Scaling Service (Policies)

EC2 Instances Added/Removed

EC2 instances publish CPU data to CW

CloudWatch alarms trigger ASG policies

25

Thursday, November 17, 2011

slide-26
SLIDE 26

ELB AutoScaling Interlude

Scale Out Event Average CPU > 60% for 5 minutes Scale In Event Average CPU < 30% FOR 5 minutes Cool Down Period 10 minutes Auto-Scale Alerts DLAutoScaleEvents

26

Thursday, November 17, 2011

slide-27
SLIDE 27

@r39132

23

27

Thursday, November 17, 2011

slide-28
SLIDE 28

Issue 1

Summary We would like to have auto-scaling at all levels.

X X Y Y A A X X Y Y A A A A A A

28

Thursday, November 17, 2011

slide-29
SLIDE 29

Issue 2

Thundering herds to NMTS

29

Thursday, November 17, 2011

slide-30
SLIDE 30

Issue 2

Step 1

Service X and Service Y, each made up of 2 instances, call Service A, also made up of 2 instance

Step 2a

Service Y overwhelms Service A

Step 3

Services X & Y experience read and connection timeouts against an overwhelmed Service A

Step 4

Service Aʼs tier get 2 more machines

X X Y Y A A X X Y Y A A X X Y Y A A

Time outs T i m e

  • u

t s

Step 1 Step 2a Step 3 X X Y Y A A

Time outs T i m e

  • u

t s

A A JUST ADDED JUST ADDED Step 4

30

Thursday, November 17, 2011

slide-31
SLIDE 31

Issue 2

Step 5

  • New requests + Retries cause

request storms (a.k.a. thundering herds)

  • If Service A can be grown to exceed

retry storm steady-state traffic volume, we can exit this vicious cycle

Step 6

  • Else, more timeouts, and VC

continues

X X Y Y A A

Time outs Time outs

A A X X Y Y A A A A VICIOUS CYCLE Step 5 Step 6

31

Thursday, November 17, 2011

slide-32
SLIDE 32

Issue 2

Step 1

Service X and Service Y, each made up of 2 instances, call Service A, also made up of 2 instance

Step 2b

Service A experiences slowness

Step 3

Services X & Y experience read and connection timeouts against a slower Service A

Step 4

If the slowness can be fixed by adding more machines to Service Aʼs tier, then do so

X X Y Y A A X X Y Y A A X X Y Y A A

Time outs T i m e

  • u

t s

Step 1 Step 2b Step 3 X X Y Y A A

Time outs T i m e

  • u

t s

A A JUST ADDED JUST ADDED Step 4

SLOW

OPTIONAL

32

Thursday, November 17, 2011

slide-33
SLIDE 33

Issue 2

Step 5

  • New requests + Retries cause

request storms (a.k.a. thundering herds)

  • If Service A can be grown to exceed

retry storm steady-state traffic volume, we can exit this vicious cycle

Step 6

  • Else, more timeouts, and VC

continues

X X Y Y A A

Time outs Time outs

A A X X Y Y A A A A VICIOUS CYCLE Step 5 Step 6

33

Thursday, November 17, 2011

slide-34
SLIDE 34

Issue 2

Potential Causes of Thundering Herd

  • Service Y sends more traffic to Service A, without checking if Service A has

available capacity

  • Service A slows down
  • Service Yʻs time outs against Service A are set too low
  • Service Yʻs retries against Service A are too aggressive
  • Natural organic growth in traffic hit a tipping point in the system -- in Service A

in this case

S1 S2 S2

Time Outs (Upstream) Thundering Herd (Downstream)

34

Thursday, November 17, 2011

slide-35
SLIDE 35

Solutions to Issue 2

Thundering herds to NMTS

35

Thursday, November 17, 2011

slide-36
SLIDE 36

Solutions to Issue 2

The Platform Solution

  • Every service at Netflix sits on the platform.jar
  • The platform.jar offers 2 components of interest here:
  • NIWS library : the client-side of Netflix Inter-Web

Service calls. Handles retry, failover, thundering-herd prevention, & fast failure

  • BaseServer library : a set of Tomcat servlet filters

that protect the underlying application servlet stack. In this context, it throttles traffic

X NIWS NIWS Throttle Layer A BaseServer Filter Chain BaseServer Throttle Layer

36

Thursday, November 17, 2011

slide-37
SLIDE 37

Solutions to Issue 2

The Platform Solution

  • NIWS library
  • Fair Retry Logic : e.g. exponential bounded backoff
  • Takes 2 configuration params per client:
  • Max_Num_of_Requests (a.k.a. MNR)
  • Sample_Interval_in_seconds (a.k.a. SI)
  • Ensures that a client does not send more than MNR/

SI requests/s, else throttles requests at the client

X NIWS NIWS Throttle Layer A BaseServer Filter Chain BaseServer Throttle Layer

37

Thursday, November 17, 2011

slide-38
SLIDE 38

Solutions to Issue 2

The Platform Solution

  • BaseServer
  • As an additional fail-safe, the server can set throttles

that are not client specific (i.e. the limits apply to total inbound traffic, regardless of client)

  • Takes 1 configuration parameter:
  • Max_Num_of_Concurrent_Requests (a.k.a.

MNCR)

  • Ensures that a server does not handle more than

MNCR requests at any instant

  • If the traffic exceeds the limits, reject excess calls at

the server (i.e. 503s)

X NIWS NIWS Throttle Layer A BaseServer Filter Chain BaseServer Throttle Layer

38

Thursday, November 17, 2011

slide-39
SLIDE 39

Solutions to Issue 2

The Platform Solution

  • Graceful Degradation
  • Any client that is throttled at either the NIWS

Throttle Layer or the BaseServer Throttle Layer need to implement graceful degradation

  • Netflixʼs Web Scale Traffic falls in 2 categories:
  • Users get a personalized set of movies to

pick from (i.e. via API Edge Server path)

  • GD : Show popular movies, not

personalized movies

  • Users can start watching a movie (i.e. via

NCCP Edge Server path)

  • GD : tougher problem to solve
  • When device leases expire, we

honor them if we are unable to generate a new one for them

X NIWS NIWS Throttle Layer A BaseServer Filter Chain BaseServer Throttle Layer

39

Thursday, November 17, 2011

slide-40
SLIDE 40

Solutions to Issue 2

This all sounds great!

  • But, what if developers do not use these

built-in features of the platform or neglect to set their configuration appropriately?

  • (i.e. the default RPS limit in the NIWS

client is Integer.MAX_VALUE)

40

Thursday, November 17, 2011

slide-41
SLIDE 41

Solutions to Issue 2

We have a little help

41

Thursday, November 17, 2011

slide-42
SLIDE 42

Simian Army

Prevention is the best medicine

42

Thursday, November 17, 2011

slide-43
SLIDE 43

Simian Army

  • Chaos Monkey
  • Simulates hard failures in AWS by killing a few instances per ASG (e.g. Auto Scale Group)
  • Similar to how EC2 instances can be killed by AWS with little warning
  • Tests clientsʼ ability to gracefully deal with broken connections, interrupted calls, etc...
  • Verifies that all services are running within the protection of AWS Auto Scale Groups, which

reincarnates killed instances

  • If not, the Chaos monkey will win!

43

Thursday, November 17, 2011

slide-44
SLIDE 44

Simian Army

  • Latency Monkey
  • Simulates soft failures -- i.e. a service gets slower
  • Injects random delays in NIWS (client-side) or BaseServer (server-side) of a client-server

interaction in production

  • Tests the ability of applications to detect and recover (i.e. Graceful Degradation) from the harder

problem of delays, that leads to thundering herd and timeouts

44

Thursday, November 17, 2011

slide-45
SLIDE 45

Simian Army

Does this solve all of our issues?

45

Thursday, November 17, 2011

slide-46
SLIDE 46

Simian Army

The infinite cloud is infinite when your needs are moderate!

To ensure fairness among tenants, AWS meters or limits every resource Hence, we hit limits quite often. Our “velocity” is limited by how long it takes for AWS to turn around and raise the limit -- a few hours!

46

Thursday, November 17, 2011

slide-47
SLIDE 47

Simian Army

  • Limits Monkey
  • Checks once a day whether we are approaching one of our limits and triggers alerts for us to

proactively reach out to AWS!

  • Conformity & Janitor Monkeys
  • Finds and clean up orphaned resources (e.g. EC2 instances that are not in an ASG,

unreferenced security groups, ELBs, ASGs, etc...) to increase head-room

  • Buys us more time before we run out of resources and also saves us $$$$

47

Thursday, November 17, 2011

slide-48
SLIDE 48

Simian Army

The Simian Army fills the gap created by an absence

  • f process and a need to ensure fault-tolerance and

efficient operation of our systems

48

Thursday, November 17, 2011

slide-49
SLIDE 49

Fast Rollback

Fault-tolerant deployment

49

Thursday, November 17, 2011

slide-50
SLIDE 50

Fast Rollback

What is the point of having Fault-Tolerant layers if deployments of a bug can take them down?

50

Thursday, November 17, 2011

slide-51
SLIDE 51

Fast Rollback

51

Thursday, November 17, 2011

slide-52
SLIDE 52

Fast Rollback

Optimism causes outages

51

Thursday, November 17, 2011

slide-53
SLIDE 53

Fast Rollback

Optimism causes outages Production traffic is unique

51

Thursday, November 17, 2011

slide-54
SLIDE 54

Fast Rollback

Optimism causes outages Production traffic is unique Keep old version running

51

Thursday, November 17, 2011

slide-55
SLIDE 55

Fast Rollback

Optimism causes outages Production traffic is unique Keep old version running Switch traffic to new version

51

Thursday, November 17, 2011

slide-56
SLIDE 56

Fast Rollback

Optimism causes outages Production traffic is unique Keep old version running Switch traffic to new version Monitor results

51

Thursday, November 17, 2011

slide-57
SLIDE 57

Fast Rollback

Optimism causes outages Production traffic is unique Keep old version running Switch traffic to new version Monitor results Revert traffic quickly

51

Thursday, November 17, 2011

slide-58
SLIDE 58

Fast Rollback

52

Thursday, November 17, 2011

slide-59
SLIDE 59

Fast Rollback

api-usprod-v007 api-frontend

52

Thursday, November 17, 2011

slide-60
SLIDE 60

Fast Rollback

api-usprod-v007 api-frontend api-usprod-v008

52

Thursday, November 17, 2011

slide-61
SLIDE 61

Fast Rollback

api-usprod-v007 api-frontend api-usprod-v008

52

Thursday, November 17, 2011

slide-62
SLIDE 62

Fast Rollback

api-usprod-v007 api-frontend api-usprod-v008

52

Thursday, November 17, 2011

slide-63
SLIDE 63

Fast Rollback

api-usprod-v007 api-frontend api-usprod-v008

52

Thursday, November 17, 2011

slide-64
SLIDE 64

Fast Rollback

api-frontend api-usprod-v008

52

Thursday, November 17, 2011

slide-65
SLIDE 65

Fast Rollback

53

Thursday, November 17, 2011

slide-66
SLIDE 66

Fast Rollback

api-usprod-v007 api-frontend

53

Thursday, November 17, 2011

slide-67
SLIDE 67

Fast Rollback

api-usprod-v007 api-frontend api-usprod-v008

53

Thursday, November 17, 2011

slide-68
SLIDE 68

Fast Rollback

api-usprod-v007 api-frontend api-usprod-v008

53

Thursday, November 17, 2011

slide-69
SLIDE 69

Fast Rollback

api-usprod-v007 api-frontend api-usprod-v008

53

Thursday, November 17, 2011

slide-70
SLIDE 70

Fast Rollback

api-usprod-v007 api-frontend

53

Thursday, November 17, 2011

slide-71
SLIDE 71

Acknowledgements

Platform Engineering

  • Sudhir Tonse
  • Pradeep Kamath

Engineering Tools

  • Joe Sondow

Streaming Server

  • Ranjit Mavinkurve

54

Thursday, November 17, 2011

slide-72
SLIDE 72

Questions?

Sid Anand @r39132

55

Thursday, November 17, 2011