Keeping Movies Running Amid Thunderstorms Fault-tolerant Systems @ - PowerPoint PPT Presentation

Keeping Movies Running Amid Thunderstorms Fault-tolerant Systems @ Netflix Sid Anand (@r39132) QCon SF 2011 1 Thursday, November 17, 2011

Backgrounder Netflix Then and Now 2 Thursday, November 17, 2011

Netflix Then and Now Netflix prior to circa 2009 Netflix post circa 2009 Users watched DVDs at home Users watch streaming at home Peak days : Friday, Saturday, Sunday Peak days : Friday, Saturday, Sunday Users returned DVDs & Updated their Qs Off-Peak days see many orders of magnitude more traffic than prior to Peak days : Sunday, Monday 2009 We shipped the next DVDs User expectation is that streaming is always available Peak days : Monday, Tuesday No Scheduled Site Downtimes Scheduled Site Downtimes on alternate Wednesdays Fault Tolerance is a top design concern 3 Thursday, November 17, 2011

Netflix DC Architecture A Simple System 4 Thursday, November 17, 2011

Netflix’s DC Architecture Components H/W Load Balancer 1 Netscaler H/W Load Balancer Apache + Tomcat Apache + Tomcat Apache + Tomcat ~20 “ WWW ” Apache+Tomcat servers 3 Oracle DBs & 1 MySQL DB Cache Servers Cinematch System Cache Servers MySQL Oracle Cinematch Recommendation System 5 Thursday, November 17, 2011

Netflix’s DC Architecture Types of Production Issues H/W Load Balancer Java Garbage Collection problems, which would would result in slower Apache + Tomcat Apache + Tomcat Apache + Tomcat WWW pages Deadlocks in our multi-threaded Java application would cause web page loading to timeout Cinematch System Cache Servers MySQL Oracle Transaction locking in the DB would result in the similar web page loading timeouts Under-optimized SQL or DB would cause slower web pages ( e.g. DB optimizer picks a sub-optimal the execution plan ) 6 Thursday, November 17, 2011

Netflix’s DC Architecture H/W Load Balancer Architecture Pros As serious as these sound, they were Apache + Tomcat Apache + Tomcat Apache + Tomcat typically single-system failure scenarios Single-system failures are relatively easy to resolve Architecture Cons Cinematch System Cache Servers MySQL Oracle Not horizontally scalable We ʼ re constrained by what can fit on a single box Not conducive to high-velocity development and deployment 7 Thursday, November 17, 2011

Netflix’s Cloud Architecture A Less Simple System 8 Thursday, November 17, 2011

Netflix’s Cloud Architecture ELB ELB NES NES NES NES Components Many (~100) applications, organized in Discovery clusters NMTS NMTS NMTS NMTS Clusters can be at different levels in the call stack NMTS NMTS Clusters can call each other NBES NBES IAAS IAAS IAAS 9 Thursday, November 17, 2011

Netflix’s Cloud Architecture ELB ELB Levels NES NES NES NES NES : Netflix Edge Services Discovery NMTS : Netflix Mid-tier Services NMTS NMTS NMTS NMTS NBES : Netflix Back-end Services IAAS : AWS IAAS Services NMTS NMTS Discovery : Help services discover NMTS and NBES services NBES NBES IAAS IAAS IAAS 10 Thursday, November 17, 2011

Netflix’s Cloud Architecture ELB ELB Components (NES) NES NES NES NES Overview Any service that browsers and streaming Discovery devices connect to over the internet NMTS NMTS NMTS NMTS They sit behind AWS Elastic Load Balancers ( a.k.a. ELB ) NMTS NMTS They call clusters at lower levels NBES NBES IAAS IAAS IAAS 11 Thursday, November 17, 2011

Netflix’s Cloud Architecture Components (NES) ELB ELB Examples NES NES NES NES API Servers Discovery Support the video browsing experience NMTS NMTS NMTS NMTS Also allows users to modify their Q Streaming Control Servers NMTS NMTS Support streaming video playback Authenticate your Wii, PS3, etc... NBES NBES Download DRM to the Wii, PS3, etc... Return a list of CDN urls to the Wii, PS3, etc... IAAS IAAS IAAS 12 Thursday, November 17, 2011

Netflix’s Cloud Architecture ELB ELB Components (NMTS) NES NES NES NES Overview Discovery Can call services at the same or lower NMTS NMTS NMTS NMTS levels Other NMTS NMTS NMTS NBES, IAAS Not NES NBES NBES Exposed through our Discovery service IAAS IAAS IAAS 13 Thursday, November 17, 2011

Netflix’s Cloud Architecture ELB ELB Components (NMTS) NES NES NES NES Examples Discovery Netflix Queue Servers NMTS NMTS NMTS NMTS Modify items in the users ʼ movie queue Viewing History Servers NMTS NMTS Record and track all streaming movie watching SIMS Servers NBES NBES Compute and serve user-to-user and movie-to-movie similarities IAAS IAAS IAAS 14 Thursday, November 17, 2011

Netflix’s Cloud Architecture ELB ELB Components (NBES) NES NES NES NES Overview Discovery A back-end, usually 3rd party, open-source service NMTS NMTS NMTS NMTS Leaf in the call tree. Cannot call anything else NMTS NMTS NBES NBES IAAS IAAS IAAS 15 Thursday, November 17, 2011

Netflix’s Cloud Architecture ELB ELB Components (NBES) NES NES NES NES Examples Discovery Cassandra Clusters NMTS NMTS NMTS NMTS Our new cloud database is Cassandra and stores all sorts of data to support application needs NMTS NMTS Zookeeper Clusters Our distributed lock service and sequence NBES NBES generator Memcached Clusters Typically caches things that we store in S3 but need to access quickly or often IAAS IAAS IAAS 16 Thursday, November 17, 2011

Netflix’s Cloud Architecture ELB ELB Components (IAAS) NES NES NES NES Examples AWS S3 Discovery Large-sized data ( e.g. video encodes, NMTS NMTS NMTS NMTS application logs, etc... ) is stored here, not Cassandra NMTS NMTS AWS SQS Amazon ʼ s message queue to send events ( e.g. Facebook network updates are processed asynchronously over SQS ) NBES NBES IAAS IAAS IAAS 17 Thursday, November 17, 2011

Netflix’s Cloud Architecture ELB ELB Types of Production Issues NES NES NES NES A user-issued call will pass through multiple levels during normal operation Discovery We are now exposed to multi-system NMTS NMTS NMTS NMTS coincident failures, a.k.a. coordinated failures NMTS NMTS NBES NBES IAAS IAAS IAAS 18 Thursday, November 17, 2011

Netflix’s Cloud Architecture ELB ELB Architecture Pros NES NES NES NES Horizontally scalable at every level Should give us maximum availability Discovery NMTS NMTS NMTS NMTS Supports high-velocity development and deployment Architecture Cons NMTS NMTS A user-issued call will pass through multiple levels ( a.k.a. hops ) during normal operation NBES NBES Latency can be a concern We are now exposed to multi-system coincident failures, a.k.a. coordinated IAAS IAAS IAAS failures A lot of moving parts 19 Thursday, November 17, 2011

Issue 1 Capacity Planning 20 Thursday, November 17, 2011

Issue 1 X X Y Y • Service X and Service Y , each made up of 2 instances, call Service A , also made up of 2 instance • If either of these services expect a large increase in A A traffic, they need to let the owner of Service A know • Service A can then scale up ahead of the traffic X X Y Y increase A A A A A A Disaster Avoided ?? 21 Thursday, November 17, 2011

Issue 1 • A given application owner may need to contact 20 other application owners each time he expects to get a large increase in traffic X X Y Y • Too much human coordination • A few options A A • Some service owners vastly over-provision for their application X X Y Y • Not cost effective • Auto-scaling • A A A A A A We want to generalize the model first proved by our Streaming Control Server (a.k.a. NCCP) team 22 Thursday, November 17, 2011

ELB AutoScaling Interlude How to use an ELB An elastic-load balancer (ELB) routes traffic to your EC2 instances e.g. of an ELB : nccp-wii-11111111.us- east-1.elb.amazonaws.com Netflix maps a CNAME to this ELB e.g. : nccp.wii.netflix.com Netflix then registers the API Service’s EC2 instances with this ELB The ELB periodically polls attached EC2 instances to ensure the instances are healthy 23 Thursday, November 17, 2011

ELB AutoScaling Interlude Taking this a bit further The NCCP servers can publish metrics to AWS CloudWatch We can set up an alarm in Cloud Watch on a metric ( e.g. CPU ) We can associate an auto scale policy with that alarm ( e.g. if CPU > 60%, add 3 more instances ) When a metric goes above a limit, an alarm is triggered, causing auto-scaling, which grows our pool 24 Thursday, November 17, 2011

ELB AutoScaling Interlude Cloud EC2 instances publish NCCP Watch CPU data to CW (Alarms) CloudWatch alarms trigger ASG policies Auto EC2 Instances Scaling Service Added/Removed (Policies) 25 Thursday, November 17, 2011

ELB AutoScaling Interlude Scale Out Event Average CPU > 60% for 5 minutes Scale In Event Average CPU < 30% FOR 5 minutes Cool Down Period 10 minutes Auto-Scale Alerts DLAutoScaleEvents 26 Thursday, November 17, 2011

27 @r39132 23 Thursday, November 17, 2011

Issue 1 X X Y Y Summary A A We would like to have auto-scaling at all levels. X X Y Y A A A A A A 28 Thursday, November 17, 2011

Issue 2 Thundering herds to NMTS 29 Thursday, November 17, 2011

Keeping Movies Running Amid Thunderstorms Fault-tolerant Systems @ - PowerPoint PPT Presentation

Keeping Movies Running Amid Thunderstorms Fault-tolerant Systems @ Netflix Sid Anand (@r39132) QCon SF 2011 1 Thursday, November 17, 2011 Backgrounder Netflix Then and Now 2 Thursday, November 17, 2011 Netflix Then and Now Netflix prior

THUNDERSTORMS Convective heavy rain accompanied by lightning and thunder Ahrens Thunderstorms

Red- -Light Running Light Running Red Red-Light Running 2 Traffic Signals Traffic Signals

Red- -Light Running Light Running Red Red-Light Running 2 Traffic Signals Traffic Signals

The Sword & Sorcery Movies of the 1980s The Sword & Sorcery Movies of the 1980s - -

EMPLOYER RESPONSIBILITIES AMID COVID-19 OVERVIEW + UPDATES EMPLOYER RESPONSIBILITIES AMID

Super 8 Languages for Making Movies (A Functional Pearl) Leif Andersen Stephen Chang Ma hias

Running Probabilistic Running Probabilistic Running Probabilistic Programs Backwards Programs

Running Time Why do we need to analyze the running Algorithm/Running Time Analysis time of a

D7: Front-running Race conditions #7: Front ont-running running A form of a race condition

amid changing times Mid-West Electric Consumers Association Aug. 30, 2017 | Denver, CO Mark A.

Dark Matter Indirect Detection amid hints & constraints Marco Cirelli (CNRS IPhT Saclay)

of Terminal Aerodrome Forecasts for Thunderstorms and Visibility Jadran Jurkovi 1 Zoran Pasari

Thunderstorms and lightning activity in So Paulo metropolitan area during CHUVA-GLM Vale do

Thunderstorms and Thunderbolts in Teaching Strategies and Meteorology Short-term exchanges of

Main Points Thursday Severe Threat Potential for severe weather throughout the day on

Ameren Keeping Current and Keeping Cooling Evaluation Presentation 2016 Evaluation Activities

Haryadi S. Gunawi, Pallavi Joshi, Peter Alvaro, ! Joseph M. Hellerstein, and Koushik Sen ! ! Thanh

Ken Birman i Cornell University. CS5410 Fall 2008. Background for today Consider a system like

Diabetes: the key things to understand Naveed Sattar Professor of Metabolic Medicine Duality of

NET ETRC C - Sum umme mer Web ebinar nar Ser eries es Se Septem ember ber 23, , 2014

Why do Internet services fail, and what can be done about it? David Oppenheimer

Distributed Systems CS425/ECE428 01/29/2020 Logistics Slide policy: Lecture slides v1

Multipath Transport, Resource Pooling, and implications for Routing Mark Handley , UCL and XORP,

Think outside the rack 2015-04-21 WRSC john wilkes / johnwilkes@google.com, Parthasarathy

Keeping Movies Running Amid Thunderstorms Fault-tolerant Systems @ - PowerPoint PPT Presentation

Keeping Movies Running Amid Thunderstorms Fault-tolerant Systems @ Netflix Sid Anand (@r39132) QCon SF 2011 1 Thursday, November 17, 2011 Backgrounder Netflix Then and Now 2 Thursday, November 17, 2011 Netflix Then and Now Netflix prior

THUNDERSTORMS Convective heavy rain accompanied by lightning and thunder Ahrens Thunderstorms

Red- -Light Running Light Running Red Red-Light Running 2 Traffic Signals Traffic Signals

Red- -Light Running Light Running Red Red-Light Running 2 Traffic Signals Traffic Signals

The Sword &amp; Sorcery Movies of the 1980s The Sword &amp; Sorcery Movies of the 1980s - -

EMPLOYER RESPONSIBILITIES AMID COVID-19 OVERVIEW + UPDATES EMPLOYER RESPONSIBILITIES AMID

Super 8 Languages for Making Movies (A Functional Pearl) Leif Andersen Stephen Chang Ma hias

Running Probabilistic Running Probabilistic Running Probabilistic Programs Backwards Programs

Running Time Why do we need to analyze the running Algorithm/Running Time Analysis time of a

D7: Front-running Race conditions #7: Front ont-running running A form of a race condition

amid changing times Mid-West Electric Consumers Association Aug. 30, 2017 | Denver, CO Mark A.

Dark Matter Indirect Detection amid hints &amp; constraints Marco Cirelli (CNRS IPhT Saclay)

of Terminal Aerodrome Forecasts for Thunderstorms and Visibility Jadran Jurkovi 1 Zoran Pasari

Thunderstorms and lightning activity in So Paulo metropolitan area during CHUVA-GLM Vale do

Thunderstorms and Thunderbolts in Teaching Strategies and Meteorology Short-term exchanges of

Main Points Thursday Severe Threat Potential for severe weather throughout the day on

Ameren Keeping Current and Keeping Cooling Evaluation Presentation 2016 Evaluation Activities

Haryadi S. Gunawi, Pallavi Joshi, Peter Alvaro, ! Joseph M. Hellerstein, and Koushik Sen ! ! Thanh

Ken Birman i Cornell University. CS5410 Fall 2008. Background for today Consider a system like

Diabetes: the key things to understand Naveed Sattar Professor of Metabolic Medicine Duality of

NET ETRC C - Sum umme mer Web ebinar nar Ser eries es Se Septem ember ber 23, , 2014

Why do Internet services fail, and what can be done about it? David Oppenheimer

Distributed Systems CS425/ECE428 01/29/2020 Logistics Slide policy: Lecture slides v1

Multipath Transport, Resource Pooling, and implications for Routing Mark Handley , UCL and XORP,

Think outside the rack 2015-04-21 WRSC john wilkes / johnwilkes@google.com, Parthasarathy

The Sword & Sorcery Movies of the 1980s The Sword & Sorcery Movies of the 1980s - -

Dark Matter Indirect Detection amid hints & constraints Marco Cirelli (CNRS IPhT Saclay)