Netflix: Netflix: Petabyte Scale Petabyte Scale Analytics - PowerPoint PPT Presentation

Netflix: Netflix: Petabyte Scale Petabyte Scale Analytics Infrastructure in Analytics Infrastructure in the Cloud the Cloud Daniel C. Weeks Tom Gianos

Overview ● Data at Netflix ● Netflix Scale ● Platform Architecture ● Data Warehouse ● Genie ● Q&A

Data at Netflix

Our Biggest Challenge is Scale

Netflix Key Business Metrics 86+ million Global 1000+ devices 125+ million members supported hours / day

Netflix Key Platform Metrics 500B Events 60 PB DW Read 3PB Write 500TB

Big Data Platform Architecture

Data Pipelines Event Data Kafka Ursula Cloud apps 5 min S3 Dimension Data SS Aegisthus Cassandra Tables Daily

Interface Big Data Portal Big Data API Tools Transport Visualization Quality Workflow Vis Job/Cluster Vis Service Orchestration Metadata Compute Parquet S3 Storage

Production Ad-hoc Other ~2300 d2.4xl ~1200 d2.4xl

S3 Data Warehouse

Why S3? • Lots of 9’s • Features not available in HDFS • Decouple Compute and Storage

Decoupled Scaling Warehouse Size HDFS Capacity All Clusters 3x Replication No Buffer

Decouple Compute / Storage Production Ad-hoc S3

Tradeoffs - Performance • Split Calculation (Latency) – Impacts job start time – Executes off cluster • Table Scan (Latency + Throughput) – Parquet seeks add latency – Read overhead and available throughput • Performance Converges with Volume and Complexity

Tradeoffs - Performance

Metadata • Metacat: Federated Metadata Service • Hive Thrift Interface • Logical Abstraction

Partitioning - Less is More Database Table Partition country_d date=20161101 data_science etl catalog_d date=20161102 telemetry playback_f date=20161103 ab_test search_f date=20161104

Partition Locations data_science playback_f date=20161101 s3://<bucket>/hive/warehouse/data_science.db/playback_f/dateint=20161101/… date=20161102 s3://<bucket>/hive/warehouse/data_science.db/playback_f/dateint=20161102/…

Parquet

Parquet File Format Column Oriented ● Store column data contiguously ● Improve compression ● Column projection Strong Community Support ● Spark, Presto, Hive, Pig, Drill, Impala, etc. ● Works well with S3

Column Chunk Column Chunk Column Chunk Row Group Dict Page Data Page Dict Page Data Page Data Page Data Page Data Page Data Page Data Page Column Chunk Column Chunk Column Chunk Row Group Dict Page Data Page Dict Page Data Page Data Page Data Page Data Page Data Page Data Page schema, version, etc. RowGroup Metadata Footer row count, size, etc. Column Chunk Metadata Column Chunk Metadata Column Chunk Metadata [encoding, size, min, max] [encoding, size, min, max] [encoding, size, min, max]

Staging Data • Partition by low cardinality fields • Sort by high cardinality predicate fields

Staging Data Original Sorted

Filtered Original Processed

Parquet Tuning Guide http://www.slideshare.net/RyanBlue3/parquet- performance-tuning-the-missing-guide

A Nascent Data Platform Gateway

Need Somewhere to Test Prod Gateway Prod Test Gateway Test

More Users = More Resources Prod Prod Gateway Prod Gateway Prod Gateways Prod Prod Gateway Test Gateway Test Gateways

Clusters for Specific Purposes Prod Prod Gateway Prod Gateway Prod Gateways Prod Prod Gateway Test Gateway Test Gateways Prod Prod Gateway Backfill Gateway Backfill Gateways

User Base Matures R? Prod Prod Gateway Prod Gateway Prod Gateways There’s a bug in Presto 0.149 need 0.150 Prod Prod Gateway Test Gateway I want Test Gateways Spark 1.6.1 I need Prod Spark Prod Gateway 2.0 Backfill Gateway Backfill Gateways My job is slow I need more resources

No one is happ No one is happy

Genie to the Rescue Prod Test Backfill

Problems Netflix Data Platform Faces • For Administrators – Coordination of many moving parts • ~15 clusters • ~45 different client executables and versions for those clusters – Heavy load • ~45-50k jobs per day – Hundreds of users with different problems • For Users – Don’t want to know details – All clusters and client applications need to be available for use – Need to provide tools to make doing their jobs easy

Genie for the Platform Administrator

An administrator wants a tool to… • Simplify configuration management and deployment • Minimize impact of changes to users • Track and respond to problems with system quickly • Scale client resources as load increases

Genie Configuration Data Model • Metadata about cluster Cluster – [sched:sla, type:yarn, ver:2.7.1] 1 0..* • Executable(s) Command – [type:spark-submit, ver:1.6.0] 1 0..* • Dependencies for an executable ApplicaLon

Search Resources

Administration Use Cases

Updating a Cluster • Start up a new cluster • Register Cluster with Genie • Run tests • Move tags from old to new cluster in Genie – New cluster begins taking load immediately • Let old jobs finish on old cluster • Shut down old cluster • No down time!

Load Balance Between Clusters • Different loads at different times of day • Copy tags from one cluster to another to split load • Remove tags when done • Transparent to all clients!

Update Application Binaries • Copy new binaries to central download location • Genie cache will invalidate old binaries on next invocation and download new ones • Instant change across entire Genie cluster

Genie for Users

User wants a tool to… • Discover a cluster to run job on • Run the job client • Handle all dependencies and configuration • Monitor the job • View history of jobs • Get job results

Clusters Submitting a Job { … “clusterCriteria”:[ “type:yarn”, “sched:sla” ], “commandCriteria”:[ “type:spark”, “ver:1.6.0” ] … } Commands 1. https://analyticsforinsights.files.wordpress.com/2015/04/superman-data-scientist-graphic.jpg

Genie Job Data Model 1 Cluster Job Request 1 1 1 1 1 1 1 Command Job Job Metadata 1 1 1 ApplicaLon Job ExecuLon 0..*

Job Request

Python Client Example

Job History

Job Output

Wrapping Up

Data Warehouse • S3 for Scale • Decouple Compute & Storage • Parquet for Speed

Genie at Netflix • Runs the OSS code • Runs ~45k jobs per day in production • Runs on ~25 i2.4xl instances at any given time • Keeps ~3 months of jobs (~3.1 million) in history

Resources • http://netflix.github.io/genie/ – Work in progress for 3.0.0 • https://github.com/Netflix/genie – Demo instructions in README • https://hub.docker.com/r/netflixoss/genie- app/ – Docker Container

Questions?

Netflix: Netflix: Petabyte Scale Petabyte Scale Analytics - PowerPoint PPT Presentation

Netflix: Netflix: Petabyte Scale Petabyte Scale Analytics Infrastructure in Analytics Infrastructure in the Cloud the Cloud Daniel C. Weeks Tom Gianos Overview Data at Netflix Netflix Scale Platform Architecture Data

Netflix: Integrating Spark At Petabyte Scale Ashwin Shankar Cheolsoo Park Outline 1. Netflix

Peering to Scale the Netflix Perspective Scaling for Growth How Does Netflix Manage Growth?

How We Know Where You Are in House of Cards @zimmermatt Netflix Scale @zimmermatt Netflix

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Spring Cloud, Spring Boot and Netflix OSS http://localhost:4000/decks/cloud-boot-netflix.html

Keeping Movies Running Amid Thunderstorms Fault-tolerant Systems @ Netflix Sid Anand (@r39132)

How Stranger Things can happen with Visual Analytics Jason Flittner Senior Analytics

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Automating Operational Decisions in Real-time Chris Sanden Senior Analytics Engineer About Me.

Anti-Entropy using CRDTs on HA Datastores Sailesh Mukil Senior Software Engineer, Netflix

FAILURE AT NETFLIX VELOCITY Cannot Connect to the Netflix Service 0 0 Ms % IMPACT LATENCY

Innovation & Creativity CEPI WORKSHOP - PANEL 1 18 JUNE 2018 Netflix History 100M Netflix

Improving Netflix Performance Bill Scott Director, UI Engineering Netflix June 23, 2008 1

Crisis to Calm: A Story of Data Validation @ Netflix Lavanya Kanchanapalli Rollback data

How Computers Help Humans Root Cause Issues at Netflix SETH KATZ QCON NEW YORK, 2018 Hello!

Containers at Netflix - An Evolving Story Sangeeta Narayanan Engineering Manager @Netflix

Randolph Farm 2010 Casselmonte Farm Casselmonte Farm Three Seasons SPRING SUMMER FALL $ $ $ F A

Value Proposition By Kate Ray PLAN OF ACTION CREATE A BUSINESS INSTAGRAM ACCOUNT Using INSTAGRAM

Ramona Middle School New Parent Orientation 2018 Whos Who at Ramona? Principal James Ellis

Chicagoland Ryan White HIV Quality Improvement Learning Collaborative Agency: Lake County

Slide #1: Intro I. Blockbuster's plight A. "King Kong" Blockbuster has become a

An Inverse Evaluation of Netflix Architecture Using ATAM Stefan Toth @st_toth; st@embarc.de

Cloud-Native and Scalable Kafka Allen Wang @allenxwang About Me Real Time Data

Pinewood Group Presentation of Q1 2019/20 results Important notice This presentation has been

Netflix: Netflix: Petabyte Scale Petabyte Scale Analytics - PowerPoint PPT Presentation

Netflix: Netflix: Petabyte Scale Petabyte Scale Analytics Infrastructure in Analytics Infrastructure in the Cloud the Cloud Daniel C. Weeks Tom Gianos Overview Data at Netflix Netflix Scale Platform Architecture Data

Netflix: Integrating Spark At Petabyte Scale Ashwin Shankar Cheolsoo Park Outline 1. Netflix

Peering to Scale the Netflix Perspective Scaling for Growth How Does Netflix Manage Growth?

How We Know Where You Are in House of Cards @zimmermatt Netflix Scale @zimmermatt Netflix

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Spring Cloud, Spring Boot and Netflix OSS http://localhost:4000/decks/cloud-boot-netflix.html

Keeping Movies Running Amid Thunderstorms Fault-tolerant Systems @ Netflix Sid Anand (@r39132)

How Stranger Things can happen with Visual Analytics Jason Flittner Senior Analytics

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Automating Operational Decisions in Real-time Chris Sanden Senior Analytics Engineer About Me.

Anti-Entropy using CRDTs on HA Datastores Sailesh Mukil Senior Software Engineer, Netflix

FAILURE AT NETFLIX VELOCITY Cannot Connect to the Netflix Service 0 0 Ms % IMPACT LATENCY

Innovation &amp; Creativity CEPI WORKSHOP - PANEL 1 18 JUNE 2018 Netflix History 100M Netflix

Improving Netflix Performance Bill Scott Director, UI Engineering Netflix June 23, 2008 1

Crisis to Calm: A Story of Data Validation @ Netflix Lavanya Kanchanapalli Rollback data

How Computers Help Humans Root Cause Issues at Netflix SETH KATZ QCON NEW YORK, 2018 Hello!

Containers at Netflix - An Evolving Story Sangeeta Narayanan Engineering Manager @Netflix

Randolph Farm 2010 Casselmonte Farm Casselmonte Farm Three Seasons SPRING SUMMER FALL $ $ $ F A

Value Proposition By Kate Ray PLAN OF ACTION CREATE A BUSINESS INSTAGRAM ACCOUNT Using INSTAGRAM

Ramona Middle School New Parent Orientation 2018 Whos Who at Ramona? Principal James Ellis

Chicagoland Ryan White HIV Quality Improvement Learning Collaborative Agency: Lake County

Slide #1: Intro I. Blockbuster's plight A. &quot;King Kong&quot; Blockbuster has become a

An Inverse Evaluation of Netflix Architecture Using ATAM Stefan Toth @st_toth; st@embarc.de

Cloud-Native and Scalable Kafka Allen Wang @allenxwang About Me Real Time Data

Pinewood Group Presentation of Q1 2019/20 results Important notice This presentation has been

Innovation & Creativity CEPI WORKSHOP - PANEL 1 18 JUNE 2018 Netflix History 100M Netflix

Slide #1: Intro I. Blockbuster's plight A. "King Kong" Blockbuster has become a