Netflix: Netflix: Petabyte Scale Petabyte Scale Analytics - - PowerPoint PPT Presentation

netflix netflix petabyte scale petabyte scale analytics
SMART_READER_LITE
LIVE PREVIEW

Netflix: Netflix: Petabyte Scale Petabyte Scale Analytics - - PowerPoint PPT Presentation

Netflix: Netflix: Petabyte Scale Petabyte Scale Analytics Infrastructure in Analytics Infrastructure in the Cloud the Cloud Daniel C. Weeks Tom Gianos Overview Data at Netflix Netflix Scale Platform Architecture Data


slide-1
SLIDE 1

Daniel C. Weeks Tom Gianos

Netflix: Netflix: Petabyte Scale Petabyte Scale Analytics Infrastructure in Analytics Infrastructure in the Cloud the Cloud

slide-2
SLIDE 2
  • Data at Netflix
  • Netflix Scale
  • Platform Architecture
  • Data Warehouse
  • Genie
  • Q&A

Overview

slide-3
SLIDE 3

Data at Netflix

slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8

Our Biggest Challenge is Scale

slide-9
SLIDE 9

Netflix Key Business Metrics

86+ million members Global 1000+ devices supported 125+ million hours / day

slide-10
SLIDE 10

Netflix Key Platform Metrics

60 PB DW Read 3PB Write 500TB 500B Events

slide-11
SLIDE 11

Big Data Platform Architecture

slide-12
SLIDE 12

Data Pipelines

Cloud apps Kafka Ursula Cassandra Aegisthus

Dimension Data Event Data

5 min Daily

S3

SS Tables

slide-13
SLIDE 13

Storage Compute Service Tools Big Data API Big Data Portal

S3 Parquet

Transport Visualization Quality Workflow Vis Job/Cluster Vis

Interface

Orchestration Metadata

slide-14
SLIDE 14

Production Ad-hoc Other ~2300 d2.4xl ~1200 d2.4xl

slide-15
SLIDE 15

S3 Data Warehouse

slide-16
SLIDE 16

Why S3?

  • Lots of 9’s
  • Features not available in HDFS
  • Decouple Compute and Storage
slide-17
SLIDE 17

Decoupled Scaling

Warehouse Size HDFS Capacity

All Clusters 3x Replication No Buffer

slide-18
SLIDE 18

Production Ad-hoc

S3

Decouple Compute / Storage

slide-19
SLIDE 19

Tradeoffs - Performance

  • Split Calculation (Latency)

– Impacts job start time – Executes off cluster

  • Table Scan (Latency + Throughput)

– Parquet seeks add latency – Read overhead and available throughput

  • Performance Converges with Volume and Complexity
slide-20
SLIDE 20

Tradeoffs - Performance

slide-21
SLIDE 21

Metadata

  • Metacat: Federated Metadata Service
  • Hive Thrift Interface
  • Logical Abstraction
slide-22
SLIDE 22

Partitioning - Less is More

ab_test telemetry etl data_science country_d catalog_d playback_f search_f date=20161101 date=20161102 date=20161103 date=20161104

Database Table Partition

slide-23
SLIDE 23

Partition Locations

data_science playback_f date=20161101 date=20161102

s3://<bucket>/hive/warehouse/data_science.db/playback_f/dateint=20161101/… s3://<bucket>/hive/warehouse/data_science.db/playback_f/dateint=20161102/…

slide-24
SLIDE 24

Parquet

slide-25
SLIDE 25

Parquet File Format

Column Oriented

  • Store column data contiguously
  • Improve compression
  • Column projection

Strong Community Support

  • Spark, Presto, Hive, Pig, Drill, Impala, etc.
  • Works well with S3
slide-26
SLIDE 26

Footer

RowGroup Metadata

row count, size, etc.

schema, version, etc. Column Chunk Metadata [encoding, size, min, max] Column Chunk Metadata [encoding, size, min, max] Column Chunk Metadata [encoding, size, min, max]

Dict Page Data Page Data Page

Column Chunk

Data Page Data Page Data Page

Column Chunk

Dict Page Data Page Data Page

Column Chunk

Dict Page Data Page Data Page

Column Chunk

Data Page Data Page Data Page

Column Chunk

Dict Page Data Page Data Page

Column Chunk

Row Group Row Group

slide-27
SLIDE 27

Staging Data

  • Partition by low cardinality fields
  • Sort by high cardinality predicate fields
slide-28
SLIDE 28

Staging Data

Sorted Original

slide-29
SLIDE 29

Filtered

Processed Original

slide-30
SLIDE 30

Parquet Tuning Guide

http://www.slideshare.net/RyanBlue3/parquet- performance-tuning-the-missing-guide

slide-31
SLIDE 31
slide-32
SLIDE 32

A Nascent Data Platform

Gateway

slide-33
SLIDE 33

Need Somewhere to Test

Test Gateway Test Prod Gateway Prod

slide-34
SLIDE 34

More Users = More Resources

Prod Prod Gateway Prod Gateway Prod Gateways Test Prod Gateway Prod Gateway Test Gateways

slide-35
SLIDE 35

Clusters for Specific Purposes

Prod Prod Gateway Prod Gateway Prod Gateways Test Prod Gateway Prod Gateway Test Gateways Backfill Prod Gateway Prod Gateway Backfill Gateways

slide-36
SLIDE 36

User Base Matures

Prod Prod Gateway Prod Gateway Prod Gateways Test Prod Gateway Prod Gateway Test Gateways Backfill Prod Gateway Prod Gateway Backfill Gateways

R? I need Spark 2.0 I want Spark 1.6.1 There’s a bug in Presto 0.149 need 0.150 My job is slow I need more resources

slide-37
SLIDE 37

No one is happ No one is happy

slide-38
SLIDE 38

Genie to the Rescue

Prod Test Backfill

slide-39
SLIDE 39

Problems Netflix Data Platform Faces

  • For Administrators

– Coordination of many moving parts

  • ~15 clusters
  • ~45 different client executables and versions for those clusters

– Heavy load

  • ~45-50k jobs per day

– Hundreds of users with different problems

  • For Users

– Don’t want to know details – All clusters and client applications need to be available for use – Need to provide tools to make doing their jobs easy

slide-40
SLIDE 40

Genie for the Platform Administrator

slide-41
SLIDE 41

An administrator wants a tool to…

  • Simplify configuration management and

deployment

  • Minimize impact of changes to users
  • Track and respond to problems with system

quickly

  • Scale client resources as load increases
slide-42
SLIDE 42

Genie Configuration Data Model

  • Metadata about cluster

– [sched:sla, type:yarn, ver:2.7.1]

  • Executable(s)

– [type:spark-submit, ver:1.6.0]

  • Dependencies for an executable

Cluster Command ApplicaLon 1 0..* 1 0..*

slide-43
SLIDE 43

Search Resources

slide-44
SLIDE 44

Administration Use Cases

slide-45
SLIDE 45

Updating a Cluster

  • Start up a new cluster
  • Register Cluster with Genie
  • Run tests
  • Move tags from old to new cluster in Genie

– New cluster begins taking load immediately

  • Let old jobs finish on old cluster
  • Shut down old cluster
  • No down time!
slide-46
SLIDE 46

Load Balance Between Clusters

  • Different loads at different times of day
  • Copy tags from one cluster to another to

split load

  • Remove tags when done
  • Transparent to all clients!
slide-47
SLIDE 47

Update Application Binaries

  • Copy new binaries to central download

location

  • Genie cache will invalidate old binaries on

next invocation and download new ones

  • Instant change across entire Genie cluster
slide-48
SLIDE 48

Genie for Users

slide-49
SLIDE 49

User wants a tool to…

  • Discover a cluster to run job on
  • Run the job client
  • Handle all dependencies and configuration
  • Monitor the job
  • View history of jobs
  • Get job results
slide-50
SLIDE 50

Submitting a Job

  • 1. https://analyticsforinsights.files.wordpress.com/2015/04/superman-data-scientist-graphic.jpg

{ … “clusterCriteria”:[ “type:yarn”, “sched:sla” ], “commandCriteria”:[ “type:spark”, “ver:1.6.0” ] … }

Commands Clusters

slide-51
SLIDE 51

Genie Job Data Model

Cluster Command ApplicaLon Job Job Request Job ExecuLon Job Metadata

0..* 1 1 1 1 1 1 1 1 1 1 1

slide-52
SLIDE 52

Job Request

slide-53
SLIDE 53

Python Client Example

slide-54
SLIDE 54

Job History

slide-55
SLIDE 55

Job Output

slide-56
SLIDE 56

Wrapping Up

slide-57
SLIDE 57

Data Warehouse

  • S3 for Scale
  • Decouple Compute & Storage
  • Parquet for Speed
slide-58
SLIDE 58

Genie at Netflix

  • Runs the OSS code
  • Runs ~45k jobs per day in production
  • Runs on ~25 i2.4xl instances at any given

time

  • Keeps ~3 months of jobs (~3.1 million) in

history

slide-59
SLIDE 59

Resources

  • http://netflix.github.io/genie/

– Work in progress for 3.0.0

  • https://github.com/Netflix/genie

– Demo instructions in README

  • https://hub.docker.com/r/netflixoss/genie-

app/

– Docker Container

slide-60
SLIDE 60

Questions?