Taming large state for real-time joins Sonali Sharma & Shriya - PowerPoint PPT Presentation

Taming large state for real-time joins Sonali Sharma & Shriya Arora Netflix

Waiting for your data be like ....

“I love waiting for my data” - said no stakeholder ever!

Sonali Sharma Shriya Arora - Senior Data engineer, Data Science and Engineering, Netflix - Build data products for personalization - Building low latency data pipelines - Deal with PB scale of data

Coming up in the next 40 minutes ● Use case for a stateful streaming pipeline ● Concept and Building blocks of streaming apps ● Data join in a streaming context (windows) ● Challenges in building low latency pipeline

Use case for streaming pipeline

Netflix Traffic 1 trillion events per day 100 PB of data stored on cloud

Recommendations everywhere!

Which artwork to show?

Signal: Take Fraction Play No play User A User B User C Take Fraction = 1 / 3 Profile B Profile C

Making a case for streaming ETL ` Real time Real time Faster training of Computational gains Reporting Alerting ML models

Recap: Use case ● Join Impression events with playback events in real time to calculate take fraction ● Train model faster and on fresher data ● Convert large batch data processing pipeline to a stateful streaming pipeline

Concepts and Building Blocks

Modern stream processing frameworks Qcon stream processing talks 2017

Bounded vs Unbounded Data Batch data at rest, hard boundaries Window Stream data is unbounded

Solution: Windows Windows split the stream into buckets of finite size, over which we can apply computations. stream.keyBy(...) .window(...) Group By [.trigger(...)] [.allowedLateness(...)] .reduce/aggregate/fold/apply() stream.join(otherStream) .where(<KeySelector>) Join .equalTo(<KeySelector>) .window(<WindowAssigner>) .apply(<JoinFunction>) T

Event time vs processing time 1 2 3 4 5 Clock Event time Processing time

Out-of-order and late-arriving events Event time windows Events from the Netflix apps 1st burst of events 2nd burst of events Ingestion pipeline Processing time windows

Solution: Watermark A watermark is a notion of input completeness with respect to event time. Watermarks act as a metric of progress when processing an unbounded data source.

Slowly changing dimensions Enriching stream with dimensional data API calls for ` enrichment Raw streams Enriched stream Combine streams Movie Metadata (Hive or data map)

Fault tolerance Newer records Older records Event time Checkpoint {n-1} Checkpoint {n} Checkpoint interval Checkpoint ● Snapshot of metadata and state of the app ● Helps in recovery

Check point interval Interval should have cover duration and pauses with buffer

Recap: Concepts and Building blocks ● Handling unbounded data, define boundaries using Windows ● Event time processing ● Handle out of order and late arriving events using Watermarks ● Enrich data in stream using external calls ● Fault tolerance is very important for streaming applications

Making a stream join work

Data Flow Architecture .keyBy Transform + Impression AssignTs Output stream Reduce Transform + .keyBy AssignTs Playback stream kafla By Source, Fair use, https://en.wikipedia.org/w/index.php?curid=47175041

Data Flow Architecture Transform + AssignTs Parse Filter AssignTs (raw ->T) (T-> T) (t.getTs()) By Source, Fair use, https://en.wikipedia.org/w/index.php?curid=47175041

Joining streams: Keyed Streams .keyBy DataStream KeyedStream .keyBy

Stream joins in Flink: Maintaining State Events need to be held in-memory for user-defined intervals ● of time for meaningful aggregations Data held in memory needs to be cleared when no longer ● needed A C Checkpoint RocksDB B

Aggregating streams: Windows Windows split the stream into buckets of finite size, over which we can apply computations. Stream volume: 200k /s/region Repeating values for same keys: 3-4

Aggregating streams Can the events be summarized as they come?

Updating state: CoProcess Function ValueState<T> K3,I K3,I K3,I K1,I K1,I K1 I + P + K3,P K4,P K1,P K1,P I + + P K3 P K4 Impressions Playback Composite Type

Stream joins in Flink: Updating State Timers ● Flink’s TimerService can be used to register callbacks for future ○ time instants. processElement() State onTimer() Timer service Aggregated elements

Recap .keyBy Transform + Impression AssignTs Output stream Summarize Transform + .keyBy AssignTs Playback stream kafla By Source, Fair use, https://en.wikipedia.org/w/index.php?curid=47175041

Challenges

Challenge: Data Correctness ● Trade-offs ○ Latency v/s completeness ● Duplicates ○ Most streaming systems are at-most-once ○ de-duplication explodes state ● Data validation ○ Real-time auditing of data ○ How to stop the incoming flow of bad data?

Challenge: Operations Visibility into event time progression

Challenge: Operations ● Visibility into state ● Monitoring checkpoints ● Periodic Savepoints ● Intercepting RocksDB metrics

Challenge: Data recovery ● Replaying from Kafka ○ Checkpoints contain offset information ○ Different streams have different volumes ● Replaying from Hive ○ Kafka retention is expensive ○ Easier for stateless applications

Solution: Replaying from Kafka ● Ingestion time filtering ○ Read all input streams from earliest ○ Netflix Kafka producer stamps processing time ○ Filter out events based on processing time stream.filter(e => e.ingestionTs > T2 && e.ingestionTs < T7 ) T1 T2 T3 T7 T8 T0 T4 T5 T6 T9 T10 System went down System came back up

Challenge: Region failovers ● Event time is dependent on incoming data ● Force moving the watermark via a maxInactivity parameter

Challenges we are working on State Schema Evolution ● Application level De-duplication ● Auto Scaling and recovery ● Replaying and Restating data ●

Finally

What sparked joy Fresher data for Personalization models ● Enhanced user experience ● Enable stakeholders for early decision making ● Save on storage and compute costs ● Real-time auditing and early detection of data gaps ●

Questions? Join us! @NetflixData

Taming large state for real-time joins Sonali Sharma & Shriya - PowerPoint PPT Presentation

Taming large state for real-time joins Sonali Sharma & Shriya Arora Netflix Waiting for your data be like .... I love waiting for my data - said no stakeholder ever! Sonali Sharma Shriya Arora - Senior Data engineer, Data

SQL Workshop Joins Doug Shook Inner Joins Joins are used to combine data from multiple

SQL$Joins Max$Masnick August&7,&2015 What%are%joins?

Software Tool Seminar WS1516 - Taming the Snake November 4, 2015 1 Taming the Snake 1.1

CS 61: Database Systems Joins Adapted from Silberschatz, Korth, and Sundarshan unless otherwise

Real- Real -Time Systems Time Systems Real- -Time Systems Time Systems Real

Real Real- -Time Systems Time Systems Designing a real- Designing a real -time system time

Real- Real -time systems time systems Real- Real -time programming time programming

JOINS IN SQL By Rohit Dhanwani OBJECTIVES Define and use different types of joins INNER

S9557 EFFECTIVE, SCALABLE MULTI-GPU JOINS Tim Kaldewey, Nikolay Sakharnykh and Jiri Kraus, March

Joins, and more plotting Joins, and more plotting Abhijit Dasgupta Abhijit Dasgupta Fall, 2019

Notes on exact meets and joins R. N. Ball, J. Picado and A. Pultr 1 Exact meets and joins.

Real graduates, Real graduates, real transitions, real transitions, real stories: real

Efficient Processing of Set-Similarity Joins on Large Computer Clusters Rares Vernica

Real Real Real Time Real-Time Time Time Model Checking Model Model Checking Model

TAMING NG T THE C CAVEMAN: STRESS MANAGEMENT FOR THE NEW AGE Diana F. Hott, LCSW CEAP

Taming the Beast Workshop Bayesian inference of species tree Species & gene trees *BEAST

HMP 668 TaoDong,AliKattan,PhillipStadler,

Illinois Postsecondary and Workforce Readiness (PWR) Working Group Agenda 1. Todays Meeting

Interactive Search Profiles as a Means of Personalisation Maram Barifah & Monica Landoni

S OCIAL S YNERGY Students : Tal Sharon, Oded Leiba, Shai Shabat, Roey Arditi, Maor Hornstein

A REAL ESTATE MARKETING PLATFORM WEB SITESPROPERTY SEARCHESEMAIL INTEGRATIONMORTGAGE

XPERI Q2 2020 INVESTOR DECK August 10, 2020 Safe Harbor This document contains

Personalized SEL at Your School Social Emotional Learning Five Components Self-Awareness

Presentation 3 Getting personal: a tailored advertising experience Get Media Smart social

Taming large state for real-time joins Sonali Sharma & Shriya - PowerPoint PPT Presentation

Taming large state for real-time joins Sonali Sharma & Shriya Arora Netflix Waiting for your data be like .... I love waiting for my data - said no stakeholder ever! Sonali Sharma Shriya Arora - Senior Data engineer, Data

SQL Workshop Joins Doug Shook Inner Joins Joins are used to combine data from multiple

SQL$Joins Max$Masnick August&amp;7,&amp;2015 What%are%joins?

Software Tool Seminar WS1516 - Taming the Snake November 4, 2015 1 Taming the Snake 1.1

CS 61: Database Systems Joins Adapted from Silberschatz, Korth, and Sundarshan unless otherwise

Real- Real -Time Systems Time Systems Real- -Time Systems Time Systems Real

Real Real- -Time Systems Time Systems Designing a real- Designing a real -time system time

Real- Real -time systems time systems Real- Real -time programming time programming

JOINS IN SQL By Rohit Dhanwani OBJECTIVES Define and use different types of joins INNER

S9557 EFFECTIVE, SCALABLE MULTI-GPU JOINS Tim Kaldewey, Nikolay Sakharnykh and Jiri Kraus, March

Joins, and more plotting Joins, and more plotting Abhijit Dasgupta Abhijit Dasgupta Fall, 2019

Notes on exact meets and joins R. N. Ball, J. Picado and A. Pultr 1 Exact meets and joins.

Real graduates, Real graduates, real transitions, real transitions, real stories: real

Efficient Processing of Set-Similarity Joins on Large Computer Clusters Rares Vernica

Real Real Real Time Real-Time Time Time Model Checking Model Model Checking Model

TAMING NG T THE C CAVEMAN: STRESS MANAGEMENT FOR THE NEW AGE Diana F. Hott, LCSW CEAP

Taming the Beast Workshop Bayesian inference of species tree Species &amp; gene trees *BEAST

HMP 668 TaoDong,AliKattan,PhillipStadler,

Illinois Postsecondary and Workforce Readiness (PWR) Working Group Agenda 1. Todays Meeting

Interactive Search Profiles as a Means of Personalisation Maram Barifah &amp; Monica Landoni

S OCIAL S YNERGY Students : Tal Sharon, Oded Leiba, Shai Shabat, Roey Arditi, Maor Hornstein

A REAL ESTATE MARKETING PLATFORM WEB SITESPROPERTY SEARCHESEMAIL INTEGRATIONMORTGAGE

XPERI Q2 2020 INVESTOR DECK August 10, 2020 Safe Harbor This document contains

Personalized SEL at Your School Social Emotional Learning Five Components Self-Awareness

Presentation 3 Getting personal: a tailored advertising experience Get Media Smart social

SQL$Joins Max$Masnick August&7,&2015 What%are%joins?

Taming the Beast Workshop Bayesian inference of species tree Species & gene trees *BEAST

Interactive Search Profiles as a Means of Personalisation Maram Barifah & Monica Landoni