Real Time Recommendations using Spark Streaming Elliot Chow Why? - PowerPoint PPT Presentation

Real Time Recommendations using Spark Streaming Elliot Chow

Why? - React more quickly to changes in interest - Time-of-day effects - Real-world events

Feedback Loop UI Recommendation Data Systems Systems Stream Processing

Trends Data - What people browse: impressions - What people watch: plays

Trends Data - Impressions Appearance of a video in the viewport

Trends Data - Plays Member plays a video

Why Spark Streaming? - Existing Spark infrastructure - Experience with Spark - Batch and Streaming

Components

Design Consume Filter Plays Join Aggregate Transform Cassandra S3 Consume Filter Impressions

Join Key “Request Id” - a unique identifier of the source of a play or impression

Design Consume Filter Plays Aggregate Join Transform Cassandra S3 Consume Filter Impressions

Output Video Epoch Plays Impressions Stranger Things 1 (00:00-00:30) 4 5 Stranger Things 1 (00:00-00:30) 3 6 House Of Cards 2 (00:30-01:00) 8 10 Marseille 2 (00:30-01:00) 3 3

Output - Instead of raw counts, output sets of request ids - Count = cardinality of the set of request ids - Idempotent counting

Design Consume Filter Plays Transform Join Aggregate Cassandra S3 Consume Filter Impressions

Programming with Spark Streaming

Streaming Joins

Streaming Joins - Time - Time to browse and select a video - Batched logging from client application - Delays in data sources

Streaming Joins - Attempt I - Window both plays and impressions by epoch duration - Join the two windows together - Slide by epoch duration Plays Impressions t

Streaming Joins - Attempt I - Easy to implement - Tight coupling with processing time - Does not mesh well with absolute time windows - Failure can mean loss of all data for the entire window Window Start Window End 00:15 00:45 00:00 Epoch 1 00:30 Epoch 2 01:00

Streaming Joins - Attempt II - Join using mapWithState - Join key is the mapWithState key - State is the plays and impressions sharing the same join key - Use timeouts to expire unjoined data

Streaming Joins - Attempt II Plays & MapWithStateRDD Impressions R1, I1

Streaming Joins - Attempt II Plays & MapWithStateRDD Impressions R1 => { I1 } R1, I1

Streaming Joins - Attempt II Plays & MapWithStateRDD Impressions R1 => { I1 } R2, I8

Streaming Joins - Attempt II Plays & MapWithStateRDD Impressions R1 => { I1 } R2 => { I8 } R2, I8

Streaming Joins - Attempt II Plays & MapWithStateRDD Impressions R1 => { I1 } R2 => { I8 } R1, P1

Streaming Joins - Attempt II Plays & MapWithStateRDD Impressions R1 => { I1, P1 } R1, P1 R2 => { I8 } R1, I1 R1, P1

Streaming Joins - Attempt II Plays & MapWithStateRDD Impressions R1 => { I1, P1 } R2 => { I8 } R3, I5

Streaming Joins - Attempt II Plays & MapWithStateRDD Impressions R1 => { I1, P1 } R2 => { I8 } R3, I5 R3 => { I5 }

Streaming Joins - Attempt II Plays & MapWithStateRDD Impressions R1 => { I1, P1 } R1, I6 R3 => { I5 }

Streaming Joins - Attempt II Plays & MapWithStateRDD Impressions R1 => { I1, P1, I6 } R1, I6 R1, I6 R3 => { I5 }

Streaming Joins - Attempt II Plays & MapWithStateRDD Impressions R1 => { I1, P1, I6 } ... R3 => { I5 }

Streaming Joins - Attempt II - Make progress every batch - Too much “uninteresting” data - High memory usage - Large checkpoints

Streaming Joins - An Observation Plays Impressions t

Streaming Joins - An Observation Join incoming batch of plays to windowed impressions, and vice versa Plays Impressions t

Streaming Joins - An Observation Slide by batch interval... Plays Impressions t

Streaming Joins - An Observation Slide by batch interval again... Plays Impressions t

Streaming Joins - Attempt III - Counts are updated every batch - Uses Spark’s windowing - No checkpoints

mapWithState

mapWithState Can be used for more than sessionization -

mapWithState Can be used for more than sessionization - - Be aware of cache evictions - Lots of state may need to be recomputed

mapWithState val input : DStream[(VideoId, RequestId)] = // ... val spec : StateSpec[VideoId, RequestId, Set[RequestId], (VideoId, Set[RequestId])] = // ... val output : DStream[(VideoId, Set[RequestId])] = { input. mapWithState(spec) }

mapWithState val input : DStream[(VideoId, RequestId)] = // ... val spec : StateSpec[VideoId, RequestId, Set[RequestId], (VideoId, Set[RequestId])] = // ... val output : DStream[(VideoId, Set[RequestId])] = { input. mapWithState(spec). groupByKey. mapValues(_.maxBy(_.size)) }

mapWithState val input : DStream[(VideoId, RequestId)] = // ... val spec : StateSpec[VideoId, Iterable[RequestId], Set[RequestId], (VideoId, Set[RequestId])] = // ... val output : DStream[(VideoId, Set[RequestId])] = { input. groupByKey. mapWithState(spec) }

mapWithState val input : DStream[(VideoId, RequestId)] = // ... val spec : StateSpec[VideoId, RequestId, Set[RequestId], Unit] = // ... val output : DStream[(VideoId, Set[RequestId])] = { input. mapWithState(spec). stateSnapshots }

Productionizing Spark Streaming

Metrics - Monitoring system health - Aid in diagnosis of issues - Needs to be performant and accurate

Metrics - Option I - Use “traditional” stream processing metrics - Events/second, bytes/second, … - Batching can make numbers hard to interpret - Susceptible to recomputation

Metrics - Option II - Spark Accumulators - Used internally by Spark - Susceptible to recomputation - Unclear when to report the metric - Can make use of SparkListener & StreamingListener

Metrics - Option III - Explicit counts on RDDs - Counts will be accurate - Additional latency - Use caching to prevent duplicate work*

Metrics - Processing time < Batch interval - Time the different parts of the job - Spark is lazy - may require forcing evaluation - Use Spark UI metrics

Error Handling - What exceptions cause the streaming job to crash?

Error Handling - What exceptions cause the streaming job to crash? - Most seem to be caught to keep the job running - Exception handling is application-specific - Stop-gap: track the elapsed time since the batch started

Future Work

Future Work - Red/Black deployment with zero data-loss

Future Work - Red/Black deployment with zero data-loss - Auto-scaling

Future Work - Red/Black deployment with zero data-loss - Auto-scaling - Improved back pressure per topic

Future Work - Red/Black deployment with zero data-loss - Auto-scaling - Improved back pressure per topic - Updating broadcast variables

Questions? We’re hiring! elliot@netflix.com

Real Time Recommendations using Spark Streaming Elliot Chow Why? - PowerPoint PPT Presentation

Real Time Recommendations using Spark Streaming Elliot Chow Why? - React more quickly to changes in interest - Time-of-day effects - Real-world events Feedback Loop UI Recommendation Data Systems Systems Stream Processing Trends

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Day 4 Lab1: Docker container for Kafka - Spark streaming - Cassandra This Dockerfile sets up

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

A (Probably not) Project Proposal: Spark Streaming vs Apache Storm for Real-time Event Detection

Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323:

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Streaming Summary by Lucy Yu Motivation Most of big data happens in a streaming

Environmental Governance in Environmental Governance in the Aftermath of Hurricane the Aftermath

Rust 2018 an epoch release by Steve Klabnik What is Rust? Systems (?) 1 of 12 3/5/2018, 7:39

Slimane BEKKI, LATMOS (Thank you to Daniel Jacob great website, Harvard Univ.) PLAN 1/ Some

Accelerating CCUS Commercialization Through US-PRC Business Collaborations S. Ming Sung Chief

No Game No Driving --Transfer driving task via cycleGAN Zhipeng Fan N16246016 Ben Ahlbrand

Status of KaVA SFRs WG and new possible collaboration Tomoya Hirota (NAOJ), Kee-Tae Kim (KASI),

Continuous Cumulative Solutions of EPN LACs EPN LACs Heinz Habrich Federal Agency for

Ground-Based Calibration of WFC ALLFRAME Stellar Photometry in M81 Shaun M. G. Hughes 1 , Peter B.

Real Time Recommendations using Spark Streaming Elliot Chow Why? - PowerPoint PPT Presentation

Real Time Recommendations using Spark Streaming Elliot Chow Why? - React more quickly to changes in interest - Time-of-day effects - Real-world events Feedback Loop UI Recommendation Data Systems Systems Stream Processing Trends

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Day 4 Lab1: Docker container for Kafka - Spark streaming - Cassandra This Dockerfile sets up

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

A (Probably not) Project Proposal: Spark Streaming vs Apache Storm for Real-time Event Detection

Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323:

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Streaming Summary by Lucy Yu Motivation Most of big data happens in a streaming

Environmental Governance in Environmental Governance in the Aftermath of Hurricane the Aftermath

Rust 2018 an epoch release by Steve Klabnik What is Rust? Systems (?) 1 of 12 3/5/2018, 7:39

Slimane BEKKI, LATMOS (Thank you to Daniel Jacob great website, Harvard Univ.) PLAN 1/ Some

Accelerating CCUS Commercialization Through US-PRC Business Collaborations S. Ming Sung Chief

No Game No Driving --Transfer driving task via cycleGAN Zhipeng Fan N16246016 Ben Ahlbrand

Status of KaVA SFRs WG and new possible collaboration Tomoya Hirota (NAOJ), Kee-Tae Kim (KASI),

Continuous Cumulative Solutions of EPN LACs EPN LACs Heinz Habrich Federal Agency for

Ground-Based Calibration of WFC ALLFRAME Stellar Photometry in M81 Shaun M. G. Hughes 1 , Peter B.

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark