Real Time Recommendations using Spark Streaming Elliot Chow Why? - - PowerPoint PPT Presentation

real time recommendations using spark streaming
SMART_READER_LITE
LIVE PREVIEW

Real Time Recommendations using Spark Streaming Elliot Chow Why? - - PowerPoint PPT Presentation

Real Time Recommendations using Spark Streaming Elliot Chow Why? - React more quickly to changes in interest - Time-of-day effects - Real-world events Feedback Loop UI Recommendation Data Systems Systems Stream Processing Trends


slide-1
SLIDE 1

Real Time Recommendations using Spark Streaming

Elliot Chow

slide-2
SLIDE 2

Why?

  • React more quickly to changes in interest
  • Time-of-day effects
  • Real-world events
slide-3
SLIDE 3
slide-4
SLIDE 4

Data Systems Stream Processing Recommendation Systems UI

Feedback Loop

slide-5
SLIDE 5

Trends Data

  • What people browse: impressions
  • What people watch: plays
slide-6
SLIDE 6

Trends Data - Impressions

Appearance of a video in the viewport

slide-7
SLIDE 7

Trends Data - Plays

Member plays a video

slide-8
SLIDE 8

Why Spark Streaming?

  • Existing Spark infrastructure
  • Experience with Spark
  • Batch and Streaming
slide-9
SLIDE 9

Components

slide-10
SLIDE 10

Join Cassandra Aggregate S3 Transform

Filter Filter Consume Impressions Consume Plays

Design

slide-11
SLIDE 11

Join

Cassandra Aggregate S3 Transform Filter Filter

Consume Impressions

Consume Plays

Design

slide-12
SLIDE 12

Join Key

“Request Id” - a unique identifier of the source of a play or impression

slide-13
SLIDE 13

Design

Join

Cassandra Aggregate

S3 Transform Filter Filter

Consume Impressions

Consume Plays

slide-14
SLIDE 14

Output

Video Epoch Plays Impressions Stranger Things 1 (00:00-00:30) 4 5 Stranger Things 1 (00:00-00:30) 3 6 House Of Cards 2 (00:30-01:00) 8 10 Marseille 2 (00:30-01:00) 3 3

slide-15
SLIDE 15
  • Instead of raw counts, output sets of request ids
  • Count = cardinality of the set of request ids
  • Idempotent counting

Output

slide-16
SLIDE 16

Join Cassandra Aggregate

S3 Transform

Filter Filter Consume Impressions Consume Plays

Design

slide-17
SLIDE 17

Programming with Spark Streaming

slide-18
SLIDE 18

Streaming Joins

slide-19
SLIDE 19

Streaming Joins - Time

  • Time to browse and select a video
  • Batched logging from client application
  • Delays in data sources
slide-20
SLIDE 20
  • Window both plays and impressions by epoch duration
  • Join the two windows together
  • Slide by epoch duration

Streaming Joins - Attempt I

t

Plays Impressions

slide-21
SLIDE 21

Streaming Joins - Attempt I

  • Easy to implement
  • Tight coupling with processing time
  • Does not mesh well with absolute time windows
  • Failure can mean loss of all data for the entire window

Epoch 1 Window Start 00:15 Epoch 2 Window End 00:45 00:00 00:30 01:00

slide-22
SLIDE 22

Streaming Joins - Attempt II

  • Join using mapWithState
  • Join key is the mapWithState key
  • State is the plays and impressions sharing the same join key
  • Use timeouts to expire unjoined data
slide-23
SLIDE 23

Streaming Joins - Attempt II

R1, I1 Plays & Impressions MapWithStateRDD

slide-24
SLIDE 24

Streaming Joins - Attempt II

R1 => { I1 } R1, I1

Plays & Impressions MapWithStateRDD

slide-25
SLIDE 25

Streaming Joins - Attempt II

R1 => { I1 } R2, I8

Plays & Impressions MapWithStateRDD

slide-26
SLIDE 26

Streaming Joins - Attempt II

R1 => { I1 } R2 => { I8 } R2, I8

Plays & Impressions MapWithStateRDD

slide-27
SLIDE 27

Streaming Joins - Attempt II

R1 => { I1 } R2 => { I8 } R1, P1

Plays & Impressions MapWithStateRDD

slide-28
SLIDE 28

Streaming Joins - Attempt II

R1 => { I1, P1 } R2 => { I8 } R1, P1 R1, I1 R1, P1

Plays & Impressions MapWithStateRDD

slide-29
SLIDE 29

Streaming Joins - Attempt II

R1 => { I1, P1 } R2 => { I8 } R3, I5

Plays & Impressions MapWithStateRDD

slide-30
SLIDE 30

Streaming Joins - Attempt II

R1 => { I1, P1 } R2 => { I8 } R3, I5 R3 => { I5 }

Plays & Impressions MapWithStateRDD

slide-31
SLIDE 31

Streaming Joins - Attempt II

R1 => { I1, P1 } R1, I6 R3 => { I5 }

Plays & Impressions MapWithStateRDD

slide-32
SLIDE 32

Streaming Joins - Attempt II

R1 => { I1, P1, I6 } R1, I6 R3 => { I5 } R1, I6

Plays & Impressions MapWithStateRDD

slide-33
SLIDE 33

Streaming Joins - Attempt II

R1 => { I1, P1, I6 }

...

R3 => { I5 }

Plays & Impressions MapWithStateRDD

slide-34
SLIDE 34

Streaming Joins - Attempt II

  • Make progress every batch
  • Too much “uninteresting” data
  • High memory usage
  • Large checkpoints
slide-35
SLIDE 35

Streaming Joins - An Observation

t

Plays Impressions

slide-36
SLIDE 36

Streaming Joins - An Observation

t

Plays Impressions

slide-37
SLIDE 37

Streaming Joins - An Observation

t

Plays Impressions

Join incoming batch of plays to windowed impressions, and vice versa

slide-38
SLIDE 38

Streaming Joins - An Observation

t

Plays Impressions

Slide by batch interval...

slide-39
SLIDE 39

Streaming Joins - An Observation

t

Plays Impressions

Slide by batch interval again...

slide-40
SLIDE 40

Streaming Joins - Attempt III

  • Counts are updated every batch
  • Uses Spark’s windowing
  • No checkpoints
slide-41
SLIDE 41

mapWithState

slide-42
SLIDE 42

mapWithState

  • Can be used for more than sessionization
slide-43
SLIDE 43

mapWithState

  • Can be used for more than sessionization
  • Be aware of cache evictions
  • Lots of state may need to be recomputed
slide-44
SLIDE 44

mapWithState

val input: DStream[(VideoId, RequestId)] = // ... val spec: StateSpec[VideoId, RequestId, Set[RequestId], (VideoId, Set[RequestId])] = // ... val output: DStream[(VideoId, Set[RequestId])] = { input. mapWithState(spec) }

slide-45
SLIDE 45

mapWithState

val input: DStream[(VideoId, RequestId)] = // ... val spec: StateSpec[VideoId, RequestId, Set[RequestId], (VideoId, Set[RequestId])] = // ... val output: DStream[(VideoId, Set[RequestId])] = { input. mapWithState(spec). groupByKey. mapValues(_.maxBy(_.size)) }

slide-46
SLIDE 46

mapWithState

val input: DStream[(VideoId, RequestId)] = // ... val spec: StateSpec[VideoId, Iterable[RequestId], Set[RequestId], (VideoId, Set[RequestId])] = // ... val output: DStream[(VideoId, Set[RequestId])] = { input. groupByKey. mapWithState(spec) }

slide-47
SLIDE 47

mapWithState

val input: DStream[(VideoId, RequestId)] = // ... val spec: StateSpec[VideoId, RequestId, Set[RequestId], Unit] = // ... val output: DStream[(VideoId, Set[RequestId])] = { input. mapWithState(spec). stateSnapshots }

slide-48
SLIDE 48

Productionizing Spark Streaming

slide-49
SLIDE 49

Metrics

  • Monitoring system health
  • Aid in diagnosis of issues
  • Needs to be performant and accurate
slide-50
SLIDE 50

Metrics - Option I

  • Use “traditional” stream processing metrics
  • Events/second, bytes/second, …
  • Batching can make numbers hard to interpret
  • Susceptible to recomputation
slide-51
SLIDE 51

Metrics - Option II

  • Spark Accumulators
  • Used internally by Spark
  • Susceptible to recomputation
  • Unclear when to report the metric
  • Can make use of SparkListener & StreamingListener
slide-52
SLIDE 52

Metrics - Option III

  • Explicit counts on RDDs
  • Counts will be accurate
  • Additional latency
  • Use caching to prevent duplicate work*
slide-53
SLIDE 53

Metrics

  • Processing time < Batch interval
  • Time the different parts of the job
  • Spark is lazy - may require forcing evaluation
  • Use Spark UI metrics
slide-54
SLIDE 54

Error Handling

  • What exceptions cause the streaming job to crash?
slide-55
SLIDE 55

Error Handling

  • What exceptions cause the streaming job to crash?
  • Most seem to be caught to keep the job running
  • Exception handling is application-specific
  • Stop-gap: track the elapsed time since the batch started
slide-56
SLIDE 56

Future Work

slide-57
SLIDE 57

Future Work

  • Red/Black deployment with zero data-loss
slide-58
SLIDE 58

Future Work

  • Red/Black deployment with zero data-loss
  • Auto-scaling
slide-59
SLIDE 59

Future Work

  • Red/Black deployment with zero data-loss
  • Auto-scaling
  • Improved back pressure per topic
slide-60
SLIDE 60

Future Work

  • Red/Black deployment with zero data-loss
  • Auto-scaling
  • Improved back pressure per topic
  • Updating broadcast variables
slide-61
SLIDE 61

Questions?

We’re hiring! elliot@netflix.com