The Dataflow Model A Practical Approach to Balancing Correctness, - - PowerPoint PPT Presentation

the dataflow model
SMART_READER_LITE
LIVE PREVIEW

The Dataflow Model A Practical Approach to Balancing Correctness, - - PowerPoint PPT Presentation

The Dataflow Model A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing Tyler Akidau et al. Christopher Little Outline Prerequisites Problem System Evaluation


slide-1
SLIDE 1

Christopher Little

The Dataflow Model

A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing

Tyler Akidau et al.

slide-2
SLIDE 2

Outline

Prerequisites Problem System Evaluation

slide-3
SLIDE 3

Prerequisites

slide-4
SLIDE 4

Event vs Processing Time

slide-5
SLIDE 5

Low Watermark

slide-6
SLIDE 6

Fixed Windowing

slide-7
SLIDE 7

Unaligned Windowing (Tuples)

slide-8
SLIDE 8

Unaligned Windowing (Sessions)

slide-9
SLIDE 9

Problem

slide-10
SLIDE 10

Tracking Video Sessions

  • Online/Offline video platform
  • Want aggregate stats per user: track sessions
  • Pay advertisers per view: must be correct
  • Want to adjust bids fast: low latency
  • Must scale: distributed system
slide-11
SLIDE 11

“A major shortcoming of all the models and systems mentioned above, is that they focus on input data as something which will at some point become complete.”

slide-12
SLIDE 12

System

slide-13
SLIDE 13

– What results are being computed. – Where in event time they are being computed. – When in processing time they are materialized. – How earlier results relate to later refinements.

slide-14
SLIDE 14

– What results are being computed. ✔ – Where in event time they are being computed. – When in processing time they are materialized. – How earlier results relate to later refinements.

slide-15
SLIDE 15

Two Primitive Transforms

(fix, 1) (fit, 2) (f, 1) (fi, 1) (fix, 1) (f, 2) (fi, 2) (fit, 2) (f, [1, 2]) (fi, [1, 2]) (fix, [1]) (fit, [2])

ParDo( ExpandPrefixes ) GroupByKey

slide-16
SLIDE 16

Session Windowing Example

(k1, (v1, 13:02)) (k2, (v2, 13:14)) (k1, (v3, 13:57)) (k1, (v4, 13:20)) (k1, (v1, [13:02, 13:32])) (k2, (v2, [13:14, 13:44])) (k1, (v3, [13:57, 14:27])) (k1, (v4, [13:20, 13:50])) (k1, ([(v1, [13:02, 13:32]) ,(v3, [13:57, 14:27]) ,(v4, [13:20, 13:50])])) (k2, ([(v2, [13:14, 13:44])])) (k1, ([v1, v4], [13:02, 13:50])) (k1, ([v3], [13:57, 14:27])) (k2, ([v2], [13:14, 13:44]))

ParDo ParDo GroupByKey

AssignWindows M e r g e W i n d

  • w

s MergeWindows

slide-17
SLIDE 17

– What results are being computed. ✔ – Where in event time they are being computed. ✔ – When in processing time they are materialized. – How earlier results relate to later refinements.

slide-18
SLIDE 18

Triggering

slide-19
SLIDE 19

Triggering (end of time)

slide-20
SLIDE 20

Triggering (periodically)

slide-21
SLIDE 21

Triggering (on input, tuples)

slide-22
SLIDE 22

Triggering (on watermark+input)

slide-23
SLIDE 23

– What results are being computed. ✔ – Where in event time they are being computed. ✔ – When in processing time they are materialized. ✔ – How earlier results relate to later refinements.

slide-24
SLIDE 24

Accumulating

slide-25
SLIDE 25

Discarding

slide-26
SLIDE 26

Accumulating + Retracting

slide-27
SLIDE 27

– What results are being computed. ✔ – Where in event time they are being computed. ✔ – When in processing time they are materialized. ✔ – How earlier results relate to later refinements. ✔

slide-28
SLIDE 28

Evaluation

slide-29
SLIDE 29
  • Name
  • Concepts
  • Necessity
  • Clarity

Evaluation

slide-30
SLIDE 30

Christopher Little

The Dataflow Model

A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing

Tyler Akidau et al.