The Dataflow Model Problem How can we process unbounded data? - - PowerPoint PPT Presentation

the dataflow model problem
SMART_READER_LITE
LIVE PREVIEW

The Dataflow Model Problem How can we process unbounded data? - - PowerPoint PPT Presentation

The Dataflow Model Problem How can we process unbounded data? Example: track user activity on a website Key ideas Windowing Fixed windows Sliding windows Sessions Time domains Event time Processing time


slide-1
SLIDE 1

The Dataflow Model

slide-2
SLIDE 2

Problem

  • How can we process unbounded data?
  • Example: track user activity on a website
slide-3
SLIDE 3

Key ideas

  • Windowing
  • Fixed windows
  • Sliding windows
  • Sessions
  • Time domains
  • Event time
  • Processing time
  • Triggers
slide-4
SLIDE 4

Contribution

  • Dataflow API
  • Easily build pipelines with your choice of

windowing, time domain, and trigger

  • Independent of execution engine
  • Choose batch, micro-batch, or streaming

depending on tradeoffs

slide-5
SLIDE 5

Windowing

slide-6
SLIDE 6

Types of windows

  • Fixed windows
  • Sliding windows
  • Sessions
slide-7
SLIDE 7

Fixed windows

slide-8
SLIDE 8

Sliding windows

Example: compute running average

  • ver past 5 minutes of data
slide-9
SLIDE 9

Session windows

Example: YouTube viewing sessions

slide-10
SLIDE 10

Time domains

  • For many applications, windows should be based
  • n “event time” (when the events actually occur)
  • Example: billing YouTube advertisers
  • Lag, partitions, etc, might cause an event to be

processed later than its event time

  • Processing time
slide-11
SLIDE 11

Challenge: time skew

slide-12
SLIDE 12

Goal: Event-time windows

Fixed windows Session windows

slide-13
SLIDE 13

Challenge: completion

  • With event times, how does the system know if it

has received all of the data in a window?

  • Example: phones might watch YouTube videos

(and ads) offline

slide-14
SLIDE 14

Watermarks

  • Heuristics that tell the system when it is likely to

have received most of the data in a window

  • Based on global progress metrics
  • Watermarks are insufficient:
  • Late data might arrive behind the watermark
  • Watermark might be too slow due to one late

datum and increase latency for the whole system

slide-15
SLIDE 15

Incremental processing

  • Difficult to get the single best result from a window
  • Instead, let windows produce multiple results

(improving incrementally over time)

slide-16
SLIDE 16

Triggers

  • Triggers specify when to output window results
  • at watermark
  • at percentile watermark
  • every minute, etc
  • Triggers specify how to output results
  • discard previous window
  • accumulate
  • accumulate and retract
  • Triggers are composable
slide-17
SLIDE 17

Examples

slide-18
SLIDE 18

12:01 12:02 12:03 12:04 12:05 12:06 12:07 12:08 Event Time 12:06 12:07 12:08 12:09 Processing Time

5 7 8 3 4 3 3 8 1 9

Ideal watermark: Actual watermark:

Figure 5: Example Inputs

slide-19
SLIDE 19

PCollection<KV<String, Integer>> output = input .apply(Window.trigger(Repeat(AtPeriod(1, MINUTE))) .accumulating()) .apply(Sum.integersPerKey());

12:01 12:02 12:03 12:04 12:05 12:06 12:07 12:08 Event Time 12:06 12:07 12:08 12:09 Processing Time

5 7 8 3 4 3 3 8 1 9 51 51 33 33 22 22 12 12

Figure 7: GlobalWindows, AtPeriod, Accumulating

slide-20
SLIDE 20

PCollection<KV<String, Integer>> output = input .apply(Window.trigger(Repeat(AtPeriod(1, MINUTE))) .discarding()) .apply(Sum.integersPerKey());

12:01 12:02 12:03 12:04 12:05 12:06 12:07 12:08 Event Time 12:06 12:07 12:08 12:09 Processing Time

5 7 8 3 4 3 3 8 1 9 12 12 10 10 11 11 18 18

Figure 8: GlobalWindows, AtPeriod, Discarding

slide-21
SLIDE 21

PCollection<KV<String, Integer>> output = input .apply(Window.into(FixedWindows.of(2, MINUTES)) .trigger(Repeat(AtWatermark()))) .accumulating()) .apply(Sum.integersPerKey());

Let’s run this pipeline under the three execution engines: batch, micro-batch, streaming

slide-22
SLIDE 22

12:01 12:02 12:03 12:04 12:05 12:06 12:07 12:08 Event Time 12:06 12:07 12:08 12:09 Processing Time Ideal watermark: Actual watermark:

5 7 8 3 4 3 3 8 1 9 14 14 22 22 3 12 12

Figure 10: FixedWindows, Batch

slide-23
SLIDE 23

12:01 12:02 12:03 12:04 12:05 12:06 12:07 12:08 Event Time 12:06 12:07 12:08 12:09 Processing Time Ideal watermark: Actual watermark:

5 7 8 3 4 3 3 8 1 9 14 14 12 12 22 22 3 14 14 3 5 7

Figure 11: FixedWindows, Micro-Batch

slide-24
SLIDE 24

12:01 12:02 12:03 12:04 12:05 12:06 12:07 12:08 Event Time 12:06 12:07 12:08 12:09 Processing Time Ideal watermark: Actual watermark:

5 7 8 3 4 3 3 8 1 9 12 12 14 14 3 22 22 5

Figure 12: FixedWindows, Streaming

slide-25
SLIDE 25

12:01 12:02 12:03 12:04 12:05 12:06 12:07 12:08 Event Time 12:06 12:07 12:08 12:09 Processing Time

5 7 8 3 4 3 3 8 1 9

Ideal watermark: Actual watermark:

12 12 14 14 3 22 22 14 14 3 5 7

Figure 13: FixedWindows, Streaming, Partial

slide-26
SLIDE 26

PCollection<KV<String, Integer>> output = input .apply(Window.into(Sessions.withGapDuration(1, MINUTE)) .trigger(SequenceOf( RepeatUntil( AtPeriod(1, MINUTE), AtWatermark()), Repeat(AtWatermark()))) .accumulatingAndRetracting()) .apply(Sum.integersPerKey());

12:01 12:02 12:03 12:04 12:05 12:06 12:07 12:08 Event Time 12:06 12:07 12:08 12:09 Processing Time

5 7 8 3 4 3 3 8 1 9

Ideal watermark: Actual watermark:

12 12

  • 3
  • 3

39 39

  • 25
  • 25
  • 5
  • 5

3 25 25

  • 7
  • 7
  • 10
  • 10

10 10 7 5

Figure 14: Sessions, Retracting