the dataflow model problem
play

The Dataflow Model Problem How can we process unbounded data? - PowerPoint PPT Presentation

The Dataflow Model Problem How can we process unbounded data? Example: track user activity on a website Key ideas Windowing Fixed windows Sliding windows Sessions Time domains Event time Processing time


  1. The Dataflow Model

  2. Problem • How can we process unbounded data? • Example: track user activity on a website

  3. Key ideas • Windowing • Fixed windows • Sliding windows • Sessions • Time domains • Event time • Processing time • Triggers

  4. Contribution • Dataflow API • Easily build pipelines with your choice of windowing, time domain, and trigger • Independent of execution engine • Choose batch, micro-batch, or streaming depending on tradeoffs

  5. Windowing

  6. Types of windows • Fixed windows • Sliding windows • Sessions

  7. Fixed windows

  8. Sliding windows Example: compute running average over past 5 minutes of data

  9. Session windows Example: YouTube viewing sessions

  10. Time domains • For many applications, windows should be based on “event time” (when the events actually occur) • Example: billing YouTube advertisers • Lag, partitions, etc, might cause an event to be processed later than its event time • Processing time

  11. Challenge: time skew

  12. Goal: Event-time windows Fixed windows Session windows

  13. Challenge: completion • With event times, how does the system know if it has received all of the data in a window? • Example: phones might watch YouTube videos (and ads) offline

  14. Watermarks • Heuristics that tell the system when it is likely to have received most of the data in a window • Based on global progress metrics • Watermarks are insufficient: • Late data might arrive behind the watermark • Watermark might be too slow due to one late datum and increase latency for the whole system

  15. Incremental processing • Difficult to get the single best result from a window • Instead, let windows produce multiple results (improving incrementally over time)

  16. Triggers • Triggers specify when to output window results • at watermark • at percentile watermark • every minute, etc • Triggers specify how to output results • discard previous window • accumulate • accumulate and retract • Triggers are composable

  17. Examples

  18. 12:09 1 8 9 12:08 Processing Time 3 12:07 8 3 3 4 12:06 7 5 12:01 12:02 12:03 12:04 12:05 12:06 12:07 12:08 Event Time Actual watermark: Ideal watermark: Figure 5: Example Inputs

  19. PCollection<KV<String, Integer>> output = input .apply( Window . trigger ( Repeat ( AtPeriod (1, MINUTE))) . accumulating ()) .apply(Sum.integersPerKey()); 12:09 1 51 51 8 9 12:08 Processing Time 33 33 3 12:07 8 22 22 3 3 4 12:06 12 12 7 5 12:01 12:02 12:03 12:04 12:05 12:06 12:07 12:08 Event Time Figure 7: GlobalWindows, AtPeriod, Accumulating

  20. PCollection<KV<String, Integer>> output = input .apply(Window.trigger(Repeat(AtPeriod(1, MINUTE))) . discarding ()) .apply(Sum.integersPerKey()); 12:09 1 18 18 8 9 12:08 Processing Time 11 11 3 12:07 8 10 10 3 3 4 12:06 12 12 7 5 12:01 12:02 12:03 12:04 12:05 12:06 12:07 12:08 Event Time Figure 8: GlobalWindows, AtPeriod, Discarding

  21. PCollection<KV<String, Integer>> output = input .apply(Window.into(FixedWindows.of(2, MINUTES)) . trigger ( Repeat ( AtWatermark ()))) .accumulating()) .apply(Sum.integersPerKey()); Let’s run this pipeline under the three execution engines: batch, micro-batch, streaming

  22. 12:09 12 12 3 22 22 14 14 1 8 9 12:08 Processing Time 3 12:07 8 3 3 4 12:06 7 5 12:01 12:02 12:03 12:04 12:05 12:06 12:07 12:08 Event Time Actual watermark: Ideal watermark: Figure 10: FixedWindows, Batch

  23. 12:09 1 12 12 14 14 8 9 12:08 Processing Time 3 22 22 3 12:07 8 3 14 14 3 3 4 12:06 7 5 7 5 12:01 12:02 12:03 12:04 12:05 12:06 12:07 12:08 Event Time Actual watermark: Ideal watermark: Figure 11: FixedWindows, Micro-Batch

  24. 12:09 12 12 1 8 9 12:08 14 14 Processing Time 3 3 22 22 12:07 8 3 3 4 12:06 5 7 5 12:01 12:02 12:03 12:04 12:05 12:06 12:07 12:08 Event Time Actual watermark: Ideal watermark: Figure 12: FixedWindows, Streaming

  25. 12:09 1 12 12 8 9 12:08 14 14 Processing Time 3 3 22 22 12:07 8 14 14 3 3 3 4 12:06 5 7 7 5 12:01 12:02 12:03 12:04 12:05 12:06 12:07 12:08 Event Time Actual watermark: Ideal watermark: Figure 13: FixedWindows, Streaming, Partial

  26. PCollection<KV<String, Integer>> output = input .apply(Window.into( Sessions . withGapDuration (1, MINUTE)) .trigger(SequenceOf( RepeatUntil( AtPeriod(1, MINUTE), AtWatermark()), Repeat(AtWatermark()))) . accumulatingAndRetracting ()) .apply(Sum.integersPerKey()); 12:09 1 -3 -3 12 12 8 9 12:08 -5 -5 39 39 -25 -25 Processing Time 3 -7 -7 25 25 -10 -10 3 12:07 8 10 10 3 3 4 12:06 5 7 7 5 12:01 12:02 12:03 12:04 12:05 12:06 12:07 12:08 Event Time Actual watermark: Ideal watermark: Figure 14: Sessions, Retracting

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend