fundamentals of stream processing with apache beam
play

Fundamentals of Stream Processing with Apache Beam (incubating) - PowerPoint PPT Presentation

Google Docs version of slides (including animations): https://goo.gl/yzvLXe Fundamentals of Stream Processing with Apache Beam (incubating) Frances Perry & Tyler Akidau @francesjperry, @takidau Apache Beam Committers & Google Engineers


  1. Google Docs version of slides (including animations): https://goo.gl/yzvLXe Fundamentals of Stream Processing with Apache Beam (incubating) Frances Perry & Tyler Akidau @francesjperry, @takidau Apache Beam Committers & Google Engineers QCon San Francisco -- November 2016

  2. Agenda 1 Infinite, Out-of-Order Data Sets What, Where, When, How 2 Reasons This is Awesome 3 Apache Beam (incubating) 4

  3. Infinite, Out-of-Order Data Sets 1

  4. Data...

  5. ...can be big...

  6. ...really, really big... Thursday Wednesday Tuesday

  7. … maybe infinitely big... 8:00 9:00 10:00 11:00 12:00 13:00 14:00

  8. … with unknown delays. 8:00 8:00 8:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00

  9. Element-wise transformations Processing 8:00 9:00 10:00 11:00 12:00 13:00 14:00 Time

  10. Aggregating via Processing-Time Windows Processing 8:00 9:00 10:00 11:00 12:00 13:00 14:00 Time

  11. Aggregating via Event-Time Windows Input Processing 10:00 11:00 12:00 13:00 14:00 15:00 Time Output 10:00 11:00 12:00 13:00 14:00 15:00 Event Time

  12. Formalizing Event-Time Skew Skew Processing Time Reality Ideal Event Time

  13. Formalizing Event-Time Skew Watermarks describe event time progress. Skew Processing Time "No timestamp earlier than the ~Watermark watermark will be seen" Ideal Often heuristic-based. Too Slow? Results are delayed . Too Fast? Some data is late . Event Time

  14. What, Where, When, How 2

  15. What are you computing? Where in event time? When in processing time? How do refinements relate?

  16. What are you computing? Element-Wise Aggregating Composite What Where When How

  17. What: Computing Integer Sums // Collection of raw log lines PCollection<String> raw = IO.read(...); // Element-wise transformation into team/score pairs PCollection<KV<String, Integer>> input = raw.apply(ParDo.of(new ParseFn()); // Composite transformation containing an aggregation PCollection<KV<String, Integer>> scores = input.apply(Sum.integersPerKey()); What Where When How

  18. What: Computing Integer Sums What Where When How

  19. What: Computing Integer Sums What Where When How

  20. Where in event time? Windowing divides data into event-time-based finite chunks. Fixed Sliding Sessions 1 4 2 1 3 3 1 2 3 4 Key 1 Key 2 Key 3 4 2 5 Time Often required when doing aggregations over unbounded data. What Where When How

  21. Where: Fixed 2-minute Windows PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2))) .apply(Sum.integersPerKey()); What Where When How

  22. Where: Fixed 2-minute Windows What Where When How

  23. When in processing time? • Triggers control Skew when results are Processing Time emitted. ~Watermark Ideal • Triggers are often relative to the watermark. Event Time What Where When How

  24. When: Triggering at the Watermark PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey()); What Where When How

  25. When: Triggering at the Watermark What Where When How

  26. When: Early and Late Firings PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1)))) .apply(Sum.integersPerKey()); What Where When How

  27. When: Early and Late Firings What Where When How

  28. How do refinements relate? • How should multiple outputs per window accumulate? • Appropriate choice depends on consumer. Firing Elements Discarding Accumulating Acc. & Retracting Speculative [3] 3 3 3 Watermark [5, 1] 6 9 9, -3 Late [2] 2 11 11, -9 Last Observed 2 11 11 Total Observed 11 23 11 (Accumulating & Retracting not yet implemented.) What Where When How

  29. How: Add Newest, Remove Previous PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey()); What Where When How

  30. How: Add Newest, Remove Previous What Where When How

  31. Reasons This is Awesome 3

  32. What / Where / When / How Correctness Power Composability Flexibility Modularity

  33. What / Where / When / How Correctness Power Composability Flexibility Modularity

  34. Distributed Systems are Distributed

  35. Processing Time Results Differ

  36. Event Time Results are Stable

  37. What / Where / When / How Correctness Power Composability Flexibility Modularity

  38. Identifying Bursts of User Activity

  39. Sessions PCollection<KV<String, Integer>> scores = input .apply(Window.into(Sessions.withGapDuration(Minutes(1)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey());

  40. Identifying Bursts of User Activity

  41. What / Where / When / How Correctness Power Composability Flexibility Modularity

  42. Calculating Session Lengths input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark()) .discardingFiredPanes()) .apply(CalculateWindowLength()));

  43. Calculating the Average Session Length input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark()) .discardingFiredPanes()) .apply(CalculateWindowLength())); .apply(Window.into(FixedWindows.of(Minutes(2))) .trigger(AtWatermark()) .withEarlyFirings(AtPeriod(Minutes(1))) .accumulatingFiredPanes()) .apply(Mean.globally());

  44. What / Where / When / How Correctness Power Composability Flexibility Modularity

  45. 1.Classic Batch 2. Batch with Fixed 3. Streaming Windows 4. Streaming with 5. Streaming With 6. Sessions Speculative + Late Data Retractions

  46. What / Where / When / How Correctness Power Composability Flexibility Modularity

  47. PCollection<KV<String, Integer>> scores = input PCollection<KV<String, Integer>> scores = input PCollection<KV<String, Integer>> scores = input .apply(Sum.integersPerKey()); .apply(Window.into(FixedWindows.of(Minutes(2))) .apply(Window.into(FixedWindows.of(Minutes(2)) .apply(Sum.integersPerKey()); .triggering(AtWatermark())) .apply(Sum.integersPerKey()); 1.Classic Batch 2. Batch with Fixed 3. Streaming Windows PCollection<KV<String, Integer>> scores = input PCollection<KV<String, Integer>> scores = input PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .apply(Window.into(FixedWindows.of(Minutes(2)) .apply(Window.into(Sessions.withGapDuration(Minutes(2)) .triggering(AtWatermark() .triggering(AtWatermark() .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withEarlyFirings(AtPeriod(Minutes(1))) .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .withLateFirings(AtCount(1))) .withLateFirings(AtCount(1))) .apply(Sum.integersPerKey()); .accumulatingAndRetractingFiredPanes()) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey()); .apply(Sum.integersPerKey()); 4. Streaming with 5. Streaming With 6. Sessions Speculative + Late Data Retractions

  48. What / Where / When / How Correctness Power Composability Flexibility Modularity

  49. Apache Beam (incubating) 4

  50. The Evolution of Beam Colossus BigTable PubSub Dremel Google Cloud Dataflow Spanner Megastore Millwheel Flume MapReduce Apache Beam

  51. What is Part of Apache Beam? 1. The Beam Model: What / Where / When / How 2. SDKs for writing Beam pipelines -- Java and Python 3. Runners for Existing Distributed Processing Backends • Apache Flink • Apache Spark • Google Cloud Dataflow • Direct runner for local development and testing • In development: Apache Gearpump and Apache Apex

  52. Apache Beam Technical Vision End users: who want to write 1. Other Beam pipelines or transform libraries in Beam Java Languages Python a language that’s familiar. SDK writers: who want to make 2. Beam Model: Pipeline Construction Beam concepts available in new languages. Cloud Runner A Runner B Dataflow Runner writers: who have a 3. distributed processing environment and want to support Beam Model: Fn Runners Beam pipelines Execution Execution Execution

  53. Visions are a Journey 2016-02-25 2016-07-28 2016-10-31 1st commit to 0.2.0-incubating 0.3.0-incubating ASF repository release release 2016-02-01 2016-06-08 2016-10-21 Enter Apache 0.1.0-incubating Three new Incubator release committers Early 2016 Late 2016 Mid 2016 Internal API redesign Multiple runners API Stabilization and chaos execute Beam pipelines

  54. Categorizing Runner Capabilities http://beam.incubator.apache.org/ documentation/runners/capability-matrix/

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend