SLIDE 1 Frances Perry & Tyler Akidau
@francesjperry, @takidau Apache Beam Committers & Google Engineers
Fundamentals of Stream Processing with Apache Beam (incubating)
QCon San Francisco -- November 2016
Google Docs version of slides (including animations): https://goo.gl/yzvLXe
SLIDE 2 Infinite, Out-of-Order Data Sets What, Where, When, How Reasons This is Awesome
Agenda
Apache Beam (incubating) 2 4 1 3
SLIDE 3 Infinite, Out-of-Order Data Sets
1
SLIDE 4
Data...
SLIDE 5
...can be big...
SLIDE 6 ...really, really big...
Tuesday Wednesday Thursday
SLIDE 7 … maybe infinitely big...
9:00 8:00 14:00 13:00 12:00 11:00 10:00
SLIDE 8 … with unknown delays.
9:00 8:00 14:00 13:00 12:00 11:00 10:00
8:00 8:00 8:00
SLIDE 9 Element-wise transformations
13:00 14:00 8:00 9:00 10:00 11:00 12:00
Processing Time
SLIDE 10 Aggregating via Processing-Time Windows
13:00 14:00 8:00 9:00 10:00 11:00 12:00
Processing Time
SLIDE 11 Aggregating via Event-Time Windows
Event Time Processing Time
11:00 10:00 15:00 14:00 13:00 12:00 11:00 10:00 15:00 14:00 13:00 12:00
Input Output
SLIDE 12 Reality
Formalizing Event-Time Skew
Processing Time Event Time
Ideal Skew
SLIDE 13 Formalizing Event-Time Skew
Watermarks describe event time progress. "No timestamp earlier than the watermark will be seen" Processing Time Event Time
~Watermark Ideal Skew
Often heuristic-based. Too Slow? Results are delayed. Too Fast? Some data is late.
SLIDE 14 What, Where, When, How
2
SLIDE 15
What are you computing? Where in event time? When in processing time? How do refinements relate?
SLIDE 16 What are you computing?
What Where When How
Element-Wise Aggregating Composite
SLIDE 17 What: Computing Integer Sums
// Collection of raw log lines PCollection<String> raw = IO.read(...); // Element-wise transformation into team/score pairs PCollection<KV<String, Integer>> input = raw.apply(ParDo.of(new ParseFn()); // Composite transformation containing an aggregation PCollection<KV<String, Integer>> scores = input.apply(Sum.integersPerKey());
What Where When How
SLIDE 18 What: Computing Integer Sums
What Where When How
SLIDE 19 What: Computing Integer Sums
What Where When How
SLIDE 20 Windowing divides data into event-time-based finite chunks. Often required when doing aggregations over unbounded data.
Where in event time?
What Where When How
Fixed Sliding
1 2 3 5 4
Sessions
2 4 3 1 Key 2 Key 1 Key 3
Time
1 2 3 4
SLIDE 21 Where: Fixed 2-minute Windows
What Where When How
PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2))) .apply(Sum.integersPerKey());
SLIDE 22 Where: Fixed 2-minute Windows
What Where When How
SLIDE 23 When in processing time?
What Where When How
when results are emitted.
relative to the watermark.
Processing Time Event Time
~Watermark Ideal Skew
SLIDE 24 When: Triggering at the Watermark
What Where When How
PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey());
SLIDE 25 When: Triggering at the Watermark
What Where When How
SLIDE 26 When: Early and Late Firings
What Where When How
PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1)))) .apply(Sum.integersPerKey());
SLIDE 27 When: Early and Late Firings
What Where When How
SLIDE 28 How do refinements relate?
What Where When How
- How should multiple outputs per window accumulate?
- Appropriate choice depends on consumer.
Firing Elements Speculative [3] Watermark [5, 1] Late [2] Last Observed Total Observed Discarding 3 6 2 2 11 Accumulating 3 9 11 11 23
3 9, -3 11, -9 11 11 (Accumulating & Retracting not yet implemented.)
SLIDE 29 How: Add Newest, Remove Previous
What Where When How
PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey());
SLIDE 30 How: Add Newest, Remove Previous
What Where When How
SLIDE 31 Reasons This is Awesome
3
SLIDE 32
Correctness Power Composability Flexibility Modularity
What / Where / When / How
SLIDE 33
Correctness Power Composability Flexibility Modularity
What / Where / When / How
SLIDE 34
Distributed Systems are Distributed
SLIDE 35
Processing Time Results Differ
SLIDE 36
Event Time Results are Stable
SLIDE 37
Correctness Power Composability Flexibility Modularity
What / Where / When / How
SLIDE 38
Identifying Bursts of User Activity
SLIDE 39
Sessions
PCollection<KV<String, Integer>> scores = input .apply(Window.into(Sessions.withGapDuration(Minutes(1)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey());
SLIDE 40
Identifying Bursts of User Activity
SLIDE 41
Correctness Power Composability Flexibility Modularity
What / Where / When / How
SLIDE 42 Calculating Session Lengths
input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark()) .discardingFiredPanes()) .apply(CalculateWindowLength()));
SLIDE 43
SLIDE 44 Calculating the Average Session Length
.apply(Window.into(FixedWindows.of(Minutes(2))) .trigger(AtWatermark()) .withEarlyFirings(AtPeriod(Minutes(1))) .accumulatingFiredPanes()) .apply(Mean.globally()); input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark()) .discardingFiredPanes()) .apply(CalculateWindowLength()));
SLIDE 45
SLIDE 46
Correctness Power Composability Flexibility Modularity
What / Where / When / How
SLIDE 47 1.Classic Batch
Windows
- 3. Streaming
- 5. Streaming With
Retractions
Speculative + Late Data
SLIDE 48
Correctness Power Composability Flexibility Modularity
What / Where / When / How
SLIDE 49 PCollection<KV<String, Integer>> scores = input .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2))) .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(Sessions.withGapDuration(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey());
1.Classic Batch
Windows
- 3. Streaming
- 5. Streaming With
Retractions
Speculative + Late Data
SLIDE 50
Correctness Power Composability Flexibility Modularity
What / Where / When / How
SLIDE 51 Apache Beam (incubating)
4
SLIDE 52 The Evolution of Beam
MapReduce
Google Cloud Dataflow
Apache Beam
BigTable Dremel Colossus Flume Megastore Spanner PubSub Millwheel
SLIDE 53 1. The Beam Model: What / Where / When / How 2. SDKs for writing Beam pipelines -- Java and Python 3. Runners for Existing Distributed Processing Backends
- Apache Flink
- Apache Spark
- Google Cloud Dataflow
- Direct runner for local development and testing
- In development: Apache Gearpump and Apache Apex
What is Part of Apache Beam?
SLIDE 54 1. End users: who want to write pipelines or transform libraries in a language that’s familiar. 2. SDK writers: who want to make Beam concepts available in new languages. 3. Runner writers: who have a distributed processing environment and want to support Beam pipelines
Apache Beam Technical Vision
Beam Model: Fn Runners Runner A Runner B Beam Model: Pipeline Construction Other Languages Beam Java Beam Python Execution Execution Cloud Dataflow Execution
SLIDE 55 2016-02-01 Enter Apache Incubator Early 2016 Internal API redesign and chaos Mid 2016 API Stabilization Late 2016 Multiple runners execute Beam pipelines 2016-02-25 1st commit to ASF repository 2016-06-08 0.1.0-incubating release 2016-07-28 0.2.0-incubating release
Visions are a Journey
2016-10-21 Three new committers 2016-10-31 0.3.0-incubating release
SLIDE 56 Categorizing Runner Capabilities
http://beam.incubator.apache.org/ documentation/runners/capability-matrix/
SLIDE 57
Learn More !
Streaming Fundamentals: The World Beyond Batch 101 & 102 http://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 http://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 Apache Beam (incubating) http://beam.incubator.apache.org Join the Beam community user-subscribe@beam.incubator.apache.org dev-subscribe@beam.incubator.apache.org Slides for this talk http://goo.gl/yzvLXe Follow @ApacheBeam on Twitter
SLIDE 58
Thank you!