Fundamentals of Stream Processing with Apache Beam (incubating) - - PowerPoint PPT Presentation

fundamentals of stream processing with apache beam
SMART_READER_LITE
LIVE PREVIEW

Fundamentals of Stream Processing with Apache Beam (incubating) - - PowerPoint PPT Presentation

Google Docs version of slides (including animations): https://goo.gl/yzvLXe Fundamentals of Stream Processing with Apache Beam (incubating) Frances Perry & Tyler Akidau @francesjperry, @takidau Apache Beam Committers & Google Engineers


slide-1
SLIDE 1

Frances Perry & Tyler Akidau

@francesjperry, @takidau Apache Beam Committers & Google Engineers

Fundamentals of Stream Processing with Apache Beam (incubating)

QCon San Francisco -- November 2016

Google Docs version of slides (including animations): https://goo.gl/yzvLXe

slide-2
SLIDE 2

Infinite, Out-of-Order Data Sets What, Where, When, How Reasons This is Awesome

Agenda

Apache Beam (incubating) 2 4 1 3

slide-3
SLIDE 3

Infinite, Out-of-Order Data Sets

1

slide-4
SLIDE 4

Data...

slide-5
SLIDE 5

...can be big...

slide-6
SLIDE 6

...really, really big...

Tuesday Wednesday Thursday

slide-7
SLIDE 7

… maybe infinitely big...

9:00 8:00 14:00 13:00 12:00 11:00 10:00

slide-8
SLIDE 8

… with unknown delays.

9:00 8:00 14:00 13:00 12:00 11:00 10:00

8:00 8:00 8:00

slide-9
SLIDE 9

Element-wise transformations

13:00 14:00 8:00 9:00 10:00 11:00 12:00

Processing Time

slide-10
SLIDE 10

Aggregating via Processing-Time Windows

13:00 14:00 8:00 9:00 10:00 11:00 12:00

Processing Time

slide-11
SLIDE 11

Aggregating via Event-Time Windows

Event Time Processing Time

11:00 10:00 15:00 14:00 13:00 12:00 11:00 10:00 15:00 14:00 13:00 12:00

Input Output

slide-12
SLIDE 12

Reality

Formalizing Event-Time Skew

Processing Time Event Time

Ideal Skew

slide-13
SLIDE 13

Formalizing Event-Time Skew

Watermarks describe event time progress. "No timestamp earlier than the watermark will be seen" Processing Time Event Time

~Watermark Ideal Skew

Often heuristic-based. Too Slow? Results are delayed. Too Fast? Some data is late.

slide-14
SLIDE 14

What, Where, When, How

2

slide-15
SLIDE 15

What are you computing? Where in event time? When in processing time? How do refinements relate?

slide-16
SLIDE 16

What are you computing?

What Where When How

Element-Wise Aggregating Composite

slide-17
SLIDE 17

What: Computing Integer Sums

// Collection of raw log lines PCollection<String> raw = IO.read(...); // Element-wise transformation into team/score pairs PCollection<KV<String, Integer>> input = raw.apply(ParDo.of(new ParseFn()); // Composite transformation containing an aggregation PCollection<KV<String, Integer>> scores = input.apply(Sum.integersPerKey());

What Where When How

slide-18
SLIDE 18

What: Computing Integer Sums

What Where When How

slide-19
SLIDE 19

What: Computing Integer Sums

What Where When How

slide-20
SLIDE 20

Windowing divides data into event-time-based finite chunks. Often required when doing aggregations over unbounded data.

Where in event time?

What Where When How

Fixed Sliding

1 2 3 5 4

Sessions

2 4 3 1 Key 2 Key 1 Key 3

Time

1 2 3 4

slide-21
SLIDE 21

Where: Fixed 2-minute Windows

What Where When How

PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2))) .apply(Sum.integersPerKey());

slide-22
SLIDE 22

Where: Fixed 2-minute Windows

What Where When How

slide-23
SLIDE 23

When in processing time?

What Where When How

  • Triggers control

when results are emitted.

  • Triggers are often

relative to the watermark.

Processing Time Event Time

~Watermark Ideal Skew

slide-24
SLIDE 24

When: Triggering at the Watermark

What Where When How

PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey());

slide-25
SLIDE 25

When: Triggering at the Watermark

What Where When How

slide-26
SLIDE 26

When: Early and Late Firings

What Where When How

PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1)))) .apply(Sum.integersPerKey());

slide-27
SLIDE 27

When: Early and Late Firings

What Where When How

slide-28
SLIDE 28

How do refinements relate?

What Where When How

  • How should multiple outputs per window accumulate?
  • Appropriate choice depends on consumer.

Firing Elements Speculative [3] Watermark [5, 1] Late [2] Last Observed Total Observed Discarding 3 6 2 2 11 Accumulating 3 9 11 11 23

  • Acc. & Retracting

3 9, -3 11, -9 11 11 (Accumulating & Retracting not yet implemented.)

slide-29
SLIDE 29

How: Add Newest, Remove Previous

What Where When How

PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey());

slide-30
SLIDE 30

How: Add Newest, Remove Previous

What Where When How

slide-31
SLIDE 31

Reasons This is Awesome

3

slide-32
SLIDE 32

Correctness Power Composability Flexibility Modularity

What / Where / When / How

slide-33
SLIDE 33

Correctness Power Composability Flexibility Modularity

What / Where / When / How

slide-34
SLIDE 34

Distributed Systems are Distributed

slide-35
SLIDE 35

Processing Time Results Differ

slide-36
SLIDE 36

Event Time Results are Stable

slide-37
SLIDE 37

Correctness Power Composability Flexibility Modularity

What / Where / When / How

slide-38
SLIDE 38

Identifying Bursts of User Activity

slide-39
SLIDE 39

Sessions

PCollection<KV<String, Integer>> scores = input .apply(Window.into(Sessions.withGapDuration(Minutes(1)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey());

slide-40
SLIDE 40

Identifying Bursts of User Activity

slide-41
SLIDE 41

Correctness Power Composability Flexibility Modularity

What / Where / When / How

slide-42
SLIDE 42

Calculating Session Lengths

input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark()) .discardingFiredPanes()) .apply(CalculateWindowLength()));

slide-43
SLIDE 43
slide-44
SLIDE 44

Calculating the Average Session Length

.apply(Window.into(FixedWindows.of(Minutes(2))) .trigger(AtWatermark()) .withEarlyFirings(AtPeriod(Minutes(1))) .accumulatingFiredPanes()) .apply(Mean.globally()); input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark()) .discardingFiredPanes()) .apply(CalculateWindowLength()));

slide-45
SLIDE 45
slide-46
SLIDE 46

Correctness Power Composability Flexibility Modularity

What / Where / When / How

slide-47
SLIDE 47

1.Classic Batch

  • 2. Batch with Fixed

Windows

  • 3. Streaming
  • 5. Streaming With

Retractions

  • 4. Streaming with

Speculative + Late Data

  • 6. Sessions
slide-48
SLIDE 48

Correctness Power Composability Flexibility Modularity

What / Where / When / How

slide-49
SLIDE 49

PCollection<KV<String, Integer>> scores = input .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2))) .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(Sessions.withGapDuration(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey());

1.Classic Batch

  • 2. Batch with Fixed

Windows

  • 3. Streaming
  • 5. Streaming With

Retractions

  • 4. Streaming with

Speculative + Late Data

  • 6. Sessions
slide-50
SLIDE 50

Correctness Power Composability Flexibility Modularity

What / Where / When / How

slide-51
SLIDE 51

Apache Beam (incubating)

4

slide-52
SLIDE 52

The Evolution of Beam

MapReduce

Google Cloud Dataflow

Apache Beam

BigTable Dremel Colossus Flume Megastore Spanner PubSub Millwheel

slide-53
SLIDE 53

1. The Beam Model: What / Where / When / How 2. SDKs for writing Beam pipelines -- Java and Python 3. Runners for Existing Distributed Processing Backends

  • Apache Flink
  • Apache Spark
  • Google Cloud Dataflow
  • Direct runner for local development and testing
  • In development: Apache Gearpump and Apache Apex

What is Part of Apache Beam?

slide-54
SLIDE 54

1. End users: who want to write pipelines or transform libraries in a language that’s familiar. 2. SDK writers: who want to make Beam concepts available in new languages. 3. Runner writers: who have a distributed processing environment and want to support Beam pipelines

Apache Beam Technical Vision

Beam Model: Fn Runners Runner A Runner B Beam Model: Pipeline Construction Other Languages Beam Java Beam Python Execution Execution Cloud Dataflow Execution

slide-55
SLIDE 55

2016-02-01 Enter Apache Incubator Early 2016 Internal API redesign and chaos Mid 2016 API Stabilization Late 2016 Multiple runners execute Beam pipelines 2016-02-25 1st commit to ASF repository 2016-06-08 0.1.0-incubating release 2016-07-28 0.2.0-incubating release

Visions are a Journey

2016-10-21 Three new committers 2016-10-31 0.3.0-incubating release

slide-56
SLIDE 56

Categorizing Runner Capabilities

http://beam.incubator.apache.org/ documentation/runners/capability-matrix/

slide-57
SLIDE 57

Learn More !

Streaming Fundamentals: The World Beyond Batch 101 & 102 http://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 http://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 Apache Beam (incubating) http://beam.incubator.apache.org Join the Beam community user-subscribe@beam.incubator.apache.org dev-subscribe@beam.incubator.apache.org Slides for this talk http://goo.gl/yzvLXe Follow @ApacheBeam on Twitter

slide-58
SLIDE 58

Thank you!