Have Your Cake & Eat It Too Further Dispelling the Myths of the - - PowerPoint PPT Presentation

have your cake eat it too
SMART_READER_LITE
LIVE PREVIEW

Have Your Cake & Eat It Too Further Dispelling the Myths of the - - PowerPoint PPT Presentation

Google Docs version of slides (with animations) available at: http://goo.gl/eX5kxa Have Your Cake & Eat It Too Further Dispelling the Myths of the Lambda Architecture Tyler Akidau Staff Software Engineer MillWheel - Stream Processing


slide-1
SLIDE 1

Have Your Cake & Eat It Too

Further Dispelling the Myths of the Lambda Architecture

Tyler Akidau Staff Software Engineer

Google Docs version of slides (with animations) available at: http://goo.gl/eX5kxa

slide-2
SLIDE 2

MillWheel Streaming Flume Cloud Dataflow

  • Stream Processing System
  • High-level API
  • Data Processing Service
slide-3
SLIDE 3

Google Cloud Dataflow

Optimize Schedule

GCS GCS

slide-4
SLIDE 4
  • Slava Chernyak, Josh Haberman, Reuven Lax,

Daniel Mills, Paul Nordstrom, Sam McVeety, Sam Whittle, and more...

  • Robert Bradshaw, Daniel Mills,

and more...

  • Robert Bradshaw, Craig Chambers, Reuven

Lax, Daniel Mills, Frances Perry, and more...

MillWheel Streaming Flume Cloud Dataflow

slide-5
SLIDE 5

Cloud Dataflow is unreleased. Things may change.

slide-6
SLIDE 6

Lambda vs Streaming Strong Consistency Reasoning About Time 1 2 3

Agenda

slide-7
SLIDE 7

Lambda vs Streaming

1

slide-8
SLIDE 8

http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html

slide-9
SLIDE 9

The Lambda Architecture

slide-10
SLIDE 10

http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

slide-11
SLIDE 11

The Evolution of Streaming

slide-12
SLIDE 12

Strong Consistency Tools for Reasoning About Time

What does it take?

slide-13
SLIDE 13

Strong Consistency

2

slide-14
SLIDE 14

Consistent Storage

Storage

slide-15
SLIDE 15
  • Mostly correct is not good enough
  • Required for exactly-once processing
  • Required for repeatable results
  • Cannot replace batch without it

Why consistency is important

slide-16
SLIDE 16
  • Sequencers (e.g. BigTable)
  • Leases (e.g. Spanner)
  • Federation of storage silos (e.g. Samza,

Dataflow)

  • RDDs (e.g. Spark)

How?

slide-17
SLIDE 17

http://research.google.com/pubs/pub41378.html

slide-18
SLIDE 18

Reasoning About Time

3

slide-19
SLIDE 19

Event Time vs Stream Time Batch vs Streaming Approaches Dataflow API

slide-20
SLIDE 20

Event Time - When Events Happened Stream Time - When Events Are Processed

slide-21
SLIDE 21

Batch vs Streaming

slide-22
SLIDE 22

MapReduce

Batch

slide-23
SLIDE 23

MapReduce [10:00 - 11:00) [10:00 - 11:00)

[11:00 - 12:00) [12:00 - 13:00) [13:00 - 14:00) [14:00 - 15:00) [15:00 - 16:00) [16:00 - 17:00) [18:00 - 19:00) [19:00 - 20:00) [21:00 - 22:00) [22:00 - 23:00) [23:00 - 0:00)

Batch: Fixed Windows

slide-24
SLIDE 24

MapReduce [10:00 - 11:00) [11:00 - 12:00)

Batch: User Sessions

Joan Larry Ingo Amanda Cheryl Arthur

[11:00 - 12:00) [10:00 - 11:00)

slide-25
SLIDE 25

Streaming

11:00 10:00 16:00 15:00 14:00 13:00 12:00

slide-26
SLIDE 26

Unordered Unbounded Of Varying Event Time Skew

Confounding characteristics of data streams

slide-27
SLIDE 27

Event Time Skew

Stream Time Event Time

Skew

slide-28
SLIDE 28

Approaches

slide-29
SLIDE 29
  • 1. Time-Agnostic Processing
  • 2. Approximation
  • 3. Stream Time Windowing
  • 4. Event Time Windowing

Approaches to reasoning about time

slide-30
SLIDE 30
  • 1. Time-Agnostic Processing - Filters

11:00 10:00 16:00 15:00 14:00 13:00 12:00

Stream Time

Example Input: Example Output: Pros: Cons: Web server traffic logs All traffic from specific domains Straightforward Efficient Limited utility

slide-31
SLIDE 31
  • 1. Time-Agnostic Processing - Hash Join

11:00 10:00 16:00 15:00 14:00 13:00 12:00

Stream Time

Example Input: Example Output: Pros: Cons: Query & Click traffic Joined stream of Query + Click pairs Straightforward Efficient Limited utility

slide-32
SLIDE 32
  • 2. Approximation via Online Algorithms

11:00 10:00 16:00 15:00 14:00 13:00 12:00

Stream Time

Example Input: Example Output: Pros: Cons: Twitter hashtags Approximate top N hashtags per prefix Efficient Inexact Complicated Algorithms

slide-33
SLIDE 33

11:00 10:00 16:00 15:00 14:00 13:00 12:00

Stream Time

Web server request traffic Per-minute rate of received requests Straightforward Results reflect contents of stream Results don’t reflect events as they happened If approximating event time, usefulness varies Example Input: Example Output: Pros: Cons:

  • 3. Windowing by Stream Time
slide-34
SLIDE 34

11:00 10:00 16:00 15:00 14:00 13:00 12:00

Event Time

Example Input: Example Output: Pros: Cons: Twitter hashtags Top N hashtags by prefix per hour. Reflects events as they occurred More complicated buffering Completeness issues

11:00 10:00 16:00 15:00 14:00 13:00 12:00

Stream Time

  • 4. Windowing by Event Time - Fixed Windows
slide-35
SLIDE 35

11:00 10:00 16:00 15:00 14:00 13:00 12:00

Event Time

Example Input: Example Output: Pros: Cons: User activity stream Per-session group of activities Reflects events as they occurred More complicated buffering Completeness issues

11:00 10:00 16:00 15:00 14:00 13:00 12:00

Stream Time

  • 4. Windowing by Event Time - Sessions
slide-36
SLIDE 36

Dataflow API

slide-37
SLIDE 37

What are you computing? Where in event time? When in stream time?

slide-38
SLIDE 38

What = Aggregation API Where = Windowing API When = Watermarks + Triggers API

slide-39
SLIDE 39

Aggregation API

PCollection<KV<String, Double>> sums = Pipeline .begin() .read(“userRequests”) .apply(new Sum());

slide-40
SLIDE 40

Aggregation API

2 4 7 1 6 3 3 8 9 18 9 16

Sum

slide-41
SLIDE 41

Streaming Mode

10:02 10:00 10:06 10:04 Stream Time

2 4 3 1 6 3 3 8 7 2 4 1 6 3 3 8 9 4 7 3 3 2

10:02 10:00 10:06 10:04 Event Time

1 3 8 9 4 6 1 8 7 2 3 4 3 2 7 4 3 6 3 3 3 2

slide-42
SLIDE 42

Windowing API

PCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new FixedWindows(2, MINUTE))); .apply(new Sum());

slide-43
SLIDE 43

Windowing API

10:02 10:00 10:06 10:04 Stream Time

2 4 3 1 6 3 3 8 7

10:02 10:00 10:06 10:04 Event Time 10:02 10:00 10:06 10:04 Event Time

2 4 1 6 3 3 8 9 4 7 3 3 2 FixedWindows Sum 1 3 8 9 4 6 1 8 7 2 3 4 3 2 7 4 3 6 3 3 3 2 13 12 6 16 18 3 5 4 15

slide-44
SLIDE 44

Watermarks

  • f(S) -> E
  • S = a point in stream time (i.e. now)
  • E = the point in event time up to

which input data is complete as of S

slide-45
SLIDE 45

Event Time Skew

Stream Time Event Time

slide-46
SLIDE 46

FixedWindows Sum

10:02 10:00

13 6 16 18 3 5 4 15 12

10:06 10:04 Stream Time

2 4 3 1 6 3 3 8 7

10:02 10:00 10:06 10:04 Event Time 10:01 10:00 10:03 10:02 Event Time

2 4 1 6 3 3 8 9 4 7 3 3 2 1 3 8 9 4 6 1 8 7 2 3 4 3 2 7 4 3 6 3 3 3 2

Watermarks

slide-47
SLIDE 47

Watermark Caveats

Too slow = more latency Too fast = late data

slide-48
SLIDE 48

Triggers

When in stream time to emit?

slide-49
SLIDE 49

Triggers API

PCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new FixedWindows(2, MINUTES)) .trigger(new AtWatermark()); .apply(new Sum());

slide-50
SLIDE 50

2 3 1 8 4 8 7 6 3 13 13

Event Time 10:05 10:06 10:01 10:00 13:00 Stream Time 10:03 10:00 10:06

2 12 5 5 20 20 9 9

10:01 10:02 10:05 10:04 10:02 10:03 10:04

5

Late datum

slide-51
SLIDE 51

A Better Strategy

  • 1. Once per stream time minute
  • 2. At watermark
  • 3. Once per record for two weeks
slide-52
SLIDE 52

13 25 5 2 3 1 8 4 8 7 6 3 5

Event Time 10:05 10:06 10:01 10:00 13:00 Stream Time 10:03 10:00 10:06

2 12 5 5 20 20 9 9

10:01 10:02 10:05 10:04 10:02 10:03 10:04

12 20 13 20 5 13 20 13 13 25 2 12 13 9 20 5 20 25

Late datum

9 13 25

slide-53
SLIDE 53

Triggers API

PCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new FixedWindows(2, MINUTE)) .trigger(new SequenceOf( new RepeatUntil( new AtPeriod(1, MINUTE), new AtWatermark()), new AtWatermark(), new RepeatUntil( new AfterCount(1), new AfterDelay( 14, DAYS, TimeDomain.EVENT_TIME)))); .apply(new Sum());

slide-54
SLIDE 54

Lambda vs Streaming

Low-latency, approximate results Complete, correct results as soon as possible Ability to deal with changes upstream

slide-55
SLIDE 55

One Last Thing...

What if I want sessions?

slide-56
SLIDE 56

Triggers API

PCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new Sessions(1, MINUTE)) .trigger(new SequenceOf( new RepeatUntil( new AtPeriod(1, MINUTE), new AtWatermark()), new AtWatermark(), new RepeatUntil( new AfterCount(1), new AfterDelay( 14, DAYS, TimeDomain.EVENT_TIME)))); .apply(new Sum());

slide-57
SLIDE 57

2 8 4 8 7 6 3

Event Time 10:05 10:06 10:01 10:00 13:00 Stream Time 10:03 10:00 10:06 10:01 10:02 10:05 10:04 10:02 10:03 10:04

2 2 3 1 9 1 minute 2 7 1 minute 2 2 7 1 minute 9 3 9 3 3 9 1 4 8 8 9 7 25 25 25 5

Late datum

25 8 33 33 33 33 5 38 38 6 3 9 38 9 20 5 13 2 12 25 9

slide-58
SLIDE 58

Summary

Lambda is great Streaming by itself is better :-) Strong Consistency = Correctness Streaming = Aggregation + Windowing + Triggers Tools For Reasoning About Time = Power + Flexibility

slide-59
SLIDE 59

Thank you!

Questions?

Questions about this talk: Questions about Cloud Dataflow: takidau@google.com (Tyler Akidau) cloude@google.com (Eric Schmidt)