SLIDE 1 Have Your Cake & Eat It Too
Further Dispelling the Myths of the Lambda Architecture
Tyler Akidau Staff Software Engineer
Google Docs version of slides (with animations) available at: http://goo.gl/eX5kxa
SLIDE 2 MillWheel Streaming Flume Cloud Dataflow
- Stream Processing System
- High-level API
- Data Processing Service
SLIDE 3 Google Cloud Dataflow
Optimize Schedule
GCS GCS
SLIDE 4
- Slava Chernyak, Josh Haberman, Reuven Lax,
Daniel Mills, Paul Nordstrom, Sam McVeety, Sam Whittle, and more...
- Robert Bradshaw, Daniel Mills,
and more...
- Robert Bradshaw, Craig Chambers, Reuven
Lax, Daniel Mills, Frances Perry, and more...
MillWheel Streaming Flume Cloud Dataflow
SLIDE 5
Cloud Dataflow is unreleased. Things may change.
SLIDE 6 Lambda vs Streaming Strong Consistency Reasoning About Time 1 2 3
Agenda
SLIDE 7 Lambda vs Streaming
1
SLIDE 8 http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html
SLIDE 9
The Lambda Architecture
SLIDE 10 http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
SLIDE 11
The Evolution of Streaming
SLIDE 12
Strong Consistency Tools for Reasoning About Time
What does it take?
SLIDE 13 Strong Consistency
2
SLIDE 14
Consistent Storage
Storage
SLIDE 15
- Mostly correct is not good enough
- Required for exactly-once processing
- Required for repeatable results
- Cannot replace batch without it
Why consistency is important
SLIDE 16
- Sequencers (e.g. BigTable)
- Leases (e.g. Spanner)
- Federation of storage silos (e.g. Samza,
Dataflow)
How?
SLIDE 17 http://research.google.com/pubs/pub41378.html
SLIDE 18 Reasoning About Time
3
SLIDE 19
Event Time vs Stream Time Batch vs Streaming Approaches Dataflow API
SLIDE 20
Event Time - When Events Happened Stream Time - When Events Are Processed
SLIDE 21
Batch vs Streaming
SLIDE 23 MapReduce [10:00 - 11:00) [10:00 - 11:00)
[11:00 - 12:00) [12:00 - 13:00) [13:00 - 14:00) [14:00 - 15:00) [15:00 - 16:00) [16:00 - 17:00) [18:00 - 19:00) [19:00 - 20:00) [21:00 - 22:00) [22:00 - 23:00) [23:00 - 0:00)
Batch: Fixed Windows
SLIDE 24 MapReduce [10:00 - 11:00) [11:00 - 12:00)
Batch: User Sessions
Joan Larry Ingo Amanda Cheryl Arthur
[11:00 - 12:00) [10:00 - 11:00)
SLIDE 25 Streaming
11:00 10:00 16:00 15:00 14:00 13:00 12:00
SLIDE 26
Unordered Unbounded Of Varying Event Time Skew
Confounding characteristics of data streams
SLIDE 27 Event Time Skew
Stream Time Event Time
Skew
SLIDE 28
Approaches
SLIDE 29
- 1. Time-Agnostic Processing
- 2. Approximation
- 3. Stream Time Windowing
- 4. Event Time Windowing
Approaches to reasoning about time
SLIDE 30
- 1. Time-Agnostic Processing - Filters
11:00 10:00 16:00 15:00 14:00 13:00 12:00
Stream Time
Example Input: Example Output: Pros: Cons: Web server traffic logs All traffic from specific domains Straightforward Efficient Limited utility
SLIDE 31
- 1. Time-Agnostic Processing - Hash Join
11:00 10:00 16:00 15:00 14:00 13:00 12:00
Stream Time
Example Input: Example Output: Pros: Cons: Query & Click traffic Joined stream of Query + Click pairs Straightforward Efficient Limited utility
SLIDE 32
- 2. Approximation via Online Algorithms
11:00 10:00 16:00 15:00 14:00 13:00 12:00
Stream Time
Example Input: Example Output: Pros: Cons: Twitter hashtags Approximate top N hashtags per prefix Efficient Inexact Complicated Algorithms
SLIDE 33 11:00 10:00 16:00 15:00 14:00 13:00 12:00
Stream Time
Web server request traffic Per-minute rate of received requests Straightforward Results reflect contents of stream Results don’t reflect events as they happened If approximating event time, usefulness varies Example Input: Example Output: Pros: Cons:
- 3. Windowing by Stream Time
SLIDE 34 11:00 10:00 16:00 15:00 14:00 13:00 12:00
Event Time
Example Input: Example Output: Pros: Cons: Twitter hashtags Top N hashtags by prefix per hour. Reflects events as they occurred More complicated buffering Completeness issues
11:00 10:00 16:00 15:00 14:00 13:00 12:00
Stream Time
- 4. Windowing by Event Time - Fixed Windows
SLIDE 35 11:00 10:00 16:00 15:00 14:00 13:00 12:00
Event Time
Example Input: Example Output: Pros: Cons: User activity stream Per-session group of activities Reflects events as they occurred More complicated buffering Completeness issues
11:00 10:00 16:00 15:00 14:00 13:00 12:00
Stream Time
- 4. Windowing by Event Time - Sessions
SLIDE 36
Dataflow API
SLIDE 37
What are you computing? Where in event time? When in stream time?
SLIDE 38
What = Aggregation API Where = Windowing API When = Watermarks + Triggers API
SLIDE 39
Aggregation API
PCollection<KV<String, Double>> sums = Pipeline .begin() .read(“userRequests”) .apply(new Sum());
SLIDE 40 Aggregation API
2 4 7 1 6 3 3 8 9 18 9 16
Sum
SLIDE 41 Streaming Mode
10:02 10:00 10:06 10:04 Stream Time
2 4 3 1 6 3 3 8 7 2 4 1 6 3 3 8 9 4 7 3 3 2
10:02 10:00 10:06 10:04 Event Time
1 3 8 9 4 6 1 8 7 2 3 4 3 2 7 4 3 6 3 3 3 2
SLIDE 42
Windowing API
PCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new FixedWindows(2, MINUTE))); .apply(new Sum());
SLIDE 43 Windowing API
10:02 10:00 10:06 10:04 Stream Time
2 4 3 1 6 3 3 8 7
10:02 10:00 10:06 10:04 Event Time 10:02 10:00 10:06 10:04 Event Time
2 4 1 6 3 3 8 9 4 7 3 3 2 FixedWindows Sum 1 3 8 9 4 6 1 8 7 2 3 4 3 2 7 4 3 6 3 3 3 2 13 12 6 16 18 3 5 4 15
SLIDE 44 Watermarks
- f(S) -> E
- S = a point in stream time (i.e. now)
- E = the point in event time up to
which input data is complete as of S
SLIDE 45
Event Time Skew
Stream Time Event Time
SLIDE 46 FixedWindows Sum
10:02 10:00
13 6 16 18 3 5 4 15 12
10:06 10:04 Stream Time
2 4 3 1 6 3 3 8 7
10:02 10:00 10:06 10:04 Event Time 10:01 10:00 10:03 10:02 Event Time
2 4 1 6 3 3 8 9 4 7 3 3 2 1 3 8 9 4 6 1 8 7 2 3 4 3 2 7 4 3 6 3 3 3 2
Watermarks
SLIDE 47
Watermark Caveats
Too slow = more latency Too fast = late data
SLIDE 48
Triggers
When in stream time to emit?
SLIDE 49
Triggers API
PCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new FixedWindows(2, MINUTES)) .trigger(new AtWatermark()); .apply(new Sum());
SLIDE 50 2 3 1 8 4 8 7 6 3 13 13
Event Time 10:05 10:06 10:01 10:00 13:00 Stream Time 10:03 10:00 10:06
2 12 5 5 20 20 9 9
10:01 10:02 10:05 10:04 10:02 10:03 10:04
5
Late datum
SLIDE 51 A Better Strategy
- 1. Once per stream time minute
- 2. At watermark
- 3. Once per record for two weeks
SLIDE 52 13 25 5 2 3 1 8 4 8 7 6 3 5
Event Time 10:05 10:06 10:01 10:00 13:00 Stream Time 10:03 10:00 10:06
2 12 5 5 20 20 9 9
10:01 10:02 10:05 10:04 10:02 10:03 10:04
12 20 13 20 5 13 20 13 13 25 2 12 13 9 20 5 20 25
Late datum
9 13 25
SLIDE 53 Triggers API
PCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new FixedWindows(2, MINUTE)) .trigger(new SequenceOf( new RepeatUntil( new AtPeriod(1, MINUTE), new AtWatermark()), new AtWatermark(), new RepeatUntil( new AfterCount(1), new AfterDelay( 14, DAYS, TimeDomain.EVENT_TIME)))); .apply(new Sum());
SLIDE 54
Lambda vs Streaming
Low-latency, approximate results Complete, correct results as soon as possible Ability to deal with changes upstream
SLIDE 55
One Last Thing...
What if I want sessions?
SLIDE 56 Triggers API
PCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new Sessions(1, MINUTE)) .trigger(new SequenceOf( new RepeatUntil( new AtPeriod(1, MINUTE), new AtWatermark()), new AtWatermark(), new RepeatUntil( new AfterCount(1), new AfterDelay( 14, DAYS, TimeDomain.EVENT_TIME)))); .apply(new Sum());
SLIDE 57 2 8 4 8 7 6 3
Event Time 10:05 10:06 10:01 10:00 13:00 Stream Time 10:03 10:00 10:06 10:01 10:02 10:05 10:04 10:02 10:03 10:04
2 2 3 1 9 1 minute 2 7 1 minute 2 2 7 1 minute 9 3 9 3 3 9 1 4 8 8 9 7 25 25 25 5
Late datum
25 8 33 33 33 33 5 38 38 6 3 9 38 9 20 5 13 2 12 25 9
SLIDE 58
Summary
Lambda is great Streaming by itself is better :-) Strong Consistency = Correctness Streaming = Aggregation + Windowing + Triggers Tools For Reasoning About Time = Power + Flexibility
SLIDE 59
Thank you!
Questions?
Questions about this talk: Questions about Cloud Dataflow: takidau@google.com (Tyler Akidau) cloude@google.com (Eric Schmidt)