Foundations of streaming SQL or: how I learned to love stream & - - PowerPoint PPT Presentation

foundations of streaming sql
SMART_READER_LITE
LIVE PREVIEW

Foundations of streaming SQL or: how I learned to love stream & - - PowerPoint PPT Presentation

Foundations of streaming SQL or: how I learned to love stream & table theory Slides: https://s.apache.org/streaming-sql-qcon-london Tyler Akidau Apache Beam PMC Software Engineer at Google @takidau Covering ideas from across the Apache


slide-1
SLIDE 1

1

Foundations of streaming SQL

  • r: how I learned to love stream & table theory

Slides: https://s.apache.org/streaming-sql-qcon-london

Tyler Akidau Apache Beam PMC Software Engineer at Google @takidau Covering ideas from across the Apache Beam, Apache Calcite, Apache Kafka, and Apache Flink communities, with thoughts and contributions from Julian Hyde, Fabian Hueske, Shaoxuan Wang, Kenn Knowles, Ben Chambers, Reuven Lax, Mingmin Xu, James Xu, Martin Kleppmann, Jay Kreps and many more, not to mention that whole database community thing... QCon London 2018

slide-2
SLIDE 2

2

Table of Contents

01 02 Stream & Table Theory A Basics B The Beam Model Streaming SQL A Time-varying relations B SQL language extensions Chapter 7 Chapter 9

slide-3
SLIDE 3

3

01 Stream & Table Theory

TFW you realize everything you do was invented by the database community decades ago...

A Basics B The Beam Model

slide-4
SLIDE 4

4

Stream & table basics

https://www.confluent.io/blog/making-sense-of-stream-processing/ https://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/

slide-5
SLIDE 5

5

Special theory of stream & table relativity

streams → tables: tables → streams: The aggregation of a stream of updates over time yields a table. The observation of changes to a table over time yields a stream.

slide-6
SLIDE 6

6

Non-relativistic stream & table definitions

Tables are data at rest. Streams are data in motion.

slide-7
SLIDE 7

7

01 Stream & Table Theory

TFW you realize everything you do was invented by the database community decades ago...

A Basics B The Beam Model

slide-8
SLIDE 8

8

The Beam Model What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?

slide-9
SLIDE 9

9

Reconciling streams & tables w/ the Beam Model

  • How does batch processing fit into all of this?
  • What is the relationship of streams to bounded and

unbounded datasets?

  • How do the four what, where, when, how questions map
  • nto a streams/tables world?
slide-10
SLIDE 10

10

Reduce Map

MapReduce

input

  • utput
slide-11
SLIDE 11

11

MapReduce

input

  • utput

MapRead Map MapWrite ReduceRead Reduce ReduceWrite

slide-12
SLIDE 12

12

MapReduce

MapRead Map MapWrite ReduceRead Reduce ReduceWrite ? ? ? ? ? ? ?

slide-13
SLIDE 13

13

MapReduce

MapRead Map MapWrite ReduceRead Reduce ReduceWrite ? ? ? ? ?

table table

slide-14
SLIDE 14

14

Map phase

MapRead Map MapWrite

table

? ? ?

slide-15
SLIDE 15

15

Map phase API

void map(K1 key, V1 value, Emit<K2, V2>);

slide-16
SLIDE 16

16

Map phase API

void map(K1 key, V1 value, Emit<K2, V2>);

slide-17
SLIDE 17

17

Map phase

MapRead Map MapWrite

table stream

? ?

slide-18
SLIDE 18

18

Map phase API

void map(K1 key, V1 value, Emit<K2, V2>);

slide-19
SLIDE 19

19

Map phase

MapRead Map MapWrite

table stream stream

?

slide-20
SLIDE 20

20

Map phase API

void map(K1 key, V1 value, Emit<K2, V2>); void reduce(K2 key, Iterable<V2> value, Emit<V3>);

slide-21
SLIDE 21

21

Map phase

MapRead Map MapWrite

table stream stream table

slide-22
SLIDE 22

22

MapReduce

MapRead Map MapWrite ReduceRead Reduce ReduceWrite

table stream stream table

? ?

table

slide-23
SLIDE 23

23

Map phase API

void map(K1 key, V1 value, Emit<K2, V2>); void reduce(K2 key, Iterable<V2> value, Emit<V3>);

slide-24
SLIDE 24

24

Map phase API

void map(K1 key, V1 value, Emit<K2, V2>); void reduce(K2 key, Iterable<V2> value, Emit<V3>);

slide-25
SLIDE 25

25

table

MapReduce

MapRead Map MapWrite ReduceRead Reduce ReduceWrite

table stream stream table stream stream

slide-26
SLIDE 26

26

Reconciling streams & tables w/ the Beam Model

  • How does batch processing fit into all of this?
  • What is the relationship of streams to bounded and

unbounded datasets?

  • How do the four what, where, when, how questions map
  • nto a streams/tables world?
  • 1. Tables are read into streams.
  • 2. Streams are processed into new streams until a

grouping operation is hit.

  • 3. Grouping turns the stream into a table.
  • 4. Repeat steps 1-3 until you run out of operations.
slide-27
SLIDE 27

27

Reconciling streams & tables w/ the Beam Model

  • How does batch processing fit into all of this?
  • What is the relationship of streams to bounded and

unbounded datasets?

  • How do the four what, where, when, how questions map
  • nto a streams/tables world?

Streams are the in-motion form of data both bounded and unbounded.

slide-28
SLIDE 28

28

Reconciling streams & tables w/ the Beam Model

  • How does batch processing fit into all of this?
  • What is the relationship of streams to bounded and

unbounded datasets?

  • How do the four what, where, when, how questions map
  • nto a streams/tables world?
slide-29
SLIDE 29

29

The Beam Model What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?

slide-30
SLIDE 30

30

The Beam Model What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?

slide-31
SLIDE 31

31

Example data: individual user scores

slide-32
SLIDE 32

32

What is calculated?

PCollection<KV<Team, Score>> input = IO.read(...) .apply(ParDo.of(new ParseFn()); .apply(Sum.integersPerKey());

slide-33
SLIDE 33

What is calculated?

slide-34
SLIDE 34

34

The Beam Model What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?

slide-35
SLIDE 35

35

Windowing divides data into event-time-based finite chunks. Often required when doing aggregations over unbounded data. Fixed Sliding

1 2 3 5 4

Sessions

2 4 3 1 Key 2 Key 1 Key 3

Time

1 2 3 4

Where in event time?

slide-36
SLIDE 36

36

Where in event time?

PCollection<KV<User, Score>> input = IO.read(...) .apply(ParDo.of(new ParseFn()); .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))) .apply(Sum.integersPerKey());

slide-37
SLIDE 37

Where in event time?

slide-38
SLIDE 38

38

The Beam Model What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?

slide-39
SLIDE 39

39

  • Triggers control

when results are emitted.

  • Triggers are often

relative to the

watermark.

Processing Time Event Time

~Watermark Ideal Skew

When in processing time?

slide-40
SLIDE 40

40

When in processing time?

PCollection<KV<User, Score>> input = IO.read(...) .apply(ParDo.of(new ParseFn()); .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark()) .apply(Sum.integersPerKey());

slide-41
SLIDE 41

When in processing time?

slide-42
SLIDE 42

42

The Beam Model: asking the right questions What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?

slide-43
SLIDE 43

43

How do refinements relate?

PCollection<KV<User, Score>> input = IO.read(...) .apply(ParDo.of(new ParseFn()); .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark().withLateFirings(AtCount(1))) .accumulatingFiredPanes()) .apply(Sum.integersPerKey());

slide-44
SLIDE 44

How do refinements relate?

slide-45
SLIDE 45

45

What/Where/When/How Summary

  • 3. Streaming
  • 4. Streaming + Late Data Handling
  • 1. Classic Batch
  • 2. Windowed Batch
slide-46
SLIDE 46

46

Reconciling streams & tables w/ the Beam Model

  • How does batch processing fit into all of this?
  • What is the relationship of streams to bounded and

unbounded datasets?

  • How do the four what, where, when, how questions map
  • nto a streams/tables world?
slide-47
SLIDE 47

47

General theory of stream & table relativity

Pipelines : tables + streams + operations Tables : data at rest Streams : data in motion Operations : (stream | table) → (stream | table) transformations

  • stream → stream: Non-grouping (element-wise) operations

Leaves stream data in motion, yielding another stream.

  • stream → table: Grouping operations

Brings stream data to rest, yielding a table. Windowing adds the dimension of time to grouping.

  • table → stream: Ungrouping (triggering) operations

Puts table data into motion, yielding a stream. Accumulation dictates the nature of the stream (deltas, values, retractions).

  • table → table: (none)

Impossible to go from rest and back to rest without being put into motion.

slide-48
SLIDE 48

48

02 Streaming SQL

Contorting relational algebra for fun and profit

A Time-varying relations B SQL language extensions

slide-49
SLIDE 49

49

Relational algebra

User Score Time Julie 7 12:01 Frank 3 12:03 Julie 1 12:03 Julie 4 12:07 Score Time 7 12:01 3 12:03 1 12:03 4 12:07

πScore,Time(UserScores) πUserScoresπ

SELECT Score, Time FROM UserScores;

  • | Score | Time |
  • | 7 | 12:01 |

| 3 | 12:03 | | 1 | 12:03 | | 4 | 12:07 |

  • Relational algebra

SQL Relation

slide-50
SLIDE 50

50

Relations evolve over time

12:07> SELECT * FROM UserScores;

  • | Name | Score | Time |
  • | Julie | 7 | 12:01 |

| Frank | 3 | 12:03 | | Julie | 1 | 12:03 | | Julie | 4 | 12:07 |

  • 12:03> SELECT * FROM

UserScores;

  • | Name | Score | Time |
  • | Julie | 7 | 12:01 |

| Frank | 3 | 12:03 | | Julie | 1 | 12:03 |

  • 12:00> SELECT * FROM

UserScores;

  • | Name | Score | Time |
  • 12:01> SELECT * FROM

UserScores;

  • | Name | Score | Time |
  • | Julie | 7 | 12:01 |
slide-51
SLIDE 51

51

Classic SQL vs Streaming SQL

Classic SQL classic relations single point in time :: :: Streaming SQL time-varying relations every point in time :: ::

slide-52
SLIDE 52

52

Classic SQL vs Streaming SQL

Classic SQL classic relations single point in time :: :: Streaming SQL time-varying relations every point in time :: ::

slide-53
SLIDE 53

53

Classic relations

12:07> SELECT * FROM UserScores;

  • | [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) |

| ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | | | | | | Julie | 7 | 12:01 | | | Julie | 7 | 12:01 | | | Julie | 7 | 12:01 | | | | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | | | | | | | | | | | | | | Julie | 1 | 12:03 | | | Julie | 1 | 12:03 | | | | | | | | | | | | | | | | | | | Julie | 4 | 12:07 | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- |

  • 12:07> SELECT * FROM

UserScores;

  • | Name | Score | Time |
  • | Julie | 7 | 12:01 |

| Frank | 3 | 12:03 | | Julie | 1 | 12:03 | | Julie | 4 | 12:07 |

  • 12:03> SELECT * FROM

UserScores;

  • | Name | Score | Time |
  • | Julie | 7 | 12:01 |

| Frank | 3 | 12:03 | | Julie | 1 | 12:03 |

  • 12:00> SELECT * FROM

UserScores;

  • | Name | Score | Time |
  • 12:01> SELECT * FROM

UserScores;

  • | Name | Score | Time |
  • | Julie | 7 | 12:01 |
  • Time-varying relation
slide-54
SLIDE 54

54

Closure property of relational algebra remains intact with time-varying relations.

slide-55
SLIDE 55

55

Time-varying relations: variations

12:07> SELECT * FROM UserScores;

  • | [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) |

| ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | | | | | | Julie | 7 | 12:01 | | | Julie | 7 | 12:01 | | | Julie | 7 | 12:01 | | | | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | | | | | | | | | | | | | | Julie | 1 | 12:03 | | | Julie | 1 | 12:03 | | | | | | | | | | | | | | | | | | | Julie | 4 | 12:07 | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- |

slide-56
SLIDE 56

56

Time-varying relations: filtering

12:07> SELECT * FROM UserScores;

  • | [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) |

| ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | | | | | | Julie | 7 | 12:01 | | | Julie | 7 | 12:01 | | | Julie | 7 | 12:01 | | | | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | | | | | | | | | | | | | | Julie | 1 | 12:03 | | | Julie | 1 | 12:03 | | | | | | | | | | | | | | | | | | | Julie | 4 | 12:07 | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- |

  • 12:07> SELECT * FROM UserScores WHERE Name = “Julie”;
  • | [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) |

| ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | | | | | | Julie | 7 | 12:01 | | | Julie | 7 | 12:01 | | | Julie | 7 | 12:01 | | | | | | | | | | | | | | Julie | 1 | 12:03 | | | Julie | 1 | 12:03 | | | | | | | | | | | | | | | | | | | Julie | 4 | 12:07 | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- |

slide-57
SLIDE 57

57

Time-varying relations: grouping

12:07> SELECT * FROM UserScores;

  • | [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) |

| ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | | | | | | Julie | 7 | 12:01 | | | Julie | 7 | 12:01 | | | Julie | 7 | 12:01 | | | | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | | | | | | | | | | | | | | Julie | 1 | 12:03 | | | Julie | 1 | 12:03 | | | | | | | | | | | | | | | | | | | Julie | 4 | 12:07 | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- |

  • 12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;
  • | [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) |

| ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | | | | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- |

slide-58
SLIDE 58

58

How does this relate to streams & tables?

slide-59
SLIDE 59

59

Time-varying relations: tables

12:07> SELECT TABLE Name, SUM(Score), MAX(Time) FROM UserScores GROUP BY Name;

  • | Name | Score | Time |
  • | Julie | 12 | 12:07 |

| Frank | 3 | 12:03 |

  • 12:03> SELECT TABLE Name,

SUM(Score), MAX(Time) FROM UserScores GROUP BY Name;

  • | Name | Score | Time |
  • | Julie | 8 | 12:03 |

| Frank | 3 | 12:03 |

  • 12:01> SELECT TABLE Name,

SUM(Score), MAX(Time) FROM UserScores GROUP BY Name;

  • | Name | Score | Time |
  • | Julie | 7 | 12:01 |
  • 12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;
  • | [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) |

| ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | | | | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- |

  • 12:00> SELECT TABLE Name,

SUM(Score), MAX(Time) FROM UserScores GROUP BY Name;

  • | Name | Score | Time |
slide-60
SLIDE 60

60

Time-varying relations: tables

12:01> SELECT TABLE Name, SUM(Score), MAX(Time) FROM UserScores GROUP BY Name;

  • | Name | Score | Time |
  • | Julie | 7 | 12:01 |
  • 12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;
  • | [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) |

| ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | | | | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- |

  • 12:07> SELECT TABLE Name, SUM(Score), MAX(Time) AS OF

SYSTEM TIME ‘12:01’ FROM UserScores GROUP BY Name;

  • | Name | Score | Time |
  • | Julie | 7 | 12:01 |
slide-61
SLIDE 61

61

Time-varying relations: tables

12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;

  • | [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) |

| ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | | | | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- |

  • 12:00> SELECT STREAM Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;
  • | Name | Score | Time |
  • ...

12:00

12:00> SELECT TABLE Name, SUM(Score), MAX(Time) FROM UserScores GROUP BY Name;

  • | Name | Score | Time |
slide-62
SLIDE 62

62

Time-varying relations: streams

12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;

  • | [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) |

| ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | | | | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- |

  • 12:00> SELECT STREAM Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;
  • | Name | Score | Time |
  • | Julie | 7 | 12:01 |

...

12:01

12:00> SELECT TABLE Name, SUM(Score), MAX(Time) FROM UserScores GROUP BY Name;

  • | Name | Score | Time |
slide-63
SLIDE 63

63

Time-varying relations: streams

12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;

  • | [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) |

| ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | | | | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- |

  • 12:00> SELECT STREAM Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;
  • | Name | Score | Time |
  • | Julie | 7 | 12:01 |

...

12:01

12:01> SELECT TABLE Name, SUM(Score), MAX(Time) FROM UserScores GROUP BY Name;

  • | Name | Score | Time |
  • | Julie | 7 | 12:01 |
slide-64
SLIDE 64

64

Time-varying relations: streams

12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;

  • | [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) |

| ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | | | | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- |

  • 12:00> SELECT STREAM Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;
  • | Name | Score | Time |
  • | Julie | 7 | 12:01 |

| Frank | 3 | 12:03 | | Julie | 8 | 12:03 | ...

12:03

12:01> SELECT TABLE Name, SUM(Score), MAX(Time) FROM UserScores GROUP BY Name;

  • | Name | Score | Time |
  • | Julie | 7 | 12:01 |
slide-65
SLIDE 65

65

Time-varying relations: streams

12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;

  • | [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) |

| ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | | | | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- |

  • 12:00> SELECT STREAM Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;
  • | Name | Score | Time |
  • | Julie | 7 | 12:01 |

| Frank | 3 | 12:03 | | Julie | 8 | 12:03 | ...

12:03

12:03> SELECT TABLE Name, SUM(Score), MAX(Time) FROM UserScores GROUP BY Name;

  • | Name | Score | Time |
  • | Julie | 8 | 12:03 |

| Frank | 3 | 12:03 |

slide-66
SLIDE 66

66

Time-varying relations: streams

12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;

  • | [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) |

| ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | | | | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- |

  • 12:00> SELECT STREAM Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;
  • | Name | Score | Time |
  • | Julie | 7 | 12:01 |

| Frank | 3 | 12:03 | | Julie | 8 | 12:03 | | Julie | 12 | 12:07 | ...

12:07

12:03> SELECT TABLE Name, SUM(Score), MAX(Time) FROM UserScores GROUP BY Name;

  • | Name | Score | Time |
  • | Julie | 8 | 12:03 |

| Frank | 3 | 12:03 |

slide-67
SLIDE 67

67

Time-varying relations: streams

12:07> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;

  • | [-inf, 12:01) | [12:01, 12:03) | [12:03, 12:07) | [12:07, now) |

| ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | Name | Score | Time | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- | | | | | | | | Julie | 7 | 12:01 | | | Julie | 8 | 12:03 | | | Julie | 12 | 12:07 | | | | | | | | | | | | | | Frank | 3 | 12:03 | | | Frank | 3 | 12:03 | | | ------------------------- | ------------------------- | ------------------------- | ------------------------- |

  • 12:00> SELECT STREAM Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name;
  • | Name | Score | Time |
  • | Julie | 7 | 12:01 |

| Frank | 3 | 12:03 | | Julie | 8 | 12:03 | | Julie | 12 | 12:07 | ...

12:07

12:07> SELECT TABLE Name, SUM(Score), MAX(Time) FROM UserScores GROUP BY Name;

  • | Name | Score | Time |
  • | Julie | 12 | 12:07 |

| Frank | 3 | 12:03 |

slide-68
SLIDE 68

68

How does this relate to streams & tables?

capture a point-in-time snapshot

  • f a time-varying relation.

capture the evolution of a time-varying relation over time. Tables Streams

slide-69
SLIDE 69

69

02 Streaming SQL

Contorting relational algebra for fun and profit

A Time-varying relations B SQL language extensions

slide-70
SLIDE 70

70

When do you need SQL extensions for streaming?

As a table: As a stream: SQL extensions rarely needed. SQL extensions sometimes needed.

How is output consumed? good defaults = often not needed

slide-71
SLIDE 71

71

When do you need SQL extensions for streaming?*

Explicit table / stream selection

  • SELECT TABLE * from X;
  • SELECT STREAM * from X;

Timestamps and windowing

  • Event-time columns
  • Windowing. E.g.,

SELECT * FROM X GROUP BY SESSION(<COLUMN> INTERVAL '5' MINUTE); ○ Grouping by timestamp ○ Complex multi-row transactions inexpressible in declarative SQL (e.g., session windows) Sane default table / stream selection

  • If all inputs are tables, output is a table
  • If any inputs are streams, output is a stream

Simple triggers

  • Implicitly defined by characteristics of the sink
  • Optionally be configured outside of query.
  • Per-query, e.g.: SELECT * from X EMIT <WHEN>;
  • Focused set of use cases:

○ Repeated updates ... EMIT AFTER <TIMEDELTA> ○ Completeness ... EMIT WHEN WATERMARK PAST <COLUMN> ○ Repeated updates + completeness (e.g., early/on-time/late pattern) ... EMIT AFTER <TIMEDELTA> AND WHEN WATERMARK PAST <COLUMN>

* Most of these extensions are theoretical at this point; very few have concrete implementations.

slide-72
SLIDE 72

72

Summary streams ⇄ tables streams & tables : Beam Model time-varying relations SQL language extensions

slide-73
SLIDE 73

73

Thank you!

In early release now streamingsystems.net

These slides: http://s.apache.org/streaming-sql-big-data-spain

Streaming SQL spec (WIP: Apex, Beam, Calcite, Flink): http://s.apache.org/streaming-sql-spec Streaming in Calcite (Julian Hyde): https://calcite.apache.org/docs/stream.html Streams, joins & temporal tables (Julian Hyde): http://s.apache.org/streams-joins-and-temporal-tables Streaming 101: http://oreilly.com/ideas/the-world-beyond-batch-streaming-101 Streaming 102: http://oreilly.com/ideas/the-world-beyond-batch-streaming-102 Apache Beam: http://beam.apache.org Apache Calcite: http://calcite.apache.org Apache Flink: http://flink.apache.org Twitter: @takidau