Its About Time: An Introduction to Timely Dataflow Data Council, - - PowerPoint PPT Presentation

it s about time an introduction to timely dataflow
SMART_READER_LITE
LIVE PREVIEW

Its About Time: An Introduction to Timely Dataflow Data Council, - - PowerPoint PPT Presentation

Its About Time: An Introduction to Timely Dataflow Data Council, October 19 clockworks Malte Sandstede malte@clockworks.io / @MalteSandstede Nikolas Gbel In collaboration with: niko@clockworks.io / @NikolasGoebel Frank McSherry


slide-1
SLIDE 1

It’s About Time: An Introduction to
 Timely Dataflow

Data Council, October ‘19

slide-2
SLIDE 2

Malte Sandstede malte@clockworks.io / @MalteSandstede

clockworks Systems Group

Nikolas Göbel niko@clockworks.io / @NikolasGoebel David Bach david@clockworks.io Frank McSherry Vasia Kalavri (ETH)

In collaboration with:

Moritz Moxter moritz@clockworks.io

slide-3
SLIDE 3

Stream Processing’s Trifecta

Timeliness Expressivity Consistency

slide-4
SLIDE 4

Stream Processing’s Trifecta

Timeliness Expressivity Consistency

Naive Stateless Processing

  • Low latency
  • Issue: Late arrivals
  • Issue: Complex computations
slide-5
SLIDE 5

Stream Processing’s Trifecta

Timeliness Expressivity Consistency

MapReduce

  • No late arrivals (by definition)
  • Easy to scale
  • Issue: Complex computations
  • Issue: High latency
slide-6
SLIDE 6

Stream Processing’s Trifecta

Timeliness Expressivity Consistency

Database

  • No late arrivals
  • High expressivity
  • ACID
  • Issue: Not realtime!
slide-7
SLIDE 7

Stream Processing’s Trifecta

Timeliness Expressivity Consistency

slide-8
SLIDE 8

Use Case: Kafka Superpowers

(Partitions complect physical representation & use case)

P1 P2 T1

slide-9
SLIDE 9

Use Case: Kafka Superpowers

Reactivity
 queries Virtualization
 Repartitioning
 Joins Physical
 Representation P1 P2 T1 V1 V2 V3 V4 Virtual Partitions
 time order Business Logic

slide-10
SLIDE 10

Stream Processing as Dataflow

sources sinks

  • perators

data exchange

slide-11
SLIDE 11

Dataflow Parallelism

slide-12
SLIDE 12

Dataflow Distribution

w1 w2

slide-13
SLIDE 13

Correctness Troubles

SUM

(1, t0) (4, t1) (3, t0) DATA

slide-14
SLIDE 14

Correctness Troubles

SUM

(1, t0) (4, t1) (3, t0) DATA

slide-15
SLIDE 15

Correctness Troubles

SUM

(4, t1) (1, t0) (5, t1) (3, t0) DATA

slide-16
SLIDE 16

Correctness Troubles

SUM

(4, t1) (1, t0) (5, t1) (3, t0) DATA

slide-17
SLIDE 17

Timely Dataflow

A low-latency runtime for 
 distributed cyclic dataflows

github.com/TimelyDataflow

slide-18
SLIDE 18

Correctness with Progress Tracking

SUM

(1, t0) (4, t1) (3, t0)

t0 t0 t2

DATA PROGRESS

t0

slide-19
SLIDE 19

Correctness with Progress Tracking

SUM

(1, t0) (4, t1) (3, t0)

t0 t0 t2

DATA PROGRESS

t0

slide-20
SLIDE 20

Correctness with Progress Tracking

SUM

(1, t0) (4, t1) (3, t0)

t0 t0 t2

DATA PROGRESS

t0

slide-21
SLIDE 21

Correctness with Progress Tracking

SUM

(1, t0) (4, t1) (3, t0)

t0 t2

DATA PROGRESS

t2

slide-22
SLIDE 22

Correctness with Progress Tracking

SUM

(1, t0) (4, t1) (3, t0)

t0 t2

DATA PROGRESS

t2

(4, t0)

t2

slide-23
SLIDE 23

Correctness with Progress Tracking

SUM

(1, t0) (4, t1) (3, t0)

t0 t2

DATA PROGRESS

t2

(4, t0)

t2

(8, t1)

slide-24
SLIDE 24

Progress Tracking… without Progress?

(data sources with different event frequencies)

JOIN

(1, t0) (4, t1) (3, t2)

t1 t2

CLICKSTREAM TOPIC CLICKSTREAM PROGRESS

t0 t3 t4

(2, t3) Waiting on METADATA …

t0

METADATA TOPIC METADATA PROGRESS

(MIN)

slide-25
SLIDE 25

Multidimensional Progress Tracking

(track sources along independent timelines)

JOIN

(1, t0) (4, t1) (3, t2) CLICKSTREAM TOPIC CLICKSTREAM PROGRESS

t0

(2, t3) …

t0

METADATA TOPIC METADATA PROGRESS

t0 t1 t2 t3 t4

t0

slide-26
SLIDE 26

JOIN

(1, t0) (4, t1) (3, t2) CLICKSTREAM TOPIC CLICKSTREAM PROGRESS

t1

(2, t3) …

t0

METADATA TOPIC METADATA PROGRESS

t0 t1 t2 t3 t4

t0

Multidimensional Progress Tracking

(track sources along independent timelines)

slide-27
SLIDE 27

JOIN

(1, t0) (4, t1) (3, t2) CLICKSTREAM TOPIC CLICKSTREAM PROGRESS

t2

(2, t3) …

t0

METADATA TOPIC METADATA PROGRESS

t0 t1 t2 t3 t4

… …

t0

Multidimensional Progress Tracking

(track sources along independent timelines)

slide-28
SLIDE 28

Creating Dataflows with Timely

slide-29
SLIDE 29

Creating Dataflows with Timely

slide-30
SLIDE 30

Creating Dataflows with Timely

slide-31
SLIDE 31

Creating Dataflows with Timely

slide-32
SLIDE 32

Running Dataflows with Timely

slide-33
SLIDE 33
slide-34
SLIDE 34

Kafka Superpowers

Reactivity
 queries Virtualization
 Repartitioning
 Joins Physical
 Representation P1 P2 T1 V1 V2 V3 V4 Virtual Partitions
 time order Business Logic

✔ ✔ ?

slide-35
SLIDE 35

Kafka Superpowers

Reactivity
 queries Virtualization
 Repartitioning
 Joins Physical
 Representation P1 P2 T1 V1 V2 V3 V4 Virtual Partitions
 time order Business Logic

✔ ✔ ?

Timely

slide-36
SLIDE 36

The Trifecta?

Timeliness Expressivity Consistency

slide-37
SLIDE 37

The Trifecta?

Timeliness Expressivity Consistency (recursive) queries

slide-38
SLIDE 38

Recursive Graph Traversal

C F B A D E

slide-39
SLIDE 39

Recursive Graph Traversal

C F B A D E

slide-40
SLIDE 40

Recursive Graph Traversal

C F B A D E

slide-41
SLIDE 41

Recursive Dataflows

/// Breadth-First Search let nodes = roots.map(|x| (x, 0)); nodes.iterate(|inner| { let edges = edges.enter(&inner.scope()); let nodes = nodes.enter(&inner.scope()); inner.join(&edges, |_k,l,d| (*d, l+1)) .concat(&nodes) .reduce(|_, s, t| t.push((*s[0].0, 1))) })

BFS

EDGE CHANGES REACHABLE NODES TRANSITIVE EDGES

slide-42
SLIDE 42

Recursive Dataflows

/// Breadth-First Search let nodes = roots.map(|x| (x, 0)); nodes.iterate(|inner| { let edges = edges.enter(&inner.scope()); let nodes = nodes.enter(&inner.scope()); inner.join(&edges, |_k,l,d| (*d, l+1)) .concat(&nodes) .reduce(|_, s, t| t.push((*s[0].0, 1))) })

BFS

EDGE CHANGES REACHABLE NODES TRANSITIVE EDGES

slide-43
SLIDE 43

Progress Tracking… with Loops?

(have to finish iterating before we can handle next input)

BFS

EDGE CHANGES REACHABLE NODES TRANSITIVE EDGES

t1 t2 t0

Have to wait while transitive graph is being discovered.

t0

slide-44
SLIDE 44

BFS

EDGE CHANGES REACHABLE NODES TRANSITIVE EDGES

t1 t2 t0

Multidimensional Progress Tracking

(track iteration depth separately)

t1 1 t0 1

(Product Partial Order)

slide-45
SLIDE 45

t1 t0

Lexicographical Order (Join)

(visibility for )

t0 t2 t3 t1 t2 t3 t2 t2

✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔

slide-46
SLIDE 46

t1 t0

Product Partial Order (Iteration)

(visibility for )

t2 t3 1 2 3 t2 2

✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔

slide-47
SLIDE 47

BFS

EDGE CHANGES REACHABLE NODES TRANSITIVE EDGES

t1 t2 t0

Multidimensional Progress Tracking

(track iteration depth separately)

t1 1 t0 1

(Product Partial Order)

slide-48
SLIDE 48

BFS

EDGE CHANGES REACHABLE NODES TRANSITIVE EDGES Have to start from scratch for every transaction?

Incremental Execution?

slide-49
SLIDE 49

Differential Dataflow

Iterative, incrementalized operators for Timely

github.com/TimelyDataflow

slide-50
SLIDE 50

Performance

slide-51
SLIDE 51

/// BFS let nodes = roots.map(|x| (x, 0)); nodes.iterate(|inner| { let edges = edges.enter(&inner.scope()); let nodes = nodes.enter(&inner.scope()); inner.join_map(&edges, |_k,l,d| (*d, l+1)) .concat(&nodes) .reduce(|_, s, t| t.push((*s[0].0, 1))) }) [[(bfs ?from ?to) [?from :edge ?to]] [(bfs ?from ?to) [?from :edge ?hop] (bfs ?hop ?to)]]

Streaming & Relational Queries

Declarative Differential Dataflows (3DF)

github.com/comnik/declarative-dataflow

slide-52
SLIDE 52

The Trifecta!

Timeliness Expressivity Consistency

slide-53
SLIDE 53

Kafka Superpowers

Reactivity
 queries Virtualization
 Repartitioning
 Joins Physical
 Representation P1 P2 T1 V1 V2 V3 V4 Virtual Partitions
 time order Business Logic

✔ ✔ ✔

Timely

slide-54
SLIDE 54

Kafka Superpowers

Reactivity
 queries Virtualization
 Repartitioning
 Joins Physical
 Representation P1 P2 T1 V1 V2 V3 V4 Virtual Partitions
 time order Business Logic

✔ ✔ ✔

Timely DD+3DF

slide-55
SLIDE 55

Kafka Superpowers

P1 P2 T1 V1 V2 V3 V4

clockworks.io/kplex

slide-56
SLIDE 56

Timely as a Programming Model

Timely Dataflow

(Dataflows w/ Multidimensional Progress Tracking)

Differential Dataflow

(Iterative Incrementalized Operators)

3DF

(Streaming Relational Queries) github.com/TimelyDataflow github.com/comnik/declarative-dataflow

slide-57
SLIDE 57

Sources

Repositories

  • Timely: github.com/TimelyDataflow
  • ST2: github.com/li1/snailtrail
  • 3DF: github.com/comnik/declarative-dataflow
  • Differential FAQ: github.com/eoxxs/differential-aggregate-query

Papers

  • Naiad (Timely Dataflow): http://dl.acm.org/citation.cfm?doid=2517349.2522738
  • Differential Dataflow: http://michaelisard.com/pubs/differentialdataflow.pdf, arxiv.org/abs/1812.02639
  • SnailTrail: hdl.handle.net/20.500.11850/228581

Talks

  • Reactive Datalog for Datomic (clojure/conj 2018): clockworks.io/2018/12/01/conj-talk.html
  • Across Time and Space (BobKonf 2019): clockworks.io/2019/03/22/across-time-space.html

Blog Posts

  • frankmcsherry.org
  • Incremental Functional Aggregate Queries: clockworks.io/2019/07/06/Incremental-Functional-Aggregate-Queries.html
  • Dataflows you can’t refuse: clockworks.io/2019/02/10/dataflows-you-cant-refuse.html
  • Reactive Datalog with Vega: clockworks.io/2018/11/25/reactive-datalog-with-vega.html
  • Incremental Datalog with Differential Dataflows: clockworks.io/2018/09/13/incremental-datalaog.html

clockworks

www.clockworks.io {david, malte, moritz, niko}@clockworks.io