Coordination-free query evaluation and multi-query optimization in - - PowerPoint PPT Presentation

coordination free query evaluation and multi query
SMART_READER_LITE
LIVE PREVIEW

Coordination-free query evaluation and multi-query optimization in - - PowerPoint PPT Presentation

Formal approaches to: Coordination-free query evaluation and multi-query optimization in parallel and distributed systems Bas Ketsman Outline CALM Formalization CALM Revision 1 Coordination-free evaluation Conclusion Parallel-Correctness


slide-1
SLIDE 1

Formal approaches to:

Coordination-free query evaluation and multi-query optimization in parallel and distributed systems

Bas Ketsman

slide-2
SLIDE 2

Outline CALM Formalization CALM Revision 1 Conclusion Parallel-Correctness Transferability Conclusion

2 / 46

Coordination-free evaluation Multi-Query optimization

slide-3
SLIDE 3

Introduction Context: Declarative Networking, where Datalog based languages are used for parallel and distributed computing in clusters with disordered communication. CALM-conjecture: No-coordination ? = Monotonicity

[Hellerstein, 2010]

[Ameloot, Neven, Van den Bussche, 2011]: TRUE for a setting where nodes have no information about the horizontal-distribution of records [Zinn, Green, Ludäscher, 2012]: FALSE for settings where nodes have information about the horizontal-distribution of record

3 / 46

slide-4
SLIDE 4

Introduction Context: Declarative Networking, where Datalog based languages are used for parallel and distributed computing in clusters with disordered communication. CALM-conjecture: No-coordination ? = Monotonicity

[Hellerstein, 2010]

[Ameloot, Neven, Van den Bussche, 2011]: TRUE for a setting where nodes have no information about the horizontal-distribution of records [Zinn, Green, Ludäscher, 2012]: FALSE for settings where nodes have information about the horizontal-distribution of record

3 / 46

slide-5
SLIDE 5

Introduction Context: Declarative Networking, where Datalog based languages are used for parallel and distributed computing in clusters with disordered communication. CALM-conjecture: No-coordination ? = Monotonicity

[Hellerstein, 2010]

[Ameloot, Neven, Van den Bussche, 2011]: TRUE for a setting where nodes have no information about the horizontal-distribution of records [Zinn, Green, Ludäscher, 2012]: FALSE for settings where nodes have information about the horizontal-distribution of record

3 / 46

slide-6
SLIDE 6

Introduction Context: Declarative Networking, where Datalog based languages are used for parallel and distributed computing in clusters with disordered communication. CALM-conjecture: No-coordination ? = Monotonicity

[Hellerstein, 2010]

[Ameloot, Neven, Van den Bussche, 2011]: TRUE

▶ for a setting where nodes have no information about the

horizontal-distribution of records [Zinn, Green, Ludäscher, 2012]: FALSE for settings where nodes have information about the horizontal-distribution of record

3 / 46

slide-7
SLIDE 7

Introduction Context: Declarative Networking, where Datalog based languages are used for parallel and distributed computing in clusters with disordered communication. CALM-conjecture: No-coordination ? = Monotonicity

[Hellerstein, 2010]

[Ameloot, Neven, Van den Bussche, 2011]: TRUE

▶ for a setting where nodes have no information about the

horizontal-distribution of records [Zinn, Green, Ludäscher, 2012]: FALSE

▶ for settings where nodes have information about the

horizontal-distribution of record

3 / 46

slide-8
SLIDE 8

Goal: To clarify the relation between monotonicity and coordination in asynchronous systems and to reveal the more complete picture

slide-9
SLIDE 9

Outline CALM Formalization CALM Revision 1 Conclusion Parallel-Correctness Transferability Conclusion

5 / 46

Coordination-free evaluation Multi-Query optimization

slide-10
SLIDE 10

Monotonicity

Definition

A query Q is monotone if Q(I) ⊆ Q(I ∪ J) for all database instances I and J. Notation: M = class of monotone queries

Example

: Select triangles in a graph : Select open triangles in a graph

6 / 46

slide-11
SLIDE 11

Monotonicity

Definition

A query Q is monotone if Q(I) ⊆ Q(I ∪ J) for all database instances I and J. Notation: M = class of monotone queries

Example

▶ Q∆: Select triangles in a graph ∈ M ▶ Q<: Select open triangles in a graph ̸∈ M

6 / 46

slide-12
SLIDE 12

Monotonicity

Definition

A query Q is monotone if Q(I) ⊆ Q(I ∪ J) for all database instances I and J. Notation: M = class of monotone queries

Example

▶ Q∆: Select triangles in a graph ∈ M ▶ Q<: Select open triangles in a graph ̸∈ M

6 / 46

slide-13
SLIDE 13

Monotonicity

Definition

A query Q is monotone if Q(I) ⊆ Q(I ∪ J) for all database instances I and J. Notation: M = class of monotone queries

Example

▶ Q∆: Select triangles in a graph ∈ M ▶ Q<: Select open triangles in a graph ̸∈ M

6 / 46

slide-14
SLIDE 14

Relational Transducer Networks

[Ameloot, Neven, Van den Bussche, 2011]

▶ Network

N = {x, y, u, z}

▶ Transducer Π ▶ messages can be

arbitrarily delayed but never get lost Semantics defined in terms of runs over a transition system

7 / 46

slide-15
SLIDE 15

Relational Transducer Networks

[Ameloot, Neven, Van den Bussche, 2011]

▶ Network

N = {x, y, u, z}

▶ Transducer Π ▶ messages can be

arbitrarily delayed but never get lost Semantics defined in terms of runs over a transition system

7 / 46

slide-16
SLIDE 16

Eventual Consistent Query Evaluation

Definition

A transducer Π computes a query Q if

▶ for all networks N,

Network independent

▶ for all databases I,

Distribution independent

▶ for all horizontal distributions H, and ▶ for every run of Π,

  • ut(Π) = Q(I).

Consistency requirement

8 / 46

slide-17
SLIDE 17

Example: Q∆ : select all triangles

9 / 46

Algorithm: Broadcast all data

  • utput triangles

whenever new data arrives Extremely naive, but works .. and is coordination-free!

slide-18
SLIDE 18

Example: Q∆ : select all triangles

9 / 46

Algorithm: Broadcast all data

  • utput triangles

whenever new data arrives Extremely naive, but works .. and is coordination-free!

slide-19
SLIDE 19

Example: Q∆ : select all triangles

9 / 46

Algorithm: Broadcast all data

  • utput triangles

whenever new data arrives Extremely naive, but works .. and is coordination-free!

slide-20
SLIDE 20

Example: Q∆ : select all triangles

9 / 46

Algorithm:

▶ Broadcast all data ▶ output triangles

whenever new data arrives Extremely naive, but works .. and is coordination-free!

slide-21
SLIDE 21

Example: Q∆ : select all triangles

9 / 46

Algorithm:

▶ Broadcast all data ▶ output triangles

whenever new data arrives Extremely naive, but works .. and is coordination-free!

slide-22
SLIDE 22

Example: Q< : select all open triangles

10 / 46

Coordination is needed to reason about the absence of records.

slide-23
SLIDE 23

Example: Q< : select all open triangles

?

10 / 46

Coordination is needed to reason about the absence of records.

slide-24
SLIDE 24

Example: Q< : select all open triangles

? ? ? no no

10 / 46

Coordination is needed to reason about the absence of records.

slide-25
SLIDE 25

Example: Q< : select all open triangles

? ? no no

10 / 46

Coordination is needed to reason about the absence of records.

slide-26
SLIDE 26

Coordination-freeness Goal: separate data-communication from coordination-communication

Definition

is coordination-free if for all inputs I there is a distribution on which computes I without having to do communication.

[Ameloot, Neven, Van den Bussche, 2011]

11 / 46

slide-27
SLIDE 27

Coordination-freeness Goal: separate data-communication from coordination-communication

Definition

Π is coordination-free if for all inputs I there is a distribution on which Π computes Q(I) without having to do communication.

[Ameloot, Neven, Van den Bussche, 2011]

11 / 46

slide-28
SLIDE 28

Example: Ideal Distribution Q∆: select all triangles

12 / 46

Algorithm: Output triangles whenever new data arrives

slide-29
SLIDE 29

Example: Ideal Distribution Q∆: select all triangles

12 / 46

Algorithm:

▶ Broadcast all data ▶ Output triangles

whenever new data arrives

slide-30
SLIDE 30

Example: Ideal Distribution Q∆: select all triangles

12 / 46

Algorithm:

▶ (Broadcast all data) ▶ Output triangles

whenever new data arrives

slide-31
SLIDE 31

CALM-conjecture

[Ameloot, Neven, Van den Bussche, 2011]

A query has a coordination-free and eventually consistent execution strategy iff the query is monotone Theorem

F0 = M

Definition

F0 = set of queries which are distributedly computed by coordination-free transducers

13 / 46

slide-32
SLIDE 32

Outline CALM Formalization CALM Revision 1 Conclusion Parallel-Correctness Transferability Conclusion

14 / 46

Coordination-free evaluation Multi-Query optimization

slide-33
SLIDE 33

Policy-aware Transducers

15 / 46

“Distribution policy”

slide-34
SLIDE 34

Policy-aware Transducers

. . . . . . . . .

15 / 46

“Distribution policy”

slide-35
SLIDE 35

Policy-aware Transducers

Deduction rules

▶ in local database ⇒ in global database ▶ not in local database + in scope ⇒ not in global database ▶ not in local database + not in scope ⇒ unknown

16 / 46

slide-36
SLIDE 36

Policy-aware Transducers

. . . ? . . . . . .

17 / 46

“Distribution policy”

slide-37
SLIDE 37

Policy-aware Transducers

. . . . . . . . .

17 / 46

“Distribution policy”

slide-38
SLIDE 38

Policy-aware Transducers [Zinn, Green, Ludäscher, 2012]

Definition

A distribution policy P for σ and N is a total function from facts(σ) to the power set of N.

Definition

A policy-aware transducer is a transducer with access to P restricted to its active domain

Definition

F1 = set of queries which are distributedly computed by policy-aware coordination-free transducers

18 / 46

slide-39
SLIDE 39

Domain-distinct-monotonicity

Definition

A fact f is domain distinct from instance I when adom(f) ̸⊆ adom(I).

Example

I f

  • f ′

19 / 46

slide-40
SLIDE 40

Domain-distinct-monotonicity

Definition

A query Q is domain-distinct-monotone if Q(I) ⊆ Q(I ∪ J) for all I and J, with J having only domain-distinct facts Notation: Mdistinct = domain-distinct-monotone queries M Mdistinct

Remark

Mdistinct: class of queries preserved under extensions

20 / 46

slide-41
SLIDE 41

Domain-distinct-monotonicity

Example

Select open triangles in graph ∈ Mdistinct.

I Q(I)

Not domain-distinct from I

21 / 46

slide-42
SLIDE 42

Domain-distinct-monotonicity

Example

Select open triangles in graph ∈ Mdistinct.

I Q(I)

Not domain-distinct from I

21 / 46

slide-43
SLIDE 43

Revised CALM-conjecture

A query has a coordination-free and eventually consistent execution strategy under distribution policies iff the query is domain-distinct-monotone Theorem

F1 = Mdistinct

Definition

F1 = set of queries which are distributedly computed by policy-aware coordination-free transducers

22 / 46

slide-44
SLIDE 44

Proof idea of Mdistinct ⊆ F1 For domain-distinct-monotone queries:

▶ broadcast all present and deduced absent facts ▶ Evaluate query on complete sets

Example: : Select open triangles in a graph Computing node behaviour:

23 / 46

slide-45
SLIDE 45

Proof idea of Mdistinct ⊆ F1 For domain-distinct-monotone queries:

▶ broadcast all present and deduced absent facts ▶ Evaluate query on complete sets

Example: Q<: Select open triangles in a graph Computing node behaviour:

? ∅

23 / 46

slide-46
SLIDE 46

Proof idea of Mdistinct ⊆ F1 For domain-distinct-monotone queries:

▶ broadcast all present and deduced absent facts ▶ Evaluate query on complete sets

Example: Q<: Select open triangles in a graph Computing node behaviour:

? ? ? ∅

23 / 46

slide-47
SLIDE 47

Proof idea of Mdistinct ⊆ F1 For domain-distinct-monotone queries:

▶ broadcast all present and deduced absent facts ▶ Evaluate query on complete sets

Example: Q<: Select open triangles in a graph Computing node behaviour:

? ?

23 / 46

slide-48
SLIDE 48

Proof idea of Mdistinct ⊆ F1 For domain-distinct-monotone queries:

▶ broadcast all present and deduced absent facts ▶ Evaluate query on complete sets

Example: Q<: Select open triangles in a graph Computing node behaviour:

?

23 / 46

slide-49
SLIDE 49

Proof idea of Mdistinct ⊆ F1 For domain-distinct-monotone queries:

▶ broadcast all present and deduced absent facts ▶ Evaluate query on complete sets

Example: Q<: Select open triangles in a graph Computing node behaviour:

23 / 46

slide-50
SLIDE 50

Complete Picture

Datalog(̸=) ⊋ wILOG(̸=) = M = F0 ⊊ ⊊ ⊊ ⊊ SP-Datalog ⊋ SP-wILOG = Mdistinct = F1 ⊊ ⊊ ⊊ ⊊ semicon-Datalog¬ ⊋ semicon-wILOG¬ = Mdisjoint = F2 Datalog Datalog + value invention Monotonicity Coordination freeness

[Ameloot, Ketsman, Neven, Zinn PODS 2014 best paper; TODS 2016]

24 / 46

slide-51
SLIDE 51

Outline CALM Formalization CALM Revision 1 Conclusion Parallel-Correctness Transferability Conclusion

25 / 46

Coordination-free evaluation Multi-Query optimization

slide-52
SLIDE 52

Goal: To clarify the relation between monotonicity and coordination in asynchronous systems and to reveal the more complete picture

▶ A four-level quantification of coordination exists in terms of

the amount of information needed for the query to become coordination-free

▶ The CALM conjecture reveals a very robust relation

between coordination and non-monotonic behaviour of queries

slide-53
SLIDE 53

Open Questions & Future Work

▶ How to compute queries without coordination and with

less communication?

▶ How to compute queries with “some” coordination?

Related Work Oblivious broadcasting algorithms = broadcast fragment

  • f local database

[Ketsman, Neven ICDT 2015; ToCS 2016]

A worst-case optimal load algorithm for join evaluation

[Ketsman, Suciu, PODS 2017]

27 / 46

slide-54
SLIDE 54

Open Questions & Future Work

▶ How to compute queries without coordination and with

less communication?

▶ How to compute queries with “some” coordination?

Related Work

▶ Oblivious broadcasting algorithms = broadcast fragment

  • f local database

[Ketsman, Neven ICDT 2015; ToCS 2016]

▶ A worst-case optimal load algorithm for join evaluation

[Ketsman, Suciu, PODS 2017]

27 / 46

slide-55
SLIDE 55

Outline CALM Formalization CALM Revision 1 Conclusion Parallel-Correctness Transferability Conclusion

28 / 46

Coordination-free evaluation Multi-Query optimization

slide-56
SLIDE 56

Motivation

▶ Many systems rely on coordination for communication

For example: MapReduce-like systems

▶ Avoiding coordination completely is not possible ▶ Minimize the number of communication steps

Ideally: one round

29 / 46

slide-57
SLIDE 57

Single-Round Evaluation Algorithm (1-Round MPC model [Koutris & Suciu 2011])

Redistribution

Input = query Q Output = union of output at each server

30 / 46

slide-58
SLIDE 58

Single-Round Evaluation Algorithm (1-Round MPC model [Koutris & Suciu 2011])

Redistribution

Step 1: Input = query Q Output = union of output at each server

30 / 46

slide-59
SLIDE 59

Single-Round Evaluation Algorithm (1-Round MPC model [Koutris & Suciu 2011]) Q Q Q

Redistribution

Step 1: Step 2: Input = query Q Output = union of output at each server

30 / 46

slide-60
SLIDE 60

Single-Round Evaluation Algorithm (1-Round MPC model [Koutris & Suciu 2011]) Q Q Q

Redistribution

Step 1: Step 2: Input = query Q Output = union of output at each server

30 / 46

slide-61
SLIDE 61

Multi-Query Evaluation & optimization Workload: Q1, Q2, . . . , Qn + fixed database

Redistribution

I → Q1(I) ←

Redistribution

Q2(I) ←

Redistribution

Q3(I) ← · · ·

31 / 46

slide-62
SLIDE 62

Goal: To formally reason about single-round query evaluation and multi-query optimization in systems where communication implies a synchronization barrier Focus: conjunctive queries

slide-63
SLIDE 63

Single-Round Evaluation Algorithm (1-Round MPC model [Koutris & Suciu 2011]) Q Q Q

Redistribution

Step 1: Step 2:

Modeled by a distribution policy P

Input = query Q Output = union of output at each server

33 / 46

slide-64
SLIDE 64

Can a reshuffle step be avoided?

Main question 1

Given target query and distribution policy: Do we need to reshuffle? Parallel-Correctness

Main question 2

Given target query and previously computed query: Do we need to reshuffle? Transferability

34 / 46

slide-65
SLIDE 65

Can a reshuffle step be avoided?

Main question 1

Given target query and distribution policy: Do we need to reshuffle? Parallel-Correctness

Main question 2

Given target query and previously computed query: Do we need to reshuffle? Transferability

34 / 46

slide-66
SLIDE 66

Can a reshuffle step be avoided?

Main question 1

Given target query and distribution policy: Do we need to reshuffle? Parallel-Correctness

Main question 2

Given target query and previously computed query: Do we need to reshuffle? Transferability

34 / 46

slide-67
SLIDE 67

Can a reshuffle step be avoided?

Main question 1

Given target query and distribution policy: Do we need to reshuffle? Parallel-Correctness

Main question 2

Given target query and previously computed query: Do we need to reshuffle? Transferability

34 / 46

slide-68
SLIDE 68

Outline CALM Formalization CALM Revision 1 Conclusion Parallel-Correctness Transferability Conclusion

35 / 46

Coordination-free evaluation Multi-Query optimization

slide-69
SLIDE 69

Parallel-Correctness Given target query and distribution policy: Do we need to reshuffle? Semantical correctness of simple evaluation algorithm

Definition

is parallel-correct w.r.t. , iff for every database dist by monotonicity

36 / 46

slide-70
SLIDE 70

Parallel-Correctness Given target query and distribution policy: Do we need to reshuffle? Semantical correctness of simple evaluation algorithm

Definition

is parallel-correct w.r.t. , iff for every database dist by monotonicity

36 / 46

slide-71
SLIDE 71

Parallel-Correctness Given target query and distribution policy: Do we need to reshuffle? Semantical correctness of simple evaluation algorithm

Definition

Q is parallel-correct w.r.t. P , iff for every database I Q(I) = ∪

κ∈N

Q(distP ,I(κ)) by monotonicity

36 / 46

slide-72
SLIDE 72

Parallel-Correctness Given target query and distribution policy: Do we need to reshuffle? Semantical correctness of simple evaluation algorithm

Definition

Q is parallel-correct w.r.t. P , iff for every database I Q(I) = ∪

κ∈N

Q(distP ,I(κ)) ⊇ by monotonicity

36 / 46

slide-73
SLIDE 73

Parallel-Correctness Complexity CQ Π2

p-c

UCQ Π2

p-c

UCQ̸= Π2

p-c

FO undecidable

(needs policy representation)

37 / 46

slide-74
SLIDE 74

Other use of Parallel-Correctness Parallel-correct for Q1 Parallel-correct for

38 / 46

slide-75
SLIDE 75

Other use of Parallel-Correctness Parallel-correct for Q1 Parallel-correct for Q2

38 / 46

slide-76
SLIDE 76

Possible Problems

▶ Reasoning about distribution policies is complex ▶ Not every distribution policy is equally efficient ▶ Choice of policy may be hidden behind abstraction layer ▶ Reasoning about query order before policies are known

39 / 46

slide-77
SLIDE 77

Outline CALM Formalization CALM Revision 1 Conclusion Parallel-Correctness Transferability Conclusion

40 / 46

Coordination-free evaluation Multi-Query optimization

slide-78
SLIDE 78

Transferability Given target query and previously computed query: Do we need to reshuffle?

41 / 46

slide-79
SLIDE 79

Transferability Given target query and previously computed query: Do we need to reshuffle?

Definition

Q →T Q′ iff Q′ is parallel-correct on every P where Q is parallel-correct on

41 / 46

slide-80
SLIDE 80

Transferability Given target query and previously computed query: Do we need to reshuffle? Parallel-correct for Q1 Parallel-correct for Very strong property like query containment, but for parallel and distributed setting

41 / 46

slide-81
SLIDE 81

Transferability Given target query and previously computed query: Do we need to reshuffle? Parallel-correct for Q1 Parallel-correct for Q2 Very strong property like query containment, but for parallel and distributed setting

41 / 46

slide-82
SLIDE 82

Transferability Given target query and previously computed query: Do we need to reshuffle? Parallel-correct for Q1 Parallel-correct for Q2 Very strong property like query containment, but for parallel and distributed setting

41 / 46

slide-83
SLIDE 83

Complexity Parallel-correctness∗ Transferability CQ Π2

p-c

Π3

p-c

UCQ Π2

p-c

Π3

p-c

UCQ̸= Π2

p-c

Π3

p-c

FO undecidable undecidable sm-CQ X NP-c

(∗needs policy representation)

[Ameloot,Geck,Ketsman,Neven,Schwentick PODS 2015 best paper; Sigmod record 2016; CACM 2017; JACM (accepted)]

42 / 46

slide-84
SLIDE 84

Outline CALM Formalization CALM Revision 1 Conclusion Parallel-Correctness Transferability Conclusion

43 / 46

Coordination-free evaluation Multi-Query optimization

slide-85
SLIDE 85

Goal: To formally reason about single-round query evaluation and multi-query optimization in systems where communication implies a synchronization barrier

▶ A formal framework for reasoning about the correctness of

single-round query evaluation and query optimization via distribution policies.

▶ Parallel-correctness: semantical correctness ▶ Transferability: like “containment” but for parallel and

distributed query evaluation

slide-86
SLIDE 86

Open Questions & Future Work

▶ How do parallel-correctness and transferability relate to

query evaluation in practice?

▶ How much data needs to be reshuffled?

Related Work Parallel-correctness for CQs with negation

[Geck, Ketsman, Neven, Schwentick ICDT 2016]

Extension of Parallel-correctness to reason about multi-round evaluation with with Datalog

[Ketsman, Koutris, Albarghouti, submitted]

Bag-semantics

  • ngoing

45 / 46

slide-87
SLIDE 87

Open Questions & Future Work

▶ How do parallel-correctness and transferability relate to

query evaluation in practice?

▶ How much data needs to be reshuffled?

Related Work

▶ Parallel-correctness for CQs with negation

[Geck, Ketsman, Neven, Schwentick ICDT 2016]

▶ Extension of Parallel-correctness to reason about

multi-round evaluation with with Datalog

[Ketsman, Koutris, Albarghouti, submitted]

▶ Bag-semantics

  • ngoing

45 / 46

slide-88
SLIDE 88

Thank you!