Parallel-Correctness and Transferability for Conjunctive Queries Tom - - PowerPoint PPT Presentation

parallel correctness and transferability for conjunctive
SMART_READER_LITE
LIVE PREVIEW

Parallel-Correctness and Transferability for Conjunctive Queries Tom - - PowerPoint PPT Presentation

Parallel-Correctness and Transferability for Conjunctive Queries Tom J. Ameloot 1 Gaetano Geck 2 Bas Ketsman 1 Frank Neven 1 Thomas Schwentick 2 1 Hasselt University 2 Dortmund University Big Data Too large for one server Several systems:


slide-1
SLIDE 1

Parallel-Correctness and Transferability for Conjunctive Queries

Tom J. Ameloot1 Gaetano Geck2 Bas Ketsman1 Frank Neven1 Thomas Schwentick2

1 Hasselt University 2 Dortmund University

slide-2
SLIDE 2

Big Data

“Too large for one server” Several systems: Hadoop, Spark, . . . many others Common Strategy ◮ Data is distributed ◮ Query evaluation: Multiple rounds with reshuffling

2

slide-3
SLIDE 3

Simple Evaluation Algorithm

1-Round MPC model [Koutris & Suciu 2011]

Modeled by a distribution policy P

Q Q Q

Redistribution

Step 1: Step 2: Input = query Q Output = union of output at each server

3

slide-4
SLIDE 4

Main Problems

Semantical correctness: When is the simple algorithm correct on a distribution policy? Parallel-Correctness Multiple-query optimization: Which queries allow to reuse the distribution

  • btained for another query?

Transferability Formal framework for reasoning about correctness

  • f query evaluation and optimization in a

distributed setting

4

slide-5
SLIDE 5

Outline

  • 1. Definitions
  • 2. Parallel-Correctness
  • 3. Transferability
  • 4. Lowering the Complexity
  • 5. Conclusion & Future Work

5

slide-6
SLIDE 6

Definitions

Database schema Infinite set of data values Instance I is a finite set of facts R(d1, . . . , dn) Conjunctive Query: T(¯ x) ← R1(¯ y1), . . . , Rm(¯ ym)

6

slide-7
SLIDE 7

Distribution Policies

Network N is a finite set of nodes all R-facts all S-facts P Definition A distribution policy P is a total function mapping facts (over dom) to sets of nodes in N

7

slide-8
SLIDE 8

Distribution Policies

Network N is a finite set of nodes all R-facts all S-facts P {R(a, b), R(b, a)} {S(a)} distP ,I(1) distP ,I(2) = distribution of I based on P Instance I = {R(a, b), R(b, a), S(a)}

8

slide-9
SLIDE 9

Hypercube

◮ Invented in the context of Datalog evaluation [Ganguli, Silberschatz & Tsur 1990] ◮ Described in Map-Reduce context [Afrati & Ullman 2010] ◮ Intensively studied [Beame, Koutris & Suciu 2014] Algorithm: ◮ Reshuffling based on structure of Q Partitioning of complete valuations

  • ver servers in instance independent

way through hashing of domain values

9

slide-10
SLIDE 10

Simple Evaluation Algorithm

Input = query Q Step 1: distribute data over servers w.r.t. P Step 2: evaluate Q at each server

10

slide-11
SLIDE 11

Parallel-Correctness

Definition Q is parallel-correct on I w.r.t. P , iff Q(I) =

  • κ∈N

Q(distP ,I(κ)) ⊇ by monotonicity Definition (w.r.t. all instances) Q is parallel-correct w.r.t. P iff Q is parallel-correct w.r.t. P on every I

11

slide-12
SLIDE 12

Parallel-Correctness

Sufficient Condition

(C0) for every valuation V for Q,

  • f∈V (bodyQ)

P (f) = ∅. Intuition: Facts required by a valuation meet at some node Lemma (C0) implies Q parallel-correct w.r.t. P . Not necessary

12

slide-13
SLIDE 13

(C0) not Necessary

Example Distribution policy P all − {R(a, b)} all − {R(b, a)} Query Q: T(x, z) ← R(x, y), R(y, z), R(x, x) V = {x, z → a, y → b} Requires: R(a, b) R(b, a) R(a, a) Derives: T(a, a) R(a, b) R(b, a) R(a, a) Do not meet V ′ = {x, y, z → a} Requires: R(a, a) Derives: T(a, a)

  • =

13

slide-14
SLIDE 14

Parallel-Correctness

Characterization

Lemma Q is parallel-correct w.r.t. P iff (C1) for every minimal valuation V for Q,

  • f∈V (bodyQ)

P (f) = ∅. Definition V is minimal if no V ′ exists, where V ′(headQ) = V (headQ), V ′(bodyQ) V (bodyQ).

14

slide-15
SLIDE 15

Parallel-Correctness

Example Query Q: T(x, z) ← R(x, y), R(y, z), R(x, x) V = {x, z → a, y → b} Requires: R(a, b) R(b, a) R(a, a) Derives: T(a, a) V ′ = {x, y, z → a} Requires: R(a, a) Derives: T(a, a)

  • =

Notice: Q is minimal CQ CQ is minimal iff injective valuations are minimal Proposition Testing whether a valuation is minimal is coNP-complete.

15

slide-16
SLIDE 16

Parallel-Correctness

Complexity

Theorem Deciding whether Q is parallel-correct w.r.t. P is ΠP

2 -complete.

Proof: ◮ Lower bound: Reduction from Π2-QBF ◮ Upper bound: Characterization but, requires proper formalization of P

16

slide-17
SLIDE 17

Outline

  • 1. Definitions
  • 2. Parallel-Correctness
  • 3. Transferability
  • 4. Lowering the Complexity
  • 5. Conclusion & Future Work

17

slide-18
SLIDE 18

Computing Multiple Queries

Q Q Q

Redistribution

Q → Q(I) ← Q′ Q′ Q′

Redistribution

Q′ → Q′(I) ←

. . .

18

slide-19
SLIDE 19

Computing Multiple Queries

Q Q Q

Redistribution

Q → Q(I) ← Q′ Q′ Q′ Q′ → Q′(I) ← When can Q′ be evaluated on distribution used for Q?

No reshuffling

. . .

19

slide-20
SLIDE 20

Transferability

Definition Q →T Q′ iff Q′ is parallel-correct on every P where Q is parallel-correct on

Example Q : T() ← R(x, y), R(y, z), R(z, w) Q′ : N() ← R(x, y), R(y, x) a b c d a b c a b a a b a Q →T Q′

20

slide-21
SLIDE 21

Transferability

Characterization & Complexity

Lemma Q →T Q′ iff (C2) for every minimal valuation V ′ for Q′ there is a minimal valuation V for Q, s.t. V ′(bodyQ) ⊆ V (bodyQ). Based on query structure alone, not on distribution policies

21

slide-22
SLIDE 22

Transferability

Characterization & Complexity

Lemma Q →T Q′ iff (C2) for every minimal valuation V ′ for Q′ there is a minimal valuation V for Q, s.t. V ′(bodyQ) ⊆ V (bodyQ). Theorem Deciding Q →T Q′ is ΠP

3 -complete.

◮ Lower bound: Reduction from Π3-QBF ◮ Upper bound: Characterization

22

slide-23
SLIDE 23

Outline

  • 1. Definitions
  • 2. Parallel-Correctness
  • 3. Transferability
  • 4. Lowering the Complexity
  • 5. Conclusion & Future Work

23

slide-24
SLIDE 24

Strongly Minimal CQs

Definition A CQ is strongly minimal if all its valuations are min- imal ◮ Full-CQs T(x, y) ← R(x, y), R(x, x) ◮ CQs without self-joins T() ← R(x, y), S(x, x) ◮ Hybrids T(y) ← R(x, y), R(x, x), R(z, x), S(z) A minimal CQ is not always strongly minimal

24

slide-25
SLIDE 25

Strongly Minimal CQs

Lemma Deciding whether Q is strongly minimal is coNP- complete Theorem Deciding Q →T Q′ is NP-complete for strongly min- imal Q

25

slide-26
SLIDE 26

Hypercube

Algorithm: ◮ Reshuffling based on structure of Q Partitioning of complete valuations

  • ver servers in instance independent

way through hashing of domain values H(Q) = family of Hypercube policies for Q. Definition Q →H Q’ iff Q′ is parallel-correct w.r.t. every P ∈ H(Q).

26

slide-27
SLIDE 27

Hypercube

Two properties: ◮ Q-generous: for every valuation facts meet

  • n some node (∀P ∈ H(Q))

◮ Q-scattered: there is a policy scattering facts in such a way that no facts meet by coincidence (∀I) Theorem Deciding whether Q →H Q′ is NP-complete (also when Q or Q′ is acyclic)

27

slide-28
SLIDE 28

Related Concepts

Containment

Q ⊆ Q′ Lemma Containment and transferability are incomparable

Determinacy (Data-Integration)

Q′(I) = Q′(J) implies Q(I) = Q(J), for every I, J Lemma Determinacy and transferability are incomparable

28

slide-29
SLIDE 29

Summary

Formal framework for reasoning about correctness

  • f query evaluation and optimization in a distributed

setting Main concepts: ◮ Parallel-correctness ◮ Transferability Independent of expression mechanism

29

slide-30
SLIDE 30

Future Work

Expression Formalism for distribution policies ◮ Other than Hypercube? Distribution policy for set of queries ◮ Given CQ: which distribution policy? Hypercube ◮ Given set of CQs: which distribution policy? Open question

30

slide-31
SLIDE 31

Future Work

Tractable Results ◮ Other classes of queries? ◮ Other families of distribution policies? More expressive classes of queries ◮ This work: CQs ◮ FO: undecidable ◮ initial results: UCQs, CQs with negation

31