SLIDE 1 Optimizing Query Answering under Ontological Constraints
Giorgio Orsi1,2 and Andreas Pieris2
1Institute for the Future of Computing
Oxford Martin School University of Oxford
2Department of Computer Science
University of Oxford VLDB 2011
SLIDE 2
Ontological Databases
Ontological Reasoning DB Constraints Ontological DB
SLIDE 3
Ontological Databases
D D D
ABox TBox
Ontological Reasoning DB Constraints Ontological DB
SLIDE 4
Ontological Databases
D D D Q(X) 9Y (X,Y)
ABox TBox
Ontological Reasoning DB Constraints Ontological DB
SLIDE 5
Ontological Databases
D D D
,
ABox TBox
{ t | D [ ² 9u (t,u) } Ontological Reasoning DB Constraints Ontological DB Q(X) 9Y (X,Y)
SLIDE 6
Ontological Constraints (examples)
Concept Inclusions: 8X emp(X) person(X) (Inverse) Relation Inclusion: Relation Transitivity: 8X8Y8Z mgs(X,Y),mgs(Y,Z) mgs(X,Z) 8X8Y manages(X,Y) isManaged(Y,X) Participation: 8X emp(X) 9Y report(X,Y) Disjointness: 8X emp(X), customer(X) ? Functionality: 8X8Y8Z reports(X,Y),reports(X,Z) Y = Z
SLIDE 7 Datalog§
¡ Datalog variant allowing in the head:
- 9-variables ! TGDs 8X8Y (X,Y) 9Z (X,Z)
- Equality atoms ! EGDs 8X (X) Xi=Xj
- Constant false (?) ! NCs 8X (X) ?
Datalog+
[Cali’ et Al, PODS 09]
SLIDE 8 Datalog+
Datalog§ [Cali’ et Al, PODS 09]
¡ Datalog variant allowing in the head:
- 9-variables ! TGDs 8X8Y (X,Y) 9Z (X,Z)
- Equality atoms ! EGDs 8X (X) Xi=Xj
- Constant false (?) ! NCs 8X (X) ?
¡ But, query answering under Datalog+ is undecidable
SLIDE 9 Datalog+
Datalog§ [Cali’ et Al, PODS 09]
¡ Datalog variant allowing in the head:
- 9-variables ! TGDs 8X8Y (X,Y) 9Z (X,Z)
- Equality atoms ! EGDs 8X (X) Xi=Xj
- Constant false (?) ! NCs 8X (X) ?
¡ Datalog+ is syntactically restricted ! Datalog§ ¡ But, query answering under Datalog+ is undecidable
SLIDE 10 Datalog+
Datalog§ [Cali’ et Al, PODS 09]
¡ Datalog variant allowing in the head:
- 9-variables ! TGDs 8X8Y (X,Y) 9Z (X,Z)
- Equality atoms ! EGDs 8X (X) Xi=Xj
- Constant false (?) ! NCs 8X (X) ?
¡ Datalog+ is syntactically restricted ! Datalog§ ¡ But, query answering under Datalog+ is undecidable ¡ TGDs more expressive than inclusion dependencies 8D8P8A runs(D,P),area(P,A) 9E employee(E,D,A)
SLIDE 11
The Chase Procedure
Input: Database D, set of TGDs Output: A model of D [ person(john) 8X person(X) 9Y father(Y,X) 8X8Y father(X,Y) person(X) D chase(D,) = D [ ?
SLIDE 12
The Chase Procedure
Input: Database D, set of TGDs Output: A model of D [ person(john) D chase(D,) = D [ {father(z1,john) 8X person(X) 9Y father(Y,X) 8X8Y father(X,Y) person(X)
SLIDE 13
The Chase Procedure
Input: Database D, set of TGDs Output: A model of D [ person(john) D chase(D,) = D [ {father(z1,john), person(z1) 8X person(X) 9Y father(Y,X) 8X8Y father(X,Y) person(X)
SLIDE 14
The Chase Procedure
Input: Database D, set of TGDs Output: A model of D [ person(john) D chase(D,) = D [ {father(z1,john), person(z1), father(z2,z1) 8X person(X) 9Y father(Y,X) 8X8Y father(X,Y) person(X)
SLIDE 15
The Chase Procedure
Input: Database D, set of TGDs Output: A model of D [ person(john) D chase(D,) = D [ {father(z1,john), person(z1), father(z2,z1), …} 8X person(X) 9Y father(Y,X) 8X8Y father(X,Y) person(X)
SLIDE 16 Query Answering via Chase
[see, e.g., Deutsch, Nash & Remmel, PODS 08]
D [ ² Q , chase(D,) ² Q D
. . .
C = chase(D,) M1 M2 h1 h2
h1(C) h2(C)
Q h
SLIDE 17
Q
Query Answering via Rewriting
SLIDE 18
Q Q
compilation
Query Answering via Rewriting
SLIDE 19
Q
evaluation
Q Q
compilation
D
Query Answering via Rewriting
SLIDE 20
Chase vs Rewriting
SLIDE 21
Linear TGDs
8X8Y r(X,Y) 9Z (X,Z)
single body atom ¡ Properly generalize inclusion dependencies. ¡ Enjoy the bounded-derivation depth property. ¡ FO-rewritable Query Answering in AC0 (data complexity).
SLIDE 22
Q q promotesTo(A,B), customer(B) (original query) promoter(X) Y promotesTo(X,Y) promotesTo(X,Y) customer(Y) q promotesTo(A,B), customer(B)
Q
FO-rewritability: example [Gottlob et Al., ICDE 11]
SLIDE 23
q promotesTo(A,B), customer(B) q promotesTo(A,B), customer(V0,B) { Y = B } ( V0 is fresh ) promoter(X) Y promotesTo(X,Y) promotesTo(X,Y) customer(Y)
Q
FO-rewritability: example [Gottlob et Al., ICDE 11]
Q q promotesTo(A,B), customer(B)
SLIDE 24
q promotesTo(A,B), customer(B) q promotesTo(A,B), promotesTo(V0,B) ans(A) promotesTo(A,B) factorization { A = V0 } promoter(X) Y promotesTo(X,Y) promotesTo(X,Y) customer(Y)
Q
FO-rewritability: Example [Gottlob et Al., ICDE 11]
Q q promotesTo(A,B), customer(B)
SLIDE 25
q promoter(A) promoter(X) Y promotesTo(X,Y) promotesTo(X,Y) customer(Y)
Q
FO-rewritability: example [Gottlob et Al., ICDE 11]
Q q promotesTo(A,B), customer(B) q promotesTo(A,B) {X = A, Y = B} q promotesTo(A,B), customer(B)
SLIDE 26
UCQ rewriting (first-order)
promoter(X) Y promotesTo(X,Y) promotesTo(X,Y) customer(Y)
Q
FO-rewritability: example [Gottlob et Al., ICDE 11]
Q q promoter(A) q promotesTo(A,B), customer(B) q promotesTo(A,B) q promotesTo(A,B), customer(B)
SLIDE 27
FO-rewritability
¡ Desirable properties of a FO-rewriting: independent on the DB executable by any DBMS easy to compute (e.g., polynomial time) small size (e.g., polynomial size)
SLIDE 28
FO-rewritability
¡ Unions of Conjunctive Queries (UCQs) executable by any DBMS DB independent easy to optimize and distribute worst-case exponential size in Q and
Calvanese et Al, JAR 07 Perez Urbina et Al, JAL 09 Cali’ et Al, PODS 09 Gottlob et Al, ICDE 11 and others…
¡ Desirable properties of a FO-rewriting: independent on the DB executable by any DBMS easy to compute (e.g., polynomial time) small size (e.g., polynomial size)
SLIDE 29
¡ Combined and hybrid FO-rewriting good computational properties (e.g., polynomial in size) requires access to the DB
Perez Urbina et Al, JAL 09 Kontchakov et Al., KR 10 Gottlob and Schwentick, DL 11
FO-rewritability
SLIDE 30
¡ Purely intensional Datalog rewriting very compressed representation purely intensional requires view-creation or Datalog engine ¡ Combined and hybrid FO-rewriting good computational properties (e.g., polynomial in size) requires access to the DB
Perez Urbina et Al, JAL 09 Kontchakov et Al., KR 10 Gottlob and Schwentick, DL 11 Perez Urbina et Al, JAL 09 Rosati and Almatelli., KR 10
FO-rewritability
SLIDE 31
Datalog Rewriting: Keep it First-Order!
¡ A Datalog query is (in general) not a first-order query a non-recursive Datalog query is a first-order query a bounded Datalog query is a first-order query
SLIDE 32
¡ A Datalog query is (in general) not a first-order query a non-recursive Datalog query is a first-order query a bounded Datalog query is a first-order query ¡ Input: a (w.l.o.g. boolean) conjunctive query Q = <q,ρ> Q : q(X) p(X), s(X,Y) <q, q(X) p(X),s(X,Y) > a set of linear TGDs ¡ Output: a bounded Datalog query Q = <q,π >
Datalog Rewriting: Keep it First-Order!
SLIDE 33
Datalog Rewriting: skolemization (and renaming)
r(X,Y) Z s(Y,Z) s(X,Y) Z p(Y,Y,Z) p(X,Y,Z) t(Z)
SLIDE 34
r(X,Y) Z s(Y,Z) s(X,Y) Z p(Y,Y,Z) p(X,Y,Z) t(Z) r(X1,Y1) s(Y1,f1(Y1)) s(X2,Y2) p(Y2,Y2,f2(Y2)) p(X3,Y3,Z3) t(Z3) f
Datalog Rewriting: skolemization (and renaming)
SLIDE 35
Datalog Rewriting: Skolemization (and renaming)
r(X,Y) Z s(Y,Z) s(X,Y) Z p(Y,Y,Z) p(X,Y,Z) t(Z) ¡ f and are equisatisfiable (not equivalent) ¡ Introduce one Skolem function for each existential variable r(X1,Y1) s(Y1,f1(Y1)) s(X2,Y2) p(Y2,Y2,f2(Y2)) p(X3,Y3,Z3) t(Z3) f
SLIDE 36
Datalog Rewriting: Rule Saturation
¡ Apply resolution inference rule to rules in f at least one of the rules contains Skolem terms δ1 : r (X1,Y1) s(Y1,f1(Y1)) δ2 : s(X2,Y2) p(Y2,Y2,f2(Y2)) δ3 : p(X3,Y3,Z3) t(Z3) f
SLIDE 37
Datalog Rewriting: Rule Saturation
¡ Apply resolution inference rule to rules in f at least one of the rules contains Skolem terms f [f] … r(X1,Y1) p(f1(Y1) ,f1(Y1), f2(f1(Y1))) … δ1 : r (X1,Y1) s(Y1,f1(Y1)) δ2 : s(X2,Y2) p(Y2,Y2,f2(Y2)) δ3 : p(X3,Y3,Z3) t(Z3)
SLIDE 38
Datalog Rewriting: Properties of Rule Saturation
¡ [f] mimics the chase derivations.
SLIDE 39
Datalog Rewriting: Properties of Rule Saturation
¡ [f] mimics the chase derivations. δ1 : r (X1,Y1) s(Y1,f1(Y1)) δ2 : s(X2,Y2) p(Y2,Y2,f2(Y2)) δ3 : p(X3,Y3,Z3) t(Z3)
SLIDE 40
Datalog Rewriting: Properties of Rule Saturation
¡ [f] mimics the chase derivations. ¡ [f] depends only on . ¡ [f] is possibly infinite linear TGDs have BDDP: suffices to construct it up to k steps [f]k. δ1 : r (X1,Y1) s(Y1,f1(Y1)) δ2 : s(X2,Y2) p(Y2,Y2,f2(Y2)) δ3 : p(X3,Y3,Z3) t(Z3)
SLIDE 41
Datalog Rewriting: Query Saturation
¡ resolve [f] with the query Q. use only rules with Skolem terms.
SLIDE 42
Datalog Rewriting: Query Saturation
¡ resolve [f] with the query Q. use only rules with Skolem terms. … δ1 : r (X1,Y1) s(Y1,f1(Y1)) δ2 : s(X2,Y2) p(Y2,Y2,f2(Y2)) δ3 : p(X3,Y3,Z3) t(Z3) … [δ12]] : r (X1,Y1) p(f1(Y1) ,f1(Y1), f2(f1(Y1))) … Q s(A,B), p(B,B,C) [f] Q … Q r(X1,Y1), p(f1(Y1), f1(Y1),C) … [Q,f]
SLIDE 43
Datalog Rewriting: Query Saturation
¡ bypasses chase derivations with function symbols … δ1 : r (X1,Y1) s(Y1,f1(Y1)) δ2 : s(X2,Y2) p(Y2,Y2,f2(Y2)) δ3 : p(X3,Y3,Z3) t(Z3) … [δ12]] : r (X1,Y1) p(f1(Y1) ,f1(Y1), f2(f1(Y1))) … Q s(A,B), p(B,B,C) [f] Q
SLIDE 44
Datalog Rewriting: Finalization
¡ keep only the function-free rules from [f] [ [Q,f] ¡ derivations producing certain answers are captured by function-symbol-free rules.
SLIDE 45
¡ use the predicate graph to reduce the number of rules in f δ1 : r (X1,Y1) s(Y1,f1(Y1)) δ2 : s(X2,Y2) p(Y2,Y2,f2(Y2)) δ3 : p(X3,Y3,Z3) t(Z3) f Q s(A,B), p(B,B,C) Q
Optimizations: Pruning
SLIDE 46
Optimizations: Pruning
¡ use the predicate graph to reduce the number of rules in f δ1 : r (X1,Y1) s(Y1,f1(Y1)) δ2 : s(X2,Y2) p(Y2,Y2,f2(Y2)) δ3 : p(X3,Y3,Z3) t(Z3) f Q s(A,B), p(B,B,C) Q
SLIDE 47
Optimizations: Pruning
¡ use the predicate graph to reduce the number of rules in f δ1 : r (X1,Y1) s(Y1,f1(Y1)) δ2 : s(X2,Y2) p(Y2,Y2,f2(Y2)) δ3 : p(X3,Y3,Z3) t(Z3) f Q s(A,B), p(B,B,C) Q ¡ we are no longer independent on Q!
SLIDE 48
Optimizations: Query Elimination
¡ eliminate implied atoms during query saturation δ1 : r (X1,Y1) s(Y1,f1(Y1)) δ2 : s(X2,Y2) p(Y2,Y2,f2(Y2)) f Q Q s(A,B), p(B,B,C)
SLIDE 49
Optimizations: Query Elimination
¡ eliminate implied atoms during query saturation f Q s(A,B) ² p(B,B,C) Q s(A,B), p(B,B,C) atom coverage δ1 : r (X1,Y1) s(Y1,f1(Y1)) δ2 : s(X2,Y2) p(Y2,Y2,f2(Y2))
SLIDE 50
Optimizations: Query Elimination
¡ eliminate implied atoms during query saturation f Q s(A,B), p(B,B,C) ≡ Q s(A,B) Q s(A,B) ² p(B,B,C) Q s(A,B), p(B,B,C) atom coverage δ1 : r (X1,Y1) s(Y1,f1(Y1)) δ2 : s(X2,Y2) p(Y2,Y2,f2(Y2))
SLIDE 51
Optimizations: Query Elimination
¡ unique elimination strategy (w.r.t. the final size of the rewriting) see paper. ¡ given m = |body(ρ)| and n = || worst-case size of [f] [ [Q,f] is O((n∙m)m) worst-case size of Q = <q,π > is O(n+m)m ¡ atom coverage under linear TGDs can be checked in polynomial time see paper.
SLIDE 52
Experimental Results
SLIDE 53
Discussion
¡ Datalog rewriting is substantially more compact than UCQ rewriting. ¡ Unclear whether this always leads to increase in performance. ¡ Extend the procedure to larger classes of TGDs guarded TGDs [Cali’ et Al, PODS 09] non FO-rewritable sticky-join TGDs [Cali’ et Al, VLDB 10]
SLIDE 54
The Datalog Family
Thomas Lukasiewicz Georg Gottlob Andreas Pieris Andrea Calì Giorgio Orsi Thank you!