Inconsistency and Incompleteness in Data Integration: a Logic-based - - PowerPoint PPT Presentation

inconsistency and incompleteness in data integration a
SMART_READER_LITE
LIVE PREVIEW

Inconsistency and Incompleteness in Data Integration: a Logic-based - - PowerPoint PPT Presentation

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach Andrea Cal` Universit` a di Roma La Sapienza CoLogNET Workshop Logic-based methods for information integration Vienna, Austria, 23 August 2003


slide-1
SLIDE 1

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach Andrea Cal` ı Universit` a di Roma “La Sapienza” CoLogNET Workshop Logic-based methods for information integration Vienna, Austria, 23 August 2003

slide-2
SLIDE 2

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Joint work with:

  • Maurizio Lenzerini
  • Domenio Lembo
  • Riccardo Rosati

Andrea Cal` ı 2

slide-3
SLIDE 3

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

What is a data integration system?

  • Offers uniform access to a set of heterogeneous sources
  • The representation provided to the user is called global schema
  • The user is freed from the knowledge about the data sources
  • When the user issues a query over the global schema, the system:
  • 1. determines which sources to query and how
  • 2. issues suitable queries to the sources
  • 3. assembles the results and provides the answer to the user

Andrea Cal` ı 3

slide-4
SLIDE 4

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Logical architecture for a data integration system

Source Source Source Global Schema Application Source structure Source structure Source structure

Andrea Cal` ı 4

slide-5
SLIDE 5

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Software architecture for a data integration system

Source Source Source Global Schema Application Query Wrapper Wrapper Wrapper Mediator Source structure Source structure Source structure

Andrea Cal` ı 5

slide-6
SLIDE 6

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Data integration system: formalisation

A data integration system I is a triple G, S, M:

  • G: global schema
  • S: source schema
  • M: mapping

Andrea Cal` ı 6

slide-7
SLIDE 7

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Our framework

  • global schema G: relational with integrity constraints (ICs)
  • source schema S: relational;
  • mapping M: global-as-view (GAV), expressed with the language of union
  • f conjunctive queries (UCQ)

Andrea Cal` ı 7

slide-8
SLIDE 8

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Example

Global schema:

player(Pname, Pteam) team(Tname, Tcity)

Source schema:

{ s1/3, s2/2, s3/2 }

The GAV mapping associates to each relation in the global schema G a view over the source schema:

player

  player(X, Y ) ← s1(X, Y, Z) player(X, Y ) ← s3(X, Y ) team

  • team(X, Y )

← s2(X, Y )

Andrea Cal` ı 8

slide-9
SLIDE 9

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

The role of integrity constraints (ICs)

ICs on the global schema:

  • enhance the expressiveness of the global schema
  • in general they are not satisfied by the data at the sources

ICs on the source schema:

  • represent local properties of data sources
  • we assume that the data at the sources satisfy ICs expressed over the

sources

⇒ not considered

Andrea Cal` ı 9

slide-10
SLIDE 10

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Constraints on the global schema

  • 1. key dependencies (KDs)

key(r) = {A1, . . . , Ak}

  • 2. inclusion dependencies (IDs) (generalisation of foreign key dependencies)

r1[A1, . . . , Am] ⊆ r2[B1, . . . , Bm]

Andrea Cal` ı 10

slide-11
SLIDE 11

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Outline ♦ Introduction (done) ♦ Framework (done)

  • Reasoning on integrity constraints
  • Query rewriting for IDs alone
  • Query rewriting for KDs and IDs
  • Semantics for inconsistent data (loosely-sound)
  • Query rewriting under loosely-sound semantics
  • Complexity results

Andrea Cal` ı 11

slide-12
SLIDE 12

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Reasoning about constraints

Given a source database D for a system I, a global database B is said to be legal if:

  • 1. it satisfies the ICs on the global schema
  • 2. it satisfies the mapping, i.e. B is constituted by a superset of the retrieved

global database ret(I, D)

  • ret(I, D) is obtained by evaluating, for each relation in G, the mapping

queries over the source database

  • assumption of sound mapping
  • there are several global databases that are legal for the system

Andrea Cal` ı 12

slide-13
SLIDE 13

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Answers to queries under constraints

  • We are interested in certain answers.
  • A tuple t is a certain answer for a query Q if t is in the answer to Q for all

(possibly infinite) legal databases.

  • The certain answers to Q are denoted by ans(Q, I, D).

Andrea Cal` ı 13

slide-14
SLIDE 14

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Example

Global schema:

player(Pname, Pteam) team(Tname, Tcity)

Constraints:

player[Pteam] ⊆ team[Tname]

Mapping:

player

  player(X, Y ) ← s1(X, Y, Z) player(X, Y ) ← s3(X, Y ) team

  • team(X, Y )

← s2(X, Y )

Andrea Cal` ı 14

slide-15
SLIDE 15

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Example (cont’d)

Source database D

s1 figo realMadrid 31 s2 realMadrid madrid s3 totti roma

Andrea Cal` ı 15

slide-16
SLIDE 16

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Example (cont’d)

Retrieved global database ret(I, D)

player figo realMadrid totti roma team realMadrid madrid

Andrea Cal` ı 16

slide-17
SLIDE 17

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Example (cont’d)

Retrieved global database ret(I, D)

player figo realMadrid totti roma team realMadrid madrid roma α

The ID on the global schema tells us that roma is the name of some team All legal global databases for I have at least the tuples shown above, where α is some value of the domain of the database

Andrea Cal` ı 17

slide-18
SLIDE 18

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Example (cont’d)

Retrieved global database ret(I, D)

player figo realMadrid totti roma team realMadrid madrid roma α

The ID on the global schema tells us that roma is the name of some team All legal global databases for I have at least the tuples shown above, where α is some value of the domain of the database Warning 1 there may be an infinite number of legal databases for I

Andrea Cal` ı 18

slide-19
SLIDE 19

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Example (cont’d)

Retrieved global database ret(I, D)

player figo realMadrid totti roma team realMadrid madrid roma α

The ID on the global schema tells us that roma is the name of some team All legal global databases for I have at least the tuples shown above, where α is some value of the domain of the database Warning 1 there may be an infinite number of legal databases for I Warning 2 in case of cyclic IDs, legal databases for I may be of infinite size

Andrea Cal` ı 19

slide-20
SLIDE 20

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Example (cont’d)

Retrieved global database ret(I, D)

player figo realMadrid totti roma team realMadrid madrid roma α

The ID on the global schema tells us that roma is the name of some team All legal global databases for I have at least the tuples shown above, where α is some value of the domain of the database Consider the query

q(X) ← team(X, Y )

Andrea Cal` ı 20

slide-21
SLIDE 21

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Example (cont’d)

Retrieved global database ret(I, D)

player figo realMadrid totti roma team realMadrid madrid roma α

The ID on the global schema tells us that roma is the name of some team All legal global databases for I have at least the tuples shown above, where α is some value of the domain of the database Consider the query

q(X) ← team(X, Y ) ans(q, I, D) = {realMadrid, roma}

Andrea Cal` ı 21

slide-22
SLIDE 22

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Query rewriting

Given a user query Q over G

  • we look for a rewriting R of Q expressed over S
  • a rewriting R is perfect if RD = ans(Q, I, D) for every source database

D.

With a perfect rewriting, we can do query answering by rewriting Note that we avoid the construction of the retrieved global database ret(I, D)

Andrea Cal` ı 22

slide-23
SLIDE 23

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Query rewriting for IDs alone

Intuition: Use the IDs as basic rewriting rules

q(X) ← team(X, Y ) player[Pteam] ⊆ team[Tname]

as a logic rule:

team(W2, W3) ← player(W1, W2)

Andrea Cal` ı 23

slide-24
SLIDE 24

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Query rewriting for IDs alone

Intuition: Use the IDs as basic rewriting rules

q(X) ← team(X, Y ) player[Pteam] ⊆ team[Tname]

as a logic rule:

team(W2, W3) ← player(W1.W2)

Basic rewriting step: when the atom unifies with the head of the rule substitute the atom with the body of the rule We add to the rewriting the query

q(X) ← player(W, X)

Andrea Cal` ı 24

slide-25
SLIDE 25

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Query Rewriting for IDs alone: algorithm ID-rewrite

Iterative execution of:

  • 1. reduction: atoms that are subsumed by another atom are eliminated
  • 2. basic rewriting step

Andrea Cal` ı 25

slide-26
SLIDE 26

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Main result

  • ΠM: rules of the mapping
  • ΠID: union of CQs produced by the rewriting algorithm

Theorem: ΠM ∪ ΠID is a perfect rewriting of the user query Q

Andrea Cal` ı 26

slide-27
SLIDE 27

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Query answering under IDs and KDs

  • possibility of inconsistencies (recall the sound mapping)
  • when ret(I, D) violates the KDs, no legal database exists and query

answering becomes trivial! Theorem: Query answering under IDs and KDs is undecidable. Proof: by reduction from implication of IDs and KDs.

Andrea Cal` ı 27

slide-28
SLIDE 28

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Separation

Non-key-conflicting IDs (NKCIDs) are of the form

r1[A1] ⊆ r2[A2]

where either:

  • 1. no KD is defined over r2
  • 2. A2 is not a strict superset of key(r2)

Theorem (separation): Under KDs and NKCIDs, when ret(I, D) satisfies the KDs, KDs can be ignored wrt certain answers

Andrea Cal` ı 28

slide-29
SLIDE 29

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Query rewriting under IDs and NKCIDs

Set of rules ΠKD for considering the case of KD violation for a relation r:

q(Y1, . . . , Yn) ← r(X1, . . . , Xk, . . . , Xi, . . .), r(X1, . . . , Xk, . . . , X′

i, . . .),

Xi = X′

i, val(Y1), . . . , val(Yn)

X1, . . . , Xk are the variables corresponding to the attributes of key(r)

Theorem: ΠID ∪ ΠKD ∪ Πval ∪ ΠM is a perfect rewriting of Q.

Πval is the set of rules that imposes that val(c) is true if c is a value in the

database.

Andrea Cal` ı 29

slide-30
SLIDE 30

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Semantics for inconsistent data sources

  • Under the (strictly) sound semantics, a single KD violation leads to a

non-interesting case for query answering

  • New approach: loosely-sound semantics. Add as much as you like (as

with sound semantics), and throw away the minimum number of tuples A global database B1 is better than another database B2, denoted

B1 ≫(I,D) B2, iff B1 ∩ ret(I, D) ⊃ B2 ∩ ret(I, D)

The answers ansℓ(Q, I, D) to a query are those that are true on all “best” legal global databases w.r.t. ≫(I,D).

Andrea Cal` ı 30

slide-31
SLIDE 31

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Example

Global schema:

player(Pname, Pteam) team(Tname, Tcity)

Constraints:

player[Pteam] ⊆ team[Tname] key(player) = {Pname}

Mapping:

player

  player(X, Y ) ← s1(X, Y, Z) player(X, Y ) ← s3(X, Y ) team

  • team(X, Y )

← s2(X, Y )

Andrea Cal` ı 31

slide-32
SLIDE 32

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Example (cont’d)

Source database D

s1 figo realMadrid 31 s2 realMadrid madrid s3 totti roma figo cavese

Andrea Cal` ı 32

slide-33
SLIDE 33

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Example (cont’d)

Retrieved global database ret(I, D)

player figo realMadrid totti roma figo cavese team realMadrid madrid

There are two possible ways of repairing the violation with a minimum deletion of tuples.

Andrea Cal` ı 33

slide-34
SLIDE 34

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Example (cont’d)

First form

player figo realMadrid totti roma team realMadrid madrid roma α

Andrea Cal` ı 34

slide-35
SLIDE 35

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Example (cont’d)

Second form

player totti roma figo cavese team realMadrid madrid roma α cavese β

For the query

q(X) ← team(X, Y)

we have

ansℓ(q, I, D) = {roma, realMadrid}

Andrea Cal` ı 35

slide-36
SLIDE 36

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Query rewriting under the loosely-sound semantics

Set of rules ΠℓKD that take KDs into account (Datalog¬ under stable model semantics): for each relation r in G

r(x, y) ← rD(x, y) , not r(x, y) r(x, y) ← rD(x, y) , r(x, z) , Y1 = Z1 · · · r(x, y) ← rD(x, y) , r(x, z) , Ym = Zm

where: in r(x, y) the variables in x correspond to the attributes constituting the key of the relation r; y = Y1, . . . , Ym and z = Z1, . . . , Zm.

Andrea Cal` ı 36

slide-37
SLIDE 37

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Query rewriting under the loosely-sound semantics (cont’d) ΠMD: rules obtained from ΠM by replacing each r with rD.

Theorem: ΠℓKD ∪ ΠID ∪ ΠMD is a perfect rewriting of Q. .

Andrea Cal` ı 37

slide-38
SLIDE 38

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Summary of complexity results

KDs IDs strictly-sound loosely-sound no GEN PTIME/PSPACE♠ PTIME/PSPACE♠ yes no PTIME/NP♠ coNP/Πp

2 ♠

yes FK PTIME/PSPACE coNP/PSPACE yes NKC PTIME/PSPACE coNP/PSPACE yes 1KC undecidable undecidable yes GEN undecidable♠ undecidable♠

Andrea Cal` ı 38

slide-39
SLIDE 39

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

The system DIS@DIS

  • Deals also with exclusion dependencies
  • The rules ΠM are taken into account by means of unfolding (substitution)
  • Is able to work with local-as-view (LAV) mappings, which are translated into

GAV ones (plus integrity constraints) [— ER 2002]

Andrea Cal` ı 39

slide-40
SLIDE 40

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Meta−Data Repository KD−ED Optimizer Consistency Checker KD−ED Reformulator Consistency Processor system specification Query Processor query Source Wrappers Data Sources Source Query Evaluator Unfolder results User Interface Source Processor INTENSIONAL LEVEL EXTENSIONAL LEVEL

...

Global Query Processor Global Query Evaluator ID Reformulator Query Optimizer System Processor System Analyzer ID Expander LAV−GAV Compiler ED Expander Datalog−n Query Handler RGDB Generator RGDB

Andrea Cal` ı 40

slide-41
SLIDE 41

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Conclusions

  • Query answering by rewriting in data integration systems under constraints
  • Query rewriting technique for IDs alone, inspired by [Gryz ICDE 1999]
  • Characterisation of the threshold between decidability and undecidability

under KDs and IDs [— PODS 2003]

  • Query rewriting technique for a maximal class of KDs and IDs
  • Loose semantics under KDs and IDs

⋆ Query rewriting technique for KDs, decoupled from that for IDs

  • All rewritings in purely intensional fashion
  • System DIS@DIS

Andrea Cal` ı 41

slide-42
SLIDE 42

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Thank you

Further information:

  • Questions: now
  • Presenter’s contact: http://www.andreacali.com
  • Implementation: system DIS@DIS — try it online! available from the link

above Thanks to: Maurizio Lenzerini, Giuseppe De Giacomo, Diego Calvanese, Jarek Gryz

Slides typesetted with L

A

T EX2e

Andrea Cal` ı 42

slide-43
SLIDE 43

Inconsistency and Incompleteness in Data Integration: a Logic-based Approach CoLogNET Workshop 2003

Algorithm ID-rewrite

Input: relational schema Ψ, set of IDs ΣI, UCQ Q Output: perfect rewriting of Q

Q′ := Q;

repeat

Qaux := Q′;

for each q ∈ Qaux do (a) for each g1, g2 ∈ body(q) do if g1 and g2 unify then Q′ := Q′ ∪ {τ(reduce(q, g1, g2))}; (b) for each g ∈ body(q) do for each I ∈ ΣI do if I is applicable to g then Q′ := Q′ ∪ { q[g/gr(g, I)] } until Qaux = Q′; return Q′

Andrea Cal` ı 43