PROBABILISTIC DATABASES MAURICE VAN KEULEN ASSOCIATE PROFESSOR DATA - - PowerPoint PPT Presentation

probabilistic databases
SMART_READER_LITE
LIVE PREVIEW

PROBABILISTIC DATABASES MAURICE VAN KEULEN ASSOCIATE PROFESSOR DATA - - PowerPoint PPT Presentation

PROBABILISTIC DATABASES MAURICE VAN KEULEN ASSOCIATE PROFESSOR DATA MANAGEMENT TECHNOLOGY HEAD OF DATABASE GROUP WHO AM I AND WHY AM I HERE Maurice van Keulen Associate Professor Data Management Technology University of Twente Research


slide-1
SLIDE 1

PROBABILISTIC DATABASES

MAURICE VAN KEULEN

ASSOCIATE PROFESSOR DATA MANAGEMENT TECHNOLOGY HEAD OF DATABASE GROUP

slide-2
SLIDE 2

Maurice van Keulen Associate Professor Data Management Technology University of Twente § Research interests: Data integration, data quality, information extraction, natural language processing, data cleaning I am here on a research visit § Period Nov’17 – Feb’18 several times 3 or 2 days § Goals: § Compare Probabilistic Programming (PP) with Probabilistic Databases (PDBs) § Does PP lead to new insights in PDBs and Probabilistic Data Integration (PDI)?

23 Nov 2017 Research visit RWTH - Probabilistic Databases 2

WHO AM I AND WHY AM I HERE

slide-3
SLIDE 3

§ Motivational example § Intuition – What is a probabilistic database § Theoretical foundation – possible worlds theory § Querying a probabilistic database § Real-world query performance § Probabilistic databases for other data models (XML, DataLog) § Conclusions § Outlook on my next presentation on “Probabilistic Data Integration”

23 Nov 2017 Research visit RWTH - Probabilistic Databases 3

CONTENTS

slide-4
SLIDE 4

MOTIVATIONAL EXAMPLE AND WHAT IS A PROBABILISTIC DATABASE?

slide-5
SLIDE 5

23 Nov 2017 Research visit RWTH - Probabilistic Databases 5

MOTIVATIONAL EXAMPLE: COMBINING DATA …

Keulen, M. (2012) Managing Uncertainty: The Road Towards Better Data

  • Interoperability. IT - Information Technology, 54 (3).
  • pp. 138-146. ISSN 1611-2776

Car brand Sales B.M.W. 25 Mercedes 32 Renault 10 Car brand Sales BMW 72 Mercedes-Benz 39 Renault 20 Car brand Sales

Bayerische Motoren Werke

8 Mercedes 35 Renault 15 Car brand Sales

B.M.W.

25

Bayerische Motoren Werke

8

BMW

72 Mercedes 67 Mercedes-Benz 39 Renault 45

slide-6
SLIDE 6

23 Nov 2017 Research visit RWTH - Probabilistic Databases 6

… AND THE PROBLEM OF SEMANTIC DUPLICATES

Car brand Sales

B.M.W.

25

Bayerische Motoren Werke

8

BMW

72 Mercedes 67 Mercedes-Benz 39 Renault 45 Preferred customers … SELECT SUM(Sales) FROM CarSales WHERE Sales>100 ‘No preferred customers’

slide-7
SLIDE 7

Database

Real world (of car brands) Mercedes-Benz 39 72 BMW 45 Renault 67 Mercedes 8 Bayerische Motoren Werke 25 B.M.W. Sales Car brand

ω

d1 d2 d3 d4 d5 d6

  • 1
  • 2
  • 3
  • 4

23 Nov 2017 Research visit RWTH - Probabilistic Databases 7

SEMANTIC DUPLICATES

slide-8
SLIDE 8

23 Nov 2017 Research visit RWTH - Probabilistic Databases 8

MOST DATA QUALITY PROBLEMS CAN BE MODELED AS UNCERTAINTY IN DATA

Car brand Sales

B.M.W.

25

Bayerische Motoren Werke

8

BMW

72 Mercedes 67 Mercedes-Benz 39 Renault 45 Mercedes 106 Mercedes-Benz 106 1 2 3 4 5 6 X=0 X=0 X=1 Y=0 X=1 Y=1 X=0 4 and 5 different 0.2 X=1 4 and 5 the same 0.8 Y=0 “Mercedes” correct name 0.5 Y=1 “Mercedes-Benz” correct name 0.5 B.M.W. / BMW / Bayerische Motoren Werke analogously Run some duplicate detection tool

slide-9
SLIDE 9

§ Looks like ordinary database § Several “possible” answers or approximate answers to queries § Important: Scalability (big data!) Sales of “preferred customers” § SELECT SUM(sales) FROM carsales WHERE sales≥ 100

23 Nov 2017 Research visit RWTH - Probabilistic Databases 9

WHAT I HAVE NOW IS A PROBABILISTIC DATABASE

SUM(sales) P 0 14% 105 6% 106 56% 211 24%

slide-10
SLIDE 10

Sales of “preferred customers” § SELECT SUM(sales) FROM carsales WHERE sales≥ 100 § Answer: 106 § Risk = Probability * Impact § Analyst only bothered with problems that matter

23 Nov 2017 Research visit RWTH - Probabilistic Databases 10

QUERYING AND RELIABILITY ASSESSMENT

SUM(sales) P 0 14% 105 6% 106 56% 211 24% Second most likely answer at 24% with impact factor 2 in sales (211 vs 106) Risk of substantially wrong answer

slide-11
SLIDE 11

THEORETICAL FOUNDATION QUERYING A PROBABILISTIC DATABASE

slide-12
SLIDE 12

23 Nov 2017 Research visit RWTH - Probabilistic Databases 12

POSSIBLE WORLDS THEORY

Car brand Sales Mercedes 67 Mercedes-Benz 39 Mercedes 106 Mercedes-Benz 106 Renault 45 4 5 6 X=0 X=0 X=1 Y=0 X=1 Y=1

X=0 4 and 5 different 0.2 X=1 4 and 5 the same 0.8 Y=0 “Mercedes” correct 0.5 Y=1 “Mercedes-Benz” correct 0.5

Car brand Sales Mercedes 67 Mercedes-Benz 39 Renault 45 Car brand Sales Mercedes 67 Mercedes-Benz 39 Renault 45 Car brand Sales Mercedes 106 Renault 45 Car brand Sales Mercedes-Benz 106 Renault 45 X=0 Y=0 X=1 Y=0 X=0 Y=1 X=1 Y=1 0.4 0.1 0.4 0.2 * 0.5 = 0.1

‘0’ (0.2) ‘106’ (0.8) 106 106

slide-13
SLIDE 13

Associate each tuple ti with a sentence 𝜒i

i

§ (ti,𝜒i)

i)

𝜒 is a propositional formula § atoms of the form r = v Example § ( <carbrand=“Mercedes”, sales=106>, X=1⋀Y=0 ) This tuple exists in all worlds for which 𝜒 is true

23 Nov 2017 Research visit RWTH - Probabilistic Databases 13

POSSIBLE WORLDS THEORY

slide-14
SLIDE 14

Given a database D as a probabilistic relation § D={ (t1,𝜒1), …, (tn,𝜒n) } A world w is induced by a random variable assignment for each random variable § Example § 𝜄w={ X ⟼ 1, Y ⟼ 0 } P(w) = P(𝜄w) = 0.8 x 0.5 = 0.4 § w={ <“Mercedes”, 106>, <“Renault”, 45> } § PWS is the set of all possible worlds Semantics § Tuple ti exists in world w iff is 𝜒i true for 𝜄w

i true for 𝜄w

§ Query semantics: Q(D)=∪w∈PWS Q(w)

23 Nov 2017 Research visit RWTH - Probabilistic Databases 14

QUERYING IN POSSIBLE WORLDS THEORY

slide-15
SLIDE 15

23 Nov 2017 Research visit RWTH - Probabilistic Databases 15

POSSIBLE WORLDS THEORY

Query implementation Query semantics possible worlds possible answers

Theory Implementation

compact representation representation

  • f possible answers

Q Q’

slide-16
SLIDE 16

SQL queries in relational databases are executed by § Translating them to relational algebra (query plan) § Optimizing the query plan § Choosing implementations for certain operators (e.g., mergejoin for a join / index lookup for a select) § Execution of the query plan Approach: § Extend relational algebra with sentence propagation

23 Nov 2017 Research visit RWTH - Probabilistic Databases 16

PROBABILISTIC RELATIONAL ALGEBRA

slide-17
SLIDE 17

A selection of relational algebra operations § Selection: σp(R) (t,𝜒)∈R p(t)=true ⇔ (t,𝜒)∈σp(R) § Cartesian product: R x S (t1,𝜒1)∈R (t2,𝜒2)∈S ⇔ (t1t2, 𝜒1 ⋀ 𝜒2)∈R x S § Join: R ⋈p S = σp(RxS) § Duplicate removal: 𝛆(R) (ti,𝜒i)∈R for i∈[1..n] t1=…=tn ⇔ (t1, 𝜒1⋁…⋁𝜒n)∈𝛆(R)

i)∈R for i∈[1..n] t1=…=tn ⇔ (t1, 𝜒1⋁…⋁𝜒n)∈𝛆(R)

23 Nov 2017 Research visit RWTH - Probabilistic Databases 17

PROBABILISTIC RELATIONAL ALGEBRA

slide-18
SLIDE 18

§ In this way one could implement a probabilistic relational database Alternative on-top-of normal RDBMS approach (e.g., MayBMS, Trio) § Represent sentences with additional columns / tables in a normal database § Map SQL query Q to Q’ which additionally performs sentence propagation § Execute Q’ § Logically the same thing, but leverages existing RDBMS functionality to the fullest

23 Nov 2017 Research visit RWTH - Probabilistic Databases 18

ALTERNATIVE IMPLEMENTATION

slide-19
SLIDE 19

REAL-WORLD QUERY PERFORMANCE

slide-20
SLIDE 20

Research question § Are current probabilistic database prototypes mature enough for real-world use? Approach § PDB prototype: MayBMS § Real world scenario § Probabilistic integration of 3 biological databases § Query load with typical queries for the scenario § Measure query execution times

23 Nov 2017 Research visit RWTH - Probabilistic Databases 20

REAL-WORLD EXPERIMENT ON QUERY PERFORMANCE

slide-21
SLIDE 21

§ Proteins in orthologous group expected to have same function § Example “disease-causing bacteria” § Identified protein if silenced will kill bacteria. Side-effects in humans? Ø Find orthologous proteins § Combined insight of multiple sources § Sources: Homologene, PIRSF, and eggNOG § Challenges § Integrating conflicting groups § No truth & partial untrustworthiness: data established with different methods by different research groups

23 Nov 2017 Research visit RWTH - Probabilistic Databases 21

REAL-WORLD SCENARIO

ORTHOLOGOUS GROUPS IN BIO-INFORMATICS

“Ancient Paperbird” A0 A1 A2 “Long-beaked Paperbird” L0 L1 L2 “Running Paperbird” R0 R1 R2

  • B. Wanders, M. van Keulen,
  • P. van der Vet, Uncertain

Groupings: Probabilistic combination of grouping

  • data. DEXA 2015.
slide-22
SLIDE 22

§ Uncertain groupings § Iterative data understanding with Integration views

23 Nov 2017 Research visit RWTH - Probabilistic Databases 22

UNCERTAIN GROUPINGS

Source Groups

S1 ABC1 DE1

FG1

S2 AB2 CD2

FH2

S3 ABE3

FGH3 How to combine the insights of

  • rthologous grouping of

proteins from 3 sources?

slide-23
SLIDE 23

X= 1 2 3

23 Nov 2017 Research visit RWTH - Probabilistic Databases 23

INTEGRATION VIEW 1: COMP

COMBINATION OF INDEPENDENT COMPONENTS ABC1 DE1 AB2 CD2 FG1 FH2 FGH3 Group

ABC1 DE1 FG1 AB2 CD2 FH2 ABE3 FGH3 X=1 X=1 Y=1 X=2 X=2 Y=2 X=3 Y=3 Two choices:

  • Source Si is correct for

component ABCDE

  • Source Si is correct for

component FGH

ABE3

Y= 1 2 3

slide-24
SLIDE 24

23 Nov 2017 Research visit RWTH - Probabilistic Databases 24

INTEGRATION VIEW 2: COLL

FINE-GRAINED COLLISION-BASED INTEGRATION VIEW ABC1 AB2 ABE3 FG1 FH2 FGH3 Group

ABC1 DE1 FG1 AB2 CD2 FH2 ABE3 FGH3 X1=1 X2=1 X3=1 X5=1 X6=1 X7=1 X8=1 X1=2 X4=1 X2=2 X5=1 X7=2 X9=1 X3=2 X4=2 X6=2 X8=2 X9=2 9 choices:

  • For each collision, only one

can be correct

  • A possible world is a

collision-free combination

  • f groups

DE1 CD2

1 2 3 4 5 6 7 8 9

slide-25
SLIDE 25

§ Combining Homologene, PIRSF, and eggNOG Ø 776,000 groups; 14M proteins; 2.8M random vars

23 Nov 2017 Research visit RWTH - Probabilistic Databases 25

EXPERIMENTAL RESULTS

COLL INFORMATION VIEW; MAYBMS

slide-26
SLIDE 26

§ Scales to realistic sizes … but § Some MayBMS limitations § Max 500 rvas per tuple (our case has up to 17885) § Max 33 rvas for confidence computation UDF (this one we solved ourselves) § Exact confidence computation expensive § Complexity from practice: overlapping groups § Graph representation and optimization: cliques

23 Nov 2017 Research visit RWTH - Probabilistic Databases 26

EXPERIMENTAL RESULTS

slide-27
SLIDE 27

PROBABILISTIC DATABASES FOR OTHER DATA MODELS

XML / DATALOG

slide-28
SLIDE 28

Given any data model with its query language § Associate data item with sentence § For every query operator ⨂ in data model’s algebra § Define extended operator ⨷=(⨂,𝜐⨂) § Where 𝜐⨂ is a function that produces the descriptive sentence of a result based on the descriptive sentences of the operands in a manner that is appropriate for operation ⨂ produces a probabilistic variant for that data model + query language

23 Nov 2017 Research visit RWTH - Probabilistic Databases 28

GENERAL APPROACH

  • B. Wanders, M. van Keulen,

Revisiting the formal foundation of Probabilistic

  • Databases. EUSFLAT 2015.
slide-29
SLIDE 29

<company> <name>Foobar Corporation</name> <ticker_symbol>FOO</ticker_symbol> <description>Foobar Corporation is a maker of <EM>Foo(TM)</EM> and Foobar(TM) products and a leading software company with a 300 Billion dollar revenue in 1999. It is located in Alaska. </description> <business_code>Software</business_code> <partners> <partner>YouNameItWeIntegrateIt.com</partner> <partner>TheAppCompany Inc.</partner> </partners> <competitors> <competitor name=“Gorilla Corporation”/> </competitors> </company>

§ Markup tags are chosen purely for logic structure! § Well-formedness: start and end tags need to match precisely and be properly nested

2016/2017 Data Science: XML and the topic XMLDB 29

EXAMPLE XML DOCUMENT

Opening tag Closing tag Shorthand for no embedded elements or text Attribute Mixed content

slide-30
SLIDE 30

2016/2017 Data Science: XML and the topic XMLDB 30

XML DOCUMENTS ARE TREES

name ticker_symbol description em business_code partners partner competitors partner company name = “Gorilla Corporation” ... ... element attribute root text Foobar Corporation FOO Foobar ... maker of Foo(TM) and Foobar(TM) ... Software competitor

slide-31
SLIDE 31

Given any data model with its query language § Associate data item with sentence § For every query operator ⨂ in data model’s algebra § Define extended operator ⨷=(⨂,𝜐⨂) § Where 𝜐⨂ is a function that produces the descriptive sentence of a result based on the descriptive sentences of the operands in a manner that is appropriate for operation ⨂ XML § Data item = XML node § Query algebra is tree navigation algebra

23 Nov 2017 Research visit RWTH - Probabilistic Databases 31

GENERAL APPROACH APPLIED TO XML/XPATH

slide-32
SLIDE 32

23 Nov 2017 Research visit RWTH - Probabilistic Databases 32

PROBABILISTIC XML

UNCERTAINTY ORTHOGONAL TO DATA MODEL

Car brand Sales Mercedes 67 Mercedes-Benz 39 Mercedes 106 Mercedes-Benz 106 1 2 3 4

X=0 X=0 X=1 Y=0 X=1 Y=1

Part_sales “Mercedes” db part_sales 67 “Mercedes-Benz” 39 part_sales 106 “Mercedes” “Mercedes-Benz” brand sales brand sales X=0 X=0 X=1 Y=0 Y=1

slide-33
SLIDE 33

23 Nov 2017 Research visit RWTH - Probabilistic Databases

QUERYING PROBABILISTIC XML

Query implementation

Query semantics possible worlds possible answers compact representation

Theory Implementation

representation

  • f possible answers

33

slide-34
SLIDE 34

Query § //brand[.=“Mercedes”] Answer nodes § (Textnode “Mercedes”, X=0) § (Textnode “Mercedes”, Y=0) but descendant of node with X=1 Answer § (Textnode “Mercedes”, X=0) § (Textnode “Mercedes”, Y=0 ⋀ X=1)

23 Nov 2017 Research visit RWTH - Probabilistic Databases 34

EXAMPLE XPATH QUERY

slide-35
SLIDE 35

23 Nov 2017 Research visit RWTH - Probabilistic Databases

QUERYING PROBABILISTIC XML USING A URDBMS

Query semantics possible worlds possible answers

Theory Implementation

35

Q Q’

slide-36
SLIDE 36

PROBABILISTIC DATABASES FOR OTHER DATA MODELS

DATALOG (JudgeD)

  • B. Wanders, M. van Keulen,

JudgeD: a Probabilistic Datalog with Dependencies. DeLBP 2016.

slide-37
SLIDE 37

§ Datalog is a declarative logic programming language § Syntactically is a subset of Prolog For the logicians among you: § It imposes restrictions, such as § Disallows complex terms as arguments of predicates § Stratification restrictions on negation & recursion § Query evaluation § is based on first-order logic § on finite sets is guaranteed to terminate § is sound and complete § Datalog is not Turing complete

23 Nov 2017 Research visit RWTH - Probabilistic Databases 37

DATALOG

slide-38
SLIDE 38

Given any data model with its query language § Associate data item with sentence § For every query operator ⨂ in data model’s algebra § Define extended operator ⨷=(⨂,𝜐⨂) § Where 𝜐⨂ is a function that produces the descriptive sentence of a result based on the descriptive sentences of the operands in a manner that is appropriate for operation ⨂ DataLog § Data item = DataLog fact / rule § Query operator is entailment

23 Nov 2017 Research visit RWTH - Probabilistic Databases 38

GENERAL APPROACH APPLIED TO DATALOG

slide-39
SLIDE 39

% facts parent(john, douglas). parent(bob, john). parent(ebbon, bob). % rules ancestor(A, B) :- parent(A, B). ancestor(A, B) :- parent(A, C), ancestor(C,B). % query ancestor(A, B)?

23 Nov 2017 Research visit RWTH - Probabilistic Databases 39

DATALOG EXAMPLE

slide-40
SLIDE 40

§ r = (Ah ⟵ A1,…,Am) § r∈KB § ∃𝜄 : Ah𝜄 is ground § (∀i∈[1..m]) KB ⊨ Ai𝜄 ⇒ § KB ⊨ Ah𝜄

23 Nov 2017 Research visit RWTH - Probabilistic Databases 40

DATALOG INFERENCE

ancestor(A, B) :- parent(A, B). 𝜄={A⟼bob,B ⟼john} parent(bob,john) is ground parent(bob,john) ∈ KB KB ⊨ ancestor(bob,john)

slide-41
SLIDE 41

§ r = (Ah ⟵ A1,…,Am , 𝜒) ) § r∈KB § ∃𝜄 : Ah𝜄 is ground § (∀i∈[1..m]) KB ⊨ (Ai𝜄 , 𝜒i)

i)

§ 𝜒’ = 𝜒 ⋀ 𝜒1 ⋀…⋀ 𝜒m § 𝜒’ ≢⊥ ⇒ § KB ⊨ (Ah𝜄 , 𝜒’)

23 Nov 2017 Research visit RWTH - Probabilistic Databases 41

PROBABILISTIC DATALOG INFERENCE

slide-42
SLIDE 42

saw(1, amy, blue, honda) [x1=1]. saw(1, amy, red, toyota) [x1=2]. saw(1, amy, nothing) [x1=3]. @p(x1=1) = 0.7. @p(x1=2) = 0.2. @p(x1=3) = 0.1. drives(1, frank, red, toyota) [y1=1]. drives(1, frank, blue, toyota) [y1=2]. @p(y1=1) = 0.8. @p(y1=2) = 0.2. suspect(X) :- saw(_,_,Color,Car), drives(_,X,Color,Car).

23 Nov 2017 Research visit RWTH - Probabilistic Databases 42

JUDGED: OUR IMPLEMENTATION OF PROB. DATALOG

slide-43
SLIDE 43

suspect(X) :- saw(_,_,Color,Car), drives(_,X,Color,Car). Query § suspect(X)? Answer § Exact § suspect(frank) [y1=1 and x1=2] § Monte carlo § suspect(frank) p=0.16

23 Nov 2017 Research visit RWTH - Probabilistic Databases 43

JUDGED: OUR IMPLEMENTATION OF PROB. DATALOG

slide-44
SLIDE 44

coin(c1). coin(c2). { result(C, heads) :- coin(C) [c(C)=heads]. result(C, tails) :- coin(C) [c(C)=tails]. @uniform c(C). | coin(C) } toss(X, Y) :- result(c1, X), result(c2, Y). toss(A, B)?

23 Nov 2017 Research visit RWTH - Probabilistic Databases 44

NEED FOR HIGHER ORDER FUNCTIONALITY

OPPORTUNITY FOR NOVEL THEORETICAL RESEARCH

Creates as many random variables as there are solutions to coin(C)

slide-45
SLIDE 45

§ A probabilistic database is just like a normal database but queries produce possible or approximate answers § Meant to transparently manage uncertain data § Database prototypes can handle real-world scenarios § Model for probabilistic database generalizable to other data models: XML and DataLog (JudgeD) Some possible applications § Data integration in imperfect circumstances § Data with quality problems § Handling results of natural language processing

  • r other imperfect information extraction techniques

23 Nov 2017 Research visit RWTH - Probabilistic Databases 45

CONCLUSIONS

  • M. van Keulen, Managing

Uncertainty: The Road Towards Better Data Interoperability. IT – Information Technology 54(3), 2012.

slide-46
SLIDE 46

Probabilistic data integration § Represent data quality problems as probabilistic data § 80/20 rule: 80% if the issues are solvable in 20% of the time § Quick-and-dirty data integration whereby unresolved cases are represented as probabilistic data § Improve quality of data integration result during use Properties § Faster journey from start to first use § Trade development time for data quality § More robust against e.g. threshold settings § More potential for natural and effective feedback loops leveraging human attention for continuous improvement

23 Nov 2017 Research visit RWTH - Probabilistic Databases 46

OUTLOOK ON NEXT PRESENTATION

slide-47
SLIDE 47

(Francis Bacon, 1605) (Jorge Luis Borges, 1979)

(often attributed to John Maynard Keynes, but Carveth Read, 1898)