PROBABILISTIC DATABASES
MAURICE VAN KEULEN
ASSOCIATE PROFESSOR DATA MANAGEMENT TECHNOLOGY HEAD OF DATABASE GROUP
PROBABILISTIC DATABASES MAURICE VAN KEULEN ASSOCIATE PROFESSOR DATA - - PowerPoint PPT Presentation
PROBABILISTIC DATABASES MAURICE VAN KEULEN ASSOCIATE PROFESSOR DATA MANAGEMENT TECHNOLOGY HEAD OF DATABASE GROUP WHO AM I AND WHY AM I HERE Maurice van Keulen Associate Professor Data Management Technology University of Twente Research
MAURICE VAN KEULEN
ASSOCIATE PROFESSOR DATA MANAGEMENT TECHNOLOGY HEAD OF DATABASE GROUP
Maurice van Keulen Associate Professor Data Management Technology University of Twente § Research interests: Data integration, data quality, information extraction, natural language processing, data cleaning I am here on a research visit § Period Nov’17 – Feb’18 several times 3 or 2 days § Goals: § Compare Probabilistic Programming (PP) with Probabilistic Databases (PDBs) § Does PP lead to new insights in PDBs and Probabilistic Data Integration (PDI)?
23 Nov 2017 Research visit RWTH - Probabilistic Databases 2
WHO AM I AND WHY AM I HERE
§ Motivational example § Intuition – What is a probabilistic database § Theoretical foundation – possible worlds theory § Querying a probabilistic database § Real-world query performance § Probabilistic databases for other data models (XML, DataLog) § Conclusions § Outlook on my next presentation on “Probabilistic Data Integration”
23 Nov 2017 Research visit RWTH - Probabilistic Databases 3
CONTENTS
MOTIVATIONAL EXAMPLE AND WHAT IS A PROBABILISTIC DATABASE?
23 Nov 2017 Research visit RWTH - Probabilistic Databases 5
MOTIVATIONAL EXAMPLE: COMBINING DATA …
Keulen, M. (2012) Managing Uncertainty: The Road Towards Better Data
Car brand Sales B.M.W. 25 Mercedes 32 Renault 10 Car brand Sales BMW 72 Mercedes-Benz 39 Renault 20 Car brand Sales
Bayerische Motoren Werke
8 Mercedes 35 Renault 15 Car brand Sales
B.M.W.
25
Bayerische Motoren Werke
8
BMW
72 Mercedes 67 Mercedes-Benz 39 Renault 45
23 Nov 2017 Research visit RWTH - Probabilistic Databases 6
… AND THE PROBLEM OF SEMANTIC DUPLICATES
Car brand Sales
B.M.W.
25
Bayerische Motoren Werke
8
BMW
72 Mercedes 67 Mercedes-Benz 39 Renault 45 Preferred customers … SELECT SUM(Sales) FROM CarSales WHERE Sales>100 ‘No preferred customers’
Database
Real world (of car brands) Mercedes-Benz 39 72 BMW 45 Renault 67 Mercedes 8 Bayerische Motoren Werke 25 B.M.W. Sales Car brand
d1 d2 d3 d4 d5 d6
23 Nov 2017 Research visit RWTH - Probabilistic Databases 7
SEMANTIC DUPLICATES
23 Nov 2017 Research visit RWTH - Probabilistic Databases 8
MOST DATA QUALITY PROBLEMS CAN BE MODELED AS UNCERTAINTY IN DATA
Car brand Sales
B.M.W.
25
Bayerische Motoren Werke
8
BMW
72 Mercedes 67 Mercedes-Benz 39 Renault 45 Mercedes 106 Mercedes-Benz 106 1 2 3 4 5 6 X=0 X=0 X=1 Y=0 X=1 Y=1 X=0 4 and 5 different 0.2 X=1 4 and 5 the same 0.8 Y=0 “Mercedes” correct name 0.5 Y=1 “Mercedes-Benz” correct name 0.5 B.M.W. / BMW / Bayerische Motoren Werke analogously Run some duplicate detection tool
§ Looks like ordinary database § Several “possible” answers or approximate answers to queries § Important: Scalability (big data!) Sales of “preferred customers” § SELECT SUM(sales) FROM carsales WHERE sales≥ 100
23 Nov 2017 Research visit RWTH - Probabilistic Databases 9
WHAT I HAVE NOW IS A PROBABILISTIC DATABASE
SUM(sales) P 0 14% 105 6% 106 56% 211 24%
Sales of “preferred customers” § SELECT SUM(sales) FROM carsales WHERE sales≥ 100 § Answer: 106 § Risk = Probability * Impact § Analyst only bothered with problems that matter
23 Nov 2017 Research visit RWTH - Probabilistic Databases 10
QUERYING AND RELIABILITY ASSESSMENT
SUM(sales) P 0 14% 105 6% 106 56% 211 24% Second most likely answer at 24% with impact factor 2 in sales (211 vs 106) Risk of substantially wrong answer
THEORETICAL FOUNDATION QUERYING A PROBABILISTIC DATABASE
23 Nov 2017 Research visit RWTH - Probabilistic Databases 12
POSSIBLE WORLDS THEORY
Car brand Sales Mercedes 67 Mercedes-Benz 39 Mercedes 106 Mercedes-Benz 106 Renault 45 4 5 6 X=0 X=0 X=1 Y=0 X=1 Y=1
X=0 4 and 5 different 0.2 X=1 4 and 5 the same 0.8 Y=0 “Mercedes” correct 0.5 Y=1 “Mercedes-Benz” correct 0.5
Car brand Sales Mercedes 67 Mercedes-Benz 39 Renault 45 Car brand Sales Mercedes 67 Mercedes-Benz 39 Renault 45 Car brand Sales Mercedes 106 Renault 45 Car brand Sales Mercedes-Benz 106 Renault 45 X=0 Y=0 X=1 Y=0 X=0 Y=1 X=1 Y=1 0.4 0.1 0.4 0.2 * 0.5 = 0.1
‘0’ (0.2) ‘106’ (0.8) 106 106
Associate each tuple ti with a sentence 𝜒i
i
§ (ti,𝜒i)
i)
𝜒 is a propositional formula § atoms of the form r = v Example § ( <carbrand=“Mercedes”, sales=106>, X=1⋀Y=0 ) This tuple exists in all worlds for which 𝜒 is true
23 Nov 2017 Research visit RWTH - Probabilistic Databases 13
POSSIBLE WORLDS THEORY
Given a database D as a probabilistic relation § D={ (t1,𝜒1), …, (tn,𝜒n) } A world w is induced by a random variable assignment for each random variable § Example § 𝜄w={ X ⟼ 1, Y ⟼ 0 } P(w) = P(𝜄w) = 0.8 x 0.5 = 0.4 § w={ <“Mercedes”, 106>, <“Renault”, 45> } § PWS is the set of all possible worlds Semantics § Tuple ti exists in world w iff is 𝜒i true for 𝜄w
i true for 𝜄w
§ Query semantics: Q(D)=∪w∈PWS Q(w)
23 Nov 2017 Research visit RWTH - Probabilistic Databases 14
QUERYING IN POSSIBLE WORLDS THEORY
23 Nov 2017 Research visit RWTH - Probabilistic Databases 15
POSSIBLE WORLDS THEORY
Query implementation Query semantics possible worlds possible answers
Theory Implementation
compact representation representation
Q Q’
SQL queries in relational databases are executed by § Translating them to relational algebra (query plan) § Optimizing the query plan § Choosing implementations for certain operators (e.g., mergejoin for a join / index lookup for a select) § Execution of the query plan Approach: § Extend relational algebra with sentence propagation
23 Nov 2017 Research visit RWTH - Probabilistic Databases 16
PROBABILISTIC RELATIONAL ALGEBRA
A selection of relational algebra operations § Selection: σp(R) (t,𝜒)∈R p(t)=true ⇔ (t,𝜒)∈σp(R) § Cartesian product: R x S (t1,𝜒1)∈R (t2,𝜒2)∈S ⇔ (t1t2, 𝜒1 ⋀ 𝜒2)∈R x S § Join: R ⋈p S = σp(RxS) § Duplicate removal: 𝛆(R) (ti,𝜒i)∈R for i∈[1..n] t1=…=tn ⇔ (t1, 𝜒1⋁…⋁𝜒n)∈𝛆(R)
i)∈R for i∈[1..n] t1=…=tn ⇔ (t1, 𝜒1⋁…⋁𝜒n)∈𝛆(R)
23 Nov 2017 Research visit RWTH - Probabilistic Databases 17
PROBABILISTIC RELATIONAL ALGEBRA
§ In this way one could implement a probabilistic relational database Alternative on-top-of normal RDBMS approach (e.g., MayBMS, Trio) § Represent sentences with additional columns / tables in a normal database § Map SQL query Q to Q’ which additionally performs sentence propagation § Execute Q’ § Logically the same thing, but leverages existing RDBMS functionality to the fullest
23 Nov 2017 Research visit RWTH - Probabilistic Databases 18
ALTERNATIVE IMPLEMENTATION
REAL-WORLD QUERY PERFORMANCE
Research question § Are current probabilistic database prototypes mature enough for real-world use? Approach § PDB prototype: MayBMS § Real world scenario § Probabilistic integration of 3 biological databases § Query load with typical queries for the scenario § Measure query execution times
23 Nov 2017 Research visit RWTH - Probabilistic Databases 20
REAL-WORLD EXPERIMENT ON QUERY PERFORMANCE
§ Proteins in orthologous group expected to have same function § Example “disease-causing bacteria” § Identified protein if silenced will kill bacteria. Side-effects in humans? Ø Find orthologous proteins § Combined insight of multiple sources § Sources: Homologene, PIRSF, and eggNOG § Challenges § Integrating conflicting groups § No truth & partial untrustworthiness: data established with different methods by different research groups
23 Nov 2017 Research visit RWTH - Probabilistic Databases 21
REAL-WORLD SCENARIO
ORTHOLOGOUS GROUPS IN BIO-INFORMATICS
“Ancient Paperbird” A0 A1 A2 “Long-beaked Paperbird” L0 L1 L2 “Running Paperbird” R0 R1 R2
Groupings: Probabilistic combination of grouping
§ Uncertain groupings § Iterative data understanding with Integration views
23 Nov 2017 Research visit RWTH - Probabilistic Databases 22
UNCERTAIN GROUPINGS
Source Groups
S1 ABC1 DE1
FG1
S2 AB2 CD2
FH2
S3 ABE3
FGH3 How to combine the insights of
proteins from 3 sources?
X= 1 2 3
23 Nov 2017 Research visit RWTH - Probabilistic Databases 23
INTEGRATION VIEW 1: COMP
COMBINATION OF INDEPENDENT COMPONENTS ABC1 DE1 AB2 CD2 FG1 FH2 FGH3 Group
ABC1 DE1 FG1 AB2 CD2 FH2 ABE3 FGH3 X=1 X=1 Y=1 X=2 X=2 Y=2 X=3 Y=3 Two choices:
component ABCDE
component FGH
ABE3
Y= 1 2 3
23 Nov 2017 Research visit RWTH - Probabilistic Databases 24
INTEGRATION VIEW 2: COLL
FINE-GRAINED COLLISION-BASED INTEGRATION VIEW ABC1 AB2 ABE3 FG1 FH2 FGH3 Group
ABC1 DE1 FG1 AB2 CD2 FH2 ABE3 FGH3 X1=1 X2=1 X3=1 X5=1 X6=1 X7=1 X8=1 X1=2 X4=1 X2=2 X5=1 X7=2 X9=1 X3=2 X4=2 X6=2 X8=2 X9=2 9 choices:
can be correct
collision-free combination
DE1 CD2
1 2 3 4 5 6 7 8 9
§ Combining Homologene, PIRSF, and eggNOG Ø 776,000 groups; 14M proteins; 2.8M random vars
23 Nov 2017 Research visit RWTH - Probabilistic Databases 25
EXPERIMENTAL RESULTS
COLL INFORMATION VIEW; MAYBMS
§ Scales to realistic sizes … but § Some MayBMS limitations § Max 500 rvas per tuple (our case has up to 17885) § Max 33 rvas for confidence computation UDF (this one we solved ourselves) § Exact confidence computation expensive § Complexity from practice: overlapping groups § Graph representation and optimization: cliques
23 Nov 2017 Research visit RWTH - Probabilistic Databases 26
EXPERIMENTAL RESULTS
Given any data model with its query language § Associate data item with sentence § For every query operator ⨂ in data model’s algebra § Define extended operator ⨷=(⨂,𝜐⨂) § Where 𝜐⨂ is a function that produces the descriptive sentence of a result based on the descriptive sentences of the operands in a manner that is appropriate for operation ⨂ produces a probabilistic variant for that data model + query language
23 Nov 2017 Research visit RWTH - Probabilistic Databases 28
GENERAL APPROACH
Revisiting the formal foundation of Probabilistic
<company> <name>Foobar Corporation</name> <ticker_symbol>FOO</ticker_symbol> <description>Foobar Corporation is a maker of <EM>Foo(TM)</EM> and Foobar(TM) products and a leading software company with a 300 Billion dollar revenue in 1999. It is located in Alaska. </description> <business_code>Software</business_code> <partners> <partner>YouNameItWeIntegrateIt.com</partner> <partner>TheAppCompany Inc.</partner> </partners> <competitors> <competitor name=“Gorilla Corporation”/> </competitors> </company>
§ Markup tags are chosen purely for logic structure! § Well-formedness: start and end tags need to match precisely and be properly nested
2016/2017 Data Science: XML and the topic XMLDB 29
EXAMPLE XML DOCUMENT
Opening tag Closing tag Shorthand for no embedded elements or text Attribute Mixed content
2016/2017 Data Science: XML and the topic XMLDB 30
XML DOCUMENTS ARE TREES
name ticker_symbol description em business_code partners partner competitors partner company name = “Gorilla Corporation” ... ... element attribute root text Foobar Corporation FOO Foobar ... maker of Foo(TM) and Foobar(TM) ... Software competitor
Given any data model with its query language § Associate data item with sentence § For every query operator ⨂ in data model’s algebra § Define extended operator ⨷=(⨂,𝜐⨂) § Where 𝜐⨂ is a function that produces the descriptive sentence of a result based on the descriptive sentences of the operands in a manner that is appropriate for operation ⨂ XML § Data item = XML node § Query algebra is tree navigation algebra
23 Nov 2017 Research visit RWTH - Probabilistic Databases 31
GENERAL APPROACH APPLIED TO XML/XPATH
23 Nov 2017 Research visit RWTH - Probabilistic Databases 32
PROBABILISTIC XML
UNCERTAINTY ORTHOGONAL TO DATA MODEL
Car brand Sales Mercedes 67 Mercedes-Benz 39 Mercedes 106 Mercedes-Benz 106 1 2 3 4
X=0 X=0 X=1 Y=0 X=1 Y=1
Part_sales “Mercedes” db part_sales 67 “Mercedes-Benz” 39 part_sales 106 “Mercedes” “Mercedes-Benz” brand sales brand sales X=0 X=0 X=1 Y=0 Y=1
23 Nov 2017 Research visit RWTH - Probabilistic Databases
QUERYING PROBABILISTIC XML
Query implementation
Query semantics possible worlds possible answers compact representation
Theory Implementation
representation
33
Query § //brand[.=“Mercedes”] Answer nodes § (Textnode “Mercedes”, X=0) § (Textnode “Mercedes”, Y=0) but descendant of node with X=1 Answer § (Textnode “Mercedes”, X=0) § (Textnode “Mercedes”, Y=0 ⋀ X=1)
23 Nov 2017 Research visit RWTH - Probabilistic Databases 34
EXAMPLE XPATH QUERY
23 Nov 2017 Research visit RWTH - Probabilistic Databases
QUERYING PROBABILISTIC XML USING A URDBMS
Query semantics possible worlds possible answers
Theory Implementation
35
Q Q’
JudgeD: a Probabilistic Datalog with Dependencies. DeLBP 2016.
§ Datalog is a declarative logic programming language § Syntactically is a subset of Prolog For the logicians among you: § It imposes restrictions, such as § Disallows complex terms as arguments of predicates § Stratification restrictions on negation & recursion § Query evaluation § is based on first-order logic § on finite sets is guaranteed to terminate § is sound and complete § Datalog is not Turing complete
23 Nov 2017 Research visit RWTH - Probabilistic Databases 37
DATALOG
Given any data model with its query language § Associate data item with sentence § For every query operator ⨂ in data model’s algebra § Define extended operator ⨷=(⨂,𝜐⨂) § Where 𝜐⨂ is a function that produces the descriptive sentence of a result based on the descriptive sentences of the operands in a manner that is appropriate for operation ⨂ DataLog § Data item = DataLog fact / rule § Query operator is entailment
23 Nov 2017 Research visit RWTH - Probabilistic Databases 38
GENERAL APPROACH APPLIED TO DATALOG
% facts parent(john, douglas). parent(bob, john). parent(ebbon, bob). % rules ancestor(A, B) :- parent(A, B). ancestor(A, B) :- parent(A, C), ancestor(C,B). % query ancestor(A, B)?
23 Nov 2017 Research visit RWTH - Probabilistic Databases 39
DATALOG EXAMPLE
§ r = (Ah ⟵ A1,…,Am) § r∈KB § ∃𝜄 : Ah𝜄 is ground § (∀i∈[1..m]) KB ⊨ Ai𝜄 ⇒ § KB ⊨ Ah𝜄
23 Nov 2017 Research visit RWTH - Probabilistic Databases 40
DATALOG INFERENCE
ancestor(A, B) :- parent(A, B). 𝜄={A⟼bob,B ⟼john} parent(bob,john) is ground parent(bob,john) ∈ KB KB ⊨ ancestor(bob,john)
§ r = (Ah ⟵ A1,…,Am , 𝜒) ) § r∈KB § ∃𝜄 : Ah𝜄 is ground § (∀i∈[1..m]) KB ⊨ (Ai𝜄 , 𝜒i)
i)
§ 𝜒’ = 𝜒 ⋀ 𝜒1 ⋀…⋀ 𝜒m § 𝜒’ ≢⊥ ⇒ § KB ⊨ (Ah𝜄 , 𝜒’)
23 Nov 2017 Research visit RWTH - Probabilistic Databases 41
PROBABILISTIC DATALOG INFERENCE
saw(1, amy, blue, honda) [x1=1]. saw(1, amy, red, toyota) [x1=2]. saw(1, amy, nothing) [x1=3]. @p(x1=1) = 0.7. @p(x1=2) = 0.2. @p(x1=3) = 0.1. drives(1, frank, red, toyota) [y1=1]. drives(1, frank, blue, toyota) [y1=2]. @p(y1=1) = 0.8. @p(y1=2) = 0.2. suspect(X) :- saw(_,_,Color,Car), drives(_,X,Color,Car).
23 Nov 2017 Research visit RWTH - Probabilistic Databases 42
JUDGED: OUR IMPLEMENTATION OF PROB. DATALOG
suspect(X) :- saw(_,_,Color,Car), drives(_,X,Color,Car). Query § suspect(X)? Answer § Exact § suspect(frank) [y1=1 and x1=2] § Monte carlo § suspect(frank) p=0.16
23 Nov 2017 Research visit RWTH - Probabilistic Databases 43
JUDGED: OUR IMPLEMENTATION OF PROB. DATALOG
coin(c1). coin(c2). { result(C, heads) :- coin(C) [c(C)=heads]. result(C, tails) :- coin(C) [c(C)=tails]. @uniform c(C). | coin(C) } toss(X, Y) :- result(c1, X), result(c2, Y). toss(A, B)?
23 Nov 2017 Research visit RWTH - Probabilistic Databases 44
NEED FOR HIGHER ORDER FUNCTIONALITY
OPPORTUNITY FOR NOVEL THEORETICAL RESEARCH
Creates as many random variables as there are solutions to coin(C)
§ A probabilistic database is just like a normal database but queries produce possible or approximate answers § Meant to transparently manage uncertain data § Database prototypes can handle real-world scenarios § Model for probabilistic database generalizable to other data models: XML and DataLog (JudgeD) Some possible applications § Data integration in imperfect circumstances § Data with quality problems § Handling results of natural language processing
23 Nov 2017 Research visit RWTH - Probabilistic Databases 45
CONCLUSIONS
Uncertainty: The Road Towards Better Data Interoperability. IT – Information Technology 54(3), 2012.
Probabilistic data integration § Represent data quality problems as probabilistic data § 80/20 rule: 80% if the issues are solvable in 20% of the time § Quick-and-dirty data integration whereby unresolved cases are represented as probabilistic data § Improve quality of data integration result during use Properties § Faster journey from start to first use § Trade development time for data quality § More robust against e.g. threshold settings § More potential for natural and effective feedback loops leveraging human attention for continuous improvement
23 Nov 2017 Research visit RWTH - Probabilistic Databases 46
OUTLOOK ON NEXT PRESENTATION
(Francis Bacon, 1605) (Jorge Luis Borges, 1979)
(often attributed to John Maynard Keynes, but Carveth Read, 1898)