PROBABILISTIC DATABASES MAURICE VAN KEULEN ASSOCIATE PROFESSOR DATA - PowerPoint PPT Presentation

PROBABILISTIC DATABASES MAURICE VAN KEULEN ASSOCIATE PROFESSOR DATA MANAGEMENT TECHNOLOGY HEAD OF DATABASE GROUP

WHO AM I AND WHY AM I HERE Maurice van Keulen Associate Professor Data Management Technology University of Twente § Research interests: Data integration, data quality, information extraction, natural language processing, data cleaning I am here on a research visit § Period Nov’17 – Feb’18 several times 3 or 2 days § Goals: § Compare Probabilistic Programming (PP) with Probabilistic Databases (PDBs) § Does PP lead to new insights in PDBs and Probabilistic Data Integration (PDI)? Research visit RWTH - Probabilistic Databases 23 Nov 2017 2

CONTENTS § Motivational example § Intuition – What is a probabilistic database § Theoretical foundation – possible worlds theory § Querying a probabilistic database § Real-world query performance § Probabilistic databases for other data models (XML, DataLog) § Conclusions § Outlook on my next presentation on “Probabilistic Data Integration” Research visit RWTH - Probabilistic Databases 23 Nov 2017 3

MOTIVATIONAL EXAMPLE AND WHAT IS A PROBABILISTIC DATABASE?

MOTIVATIONAL EXAMPLE: COMBINING DATA … Car brand Sales Car brand Sales Car brand Sales B.M.W. 25 BMW 72 Bayerische Motoren Werke 8 Mercedes 32 Mercedes-Benz 39 Mercedes 35 Renault 10 Renault 20 Renault 15 Car brand Sales B.M.W. 25 Bayerische Motoren Werke 8 BMW 72 Mercedes 67 Keulen, M. (2012) Mercedes-Benz 39 Managing Uncertainty: The Road Towards Better Data Interoperability. IT - Information Technology, 54 (3). Renault 45 pp. 138-146. ISSN 1611-2776 Research visit RWTH - Probabilistic Databases 23 Nov 2017 5

… AND THE PROBLEM OF SEMANTIC DUPLICATES Car brand Sales Preferred customers … B.M.W. 25 SELECT SUM(Sales) Bayerische Motoren Werke 8 FROM CarSales BMW 72 WHERE Sales>100 Mercedes 67 Mercedes-Benz 39 Renault 45 0 ‘No preferred customers’ Research visit RWTH - Probabilistic Databases 23 Nov 2017 6

SEMANTIC DUPLICATES Real world Database (of car brands) ω o 1 Car brand Sales d 1 o 2 B.M.W. 25 Bayerische 8 d 2 Motoren Werke d 3 o 3 Mercedes 67 d 4 Renault 45 d 5 o 4 BMW 72 d 6 Mercedes-Benz 39 Research visit RWTH - Probabilistic Databases 23 Nov 2017 7

MOST DATA QUALITY PROBLEMS CAN BE MODELED AS UNCERTAINTY IN DATA Run some duplicate detection tool Car brand Sales B.M.W. 25 1 X=0 4 and 5 different 0.2 Bayerische Motoren Werke 8 2 X=1 4 and 5 the same 0.8 BMW 72 3 4 Mercedes 67 X=0 Y=0 “Mercedes” 0.5 Mercedes-Benz 39 5 X=0 correct name Renault 45 6 Y=1 “Mercedes-Benz” 0.5 correct name Mercedes 106 X=1 Y=0 Mercedes-Benz 106 X=1 Y=1 B.M.W. / BMW / Bayerische Motoren Werke analogously Research visit RWTH - Probabilistic Databases 23 Nov 2017 8

WHAT I HAVE NOW IS A PROBABILISTIC DATABASE § Looks like ordinary database § Several “possible” answers or approximate answers to queries § Important: Scalability (big data!) SUM(sales) P Sales of “preferred customers” 0 14% § SELECT SUM(sales) 105 6% FROM carsales 106 56% WHERE sales ≥ 100 211 24% Research visit RWTH - Probabilistic Databases 23 Nov 2017 9

QUERYING AND RELIABILITY ASSESSMENT Sales of “preferred customers” SUM(sales) P § SELECT SUM(sales) 0 14% FROM carsales 105 6% WHERE sales ≥ 100 106 56% § Answer: 106 Risk of substantially 211 24% wrong answer § Risk = Probability * Impact § Analyst only bothered with Second most likely problems that matter answer at 24% with impact factor 2 in sales (211 vs 106) Research visit RWTH - Probabilistic Databases 23 Nov 2017 10

THEORETICAL FOUNDATION QUERYING A PROBABILISTIC DATABASE

X=0 4 and 5 different 0.2 X=1 4 and 5 the same 0.8 POSSIBLE WORLDS THEORY Y=0 “Mercedes” correct 0.5 Y=1 “Mercedes-Benz” correct 0.5 Car brand Sales Mercedes 67 4 X=0 X=1 0.4 Y=0 Mercedes-Benz 39 5 X=0 Car brand Sales Mercedes 106 X=1 Y=0 Mercedes 106 106 Mercedes-Benz 106 X=1 Y=1 Renault 45 X=0 Y=1 Renault 45 6 Car brand Sales 0.1 Mercedes 67 Mercedes-Benz 39 0.2 * 0.5 X=1 Renault 45 0 X=0 = 0.1 Y=1 Y=0 Car brand Sales 0.4 Car brand Sales Mercedes-Benz 106 ‘0’ (0.2) Mercedes 67 Renault 45 106 ‘106’ (0.8) Mercedes-Benz 39 Renault 45 0 Research visit RWTH - Probabilistic Databases 23 Nov 2017 12

POSSIBLE WORLDS THEORY Associate each tuple t i with a sentence 𝜒 i i § (t i , 𝜒 i ) i ) 𝜒 is a propositional formula § atoms of the form r = v Example § ( <carbrand=“Mercedes”, sales=106>, X=1 ⋀ Y=0 ) This tuple exists in all worlds for which 𝜒 is true Research visit RWTH - Probabilistic Databases 23 Nov 2017 13

QUERYING IN POSSIBLE WORLDS THEORY Given a database D as a probabilistic relation § D ={ (t 1 , 𝜒 1 ), … , (t n , 𝜒 n ) } A world w is induced by a random variable assignment for each random variable § Example § 𝜄 w ={ X ⟼ 1, Y ⟼ 0 } P( w ) = P( 𝜄 w ) = 0.8 x 0.5 = 0.4 § w ={ <“Mercedes”, 106>, <“Renault”, 45> } § PWS is the set of all possible worlds Semantics § Tuple t i exists in world w iff is 𝜒 i true for 𝜄 w i true for 𝜄 w § Query semantics: Q( D )= ∪ w ∈ PWS Q( w ) Research visit RWTH - Probabilistic Databases 23 Nov 2017 14

POSSIBLE WORLDS THEORY possible worlds possible answers Q Query semantics Theory Implementation Q’ Query implementation compact representation representation of possible answers Research visit RWTH - Probabilistic Databases 23 Nov 2017 15

PROBABILISTIC RELATIONAL ALGEBRA SQL queries in relational databases are executed by § Translating them to relational algebra (query plan) § Optimizing the query plan § Choosing implementations for certain operators (e.g., mergejoin for a join / index lookup for a select) § Execution of the query plan Approach: § Extend relational algebra with sentence propagation Research visit RWTH - Probabilistic Databases 23 Nov 2017 16

PROBABILISTIC RELATIONAL ALGEBRA A selection of relational algebra operations § Selection: σ p (R) (t, 𝜒 ) ∈ R p(t)=true ⇔ (t, 𝜒 ) ∈ σ p (R) § Cartesian product: R x S (t 1 , 𝜒 1 ) ∈ R (t 2 , 𝜒 2 ) ∈ S ⇔ (t 1 t 2 , 𝜒 1 ⋀ 𝜒 2 ) ∈ R x S § Join: R ⋈ p S = σ p (RxS) § Duplicate removal: 𝛆 (R) (t i , 𝜒 i ) ∈ R for i ∈ [1..n] t 1 = … =t n ⇔ (t 1 , 𝜒 1 ⋁ … ⋁ 𝜒 n ) ∈ 𝛆 (R) i ) ∈ R for i ∈ [1..n] t 1 = … =t n ⇔ (t 1 , 𝜒 1 ⋁ … ⋁ 𝜒 n ) ∈ 𝛆 (R) Research visit RWTH - Probabilistic Databases 23 Nov 2017 17

ALTERNATIVE IMPLEMENTATION § In this way one could implement a probabilistic relational database Alternative on-top-of normal RDBMS approach (e.g., MayBMS, Trio) § Represent sentences with additional columns / tables in a normal database § Map SQL query Q to Q’ which additionally performs sentence propagation § Execute Q’ § Logically the same thing, but leverages existing RDBMS functionality to the fullest Research visit RWTH - Probabilistic Databases 23 Nov 2017 18

REAL-WORLD QUERY PERFORMANCE

REAL-WORLD EXPERIMENT ON QUERY PERFORMANCE Research question § Are current probabilistic database prototypes mature enough for real-world use? Approach § PDB prototype: MayBMS § Real world scenario § Probabilistic integration of 3 biological databases § Query load with typical queries for the scenario § Measure query execution times Research visit RWTH - Probabilistic Databases 23 Nov 2017 20

REAL-WORLD SCENARIO ORTHOLOGOUS GROUPS IN BIO-INFORMATICS § Proteins in orthologous group expected to have same function § Example “disease-causing bacteria” “Ancient Paperbird” § Identified protein if silenced will A 0 A 1 A 2 kill bacteria. Side-effects in humans? Ø Find orthologous proteins § Combined insight of multiple sources “Long-beaked “Running § Sources: Homologene, PIRSF, and Paperbird” Paperbird” eggNOG L 0 L 1 L 2 R 0 R 1 R 2 § Challenges § Integrating conflicting groups B. Wanders, M. van Keulen, P. van der Vet, Uncertain § No truth & partial untrustworthiness: Groupings: Probabilistic data established with different combination of grouping data. DEXA 2015. methods by different research groups Research visit RWTH - Probabilistic Databases 23 Nov 2017 21

UNCERTAIN GROUPINGS § Uncertain groupings § Iterative data understanding with Integration views Source Groups How to combine the insights of S 1 ABC 1 DE 1 FG 1 orthologous grouping of S 2 AB 2 CD 2 FH 2 proteins from 3 sources? S 3 ABE 3 FGH 3 Research visit RWTH - Probabilistic Databases 23 Nov 2017 22

INTEGRATION VIEW 1: COMP COMBINATION OF INDEPENDENT COMPONENTS X= Y= ABC 1 DE 1 FG 1 Group 1 1 ABC 1 X=1 DE 1 X=1 AB 2 CD 2 FH 2 2 2 FG 1 Y=1 AB 2 X=2 3 FGH 3 3 ABE 3 CD 2 X=2 FH 2 Y=2 Two choices: ABE 3 X=3 • Source S i is correct for FGH 3 Y=3 component ABCDE • Source S i is correct for component FGH Research visit RWTH - Probabilistic Databases 23 Nov 2017 23

PROBABILISTIC DATABASES MAURICE VAN KEULEN ASSOCIATE PROFESSOR DATA - PowerPoint PPT Presentation

PROBABILISTIC DATABASES MAURICE VAN KEULEN ASSOCIATE PROFESSOR DATA MANAGEMENT TECHNOLOGY HEAD OF DATABASE GROUP WHO AM I AND WHY AM I HERE Maurice van Keulen Associate Professor Data Management Technology University of Twente Research

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Module 3: Creating and Managing Databases Overview Creating Databases Creating

Open-World Probabilistic Databases Guy Van den Broeck GCAI Oct 21, 2017 Overview 1. Why

Open-World Probabilistic Databases Guy Van den Broeck FLAIRS May 23, 2017 Overview 1. Why

Probabilistic Databases Guy Van den Broeck Scalable Uncertainty Management (SUM) Sep 21, 2016

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

GEMS/Food Databases and GEMS/Food Databases and GEMS/Food Databases and in the Food Supply

Image Databases Image Databases Image Databases Prof. Paolo Ciaccia Prof. Paolo Ciaccia

Lecture 10: Larger-than-Memory Databases 1 / 53 Larger-than-Memory Databases Recap

Databases and PHP Accessing databases from PHP PHP & Databases l PHP can connect to

3. Text and document databases Normal databases: formatted records; document databases:

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Scalable Uncertainty Management 04 Probabilistic Databases Rainer Gemulla Jun 1, 2012

Part 2 Some mathematical principles and problem-solving techniques 2 1 Some mathematical

Westminster Food and Nutrition Forum 21 Sept 2017 Helen Munday, CSO and Director, Regulation,

This method (in short: PCDM) uses graphics and other data, preferably BPR- Tool symbols, on

probability in environmental risk analysis Ullrika Sahlin Tuesday 16.00-17.30 1 Outline

Week 4 -Wednesday What did we talk about last time? Exceptions Catching exceptions

Week 3 - Friday What did we talk about last time? Wrapper classes Examples if

Exploring the Presence of Technical Debt in Industrial GUI-based Testware: A Case Study Emil

Class Two c++ is a typed language what a thing is matters. A thing is a variable in

PROBABILISTIC DATABASES MAURICE VAN KEULEN ASSOCIATE PROFESSOR DATA - PowerPoint PPT Presentation

PROBABILISTIC DATABASES MAURICE VAN KEULEN ASSOCIATE PROFESSOR DATA MANAGEMENT TECHNOLOGY HEAD OF DATABASE GROUP WHO AM I AND WHY AM I HERE Maurice van Keulen Associate Professor Data Management Technology University of Twente Research

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Module 3: Creating and Managing Databases Overview Creating Databases Creating

Open-World Probabilistic Databases Guy Van den Broeck GCAI Oct 21, 2017 Overview 1. Why

Open-World Probabilistic Databases Guy Van den Broeck FLAIRS May 23, 2017 Overview 1. Why

Probabilistic Databases Guy Van den Broeck Scalable Uncertainty Management (SUM) Sep 21, 2016

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

GEMS/Food Databases and GEMS/Food Databases and GEMS/Food Databases and in the Food Supply

Image Databases Image Databases Image Databases Prof. Paolo Ciaccia Prof. Paolo Ciaccia

Lecture 10: Larger-than-Memory Databases 1 / 53 Larger-than-Memory Databases Recap

Databases and PHP Accessing databases from PHP PHP &amp; Databases l PHP can connect to

3. Text and document databases Normal databases: formatted records; document databases:

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Scalable Uncertainty Management 04 Probabilistic Databases Rainer Gemulla Jun 1, 2012

Part 2 Some mathematical principles and problem-solving techniques 2 1 Some mathematical

Westminster Food and Nutrition Forum 21 Sept 2017 Helen Munday, CSO and Director, Regulation,

This method (in short: PCDM) uses graphics and other data, preferably BPR- Tool symbols, on

probability in environmental risk analysis Ullrika Sahlin Tuesday 16.00-17.30 1 Outline

Week 4 -Wednesday What did we talk about last time? Exceptions Catching exceptions

Week 3 - Friday What did we talk about last time? Wrapper classes Examples if

Exploring the Presence of Technical Debt in Industrial GUI-based Testware: A Case Study Emil

Class Two c++ is a typed language what a thing is matters. A thing is a variable in

Databases and PHP Accessing databases from PHP PHP & Databases l PHP can connect to