Open-World Probabilistic Databases Guy Van den Broeck FLAIRS May - PowerPoint PPT Presentation

Open-World Probabilistic Databases Guy Van den Broeck FLAIRS May 23, 2017

Overview 1. Why probabilistic databases? 2. How probabilistic query evaluation? 3. Why open world? 4. How open-world query evaluation? 5. What is the broader picture?

Why probabilistic databases?

What we’d like to do…

Google Knowledge Graph > 570 million entities > 18 billion tuples

Probabilistic Databases • Tuple-independent probabilistic database Scientist Coauthor x y P x P Erdos Renyi Erdos 0.9 0.6 Einstein Einstein Pauli 0.7 0.8 Pauli Obama Erdos 0.6 0.1 • Learned from the web, large text corpora, ontologies, etc., using statistical machine learning. [Suciu’11]

Information Extraction is Noisy! Coauthor x y P Luc Laura 0.7 Luc Hendrik 0.6 Luc Kathleen 0.3 Luc Paol 0.3 Luc Paolo 0.1

What we’d like to do… ∃ x Coauthor(Einstein,x) ∧ Coauthor(Erdos,x)

Einstein is in the Knowledge Graph

Erdős is in the Knowledge Graph

This guy is in the Knowledge Graph … and he published with both Einstein and Erdos!

Desired Query Answer 1. Fuse uncertain information from web Ernst Straus ⇒ Embrace probability! Barack Obama, … 2. Cannot come from labeled data ⇒ Embrace query eval! Justin Bieber , …

[ Chen’16] (NYTimes)

How probabilistic query evaluation?

Tuple-Independent Probabilistic DB Coauthor x y P Probabilistic database D: A B p 1 A C p 2 B C p 3 Possible worlds semantics: x y (1-p 1 )(1-p 2 )(1-p 3 ) x y x y A B x y A C A B x y A C A B x y p 1 p 2 p 3 B C A C A B x y B C B C x y A C (1-p 1 )p 2 p 3 B C

Probabilistic Query Evaluation Q = ∃ x ∃ y Scientist(x) ∧ Coauthor(x,y) P( Q ) = 1- {1- } * p 1 *[ ] 1-(1-q 1 )*(1-q 2 ) {1- } p 2 *[ ] 1-(1-q 3 )*(1-q 4 )*(1-q 5 ) Coauthor x y P A D q 1 Y 1 Scientist x P A E q 2 Y 2 A p 1 X 1 B F q 3 Y 3 B p 2 X 2 B G q 4 Y 4 C p 3 X 3 B H q 5 Y 5

Lifted Inference Rules Preprocess Q (omitted), Then apply rules (some have preconditions) Negation P(¬Q) = 1 – P(Q) P(Q1 ∧ Q2) = P(Q1) P(Q2) Decomposable ∧ , ∨ P(Q1 ∨ Q2) =1 – (1 – P(Q1)) (1 – P(Q2)) P( ∃ z Q) = 1 – Π A ∈ Domain (1 – P(Q[A/z])) Decomposable ∃ , ∀ P( ∀ z Q) = Π A ∈ Domain P(Q[A/z]) P(Q1 ∧ Q2) = P(Q1) + P(Q2) - P(Q1 ∨ Q2) Inclusion/ P(Q1 ∨ Q2) = P(Q1) + P(Q2) - P(Q1 ∧ Q2) exclusion [Suciu’11]

Closed-World Lifted Query Eval Q = ∃ x ∃ y Scientist(x) ∧ Coauthor(x,y) Decomposable ∀ -Rule P(Q) = 1 - Π A ∈ Domain (1 - P(Scientist(A) ∧ ∃ y Coauthor(A,y)) Check independence: Scientist(A) ∧ ∃ y Coauthor(A,y) Scientist(B) ∧ ∃ y Coauthor(B,y) = 1 - (1 - P(Scientist(A) ∧ ∃ y Coauthor(A,y)) x (1 - P(Scientist(B) ∧ ∃ y Coauthor(B,y)) x (1 - P(Scientist(C) ∧ ∃ y Coauthor(C,y)) x (1 - P(Scientist(D) ∧ ∃ y Coauthor(D,y)) x (1 - P(Scientist(E) ∧ ∃ y Coauthor(E,y)) x (1 - P(Scientist(F) ∧ ∃ y Coauthor(F,y)) … Complexity PTIME

Limitations H 0 = ∀ x ∀ y Smoker(x) ∨ Friend(x,y) ∨ Jogger(y) P( ∀ z Q) = Π A ∈ Domain P(Q[A/z]) The decomposable ∀ -rule: … does not apply: Dependent H 0 [Alice/x] and H 0 [Bob/x] are dependent: ∀ y (Smoker(Alice) ∨ Friend(Alice,y) ∨ Jogger(y)) ∀ y (Smoker(Bob) ∨ Friend(Bob,y) ∨ Jogger(y)) Lifted inference sometimes fails. Computing P(H 0 ) is #P-hard in size database [Suciu’11]

Are the Lifted Rules Complete? You already know: • Inference rules: PTIME data complexity • Some queries: #P-hard data complexity Dichotomy Theorem for UCQ / Mon. CNF • If lifted rules succeed, then PTIME query • If lifted rules fail, then query is #P-hard Lifted rules are complete for UCQ! [Dalvi and Suciu;JACM’11]

Why open world?

Knowledge Base Completion Coauthor Given: x y P Einstein Straus 0.7 Erdos Straus 0.6 Einstein Pauli 0.9 … … … 0.8::Coauthor(x,y) :- Coauthor(z,x) ∧ Coauthor(z,y). Learn: Complete: x y P Straus Pauli 0.504 … … …

Bayesian Learning Loop Bayesian view on learning: 1. Prior belief: P( Coauthor(Straus,Pauli) ) = 0.01 2. Observe page P( Coauthor(Straus,Pauli| ) = 0.2 3. Observe page P( Coauthor(Straus,Pauli)| , ) = 0.3 Principled and sound reasoning!

Problem: Broken Learning Loop Bayesian view on learning: 1. Prior belief: P( Coauthor(Straus,Pauli) ) = 0 2. Observe page P( Coauthor(Straus,Pauli| ) = 0.2 3. Observe page P( Coauthor(Straus,Pauli)| , ) = 0.3 This is mathematical nonsense! [Ceylan , Darwiche, Van den Broeck; KR’16]

What we’d like to do… ∃ x Coauthor(Einstein,x) ∧ Coauthor(Erdos,x) Ernst Straus Kristian Kersting , … Justin Bieber , …

Coauthor Open World DB X Y P Einstein Straus 0.7 Erdos Straus 0.6 • What if fact missing? Einstein Pauli 0.9 Erdos Renyi 0.7 Kersting Natarajan 0.8 Luc Paol 0.1 • Probability 0 for: … … … Q1 = ∃ x Coauthor(Einstein, x ) ∧ Coauthor(Erdos, x ) Q2 = ∃ x Coauthor(Bieber, x ) ∧ Coauthor(Erdos, x ) Q3 = Coauthor(Einstein, Straus ) ∧ Coauthor(Erdos, Straus ) Q4 = Coauthor(Einstein, Bieber ) ∧ Coauthor(Erdos, Bieber ) Q5 = Coauthor(Einstein, Bieber ) ∧ ¬ Coauthor( Einstein , Bieber )

X Y P Einstein Straus 0.7 Intuition Erdos Straus 0.6 Einstein Pauli 0.9 Erdos Renyi 0.7 Kersting Natarajan 0.8 Luc Paol 0.1 Q1 = ∃ x Coauthor(Einstein, x ) ∧ Coauthor(Erdos, x ) … … … Q2 = ∃ x Coauthor(Bieber, x ) ∧ Coauthor(Erdos, x ) Q3 = Coauthor(Einstein, Straus ) ∧ Coauthor(Erdos, Straus ) Q4 = Coauthor(Einstein, Bieber ) ∧ Coauthor(Erdos, Bieber ) Q5 = Coauthor(Einstein, Bieber ) ∧ ¬ Coauthor( Einstein , Bieber ) We know for sure that P(Q1 ) ≥ P(Q3), P(Q1 ) ≥ P(Q4) and P(Q3) ≥ P(Q5), P(Q4) ≥ P(Q5) because P(Q5) = 0. We have strong evidence that P(Q1) ≥ P(Q2). [Ceylan , Darwiche, Van den Broeck; KR’16]

Problem: Curse of Superlinearity • Reality is worse! • Tuples are intentionally missing!

Problem: Curse of Superlinearity Sibling x y P … … … Facebook scale ⇒ 200 Exabytes of data” All Google storage is 2 exabytes … [Ceylan , Darwiche, Van den Broeck; KR’16]

Problem: Model Evaluation Coauthor Given: x y P Einstein Straus 0.7 Erdos Straus 0.6 Einstein Pauli 0.9 … … … 0.8::Coauthor(x,y) :- Coauthor(z,x) ∧ Coauthor(z,y). Learn: OR 0.6::Coauthor(x,y) :- Affiliation(x,z) ∧ Affiliation(y,z). What is the likelihood, precision, accuracy, …? [De Raedt et al; IJCAI’15]

Open-World Prob. Databases Intuition: tuples can be added with P < λ Q2 = Coauthor(Einstein, Straus ) ∧ Coauthor(Erdos, Straus ) 0.7 * λ ≥ P(Q2) ≥ 0 Coauthor Coauthor X Y P X Y P Einstein Straus 0.7 Einstein Straus 0.7 Einstein Pauli 0.9 Einstein Pauli 0.9 Erdos Renyi 0.7 Erdos Renyi 0.7 Kersting Natarajan 0.8 Kersting Natarajan 0.8 Luc Paol 0.1 Luc Paol 0.1 … … … … … … λ Erdos Straus

How open-world query evaluation?

UCQ / Monotone CNF • Lower bound = closed-world probability • Upper bound = probability after adding all tuples with probability λ • Polynomial time ☺ • Quadratic blow-up  • 200 exabytes … again 

Closed-World Lifted Query Eval Q = ∃ x ∃ y Scientist(x) ∧ Coauthor(x,y) Decomposable ∀ -Rule P(Q) = 1 - Π A ∈ Domain (1 - P(Scientist(A) ∧ ∃ y Coauthor(A,y)) Check independence: Scientist(A) ∧ ∃ y Coauthor(A,y) Scientist(B) ∧ ∃ y Coauthor(B,y) = 1 - (1 - P(Scientist(A) ∧ ∃ y Coauthor(A,y)) x (1 - P(Scientist(B) ∧ ∃ y Coauthor(B,y)) x (1 - P(Scientist(C) ∧ ∃ y Coauthor(C,y)) x (1 - P(Scientist(D) ∧ ∃ y Coauthor(D,y)) x (1 - P(Scientist(E) ∧ ∃ y Coauthor(E,y)) x (1 - P(Scientist(F) ∧ ∃ y Coauthor(F,y)) … Complexity PTIME

Closed-World Lifted Query Eval Q = ∃ x ∃ y Scientist(x) ∧ Coauthor(x,y) P(Q) = 1 - Π A ∈ Domain (1 - P(Scientist(A) ∧ ∃ y Coauthor(A,y)) = 1 - (1 - P(Scientist(A) ∧ ∃ y Coauthor(A,y)) x (1 - P(Scientist(B) ∧ ∃ y Coauthor(B,y)) x (1 - P(Scientist(C) ∧ ∃ y Coauthor(C,y)) x (1 - P(Scientist(D) ∧ ∃ y Coauthor(D,y)) No supporting facts x (1 - P(Scientist(E) ∧ ∃ y Coauthor(E,y)) in database! x (1 - P(Scientist(F) ∧ ∃ y Coauthor(F,y)) … Probability 0 in closed world Ignore these queries! Complexity linear time!

Open-World Lifted Query Eval Q = ∃ x ∃ y Scientist(x) ∧ Coauthor(x,y) P(Q) = 1 - Π A ∈ Domain (1 - P(Scientist(A) ∧ ∃ y Coauthor(A,y)) = 1 - (1 - P(Scientist(A) ∧ ∃ y Coauthor(A,y)) x (1 - P(Scientist(B) ∧ ∃ y Coauthor(B,y)) x (1 - P(Scientist(C) ∧ ∃ y Coauthor(C,y)) x (1 - P(Scientist(D) ∧ ∃ y Coauthor(D,y)) No supporting facts x (1 - P(Scientist(E) ∧ ∃ y Coauthor(E,y)) in database! x (1 - P(Scientist(F) ∧ ∃ y Coauthor(F,y)) … Probability p in closed world Complexity PTIME!

Open-World Probabilistic Databases Guy Van den Broeck FLAIRS May - PowerPoint PPT Presentation

Open-World Probabilistic Databases Guy Van den Broeck FLAIRS May 23, 2017 Overview 1. Why probabilistic databases? 2. How probabilistic query evaluation? 3. Why open world? 4. How open-world query evaluation? 5. What is the broader picture? Why

Open-World Probabilistic Databases Guy Van den Broeck GCAI Oct 21, 2017 Overview 1. Why

Probabilistic Databases Guy Van den Broeck Scalable Uncertainty Management (SUM) Sep 21, 2016

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Module 3: Creating and Managing Databases Overview Creating Databases Creating

Databases Peter Zaitsev Percona You Might Know I am passionate about Open Source Databases

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

GEMS/Food Databases and GEMS/Food Databases and GEMS/Food Databases and in the Food Supply

Image Databases Image Databases Image Databases Prof. Paolo Ciaccia Prof. Paolo Ciaccia

Lecture 10: Larger-than-Memory Databases 1 / 53 Larger-than-Memory Databases Recap

Databases and PHP Accessing databases from PHP PHP & Databases l PHP can connect to

3. Text and document databases Normal databases: formatted records; document databases:

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Open Source Databases for Modern Business Santa Clara, California | April 23th 25th, 2018

What the Digerati Know INFO/CSE100, Spring 2006 Fluency in Information Technology

The LOGJAM attack Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice David Adrian,

What is cryptography about? loru23n8uladjkfb!#@ I will cut your throat

ASIACRYPT 2013 1 Cryptographic Hash Function Public function Input: arbitrary long

DTTF/NB479: Dszquphsbqiz Day 27 Announcements: Questions? This week: Discrete Logs,

CSE 510 Web Data Engineering Access Control Authentication & Authorization UB CSE 510 Web

TO Y PROGRAMMING Demo of a repository for statically compiled programs 2016 US LLVM

IT452 Advanced Web and Internet Systems Set 6: Web Servers (operation, configuration, and

Open-World Probabilistic Databases Guy Van den Broeck FLAIRS May - PowerPoint PPT Presentation

Open-World Probabilistic Databases Guy Van den Broeck FLAIRS May 23, 2017 Overview 1. Why probabilistic databases? 2. How probabilistic query evaluation? 3. Why open world? 4. How open-world query evaluation? 5. What is the broader picture? Why

Open-World Probabilistic Databases Guy Van den Broeck GCAI Oct 21, 2017 Overview 1. Why

Probabilistic Databases Guy Van den Broeck Scalable Uncertainty Management (SUM) Sep 21, 2016

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Module 3: Creating and Managing Databases Overview Creating Databases Creating

Databases Peter Zaitsev Percona You Might Know I am passionate about Open Source Databases

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

GEMS/Food Databases and GEMS/Food Databases and GEMS/Food Databases and in the Food Supply

Image Databases Image Databases Image Databases Prof. Paolo Ciaccia Prof. Paolo Ciaccia

Lecture 10: Larger-than-Memory Databases 1 / 53 Larger-than-Memory Databases Recap

Databases and PHP Accessing databases from PHP PHP &amp; Databases l PHP can connect to

3. Text and document databases Normal databases: formatted records; document databases:

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Open Source Databases for Modern Business Santa Clara, California | April 23th 25th, 2018

What the Digerati Know INFO/CSE100, Spring 2006 Fluency in Information Technology

The LOGJAM attack Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice David Adrian,

What is cryptography about? loru23n8uladjkfb!#@ I will cut your throat

ASIACRYPT 2013 1 Cryptographic Hash Function Public function Input: arbitrary long

DTTF/NB479: Dszquphsbqiz Day 27 Announcements: Questions? This week: Discrete Logs,

CSE 510 Web Data Engineering Access Control Authentication &amp; Authorization UB CSE 510 Web

TO Y PROGRAMMING Demo of a repository for statically compiled programs 2016 US LLVM

IT452 Advanced Web and Internet Systems Set 6: Web Servers (operation, configuration, and

Databases and PHP Accessing databases from PHP PHP & Databases l PHP can connect to

CSE 510 Web Data Engineering Access Control Authentication & Authorization UB CSE 510 Web