Aggregation in Probabilistic Databases via Knowledge Compilation - PowerPoint PPT Presentation

Aggregation in Probabilistic Databases via Knowledge Compilation Robert Fink, Larisa Han, Dan Olteanu University of Oxford VLDB 2012, Istanbul 1 / 30

Outline Motivation Algebraic Foundations Representation System Query Evaluation 2 / 30

3 / 30

? Who is responsible for a larger capacity of biogas plants, Democrats or Republicans? ? 3 / 30

More biomass plant capacity, Democrats or Republicans? How to come up with an answer? Option 1: Use Wikipedia, search for lists of Governors and their terms. Search for list of biomass plants, find out when and where they were build, match up with Governors of US states. Group by political parties of Governors, sum capacity of plants. (Phew.) 4 / 30

More biomass plant capacity, Democrats or Republicans? How to come up with an answer? Option 1: Use Wikipedia, search for lists of Governors and their terms. Search for list of biomass plants, find out when and where they were build, match up with Governors of US states. Group by political parties of Governors, sum capacity of plants. (Phew.) Option 2: Find tables on Governors and biomass plants on the Web and write a query like compute sum(Plant.capacity) from Governor, Plant where - Plant.date matches Governor.term - Plant.location matches Governor.state group by Governor.party 4 / 30

Biomass Plants in the US 5 / 30

Governors in US States 6 / 30

Deterministic case G.Name G.Party G.State P .Location P .capacity G1 Dem CA CA 17 G2 Dem FL FL 5 G3 Dem NY NY 9 ... G4 Rep NY NY 8 G5 Rep CA CA 14 G6 Rep CA CA 2 Problem to solve: 17 + 5 + 9 > 8 + 14 + 2? 7 / 30

Uncertain case G.Name G.Party G.State P .Location P .capacity P1 Dem CA SF , CA 17 P2 Dem FL Florida 5 P3 Dem NY NY 9 ... P4 Rep NY NY 8 P5 Rep CA LA, CA 14 P6 Rep CA Berkeley 2 8 / 30

Uncertain case G.Name G.Party G.State P .Location P .capacity Prob P1 Dem CA SF , CA 17 0.9 P2 Dem FL Florida 5 0.5 P3 Dem NY NY 9 1.0 ... P4 Rep NY NY 8 1.0 P5 Rep CA LA, CA 14 0.8 P6 Rep CA Berkeley 2 0.2 8 / 30

Uncertain case G.Name G.Party G.State P .Location P .capacity Φ P1 Dem CA SF , CA 17 x 1 (p=0.9) P2 Dem FL Florida 5 x 2 (p=0.5) P3 Dem NY NY 9 x 3 (p=1.0) ... P4 Rep NY NY 8 y 1 (p=1.0) P5 Rep CA LA, CA 14 y 2 (p=0.8) P6 Rep CA Berkeley 2 y 3 (p=0.2) 8 / 30

Uncertain case G.Name G.Party G.State P .Location P .capacity Φ P1 Dem CA SF , CA 17 x 1 (p=0.9) P2 Dem FL Florida 5 x 2 (p=0.5) P3 Dem NY NY 9 x 3 (p=1.0) ... P4 Rep NY NY 8 y 1 (p=1.0) P5 Rep CA LA, CA 14 y 2 (p=0.8) P6 Rep CA Berkeley 2 y 3 (p=0.2) Problem to solve: x 1 ⊗ 17 + x 2 ⊗ 5 + x 3 ⊗ 9 y 1 ⊗ 8 + y 2 ⊗ 14 + y 3 ⊗ 2 ? > 8 / 30

Algebraic Expressions give rise to Random Variables Democratic Biogas Capacity > Republican Biogas Capacity Φ = [ x 1 ⊗ 17 + x 2 ⊗ 5 + x 3 ⊗ 9 > x 4 ⊗ 8 + x 5 ⊗ 14 + x 6 ⊗ 2 ] 9 / 30

Algebraic Expressions give rise to Random Variables Democratic Biogas Capacity > Republican Biogas Capacity Φ = [ x 1 ⊗ 17 + x 2 ⊗ 5 + x 3 ⊗ 9 > y 1 ⊗ 8 + y 2 ⊗ 14 + y 3 ⊗ 2 ] x 1 , x 2 , x 3 , y 1 , y 2 , y 3 are Boolean random variables 9 / 30

Algebraic Expressions give rise to Random Variables Democratic Biogas Capacity > Republican Biogas Capacity Φ = [ x 1 ⊗ 17 + x 2 ⊗ 5 + x 3 ⊗ 9 > y 1 ⊗ 8 + y 2 ⊗ 14 + y 3 ⊗ 2 ] x 1 , x 2 , x 3 , y 1 , y 2 , y 3 are Boolean random variables Then the sum expression α = x 1 ⊗ 17 + x 2 ⊗ 5 + x 3 ⊗ 9 is an N -valued random variable 9 / 30

Algebraic Expressions give rise to Random Variables Democratic Biogas Capacity > Republican Biogas Capacity Φ = [ x 1 ⊗ 17 + x 2 ⊗ 5 + x 3 ⊗ 9 > y 1 ⊗ 8 + y 2 ⊗ 14 + y 3 ⊗ 2 ] x 1 , x 2 , x 3 , y 1 , y 2 , y 3 are Boolean random variables Then the sum expression α = x 1 ⊗ 17 + x 2 ⊗ 5 + x 3 ⊗ 9 is an N -valued random variable Hence Φ is a B -valued random variable 9 / 30

Algebraic Expressions give rise to Random Variables Democratic Biogas Capacity > Republican Biogas Capacity Φ = [ x 1 ⊗ 17 + x 2 ⊗ 5 + x 3 ⊗ 9 > y 1 ⊗ 8 + y 2 ⊗ 14 + y 3 ⊗ 2 ] x 1 , x 2 , x 3 , y 1 , y 2 , y 3 are Boolean random variables Then the sum expression α = x 1 ⊗ 17 + x 2 ⊗ 5 + x 3 ⊗ 9 is an N -valued random variable Hence Φ is a B -valued random variable P Φ [ ⊤ ] is the probability that a random choice of possible values for the variables x 1 , x 2 , x 3 , y 1 , y 2 , y 3 satisfies the inequality 9 / 30

Algebraic Expressions give rise to Random Variables Democratic Biogas Capacity > Republican Biogas Capacity Φ = [ x 1 ⊗ 17 + x 2 ⊗ 5 + x 3 ⊗ 9 > y 1 ⊗ 8 + y 2 ⊗ 14 + y 3 ⊗ 2 ] x 1 , x 2 , x 3 , y 1 , y 2 , y 3 are Boolean random variables Then the sum expression α = x 1 ⊗ 17 + x 2 ⊗ 5 + x 3 ⊗ 9 is an N -valued random variable Hence Φ is a B -valued random variable P Φ [ ⊤ ] is the probability that a random choice of possible values for the variables x 1 , x 2 , x 3 , y 1 , y 2 , y 3 satisfies the inequality In previous example, P Φ [ ⊤ ] is the probability that democrats were responsible for a higher capacity of biogas plants 9 / 30

Monoids , Semirings, Semimodule What do we mean by + in Φ 1 ⊗ 17 +Φ 2 ⊗ 5? Well, it depends . . . 11 / 30

Monoids , Semirings, Semimodule What do we mean by + in Φ 1 ⊗ 17 +Φ 2 ⊗ 5? Well, it depends . . . Aggregation modelled by commutative monoids Carrier M , e.g. N or R Binary operation M × M → M Neutral element 0 ∈ M Examples for aggregation monoids: SUM ( N , + , 0 ) , MIN ( N , min , ∞ ) , MAX ( N , max , −∞ ) , PROD, COUNT (special case of SUM) 11 / 30

Monoids, Semirings , Semimodule What are Φ 1 , Φ 2 in Φ 1 ⊗ 17 + Φ 2 ⊗ 5? 12 / 30

Monoids, Semirings , Semimodule What are Φ 1 , Φ 2 in Φ 1 ⊗ 17 + Φ 2 ⊗ 5? R S T Consider Query: Φ Φ Φ A A A B 1 x 1 1 y 1 1 17 z 1 � � AGG B ( R ∪ S ) ✶ A T 2 x 2 2 5 z 2 12 / 30

Monoids, Semirings , Semimodule What are Φ 1 , Φ 2 in Φ 1 ⊗ 17 + Φ 2 ⊗ 5? R S T Consider Query: Φ Φ Φ A A A B 1 x 1 1 y 1 1 17 z 1 � � AGG B ( R ∪ S ) ✶ A T 2 x 2 2 5 z 2 Tuples annotations modelled by semirings ( R ∪ S ) ✶ A T A B Φ ( R ∪ S ) ✶ A T yields 1 17 ( x 1 + y 1 ) · z 1 2 5 x 2 · z 2 Aggregation on top of this table yields: (( x 1 + y 1 ) · z 1 ) ⊗ 17 + ( x 2 · z 2 ) ⊗ 5 where the meaning of + depends on the aggregation monoid 12 / 30

Monoids, Semirings, Semimodule Semimodule Algebraic framework introduced by Amsterdamer et al. [2011] The algebraic structure combining semirings and monoids is called semimodule Generalisation of vector space. “Scalars”: tuple annotations, “Vectors”: aggregation values Semimodule expressions represent data values conditioned on tuple annotations Semiring and semimodule expressions are random variables Semimodule: Random variable over aggregation domain Semiring expressions: ? ◮ So far in probabilistic databases: Boolean random variable ◮ However: B is in general not large enough for aggregation; need larger semiring, for example natural numbers 13 / 30

Aggregation Needs Semirings Larger Than B ProducerEU ProducerUS Products Φ Φ Φ Item Item Item Price 1 x 1 1 y 1 1 17 z 1 2 x 2 2 5 z 2 � � Query: SUM Price ( ProducerEU ∪ ProducerUS ) ✶ Item Products asking for total price of products sold by all producers Resulting expression: (( x 1 + y 1 ) · z 1 ) ⊗ 17 + ( x 2 · z 2 ) ⊗ 5 Valuation ν : x 1 , x 2 , y 1 , z 1 , z 2 �→ ⊤ yields ⊤ ⊗ 17 + ⊤ ⊗ 5 = 22 Arguably not the expected result 14 / 30

Aggregation Needs Semirings Larger Than B ProducerEU ProducerUS Products Φ Φ Φ Item Item Item Price 1 x 1 1 y 1 1 17 z 1 2 x 2 2 5 z 2 � � Query: SUM Price ( ProducerEU ∪ ProducerUS ) ✶ Item Products asking for total price of products sold by all producers Resulting expression: (( x 1 + y 1 ) · z 1 ) ⊗ 17 + ( x 2 · z 2 ) ⊗ 5 Valuation ν : x 1 , x 2 , y 1 , z 1 , z 2 �→ ⊤ yields ⊤ ⊗ 17 + ⊤ ⊗ 5 = 22 Arguably not the expected result Boolean semiring is not large enough for SUM Better choice: Semiring N . Identify ⊥ ∼ 0, ⊤ ∼ 1. Valuation ν : x 1 , x 2 , y 1 , z 1 , z 2 �→ 1 yields (( 1 + 1 ) · 1 ) ⊗ 17 + ( 1 · 1 ) ⊗ 5 = 2 ⊗ 17 + 1 ⊗ 5 = 39. 14 / 30

The pvc-tables Representation System Ingredients for pvc-tables A set X of variable symbols Tuples contain constants or semimodule expressions over X Every tuple is annotated with a semiring expression over X Queries Query Q maps pvc-table database D to pvc-table Q ( D ) Annotations are propagated via query operators Expressions concisely encode probability distributions of answers Properties of pvc-tables Polynomial overhead (Amsterdamer et al. [2011]): � � | Q ( D ) | ∈ O poly ( | D | ) (unlike pc-tables) Completeness: Every finite probability distribution over relations (with set or bag semantics) can be represented by pvc-tables 16 / 30

Aggregation in Probabilistic Databases via Knowledge Compilation - PowerPoint PPT Presentation

Aggregation in Probabilistic Databases via Knowledge Compilation Robert Fink, Larisa Han, Dan Olteanu University of Oxford VLDB 2012, Istanbul 1 / 30 Outline Motivation Algebraic Foundations Representation System Query Evaluation 2 / 30

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Module 3: Creating and Managing Databases Overview Creating Databases Creating

Open-World Probabilistic Databases Guy Van den Broeck GCAI Oct 21, 2017 Overview 1. Why

Open-World Probabilistic Databases Guy Van den Broeck FLAIRS May 23, 2017 Overview 1. Why

Probabilistic Databases Guy Van den Broeck Scalable Uncertainty Management (SUM) Sep 21, 2016

Elmwood Park: Electricity Aggregation Developing an Opt-In Municipal Aggregation Program to

simplifying the customer experience through account aggregation Sim Sangha Business Development

The Axiomatic Method in Social Choice Theory: Preference Aggregation, Judgment Aggregation, Graph

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

GEMS/Food Databases and GEMS/Food Databases and GEMS/Food Databases and in the Food Supply

Image Databases Image Databases Image Databases Prof. Paolo Ciaccia Prof. Paolo Ciaccia

Lecture 10: Larger-than-Memory Databases 1 / 53 Larger-than-Memory Databases Recap

PET and PVC separation with hyperspectral imaging Monica Moroni 1, *, Alessandro Mei 2 ,

Differences PVC Vs. Internet ! More Reason To Be Concerned Less Awareness, More Coverage, New

Wot the L: Analysis of Real versus Random Placed Nets, and Implications for Steiner Tree

Modeling Cloud Compu.ng and Cloud Networking with VXDL Pascale

The NO n A Experiment Survey of the NOvA Far Detector Babatunde OSheg Oshinowo Horst Friedsam

consumer, investment and policy decisions Australias 2025 National Packaging Targets 100%

Gluster in Kubernetes Michael Adam <obnox@redhat.com> Vault conference 2017-03-23

OpenAFS as Persistent Storage inside Kubernetes using Container Storage Interface plugin for

Aggregation in Probabilistic Databases via Knowledge Compilation - PowerPoint PPT Presentation

Aggregation in Probabilistic Databases via Knowledge Compilation Robert Fink, Larisa Han, Dan Olteanu University of Oxford VLDB 2012, Istanbul 1 / 30 Outline Motivation Algebraic Foundations Representation System Query Evaluation 2 / 30

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Module 3: Creating and Managing Databases Overview Creating Databases Creating

Open-World Probabilistic Databases Guy Van den Broeck GCAI Oct 21, 2017 Overview 1. Why

Open-World Probabilistic Databases Guy Van den Broeck FLAIRS May 23, 2017 Overview 1. Why

Probabilistic Databases Guy Van den Broeck Scalable Uncertainty Management (SUM) Sep 21, 2016

Elmwood Park: Electricity Aggregation Developing an Opt-In Municipal Aggregation Program to

simplifying the customer experience through account aggregation Sim Sangha Business Development

The Axiomatic Method in Social Choice Theory: Preference Aggregation, Judgment Aggregation, Graph

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

GEMS/Food Databases and GEMS/Food Databases and GEMS/Food Databases and in the Food Supply

Image Databases Image Databases Image Databases Prof. Paolo Ciaccia Prof. Paolo Ciaccia

Lecture 10: Larger-than-Memory Databases 1 / 53 Larger-than-Memory Databases Recap

PET and PVC separation with hyperspectral imaging Monica Moroni 1, *, Alessandro Mei 2 ,

Differences PVC Vs. Internet ! More Reason To Be Concerned Less Awareness, More Coverage, New

Wot the L: Analysis of Real versus Random Placed Nets, and Implications for Steiner Tree

Modeling Cloud Compu.ng and Cloud Networking with VXDL Pascale

The NO n A Experiment Survey of the NOvA Far Detector Babatunde OSheg Oshinowo Horst Friedsam

consumer, investment and policy decisions Australias 2025 National Packaging Targets 100%

Gluster in Kubernetes Michael Adam &lt;obnox@redhat.com&gt; Vault conference 2017-03-23

OpenAFS as Persistent Storage inside Kubernetes using Container Storage Interface plugin for

Gluster in Kubernetes Michael Adam <obnox@redhat.com> Vault conference 2017-03-23