1 / 30
Aggregation in Probabilistic Databases via Knowledge Compilation
Robert Fink, Larisa Han, Dan Olteanu University of Oxford VLDB 2012, Istanbul
Aggregation in Probabilistic Databases via Knowledge Compilation - - PowerPoint PPT Presentation
Aggregation in Probabilistic Databases via Knowledge Compilation Robert Fink, Larisa Han, Dan Olteanu University of Oxford VLDB 2012, Istanbul 1 / 30 Outline Motivation Algebraic Foundations Representation System Query Evaluation 2 / 30
1 / 30
Robert Fink, Larisa Han, Dan Olteanu University of Oxford VLDB 2012, Istanbul
Motivation Algebraic Foundations Representation System Query Evaluation
2 / 30
3 / 30
3 / 30
Who is responsible for a larger capacity of biogas plants, Democrats or Republicans?
3 / 30
How to come up with an answer? Option 1: Use Wikipedia, search for lists of Governors and their
they were build, match up with Governors of US states. Group by political parties of Governors, sum capacity of plants. (Phew.)
4 / 30
How to come up with an answer? Option 1: Use Wikipedia, search for lists of Governors and their
they were build, match up with Governors of US states. Group by political parties of Governors, sum capacity of plants. (Phew.) Option 2: Find tables on Governors and biomass plants on the Web and write a query like compute sum(Plant.capacity) from Governor, Plant where
group by Governor.party
4 / 30
5 / 30
6 / 30
G.Name G.Party G.State P .Location P .capacity G1 Dem CA CA 17 G2 Dem FL FL 5 G3 Dem NY NY 9 ... G4 Rep NY NY 8 G5 Rep CA CA 14 G6 Rep CA CA 2 Problem to solve: 17 + 5 + 9 > 8 + 14 + 2?
7 / 30
G.Name G.Party G.State P .Location P .capacity P1 Dem CA SF , CA 17 P2 Dem FL Florida 5 P3 Dem NY NY 9 ... P4 Rep NY NY 8 P5 Rep CA LA, CA 14 P6 Rep CA Berkeley 2
8 / 30
G.Name G.Party G.State P .Location P .capacity Prob P1 Dem CA SF , CA 17 0.9 P2 Dem FL Florida 5 0.5 P3 Dem NY NY 9 1.0 ... P4 Rep NY NY 8 1.0 P5 Rep CA LA, CA 14 0.8 P6 Rep CA Berkeley 2 0.2
8 / 30
G.Name G.Party G.State P .Location P .capacity Φ P1 Dem CA SF , CA 17 x1 (p=0.9) P2 Dem FL Florida 5 x2 (p=0.5) P3 Dem NY NY 9 x3 (p=1.0) ... P4 Rep NY NY 8 y1 (p=1.0) P5 Rep CA LA, CA 14 y2 (p=0.8) P6 Rep CA Berkeley 2 y3 (p=0.2)
8 / 30
G.Name G.Party G.State P .Location P .capacity Φ P1 Dem CA SF , CA 17 x1 (p=0.9) P2 Dem FL Florida 5 x2 (p=0.5) P3 Dem NY NY 9 x3 (p=1.0) ... P4 Rep NY NY 8 y1 (p=1.0) P5 Rep CA LA, CA 14 y2 (p=0.8) P6 Rep CA Berkeley 2 y3 (p=0.2) Problem to solve: x1 ⊗ 17 + x2 ⊗ 5 + x3 ⊗ 9 > y1 ⊗ 8 + y2 ⊗ 14 + y3 ⊗ 2 ?
8 / 30
Democratic Biogas Capacity > Republican Biogas Capacity Φ = [x1⊗17 + x2⊗5 + x3⊗9 > x4⊗8 + x5⊗14 + x6⊗2]
9 / 30
Democratic Biogas Capacity > Republican Biogas Capacity Φ = [x1⊗17 + x2⊗5 + x3⊗9 > y1⊗8 + y2⊗14 + y3⊗2] x1, x2, x3, y1, y2, y3 are Boolean random variables
9 / 30
Democratic Biogas Capacity > Republican Biogas Capacity Φ = [x1⊗17 + x2⊗5 + x3⊗9 > y1⊗8 + y2⊗14 + y3⊗2] x1, x2, x3, y1, y2, y3 are Boolean random variables Then the sum expression α = x1⊗17 + x2⊗5 + x3⊗9 is an N-valued random variable
9 / 30
Democratic Biogas Capacity > Republican Biogas Capacity Φ = [x1⊗17 + x2⊗5 + x3⊗9 > y1⊗8 + y2⊗14 + y3⊗2] x1, x2, x3, y1, y2, y3 are Boolean random variables Then the sum expression α = x1⊗17 + x2⊗5 + x3⊗9 is an N-valued random variable Hence Φ is a B-valued random variable
9 / 30
Democratic Biogas Capacity > Republican Biogas Capacity Φ = [x1⊗17 + x2⊗5 + x3⊗9 > y1⊗8 + y2⊗14 + y3⊗2] x1, x2, x3, y1, y2, y3 are Boolean random variables Then the sum expression α = x1⊗17 + x2⊗5 + x3⊗9 is an N-valued random variable Hence Φ is a B-valued random variable PΦ[⊤] is the probability that a random choice of possible values for the variables x1, x2, x3, y1, y2, y3 satisfies the inequality
9 / 30
Democratic Biogas Capacity > Republican Biogas Capacity Φ = [x1⊗17 + x2⊗5 + x3⊗9 > y1⊗8 + y2⊗14 + y3⊗2] x1, x2, x3, y1, y2, y3 are Boolean random variables Then the sum expression α = x1⊗17 + x2⊗5 + x3⊗9 is an N-valued random variable Hence Φ is a B-valued random variable PΦ[⊤] is the probability that a random choice of possible values for the variables x1, x2, x3, y1, y2, y3 satisfies the inequality In previous example, PΦ[⊤] is the probability that democrats were responsible for a higher capacity of biogas plants
9 / 30
Motivation Algebraic Foundations Representation System Query Evaluation
10 / 30
What do we mean by + in Φ1 ⊗ 17+Φ2 ⊗ 5? Well, it depends . . .
11 / 30
What do we mean by + in Φ1 ⊗ 17+Φ2 ⊗ 5? Well, it depends . . .
Aggregation modelled by commutative monoids
Carrier M, e.g. N or R Binary operation M × M → M Neutral element 0 ∈ M Examples for aggregation monoids: SUM (N, +, 0), MIN (N, min, ∞), MAX (N, max, −∞), PROD, COUNT (special case of SUM)
11 / 30
What are Φ1, Φ2 in Φ1 ⊗ 17 + Φ2 ⊗ 5?
12 / 30
What are Φ1, Φ2 in Φ1 ⊗ 17 + Φ2 ⊗ 5? Consider Query: AGGB
A Φ 1 x1 2 x2 S A Φ 1 y1 T A B Φ 1 17 z1 2 5 z2
12 / 30
What are Φ1, Φ2 in Φ1 ⊗ 17 + Φ2 ⊗ 5? Consider Query: AGGB
A Φ 1 x1 2 x2 S A Φ 1 y1 T A B Φ 1 17 z1 2 5 z2
Tuples annotations modelled by semirings (R ∪ S) ✶A T yields
(R ∪ S) ✶A T A B Φ 1 17 (x1 + y1) · z1 2 5 x2 · z2
Aggregation on top of this table yields: ((x1 + y1) · z1) ⊗ 17 + (x2 · z2) ⊗ 5 where the meaning of + depends on the aggregation monoid
12 / 30
Semimodule
Algebraic framework introduced by Amsterdamer et al. [2011] The algebraic structure combining semirings and monoids is called semimodule Generalisation of vector space. “Scalars”: tuple annotations, “Vectors”: aggregation values Semimodule expressions represent data values conditioned on tuple annotations
Semiring and semimodule expressions are random variables
Semimodule: Random variable over aggregation domain Semiring expressions: ?
◮ So far in probabilistic databases:
Boolean random variable
◮ However: B is in general not large enough for aggregation; need
larger semiring, for example natural numbers
13 / 30
ProducerEU Item Φ 1 x1 2 x2 ProducerUS Item Φ 1 y1 Products Item Price Φ 1 17 z1 2 5 z2
Query: SUMPrice
Resulting expression: ((x1 + y1) · z1) ⊗ 17 + (x2 · z2) ⊗ 5 Valuation ν : x1, x2, y1, z1, z2 → ⊤ yields ⊤ ⊗ 17 + ⊤ ⊗ 5 = 22 Arguably not the expected result
14 / 30
ProducerEU Item Φ 1 x1 2 x2 ProducerUS Item Φ 1 y1 Products Item Price Φ 1 17 z1 2 5 z2
Query: SUMPrice
Resulting expression: ((x1 + y1) · z1) ⊗ 17 + (x2 · z2) ⊗ 5 Valuation ν : x1, x2, y1, z1, z2 → ⊤ yields ⊤ ⊗ 17 + ⊤ ⊗ 5 = 22 Arguably not the expected result Boolean semiring is not large enough for SUM Better choice: Semiring N. Identify ⊥ ∼ 0, ⊤ ∼ 1. Valuation ν : x1, x2, y1, z1, z2 → 1 yields ((1 + 1) · 1) ⊗ 17 + (1 · 1) ⊗ 5 = 2 ⊗ 17 + 1 ⊗ 5 = 39.
14 / 30
Motivation Algebraic Foundations Representation System Query Evaluation
15 / 30
Ingredients for pvc-tables
A set X of variable symbols Tuples contain constants or semimodule expressions over X Every tuple is annotated with a semiring expression over X
Queries
Query Q maps pvc-table database D to pvc-table Q(D) Annotations are propagated via query operators Expressions concisely encode probability distributions of answers
Properties of pvc-tables
Polynomial overhead (Amsterdamer et al. [2011]): |Q(D)| ∈ O
Completeness: Every finite probability distribution over relations (with set or bag semantics) can be represented by pvc-tables
16 / 30
Semantics: Set vs Bag & Deterministic vs Probabilistic
Different choices for the semiring and the probability distributions of the annotation variables give rise to different database semantics. Database Semantics Semiring Probability Distributions Deterministic Set B Px[⊤] = 1 or Px[⊥] = 1 Deterministic Bag N ∃n ∈ N : Px[n] = 1 Probabilistic Set B Px[⊤], Px[⊥] ∈ [0, 1] Probabilistic Bag N ∀n ∈ N : Px[n] ∈ [0, 1]
17 / 30
Motivation Algebraic Foundations Representation System Query Evaluation
18 / 30
Step 1: Construction of Expressions
Alongside (standard) query evaluation, compute annotations. Project, Union, Cartesian Product: Construction of semiring expressions (· for joint, and + for alternative use of data) Aggregation (with grouping): Construct semimodule expressions (
AGG Φ ⊗ v) R A B Φ a 1 x1 a 2 x2 b 3 x3 b 4 x4
select AGG(B) from R group by A
− − − − − − − − − − − − − − − − − − →
pvc-table A AGG(B) Φ a x1 ⊗ 1 + x2 ⊗ 2 [x1 + x2 = 0] b x3 ⊗ 3 + x4 ⊗ 4 [x3 + x4 = 0]
19 / 30
Step 2: Probability Computation
Problem: Given a tuple, compute its probability distribution. Idea: Tuple probability is equivalent to joint probability distribution of its semimodule expressions and annotation expression as obtained from evaluation step 1. Approach: Compile expressions into a tractable form consisting of independent and mutually exclusive sub-expressions.
20 / 30
Consider semiring expression Φ = x + y. If x, y are independent random variables over N, then the probability distribution of Φ is given by the convolution of x and y. If x, y are in N: Px+y[n] =
i+j=n
Px[i]Py[j]
21 / 30
Consider semiring expression Φ = x + y. If x, y are independent random variables over N, then the probability distribution of Φ is given by the convolution of x and y. If x, y are in N: Px+y[n] =
i+j=n
Px[i]Py[j] If x, y are Boolean: Px+y[⊥] =
a∨b=⊥
Px[a]Py[b] Px+y[⊤] =
a∨b=⊤
Px[a]Py[b]
21 / 30
Consider semiring expression Φ = x + y. If x, y are independent random variables over N, then the probability distribution of Φ is given by the convolution of x and y. If x, y are in N: Px+y[n] =
i+j=n
Px[i]Py[j] If x, y are Boolean: Px+y[⊥] =
a∨b=⊥
Px[a]Py[b] = Px[⊥]Py[⊥] Px+y[⊤] =
a∨b=⊤
Px[a]Py[b] = Px[⊤]Py[⊤] + Px[⊥]Py[⊤] + Px[⊤]Py[⊥] = 1 − Px[⊥]Py[⊥]
21 / 30
The applicability of convolution is not limited to “sums”; convolution is equally well defined for other binary operations:
Convolution for algebraic operations
Semiring expressions: Φ · Ψ, Φ + Ψ Semimodule expressions: α + β Mixed semiring and semimodule expressions: Φ ⊗ α Convolution is also applicable to comparisons of expressions, such as α ≤ β
22 / 30
What if there are no independent sub-expressions? Example: α = a(b + c) ⊗ 10 + c ⊗ 20 Idea: Instantiate one of the variables to create mutually exclusive sub-expressions. P(α) = Pc[1] · P
Pc[2] · P
Pc[3] · P
· · · Need to consider all possible values of c with non-zero probability. In particular: For Boolean variables, the above construction yields Shannon’s expansion.
23 / 30
Decomposition gives rise to a tree whose nodes explain the decomposition steps taken. For example, for mutex decomposition, ⊕ for convolution w.r.t. +, ⊗ for convolution w.r.t. ⊗, etc. Example: α = a(b + c) ⊗ 10 + c ⊗ 20
⊕ ⊗ a ⊗ ⊕ b 1 1 ⊗ 10 1 ⊗ 20 c ← 1 ⊕ ⊗ a ⊗ ⊕ b 2 1 ⊗ 10 2 ⊗ 20 c ← 2
24 / 30
The probability distribution Pd of a d-tree d whose nodes have probability distributions p1, . . . , pn can be computed in time O( |pi|).
Specific polynomial time cases
For MIN and MAX monoids combined with any semiring For SUM monoid: If monoid values and size of probability distributions of semiring expressions are bounded by constants
◮ This subsumes COUNT aggregation 25 / 30
Approximate probability computation by partial expansion of d-tree (Olteanu et al. [2010], Fink et al. [2011]) Sensitivity analysis and explanation of query results (Kanagal et al. [2011]) Conditioning probabilistic databases (Koch and Olteanu [2008])
26 / 30
Tractability for query evaluation on probabilistic databases is considered with respect to data complexity: For which class of queries can probability distributions of query answers be computed in polynomial-time data complexity for any tuple-independent database?
27 / 30
Tractability for query evaluation on probabilistic databases is considered with respect to data complexity: For which class of queries can probability distributions of query answers be computed in polynomial-time data complexity for any tuple-independent database? Syntactic characterisation of tractable queries with aggregates
◮ There are known classes of tractable non-aggregate queries with
polynomial-time d-tree compilation, e.g. hierarchical queries
◮ Extend these classes by adding nested aggregation without
breaking the tractable (e.g. hierarchical) property
27 / 30
Example 1
select R.A from R where R.B =
where S.C = R.C
select S.B from S where S.C = R.C
28 / 30
Example 2
select 1 where
where S.A=T.A
select 1 where (select R.A from R) select 1 from S,T where S.A=T.A select 1 where (select R.A from R) <= (select 1 from S,T where S.A=T.A)
28 / 30
29 / 30
Yael Amsterdamer, Daniel Deutch, and Val Tannen. Provenance for Aggregate
Robert Fink, Dan Olteanu, and Swaroop Rath. Providing support for full relational algebra in probabilistic databases. In ICDE, 2011. Bhargav Kanagal, Jian Li, and Amol Deshpande. Sensitivity analysis and explanations for robust query evaluation in probabilistic databases. In SIGMOD, 2011. Christoph Koch and Dan Olteanu. Conditioning probabilistic databases. PVLDB, 1(1), 2008.
conditional tables. JIIS, 19(3), 2002. Dan Olteanu, Jiewen Huang, and Christoph Koch. Approximate confidence computation in probabilistic databases. In ICDE, 2010.
30 / 30