Variable Elimination Probabilistic Graphical Models Sharif - - PowerPoint PPT Presentation
Variable Elimination Probabilistic Graphical Models Sharif - - PowerPoint PPT Presentation
Exact Inference: Variable Elimination Probabilistic Graphical Models Sharif University of Technology Spring 2018 Soleymani Probabilistic Inference and Learning We now have compact representations of probability distributions
Probabilistic Inference and Learning
2
We
now have compact representations
- f
probability distributions (Graphical Models)
A GM M describes a unique probability distribution P
Typical tasks:
Task 1: How do we answer queries about 𝑄𝑁, e.g., 𝑄𝑁 𝑌 𝑍 ?
We use inference as a name for the process of computing answers to such
queries
Task 2: How do we estimate a plausible model M from data D?
i.We use learning as a name for the process of obtaining point estimate of M. ii. But for Bayesian, they seek 𝑞(𝑁|𝐸), which is actually an inference problem. iii. When not all variables are observable, even computing point estimate of M
need to do inference to impute the missing data. This slide has been adopted from Eric Zing, PGM 10708, CMU.
Why we need inference
3
If we know the graphical model, we use the inference to
find marginal or conditional distributions efficiently
We also need inference in the learning when we try to
find a model from incomplete data or when the learning approach is Bayesian (as we will see in the next lectures)
Inference query
4
Likelihood: probability of evidence
𝑄 𝒇 =
𝒀
𝑄(𝒀, 𝒇)
Marginal probability distribution:
𝑄 𝒀 =
𝒴−𝒀
𝑄(𝒴)
Conditional probability distribution (a posteriori belief):
𝑄 𝒀|𝒇 = 𝑄(𝒀, 𝒇) 𝒀 𝑄(𝒀, 𝒇)
Marginalized conditional probability distribution:
𝑄 𝒁|𝒇 = 𝒂 𝑄(𝒁, 𝒂, 𝒇) 𝒁 𝒂 𝑄(𝒁, 𝒂, 𝒇) (𝒀 = 𝒁 ∪ 𝒂)
Nodes: 𝒴 = {𝑌1, … , 𝑌𝑜} 𝒇 : evidence on a set variables 𝑭 𝒀 = 𝒴 − 𝑭 𝒁 = 𝒀 − 𝒂 query a subset Y of all domain variables X={Y,Z} and "don't care" about the remaining Z
Most Probable Assignment (MPA)
5
Most probable assignment for some variables of interest
given an evidence 𝑭 = 𝒇 𝒁∗|𝒇 = argmax
𝒁
𝑄 𝒁|𝒇
Applications of MPA
Classification
find most likely label, given the evidence
Explanation
what is the most likely scenario, given the evidence
Maximum a posteriori configuration of 𝒁
MPA: Example
6
This slide has been adopted from Eric Zing, PGM 10708, CMU.
Marginal probability: Enumeration
7
𝑄 𝒁 𝒇 ∝ 𝑄 𝒁, 𝒇 𝑄 𝒁, 𝒇 = 𝒂 𝑄(𝒁, 𝒇, 𝒂) Marginal probability: exponential computation is required in
general
#P-complete problem (enumeration intractable)
Even in the graph of polynomial size it can be exponential
We cannot find a general procedure that works efficiently for arbitrary
GMs
Harness of Inference
8
Hardness does not mean we cannot solve inference It implies that we cannot find a general procedure that
works efficiently for arbitrary GMs
For particular families of GMs, we can have provably
efficient procedures
For special graph structure, provably efficient algorithms (avoiding
exponential cost) are available
Exact inference
9
Exact inference:
Variable elimination algorithm
general graph one query
Belief propagation, sum-product on factor graphs
Tree marginal probability on all nodes
Junction tree algorithm
general graph marginal probability on all clique nodes
Inference on a chain
10
𝑄 𝑒 =
𝑏 𝑐 𝑑
𝑄(𝑏, 𝑐, 𝑑, 𝑒) 𝑄 𝑒 =
𝑏 𝑐 𝑑
𝑄 𝑏 𝑄 𝑐 𝑏 𝑄 𝑑 𝑐 𝑄(𝑒|𝑑)
A
naïve summation needs to enumerate
- ver
an exponential number of terms
𝐵 𝐶 𝐷 𝐸
Inference on a chain: marginalization and elimination
11
𝑄 𝑒 =
𝑏 𝑐 𝑑
𝑄 𝑏 𝑄 𝑐 𝑏 𝑄 𝑑 𝑐 𝑄(𝑒|𝑑) =
𝑑 𝑐 𝑏
𝑄 𝑏 𝑄 𝑐 𝑏 𝑄 𝑑 𝑐 𝑄(𝑒|𝑑) =
𝑑
𝑄(𝑒|𝑑)
𝑐
𝑄 𝑑 𝑐
𝑏
𝑄 𝑏 𝑄 𝑐 𝑏
In a chain of 𝑜 nodes each having 𝑙 values, 𝑃(𝑜𝑙2) instead of 𝑃(𝑙𝑜)
𝑄(𝑐) 𝑄(𝑑) 𝑄(𝑒) 𝐵 𝐶 𝐷 𝐸
Inference on a chain
12 In both directed and undirected graphical models, the joint probability is a factored
expression over subsets of the variables 𝑄 𝒚 = 1 𝑎 𝜚1,2 𝑦1, 𝑦2 𝜚2,3 𝑦2, 𝑦3 … 𝜚𝑂−1,𝑂 𝑦𝑂−1, 𝑦𝑂 𝑄 𝑦𝑗 = 1 𝑎
𝑦1
…
𝑦𝑗−1 𝑦𝑗+1
…
𝑦𝑂
𝜚1,2 𝑦1, 𝑦2 … 𝜚𝑂−1,𝑂 𝑦𝑂−1, 𝑦𝑂 𝑄 𝑦𝑗 =
𝑦𝑗−1
𝜚 𝑦𝑗−1, 𝑦𝑗
𝑦𝑗−2
𝜚 𝑦𝑗−2, 𝑦𝑗−1 …
𝑦1
𝜚 𝑦1, 𝑦2 ×
𝑦𝑗+1
𝜚 𝑦𝑗, 𝑦𝑗+1
𝑦𝑗+2
𝜚 𝑦𝑗+1, 𝑦𝑗+2 …
𝑦𝑂
𝜚 𝑦𝑂−1, 𝑦𝑂 𝑃 𝑊𝑏𝑚 𝑌
𝑘
× 𝑊𝑏𝑚 𝑌
𝑘+1
- perations in each elimination
undirected 𝑌1 𝑌2 𝑌𝑂−1 𝑌𝑂 … 𝑌1 𝑌2 𝑌𝑂−1 𝑌𝑂 …
Inference on a chain: improvement reasons
13
Computing an expression of the form (sum-product inference):
𝒂 𝜚∈𝜲
𝜚
We used the structure of BN to factorize the joint distribution and
thus the scope of the resulted factors will be limited.
Distributive law: If 𝑌 ∉ Scope(𝜚1) then 𝑌 𝜚1 . 𝜚2 = 𝜚1. 𝑌 𝜚2
Performing the summations over the product of only a subset of factors
We find sub-expressions that can be computed once and then we
save and reuse them in later computations
Instead of computing them exponentially many times
𝜲: the set of factors
Variable elimination algorithm for sum-product inference
14
Sum out each variable one at a time
all factors containing that variable are (removed from the set
- f factors and) multiplied to generate a product factor
The variable is summed out from the generated product factor
and a new factor is obtained
The new factor is added to the set of the available factors The resulted factor does not necessarily correspond to any probability or conditional probability in the network
15
Procedure Sum-Product-VE (Z,G)
// 𝒂: the variables to be eliminated
𝚾 ←all factors of G Select an elimination order 𝑎1, . . . , 𝑎𝐿 for 𝒂 for 𝑗 = 1, . . . , 𝐿 𝚾 ← Sum-Product-Elim-Var(𝚾, 𝑎𝑗)) 𝜚∗ ←
𝜚∈𝜲
𝜚 Return 𝜚∗ Procedure Sum-Product-Elim-Var( 𝚾, 𝑎) 𝚾′ ← {𝜚 ∈ 𝚾: 𝑎 ∈ Scope(𝜚)} 𝚾′′ ← 𝚾 − 𝚾′ 𝑛 ←
𝑎 𝜚∈𝚾′
𝜚 return 𝚾′′ ∪ {𝑛}
- Move all irrelevant factors (to the
variable that must be eliminated now)
- utside of the summation
- Perform sum, getting a new term
- Insert the new term into the product
It does not need normalization for directed graph when we have no evidence
16
Procedure Cond-Prob-VE ( , // the network over 𝒀 𝒁, // Set of query variables 𝑭 = 𝒇, // evidence) 𝚾 ←the factors parametrizing Replace each 𝜚 ∈ 𝜲 by 𝜚[𝑭 = 𝒇] Select an elimination order 𝑎1, . . . , 𝑎𝐿 for 𝒂 = 𝒀 − 𝒁 − 𝑭 for 𝑗 = 1, . . . , 𝑙 𝚾 ← Sum-Product-Elim-Var(𝚾, 𝑎𝑗)) 𝜚∗ ←
𝜚∈𝜲
𝜚 𝛽 ←
𝒛∈𝑊𝑏𝑚(𝒁)
𝜚∗(𝒛) Return 𝛽, 𝜚∗
Directed example
17 Query: 𝑄(𝑌2|𝑌7 =
𝑦7)
𝑄 𝑌2
𝑦7 ∝ 𝑄 𝑌2, 𝑦7 𝑄 𝑦2, 𝑦7 =
𝑦1 𝑦3 𝑦4 𝑦5 𝑦6 𝑦8
𝑄 𝑦1, 𝑦2, 𝑦3, 𝑦4, 𝑦5, 𝑦6, 𝑦7, 𝑦8 Consider the elimination order 𝑌1, 𝑌3, 𝑌4, 𝑌5, 𝑌6, 𝑌8
𝑄 𝑦2, 𝑦7 =
𝑦8 𝑦6 𝑦5 𝑦4 𝑦3 𝑦1
𝑄 𝑦1 𝑄 𝑦2 𝑄 𝑦3 𝑦1, 𝑦2 𝑄 𝑦4 𝑦3 𝑄 𝑦5 𝑦2 𝑄 𝑦6 𝑦3, 𝑦7 𝑄( 𝑦7|𝑦4, 𝑦5)𝑄 𝑦8 𝑦7
𝑌1 𝑌2 𝑌3 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8
18 𝑄 𝑦2, 𝑦7 =
𝑦8 𝑦6 𝑦5 𝑦4 𝑦3
𝑄 𝑦2 𝑄 𝑦4 𝑦3 𝑄 𝑦5 𝑦2 𝑄 𝑦6 𝑦3, 𝑦7 𝑄( 𝑦7|𝑦4, 𝑦5)𝑄 𝑦8 𝑦7
𝑦1
𝑄 𝑦1 𝑄 𝑦3 𝑦1, 𝑦2 =
𝑦8 𝑦6 𝑦5 𝑦4 𝑦3
𝑄 𝑦2 𝑄 𝑦4 𝑦3 𝑄 𝑦5 𝑦2 𝑄 𝑦6 𝑦3, 𝑦7 𝑄 𝑦7 𝑦4, 𝑦5 𝑄 𝑦8 𝑦7 𝑛1(𝑦2, 𝑦3) =
𝑦8 𝑦6 𝑦5 𝑦4
𝑄 𝑦2 𝑄 𝑦5 𝑦2 𝑄 𝑦7 𝑦4, 𝑦5 𝑄 𝑦8 𝑦7
𝑦3
𝑄 𝑦4 𝑦3 𝑄 𝑦6 𝑦3, 𝑦7 𝑛1(𝑦2, 𝑦3) =
𝑦8 𝑦6 𝑦5 𝑦4
𝑄 𝑦2 𝑄 𝑦5 𝑦2 𝑄 𝑦7 𝑦4, 𝑦5 𝑄 𝑦8 𝑦7 𝑛3(𝑦2, 𝑦6, 𝑦4) =
𝑦8 𝑦6 𝑦5
𝑄 𝑦2 𝑄 𝑦5 𝑦2 𝑄 𝑦8 𝑦7
𝑦4
𝑄 𝑦7 𝑦4, 𝑦5 𝑛3(𝑦2, 𝑦6, 𝑦4) =
𝑦8 𝑦6 𝑦5
𝑄 𝑦2 𝑄 𝑦5 𝑦2 𝑄 𝑦8 𝑦7 𝑛4(𝑦2, 𝑦5, 𝑦6) =
𝑦8 𝑦6
𝑄 𝑦2 𝑄 𝑦8 𝑦7
𝑦5
𝑄 𝑦5 𝑦2 𝑛4(𝑦2, 𝑦5, 𝑦6) =
𝑦8 𝑦6
𝑄 𝑦2 𝑄 𝑦8 𝑦7 𝑛5(𝑦2, 𝑦6) =
𝑦8
𝑄 𝑦2 𝑄 𝑦8 𝑦7
𝑦6
𝑛5(𝑦2, 𝑦6) =
𝑦8
𝑄 𝑦2 𝑄 𝑦8 𝑦7 𝑛6(𝑦2) = 𝑛8(𝑦2)𝑛6(𝑦2)
Conditional probability
19
𝑄 𝑦2| 𝑦7 = 𝑛8(𝑦2)𝑛6(𝑦2) 𝑦2 𝑛8(𝑦2)𝑛6(𝑦2)
Undirected example
20 Query: 𝑄(𝑌2|𝑌7 =
𝑦7)
𝑄 𝑌2
𝑦7 ∝ 𝑄 𝑌2, 𝑦7 𝑄 𝑦2, 𝑦7 =
𝑦1 𝑦3 𝑦4 𝑦5 𝑦6 𝑦8
𝑄 𝑦1, 𝑦2, 𝑦3, 𝑦4, 𝑦5, 𝑦6, 𝑦7, 𝑦8 Consider the elimination order 𝑌1, 𝑌3, 𝑌4, 𝑌5, 𝑌6, 𝑌8
𝑄 𝑦2, 𝑦7 =
𝑦8 𝑦6 𝑦5 𝑦4 𝑦3 𝑦1
𝜚(𝑦3, 𝑦4)𝜚(𝑦2, 𝑦5)𝜚(𝑦3, 𝑦6, 𝑦7)𝜚(𝑦4, 𝑦5, 𝑦7)𝜚( 𝑦7, 𝑦8)𝜚(𝑦1, 𝑦2, 𝑦3)
𝑌1 𝑌2 𝑌3 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌1 𝑌2 𝑌3 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8
21 𝑄 𝑦2, 𝑦7 =
𝑦8 𝑦6 𝑦5 𝑦4 𝑦3
𝜚(𝑦3, 𝑦4)𝜚(𝑦2, 𝑦5)𝜚(𝑦3, 𝑦6, 𝑦7)𝜚(𝑦4, 𝑦5, 𝑦7)𝜚( 𝑦7, 𝑦8)
𝑦1
𝜚(𝑦1, 𝑦2, 𝑦3) =
𝑦8 𝑦6 𝑦5 𝑦4 𝑦3
𝜚(𝑦3, 𝑦4)𝜚(𝑦2, 𝑦5)𝜚(𝑦3, 𝑦6, 𝑦7)𝜚(𝑦4, 𝑦5, 𝑦7)𝜚( 𝑦7, 𝑦8)𝑛1(𝑦2, 𝑦3) =
𝑦8 𝑦6 𝑦5 𝑦4
𝜚(𝑦2, 𝑦5)𝜚(𝑦4, 𝑦5, 𝑦7)𝜚( 𝑦7, 𝑦8)
𝑦3
𝜚(𝑦3, 𝑦4) 𝜚(𝑦3, 𝑦6, 𝑦7)𝑛1(𝑦2, 𝑦3) =
𝑦8 𝑦6 𝑦5 𝑦4
𝜚(𝑦2, 𝑦5)𝜚(𝑦4, 𝑦5, 𝑦7)𝜚( 𝑦7, 𝑦8)𝑛3(𝑦2, 𝑦6, 𝑦4) =
𝑦8 𝑦6 𝑦5
𝜚(𝑦2, 𝑦5)𝜚( 𝑦7, 𝑦8)
𝑦4
𝜚(𝑦4, 𝑦5, 𝑦7)𝑛3(𝑦2, 𝑦6, 𝑦4) =
𝑦8 𝑦6 𝑦5
𝜚(𝑦2, 𝑦5)𝜚( 𝑦7, 𝑦8)𝑛4(𝑦2, 𝑦5, 𝑦6) =
𝑦8 𝑦6
𝜚( 𝑦7, 𝑦8)
𝑦5
𝜚(𝑦2, 𝑦5)𝑛4(𝑦2, 𝑦5, 𝑦6) =
𝑦8 𝑦6
𝜚( 𝑦7, 𝑦8)𝑛5(𝑦2, 𝑦6) =
𝑦8
𝜚( 𝑦7, 𝑦8)
𝑦6
𝑛5(𝑦2, 𝑦6) =
𝑦8
𝜚( 𝑦7, 𝑦8) 𝑛6(𝑦2)
𝑌1 𝑌2 𝑌3 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8
Complexity of variable elimination algorithm
22
In each elimination step, the following computations are
required:
𝑔 𝑦, 𝑦1, … , 𝑦𝑙 = 𝑗=1
𝑁 𝑗(𝑦, 𝒚𝑑𝑗)
𝑦 𝑔 𝑦, 𝑦1, … , 𝑦𝑙
We need:
(𝑁 − 1) × 𝑊𝑏𝑚(𝑌) × 𝑗=1
𝑙
𝑊𝑏𝑚(𝑌𝑗) multiplications
For each tuple 𝑦, 𝑦1, … , 𝑦𝑙, we need 𝑁 − 1 multiplications
𝑊𝑏𝑚(𝑌) × 𝑗=1
𝑙
𝑊𝑏𝑚(𝑌𝑗) additions
For each tuple 𝑦1, … , 𝑦𝑙, we need 𝑊𝑏𝑚(𝑌) additions
Complexity is exponential in number of variables in the intermediate factor Size of the created factors is the dominant quantity in the complexity of VE
Graph elimination
23
Graph elimination is a simple unified treatment of inference
algorithms in both directed and undirected models
convert directed models to undirected ones (moralization)
Graph-theoretic property: the factors resulted during variable
elimination are captured by recording the elimination clique
The computational complexity of the Eliminate algorithm can
be reduced to purely graph-theoretic considerations
Graph elimination
24
Begin with the undirected GM or moralized BN Choose an elimination ordering (query nodes should be last) Eliminate a node from the graph and add edges (called fill
edges) between all pairs of its neighbors
Iterate until all non-query nodes are eliminated
Graph elimination
25
𝑌1 𝑌2 𝑌3 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌1 𝑌2 𝑌3 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌1 𝑌2 𝑌3 𝑌4 𝑌5 𝑌6 𝑌8 𝑌2 𝑌3 𝑌4 𝑌5 𝑌6 𝑌8 𝑌2 𝑌4 𝑌5 𝑌6 𝑌8 𝑌2 𝑌5 𝑌6 𝑌8 𝑌2 𝑌6 𝑌8 𝑌2 𝑌8 𝑌2 Removing a node from the graph and connecting the remaining neighbors Moralized graph
Summation ⇔ elimination Intermediate term ⇔ elimination clique
fill edges
Graph elimination: elimination cliques
26
Induced dependency during marginalization is captured in
elimination cliques
A correspondence between maximal cliques in the induced
graph and maximal factors generated inVE algorithm
The complexity depends on the number of variables in the largest
elimination clique
The size of the maximal elimination clique in the induced
graph depends on the elimination ordering
Elimination order: example
Ordering ≺: 𝑌1, 𝑌3, 𝑌4, 𝑌5, 𝑌6 Ordering ≺:𝑌4, 𝑌3, 𝑌5, 𝑌6, 𝑌1
𝑌1 𝑌2 𝑌5 𝑌3 𝑌6 𝑌4
Elimination order
28
Finding the best elimination ordering is NP-hard
Equivalent to finding the tree-width in the graph that is NP-
hard
Tree-width: one less than the smallest achievable size of the
largest elimination clique, ranging over all possible elimination
- rdering
Good elimination orderings lead to small cliques and
thus reduce complexity
What is the optimal order for trees?
Heuristics for finding an ordering
29
How can we find an ordering that induces a “small”
graph?
Some heuristics to select the next node for elimination:
Min-neighbors
The cost of a vertex is the number of neighbors it has in the current
graph
Min-fill
The cost of a vertex is the number of edges that need to be added to
the graph due to its elimination.
Elimination algorithm: summary
30
Elimination algorithm computes the marginal probability for
- ne query
It uses the factorization properties and distributive law to
compute the marginal probabilities more efficiently
reorder computations save intermediate terms
Elimination
- rder
affects the computational complexity. However, finding the best order in general is NP-hard
References
31