 
              Exact Inference: Variable Elimination Probabilistic Graphical Models Sharif University of Technology Spring 2018 Soleymani
Probabilistic Inference and Learning  We now have compact representations of probability distributions (Graphical Models) A GM M describes a unique probability distribution P   Typical tasks:  Task 1: How do we answer queries about 𝑄 𝑁 , e.g., 𝑄 𝑁 𝑌 𝑍 ?  We use inference as a name for the process of computing answers to such queries  Task 2: How do we estimate a plausible model M from data D?  i.We use learning as a name for the process of obtaining point estimate of M.  ii. But for Bayesian, they seek 𝑞(𝑁|𝐸) , which is actually an inference problem.  iii. When not all variables are observable, even computing point estimate of M need to do inference to impute the missing data. 2 This slide has been adopted from Eric Zing, PGM 10708, CMU.
Why we need inference  If we know the graphical model, we use the inference to find marginal or conditional distributions efficiently  We also need inference in the learning when we try to find a model from incomplete data or when the learning approach is Bayesian (as we will see in the next lectures) 3
Inference query  Likelihood: probability of evidence Nodes: 𝒴 = {𝑌 1 , … , 𝑌 𝑜 } 𝒇 : evidence on a set variables 𝑭 𝒀 = 𝒴 − 𝑭 𝑄 𝒇 = 𝑄(𝒀, 𝒇) 𝒁 = 𝒀 − 𝒂 𝒀  Marginal probability distribution: 𝑄 𝒀 = 𝑄(𝒴) 𝒴−𝒀  Conditional probability distribution (a posteriori belief): 𝑄(𝒀, 𝒇) 𝑄 𝒀|𝒇 = 𝒀 𝑄(𝒀, 𝒇)  Marginalized conditional probability distribution: 𝒂 𝑄(𝒁, 𝒂, 𝒇) 𝑄 𝒁|𝒇 = (𝒀 = 𝒁 ∪ 𝒂) 𝒁 𝒂 𝑄(𝒁, 𝒂, 𝒇) query a subset Y of all domain variables X={Y,Z} 4 and "don't care" about the remaining Z
Most Probable Assignment (MPA)  Most probable assignment for some variables of interest given an evidence 𝑭 = 𝒇 𝒁 ∗ |𝒇 = argmax 𝑄 𝒁|𝒇 𝒁 Maximum a posteriori configuration of 𝒁  Applications of MPA  Classification  find most likely label, given the evidence  Explanation  what is the most likely scenario, given the evidence 5
MPA: Example 6 This slide has been adopted from Eric Zing, PGM 10708, CMU.
Marginal probability: Enumeration  𝑄 𝒁 𝒇 ∝ 𝑄 𝒁, 𝒇  𝑄 𝒁, 𝒇 = 𝒂 𝑄(𝒁, 𝒇, 𝒂)  Marginal probability: exponential computation is required in general  #P-complete problem (enumeration intractable)  Even in the graph of polynomial size it can be exponential  We cannot find a general procedure that works efficiently for arbitrary GMs 7
Harness of Inference  Hardness does not mean we cannot solve inference  It implies that we cannot find a general procedure that works efficiently for arbitrary GMs  For particular families of GMs, we can have provably efficient procedures  For special graph structure, provably efficient algorithms (avoiding exponential cost) are available 8
Exact inference  Exact inference:  Variable elimination algorithm  general graph  one query  Belief propagation , sum-product on factor graphs  Tree  marginal probability on all nodes  Junction tree algorithm  general graph  marginal probability on all clique nodes 9
Inference on a chain 𝐵 𝐶 𝐷 𝐸 𝑄 𝑒 = 𝑄(𝑏, 𝑐, 𝑑, 𝑒) 𝑏 𝑐 𝑑 𝑄 𝑒 = 𝑄 𝑏 𝑄 𝑐 𝑏 𝑄 𝑑 𝑐 𝑄(𝑒|𝑑) 𝑐 𝑑 𝑏  A naïve summation needs to enumerate over an exponential number of terms 10
Inference on a chain: marginalization and elimination 𝐵 𝐶 𝐷 𝐸 𝑄 𝑒 = 𝑄 𝑏 𝑄 𝑐 𝑏 𝑄 𝑑 𝑐 𝑄(𝑒|𝑑) 𝑐 𝑑 𝑏 = 𝑄 𝑏 𝑄 𝑐 𝑏 𝑄 𝑑 𝑐 𝑄(𝑒|𝑑) 𝑐 𝑏 𝑑 = 𝑄(𝑒|𝑑) 𝑄 𝑑 𝑐 𝑄 𝑏 𝑄 𝑐 𝑏 𝑐 𝑏 𝑑 𝑄(𝑐) 𝑄(𝑑) 𝑄(𝑒)  In a chain of 𝑜 nodes each having 𝑙 values, 𝑃(𝑜𝑙 2 ) instead of 𝑃(𝑙 𝑜 ) 11
𝑌 𝑂 𝑌 1 𝑌 2 … 𝑌 𝑂−1 Inference on a chain 𝑌 𝑂 … 𝑌 1 𝑌 2 𝑌 𝑂−1  In both directed and undirected graphical models, the joint probability is a factored expression over subsets of the variables 𝑄 𝒚 = 1 𝑎 𝜚 1,2 𝑦 1 , 𝑦 2 𝜚 2,3 𝑦 2 , 𝑦 3 … 𝜚 𝑂−1,𝑂 𝑦 𝑂−1 , 𝑦 𝑂 undirected 𝑄 𝑦 𝑗 = 1 𝑎 … … 𝜚 1,2 𝑦 1 , 𝑦 2 … 𝜚 𝑂−1,𝑂 𝑦 𝑂−1 , 𝑦 𝑂 𝑦 𝑗−1 𝑦 𝑗+1 𝑦 𝑂 𝑦 1 𝑄 𝑦 𝑗 = 𝜚 𝑦 𝑗−1 , 𝑦 𝑗 𝜚 𝑦 𝑗−2 , 𝑦 𝑗−1 … 𝜚 𝑦 1 , 𝑦 2 𝑦 𝑗−1 𝑦 𝑗−2 𝑦 1 × 𝜚 𝑦 𝑗 , 𝑦 𝑗+1 𝜚 𝑦 𝑗+1 , 𝑦 𝑗+2 … 𝜚 𝑦 𝑂−1 , 𝑦 𝑂 𝑦 𝑗+1 𝑦 𝑗+2 𝑦 𝑂 operations in each elimination 𝑃 𝑊𝑏𝑚 𝑌 × 𝑊𝑏𝑚 𝑌 𝑘 𝑘+1 12
Inference on a chain: improvement reasons  Computing an expression of the form (sum-product inference): 𝜚 𝜲 : the set of factors 𝒂 𝜚∈𝜲  We used the structure of BN to factorize the joint distribution and thus the scope of the resulted factors will be limited.  Distributive law: If 𝑌 ∉ Scope(𝜚 1 ) then 𝑌 𝜚 1 . 𝜚 2 = 𝜚 1 . 𝑌 𝜚 2  Performing the summations over the product of only a subset of factors  We find sub-expressions that can be computed once and then we save and reuse them in later computations  Instead of computing them exponentially many times 13
Variable elimination algorithm for sum-product inference  Sum out each variable one at a time  all factors containing that variable are (removed from the set of factors and) multiplied to generate a product factor  The variable is summed out from the generated product factor and a new factor is obtained  The new factor is added to the set of the available factors The resulted factor does not necessarily correspond to any probability or conditional probability in the network 14
Procedure Sum-Product-VE ( Z, G) Procedure Sum-Product-Elim-Var( 𝚾 , 𝑎 ) // 𝒂 : the variables to be eliminated 𝚾 ′ ← {𝜚 ∈ 𝚾: 𝑎 ∈ Scope(𝜚)} 𝚾 ← all factors of G 𝚾 ′′ ← 𝚾 − 𝚾 ′ Select an elimination order 𝑎 1 , . . . , 𝑎 𝐿 for 𝒂 for 𝑗 = 1, . . . , 𝐿 𝑛 ← 𝜚 𝚾 ← Sum-Product-Elim-Var( 𝚾 , 𝑎 𝑗 )) 𝜚∈𝚾 ′ 𝑎 return 𝚾 ′′ ∪ {𝑛} 𝜚 ∗ ← 𝜚 𝜚∈𝜲 Return 𝜚 ∗ • Move all irrelevant factors (to the variable that must be eliminated now) It does not need normalization for outside of the summation directed graph when we have no evidence • Perform sum, getting a new term Insert the new term into the product • 15
Procedure Cond-Prob-VE (  , // the network over 𝒀 𝒁 , // Set of query variables 𝑭 = 𝒇, // evidence) 𝚾 ← the factors parametrizing  Replace each 𝜚 ∈ 𝜲 by 𝜚[𝑭 = 𝒇] Select an elimination order 𝑎 1 , . . . , 𝑎 𝐿 for 𝒂 = 𝒀 − 𝒁 − 𝑭 for 𝑗 = 1, . . . , 𝑙 𝚾 ← Sum-Product-Elim-Var( 𝚾 , 𝑎 𝑗 )) 𝜚 ∗ ← 𝜚 𝜚∈𝜲 𝜚 ∗ (𝒛) 𝛽 ← 𝒛∈𝑊𝑏𝑚(𝒁) Return 𝛽, 𝜚 ∗ 16
Directed example  Query: 𝑄(𝑌 2 |𝑌 7 = 𝑦 7 ) 𝑌 2 𝑌 1 𝑌 3  𝑄 𝑌 2 𝑦 7 ∝ 𝑄 𝑌 2 , 𝑦 7 𝑌 5 𝑌 4 𝑌 7 𝑄 𝑦 2 , 𝑦 7 𝑌 6 𝑌 8 = 𝑄 𝑦 1 , 𝑦 2 , 𝑦 3 , 𝑦 4 , 𝑦 5 , 𝑦 6 , 𝑦 7 , 𝑦 8 𝑦 1 𝑦 3 𝑦 4 𝑦 5 𝑦 6 𝑦 8 Consider the elimination order 𝑌 1 , 𝑌 3 , 𝑌 4 , 𝑌 5 , 𝑌 6 , 𝑌 8 𝑄 𝑦 2 , 𝑦 7 = 𝑄 𝑦 1 𝑄 𝑦 2 𝑄 𝑦 3 𝑦 1 , 𝑦 2 𝑄 𝑦 4 𝑦 3 𝑄 𝑦 5 𝑦 2 𝑄 𝑦 6 𝑦 3 , 𝑦 7 𝑄( 𝑦 7 |𝑦 4 , 𝑦 5 )𝑄 𝑦 8 𝑦 7 𝑦 8 𝑦 6 𝑦 5 𝑦 4 𝑦 3 𝑦 1 17
𝑄 𝑦 2 , 𝑦 7 = 𝑄 𝑦 2 𝑄 𝑦 4 𝑦 3 𝑄 𝑦 5 𝑦 2 𝑄 𝑦 6 𝑦 3 , 𝑦 7 𝑄( 𝑦 7 |𝑦 4 , 𝑦 5 )𝑄 𝑦 8 𝑦 7 𝑄 𝑦 1 𝑄 𝑦 3 𝑦 1 , 𝑦 2 𝑦 8 𝑦 6 𝑦 5 𝑦 4 𝑦 3 𝑦 1 = 𝑄 𝑦 2 𝑄 𝑦 4 𝑦 3 𝑄 𝑦 5 𝑦 2 𝑄 𝑦 6 𝑦 3 , 𝑦 7 𝑄 𝑦 7 𝑦 4 , 𝑦 5 𝑄 𝑦 8 𝑦 7 𝑛 1 (𝑦 2 , 𝑦 3 ) 𝑦 8 𝑦 6 𝑦 5 𝑦 4 𝑦 3 = 𝑄 𝑦 2 𝑄 𝑦 5 𝑦 2 𝑄 𝑦 7 𝑦 4 , 𝑦 5 𝑄 𝑦 8 𝑦 7 𝑄 𝑦 4 𝑦 3 𝑄 𝑦 6 𝑦 3 , 𝑦 7 𝑛 1 (𝑦 2 , 𝑦 3 ) 𝑦 8 𝑦 6 𝑦 5 𝑦 4 𝑦 3 = 𝑄 𝑦 2 𝑄 𝑦 5 𝑦 2 𝑄 𝑦 7 𝑦 4 , 𝑦 5 𝑄 𝑦 8 𝑦 7 𝑛 3 (𝑦 2 , 𝑦 6 , 𝑦 4 ) 𝑦 8 𝑦 6 𝑦 5 𝑦 4 = 𝑄 𝑦 2 𝑄 𝑦 5 𝑦 2 𝑄 𝑦 8 𝑦 7 𝑄 𝑦 7 𝑦 4 , 𝑦 5 𝑛 3 (𝑦 2 , 𝑦 6 , 𝑦 4 ) 𝑦 8 𝑦 6 𝑦 5 𝑦 4 = 𝑄 𝑦 2 𝑄 𝑦 5 𝑦 2 𝑄 𝑦 8 𝑦 7 𝑛 4 (𝑦 2 , 𝑦 5 , 𝑦 6 ) 𝑦 8 𝑦 6 𝑦 5 = 𝑄 𝑦 2 𝑄 𝑦 8 𝑦 7 𝑄 𝑦 5 𝑦 2 𝑛 4 (𝑦 2 , 𝑦 5 , 𝑦 6 ) 𝑦 8 𝑦 6 𝑦 5 = 𝑄 𝑦 2 𝑄 𝑦 8 𝑦 7 𝑛 5 (𝑦 2 , 𝑦 6 ) 𝑦 8 𝑦 6 = 𝑄 𝑦 2 𝑄 𝑦 8 𝑦 7 𝑛 5 (𝑦 2 , 𝑦 6 ) 𝑦 8 𝑦 6 = 𝑄 𝑦 2 𝑄 𝑦 8 𝑦 7 𝑛 6 (𝑦 2 ) = 𝑛 8 (𝑦 2 )𝑛 6 (𝑦 2 ) 18 𝑦 8
Conditional probability 𝑛 8 (𝑦 2 )𝑛 6 (𝑦 2 ) 𝑄 𝑦 2 | 𝑦 7 = 𝑦 2 𝑛 8 (𝑦 2 )𝑛 6 (𝑦 2 ) 19
Recommend
More recommend