Variable Elimination Probabilistic Graphical Models Sharif - - PowerPoint PPT Presentation

variable elimination
SMART_READER_LITE
LIVE PREVIEW

Variable Elimination Probabilistic Graphical Models Sharif - - PowerPoint PPT Presentation

Exact Inference: Variable Elimination Probabilistic Graphical Models Sharif University of Technology Spring 2018 Soleymani Probabilistic Inference and Learning We now have compact representations of probability distributions


slide-1
SLIDE 1

Exact Inference: Variable Elimination

Probabilistic Graphical Models Sharif University of Technology Spring 2018 Soleymani

slide-2
SLIDE 2

Probabilistic Inference and Learning

2

 We

now have compact representations

  • f

probability distributions (Graphical Models)

A GM M describes a unique probability distribution P

 Typical tasks:

 Task 1: How do we answer queries about 𝑄𝑁, e.g., 𝑄𝑁 𝑌 𝑍 ?

 We use inference as a name for the process of computing answers to such

queries

 Task 2: How do we estimate a plausible model M from data D?

 i.We use learning as a name for the process of obtaining point estimate of M.  ii. But for Bayesian, they seek 𝑞(𝑁|𝐸), which is actually an inference problem.  iii. When not all variables are observable, even computing point estimate of M

need to do inference to impute the missing data. This slide has been adopted from Eric Zing, PGM 10708, CMU.

slide-3
SLIDE 3

Why we need inference

3

 If we know the graphical model, we use the inference to

find marginal or conditional distributions efficiently

 We also need inference in the learning when we try to

find a model from incomplete data or when the learning approach is Bayesian (as we will see in the next lectures)

slide-4
SLIDE 4

Inference query

4

 Likelihood: probability of evidence

𝑄 𝒇 =

𝒀

𝑄(𝒀, 𝒇)

 Marginal probability distribution:

𝑄 𝒀 =

𝒴−𝒀

𝑄(𝒴)

 Conditional probability distribution (a posteriori belief):

𝑄 𝒀|𝒇 = 𝑄(𝒀, 𝒇) 𝒀 𝑄(𝒀, 𝒇)

 Marginalized conditional probability distribution:

𝑄 𝒁|𝒇 = 𝒂 𝑄(𝒁, 𝒂, 𝒇) 𝒁 𝒂 𝑄(𝒁, 𝒂, 𝒇) (𝒀 = 𝒁 ∪ 𝒂)

Nodes: 𝒴 = {𝑌1, … , 𝑌𝑜} 𝒇 : evidence on a set variables 𝑭 𝒀 = 𝒴 − 𝑭 𝒁 = 𝒀 − 𝒂 query a subset Y of all domain variables X={Y,Z} and "don't care" about the remaining Z

slide-5
SLIDE 5

Most Probable Assignment (MPA)

5

 Most probable assignment for some variables of interest

given an evidence 𝑭 = 𝒇 𝒁∗|𝒇 = argmax

𝒁

𝑄 𝒁|𝒇

 Applications of MPA

 Classification

 find most likely label, given the evidence

 Explanation

 what is the most likely scenario, given the evidence

Maximum a posteriori configuration of 𝒁

slide-6
SLIDE 6

MPA: Example

6

This slide has been adopted from Eric Zing, PGM 10708, CMU.

slide-7
SLIDE 7

Marginal probability: Enumeration

7

 𝑄 𝒁 𝒇 ∝ 𝑄 𝒁, 𝒇  𝑄 𝒁, 𝒇 = 𝒂 𝑄(𝒁, 𝒇, 𝒂)  Marginal probability: exponential computation is required in

general

 #P-complete problem (enumeration intractable)

 Even in the graph of polynomial size it can be exponential

 We cannot find a general procedure that works efficiently for arbitrary

GMs

slide-8
SLIDE 8

Harness of Inference

8

 Hardness does not mean we cannot solve inference  It implies that we cannot find a general procedure that

works efficiently for arbitrary GMs

 For particular families of GMs, we can have provably

efficient procedures

 For special graph structure, provably efficient algorithms (avoiding

exponential cost) are available

slide-9
SLIDE 9

Exact inference

9

 Exact inference:

 Variable elimination algorithm

 general graph  one query

 Belief propagation, sum-product on factor graphs

 Tree  marginal probability on all nodes

 Junction tree algorithm

 general graph  marginal probability on all clique nodes

slide-10
SLIDE 10

Inference on a chain

10

𝑄 𝑒 =

𝑏 𝑐 𝑑

𝑄(𝑏, 𝑐, 𝑑, 𝑒) 𝑄 𝑒 =

𝑏 𝑐 𝑑

𝑄 𝑏 𝑄 𝑐 𝑏 𝑄 𝑑 𝑐 𝑄(𝑒|𝑑)

 A

naïve summation needs to enumerate

  • ver

an exponential number of terms

𝐵 𝐶 𝐷 𝐸

slide-11
SLIDE 11

Inference on a chain: marginalization and elimination

11

𝑄 𝑒 =

𝑏 𝑐 𝑑

𝑄 𝑏 𝑄 𝑐 𝑏 𝑄 𝑑 𝑐 𝑄(𝑒|𝑑) =

𝑑 𝑐 𝑏

𝑄 𝑏 𝑄 𝑐 𝑏 𝑄 𝑑 𝑐 𝑄(𝑒|𝑑) =

𝑑

𝑄(𝑒|𝑑)

𝑐

𝑄 𝑑 𝑐

𝑏

𝑄 𝑏 𝑄 𝑐 𝑏

 In a chain of 𝑜 nodes each having 𝑙 values, 𝑃(𝑜𝑙2) instead of 𝑃(𝑙𝑜)

𝑄(𝑐) 𝑄(𝑑) 𝑄(𝑒) 𝐵 𝐶 𝐷 𝐸

slide-12
SLIDE 12

Inference on a chain

12  In both directed and undirected graphical models, the joint probability is a factored

expression over subsets of the variables 𝑄 𝒚 = 1 𝑎 𝜚1,2 𝑦1, 𝑦2 𝜚2,3 𝑦2, 𝑦3 … 𝜚𝑂−1,𝑂 𝑦𝑂−1, 𝑦𝑂 𝑄 𝑦𝑗 = 1 𝑎

𝑦1

𝑦𝑗−1 𝑦𝑗+1

𝑦𝑂

𝜚1,2 𝑦1, 𝑦2 … 𝜚𝑂−1,𝑂 𝑦𝑂−1, 𝑦𝑂 𝑄 𝑦𝑗 =

𝑦𝑗−1

𝜚 𝑦𝑗−1, 𝑦𝑗

𝑦𝑗−2

𝜚 𝑦𝑗−2, 𝑦𝑗−1 …

𝑦1

𝜚 𝑦1, 𝑦2 ×

𝑦𝑗+1

𝜚 𝑦𝑗, 𝑦𝑗+1

𝑦𝑗+2

𝜚 𝑦𝑗+1, 𝑦𝑗+2 …

𝑦𝑂

𝜚 𝑦𝑂−1, 𝑦𝑂 𝑃 𝑊𝑏𝑚 𝑌

𝑘

× 𝑊𝑏𝑚 𝑌

𝑘+1

  • perations in each elimination

undirected 𝑌1 𝑌2 𝑌𝑂−1 𝑌𝑂 … 𝑌1 𝑌2 𝑌𝑂−1 𝑌𝑂 …

slide-13
SLIDE 13

Inference on a chain: improvement reasons

13

 Computing an expression of the form (sum-product inference):

𝒂 𝜚∈𝜲

𝜚

 We used the structure of BN to factorize the joint distribution and

thus the scope of the resulted factors will be limited.

 Distributive law: If 𝑌 ∉ Scope(𝜚1) then 𝑌 𝜚1 . 𝜚2 = 𝜚1. 𝑌 𝜚2

 Performing the summations over the product of only a subset of factors

 We find sub-expressions that can be computed once and then we

save and reuse them in later computations

 Instead of computing them exponentially many times

𝜲: the set of factors

slide-14
SLIDE 14

Variable elimination algorithm for sum-product inference

14

 Sum out each variable one at a time

 all factors containing that variable are (removed from the set

  • f factors and) multiplied to generate a product factor

 The variable is summed out from the generated product factor

and a new factor is obtained

 The new factor is added to the set of the available factors The resulted factor does not necessarily correspond to any probability or conditional probability in the network

slide-15
SLIDE 15

15

Procedure Sum-Product-VE (Z,G)

// 𝒂: the variables to be eliminated

𝚾 ←all factors of G Select an elimination order 𝑎1, . . . , 𝑎𝐿 for 𝒂 for 𝑗 = 1, . . . , 𝐿 𝚾 ← Sum-Product-Elim-Var(𝚾, 𝑎𝑗)) 𝜚∗ ←

𝜚∈𝜲

𝜚 Return 𝜚∗ Procedure Sum-Product-Elim-Var( 𝚾, 𝑎) 𝚾′ ← {𝜚 ∈ 𝚾: 𝑎 ∈ Scope(𝜚)} 𝚾′′ ← 𝚾 − 𝚾′ 𝑛 ←

𝑎 𝜚∈𝚾′

𝜚 return 𝚾′′ ∪ {𝑛}

  • Move all irrelevant factors (to the

variable that must be eliminated now)

  • utside of the summation
  • Perform sum, getting a new term
  • Insert the new term into the product

It does not need normalization for directed graph when we have no evidence

slide-16
SLIDE 16

16

Procedure Cond-Prob-VE ( 𝒧, // the network over 𝒀 𝒁, // Set of query variables 𝑭 = 𝒇, // evidence) 𝚾 ←the factors parametrizing 𝒧 Replace each 𝜚 ∈ 𝜲 by 𝜚[𝑭 = 𝒇] Select an elimination order 𝑎1, . . . , 𝑎𝐿 for 𝒂 = 𝒀 − 𝒁 − 𝑭 for 𝑗 = 1, . . . , 𝑙 𝚾 ← Sum-Product-Elim-Var(𝚾, 𝑎𝑗)) 𝜚∗ ←

𝜚∈𝜲

𝜚 𝛽 ←

𝒛∈𝑊𝑏𝑚(𝒁)

𝜚∗(𝒛) Return 𝛽, 𝜚∗

slide-17
SLIDE 17

Directed example

17  Query: 𝑄(𝑌2|𝑌7 =

𝑦7)

 𝑄 𝑌2

𝑦7 ∝ 𝑄 𝑌2, 𝑦7 𝑄 𝑦2, 𝑦7 =

𝑦1 𝑦3 𝑦4 𝑦5 𝑦6 𝑦8

𝑄 𝑦1, 𝑦2, 𝑦3, 𝑦4, 𝑦5, 𝑦6, 𝑦7, 𝑦8 Consider the elimination order 𝑌1, 𝑌3, 𝑌4, 𝑌5, 𝑌6, 𝑌8

𝑄 𝑦2, 𝑦7 =

𝑦8 𝑦6 𝑦5 𝑦4 𝑦3 𝑦1

𝑄 𝑦1 𝑄 𝑦2 𝑄 𝑦3 𝑦1, 𝑦2 𝑄 𝑦4 𝑦3 𝑄 𝑦5 𝑦2 𝑄 𝑦6 𝑦3, 𝑦7 𝑄( 𝑦7|𝑦4, 𝑦5)𝑄 𝑦8 𝑦7

𝑌1 𝑌2 𝑌3 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8

slide-18
SLIDE 18

18 𝑄 𝑦2, 𝑦7 =

𝑦8 𝑦6 𝑦5 𝑦4 𝑦3

𝑄 𝑦2 𝑄 𝑦4 𝑦3 𝑄 𝑦5 𝑦2 𝑄 𝑦6 𝑦3, 𝑦7 𝑄( 𝑦7|𝑦4, 𝑦5)𝑄 𝑦8 𝑦7

𝑦1

𝑄 𝑦1 𝑄 𝑦3 𝑦1, 𝑦2 =

𝑦8 𝑦6 𝑦5 𝑦4 𝑦3

𝑄 𝑦2 𝑄 𝑦4 𝑦3 𝑄 𝑦5 𝑦2 𝑄 𝑦6 𝑦3, 𝑦7 𝑄 𝑦7 𝑦4, 𝑦5 𝑄 𝑦8 𝑦7 𝑛1(𝑦2, 𝑦3) =

𝑦8 𝑦6 𝑦5 𝑦4

𝑄 𝑦2 𝑄 𝑦5 𝑦2 𝑄 𝑦7 𝑦4, 𝑦5 𝑄 𝑦8 𝑦7

𝑦3

𝑄 𝑦4 𝑦3 𝑄 𝑦6 𝑦3, 𝑦7 𝑛1(𝑦2, 𝑦3) =

𝑦8 𝑦6 𝑦5 𝑦4

𝑄 𝑦2 𝑄 𝑦5 𝑦2 𝑄 𝑦7 𝑦4, 𝑦5 𝑄 𝑦8 𝑦7 𝑛3(𝑦2, 𝑦6, 𝑦4) =

𝑦8 𝑦6 𝑦5

𝑄 𝑦2 𝑄 𝑦5 𝑦2 𝑄 𝑦8 𝑦7

𝑦4

𝑄 𝑦7 𝑦4, 𝑦5 𝑛3(𝑦2, 𝑦6, 𝑦4) =

𝑦8 𝑦6 𝑦5

𝑄 𝑦2 𝑄 𝑦5 𝑦2 𝑄 𝑦8 𝑦7 𝑛4(𝑦2, 𝑦5, 𝑦6) =

𝑦8 𝑦6

𝑄 𝑦2 𝑄 𝑦8 𝑦7

𝑦5

𝑄 𝑦5 𝑦2 𝑛4(𝑦2, 𝑦5, 𝑦6) =

𝑦8 𝑦6

𝑄 𝑦2 𝑄 𝑦8 𝑦7 𝑛5(𝑦2, 𝑦6) =

𝑦8

𝑄 𝑦2 𝑄 𝑦8 𝑦7

𝑦6

𝑛5(𝑦2, 𝑦6) =

𝑦8

𝑄 𝑦2 𝑄 𝑦8 𝑦7 𝑛6(𝑦2) = 𝑛8(𝑦2)𝑛6(𝑦2)

slide-19
SLIDE 19

Conditional probability

19

𝑄 𝑦2| 𝑦7 = 𝑛8(𝑦2)𝑛6(𝑦2) 𝑦2 𝑛8(𝑦2)𝑛6(𝑦2)

slide-20
SLIDE 20

Undirected example

20  Query: 𝑄(𝑌2|𝑌7 =

𝑦7)

 𝑄 𝑌2

𝑦7 ∝ 𝑄 𝑌2, 𝑦7 𝑄 𝑦2, 𝑦7 =

𝑦1 𝑦3 𝑦4 𝑦5 𝑦6 𝑦8

𝑄 𝑦1, 𝑦2, 𝑦3, 𝑦4, 𝑦5, 𝑦6, 𝑦7, 𝑦8 Consider the elimination order 𝑌1, 𝑌3, 𝑌4, 𝑌5, 𝑌6, 𝑌8

𝑄 𝑦2, 𝑦7 =

𝑦8 𝑦6 𝑦5 𝑦4 𝑦3 𝑦1

𝜚(𝑦3, 𝑦4)𝜚(𝑦2, 𝑦5)𝜚(𝑦3, 𝑦6, 𝑦7)𝜚(𝑦4, 𝑦5, 𝑦7)𝜚( 𝑦7, 𝑦8)𝜚(𝑦1, 𝑦2, 𝑦3)

𝑌1 𝑌2 𝑌3 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌1 𝑌2 𝑌3 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8

slide-21
SLIDE 21

21 𝑄 𝑦2, 𝑦7 =

𝑦8 𝑦6 𝑦5 𝑦4 𝑦3

𝜚(𝑦3, 𝑦4)𝜚(𝑦2, 𝑦5)𝜚(𝑦3, 𝑦6, 𝑦7)𝜚(𝑦4, 𝑦5, 𝑦7)𝜚( 𝑦7, 𝑦8)

𝑦1

𝜚(𝑦1, 𝑦2, 𝑦3) =

𝑦8 𝑦6 𝑦5 𝑦4 𝑦3

𝜚(𝑦3, 𝑦4)𝜚(𝑦2, 𝑦5)𝜚(𝑦3, 𝑦6, 𝑦7)𝜚(𝑦4, 𝑦5, 𝑦7)𝜚( 𝑦7, 𝑦8)𝑛1(𝑦2, 𝑦3) =

𝑦8 𝑦6 𝑦5 𝑦4

𝜚(𝑦2, 𝑦5)𝜚(𝑦4, 𝑦5, 𝑦7)𝜚( 𝑦7, 𝑦8)

𝑦3

𝜚(𝑦3, 𝑦4) 𝜚(𝑦3, 𝑦6, 𝑦7)𝑛1(𝑦2, 𝑦3) =

𝑦8 𝑦6 𝑦5 𝑦4

𝜚(𝑦2, 𝑦5)𝜚(𝑦4, 𝑦5, 𝑦7)𝜚( 𝑦7, 𝑦8)𝑛3(𝑦2, 𝑦6, 𝑦4) =

𝑦8 𝑦6 𝑦5

𝜚(𝑦2, 𝑦5)𝜚( 𝑦7, 𝑦8)

𝑦4

𝜚(𝑦4, 𝑦5, 𝑦7)𝑛3(𝑦2, 𝑦6, 𝑦4) =

𝑦8 𝑦6 𝑦5

𝜚(𝑦2, 𝑦5)𝜚( 𝑦7, 𝑦8)𝑛4(𝑦2, 𝑦5, 𝑦6) =

𝑦8 𝑦6

𝜚( 𝑦7, 𝑦8)

𝑦5

𝜚(𝑦2, 𝑦5)𝑛4(𝑦2, 𝑦5, 𝑦6) =

𝑦8 𝑦6

𝜚( 𝑦7, 𝑦8)𝑛5(𝑦2, 𝑦6) =

𝑦8

𝜚( 𝑦7, 𝑦8)

𝑦6

𝑛5(𝑦2, 𝑦6) =

𝑦8

𝜚( 𝑦7, 𝑦8) 𝑛6(𝑦2)

𝑌1 𝑌2 𝑌3 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8

slide-22
SLIDE 22

Complexity of variable elimination algorithm

22

 In each elimination step, the following computations are

required:

 𝑔 𝑦, 𝑦1, … , 𝑦𝑙 = 𝑗=1

𝑁 𝑕𝑗(𝑦, 𝒚𝑑𝑗)

 𝑦 𝑔 𝑦, 𝑦1, … , 𝑦𝑙

 We need:

 (𝑁 − 1) × 𝑊𝑏𝑚(𝑌) × 𝑗=1

𝑙

𝑊𝑏𝑚(𝑌𝑗) multiplications

 For each tuple 𝑦, 𝑦1, … , 𝑦𝑙, we need 𝑁 − 1 multiplications

𝑊𝑏𝑚(𝑌) × 𝑗=1

𝑙

𝑊𝑏𝑚(𝑌𝑗) additions

 For each tuple 𝑦1, … , 𝑦𝑙, we need 𝑊𝑏𝑚(𝑌) additions

Complexity is exponential in number of variables in the intermediate factor Size of the created factors is the dominant quantity in the complexity of VE

slide-23
SLIDE 23

Graph elimination

23

 Graph elimination is a simple unified treatment of inference

algorithms in both directed and undirected models

 convert directed models to undirected ones (moralization)

 Graph-theoretic property: the factors resulted during variable

elimination are captured by recording the elimination clique

 The computational complexity of the Eliminate algorithm can

be reduced to purely graph-theoretic considerations

slide-24
SLIDE 24

Graph elimination

24

 Begin with the undirected GM or moralized BN  Choose an elimination ordering (query nodes should be last)  Eliminate a node from the graph and add edges (called fill

edges) between all pairs of its neighbors

 Iterate until all non-query nodes are eliminated

slide-25
SLIDE 25

Graph elimination

25

𝑌1 𝑌2 𝑌3 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌1 𝑌2 𝑌3 𝑌4 𝑌5 𝑌6 𝑌7 𝑌8 𝑌1 𝑌2 𝑌3 𝑌4 𝑌5 𝑌6 𝑌8 𝑌2 𝑌3 𝑌4 𝑌5 𝑌6 𝑌8 𝑌2 𝑌4 𝑌5 𝑌6 𝑌8 𝑌2 𝑌5 𝑌6 𝑌8 𝑌2 𝑌6 𝑌8 𝑌2 𝑌8 𝑌2 Removing a node from the graph and connecting the remaining neighbors Moralized graph

Summation ⇔ elimination Intermediate term ⇔ elimination clique

fill edges

slide-26
SLIDE 26

Graph elimination: elimination cliques

26

 Induced dependency during marginalization is captured in

elimination cliques

 A correspondence between maximal cliques in the induced

graph and maximal factors generated inVE algorithm

 The complexity depends on the number of variables in the largest

elimination clique

 The size of the maximal elimination clique in the induced

graph depends on the elimination ordering

slide-27
SLIDE 27

Elimination order: example

 Ordering ≺: 𝑌1, 𝑌3, 𝑌4, 𝑌5, 𝑌6  Ordering ≺:𝑌4, 𝑌3, 𝑌5, 𝑌6, 𝑌1

𝑌1 𝑌2 𝑌5 𝑌3 𝑌6 𝑌4

slide-28
SLIDE 28

Elimination order

28

 Finding the best elimination ordering is NP-hard

 Equivalent to finding the tree-width in the graph that is NP-

hard

 Tree-width: one less than the smallest achievable size of the

largest elimination clique, ranging over all possible elimination

  • rdering

 Good elimination orderings lead to small cliques and

thus reduce complexity

 What is the optimal order for trees?

slide-29
SLIDE 29

Heuristics for finding an ordering

29

 How can we find an ordering that induces a “small”

graph?

 Some heuristics to select the next node for elimination:

 Min-neighbors

 The cost of a vertex is the number of neighbors it has in the current

graph

 Min-fill

 The cost of a vertex is the number of edges that need to be added to

the graph due to its elimination.

slide-30
SLIDE 30

Elimination algorithm: summary

30

 Elimination algorithm computes the marginal probability for

  • ne query

 It uses the factorization properties and distributive law to

compute the marginal probabilities more efficiently

 reorder computations  save intermediate terms

 Elimination

  • rder

affects the computational complexity. However, finding the best order in general is NP-hard

slide-31
SLIDE 31

References

31

 D. Koller and N. Friedman, “Probabilistic Graphical Models:

Principles and Techniques”, MIT Press, 2009, Chapter 9.

 M.I. Jordan, “An

Introduction to Probabilistic Graphical Models”, Chapter 3.