Bayesian networks: approximate inference Machine Intelligence - - PowerPoint PPT Presentation

bayesian networks approximate inference
SMART_READER_LITE
LIVE PREVIEW

Bayesian networks: approximate inference Machine Intelligence - - PowerPoint PPT Presentation

Bayesian networks: approximate inference Machine Intelligence Thomas D. Nielsen September 2008 Approximative inference September 2008 1 / 25 Approximate Inference Motivation Because of the (worst-case) intractability of exact inference in


slide-1
SLIDE 1

Bayesian networks: approximate inference

Machine Intelligence Thomas D. Nielsen September 2008

Approximative inference September 2008 1 / 25

slide-2
SLIDE 2

Approximate Inference

Motivation Because of the (worst-case) intractability of exact inference in Bayesian networks, try to find more efficient approximate inference techniques: Instead of computing exact posterior P(A | E = e) compute approximation ˆ P(A | E = e) with ˆ P(A | E = e) ∼ P(A | E = e)

Approximative inference September 2008 2 / 25

slide-3
SLIDE 3

Approximate Inference

Absolute/Relative Error For p, ˆ p ∈ [0, 1]: ˆ p is approximation for p with absolute error ≤ ǫ, if | p − ˆ p |≤ ǫ, i.e. ˆ p ∈ [p − ǫ, p + ǫ].

Approximative inference September 2008 3 / 25

slide-4
SLIDE 4

Approximate Inference

Absolute/Relative Error For p, ˆ p ∈ [0, 1]: ˆ p is approximation for p with absolute error ≤ ǫ, if | p − ˆ p |≤ ǫ, i.e. ˆ p ∈ [p − ǫ, p + ǫ]. ˆ p is approximation for p with relative error ≤ ǫ, if | 1 − ˆ p/p |≤ ǫ, i.e. ˆ p ∈ [p(1 − ǫ), p(1 + ǫ)].

Approximative inference September 2008 3 / 25

slide-5
SLIDE 5

Approximate Inference

Absolute/Relative Error For p, ˆ p ∈ [0, 1]: ˆ p is approximation for p with absolute error ≤ ǫ, if | p − ˆ p |≤ ǫ, i.e. ˆ p ∈ [p − ǫ, p + ǫ]. ˆ p is approximation for p with relative error ≤ ǫ, if | 1 − ˆ p/p |≤ ǫ, i.e. ˆ p ∈ [p(1 − ǫ), p(1 + ǫ)]. This definition is not always fully satisfactory, because it is not symmetric in p and ˆ p and not invariant under the transition p → (1 − p), ˆ p → (1 − ˆ p). Use with care! When ˆ p1, ˆ p2 are approximations for p1, p2 with absolute error ≤ ǫ, then no error bounds follow for ˆ p1/ˆ p2 as an approximation for p1/p2. When ˆ p1, ˆ p2 are approximations for p1, p2 with relative error ≤ ǫ, then ˆ p1/ˆ p2 approximates p1/p2 with relative error ≤ (2ǫ)/(1 + ǫ).

Approximative inference September 2008 3 / 25

slide-6
SLIDE 6

Approximate Inference

Randomized Methods Most methods for approximate inference are randomized algorithms that compute approximations ˆ P from random samples of instantiations. We shall consider: Forward sampling Likelihood weighting Gibbs sampling Metropolis Hastings algorithm

Approximative inference September 2008 4 / 25

slide-7
SLIDE 7

Approximate Inference

Forward Sampling Observation: can use Bayesian network as random generator that produces full instantiations V = v according to distribution P(V). Example:

A B

A t f .2 .8 B A t f t .7 .3 f .4 .6

  • Generate random numbers r1, r2 uniformly

from [0,1].

  • Set A = t if r1 ≤ .2 and A = f else.
  • Depending on the value of A and r2 set B

to t or f. Generation of one random instantiation: linear in size of network.

Approximative inference September 2008 5 / 25

slide-8
SLIDE 8

Approximate Inference

Sampling Algorithm Thus, we have a randomized algorithm S that produces possible outputs from sp(V) according to the distribution P(V). Define ˆ P(A = a | E = e) := |{i ∈ 1, . . . , N | E = e, A = a in Si}| |{i ∈ 1, . . . , N | E = e in Si}|

Approximative inference September 2008 6 / 25

slide-9
SLIDE 9

Approximate Inference

Forward Sampling: Illustration

# #

Sample with not E = e E = e, A = a E = e, A = a ∪ Approximation for P(A = a | E = e):

Approximative inference September 2008 7 / 25

slide-10
SLIDE 10

Approximate Inference

Sampling from the conditional distribution Problem of forward sampling: samples with E = e are useless! Idea: find sampling algorithm Sc that produces outputs from sp(V) according to the distribution P(V | E = e).

Approximative inference September 2008 8 / 25

slide-11
SLIDE 11

Approximate Inference

Sampling from the conditional distribution Problem of forward sampling: samples with E = e are useless! Idea: find sampling algorithm Sc that produces outputs from sp(V) according to the distribution P(V | E = e). A tempting approach: Fix the variables in E to e and sample from the nonevidence variables

  • nly!

Approximative inference September 2008 8 / 25

slide-12
SLIDE 12

Approximate Inference

Sampling from the conditional distribution Problem of forward sampling: samples with E = e are useless! Idea: find sampling algorithm Sc that produces outputs from sp(V) according to the distribution P(V | E = e). A tempting approach: Fix the variables in E to e and sample from the nonevidence variables

  • nly!

Problem: Only evidence from the ancestors are taken into account!

Approximative inference September 2008 8 / 25

slide-13
SLIDE 13

Approximate Inference

Likelihood weighting We would like to sample from (pa(X)′′ are the parents in E) P(U, e) = Y

X∈U\E

P(X | pa(X)′, pa(X)′′ = e) × Y

X∈E

P(X = e | pa(X)′, pa(X)′′ = e), but by applying forward sampling with fixed E we actually sample from: Sampling distribution = Y

X∈U\E

P(X | pa(X)′, pa(X)′′ = e).

Approximative inference September 2008 9 / 25

slide-14
SLIDE 14

Approximate Inference

Likelihood weighting We would like to sample from (pa(X)′′ are the parents in E) P(U, e) = Y

X∈U\E

P(X | pa(X)′, pa(X)′′ = e) × Y

X∈E

P(X = e | pa(X)′, pa(X)′′ = e), but by applying forward sampling with fixed E we actually sample from: Sampling distribution = Y

X∈U\E

P(X | pa(X)′, pa(X)′′ = e). Solution: Instead of letting each sample count as 1, use w(x, e) = Y

X∈E

P(X = e | pa(X)′, pa(X)′′ = e).

Approximative inference September 2008 9 / 25

slide-15
SLIDE 15

Approximate Inference

Likelihood weighting: example

A B

A t f .2 .8 B A t f t .7 .3 f .4 .6

  • Assume evidence B = t.
  • Generate a random number r uniformly

from [0,1].

  • Set A = t if r ≤ .2 and A = f else.
  • If A = t then let the sample count as

w(t, t) = 0.7; otherwise w(f, t) = 0.4.

Approximative inference September 2008 10 / 25

slide-16
SLIDE 16

Approximate Inference

Likelihood weighting: example

A B

A t f .2 .8 B A t f t .7 .3 f .4 .6

  • Assume evidence B = t.
  • Generate a random number r uniformly

from [0,1].

  • Set A = t if r ≤ .2 and A = f else.
  • If A = t then let the sample count as

w(t, t) = 0.7; otherwise w(f, t) = 0.4. With N samples (a1, . . . , aN) we get ˆ P(A = t | B = t) = PN

i=1 w(ai = t, e)

PN

i=1(w(ai = t, e) + w(ai = f, e))

.

Approximative inference September 2008 10 / 25

slide-17
SLIDE 17

Approximate Inference

Gibbs Sampling For notational convenience assume from now on that for some l: E = Vl+1, Vl+2, . . . , Vn. Write W for V1, . . . , Vl. Principle: obtain new sample from previous sample by randomly changing the value of only one selected variable. Procedure Gibbs sampling v0 = (v0,1, . . . , v0,l) := arbitrary instantiation of W i := 1 repeat forever choose Vk ∈ W # deterministic or randomized generate randomly vi,k according to distribution P(Vk | V1 = vi−1,1, . . . , Vk−1 = vi−1,k−1, Vk+1 = vi−1,k+1, . . . , Vl = vi−1,l, E = e) set vi = (vi−1,1, . . . vi−1,k−1, vi,k , vi−1,k+1, . . . , vi−1,l) i := i + 1

Approximative inference September 2008 11 / 25

slide-18
SLIDE 18

Approximate Inference

Illustration The process of Gibbs sampling can be understood as a random walk in the space of all instantiations with E = e: Reachable in one step: instantiations that differ from current one by value assignment to at most

  • ne variable (assume randomized choice of variable Vk).

Approximative inference September 2008 12 / 25

slide-19
SLIDE 19

Approximate Inference

Implementation of Sampling Step The sampling step generate randomly vi,k according to distribution P(Vk | V1 = vi−1,1, . . . , Vk−1 = vi−1,k−1, Vk+1 = vi−1,k+1, . . . , Vl = vi−1,l, E = e) requires sampling from a conditional distribution. In this special case (all but one variables are instantiated) this is easy: just need to compute for each v ∈ sp(Vk ) the probability P(V1 = vi−1,1, . . . , Vk−1 = vi−1,k−1, Vk = v, Vk+1 = vi−1,k+1, . . . , Vl = vi−1,l, E = e) (linear in network size), and choose vi,k according to these probabilities (normalized). This can be further simplified by computing the distribution on sp(Vk ) only in the Markov blanket of Vk, i.e. the subnetwork consisting of Vk, its parents, its children, and the parents of its children.

Approximative inference September 2008 13 / 25

slide-20
SLIDE 20

Approximate Inference

Convergence of Gibbs Sampling Under certain conditions: the distribution of samples converges to the posterior distribution P(W | E = e): lim

i→∞ P(vi = v) = P(W = v | E = e)

(v ∈ sp(W)). Sufficient conditions are: in the repeat loop of the Gibbs sampler, variable Vk is randomly selected (with non-zero selection probability for all Vk ∈ W), and the Bayesian network has no zero-entries in its cpt’s

Approximative inference September 2008 14 / 25

slide-21
SLIDE 21

Approximate Inference

Approximate Inference using Gibbs Sampling

  • 1. Start Gibbs sampling with some starting configuration v0.
  • 2. Run the sampler for N steps (“Burn in”)
  • 3. Run the sampler for M additional steps; use the relative frequency of state v in

these M samples as an estimate for P(W = v | E = e). Problems: How large must N be chosen? Difficult to say how long it takes for Gibbs sampler to converge! Even when sampling is from the stationary distribution, samples are not independent. Result: error cannot be bounded as function of M using Chebyshev’s inequality (or related methods).

Approximative inference September 2008 15 / 25

slide-22
SLIDE 22

Approximate Inference

Effect of dependence P(vN = v) close to P(W = v | E = e): probability that vN is in the red region is close to P(A = a | E = e). This does not guarantee that the fraction of samples in vN, vN+1, . . . , vN+M that are in the red region yields a good approximation to P(A = a | E = e)! v0 vN vN vN+M vN+M

Approximative inference September 2008 16 / 25

slide-23
SLIDE 23

Approximate Inference

Multiple starting points In practice, one tries to counteract these difficulties by restarting the Gibbs sampling several times (often with different starting points): v0 v0 v0 vN vN vN vN+M vN+M vN+M

Approximative inference September 2008 17 / 25

slide-24
SLIDE 24

Approximate Inference

Metropolis Hastings Algorithm Another way of constructing a random walk on sp(W): Let {q(v, v′) | v, v′ ∈ sp(W)} be a set of transition probabilities over sp(W), i.e. q(v, ·) is a probability distribution for each v ∈ sp(W). The q(v, v′) are called proposal probabilities. Define α(v, v′) := min  1, P(W = v′ | E = e)q(v′, v) P(W = v | E = e)q(v, v′) ff := min  1, P(W = v′, E = e)q(v′, v) P(W = v, E = e)q(v, v′) ff α(v, v′) is called the acceptance probability for the transition from v to v′.

Approximative inference September 2008 18 / 25

slide-25
SLIDE 25

Approximate Inference

Procedure Metropolis Hastings sampling v0 = (v0,1, . . . , v0,l) := arbitrary instantiation of W i := 1 repeat forever sample v′ according to distribution q(vi−1, ·) set accept to true with probability α(v, v′) if accept vi := v′ else vi := vi−1 i := i + 1

Approximative inference September 2008 19 / 25

slide-26
SLIDE 26

Approximate Inference

Convergence of Metropolis Hastings Sampling Under certain conditions: the distribution of samples converges to the posterior distribution P(W | E = e). A sufficient condition is: q(v, v′) > 0 for all v, v′. To obtain a good performance, q should be chosen so as to obtain high acceptance probabilities, i.e. quotients P(W = v′ | E = e)q(v′, v) P(W = v | E = e)q(v, v′) should be close to 1. Optimal (but usually not feasible): q(v, v′) = P(W = v′ | E = e). Generally: try to approximate with q the target distribution P(W | E = e).

Approximative inference September 2008 20 / 25

slide-27
SLIDE 27

Loopy belief propagation

Loopy belief propagation Message passing algorithm like junction tree propagation Works directly on the Bayesian network structure (rather than the junction tree)

Approximative inference September 2008 21 / 25

slide-28
SLIDE 28

Loopy belief propagation

Message passing A node sends a message to a neighbor by multiplying the incoming messages from all other neighbors to the potential it holds. marginalizing the result down to the separator.

A B C D E P(C|A, B)

Approximative inference September 2008 22 / 25

slide-29
SLIDE 29

Loopy belief propagation

Message passing A node sends a message to a neighbor by multiplying the incoming messages from all other neighbors to the potential it holds. marginalizing the result down to the separator.

A B C D E P(C|A, B) φA

Approximative inference September 2008 22 / 25

slide-30
SLIDE 30

Loopy belief propagation

Message passing A node sends a message to a neighbor by multiplying the incoming messages from all other neighbors to the potential it holds. marginalizing the result down to the separator.

A B C D E P(C|A, B) φA φB

Approximative inference September 2008 22 / 25

slide-31
SLIDE 31

Loopy belief propagation

Message passing A node sends a message to a neighbor by multiplying the incoming messages from all other neighbors to the potential it holds. marginalizing the result down to the separator.

A B C D E P(C|A, B) φA φB φD

Approximative inference September 2008 22 / 25

slide-32
SLIDE 32

Loopy belief propagation

Message passing A node sends a message to a neighbor by multiplying the incoming messages from all other neighbors to the potential it holds. marginalizing the result down to the separator.

A B C D E P(C|A, B) φA φB φD πE(C)

Approximative inference September 2008 22 / 25

slide-33
SLIDE 33

Loopy belief propagation

Message passing A node sends a message to a neighbor by multiplying the incoming messages from all other neighbors to the potential it holds. marginalizing the result down to the separator.

A B C D E P(C|A, B) φA φB φD φE πE(C)

πE(C) = φD X

A,B

P(C | A, B)φAφB

Approximative inference September 2008 22 / 25

slide-34
SLIDE 34

Loopy belief propagation

Message passing A node sends a message to a neighbor by multiplying the incoming messages from all other neighbors to the potential it holds. marginalizing the result down to the separator.

A B C D E P(C|A, B) φA φB φD φE πE(C) λA(C)

πE(C) = φD X

A,B

P(C | A, B)φAφB λC(A) = X

B,C

P(C | A, B)φBφDφE

Approximative inference September 2008 22 / 25

slide-35
SLIDE 35

Loopy belief propagation

Example: error sources A few observations: When calculating P(E) we treat C and D as being independent (in the junction tree C and D would appear in the same separator). Evidence on a converging connection ma cause the error to cycle.

Approximative inference September 2008 23 / 25

slide-36
SLIDE 36

Loopy belief propagation

In general There is no guarantee of convergence, nor that in case of convergence, it will converge to the correct distribution. However, the method converges to the correct distribution surprisingly

  • ften!

If the network is singly connected, convergence is guaranteed.

Approximative inference September 2008 24 / 25

slide-37
SLIDE 37

Approximate Inference

Literature R.M. Neal: Probabilistic Inference Using Markov Chain Monte Carlo Methods. Technical report CRG-TR-93-11993, Department of Computer Science, University of Toronto. http://omega.albany.edu:8008/neal.pdf P . Dagum, M. Luby: Approximating probabilistic inference in Bayesian belief networks is NP-hard. Artificial Intelligence 60, 1993.

Approximative inference September 2008 25 / 25