Overview 1. Probabilistic Reasoning/Graphical models 2. Importance - - PowerPoint PPT Presentation

overview
SMART_READER_LITE
LIVE PREVIEW

Overview 1. Probabilistic Reasoning/Graphical models 2. Importance - - PowerPoint PPT Presentation

Overview 1. Probabilistic Reasoning/Graphical models 2. Importance Sampling 3. Markov Chain Monte Carlo: Gibbs Sampling 4. Sampling in presence of Determinism 5. Rao-Blackwellisation 6. AND/OR importance sampling Overview 1. Probabilistic


slide-1
SLIDE 1

Overview

  • 1. Probabilistic Reasoning/Graphical models
  • 2. Importance Sampling
  • 3. Markov Chain Monte Carlo: Gibbs Sampling
  • 4. Sampling in presence of Determinism
  • 5. Rao-Blackwellisation
  • 6. AND/OR importance sampling
slide-2
SLIDE 2

Overview

  • 1. Probabilistic Reasoning/Graphical models
  • 2. Importance Sampling
  • 3. Markov Chain Monte Carlo: Gibbs Sampling
  • 4. Sampling in presence of Determinism
  • 5. Cutset-based Variance Reduction
  • 6. AND/OR importance sampling
slide-3
SLIDE 3

Probabilistic Reasoning; Graphical models

  • Graphical models:

– Bayesian network, constraint networks, mixed network

  • Queries
  • Exact algorithm

– using inference, – search and hybrids

  • Graph parameters:

– tree-width, cycle-cutset, w-cutset

slide-4
SLIDE 4

Queries

  • Probability of evidence (or partition function)
  • Posterior marginal (beliefs):
  • Most Probable Explanation

 

 

) var( 1

| ) | ( ) (

e X n i e i i pa

x P e P



X i i i C

Z ) ( 

   

    

 

) var( 1 ) var( 1

| ) | ( | ) | ( ) ( ) , ( ) | (

e X n j e j j X e X n j e j j i i

pa x P pa x P e P e x P e x P

i

e) , x P( max arg * x

x

slide-5
SLIDE 5

5

Approximation

  • Since inference, search and hybrids are too expensive when

graph is dense; (high treewidth) then:

  • Bounding inference:
  • mini-bucket and mini-clustering
  • Belief propagation
  • Bounding search:
  • Sampling
  • Goal: an anytime scheme
slide-6
SLIDE 6

Overview

  • 1. Probabilistic Reasoning/Graphical models
  • 2. Importance Sampling
  • 3. Markov Chain Monte Carlo: Gibbs Sampling
  • 4. Sampling in presence of Determinism
  • 5. Rao-Blackwellisation
  • 6. AND/OR importance sampling
slide-7
SLIDE 7

Outline

  • Definitions and Background on Statistics
  • Theory of importance sampling
  • Likelihood weighting
  • State-of-the-art importance sampling

techniques

7

slide-8
SLIDE 8

A sample

  • Given a set of variables X={X1,...,Xn}, a sample,

denoted by St is an instantiation of all variables:

8

) ,..., , (

t n t t t

x x x S

2 1

slide-9
SLIDE 9

How to draw a sample ? Univariate distribution

  • Example: Given random variable X having

domain {0, 1} and a distribution P(X) = (0.3, 0.7).

  • Task: Generate samples of X from P.
  • How?

– draw random number r  [0, 1] – If (r < 0.3) then set X=0 – Else set X=1

9

slide-10
SLIDE 10

How to draw a sample? Multi-variate distribution

  • Let X={X1,..,Xn} be a set of variables
  • Express the distribution in product form
  • Sample variables one by one from left to right,

along the ordering dictated by the product form.

  • Bayesian network literature: Logic sampling

10

) ,..., | ( ... ) | ( ) ( ) (

1 1 1 2 1 

   

n n

X X X P X X P X P X P

slide-11
SLIDE 11

Sampling for Prob. Inference Outline

  • Logic Sampling
  • Importance Sampling

– Likelihood Sampling – Choosing a Proposal Distribution

  • Markov Chain Monte Carlo (MCMC)

– Metropolis-Hastings – Gibbs sampling

  • Variance Reduction
slide-12
SLIDE 12

Logic Sampling: No Evidence (Henrion 1988)

Input: Bayesian network X= {X1,…,XN}, N- #nodes, T - # samples Output: T samples Process nodes in topological order – first process the ancestors of a node, then the node itself: 1. For t = 0 to T 2. For i = 0 to N 3. Xi  sample xi

t from P(xi | pai)

12

slide-13
SLIDE 13

Logic sampling (example)

13

X1 X4 X2 X3

) (

1

X P

) | (

1 2 X

X P

) | (

1 3 X

X P

) | ( from Sample . 4 ) | ( from Sample . 3 ) | ( from Sample . 2 ) ( from Sample . 1 sample generate // Evidence No

3 3 , 2 2 4 4 1 1 3 3 1 1 2 2 1 1

x X x X x P x x X x P x x X x P x x P x k    

) , | ( ) | ( ) | ( ) ( ) , , , (

3 2 4 1 3 1 2 1 4 3 2 1

X X X P X X P X X P X P X X X X P    

) , | (

3 2 4

X X X P

slide-14
SLIDE 14

Logic Sampling w/ Evidence

Input: Bayesian network X= {X1,…,XN}, N- #nodes E – evidence, T - # samples Output: T samples consistent with E

  • 1. For t=1 to T

2. For i=1 to N 3. Xi  sample xi

t from P(xi | pai)

4. If Xi in E and Xi  xi, reject sample: 5. Goto Step 1.

14

slide-15
SLIDE 15

Logic Sampling (example)

15

X1 X4 X2 X3

) ( 1 x P

) | (

1 2 x

x P ) , | (

3 2 4

x x x P

) | (

1 3 x

x P

) | ( from Sample 5.

  • therwise

1, from start and sample reject 0, If . 4 ) | ( from Sample . 3 ) | ( from Sample . 2 ) ( from Sample . 1 sample generate // : Evidence

3 , 2 4 4 3 1 3 3 1 2 2 1 1 3

x x x P x x x x P x x x P x x P x k X  

slide-16
SLIDE 16

Expected value: Given a probability distribution P(X) and a function g(X) defined over a set of variables X = {X1, X2, … Xn}, the expected value of g w.r.t. P is Variance: The variance of g w.r.t. P is:

Expected value and Variance

16

) ( ) ( )] ( [ x P x g x g E

x P

 

) ( )] ( [ ) ( )] ( [

2

x P x g E x g x g Var

x P P

 

slide-17
SLIDE 17

Monte Carlo Estimate

  • Estimator:

– An estimator is a function of the samples. – It produces an estimate of the unknown parameter of the sampling distribution.

17

 

 

T t t P

S g T g P

1 T 2 1

1 : by given is [g(x)] E

  • f

estimate carlo Monte the , from drawn S , S , S samples i.i.d. Given ) ( ˆ

slide-18
SLIDE 18

Example: Monte Carlo estimate

  • Given:

– A distribution P(X) = (0.3, 0.7). – g(X) = 40 if X equals 0 = 50 if X equals 1.

  • Estimate EP[g(x)]=(40x0.3+50x0.7)=47.
  • Generate k samples from P: 0,1,1,1,0,1,1,0,1,0

18

46 10 6 50 4 40 1 50 40            samples X samples X samples g # ) ( # ) ( # ˆ

slide-19
SLIDE 19

Outline

  • Definitions and Background on Statistics
  • Theory of importance sampling
  • Likelihood weighting
  • State-of-the-art importance sampling

techniques

19

slide-20
SLIDE 20

Importance sampling: Main idea

  • Express query as the expected value of a

random variable w.r.t. to a distribution Q.

  • Generate random samples from Q.
  • Estimate the expected value from the

generated samples using a monte carlo estimator (average).

20

slide-21
SLIDE 21

Importance sampling for P(e)

) ( , ) ( ) ( ˆ : )] ( [ ) ( ) , ( ) ( ) ( ) , ( ) , ( ) ( ) ( ) , ( , \ Z Q z w T e P z w E z Q e z P E z Q z Q e z P e z P e P z Q e z P E X Z Let

t T t t Q Q z z

               

  

z where 1 estimate Carlo Monte : as P(e) rewrite can we Then, satisfying

  • n,

distributi (proposal) a be Q(Z) Let

1

slide-22
SLIDE 22

Properties of IS estimate of P(e)

  • Convergence: by law of large numbers
  • Unbiased.
  • Variance:

       T for ) ( ) ( 1 ) ( ˆ

. . 1

e P z w T e P

s a T i i

 

T z w Var z w T Var e P Var

Q N i i Q Q

)] ( [ ) ( 1 ) ( ˆ

1

       

 

) ( )] ( ˆ [ e P e P EQ 

slide-23
SLIDE 23

Properties of IS estimate of P(e)

  • Mean Squared Error of the estimator

 

   

   

T x w Var e P Var e P Var e P E e P e P e P E e P MSE

Q Q Q Q Q Q

)] ( [ ) ( ˆ ) ( ˆ )] ( ˆ [ ) ( ) ( ) ( ˆ ) ( ˆ             

2 2

This quantity enclosed in the brackets is zero because the expected value of the estimator equals the expected value of g(x)

slide-24
SLIDE 24

Estimating P(Xi|e)

24

 

) | ( ) | ( E : biased is Estimate , , ) ( ˆ ) , ( ˆ ) | ( : estimate Ratio IS. by r denominato and numerator Estimate : Idea ) ( ) , ( ) ( ) , ( ) ( ) , ( ) , ( ) ( ) ( ) , ( ) | (

  • therwise.

and x contains z if 1 is which function, delta

  • dirac

a be (z) Let

T 1 k T 1 k i xi

e x P e x P e) w(z e) )w(z (z e P e x P e x P z Q e z P E z Q e z P z E e z P e z P z e P e x P e x P

i i k k k x i i Q x Q z z x i i

i i i

                 

   

 

   

slide-25
SLIDE 25

Properties of the IS estimator for P(Xi|e)

  • Convergence: By Weak law of large numbers
  • Asymptotically unbiased
  • Variance

– Harder to analyze – Liu suggests a measure called “Effective sample size”

25

   T as ) | ( ) | ( e x P e x P

i i

) | ( )] | ( [ lim e x P e x P E

i i P T

 

slide-26
SLIDE 26

Generating samples from Q

  • No restrictions on “how to”
  • Typically, express Q in product form:

– Q(Z)=Q(Z1)xQ(Z2|Z1)x….xQ(Zn|Z1,..Zn-1)

  • Sample along the order Z1,..,Zn
  • Example:

– Z1Q(Z1)=(0.2,0.8) – Z2 Q(Z2|Z1)=(0.1,0.9,0.2,0.8) – Z3 Q(Z3|Z1,Z2)=Q(Z3)=(0.5,0.5)

slide-27
SLIDE 27

Outline

  • Definitions and Background on Statistics
  • Theory of importance sampling
  • Likelihood weighting
  • State-of-the-art importance sampling

techniques

27

slide-28
SLIDE 28

Likelihood Weighting

(Fung and Chang, 1990; Shachter and Peot, 1990)

28

Works well for likely evidence!

“Clamping” evidence+ logic sampling+ weighing samples by evidence likelihood Is an instance of importance sampling!

slide-29
SLIDE 29

Likelihood Weighting: Sampling

29

e e e e e e e e e

Sample in topological order over X ! Clamp evidence, Sample xi P(Xi|pai), P(Xi|pai) is a look-up in CPT!

slide-30
SLIDE 30

Likelihood Weighting: Proposal Distribution

30

    

    

            

E E j j E X X i i E X X E E j j i i n E X X i i

j i i j i

pa e P e pa x P pa e P e pa x P x Q e x P w x x x Weights x X X X P X P e pa X P E X Q ) | ( ) , | ( ) | ( ) , | ( ) ( ) , ( ) ,.., ( : : ) , | ( ) ( ) , | ( ) \ (

\ \ \

sample a Given ) X , Q(X . x X Evidence and ) X , X | P(X ) X | P(X ) P(X ) X , X , P(X : network Bayesian a Given : Example

1 2 2 1 3 1 3 1 2 2 2 1 3 1 2 1 3 2 1

Notice: Q is another Bayesian network

slide-31
SLIDE 31

Likelihood Weighting: Estimates

31

T t t

w T e P

1 ) (

1 ) ( ˆ

Estimate P(e):

  • therwise

zero equals and x if 1 ) ( ) ( ) ( ˆ ) , ( ˆ ) | ( ˆ

i ) ( 1 ) ( ) ( 1 ) ( t i t x T t t t x T t t i i

x x g w x g w e P e x P e x P

i i

   

 

 

Estimate Posterior Marginals:

slide-32
SLIDE 32

Likelihood Weighting

  • Converges to exact posterior marginals
  • Generates Samples Fast
  • Sampling distribution is close to prior

(especially if E  Leaf Nodes)

  • Increasing sampling variance

Convergence may be slow Many samples with P(x(t))=0 rejected

32

slide-33
SLIDE 33

Outline

  • Definitions and Background on Statistics
  • Theory of importance sampling
  • Likelihood weighting
  • Error estimation
  • State-of-the-art importance sampling

techniques

33

slide-34
SLIDE 34
slide-35
SLIDE 35
slide-36
SLIDE 36
slide-37
SLIDE 37

absolute

slide-38
SLIDE 38

Outline

  • Definitions and Background on Statistics
  • Theory of importance sampling
  • Likelihood weighting
  • State-of-the-art importance sampling

techniques

38

slide-39
SLIDE 39

Proposal selection

  • One should try to select a proposal that is as

close as possible to the posterior distribution.

 

) | ( ) ( ) ( ) ( ) , ( estimator variance

  • zero

a have to , ) ( ) ( ) , ( ) ( ) ( ) ( ) , ( 1 )] ( [ ) ( ˆ

2

e z P z Q z Q e P e z P e P z Q e z P z Q e P z Q e z P N T z w Var e P Var

Z z Q Q

                

slide-40
SLIDE 40

Perfect sampling using Bucket Elimination

  • Algorithm:

– Run Bucket elimination on the problem along an

  • rdering o=(XN,..,X1).

– Sample along the reverse ordering: (X1,..,XN) – At each variable Xi, recover the probability P(Xi|x1,...,xi-1) by referring to the bucket.

slide-41
SLIDE 41

41

Bucket Elimination

) , ( ) | (    e a P e a P

 

d e b c

c b e P b a d P a c P a b P a P e a P

, , ,

) , | ( ) , | ( ) | ( ) | ( ) ( ) , (

   

d c b e

b a d P c b e P a b P a c P a P ) , | ( ) , | ( ) | ( ) | ( ) (

Elimination Order: d,e,b,c Query:

D: E: B: C: A:

d D

b a d P b a f ) , | ( ) , ( ) , | ( b a d P ) , | ( c b e P ) , | ( ) , ( c b e P c b fE  

b E D B

c b f b a f a b P c a f ) , ( ) , ( ) | ( ) , ( ) ( ) ( ) , ( a f A p e a P

C

  ) (a P ) | ( a c P

c B C

c a f a c P a f ) , ( ) | ( ) ( ) | ( a b P

D,A,B E,B,C B,A,C C,A A

) , ( b a fD ) , ( c b fE ) , ( c a fB ) (a fC

A A D D E E C C B B

Bucket Tree

D E B C A

Original Functions Messages Time and space exp(w*)

slide-42
SLIDE 42

SP2 42

Bucket elimination (BE)

Algorithm elim-bel (Dechter 1996)



b

Elimination operator

P(e)

bucket B: P(a) P(C|A) P(B|A) P(D|B,A) P(e|B,C) bucket C: bucket D: bucket E: bucket A: B C D E A

e) (A, hD

(a) hE

e) C, D, (A, hB

e) D, (A, hC

slide-43
SLIDE 43

SP2 43

Sampling from the output of BE

(Dechter 2002) bucket B: P(A) P(C|A) P(B|A) P(D|B,A) P(e|B,C) bucket C: bucket D: bucket E: bucket A:

e) (A, hD

(A) hE

e) C, D, (A, hB

e) D, (A, hC

Q(A) a A : (A) h P(A) Q(A)

E

    Sample ignore : bucket Evidence

e) D, (a, h e) a, | Q(D d D : Sample bucket in the a A Set

C

   

e) C, d, (a, h ) | ( d) e, a, | Q(C c C : Sample bucket the in d D a, A Set

B

      A C P

) , | ( ) , | ( ) | ( d) e, a, | Q(B b B : Sample bucket the in c C d, D a, A Set c b e P a B d P a B P      

slide-44
SLIDE 44

Mini-buckets: “local inference”

  • Computation in a bucket is time and space

exponential in the number of variables involved

  • Therefore, partition functions in a bucket into

“mini-buckets” on smaller number of variables

  • Can control the size of each “mini-bucket”,

yielding polynomial complexity.

SP2 44

slide-45
SLIDE 45

SP2 45 45

Mini-Bucket Elimination

bucket A: bucket E: bucket D: bucket C: bucket B:

ΣB

P(B|A) P(D|B,A) hE(A) hB(A,D) P(e|B,C)

Mini-buckets

ΣB

P(C|A) hB(C,e) hD(A) hC(A,e) Approximation of P(e) Space and Time constraints: Maximum scope size of the new function generated should be bounded by 2 BE generates a function having scope size 3. So it cannot be used. P(A)

slide-46
SLIDE 46

SP2 46 46

Sampling from the output of MBE

bucket A: bucket E: bucket D: bucket C: bucket B:

P(B|A) P(D|B,A) hE(A) hB(A,D) P(e|B,C) P(C|A) hB(C,e) hD(A) hC(A,e) Sampling is same as in BE-sampling except that now we construct Q from a randomly selected “mini- bucket”

slide-47
SLIDE 47

IJGP-Sampling (Gogate and Dechter, 2005)

  • Iterative Join Graph Propagation (IJGP)

– A Generalized Belief Propagation scheme (Yedidia et al., 2002)

  • IJGP yields better approximations of P(X|E)

than MBE

– (Dechter, Kask and Mateescu, 2002)

  • Output of IJGP is same as mini-bucket

“clusters”

  • Currently the best performing IS scheme!
slide-48
SLIDE 48

Current Research question

  • Given a Bayesian network with evidence or a

Markov network representing function P, generate another Bayesian network representing a function Q (from a family of distributions, restricted by structure) such that Q is closest to P.

  • Current approaches

– Mini-buckets – Ijgp – Both

  • Experimented, but need to be justified

theoretically.

slide-49
SLIDE 49

Algorithm: Approximate Sampling

1) Run IJGP or MBE 2) At each branch point compute the edge probabilities by consulting output of IJGP or MBE

  • Rejection Problem:

– Some assignments generated are non solutions

slide-50
SLIDE 50

 

k ) ( ˆ Re ' ) ( Q Update ) ( N 1 ) ( ˆ e) (E P ˆ Q z ,..., z samples Generate do k to 1 i For ) ( ˆ )) ( | ( ... )) ( | ( ) ( ) ( Q Proposal Initial

1 k 1 N 1 2 2 1 1

e E P turn End Q Q k Q z w e E P from e E P Z pa Z Q Z pa Z Q Z Q Z

k k i N j k k n n

               

 

Adaptive Importance Sampling

slide-51
SLIDE 51

Adaptive Importance Sampling

  • General case
  • Given k proposal distributions
  • Take N samples out of each distribution
  • Approximate P(e)

 

1 ) ( ˆ

1

   

k j

proposal jth weight Avg k e P

slide-52
SLIDE 52

Estimating Q'(z)

sampling importance by estimated is ) Z ,.., Z | (Z Q' each where )) ( | ( ' ... )) ( | ( ' ) ( ' ) ( Q

1

  • i

1 i 2 2 1 ' n n

Z pa Z Q Z pa Z Q Z Q Z    

slide-53
SLIDE 53

Overview

  • 1. Probabilistic Reasoning/Graphical models
  • 2. Importance Sampling
  • 3. Markov Chain Monte Carlo: Gibbs Sampling
  • 4. Sampling in presence of Determinism
  • 5. Rao-Blackwellisation
  • 6. AND/OR importance sampling
slide-54
SLIDE 54

Markov Chain

  • A Markov chain is a discrete random process with

the property that the next state depends only on the current state (Markov Property):

54

x1 x2 x3 x4

) | ( ) ,..., , | (

1 1 2 1  

t t t t

x x P x x x x P

  • If P(Xt|xt-1) does not depend on t (time

homogeneous) and state space is finite, then it is

  • ften expressed as a transition function (aka

transition matrix)

1 ) (  

x

x X P

slide-55
SLIDE 55

Example: Drunkard’s Walk

  • a random walk on the number line where, at

each step, the position may change by +1 or −1 with equal probability

55

1 2 3

,...} 2 , 1 , { ) (  X D 5 . 5 . ) 1 ( ) 1 ( n n P n P  

transition matrix P(X) 1 2

slide-56
SLIDE 56

Example: Weather Model

56

5 . 5 . 1 . 9 . ) ( ) ( sunny rainy sunny P rainy P

transition matrix P(X)

} , { ) ( sunny rainy X D 

rain rain rain rain sun

slide-57
SLIDE 57

Multi-Variable System

  • state is an assignment of values to all the

variables

57

x1

t

x2

t

x3

t

x1

t+1

x2

t+1

x3

t+1

} ,..., , {

2 1 t n t t t

x x x x  finite discrete X D X X X X

i

, ) ( }, , , {

3 2 1

 

slide-58
SLIDE 58

Bayesian Network System

  • Bayesian Network is a representation of the

joint probability distribution over 2 or more variables

X1

t

X2

t

X3

t

} , , {

3 2 1 t t t t

x x x x 

X1 X2 X3

58

} , , {

3 2 1

X X X X 

x1

t+1

x2

t+1

x3

t+1

slide-59
SLIDE 59

59

Stationary Distribution Existence

  • If the Markov chain is time-homogeneous,

then the vector (X) is a stationary distribution (aka invariant or equilibrium distribution, aka “fixed point”), if its entries sum up to 1 and satisfy:

  • Finite state space Markov chain has a unique

stationary distribution if and only if:

– The chain is irreducible – All of its states are positive recurrent

) (

) | ( ) ( ) (

X D x j i j i

i

x x P x x  

slide-60
SLIDE 60

60

Irreducible

  • A state x is irreducible if under the transition rule
  • ne has nonzero probability of moving from x to

any other state and then coming back in a finite number of steps

  • If one state is irreducible, then all the states

must be irreducible

(Liu, Ch. 12, pp. 249, Def. 12.1.1)

slide-61
SLIDE 61

61

Recurrent

  • A state x is recurrent if the chain returns to x

with probability 1

  • Let M(x ) be the expected number of steps to

return to state x

  • State x is positive recurrent if M(x ) is finite

The recurrent states in a finite state chain are positive recurrent .

slide-62
SLIDE 62

Stationary Distribution Convergence

  • Consider infinite Markov chain:

n n n

P P x x P P

) (

) | (  

  • Initial state is not important in the limit

“The most useful feature of a “good” Markov chain is its fast forgetfulness of its past…” (Liu, Ch. 12.1)

) (

lim

n n

P

 

 

62

  • If the chain is both irreducible and aperiodic,

then:

slide-63
SLIDE 63

Aperiodic

  • Define d(i) = g.c.d.{n > 0 | it is possible to go

from i to i in n steps}. Here, g.c.d. means the greatest common divisor of the integers in the

  • set. If d(i)=1 for i, then chain is aperiodic
  • Positive recurrent, aperiodic states are ergodic

63

slide-64
SLIDE 64

Markov Chain Monte Carlo

  • How do we estimate P(X), e.g., P(X|e) ?

64

  • Generate samples that form Markov Chain

with stationary distribution =P(X|e)

  • Estimate  from samples (observed states):

visited states x0,…,xn can be viewed as “samples” from distribution 

) , ( 1 ) (

1 t T t

x x T x

   ) ( lim x

T

 

 

slide-65
SLIDE 65

MCMC Summary

  • Convergence is guaranteed in the limit
  • Samples are dependent, not i.i.d.
  • Convergence (mixing rate) may be slow
  • The stronger correlation between states, the

slower convergence!

  • Initial state is not important, but… typically,

we throw away first K samples - “burn-in”

65

slide-66
SLIDE 66

Gibbs Sampling (Geman&Geman,1984)

  • Gibbs sampler is an algorithm to generate a

sequence of samples from the joint probability distribution of two or more random variables

  • Sample new variable value one variable at a

time from the variable’s conditional distribution:

  • Samples form a Markov chain with stationary

distribution P(X|e)

66

) \ | ( } ,..., , ,.., | ( ) (

1 1 1 i t i t n t i t i t i i

x x X P x x x x X P X P  

 

slide-67
SLIDE 67

Gibbs Sampling: Illustration

The process of Gibbs sampling can be understood as a random walk in the space of all instantiations of X=x (remember drunkard’s walk): In one step we can reach instantiations that differ from current one by value assignment to at most one variable (assume randomized choice of variables Xi).

slide-68
SLIDE 68

Ordered Gibbs Sampler

Generate sample xt+1 from xt : In short, for i=1 to N:

68

) , \ | ( from sampled ) , ,..., , | ( ... ) , ,..., , | ( ) , ,..., , | (

1 1 1 1 2 1 1 1 3 1 1 2 1 2 2 3 2 1 1 1 1

e x x X P x X e x x x X P x X e x x x X P x X e x x x X P x X

i t i t i i t N t t N t N N t N t t t t N t t t

       

        

Process All Variables In Some Order

slide-69
SLIDE 69

Transition Probabilities in BN

Markov blanket:

: ) | ( ) \ | (

t i i i t i

markov X P x x X P 

i j ch

X j j i i i t i

pa x P pa x P x x x P ) | ( ) | ( ) \ | ( ) ( ) (

 

j j ch

X j i i i

pa ch pa X markov

Xi

Given Markov blanket (parents, children, and their parents), Xi is independent of all other nodes

69

Computation is linear in the size of Markov blanket!

slide-70
SLIDE 70

Ordered Gibbs Sampling Algorithm (Pearl,1988)

Input: X, E=e Output: T samples {xt} Fix evidence E=e, initialize x0 at random

1. For t = 1 to T (compute samples) 2. For i = 1 to N (loop through variables) 3. xi

t+1  P(Xi | markovi t)

4. End For 5. End For

slide-71
SLIDE 71

Gibbs Sampling Example - BN

71 X1 X4 X8 X5 X2 X3 X9 X7 X6

} { }, ,..., , {

9 9 2 1

X E X X X X  

X1 = x1 X6 = x6 X2 = x2 X7 = x7 X3 = x3 X8 = x8 X4 = x4 X5 = x5

slide-72
SLIDE 72

Gibbs Sampling Example - BN

72 X1 X4 X8 X5 X2 X3 X9 X7 X6

) , ,..., | (

9 8 2 1 1 1

x x x X P x 

} { }, ,..., , {

9 9 2 1

X E X X X X  

) , ,..., | (

9 8 1 1 2 1 2

x x x X P x 

slide-73
SLIDE 73

Answering Queries P(xi |e) = ?

  • Method 1: count # of samples where Xi = xi (histogram estimator):

  

T t t i i i i i

markov x X P T x X P

1

) | ( 1 ) (

 

T t t i i i

x x T x X P

1

) , ( 1 ) ( 

Dirac delta f-n

  • Method 2: average probability (mixture estimator):
  • Mixture estimator converges faster (consider

estimates for the unobserved values of Xi; prove via Rao-Blackwell theorem)

slide-74
SLIDE 74

Rao-Blackwell Theorem

Rao-Blackwell Theorem: Let random variable set X be composed of two groups of variables, R and L. Then, for the joint distribution (R,L) and function g, the following result applies

74

)] ( [ } | ) ( { [ R g Var L R g E Var 

for a function of interest g, e.g., the mean or covariance (Casella&Robert,1996, Liu et. al. 1995).

  • theorem makes a weak promise, but works well in practice!
  • improvement depends the choice of R and L
slide-75
SLIDE 75

Importance vs. Gibbs

 

   

       

T t t t t t T t t T t

x Q x P x g T g e X Q X x g T X g e X P e X P e X P x

1 1

) ( ) ( ) ( 1 ) | ( ) ( 1 ) ( ˆ ) | ( ) | ( ˆ ) | ( ˆ

wt

Gibbs: Importance:

slide-76
SLIDE 76

Gibbs Sampling: Convergence

76

  • Sample from `P(X|e)P(X|e)
  • Converges iff chain is irreducible and ergodic
  • Intuition - must be able to explore all states:

– if Xi and Xj are strongly correlated, Xi=0 Xj=0, then, we cannot explore states with Xi=1 and Xj=1

  • All conditions are satisfied when all

probabilities are positive

  • Convergence rate can be characterized by the

second eigen-value of transition matrix

slide-77
SLIDE 77

Gibbs: Speeding Convergence

Reduce dependence between samples (autocorrelation)

  • Skip samples
  • Randomize Variable Sampling Order
  • Employ blocking (grouping)
  • Multiple chains

Reduce variance (cover in the next section)

77

slide-78
SLIDE 78

Blocking Gibbs Sampler

  • Sample several variables together, as a block
  • Example: Given three variables X,Y,Z, with domains of

size 2, group Y and Z together to form a variable W={Y,Z} with domain size 4. Then, given sample (xt,yt,zt), compute next sample: + Can improve convergence greatly when two variables are strongly correlated!

  • Domain of the block variable grows exponentially with

the #variables in a block!

78

) | , ( ) , ( ) ( ) , | (

1 1 1 1 1     

   

t t t t t t t t

x Z Y P w z y w P z y X P x

slide-79
SLIDE 79

Gibbs: Multiple Chains

  • Generate M chains of size K
  • Each chain produces independent estimate Pm:

79

   

  

 M i m

P M P

1

1 ˆ

K t i t i i m

x x x P K e x P

1

) \ | ( 1 ) | (

Treat Pm as independent random variables.

  • Estimate P(xi|e) as average of Pm (xi|e) :
slide-80
SLIDE 80

Gibbs Sampling Summary

  • Markov Chain Monte Carlo method

(Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994)

  • Samples are dependent, form Markov Chain
  • Sample from

which converges to

  • Guaranteed to converge when all P > 0
  • Methods to improve convergence:

– Blocking – Rao-Blackwellised

80

) | ( e X P ) | ( e X P

slide-81
SLIDE 81

Overview

  • 1. Probabilistic Reasoning/Graphical models
  • 2. Importance Sampling
  • 3. Markov Chain Monte Carlo: Gibbs Sampling
  • 4. Sampling in presence of Determinism
  • 5. Rao-Blackwellisation
  • 6. AND/OR importance sampling
slide-82
SLIDE 82

Sampling: Performance

  • Gibbs sampling

– Reduce dependence between samples

  • Importance sampling

– Reduce variance

  • Achieve both by sampling a subset of variables

and integrating out the rest (reduce dimensionality), aka Rao-Blackwellisation

  • Exploit graph structure to manage the extra cost

82

slide-83
SLIDE 83

Smaller Subset State-Space

  • Smaller state-space is easier to cover

83

} , , , {

4 3 2 1

X X X X X  } , {

2 1 X

X X  64 ) (  X D 16 ) (  X D

slide-84
SLIDE 84

Smoother Distribution

84

00 01 10 11 0.1 0.2 00 01 10 11

P(X1,X2,X3,X4)

0-0.1 0.1-0.2 0.2-0.26 1 0.1 0.2 1

P(X1,X2)

0-0.1 0.1-0.2 0.2-0.26

slide-85
SLIDE 85

Speeding Up Convergence

  • Mean Squared Error of the estimator:

   

P Var BIAS P MSE

Q Q

 

2

 

        

2 2

] [ ˆ ] ˆ [ ] ˆ [ P E P E P Var P MSE

Q Q Q Q

  • Reduce variance  speed up convergence !
  • In case of unbiased estimator, BIAS=0

85

slide-86
SLIDE 86

Rao-Blackwellisation

86

)} ( ~ { ]} | ) ( [ { )} ( { )} ( ˆ { ]} | ) ( [ { )} ( { ]} | ) ( {var[ ]} | ) ( [ { )} ( { ]} | ) ( [ ] | ) ( [ { 1 ) ( ~ )} ( ) ( { 1 ) ( ˆ

1 1

x g Var T l x h E Var T x h Var x g Var l x g E Var x g Var l x g E l x g E Var x g Var l x h E l x h E T x g x h x h T x g L R X

T T

              

Liu, Ch.2.3

slide-87
SLIDE 87

Rao-Blackwellisation

  • X=RL
  • Importance Sampling:
  • Gibbs Sampling:

– autocovariances are lower (less correlation between samples) – if Xi and Xj are strongly correlated, Xi=0  Xj=0,

  • nly include one fo them into a sampling set

87

} ) ( ) ( { } ) , ( ) , ( { R Q R P Var L R Q L R P Var

Q Q

Liu, Ch.2.5.5 “Carry out analytical computation as much as possible” - Liu

slide-88
SLIDE 88

Blocking Gibbs Sampler vs. Collapsed

  • Standard Gibbs:

(1)

  • Blocking:

(2)

  • Collapsed:

(3)

88

X Y Z

) , | ( ), , | ( ), , | ( y x z P z x y P z y x P ) | , ( ), , | ( x z y P z y x P ) | ( ), | ( x y P y x P

Faster Convergence

slide-89
SLIDE 89

Collapsed Gibbs Sampling

Generating Samples

Generate sample ct+1 from ct :

89

) , \ | ( ) , ,..., , | ( ... ) , ,..., , | ( ) , ,..., , | (

1 1 1 1 2 1 1 1 3 1 1 2 1 2 2 3 2 1 1 1 1

e c c c P c C e c c c c P c C e c c c c P c C e c c c c P c C

i t i t i i t K t t K t K K t K t t t t K t t t

from sampled        

        

In short, for i=1 to K:

slide-90
SLIDE 90

Collapsed Gibbs Sampler

Input: C X, E=e Output: T samples {ct } Fix evidence E=e, initialize c0 at random

1. For t = 1 to T (compute samples) 2. For i = 1 to N (loop through variables) 3. ci

t+1  P(Ci | ct\ci)

4. End For 5. End For

slide-91
SLIDE 91

Calculation Time

  • Computing P(ci| ct\ci,e) is more expensive

(requires inference)

  • Trading #samples for smaller variance:

– generate more samples with higher covariance – generate fewer samples with lower covariance

  • Must control the time spent computing

sampling probabilities in order to be time- effective!

91

slide-92
SLIDE 92

Exploiting Graph Properties

Recall… computation time is exponential in the adjusted induced width of a graph

  • w-cutset is a subset of variable s.t. when they

are observed, induced width of the graph is w

  • when sampled variables form a w-cutset ,

inference is exp(w) (e.g., using Bucket Tree Elimination)

  • cycle-cutset is a special case of w-cutset

92

Sampling w-cutset  w-cutset sampling!

slide-93
SLIDE 93

What If C=Cycle-Cutset ?

93

} { }, {

9 5 2

X E ,x x c  

X1 X7 X5 X4 X2 X9 X8 X3 X6 X1 X7 X4 X9 X8 X3 X6

P(x2,x5,x9) – can compute using Bucket Elimination P(x2,x5,x9) – computation complexity is O(N)

slide-94
SLIDE 94

Computing Transition Probabilities

94

) , , 1 ( : ) , , ( :

9 3 2 9 3 2

x x x P BE x x x P BE  

X1 X7 X5 X4 X2 X9 X8 X3 X6

) , , 1 ( ) | 1 ( ) , , ( ) | ( ) , , 1 ( ) , , (

9 3 2 3 2 9 3 2 3 2 9 3 2 9 3 2

x x x P x x P x x x P x x P x x x P x x x P             

Compute joint probabilities: Normalize:

slide-95
SLIDE 95

Cutset Sampling-Answering Queries

  • Query: ci C, P(ci |e)=? same as Gibbs:

95

computed while generating sample t using bucket tree elimination compute after generating sample t using bucket tree elimination

T t i t i i

e c c c P T |e c P

1

) , \ | ( 1 ) ( ˆ

T t t i i

,e c x P T |e) (x P

1

) | ( 1

  • Query: xi X\C, P(xi |e)=?
slide-96
SLIDE 96

Cutset Sampling vs. Cutset Conditioning

96

) | ( ) | ( ) ( ) | ( ) | ( 1

) ( ) ( 1

e c P c,e x P T c count c,e x P ,e c x P T |e) (x P

C D c i C D c i T t t i i

    

  

  

  • Cutset Conditioning
  • Cutset Sampling

) | ( ) | (

) (

e c P c,e x P |e) P(x

C D c i i

  

slide-97
SLIDE 97

Cutset Sampling Example

97

                ) ( ) ( ) ( 3 1 ) | ( ) ( ) ( ) (

9 2 5 2 9 1 5 2 9 5 2 9 2 9 2 5 2 3 2 9 1 5 2 2 2 9 5 2 1 2

,x | x x P ,x | x x P ,x | x x P x x P ,x | x x P x ,x | x x P x ,x | x x P x  

X1 X7 X6 X5 X4 X2 X9 X8 X3

Estimating P(x2|e) for sampling node X2 :

Sample 1 Sample 2 Sample 3

slide-98
SLIDE 98

Cutset Sampling Example

98

) , , | ( } , { ) , , | ( } , { ) , , | ( } , {

9 3 5 3 2 3 3 5 3 2 3 9 2 5 2 2 3 2 5 2 2 2 9 1 5 1 2 3 1 5 1 2 1

x x x x P x x c x x x x P x x c x x x x P x x c      

Estimating P(x3 |e) for non-sampled node X3 :

X1 X7 X6 X5 X4 X2 X9 X8 X3

             ) , , | ( ) , , | ( ) , , | ( 3 1 ) | (

9 3 5 3 2 3 9 2 5 2 2 3 9 1 5 1 2 3 9 3

x x x x P x x x x P x x x x P x x P

slide-99
SLIDE 99

CPCS54 Test Results

99

MSE vs. #samples (left) and time (right) Ergodic, |X|=54, D(Xi)=2, |C|=15, |E|=3 Exact Time = 30 sec using Cutset Conditioning

CPCS54, n=54, |C|=15, |E|=3 0.001 0.002 0.003 0.004 1000 2000 3000 4000 5000 # samples

Cutset Gibbs

CPCS54, n=54, |C|=15, |E|=3

0.0002 0.0004 0.0006 0.0008 5 10 15 20 25 Time(sec)

Cutset Gibbs

slide-100
SLIDE 100

CPCS179 Test Results

100

MSE vs. #samples (left) and time (right) Non-Ergodic (1 deterministic CPT entry) |X| = 179, |C| = 8, 2<= D(Xi)<=4, |E| = 35 Exact Time = 122 sec using Cutset Conditioning

CPCS179, n=179, |C|=8, |E|=35

0.002 0.004 0.006 0.008 0.01 0.012 100 500 1000 2000 3000 4000 # samples Cutset Gibbs

CPCS179, n=179, |C|=8, |E|=35

0.002 0.004 0.006 0.008 0.01 0.012 20 40 60 80 Time(sec)

Cutset Gibbs

slide-101
SLIDE 101

CPCS360b Test Results

101

MSE vs. #samples (left) and time (right) Ergodic, |X| = 360, D(Xi)=2, |C| = 21, |E| = 36 Exact Time > 60 min using Cutset Conditioning Exact Values obtained via Bucket Elimination

CPCS360b, n=360, |C|=21, |E|=36

0.00004 0.00008 0.00012 0.00016 200 400 600 800 1000 # samples

Cutset Gibbs

CPCS360b, n=360, |C|=21, |E|=36

0.00004 0.00008 0.00012 0.00016 1 2 3 5 10 20 30 40 50 60 Time(sec)

Cutset Gibbs

slide-102
SLIDE 102

Random Networks

102

MSE vs. #samples (left) and time (right) |X| = 100, D(Xi) =2,|C| = 13, |E| = 15-20 Exact Time = 30 sec using Cutset Conditioning

RANDOM, n=100, |C|=13, |E|=15-20

0.0005 0.001 0.0015 0.002 0.0025 0.003 0.0035 200 400 600 800 1000 1200

# samples Cutset Gibbs

RANDOM, n=100, |C|=13, |E|=15-20

0.0002 0.0004 0.0006 0.0008 0.001 1 2 3 4 5 6 7 8 9 10 11 Time(sec) Cutset Gibbs

slide-103
SLIDE 103

Coding Networks

Cutset Transforms Non-Ergodic Chain to Ergodic

103

MSE vs. time (right) Non-Ergodic, |X| = 100, D(Xi)=2, |C| = 13-16, |E| = 50 Sample Ergodic Subspace U={U1, U2,…Uk} Exact Time = 50 sec using Cutset Conditioning

x1 x2 x3 x4 u1 u2 u3 u4 p1 p2 p3 p4 y4 y3 y2 y1

Coding Networks, n=100, |C|=12-14

0.001 0.01 0.1 10 20 30 40 50 60 Time(sec)

IBP Gibbs Cutset

slide-104
SLIDE 104

Non-Ergodic Hailfinder

104

MSE vs. #samples (left) and time (right) Non-Ergodic, |X| = 56, |C| = 5, 2 <=D(Xi) <=11, |E| = 0 Exact Time = 2 sec using Loop-Cutset Conditioning

HailFinder, n=56, |C|=5, |E|=1

0.0001 0.001 0.01 0.1 1 1 2 3 4 5 6 7 8 9 10

Time(sec)

Cutset Gibbs

HailFinder, n=56, |C|=5, |E|=1

0.0001 0.001 0.01 0.1 500 1000 1500

# samples Cutset Gibbs

slide-105
SLIDE 105

CPCS360b - MSE

105

cpcs360b, N=360, |E|=[20-34], w*=20, MSE

0.000005 0.00001 0.000015 0.00002 0.000025

200 400 600 800 1000 1200 1400 1600 Time (sec)

Gibbs IBP |C|=26,fw=3 |C|=48,fw=2

MSE vs. Time Ergodic, |X| = 360, |C| = 26, D(Xi)=2 Exact Time = 50 min using BTE

slide-106
SLIDE 106

Cutset Importance Sampling

  • Apply Importance Sampling over cutset C

 

 

 

T t t T t t t

w T c Q e c P T e P

1 1

1 ) ( ) , ( 1 ) ( ˆ

T t t t i i

w c c T e c P

1

) , ( 1 ) | (  

T t t t i i

w e c x P T e x P

1

) , | ( 1 ) | ( 

where P(ct,e) is computed using Bucket Elimination

(Gogate & Dechter, 2005) and (Bidyuk & Dechter, 2006)

slide-107
SLIDE 107

Likelihood Cutset Weighting (LCS)

  • Z=Topological Order{C,E}
  • Generating sample t+1:

107

For End If End ) ,..., | ( Else If : do For

1 1 1 1 1

1

    

    

t i t i t i i i t i i i

z z Z P z e ,z z z E Z Z Z

  • computed while generating

sample t using bucket tree elimination

  • can be memoized for some

number of instances K (based on memory available

KL[P(C|e), Q(C)+ ≤ KL*P(X|e), Q(X)]

slide-108
SLIDE 108

Pathfinder 1

108

slide-109
SLIDE 109

Pathfinder 2

109

slide-110
SLIDE 110

Link

110

slide-111
SLIDE 111

Summary

  • i.i.d. samples
  • Unbiased estimator
  • Generates samples fast
  • Samples from Q
  • Reject samples with

zero-weight

  • Improves on cutset
  • Dependent samples
  • Biased estimator
  • Generates samples

slower

  • Samples from `P(X|e)
  • Does not converge in

presence of constraints

  • Improves on cutset

111

Importance Sampling Gibbs Sampling

slide-112
SLIDE 112

CPCS360b

112

cpcs360b, N=360, |LC|=26, w*=21, |E|=15

1.E-05 1.E-04 1.E-03 1.E-02 2 4 6 8 10 12 14

Time (sec) MSE

LW AIS-BN Gibbs LCS IBP

LW – likelihood weighting LCS – likelihood weighting on a cutset

slide-113
SLIDE 113

CPCS422b

113 1.0E-05 1.0E-04 1.0E-03 1.0E-02 10 20 30 40 50 60

MSE Time (sec)

cpcs422b, N=422, |LC|=47, w*=22, |E|=28

LW AIS-BN Gibbs LCS IBP

LW – likelihood weighting LCS – likelihood weighting on a cutset

slide-114
SLIDE 114

Coding Networks

114

coding, N=200, P=3, |LC|=26, w*=21

1.0E-05 1.0E-04 1.0E-03 1.0E-02 1.0E-01 2 4 6 8 10

Time (sec) MSE

LW AIS-BN Gibbs LCS IBP

LW – likelihood weighting LCS – likelihood weighting on a cutset