[PPT] - Sampling Techniques for Probabilistic and Deterministic Graphical PowerPoint Presentation

SLIDE 1

Sampling Techniques for Probabilistic and Deterministic Graphical models

ICS 276, Spring 2017 Bozhena Bidyuk Rina Dechter

Reading” Darwiche chapter 15, related papers

SLIDE 2

Overview

1. Probabilistic Reasoning/Graphical models
2. Importance Sampling
3. Markov Chain Monte Carlo: Gibbs Sampling
4. Sampling in presence of Determinism
5. Rao-Blackwellisation
6. AND/OR importance sampling

SLIDE 3

Markov Chain

A Markov chain is a discrete random process with

the property that the next state depends only on the current state (Markov Property):

3

x1 x2 x3 x4

) | ( ) ,..., , | (

1 1 2 1  



t t t t

x x P x x x x P

If P(Xt|xt-1) does not depend on t (time

homogeneous) and state space is finite, then it is

ften expressed as a transition function (aka

transition matrix)

1 ) (  



x

x X P

SLIDE 4

Example: Drunkard’s Walk

a random walk on the number line where, at

each step, the position may change by +1 or −1 with equal probability

4

1 2 3

,...} 2 , 1 , { ) (  X D

5 . 5 . ) 1 ( ) 1 ( n n P n P  

transition matrix P(X) 1 2

SLIDE 5

Example: Weather Model

5

5 . 5 . 1 . 9 . ) ( ) ( sunny rainy sunny P rainy P

transition matrix P(X)

} , { ) ( sunny rainy X D 

rain rain rain rain sun

SLIDE 6

Multi-Variable System

state is an assignment of values to all the

variables

6

x1

t

x2

t

x3

t

x1

t+1

x2

t+1

x3

t+1

} ,..., , {

2 1 t n t t t

x x x x  finite discrete X D X X X X

i

, ) ( }, , , {

3 2 1

 

SLIDE 7

Bayesian Network System

Bayesian Network is a representation of the

joint probability distribution over 2 or more variables

X1

t

X2

t

X3

t

} , , {

3 2 1 t t t t

x x x x 

X1 X2 X3

7

} , , {

3 2 1

X X X X 

x1

t+1

x2

t+1

x3

t+1

SLIDE 8

8

Stationary Distribution Existence

If the Markov chain is time-homogeneous,

then the vector (X) is a stationary distribution (aka invariant or equilibrium distribution, aka “fixed point”), if its entries sum up to 1 and satisfy:

Finite state space Markov chain has a unique

stationary distribution if and only if:

– The chain is irreducible – All of its states are positive recurrent







) (

) | ( ) ( ) (

X D x j i j i

i

x x P x x  

SLIDE 9

9

Irreducible

A state x is irreducible if under the transition rule
ne has nonzero probability of moving from x to

any other state and then coming back in a finite number of steps

If one state is irreducible, then all the states

must be irreducible

(Liu, Ch. 12, pp. 249, Def. 12.1.1)

SLIDE 10

10

Recurrent

A state x is recurrent if the chain returns to x

with probability 1

Let M(x ) be the expected number of steps to

return to state x

State x is positive recurrent if M(x ) is finite

The recurrent states in a finite state chain are positive recurrent .

SLIDE 11

Stationary Distribution Convergence

Consider infinite Markov chain:

n n n

P P x x P P

) (

) | (  

Initial state is not important in the limit

“The most useful feature of a “good” Markov chain is its fast forgetfulness of its past…” (Liu, Ch. 12.1)

) (

lim

n n

P

 

 

11

If the chain is both irreducible and aperiodic,

then:

SLIDE 12

Aperiodic

Define d(i) = g.c.d.{n > 0 | it is possible to go

from i to i in n steps}. Here, g.c.d. means the greatest common divisor of the integers in the

set. If d(i)=1 for i, then chain is aperiodic
Positive recurrent, aperiodic states are ergodic

12

SLIDE 13

Markov Chain Monte Carlo

How do we estimate P(X), e.g., P(X|e) ?

13

Generate samples that form Markov Chain

with stationary distribution =P(X|e)

Estimate  from samples (observed states):

visited states x0,…,xn can be viewed as “samples” from distribution 

) , ( 1 ) (

1 t T t

x x T x





   ) ( lim x

T

 

 



SLIDE 14

MCMC Summary

Convergence is guaranteed in the limit
Samples are dependent, not i.i.d.
Convergence (mixing rate) may be slow
The stronger correlation between states, the

slower convergence!

Initial state is not important, but… typically,

we throw away first K samples - “burn-in”

14

SLIDE 15

Gibbs Sampling (Geman&Geman,1984)

Gibbs sampler is an algorithm to generate a

sequence of samples from the joint probability distribution of two or more random variables

Sample new variable value one variable at a

time from the variable’s conditional distribution:

Samples form a Markov chain with stationary

distribution P(X|e)

15

) \ | ( } ,..., , ,.., | ( ) (

1 1 1 i t i t n t i t i t i i

x x X P x x x x X P X P  

 

SLIDE 16

Gibbs Sampling: Illustration

The process of Gibbs sampling can be understood as a random walk in the space of all instantiations of X=x (remember drunkard’s walk): In one step we can reach instantiations that differ from current one by value assignment to at most one variable (assume randomized choice of variables Xi).

SLIDE 17

Ordered Gibbs Sampler

Generate sample xt+1 from xt : In short, for i=1 to N:

17

) , \ | ( from sampled ) , ,..., , | ( ... ) , ,..., , | ( ) , ,..., , | (

1 1 1 1 2 1 1 1 3 1 1 2 1 2 2 3 2 1 1 1 1

e x x X P x X e x x x X P x X e x x x X P x X e x x x X P x X

i t i t i i t N t t N t N N t N t t t t N t t t

       

        

Process All Variables In Some Order

SLIDE 18

Transition Probabilities in BN

Markov blanket:

: ) | ( ) \ | (

t i i i t i

markov X P x x X P 







i j ch

X j j i i i t i

pa x P pa x P x x x P ) | ( ) | ( ) \ | ( ) ( ) (



 

j j ch

X j i i i

pa ch pa X markov





Xi

Given Markov blanket (parents, children, and their parents), Xi is independent of all other nodes

18

Computation is linear in the size of Markov blanket!

SLIDE 19

Ordered Gibbs Sampling Algorithm (Pearl,1988)

Input: X, E=e Output: T samples {xt} Fix evidence E=e, initialize x0 at random

1. For t = 1 to T (compute samples) 2. For i = 1 to N (loop through variables) 3. xi

t+1  P(Xi | markovi t)

4. End For 5. End For

SLIDE 20

Gibbs Sampling Example - BN

20 X1 X4 X8 X5 X2 X3 X9 X7 X6

} { }, ,..., , {

9 9 2 1

X E X X X X  

X1 = x1 X6 = x6 X2 = x2 X7 = x7 X3 = x3 X8 = x8 X4 = x4 X5 = x5

SLIDE 21

Gibbs Sampling Example - BN

21 X1 X4 X8 X5 X2 X3 X9 X7 X6

) , ,..., | (

9 8 2 1 1 1

x x x X P x 

} { }, ,..., , {

9 9 2 1

X E X X X X  

) , ,..., | (

9 8 1 1 2 1 2

x x x X P x 



SLIDE 22

Answering Queries P(xi |e) = ?

Method 1: count # of samples where Xi = xi (histogram estimator):





  

T t t i i i i i

markov x X P T x X P

1

) | ( 1 ) (





 

T t t i i i

x x T x X P

1

) , ( 1 ) ( 

Dirac delta f-n

Method 2: average probability (mixture estimator):
Mixture estimator converges faster (consider

estimates for the unobserved values of Xi; prove via Rao-Blackwell theorem)

SLIDE 23

Rao-Blackwell Theorem

Rao-Blackwell Theorem: Let random variable set X be composed of two groups of variables, R and L. Then, for the joint distribution (R,L) and function g, the following result applies

23

)] ( [ } | ) ( { [ R g Var L R g E Var 

for a function of interest g, e.g., the mean or covariance (Casella&Robert,1996, Liu et. al. 1995).

theorem makes a weak promise, but works well in practice!
improvement depends the choice of R and L

SLIDE 24

Importance vs. Gibbs

 

   

       

T t t t t t T t t T t

x Q x P x g T g e X Q X x g T X g e X P e X P e X P x

1 1

) ( ) ( ) ( 1 ) | ( ) ( 1 ) ( ˆ ) | ( ) | ( ˆ ) | ( ˆ

wt

Gibbs: Importance:

SLIDE 25

Gibbs Sampling: Convergence

25

Sample from `P(X|e)P(X|e)
Converges iff chain is irreducible and ergodic
Intuition - must be able to explore all states:

– if Xi and Xj are strongly correlated, Xi=0 Xj=0, then, we cannot explore states with Xi=1 and Xj=1

All conditions are satisfied when all

probabilities are positive

Convergence rate can be characterized by the

second eigen-value of transition matrix

SLIDE 26

Gibbs: Speeding Convergence

Reduce dependence between samples (autocorrelation)

Skip samples
Randomize Variable Sampling Order
Employ blocking (grouping)
Multiple chains

Reduce variance (cover in the next section)

26

SLIDE 27

Blocking Gibbs Sampler

Sample several variables together, as a block
Example: Given three variables X,Y,Z, with domains of

size 2, group Y and Z together to form a variable W={Y,Z} with domain size 4. Then, given sample (xt,yt,zt), compute next sample: + Can improve convergence greatly when two variables are strongly correlated!

Domain of the block variable grows exponentially with

the #variables in a block!

27

) | , ( ) , ( ) ( ) , | (

1 1 1 1 1     

   

t t t t t t t t

x Z Y P w z y w P z y X P x

SLIDE 28

Gibbs: Multiple Chains

Generate M chains of size K
Each chain produces independent estimate Pm:

28

   




 M i m

P M P

1

1 ˆ





K t i t i i m

x x x P K e x P

1

) \ | ( 1 ) | (

Treat Pm as independent random variables.

Estimate P(xi|e) as average of Pm (xi|e) :

SLIDE 29

Gibbs Sampling Summary

Markov Chain Monte Carlo method

(Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994)

Samples are dependent, form Markov Chain
Sample from

which converges to

Guaranteed to converge when all P > 0
Methods to improve convergence:

– Blocking – Rao-Blackwellised

29

) | ( e X P ) | ( e X P

SLIDE 30

Overview

1. Probabilistic Reasoning/Graphical models
2. Importance Sampling
3. Markov Chain Monte Carlo: Gibbs Sampling
4. Sampling in presence of Determinism
5. Rao-Blackwellisation
6. AND/OR importance sampling

SLIDE 31

Sampling: Performance

Gibbs sampling

– Reduce dependence between samples

Importance sampling

– Reduce variance

Achieve both by sampling a subset of variables

and integrating out the rest (reduce dimensionality), aka Rao-Blackwellisation

Exploit graph structure to manage the extra cost

31

SLIDE 32

Smaller Subset State-Space

Smaller state-space is easier to cover

32

} , , , {

4 3 2 1

X X X X X  } , {

2 1 X

X X  64 ) (  X D 16 ) (  X D

SLIDE 33

Smoother Distribution

33

00 01 10 11 0.1 0.2 00 01 10 11

P(X1,X2,X3,X4)

0-0.1 0.1-0.2 0.2-0.26 1 0.1 0.2 1

P(X1,X2)

0-0.1 0.1-0.2 0.2-0.26

SLIDE 34

Speeding Up Convergence

Mean Squared Error of the estimator:

   

P Var BIAS P MSE

Q Q

 

2

 

        

2 2

] [ ˆ ] ˆ [ ] ˆ [ P E P E P Var P MSE

Q Q Q Q

Reduce variance  speed up convergence !
In case of unbiased estimator, BIAS=0

34

SLIDE 35

Rao-Blackwellisation

35

)} ( ~ { ]} | ) ( [ { )} ( { )} ( ˆ { ]} | ) ( [ { )} ( { ]} | ) ( {var[ ]} | ) ( [ { )} ( { ]} | ) ( [ ] | ) ( [ { 1 ) ( ~ )} ( ) ( { 1 ) ( ˆ

1 1

x g Var T l x h E Var T x h Var x g Var l x g E Var x g Var l x g E l x g E Var x g Var l x h E l x h E T x g x h x h T x g L R X

T T

              



Liu, Ch.2.3

SLIDE 36

Rao-Blackwellisation

X=RL
Importance Sampling:
Gibbs Sampling:

– autocovariances are lower (less correlation between samples) – if Xi and Xj are strongly correlated, Xi=0  Xj=0,

nly include one fo them into a sampling set

36

} ) ( ) ( { } ) , ( ) , ( { R Q R P Var L R Q L R P Var

Q Q



Liu, Ch.2.5.5 “Carry out analytical computation as much as possible” - Liu

SLIDE 37

Blocking Gibbs Sampler vs. Collapsed

Standard Gibbs:

(1)

Blocking:

(2)

Collapsed:

(3)

37

X Y Z

) , | ( ), , | ( ), , | ( y x z P z x y P z y x P ) | , ( ), , | ( x z y P z y x P ) | ( ), | ( x y P y x P

Faster Convergence

SLIDE 38

Collapsed Gibbs Sampling

Generating Samples

Generate sample ct+1 from ct :

38

) , \ | ( ) , ,..., , | ( ... ) , ,..., , | ( ) , ,..., , | (

1 1 1 1 2 1 1 1 3 1 1 2 1 2 2 3 2 1 1 1 1

e c c c P c C e c c c c P c C e c c c c P c C e c c c c P c C

i t i t i i t K t t K t K K t K t t t t K t t t

from sampled        

        

In short, for i=1 to K:

SLIDE 39

Collapsed Gibbs Sampler

Input: C X, E=e Output: T samples {ct } Fix evidence E=e, initialize c0 at random

1. For t = 1 to T (compute samples) 2. For i = 1 to N (loop through variables) 3. ci

t+1  P(Ci | ct\ci)

4. End For 5. End For

SLIDE 40

Calculation Time

Computing P(ci| ct\ci,e) is more expensive

(requires inference)

Trading #samples for smaller variance:

– generate more samples with higher covariance – generate fewer samples with lower covariance

Must control the time spent computing

sampling probabilities in order to be time- effective!

40

SLIDE 41

Exploiting Graph Properties

Recall… computation time is exponential in the adjusted induced width of a graph

w-cutset is a subset of variable s.t. when they

are observed, induced width of the graph is w

when sampled variables form a w-cutset ,

inference is exp(w) (e.g., using Bucket Tree Elimination)

cycle-cutset is a special case of w-cutset

41

Sampling w-cutset  w-cutset sampling!

SLIDE 42

What If C=Cycle-Cutset ?

42

} { }, {

9 5 2

X E ,x x c  

X1 X7 X5 X4 X2 X9 X8 X3 X6 X1 X7 X4 X9 X8 X3 X6

P(x2,x5,x9) – can compute using Bucket Elimination P(x2,x5,x9) – computation complexity is O(N)

SLIDE 43

Computing Transition Probabilities

43

) , , 1 ( : ) , , ( :

9 3 2 9 3 2

x x x P BE x x x P BE  

X1 X7 X5 X4 X2 X9 X8 X3 X6

) , , 1 ( ) | 1 ( ) , , ( ) | ( ) , , 1 ( ) , , (

9 3 2 3 2 9 3 2 3 2 9 3 2 9 3 2

x x x P x x P x x x P x x P x x x P x x x P             

Compute joint probabilities: Normalize:

SLIDE 44

Cutset Sampling-Answering Queries

Query: ci C, P(ci |e)=? same as Gibbs:

44

computed while generating sample t using bucket tree elimination compute after generating sample t using bucket tree elimination





T t i t i i

e c c c P T |e c P

1

) , \ | ( 1 ) ( ˆ





T t t i i

,e c x P T |e) (x P

1

) | ( 1

Query: xi X\C, P(xi |e)=?

SLIDE 45

Cutset Sampling vs. Cutset Conditioning

45

) | ( ) | ( ) ( ) | ( ) | ( 1

) ( ) ( 1

e c P c,e x P T c count c,e x P ,e c x P T |e) (x P

C D c i C D c i T t t i i

    

  

  

Cutset Conditioning
Cutset Sampling

) | ( ) | (

) (

e c P c,e x P |e) P(x

C D c i i

  



SLIDE 46

Cutset Sampling Example

46

                ) ( ) ( ) ( 3 1 ) | ( ) ( ) ( ) (

9 2 5 2 9 1 5 2 9 5 2 9 2 9 2 5 2 3 2 9 1 5 2 2 2 9 5 2 1 2

X1 X7 X6 X5 X4 X2 X9 X8 X3

Estimating P(x2|e) for sampling node X2 :

Sample 1 Sample 2 Sample 3

SLIDE 47

Cutset Sampling Example

47

) , , | ( } , { ) , , | ( } , { ) , , | ( } , {

9 3 5 3 2 3 3 5 3 2 3 9 2 5 2 2 3 2 5 2 2 2 9 1 5 1 2 3 1 5 1 2 1

x x x x P x x c x x x x P x x c x x x x P x x c      

Estimating P(x3 |e) for non-sampled node X3 :

X1 X7 X6 X5 X4 X2 X9 X8 X3

             ) , , | ( ) , , | ( ) , , | ( 3 1 ) | (

9 3 5 3 2 3 9 2 5 2 2 3 9 1 5 1 2 3 9 3

x x x x P x x x x P x x x x P x x P

SLIDE 48

CPCS54 Test Results

48

MSE vs. #samples (left) and time (right) Ergodic, |X|=54, D(Xi)=2, |C|=15, |E|=3 Exact Time = 30 sec using Cutset Conditioning

CPCS54, n=54, |C|=15, |E|=3 0.001 0.002 0.003 0.004 1000 2000 3000 4000 5000 # samples

Cutset Gibbs

CPCS54, n=54, |C|=15, |E|=3

0.0002 0.0004 0.0006 0.0008 5 10 15 20 25 Time(sec)

Cutset Gibbs

SLIDE 49

CPCS179 Test Results

49

MSE vs. #samples (left) and time (right) Non-Ergodic (1 deterministic CPT entry) |X| = 179, |C| = 8, 2<= D(Xi)<=4, |E| = 35 Exact Time = 122 sec using Cutset Conditioning

CPCS179, n=179, |C|=8, |E|=35

0.002 0.004 0.006 0.008 0.01 0.012 100 500 1000 2000 3000 4000 # samples Cutset Gibbs

CPCS179, n=179, |C|=8, |E|=35

0.002 0.004 0.006 0.008 0.01 0.012 20 40 60 80 Time(sec)

Cutset Gibbs

SLIDE 50

CPCS360b Test Results

50

MSE vs. #samples (left) and time (right) Ergodic, |X| = 360, D(Xi)=2, |C| = 21, |E| = 36 Exact Time > 60 min using Cutset Conditioning Exact Values obtained via Bucket Elimination

CPCS360b, n=360, |C|=21, |E|=36

0.00004 0.00008 0.00012 0.00016 200 400 600 800 1000 # samples

Cutset Gibbs

CPCS360b, n=360, |C|=21, |E|=36

0.00004 0.00008 0.00012 0.00016 1 2 3 5 10 20 30 40 50 60 Time(sec)

Cutset Gibbs

SLIDE 51

Random Networks

51

MSE vs. #samples (left) and time (right) |X| = 100, D(Xi) =2,|C| = 13, |E| = 15-20 Exact Time = 30 sec using Cutset Conditioning

RANDOM, n=100, |C|=13, |E|=15-20

0.0005 0.001 0.0015 0.002 0.0025 0.003 0.0035 200 400 600 800 1000 1200

# samples Cutset Gibbs

RANDOM, n=100, |C|=13, |E|=15-20

0.0002 0.0004 0.0006 0.0008 0.001 1 2 3 4 5 6 7 8 9 10 11 Time(sec) Cutset Gibbs

SLIDE 52

Coding Networks

Cutset Transforms Non-Ergodic Chain to Ergodic

52

MSE vs. time (right) Non-Ergodic, |X| = 100, D(Xi)=2, |C| = 13-16, |E| = 50 Sample Ergodic Subspace U={U1, U2,…Uk} Exact Time = 50 sec using Cutset Conditioning

x1 x2 x3 x4 u1 u2 u3 u4 p1 p2 p3 p4 y4 y3 y2 y1

Coding Networks, n=100, |C|=12-14

0.001 0.01 0.1 10 20 30 40 50 60 Time(sec)

IBP Gibbs Cutset

SLIDE 53

Non-Ergodic Hailfinder

53

MSE vs. #samples (left) and time (right) Non-Ergodic, |X| = 56, |C| = 5, 2 <=D(Xi) <=11, |E| = 0 Exact Time = 2 sec using Loop-Cutset Conditioning

HailFinder, n=56, |C|=5, |E|=1

0.0001 0.001 0.01 0.1 1 1 2 3 4 5 6 7 8 9 10

Time(sec)

Cutset Gibbs

HailFinder, n=56, |C|=5, |E|=1

0.0001 0.001 0.01 0.1 500 1000 1500

# samples Cutset Gibbs

SLIDE 54

CPCS360b - MSE

54

cpcs360b, N=360, |E|=[20-34], w*=20, MSE

0.000005 0.00001 0.000015 0.00002 0.000025

200 400 600 800 1000 1200 1400 1600 Time (sec)

Gibbs IBP |C|=26,fw=3 |C|=48,fw=2

MSE vs. Time Ergodic, |X| = 360, |C| = 26, D(Xi)=2 Exact Time = 50 min using BTE

SLIDE 55

Cutset Importance Sampling

Apply Importance Sampling over cutset C

 

 

T t t T t t t

w T c Q e c P T e P

1 1

1 ) ( ) , ( 1 ) ( ˆ





T t t t i i

w c c T e c P

1

) , ( 1 ) | (  





T t t t i i

w e c x P T e x P

1

) , | ( 1 ) | ( 

where P(ct,e) is computed using Bucket Elimination

(Gogate & Dechter, 2005) and (Bidyuk & Dechter, 2006)

SLIDE 56

Likelihood Cutset Weighting (LCS)

Z=Topological Order{C,E}
Generating sample t+1:

56

For End If End ) ,..., | ( Else If : do For

1 1 1 1 1

1

    

    

t i t i t i i i t i i i

z z Z P z e ,z z z E Z Z Z

computed while generating

sample t using bucket tree elimination

can be memoized for some

number of instances K (based on memory available

KL[P(C|e), Q(C)] ≤ KL[P(X|e), Q(X)]

SLIDE 57

Pathfinder 1

57

SLIDE 58

Pathfinder 2

58

SLIDE 59

Link

59

SLIDE 60

Summary

i.i.d. samples
Unbiased estimator
Generates samples fast
Samples from Q
Reject samples with

zero-weight

Improves on cutset
Dependent samples
Biased estimator
Generates samples

slower

Samples from `P(X|e)
Does not converge in

presence of constraints

Improves on cutset

60

Importance Sampling Gibbs Sampling

SLIDE 61

CPCS360b

61

cpcs360b, N=360, |LC|=26, w*=21, |E|=15

1.E-05 1.E-04 1.E-03 1.E-02 2 4 6 8 10 12 14

Time (sec) MSE

LW AIS-BN Gibbs LCS IBP

LW – likelihood weighting LCS – likelihood weighting on a cutset

SLIDE 62

CPCS422b

62 1.0E-05 1.0E-04 1.0E-03 1.0E-02 10 20 30 40 50 60

MSE Time (sec)

cpcs422b, N=422, |LC|=47, w*=22, |E|=28

LW AIS-BN Gibbs LCS IBP

LW – likelihood weighting LCS – likelihood weighting on a cutset

SLIDE 63

Coding Networks

63

coding, N=200, P=3, |LC|=26, w*=21

1.0E-05 1.0E-04 1.0E-03 1.0E-02 1.0E-01 2 4 6 8 10

Time (sec) MSE

LW AIS-BN Gibbs LCS IBP

LW – likelihood weighting LCS – likelihood weighting on a cutset