Markov Chain Monte Carlo (MCMC) Variational methods Milos - - PDF document

markov chain monte carlo mcmc
SMART_READER_LITE
LIVE PREVIEW

Markov Chain Monte Carlo (MCMC) Variational methods Milos - - PDF document

CS 3750 Machine Learning Lecture 6 Approximate probabilistic inference: Markov Chain Monte Carlo (MCMC) Variational methods Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square CS 3750 Advanced Machine Learning Markov chain


slide-1
SLIDE 1

1

CS 3750 Advanced Machine Learning

CS 3750 Machine Learning Lecture 6

Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square

Approximate probabilistic inference:

  • Markov Chain Monte Carlo (MCMC)
  • Variational methods

CS 3750 Advanced Machine Learning

Markov chain Monte Carlo

  • Importance sampling: samples are generated according to Q

and every sample from Q is reweighted according to w, but the Q distribution may be very far from the target

  • MCMC is a strategy for generating samples from the target

distribution, including conditional distributions

  • MCMC:

– Markov chain defines a sampling process that – initially generates samples very different from the target distribution (e.g. posterior) – but gradually refines the samples so that they are closer and closer to the posterior.

slide-2
SLIDE 2

2

CS 3750 Advanced Machine Learning

MCMC

  • The construction of a Markov chain requires two basic

ingredients – a transition matrix P – an initial distribution 0

  • Assume a finite set S={1,…m} of states, then a

transition matrix is Where and

              

mm m m m m

p p p p p p p p p P       

2 1 2 22 21 1 12 11

2

) , ( S j i pij    S i pij

S j

  

1

CS 3750 Advanced Machine Learning

Markov Chain

  • Markov chain defines a random process of selecting states
  • Chain Dynamics

  

   

) ( ) ( ) ( ) 1 ( ) 1 (

) ' ( ) ( ) ' (

X Dom x t t t t

x x T x X P x X P

x x’

t t+1 Probability of a state x’ being selected at time t+1

  , , ,

) ( ) 1 ( ) ( m

x x x

Initial state selected based on 0 transition matrix Subsequent states selected based on the previous state and the transition matrix

slide-3
SLIDE 3

3

CS 3750 Advanced Machine Learning

MCMC

  • Markov chain satisfies
  • Irreducibility: A MC is called irreducible (or un-

decomposable) if there is a positive transition probability for all pairs of states within a limited number of steps

  • In irreducible chains there may still exist a periodic structure

such that for each state , the set of possible return times to i when starting in i is a subset of the set containing all but a finite set of these elements. The smallest number p with this property is the so-called period of the chain

) | ( ) , , | (

1 1 1 1 n n n n n n

i X j X P i X i X i X j X P       

 

} : gcd{

) (

  

n ii

p N n p

} , 3 , 2 , { Ν  p p p p 

CS 3750 Advanced Machine Learning

  • Aperiodicity: An irreducible chain is called aperiodic (or

acyclic) if the period p equals 1 or, equivalently, if for all pairs

  • f states there is an integer nij such that for all , the

probability p(n)

ij>0.

  • If a Markov chain satisfies both irreducibility and

aperiodicity, then it converges to an invariant distribution q(x)

  • A Markov chain with transition matrix P will have an

equilibrium distribution q iff q = qP.

  • A sufficient, but not necessary, condition to ensure a particular

q(x) is the invariant distribution of transition matrix P is the following reversibility (detailed balance) condition ) | ( ) ( ) | ( ) (

1 1 1   

i i i i i i

x x P x q x x P x q

MCMC

ij

n n 

slide-4
SLIDE 4

4

CS 3750 Advanced Machine Learning

Markov Chain Monte Carlo

Objective: generate samples from the posterior distribution

  • Idea:

– Markov chain defines a sampling process that – initially generates samples very different from the target posterior – but gradually refines the samples so that they are closer and closer to the posterior.

CS 3750 Advanced Machine Learning

MCMC

  • P(X|e)— the query we want

to compute

  • e1 & e2 are known evidence
  • Sampling from the

distribution P(X) is very different from the desired posterior P(X|e)

e1 e2

P(X|e)

slide-5
SLIDE 5

5

CS 3750 Advanced Machine Learning

Markov Chain Monte Carlo (MCMC)

State Space ………

X1 X2 X3 X4

CS 3750 Advanced Machine Learning

MCMC (Cont.)

  • Goal: a sample from P(X|e)
  • Start from some P(X) and generate a sample x1

X1

slide-6
SLIDE 6

6

CS 3750 Advanced Machine Learning

MCMC (Cont.)

X1

Apply T

  • Goal: a sample from P(X|e)
  • Start from some P(X) and generate a sample x1

CS 3750 Advanced Machine Learning

MCMC (Cont.)

X2 X1

Apply T Apply T

  • Goal: a sample from P(X|e)
  • Start from some P(X) and generate a sample x1
  • From x1 and transition generate x2
slide-7
SLIDE 7

7

CS 3750 Advanced Machine Learning

MCMC (Cont.)

X2 X1

Apply T Apply T

  • Goal: a sample from P(X|e)
  • Start from some P(X) and generate a sample x1
  • From x1 and transition generate x2

CS 3750 Advanced Machine Learning

MCMC (Cont.)

……

Xn X2 X1

Apply T Apply T Apply T

P’(X|e)

  • Goal: a sample from P(X|e)
  • Start from some P(X) and generate a sample x1
  • From x1 and transition generate x2
  • Repeat for n steps
slide-8
SLIDE 8

8

CS 3750 Advanced Machine Learning

MCMC (Cont.)

……

Xn X2 X1

P’(X|e)

Apply T Apply T Apply T

  • Goal: a sample from P(X|e)
  • Start from some P(X) and generate a sample x1
  • From x1 and transition generate x2
  • Repeat for n steps

CS 3750 Advanced Machine Learning

MCMC (Cont.)

……

Xn X2 X1

Apply T Apply T Apply T

Xn+2 Xn+1

……

Samples from desired P (X|e) P’(X|e)

  • Goal: a sample from P(X|e)
  • Start from some P(X) and generate a sample x1
  • From x1 and transition generate x2
  • Repeat for n steps
slide-9
SLIDE 9

9

CS 3750 Advanced Machine Learning

MCMC

  • In general, an MCMC sampling process doesn’t have

to converge to a stationary distribution

  • A finite state Markov Chain has a unique stationary

distribution iff the markov chain is regular – regular: exist some k, for each pair of states x and x’, the probability of getting from x to x’ in exactly k steps is greater than 0

  • We want Markov chains that converge to a unique

target distribution from any initial state Big question:

  • How to build such Markov chains?

CS 3750 Advanced Machine Learning

Gibbs Sampling

  • Evidence:

– x5 =T – x6 =T

  • all variables have

binary values T or F

x5 x4 x6 x2 x3 x1

  • A simple method to define MC for BBN

can benefit from the structure (independences) in the network

slide-10
SLIDE 10

10

CS 3750 Advanced Machine Learning

Gibbs Sampling

x5 x4 x6 x2 x3 x1

X0

x5=x6=T (Fixed) x1=F, x2=T x3=T, x4=T

Initial state

CS 3750 Advanced Machine Learning

Gibbs Sampling

x5 x4 x6 x2 x3 x1

X0

x5=x6=T (Fixed) x1=F, x2=T x3=T, x4=T

Update Value of x4 Initial state

slide-11
SLIDE 11

11

CS 3750 Advanced Machine Learning

Gibbs Sampling

x5 x4 x6 x2 x3 x1

X0

x5 x4 x6 x2 x3 x1

X1

x4=F x5=T x6=T x1=F, x2=T, x3=T,

CS 3750 Advanced Machine Learning

Gibbs Sampling

x5 x4 x6 x2 x3 x1

X0

x5 x4 x6 x2 x3 x1

X1 Update Value of x3

x4=F x5=T x6=T

slide-12
SLIDE 12

12

CS 3750 Advanced Machine Learning

Gibbs Sampling

x5 x4 x6 x2 x3 x1

X1 Update Value of x3

x4=F x5=T x6=T

x5 x4 x6 x2 x3 x1

X2

x3=T x4=F x5=T x6=T

CS 3750 Advanced Machine Learning

Gibbs Sampling

x5 x4 x6 x2 x3 x1

……

x5 x4 x6 x2 x3 x1

…… Xn

Samples from desired P(Xrest|e)

After many reassignments

slide-13
SLIDE 13

13

CS 3750 Advanced Machine Learning

Gibbs Sampling

x5 x4 x6 x2 x3 x1 x5 x4 x6 x2 x3 x1

Keep resampling each variable using the value of variables in its local neighborhood (Markov blanket)

) , , , | (

6 5 3 2 4

x x x x X P

CS 3750 Advanced Machine Learning

Gibbs Sampling

x5 x4 x6 x2 x3 x1

  • Gibbs sampling takes advantage of the graphical model structure
  • Markov blanket makes the variable independent from

the rest of the network

) , , , | (

6 5 3 2 4

x x x x X P

slide-14
SLIDE 14

14

CS 3750 Advanced Machine Learning

Building a Markov Chain

  • A reversible Markov chain:
  • A sufficient, but not necessary, condition to ensure a particular

q(x) is the invariant distribution of transition matrix P is the following reversibility (detailed balance) condition

  • Metropolis-Hastings algorithm

– builds a reversible Markov Chain – Uses a proposal distribution to generate candidate states

  • Either accept it and take a transition to state x’
  • Or reject it and stay at current state x

) | ( ) ( ) | ( ) (

1 1 1   

i i i i i i

x x P x q x x P x q

CS 3750 Advanced Machine Learning

Building a Markov Chain

  • Metropolis-Hastings algorithm

– builds a reversible Markov Chain – uses the proposal distribution (similar to proposal the distribution in importance sampling) to generate candidates for x’

  • A proposal distribution Q:
  • Example: Uniform over the values of variables

– Either accept a proposal and take a transition to state x’ – Or reject it and stay at current state x

  • Acceptance probability

) ' ( x x T Q 

) ' ( x x A 

slide-15
SLIDE 15

15

CS 3750 Advanced Machine Learning

Building a Markov Chain

  • Transition for the MH:
  • From reversibility condition:
  • We get

) ' ( ) ' ( ) ' ( ) ( x x T x q x x T x q    ] ) ' ( ) ( ) ' ( ) ' ( , 1 min[ ) ' ( x x T x q x x T x q x x A

Q Q

    ' ) ' ( ) ' ( ) ' ( x x if x x A x x T x x T

Q

     )) ' ( 1 )( ' ( ) ( ) (

'

x x A x x T x x T x x T

x x Q Q

      

  • therwise

CS 3750 Advanced Machine Learning

Building a Markov Chain

  • Comparing Metropolis Hastings with Gibbs sampling

– For Gibbs – Special MH, for which acceptance probability is 1. 1 ] 1 , 1 min[ ] ) | ' ( ) | ( ) | ( ) | ' ( , 1 min[ ] ) , , ( ) | ( ) , ' , ( ) | ' ( , 1 min[ ) ' , , (

'

      

i i i i i i i i i i i i Q i i i i i i Q i i i i i i

u x P u x P u x P u x P x u x u T u x P x u x u T u x P x u x u A

slide-16
SLIDE 16

16

CS 3750 Advanced Machine Learning

Metropolis Hastings algorithm

  • Assumptions:

– We can’t draw the samples from q(x) – We can evaluate q(x) for any x

  • We use a Markov chain that moves towards x* with

acceptance probability

  • The transition kernel defined by this process satisfies the

detailed balance condition

        ) | * ( ) ( *) | ( *) ( , 1 min *) , ( x x p x q x x p x q x x

CS 3750 Advanced Machine Learning

Mixing Time in Using Markov Chain

  • Mixing Time

– The number of steps we take until we collect a sample from the target distribution. (# = n) ……

Xn X2 X1

Local Rules Local Rules Local Rules

Mixing Time

Xn+2 Xn+1

……

Samples from desired P(X|e)

slide-17
SLIDE 17

17

CS 3750 Advanced Machine Learning

Summary

  • Markov Chain Monte Carlo method attempts to generate

samples from posterior distribution

  • Metropolis Hastings algorithm is a general scheme for

specifying a Markov chain.

  • Gibbs sampling is a special case that takes advantage of the

network structure (Markov Blanket)

CS 3750 Machine Learning

Variational approximations

slide-18
SLIDE 18

18

CS 3750 Machine Learning

Variational approximation

Assume we have a function that is hard to calculate Example: posterior probability in a complex BBNs

  • this inference can be very hard

Idea: replace calculations of with an optimization over a simpler parametric function

) | ( X Z P ) | ( max ~ ) ( 

Z q Z f ) (Z f ) (Z f ) | (  Z q

CS 3750 Machine Learning

Variational lower bound

Let X denote observed variables and Z denote target variables Assume some distribution: defined by parameters

) ( log ) , ( log ) | ( log X P Z X P X Z P  

Average both sides with 

) ( ) , ( ) | ( X P Z X P X Z P  ) | ( log ) , ( log ) ( log X Z P Z X P X P  

) | ( X Z Q

Q

E

  

 

Z Z Z

X Z P X Z Q Z X P X Z Q X P X Z Q ) | ( log ) | ( ) , ( log ) | ( ) ( log ) | (

  

   

) | ( log ) , ( log ) ( log X Z P E Z X P E X P

Q Q

 

 

slide-19
SLIDE 19

19

CS 3750 Machine Learning

Variational lower bound  

 

Z Z

X Z P X Z Q Z X P X Z Q X P ) | ( log ) | ( ) , ( log ) | ( ) ( log

 

   

) | ( log ) , ( log ) ( log X Z P E Z X P E X P

Q Q

 

 

   

   

Z Z Z Z

X Z Q X Z Q X Z Q X Z Q X Z P X Z Q Z X P X Z Q X P ) | ( log ) | ( ) | ( log ) | ( ) | ( log ) | ( ) , ( log ) | ( ) ( log

     

 

 

Z Z

X Z P X Z Q X Z Q X Z Q P Q KL ) | ( log ) | ( ) | ( log ) | ( ) | (

  

Kullback-Leibler divergence: distance between 2 distributions

 

 

Z Z

X Z Q X Z Q Z X P X Z Q P Q F ) | ( log ) | ( ) , ( log ) | ( ) , (

  

Functional (Evidence lower bound or ELBO): ) | ( ) , ( ) ( log P Q KL P Q F X P  

CS 3750 Machine Learning

Variational lower bound

distance between ) | ( ) , ( ) ( log P Q KL P Q F X P  

) | ( ), | ( X Z P X Z Q

Always

Equals 0 if

) | ( ) | ( X Z P X Z Q 

We can optimize the approximation by minimizing

) | ( X Z Q

) | ( min P Q KL

 

We can also do this by maximizing ) , ( P Q F

) , ( max P Q F

 

 

 

Z Z

X Z Q X Z Q Z X P X Z Q P Q F ) | ( log ) | ( ) , ( log ) | ( ) , (

  

Often much easier

slide-20
SLIDE 20

20

CS 3750 Machine Learning

Latent variable models

Observed variables x

Let X denote observed variables and Z denote hidden (latent variables)

  

  

Z Z Z

X Z Q X Z Q Z P X Z Q Z X P X Z Q P Q F ) | ( log ) | ( ) ( log ) | ( ) | ( log ) | ( ) , (

   

x z

Latent variables Z

Inference opposite the links is hard: P(Z | X) Solution: Define a simpler distribution : to approximate P(Z | X)

) | ( X Z Q

) , ( max P Q F

 

Optimize:

CS 3750 Machine Learning

Mean field approximation

) | ( X Z Q

How to construct approximation of Mean field approximation: ?

) | ( ) | (

i i i i Z

Q X Z Q 

 

i i

Z Z Z Z Z Z

X Z Q X Z Q Z X P X Z Q

,.. , ,.. ,

2 1 2 1

) | ( log ) | ( ) , ( log ) | ( max

   

) , ( max P Q F

 

    

i i i

Z Z Z i i i i i i i i Z Z Z i i i i

Z Q Z Q Z X P Z Q

,.. , ,.. , ,... ,

2 1 2 1 2 1

) | ( log ) | ( ) , ( log ) | ( max   

  

slide-21
SLIDE 21

21

CS 3750 Machine Learning

Latent variable models

Observed variables x

Let X denote observed variables and Z denote hidden (latent variables)

Latent variables Z

) | ( ) | (

i i i i Z

Q X Z Q 

      

 

i i i i

Z Z Z i i i i i i i i Z Z Z i i i i i Z Z Z i i i i

Z Q Z Q Z P Z Q Z X P Z Q

,.. , ,.. , ,.. , ,... ,

2 1 2 1 2 1 2 1

) | ( log ) | ( ) ( log ) | ( ) | ( log ) | ( max    

  

) , ( max P Q F

 

CS 3750 Machine Learning

Latent variable models

Observed variables x

Let X denote observed variables and Z denote hidden (latent variables)

Latent variables Z

) | ( ) | (

i i i i Z

Q X Z Q 

     

 

i i i i

Z i i i i i i i Z i i i i i Z Z Z i i i i

Z Q Z Q Z P Z Q Z X P Z Q ) | ( log ) | ( ) ( log ) | ( ) | ( log ) | ( max

,.. , ,... ,

2 1 2 1

   

  

) , ( max P Q F

 

Express analytically F, differentiate wrt parameters and set to 0  Mean field equations that can be used to get optimal set 