stat 339 markov chain monte carlo mcmc
play

STAT 339 Markov Chain Monte Carlo (MCMC) 7 April 2017 Some theory - PowerPoint PPT Presentation

STAT 339 Markov Chain Monte Carlo (MCMC) 7 April 2017 Some theory and intuition about MCMC generally Outline Gibbs Sampling in Mixture Models Sampling as an Approximation Method Gibbs Sampling Intuition and Motivation Concretely: The


  1. STAT 339 Markov Chain Monte Carlo (MCMC) 7 April 2017 Some theory and intuition about MCMC generally

  2. Outline Gibbs Sampling in Mixture Models Sampling as an Approximation Method Gibbs Sampling Intuition and Motivation Concretely: The Algorithm A Bit of MCMC Theory

  3. Outline Gibbs Sampling in Mixture Models Sampling as an Approximation Method Gibbs Sampling Intuition and Motivation Concretely: The Algorithm A Bit of MCMC Theory 3 / 31

  4. Outline Gibbs Sampling in Mixture Models Sampling as an Approximation Method Gibbs Sampling Intuition and Motivation Concretely: The Algorithm A Bit of MCMC Theory 4 / 31

  5. Intractable Posterior ▸ It is easy to write down the unnormalized posterior density: p ( θ ∣ Y ) ∝ p ( θ ) p ( Y ∣ θ ) ▸ GMM case: ( π k N( y n ∣ µ k , Σ k )) K N K p ( θ ∣ Y ) ∝ p ( π ) p ( µ k , Σ k ) ∏ ∏ ∑ k = 1 n = 1 k = 1 where, e.g., p ( π ) is Dirichlet, p ( µ k , Σ k ) is Normal-Inverse Wishart (or p ( µ k , Σ − 1 k ) is Normal-Wishart) ▸ But hard to do much with it (except maybe find a local maximum) ▸ Solution: Try to draw samples from it. 5 / 31

  6. When can we sample from but not analyze a distribution? ▸ Consider trying to evaluate P ( Straight ) in poker (say) ▸ You might be able to do this analytically, but if you have a computer handy, it is easy to simulate dealing out several million poker hands ▸ Then, simply count: P ( Straight ) ≈ 1 S I ( Did I get a straight in hand s? ) ∑ S s = 1 6 / 31

  7. Outline Gibbs Sampling in Mixture Models Sampling as an Approximation Method Gibbs Sampling Intuition and Motivation Concretely: The Algorithm A Bit of MCMC Theory 7 / 31

  8. Gibbs Sampling as "Stochastic EM" ▸ The EM algorithm involves, iteratively 1. Computing an expectation over cluster assignments (using the posterior, conditioned on parameter values) 2. Arg-Maximizing parameter values (using the likelihood/posterior conditioned on expected cluster assignments) ▸ Gibbs sampling (in this context) involves, iteratively 1. Sampling cluster assignments (using the posterior, conditioned on parameter values) 2. Sampling parameter values (using the posterior, conditioned on cluster assignments) 8 / 31

  9. What does Gibbs Yield? ▸ Over many iterations, Gibbs sampling yields a collection of many cluster assignments and many parameter values, distributed according to the joint posterior {{ z ( s ) n } N n = 1 , π ( s ) , { µ ( s ) k } K k = 1 }} S k , Σ ( s ) s = 1 ▸ These can be used to approximate more or less any function of the assignments and parameters we want ▸ Example: p ( z n = z m ∣ Y ) ≈ 1 S I ( z ( s ) = z ( s ) m ) ∑ n S s = 1 ▸ Or, for future data: m N( y new ∣ µ z ( s ) m ) π ( s ) ∑ m , Σ z ( s ) S p ( z new = z m ∣ y new , Y ) ≈ 1 z ( s ) k N( y new ∣ µ ( s ) k ) ∑ K k = 1 π ( s ) k , Σ ( s ) S s = 1 9 / 31

  10. Outline Gibbs Sampling in Mixture Models Sampling as an Approximation Method Gibbs Sampling Intuition and Motivation Concretely: The Algorithm A Bit of MCMC Theory 10 / 31

  11. Concretely: The Algorithm ▸ In general, Gibbs sampling applies when we have multiple sets of unknown parameters, but where it is difficult to sample all of them directly from their joint distribution. ▸ Mixture model: params are z , π and θ ▸ Idea: partition the parameters into “blocks”, and iteratively fix all but one block and sample the remaining from its posterior conditioned on the others ▸ Here, we need 1: p ( z ∣ Y , θ , π ) 2: p ( θ , π ∣ z , Y ) 11 / 31

  12. Summary: Gibbs Algorithm for GMM 1. Initialize cluster assignments, z ( 0 ) (e.g., using K -means) 2. For s = 1 ,...,S , sample ∼ p ( µ k , Σ k ∣ z ( s − 1 ) , Y ) ,k = 1 ,...,K (a) µ ( s ) k , Σ ( s ) k (or, alternatively, subdivide into two blocks with µ k and Σ k sampled separately given the other) (b) π ( s ) ∼ p ( π ∣ z ( s − 1 ) ) ∼ p ( z n ∣ π ( s ) , θ ( s ) , y n ) ,n = 1 ,...,N (c) z ( s ) n 12 / 31

  13. Specific Conditional Distributions Assuming conjugate priors and Σ k restricted to diagonal w/ elements h k,d = σ 2 k,d p ( µ k,d ∣ h k,d , z , Y ) = N( µ post ,h − 1 post ) where µ post = h − 1 post ( h prior µ prior + n k h k,d ¯ y k,d ) h post = h prior + n k h k,d p ( h k,d ∣ µ k , z , Y ) = Gamma ( a post ,b post ) where a post = a prior + n k 2 b post = b prior + 1 ( y n,d − µ k,d ) 2 2 ∑ n ∶ z n = k p ( π ∣ z , Y ) = Dir ( α post, 1 ,...,α post,K ) where α post,k = α prior,k + n k Each of these can be sampled from with readily available methods (e.g., numpy.random.dirichlet() ) 13 / 31

  14. Conditional for z We have already worked out that p ( z n = k ∣ π , θ , Y ) = Categorical ( q n 1 ,...,q nK ) π k N( y n ∣ µ k , Σ k ) where q nk = k ′ = 1 π k N( y n ∣ µ k , Σ k ) ∑ K Conditioned on the cluster parameters, all z n are independent. Sampling from this distribution is just a matter of enumerating the probabilities, or using a library function for categorical distributions (e.g., numpy.random.choice() ) 14 / 31

  15. Gibbs Sampling More Generally ▸ Gibbs is a very general algorithm that can be used to sample from complicated probability distributions ▸ In general suppose we have p ( θ ) = p ( θ 1 ,..., θ M ) where each θ m is a collection of random variables. We can generate samples from this distribution by 1. Initializing each θ m 2. For s = 1 ,...,S ; For m = 1 ,...,M : m from p ( θ m ∣ θ ( s ) ) 1 ,..., θ ( s ) m − 1 , θ ( s − 1 ) m + 1 ,..., θ ( s − 1 ) Sample θ ( s ) M ▸ Note that we can partition θ into the θ m however we like, as long as certain conditions are met (more on this later) 15 / 31

  16. Application: Image De-Noising Consider a generative model of noisy images in which θ represents the pixels of the “pure” image, and Y is the observed “corrupted” image. 16 / 31

  17. Application: Image De-Noising ▸ We can define a prior on θ that encourages neighboring pixels to be similar, and a likelihood on Y that encourages observed pixels to be close to the corresponding one in the pure image. ▸ Then, we Gibbs sample each pixel θ m conditioned on the image and the other θ s p ( θ m ∣ θ − m , Y ) ∝ p ( θ m ∣ θ m − ) p ( y m ∣ θ m ) 17 / 31

  18. Results from Two Priors 18 / 31

  19. Outline Gibbs Sampling in Mixture Models Sampling as an Approximation Method Gibbs Sampling Intuition and Motivation Concretely: The Algorithm A Bit of MCMC Theory 19 / 31

  20. Markov Chain Monte Carlo ▸ MCMC gets its name from the fact that it is a method for generating random samples (the “Monte Carlo” part) and the fact that each sample depends on exactly one previous sample (the “Markov Chain”) part ▸ In general, involves sampling from q ( θ ( s ) ∣ θ ( s − 1 ) ) for a suitable “transition kernel” q . ▸ Under certain conditions, θ ( s for large s are distributed according to p ( θ ) , which is determined by q . ▸ Idea: construct a q so that p is the target distribution we want to get samples from. 20 / 31

  21. Markov Chains ▸ Simple Markov chain: two possible states, { rain,sun } at each time step t . ▸ Suppose q ( ⋅ ∣ ⋅ ) is defined by the transition matrix ⎛ ⎞ rain sun Q = ⎜ ⎟ rain 0 . 7 0 . 3 ⎝ ⎠ sun 0 . 1 0 . 9 where rows represent the “from” state and columns represent the “to” state. That is, q ( rain ∣ rain ) = 0 . 7 , q ( sun ∣ rain ) = 0 . 3 , etc. 21 / 31

  22. Markov Chains ▸ Suppose q ( ⋅ ∣ ⋅ ) is defined by the transition matrix ⎛ ⎞ rain sun ⎜ ⎟ Q = rain 0 . 7 0 . 3 ⎝ ⎠ sun 0 . 1 0 . 9 ▸ Question: If it is sunny today, what is the probability that it will be sunny tomorrow? ▸ If it is sunny today, what is the probability that it will be rainy two days from now? 22 / 31

  23. Markov Chains ▸ Suppose q ( ⋅ ∣ ⋅ ) is defined by the transition matrix ⎛ ⎞ rain sun Q = ⎜ ⎟ rain 0 . 7 0 . 3 ⎝ ⎠ sun 0 . 1 0 . 9 ▸ If it is sunny today, what is the probability that it will be rainy two days from now? ▸ We have P ( θ t + 2 = rain ∣ θ t = sun ) = ∑ P ( θ t + 1 = s,θ t + 2 = rain ∣ θ t = sun ) s = ∑ q ( s ∣ sun ) q ( rain ∣ s ) s = ( QQ ) 2 , 2 = ( 0 . 1 , 0 . 9 ) T ( 0 . 7 , 0 . 1 ) = 0 . 07 + 0 . 09 = 0 . 16 23 / 31

  24. Markov Chains ▸ Suppose q ( ⋅ ∣ ⋅ ) is defined by the transition matrix ⎛ ⎞ rain sun Q = ⎜ ⎟ rain 0 . 7 0 . 3 ⎝ ⎠ sun 0 . 1 0 . 9 ▸ If it is sunny today, what is the probability that it will be rainy a year from now? ▸ We have the recurrence relation P ( θ t + m ∣ θ t ) = ∑ P ( θ t + m − 1 = s,θ t + m ∣ θ t ) s ▸ This is the matrix product of the m − 1 step matrix with the one-step transition matrix. ▸ So the m -step transition matrix is just Q m 24 / 31

  25. Simulation of Q ∞ ▸ What does the “many step” transition matrix look like? ▸ Simulation ▸ We find that regardless of the starting state, there is a 1/4 chance of rain and a 3/4 chance of sun on any day far enough in the future. ▸ This is called the stationary distribution of the transition matrix Q ▸ Mathematically, the row vector π is stationary w.r.t. Q if π Q = π 25 / 31

  26. MCMC In General ▸ Gibbs sampling (and other MCMC algorithms) construct a transition kernel, Q (in Gibbs, subdivided into smaller transitions), so that the stationary distribution is the target distribution. ▸ Hence, after a sufficiently long time horizon, we get samples drawn from the posterior. 26 / 31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend