Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo - PowerPoint PPT Presentation

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 18, 2015 1 / 45

Resources and Attribution Image credits, inspiration, pointers for further reading (SL.): 1. Xing (2014). Markov chain Monte Carlo. Lectures on Probabilistic Graphical Models. (SL. 15) 2. Murray (2009). Markov chain Monte Carlo. Cambridge Machine Learning Summer School. (SL. 7, 11, 12, 13, 16, 19, 20, 24, 36, 37) 3. Murray (2007). Advances in Markov chain Monte Carlo Methods. PhD Thesis. (Detailed descriptions for material in the above reference) 4. Bishop (2006). Pattern Recognition and Machine Learning (PRML). (SL. 8, 14, 22, 31, 35) 5. Geweke (2004). Getting it right: joint distribution tests of posterior simulators, JASA 99(467): 799-804. (SL. 34) 6. MacKay (2003). Information Theory, Inference, and Learning Algorithms. (SL. 25, 43) 7. Rasmussen (2000). The Infinite Gaussian Mixture Model. NIPS. (SL. 27) 2 / 45

Monte Carlo Monte Carlo approximates expectations with sums formed from sampling. � J f ( x ) p ( x ) dx ≈ 1 � x ( j ) ∼ p ( x ) f ( x ( j ) ) , E [ f ( x )] = (1) J j = 1 3 / 45

Monte Carlo Example: Integrating away the weights in e.g., Bayesian neural network. ◮ Specify y ( x ) = f ( x , w ) , for response y and input (predictor) x . Place prior p ( w ) on the weights w . likelihood � �� ◮ Infer posterior p ( w |D ) ∝ p ( D| w ) p ( w ) given data D . Derive the predictive distribution at test input x ∗ : � p ( y ∗ | x ∗ , D ) = p ( y ∗ | w , x ∗ ) p ( w |D ) dw (2) J ≈ 1 � w ( j ) ∼ p ( w |D ) p ( y ∗ | w ( j ) , x ∗ ) , (3) J j = 1 But how do we sample from p ( w |D ) ? 4 / 45

Monte Carlo Warning! Marginalisation via prior sampling is dangerous! � p ( D ) = p ( D| w ) p ( w ) dw (4) J ≈ 1 � w ( j ) ∼ p ( w ) . p ( D| w ( j ) ) , (5) J j = 1 ◮ Question: do you see the problem? 5 / 45

Monte Carlo Warning! Marginalisation via prior sampling is dangerous! � p ( D ) = p ( D| w ) p ( w ) dw (6) J ≈ 1 � w ( j ) ∼ p ( w ) . p ( D| w ( j ) ) , (7) J j = 1 ◮ If you were to do multiple finite approximations using this strategy, the variance between the approximations would be massive. Most of the time samples from p ( w ) will have low likelihood, such that only a small percentage of terms contribute significantly to the Monte Carlo sum. 6 / 45

Monte Carlo Sampling from p ( x ) is equivalent to sampling uniformly in the area under p ( x ) . 7 / 45

Monte Carlo Review Suppose that x ∼ U ( 0 , 1 ) , and we have y = f ( x ) . The distribution of y will be p ( y ) = p ( x ) dx dy = dx (8) dy Integrating both sides, we find, � y p ( y ′ ) dy ′ x = g ( y ) = (9) −∞ Therefore y = g − 1 ( x ) , where g is the CDF of y . To sample from p ( y ) we can sample from p ( x ) and then transform the samples with the inverse CDF of y . In other words, sampling uniformly under the curve of p ( y ) gives samples from y . This is the starting point for many sampling procedures. 8 / 45

Monte Carlo Review ◮ Monte Carlo: Approximates expectations with sums formed from sampling. ◮ Variables with uniform distribution under the curve of p ( x ) are valid samples. ◮ Cumulative CDF Sampling: If X ∼ U ( 0 , 1 ) , and g ( · ) is the CDF of distribution G , then g − 1 ( X ) ∼ G . 9 / 45

Monte Carlo Review Rejection Sampling 1. Approximate unnormalised ˜ p ( x ) with kq ( x ) ≥ ˜ p ( x ) . 2. Sample x 0 from q ( x ) . 3. Sample u 0 from U ( 0 , kq ( x 0 )) . 4. x 0 , u 0 have uniform distribution under the curve kq ( x 0 ) . 5. Accept if u 0 ≤ ˜ p ( x 0 ) . Importance Sampling � � J p ( x ( j ) ) f ( x ) p ( x ) q ( x ) q ( x ) dx ≈ 1 � q ( x ( j ) ) f ( x ( j ) ) , f ( x ) p ( x ) dx = J j = 1 x ( j ) ∼ q ( x ) 10 / 45

Review: Sampling from a Bayes Net Ancestral Sampling ◮ Sample top level variables from marginal distributions. ◮ Sample nodes conditioned on samples of parent nodes. Example We wish to sample from P ( A , B , C , D , E ) = P ( A ) P ( B ) P ( C | A , B ) P ( D | B , C ) P ( E | C , D ) . ◮ A ∼ P ( A ) B ∼ P ( B ) C ∼ P ( C | A , B ) D ∼ P ( D | B , C ) E ∼ P ( E | C , D ) 11 / 45

Sampling from High Dimensional Distributions We often can’t decompose P ( x ) into low dimensional conditional distributions. � ◮ Undirected graphical models: P ( x ) = 1 i f i ( x ) . Z ◮ Posterior over a directed graphical model: P ( A , B , C , D | E ) = P ( A , B , C , D , E ) / P ( E ) . We often don’t know Z or P ( E ) . 12 / 45

Monte Carlo Limitations: High Dimensional Distributions Rejection and importance sampling scale badly with dimension. Suppose p ( x ) = N ( 0 , I ) and q ( x ) = N ( 0 , σ 2 I ) . ◮ We require σ ≥ 1. Rejection sampling has an acceptance rate � kq ( x ) q ( x ) dx = 1 p ( x ) k . (10) Here we must set k = σ − D / 2 . 2 − 1 /σ 2 ) D / 2 − 1. σ 2 ◮ Variance of importance weights is ( Generally, for kq ( x ) ≥ p ( x ) , the ratio of the volume outside p ( x ) to the volume of p ( x ) shrinks to zero as D increases. 13 / 45

Markov chain Monte Carlo (MCMC) ◮ Markov chain Monte Carlo methods (MCMC) allow us to sample from a wide array of high dimensional distributions, even when we have very little information about these distributions. Undirected graphical model for a Markov chain. 14 / 45

Markov chain Monte Carlo (MCMC) ◮ MCMC methods allow us to sample from a wide array of high dimensional distributions. ◮ We sample from a transition probability z i + 1 ∼ T ( z i + 1 | z i ) which depends on the current state z i , an adaptive proposal density q ( z i + 1 ; z i ) , and acceptance rule. Samples z 1 , z 2 , . . . therefore form a Markov chain. 15 / 45

Example: Metropolis Algorithm ◮ Sample proposal x ′ from a Gaussian distribution N ( x ′ ; x , σ 2 ) . ◮ Accept with probability min ( 1 , p ( x ′ ) / p ( x )) . ◮ If rejected, the next sample is the same as the previous, x ′ = x . This is unlike rejection or importance sampling, where rejected samples are discarded. ◮ Here we have an adaptive proposal distribution. 16 / 45

Markov chain Monte Carlo (MCMC) Under what circumstances does the Markov chain converge to the desired distribution? First, some notation and terminology: ◮ Transition operator : T i ( z i + 1 ← z i ) = P ( z i + 1 | z i ) ◮ A Markov chain is homogeneous if the transition probabilities are the same for all m . ◮ A distribution p ( z ) is invariant with respect to a Markov chain if � p ∗ ( z ) = T ( z ← z ′ ) p ∗ ( z ′ ) . (11) z ′ ◮ A sufficient but not necessary condition for invariant p ( z ) is to satisfy detailed balance : p ∗ ( z ′ ) T ( z ← z ′ ) = p ∗ ( z ) T ( z ′ ← z ) . (12) Exercise: prove that detailed balance leads to invariance 17 / 45

Detailed Balance ◮ A sufficient but not necessary condition for invariant p ( z ) is to satisfy detailed balance : p ∗ ( z ′ ) T ( z ← z ′ ) = p ∗ ( z ) T ( z ′ ← z ) . (13) What does detailed balance mean? 18 / 45

Detailed Balance ◮ A sufficient but not necessary condition for invariant p ( z ) is to satisfy detailed balance : p ∗ ( z ′ ) T ( z ← z ′ ) = p ∗ ( z ) T ( z ′ ← z ) . (14) What does detailed balance mean? It means that z ← z ′ and z ′ ← z are equally probable. 19 / 45

Reverse Operators If T is stationary, we can define a reverse operator: T ( z ′ ← z ) p ∗ ( z ) T ( z ← z ′ ) ∝ T ( z ′ ← z ) p ∗ ( z ) = ˜ � z T ( z ′ ← z ) p ∗ ( z ) = T ( z ′ ← z ) p ∗ ( z ) (15) p ∗ ( z ′ ) Generalised Detailed Balance T ( z ′ ← z ) p ∗ ( z ) = ˜ T ( z ← z ′ ) p ∗ ( z ′ ) (16) ◮ Generalised detailed balance is both sufficient and necessary for invariance. ◮ Operators satisfying detailed balance are their own reverse operator. 20 / 45

Markov chain Monte Carlo (MCMC) ◮ Wish to use Markov chains to sample from a given distribution. ◮ We can do this if 1. The Markov chain leaves the distribution invariant: p ∗ ( z ) = � z ′ T ( z ← z ′ ) p ∗ ( z ′ ) 2. lim m →∞ p ( z m ) = p ∗ ( z ) regardless of the initial distribution p ( z 0 ) ( ergodicity ). 21 / 45

Creating Transition Operators Some possibilities: ◮ Construct transition probabilities from a set of base transitions B 1 , . . . , B K : K � T ( z ← z ′ ) = α k B k ( z ′ , z ) (17) k = 1 ◮ Can be combined through successive application: � � T ( z ← z ′ ) = B 1 ( z ′ , z 1 ) . . . B K ( z K − 1 , z ) · · · (18) z 1 z n − 1 Question: Under what conditions does invariance and detailed balance hold in each case? 22 / 45

Metropolis-Hastings (MH) Algorithm 1. Sample proposal: x ∼ q ( x ′ ; x ) ; e.g. N ( x ′ ; x , σ 2 ) 2. Accept with probability min ( 1 , p ( x ′ ) q ( x ; x ′ ) p ( x ) q ( x ′ ; x ) ) 3. If rejected, next state in chain is a repeat of the current state (contrast with rejection sampling). Questions 1. Do we require that p ( x ) be normalised? 2. Does the MH algorithm satisfy detailed balance? Hint: What is the transition operator T ( x ′ ← x ) ? 23 / 45

MH step-size demo Exploring p ( x ) = N ( x ; 0 , 1 ) with proposal q ( x ′ ; x ) = N ( x ′ ; x , σ 2 ) with different step sizes σ : 24 / 45

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo - PowerPoint PPT Presentation

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 18, 2015 1 / 45 Resources and Attribution Image credits, inspiration, pointers for further

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Undirected Graphical Models: Markov Random Fields Probabilistic Graphical Models Sharif

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Computer Science Let me be provocative Probabilistic graphical models is how we do probabilistic

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

CS 6782: Fall 2010 Probabilistic Graphical Models Guozhang Wang December 10, 2010 1

Graphical models Review Graphical models (Bayes nets, Markov random fields, factor graphs) !

Condence Sets Based on Sparse Estimators Are Necessarily Large Benedikt M. Ptscher

Nearest Neighbor Gaussian Processes for Large Spatial Data Abhi Datta 1 , Sudipto Banerjee 2 and

New Drugs in AML New version of old drugs Inhibitors of signaling pathways CPX-351

Why is it plausible? Barry Mazur January 5, 2012 Rough notes in preparation for a lecture at the

1 1.2. Guidelines for Information Security of Cloud Computing Category Main contents of measure

The Power of Unbiased Recursive Partitioning: A Unifying View of CTree, MOB, and GUIDE Lisa

Generic likelihood methods in R Peter Dalgaard Department of Biostatistics University of

Overview and History of R Computing for Data Analysis 1 / 16 What is R? What is R? 2 / 16