Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo
Andrew Gordon Wilson
www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 18, 2015
1 / 45
Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo - - PowerPoint PPT Presentation
Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 18, 2015 1 / 45 Resources and Attribution Image credits, inspiration, pointers for further
www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 18, 2015
1 / 45
2 / 45
3 / 45
◮ Specify y(x) = f(x, w), for response y and input (predictor)
◮ Infer posterior p(w|D) ∝
4 / 45
◮ Question: do you see the problem?
5 / 45
◮ If you were to do multiple finite approximations using this
6 / 45
7 / 45
8 / 45
◮ Monte Carlo: Approximates expectations with sums
◮ Variables with uniform distribution under the curve of p(x)
◮ Cumulative CDF Sampling: If X ∼ U(0, 1), and g(·) is the
9 / 45
10 / 45
◮ Sample top level variables from marginal distributions. ◮ Sample nodes conditioned on samples of parent nodes.
◮ A ∼ P(A)
11 / 45
◮ Undirected graphical models: P(x) = 1
◮ Posterior over a directed graphical model:
12 / 45
◮ We require σ ≥ 1.
◮ Variance of importance weights is (
13 / 45
◮ Markov chain Monte Carlo methods (MCMC) allow us to
14 / 45
◮ MCMC methods allow us to sample from a wide array of
◮ We sample from a transition probability
15 / 45
◮ Sample proposal x′ from a Gaussian distribution
◮ Accept with probability min(1, p(x′)/p(x)). ◮ If rejected, the next sample is the same as the previous,
◮ Here we have an adaptive proposal distribution.
16 / 45
◮ Transition operator: Ti(zi+1 ← zi) = P(zi+1|zi) ◮ A Markov chain is homogeneous if the transition
◮ A distribution p(z) is invariant with respect to a Markov
◮ A sufficient but not necessary condition for invariant p(z) is
17 / 45
◮ A sufficient but not necessary condition for invariant p(z) is
18 / 45
◮ A sufficient but not necessary condition for invariant p(z) is
19 / 45
◮ Generalised detailed balance is both sufficient and
◮ Operators satisfying detailed balance are their own reverse
20 / 45
◮ Wish to use Markov chains to sample from a given
◮ We can do this if
z′ T(z ← z′)p∗(z′)
21 / 45
◮ Construct transition probabilities from a set of base
◮ Can be combined through successive application:
22 / 45
23 / 45
24 / 45
◮ Assume standard Metropolis, with Gaussian q(x; x′). ◮ Accept x′ with probability λ = min(1, p(x′)/p(x)),
◮ All of our information about p is contained from the
◮ a is a sequence of Bernoulli random variables, with
◮ The entropy (information content) of a is maximized if
25 / 45
◮ Large step sizes lead to many rejections ◮ Small step sizes lead to poor exploration ◮ Struggles badly with multi-modal distributions (like most
◮ Simple to implement ◮ Reasonable for sampling from correlated high dimensional
26 / 45
27 / 45
◮ Diagnostics: Plot autocorrelations, compute Gelman-Rubin
◮ Discussion of thinning, multiple runs, burn in, in Practical
◮ Unit tests, including running on small-scale versions of
28 / 45
29 / 45
30 / 45
31 / 45
◮ Easy access to conditional distributions ◮ Conditionals may be conjugate (example, Dirichlet process
◮ Conditionals will be lower dimensional. We can then apply
◮ WinBUGS and OpenBUGS sample from graphical models
◮ Can be viewed as a special case of MH with no rejections.
32 / 45
◮ Will be discussed in the next lecture. ◮ Helps overcome some limitations associated with
◮ Is critical for sampling from Dirichlet Process Mixture
◮ Good preparation: C.E. Rasmussen. The Infinite Gaussian
33 / 45
34 / 45
35 / 45
36 / 45
37 / 45
◮ Very automatic: lack of tunable free parameters, proposal
◮ No rejections. ◮ Is a great choice when you have little knowledge of the
◮ For multidimensional distributions, one can sample each
38 / 45
39 / 45
◮ Often probability distributions can be written in the form
◮ The gradient tells us which direction to go to find states of
◮ Hamiltonian (aka Hybrid) Monte Carlo Methods helps us
40 / 45
◮ Form H(x, v) = E(x) + K(v), with K(v) = vTv/2. ◮ p(x, v) =
◮ Since the density is separable, the marginal distribution of x
◮ Simulate Hamiltonian dynamics
41 / 45
◮ Very efficient with good settings of τ and ǫ. ◮ State of the art for sampling from posteriors over Bayesian
◮ Very difficult to tune τ and ǫ. A recent review and
◮ HMC helps with local exploration, but not with
42 / 45
43 / 45
44 / 45
◮ Multiple runs ◮ Simulated annealing ◮ Parallel tempering
45 / 45