Advanced Machine Learning MCMC Methods Amit Sethi Electrical - - PowerPoint PPT Presentation
Advanced Machine Learning MCMC Methods Amit Sethi Electrical - - PowerPoint PPT Presentation
Advanced Machine Learning MCMC Methods Amit Sethi Electrical Engineering, IIT Bombay Objectives We have talked about: Exact inference in Factor Graphs using Sum- Product algorithm (aka Belief Propagation) Limitations and some
Objectives
- We have talked about:
– Exact inference in Factor Graphs using Sum- Product algorithm (aka Belief Propagation) – Limitations and some remedies thereof of the Sum-Product algorithm
- Today we will learn:
– Sampling methods (aka Monte Carlo methods) when exact inference is intractable
2
We want to find expected value of a function, e.g. when calculating messages
E[f ] = ∫ f(z) p(z) dz
- It may not be feasible to compute this, but
– Computing f(z) may be easy, and – So, now we need to draw samples from p(z) E[f ] ≈ 1/L ∑l f(z(l))
3
Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
z p(z) f(z)
We will look at following ways to sample from a distribution
- Rejection Sampling
- Importance Sampling
- Gibbs Sampling
4
Sampling marginals
- Note that this procedure can be applied to
generate samples for marginals as well
- Simply discard portions of sample which are
not needed
– e.g. For marginal p(rain), sample (cloudy = t; sprinkler = f ; rain = t; w = t) just becomes (rain = t)
- Still a fair sampling procedure
- But, anything more complex can be a problem
5
Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
When the partition function is unknown
- Consider the case of
an arbitrary, continuous p(z)
- How can we draw
samples from it?
- Assume that we can evaluate p(z) up to some
constant, efficiently (e.g. MRF).
6
Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
z p(z) f(z)
Rejection sampling makes use of an easier “proposal” distribution
- Let’s also assume that we have some simpler
distribution q(z) called a proposal distribution from which we can easily draw samples
– e.g. q(z) is a Gaussian
- We can then draw samples from q(z) and use
these, if we had a way to convert these into samples from p(z)
7
Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
z p(z) kq(z)
Now, we reject samples according to the ratio of p and q at z
- Introduce constant k such that kq(z) >= p(z) for all z
- Rejection sampling procedure:
– Generate z0 from q(z) – Generate u0 from [0; kq(z0)] uniformly – If u0 > p(z0) reject sample z0, otherwise keep it
- Original samples are under the red curve
- Kept samples from under the blue curve – hence
samples from p(z)
8
Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
z p(z) kq(z)
Rejection sampling can end up rejecting a lot of samples from q
- How likely are we to keep samples?
- Probability a sample is accepted is:
p(accept) = ∫ p(z)/kq(z) q(z) dz = 1/k ∫ p(z) dz
- Smaller k is better subject to kq(z) >= p(z) for all z
– If q(z) is similar to p(z), this is easier
- In high-dim spaces, acceptance ratio falls off
exponentially, and finding a suitable k challenging
9
Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
In Importance sampling, we scale the weight of sample by the ratio of p and q
- Approximate expectation by drawing points from
E[f] = ∫ f(z) p(z) dz = ∫ f(z) p(z)/q(z) q(z) dz ≈ 1/L ∑l f(z(l)) p(z(l))/q(z(l))
- The quantity p(z(l))/q(z(l)) is known as
importance weight
Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
p(z) f(z) kq(z)
MCMC methods generate samples sequentially
11
Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
- Markov chain Monte Carlo methods use a
Markov chain, i.e. a sequence where a sample is dependent on the previous one, i.e. z(1), z(2), … , z(τ)
- Transitions of the Markov chain form the
proposal distribution q(z|z(τ))
- Asymptotically, these sample are drawn from
the desired distribution p(z)
Metropolis algorithm assumes the proposal distribution is symmetric
12
Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
Visualizing Metropolis algorithm
13
Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
Metropolis-Hastings algorithm generalizes MA for non- symmetric transitions
14
Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
Gibbs Sampling is a simple coordinate-wise MCMC method without using proposal dist.
15
Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
Markov Blanket of an MRF
- It is simply the set of neighbouring nodes
16
Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
Example: Estimate prob. of one node given others in an image denoising MRF
- Potentials:
– For observing: -ηxiyi – For spatial coherence: -βxixj – For prior: -hxi
- P(xi | ~xi) or P(xi | X\xi ,Y)
- What is the Markov blanket of xi?
Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
Gibbs sampling as a special case of MH
- Proposal distribution:
- By holding other dimensions constant:
- Also,
- So, acceptance probability is:
- So, the step is always accepted
18
Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop
Issues with Gibbs sampling
- Initialization is random
- Samples are not independent
– Burn-in should be discarded (random initialization may start and wander in low probability region for a time)
- Time taken is linear in number of samples
- Number of iterations scale with
dimensionality
RBM and its energy function defined
- RBM is a bi-partite graph between its visible and
hidden sets of nodes
- Its energy function is:
Source: “An Introduction to Restricted Boltzmann Machines”, Asja Fischer and Christian Igel, CIARP 2012
Marginals in an RBM
- The hidden nodes can be used to learn a code explaining
the visible nodes
– In a Deep Belief Net (DBN), more layers can be added on top
- Due to its bi-partite nature, the Markov blanket of a node
from any set is very simply the other set of nodes
- This leads to a simple product form for nodes from a set
- This is leading towards a formulation called Product of
Experts (PoE)
Source: “An Introduction to Restricted Boltzmann Machines”, Asja Fischer and Christian Igel, CIARP 2012
Gibbs Sampling in RBM
- Let us look at the marginal of the visible node
Source: “An Introduction to Restricted Boltzmann Machines”, Asja Fischer and Christian Igel, CIARP 2012
Gibbs Sampling in RBM
- The RBM can be interpreted as a stochastic
neural network, for which block Gibbs sampling can be used
Source: “An Introduction to Restricted Boltzmann Machines”, Asja Fischer and Christian Igel, CIARP 2012
In summary
- Monte Carlo methods are often preferred over
analytical methods to estimate probability distributions and their marginals in complex PGMs
- Rejection sampling and importance sampling do not
use samples effectively
– Finding a good proposal distribution can be tricky
- Markov Chain Monte Carlo is often preferred over
simple Monte Carlo
- Initial few samples of MCMC methods are rejected
- Metropolis (Hastings) uses a proposal step distribution
- Gibbs Sampling is the most preferred MCMC method