Advanced Machine Learning MCMC Methods Amit Sethi Electrical - - PowerPoint PPT Presentation

advanced machine learning
SMART_READER_LITE
LIVE PREVIEW

Advanced Machine Learning MCMC Methods Amit Sethi Electrical - - PowerPoint PPT Presentation

Advanced Machine Learning MCMC Methods Amit Sethi Electrical Engineering, IIT Bombay Objectives We have talked about: Exact inference in Factor Graphs using Sum- Product algorithm (aka Belief Propagation) Limitations and some


slide-1
SLIDE 1

Advanced Machine Learning MCMC Methods

Amit Sethi Electrical Engineering, IIT Bombay

slide-2
SLIDE 2

Objectives

  • We have talked about:

– Exact inference in Factor Graphs using Sum- Product algorithm (aka Belief Propagation) – Limitations and some remedies thereof of the Sum-Product algorithm

  • Today we will learn:

– Sampling methods (aka Monte Carlo methods) when exact inference is intractable

2

slide-3
SLIDE 3

We want to find expected value of a function, e.g. when calculating messages

E[f ] = ∫ f(z) p(z) dz

  • It may not be feasible to compute this, but

– Computing f(z) may be easy, and – So, now we need to draw samples from p(z) E[f ] ≈ 1/L ∑l f(z(l))

3

Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

z p(z) f(z)

slide-4
SLIDE 4

We will look at following ways to sample from a distribution

  • Rejection Sampling
  • Importance Sampling
  • Gibbs Sampling

4

slide-5
SLIDE 5

Sampling marginals

  • Note that this procedure can be applied to

generate samples for marginals as well

  • Simply discard portions of sample which are

not needed

– e.g. For marginal p(rain), sample (cloudy = t; sprinkler = f ; rain = t; w = t) just becomes (rain = t)

  • Still a fair sampling procedure
  • But, anything more complex can be a problem

5

Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

slide-6
SLIDE 6

When the partition function is unknown

  • Consider the case of

an arbitrary, continuous p(z)

  • How can we draw

samples from it?

  • Assume that we can evaluate p(z) up to some

constant, efficiently (e.g. MRF).

6

Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

z p(z) f(z)

slide-7
SLIDE 7

Rejection sampling makes use of an easier “proposal” distribution

  • Let’s also assume that we have some simpler

distribution q(z) called a proposal distribution from which we can easily draw samples

– e.g. q(z) is a Gaussian

  • We can then draw samples from q(z) and use

these, if we had a way to convert these into samples from p(z)

7

Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

z p(z) kq(z)

slide-8
SLIDE 8

Now, we reject samples according to the ratio of p and q at z

  • Introduce constant k such that kq(z) >= p(z) for all z
  • Rejection sampling procedure:

– Generate z0 from q(z) – Generate u0 from [0; kq(z0)] uniformly – If u0 > p(z0) reject sample z0, otherwise keep it

  • Original samples are under the red curve
  • Kept samples from under the blue curve – hence

samples from p(z)

8

Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

z p(z) kq(z)

slide-9
SLIDE 9

Rejection sampling can end up rejecting a lot of samples from q

  • How likely are we to keep samples?
  • Probability a sample is accepted is:

p(accept) = ∫ p(z)/kq(z) q(z) dz = 1/k ∫ p(z) dz

  • Smaller k is better subject to kq(z) >= p(z) for all z

– If q(z) is similar to p(z), this is easier

  • In high-dim spaces, acceptance ratio falls off

exponentially, and finding a suitable k challenging

9

Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

slide-10
SLIDE 10

In Importance sampling, we scale the weight of sample by the ratio of p and q

  • Approximate expectation by drawing points from

E[f] = ∫ f(z) p(z) dz = ∫ f(z) p(z)/q(z) q(z) dz ≈ 1/L ∑l f(z(l)) p(z(l))/q(z(l))

  • The quantity p(z(l))/q(z(l)) is known as

importance weight

Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

p(z) f(z) kq(z)

slide-11
SLIDE 11

MCMC methods generate samples sequentially

11

Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

  • Markov chain Monte Carlo methods use a

Markov chain, i.e. a sequence where a sample is dependent on the previous one, i.e. z(1), z(2), … , z(τ)

  • Transitions of the Markov chain form the

proposal distribution q(z|z(τ))

  • Asymptotically, these sample are drawn from

the desired distribution p(z)

slide-12
SLIDE 12

Metropolis algorithm assumes the proposal distribution is symmetric

12

Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

slide-13
SLIDE 13

Visualizing Metropolis algorithm

13

Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

slide-14
SLIDE 14

Metropolis-Hastings algorithm generalizes MA for non- symmetric transitions

14

Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

slide-15
SLIDE 15

Gibbs Sampling is a simple coordinate-wise MCMC method without using proposal dist.

15

Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

slide-16
SLIDE 16

Markov Blanket of an MRF

  • It is simply the set of neighbouring nodes

16

Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

slide-17
SLIDE 17

Example: Estimate prob. of one node given others in an image denoising MRF

  • Potentials:

– For observing: -ηxiyi – For spatial coherence: -βxixj – For prior: -hxi

  • P(xi | ~xi) or P(xi | X\xi ,Y)
  • What is the Markov blanket of xi?

Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

slide-18
SLIDE 18

Gibbs sampling as a special case of MH

  • Proposal distribution:
  • By holding other dimensions constant:
  • Also,
  • So, acceptance probability is:
  • So, the step is always accepted

18

Source: “Pattern Recognition and Machine Learning”, Book and slides by Christopher Boshop

slide-19
SLIDE 19

Issues with Gibbs sampling

  • Initialization is random
  • Samples are not independent

– Burn-in should be discarded (random initialization may start and wander in low probability region for a time)

  • Time taken is linear in number of samples
  • Number of iterations scale with

dimensionality

slide-20
SLIDE 20

RBM and its energy function defined

  • RBM is a bi-partite graph between its visible and

hidden sets of nodes

  • Its energy function is:

Source: “An Introduction to Restricted Boltzmann Machines”, Asja Fischer and Christian Igel, CIARP 2012

slide-21
SLIDE 21

Marginals in an RBM

  • The hidden nodes can be used to learn a code explaining

the visible nodes

– In a Deep Belief Net (DBN), more layers can be added on top

  • Due to its bi-partite nature, the Markov blanket of a node

from any set is very simply the other set of nodes

  • This leads to a simple product form for nodes from a set
  • This is leading towards a formulation called Product of

Experts (PoE)

Source: “An Introduction to Restricted Boltzmann Machines”, Asja Fischer and Christian Igel, CIARP 2012

slide-22
SLIDE 22

Gibbs Sampling in RBM

  • Let us look at the marginal of the visible node

Source: “An Introduction to Restricted Boltzmann Machines”, Asja Fischer and Christian Igel, CIARP 2012

slide-23
SLIDE 23

Gibbs Sampling in RBM

  • The RBM can be interpreted as a stochastic

neural network, for which block Gibbs sampling can be used

Source: “An Introduction to Restricted Boltzmann Machines”, Asja Fischer and Christian Igel, CIARP 2012

slide-24
SLIDE 24

In summary

  • Monte Carlo methods are often preferred over

analytical methods to estimate probability distributions and their marginals in complex PGMs

  • Rejection sampling and importance sampling do not

use samples effectively

– Finding a good proposal distribution can be tricky

  • Markov Chain Monte Carlo is often preferred over

simple Monte Carlo

  • Initial few samples of MCMC methods are rejected
  • Metropolis (Hastings) uses a proposal step distribution
  • Gibbs Sampling is the most preferred MCMC method

– It makes use of Markov blankets to compute single variable marginals