STAT 339 Markov Chain Monte Carlo (MCMC) 7 April 2017 Some theory - - PowerPoint PPT Presentation

stat 339 markov chain monte carlo mcmc
SMART_READER_LITE
LIVE PREVIEW

STAT 339 Markov Chain Monte Carlo (MCMC) 7 April 2017 Some theory - - PowerPoint PPT Presentation

STAT 339 Markov Chain Monte Carlo (MCMC) 7 April 2017 Some theory and intuition about MCMC generally Outline Gibbs Sampling in Mixture Models Sampling as an Approximation Method Gibbs Sampling Intuition and Motivation Concretely: The


slide-1
SLIDE 1

STAT 339 Markov Chain Monte Carlo (MCMC)

7 April 2017 Some theory and intuition about MCMC generally

slide-2
SLIDE 2

Outline

Gibbs Sampling in Mixture Models Sampling as an Approximation Method Gibbs Sampling Intuition and Motivation Concretely: The Algorithm A Bit of MCMC Theory

slide-3
SLIDE 3

Outline

Gibbs Sampling in Mixture Models Sampling as an Approximation Method Gibbs Sampling Intuition and Motivation Concretely: The Algorithm A Bit of MCMC Theory 3 / 31

slide-4
SLIDE 4

Outline

Gibbs Sampling in Mixture Models Sampling as an Approximation Method Gibbs Sampling Intuition and Motivation Concretely: The Algorithm A Bit of MCMC Theory 4 / 31

slide-5
SLIDE 5

Intractable Posterior

▸ It is easy to write down the unnormalized posterior

density: p(θ ∣ Y) ∝ p(θ)p(Y ∣ θ)

▸ GMM case:

p(θ ∣ Y) ∝ p(π)

K

k=1

p(µk,Σk)

N

n=1

(

K

k=1

πkN(yn ∣ µk,Σk)) where, e.g., p(π) is Dirichlet, p(µk,Σk) is Normal-Inverse Wishart (or p(µk,Σ−1

k ) is

Normal-Wishart)

▸ But hard to do much with it (except maybe find a local

maximum)

▸ Solution: Try to draw samples from it.

5 / 31

slide-6
SLIDE 6

When can we sample from but not analyze a distribution?

▸ Consider trying to evaluate P(Straight) in poker (say) ▸ You might be able to do this analytically, but if you have

a computer handy, it is easy to simulate dealing out several million poker hands

▸ Then, simply count:

P(Straight) ≈ 1 S

S

s=1

I(Did I get a straight in hand s?) 6 / 31

slide-7
SLIDE 7

Outline

Gibbs Sampling in Mixture Models Sampling as an Approximation Method Gibbs Sampling Intuition and Motivation Concretely: The Algorithm A Bit of MCMC Theory 7 / 31

slide-8
SLIDE 8

Gibbs Sampling as "Stochastic EM"

▸ The EM algorithm involves, iteratively

  • 1. Computing an expectation over cluster assignments

(using the posterior, conditioned on parameter values)

  • 2. Arg-Maximizing parameter values (using the

likelihood/posterior conditioned on expected cluster assignments)

▸ Gibbs sampling (in this context) involves, iteratively

  • 1. Sampling cluster assignments (using the posterior,

conditioned on parameter values)

  • 2. Sampling parameter values (using the posterior,

conditioned on cluster assignments)

8 / 31

slide-9
SLIDE 9

What does Gibbs Yield?

▸ Over many iterations, Gibbs sampling yields a collection

  • f many cluster assignments and many parameter values,

distributed according to the joint posterior {{z(s)

n }N n=1,π(s),{µ(s) k ,Σ(s) k }K k=1}}S s=1 ▸ These can be used to approximate more or less any

function of the assignments and parameters we want

▸ Example:

p(zn = zm ∣ Y) ≈ 1 S

S

s=1

I(z(s)

n

= z(s)

m ) ▸ Or, for future data:

p(znew = zm ∣ ynew,Y) ≈ 1 S

S

s=1

π(s)

z(s)

m N(ynew ∣ µz(s) m ,Σz(s) m )

∑K

k=1 π(s) k N(ynew ∣ µ(s) k ,Σ(s) k )

9 / 31

slide-10
SLIDE 10

Outline

Gibbs Sampling in Mixture Models Sampling as an Approximation Method Gibbs Sampling Intuition and Motivation Concretely: The Algorithm A Bit of MCMC Theory 10 / 31

slide-11
SLIDE 11

Concretely: The Algorithm

▸ In general, Gibbs sampling applies when we have multiple

sets of unknown parameters, but where it is difficult to sample all of them directly from their joint distribution.

▸ Mixture model: params are z, π and θ ▸ Idea: partition the parameters into “blocks”, and

iteratively fix all but one block and sample the remaining from its posterior conditioned on the others

▸ Here, we need

1: p(z ∣ Y,θ,π) 2: p(θ,π ∣ z,Y) 11 / 31

slide-12
SLIDE 12

Summary: Gibbs Algorithm for GMM

  • 1. Initialize cluster assignments, z(0) (e.g., using K-means)
  • 2. For s = 1,...,S, sample

(a) µ(s)

k ,Σ(s) k

∼ p(µk,Σk ∣ z(s−1),Y),k = 1,...,K (or, alternatively, subdivide into two blocks with µk and Σk sampled separately given the other) (b) π(s) ∼ p(π ∣ z(s−1)) (c) z(s)

n

∼ p(zn ∣ π(s),θ(s),yn),n = 1,...,N

12 / 31

slide-13
SLIDE 13

Specific Conditional Distributions

Assuming conjugate priors and Σk restricted to diagonal w/ elements hk,d = σ2

k,d

p(µk,d ∣ hk,d,z,Y) = N(µpost,h−1

post)

where µpost = h−1

post(hpriorµprior + nkhk,d¯

yk,d) hpost = hprior + nkhk,d p(hk,d ∣ µk,z,Y) = Gamma(apost,bpost) where apost = aprior + nk 2 bpost = bprior + 1 2 ∑

n∶zn=k

(yn,d − µk,d)2 p(π ∣ z,Y) = Dir(αpost,1,...,αpost,K) where αpost,k = αprior,k + nk Each of these can be sampled from with readily available methods (e.g., numpy.random.dirichlet()) 13 / 31

slide-14
SLIDE 14

Conditional for z

We have already worked out that p(zn = k ∣ π,θ,Y) = Categorical(qn1,...,qnK) where qnk = πkN(yn ∣ µk,Σk) ∑K

k′=1 πkN(yn ∣ µk,Σk)

Conditioned on the cluster parameters, all zn are independent. Sampling from this distribution is just a matter of enumerating the probabilities, or using a library function for categorical distributions (e.g., numpy.random.choice()) 14 / 31

slide-15
SLIDE 15

Gibbs Sampling More Generally

▸ Gibbs is a very general algorithm that can be used to

sample from complicated probability distributions

▸ In general suppose we have p(θ) = p(θ1,...,θM) where

each θm is a collection of random variables. We can generate samples from this distribution by

  • 1. Initializing each θm
  • 2. For s = 1,...,S; For m = 1,...,M:

Sample θ(s)

m from p(θm ∣ θ(s) 1 ,...,θ(s) m−1,θ(s−1) m+1 ,...,θ(s−1) M

)

▸ Note that we can partition θ into the θm however we like,

as long as certain conditions are met (more on this later) 15 / 31

slide-16
SLIDE 16

Application: Image De-Noising

Consider a generative model of noisy images in which θ represents the pixels of the “pure” image, and Y is the

  • bserved “corrupted” image.

16 / 31

slide-17
SLIDE 17

Application: Image De-Noising

▸ We can define a prior on θ that encourages neighboring

pixels to be similar, and a likelihood on Y that encourages observed pixels to be close to the corresponding one in the pure image.

▸ Then, we Gibbs sample each pixel θm conditioned on the

image and the other θs p(θm ∣ θ−m,Y) ∝ p(θm ∣ θm−)p(ym ∣ θm) 17 / 31

slide-18
SLIDE 18

Results from Two Priors

18 / 31

slide-19
SLIDE 19

Outline

Gibbs Sampling in Mixture Models Sampling as an Approximation Method Gibbs Sampling Intuition and Motivation Concretely: The Algorithm A Bit of MCMC Theory 19 / 31

slide-20
SLIDE 20

Markov Chain Monte Carlo

▸ MCMC gets its name from the fact that it is a method

for generating random samples (the “Monte Carlo” part) and the fact that each sample depends on exactly one previous sample (the “Markov Chain”) part

▸ In general, involves sampling from q(θ(s) ∣ θ(s−1)) for a

suitable “transition kernel” q.

▸ Under certain conditions, θ(s for large s are distributed

according to p(θ), which is determined by q.

▸ Idea: construct a q so that p is the target distribution we

want to get samples from. 20 / 31

slide-21
SLIDE 21

Markov Chains

▸ Simple Markov chain: two possible states, {rain,sun} at

each time step t.

▸ Suppose q(⋅ ∣ ⋅) is defined by the transition matrix

Q = ⎛ ⎜ ⎝ rain sun rain 0.7 0.3 sun 0.1 0.9 ⎞ ⎟ ⎠ where rows represent the “from” state and columns represent the “to” state. That is, q(rain ∣ rain) = 0.7, q(sun ∣ rain) = 0.3, etc. 21 / 31

slide-22
SLIDE 22

Markov Chains

▸ Suppose q(⋅ ∣ ⋅) is defined by the transition matrix

Q = ⎛ ⎜ ⎝ rain sun rain 0.7 0.3 sun 0.1 0.9 ⎞ ⎟ ⎠

▸ Question: If it is sunny today, what is the probability that

it will be sunny tomorrow?

▸ If it is sunny today, what is the probability that it will be

rainy two days from now? 22 / 31

slide-23
SLIDE 23

Markov Chains

▸ Suppose q(⋅ ∣ ⋅) is defined by the transition matrix

Q = ⎛ ⎜ ⎝ rain sun rain 0.7 0.3 sun 0.1 0.9 ⎞ ⎟ ⎠

▸ If it is sunny today, what is the probability that it will be

rainy two days from now?

▸ We have

P(θt+2 = rain ∣ θt = sun) = ∑

s

P(θt+1 = s,θt+2 = rain ∣ θt = sun) = ∑

s

q(s ∣ sun)q(rain ∣ s) = (QQ)2,2 = (0.1,0.9)T(0.7,0.1) = 0.07 + 0.09 = 0.16 23 / 31

slide-24
SLIDE 24

Markov Chains

▸ Suppose q(⋅ ∣ ⋅) is defined by the transition matrix

Q = ⎛ ⎜ ⎝ rain sun rain 0.7 0.3 sun 0.1 0.9 ⎞ ⎟ ⎠

▸ If it is sunny today, what is the probability that it will be

rainy a year from now?

▸ We have the recurrence relation

P(θt+m ∣ θt) = ∑

s

P(θt+m−1 = s,θt+m ∣ θt)

▸ This is the matrix product of the m − 1 step matrix with

the one-step transition matrix.

▸ So the m-step transition matrix is just Qm

24 / 31

slide-25
SLIDE 25

Simulation of Q∞

▸ What does the “many step” transition matrix look like? ▸ Simulation ▸ We find that regardless of the starting state, there is a

1/4 chance of rain and a 3/4 chance of sun on any day far enough in the future.

▸ This is called the stationary distribution of the

transition matrix Q

▸ Mathematically, the row vector π is stationary w.r.t. Q if

πQ = π 25 / 31

slide-26
SLIDE 26

MCMC In General

▸ Gibbs sampling (and other MCMC algorithms) construct

a transition kernel, Q (in Gibbs, subdivided into smaller transitions), so that the stationary distribution is the target distribution.

▸ Hence, after a sufficiently long time horizon, we get

samples drawn from the posterior. 26 / 31

slide-27
SLIDE 27

Beyond Gibbs: Metropolis-Hastings

Gibbs turns out to be a special case of the Metropolis-Hastings algorithm

The Metropolis-Hastings Algorithm: Want samples from p(θ)

  • 1. Initialize θ(0)
  • 2. For s = 1,...,S

(a) Draw a sample ˜ θ

(t) from some proposal density,

˜ q(θ(s) ∣ θ(s−1)) (b) Compute the acceptance probability A = min{1, p(θ(t))˜ q(θ(t−1) ∣ θ(t)) p(θ(t−1))˜ q(θ(t) ∣ θ(t−1)) } (c) Sample u ∼ Unif(0,1). If u ≤ A, set θ(t) = ˜ θ

(t).

Otherwise, reject and set θ(t) = θ(t−1).

27 / 31

slide-28
SLIDE 28

Acceptance Probability Intuition

The acceptance probability is A = min{1, p(θ(t))˜ q(θ(t−1) ∣ θ(t)) p(θ(t−1))˜ q(θ(t) ∣ θ(t−1)) } Qualitative intuition: Tend to reject values that ˜ q “oversuggests” and tend to stay at values that are “good” and hard to get back to later. 28 / 31

slide-29
SLIDE 29

Another Special Case: Metropolis Algorithm

Proposal density: Gaussian random walk with fixed covariance Σ ˜ q(θ ∣ θ′) = N(θ ∣ θ′,Σ) Note that ˜ q(θ ∣ θ′) = ˜ q(θ′ ∣ θ) Then we accept with probability A = min{1, p(θ(t))q(θ(t−1) ∣ θ(t)) p(θ(t−1))q(θ(t) ∣ θ(t−1)) } = min{1, p(θ(t)) p(θ(t−1)} In other words, accept all “up-hill” steps, allow some “down-hill” steps. 29 / 31

slide-30
SLIDE 30

Gibbs Yields Zero Rejections

In the mth part of a Gibbs iteration, we sample θ′

m according

to ˜ q(θ′

m ∣ θ−m) = p(θ1,...,θm−1,θ′ m,θm+1,...,θM)

p(θ1,...,θm−1,θm+1,...,θM) The ratio of this and its reverse is just p(θ′)/p(θ), so the acceptance ratio becomes A = min{1, p(θ′)q(θ ∣ θ′) p(θ)q(θ′ ∣ θ) } = min{1, p(θ′) p(θ) p(θ) p(θ′)} = 1 30 / 31

slide-31
SLIDE 31

When will MCMC (i.e., q) converge (to p)?

Some (rough) conditions (not rigorous!1):

  • 1. Irreducibility: Must be able to get anywhere from

anywhere; no inescapable state vortexes (attractors) or isolated components of state space

  • 2. Aperiodicity: No deterministic cycles

A sufficient (not necessary) condition is detailed balance: p(θ)q(θ′ ∣ θ) = p(θ′)q(θ ∣ θ′) for all θ,θ′ where p(θ) is a stationary distribution. I.e., the expected number of jumps between any two states is the same in both directions.

1In fact, strictly, even these are only sufficient when state space is

  • finite. See Neal (2003) (click for a link) for a good review.

31 / 31