STAT 339 Markov Chain Monte Carlo (MCMC) 7 April 2017 Some theory - - PowerPoint PPT Presentation
STAT 339 Markov Chain Monte Carlo (MCMC) 7 April 2017 Some theory - - PowerPoint PPT Presentation
STAT 339 Markov Chain Monte Carlo (MCMC) 7 April 2017 Some theory and intuition about MCMC generally Outline Gibbs Sampling in Mixture Models Sampling as an Approximation Method Gibbs Sampling Intuition and Motivation Concretely: The
Outline
Gibbs Sampling in Mixture Models Sampling as an Approximation Method Gibbs Sampling Intuition and Motivation Concretely: The Algorithm A Bit of MCMC Theory
Outline
Gibbs Sampling in Mixture Models Sampling as an Approximation Method Gibbs Sampling Intuition and Motivation Concretely: The Algorithm A Bit of MCMC Theory 3 / 31
Outline
Gibbs Sampling in Mixture Models Sampling as an Approximation Method Gibbs Sampling Intuition and Motivation Concretely: The Algorithm A Bit of MCMC Theory 4 / 31
Intractable Posterior
▸ It is easy to write down the unnormalized posterior
density: p(θ ∣ Y) ∝ p(θ)p(Y ∣ θ)
▸ GMM case:
p(θ ∣ Y) ∝ p(π)
K
∏
k=1
p(µk,Σk)
N
∏
n=1
(
K
∑
k=1
πkN(yn ∣ µk,Σk)) where, e.g., p(π) is Dirichlet, p(µk,Σk) is Normal-Inverse Wishart (or p(µk,Σ−1
k ) is
Normal-Wishart)
▸ But hard to do much with it (except maybe find a local
maximum)
▸ Solution: Try to draw samples from it.
5 / 31
When can we sample from but not analyze a distribution?
▸ Consider trying to evaluate P(Straight) in poker (say) ▸ You might be able to do this analytically, but if you have
a computer handy, it is easy to simulate dealing out several million poker hands
▸ Then, simply count:
P(Straight) ≈ 1 S
S
∑
s=1
I(Did I get a straight in hand s?) 6 / 31
Outline
Gibbs Sampling in Mixture Models Sampling as an Approximation Method Gibbs Sampling Intuition and Motivation Concretely: The Algorithm A Bit of MCMC Theory 7 / 31
Gibbs Sampling as "Stochastic EM"
▸ The EM algorithm involves, iteratively
- 1. Computing an expectation over cluster assignments
(using the posterior, conditioned on parameter values)
- 2. Arg-Maximizing parameter values (using the
likelihood/posterior conditioned on expected cluster assignments)
▸ Gibbs sampling (in this context) involves, iteratively
- 1. Sampling cluster assignments (using the posterior,
conditioned on parameter values)
- 2. Sampling parameter values (using the posterior,
conditioned on cluster assignments)
8 / 31
What does Gibbs Yield?
▸ Over many iterations, Gibbs sampling yields a collection
- f many cluster assignments and many parameter values,
distributed according to the joint posterior {{z(s)
n }N n=1,π(s),{µ(s) k ,Σ(s) k }K k=1}}S s=1 ▸ These can be used to approximate more or less any
function of the assignments and parameters we want
▸ Example:
p(zn = zm ∣ Y) ≈ 1 S
S
∑
s=1
I(z(s)
n
= z(s)
m ) ▸ Or, for future data:
p(znew = zm ∣ ynew,Y) ≈ 1 S
S
∑
s=1
π(s)
z(s)
m N(ynew ∣ µz(s) m ,Σz(s) m )
∑K
k=1 π(s) k N(ynew ∣ µ(s) k ,Σ(s) k )
9 / 31
Outline
Gibbs Sampling in Mixture Models Sampling as an Approximation Method Gibbs Sampling Intuition and Motivation Concretely: The Algorithm A Bit of MCMC Theory 10 / 31
Concretely: The Algorithm
▸ In general, Gibbs sampling applies when we have multiple
sets of unknown parameters, but where it is difficult to sample all of them directly from their joint distribution.
▸ Mixture model: params are z, π and θ ▸ Idea: partition the parameters into “blocks”, and
iteratively fix all but one block and sample the remaining from its posterior conditioned on the others
▸ Here, we need
1: p(z ∣ Y,θ,π) 2: p(θ,π ∣ z,Y) 11 / 31
Summary: Gibbs Algorithm for GMM
- 1. Initialize cluster assignments, z(0) (e.g., using K-means)
- 2. For s = 1,...,S, sample
(a) µ(s)
k ,Σ(s) k
∼ p(µk,Σk ∣ z(s−1),Y),k = 1,...,K (or, alternatively, subdivide into two blocks with µk and Σk sampled separately given the other) (b) π(s) ∼ p(π ∣ z(s−1)) (c) z(s)
n
∼ p(zn ∣ π(s),θ(s),yn),n = 1,...,N
12 / 31
Specific Conditional Distributions
Assuming conjugate priors and Σk restricted to diagonal w/ elements hk,d = σ2
k,d
p(µk,d ∣ hk,d,z,Y) = N(µpost,h−1
post)
where µpost = h−1
post(hpriorµprior + nkhk,d¯
yk,d) hpost = hprior + nkhk,d p(hk,d ∣ µk,z,Y) = Gamma(apost,bpost) where apost = aprior + nk 2 bpost = bprior + 1 2 ∑
n∶zn=k
(yn,d − µk,d)2 p(π ∣ z,Y) = Dir(αpost,1,...,αpost,K) where αpost,k = αprior,k + nk Each of these can be sampled from with readily available methods (e.g., numpy.random.dirichlet()) 13 / 31
Conditional for z
We have already worked out that p(zn = k ∣ π,θ,Y) = Categorical(qn1,...,qnK) where qnk = πkN(yn ∣ µk,Σk) ∑K
k′=1 πkN(yn ∣ µk,Σk)
Conditioned on the cluster parameters, all zn are independent. Sampling from this distribution is just a matter of enumerating the probabilities, or using a library function for categorical distributions (e.g., numpy.random.choice()) 14 / 31
Gibbs Sampling More Generally
▸ Gibbs is a very general algorithm that can be used to
sample from complicated probability distributions
▸ In general suppose we have p(θ) = p(θ1,...,θM) where
each θm is a collection of random variables. We can generate samples from this distribution by
- 1. Initializing each θm
- 2. For s = 1,...,S; For m = 1,...,M:
Sample θ(s)
m from p(θm ∣ θ(s) 1 ,...,θ(s) m−1,θ(s−1) m+1 ,...,θ(s−1) M
)
▸ Note that we can partition θ into the θm however we like,
as long as certain conditions are met (more on this later) 15 / 31
Application: Image De-Noising
Consider a generative model of noisy images in which θ represents the pixels of the “pure” image, and Y is the
- bserved “corrupted” image.
16 / 31
Application: Image De-Noising
▸ We can define a prior on θ that encourages neighboring
pixels to be similar, and a likelihood on Y that encourages observed pixels to be close to the corresponding one in the pure image.
▸ Then, we Gibbs sample each pixel θm conditioned on the
image and the other θs p(θm ∣ θ−m,Y) ∝ p(θm ∣ θm−)p(ym ∣ θm) 17 / 31
Results from Two Priors
18 / 31
Outline
Gibbs Sampling in Mixture Models Sampling as an Approximation Method Gibbs Sampling Intuition and Motivation Concretely: The Algorithm A Bit of MCMC Theory 19 / 31
Markov Chain Monte Carlo
▸ MCMC gets its name from the fact that it is a method
for generating random samples (the “Monte Carlo” part) and the fact that each sample depends on exactly one previous sample (the “Markov Chain”) part
▸ In general, involves sampling from q(θ(s) ∣ θ(s−1)) for a
suitable “transition kernel” q.
▸ Under certain conditions, θ(s for large s are distributed
according to p(θ), which is determined by q.
▸ Idea: construct a q so that p is the target distribution we
want to get samples from. 20 / 31
Markov Chains
▸ Simple Markov chain: two possible states, {rain,sun} at
each time step t.
▸ Suppose q(⋅ ∣ ⋅) is defined by the transition matrix
Q = ⎛ ⎜ ⎝ rain sun rain 0.7 0.3 sun 0.1 0.9 ⎞ ⎟ ⎠ where rows represent the “from” state and columns represent the “to” state. That is, q(rain ∣ rain) = 0.7, q(sun ∣ rain) = 0.3, etc. 21 / 31
Markov Chains
▸ Suppose q(⋅ ∣ ⋅) is defined by the transition matrix
Q = ⎛ ⎜ ⎝ rain sun rain 0.7 0.3 sun 0.1 0.9 ⎞ ⎟ ⎠
▸ Question: If it is sunny today, what is the probability that
it will be sunny tomorrow?
▸ If it is sunny today, what is the probability that it will be
rainy two days from now? 22 / 31
Markov Chains
▸ Suppose q(⋅ ∣ ⋅) is defined by the transition matrix
Q = ⎛ ⎜ ⎝ rain sun rain 0.7 0.3 sun 0.1 0.9 ⎞ ⎟ ⎠
▸ If it is sunny today, what is the probability that it will be
rainy two days from now?
▸ We have
P(θt+2 = rain ∣ θt = sun) = ∑
s
P(θt+1 = s,θt+2 = rain ∣ θt = sun) = ∑
s
q(s ∣ sun)q(rain ∣ s) = (QQ)2,2 = (0.1,0.9)T(0.7,0.1) = 0.07 + 0.09 = 0.16 23 / 31
Markov Chains
▸ Suppose q(⋅ ∣ ⋅) is defined by the transition matrix
Q = ⎛ ⎜ ⎝ rain sun rain 0.7 0.3 sun 0.1 0.9 ⎞ ⎟ ⎠
▸ If it is sunny today, what is the probability that it will be
rainy a year from now?
▸ We have the recurrence relation
P(θt+m ∣ θt) = ∑
s
P(θt+m−1 = s,θt+m ∣ θt)
▸ This is the matrix product of the m − 1 step matrix with
the one-step transition matrix.
▸ So the m-step transition matrix is just Qm
24 / 31
Simulation of Q∞
▸ What does the “many step” transition matrix look like? ▸ Simulation ▸ We find that regardless of the starting state, there is a
1/4 chance of rain and a 3/4 chance of sun on any day far enough in the future.
▸ This is called the stationary distribution of the
transition matrix Q
▸ Mathematically, the row vector π is stationary w.r.t. Q if
πQ = π 25 / 31
MCMC In General
▸ Gibbs sampling (and other MCMC algorithms) construct
a transition kernel, Q (in Gibbs, subdivided into smaller transitions), so that the stationary distribution is the target distribution.
▸ Hence, after a sufficiently long time horizon, we get
samples drawn from the posterior. 26 / 31
Beyond Gibbs: Metropolis-Hastings
Gibbs turns out to be a special case of the Metropolis-Hastings algorithm
The Metropolis-Hastings Algorithm: Want samples from p(θ)
- 1. Initialize θ(0)
- 2. For s = 1,...,S
(a) Draw a sample ˜ θ
(t) from some proposal density,
˜ q(θ(s) ∣ θ(s−1)) (b) Compute the acceptance probability A = min{1, p(θ(t))˜ q(θ(t−1) ∣ θ(t)) p(θ(t−1))˜ q(θ(t) ∣ θ(t−1)) } (c) Sample u ∼ Unif(0,1). If u ≤ A, set θ(t) = ˜ θ
(t).
Otherwise, reject and set θ(t) = θ(t−1).
27 / 31
Acceptance Probability Intuition
The acceptance probability is A = min{1, p(θ(t))˜ q(θ(t−1) ∣ θ(t)) p(θ(t−1))˜ q(θ(t) ∣ θ(t−1)) } Qualitative intuition: Tend to reject values that ˜ q “oversuggests” and tend to stay at values that are “good” and hard to get back to later. 28 / 31
Another Special Case: Metropolis Algorithm
Proposal density: Gaussian random walk with fixed covariance Σ ˜ q(θ ∣ θ′) = N(θ ∣ θ′,Σ) Note that ˜ q(θ ∣ θ′) = ˜ q(θ′ ∣ θ) Then we accept with probability A = min{1, p(θ(t))q(θ(t−1) ∣ θ(t)) p(θ(t−1))q(θ(t) ∣ θ(t−1)) } = min{1, p(θ(t)) p(θ(t−1)} In other words, accept all “up-hill” steps, allow some “down-hill” steps. 29 / 31
Gibbs Yields Zero Rejections
In the mth part of a Gibbs iteration, we sample θ′
m according
to ˜ q(θ′
m ∣ θ−m) = p(θ1,...,θm−1,θ′ m,θm+1,...,θM)
p(θ1,...,θm−1,θm+1,...,θM) The ratio of this and its reverse is just p(θ′)/p(θ), so the acceptance ratio becomes A = min{1, p(θ′)q(θ ∣ θ′) p(θ)q(θ′ ∣ θ) } = min{1, p(θ′) p(θ) p(θ) p(θ′)} = 1 30 / 31
When will MCMC (i.e., q) converge (to p)?
Some (rough) conditions (not rigorous!1):
- 1. Irreducibility: Must be able to get anywhere from
anywhere; no inescapable state vortexes (attractors) or isolated components of state space
- 2. Aperiodicity: No deterministic cycles
A sufficient (not necessary) condition is detailed balance: p(θ)q(θ′ ∣ θ) = p(θ′)q(θ ∣ θ′) for all θ,θ′ where p(θ) is a stationary distribution. I.e., the expected number of jumps between any two states is the same in both directions.
1In fact, strictly, even these are only sufficient when state space is
- finite. See Neal (2003) (click for a link) for a good review.