Markov Chain Monte Carlo (MCMC) Inference Seung-Hoon Na Chonbuk - - PowerPoint PPT Presentation
Markov Chain Monte Carlo (MCMC) Inference Seung-Hoon Na Chonbuk - - PowerPoint PPT Presentation
Markov Chain Monte Carlo (MCMC) Inference Seung-Hoon Na Chonbuk National University Monte Carlo Approximation Generate some (unweighted) samples from the posterior: Use these samples to compute any quantity of interest Posterior
Monte Carlo Approximation
- Generate some (unweighted) samples from the
posterior:
- Use these samples to compute any quantity of
interest
β Posterior marginal: π(π¦1|πΈ) β Posterior predictive: π π§ πΈ β β¦
Sampling from standard distributions
- Using the cdf
β Based on the inverse probability transform
Sampling from standard distributions: Inverse CDF
Sampling from standard distributions: Inverse CDF
- Example: Exponential distribution
cdf
Sampling from a Gaussian: Box-Muller Method
- Sample π¨1, π¨2 β (β1,1) uniformly
- Discard (π¨1, π¨2) that do not inside the unit-circle,
so satisfying ο¨ π¨1
2 + π¨2 2 β€ 1
- π π¨ =
1 π π½(π¨ ππππππ πππ πππ)
- Define
Sampling from a Gaussian: Box-Muller Method
- Pola
lar coo
- ordinat
dinateμμμ sampl pling ng
- rμ normal distributionμ λΉλ‘νλλ‘ μν ο¨ Exponential λΆν¬
- πλ₯Ό uniformνκ² μν ο¨ μ£Όμ΄μ§ rμ λν΄ x,y λ₯Ό uniformνκ² λΆν¬
Sampling from a Gaussian: Box-Muller Method
- π ~ π 0,1
π ~ π 0,1
- π π¦, π§ =
1 2π exp β π¦2+π§2 2
= 1
2π exp β π 2 2
- π 2~πΉπ¦π
1 2
- πΉπ¦π π =
β log(π 0,1 ) π
- π βΌ
β2 log(π 0,1 )
https://theclevermachine.wordpress.com/2012/09/11/sampling-from-the-normal- distribution-using-the-box-muller-transform/
π π β€ 1 β exp(β0.5π 2) =π π 2 β€ β2ππππ = π π β€ β2ππππ
- π¨ = π 2
- π π¨ = π π
ππ ππ¨ = 0.5 π π
- π π = 2π π¨ = ππ¦π β
π 2 2
Sampling from a Gaussian: Box-Muller Method
- 1. Draw π£1, π£2 βΌ π 0,1
- 2. Transform to polar rep.
β π = β2log(π£1) π = 2ππ£2
- 3. Transform to Catesian rep.
β π¦ = π cos π π§ = π cos π
Rejection sampling
- when the inverse cdf method cannot be used
- Create a proposal distribution π(π¦) which
satisfies
- ππ(π¦) provides an upper envelope for ΰ·€
π(π¦)
- Sample π¦ βΌ π(π¦)
- Sample π£ βΌ π(0,1)
- If π£ >
ΰ·€ π(π¦) ππ(π¦) , reject the sample
- Otherwise accept it
Rejection sampling
Rejection sampling: Proof
- Why rejection sampling works?
- Let
- The cdf of the accepted point:
π(π¦)
Rejection sampling
- How efficient is this method?
- π ππππππ’ = Χ¬ π π¦ Χ¬ π½ π¦, π£ β π ππ£ ππ¦
= ΰΆ± ΰ·€ π π¦ ππ π¦ π π¦ ππ¦ = 1 π ΰΆ± ΰ·€ π π¦ ππ¦ ο¨ We need to choose M as small as possible while still satisfying
ΰΆ±
1
π½ π£ β€ π§ = π§
ΰ·€ π π¦ ππ π¦
Rejection sampling: Example
- Suppose we want to sample from a Gamma
- When π½ is an integer, i. e. , π½ = π, we can use:
- But, for non-integer π½, we cannot use this trick
& instead use rejection sampling
Rejection sampling: Example
- Use as a proposal, where
- To obtain π as small as possible, check the ratio
π(π¦)/π π¦ :
- This ratio attains its maximum when
π(π¦) π(π¦) β€ π π(π¦) ππ(π¦) β€ 1
Rejection sampling: Example
- Proposal
Rejection Sampling: Application to Bayesian Statistics
- Suppose we want to draw (unweighted) samples
from the posterior
- Use rejection sampling
β Target distribution β Proposal: β M:
- Accepting probability:
Adaptive rejection sampling
- Upper bound the log density with a piecewise
linear function
Importance Sampling
- MC methods for approximating integrals of the
form:
- The idea: Draw samples π in regions which have
high probability, π(π), but also where |π(π)| is large
- Define π π = π½ π β πΉ
- Sample from a proposal
than to sample from itself
Importance Sampling
- Samples from any proposal π(π) to estimate
the integral:
- How should we choose the proposal?
β Minimize the variance of the estimate α π½
importance weights.
Importance Sampling
- By Jensenβs inequality, we have πΉ π£2 π
β₯ πΉ π£ π
2
- Setting π£ π =
π π π π π π
= π₯ π |π π |, we have the lower bound:
π£ π =
π π π π π π
κ° μμμ΄λ©΄ equalityκ° μ±λ¦½ π π β π π π π
Importance Sampling: Handling unnormalized distributions
- When only unnormalized target distribution and
proposals are available without ππ, ππ?
- Use the same set of samples to evaluate ππ/ ππ:
unnormalized importance weight
Ancestral sampling for PGM
- Ancestral sampling
β Sample the root nodes, β then sample their children, β then sample their children, etc. β This is okay when we have no evidence
Ancestral sampling
Ancestral sampling: Example
Rejection sampling for PGM
- Now, suppose that we have some evidence with
interest in conditional queries :
- Rejection sampling (local sampling)
β Perform ancestral sampling, β but as soon as we sample a value that is inconsistent with an observed value, reject the whole sample and start again
- However, rejection sampling is very inefficient
(requiring so many samples) & cannot be applied for real-valued evidences
Importance Sampling for DGM: Likelihood weighting
- Likelihood weighting
β Sample unobserved variables as before, conditional on their parents; But donβt sample
- bserved variables; instead we just use their
- bserved values.
β This is equivalent to using a proposal of the form: β The corresponding importance weight:
the set of observed nodes
Likelihood weighting
Likelihood weighting
Sampling importance resampling (SIR)
- Draw unweighted samples by first using
importance sampling
- Sample with replacement where the
probability that we pick ππ‘ is π₯π‘
Sampling importance resampling (SIR)
- Application: Bayesian inference
β Goal: draw samples from the posterior β Unnormalized posterior: β Proposal: β Normalized weights: β Then, we use SIR to sample from
Typically Sβ << S
Particle Filtering
- Simulation based, algorithm for recursive
Bayesian inference
β Sequential importance sampling ο¨ resampling
Markov Chain Monte Carlo (MCMC)
- 1) Construct a Markov chain on the state space X
β whose stationary distribution is the target density πβ(π) of interest
- 2) Perform a random walk on the state space
β in such a way that the fraction of time we spend in each state x is proportional to πβ(π)
- 3) By drawing (correlated!) samples
π0, π1, π2, . . . , from the chain, perform Monte Carlo integration wrt pβ
Markov Chain Monte Carlo (MCMC) vs. Variational inference
- Variational inference
β (1) for small to medium problems, it is usually faster; β (2) it is deterministic; β (3) is it easy to determine when to stop; β (4) it often provides a lower bound on the log liklihood.
Markov Chain Monte Carlo (MCMC) vs. Variational inference
- MCMC
β (1) it is often easier to implement; β (2) it is applicable to a broader range of models, such as models whose size or structure changes depending on the values of certain variables (e.g., as happens in matching problems), or models without nice conjugate priors; β (3) sampling can be faster than variational methods when applied to really huge models or datasets.2
Gibbs Sampling
- Sample each variable in turn, conditioned on the
values of all the other variables in the distribution
- For example, if we have D = 3 variables
- Need to derive full conditional for variable I
Gibbs Sampling: Ising model
- Full conditional
Gibbs Sampling: Ising model
- Combine an Ising prior with a local evidence term ππ’
Gibbs Sampling: Ising model
- Ising prior with π
ππ = πΎ = 1
β Gaussian noise model with π = 2
Gibbs Sampling: Ising model
Gaussian Mixture Model (GMM)
- Likelihood function
- Factored conjugate prior
Gaussian Mixture Model (GMM): Variational EM
- Standard VB approximation to the posterior:
- Mean field approximation
- VBEM results in the optimal form of π π, πΎ :
[Ref] Gaussian Models
- Marginals and conditionals of a Gaussian model
- π π βΌ π(π, π»)
- Marginals:
- Posterior
conditionals:
[Ref] Gaussian Models
- Linear Gaussian systems
- The posterior:
- The normalization constant:
[Ref] Gaussian Models: Posterior distribution of π
- The likelihood wrt π:
- The prior:
- The posterior:
[Ref] Gaussian Models: Posterior distribution of π»
- The likelihood form of Ξ£
- The conjugate prior: the inverse Wishart distribution
The posterior:
Gaussian Mixture Model (GMM): Gibbs Sampling
- Full joint distribution:
Gaussian Mixture Model (GMM): Gibbs Sampling
- The full conditionals:
Gaussian Mixture Model (GMM): Gibbs Sampling
Label switching problem
- Unidentifiability
β The parameters of the model πΎ, and the indicator functions π, are unidentifiable β We can arbitrarily permute the hidden labels without affecting the likelihood
- Monte Carlo average of the samples for clusters:
β Samples for cluster 1 may be used for samples for cluster 2
- Labeling switching problem
β If we could average over all modes, we would find πΉ[ππ|πΈ] is the same for all π
Collapsed Gibbs sampling
- Analytically integrate out some of the unknown
quantities, and just sample the rest.
- Suppose we sample π and integrate out πΎ
- Thus the πΎ parameters do not participate in the
Markov chain;
- Consequently we can draw conditionally
independent samples ο¨ This will have much lower variance than samples drawn from the joint state space
Rao-Blackwellisation
Collapsed Gibbs sampling
- Theorem 24.2.1 (Rao-Blackwell)
β Let π and πΎ be dependent random variables, and π(π, πΎ) be some scalar function. Then
ο¨ The variance of the estimate created by analytically integrating out πΎ will always be lower (or rather, will never be higher) than the variance of a direct MC estimate.
Collapsed Gibbs sampling
After integrating out the parameters.
Collapsed Gibbs: GMM
- Analytically integrate out the model parameters
ππ, π»π and π, and just sample the indicators π
β Once we integrate out π, all the π¨π nodes become inter-dependent. β once we integrate out πΎπ, all the π¦π nodes become inter-dependent
- Full conditionals:
Collapsed Gibbs: GMM
- Suppose a symmetric prior of the form π =
πΈππ (π·) and π½π = π½/πΏ
- Integrating out π, using Dirichlet-multinormial
formula:
ΰ·
π=1 πΏ
ππ,βπ = π β 1
[ref] Dirichlet-multinormial
- The marginal likelihood for the Dirichlet-
multinoulli model:
Collapsed Gibbs: GMM
All the data assigned to cluster π except for ππ
To compute , we remove ππβs statistics from its current cluster (namely π¨π), and then evaluate ππ under each clusterβs posterior
- predictive. Once we have picked a new cluster, we add ππβs statistics to this
new cluster
Collapsed Gibbs: GMM
- Collapsed Gibbs sampler for a mixture model
Collapsed Gibbs vs. Vanilla Gibbs
a mixture of K = 4 2-d Gaussians applied to N = 300 data points 20 different random initializations
Collapsed Gibbs vs. Vanilla Gibbs
logprob averaged over 100 different random initializations. Solid line: the median thick dashed: the 0.25 and 0.75 quantiles, thin dashed: the 0.05 and 0.95 quintiles.
Metropolis Hastings algorithm
- At each step, propose to move from the
current state π to a new state πβ² with probability π(πβ²|π)
β π: the proposal distribution β E.g.)
Random walk Metropolis algorithm independence sampler
Metropolis Hastings algorithm
- The acceptance probability if the proposal is
symmetric,
- The acceptance probability if the proposal is
asymmetric
β Need the Hastings correction
Given target distribution
Metropolis Hastings algorithm
- When evaluating Ξ±, we only need to know
unnormalized density
Metropolis Hastings algorithm
Metropolis Hastings algorithm
- Gibbs sampling is a special case of MH, using
the proposal:
- Then, the acceptance probability is 100%
Metropolis Hastings: Example
An example of the Metropolis Hastings algorithm for sampling from a mixture of two 1D Gaussians
MH sampling results using a Gaussian proposal with the variance π€ = 1
Metropolis Hastings: Example
An example of the Metropolis Hastings algorithm for sampling from a mixture of two 1D Gaussians
MH sampling results using a Gaussian proposal with the variance π€ = 500
Metropolis Hastings: Example
An example of the Metropolis Hastings algorithm for sampling from a mixture of two 1D Gaussians
MH sampling results using a Gaussian proposal with the variance π€ = 8
Metropolis Hastings: Gaussian Proposals
- 1) an independence proposal
- 2) a random walk proposal
MH for binary logistic regression
Metropolis Hastings: Example
- MH for binary logical regression
Joint posterior of the parameters
Metropolis Hastings: Example
- MH for binary logical regression
β Initialize the chain at the mode, computed using IRLS β Use the random walk Metropolis sampler
Metropolis Hastings: Example
- MH for binary logical regression