Markov Chain Monte Carlo (MCMC) Inference Seung-Hoon Na Chonbuk - - PowerPoint PPT Presentation

β–Ά
markov chain monte carlo mcmc
SMART_READER_LITE
LIVE PREVIEW

Markov Chain Monte Carlo (MCMC) Inference Seung-Hoon Na Chonbuk - - PowerPoint PPT Presentation

Markov Chain Monte Carlo (MCMC) Inference Seung-Hoon Na Chonbuk National University Monte Carlo Approximation Generate some (unweighted) samples from the posterior: Use these samples to compute any quantity of interest Posterior


slide-1
SLIDE 1

Markov Chain Monte Carlo (MCMC) Inference

Seung-Hoon Na Chonbuk National University

slide-2
SLIDE 2

Monte Carlo Approximation

  • Generate some (unweighted) samples from the

posterior:

  • Use these samples to compute any quantity of

interest

– Posterior marginal: π‘ž(𝑦1|𝐸) – Posterior predictive: π‘ž 𝑧 𝐸 – …

slide-3
SLIDE 3

Sampling from standard distributions

  • Using the cdf

– Based on the inverse probability transform

slide-4
SLIDE 4

Sampling from standard distributions: Inverse CDF

slide-5
SLIDE 5

Sampling from standard distributions: Inverse CDF

  • Example: Exponential distribution

cdf

slide-6
SLIDE 6

Sampling from a Gaussian: Box-Muller Method

  • Sample 𝑨1, 𝑨2 ∈ (βˆ’1,1) uniformly
  • Discard (𝑨1, 𝑨2) that do not inside the unit-circle,

so satisfying  𝑨1

2 + 𝑨2 2 ≀ 1

  • π‘ž 𝑨 =

1 𝜌 𝐽(𝑨 π‘—π‘œπ‘‘π‘—π‘’π‘“ π‘‘π‘—π‘ π‘‘π‘šπ‘“)

  • Define
slide-7
SLIDE 7

Sampling from a Gaussian: Box-Muller Method

  • Pola

lar coo

  • ordinat

dinateμƒμ—μ„œ sampl pling ng

  • r을 normal distribution에 λΉ„λ‘€ν•˜λ„λ‘ μƒ˜ν”Œ  Exponential 뢄포
  • πœ„λ₯Ό uniformν•˜κ²Œ μƒ˜ν”Œ  μ£Όμ–΄μ§„ r에 λŒ€ν•΄ x,y λ₯Ό uniformν•˜κ²Œ 뢄포
slide-8
SLIDE 8

Sampling from a Gaussian: Box-Muller Method

  • π‘Œ ~ 𝑂 0,1

𝑍 ~ 𝑂 0,1

  • π‘ž 𝑦, 𝑧 =

1 2𝜌 exp βˆ’ 𝑦2+𝑧2 2

= 1

2𝜌 exp βˆ’ 𝑠2 2

  • 𝑠2~πΉπ‘¦π‘ž

1 2

  • πΉπ‘¦π‘ž πœ‡ =

βˆ’ log(𝑉 0,1 ) πœ‡

  • 𝑠 ∼

βˆ’2 log(𝑉 0,1 )

https://theclevermachine.wordpress.com/2012/09/11/sampling-from-the-normal- distribution-using-the-box-muller-transform/

𝑄 𝑉 ≀ 1 βˆ’ exp(βˆ’0.5𝑠2) =𝑄 𝑠2 ≀ βˆ’2π‘šπ‘π‘•π‘‰ = 𝑄 𝑠 ≀ βˆ’2π‘šπ‘π‘•π‘‰

  • 𝑨 = 𝑠2
  • 𝑄 𝑨 = 𝑄 𝑠

𝑒𝑠 𝑒𝑨 = 0.5 𝑄 𝑠

  • 𝑄 𝑠 = 2𝑄 𝑨 = π‘“π‘¦π‘ž βˆ’

𝑠2 2

slide-9
SLIDE 9

Sampling from a Gaussian: Box-Muller Method

  • 1. Draw 𝑣1, 𝑣2 ∼ 𝑉 0,1
  • 2. Transform to polar rep.

– 𝑠 = βˆ’2log(𝑣1) πœ„ = 2πœŒπ‘£2

  • 3. Transform to Catesian rep.

– 𝑦 = 𝑠 cos πœ„ 𝑧 = 𝑠 cos πœ„

slide-10
SLIDE 10

Rejection sampling

  • when the inverse cdf method cannot be used
  • Create a proposal distribution π‘Ÿ(𝑦) which

satisfies

  • π‘π‘Ÿ(𝑦) provides an upper envelope for ΰ·€

π‘ž(𝑦)

  • Sample 𝑦 ∼ π‘Ÿ(𝑦)
  • Sample 𝑣 ∼ 𝑉(0,1)
  • If 𝑣 >

ΰ·€ π‘ž(𝑦) π‘π‘Ÿ(𝑦) , reject the sample

  • Otherwise accept it
slide-11
SLIDE 11

Rejection sampling

slide-12
SLIDE 12

Rejection sampling: Proof

  • Why rejection sampling works?
  • Let
  • The cdf of the accepted point:

π‘ž(𝑦)

slide-13
SLIDE 13

Rejection sampling

  • How efficient is this method?
  • 𝑄 π‘π‘‘π‘‘π‘“π‘žπ‘’ = Χ¬ π‘Ÿ 𝑦 Χ¬ 𝐽 𝑦, 𝑣 ∈ 𝑇 𝑒𝑣 𝑒𝑦

= ΰΆ± ΰ·€ π‘ž 𝑦 π‘π‘Ÿ 𝑦 π‘Ÿ 𝑦 𝑒𝑦 = 1 𝑁 ΰΆ± ΰ·€ π‘ž 𝑦 𝑒𝑦  We need to choose M as small as possible while still satisfying

ΰΆ±

1

𝐽 𝑣 ≀ 𝑧 = 𝑧

ΰ·€ π‘ž 𝑦 π‘π‘Ÿ 𝑦

slide-14
SLIDE 14

Rejection sampling: Example

  • Suppose we want to sample from a Gamma
  • When 𝛽 is an integer, i. e. , 𝛽 = 𝑙, we can use:
  • But, for non-integer 𝛽, we cannot use this trick

& instead use rejection sampling

slide-15
SLIDE 15

Rejection sampling: Example

  • Use as a proposal, where
  • To obtain 𝑁 as small as possible, check the ratio

π‘ž(𝑦)/π‘Ÿ 𝑦 :

  • This ratio attains its maximum when

π‘ž(𝑦) π‘Ÿ(𝑦) ≀ 𝑁 π‘ž(𝑦) π‘π‘Ÿ(𝑦) ≀ 1

slide-16
SLIDE 16

Rejection sampling: Example

  • Proposal
slide-17
SLIDE 17

Rejection Sampling: Application to Bayesian Statistics

  • Suppose we want to draw (unweighted) samples

from the posterior

  • Use rejection sampling

– Target distribution – Proposal: – M:

  • Accepting probability:
slide-18
SLIDE 18

Adaptive rejection sampling

  • Upper bound the log density with a piecewise

linear function

slide-19
SLIDE 19

Importance Sampling

  • MC methods for approximating integrals of the

form:

  • The idea: Draw samples π’š in regions which have

high probability, π‘ž(π’š), but also where |𝑔(π’š)| is large

  • Define 𝑔 π’š = 𝐽 π’š ∈ 𝐹
  • Sample from a proposal

than to sample from itself

slide-20
SLIDE 20

Importance Sampling

  • Samples from any proposal π‘Ÿ(π’š) to estimate

the integral:

  • How should we choose the proposal?

– Minimize the variance of the estimate መ 𝐽

importance weights.

slide-21
SLIDE 21

Importance Sampling

  • By Jensen’s inequality, we have 𝐹 𝑣2 π’š

β‰₯ 𝐹 𝑣 π’š

2

  • Setting 𝑣 π’š =

π‘ž π’š 𝑔 π’š π‘Ÿ π’š

= π‘₯ π’š |𝑔 π’š |, we have the lower bound:

𝑣 π’š =

π‘ž π’š 𝑔 π’š π‘Ÿ π’š

κ°€ μƒμˆ˜μ΄λ©΄ equalityκ°€ 성립 π‘Ÿ π’š ∝ π‘ž π’š 𝑔 π’š

slide-22
SLIDE 22

Importance Sampling: Handling unnormalized distributions

  • When only unnormalized target distribution and

proposals are available without π‘Žπ‘ž, π‘Žπ‘Ÿ?

  • Use the same set of samples to evaluate π‘Žπ‘Ÿ/ π‘Žπ‘Ÿ:

unnormalized importance weight

slide-23
SLIDE 23

Ancestral sampling for PGM

  • Ancestral sampling

– Sample the root nodes, – then sample their children, – then sample their children, etc. – This is okay when we have no evidence

slide-24
SLIDE 24

Ancestral sampling

slide-25
SLIDE 25

Ancestral sampling: Example

slide-26
SLIDE 26

Rejection sampling for PGM

  • Now, suppose that we have some evidence with

interest in conditional queries :

  • Rejection sampling (local sampling)

– Perform ancestral sampling, – but as soon as we sample a value that is inconsistent with an observed value, reject the whole sample and start again

  • However, rejection sampling is very inefficient

(requiring so many samples) & cannot be applied for real-valued evidences

slide-27
SLIDE 27

Importance Sampling for DGM: Likelihood weighting

  • Likelihood weighting

– Sample unobserved variables as before, conditional on their parents; But don’t sample

  • bserved variables; instead we just use their
  • bserved values.

– This is equivalent to using a proposal of the form: – The corresponding importance weight:

the set of observed nodes

slide-28
SLIDE 28

Likelihood weighting

slide-29
SLIDE 29

Likelihood weighting

slide-30
SLIDE 30

Sampling importance resampling (SIR)

  • Draw unweighted samples by first using

importance sampling

  • Sample with replacement where the

probability that we pick π’šπ‘‘ is π‘₯𝑑

slide-31
SLIDE 31

Sampling importance resampling (SIR)

  • Application: Bayesian inference

– Goal: draw samples from the posterior – Unnormalized posterior: – Proposal: – Normalized weights: – Then, we use SIR to sample from

Typically S’ << S

slide-32
SLIDE 32

Particle Filtering

  • Simulation based, algorithm for recursive

Bayesian inference

– Sequential importance sampling  resampling

slide-33
SLIDE 33

Markov Chain Monte Carlo (MCMC)

  • 1) Construct a Markov chain on the state space X

– whose stationary distribution is the target density π‘žβˆ—(π’š) of interest

  • 2) Perform a random walk on the state space

– in such a way that the fraction of time we spend in each state x is proportional to π‘žβˆ—(π’š)

  • 3) By drawing (correlated!) samples

π’š0, π’š1, π’š2, . . . , from the chain, perform Monte Carlo integration wrt pβˆ—

slide-34
SLIDE 34

Markov Chain Monte Carlo (MCMC) vs. Variational inference

  • Variational inference

– (1) for small to medium problems, it is usually faster; – (2) it is deterministic; – (3) is it easy to determine when to stop; – (4) it often provides a lower bound on the log liklihood.

slide-35
SLIDE 35

Markov Chain Monte Carlo (MCMC) vs. Variational inference

  • MCMC

– (1) it is often easier to implement; – (2) it is applicable to a broader range of models, such as models whose size or structure changes depending on the values of certain variables (e.g., as happens in matching problems), or models without nice conjugate priors; – (3) sampling can be faster than variational methods when applied to really huge models or datasets.2

slide-36
SLIDE 36

Gibbs Sampling

  • Sample each variable in turn, conditioned on the

values of all the other variables in the distribution

  • For example, if we have D = 3 variables
  • Need to derive full conditional for variable I
slide-37
SLIDE 37

Gibbs Sampling: Ising model

  • Full conditional
slide-38
SLIDE 38

Gibbs Sampling: Ising model

  • Combine an Ising prior with a local evidence term πœ”π‘’
slide-39
SLIDE 39

Gibbs Sampling: Ising model

  • Ising prior with 𝑋

π‘—π‘˜ = 𝐾 = 1

– Gaussian noise model with 𝜏 = 2

slide-40
SLIDE 40

Gibbs Sampling: Ising model

slide-41
SLIDE 41

Gaussian Mixture Model (GMM)

  • Likelihood function
  • Factored conjugate prior
slide-42
SLIDE 42

Gaussian Mixture Model (GMM): Variational EM

  • Standard VB approximation to the posterior:
  • Mean field approximation
  • VBEM results in the optimal form of π‘Ÿ π’œ, 𝜾 :
slide-43
SLIDE 43

[Ref] Gaussian Models

  • Marginals and conditionals of a Gaussian model
  • π‘ž π’š ∼ 𝑂(𝝂, 𝚻)
  • Marginals:
  • Posterior

conditionals:

slide-44
SLIDE 44

[Ref] Gaussian Models

  • Linear Gaussian systems
  • The posterior:
  • The normalization constant:
slide-45
SLIDE 45

[Ref] Gaussian Models: Posterior distribution of 𝝂

  • The likelihood wrt 𝝂:
  • The prior:
  • The posterior:
slide-46
SLIDE 46

[Ref] Gaussian Models: Posterior distribution of 𝚻

  • The likelihood form of Ξ£
  • The conjugate prior: the inverse Wishart distribution

The posterior:

slide-47
SLIDE 47

Gaussian Mixture Model (GMM): Gibbs Sampling

  • Full joint distribution:
slide-48
SLIDE 48

Gaussian Mixture Model (GMM): Gibbs Sampling

  • The full conditionals:
slide-49
SLIDE 49

Gaussian Mixture Model (GMM): Gibbs Sampling

slide-50
SLIDE 50

Label switching problem

  • Unidentifiability

– The parameters of the model 𝜾, and the indicator functions π’œ, are unidentifiable – We can arbitrarily permute the hidden labels without affecting the likelihood

  • Monte Carlo average of the samples for clusters:

– Samples for cluster 1 may be used for samples for cluster 2

  • Labeling switching problem

– If we could average over all modes, we would find 𝐹[𝝂𝑙|𝐸] is the same for all 𝑙

slide-51
SLIDE 51

Collapsed Gibbs sampling

  • Analytically integrate out some of the unknown

quantities, and just sample the rest.

  • Suppose we sample π’œ and integrate out 𝜾
  • Thus the 𝜾 parameters do not participate in the

Markov chain;

  • Consequently we can draw conditionally

independent samples  This will have much lower variance than samples drawn from the joint state space

Rao-Blackwellisation

slide-52
SLIDE 52

Collapsed Gibbs sampling

  • Theorem 24.2.1 (Rao-Blackwell)

– Let π’œ and 𝜾 be dependent random variables, and 𝑔(π’œ, 𝜾) be some scalar function. Then

 The variance of the estimate created by analytically integrating out 𝜾 will always be lower (or rather, will never be higher) than the variance of a direct MC estimate.

slide-53
SLIDE 53

Collapsed Gibbs sampling

After integrating out the parameters.

slide-54
SLIDE 54

Collapsed Gibbs: GMM

  • Analytically integrate out the model parameters

𝝂𝑙, πš»π‘™ and 𝝆, and just sample the indicators π’œ

– Once we integrate out 𝝆, all the 𝑨𝑗 nodes become inter-dependent. – once we integrate out πœΎπ‘™, all the 𝑦𝑗 nodes become inter-dependent

  • Full conditionals:
slide-55
SLIDE 55

Collapsed Gibbs: GMM

  • Suppose a symmetric prior of the form 𝝆 =

𝐸𝑗𝑠(𝜷) and 𝛽𝑙 = 𝛽/𝐿

  • Integrating out 𝝆, using Dirichlet-multinormial

formula:

෍

𝑙=1 𝐿

𝑂𝑙,βˆ’π‘— = 𝑂 βˆ’ 1

slide-56
SLIDE 56

[ref] Dirichlet-multinormial

  • The marginal likelihood for the Dirichlet-

multinoulli model:

slide-57
SLIDE 57

Collapsed Gibbs: GMM

All the data assigned to cluster 𝑙 except for π’šπ‘—

To compute , we remove π’šπ‘—β€™s statistics from its current cluster (namely 𝑨𝑗), and then evaluate π’šπ‘— under each cluster’s posterior

  • predictive. Once we have picked a new cluster, we add π’šπ‘—β€™s statistics to this

new cluster

slide-58
SLIDE 58

Collapsed Gibbs: GMM

  • Collapsed Gibbs sampler for a mixture model
slide-59
SLIDE 59

Collapsed Gibbs vs. Vanilla Gibbs

a mixture of K = 4 2-d Gaussians applied to N = 300 data points 20 different random initializations

slide-60
SLIDE 60

Collapsed Gibbs vs. Vanilla Gibbs

logprob averaged over 100 different random initializations. Solid line: the median thick dashed: the 0.25 and 0.75 quantiles, thin dashed: the 0.05 and 0.95 quintiles.

slide-61
SLIDE 61

Metropolis Hastings algorithm

  • At each step, propose to move from the

current state π’š to a new state π’šβ€² with probability π‘Ÿ(π’šβ€²|π’š)

– π‘Ÿ: the proposal distribution – E.g.)

Random walk Metropolis algorithm independence sampler

slide-62
SLIDE 62

Metropolis Hastings algorithm

  • The acceptance probability if the proposal is

symmetric,

  • The acceptance probability if the proposal is

asymmetric

– Need the Hastings correction

Given target distribution

slide-63
SLIDE 63

Metropolis Hastings algorithm

  • When evaluating Ξ±, we only need to know

unnormalized density

slide-64
SLIDE 64

Metropolis Hastings algorithm

slide-65
SLIDE 65

Metropolis Hastings algorithm

  • Gibbs sampling is a special case of MH, using

the proposal:

  • Then, the acceptance probability is 100%
slide-66
SLIDE 66

Metropolis Hastings: Example

An example of the Metropolis Hastings algorithm for sampling from a mixture of two 1D Gaussians

MH sampling results using a Gaussian proposal with the variance 𝑀 = 1

slide-67
SLIDE 67

Metropolis Hastings: Example

An example of the Metropolis Hastings algorithm for sampling from a mixture of two 1D Gaussians

MH sampling results using a Gaussian proposal with the variance 𝑀 = 500

slide-68
SLIDE 68

Metropolis Hastings: Example

An example of the Metropolis Hastings algorithm for sampling from a mixture of two 1D Gaussians

MH sampling results using a Gaussian proposal with the variance 𝑀 = 8

slide-69
SLIDE 69

Metropolis Hastings: Gaussian Proposals

  • 1) an independence proposal
  • 2) a random walk proposal

MH for binary logistic regression

slide-70
SLIDE 70

Metropolis Hastings: Example

  • MH for binary logical regression

Joint posterior of the parameters

slide-71
SLIDE 71

Metropolis Hastings: Example

  • MH for binary logical regression

– Initialize the chain at the mode, computed using IRLS – Use the random walk Metropolis sampler

slide-72
SLIDE 72

Metropolis Hastings: Example

  • MH for binary logical regression