Metropolis-Hastings algorithm Dr. Jarad Niemi STAT 544 - Iowa State - - PowerPoint PPT Presentation

metropolis hastings algorithm
SMART_READER_LITE
LIVE PREVIEW

Metropolis-Hastings algorithm Dr. Jarad Niemi STAT 544 - Iowa State - - PowerPoint PPT Presentation

Metropolis-Hastings algorithm Dr. Jarad Niemi STAT 544 - Iowa State University April 2, 2019 Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 1 / 32 Outline Metropolis-Hastings algorithm Independence proposal Random-walk


slide-1
SLIDE 1

Metropolis-Hastings algorithm

  • Dr. Jarad Niemi

STAT 544 - Iowa State University

April 2, 2019

Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 1 / 32

slide-2
SLIDE 2

Outline

Metropolis-Hastings algorithm Independence proposal Random-walk proposal

Optimal tuning parameter Binomial example Normal example Binomial hierarchical example

Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 2 / 32

slide-3
SLIDE 3

Metropolis-Hastings algorithm

Metropolis-Hastings algorithm

Let p(θ|y) be the target distribution and θ(t) be the current draw from p(θ|y). The Metropolis-Hastings algorithm performs the following

  • 1. propose θ∗ ∼ g(θ|θ(t))
  • 2. accept θ(t+1) = θ∗ with probability min{1, r} where

r = r(θ(t), θ∗) = p(θ∗|y)/g(θ∗|θ(t)) p(θ(t)|y)/g(θ(t)|θ∗) = p(θ∗|y) p(θ(t)|y) g(θ(t)|θ∗) g(θ∗|θ(t))

  • therwise, set θ(t+1) = θ(t).

Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 3 / 32

slide-4
SLIDE 4

Metropolis-Hastings algorithm

Metropolis-Hastings algorithm

Suppose we only know the target up to a normalizing constant, i.e. p(θ|y) = q(θ|y)/q(y) where we only know q(θ|y). The Metropolis-Hastings algorithm performs the following

  • 1. propose θ∗ ∼ g(θ|θ(t))
  • 2. accept θ(t+1) = θ∗ with probability min{1, r} where

r = r(θ(t), θ∗) = p(θ∗|y) p(θ(t)|y) g(θ(t)|θ∗) g(θ∗|θ(t)) = q(θ∗|y)/q(y) q(θ(t)|y)/q(y) g(θ(t)|θ∗) g(θ∗|θ(t)) = q(θ∗|y) q(θ(t)|y) g(θ(t)|θ∗) g(θ∗|θ(t))

  • therwise, set θ(t+1) = θ(t).

Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 4 / 32

slide-5
SLIDE 5

Metropolis-Hastings algorithm

Two standard Metropolis-Hastings algorithms

Independent Metropolis-Hastings

Independent proposal, i.e. g(θ|θ(t)) = g(θ)

Random-walk Metropolis

Symmetric proposal, i.e. g(θ|θ(t)) = g(θ(t)|θ) for all θ, θ(t).

Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 5 / 32

slide-6
SLIDE 6

Independence Metropolis-Hastings

Independence Metropolis-Hastings

Let p(θ|y) ∝ q(θ|y) be the target distribution, θ(t) be the current draw from p(θ|y), and g(θ|θ(t)) = g(θ), i.e. the proposal is independent of the current value. The independence Metropolis-Hastings algorithm performs the following

  • 1. propose θ∗ ∼ g(θ)
  • 2. accept θ(t+1) = θ∗ with probability min{1, r} where

r = q(θ∗|y)/g(θ∗) q(θ(t)|y)/g(θ(t)) = q(θ∗|y) q(θ(t)|y) g(θ(t)) g(θ∗)

  • therwise, set θ(t+1) = θ(t).

Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 6 / 32

slide-7
SLIDE 7

Independence Metropolis-Hastings

Intuition through examples

proposed= −1 proposed= 0 proposed= 1 current= −1 current= 0 current= 1 −2 −1 1 2 3 −2 −1 1 2 3 −2 −1 1 2 3 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4

theta y distribution

proposal target

accept

FALSE TRUE

value

current proposed Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 7 / 32

slide-8
SLIDE 8

Independence Metropolis-Hastings

Example: Normal-Cauchy model

Let Y ∼ N(θ, 1) with θ ∼ Ca(0, 1) such that the posterior is p(θ|y) ∝ p(y|θ)p(θ) ∝ exp(−(y − θ)2/2) 1 + θ2 Use N(y, 1) as the proposal, then the Metropolis-Hastings acceptance probability is the min{1, r} with r =

q(θ∗|y) q(θ(t)|y) g(θ(t)) g(θ∗)

=

exp(−(y−θ∗)2/2)/1+(θ∗)2 exp(−(y−θ(t))2/2)/1+(θ(t))2 exp(−(θ(t)−y)2/2) exp(−(θ∗−y)2/2)

= 1+(θ(t))2

1+(θ∗)2

Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 8 / 32

slide-9
SLIDE 9

Independence Metropolis-Hastings

Example: Normal-Cauchy model

0.0 0.2 0.4 −2 2

theta density distribution

proposal target Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 9 / 32

slide-10
SLIDE 10

Independence Metropolis-Hastings

Example: Normal-Cauchy model

−1 1 25 50 75 100

Iteration (t) θ

Independence Metropolis−Hastings

0.0 2.5 5.0 7.5 10.0 25 50 75 100

Iteration (t) θ

Independence Metropolis−Hastings (poor starting value)

Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 10 / 32

slide-11
SLIDE 11

Independence Metropolis-Hastings

Need heavy tails

Recall that rejection sampling requires the proposal to have heavy tails and importance sampling is efficient only when the proposal has heavy tails. Independence Metropolis-Hastings also requires heavy tailed proposals for efficiency since if θ(t) is in a region where p(θ(t)|y) >> g(θ(t)), i.e. target has heavier tails than the proposal, then any proposal θ∗ such that p(θ∗|y) ≈ g(θ∗), i.e. in the center of the target and proposal, will result in r = g(θ(t)) p(θ(t)|y) p(θ∗|y) g(θ∗) ≈ 0 and few samples will be accepted.

Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 11 / 32

slide-12
SLIDE 12

Independence Metropolis-Hastings

Need heavy tails - example

Suppose θ|y ∼ Ca(0, 1) and we use a standard normal as a proposal. Then

0.0 0.1 0.2 0.3 0.4 −4 −2 2 4

x value density

target proposal 1 10 100 −4 −2 2 4

theta target / proposal

Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 12 / 32

slide-13
SLIDE 13

Independence Metropolis-Hastings

Need heavy tails

−2 2 250 500 750 1000

Iteration (t) θ

Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 13 / 32

slide-14
SLIDE 14

Random-walk Metropolis

Random-walk Metropolis

Let p(θ|y) ∝ q(θ|y) be the target distribution, θ(t) be the current draw from p(θ|y), and g(θ∗|θ(t)) = g(θ(t)|θ∗), i.e. the proposal is symmetric. The Metropolis algorithm performs the following

  • 1. propose θ∗ ∼ g(θ|θ(t))
  • 2. accept θ(t+1) = θ∗ with probability min{1, r} where

r = q(θ∗|y) q(θ(t)|y) g(θ(t)|θ∗) g(θ∗|θ(t)) = q(θ∗|y) q(θ(t)|y)

  • therwise, set θ(t+1) = θ(t).

This is also referred to as random-walk Metropolis.

Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 14 / 32

slide-15
SLIDE 15

Random-walk Metropolis

Stochastic hill climbing

Notice that r = q(θ∗|y)/q(θ(t)|y) and thus will accept whenever the target density is larger when evaluated at the proposed value than it is when evaluated at the current value. Suppose θ|y ∼ N(0, 1), θ(t) = 1, and θ∗ ∼ N(θ(t), 1).

−3 −2 −1 1 2 3 0.0 0.1 0.2 0.3 0.4 x dnorm(x) Target Proposal

Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 15 / 32

slide-16
SLIDE 16

Random-walk Metropolis

Example: Normal-Cauchy model

Let Y ∼ N(θ, 1) with θ ∼ Ca(0, 1) such that the posterior is p(θ|y) ∝ p(y|θ)p(θ) ∝ exp(−(y − θ)2/2) 1 + θ2 Use N(θ(t), v2) as the proposal, then the acceptance probability is the min{1, r} with r = q(θ∗|y) q(θ(t)|y) = p(y|θ∗)p(θ∗) p(y|θ(t))p(θ(t)). For this example, let v2 = 1.

Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 16 / 32

slide-17
SLIDE 17

Random-walk Metropolis

Example: Normal-Cauchy model

−2 −1 1 2 25 50 75 100

t θ

Random−walk Metropolis

0.0 2.5 5.0 7.5 10.0 25 50 75 100

t θ

Random−walk Metropolis (poor starting value)

Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 17 / 32

slide-18
SLIDE 18

Random-walk Metropolis Optimal tuning parameter

Random-walk tuning parameter

Let p(θ|y) be the target distribution, the proposal is symmetric with scale v2, and θ(t) is (approximately) distributed according to p(θ|y). If v2 ≈ 0, then θ∗ ≈ θ(t) and r = q(θ∗|y) q(θ(t)|y) ≈ 1 and all proposals are accepted, but θ∗ ≈ θ(t). As v2 → ∞, then q(θ∗|y) ≈ 0 since θ∗ will be far from the mass of the target distribution and r = q(θ∗|y) q(θ(t)|y) ≈ 0 so all proposed values are rejected. So there is an optimal v2 somewhere. For normal targets, the optimal random-walk proposal variance is 2.42V ar(θ|y)/d where d is the dimension of θ which results in an acceptance rate of 40% for d = 1 down to 20% as d → ∞.

Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 18 / 32

slide-19
SLIDE 19

Random-walk Metropolis Optimal tuning parameter

Random-walk with tuning parameter that is too big and too small

Let y|θ ∼ N(θ, 1), θ ∼ Ca(0, 1), and y = 1.

−0.4 0.0 0.4 0.8 25 50 75 100

iteration theta as.factor(v)

0.1 10 Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 19 / 32

slide-20
SLIDE 20

Random-walk Metropolis Binomial model

Binomial model

Let Y ∼ Bin(n, θ) and θ ∼ Be(1/2, 1/2), thus the posterior is p(θ|y) ∝ θy−0.5(1 − θ)n−y−0.5I(0 < θ < 1). To construct a random-walk Metropolis algorithm, we choose the proposal θ∗ ∼ N(θ(t), 0.42) and accept, i.e. θ(t+1) = θ∗ with probability min{1, r} where r = p(θ∗|y) p(θ(t)|y) = (θ∗)y−0.5(1 − θ∗)n−y−0.5I(0 < θ∗ < 1) (θ(t))y−0.5(1 − θ(t))n−y−0.5I(0 < θ(t) < 1)

  • therwise, set θ(t+1) = θ(t).

Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 20 / 32

slide-21
SLIDE 21

Random-walk Metropolis Binomial model

Binomial model

n = 10000 log_q = function(theta, y=3, n=10) { if (theta<0 | theta>1) return(-Inf) (y-0.5)*log(theta)+(n-y-0.5)*log(1-theta) } current = 0.5 # Initial value samps = rep(NA,n) for (i in 1:n) { proposed = rnorm(1, current, 0.4) # tuning parameter is 0.4 logr = log_q(proposed)-log_q(current) if (log(runif(1)) < logr) current = proposed samps[i] = current } length(unique(samps))/n # acceptance rate [1] 0.3746 Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 21 / 32

slide-22
SLIDE 22

Random-walk Metropolis Binomial model

Binomial

2000 4000 6000 8000 10000 0.0 0.2 0.4 0.6 0.8 Index samps

Histogram of samps

samps Density 0.0 0.2 0.4 0.6 0.8 0.0 0.5 1.0 1.5 2.0 2.5 3.0

Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 22 / 32

slide-23
SLIDE 23

Random-walk Metropolis Normal model

Normal model

Assume Yi

ind

∼ N(µ, σ2) and p(µ, σ) ∝ Ca+(σ; 0, 1) and thus p(µ, σ|y) ∝ n

i=1 σ−1 exp

  • − 1

2σ2 (yi − µ)2 1 1+σ2 I(σ > 0)

= σ−n exp

  • − 1

2σ2

n

i=1 y2 i − 2µny + µ2 1 1+σ2 I(σ > 0)

Perform a random-walk Metropolis using a normal proposal, i.e. if µ(t) and σ(t) are the current values for µ and σ, then µ∗ σ∗

  • ∼ N

µ(t) σ(t)

  • , S
  • where S is the tuning parameter.

Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 23 / 32

slide-24
SLIDE 24

Random-walk Metropolis Normal model

Adapting the tuning parameter

Recall that the optimal random-walk tuning parameter (if the target is normal) is 2.42V ar(θ|y)/d where V ar(θ|y) is the unknown posterior covariance matrix. We can estimate V ar(θ|y) using the sample covariance matrix of draws from the posterior. Proposed automatic adapting of the Metropolis-Hastings tuning parameter:

  • 1. Start with S0. Set b = 0.
  • 2. Run M iterations of the MCMC using 2.42Sb/d.
  • 3. Set Sb+1 to the sample covariance matrix of all previous draws.
  • 4. If b < B, set b = b + 1 and return to step 2. Otherwise, throw away

all previous draws and go to step 5.

  • 5. Run K iterations of the MCMC using 2.42SB/d.

Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 24 / 32

slide-25
SLIDE 25

Random-walk Metropolis Normal model

R code for Metropolis-Hastings

n = 20 y = rnorm(n) sum_y2 = sum(y^2) nybar = mean(y) log_q = function(x) { if (x[2]<0) return(-Inf)

  • n*log(x[2])-(sum_y2-2*nybar*x[1]+n*x[1]^2)/(2*x[2]^2)-log(1+x[2]^2)

} B = 10 M = 100 samps = matrix(NA, nrow=B*M, ncol=2) a_rate = rep(NA, B) # Initialize S = diag(2) # S_0 current = c(0,1) Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 25 / 32

slide-26
SLIDE 26

Random-walk Metropolis Normal model

R code for Metropolis-Hastings - Adapting

# Adapt for (b in 1:B) { for (m in 1:M) { i = (b-1)*M+m proposed = mvrnorm(1, current, 2.4^2*S/2) logr = log_q(proposed) - log_q(current) if (log(runif(1)) < logr) current = proposed samps[i,] = current } a_rate[b] = length(unique(samps[1:i,1]))/length(samps[1:i,1]) S = var(samps[1:i,]) } a_rate [1] 0.0300000 0.2700000 0.3566667 0.4000000 0.4240000 0.4333333 0.4200000 0.4175000 0.4166667 0.4270000 var(samps) # S_B [,1] [,2] [1,] 0.04898222 0.00255292 [2,] 0.00255292 0.02365873 Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 26 / 32

slide-27
SLIDE 27

Random-walk Metropolis Normal model

R code for Metropolis-Hastings - Adapting

samps = as.data.frame(samps); names(samps) = c("mu","sigma"); samps$iteration = 1:nrow(samps) ggplot(melt(samps, id.var='iteration', variable.name='parameter'), aes(x=iteration, y=value)) + geom_line() + facet_wrap(~parameter, scales='free')+ theme_bw()

mu sigma 250 500 750 1000 250 500 750 1000 0.75 1.00 1.25 1.50 −0.5 0.0 0.5

iteration value

Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 27 / 32

slide-28
SLIDE 28

Random-walk Metropolis Normal model

R code for Metropolis-Hastings - Inference

# Final run K = 10000 samps = matrix(NA, nrow=K, ncol=2) for (k in 1:K) { proposed = mvrnorm(1, current, 2.4^2*S/2) logr = log_q(proposed) - log_q(current) if (log(runif(1)) < logr) current = proposed samps[k,] = current } length(unique(na.omit(samps[,1])))/length(na.omit(samps[,1])) # acceptance rate [1] 0.3947 Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 28 / 32

slide-29
SLIDE 29

Random-walk Metropolis Normal model

R code for Metropolis-Hastings - Inference

mu sigma 2500 5000 7500 10000 2500 5000 7500 10000 1.0 1.5 2.0 −1.0 −0.5 0.0 0.5 1.0

iteration value

mu sigma −1.0 −0.5 0.0 0.5 1.0 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5

value density

Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 29 / 32

slide-30
SLIDE 30

Random-walk Metropolis Hierarchical binomial model

Hierarchical binomial model

Recall the hierarchical binomial model Yi

ind

∼ Bin(ni, θi), θi

ind

∼ Be(α, β), p(α, β) ∝ (α + β)−5/2 and after marginalizing out the θi Yi

ind

∼ Beta-binomial(ni, α, β), p(α, β) ∝ (α + β)−5/2I(a > 0)I(b > 0) Thus the posterior is p(α, β|y) ∝ n

  • i=1

B(α + yi, β + ni − yi) B(α, β)

  • (α + β)−5/2I(a > 0)I(b > 0)

where B(·) is the beta function. We can perform exactly the same adapting procedure, but now using this posterior as the target distribution.

Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 30 / 32

slide-31
SLIDE 31

Random-walk Metropolis Hierarchical binomial model

Beta-binomial hyperparameter posterior

Corr: 0.984

alpha beta alpha beta 1000 2000 3000 4000 5000 1000 2000 3000 4000 2000 4000 6000 1000 2000 3000 4000

Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 31 / 32

slide-32
SLIDE 32

Summary

Metropolis-Hastings summary

The Metropolis-Hastings algorithm, samples θ∗ ∼ g(·|θ(t)) and sets θ(t+1) = θ∗ with probability equal to min{1, r} where r = q(θ∗|y) q(θ(t)|y) g(θ(t)|θ∗) g(θ∗|θ(t)) and otherwise sets θ(t+1) = θ(t). There are two common Metropolis-Hastings proposals

independent proposal: g(θ∗|θ(t)) = g(θ∗) random-walk proposal: g(θ∗|θ(t)) = g(θ(t)|θ∗)

Independent proposals suffer from the same heavy-tail problems as rejection sampling proposals. Random-walk proposals require tuning of the random walk parameter.

Jarad Niemi (STAT544@ISU) Metropolis-Hastings April 2, 2019 32 / 32