Introduction to MCMC and BUGS Basic recipes, and a sample of some - - PowerPoint PPT Presentation

introduction to mcmc and bugs
SMART_READER_LITE
LIVE PREVIEW

Introduction to MCMC and BUGS Basic recipes, and a sample of some - - PowerPoint PPT Presentation

Introduction to MCMC and BUGS Basic recipes, and a sample of some techniques for getting started. Two simple worked out examples. R-code will be provided for those. No background in MCMC assumed. Not for experts! MCMC: Reminder


slide-1
SLIDE 1

Introduction to MCMC and BUGS

◮ Basic recipes, and a sample of some techniques for

getting started.

◮ Two simple worked out examples. ◮ R-code will be provided for those. ◮ No background in MCMC assumed. ◮ Not for experts!

slide-2
SLIDE 2

MCMC: Reminder Slide

Outline:

◮ Motivation ◮ Monte Carlo integration ◮ Markov chains ◮ MCMC

slide-3
SLIDE 3

Bayesian Inference

◮ Data: Y (realisation y) ◮ Parameters, latent variables: θ = (θ1, θ2, . . . , θk)

Assume:

◮ Likelihood: f(y|θ). Distribution of data given

parameters.

◮ Prior: π(θ). Prior distribution, before we observe data.

Inference is based on the joint posterior distribution. π(θ|y) =

f(y|θ)π(θ) R f(y|θ)π(θ)dθ

∝ f(y|θ)π(θ) i.e. Posterior ∝ Likelihood × Prior

slide-4
SLIDE 4

Example 1: Normal-Cauchy

◮ Suppose Y1, . . . , Yn iid

∼ N(θ, 1). Don’t know θ.

◮ iid=independently and identically distributed. ◮ Assume Cauchy prior distribution: π(θ) = 1 π (1+θ2).

Posterior density: π(θ|y) ∝ exp

Pn

i=1(yi−θ)2

2

  • ×

1 1+θ2

∝ exp

  • −n(θ−¯

y)2 2

  • ×

1 1+θ2

(can be shown)

Things of interest to Bayesians:

◮ Posterior Mean = E(θ|y) =

−∞ θ π(θ|y)dθ. ◮ Posterior Variance = Var(θ|y) = E(θ2|y) − {E(θ|y)}2. ◮ Credible interval: {a(y), b(y)} for θ such that

Pr {a(y) < θ < b(y)|y} = 0.95, for example.

slide-5
SLIDE 5

Example 2: One sample normal.

◮ Data: Y1, . . . , Yn iid

∼ N(µ, σ2).

◮ Let precision = τ = 1/σ2. ◮ Assume a ‘flat’ prior distribution for µ, i.e. π(µ) = 1 ◮ and let τ ∼ Gamma(a, b), a, b > 0, has mean a/b.

π(µ, τ) ∝ τ a−1e−bτ.

◮ Joint posterior density:

π(µ, τ|y) ∝ (τ)n/2+a−1 exp

  • −τ P(yi−µ)2

2

− bτ

  • For making inference, we want quantities like:

◮ Posterior Mean = E(µ|y) =

−∞

∞ µ π(µ, τ|y)dµdτ.

◮ Posterior Variance = Var(µ|y) = Horrible double

integrals! Difficult to solve in general.

◮ Credible intervals involve even more of these!

slide-6
SLIDE 6

MCMC: Reminder Slide

Outline:

◮ Motivation ◮ Monte Carlo integration ◮ Markov chains ◮ MCMC

slide-7
SLIDE 7

Monte Carlo (MC) integration

◮ General problem: Evaluate the expectation of any

general function h(X) of the random variable X: Eπ[h(X)] =

  • h(x)π(x)dx

assuming that it exists, i.e.

  • |h(x)|π(x)dx < ∞.

◮ This can be very difficult, but if we can draw samples

X (1), X (2), . . . , X (N) ∼ π(x) then we can estimate Eπ[h(X)] by ¯ hN = 1 N

N

  • t=1

h

  • X (t)

.

◮ This is the basic idea of statistics:

“Estimate unknowns by drawing samples.”

◮ Change notation: Parameters = θ ≡ x; Posterior =

π(θ|y) = π(x). Replace y by its value and suppress.

slide-8
SLIDE 8

Consistency

◮ For independent samples, by Law of Large Numbers

(LLN), ¯ hN = 1 N

N

  • t=1

h

  • X (t)

→ Eπ[h(X)] as N → ∞. (1)

◮ But independent sampling from π(x) is usually

difficult.

◮ It turns out that (1), LLN, still applies if we generate

samples using a Markov chain. But first, some revision of Markov chains.

slide-9
SLIDE 9

MCMC: Reminder Slide

Outline:

◮ Motivation ◮ Monte Carlo integration ◮ Markov chains ◮ MCMC

slide-10
SLIDE 10

Markov chains

◮ A Markov chain is generated by sampling

X (t+1) ∼ p(x|x(t)), t = 0, 1, 2, . . . .

p is called the transition kernel.

◮ So X (t+1) depends only on X (t), not on

X (0), X (1), . . . , X (t−1). p(X (t+1)|x(t), x(t−1), . . .) = p(X (t+1)|x(t))

◮ For example:

X (t+1)|x(t) ∼ N(0.5 x(t), 1.0).

◮ This is called a first order auto-regressive process

with lag-1 auto-correlation 0.5.

slide-11
SLIDE 11

◮ Simulation of the chain: X (t+1)|x(t) ∼ N(0.5 x(t), 1.0)

with two different starting points x(0) (one red and one blue).

10 20 30 40 50 60 –4 –2 2 4 i

◮ After about 5–7 iterations the chains seemed to have

forgotten their starting positions.

◮ Caution: Have p(X (t+1)|x(t)) = p(X (t+1)|x(t), x(t−1)), it

does not mean that X (t+1) and X (t−1) are independent marginally.

slide-12
SLIDE 12

Stationarity

As t → ∞, the Markov chain converges to its stationary distribution.

in distribution

  • r invariant

◮ In the above example (it can be proved that), this is

X (t)|x(0) ∼ N(0.0, 1.33), as t → ∞ which does not depend on x(0).

◮ Does this happen for all Markov chains?

slide-13
SLIDE 13

Irreducibility

◮ Assuming that a stationary distribution exists, it is

unique if the chain is irreducible.

◮ Irreducible means any set of states (values) can be

reached from any other state (value) in a finite number of moves.

◮ An example of a reducible Markov chain:

Suppose p(x|y) = 0 for x ∈ A and y ∈ B and vice versa.

✖✕ ✗✔

A

✖✕ ✗✔

B

◮ Here the whole set of values is the rectangle.

slide-14
SLIDE 14

Aperiodicity

A Markov chain taking only a finite number of values is aperiodic if greatest common divisor (g.c.d.) of return times to any particular state i say, is 1.

◮ Think of recording the number of steps taken to

return to the state 1. The g.c.d. of those numbers should be 1.

◮ If the g.c.d. is bigger than 1, 2 say, then the chain will

return in cycles of 2, 4, 6, ... number of steps. This is not allowed for aperiodicity.

◮ Definition can be extended to general state space

cases (where the Markov chain can take any value in a continuous interval).

slide-15
SLIDE 15

Ergodicity

Suppose that the Markov chain has the stationary distribution π(x) and is aperiodic and irreducible. Then it is said to be ergodic.

◮ Ergodic theorem:

¯ hN = 1 N

N

  • t=1

h

  • X (t)

→ Eπ[h(X)] as N → ∞.

◮ ¯

hN is called an ergodic average.

◮ If

σ2

h = Varπ[h(X)] < ∞,

then the central limit theorem holds, i.e, ¯ hN converges to the normal distribution,

◮ and this convergence occurs geometrically fast.

slide-16
SLIDE 16

Numerical standard errors (nse)

The nse of ¯ hN is

  • Varπ(¯

hN), and for large N nse ¯ hN

  • σ2

h

N

  • 1 + 2

N−1

  • ℓ=1

ρℓ(h)

  • where ρℓ(h) is the lag-ℓ auto-correlation in
  • h(X (t))
  • .

◮ In general no simpler expression exist for the nse. ◮ See Geyer (1992), and Besag and Green (1993) for

ideas and further references.

slide-17
SLIDE 17

Numerical standard errors (nse) ...

◮ If

  • h(X (t))
  • can be approximated as a first order

auto-regressive process then nse ¯ hN

  • σ2

h

N 1 + ρ 1 − ρ, where ρ is the lag-1 auto-correlation of

  • h(X (t))
  • .

◮ The first factor is the usual term under independent

sampling.

◮ The second term is usually > 1. ◮ And thus is the penalty to be paid for using a Markov

chain instead of independent sampling.

slide-18
SLIDE 18

Numerical standard errors (nse) ...

Moreover,

◮ the nse may not be finite in general. ◮ the nse is finite if the chain converges geometrically. ◮ if the nse is finite, then we can make it as small as we

like by increasing N.

◮ it can be proved that the following simple minded

estimator of nse is not consistent. ˆ nse ¯ hN

  • =
  • σ2

h

N

  • 1 + 2

N−1

  • ℓ=1

rℓ(h)

  • where rℓ(h) is the sample lag-ℓ auto-correlation in
  • h(x(t))
  • .

Will return to the estimation of nse later.

slide-19
SLIDE 19

Markov chains – summary

◮ A Markov chain may have a stationary distribution. ◮ The stationary distribution is unique if the chain is

irreducible.

◮ We can estimate nse’s if the chain is also

geometrically convergent. Where does this all get us?

slide-20
SLIDE 20

MCMC: Reminder Slide

Outline:

◮ Motivation ◮ Monte Carlo integration ◮ Markov chains ◮ MCMC

slide-21
SLIDE 21

MCMC

◮ How do we construct a Markov chain whose

stationary distribution is our posterior distribution, π(x)?

◮ Metropolis et al (1953) showed us how. ◮ The method was generalized by Hastings (1970).

This is called Markov chain Monte Carlo (MCMC).

slide-22
SLIDE 22

Metropolis-Hastings algorithm

At each iteration t: Step 1 Sample y ∼ q

  • y|x(t)

.

“candidate” point

❅ ❅ ■

“Proposal” distribution Step 2 Then with probability α(x(t), y) = min

  • 1, π(y)

π (x(t)) q

  • x(t)|y
  • q (y|x(t))
  • set x(t+1) = y (acceptance).

If not accepted, set x(t+1) = x(t) (rejection).

slide-23
SLIDE 23

Metropolis-Hastings algorithm...

◮ The normalising constant in π(x) is not required to

run the algorithm. It cancels out in the ratio.

◮ If q(y|x) = π(y), then we obtain independent

samples, α(x, y) = 1 always.

◮ Usually q is chosen so that q(y|x) is easy to sample

from.

◮ Theoretically, any density q(·|x) having the same

support (range) should work. However, some q’s are better than others. See later.

◮ The induced Markov chains have the desirable

properties under mild conditions on π(x).

◮ π(x) is often called the target distribution. Recall the

notation change θ to x.

slide-24
SLIDE 24

Implementing MCMC : Reminder Slide

◮ Flavours of Metropolis-Hastings ◮ Gibbs Sampler ◮ Number of Chains ◮ Burn-in and run length ◮ Numerical standard errors

slide-25
SLIDE 25

The Metropolis algorithm

◮ Suppose that the proposal distribution is symmetric:

q(x|y) ≡ q(y|x).

◮ This is originally proposed by Metropolis et al. (1953). ◮ A special case of this is known as the random-walk

Metropolis algorithm, q(x|y) ≡ q(|y − x|).

◮ In this case,

q

  • x(t)|y
  • q (y|x(t)) = q
  • x(t) − y
  • q (|y − x(t)|) = 1,

hence α(x(t), y) = min

  • 1, π(y)

π (x(t))

  • .
slide-26
SLIDE 26

The random walk Metropolis algorithm

◮ Example: consider the N(0, 1) target distribution, i.e.

π(x) ∝ exp

  • −x2

2

  • ,

◮ and consider N

  • x(t), σ2

as the proposal distribution for some σ2.

  • 3
  • 2
  • 1

1 2 3 0.0 0.2 0.4 0.6 0.8 proposals target

◮ Proposal depends on where you are, the current

value, x(t).

◮ Need to tune σ2 to have about 15-40% acceptance of

the proposals, Gelman et al. (1996).

slide-27
SLIDE 27

The Independence Sampler

◮ Proposal does not depend on the current value, x:

q(y|x) ≡ q(y).

◮ So, α(x, y) has a simpler form. ◮ Beware: Independence samplers are either very

good or very bad.

◮ Tails of q(y) must be heavier than the tails of π(x) for

geometric (rapid) convergence.

  • 6
  • 4
  • 2

2 4 6 0.0 0.1 0.2 0.3 0.4 proposal target

slide-28
SLIDE 28

Return to Example 1

◮ Let Y1, . . . , Yn iid

∼ N(θ, 1) and π(θ) =

1 π (1+θ2). ◮ Posterior:

π(θ|y) ∝ exp

  • −n(θ − ¯

y)2 2

  • ×

1 1 + θ2.

◮ Suppose n = 20, ¯

y = 0.0675. With the x notation, we have the target density π(x) ∝ exp

  • −20(x − 0.0675)2

2

  • ×

1 (1 + x2).

◮ Here the support is the real line, i.e. −∞ < x < ∞. ◮ Suppose we want E(θ|y) = E(X) =

−∞ xπ(x)dx. ◮ No analytic answer. We do not even know the

constant that will give us ∞

−∞ π(x)dx = 1.

slide-29
SLIDE 29

Example 1 continued...

◮ We can approximate E(X) very accurately by

numerical integration. Can show E(X) = 0.0620.

◮ Let us try to find this using the independence

sampler.

◮ We can sample from Cauchy distribution easily. ◮ Try the Cauchy distribution as the proposal, i.e.

q(y|x) =

1 π (1+y2), the proposal distribution.

200 400 600 800 1000

  • 0.6
  • 0.4
  • 0.2

0.0 0.2 0.4 0.6 0.8

True Answer MCMC Estimate nse lag-1.cor E(X) 0.0620 0.0604 0.018

  • 0.266
slide-30
SLIDE 30

Implementing MCMC : Reminder Slide

◮ Flavours of Metropolis-Hastings ◮ Gibbs Sampler ◮ Number of Chains ◮ Burn-in and run length ◮ Numerical standard errors

slide-31
SLIDE 31

Gibbs Sampling

◮ Suppose that x = (x1, x2, . . . , xk) is k dimensional. ◮ Gibbs sampler (Gelfand and Smith, 1990) uses what

are called the full conditional distributions: π(xj|x1, . . . , xj−1, xj+1, . . . , xk) = π(x1, . . . , xj−1, xj, xj+1, . . . , xk)

  • π(x1, . . . , xj−1, xj, xj+1, . . . , xk)dxj

.

◮ Note that the full conditional distribution

π(xj|x1, . . . , xj−1, xj+1, ..., xk) is proportional to the joint since the denominator in the above integral does not depend on xj.

◮ Often this helps in finding it, e.g. just look at the

terms involving xj in the joint posterior density.

slide-32
SLIDE 32

Return to Example 2.

◮ We had the joint posterior density:

π(µ, τ|y) ∝ (τ)n/2+a−1 exp

  • −τ P(yi−µ)2

2

− bτ

  • ◮ Need the two full conditional distributions π(µ|τ, y)

and π(τ|µ, y).

◮ Easy to see:

π(τ|µ, y) ∝ (τ)n/2+a−1 exp

  • −τ

P(yi−µ)2

2

+ b

  • , i.e.

τ|µ, y ∼ Gamma

  • n

2 + a, b + 1 2

(yi − µ)2

◮ It can be shown that

n

i=1(yi − µ)2 = n(µ − ¯

y)2 + n

i=1(yi − ¯

y)2.

◮ Hence we can easily deduce that:

µ|τ, y ∼ N(¯ y, 1/(nτ)).

slide-33
SLIDE 33

Gibbs sampling

Sample or update in turn: X (t+1)

1

∼ π(x1|x(t)

2 , x(t) 3 , · · · , x(t) k )

X (t+1)

2

∼ π(x2|x(t+1)

1

, x(t)

3 , · · · , x(t) k )

X (t+1)

3

∼ π(x3|x(t+1)

1

, x(t+1)

2

, x(t)

4 , · · · )

. . . . . . . . . X (t+1)

k

∼ π(xk|x(t+1)

1

, x(t+1)

2

, · · · , x(t+1)

k−1 ) ◮ Always use the most recent values. ◮ The order of updating matters, but often not too much

(Roberts and Sahu, 1997).

◮ The full conditional distributions are also called

complete conditional distributions.

slide-34
SLIDE 34

Thus in two dimensions (k = 2), the sample path of the Gibbs sampler will look something like:

x1 x2 x(0) x(1) x(2) x(3) x(4)

slide-35
SLIDE 35

Sampling from full conditionals

◮ We must be able to sample from

π(xj|x1, . . . , xj−1, xj+1, . . . , xk) to do Gibbs sampling.

◮ In Example 2 the two full conditional distributions are

both standard distributions: normal and gamma.

◮ In real life problems, full conditionals often have

complex algebraic forms, but are usually (nearly) log-concave.

◮ For (nearly) log-concave univariate densities, use

adaptive rejection sampling (Gilks and Wild, 1992) and follow-up work.

slide-36
SLIDE 36

Implementing MCMC : Reminder Slide

◮ Flavours of Metropolis-Hastings ◮ Gibbs Sampler ◮ Number of Chains ◮ Burn-in and run length ◮ Numerical standard errors

slide-37
SLIDE 37

Number of Chains

How many parallel chains of MCMC should be run?

◮ Several parallel chains (Gelman and Rubin, 1992)

◮ give indication of convergence ◮ and a sense of statistical security.

◮ One very long run (Geyer, 1992)

◮ reaches parts of parameter space the other scheme

cannot.

slide-38
SLIDE 38

Implementing MCMC : Reminder Slide

◮ Flavours of Metropolis-Hastings ◮ Gibbs Sampler ◮ Number of Chains ◮ Burn-in and run length ◮ Numerical standard errors

slide-39
SLIDE 39

Determining Burn-in

◮ Early iterations x(1), . . . , x(M) reflect starting value

x(0).

◮ These iterations are called burn-in. ◮ After the burn-in, we say the chain has ‘converged’. ◮ Discard the burn-in iterations and form the delayed

ergodic average: ¯ hMN = 1 N − M

N

  • t=M+1

h

  • X (t)

.

◮ Methods for determining M are called convergence

diagnostics.

slide-40
SLIDE 40

Convergence Diagnostics

◮ Must do:

◮ Plot the time series for each quantity of interest. ◮ Plot the auto-correlation functions.

◮ If not satisfied, try some other diagnostics in the

R-package CODA.

◮ See for example: Gelman and Rubin (1992), Robert

(1998), Cowles and Carlin (1996), Brooks and Roberts (1998).

◮ But realise that you cannot prove that you have

converged using any of those.

slide-41
SLIDE 41

Implementing MCMC : Reminder Slide

◮ Flavours of Metropolis-Hastings ◮ Gibbs Sampler ◮ Number of Chains ◮ Burn-in and run length ◮ Numerical standard errors

slide-42
SLIDE 42

Numerical standard errors (nse)

◮ Suppose we decide to run the chain until

nse ¯ hMN

  • is sufficiently small.

◮ For a given run length N, how can we estimate the

nse, taking account of auto-correlations in h

  • X (M+1)

, . . . , h

  • X (N)
slide-43
SLIDE 43

Method of batching

In this method the problem of auto-correlation is

  • vercome by

◮ dividing the sequence

x(M+1), . . . , x(N) into k equal-length batches,

◮ calculating the mean bj for each batch j, ◮ checking that the

b1, . . . , bk are approximately uncorrelated. Then we can estimate

  • nse (¯

xMN) =

  • 1

k(k − 1)

  • (bi − ¯

b)2.

slide-44
SLIDE 44

Practical notes

◮ Use at least 20 batches. ◮ Estimate lag-1 autocorrelation of the sequence {bi}. ◮ If the auto-correlation is high, a longer run should be

used, giving larger batches.

slide-45
SLIDE 45

Again return to Example 2.

◮ Let Syy = n i=1 (yi − ¯

y)2. It is easy to find: E(µ|y) = ¯ y and E(σ2|y) = Syy + 2b n − 3 + 2a.

◮ From n = 20 observations, we have

¯ y = 5.1597, Syy = 9.1887. Suppose we take a = 2, b = 1, a proper prior.

◮ Take N = 10000, M = N/4. (i.e. discard first 2500

iterations) T.mean G.mean nse lag-1.c µ 5.1597 5.1599 0.0019 0.0402 σ2 0.5328 0.5336 0.0019 0.0198

◮ We shall verify these numbers in the lab session

using R and BUGS.

slide-46
SLIDE 46

Problem 1. Metropolis-Hastings Algorithm: Normal-Cauchy example. Let Y1, . . . , Yn

i.i.d.

∼ N(θ, 1) and π(θ) =

1 π (1+θ2). Recall that the posterior distribution can be

written as: π(θ|y) ∝ exp

  • −n(θ−¯

y)2 2

  • ×

1 1+θ2.

Suppose for n = 20, ¯ y = 0.0675.

  • 1. Using the R language, code the independence

sampler where the standard Cauchy distribution is the proposal distribution.

  • 2. Run your code to estimate the posterior mean and

the standard deviation.

  • 3. Compare your MCMC estimate with the exact answer
  • btained using the R-function integrate.
  • 4. Program the random-walk Metropolis algorithm with a

normal proposal distribution. Tune the proposal variance to have an acceptance rate between 25-35%. Now do the previous two parts.

slide-47
SLIDE 47

Problem 2. Gibbs Sampler: Normal µ, σ2 example. Data Y1, . . . , Yn are a random sample from N(µ, σ2). Let τ = 1/σ2. Assume the prior distribution: π(µ, τ) ∝ Gamma(a, b), a > 0, b > 0, τ has mean a/b. Joint posterior: π(µ, τ|y) ∝ (τ)n/2+a−1 exp

  • −τ P(yi−µ)2

2

− bτ

  • .
  • 1. Use the R-function rnorm to simulate a data set with

20 observations from N(5, 1).

  • 2. Write an R-program for implementing the Gibbs

sampler for sampling µ and τ.

  • 3. Obtain the estimates of E(µ|y) and E(σ2|y).
  • 4. Provide MCMC trace plots for µ and σ2.
  • 5. Verify that the estimates are close to the true values.
  • 6. Use the method of batching to estimate nse.
  • 7. Write a program in WinBUGS and verify.