Introduction to MCMC and BUGS
◮ Basic recipes, and a sample of some techniques for
getting started.
◮ Two simple worked out examples. ◮ R-code will be provided for those. ◮ No background in MCMC assumed. ◮ Not for experts!
Introduction to MCMC and BUGS Basic recipes, and a sample of some - - PowerPoint PPT Presentation
Introduction to MCMC and BUGS Basic recipes, and a sample of some techniques for getting started. Two simple worked out examples. R-code will be provided for those. No background in MCMC assumed. Not for experts! MCMC: Reminder
◮ Basic recipes, and a sample of some techniques for
◮ Two simple worked out examples. ◮ R-code will be provided for those. ◮ No background in MCMC assumed. ◮ Not for experts!
◮ Motivation ◮ Monte Carlo integration ◮ Markov chains ◮ MCMC
◮ Data: Y (realisation y) ◮ Parameters, latent variables: θ = (θ1, θ2, . . . , θk)
◮ Likelihood: f(y|θ). Distribution of data given
◮ Prior: π(θ). Prior distribution, before we observe data.
f(y|θ)π(θ) R f(y|θ)π(θ)dθ
◮ Suppose Y1, . . . , Yn iid
◮ iid=independently and identically distributed. ◮ Assume Cauchy prior distribution: π(θ) = 1 π (1+θ2).
Pn
i=1(yi−θ)2
2
1 1+θ2
y)2 2
1 1+θ2
(can be shown)
◮ Posterior Mean = E(θ|y) =
−∞ θ π(θ|y)dθ. ◮ Posterior Variance = Var(θ|y) = E(θ2|y) − {E(θ|y)}2. ◮ Credible interval: {a(y), b(y)} for θ such that
◮ Data: Y1, . . . , Yn iid
◮ Let precision = τ = 1/σ2. ◮ Assume a ‘flat’ prior distribution for µ, i.e. π(µ) = 1 ◮ and let τ ∼ Gamma(a, b), a, b > 0, has mean a/b.
◮ Joint posterior density:
2
◮ Posterior Mean = E(µ|y) =
−∞
◮ Posterior Variance = Var(µ|y) = Horrible double
◮ Credible intervals involve even more of these!
◮ Motivation ◮ Monte Carlo integration ◮ Markov chains ◮ MCMC
◮ General problem: Evaluate the expectation of any
◮ This can be very difficult, but if we can draw samples
N
◮ This is the basic idea of statistics:
◮ Change notation: Parameters = θ ≡ x; Posterior =
◮ For independent samples, by Law of Large Numbers
N
◮ But independent sampling from π(x) is usually
◮ It turns out that (1), LLN, still applies if we generate
◮ Motivation ◮ Monte Carlo integration ◮ Markov chains ◮ MCMC
◮ A Markov chain is generated by sampling
◮ So X (t+1) depends only on X (t), not on
◮ For example:
◮ This is called a first order auto-regressive process
◮ Simulation of the chain: X (t+1)|x(t) ∼ N(0.5 x(t), 1.0)
10 20 30 40 50 60 –4 –2 2 4 i
◮ After about 5–7 iterations the chains seemed to have
◮ Caution: Have p(X (t+1)|x(t)) = p(X (t+1)|x(t), x(t−1)), it
◮ In the above example (it can be proved that), this is
◮ Does this happen for all Markov chains?
◮ Assuming that a stationary distribution exists, it is
◮ Irreducible means any set of states (values) can be
◮ An example of a reducible Markov chain:
◮ Here the whole set of values is the rectangle.
◮ Think of recording the number of steps taken to
◮ If the g.c.d. is bigger than 1, 2 say, then the chain will
◮ Definition can be extended to general state space
◮ Ergodic theorem:
N
◮ ¯
◮ If
h = Varπ[h(X)] < ∞,
◮ and this convergence occurs geometrically fast.
h
N−1
◮ In general no simpler expression exist for the nse. ◮ See Geyer (1992), and Besag and Green (1993) for
◮ If
h
◮ The first factor is the usual term under independent
◮ The second term is usually > 1. ◮ And thus is the penalty to be paid for using a Markov
◮ the nse may not be finite in general. ◮ the nse is finite if the chain converges geometrically. ◮ if the nse is finite, then we can make it as small as we
◮ it can be proved that the following simple minded
h
N−1
◮ A Markov chain may have a stationary distribution. ◮ The stationary distribution is unique if the chain is
◮ We can estimate nse’s if the chain is also
◮ Motivation ◮ Monte Carlo integration ◮ Markov chains ◮ MCMC
◮ How do we construct a Markov chain whose
◮ Metropolis et al (1953) showed us how. ◮ The method was generalized by Hastings (1970).
◮ The normalising constant in π(x) is not required to
◮ If q(y|x) = π(y), then we obtain independent
◮ Usually q is chosen so that q(y|x) is easy to sample
◮ Theoretically, any density q(·|x) having the same
◮ The induced Markov chains have the desirable
◮ π(x) is often called the target distribution. Recall the
◮ Flavours of Metropolis-Hastings ◮ Gibbs Sampler ◮ Number of Chains ◮ Burn-in and run length ◮ Numerical standard errors
◮ Suppose that the proposal distribution is symmetric:
◮ This is originally proposed by Metropolis et al. (1953). ◮ A special case of this is known as the random-walk
◮ In this case,
◮ Example: consider the N(0, 1) target distribution, i.e.
2
◮ and consider N
1 2 3 0.0 0.2 0.4 0.6 0.8 proposals target
◮ Proposal depends on where you are, the current
◮ Need to tune σ2 to have about 15-40% acceptance of
◮ Proposal does not depend on the current value, x:
◮ So, α(x, y) has a simpler form. ◮ Beware: Independence samplers are either very
◮ Tails of q(y) must be heavier than the tails of π(x) for
2 4 6 0.0 0.1 0.2 0.3 0.4 proposal target
◮ Let Y1, . . . , Yn iid
1 π (1+θ2). ◮ Posterior:
◮ Suppose n = 20, ¯
◮ Here the support is the real line, i.e. −∞ < x < ∞. ◮ Suppose we want E(θ|y) = E(X) =
−∞ xπ(x)dx. ◮ No analytic answer. We do not even know the
−∞ π(x)dx = 1.
◮ We can approximate E(X) very accurately by
◮ Let us try to find this using the independence
◮ We can sample from Cauchy distribution easily. ◮ Try the Cauchy distribution as the proposal, i.e.
1 π (1+y2), the proposal distribution.
200 400 600 800 1000
0.0 0.2 0.4 0.6 0.8
◮ Flavours of Metropolis-Hastings ◮ Gibbs Sampler ◮ Number of Chains ◮ Burn-in and run length ◮ Numerical standard errors
◮ Suppose that x = (x1, x2, . . . , xk) is k dimensional. ◮ Gibbs sampler (Gelfand and Smith, 1990) uses what
◮ Note that the full conditional distribution
◮ Often this helps in finding it, e.g. just look at the
◮ We had the joint posterior density:
2
◮ Easy to see:
2
2 + a, b + 1 2
◮ It can be shown that
i=1(yi − µ)2 = n(µ − ¯
i=1(yi − ¯
◮ Hence we can easily deduce that:
1
2 , x(t) 3 , · · · , x(t) k )
2
1
3 , · · · , x(t) k )
3
1
2
4 , · · · )
k
1
2
k−1 ) ◮ Always use the most recent values. ◮ The order of updating matters, but often not too much
◮ The full conditional distributions are also called
x1 x2 x(0) x(1) x(2) x(3) x(4)
◮ We must be able to sample from
◮ In Example 2 the two full conditional distributions are
◮ In real life problems, full conditionals often have
◮ For (nearly) log-concave univariate densities, use
◮ Flavours of Metropolis-Hastings ◮ Gibbs Sampler ◮ Number of Chains ◮ Burn-in and run length ◮ Numerical standard errors
◮ Several parallel chains (Gelman and Rubin, 1992)
◮ give indication of convergence ◮ and a sense of statistical security.
◮ One very long run (Geyer, 1992)
◮ reaches parts of parameter space the other scheme
◮ Flavours of Metropolis-Hastings ◮ Gibbs Sampler ◮ Number of Chains ◮ Burn-in and run length ◮ Numerical standard errors
◮ Early iterations x(1), . . . , x(M) reflect starting value
◮ These iterations are called burn-in. ◮ After the burn-in, we say the chain has ‘converged’. ◮ Discard the burn-in iterations and form the delayed
N
◮ Methods for determining M are called convergence
◮ Must do:
◮ Plot the time series for each quantity of interest. ◮ Plot the auto-correlation functions.
◮ If not satisfied, try some other diagnostics in the
◮ See for example: Gelman and Rubin (1992), Robert
◮ But realise that you cannot prove that you have
◮ Flavours of Metropolis-Hastings ◮ Gibbs Sampler ◮ Number of Chains ◮ Burn-in and run length ◮ Numerical standard errors
◮ Suppose we decide to run the chain until
◮ For a given run length N, how can we estimate the
◮ dividing the sequence
◮ calculating the mean bj for each batch j, ◮ checking that the
◮ Use at least 20 batches. ◮ Estimate lag-1 autocorrelation of the sequence {bi}. ◮ If the auto-correlation is high, a longer run should be
◮ Let Syy = n i=1 (yi − ¯
◮ From n = 20 observations, we have
◮ Take N = 10000, M = N/4. (i.e. discard first 2500
◮ We shall verify these numbers in the lab session
i.i.d.
1 π (1+θ2). Recall that the posterior distribution can be
y)2 2
1 1+θ2.
2