introduction to mcmc and bugs
play

Introduction to MCMC and BUGS Basic recipes, and a sample of some - PowerPoint PPT Presentation

Introduction to MCMC and BUGS Basic recipes, and a sample of some techniques for getting started. Two simple worked out examples. R-code will be provided for those. No background in MCMC assumed. Not for experts! MCMC: Reminder


  1. Introduction to MCMC and BUGS ◮ Basic recipes, and a sample of some techniques for getting started. ◮ Two simple worked out examples. ◮ R-code will be provided for those. ◮ No background in MCMC assumed. ◮ Not for experts!

  2. MCMC: Reminder Slide Outline: ◮ Motivation ◮ Monte Carlo integration ◮ Markov chains ◮ MCMC

  3. Bayesian Inference ◮ Data: Y (realisation y ) ◮ Parameters, latent variables: θ = ( θ 1 , θ 2 , . . . , θ k ) Assume: ◮ Likelihood: f ( y | θ ) . Distribution of data given parameters. ◮ Prior: π ( θ ) . Prior distribution, before we observe data. Inference is based on the joint posterior distribution. f ( y | θ ) π ( θ ) π ( θ | y ) = f ( y | θ ) π ( θ ) d θ R f ( y | θ ) π ( θ ) ∝ i . e . Posterior Likelihood × Prior ∝

  4. Example 1: Normal-Cauchy iid ◮ Suppose Y 1 , . . . , Y n ∼ N ( θ, 1 ) . Don’t know θ . ◮ iid=independently and identically distributed. 1 ◮ Assume Cauchy prior distribution: π ( θ ) = π ( 1 + θ 2 ) . Posterior density: � � P n i = 1 ( y i − θ ) 2 π ( θ | y ) 1 ∝ exp − × 1 + θ 2 2 � � − n ( θ − ¯ y ) 2 1 ∝ × exp (can be shown) 1 + θ 2 2 Things of interest to Bayesians: � ∞ ◮ Posterior Mean = E ( θ | y ) = −∞ θ π ( θ | y ) d θ . ◮ Posterior Variance = Var ( θ | y ) = E ( θ 2 | y ) − { E ( θ | y ) } 2 . ◮ Credible interval: { a ( y ) , b ( y ) } for θ such that Pr { a ( y ) < θ < b ( y ) | y } = 0 . 95, for example.

  5. Example 2: One sample normal. iid ◮ Data: Y 1 , . . . , Y n ∼ N ( µ, σ 2 ) . ◮ Let precision = τ = 1 /σ 2 . ◮ Assume a ‘flat’ prior distribution for µ , i.e. π ( µ ) = 1 ◮ and let τ ∼ Gamma ( a , b ) , a , b > 0, has mean a / b . π ( µ, τ ) ∝ τ a − 1 e − b τ . ◮ Joint posterior density: � � ( τ ) n / 2 + a − 1 exp − τ P ( y i − µ ) 2 π ( µ, τ | y ) − b τ ∝ 2 For making inference, we want quantities like: � ∞ � ∞ ◮ Posterior Mean = E ( µ | y ) = µ π ( µ, τ | y ) d µ d τ . −∞ 0 ◮ Posterior Variance = Var ( µ | y ) = Horrible double integrals! Difficult to solve in general. ◮ Credible intervals involve even more of these!

  6. MCMC: Reminder Slide Outline: ◮ Motivation ◮ Monte Carlo integration ◮ Markov chains ◮ MCMC

  7. Monte Carlo (MC) integration ◮ General problem: Evaluate the expectation of any general function h ( X ) of the random variable X : � E π [ h ( X )] = h ( x ) π ( x ) dx � | h ( x ) | π ( x ) dx < ∞ . assuming that it exists, i.e. ◮ This can be very difficult, but if we can draw samples X ( 1 ) , X ( 2 ) , . . . , X ( N ) ∼ π ( x ) then we can estimate E π [ h ( X )] by N � � X ( t ) � h N = 1 h ¯ . N t = 1 ◮ This is the basic idea of statistics: “Estimate unknowns by drawing samples.” ◮ Change notation: Parameters = θ ≡ x ; Posterior = π ( θ | y ) = π ( x ) . Replace y by its value and suppress.

  8. Consistency ◮ For independent samples, by Law of Large Numbers (LLN), N � � X ( t ) � 1 h N ¯ h = N t = 1 E π [ h ( X )] as N → ∞ . → (1) ◮ But independent sampling from π ( x ) is usually difficult. ◮ It turns out that (1), LLN, still applies if we generate samples using a Markov chain. But first, some revision of Markov chains.

  9. MCMC: Reminder Slide Outline: ◮ Motivation ◮ Monte Carlo integration ◮ Markov chains ◮ MCMC

  10. Markov chains ◮ A Markov chain is generated by sampling X ( t + 1 ) ∼ p ( x | x ( t ) ) , t = 0 , 1 , 2 , . . . . � ✒ � � � p is called the transition kernel. ◮ So X ( t + 1 ) depends only on X ( t ) , not on X ( 0 ) , X ( 1 ) , . . . , X ( t − 1 ) . p ( X ( t + 1 ) | x ( t ) , x ( t − 1 ) , . . . ) = p ( X ( t + 1 ) | x ( t ) ) ◮ For example: X ( t + 1 ) | x ( t ) ∼ N ( 0 . 5 x ( t ) , 1 . 0 ) . ◮ This is called a first order auto-regressive process with lag-1 auto-correlation 0.5.

  11. ◮ Simulation of the chain: X ( t + 1 ) | x ( t ) ∼ N ( 0 . 5 x ( t ) , 1 . 0 ) with two different starting points x ( 0 ) (one red and one blue). 4 2 0 –2 –4 0 10 20 30 40 50 60 i ◮ After about 5–7 iterations the chains seemed to have forgotten their starting positions. ◮ Caution: Have p ( X ( t + 1 ) | x ( t ) ) = p ( X ( t + 1 ) | x ( t ) , x ( t − 1 ) ) , it does not mean that X ( t + 1 ) and X ( t − 1 ) are independent marginally.

  12. Stationarity As t → ∞ , the Markov chain converges to its stationary distribution. � ✒ ✻ � � in distribution or invariant ◮ In the above example (it can be proved that), this is X ( t ) | x ( 0 ) ∼ N ( 0 . 0 , 1 . 33 ) , as t → ∞ which does not depend on x ( 0 ) . ◮ Does this happen for all Markov chains?

  13. Irreducibility ◮ Assuming that a stationary distribution exists, it is unique if the chain is irreducible . ◮ Irreducible means any set of states (values) can be reached from any other state (value) in a finite number of moves. ◮ An example of a reducible Markov chain: Suppose p ( x | y ) = 0 for x ∈ A and y ∈ B and vice versa. ✗✔ B ✗✔ ✖✕ A ✖✕ ◮ Here the whole set of values is the rectangle.

  14. Aperiodicity A Markov chain taking only a finite number of values is aperiodic if greatest common divisor (g.c.d.) of return times to any particular state i say, is 1. ◮ Think of recording the number of steps taken to return to the state 1. The g.c.d. of those numbers should be 1. ◮ If the g.c.d. is bigger than 1, 2 say, then the chain will return in cycles of 2, 4, 6, ... number of steps. This is not allowed for aperiodicity. ◮ Definition can be extended to general state space cases (where the Markov chain can take any value in a continuous interval).

  15. Ergodicity Suppose that the Markov chain has the stationary distribution π ( x ) and is aperiodic and irreducible. Then it is said to be ergodic . ◮ Ergodic theorem : N � � X ( t ) � 1 h N h ¯ = N t = 1 E π [ h ( X )] as N → ∞ . → h N is called an ergodic average. ◮ ¯ ◮ If h = Var π [ h ( X )] < ∞ , σ 2 then the central limit theorem holds, i.e, ¯ h N converges to the normal distribution, ◮ and this convergence occurs geometrically fast.

  16. Numerical standard errors (nse) � h N is h N ) , and for large N The nse of ¯ Var π (¯ � � � � N − 1 � � � ¯ � � σ 2 h N h ρ ℓ ( h ) nse ≈ 1 + 2 N ℓ = 1 � � where ρ ℓ ( h ) is the lag- ℓ auto-correlation in h ( X ( t ) ) . ◮ In general no simpler expression exist for the nse. ◮ See Geyer (1992), and Besag and Green (1993) for ideas and further references.

  17. Numerical standard errors (nse) ... � � h ( X ( t ) ) ◮ If can be approximated as a first order auto-regressive process then � � ¯ � σ 2 1 + ρ h N h nse ≈ 1 − ρ, N � � h ( X ( t ) ) where ρ is the lag-1 auto-correlation of . ◮ The first factor is the usual term under independent sampling. ◮ The second term is usually > 1. ◮ And thus is the penalty to be paid for using a Markov chain instead of independent sampling.

  18. Numerical standard errors (nse) ... Moreover, ◮ the nse may not be finite in general. ◮ the nse is finite if the chain converges geometrically. ◮ if the nse is finite, then we can make it as small as we like by increasing N . ◮ it can be proved that the following simple minded estimator of nse is not consistent. � � � � N − 1 � � � ¯ � � σ 2 h N h r ℓ ( h ) ˆ nse = 1 + 2 N ℓ = 1 where r ℓ ( h ) is the sample lag- ℓ auto-correlation in � � h ( x ( t ) ) . Will return to the estimation of nse later.

  19. Markov chains – summary ◮ A Markov chain may have a stationary distribution. ◮ The stationary distribution is unique if the chain is irreducible. ◮ We can estimate nse’s if the chain is also geometrically convergent. Where does this all get us?

  20. MCMC: Reminder Slide Outline: ◮ Motivation ◮ Monte Carlo integration ◮ Markov chains ◮ MCMC

  21. MCMC ◮ How do we construct a Markov chain whose stationary distribution is our posterior distribution, π ( x ) ? ◮ Metropolis et al (1953) showed us how. ◮ The method was generalized by Hastings (1970). This is called Markov chain Monte Carlo (MCMC).

  22. Metropolis-Hastings algorithm At each iteration t : � y | x ( t ) � y ∼ q . Step 1 Sample ❅ ■ ❅ � ✒ � “Proposal” distribution � “candidate” point Step 2 Then with probability � � � � q x ( t ) | y 1 , π ( y ) α ( x ( t ) , y ) = min π ( x ( t ) ) q ( y | x ( t ) ) set x ( t + 1 ) = y (acceptance). If not accepted, set x ( t + 1 ) = x ( t ) (rejection).

  23. Metropolis-Hastings algorithm... ◮ The normalising constant in π ( x ) is not required to run the algorithm. It cancels out in the ratio. ◮ If q ( y | x ) = π ( y ) , then we obtain independent samples, α ( x , y ) = 1 always. ◮ Usually q is chosen so that q ( y | x ) is easy to sample from. ◮ Theoretically, any density q ( ·| x ) having the same support (range) should work. However, some q ’s are better than others. See later. ◮ The induced Markov chains have the desirable properties under mild conditions on π ( x ) . ◮ π ( x ) is often called the target distribution . Recall the notation change θ to x .

  24. Implementing MCMC : Reminder Slide ◮ Flavours of Metropolis-Hastings ◮ Gibbs Sampler ◮ Number of Chains ◮ Burn-in and run length ◮ Numerical standard errors

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend