centre for research in statistical methodology http go
play

Centre for Research in Statistical Methodology - PowerPoint PPT Presentation

Centre for Research in Statistical Methodology http://go.warwick.ac.uk/crism Conferences and workshops ( including general calls for workshops to be organised primarily outside Warwick, calls every 6 months, next in Summer 2014 ) Research


  1. Centre for Research in Statistical Methodology http://go.warwick.ac.uk/crism • Conferences and workshops ( including general calls for workshops to be organised primarily outside Warwick, calls every 6 months, next in Summer 2014 ) • Research Fellow positions: next advertising 2 positions around Febru- ary 2014 • PhD studentships • Academic visitor programme.

  2. From Peskun Ordering to Optimal Simulated Tempering Gareth Roberts University of Warwick MCMSki, Chamonix, January 2014 mainly joint work with Jeffrey Rosenthal, but with aspects of joint work with Yves Atchade

  3. Plan for talk 1. Why 0.234 is natural in many problems 2. Comparisions of algorithms based on their diffusion limits, links to Peskun or- dering 3. A heterogenous scalling problem: spacing of temperatures in simulated temper- ing 4. Local 0.234 story for simulated tempering 5. Conclusions

  4. Metropolis-Hastings algorithm Given a target density π ( · ) that we wish to sample from, and a Markov chain transition kernel density q ( · , · ), we construct a Markov chain as follows. Given X n , generate Y n +1 from q ( X n , · ). Now set X n +1 = Y n +1 with probability α ( X n , Y n +1 ) = 1 ∧ π ( Y n +1 ) q ( Y n +1 , X n ) π ( X n ) q ( X n , Y n +1 ) . Otherwise set X n +1 = X n .

  5. Two first scaling problems • RWM q ( x , y ) = q ( | y − x | ) The acceptance probability simplifies to α ( x , y ) = 1 ∧ π ( y ) π ( x ) For example q ∼ MV N d ( x , σ 2 I d ), but also more generally. • MALA Y ∼ MV N ( x ( k ) + hV ∇ log π ( x ( k ) ) , hV ) . 2

  6. 0.01 0.02 0.03 0.04 0.05 0.1 0.9869 0.996 0.9912 0.9895 0.9939 0.9905 1.5 0.5 −0.05 0.5 1.0 −0.4 0.0 −0.5 −0.5 −1.5 −0.25 0.0 −1.0 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 0.2 0.3 0.4 0.5 0.7 0.9 0.9715 0.954 0.9443 0.9178 0.8368 0.7732 4 3 2 2 2 2 2 1 0 0 0 0 0 −1 −2 −2 −2 −2 −3 −3 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 1 1.2 1.5 1.8 2.1 2.38 0.7579 0.7474 0.6192 0.6529 0.6067 0.5542 3 2 2 2 2 1 1 0 0 0 −1 0 −1 −2 −2 −2 −2 −3 −3 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 2.6 2.9 3.2 3.5 4 5 0.6277 0.6159 0.672 0.6982 0.6446 0.6962 3 2 2 2 1 1 1 0 −1 0 0 −1 −1 −2 −3 −2 −3 −3 −3 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 6 8 10 12 15 18 0.7411 0.829 0.7974 0.8651 0.9036 0.9089 3 3 2 2 2 2 1 1 0 0 0 0 −1 −1 −2 −2 −3 −3 −3 −3 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 22 27 33 40 50 100 0.9426 0.9325 0.933 0.9564 0.962 0.989 3 2 2 1 1 1 1 1 0 0 0 −1 0 −1 −3 −3 −2 −2 −3 −2 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 The Goldilocks dilemma

  7. Scaling problems and diffusion limits Choosing σ in the above algorithms to optimise efficiency. For ‘appropriate choices’ the d -dimensional algorithm has a limit which is a diffusion. The faster the diffusion the better! • How should σ d depend on d for large d ? • What does this tell us about the efficiency of the algorithm? • Can we optimise σ d in some sensible way? • Can we characterise optimal (or close to optimal) values of σ d in terms of ob- servable properties of the Markov chain? For RWM and MALA (and some other local algorithms) and for some simple classes of target distributions, a solution to the above can be obtained by considering a diffusion limit (for high dimensional problems).

  8. Simulated tempering Consider a d -dimensional target density f d , and suppose it is possible to construct MCMC on f d,β = f β d , 0 ≤ χ ≤ β ≤ 1. This typically would mix better for small β . However we are interested in f d, 1 . Problem : Choose a finite collection of inverse temperatures, B = { β i } such that we can construct a Markov chain on R d × B which “optimally” permits the exploration of f d, 1 . This is also a scaling problem: chosing how large to make β i − β i − 1 for each i .

  9. What is “efficiency”? Let X be a Markov chain. Then for a π -integrable function f , efficiency can be described by �� n i =1 g ( X i ) � σ 2 ( g, P ) = lim n →∞ n Var . n Under weak(ish) regularity conditions ∞ � σ 2 ( g, P ) = Var π ( g ) + 2 Cov π ( g ( X 0 ) , g ( X i )) i =1 In general relative efficiency between two possible Markov chains varies depending on what function of interest g is being considered. As d → ∞ the dependence on g disappears, at least in cases where we have a diffusion limit as we will see....

  10. How do we measure “efficiency” efficiently? It is well-established that estimating limiting variance is hard. “It’s easy, just measure ESJD instead!” Andrew Gelman, 1993 ESJD = E (( X t +1 − X t ) 2 ) Why is this a good idea? Optimising this is just like considering only linear functions g and ignoring all but the first term in ∞ � Cov π ( g ( X 0 ) , g ( X i )) i =1

  11. Diffusion limits: a framework for studying algorithm optimality Many MCMC algorithms are well-approximated by diffusions (usually in the sense of diffusion limit results for high-dimensional algorithms, or other limiting regimes). Examples include Random walk Metropolis, various versions of Langevin algorithm, Gibbs samplers simulated tempering. Provides a natural framework for studying algorithm complexity and optimisation. But why is the comparison of diffusion limits any easier than comparing the finite- dimensional algorithms?

  12. 1.8 1.6 1.4 1.2 1.0 0.8 0 2000 4000 6000 8000 10000 Index MCMC sample paths and diffusions. Here ESJM is the quadratic variation [ tǫ − 1 ] � ( X iǫ − X ( i − 1) ǫ ) 2 lim ǫ → 0 i =1

  13. Diffusions A d -dimensional diffusion is a continuous-time strong Markov process with con- tinuous sample paths. We can define a diffusion as the solution of the Stochastic Differential Equation (SDE): d X t = µ ( X t )d t + σ ( X t )d B t . where B denotes d -dimensional Brownian motion, σ is a d × d matrix and µ is a d -vector. Often understood intuitively and constructively via its dynamics over small time intervals. Approximately for small h : x t + hµ ( x t ) + h 1 / 2 σ ( x t ) Z X t + h | X t = x t ∼ where Z is a d -dimensional standard normal random variable.

  14. “Efficiency” for diffusions Consider two Langevin diffusions, both with π invariant.: for h 1 < h 2 , t = h 1 / 2 dX i dB t + h i ∇ log π ( X i t ) / 2 , i = 1 , 2 . i 0.5 0.4 0.3 thin(y, 5) 0.2 0.1 0.0 0 500 1000 1500 2000 X 2 is a “speeded-up” version of X 1 . Index But how can we compare diffusions which have non-constant diffusion coefficient?

  15. Peskun ordering Peskun (1973). Many uses in MCMC theory (eg see work by Mira, Tierney and co-workers). P 1 and P 2 are two Markov chain kernels with invariant distribution π . We say P 2 dominates P 1 in the Peskun sense, and write P 2 � P 1 if for all x , and sets A not containing x P 1 ( x, A ) ≤ P 2 ( x, A ) . Peskun ordering implies an ordering on asymptotic variances of ergodic estimates: � �� n �� � �� n �� i =1 g ( X (1) i =1 g ( X (2) i ) i ) lim n Var ≥ lim n Var n n n →∞ n →∞ where X ( i ) moves according to P i . Surprisingly many applications in MCMC! But are proposal scaling problems completely incompatible with Peskun??

  16. Peskun ordering in continuous time Peskun ordering of P 1 and P 2 does not imply, nor is implied by ordering of the 2 step transition kernels. How about continuous time? (See work of Leisen +Mira 2008.) But diffusions satisfy P ( x, { x } ) = 0 so there can never be any interesting Peskun orderings in the original sense. HOWEVER MANY discrete time processes possess the SAME diffusion limit. Maybe we can consider the limit along sequences of chains which do satisfy Peskun ordering.

  17. A more powerful diffusion comparison result Consider two Langevin diffusions, both with stationary distribution π . dX i t = h i ( X i t ) 1 / 2 dB t + V i ( X i t ) dt, i = 1 , 2 , with h 1 ( x ) ≤ h 2 ( x ) for all x . (Here V i ( x ) = ( h i ( x ) ∇ log π ( x ) + h ′ i ( x )) / 2.) Under regularity conditions on the tails of π (which have to decay exponentially, or π needs to have bounded support) Then X 2 dominates X 1 in covariance ordering sense: �� t �� t � � 0 g ( X 1 0 g ( X 2 s ) ds s ) ds t →∞ t Var lim ≥ lim t →∞ t Var t t

  18. Diffusion comparison result (ctd) Proof by finding suitable Peskun ordered birth and death processes with the given diffusion limit. A dominated convergence argument is needed to extend the covariance ordering to the limiting diffusions. Can extend argument to give approximate covariance ordering for ANY processes with these respective limits.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend