Centre for Research in Statistical Methodology - PowerPoint PPT Presentation

Centre for Research in Statistical Methodology http://go.warwick.ac.uk/crism • Conferences and workshops ( including general calls for workshops to be organised primarily outside Warwick, calls every 6 months, next in Summer 2014 ) • Research Fellow positions: next advertising 2 positions around Febru- ary 2014 • PhD studentships • Academic visitor programme.

From Peskun Ordering to Optimal Simulated Tempering Gareth Roberts University of Warwick MCMSki, Chamonix, January 2014 mainly joint work with Jeffrey Rosenthal, but with aspects of joint work with Yves Atchade

Plan for talk 1. Why 0.234 is natural in many problems 2. Comparisions of algorithms based on their diffusion limits, links to Peskun ordering 3. A heterogenous scalling problem: spacing of temperatures in simulated tempering 4. Local 0.234 story for simulated tempering 5. Conclusions

Metropolis-Hastings algorithm Given a target density π ( · ) that we wish to sample from, and a Markov chain transition kernel density q ( · , · ), we construct a Markov chain as follows. Given X n , generate Y n +1 from q ( X n , · ). Now set X n +1 = Y n +1 with probability α ( X n , Y n +1 ) = 1 ∧ π ( Y n +1 ) q ( Y n +1 , X n ) π ( X n ) q ( X n , Y n +1 ) . Otherwise set X n +1 = X n .

Two first scaling problems • RWM q ( x , y ) = q ( | y − x | ) The acceptance probability simplifies to α ( x , y ) = 1 ∧ π ( y ) π ( x ) For example q ∼ MV N d ( x , σ 2 I d ), but also more generally. • MALA Y ∼ MV N ( x ( k ) + hV ∇ log π ( x ( k ) ) , hV ) . 2

0.01 0.02 0.03 0.04 0.05 0.1 0.9869 0.996 0.9912 0.9895 0.9939 0.9905 1.5 0.5 −0.05 0.5 1.0 −0.4 0.0 −0.5 −0.5 −1.5 −0.25 0.0 −1.0 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 0.2 0.3 0.4 0.5 0.7 0.9 0.9715 0.954 0.9443 0.9178 0.8368 0.7732 4 3 2 2 2 2 2 1 0 0 0 0 0 −1 −2 −2 −2 −2 −3 −3 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 1 1.2 1.5 1.8 2.1 2.38 0.7579 0.7474 0.6192 0.6529 0.6067 0.5542 3 2 2 2 2 1 1 0 0 0 −1 0 −1 −2 −2 −2 −2 −3 −3 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 2.6 2.9 3.2 3.5 4 5 0.6277 0.6159 0.672 0.6982 0.6446 0.6962 3 2 2 2 1 1 1 0 −1 0 0 −1 −1 −2 −3 −2 −3 −3 −3 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 6 8 10 12 15 18 0.7411 0.829 0.7974 0.8651 0.9036 0.9089 3 3 2 2 2 2 1 1 0 0 0 0 −1 −1 −2 −2 −3 −3 −3 −3 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 22 27 33 40 50 100 0.9426 0.9325 0.933 0.9564 0.962 0.989 3 2 2 1 1 1 1 1 0 0 0 −1 0 −1 −3 −3 −2 −2 −3 −2 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 0 200 600 1000 The Goldilocks dilemma

Scaling problems and diffusion limits Choosing σ in the above algorithms to optimise efficiency. For ‘appropriate choices’ the d -dimensional algorithm has a limit which is a diffusion. The faster the diffusion the better! • How should σ d depend on d for large d ? • What does this tell us about the efficiency of the algorithm? • Can we optimise σ d in some sensible way? • Can we characterise optimal (or close to optimal) values of σ d in terms of ob- servable properties of the Markov chain? For RWM and MALA (and some other local algorithms) and for some simple classes of target distributions, a solution to the above can be obtained by considering a diffusion limit (for high dimensional problems).

Simulated tempering Consider a d -dimensional target density f d , and suppose it is possible to construct MCMC on f d,β = f β d , 0 ≤ χ ≤ β ≤ 1. This typically would mix better for small β . However we are interested in f d, 1 . Problem : Choose a finite collection of inverse temperatures, B = { β i } such that we can construct a Markov chain on R d × B which “optimally” permits the exploration of f d, 1 . This is also a scaling problem: chosing how large to make β i − β i − 1 for each i .

What is “efficiency”? Let X be a Markov chain. Then for a π -integrable function f , efficiency can be described by �� n i =1 g ( X i ) � σ 2 ( g, P ) = lim n →∞ n Var . n Under weak(ish) regularity conditions ∞ � σ 2 ( g, P ) = Var π ( g ) + 2 Cov π ( g ( X 0 ) , g ( X i )) i =1 In general relative efficiency between two possible Markov chains varies depending on what function of interest g is being considered. As d → ∞ the dependence on g disappears, at least in cases where we have a diffusion limit as we will see....

How do we measure “efficiency” efficiently? It is well-established that estimating limiting variance is hard. “It’s easy, just measure ESJD instead!” Andrew Gelman, 1993 ESJD = E (( X t +1 − X t ) 2 ) Why is this a good idea? Optimising this is just like considering only linear functions g and ignoring all but the first term in ∞ � Cov π ( g ( X 0 ) , g ( X i )) i =1

Diffusion limits: a framework for studying algorithm optimality Many MCMC algorithms are well-approximated by diffusions (usually in the sense of diffusion limit results for high-dimensional algorithms, or other limiting regimes). Examples include Random walk Metropolis, various versions of Langevin algorithm, Gibbs samplers simulated tempering. Provides a natural framework for studying algorithm complexity and optimisation. But why is the comparison of diffusion limits any easier than comparing the finite- dimensional algorithms?

1.8 1.6 1.4 1.2 1.0 0.8 0 2000 4000 6000 8000 10000 Index MCMC sample paths and diffusions. Here ESJM is the quadratic variation [ tǫ − 1 ] � ( X iǫ − X ( i − 1) ǫ ) 2 lim ǫ → 0 i =1

Diffusions A d -dimensional diffusion is a continuous-time strong Markov process with continuous sample paths. We can define a diffusion as the solution of the Stochastic Differential Equation (SDE): d X t = µ ( X t )d t + σ ( X t )d B t . where B denotes d -dimensional Brownian motion, σ is a d × d matrix and µ is a d -vector. Often understood intuitively and constructively via its dynamics over small time intervals. Approximately for small h : x t + hµ ( x t ) + h 1 / 2 σ ( x t ) Z X t + h | X t = x t ∼ where Z is a d -dimensional standard normal random variable.

“Efficiency” for diffusions Consider two Langevin diffusions, both with π invariant.: for h 1 < h 2 , t = h 1 / 2 dX i dB t + h i ∇ log π ( X i t ) / 2 , i = 1 , 2 . i 0.5 0.4 0.3 thin(y, 5) 0.2 0.1 0.0 0 500 1000 1500 2000 X 2 is a “speeded-up” version of X 1 . Index But how can we compare diffusions which have non-constant diffusion coefficient?

Peskun ordering Peskun (1973). Many uses in MCMC theory (eg see work by Mira, Tierney and co-workers). P 1 and P 2 are two Markov chain kernels with invariant distribution π . We say P 2 dominates P 1 in the Peskun sense, and write P 2 � P 1 if for all x , and sets A not containing x P 1 ( x, A ) ≤ P 2 ( x, A ) . Peskun ordering implies an ordering on asymptotic variances of ergodic estimates: � �� n �� n �� i =1 g ( X (1) i =1 g ( X (2) i ) i ) lim n Var ≥ lim n Var n n n →∞ n →∞ where X ( i ) moves according to P i . Surprisingly many applications in MCMC! But are proposal scaling problems completely incompatible with Peskun??

Peskun ordering in continuous time Peskun ordering of P 1 and P 2 does not imply, nor is implied by ordering of the 2 step transition kernels. How about continuous time? (See work of Leisen +Mira 2008.) But diffusions satisfy P ( x, { x } ) = 0 so there can never be any interesting Peskun orderings in the original sense. HOWEVER MANY discrete time processes possess the SAME diffusion limit. Maybe we can consider the limit along sequences of chains which do satisfy Peskun ordering.

A more powerful diffusion comparison result Consider two Langevin diffusions, both with stationary distribution π . dX i t = h i ( X i t ) 1 / 2 dB t + V i ( X i t ) dt, i = 1 , 2 , with h 1 ( x ) ≤ h 2 ( x ) for all x . (Here V i ( x ) = ( h i ( x ) ∇ log π ( x ) + h ′ i ( x )) / 2.) Under regularity conditions on the tails of π (which have to decay exponentially, or π needs to have bounded support) Then X 2 dominates X 1 in covariance ordering sense: �� t �� t � � 0 g ( X 1 0 g ( X 2 s ) ds s ) ds t →∞ t Var lim ≥ lim t →∞ t Var t t

Diffusion comparison result (ctd) Proof by finding suitable Peskun ordered birth and death processes with the given diffusion limit. A dominated convergence argument is needed to extend the covariance ordering to the limiting diffusions. Can extend argument to give approximate covariance ordering for ANY processes with these respective limits.

Centre for Research in Statistical Methodology - PowerPoint PPT Presentation

Centre for Research in Statistical Methodology http://go.warwick.ac.uk/crism Conferences and workshops ( including general calls for workshops to be organised primarily outside Warwick, calls every 6 months, next in Summer 2014 ) Research

Green Action Centre, 2019 Green Action Centre, 2019 Green Action Centre, 2019 Green Action

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Scaling Methodology Scaling Methodology Dan Smith Director HW Engineering dsmith@nvidia.com

Centre for Research in Statistical Methodology http://www2.warwick.ac.uk/fac/sci/statistics/crism/

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

Methodology Methodology 3 age groups 2 7 years 8-12 years 13-17 years

SoC SoC Design Design Lecture 2: Design Methodology and Lecture Lecture 2: Design Methodology

STAT 401A - Statistical Methods for Research Workers Statistical Inference Jarad Niemi (Dr. J)

Modeling Process Quality Statistical methodology plays an important role in quality control and

Hatice Melek ATE TA CI Statistical Economic and Social Research and Training Centre

Statistical, Economic and Social Research and Training Centre for Islamic Countries SESRI RIC

Statistical presentation Statistical presentation Statistical tabulations by age, sex and 3 digit

EFTA Statistical Cooperation & the European Statistical System EEA Seminar EEA Seminar

EFTA Statistical Cooperation & the European Statistical System EEA Seminar EEA Seminar

13 Jan, 2011 Statistical Literacy: Confounding UTSA Confounding 2011 1 2011 2 Statistical

Statistics 435/535 Statistical Methods for Quality and Productivity Improvement / Statistical

Relative entropy optimization in quantum information Omar Fawzi ICMP 2018, Montr eal 1/11

Chapter 11 Analysis of Algorithms Chapter Scope Efficiency goals The concept of algorithm

Topic Number 2 Efficiency Complexity Algorithm Analysis " bit twiddling: 1. (pejorative)

Defining Efficiency CSE 421: Intro Algorithms Runs fast on typical real problem instances

Literature Foundations of parallel algorithms aff: Practical PRAM Programming. [PPP] Keller,

Crypto 2011, Santa Barbara Inverting HFE Systems is Quasi-Polynomial for All Fields Jintai Ding 1

Welcome to the course! Ben Teusch Human Resources (HR) Analytics Consultant DataCamp Human

Attrition of L1 English By Susan Dostert Emma Raykhman Introduction Sometimes I feel as

Centre for Research in Statistical Methodology - PowerPoint PPT Presentation

Centre for Research in Statistical Methodology http://go.warwick.ac.uk/crism Conferences and workshops ( including general calls for workshops to be organised primarily outside Warwick, calls every 6 months, next in Summer 2014 ) Research

Green Action Centre, 2019 Green Action Centre, 2019 Green Action Centre, 2019 Green Action

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Scaling Methodology Scaling Methodology Dan Smith Director HW Engineering dsmith@nvidia.com

Centre for Research in Statistical Methodology http://www2.warwick.ac.uk/fac/sci/statistics/crism/

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

Methodology Methodology 3 age groups 2 7 years 8-12 years 13-17 years

SoC SoC Design Design Lecture 2: Design Methodology and Lecture Lecture 2: Design Methodology

STAT 401A - Statistical Methods for Research Workers Statistical Inference Jarad Niemi (Dr. J)

Modeling Process Quality Statistical methodology plays an important role in quality control and

Hatice Melek ATE TA CI Statistical Economic and Social Research and Training Centre

Statistical, Economic and Social Research and Training Centre for Islamic Countries SESRI RIC

Statistical presentation Statistical presentation Statistical tabulations by age, sex and 3 digit

EFTA Statistical Cooperation &amp; the European Statistical System EEA Seminar EEA Seminar

EFTA Statistical Cooperation &amp; the European Statistical System EEA Seminar EEA Seminar

13 Jan, 2011 Statistical Literacy: Confounding UTSA Confounding 2011 1 2011 2 Statistical

Statistics 435/535 Statistical Methods for Quality and Productivity Improvement / Statistical

Relative entropy optimization in quantum information Omar Fawzi ICMP 2018, Montr eal 1/11

Chapter 11 Analysis of Algorithms Chapter Scope Efficiency goals The concept of algorithm

Topic Number 2 Efficiency Complexity Algorithm Analysis &quot; bit twiddling: 1. (pejorative)

Defining Efficiency CSE 421: Intro Algorithms Runs fast on typical real problem instances

Literature Foundations of parallel algorithms aff: Practical PRAM Programming. [PPP] Keller,

Crypto 2011, Santa Barbara Inverting HFE Systems is Quasi-Polynomial for All Fields Jintai Ding 1

Welcome to the course! Ben Teusch Human Resources (HR) Analytics Consultant DataCamp Human

Attrition of L1 English By Susan Dostert Emma Raykhman Introduction Sometimes I feel as

EFTA Statistical Cooperation & the European Statistical System EEA Seminar EEA Seminar

EFTA Statistical Cooperation & the European Statistical System EEA Seminar EEA Seminar

Topic Number 2 Efficiency Complexity Algorithm Analysis " bit twiddling: 1. (pejorative)