New Langevin based algorithms for MCMC in high dimensions Alain - PowerPoint PPT Presentation

New Langevin based algorithms for MCMC in high dimensions Alain Durmus Joint work with Gareth O. Roberts, Gilles Vilmart and Konstantinos Zygalakis. Département TSI, Telecom ParisTech Sixièmes rencontres des jeunes statisticiens

Main themes of this talk - Scaling limits of Metropolis-Hastings algorithms - A new MH algorithm with a new scaling page 2 A. Durmus New Langevin based algorithms for MCMC in high dimensions

Outlines page 3 A. Durmus New Langevin based algorithms for MCMC in high dimensions

Brief review of scaling results Outlines page 4 A. Durmus New Langevin based algorithms for MCMC in high dimensions

Brief review of scaling results ◮ Introduction Outlines page 5 A. Durmus New Langevin based algorithms for MCMC in high dimensions

Brief review of scaling results ◮ Introduction Motivation Let F : R d → R and π a probability measure on R d (with density π ). Generic problem : estimation of an expectation E F def = E π [ F ], where - π is known up to a multiplicative factor ; - we do not know how to sample from π (no basic Monte Carlo estimator) ; - π is high dimensional density (usual importance sampling and accept/reject inefficient). A solution is to approximate E F by n � n − 1 F ( X i ) , i =1 where ( X i ) i ≥ 0 is a Markov chain with invariant measure π . page 6 A. Durmus New Langevin based algorithms for MCMC in high dimensions

Brief review of scaling results ◮ Introduction Markov chain theory Definition Let P : R d × B ( R d ) → R + . P is a Markov kernel if - for all x ∈ R d , A �→ P ( x , A ) is a probability measure on R d , - for all A ∈ B ( R d ), x �→ P ( x , A ) is measurable from R d to R . A transition density function q : R d × R d → R is a measurable function such that for all x ∈ R d , � q ( x , y ) d y = 1 . R d A q ( x , y ) d y is a Markov kernel on R d with density q . Then P ( x , A ) = � A Markov chain associated with P is a stochastic process ( X k ) k ≥ 0 such for all k ≥ 0, X k +1 ∼ P ( X k , · ). page 7 A. Durmus New Langevin based algorithms for MCMC in high dimensions

Brief review of scaling results ◮ Introduction Markov chain theory Some simple properties : - If P 1 and P 2 are two Markov kernels , we can define a new Markov kernel, denoted P 1 P 2 , by for x ∈ R d , A ∈ B ( R d ) � P 1 P 2 ( x , A ) = P 1 ( x , d z ) P 2 ( z , A ) d z . R d - If P is a Markov kernel and ν a probability measure on R d , we can define a measure on R d , denoted ν P , by for A ∈ B ( R d ) � ν P ( A ) = ν ( d z ) P ( z , A ) . R d - Let P be a Markov kernel on R d . For f : R d → R + measurable, we can define a measurable function Pf : R d → ¯ R + by � Pf ( x ) = P ( x , d z ) f ( z ) . R d page 8 A. Durmus New Langevin based algorithms for MCMC in high dimensions

Brief review of scaling results ◮ Introduction Markov chain theory Invariant probability measure : π is said to be an invariant probability measure for the Markov kernel P if π P = π . Theorem (Meyn and Tweedie, 2003, Ergodic theorem) With some conditions on P, we have for any F ∈ L 1 ( π ) , n � 1 � F ( X i ) − → F ( x ) π ( x ) d x . n π -a.s. i =1 A simple condition for π to be an invariant measure for P is the reversibility : π ( d y ) P ( y , d x ) = π ( d x ) P ( x , d y ) . page 9 A. Durmus New Langevin based algorithms for MCMC in high dimensions

Brief review of scaling results ◮ Introduction MCMC : rationale To approximate E F : find P with invariant measure π , from which we can efficiently sample. Question : How to find P ? ⇒ the Metropolis Hastings algorithm provides a way to build such kernel. page 10 A. Durmus New Langevin based algorithms for MCMC in high dimensions

Brief review of scaling results ◮ The Metropolis-Hastings algorithm Outlines page 11 A. Durmus New Langevin based algorithms for MCMC in high dimensions

Brief review of scaling results ◮ The Metropolis-Hastings algorithm The Metropolis-Hastings algorithm (I) Initial Data : the target density π , a transition density q , X 0 ∼ µ 0 . For k ≥ 0 given X k , 1. Generate Y k +1 ∼ q ( X k , · ). 2. Set � Y k +1 with probability α ( X k , Y k +1 ) , X k +1 = with probability 1 − α ( X k , Y k +1 ) . X k where α ( x , y ) = 1 ∧ π ( y ) q ( y , x ) q ( x , y ) . π ( x ) The algorithm produces a Markov chain with a kernel P MH reversible w.r.t. π . Note X k +1 = X k + ✶ { U ≤ α ( X k , Y k +1 ) } ( Y k +1 − X k ), where U ∼ U [0 , 1] page 12 A. Durmus New Langevin based algorithms for MCMC in high dimensions

Brief review of scaling results ◮ The Metropolis-Hastings algorithm The RWM and MALA Two well known Metropolis Hastings algorithms : 1) The Random Walk Metropolis :  Y k +1 = X k + σ d Z k +1 ( Z k ) k ≥ 0 i . i . d . sequence of law N d (0 , Id d )  where φ d is the Gaussian density on R d q ( x , y ) = φ d (( y − x ) /σ d )  X k +1 = X k + ✶ { U ≤ α ( X k , Y k +1 ) } σ d Z k . 2) The Metropolis Adjusted Langevin Algorithm : Assume that log π is at least C 1 with gradient denoted by b .  = X k + σ 2 Y k +1 d b ( X k ) / 2 + σ d Z k +1  = φ d (( y − x − σ 2 q ( x , y ) d b ( x )) /σ d ) = X k + ✶ { U ≤ α ( X k , Y k +1 ) } ( σ 2  d b ( X k ) / 2 + σ d Z k ) . X k +1 page 13 A. Durmus New Langevin based algorithms for MCMC in high dimensions

Brief review of scaling results ◮ The Metropolis-Hastings algorithm Scaling problems and diffusion limits Scaling problems : - How should σ d depend on the dimension d ? - What does this tell us about the efficiency of the algorithm ? - Can we optimize σ d in a sensible way ? - Can we characterize the optimal choice of σ d by some intrinsic criteria independent of π ? For the case of the RWM and MALA, we have a diffusion limits which answer to these questions. page 14 A. Durmus New Langevin based algorithms for MCMC in high dimensions

Brief review of scaling results ◮ The Metropolis-Hastings algorithm Efficiency of HM algorithms Let ( X k ) k ≥ 0 be a Markov chain with invariant measure π . With some conditions we have a LLN and a CLT : for some F , n � 1 a . s . � − → F ( x ) π ( x ) d x F ( X i ) n n → + ∞ i =1 � n � √ � 1 � ∗ n → + ∞ N (0 , σ 2 ( F , P )) , n F ( X i ) − F ( x ) π ( x ) d x = ⇒ n i =1 where � � n 1 � σ 2 ( F , P ) = n → + ∞ n Var π lim F ( X i ) n i =1 � = Var π [ F ( X 0 )] + Cov π [ F ( X i ) , F ( X 0 )] . i ≥ 1 page 15 A. Durmus New Langevin based algorithms for MCMC in high dimensions

Brief review of scaling results ◮ The Metropolis-Hastings algorithm Efficiency of MH algorithms • Given F , the CLT allows us to compare two Markov kernel P 1 , P 2 : σ 2 ( F , P 1 ) ≤ σ 2 ( F , P 2 ) = ⇒ P 1 is more efficient than P 2 . • For all i ≥ 1 Cov π [ F ( X i ) , F ( X 0 )] ≥ 0 : therefore we cannot do better than i . i . d . samples. • However no practical conditions to ensure for all F , σ 2 ( F , P 1 ) ≤ σ 2 ( F , P 2 ) , which is the case for Langevin diffusion as we will see. page 16 A. Durmus New Langevin based algorithms for MCMC in high dimensions

Brief review of scaling results ◮ The Metropolis-Hastings algorithm Expected Square Jump Distance Common efficiency criteria : the ESJD defined for Markov chain in one dimension by : ESJD = E π [( X 1 − X 0 ) 2 ] . One justification : Maximize the ESJD ⇔ Minimize Cov π [ F ( X 1 ) , F ( X 0 )] , for F linear function. page 17 A. Durmus New Langevin based algorithms for MCMC in high dimensions

Brief review of scaling results ◮ Speed of Langevin diffusions Outlines page 18 A. Durmus New Langevin based algorithms for MCMC in high dimensions

Brief review of scaling results ◮ Speed of Langevin diffusions Langevin diffusion Let π a probability measure on R d with log-density C 1 , b ( x ) = ∇ log( π ( x )). Consider the overdamped Langevin equation : d Y t = ( b ( Y t ) / 2) d t + d B t . Note that the proposal of the MALA is just a Euler-Maruyama discretization of this SDE Under some conditions, ( Y t ) t ≥ 0 is ergodic with respect to π , and we have a LLN and a CLT again : � t � 1 a . s . F ( X s ) d s − → F ( x ) π ( x ) d x t t → + ∞ 0 � t √ � � � 1 ∗ t → + ∞ N (0 , σ 2 ( F , Y )) , t F ( X s ) d s − F ( x ) π ( x ) d x = ⇒ t 0 where � t � � 1 σ 2 ( F , Y ) = lim t → + ∞ t Var π F ( Y s ) d s . t 0 page 19 A. Durmus New Langevin based algorithms for MCMC in high dimensions

Brief review of scaling results ◮ Speed of Langevin diffusions scaled Langevin equation Consider the following scaled Langevin equation : √ d Y c t = ( cb ( Y t ) / 2) d t + c d B t . (1) Then a solution of (1) is given by ( Y 1 ct ) t ≥ 0 : � ct Y 1 ct = Y 1 ( b ( Y 1 0 + s ) / 2) d s + B ct 0 � t √ s = cu = Y 1 ( cb ( Y 1 c ˜ 0 + cu ) / 2) d s + B t , 0 with the Brownian motion ˜ B t = c − 1 / 2 B ct . page 20 A. Durmus New Langevin based algorithms for MCMC in high dimensions

New Langevin based algorithms for MCMC in high dimensions Alain - PowerPoint PPT Presentation

New Langevin based algorithms for MCMC in high dimensions Alain Durmus Joint work with Gareth O. Roberts, Gilles Vilmart and Konstantinos Zygalakis. Dpartement TSI, Telecom ParisTech Siximes rencontres des jeunes statisticiens Main themes

Neutrograph T. Pirling Institut Laue Langevin INSTITUT MAX VON LAUE - PAUL LANGEVIN Camera

Langevin Dynamics Loucas Pillaud-Vivien November 7, 2019 Loucas Pillaud-Vivien Langevin

Non-asymptotic convergence bound for the Langevin MCMC Algorithm Alain Durmus, Eric Moulines,

Parallel tempering and Interacting MCMC algorithms Gersende FORT / Eric MOULINES Telecom Paris

An MCMC library for probabilistic programming Rob Zinkov June 13th, 2014 Rob Zinkov An MCMC

Testing MCMC Samplers Jason M.T. Roos First European Bayesian Summit in Marketing Testing MCMC

Additional notes on MCMC sampling Shravan Vasishth March 18, 2020 For more details on MCMC, some

Langevin equation equation for for a a system system Langevin nonlinearly coupled coupled to

Network determination based on birth-death MCMC inference A. Mohammadi and E. Wit February 4,

MCMC and Variational Inference for AutoEncoders Achille Thin 1 , Alain Durmus 2 , Eric Moulines 1 1

MCMC for Cut Models or Chasing a Moving Target with MCMC Martyn Plummer International Agency

Modern Computational Statistics Lecture 8: Advanced MCMC Cheng Zhang School of Mathematical

CSci 8980: Advanced Topics in Graphical Models MCMC, Gibbs Sampling Instructor: Arindam Banerjee

Introduction to MCMC and BUGS Basic recipes, and a sample of some techniques for getting

FOR MCMC OLD HEADQUARTER CONFIDENTIAL BACKGROUND Existing MCMC Old HQ building is occupying

STAT 339 Markov Chain Monte Carlo (MCMC) 7 April 2017 Some theory and intuition about MCMC

Joint stateparameter estimation for nonlinear stochastic energy balance models Fei Lu 1 Nils

Differentially Private Markov Chain Monte Carlo o 2 , Onur Dikmen 3 and Antti Honkela 1 a 1

Mapping ideals of quantum group multipliers Jason Crann with M. Alaghmandan and M. Neufang

Dynamic Fractional Resource Scheduling for HPC Workloads Mark Stillwell 1 eric Vivien 2 Henri

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 6: LEARNING PRINCIPLES Spring 2019 Marion Neumann

LEARNING Outline Confusion Matrix F1 Score Gain and Lift Charts Kolmogorov Smirnov

Linear Estimation Problem Formulation Basic ideas Goal for much of this class is to

Forecast verification 4th VALUE Training School Jonas Bhend, Sven Kotlarski Forecast verification