Optimal scaling and convergence of Markov chain Monte Carlo methods - PowerPoint PPT Presentation

Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm Optimal scaling and convergence of Markov chain Monte Carlo methods Alain Durmus Joint work with: Sylvain Le Corff, ´ Eric Moulines, Gareth Roberts, Umut S ¸im¸ sekli February 16, 2016 1/66 Stochastic seminar, Helsinki university

Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm 1 Introduction 2 Optimal scaling of the symmetric RWM algorithm 3 Explicit bounds for the ULA algorithm 2/66 Stochastic seminar, Helsinki university

Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm Introduction Sampling distributions over high-dimensional state-space has recently attracted a lot of research efforts in computational statistics and machine learning community... Applications (non-exhaustive) Bayesian inference for high-dimensional models and Bayesian non parametric. Bayesian linear inverse problems (typically function space problems). Aggregation of estimators and experts. 3/66 Stochastic seminar, Helsinki university

Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm Bayesian setting A Bayesian model is specified by 1 a prior distribution p on the parameter space θ ∈ R d 2 the sampling distribution of the observed data conditional on its parameters, often termed likelihood: Y ∼ L ( ·| θ ) The inference is based on the posterior distribution: p (d θ ) L ( Y | θ ) π (d θ ) = L ( Y | u ) p (d u ) . � In most cases the normalizing constant is not tractable: π (d θ ) ∝ p (d θ ) L ( Y | θ ) . 4/66 Stochastic seminar, Helsinki university

Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm Logistic and probit regression Likelihood: Binary regression set-up in which the binary observations (responses) ( Y 1 , . . . , Y n ) are conditionally independent Bernoulli random variables with success probability F ( θ T X i ) , where 1 X i is a d dimensional vector of known covariates, 2 θ is a d dimensional vector of unknown regression coefficient 3 F is a distribution function. Two important special cases: 1 probit regression: F is the standard normal distribution function, 2 logistic regression: F is the standard logistic distribution function, F ( t ) = e t / (1 + e t ) . 5/66 Stochastic seminar, Helsinki university

Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm Logistic and probit regression (II) The posterior density distribution of θ is given, up to a proportionality constant by π ( θ | ( Y, X )) ∝ exp( − U ( θ )) , where the potential U ( θ ) is given by p { Y i log F ( θ T X i )+(1 − Y i ) log(1 − F ( θ T X i )) } +g( θ ) , � U ( θ ) = − i =1 where g is the log density of the posterior distribution. Two important cases: Gaussian prior g( θ ) = (1 / 2) θ T Σ θ , ridge regression. Laplace prior g( θ ) = λ � d i =1 | θ i | , lasso regression. 6/66 Stochastic seminar, Helsinki university

Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm Bayesian setting (II) Bayesian decision theory relies on computing expectations: � π ( f ) = R d f ( θ ) π (d θ ) Generic problem: estimation of an integral π ( f ) , where - π is known up to a multiplicative factor ; - Sampling directly from π is not an option; n � A solution is to approximate E π [ f ] by n − 1 f ( X i ) , i =1 where ( X i ) i ≥ 0 is a Markov chain associated with a Markov kernel P for which π is invariant. 7/66 Stochastic seminar, Helsinki university

Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm Markov chain theory Invariant probability measure: π is said to be an invariant probability measure for the Markov kernel P if X 0 ∼ π then X 1 ∼ π Ergodic Theorem (Meyn and Tweedie, 2003): If π is invariant, With some conditions on P , we have for any f ∈ L 1 ( π ) , n 1 � � f ( X i ) − → f ( x ) π ( x )d x . n π -a.s. i =1 8/66 Stochastic seminar, Helsinki university

Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm MCMC: rationale To approximate π ( f ) : find P with invariant measure π , from which we can efficiently sample. MCMC methods are algorithms which aims to build such kernel. One of the most famous example: The Metropolis-Hastings algorithm. 9/66 Stochastic seminar, Helsinki university

Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm The Metropolis-Hastings algorithm Initial Data: the target density π , a transition density q , X 0 ∼ µ 0 . For k ≥ 0 given X k , 1 Generate Y k +1 ∼ q ( X k , · ) . 2 Set � Y k +1 with probability α ( X k , Y k +1 ) , X k +1 = with probability 1 − α ( X k , Y k +1 ) . X k where α ( x, y ) = 1 ∧ π ( y ) q ( y, x ) q ( x, y ) . π ( x ) π is invariant for the corresponding Markov kernel P . 10/66 Stochastic seminar, Helsinki university

Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm Example: The symmetric Random Walk Metropolis algorithm The Random Walk Metropolis:  Y k +1 = X k + σZ k +1 ( Z k ) k ≥ 0 i.i.d. sequence of law N d (0 , Id d )   = σ − d φ d ( � y − x � /σ ) where φ d is the Gaussian density on R d q ( x, y )  α ( x, y ) = π ( y ) /π ( x ) .  11/66 Stochastic seminar, Helsinki university

Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm Study of MCMC methods: measures of efficiency 1 How to measure the efficiency of MCMC methods ? 2 Equivalent problem: quantifying the convergence of the Markov kernel P to its stationary distribution π . 3 We consider two criteria: the asymptotic variance ⇒ justifies optimal scaling results. convergence in some metric on the set of probability measures. 12/66 Stochastic seminar, Helsinki university

Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm 1 Introduction 2 Optimal scaling of the symmetric RWM algorithm 3 Explicit bounds for the ULA algorithm 13/66 Stochastic seminar, Helsinki university

Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm Behaviour of the RWM Recall the RWM proposal: Y k +1 = X k + σZ k +1 On the one hand, σ should be as large as possible so that the chain explores the state spaces. On the other hand, σ should not be too large as possible otherwise α → 0 . 14/66 Stochastic seminar, Helsinki university

Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm Scaling problems Questions: How should σ depend on the dimension d ? We study the following very simple model. Consider π a one dimensional positive density on R of the form π ∝ e − u . Define the positive density on given for all x ∈ R d by d d e u ( x i ) , π d ( x ) = � � π ( x i ) = i =1 i =1 where x i is the i -th component of x . 15/66 Stochastic seminar, Helsinki university

Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm Study of the acceptance ratio (I) Recall π d ( x ) = � d i =1 π ( x i ) = � d i =1 e u ( x i ) Then the acceptance ratio can be written of the form for all x, y ∈ R d , α ( x, y ) = 1 ∧ π ( y ) π ( x ) � d � � = 1 ∧ exp u ( x i ) − u ( y i ) . i =1 16/66 Stochastic seminar, Helsinki university

Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm Study of the acceptance ratio (II) �� d � Recall α ( x, y ) = 1 ∧ exp i =1 u ( x i ) − u ( y i ) We want that the acceptance ratio during the algorithm ∈ (0 , 1) . 0 ∼ π d and the proposal based on X d Let X d 0 , Y d 1 = X d 0 + σZ d 1 . We consider the mean acceptance ratio, i.e. the quantity: α ( X d 0 , Y d α ( X d 0 , X d 0 + σZ d � � � � E 1 ) = E 1 ) � d � �� u ( X d 0 ,i ) − u ( X d 0 ,i + σZ d 1 ∧ exp = E 1 ,i ) . i =1 17/66 Stochastic seminar, Helsinki university

Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm Study of the acceptance ratio (III) � �� d �� α ( X d 0 , Y d � i =1 u ( X d 0 ,i ) − u ( X d 0 ,i + σZ d 1 ) = E 1 ∧ exp 1 ,i ) E If u is C 3 then a third Taylor expansion gives: u ( X d 0 ,i ) − u ( X d 0 ,i + σZ d 1 ,i ) = σZ d 1 ,i u ′ ( X d 0 ,i ) + ( σZ d 1 ,i ) 2 u ′′ ( X d 0 ,i ) / 2 + o ( σ 3 ) . (1) Set now σ = ℓd − ξ . By (3) if ξ < 1 / 2 , then d � u ( X d 0 ,i ) − u ( X d 0 ,i + ℓd − ξ Z d lim inf 1 ,i ) = −∞ d → + ∞ i =1 and therefore α ( X d 0 , Y d � � lim inf 1 ) → 0 . d → + ∞ E 18/66 Stochastic seminar, Helsinki university

Optimal scaling and convergence of Markov chain Monte Carlo methods - PowerPoint PPT Presentation

Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm Optimal scaling and convergence of Markov chain Monte Carlo methods Alain Durmus Joint work with: Sylvain Le Corff, Eric Moulines, Gareth

Markov Chain Monte Carlo Methods Michel Bierlaire michel.bierlaire@epfl.ch Transport and

Markov chain Monte Carlo Dr. Jarad Niemi STAT 544 - Iowa State University April 2, 2018 Jarad

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

MARKOV CHAIN MONTE CARLO METHODS MARKOV CHAIN MONTE CARLO METHODS MARKO LAINE, FMI MARKO LAINE,

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Draft Supercanonical convergence rates in quasi-Monte Carlo simulation of Markov chains Pierre

Markov chain Monte Carlo Reminder Need to sample large, non-standard distributions: Markov

Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabs Pczos

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

Part 3 Markov Chain Modeling Markov Chain Model Stochastic model Amounts to sequence of

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Adaptive and Interacting Markov chain Monte Carlo Gersende FORT LTCI CNRS & Telecom

Distributed Markov chain Monte Carlo Lawrence Murray CSIRO Mathematics, Informatics and

Bayesian inference & Markov chain Monte Carlo Note 1: Many slides for this lecture were kindly

Logistic Regression Required reading: Mitchell draft chapter (see course website)

Discriminative vs. Generative Learning CS 760@UW-Madison Goals for the lecture you should

Chapter 3: Modeling with First-Order Differential Equations Department of Electrical Engineering

Machine Learning - MT 2016 16. Course Summary Varun Kanade University of Oxford November 30,

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

Logistic Regression Two Worlds: Probabilistic & Algorithmic We know two conceptual approaches

Bayesian logistic regression Already covered in lectures on classification Laplace and

Perceptron and Logistic Regression Milan Straka October 19, 2020 Charles University in Prague

Optimal scaling and convergence of Markov chain Monte Carlo methods - PowerPoint PPT Presentation

Introduction Optimal scaling of the symmetric RWM algorithm Explicit bounds for the ULA algorithm Optimal scaling and convergence of Markov chain Monte Carlo methods Alain Durmus Joint work with: Sylvain Le Corff, Eric Moulines, Gareth

Markov Chain Monte Carlo Methods Michel Bierlaire michel.bierlaire@epfl.ch Transport and

Markov chain Monte Carlo Dr. Jarad Niemi STAT 544 - Iowa State University April 2, 2018 Jarad

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

MARKOV CHAIN MONTE CARLO METHODS MARKOV CHAIN MONTE CARLO METHODS MARKO LAINE, FMI MARKO LAINE,

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Draft Supercanonical convergence rates in quasi-Monte Carlo simulation of Markov chains Pierre

Markov chain Monte Carlo Reminder Need to sample large, non-standard distributions: Markov

Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabs Pczos

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

Part 3 Markov Chain Modeling Markov Chain Model Stochastic model Amounts to sequence of

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Adaptive and Interacting Markov chain Monte Carlo Gersende FORT LTCI CNRS &amp; Telecom

Distributed Markov chain Monte Carlo Lawrence Murray CSIRO Mathematics, Informatics and

Bayesian inference &amp; Markov chain Monte Carlo Note 1: Many slides for this lecture were kindly

Logistic Regression Required reading: Mitchell draft chapter (see course website)

Discriminative vs. Generative Learning CS 760@UW-Madison Goals for the lecture you should

Chapter 3: Modeling with First-Order Differential Equations Department of Electrical Engineering

Machine Learning - MT 2016 16. Course Summary Varun Kanade University of Oxford November 30,

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

Logistic Regression Two Worlds: Probabilistic &amp; Algorithmic We know two conceptual approaches

Bayesian logistic regression Already covered in lectures on classification Laplace and

Perceptron and Logistic Regression Milan Straka October 19, 2020 Charles University in Prague

Adaptive and Interacting Markov chain Monte Carlo Gersende FORT LTCI CNRS & Telecom

Bayesian inference & Markov chain Monte Carlo Note 1: Many slides for this lecture were kindly

Logistic Regression Two Worlds: Probabilistic & Algorithmic We know two conceptual approaches