Metropolis-Hastings Algorithm for Mixture Model and its Weak - PowerPoint PPT Presentation

Metropolis-Hastings Algorithm for Mixture Model and its Weak Convergence Kengo, KAMATANI University of Tokyo, Japan KAMATANI (University of Tokyo, Japan) Metropolis-Hastings Algorithm for Mixture Model and its Weak Convergence 1 / 22

1 Gibbs sampler usually works well. 2 However in certain settings, it works poorly. ex) Mixture model. 3 Fortunately, we found an alternative MCMC method which works better in simulation. Problem Both 2 and 3 are uniformly ergodic. Therefore, to compare those methods, we have to calculate the convergence rates. It is very difficult! Therefore, in Harris recurrence approach, the comparison is difficult. We take another approach. KAMATANI (University of Tokyo, Japan) Metropolis-Hastings Algorithm for Mixture Model and its Weak Convergence 2 / 22

Summary of the talk Sec. 1 I show a bad behavior of the Gibbs sampler. Sec. 2 Define efficiency (consistency) of MCMC. Prove that the Gibbs sampler has a bad convergence property. Sec. 3 Propose a new MCMC. Prove that the new MCMC is better than the Gibbs sampler. KAMATANI (University of Tokyo, Japan) Metropolis-Hastings Algorithm for Mixture Model and its Weak Convergence 3 / 22

Note that... Harris recurrence property is also very important for our approach. Without this property, our approach is useless. The another motivation of our approach is to divide two different convergence issues 1) convergence to the local area and 2) consistency Only the mixture model is considered here. However it may be useful to other models. KAMATANI (University of Tokyo, Japan) Metropolis-Hastings Algorithm for Mixture Model and its Weak Convergence 4 / 22

Outline 1 Bad behavior of the Gibbs sampler Model description Gibbs sampler 2 Efficiency of MCMC What is MCMC? Consistency Degeneracy 3 MH algorithm converges faster MH proposal construction MH performance KAMATANI (University of Tokyo, Japan) Metropolis-Hastings Algorithm for Mixture Model and its Weak Convergence 5 / 22

Bad behavior of the Gibbs sampler Model description 1 Consider a model p X | Θ ( dx | θ ) = (1 − θ ) F 0 ( dx ) + θ F 1 ( dx ) . 2 Flip a coin with the proportion of head θ . If the coin is head, generate x from F 1 , otherwise, from F 0 . 3 We do not observe the coin but x . 4 Observation x n = ( x 1 , x 2 , . . . , x n ), x i ∼ p X | Θ ( dx | θ 0 ). Prior distribution p Θ = Beta ( α 1 , α 0 ). We want to calculate the posterior distribution. KAMATANI (University of Tokyo, Japan) Metropolis-Hastings Algorithm for Mixture Model and its Weak Convergence 7 / 22

Bad behavior of the Gibbs sampler Gibbs sampler 1 Set θ (0) ∈ Θ. 2 y i ∼ Bi (1 , p i ) ( i = 1 , 2 , . . . , n ) where θ (0) f 1 ( x i ) p i = (1 − θ (0)) f 0 ( x i ) + θ (0) f 1 ( x i ) . Count m = � n i =1 y i . F i ( dx ) = f i ( x ) dx . 3 Generate θ (1) ∼ Beta ( α 1 + m , α 0 + n − m ). 4 Empirical measure of ( θ (0) , θ (1) , . . . , θ ( N − 1)) is an estimator of the posterior distribution. The next figure is a path of the Gibbs sampler when the true model is F 0 , that is, θ 0 = 0. KAMATANI (University of Tokyo, Japan) Metropolis-Hastings Algorithm for Mixture Model and its Weak Convergence 8 / 22

Bad behavior of the Gibbs sampler Gibbs sampler Path of MCMC 6 4 deviance 2 0 0 200 400 600 800 1000 iteration Figure: Plot of paths of MCMC methods for n = 10 4 . The dashed line is a path from the Gibbs sampler and the solid line is the MH algorithm. KAMATANI (University of Tokyo, Japan) Metropolis-Hastings Algorithm for Mixture Model and its Weak Convergence 9 / 22

Bad behavior of the Gibbs sampler How to define efficiency 1 MCMC methods produce complicated Markov chain. 2 We make an approximation of MCMC method. We observe the behavior of MCMC methods when the sample size n → ∞ . KAMATANI (University of Tokyo, Japan) Metropolis-Hastings Algorithm for Mixture Model and its Weak Convergence 10 / 22

Weak convergence of MCMC What is MCMC? Write s instead of θ . 1 For each observation x , Gibbs sampler produces paths s = ( s (0) , s (1) , . . . ) in S ∞ . 2 In other words, for x ∈ X , Gibbs sampler defines a law G x ∈ P ( S ∞ ). 3 Therefore, a Gibbs sampler is a set of probability measures G = ( G x ; x ∈ X ) (Later, we will consider G as a random variable G ( x ) = G x ). Let ˆ ν m ( s ) be the empirical measure of s (0) , . . . , s ( m − 1). Let ν x be the target distribution for each x . ν m ( s ) , ν x ) → 0 in a certain sense. We expect that d (ˆ KAMATANI (University of Tokyo, Japan) Metropolis-Hastings Algorithm for Mixture Model and its Weak Convergence 12 / 22

Weak convergence of MCMC Consistency 1 We expect that as m → ∞ , E G ( d (ˆ ν m ( s ) , ν )) → 0 . But G and ν depend on x !. 2 We expect that as m → ∞ , ν m ( s ) , ν x )) = o P (1) . E G x ( d (ˆ But G x and ν x may depend on n !. KAMATANI (University of Tokyo, Japan) Metropolis-Hastings Algorithm for Mixture Model and its Weak Convergence 13 / 22

Weak convergence of MCMC Consistency Definition ( M n = ( M x n ); n ∈ N ): sequence of MCMC. We call ( M n ; n ∈ N ) consistent for ν n = ( ν x n ) if for any m ( n ) → ∞ , ν [ m ( n )] ( s ) , ν x n ( d (ˆ n )) = o P n (1) . E M x For a regular model, the Gibbs sampler has consistency with scaling θ �→ n 1 / 2 ( θ − θ 0 ). KAMATANI (University of Tokyo, Japan) Metropolis-Hastings Algorithm for Mixture Model and its Weak Convergence 14 / 22

Weak convergence of MCMC Degeneracy Definition 1 If a measure ω ∈ P ( S ∞ ) satisfies the following, we call it degenerate: ω ( { s ; s (0) = s (1) = s (2) = · · · } ) = 1 (1) 2 We also call M degenerate (in P ) if M x is degenerate a.s. x . 3 If M n ⇒ M and M degenerate, we call M n degenerate in the limit. The Gibbs sampler G n for mixture model is degenerate with scaling θ �→ n 1 / 2 θ if θ 0 = 0 as n → ∞ . KAMATANI (University of Tokyo, Japan) Metropolis-Hastings Algorithm for Mixture Model and its Weak Convergence 15 / 22

Weak convergence of MCMC Degeneracy In fact, G n tends to a diffusion process type variable with time scaling 0 , 1 , 2 , . . . �→ 0 , n − 1 / 2 , 2 n − 1 / 2 , . . . ! Under both space and time scaling, G x n is similar to the law of dS t = ( α 1 + S t Z n − S 2 t I ) dt + S t dB t where Z n ⇒ N (0 , I ) and I is the Fisher information matrix. If we take m ( n ) n − 1 / 2 → ∞ , the empirical measure converges to the posterior distribution. We call G n n 1 / 2 -weakly consistent. KAMATANI (University of Tokyo, Japan) Metropolis-Hastings Algorithm for Mixture Model and its Weak Convergence 16 / 22

MH algorithm converges faster MH proposal construction Construct a posterior distribution for another parametric family: 1 Fix Q ⊂ P ( X ). 2 For each θ , set q X | Θ ( dx | θ ) := argmin q ∈ Q d ( p X | Θ ( dx | θ ) , q ) where d is a certain metric. ex) Kullback-Leibler distance. 3 Calculate the posterior q n Θ | X n ( d θ | x n ). Remark We assume that we can generate θ ∼ q n Θ | X n ( d θ | x n ) in PC. This construction is similar to quasi Bayes method (See ex. Smith and Markov 1978) 1 variational Bayes method (See ex. Humphreys and Titterington 2000). 2 KAMATANI (University of Tokyo, Japan) Metropolis-Hastings Algorithm for Mixture Model and its Weak Convergence 18 / 22

MH algorithm converges faster MH proposal construction Construct an independent type Metropolis-Hastings algorithm with target distribution p n Θ | X n ( d θ | x n ). Step 0 Generate θ (0) ∼ q n Θ | X n ( d θ | x n ). Go to Step 1. Step i Generate θ ( i ) ∗ ∼ q n Θ | X n ( d θ | x n ). Then θ ∗ ( i ) with probability α ( θ ( i ) , θ ∗ ( i )) � θ ( i ) = with probability 1 − α ( θ ( i ) , θ ∗ ( i )) . θ ( i − 1) Go to Step i + 1. KAMATANI (University of Tokyo, Japan) Metropolis-Hastings Algorithm for Mixture Model and its Weak Convergence 19 / 22

MH algorithm converges faster MH performance (Normal): Mean squared error MCMC standard error 0 0 1 2 4 3 0 2.0 MCMC standard error 0 0 1 3 4 3 0 2.0 1.5 1.5 sd 1.0 sd 1.0 0.5 0.5 0.0 0 2000 4000 6000 8000 10000 0.0 mcmc 0 2000 4000 6000 8000 10000 mcmc Figure: The dashed line is a path from the Gibbs sampler Figure: The same figure as the left. The sample size is 10 2 . and the solid line is the MH algorithm for n = 10. KAMATANI (University of Tokyo, Japan) Metropolis-Hastings Algorithm for Mixture Model and its Weak Convergence 20 / 22

Metropolis-Hastings Algorithm for Mixture Model and its Weak - PowerPoint PPT Presentation

Metropolis-Hastings Algorithm for Mixture Model and its Weak Convergence Kengo, KAMATANI University of Tokyo, Japan KAMATANI (University of Tokyo, Japan) Metropolis-Hastings Algorithm for Mixture Model and its Weak Convergence 1 / 22 1 Gibbs

Metropolis-Hastings algorithm Dr. Jarad Niemi STAT 544 - Iowa State University April 2, 2019

The Metropolis Hastings algorithm : introduction and optimal scaling of the transient phase

Metropolis Sampling Ars` ene P erard-Gayot May 23, 2016 Introduction Background Metropolis

Bernoulli Mixture Models Victor Medina Researcher at SBIF DataCamp Mixture Models in R The

Structure of mixture models Victor Medina Researcher at SBIF DataCamp Mixture Models in R

Downtown Hastings A Place for Talent 2016 MML Community Excellence Award - City of Hastings

Hastings Borough Council Corporate Plan & Budget 2017 / 2018 www.hastings.gov.uk What

Hastings Opportunity Area: Initial work was undertaken with Hastings schools, colleagues,

Optimal scaling of the transient phase of Metropolis Hastings algorithms Tony Leli` evre Ecole

Metropolis Of Boston Philoptochos Officers Workshop Saturday, November 23, 2013 Greek Orthodox

Projet METROPOLIS METROlogie Pour LInternet et les Services Metropolis Project

aks249 Parallel Metropolis-Hastings-Walker Sampling for LDA Xanda Schofield Topic :

Metropolis-Hastings Algorithms in Function Space for Bayesian Inverse Problems Bjrn Sprungk,

Gibbs Sampling Biostatistics 615/815 Lecture 22: . . . . . . . . . Metropolis-Hastings

Scalable Metropolis-Hastings for Exact Bayesian Inference with Large Datasets Rob Cornish Paul

MCMC Diagnostics Review In the practical you used Metropolis-Hastings with a Gaussian proposal

Limiting Spectral Distribution of Stochastic Block Model Yizhe Zhu University of Washington

Nonlinear Distributional Gradient Temporal Difference Learning Chao Qu 1 Shie Mannor 2 Huan Xu 3 1

Convergence Rate of Markov Chains Will Perkins April 16, 2013 Convergence Last class we saw that

Convergence of spectral measures and eigenvalue rigidity Elizabeth Meckes Case Western Reserve

Understanding MCMC Marcel Lthi, University of Basel Slides based on presentation by Sandro

The hyperbolic Brownian plane Thomas Budzinski ENS Paris July 7th, 2016 Thomas Budzinski The

Partial match queries: a limit process Nicolas Broutin Ralph Neininger Henning Sulzbach Partial

Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Sup

Metropolis-Hastings Algorithm for Mixture Model and its Weak - PowerPoint PPT Presentation

Metropolis-Hastings Algorithm for Mixture Model and its Weak Convergence Kengo, KAMATANI University of Tokyo, Japan KAMATANI (University of Tokyo, Japan) Metropolis-Hastings Algorithm for Mixture Model and its Weak Convergence 1 / 22 1 Gibbs

Metropolis-Hastings algorithm Dr. Jarad Niemi STAT 544 - Iowa State University April 2, 2019

The Metropolis Hastings algorithm : introduction and optimal scaling of the transient phase

Metropolis Sampling Ars` ene P erard-Gayot May 23, 2016 Introduction Background Metropolis

Bernoulli Mixture Models Victor Medina Researcher at SBIF DataCamp Mixture Models in R The

Structure of mixture models Victor Medina Researcher at SBIF DataCamp Mixture Models in R

Downtown Hastings A Place for Talent 2016 MML Community Excellence Award - City of Hastings

Hastings Borough Council Corporate Plan &amp; Budget 2017 / 2018 www.hastings.gov.uk What

Hastings Opportunity Area: Initial work was undertaken with Hastings schools, colleagues,

Optimal scaling of the transient phase of Metropolis Hastings algorithms Tony Leli` evre Ecole

Metropolis Of Boston Philoptochos Officers Workshop Saturday, November 23, 2013 Greek Orthodox

Projet METROPOLIS METROlogie Pour LInternet et les Services Metropolis Project

aks249 Parallel Metropolis-Hastings-Walker Sampling for LDA Xanda Schofield Topic :

Metropolis-Hastings Algorithms in Function Space for Bayesian Inverse Problems Bjrn Sprungk,

Gibbs Sampling Biostatistics 615/815 Lecture 22: . . . . . . . . . Metropolis-Hastings

Scalable Metropolis-Hastings for Exact Bayesian Inference with Large Datasets Rob Cornish Paul

MCMC Diagnostics Review In the practical you used Metropolis-Hastings with a Gaussian proposal

Limiting Spectral Distribution of Stochastic Block Model Yizhe Zhu University of Washington

Nonlinear Distributional Gradient Temporal Difference Learning Chao Qu 1 Shie Mannor 2 Huan Xu 3 1

Convergence Rate of Markov Chains Will Perkins April 16, 2013 Convergence Last class we saw that

Convergence of spectral measures and eigenvalue rigidity Elizabeth Meckes Case Western Reserve

Understanding MCMC Marcel Lthi, University of Basel Slides based on presentation by Sandro

The hyperbolic Brownian plane Thomas Budzinski ENS Paris July 7th, 2016 Thomas Budzinski The

Partial match queries: a limit process Nicolas Broutin Ralph Neininger Henning Sulzbach Partial

Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Sup

Hastings Borough Council Corporate Plan & Budget 2017 / 2018 www.hastings.gov.uk What