approximate posterior sampling via stochastic optimisation
play

Approximate Posterior Sampling via Stochastic Optimisation Connie - PowerPoint PPT Presentation

Approximate Posterior Sampling via Stochastic Optimisation Connie Trojan Supervisor: Srshti Putcha 6 th September 2019 Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation Background Large scale


  1. Approximate Posterior Sampling via Stochastic Optimisation Connie Trojan Supervisor: Srshti Putcha 6 th September 2019 Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation

  2. Background Large scale machine learning models rely on stochastic optimisation techniques to learn parameters of interest Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation

  3. Background Large scale machine learning models rely on stochastic optimisation techniques to learn parameters of interest It is useful to understand parameter uncertainty using Bayesian inference Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation

  4. Background Large scale machine learning models rely on stochastic optimisation techniques to learn parameters of interest It is useful to understand parameter uncertainty using Bayesian inference Usually simulate the Bayesian posterior using Markov Chain Monte Carlo (MCMC) sampling algorithms Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation

  5. Background Large scale machine learning models rely on stochastic optimisation techniques to learn parameters of interest It is useful to understand parameter uncertainty using Bayesian inference Usually simulate the Bayesian posterior using Markov Chain Monte Carlo (MCMC) sampling algorithms Stochastic gradient MCMC methods combine stochastic optimisation methods with MCMC to reduce computation time Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation

  6. Notation In the Bayesian approach, the unknown parameter θ is treated as a random variable. The Bayesian posterior distribution π ( θ | ① ) has the form: N � π ( θ | ① ) ∝ p ( θ ) ℓ ( ① | θ ) = p ( θ ) ℓ ( x i | θ ) , i =1 where: p ( θ ) is the prior distribution ℓ ( x i | θ ) is the likelihood associated with observation i N is the size of the dataset Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation

  7. Notation In particular, gradient-based MCMC algorithms use the log posterior f ( θ ) to propose moves: N N � � f ( θ ) = k + f 0 ( θ ) + f i ( θ ) ≡ k + log p ( θ ) + log ℓ ( x i | θ ) i =1 i =1 Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation

  8. Stochastic Optimisation Efficient way of learning model parameters, typically used in machine learning. Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation

  9. Stochastic Optimisation Efficient way of learning model parameters, typically used in machine learning. Stochastic Gradient Ascent (SGA) Set starting value θ 0 , batch size n ≪ N , and step sizes ǫ t . Iterate: Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation

  10. Stochastic Optimisation Efficient way of learning model parameters, typically used in machine learning. Stochastic Gradient Ascent (SGA) Set starting value θ 0 , batch size n ≪ N , and step sizes ǫ t . Iterate: 1 Take a subsample S t of size n from the data Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation

  11. Stochastic Optimisation Efficient way of learning model parameters, typically used in machine learning. Stochastic Gradient Ascent (SGA) Set starting value θ 0 , batch size n ≪ N , and step sizes ǫ t . Iterate: 1 Take a subsample S t of size n from the data 2 Estimate the gradient at θ t by : f ( θ t ) = ∇ f 0 ( θ t ) + N ∇ ˆ � ∇ f i ( θ t ) n x i ∈ S t Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation

  12. Stochastic Optimisation Efficient way of learning model parameters, typically used in machine learning. Stochastic Gradient Ascent (SGA) Set starting value θ 0 , batch size n ≪ N , and step sizes ǫ t . Iterate: 1 Take a subsample S t of size n from the data 2 Estimate the gradient at θ t by : f ( θ t ) = ∇ f 0 ( θ t ) + N ∇ ˆ � ∇ f i ( θ t ) n x i ∈ S t 3 Set θ t +1 = θ t + ǫ t ∇ ˆ f ( θ t ) Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation

  13. Stochastic Optimisation Efficient way of learning model parameters, typically used in machine learning. Stochastic Gradient Ascent (SGA) Set starting value θ 0 , batch size n ≪ N , and step sizes ǫ t . Iterate: 1 Take a subsample S t of size n from the data 2 Estimate the gradient at θ t by : f ( θ t ) = ∇ f 0 ( θ t ) + N ∇ ˆ � ∇ f i ( θ t ) n x i ∈ S t 3 Set θ t +1 = θ t + ǫ t ∇ ˆ f ( θ t ) + γ ( θ t − θ t − 1 ) There are many ways of speeding up convergence, such as adding in a momentum term. Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation

  14. Stochastic Optimisation Robbins-Monro criteria for convergence: If � ∞ t =1 ǫ t = ∞ and � ∞ t =1 ǫ 2 t < ∞ , then θ t will converge to a local maximum Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation

  15. Stochastic Optimisation Robbins-Monro criteria for convergence: If � ∞ t =1 ǫ t = ∞ and � ∞ t =1 ǫ 2 t < ∞ , then θ t will converge to a local maximum Usually set ǫ t = ( α t + β ) − γ with γ ∈ (0 . 5 , 1] Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation

  16. Stochastic Optimisation Robbins-Monro criteria for convergence: If � ∞ t =1 ǫ t = ∞ and � ∞ t =1 ǫ 2 t < ∞ , then θ t will converge to a local maximum Usually set ǫ t = ( α t + β ) − γ with γ ∈ (0 . 5 , 1] These algorithms only converge to a point estimate of the posterior mode Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation

  17. MCMC Many problems for which Bayesian inference would be useful involve non-standard distributions and a large number of parameters, making exact inference challenging. MCMC algorithms aim to generate random samples from the posterior. These samplers construct a Markov chain, often a random walk, which converges to the desired stationary distribution. Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation

  18. Metropolis-Adjusted Langevin Algorithm (MALA) The Langevin diffusion describes dynamics which converge to π ( θ ): d θ ( t ) = 1 2 ∇ f ( θ ( t )) + db ( t ) MALA uses the following discretisation to propose samples: θ t +1 = θ t + σ 2 2 ∇ f ( θ t ) + ση t A Metropolis-Hastings accept/reject step is then used to correct discretisation errors, ensuring convergence to the desired stationary distribution. Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation

  19. MALA algorithm Set starting value θ 0 and step size σ 2 . Iterate the following: Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation

  20. MALA algorithm Set starting value θ 0 and step size σ 2 . Iterate the following: 1 Set θ ∗ = θ t + σ 2 2 ∇ f ( θ t ) + ση t , where η t ∼ N (0 , I ) Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation

  21. MALA algorithm Set starting value θ 0 and step size σ 2 . Iterate the following: 1 Set θ ∗ = θ t + σ 2 2 ∇ f ( θ t ) + ση t , where η t ∼ N (0 , I ) 2 Accept and set θ t +1 = θ ∗ with probability � � 1 , π ( θ ∗ ) q ( θ t | θ ∗ ) a ( θ ∗ , θ t ) = min , π ( θ t ) q ( θ ∗ | θ t ) where q ( x | y ) = P ( θ ∗ = x | θ t = y ) Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation

  22. MALA algorithm Set starting value θ 0 and step size σ 2 . Iterate the following: 1 Set θ ∗ = θ t + σ 2 2 ∇ f ( θ t ) + ση t , where η t ∼ N (0 , I ) 2 Accept and set θ t +1 = θ ∗ with probability � � 1 , π ( θ ∗ ) q ( θ t | θ ∗ ) a ( θ ∗ , θ t ) = min , π ( θ t ) q ( θ ∗ | θ t ) where q ( x | y ) = P ( θ ∗ = x | θ t = y ) 3 If rejected, set θ t +1 = θ t Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation

  23. MALA 0.3 0.2 0.2 0.1 0.1 σ = 0 . 03 a = 0 . 99 0.0 0.0 −0.1 −0.1 −0.2 −0.2 −0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2 0.3 0.2 0.2 0.1 0.1 σ = 0 . 13 a = 0 . 57 0.0 0.0 −0.1 −0.1 −0.2 −0.2 −0.2 −0.1 0.0 0.1 0.2 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.2 0.1 σ = 0 . 20 a = 0 . 13 0.1 0.0 0.0 −0.1 −0.1 −0.2 −0.2 −0.2 −0.1 0.0 0.1 0.2 0.3 −0.2 −0.1 0.0 0.1 Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation

  24. Stochastic Gradient Langevin Dynamics (SGLD) SGLD aims to reduce the computational cost of MALA by replacing the full gradient calculation in the proposal with the stochastic approximation ∇ ˆ f ( θ ): f ( θ t ) + √ ǫ t η t θ t +1 = θ t + ǫ t 2 ∇ ˆ Here, the ǫ t are decreasing to 0 as in SGA. Since the Metropolis-Hastings acceptance rate tends to 1 as the step size decreases, the costly accept/reject step is ignored. Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend