Stochastic Gradient Annealed Importance Sampling Scott Cameron - PowerPoint PPT Presentation

Stochastic Gradient Annealed Importance Sampling Scott Cameron Hans Eggers Steve Kroon Stellenbosch University NITheP

Motivation Stochastic optimization 1

Motivation Goal: Effjcient large-scale marginal likelihood estimation using mini-batches 2

Marginal Likelihood (Evidence) Consider a Bayesian model D = { y n } N p ( D , θ ) = p ( θ ) ∏ p ( y n | θ ) n = 1 n Posterior given by Bayes theorem p ( θ |D ) = p ( D| θ ) p ( θ ) p ( D ) Marginal likelihood ∫ Z := p ( D ) = p ( D| θ ) p ( θ ) d θ Posterior predictive ∫ p ( y ′ |D ) = p ( y ′ | θ ) p ( θ |D ) d θ 3

Model Comparison/Combination Posterior over models M 1 , M 2 , · · · P ( M 1 |D ) p ( M 1 ) P ( M 2 |D ) = Z 1 p ( M 2 ) Z 2 M 1 is a ‘better’ model than M 2 if Z 1 ≫ Z 2 Combined predictions i p ( y ′ |D , M i ) Z i p ( M i ) ∑ p ( y ′ |D ) = i Z i p ( M i ) ∑ Weighs models proportionately to how well they describe data 4

Why is this diffjcult? Example model µ ∼ N ( 0 , 1 ) y n ∼ N ( µ, 1 ) Naive estimator M Z = 1 ˆ ∑ p ( D| µ i ) µ i ∼ p ( µ ) M i = 1 5

Why is this diffjcult? Consistently underestimate/overestimate Prior samping Harmonic mean 6

Annealed Importance Sampling Adiabatically decrease temperature: 0 = λ 0 < · · · < λ T = 1 f t ( θ ) = p ( D| θ ) λ t p ( θ ) Update particles with HMC 1 U t ( θ ) = − λ t log p ( D| θ ) − log p ( θ ) Iterated importance sampling w ( t ) ← w ( t − 1 ) p ( D| θ ( t − 1 ) ) λ t − λ t − 1 i i i Estimator M Z = 1 w ( T ) ˆ ∑ M i i = 1 1 Hamiltonian Monte Carlo 7

Problems with Scalability Accurate estimates require T ∝ |D| 1. HMC needs likelihood gradients, O ( |D| ) 2. Importance weights need likelihood, O ( |D| ) |D| 2 ) More or less O complexity ( 8

Stochastic Gradient HMC Simulate Langevin dynamics ˙ θ = v ⟨ ξ ( t ) ξ ( t ′ ) ⟩ = δ ( t − t ′ ) v = −∇ U ( θ ) − γ v + √ 2 γ ξ ˙ Fokker–Planck equation 2 ( ) ∂ p 0 − I ∂ t = ∂ T A { p ∂ H + ∂ p } A = I γ Canonical ensemble p ∞ ( θ, v ) = 1 Ze − H ( θ, v ) 2 H ( θ, v ) = U ( θ ) + 1 2 v 2 9

solves (1) Stochastic Gradient HMC Euler–Maruyama discretization ∆ θ = v ∆ v = − η ∇ ˆ U ( θ ) − α v + N ( 0 , 2 ( α − ˆ β ) η ) Mini-batch energy estimate U ( θ ) = −|D| ˆ ∑ log p ( y | θ ) − log p ( θ ) | B | y ∈ B Time complexity O ( | B | ) ≪ O ( |D| ) 10

Stochastic Gradient HMC Euler–Maruyama discretization ∆ θ = v ∆ v = − η ∇ ˆ U ( θ ) − α v + N ( 0 , 2 ( α − ˆ β ) η ) Mini-batch energy estimate U ( θ ) = −|D| ˆ ∑ log p ( y | θ ) − log p ( θ ) | B | y ∈ B Time complexity O ( | B | ) ≪ O ( |D| ) solves (1) 10

Comparison of MCMC Trajectories RWMH HMC SGLD SGHMC 11

solves (2) Bayesian Updating/Online Estimation Predictive distributions ∫ ∏ ∏ p ( y n | y < n ) = p ( y n | θ ) p ( θ | y < n ) Z = n n Estimate p ( y n | y < n ) with AIS θ ( n ) w ( n ) ← AIS ( y n , θ ( n − 1 ) , ˜ ) i i i Marginal likelihood M Z = 1 w ( n ) ˆ ∑ ∏ ˜ M i i = 1 n 12

Bayesian Updating/Online Estimation Predictive distributions ∫ ∏ ∏ p ( y n | y < n ) = p ( y n | θ ) p ( θ | y < n ) Z = n n Estimate p ( y n | y < n ) with AIS θ ( n ) w ( n ) ← AIS ( y n , θ ( n − 1 ) , ˜ ) i i i Marginal likelihood M Z = 1 w ( n ) ˆ ∑ ∏ ˜ M i i = 1 n solves (2) 12

Stochastic Gradient Annealed Importance Sampling Intermediate distributions [∏ ] f ( λ ) n ( θ ) = p ( y n | θ ) λ p ( y k | θ ) p ( θ ) k < n Update particles with SGHMC n ( θ ) = − λ log p ( y n | θ ) − n − 1 U ( λ ) ˆ ∑ log p ( y | θ ) − log p ( θ ) | B | y ∈ B Importance weights w ( t ) ← w ( t − 1 ) p ( y n | θ ( t − 1 ) ) λ t − λ t − 1 i i i ML estimator M Z = 1 w ( T ) ˆ ∑ i M i = 1 13

Results Gaussian mixture model • vs nested sampling • vs annealed importance sampling 14

Parameter sensitivity Adaptive annealing schedule • Blue ≈ no annealing steps 15

Distribution Shift Data may change over time 1 ≤ n ≤ 10 3 10 3 < n ≤ 10 4 10 4 < n ≤ 10 5 total 16

Distribution Shift Dashed lines = shuffmed data 17

Thank You! [1] Cameron, S.A.; Eggers, H.C.; Kroon, S. Stochastic Gradient Annealed Importance Sampling for Effjcient Online Marginal Likelihood Estimation. Entropy 21.11 (2019). [2] Chen, T.; Fox, E.; Guestrin, C. Stochastic Gradient Hamiltonian Monte Carlo. ICML Proceedings vol. 5. (2014). Funded by NITheP 3 Paper sponsored by MaxEnt 2019 Big thanks to Hans and Steve! 3 National Institute of Theoretical Physics 18

Extra Slides

SGAIS Algorithm 1 Stochastic Gradient Annealed Importance Sampling 1: ∀ i : sample θ i ∼ p ( θ ) 2: ∀ i : w i ← 1 3: for n = 1 , . . . , N do λ ← 0 4: while λ < 1 do 5: ∆ ← argmin ∆ [ESS(∆) − ESS ∗ ] 6: 7: λ ← λ + ∆ ∀ i : w i ← w i p ( y n | θ i ) ∆ 8: ▷ optionally resample particles ∀ i : θ i ← SGHMC ( θ i , ˆ U ( λ ) 9: n ) end while 10: 11: end for Z = 1 12: return ˆ i w i ∑ M 19

Number of Particles 20

Efgective Sample Size 21

Learning Rate 22

Burnin 23

Learning Rate × Burnin 24

Batch Size 25

Stochastic Gradient Annealed Importance Sampling Scott Cameron - PowerPoint PPT Presentation

Stochastic Gradient Annealed Importance Sampling Scott Cameron Hans Eggers Steve Kroon Stellenbosch University NITheP Motivation Stochastic optimization 1 Motivation Goal: Effjcient large-scale marginal likelihood estimation using

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Zhang Last Lecture MCMC Importance Sampling : vs . = ply ) X ) Cx ) ply 2- ) y ( -17

Zhang Last Lecture MCMC Importance Sampling : vs . = ply ) X ) j(x7/Z Cx ) ply 2- )

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

From Importance Sampling to Doubly Robust Policy Gradient Jiawei Huang (UIUC) Nan Jiang (UIUC)

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Amortized Monte Carlo Integration Adam Goli ski, Frank Wood, Tom Rainforth 11/06/19

MATH529 Fundamentals of Optimization Fundamentals of Constrained Optimization VIII:

With All the Hype on the PS3 With All the Hype on the PS3 We Became Interested We Became

Iterative Krylov Subspace Methods for Sparse Reconstruction James Nagy Mathematics and Computer

Monte Carlo approximation methods Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square CS

Geometric Registration for Deformable Shapes 3.4 Probabilistic Techniques RANSAC Forward

Realistic Image Synthesis - BRDFs and Direct Lighting - Philipp Slusallek Karol Myszkowski

Monte Carlo Control CMPUT 366: Intelligent Systems S&B 5.3-5.5, 5.7 Lecture Outline 1.

Sambuz

Useful Links

Newsletter

Mail Us

Stochastic Gradient Annealed Importance Sampling Scott Cameron - PowerPoint PPT Presentation

Stochastic Gradient Annealed Importance Sampling Scott Cameron Hans Eggers Steve Kroon Stellenbosch University NITheP Motivation Stochastic optimization 1 Motivation Goal: Effjcient large-scale marginal likelihood estimation using

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Zhang Last Lecture MCMC Importance Sampling : vs . = ply ) X ) Cx ) ply 2- ) y ( -17

Zhang Last Lecture MCMC Importance Sampling : vs . = ply ) X ) j(x7/Z Cx ) ply 2- )

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. MLSS

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Painless Stochastic Gradient Descent : Interpolation, Line-Search, and Convergence Rates. NeurIPS

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

From Importance Sampling to Doubly Robust Policy Gradient Jiawei Huang (UIUC) Nan Jiang (UIUC)

Overview of the Stochastic Gradient Method December 02, 2020 P. Carpentier Master Optimization

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Amortized Monte Carlo Integration Adam Goli ski*, Frank Wood, Tom Rainforth* 11/06/19

MATH529 Fundamentals of Optimization Fundamentals of Constrained Optimization VIII:

With All the Hype on the PS3 With All the Hype on the PS3 We Became Interested We Became

Iterative Krylov Subspace Methods for Sparse Reconstruction James Nagy Mathematics and Computer

Monte Carlo approximation methods Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square CS

Geometric Registration for Deformable Shapes 3.4 Probabilistic Techniques RANSAC Forward

Realistic Image Synthesis - BRDFs and Direct Lighting - Philipp Slusallek Karol Myszkowski

Monte Carlo Control CMPUT 366: Intelligent Systems S&amp;B 5.3-5.5, 5.7 Lecture Outline 1.

Sambuz

Useful Links

Newsletter

Mail Us

Amortized Monte Carlo Integration Adam Goli ski, Frank Wood, Tom Rainforth 11/06/19

Monte Carlo Control CMPUT 366: Intelligent Systems S&B 5.3-5.5, 5.7 Lecture Outline 1.