stochastic gradient annealed importance sampling
play

Stochastic Gradient Annealed Importance Sampling Scott Cameron - PowerPoint PPT Presentation

Stochastic Gradient Annealed Importance Sampling Scott Cameron Hans Eggers Steve Kroon Stellenbosch University NITheP Motivation Stochastic optimization 1 Motivation Goal: Effjcient large-scale marginal likelihood estimation using


  1. Stochastic Gradient Annealed Importance Sampling Scott Cameron Hans Eggers Steve Kroon Stellenbosch University NITheP

  2. Motivation Stochastic optimization 1

  3. Motivation Goal: Effjcient large-scale marginal likelihood estimation using mini-batches 2

  4. Marginal Likelihood (Evidence) Consider a Bayesian model D = { y n } N p ( D , θ ) = p ( θ ) ∏ p ( y n | θ ) n = 1 n Posterior given by Bayes theorem p ( θ |D ) = p ( D| θ ) p ( θ ) p ( D ) Marginal likelihood ∫ Z := p ( D ) = p ( D| θ ) p ( θ ) d θ Posterior predictive ∫ p ( y ′ |D ) = p ( y ′ | θ ) p ( θ |D ) d θ 3

  5. Model Comparison/Combination Posterior over models M 1 , M 2 , · · · P ( M 1 |D ) p ( M 1 ) P ( M 2 |D ) = Z 1 p ( M 2 ) Z 2 M 1 is a ‘better’ model than M 2 if Z 1 ≫ Z 2 Combined predictions i p ( y ′ |D , M i ) Z i p ( M i ) ∑ p ( y ′ |D ) = i Z i p ( M i ) ∑ Weighs models proportionately to how well they describe data 4

  6. Why is this diffjcult? Example model µ ∼ N ( 0 , 1 ) y n ∼ N ( µ, 1 ) Naive estimator M Z = 1 ˆ ∑ p ( D| µ i ) µ i ∼ p ( µ ) M i = 1 5

  7. Why is this diffjcult? Consistently underestimate/overestimate Prior samping Harmonic mean 6

  8. Annealed Importance Sampling Adiabatically decrease temperature: 0 = λ 0 < · · · < λ T = 1 f t ( θ ) = p ( D| θ ) λ t p ( θ ) Update particles with HMC 1 U t ( θ ) = − λ t log p ( D| θ ) − log p ( θ ) Iterated importance sampling w ( t ) ← w ( t − 1 ) p ( D| θ ( t − 1 ) ) λ t − λ t − 1 i i i Estimator M Z = 1 w ( T ) ˆ ∑ M i i = 1 1 Hamiltonian Monte Carlo 7

  9. Annealed Importance Sampling Adiabatically decrease temperature: 0 = λ 0 < · · · < λ T = 1 f t ( θ ) = p ( D| θ ) λ t p ( θ ) Update particles with HMC 1 U t ( θ ) = − λ t log p ( D| θ ) − log p ( θ ) Iterated importance sampling w ( t ) ← w ( t − 1 ) p ( D| θ ( t − 1 ) ) λ t − λ t − 1 i i i Estimator M Z = 1 w ( T ) ˆ ∑ M i i = 1 1 Hamiltonian Monte Carlo 7

  10. Problems with Scalability Accurate estimates require T ∝ |D| 1. HMC needs likelihood gradients, O ( |D| ) 2. Importance weights need likelihood, O ( |D| ) |D| 2 ) More or less O complexity ( 8

  11. Stochastic Gradient HMC Simulate Langevin dynamics ˙ θ = v ⟨ ξ ( t ) ξ ( t ′ ) ⟩ = δ ( t − t ′ ) v = −∇ U ( θ ) − γ v + √ 2 γ ξ ˙ Fokker–Planck equation 2 ( ) ∂ p 0 − I ∂ t = ∂ T A { p ∂ H + ∂ p } A = I γ Canonical ensemble p ∞ ( θ, v ) = 1 Ze − H ( θ, v ) 2 H ( θ, v ) = U ( θ ) + 1 2 v 2 9

  12. solves (1) Stochastic Gradient HMC Euler–Maruyama discretization ∆ θ = v ∆ v = − η ∇ ˆ U ( θ ) − α v + N ( 0 , 2 ( α − ˆ β ) η ) Mini-batch energy estimate U ( θ ) = −|D| ˆ ∑ log p ( y | θ ) − log p ( θ ) | B | y ∈ B Time complexity O ( | B | ) ≪ O ( |D| ) 10

  13. Stochastic Gradient HMC Euler–Maruyama discretization ∆ θ = v ∆ v = − η ∇ ˆ U ( θ ) − α v + N ( 0 , 2 ( α − ˆ β ) η ) Mini-batch energy estimate U ( θ ) = −|D| ˆ ∑ log p ( y | θ ) − log p ( θ ) | B | y ∈ B Time complexity O ( | B | ) ≪ O ( |D| ) solves (1) 10

  14. Comparison of MCMC Trajectories RWMH HMC SGLD SGHMC 11

  15. solves (2) Bayesian Updating/Online Estimation Predictive distributions ∫ ∏ ∏ p ( y n | y < n ) = p ( y n | θ ) p ( θ | y < n ) Z = n n Estimate p ( y n | y < n ) with AIS θ ( n ) w ( n ) ← AIS ( y n , θ ( n − 1 ) , ˜ ) i i i Marginal likelihood M Z = 1 w ( n ) ˆ ∑ ∏ ˜ M i i = 1 n 12

  16. Bayesian Updating/Online Estimation Predictive distributions ∫ ∏ ∏ p ( y n | y < n ) = p ( y n | θ ) p ( θ | y < n ) Z = n n Estimate p ( y n | y < n ) with AIS θ ( n ) w ( n ) ← AIS ( y n , θ ( n − 1 ) , ˜ ) i i i Marginal likelihood M Z = 1 w ( n ) ˆ ∑ ∏ ˜ M i i = 1 n solves (2) 12

  17. Stochastic Gradient Annealed Importance Sampling Intermediate distributions [∏ ] f ( λ ) n ( θ ) = p ( y n | θ ) λ p ( y k | θ ) p ( θ ) k < n Update particles with SGHMC n ( θ ) = − λ log p ( y n | θ ) − n − 1 U ( λ ) ˆ ∑ log p ( y | θ ) − log p ( θ ) | B | y ∈ B Importance weights w ( t ) ← w ( t − 1 ) p ( y n | θ ( t − 1 ) ) λ t − λ t − 1 i i i ML estimator M Z = 1 w ( T ) ˆ ∑ i M i = 1 13

  18. Results Gaussian mixture model • vs nested sampling • vs annealed importance sampling 14

  19. Parameter sensitivity Adaptive annealing schedule • Blue ≈ no annealing steps 15

  20. Distribution Shift Data may change over time 1 ≤ n ≤ 10 3 10 3 < n ≤ 10 4 10 4 < n ≤ 10 5 total 16

  21. Distribution Shift Dashed lines = shuffmed data 17

  22. Thank You! [1] Cameron, S.A.; Eggers, H.C.; Kroon, S. Stochastic Gradient Annealed Importance Sampling for Effjcient Online Marginal Likelihood Estimation. Entropy 21.11 (2019). [2] Chen, T.; Fox, E.; Guestrin, C. Stochastic Gradient Hamiltonian Monte Carlo. ICML Proceedings vol. 5. (2014). Funded by NITheP 3 Paper sponsored by MaxEnt 2019 Big thanks to Hans and Steve! 3 National Institute of Theoretical Physics 18

  23. Extra Slides

  24. SGAIS Algorithm 1 Stochastic Gradient Annealed Importance Sampling 1: ∀ i : sample θ i ∼ p ( θ ) 2: ∀ i : w i ← 1 3: for n = 1 , . . . , N do λ ← 0 4: while λ < 1 do 5: ∆ ← argmin ∆ [ESS(∆) − ESS ∗ ] 6: 7: λ ← λ + ∆ ∀ i : w i ← w i p ( y n | θ i ) ∆ 8: ▷ optionally resample particles ∀ i : θ i ← SGHMC ( θ i , ˆ U ( λ ) 9: n ) end while 10: 11: end for Z = 1 12: return ˆ i w i ∑ M 19

  25. Number of Particles 20

  26. Efgective Sample Size 21

  27. Learning Rate 22

  28. Burnin 23

  29. Learning Rate × Burnin 24

  30. Batch Size 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend