Stochastic Gradient Annealed Importance Sampling Scott Cameron - - PowerPoint PPT Presentation

stochastic gradient annealed importance sampling
SMART_READER_LITE
LIVE PREVIEW

Stochastic Gradient Annealed Importance Sampling Scott Cameron - - PowerPoint PPT Presentation

Stochastic Gradient Annealed Importance Sampling Scott Cameron Hans Eggers Steve Kroon Stellenbosch University NITheP Motivation Stochastic optimization 1 Motivation Goal: Effjcient large-scale marginal likelihood estimation using


slide-1
SLIDE 1

Stochastic Gradient Annealed Importance Sampling

Scott Cameron Hans Eggers Steve Kroon

Stellenbosch University NITheP

slide-2
SLIDE 2

Motivation

Stochastic optimization

1

slide-3
SLIDE 3

Motivation

Goal: Effjcient large-scale marginal likelihood estimation using mini-batches

2

slide-4
SLIDE 4

Marginal Likelihood (Evidence)

Consider a Bayesian model D = {yn}N

n=1

p(D, θ) = p(θ) ∏

n

p(yn|θ) Posterior given by Bayes theorem p(θ|D) = p(D|θ)p(θ) p(D) Marginal likelihood Z := p(D) = ∫ p(D|θ)p(θ) dθ Posterior predictive p(y′|D) = ∫ p(y′|θ)p(θ|D) dθ

3

slide-5
SLIDE 5

Model Comparison/Combination

Posterior over models M1, M2, · · · P(M1|D) P(M2|D) = Z1 Z2 p(M1) p(M2) M1 is a ‘better’ model than M2 if Z1 ≫ Z2 Combined predictions p(y′|D) = ∑

i p(y′|D, Mi)Zi p(Mi)

i Zi p(Mi)

Weighs models proportionately to how well they describe data

4

slide-6
SLIDE 6

Why is this diffjcult?

Example model µ ∼ N(0, 1) yn ∼ N(µ, 1) Naive estimator ˆ Z = 1 M

M

i=1

p(D|µi) µi ∼ p(µ)

5

slide-7
SLIDE 7

Why is this diffjcult?

Consistently underestimate/overestimate Prior samping Harmonic mean

6

slide-8
SLIDE 8

Annealed Importance Sampling

Adiabatically decrease temperature: 0 = λ0 < · · · < λT = 1 ft(θ) = p(D|θ)λtp(θ) Update particles with HMC1 Ut(θ) = −λt log p(D|θ) − log p(θ) Iterated importance sampling w(t)

i

← w(t−1)

i

p(D|θ(t−1)

i

)λt−λt−1 Estimator ˆ Z = 1 M

M

i=1

w(T)

i

1Hamiltonian Monte Carlo

7

slide-9
SLIDE 9

Annealed Importance Sampling

Adiabatically decrease temperature: 0 = λ0 < · · · < λT = 1 ft(θ) = p(D|θ)λtp(θ) Update particles with HMC1 Ut(θ) = −λt log p(D|θ) − log p(θ) Iterated importance sampling w(t)

i

← w(t−1)

i

p(D|θ(t−1)

i

)λt−λt−1 Estimator ˆ Z = 1 M

M

i=1

w(T)

i

1Hamiltonian Monte Carlo

7

slide-10
SLIDE 10

Problems with Scalability

Accurate estimates require T ∝ |D|

  • 1. HMC needs likelihood gradients, O (|D|)
  • 2. Importance weights need likelihood, O (|D|)

More or less O ( |D|2) complexity

8

slide-11
SLIDE 11

Stochastic Gradient HMC

Simulate Langevin dynamics ˙ θ = v ˙ v = −∇U(θ) − γv + √2γ ξ ⟨ξ(t)ξ(t′)⟩ = δ(t − t′) Fokker–Planck equation2 ∂p ∂t = ∂TA {p ∂H + ∂p} A = ( −I I γ ) Canonical ensemble p∞(θ, v) = 1 Ze−H(θ,v)

2H(θ, v) = U(θ) + 1 2 v2

9

slide-12
SLIDE 12

Stochastic Gradient HMC

Euler–Maruyama discretization ∆θ = v ∆v = −η∇ˆ U(θ) − α v + N(0, 2(α − ˆ β)η) Mini-batch energy estimate ˆ U(θ) = −|D| |B| ∑

y∈B

log p(y|θ) − log p(θ) Time complexity O (|B|) ≪ O (|D|) solves (1)

10

slide-13
SLIDE 13

Stochastic Gradient HMC

Euler–Maruyama discretization ∆θ = v ∆v = −η∇ˆ U(θ) − α v + N(0, 2(α − ˆ β)η) Mini-batch energy estimate ˆ U(θ) = −|D| |B| ∑

y∈B

log p(y|θ) − log p(θ) Time complexity O (|B|) ≪ O (|D|) solves (1)

10

slide-14
SLIDE 14

Comparison of MCMC Trajectories

RWMH HMC SGLD SGHMC

11

slide-15
SLIDE 15

Bayesian Updating/Online Estimation

Predictive distributions Z = ∏

n

p(yn|y<n) = ∏

n

∫ p(yn|θ)p(θ|y<n) Estimate p(yn|y<n) with AIS θ(n)

i

, ˜ w(n)

i

← AIS(yn, θ(n−1)

i

) Marginal likelihood ˆ Z = 1 M

M

i=1

n

˜ w(n)

i

solves (2)

12

slide-16
SLIDE 16

Bayesian Updating/Online Estimation

Predictive distributions Z = ∏

n

p(yn|y<n) = ∏

n

∫ p(yn|θ)p(θ|y<n) Estimate p(yn|y<n) with AIS θ(n)

i

, ˜ w(n)

i

← AIS(yn, θ(n−1)

i

) Marginal likelihood ˆ Z = 1 M

M

i=1

n

˜ w(n)

i

solves (2)

12

slide-17
SLIDE 17

Stochastic Gradient Annealed Importance Sampling

Intermediate distributions f(λ)

n (θ) = p(yn|θ)λ

[∏

k<n

p(yk|θ) ] p(θ) Update particles with SGHMC ˆ U(λ)

n (θ) = −λ log p(yn|θ) − n − 1

|B| ∑

y∈B

log p(y|θ) − log p(θ) Importance weights w(t)

i

← w(t−1)

i

p(yn|θ(t−1)

i

)λt−λt−1 ML estimator ˆ Z = 1 M

M

i=1

w(T)

i 13

slide-18
SLIDE 18

Results

Gaussian mixture model

  • vs nested sampling
  • vs annealed importance sampling

14

slide-19
SLIDE 19

Parameter sensitivity

Adaptive annealing schedule

  • Blue ≈ no annealing steps

15

slide-20
SLIDE 20

Distribution Shift

Data may change over time

1 ≤ n ≤ 103 103 < n ≤ 104 104 < n ≤ 105 total

16

slide-21
SLIDE 21

Distribution Shift

Dashed lines = shuffmed data

17

slide-22
SLIDE 22

Thank You!

[1] Cameron, S.A.; Eggers, H.C.; Kroon, S. Stochastic Gradient Annealed Importance Sampling for Effjcient Online Marginal Likelihood Estimation. Entropy 21.11 (2019). [2] Chen, T.; Fox, E.; Guestrin, C. Stochastic Gradient Hamiltonian Monte Carlo. ICML Proceedings vol. 5. (2014). Funded by NITheP3 Paper sponsored by MaxEnt 2019 Big thanks to Hans and Steve!

3National Institute of Theoretical Physics

18

slide-23
SLIDE 23

Extra Slides

slide-24
SLIDE 24

SGAIS

Algorithm 1 Stochastic Gradient Annealed Importance Sampling

1: ∀i: sample θi ∼ p(θ) 2: ∀i: wi ← 1 3: for n = 1, . . . , N do 4:

λ ← 0

5:

while λ < 1 do

6:

∆ ← argmin∆[ESS(∆) − ESS∗]

7:

λ ← λ + ∆

8:

∀i: wi ← wi p(yn|θi)∆ ▷ optionally resample particles

9:

∀i: θi ← SGHMC(θi, ˆ U(λ)

n )

10:

end while

11: end for 12: return ˆ

Z = 1

M

i wi 19

slide-25
SLIDE 25

Number of Particles

20

slide-26
SLIDE 26

Efgective Sample Size

21

slide-27
SLIDE 27

Learning Rate

22

slide-28
SLIDE 28

Burnin

23

slide-29
SLIDE 29

Learning Rate × Burnin

24

slide-30
SLIDE 30

Batch Size

25