The Variational Predictive Natural Gradient Da Tang 1 Rajesh - - PowerPoint PPT Presentation

▶

Aug 24, 2022 270 likes •483 views

The Variational Predictive Natural Gradient Da Tang 1 Rajesh Ranganath 2 1 Columbia University 2 New York University June 12, 2019 Variational Inference Latent variable models: p ( x , z ; ) = p ( z ) p ( x | z ; ). Variational

SLIDE 1

The Variational Predictive Natural Gradient

Da Tang1 Rajesh Ranganath2

1Columbia University 2New York University

June 12, 2019

SLIDE 2

Variational Inference

◮ Latent variable models: p(x, z; θ) = p(z)p(x|z; θ).

SLIDE 3

Variational Inference

◮ Latent variable models: p(x, z; θ) = p(z)p(x|z; θ). ◮ Variational inference approximates the posterior through maximizing the ELBO:

L(λ, θ) = Eq [log p(x|z; θ)] − KL(q(z|x; λ)||p(z)).

SLIDE 4

Variational Inference

◮ Latent variable models: p(x, z; θ) = p(z)p(x|z; θ). ◮ Variational inference approximates the posterior through maximizing the ELBO:

L(λ, θ) = Eq [log p(x|z; θ)] − KL(q(z|x; λ)||p(z)).

◮ q-Fisher Information Fq = Eq

∇λ log q(z|x; λ) · ∇λ log q(z|x; λ)⊤

(Hoffman et al., 2013) approximates the negative Hessian of the objective.

◮ The natural gradient: ∇NG λ L(λ) = F −1 q

· ∇λL(λ).

SLIDE 5

Pathological Curvature of the ELBO

◮ The curvature of the ELBO may be pathological.

−75 −50 −25 25 50 75 100 125 150 −75 −50 −25 25 50 75 100

Gradient VPNG Current Optimum

1.63e+05
7.29e+05
3.27e+06
1.47e+07
6.57e+07
2.94e+08
1.32e+09
5.91e+09
2.65e+10

SLIDE 6

Pathological Curvature of the ELBO

◮ The curvature of the ELBO may be pathological. ◮ Example: A bivariate Gaussian model with unknown mean and known covariance

Σ =

1 − ε 1 − ε 1

, 0 < ε ≪ 1.

−75 −50 −25 25 50 75 100 125 150 −75 −50 −25 25 50 75 100

Gradient VPNG Current Optimum

1.63e+05
7.29e+05
3.27e+06
1.47e+07
6.57e+07
2.94e+08
1.32e+09
5.91e+09
2.65e+10

SLIDE 7

Pathological Curvature of the ELBO

◮ The curvature of the ELBO may be pathological. ◮ Example: A bivariate Gaussian model with unknown mean and known covariance

Σ =

1 − ε 1 − ε 1

, 0 < ε ≪ 1.

−75 −50 −25 25 50 75 100 125 150 −75 −50 −25 25 50 75 100

Gradient VPNG Current Optimum

1.63e+05
7.29e+05
3.27e+06
1.47e+07
6.57e+07
2.94e+08
1.32e+09
5.91e+09
2.65e+10

◮ The natural gradient fails to help.

SLIDE 8

The Natural Gradient is Insufficient

Limitations of the q-Fisher information:

◮ Approximates the Hessian of the objective well only when q(z|x; λ) ≈ p(z|x; θ). ◮ Ignore the model likelihood p(x|z; θ) in computations.

SLIDE 9

The Variational Predictive Fisher Information

◮ Construct a positive definite matrix that resembles the negative Hessian of the

expected log-likelihood part Lll = Eq(z|x;λ) [log p(x|z; θ)] of the ELBO.

SLIDE 10

The Variational Predictive Fisher Information

◮ Construct a positive definite matrix that resembles the negative Hessian of the

expected log-likelihood part Lll = Eq(z|x;λ) [log p(x|z; θ)] of the ELBO.

◮ Reparameterize the variational distribution q:

z = g(x, ε; λ) ∼ q(z|x; λ) ⇐ ⇒ ε ∼ s(ε).

SLIDE 11

The Variational Predictive Fisher Information

◮ Construct a positive definite matrix that resembles the negative Hessian of the

expected log-likelihood part Lll = Eq(z|x;λ) [log p(x|z; θ)] of the ELBO.

◮ Reparameterize the variational distribution q:

z = g(x, ε; λ) ∼ q(z|x; λ) ⇐ ⇒ ε ∼ s(ε).

◮ The variational predictive Fisher information:

Fr =Eε[Ep(x′|z=g(x,ε;λ);θ)[∇λ,θ log p(x′|z = g(x, ε; λ); θ) · ∇λ,θ log p(x′|z = g(x, ε; λ); θ)⊤]], exactly the “expected” Fisher information of the reparameterized predictive distribution p(x′|z = g(x, ε; λ); θ).

SLIDE 12

The Variational Predictive Fisher Information

◮ Variational predictive Fisher captures the curvature of variational inference.

SLIDE 13

The Variational Predictive Fisher Information

◮ Variational predictive Fisher captures the curvature of variational inference. ◮ Matrix spectrum comparison (for the bivariate Gaussian example):

(d) Precision mat Σ−1 (e) q-Fisher info Fq (f) Our Fisher info Fr

SLIDE 14

The Variational Predictive Natural Gradient

◮ The variational predictive natural gradient (VPNG):

∇VPNG

λ,θ

L = F −1

· ∇λ,θL(λ, θ).

SLIDE 15

The Variational Predictive Natural Gradient

◮ The variational predictive natural gradient (VPNG):

∇VPNG

λ,θ

L = F −1

· ∇λ,θL(λ, θ).

◮ In practice, use Monte Carlo estimations to approximate Fr and add a small

dampening parameter to ensure invertibility.

SLIDE 16

Experiments: Bayesian Logistic Regression

◮ Tested on synthetic data with high correlations. ◮ Empirical results:

Method Train AUC Test AUC Gradient 0.734 ± 0.017 0.718 ± 0.022 NG 0.744 ± 0.043 0.751 ± 0.047 VPNG 0.972 ± 0.011 0.967 ± 0.011

Table: Bayesian Logistic regression AUC

SLIDE 17

Experiments: VAE and VMF

200 400 600 800 1000

Time (s)

180 160 140 120 100 80

Train ELBO

Gradient NG VPNG

200 400 600 800 1000

Time (s)

160 140 120 100

Test ELBO

Gradient NG VPNG

500 1000 1500 2000 2500 3000

Time (s)

−1200 −1100 −1000 −900 −800 −700

Train ELBO

Gradient NG VPNG

500 1000 1500 2000 2500 3000

Time (s)

−1200 −1100 −1000 −900 −800

Test ELBO

Gradient NG VPNG

Figure: Learning curves of variational autoencoders (upper) and variational matrix factorization (lower) on real datasets.

SLIDE 18

Conclusion and Future Work

◮ The VPNG corrects for curvature in the objective between the parameters in

variational inference.

SLIDE 19

Conclusion and Future Work

◮ The VPNG corrects for curvature in the objective between the parameters in

variational inference.

◮ Future work includes extending to general Bayesian networks with multiple

stochastic layers.

SLIDE 20

Thanks!

Poster #234

Code available at https://github.com/datang1992/VPNG.