Deep Variational Inference FLARE Reading Group Presentation Wesley - - PowerPoint PPT Presentation

deep variational inference
SMART_READER_LITE
LIVE PREVIEW

Deep Variational Inference FLARE Reading Group Presentation Wesley - - PowerPoint PPT Presentation

Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is Variational Inference? What is Variational Inference? p*(x) Want to estimate some distribution, p*(x) What is Variational Inference?


slide-1
SLIDE 1

Deep Variational Inference

FLARE Reading Group Presentation Wesley Tansey 9/28/2016

slide-2
SLIDE 2
  • What is Variational

Inference?

slide-3
SLIDE 3
  • Want to estimate

some distribution, p*(x)

What is Variational Inference?

p*(x)

slide-4
SLIDE 4
  • Want to estimate

some distribution, p*(x)

  • Too expensive to

estimate

What is Variational Inference?

p*(x)

slide-5
SLIDE 5
  • Want to estimate

some distribution, p*(x)

  • Too expensive to

estimate

  • Approximate it with a

tractable distribution, q(x)

What is Variational Inference?

p*(x) q(x)

slide-6
SLIDE 6
  • Fit q(x) inside of p*(x)
  • Centered at a single

mode ○ q(x) is unimodal here ○ VI is a MAP estimate

What is Variational Inference?

p*(x) q(x)

slide-7
SLIDE 7
  • Mathematically:

KL(q || p*) = Σxq(x)log(q(x) / p*(x))

What is Variational Inference?

Still hard! p*(x) usually has a tricky normalizing constant

slide-8
SLIDE 8
  • Mathematically:

KL(q || p*) = Σxq(x)log(q(x) / p*(x))

  • Use unnormalized p~

instead

What is Variational Inference?

slide-9
SLIDE 9

log(q(x) / p*(x)) = log(q(x)) - log(p*(x)) = log(q(x)) - log(p~(x) / Z) = log(q(x)) - log(p~(x)) - log(Z)

  • Mathematically:

KL(q || p*) = Σxq(x)log(q(x) / p*(x))

  • Use unnormalized p~

instead

What is Variational Inference?

slide-10
SLIDE 10

log(q(x) / p*(x)) = log(q(x)) - log(p*(x)) = log(q(x)) - log(p~(x) / Z) = log(q(x)) - log(p~(x)) - log(Z)

  • Mathematically:

KL(q || p*) = Σxq(x)log(q(x) / p*(x))

  • Use unnormalized p~

instead

What is Variational Inference?

Constant => Can ignore in our

  • ptimization problem
slide-11
SLIDE 11
  • Classical method
  • Uses a factorized q:

q(x) = ∏i qi(xi)

Mean Field VI

[1] Blei, Ng, Jordan, “Latent Dirichlet Allocation”, JMLR, 2003.

slide-12
SLIDE 12
  • Example: Multivariate

Gaussian

  • Product of

independent Gaussians for q

  • Spherical covariance

underestimates true covariance

Mean Field VI

slide-13
SLIDE 13
  • Vanilla mean field VI

assumes you know all the parameters, θ, of the true distribution, p*(x)

Variational Bayes

[1] Blei, Ng, Jordan, “Latent Dirichlet Allocation”, JMLR, 2003.

slide-14
SLIDE 14
  • Vanilla mean field VI

assumes you know all the parameters, θ, of the true distribution, p*(x)

  • Enter: Variational

Bayes (VB)

Variational Bayes

[1] Blei, Ng, Jordan, “Latent Dirichlet Allocation”, JMLR, 2003.

slide-15
SLIDE 15
  • VB infers both the

latent q(x) variables, z, and the p*(x) parameters, θ

  • VB-EM was

popularized for LDA1 ○ E for z, M for θ

Variational Bayes

[1] Blei, Ng, Jordan, “Latent Dirichlet Allocation”, JMLR, 2003.

slide-16
SLIDE 16
  • VB usually uses a

mean field approximation of the form: q(x) = q(zi | θ)∏i qi(xi | zi)

Variational Bayes

slide-17
SLIDE 17
  • Requires analytical

solutions of expectations w.r.t. qi ○ Intractable in general

  • Factored form limits

the power of the approximation

Issues with Mean Field VB

slide-18
SLIDE 18
  • Requires analytical

solutions of expectations w.r.t. qi ○ Intractable in general

  • Factored form limits

the power of the approximation

Issues with Mean Field VB

Solution: Auto-Encoding Variational Bayes

(Kingma and Welling, 2013)

slide-19
SLIDE 19
  • Requires analytical

solutions of expectations w.r.t. qi ○ Intractable in general

  • Factored form limits

the power of the approximation

Issues with Mean Field VB

Solution: Variational Inference with Normalizing Flows

(Rezende and Mohamed, 2015)

Solution: Auto-Encoding Variational Bayes

(Kingma and Welling, 2014)

slide-20
SLIDE 20

Auto-Encoding Variational Bayes1

High-level idea: 1) Optimizing the same lower bound that we get in VB 2) Data augmentation trick leads to lower-variance estimator 3) Lots of choices of q(z|x) and p(z) lead to partial closed-form 4) Use a neural network to parameterize qϕ(z | x) and pθ(x | z) 5) SGD to fit everything

[1] Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR, 2014.

slide-21
SLIDE 21
  • Given N iid data

points, (x1, ... , xn)

  • Maximize the

marginal likelihood:

log pθ(x1,...,xn) = Σi log pθ(x(i))

1) VB Lower Bound

slide-22
SLIDE 22
  • Given N iid data

points, (x1, ... , xn)

  • Maximize the

marginal likelihood:

log pθ(x1,...,xn) = Σi log pθ(x(i))

1) VB Lower Bound

slide-23
SLIDE 23
  • Given N iid data

points, (x1, ... , xn)

  • Maximize the

marginal likelihood:

log pθ(x1,...,xn) = Σi log pθ(x(i))

1) VB Lower Bound

Always positive

slide-24
SLIDE 24
  • Given N iid data

points, (x1, ... , xn)

  • Maximize the

marginal likelihood:

log pθ(x1,...,xn) = Σi log pθ(x(i))

1) VB Lower Bound

Always positive Lower bound

slide-25
SLIDE 25
  • Write lower bound

1) VB Lower Bound

slide-26
SLIDE 26
  • Write lower bound

1) VB Lower Bound

Anyone want the derivation?

slide-27
SLIDE 27
  • Write lower bound
  • Rewrite lower bound

1) VB Lower Bound

slide-28
SLIDE 28
  • Write lower bound
  • Rewrite lower bound

1) VB Lower Bound

slide-29
SLIDE 29
  • Write lower bound
  • Rewrite lower bound

1) VB Lower Bound

Derivation?

slide-30
SLIDE 30
  • Write lower bound
  • Rewrite lower bound
  • Monte Carlo gradient

estimator of expectation part

1) VB Lower Bound

slide-31
SLIDE 31
  • Write lower bound
  • Rewrite lower bound
  • Monte Carlo gradient

estimator of expectation part ○ Too high variance

1) VB Lower Bound

slide-32
SLIDE 32
  • Rewrite qϕ(z(l) | x)
  • Separate q into a

deterministic function

  • f x and an auxiliary

noise variable ϵ

  • Leads to lower

variance estimator 2) Reparameterization trick

slide-33
SLIDE 33
  • Example: univariate

Gaussian

  • Can rewrite as sum of

mean and a scaled noise variable 2) Reparameterization trick

slide-34
SLIDE 34
  • Lots of distributions

like this. Three classes given: ○ Tractable inverse CDF ○ Location-scale ○ Composition 2) Reparameterization trick

Exponential, Cauchy, Logistic, Rayleigh, Pareto, Weibull, Reciprocal, Gompertz, Gumbel, Erlang Laplace, Elliptical, Student’s t, Logistic, Uniform, Triangular, Gaussian Log-Normal (exponentiated normal) Gamma (sum of exponentials) Dirichlet (sum of Gammas) Beta, Chi-Squared, F

slide-35
SLIDE 35
  • Yields a new MC

estimator 2) Reparameterization trick

slide-36
SLIDE 36
  • Plug estimator into

the lower bound eq.

  • KL term often can be

integrated analytically ○ Careful choice of priors 2) Reparameterization trick

slide-37
SLIDE 37
  • Plug estimator into

the lower bound eq.

  • KL term often can be

integrated analytically ○ Careful choice of priors 2) Reparameterization trick

slide-38
SLIDE 38
  • KL term often can be

integrated analytically ○ Careful choice of priors ○ E.g. both Gaussian 3) Partial closed form

slide-39
SLIDE 39
  • Regularizer
  • Reconstruction error
  • Neural nets

○ Encode: q(z | x) ○ Decode: p(x | z) 4) Auto-encoder connection

slide-40
SLIDE 40
  • q(z | x) encodes
  • p(x | z) decodes
  • “Information layer(s)”

need to compress ○ Reals = infinite info ○ Reals + random noise = finite info 4) Auto-encoder connection (alt.)

More info in Karol Gregor’s Deep Mind lecture: https://www.youtube.com/watch?v=P78QYjWh5sM

slide-41
SLIDE 41
  • Deep networks parameterize

both q(z | x) and p(x | z)

  • Lower-variance estimator of

expected log-likelihood

  • Can choose from lots of families of

q(z | x) and p(z) Where are we with VI now? (2013’ish)

slide-42
SLIDE 42
  • Problem:

○ Most parametric families available are simple ○ E.g. product of independent univariate Gaussians ○ Most posteriors are complex Where are we with VI now? (2013’ish)

slide-43
SLIDE 43

Variational Inference with Normalizing Flows1

High-level idea: 1) VAEs are great, but our posterior q(z|x) needs to be simple 2) Take simple q(z | x) and apply series of k transformations to z to get q_k(z | x). Metaphor: z “flows” through each transform. 3) Be clever in choice of transforms (computational issue) 4) Variational posterior q now converges to true posterior p 5) Deep NN now parameterizes q and flow parameters

[1] Rezende, Danilo Jimenez, and Shakir Mohamed. "Variational inference with normalizing flows." arXiv preprint arXiv:1505.05770 (2015)..

slide-44
SLIDE 44
  • Function that

transforms a probability density through a sequence of invertible mappings What is a normalizing flow?

q0(z | x) qk(z | x)

slide-45
SLIDE 45
  • Chain rule lets us

write qk as product of q0 and inverted determinants Key equations (1)

slide-46
SLIDE 46
  • Density qk(z’)
  • btained by

successively composing k transforms Key equations (2)

slide-47
SLIDE 47
  • Log likelihood of qk(z’)

has a nice additive form Key equations (3)

slide-48
SLIDE 48
  • Expectation over qk

can be written as an expectation under q0

  • Cute name: law of the

unconscious statistician (LOTUS) Key equations (4)

slide-49
SLIDE 49

Types of flows 1) Infinitesimal Flows:

○ Can show convergence in the limit ○ Skipping (theoretical; computationally expensive)

2) Invertible Linear-Time Flows:

○ log-det can be calculated efficiently

slide-50
SLIDE 50
  • Applies the transform:

where: Planar Flows

slide-51
SLIDE 51
  • Applies the transform:

where: Radial Flows

slide-52
SLIDE 52
  • VI approx. p(x) via latent variable model

○ p(x) = Σz p(z)p(x | z)

  • VAE introduces an auto-encoder approach

○ Reparameterization trick makes it feasible ○ Deep NNs parameterize q(z | x) and p(x | z)

  • NF takes q(z|x) from simple to complex

○ Series of linear-time transforms ○ Convergence in the limit

Summary