deep variational inference
play

Deep Variational Inference FLARE Reading Group Presentation Wesley - PowerPoint PPT Presentation

Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is Variational Inference? What is Variational Inference? p*(x) Want to estimate some distribution, p*(x) What is Variational Inference?


  1. Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016

  2. What is Variational Inference? ●

  3. What is Variational Inference? p*(x) ● Want to estimate some distribution, p*(x)

  4. What is Variational Inference? p*(x) ● Want to estimate some distribution, p*(x) ● Too expensive to estimate

  5. What is Variational Inference? p*(x) q(x) ● Want to estimate some distribution, p*(x) ● Too expensive to estimate ● Approximate it with a tractable distribution, q(x)

  6. What is Variational Inference? p*(x) q(x) ● Fit q(x) inside of p*(x) ● Centered at a single mode ○ q(x) is unimodal here ○ VI is a MAP estimate

  7. What is Variational Inference? ● Mathematically: KL(q || p*) Still hard! = Σ x q(x)log(q(x) / p*(x)) p*(x) usually has a tricky normalizing constant

  8. What is Variational Inference? ● Mathematically: KL(q || p*) = Σ x q(x)log(q(x) / p*(x)) ● Use unnormalized p ~ instead

  9. What is Variational Inference? log(q(x) / p*(x)) ● Mathematically: = log(q(x)) - log(p*(x)) = log(q(x)) - log(p ~ (x) / Z) KL(q || p*) = log(q(x)) - log(p ~ (x)) - log(Z) = Σ x q(x)log(q(x) / p*(x)) ● Use unnormalized p ~ instead

  10. What is Variational Inference? log(q(x) / p*(x)) ● Mathematically: = log(q(x)) - log(p*(x)) = log(q(x)) - log(p ~ (x) / Z) KL(q || p*) = log(q(x)) - log(p ~ (x)) - log(Z) = Σ x q(x)log(q(x) / p*(x)) Constant ● Use unnormalized p ~ => Can ignore in our optimization problem instead

  11. Mean Field VI ● Classical method ● Uses a factorized q: q(x) = ∏ i q i (x i ) [1] Blei, Ng, Jordan, “ Latent Dirichlet Allocation ”, JMLR, 2003.

  12. Mean Field VI ● Example: Multivariate Gaussian ● Product of independent Gaussians for q ● Spherical covariance underestimates true covariance

  13. Variational Bayes ● Vanilla mean field VI assumes you know all the parameters, θ , of the true distribution, p*(x) [1] Blei, Ng, Jordan, “ Latent Dirichlet Allocation ”, JMLR, 2003.

  14. Variational Bayes ● Vanilla mean field VI assumes you know all the parameters, θ , of the true distribution, p*(x) ● Enter: Variational Bayes (VB) [1] Blei, Ng, Jordan, “ Latent Dirichlet Allocation ”, JMLR, 2003.

  15. Variational Bayes ● VB infers both the latent q(x) variables, z, and the p*(x) parameters, θ ● VB-EM was popularized for LDA 1 ○ E for z, M for θ [1] Blei, Ng, Jordan, “ Latent Dirichlet Allocation ”, JMLR, 2003.

  16. Variational Bayes ● VB usually uses a mean field approximation of the form: q(x) = q(z i | θ )∏ i q i (x i | z i )

  17. Issues with Mean Field VB ● Requires analytical solutions of expectations w.r.t. q i ○ Intractable in general ● Factored form limits the power of the approximation

  18. Issues with Mean Field VB Solution: ● Requires analytical Auto-Encoding solutions of Variational Bayes expectations w.r.t. q i (Kingma and Welling, 2013) ○ Intractable in general ● Factored form limits the power of the approximation

  19. Issues with Mean Field VB Solution: ● Requires analytical Auto-Encoding solutions of Variational Bayes expectations w.r.t. q i (Kingma and Welling, 2014) ○ Intractable in general Solution: ● Factored form limits Variational Inference with Normalizing Flows the power of the (Rezende and Mohamed, 2015) approximation

  20. Auto-Encoding Variational Bayes 1 High-level idea: 1) Optimizing the same lower bound that we get in VB 2) Data augmentation trick leads to lower-variance estimator 3) Lots of choices of q(z|x) and p(z) lead to partial closed-form 4) Use a neural network to parameterize q ϕ (z | x) and p θ (x | z) 5) SGD to fit everything [1] Kingma and Welling, “ Auto-Encoding Variational Bayes ”, ICLR, 2014.

  21. 1) VB Lower Bound ● Given N iid data points, (x 1 , ... , x n ) ● Maximize the marginal likelihood: log p θ (x 1 ,...,x n ) = Σ i log p θ (x (i) )

  22. 1) VB Lower Bound ● Given N iid data points, (x 1 , ... , x n ) ● Maximize the marginal likelihood: log p θ (x 1 ,...,x n ) = Σ i log p θ (x (i) )

  23. 1) VB Lower Bound ● Given N iid data points, (x 1 , ... , x n ) ● Maximize the marginal likelihood: Always positive log p θ (x 1 ,...,x n ) = Σ i log p θ (x (i) )

  24. 1) VB Lower Bound ● Given N iid data points, (x 1 , ... , x n ) ● Maximize the Lower bound marginal likelihood: Always positive log p θ (x 1 ,...,x n ) = Σ i log p θ (x (i) )

  25. 1) VB Lower Bound ● Write lower bound

  26. 1) VB Lower Bound ● Write lower bound Anyone want the derivation?

  27. 1) VB Lower Bound ● Write lower bound ● Rewrite lower bound

  28. 1) VB Lower Bound ● Write lower bound ● Rewrite lower bound

  29. 1) VB Lower Bound ● Write lower bound ● Rewrite lower bound Derivation?

  30. 1) VB Lower Bound ● Write lower bound ● Rewrite lower bound ● Monte Carlo gradient estimator of expectation part

  31. 1) VB Lower Bound ● Write lower bound ● Rewrite lower bound ● Monte Carlo gradient estimator of expectation part ○ Too high variance

  32. 2) Reparameterization trick ● Rewrite q ϕ (z (l) | x) ● Separate q into a deterministic function of x and an auxiliary noise variable ϵ ● Leads to lower variance estimator

  33. 2) Reparameterization trick ● Example: univariate Gaussian ● Can rewrite as sum of mean and a scaled noise variable

  34. 2) Reparameterization trick Exponential, Cauchy, Logistic, Rayleigh, Pareto, Weibull, Reciprocal, Gompertz, Gumbel, Erlang ● Lots of distributions like this. Three classes Laplace, Elliptical, Student’s t, Logistic, given: Uniform, Triangular, Gaussian ○ Tractable inverse CDF Log-Normal (exponentiated normal) Gamma (sum of exponentials) ○ Location-scale Dirichlet (sum of Gammas) ○ Composition Beta, Chi-Squared, F

  35. 2) Reparameterization trick ● Yields a new MC estimator

  36. 2) Reparameterization trick ● Plug estimator into the lower bound eq. ● KL term often can be integrated analytically ○ Careful choice of priors

  37. 2) Reparameterization trick ● Plug estimator into the lower bound eq. ● KL term often can be integrated analytically ○ Careful choice of priors

  38. 3) Partial closed form ● KL term often can be integrated analytically ○ Careful choice of priors ○ E.g. both Gaussian

  39. 4) Auto-encoder connection ● Regularizer ● Reconstruction error ● Neural nets ○ Encode: q(z | x) ○ Decode: p(x | z)

  40. 4) Auto-encoder connection (alt.) ● q(z | x) encodes ● p(x | z) decodes ● “Information layer(s)” need to compress ○ Reals = infinite info ○ Reals + random noise = finite info More info in Karol Gregor’s Deep Mind lecture: https://www.youtube.com/watch?v=P78QYjWh5sM

  41. Where are we with VI now? (2013’ish) ● Deep networks parameterize both q(z | x) and p(x | z) ● Lower-variance estimator of expected log-likelihood ● Can choose from lots of families of q(z | x) and p(z)

  42. Where are we with VI now? (2013’ish) ● Problem: ○ Most parametric families available are simple ○ E.g. product of independent univariate Gaussians ○ Most posteriors are complex

  43. Variational Inference with Normalizing Flows 1 High-level idea: 1) VAEs are great, but our posterior q(z|x) needs to be simple 2) Take simple q(z | x) and apply series of k transformations to z to get q_k(z | x). Metaphor: z “flows” through each transform. 3) Be clever in choice of transforms (computational issue) 4) Variational posterior q now converges to true posterior p 5) Deep NN now parameterizes q and flow parameters [1] Rezende, Danilo Jimenez, and Shakir Mohamed. "Variational inference with normalizing flows." arXiv preprint arXiv:1505.05770 (2015). .

  44. What is a normalizing q 0 ( z | x ) flow? ● Function that transforms a probability density through a sequence of invertible mappings q k ( z | x )

  45. Key equations (1) ● Chain rule lets us write q k as product of q0 and inverted determinants

  46. Key equations (2) ● Density q k ( z’ ) obtained by successively composing k transforms

  47. Key equations (3) ● Log likelihood of q k ( z’ ) has a nice additive form

  48. Key equations (4) ● Expectation over q k can be written as an expectation under q 0 ● Cute name: law of the unconscious statistician (LOTUS)

  49. Types of flows 1) Infinitesimal Flows: ○ Can show convergence in the limit ○ Skipping (theoretical; computationally expensive) 2) Invertible Linear-Time Flows: ○ log-det can be calculated efficiently

  50. Planar Flows ● Applies the transform: where:

  51. Radial Flows ● Applies the transform: where:

  52. Summary ● VI approx. p(x) via latent variable model ○ p(x) = Σ z p(z)p(x | z) ● VAE introduces an auto-encoder approach ○ Reparameterization trick makes it feasible ○ Deep NNs parameterize q(z | x) and p(x | z) ● NF takes q(z|x) from simple to complex ○ Series of linear-time transforms ○ Convergence in the limit

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend