probabilistic graphical models
play

Probabilistic Graphical Models Inference & Learning in DL - PowerPoint PPT Presentation

Probabilistic Graphical Models Inference & Learning in DL Zhiting Hu Lecture 19, March 29, 2017 Reading: 1 Deep Generative Models l Explicit probabilistic models Provide an explicit parametric specification of the distribution of l


  1. Probabilistic Graphical Models Inference & Learning in DL Zhiting Hu Lecture 19, March 29, 2017 Reading: 1

  2. Deep Generative Models l Explicit probabilistic models Provide an explicit parametric specification of the distribution of π’š l Tractable likelihood function p # (π’š) l E.g., l π‘ž 𝑦, 𝑨|𝛽 = π‘ž 𝑦|𝑨 π‘ž(𝑨|𝛽) 2

  3. Deep Generative Models l Explicit probabilistic models Provide an explicit parametric specification of the distribution of π’š l Tractable likelihood function p # (π’š) l E.g., Sigmoid Belief Nets l (3) + 𝑑 . (3) , 𝑑 . 6 π’Š / π‘ž 𝑀 ./ = 1 𝒙 . ,π’Š / = 𝜏 𝒙 . (3) = 1 𝒙 9 , π’Š / : ,𝑑 9 ) = 𝜏(𝒙 9 : + 𝑑 9 ) 6 π’Š / (:) = 0,1 ? π‘ž β„Ž 9/ π’Š / (3) = 0,1 > π’Š / π’˜ / = 0,1 = 3

  4. Deep Generative Models l Explicit probabilistic models Provide an explicit parametric specification of the distribution of π’š l Tractable likelihood function p # (π’š) l E.g., Deep generative model parameterized with NNs (e.g., VAEs) l π‘ž # π’š π’œ = 𝑂 π’š; 𝜈 # π’œ ,𝜏 π‘ž π’œ = 𝑂(π’œ; 𝟏,𝑱) 4

  5. Deep Generative Models l Implicit probabilistic models Defines a stochastic process to simulate data π’š l Do not require tractable likelihood function l Data simulator l Natural approach for problems in population genetics, weather, ecology, etc. l E.g., generate data from a deterministic equation given parameters and random l noise (e.g., GANs) π’š / = 𝑕 π’œ / ; 𝜾 π’œ / ∼ 𝑂(𝟏,𝑱) 5

  6. Recap: Variational Inference π‘ž # π’š,π’œ l Consider a probabilistic model π‘Ÿ R π’œ | π’š l Assume variational distribution l Lower bound for log likelihood log π‘ž π’š + Qπ‘Ÿ R π’œ π’š log π‘ž # π’š, π’œ = 𝐿𝑀 π‘Ÿ 𝝔 π’œ π’š || π‘ž 𝜾 π’œ π’š π‘Ÿ R π’œ π’š π’œ β‰₯ Qπ‘Ÿ R π’œ π’š logπ‘ž # π’š, π’œ π‘Ÿ R π’œ π’š π’œ ≔ β„’(𝜾, 𝝔;π’š) Free energy l 𝐺 𝜾, 𝝔; π’š = βˆ’log π‘ž π’š + 𝐿𝑀(π‘Ÿ 𝝔 π’œ π’š || π‘ž 𝜾 (π’œ|π’š)) 6

  7. Wake Sleep Algorithm π‘ž # π’š π’œ l Consider a generative model E.g., sigmoid brief nets l l Variational bound: log π‘ž π’š β‰₯ Q π‘Ÿ R π’œ π’š log π‘ž # π’š,π’œ π‘Ÿ R π’œ π’š ≔ β„’(𝜾, 𝝔;π’š) π’œ π‘Ÿ R π’œ π’š l Use a inference network π‘ž # l Maximize the bound w.r.t. Γ  Wake phase max 𝜾 E \(π’œ|π’š) log π‘ž 𝜾 π’š π’œ l π‘Ÿ π’œ π’š Get samples from through bottom-up pass l Use the samples as targets for updating the generator l 7

  8. Wake Sleep Algorithm l [Hinton et al., Science 1995] l Generally applicable to a wide range of generative models by training a separate inference network π‘ž(π’œ) π‘ž # π’š π’œ l Consider a generative model , with prior E.g., multi-layer brief nets l l Free energy 𝐺 𝜾, 𝝔; π’š = βˆ’log π‘ž π’š + 𝐿𝑀(π‘Ÿ 𝝔 π’œ π’š || π‘ž 𝜾 (π’œ|π’š)) π‘Ÿ R π’œ π’š l Inference network a.k.a. recognition network l 8

  9. R 2 Wake Sleep Algorithm R 1 π’š l Free energy: 𝐺 𝜾, 𝝔; π’š = βˆ’log π‘ž π’š + 𝐿𝑀(π‘Ÿ 𝝔 π’œ π’š || π‘ž 𝜾 (π’œ|π’š)) π‘ž # l Minimize the free energy w.r.t. Γ  Wake phase max 𝜾 E \(π’œ|π’š) log π‘ž 𝜾 π’š π’œ l π‘Ÿ Get samples from through bottom-up pass on training data l Use the samples as targets for updating the generator l [Figure courtesy: Maei’s slides] 9

  10. G 2 R 2 Wake Sleep Algorithm G 1 R 1 π’š l Free energy: 𝐺 𝜾, 𝝔; π’š = βˆ’log π‘ž π’š + 𝐿𝑀(π‘Ÿ 𝝔 π’œ π’š || π‘ž 𝜾 (π’œ|π’š)) π‘Ÿ R π’œ π’š l Maximize the free energy w.r.t. ? computationally expensive / high variance l l Instead, maximize w.r.t. . Γ  π‘Ÿ R π’œ π’š Sleep phase 𝐺′ 𝜾,𝝔;π’š = βˆ’log π‘ž π’š + 𝐿𝑀(π‘ž π’œ π’š || π‘Ÿ R (π’œ|π’š)) max 𝝔 E ^(π’œ,π’š) log π‘Ÿ R π’œ π’š l π‘ž β€œDreaming” up samples from through top-down pass l Use the samples as targets for updating the recognition network l 10

  11. G 2 R 2 Wake Sleep Algorithm G 1 R 1 π’š l Wake phase: Use recognition network to perform a bottom-up pass in order to create samples l for layers above (from data) Train generative network using samples obtained from recognition model l l Sleep phase: Use generative weights to reconstruct data by performing a top-down pass l Train recognition weights using samples obtained from generative model l l KL is not symmetric l Doesn’t optimize a well-defined objective function l Not guaranteed to converge 11

  12. Variational Auto-encoders (VAEs) l [Kingma & Welling, 2014] l Enjoy similar applicability with wake-sleep algorithm Not applicable to discrete latent variables l Optimize a variational lower bound on the log-likelihood l Reduce variance through reparameterization of the recognition l distribution Alternatives: use control variates as in reinforcement learning [Mnih & Gregor, l 2014] 12

  13. Variational Auto-encoders (VAEs) π‘ž # π’š π’œ π‘ž(π’œ) l Generative model , with prior a.k.a. decoder l π‘Ÿ R π’œ π’š l Inference network a.k.a. encoder, recognition network l l Variational lower bound log π‘ž π’š β‰₯ E \ _ π’œ π’š logπ‘ž # π’š,π’œ βˆ’ KL π‘Ÿ R π’œ π’š || π‘ž π’œ ≔ β„’(𝜾,𝝔;π’š) 13

  14. Variational Auto-encoders (VAEs) l Variational lower bound β„’ 𝜾, 𝝔; π’š = E \ _ π’œ π’š log π‘ž # π’š, π’œ βˆ’ KL(π‘Ÿ R π’œ π’š || π‘ž(π’œ)) π‘ž # π’š π’œ β„’(𝜾, 𝝔; π’š) l Optimize w.r.t. the same with the wake phase l β„’(𝜾, 𝝔; π’š) π‘Ÿ R π’œ π’š l Optimize w.r.t. Directly computing the gradient with MC estimation l a REINFORCE-like update rule which suffers from high variance [Mnih & Gregor 2014] (Next lecture for more on REINFORCE) VAEs use a reparameterization trick to reduce variance l 14

  15. VAEs: Reparameterization Trick l 15

  16. VAEs: Reparameterization Trick l π‘Ÿ R π’œ (9) π’š (9) = π’ͺ(π’œ 9 ; 𝝂 9 ,𝝉 :(9) 𝑱) π’œ = π’œ R (𝝑) is a deterministic mapping of 𝝑 [Figure courtesy: Chang’s slides] 16

  17. VAEs: Reparameterization Trick Variational lower bound l β„’ 𝜾, 𝝔;π’š = E \ _ π’œ π’š log π‘ž # π’š,π’œ βˆ’ KL π‘Ÿ R π’œ π’š || π‘ž π’œ E \ _ π’œ π’š log π‘ž # π’š, π’œ = E π‘βˆΌπ’ͺ(𝟏,𝑱) log π‘ž # π’š,π’œ R 𝝑 π‘Ÿ R π’œ π’š β„’ 𝜾, 𝝔; π’š Optimize w.r.t. l 𝛼 R E \ _ π’œ π’š log π‘ž # π’š, π’œ = E π‘βˆΌπ’ͺ(𝟏,𝑱) 𝛼 R log π‘ž # π’š,π’œ R 𝝑 l Uses the gradients w.r.t. the latent variables l For Gaussian distributions, can be computed KL π‘Ÿ R π’œ π’š || π‘ž π’œ l and differentiated analytically 17

  18. VAEs: Training l 18

  19. VAEs: Results l 19

  20. VAEs: Results l Generated MNIST images [Gregor et al., 2015] 20

  21. VAEs: Limitations and variants l Element-wise reconstruction error For image generation, to reconstruct every pixels l Sensitive to irrelevant variance, e.g., translations l Variant: feature-wise (perceptual-level) reconstruction [Dosovitskiy et al., 2016] l Use a pre-trained neural network to extract features of data l Generated images are required to have similar feature vectors with the data l Variant: Combining VAEs with GANs [Larsen et al., 2016] (more later) l Reconstruction results with different loss 21

  22. VAEs: Limitations and variants l Not applicable to discrete latent variables Differentiable reparameterization does not apply to discrete variables l Wake-sleep algorithm/GANs allow discrete latents l Variant: marginalize out discrete latents [Kingma et al., 2014] l Expensive when the discrete space is large l Variant: use continuous approximations l Gumbel-softmax [Jang et al, 2017] for approximating multinomial variables l Variant: combine VAEs with wake-sleep algorithm [Hu et al., 2017] l 22

  23. VAEs: Limitations and variants Usually use a fixed standard normal distribution as prior l π‘ž π’œ = π’ͺ(π’œ; 𝟏,𝑱) l For ease of inference and learning l Limited flexibility: converting the data distribution to fixed, single-mode prior l distribution Variant: use hierarchical nonparametric priors [Goyal et al., 2017] l E.g., Dirichlet process, nested Chinese restaurant process (more later) l Learn the structures of priors jointly with the model l 23

  24. VAEs: Limitations and variants Usually use a fixed standard normal distribution as prior l π‘ž π’œ = π’ͺ(π’œ; 𝟏,𝑱) l For ease of inference and learning l Limited flexibility: converting the data distribution to fixed, single-mode prior l distribution Variant: use hierarchical nonparametric priors [Goyal et al., 2017] l E.g., Dirichlet process, nested Chinese restaurant process (more later) l Learn the structures of priors jointly with the model l 24

  25. Deep Generative Models l Implicit probabilistic models Defines a stochastic process to simulate data π’š l Do not require tractable likelihood function l Data simulator l Natural approach for problems in population genetics, weather, ecology, etc. l E.g., generate data from a deterministic equation given parameters and random l noise (e.g., GANs) π’š / = 𝑕 π’œ / ; 𝜾 π’œ / ∼ 𝑂(𝟏,𝑱) 25

  26. Generative Adversarial Nets (GANs) [Goodfellow et al., 2014] l Assume implicit generative model l Learn cost function jointly l Interpreted as a mini-max game between a generator and a l discriminator Generate sharp, high-fidelity samples l 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend