generative networks part 2 gans
play

Generative networks part 2: GANs 23 / 54 Recap on generative - PowerPoint PPT Presentation

Generative networks part 2: GANs 23 / 54 Recap on generative networks Generative networks provide a way to sample from any distribution. 1. Sample z , where denotes an efficiently sampleable distribution (e.g., uniform or Gaussian). 2.


  1. Generative networks part 2: GANs 23 / 54

  2. Recap on generative networks Generative networks provide a way to sample from any distribution. 1. Sample z ∼ µ , where µ denotes an efficiently sampleable distribution (e.g., uniform or Gaussian). 2. Output g ( z ) , where g : R d → R m is a deep network. Notation: let g # µ (pushforward of µ through g ) denote this distribution. 24 / 54

  3. Recap on generative networks Generative networks provide a way to sample from any distribution. 1. Sample z ∼ µ , where µ denotes an efficiently sampleable distribution (e.g., uniform or Gaussian). 2. Output g ( z ) , where g : R d → R m is a deep network. Notation: let g # µ (pushforward of µ through g ) denote this distribution. Brief remarks: ◮ Can this model any target distribution ν ? Yes, (roughly) for the same reason that g can approximate any f : R d → R m . ◮ Graphical models let us sample and estimate probabilities; what about here? Nope. 24 / 54

  4. Univariate examples g ( x ) = x , the identity function, mapping Uniform ([0 , 1]) to itself. 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 25 / 54

  5. Univariate examples g ( x ) = x 2 , 2 mapping Uniform ([0 , 1]) to something ∝ √ x . 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 26 / 54

  6. Univariate examples g is inverse CDF of Gaussian, input distribution is Uniform ([0 , 1]) and output is Gaussian. 1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 27 / 54

  7. Another way to visualize generative networks Given a sample from a distribution (even g # µ ), here’s the “kernel density” / “Parzen window” estimate of its density: 1. Start with random draw ( x i ) n i =1 . 2. “Place bumps at every x i ”: � x − x i p ( x ) := 1 � n � Define ˆ i =1 k , n h where k is a kernel function (not the SVM one!), h is the “bandwidth”; for example: 28 / 54

  8. Another way to visualize generative networks Given a sample from a distribution (even g # µ ), here’s the “kernel density” / “Parzen window” estimate of its density: 1. Start with random draw ( x i ) n i =1 . 2. “Place bumps at every x i ”: � x − x i p ( x ) := 1 � n � Define ˆ i =1 k , n h where k is a kernel function (not the SVM one!), h is the “bandwidth”; for example: � � ◮ Gaussian: k ( z ) ∝ exp −� z � 2 / 2 ; ◮ Epanechnikov: k ( z ) ∝ max { 0 , 1 − � z � 2 } . 28 / 54

  9. Examples — univariate sampling. Univariate sample, kernel density estimate (kde), GMM E-M. kde 0.4 gmm 0.3 0.2 0.1 0.0 2 1 0 1 2 3 4 5 29 / 54

  10. Examples — univariate sampling. Univariate sample, kernel density estimate (kde), GAN kde. 0.4 kde gan kde 0.3 0.2 0.1 0.0 2 1 0 1 2 3 4 5 This is admittedly very indirect! As mentioned, there aren’t great ways to get GAN/VAE density information. 30 / 54

  11. Examples — bivariate sampling. Bivariate sample, GMM E-M. 6 5 4 3 2 1 0 1 2 2 1 0 1 2 3 4 5 31 / 54

  12. Examples — bivariate sampling. Bivariate sample, kernel density estimate (kde). 6 5 4 3 2 1 0 1 2 2 1 0 1 2 3 4 5 32 / 54

  13. Examples — bivariate sampling. Bivariate sample, GAN kde. 6 5 4 3 2 1 0 1 2 2 1 0 1 2 3 4 5 Question: how will this plot change with network capacity? 33 / 54

  14. Approaches we’ve seen for modeling distributions. 34 / 54

  15. Approaches we’ve seen for modeling distributions. Let’s survey our approaches to density estimation. ◮ Graphical models: can be interpretable, can encode domain knowledge. ◮ Kernel density estimation: easy to implement, converges to the right thing, suffers a curse of dimension. ◮ Training: easy for KDE, messy for graphical models. Interpretability: fine for both. Sampling: easy for both. Probability measurements: easy for KDE, sometimes easy for graphical model. 34 / 54

  16. Approaches we’ve seen for modeling distributions. Let’s survey our approaches to density estimation. ◮ Graphical models: can be interpretable, can encode domain knowledge. ◮ Kernel density estimation: easy to implement, converges to the right thing, suffers a curse of dimension. ◮ Training: easy for KDE, messy for graphical models. Interpretability: fine for both. Sampling: easy for both. Probability measurements: easy for KDE, sometimes easy for graphical model. Deep networks. ◮ Either we have easy sampling, or we can estimate densities. Doing both seems to have major computational or data costs. 34 / 54

  17. Brief VAE Recap 35 / 54

  18. (Variational) Autoencoders Autoencoder : ◮ f g − − → latent z i = f ( x i ) − − → x i = g ( z i ) . ˆ x i map map � n 1 Objective: i =1 ℓ ( x i , ˆ x i ) . n 36 / 54

  19. (Variational) Autoencoders Autoencoder : ◮ f g − − → latent z i = f ( x i ) − − → x i = g ( z i ) . ˆ x i map map � n 1 Objective: i =1 ℓ ( x i , ˆ x i ) . n Variational Autoencoder : ◮ f g − − → latent distribution µ i = f ( x i ) − − − − − − − → x i ∼ g # µ i . ˆ x i map pushforward � n 1 � ℓ ( x i , ˆ � Objective: x i ) + λ KL ( µ, µ i ) . i =1 n 36 / 54

  20. 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 x i ∼ g # µ i ˆ 37 / 54

  21. 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.2 0.4 0.6 0.8 1.0 x i ∼ g # µ with small λ ˆ 37 / 54

  22. Generative Adversarial Networks (GANs) 38 / 54

  23. Generative network setup and training. ◮ We are given ( x i ) n i =1 ∼ ν . ◮ We want to find g so that ( g ( z i )) n i =1 ≈ ( x i ) n i =1 , where ( z i ) n i =1 ∼ µ . Problem: this isn’t as simple as fitting g ( z i ) ≈ x i . 39 / 54

  24. Generative network setup and training. ◮ We are given ( x i ) n i =1 ∼ ν . ◮ We want to find g so that ( g ( z i )) n i =1 ≈ ( x i ) n i =1 , where ( z i ) n i =1 ∼ µ . Problem: this isn’t as simple as fitting g ( z i ) ≈ x i . Solutions: ◮ VAE: For each x i , construct distribution µ i , so that ˆ x i ∼ g # µ i and x i are close, as are µ i and µ . To generate fresh samples, get z ∼ µ and output g ( z ) . ◮ GAN: Pick a distance notion between distributions (or between samples ( g ( z i )) n i =1 and ( x i ) n i =1 ) and pick g to minimize that! 39 / 54

  25. GAN overview GAN approach: we minimize D ( ν, g # µ ) directly, where “ D ” is some notion of distance/divergence: ◮ Jensen-Shannon Divergence (original GAN paper). ◮ Wasserstein distance (influential follow-up). 40 / 54

  26. GAN overview GAN approach: we minimize D ( ν, g # µ ) directly, where “ D ” is some notion of distance/divergence: ◮ Jensen-Shannon Divergence (original GAN paper). ◮ Wasserstein distance (influential follow-up). Each distance is computed with an alternating/adversarial scheme: 1. We have some current choice g t , and use it to produce a sample x i ) n (ˆ i =1 with ˆ x i = g t ( z i ) . x i ) n 2. We train a discriminator/critic f t to find differences between (ˆ i =1 and ( x i ) n i =1 . 3. We then pick a new generator g t +1 , trained to fool f t ! 40 / 54

  27. Jensen-Shannon divergence (original GAN) 41 / 54

  28. Original GAN formulation p = p 2 + p g Let p, p g denote density of data and generator, ˜ 2 . Original GAN minimizes Jensen-Shannon Divergence : 2 · JS ( p, p g ) = KL ( p, ˜ p ) + KL ( p g , ˜ p ) p ( x ) ln p ( x ) p g ( x ) ln p g ( x ) � � = p ( x ) d x + p ( x ) d x ˜ ˜ = E p ln p ( x ) p ( x ) + E p g ln p g ( x ) p ( x ) . ˜ ˜ 42 / 54

  29. Original GAN formulation p = p 2 + p g Let p, p g denote density of data and generator, ˜ 2 . Original GAN minimizes Jensen-Shannon Divergence : 2 · JS ( p, p g ) = KL ( p, ˜ p ) + KL ( p g , ˜ p ) p ( x ) ln p ( x ) p g ( x ) ln p g ( x ) � � = p ( x ) d x + p ( x ) d x ˜ ˜ = E p ln p ( x ) p ( x ) + E p g ln p g ( x ) p ( x ) . ˜ ˜ But we’ve been saying we can’t write down p g ? 42 / 54

  30. Original GAN formulation p = p 2 + p g Let p, p g denote density of data and generator, ˜ 2 . Original GAN minimizes Jensen-Shannon Divergence : 2 · JS ( p, p g ) = KL ( p, ˜ p ) + KL ( p g , ˜ p ) p ( x ) ln p ( x ) p g ( x ) ln p g ( x ) � � = p ( x ) d x + p ( x ) d x ˜ ˜ = E p ln p ( x ) p ( x ) + E p g ln p g ( x ) p ( x ) . ˜ ˜ But we’ve been saying we can’t write down p g ? Original GAN approach applies alternating minimization to   n m  1 + 1 � �  . � � � � inf sup ln f ( x i ) ln 1 − f ( g ( z j )) n m g ∈G f ∈F i =1 j =1 f : X → (0 , 1) 42 / 54

  31. Original GAN formulation and algorithm. Original GAN objective:   n m  1 + 1 � �  . � � � � inf sup ln f ( x i ) ln 1 − f ( g ( z j )) n m g ∈G f ∈F i =1 j =1 f : X → (0 , 1) Algorithm alternates these two steps: 1. Hold g fixed and optimize f . Specifically, generate a sample (ˆ x j ) m j =1 = ( g ( z j )) m j =1 , and approximately optimize   n m  1 + 1 � �  . � � � � sup ln f ( x i ) ln 1 − f (ˆ x j ) n m f ∈F i =1 j =1 f : X → (0 , 1) 2. Hold f fixed and optimize g . Specifically, generate ( z j ) m j =1 and approximately optimize   n m  1 + 1 � �  . � � � � inf ln f ( x i ) ln 1 − f ( g ( z j )) n m g ∈G i =1 j =1 43 / 54

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend