generative adversarial networks
play

Generative Adversarial Networks Aaron Mishkin UBC MLRG 2018W2 1 - PowerPoint PPT Presentation

Generative Adversarial Networks Aaron Mishkin UBC MLRG 2018W2 1 Generative Adversial Networks Two imaginary celebrities that were dreamed up by a random number generator. https://research.nvidia.com/publication/2017-10


  1. Generative Adversarial Networks Aaron Mishkin UBC MLRG 2018W2 1

  2. Generative Adversial Networks “Two imaginary celebrities that were dreamed up by a random number generator.” https://research.nvidia.com/publication/2017-10 Progressive-Growing-of 2

  3. Why care about GANs? Why to spend your limited time learning about GANs: • GANs are achieving state-of-the-art results in a large variety of image generation tasks. • There’s been a veritable explosion in GAN publications over the last few years – many people are very excited! • GANs are stimulating new theoretical interest in min-max optimization problems and “smooth games”. 3

  4. Why care about GANs: Hyper-realistic Image Generation StyleGAN: image generatation with hierarchical style transfer [3]. 4 https://arxiv.org/abs/1812.04948

  5. Why care about GANs: Conditionally Generative Models Conditional GANs: high-resolution image synthesis via semantic labeling [8]. Input: Segmentation Output: Synthesized Image https://research.nvidia.com/publication/2017-12 High-Resolution-Image-Synthesis 5

  6. Why care about GANs: Image Super Resolution SRGAN: Photo-realistic super-resolution [4]. Bicubic Interp. SRGAN Original Image https://arxiv.org/abs/1609.04802 6

  7. Why care about GANs: Publications Approximately 500 papers GAN papers as of September 2018! 7 See https://github.com/hindupuravinash/the-gan-zoo for the exhaustive list of papers. Image Credit: https://github.com/bgavran.

  8. Generative Models

  9. Generative Modeling Generative Models estimate the probabilistic process that generated a set of observations D . �� x i , y i �� n • D = i =1 : supervised generative models learn the joint distribution p ( x i , y i ), often to compute p ( y i | x i ). � x i � n • D = i =1 : unsupervised generative models learn the distribution of D for clustering, sampling, etc. We can: • directly estimate p ( x i ), • introducing latents y i and estimate p ( x i , y i ). 8

  10. Generative Modeling: Unsupervised Parametric Approaches • Direct Estimation: Choose a parameterized family p ( x | θ ) and learn θ by maximizing the log-likelihood n � θ ∗ = arg max θ log p ( x i | θ ) . i =1 • Latent Variable Models: Define a joint distribution p ( x , y | θ ) and learn θ by maximizing the log-marginal likelihood n � � θ ∗ = arg max θ z i p ( x i , z i | θ ) d z . log i =1 Both approaches require that p ( x | θ ) is easy to evaluate. 9

  11. Generative Modeling: Models for (Very) Complex Data How can we learn such models for very complex data? 10 https://www.researchgate.net/figure/Heterogeneousness-and-diversity-of-the-CIFAR-10-entries-in-their-10-

  12. Generative Modeling: Normalizing Flows and VAEs Design parameterized densities with huge capacity! • Normalizing flows: sequence of non-linear transformations to a simple distribution p z ( z ) p ( x | θ 0: k ) = p z ( z ) where z = f − 1 ◦ · · · ◦ f − 1 ◦ f − 1 θ 0 ( x ) . θ k θ 1 f − 1 must be invertible with tractable log-det. Jacobians. θ j • VAEs: latent-variable models where inference networks specify parameters p ( x , y | θ ) = p ( x | f θ ( y )) p y ( y ) . The marginal likelihood is maximized via the ELBO. 11

  13. GANs

  14. GANs: Density-Free Models Generative Adversial Networks (GANs) instead use an unrestricted generator G θ g ( z ) such that p ( x | θ g ) = p z ( { z } ) where { z } = G − 1 θ g ( x ) . • Problem: the inverse image of G θ g ( z ) may be huge! • Problem: it’s likely intractable to preserve volume through G ( z ; θ g ). So, we can’t evaluate p ( x | θ g ) and we can’t learn θ g by maximum likelihood. 12

  15. GANs: Discriminators GANs learn by comparing model samples with examples from D . • Sampling from the generator is easy: ˆ x = G θ g (ˆ z ) , where ˆ z ∼ p z ( z ) . • Given a sample ˆ x , a discriminator tries to distinguish it from true examples: D ( x ) = Pr ( x ∼ p data ) . • The discriminator “supervises” the generator network. 13

  16. GANs: Generator + Descriminator https://www.slideshare.net/xavigiro/deep-learning-for-computer-vision-generative-models-and-adversarial- training-upc-2016 14

  17. GANs: Goodfellow et al. (2014) • Let z ∈ R m and p z ( z ) be a simple base distribution. • The generator G θ g ( z ) : R m → ˜ D is a deep neural network. • ˜ D is the manifold of generated examples. • The discriminator D θ d ( x ) : D ∪ ˜ D → (0 , 1) is also a deep neural network. https://arxiv.org/abs/1511.06434 15

  18. GANs: Saddle-Point Optimization Saddle-Point Optimization: learn G θ g ( z ) and D θ d ( x ) jointly via the objective V ( θ d , θ g ): � � �� min θ g max E p data [log D θ d ( x )] + E p z ( z ) log 1 − D θ d ( G θ g ( z )) θ d � �� � � �� � likelihood of true data likelihood of generated data 16

  19. GANs: Optimal Discriminators Claim: Given G θ g defining an implicit distribution p g = p ( x | θ g ), the optimal descriminator is p data ( x ) D ∗ ( x ) = p data ( x ) + p g ( x ) . Proof Sketch: � � V ( θ d , θ g ) = p data ( x ) log D ( x ) d x + p ( z ) log(1 − D ( G θ g ( z ))) d z ˜ D D � = p data ( x ) log D ( x ) + p g ( x ) log(1 − D ( x )) d x D∪ ˜ D Maximizing the integrand for all x is sufficient and gives the result (see bonus slides). Previous Slide: https://commons.wikimedia.org/wiki/File:Saddle point.svg 17

  20. GANs: Jensen-Shannon Divergence and Optimal Generators Given an optimal discriminator D ∗ ( x ), the generator objective is � � � � �� log D ∗ 1 − D ∗ C ( θ g ) = E p data θ d ( x ) + E p g ( x ) log θ d ( x ) � � � � p data ( x ) p g ( x ) = E p data log + E p g ( x ) log p data ( x ) + p g ( x ) p data ( x ) + p g ( x ) � � � � � � � � ∝ 1 ( p data + p g ) + 1 ( p data + p g ) � � � � 2 KL p data 2 KL p g � � � � 2 2 � � � � � �� � Jensen-Shannon Divergence C ( θ g ) achives its global minimum at p g = p data given an optimal discriminator! 18

  21. GANs: Learning Generators and Discriminators Putting these results to use in practice: • High-capacity discriminators D θ d approximate the Jensen-Shannon divergence when close to global maximum. • D θ d is a “differentiable program”. • We can use D θ d to learn G θ g with our favourite gradient descent method. https://arxiv.org/abs/1511.06434 19

  22. GANs: Training Procedure for i = 1 . . . N do for k = 1 . . . K do • Sample noise samples { z 1 , . . . , z m } ∼ p z ( z ) • Sample examples { x 1 , . . . , x m } from p data ( x ). • Update the discriminator D θ d : m 1 � � � x i � � � � z i ���� θ d = θ d − α d ∇ θ d log D + log 1 − D G . m i =1 end for • Sample noise samples { z 1 , . . . , z m } ∼ p z ( z ). • Update the generator G θ g : m 1 � � � � z i ��� θ g = θ g − α g ∇ θ g log 1 − D G . m i =1 20 end for

  23. Problems (c. 2016)

  24. Problems with GANs • Vanishing gradients: the discriminator becomes ”too good” and the generator gradient vanishes. • Non-Convergence: the generator and discriminator oscillate without reaching an equilibrium. • Mode Collapse: the generator distribution collapses to a small set of examples. • Mode Dropping: the generator distribution doesn’t fully cover the data distribution. 21

  25. Problems: Vanishing Gradients • The minimax objective saturates when D θ d is close to perfect: � � �� V ( θ d , θ g ) = E p data [log D θ d ( x )]+ E p z ( z ) log 1 − D θ d ( G θ g ( z )) . • A non-saturating heuristic objective for the generator is � � �� J ( G θ g ) = − E p z ( z ) log D θ d ( G θ g ( z )) . 22 https://arxiv.org/abs/1701.00160

  26. Problems: Addressing Vanishing Gradients Solutions: • Change Objectives: use the non-saturating heuristic objective, maximum-likelihood cost, etc. • Limit Discriminator: restrict the capacity of the discriminator. • Schedule Learning: try to balance training D θ d and G θ g . 23

  27. Problems: Non-Convergence Simultaneous gradient descent is not guaranteed to converge for minimax objectives. • Goodfellow et al. only showed convergence when updates are made in the function space [2]. • The parameterization of D θ d and G θ g results in highly non-convex objective. • In practice, training tends to oscillate – updates “undo” each other. 24

  28. Problems: Addressing Non-Convergence Solutions: Lots and lots of hacks! 25 https://github.com/soumith/ganhacks

  29. Problems: Mode Collapse and Mode Dropping One Explanation: SGD may optimize the max-min objective � � �� max min θ g E p data [log D θ d ( x )] + E p z ( z ) log 1 − D θ d ( G θ g ( z )) θ d Intuition: the generator maps all z values to the ˆ x that is mostly likely to fool the discriminator. https://arxiv.org/abs/1701.00160 26

  30. A Possible Solution

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend