Generative Adversarial Networks Aaron Mishkin UBC MLRG 2018W2 1

Generative Adversial Networks “Two imaginary celebrities that were dreamed up by a random number generator.” https://research.nvidia.com/publication/2017-10 Progressive-Growing-of 2

Why care about GANs? Why to spend your limited time learning about GANs: • GANs are achieving state-of-the-art results in a large variety of image generation tasks. • There’s been a veritable explosion in GAN publications over the last few years – many people are very excited! • GANs are stimulating new theoretical interest in min-max optimization problems and “smooth games”. 3

Why care about GANs: Hyper-realistic Image Generation StyleGAN: image generatation with hierarchical style transfer [3]. 4 https://arxiv.org/abs/1812.04948

Why care about GANs: Conditionally Generative Models Conditional GANs: high-resolution image synthesis via semantic labeling [8]. Input: Segmentation Output: Synthesized Image https://research.nvidia.com/publication/2017-12 High-Resolution-Image-Synthesis 5

Why care about GANs: Image Super Resolution SRGAN: Photo-realistic super-resolution [4]. Bicubic Interp. SRGAN Original Image https://arxiv.org/abs/1609.04802 6

Why care about GANs: Publications Approximately 500 papers GAN papers as of September 2018! 7 See https://github.com/hindupuravinash/the-gan-zoo for the exhaustive list of papers. Image Credit: https://github.com/bgavran.

Generative Models

Generative Modeling Generative Models estimate the probabilistic process that generated a set of observations D . �� x i , y i �� n • D = i =1 : supervised generative models learn the joint distribution p ( x i , y i ), often to compute p ( y i | x i ). � x i � n • D = i =1 : unsupervised generative models learn the distribution of D for clustering, sampling, etc. We can: • directly estimate p ( x i ), • introducing latents y i and estimate p ( x i , y i ). 8

Generative Modeling: Unsupervised Parametric Approaches • Direct Estimation: Choose a parameterized family p ( x | θ ) and learn θ by maximizing the log-likelihood n � θ ∗ = arg max θ log p ( x i | θ ) . i =1 • Latent Variable Models: Define a joint distribution p ( x , y | θ ) and learn θ by maximizing the log-marginal likelihood n � � θ ∗ = arg max θ z i p ( x i , z i | θ ) d z . log i =1 Both approaches require that p ( x | θ ) is easy to evaluate. 9

Generative Modeling: Models for (Very) Complex Data How can we learn such models for very complex data? 10 https://www.researchgate.net/figure/Heterogeneousness-and-diversity-of-the-CIFAR-10-entries-in-their-10-

Generative Modeling: Normalizing Flows and VAEs Design parameterized densities with huge capacity! • Normalizing flows: sequence of non-linear transformations to a simple distribution p z ( z ) p ( x | θ 0: k ) = p z ( z ) where z = f − 1 ◦ · · · ◦ f − 1 ◦ f − 1 θ 0 ( x ) . θ k θ 1 f − 1 must be invertible with tractable log-det. Jacobians. θ j • VAEs: latent-variable models where inference networks specify parameters p ( x , y | θ ) = p ( x | f θ ( y )) p y ( y ) . The marginal likelihood is maximized via the ELBO. 11

GANs: Density-Free Models Generative Adversial Networks (GANs) instead use an unrestricted generator G θ g ( z ) such that p ( x | θ g ) = p z ( { z } ) where { z } = G − 1 θ g ( x ) . • Problem: the inverse image of G θ g ( z ) may be huge! • Problem: it’s likely intractable to preserve volume through G ( z ; θ g ). So, we can’t evaluate p ( x | θ g ) and we can’t learn θ g by maximum likelihood. 12

GANs: Discriminators GANs learn by comparing model samples with examples from D . • Sampling from the generator is easy: ˆ x = G θ g (ˆ z ) , where ˆ z ∼ p z ( z ) . • Given a sample ˆ x , a discriminator tries to distinguish it from true examples: D ( x ) = Pr ( x ∼ p data ) . • The discriminator “supervises” the generator network. 13

GANs: Generator + Descriminator https://www.slideshare.net/xavigiro/deep-learning-for-computer-vision-generative-models-and-adversarial- training-upc-2016 14

GANs: Goodfellow et al. (2014) • Let z ∈ R m and p z ( z ) be a simple base distribution. • The generator G θ g ( z ) : R m → ˜ D is a deep neural network. • ˜ D is the manifold of generated examples. • The discriminator D θ d ( x ) : D ∪ ˜ D → (0 , 1) is also a deep neural network. https://arxiv.org/abs/1511.06434 15

GANs: Saddle-Point Optimization Saddle-Point Optimization: learn G θ g ( z ) and D θ d ( x ) jointly via the objective V ( θ d , θ g ): � � �� min θ g max E p data [log D θ d ( x )] + E p z ( z ) log 1 − D θ d ( G θ g ( z )) θ d � �� likelihood of true data likelihood of generated data 16

GANs: Optimal Discriminators Claim: Given G θ g defining an implicit distribution p g = p ( x | θ g ), the optimal descriminator is p data ( x ) D ∗ ( x ) = p data ( x ) + p g ( x ) . Proof Sketch: � � V ( θ d , θ g ) = p data ( x ) log D ( x ) d x + p ( z ) log(1 − D ( G θ g ( z ))) d z ˜ D D � = p data ( x ) log D ( x ) + p g ( x ) log(1 − D ( x )) d x D∪ ˜ D Maximizing the integrand for all x is sufficient and gives the result (see bonus slides). Previous Slide: https://commons.wikimedia.org/wiki/File:Saddle point.svg 17

GANs: Jensen-Shannon Divergence and Optimal Generators Given an optimal discriminator D ∗ ( x ), the generator objective is � � � � �� log D ∗ 1 − D ∗ C ( θ g ) = E p data θ d ( x ) + E p g ( x ) log θ d ( x ) � � � � p data ( x ) p g ( x ) = E p data log + E p g ( x ) log p data ( x ) + p g ( x ) p data ( x ) + p g ( x ) � � � � � � � � ∝ 1 ( p data + p g ) + 1 ( p data + p g ) � � � � 2 KL p data 2 KL p g � � � � 2 2 � � � � � �� Jensen-Shannon Divergence C ( θ g ) achives its global minimum at p g = p data given an optimal discriminator! 18

GANs: Learning Generators and Discriminators Putting these results to use in practice: • High-capacity discriminators D θ d approximate the Jensen-Shannon divergence when close to global maximum. • D θ d is a “differentiable program”. • We can use D θ d to learn G θ g with our favourite gradient descent method. https://arxiv.org/abs/1511.06434 19

GANs: Training Procedure for i = 1 . . . N do for k = 1 . . . K do • Sample noise samples { z 1 , . . . , z m } ∼ p z ( z ) • Sample examples { x 1 , . . . , x m } from p data ( x ). • Update the discriminator D θ d : m 1 � � � x i � � � � z i �� θ d = θ d − α d ∇ θ d log D + log 1 − D G . m i =1 end for • Sample noise samples { z 1 , . . . , z m } ∼ p z ( z ). • Update the generator G θ g : m 1 � � � � z i �� θ g = θ g − α g ∇ θ g log 1 − D G . m i =1 20 end for

Problems (c. 2016)

Problems with GANs • Vanishing gradients: the discriminator becomes ”too good” and the generator gradient vanishes. • Non-Convergence: the generator and discriminator oscillate without reaching an equilibrium. • Mode Collapse: the generator distribution collapses to a small set of examples. • Mode Dropping: the generator distribution doesn’t fully cover the data distribution. 21

Problems: Vanishing Gradients • The minimax objective saturates when D θ d is close to perfect: � � �� V ( θ d , θ g ) = E p data [log D θ d ( x )]+ E p z ( z ) log 1 − D θ d ( G θ g ( z )) . • A non-saturating heuristic objective for the generator is � � �� J ( G θ g ) = − E p z ( z ) log D θ d ( G θ g ( z )) . 22 https://arxiv.org/abs/1701.00160

Problems: Addressing Vanishing Gradients Solutions: • Change Objectives: use the non-saturating heuristic objective, maximum-likelihood cost, etc. • Limit Discriminator: restrict the capacity of the discriminator. • Schedule Learning: try to balance training D θ d and G θ g . 23

Problems: Non-Convergence Simultaneous gradient descent is not guaranteed to converge for minimax objectives. • Goodfellow et al. only showed convergence when updates are made in the function space [2]. • The parameterization of D θ d and G θ g results in highly non-convex objective. • In practice, training tends to oscillate – updates “undo” each other. 24

Problems: Addressing Non-Convergence Solutions: Lots and lots of hacks! 25 https://github.com/soumith/ganhacks

Problems: Mode Collapse and Mode Dropping One Explanation: SGD may optimize the max-min objective � � �� max min θ g E p data [log D θ d ( x )] + E p z ( z ) log 1 − D θ d ( G θ g ( z )) θ d Intuition: the generator maps all z values to the ˆ x that is mostly likely to fool the discriminator. https://arxiv.org/abs/1701.00160 26

A Possible Solution

Generative Adversarial Networks Aaron Mishkin UBC MLRG 2018W2 1 - PowerPoint PPT Presentation

Generative Adversarial Networks Aaron Mishkin UBC MLRG 2018W2 1 Generative Adversial Networks Two imaginary celebrities that were dreamed up by a random number generator. https://research.nvidia.com/publication/2017-10

Generative Adversarial Networks Benjamin Striner CMU 11-785 March 21, 2018 Benjamin Striner

CSC421/2516 Lecture 18: Generative Adversarial Networks Roger Grosse and Jimmy Ba Roger Grosse

Generative Adversarial Nets(GANs) Troy Cary and Chenzhi Zhao A generative adversarial net is

Robust Estimation and Generative Adversarial Networks Weizhi ZHU Hong Kong University of Science

GAN-based Photo Video Synthesis Summary of Generative Adversarial Nets Lei Zhang What is

Generative networks part 2: GANs 23 / 54 Recap on generative networks Generative networks provide

CSC321 Lecture 19: Generative Adversarial Networks Roger Grosse Roger Grosse CSC321 Lecture 19:

Generative Adversarial Networks presented by Ian Goodfellow presentation co-developed with Aaron

Adversarial Training Attacks on Deep Networks and Generative Adversarial Networks Erkut Erdem

Applications of GANs Photo-Realistic Single Image Super-Resolution Using a Generative

Applications of GANs Photo-Realistic Single Image Super-Resolution Using a Generative

Generative Adversarial Networks Sahin Olut Department of Computer Engineering Istanbul Technical

LAB MEETING: A Connection Between Generative Adversarial Networks, Inverse Reinforcement Learning

Model-Assisted Generative Adversarial Networks Leigh Whitehead ICL Seminar 05/06/20

Bregman and Wasserstein, with Applications to Generative Adversarial Networks (GANs) and beyond

generative design systems Generative Brief Design Definitions Workshop Processes

Multi-Objective Software Effort Estimation Federica Sarro ! ! Senior Research Associate Dept.

Language Processing with Perl and Prolog Chapter 6: Words, Parts of Speech, and Morphology Pierre

EDAN20 Language Technology http://cs.lth.se/edan20/ Chapter 6: Words, Parts of Speech, and

Introduction to Big Data and Machine Learning Nonparametric methods Dr. Mihail October 1, 2019

Nonlinear Optimization: The art of modeling INSEAD, Spring 2006 Jean-Philippe Vert Ecole des

Exploring the Limits of Transfer Learning with a unified Text-to-Text Transformer Presented by -

Assessment Briefing for Parents 24 th November 2015 Colman Junior School Leadership Team Colman

Graphs with singular adjacency matrix School of Mathematical Sciences Jiaotong University