Yale
Intro Tutorial on GANs
Michela Paganini
March 21, 2018
Fermilab Machine Learning Group Meeting
Intro Tutorial on GANs Michela Paganini Fermilab Machine Learning - - PowerPoint PPT Presentation
Yale Intro Tutorial on GANs Michela Paganini Fermilab Machine Learning Group Meeting March 21, 2018 Yale Outline Overview: Generative modeling Generative Adversarial Networks (GANs) Hands-on : build vanilla GAN on FashionMNIST GAN
Michela Paganini
March 21, 2018
Fermilab Machine Learning Group Meeting
Overview:
Hands-on: build vanilla GAN on FashionMNIST GAN improvements (f-GAN, WGAN, WGAN-GP) Hands on: build WGAN on FashionMNIST
2
Task: given a training dataset, generate more samples from the same data distribution Why do we care?
3
4
5
From I. Goodfellow Maximum Likelihood Explicit density Implicit density … Tractable density Fully visible belief nets: NADE MADE PixelRNN Change of variables models (nonlinear ICA) Approximate density Variational VAE Markov Chain Boltzmann machine Markov Chain Direct GSN GAN
2-player game between generator and discriminator Latent prior mapped to sample space implicitly defines a distribution Discriminator tells how fake or real a sample looks via a score
6
Distinguish real samples from fake samples Transform noise into a realistic sample Real data
Construct a two-person zero-sum minimax game with a value We have an inner maximization by D and an outer minimization by G With perfect discriminator, generator minimizes
7
Alternate gradient descent on D and gradient ascent on G Heuristic loss function (non-saturating): instead of
8
9
Divergence = “a function which establishes the "distance" of one probability distribution to the other on a statistical manifold" f-divergences: Convex conjugate: No access to distributions in functional form. Use empirical expectations instead:
10
Extend GAN formalism: Any f-divergence can be used as GAN objective
11
If at any point the supports of these distributions have no overlap, this family of measures collapses to a null or infinite distance
12
Instance noise Label flipping Label smoothing
learn more
Other distance metrics: Integral Probability Metrics, Wasserstein Distances, Proper Scoring Rules e.g. replace family of f-divergences with the Wasserstein-1 distance, often referred to as the Earth Movers Distance Intuition: think of PDFs as mounds of dirt; EMD describes how much “work” it takes to transform one mound of dirt into another Accounts for both “mass” and distance —i.e., this works for disjoint PDFs!
13
Excellent WGAN review!
…is intractable! The infimum is over a uncountably infinite set of candidate distribution pairings Kantorovich-Rubenstein duality to the rescue! What does this give us? Restriction to K-Lipschitz functions
14
Excellent blog post about WGAN and KR-duality from V. Herrmann!
We can constrain a neural network to be K-Lipschitz! This lets us parameterize f(x) as a neural network, and clamp the weights to be in a compact space, say [-c, c] This guarantees that a network f is K-Lipschitz, with K = r(c), for some function r Great! The network can now operate in Wasserstein-1 space up to a constant factor! f is now a critic
15
For a Lipschitz continuous function, there is a double cone (shown in white) whose vertex can be translated along the graph, so that the graph always remains entirely outside the cone.
Training critic: For each batch of real samples, we want the output of f to be as big (1.0) as possible For each batch of fake samples, we want the output of f to be as small (-1.0) as possible Training generator: We want f to be as big (1.0) as possible
16
Excellent blog post about WGAN from A. Irpan!
Reliable metric to train critic to convergence
17
Restricting neural net to have weights in compact space restricts expressivity (capacity underutilization) Gradients explode / vanish Why not make the network 1-Lipschitz?
18
WGAN with Gradient Penalty TL;DR: penalize critic for having a gradient norm too far from unity This is a better way to ensure 1-Lipschitz critics For every real sample, build a fake sample, and randomly linearly interpolate between the two
19
20
21
You can find me at: michela.paganini@yale.edu
Question?
From original paper, know that the optimal discriminator is: Define generator solving for infinite capacity discriminator, We can rewrite value as Simplifying notation, and applying some algebra But we recognize this as a summation of two KL-divergences And can combine these into the Jenson-Shannon divergence This yields a unique global minimum precisely when
22