Generative Adversarial Networks Aaron Mishkin UBC MLRG 2018W2 1 - - PowerPoint PPT Presentation

generative adversarial networks
SMART_READER_LITE
LIVE PREVIEW

Generative Adversarial Networks Aaron Mishkin UBC MLRG 2018W2 1 - - PowerPoint PPT Presentation

Generative Adversarial Networks Aaron Mishkin UBC MLRG 2018W2 1 Generative Adversial Networks Two imaginary celebrities that were dreamed up by a random number generator. https://research.nvidia.com/publication/2017-10


slide-1
SLIDE 1

Generative Adversarial Networks

Aaron Mishkin

UBC MLRG 2018W2 1

slide-2
SLIDE 2

Generative Adversial Networks

“Two imaginary celebrities that were dreamed up by a random number generator.”

https://research.nvidia.com/publication/2017-10 Progressive-Growing-of

2

slide-3
SLIDE 3

Why care about GANs?

Why to spend your limited time learning about GANs:

  • GANs are achieving state-of-the-art results in a large variety
  • f image generation tasks.
  • There’s been a veritable explosion in GAN publications over

the last few years – many people are very excited!

  • GANs are stimulating new theoretical interest in min-max
  • ptimization problems and “smooth games”.

3

slide-4
SLIDE 4

Why care about GANs: Hyper-realistic Image Generation

StyleGAN: image generatation with hierarchical style transfer [3].

https://arxiv.org/abs/1812.04948

4

slide-5
SLIDE 5

Why care about GANs: Conditionally Generative Models

Conditional GANs: high-resolution image synthesis via semantic labeling [8]. Input: Segmentation Output: Synthesized Image

https://research.nvidia.com/publication/2017-12 High-Resolution-Image-Synthesis

5

slide-6
SLIDE 6

Why care about GANs: Image Super Resolution

SRGAN: Photo-realistic super-resolution [4]. Bicubic Interp. SRGAN Original Image

https://arxiv.org/abs/1609.04802

6

slide-7
SLIDE 7

Why care about GANs: Publications

Approximately 500 papers GAN papers as of September 2018!

See https://github.com/hindupuravinash/the-gan-zoo for the exhaustive list of papers. Image Credit: https://github.com/bgavran.

7

slide-8
SLIDE 8

Generative Models

slide-9
SLIDE 9

Generative Modeling

Generative Models estimate the probabilistic process that generated a set of observations D.

  • D =
  • xi, yin

i=1: supervised generative models learn the

joint distribution p(xi, yi), often to compute p(yi | xi).

  • D =
  • xin

i=1: unsupervised generative models learn the

distribution of D for clustering, sampling, etc. We can:

  • directly estimate p(xi),
  • introducing latents yi and estimate p(xi, yi).

8

slide-10
SLIDE 10

Generative Modeling: Unsupervised Parametric Approaches

  • Direct Estimation: Choose a parameterized family p(x | θ)

and learn θ by maximizing the log-likelihood θ∗ = arg max θ

n

  • i=1

log p(xi | θ).

  • Latent Variable Models: Define a joint distribution

p(x, y | θ) and learn θ by maximizing the log-marginal likelihood θ∗ = arg max θ

n

  • i=1

log

  • zi p(xi, zi | θ)dz.

Both approaches require that p(x | θ) is easy to evaluate.

9

slide-11
SLIDE 11

Generative Modeling: Models for (Very) Complex Data

How can we learn such models for very complex data?

https://www.researchgate.net/figure/Heterogeneousness-and-diversity-of-the-CIFAR-10-entries-in-their-10-

10

slide-12
SLIDE 12

Generative Modeling: Normalizing Flows and VAEs

Design parameterized densities with huge capacity!

  • Normalizing flows: sequence of non-linear transformations to

a simple distribution pz(z) p(x | θ0:k) = pz(z) where z = f −1

θk

  • · · · ◦ f −1

θ1

  • f −1

θ0 (x) .

f −1

θj

must be invertible with tractable log-det. Jacobians.

  • VAEs: latent-variable models where inference networks

specify parameters p(x, y | θ) = p(x | fθ(y))py(y). The marginal likelihood is maximized via the ELBO.

11

slide-13
SLIDE 13

GANs

slide-14
SLIDE 14

GANs: Density-Free Models

Generative Adversial Networks (GANs) instead use an unrestricted generator Gθg (z) such that p(x | θg) = pz({z}) where {z} = G −1

θg (x).

  • Problem: the inverse image of Gθg (z) may be huge!
  • Problem: it’s likely intractable to preserve volume through

G(z; θg). So, we can’t evaluate p(x | θg) and we can’t learn θg by maximum likelihood.

12

slide-15
SLIDE 15

GANs: Discriminators

GANs learn by comparing model samples with examples from D.

  • Sampling from the generator is easy:

ˆ x = Gθg (ˆ z), where ˆ z ∼ pz(z).

  • Given a sample ˆ

x, a discriminator tries to distinguish it from true examples: D(x) = Pr (x ∼ pdata) .

  • The discriminator “supervises” the generator network.

13

slide-16
SLIDE 16

GANs: Generator + Descriminator

https://www.slideshare.net/xavigiro/deep-learning-for-computer-vision-generative-models-and-adversarial- training-upc-2016

14

slide-17
SLIDE 17

GANs: Goodfellow et al. (2014)

  • Let z ∈ Rm and pz(z) be a simple base distribution.
  • The generator Gθg (z) : Rm → ˜

D is a deep neural network.

  • ˜

D is the manifold of generated examples.

  • The discriminator Dθd(x) : D ∪ ˜

D → (0, 1) is also a deep neural network.

https://arxiv.org/abs/1511.06434

15

slide-18
SLIDE 18

GANs: Saddle-Point Optimization

Saddle-Point Optimization: learn Gθg (z) and Dθd(x) jointly via the objective V (θd, θg): min

θg max θd

Epdata [log Dθd(x)]

  • likelihood of true data

+ Epz(z)

  • log
  • 1 − Dθd(Gθg (z))
  • likelihood of generated data

16

slide-19
SLIDE 19

GANs: Optimal Discriminators

Claim: Given Gθg defining an implicit distribution pg = p(x | θg), the optimal descriminator is D∗(x) = pdata(x) pdata(x) + pg(x). Proof Sketch: V (θd, θg) =

  • D

pdata(x) log D(x)dx +

  • ˜

D

p(z) log(1 − D(Gθg (z)))dz =

  • D∪ ˜

D

pdata(x) log D(x) + pg(x) log(1 − D(x))dx Maximizing the integrand for all x is sufficient and gives the result (see bonus slides).

Previous Slide: https://commons.wikimedia.org/wiki/File:Saddle point.svg

17

slide-20
SLIDE 20

GANs: Jensen-Shannon Divergence and Optimal Generators

Given an optimal discriminator D∗(x), the generator objective is C(θg) = Epdata

  • log D∗

θd(x)

  • + Epg(x)
  • log
  • 1 − D∗

θd(x)

  • = Epdata
  • log

pdata(x) pdata(x) + pg(x)

  • + Epg(x)
  • log

pg(x) pdata(x) + pg(x)

  • ∝ 1

2KL

  • pdata
  • (pdata + pg)

2

  • + 1

2KL

  • pg
  • (pdata + pg)

2

  • Jensen-Shannon Divergence

C(θg) achives its global minimum at pg = pdata given an optimal discriminator!

18

slide-21
SLIDE 21

GANs: Learning Generators and Discriminators

Putting these results to use in practice:

  • High-capacity discriminators Dθd approximate the

Jensen-Shannon divergence when close to global maximum.

  • Dθd is a “differentiable program”.
  • We can use Dθd to learn Gθg with our favourite gradient

descent method.

https://arxiv.org/abs/1511.06434

19

slide-22
SLIDE 22

GANs: Training Procedure

for i = 1 . . . N do for k = 1 . . . K do

  • Sample noise samples {z1, . . . , zm} ∼ pz(z)
  • Sample examples {x1, . . . , xm} from pdata(x).
  • Update the discriminator Dθd:

θd = θd −αd∇θd 1 m

m

  • i=1
  • log D
  • xi

+ log

  • 1 − D
  • G
  • zi

. end for

  • Sample noise samples {z1, . . . , zm} ∼ pz(z).
  • Update the generator Gθg :

θg = θg − αg∇θg 1 m

m

  • i=1

log

  • 1 − D
  • G
  • zi

. end for

20

slide-23
SLIDE 23

Problems (c. 2016)

slide-24
SLIDE 24

Problems with GANs

  • Vanishing gradients: the discriminator becomes ”too good”

and the generator gradient vanishes.

  • Non-Convergence: the generator and discriminator oscillate

without reaching an equilibrium.

  • Mode Collapse: the generator distribution collapses to a

small set of examples.

  • Mode Dropping: the generator distribution doesn’t fully

cover the data distribution.

21

slide-25
SLIDE 25

Problems: Vanishing Gradients

  • The minimax objective saturates when Dθd is close to perfect:

V (θd, θg) = Epdata [log Dθd(x)]+Epz(z)

  • log
  • 1 − Dθd(Gθg (z))
  • .
  • A non-saturating heuristic objective for the generator is

J(Gθg ) = −Epz(z)

  • log
  • Dθd(Gθg (z))
  • .

https://arxiv.org/abs/1701.00160

22

slide-26
SLIDE 26

Problems: Addressing Vanishing Gradients

Solutions:

  • Change Objectives: use the non-saturating heuristic
  • bjective, maximum-likelihood cost, etc.
  • Limit Discriminator: restrict the capacity of the

discriminator.

  • Schedule Learning: try to balance training Dθd and Gθg .

23

slide-27
SLIDE 27

Problems: Non-Convergence

Simultaneous gradient descent is not guaranteed to converge for minimax objectives.

  • Goodfellow et al. only showed convergence when updates are

made in the function space [2].

  • The parameterization of Dθd and Gθg results in highly

non-convex objective.

  • In practice, training tends to oscillate – updates “undo” each
  • ther.

24

slide-28
SLIDE 28

Problems: Addressing Non-Convergence

Solutions: Lots and lots of hacks!

https://github.com/soumith/ganhacks

25

slide-29
SLIDE 29

Problems: Mode Collapse and Mode Dropping

One Explanation: SGD may optimize the max-min objective max

θd

min

θg Epdata [log Dθd(x)] + Epz(z)

  • log
  • 1 − Dθd(Gθg (z))
  • Intuition: the generator maps all z values to the ˆ

x that is mostly likely to fool the discriminator.

https://arxiv.org/abs/1701.00160

26

slide-30
SLIDE 30

A Possible Solution

slide-31
SLIDE 31

A Possible Solution: Alternative Divergences

There are a large variety of divergence measures for distributions:

  • f-Divergences: (e.g. Jensen-Shannon, Kullback-Leibler)

Df (P ||Q) =

  • χ

q(x)f (p(x) q(x))dx

  • GANs [2], f-GANs [7], and more.
  • Integral Probability Metrics: (e.g. Earth Movers Distance,

Maximum Mean Discrepancy) γF (P ||Q) = sup

f ∈F

  • fdP −
  • fdQ
  • Wasserstein GANs [1], Fisher GANs [6], Sobolev GANs [5] and

more.

27

slide-32
SLIDE 32

A Possible Solution: Wasserstein GANs

Wasserstein GANs: Strong theory and excellent empirical results.

  • “In no experiment did we see evidence of mode collapse for

the WGAN algorithm.” [1]

https://arxiv.org/abs/1701.07875

28

slide-33
SLIDE 33

Summary

slide-34
SLIDE 34

Summary

Recap:

  • GANs are a class of density-free generative models with

(mostly) unrestricted generator functions.

  • Introducing adversial discriminator networks allows GANs to

learn by minimizing the Jensen-Shannon divergence.

  • Concurrently learning the generator and discriminator is

challenging due to

  • Vanishing Gradients,
  • Non-convergence due to oscilliation
  • Mode collapse and mode dropping.
  • A variety of alternative objective functions are being proposed.

29

slide-35
SLIDE 35

Agknowledgements and References

There are lots of excellent references on GANs:

  • Sebastian Nowozin’s presentation at MLSS 2018.
  • NIPS 2016 tutorial on GANs by Ian Goodfellow.
  • A nice explanation of Wasserstein GANs by Alex Irpan.

30

slide-36
SLIDE 36

Bonus: Optimal Discriminators Cont.

The integrand h(D(x)) = pdata(x) log D(x) + pg(x) log(1 − D(x)) is concave for D(x) ∈ (0, 1). We take the derivative and compute a stationary point in the domain: ∂h(D(x)) ∂D(x) = pdata(x) D(x) − pg(x) 1 − D(x) = 0 ⇒ D(x) = pdata(x) pdata(x) + pg(x). This minimizes the integrand over the domain of the discriminator, completing the proof.

31

slide-37
SLIDE 37

References i

Martin Arjovsky, Soumith Chintala, and L´ eon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017. Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. arxiv e-prints. arXiv preprint arXiv:1406.2661, 2014. Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948, 2018.

32

slide-38
SLIDE 38

References ii

Christian Ledig, Lucas Theis, Ferenc Husz´ ar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690, 2017. Youssef Mroueh, Chun-Liang Li, Tom Sercu, Anant Raj, and Yu Cheng. Sobolev gan. arXiv preprint arXiv:1711.04894, 2017. Youssef Mroueh and Tom Sercu. Fisher gan. In Advances in Neural Information Processing Systems, pages 2513–2523, 2017.

33

slide-39
SLIDE 39

References iii

Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems, pages 271–279, 2016. Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8798–8807, 2018.

34