Wasserstein GAN Martin Arjovsky, Soumith Chintala, Lon Bottou, ICML - - PowerPoint PPT Presentation

wasserstein gan
SMART_READER_LITE
LIVE PREVIEW

Wasserstein GAN Martin Arjovsky, Soumith Chintala, Lon Bottou, ICML - - PowerPoint PPT Presentation

Wasserstein GAN Martin Arjovsky, Soumith Chintala, Lon Bottou, ICML 2017 Presented by Yaochen Xie 12-222017 Contents GAN and its applications [1] GAN vs. Variational Auto-Encoder [2] Whats wrong with GAN [3], [4] JS


slide-1
SLIDE 1

Wasserstein GAN

Martin Arjovsky, Soumith Chintala, Léon Bottou, ICML 2017

Presented by Yaochen Xie 12-22–2017

slide-2
SLIDE 2

Contents

❖ GAN and its applications [1] ❖ GAN vs. Variational Auto-Encoder [2] ❖ What’s wrong with GAN [3], [4] ❖ JS Divergence and KL Divergence [3], [4] ❖ Wasserstein Distance [4], [5] ❖ WGAN and its Implementation[4]

slide-3
SLIDE 3

Take A Look Back at GAN

D and G play the following two-player minimax game with the value function V(D, G):

slide-4
SLIDE 4

Applications of GAN - Image Translation

Conditional GAN Triangle GAN

slide-5
SLIDE 5

Applications of GAN - Super-Resolution

slide-6
SLIDE 6

Applications of GAN - Image Inpainting

GAN LR TV Input Real

slide-7
SLIDE 7

GAN vs. VAE - AutoEncoder

slide-8
SLIDE 8

GAN vs. VAE - Variational AutoEncoder

  • add a constraint on the encoding

network, that forces it to generate latent vectors that roughly follow a unit gaussian distribution

  • generative loss: mean squared

error

  • latent loss: KL divergence
slide-9
SLIDE 9

GAN vs. VAE

  • VAE - explicit, use MSE

to judge generation quality

  • GAN - implicit, use

discriminator to judge generation quality

slide-10
SLIDE 10

Drawbacks of GAN

Gradient Vanishing Unstable, not converging Mode Collapse

slide-11
SLIDE 11

Kullback–Leibler divergence (Relative Entropy)

Discrete Distributions:

Continuous Distributions:

A metric that measures the distance between two distributions Notice that: (P||Q) (P||Q) is not equal to Rigorously, KL divergence cannot be considered as a distance.

slide-12
SLIDE 12

Jensen-Shannon divergence

A symmetrized and smoothed version of the KL Divergence When two distribution are far from each other….

slide-13
SLIDE 13

Where is the loss from?

Cross Entropy: Loss based on Cross Entropy: What if p and q belongs to continuous distributions?

Expectation

slide-14
SLIDE 14

What’s going wrong?

Now we fix G, and let D be optimum:

=> 2 times of Jensen-Shannon divergence

Till now, to optimize the loss is equivalent to minimize the JS-divergence between Pr and Pg.

Gradient Vanishing!

slide-15
SLIDE 15

What’s going wrong?

When G is fixed and D is optimum: = Mode Collapse Unstable

KL —> ∞ KL —> 0

slide-16
SLIDE 16

What’s going wrong?

slide-17
SLIDE 17

We need a weaker distance

The most fundamental difference between such distances is their impact on the convergence of sequences of probability distributions.

WHY?

iif.

E —> 0

A weaker distance means easier to converge.

  • 1. We wish

to be continuous

slide-18
SLIDE 18

We need a weaker distance

Continuity means that when a sequence of parameters converges to , the distributions also converge to

WHY?

  • 2. We wish

The weaker this distance, the easier it is to define a continuous mapping from θ-space to Pθ-space, since it’s easier for the distributions to converge. Then

slide-19
SLIDE 19

Wasserstein (Earth-Mover) Distance!

“ If each distribution is viewed as a unit amount of "dirt" piled

  • n , the metric is the minimum

"cost" of turning one pile into the other, which is assumed to be the amount of dirt that needs to be moved times the distance it has to be moved ”

slide-20
SLIDE 20

Wasserstein Distance!

  • KL-Divergence and JS-Divergence

are too strong for the loss function 1 to be continuous.

  • Wasserstein distance is a weaker

measurement of distance s.t.:

  • 1. is continuous if is

continuous.

  • 2. is continuous and differentiable

almost everywhere if is locally Lipschitz with finite expectation of local Lipschitz constant.

slide-21
SLIDE 21

Wasserstein Distance!

slide-22
SLIDE 22

Optimal Transportation View of GAN

Brenier potential

slide-23
SLIDE 23

Convex Geometry

Minkowski theorem Alexandrov theorem Geometric Interpretation to Optimal Transport Map

slide-24
SLIDE 24

Wasserstein distance in WGAN

Kantorovich-Rubinstein Duality: when μ and ν have bounded support,

where Lip(f) denotes the minimal Lipschitz constant for f.

https://vincentherrmann.github.io/blog/wasserstein/

slide-25
SLIDE 25

Implementation

Compared with origin GAN, WGAN conducts four changes:

  • Discriminator (with sigmoid activation) —> Critic (without sigmoid)
  • 1
  • Truncation of parameters in Critic (Discriminator).
  • Do not use momentum when gradient descending.

LD = LG =

slide-26
SLIDE 26

Experiments

slide-27
SLIDE 27

References

[1] Ian J. Goodfellow. Generative Adversarial Nets. [2] Kingma, Diederik P

., and Max Welling. Auto-encoding variational bayes.

[3] Martin Arjovsky and L´eon Bottou. Towards principled methods for

training generative adversarial networks.

[4] Martin Arjovsky, Soumith Chintala and Léon Bottou. Wasserstein GAN. [5] Na Lei, Kehua Su, Li Cui, Shing-Tung Yau, David Xianfeng Gu, A

Geometric View of Optimal Transportation and Generative Model