Bregman and Wasserstein, with Applications to Generative Adversarial - - PowerPoint PPT Presentation

bregman and wasserstein with applications to generative
SMART_READER_LITE
LIVE PREVIEW

Bregman and Wasserstein, with Applications to Generative Adversarial - - PowerPoint PPT Presentation

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Bregman and Wasserstein, with Applications to Generative Adversarial Networks (GANs) and


slide-1
SLIDE 1

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Bregman and Wasserstein, with Applications to Generative Adversarial Networks (GANs) and beyond

Xin Guo

Joint work with Johnny Hong, Tianyi Lin and Nan Yang University of California, Berkeley

IMS-FIPS 2018, September 10, King’s College, London

1 / 50

slide-2
SLIDE 2

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Problem Set-Up

Given the data X = (X1, . . . , Xn) i.i.d. ∼ Pr where Xi ∈ Rd.

2 / 50

slide-3
SLIDE 3

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Problem Set-Up

Given the data X = (X1, . . . , Xn) i.i.d. ∼ Pr where Xi ∈ Rd.

Pr is the true distribution over sample X.

2 / 50

slide-4
SLIDE 4

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Problem Set-Up

Given the data X = (X1, . . . , Xn) i.i.d. ∼ Pr where Xi ∈ Rd.

Pr is the true distribution over sample X. Pr is unknown.

2 / 50

slide-5
SLIDE 5

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Problem Set-Up

Given the data X = (X1, . . . , Xn) i.i.d. ∼ Pr where Xi ∈ Rd.

Pr is the true distribution over sample X. Pr is unknown. Pr could be complicated as d increases.

2 / 50

slide-6
SLIDE 6

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Problem Set-Up

Given the data X = (X1, . . . , Xn) i.i.d. ∼ Pr where Xi ∈ Rd.

Pr is the true distribution over sample X. Pr is unknown. Pr could be complicated as d increases.

Goal: How to learn Pr from data?

2 / 50

slide-7
SLIDE 7

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Problem Set-Up

Given the data X = (X1, . . . , Xn) i.i.d. ∼ Pr where Xi ∈ Rd.

Pr is the true distribution over sample X. Pr is unknown. Pr could be complicated as d increases.

Goal: How to learn Pr from data? Idea: Construct a sequence of parametric probability distributions Pθ to approximate Pr.

2 / 50

slide-8
SLIDE 8

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Problem Set-Up

Given the data X = (X1, . . . , Xn) i.i.d. ∼ Pr where Xi ∈ Rd.

Pr is the true distribution over sample X. Pr is unknown. Pr could be complicated as d increases.

Goal: How to learn Pr from data? Idea: Construct a sequence of parametric probability distributions Pθ to approximate Pr.

Pθ is a parametric distribution over X.

2 / 50

slide-9
SLIDE 9

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Problem Set-Up

Given the data X = (X1, . . . , Xn) i.i.d. ∼ Pr where Xi ∈ Rd.

Pr is the true distribution over sample X. Pr is unknown. Pr could be complicated as d increases.

Goal: How to learn Pr from data? Idea: Construct a sequence of parametric probability distributions Pθ to approximate Pr.

Pθ is a parametric distribution over X. Pθ is structured!

2 / 50

slide-10
SLIDE 10

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Problem Set-Up

Given the data X = (X1, . . . , Xn) i.i.d. ∼ Pr where Xi ∈ Rd.

Pr is the true distribution over sample X. Pr is unknown. Pr could be complicated as d increases.

Goal: How to learn Pr from data? Idea: Construct a sequence of parametric probability distributions Pθ to approximate Pr.

Pθ is a parametric distribution over X. Pθ is structured! Pθ is known.

2 / 50

slide-11
SLIDE 11

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Problem Set-Up

Given the data X = (X1, . . . , Xn) i.i.d. ∼ Pr where Xi ∈ Rd.

Pr is the true distribution over sample X. Pr is unknown. Pr could be complicated as d increases.

Goal: How to learn Pr from data? Idea: Construct a sequence of parametric probability distributions Pθ to approximate Pr.

Pθ is a parametric distribution over X. Pθ is structured! Pθ is known.

Question 1: How to generate Pθ?

2 / 50

slide-12
SLIDE 12

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Problem Set-Up

Given the data X = (X1, . . . , Xn) i.i.d. ∼ Pr where Xi ∈ Rd.

Pr is the true distribution over sample X. Pr is unknown. Pr could be complicated as d increases.

Goal: How to learn Pr from data? Idea: Construct a sequence of parametric probability distributions Pθ to approximate Pr.

Pθ is a parametric distribution over X. Pθ is structured! Pθ is known.

Question 1: How to generate Pθ? Question 2*: How to evaluate the quality of Pθ?

2 / 50

slide-13
SLIDE 13

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Roadmap

1

Bregman Divergence Function

2

Generative Adversarial Networks (GANs)

3

Wasserstein Divergence and GANs

4

Relaxed Wasserstein Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

5

Empirical Results Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

6

Conclusions

3 / 50

slide-14
SLIDE 14

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

A curious and simple math puzzle

Given a random variable X, and a flitration G, find (all?) loss/divergence functions F(x, y) such that arg min

Y ∈G E[F(X, Y )] = E[X|G].

4 / 50

slide-15
SLIDE 15

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

A curious and simple math puzzle

Given a random variable X, and a flitration G, find (all?) loss/divergence functions F(x, y) such that arg min

Y ∈G E[F(X, Y )] = E[X|G].

Example: L2 function: arg minY ∈G E[(X − Y )2] = E[X|G]

4 / 50

slide-16
SLIDE 16

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

A curious and simple math puzzle

Given a random variable X, and a flitration G, find (all?) loss/divergence functions F(x, y) such that arg min

Y ∈G E[F(X, Y )] = E[X|G].

Example: L2 function: arg minY ∈G E[(X − Y )2] = E[X|G] Counter-example: L1 function

4 / 50

slide-17
SLIDE 17

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

A curious and simple math puzzle

Given a random variable X, and a flitration G, find (all?) loss/divergence functions F(x, y) such that arg min

Y ∈G E[F(X, Y )] = E[X|G].

Example: L2 function: arg minY ∈G E[(X − Y )2] = E[X|G] Counter-example: L1 function Is L2 the unique choice?

4 / 50

slide-18
SLIDE 18

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Answer: Bregman Loss Fucntions Dφ(x, y) (Banerjee, G. and Wang (2005))

5 / 50

slide-19
SLIDE 19

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Answer: Bregman Loss Fucntions Dφ(x, y) (Banerjee, G. and Wang (2005))

Sufficient arg min

Y ∈G E[Dφ(X, Y )] = E[X|G].

5 / 50

slide-20
SLIDE 20

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Answer: Bregman Loss Fucntions Dφ(x, y) (Banerjee, G. and Wang (2005))

Sufficient arg min

Y ∈G E[Dφ(X, Y )] = E[X|G].

Necessary: If for all X arg min

y∈Rd E[F(X, y)] = E[X].

then with proper regularity conditions and up to an additive constant, F(x, y) = Dφ(x, y)

5 / 50

slide-21
SLIDE 21

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

What is Bregman Divergence Function?

BDF Dφ(x, y)

6 / 50

slide-22
SLIDE 22

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

What is Bregman Divergence Function?

BDF Dφ(x, y)

Let φ : Rd → R be a strictly convex, differentiable function

6 / 50

slide-23
SLIDE 23

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

What is Bregman Divergence Function?

BDF Dφ(x, y)

Let φ : Rd → R be a strictly convex, differentiable function Then, Dφ : Rd × Rd → R is defined as Dφ(x, y) = φ(x) − φ(y) − x − y, ∇φ(y).

6 / 50

slide-24
SLIDE 24

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

What is Bregman Divergence Function?

BDF Dφ(x, y)

Let φ : Rd → R be a strictly convex, differentiable function Then, Dφ : Rd × Rd → R is defined as Dφ(x, y) = φ(x) − φ(y) − x − y, ∇φ(y).

For any x, y ∈ Rd, Dφ(x, y) ≥ 0, the equality holds iff x = y.

6 / 50

slide-25
SLIDE 25

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Some examples of BDFs φ(x) = x2, then Dφ(x, y) = (x − y)2

7 / 50

slide-26
SLIDE 26

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Some examples of BDFs φ(x) = x2, then Dφ(x, y) = (x − y)2 Let p . = (p1, . . . , pd) be a probability distribution

7 / 50

slide-27
SLIDE 27

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Some examples of BDFs φ(x) = x2, then Dφ(x, y) = (x − y)2 Let p . = (p1, . . . , pd) be a probability distribution

d

j=1 pj = 1, with φ(p) .

= d

j=1 pj log pj (negative Shannon

entropy) is strictly convex on the d-simplex.

7 / 50

slide-28
SLIDE 28

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Some examples of BDFs φ(x) = x2, then Dφ(x, y) = (x − y)2 Let p . = (p1, . . . , pd) be a probability distribution

d

j=1 pj = 1, with φ(p) .

= d

j=1 pj log pj (negative Shannon

entropy) is strictly convex on the d-simplex. Let q = (q1, . . . , qd) be another probability distribution

7 / 50

slide-29
SLIDE 29

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Some examples of BDFs φ(x) = x2, then Dφ(x, y) = (x − y)2 Let p . = (p1, . . . , pd) be a probability distribution

d

j=1 pj = 1, with φ(p) .

= d

j=1 pj log pj (negative Shannon

entropy) is strictly convex on the d-simplex. Let q = (q1, . . . , qd) be another probability distribution Dφ(p, q) =

d

  • j=1

pj log pj −

d

  • j=1

qj log qj −p − q, ∇φ(q) =

d

  • j=1

pj log (pj/qj) , is the KL-divergence between p and q

7 / 50

slide-30
SLIDE 30

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Proof of sufficiency

Let Y be any G-measurable random variable, and Y ∗ . = E[X|G]. Then E[Dφ(X, Y )] − E[Dφ(X, Y ∗)] = E[φ(Y ∗) − φ(Y ) − X − Y , ∇φ(Y ) +X − Y ∗, ∇φ(Y ∗)].

8 / 50

slide-31
SLIDE 31

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Proof of sufficiency

Let Y be any G-measurable random variable, and Y ∗ . = E[X|G]. Then E[Dφ(X, Y )] − E[Dφ(X, Y ∗)] = E[φ(Y ∗) − φ(Y ) − X − Y , ∇φ(Y ) +X − Y ∗, ∇φ(Y ∗)]. Notice E[X − Y , ∇φ(Y )] = E[E[X − Y , ∇φ(Y )|G]] = E[Y ∗ − Y , ∇φ(Y )]

8 / 50

slide-32
SLIDE 32

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Proof of sufficiency

Let Y be any G-measurable random variable, and Y ∗ . = E[X|G]. Then E[Dφ(X, Y )] − E[Dφ(X, Y ∗)] = E[φ(Y ∗) − φ(Y ) − X − Y , ∇φ(Y ) +X − Y ∗, ∇φ(Y ∗)]. Notice E[X − Y , ∇φ(Y )] = E[E[X − Y , ∇φ(Y )|G]] = E[Y ∗ − Y , ∇φ(Y )] Thus E[X − Y ∗, ∇φ(Y ∗)] = 0,

8 / 50

slide-33
SLIDE 33

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Proof of sufficiency

Let Y be any G-measurable random variable, and Y ∗ . = E[X|G]. Then E[Dφ(X, Y )] − E[Dφ(X, Y ∗)] = E[φ(Y ∗) − φ(Y ) − X − Y , ∇φ(Y ) +X − Y ∗, ∇φ(Y ∗)]. Notice E[X − Y , ∇φ(Y )] = E[E[X − Y , ∇φ(Y )|G]] = E[Y ∗ − Y , ∇φ(Y )] Thus E[X − Y ∗, ∇φ(Y ∗)] = 0, And E[Dφ(X, Y )] − E[Dφ(X, Y ∗)] = E[Dφ(Y ∗, Y )] ≥ 0.

8 / 50

slide-34
SLIDE 34

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

More facts about BDFs Introduced and studied in the context of projection (Csiszar (1975))

9 / 50

slide-35
SLIDE 35

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

More facts about BDFs Introduced and studied in the context of projection (Csiszar (1975)) Pythagoras theorem holds for BDF (Censor and Lent (1981))

9 / 50

slide-36
SLIDE 36

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

More facts about BDFs Introduced and studied in the context of projection (Csiszar (1975)) Pythagoras theorem holds for BDF (Censor and Lent (1981)) Bijection between family of exponential distributions and BDFs, via Legendre duality (Merugu, Banerjee, Dhillon, Ghosh (2003))

9 / 50

slide-37
SLIDE 37

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

More facts about BDFs Introduced and studied in the context of projection (Csiszar (1975)) Pythagoras theorem holds for BDF (Censor and Lent (1981)) Bijection between family of exponential distributions and BDFs, via Legendre duality (Merugu, Banerjee, Dhillon, Ghosh (2003)) Widely applied to data analysis and machine learning, such as K-means clustering

9 / 50

slide-38
SLIDE 38

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

More facts about BDFs Introduced and studied in the context of projection (Csiszar (1975)) Pythagoras theorem holds for BDF (Censor and Lent (1981)) Bijection between family of exponential distributions and BDFs, via Legendre duality (Merugu, Banerjee, Dhillon, Ghosh (2003)) Widely applied to data analysis and machine learning, such as K-means clustering Well adopted in convex optimization

9 / 50

slide-39
SLIDE 39

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Generator Network [Goodfellow et al., 2014]

Generate the samples according to Pθ.

10 / 50

slide-40
SLIDE 40

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Generator Network [Goodfellow et al., 2014]

Generate the samples according to Pθ. The real samples X is inaccessible.

10 / 50

slide-41
SLIDE 41

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Generator Network [Goodfellow et al., 2014]

Generate the samples according to Pθ. The real samples X is inaccessible. Generate more compelling copies of X.

10 / 50

slide-42
SLIDE 42

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Generator Network [Goodfellow et al., 2014]

Generate the samples according to Pθ. The real samples X is inaccessible. Generate more compelling copies of X. inaccessible generate

10 / 50

slide-43
SLIDE 43

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

How to Make Generator Network Better?

A knowledgeable mentor (discriminator)—

11 / 50

slide-44
SLIDE 44

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Discriminator Network [Goodfellow et al., 2014]

Determines whether the samples are generated or not.

12 / 50

slide-45
SLIDE 45

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Discriminator Network [Goodfellow et al., 2014]

Determines whether the samples are generated or not. has access to the real samples X.

12 / 50

slide-46
SLIDE 46

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Discriminator Network [Goodfellow et al., 2014]

Determines whether the samples are generated or not. has access to the real samples X.

  • ptimizes the generator network by identifying faked samples.

12 / 50

slide-47
SLIDE 47

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Discriminator Network [Goodfellow et al., 2014]

Determines whether the samples are generated or not. has access to the real samples X.

  • ptimizes the generator network by identifying faked samples.

pass! fail again!

12 / 50

slide-48
SLIDE 48

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Graphical Model

13 / 50

slide-49
SLIDE 49

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Generative modeling

The procedure of generative modeling is to construct a class of suitable parametric probability distributions Pθ.

14 / 50

slide-50
SLIDE 50

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Generative modeling

The procedure of generative modeling is to construct a class of suitable parametric probability distributions Pθ. Generates latent variable Z ∈ Z with a fixed probability distribution PZ.

14 / 50

slide-51
SLIDE 51

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Generative modeling

The procedure of generative modeling is to construct a class of suitable parametric probability distributions Pθ. Generates latent variable Z ∈ Z with a fixed probability distribution PZ.

PZ is known and simple, e.g., uniform distribution.

14 / 50

slide-52
SLIDE 52

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Generative modeling

The procedure of generative modeling is to construct a class of suitable parametric probability distributions Pθ. Generates latent variable Z ∈ Z with a fixed probability distribution PZ.

PZ is known and simple, e.g., uniform distribution.

Generates a sequence of parametric functions gθ : Z → X.

14 / 50

slide-53
SLIDE 53

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Generative modeling

The procedure of generative modeling is to construct a class of suitable parametric probability distributions Pθ. Generates latent variable Z ∈ Z with a fixed probability distribution PZ.

PZ is known and simple, e.g., uniform distribution.

Generates a sequence of parametric functions gθ : Z → X.

gθ is complicated but structured.

14 / 50

slide-54
SLIDE 54

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Generative modeling

The procedure of generative modeling is to construct a class of suitable parametric probability distributions Pθ. Generates latent variable Z ∈ Z with a fixed probability distribution PZ.

PZ is known and simple, e.g., uniform distribution.

Generates a sequence of parametric functions gθ : Z → X.

gθ is complicated but structured. gθ is the reason why the generative modeling is powerful.

14 / 50

slide-55
SLIDE 55

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Generative modeling

The procedure of generative modeling is to construct a class of suitable parametric probability distributions Pθ. Generates latent variable Z ∈ Z with a fixed probability distribution PZ.

PZ is known and simple, e.g., uniform distribution.

Generates a sequence of parametric functions gθ : Z → X.

gθ is complicated but structured. gθ is the reason why the generative modeling is powerful.

Construct Pθ as the probability distribution of gθ(Z). More specifically, Pθ(dx) =

  • Z

1{gθ(z)=dx}PZ(dz) = EZ

  • 1{gθ(Z)=dx}
  • .

14 / 50

slide-56
SLIDE 56

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

GANs: different divergence functions

GANs:

15 / 50

slide-57
SLIDE 57

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

GANs: different divergence functions

GANs: LSGANs [Mao et al., 2016]: Least square loss. DRAGANs [Kodali et al., 2017]: Regret minimization. CGANs [Mirza and Osindero, 2014]: Conditional extension. InfoGANs [Chen et al., 2016]: Information-theoretic extension. ACGANs [Odena et al., 2017] Structured latent space. EBGANs [Zhao et al., 2016]: New perspective of the energy. BEGANs [Berthelot et al., 2017]: Auto-encoder extension.

15 / 50

slide-58
SLIDE 58

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

GANs: different divergence functions

GANs: LSGANs [Mao et al., 2016]: Least square loss. DRAGANs [Kodali et al., 2017]: Regret minimization. CGANs [Mirza and Osindero, 2014]: Conditional extension. InfoGANs [Chen et al., 2016]: Information-theoretic extension. ACGANs [Odena et al., 2017] Structured latent space. EBGANs [Zhao et al., 2016]: New perspective of the energy. BEGANs [Berthelot et al., 2017]: Auto-encoder extension. GANs training: [Arjovsky and Bottou, 2017]

15 / 50

slide-59
SLIDE 59

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

GANs: different divergence functions

GANs: LSGANs [Mao et al., 2016]: Least square loss. DRAGANs [Kodali et al., 2017]: Regret minimization. CGANs [Mirza and Osindero, 2014]: Conditional extension. InfoGANs [Chen et al., 2016]: Information-theoretic extension. ACGANs [Odena et al., 2017] Structured latent space. EBGANs [Zhao et al., 2016]: New perspective of the energy. BEGANs [Berthelot et al., 2017]: Auto-encoder extension. GANs training: [Arjovsky and Bottou, 2017] Wasserstein GANs:

15 / 50

slide-60
SLIDE 60

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

GANs: different divergence functions

GANs: LSGANs [Mao et al., 2016]: Least square loss. DRAGANs [Kodali et al., 2017]: Regret minimization. CGANs [Mirza and Osindero, 2014]: Conditional extension. InfoGANs [Chen et al., 2016]: Information-theoretic extension. ACGANs [Odena et al., 2017] Structured latent space. EBGANs [Zhao et al., 2016]: New perspective of the energy. BEGANs [Berthelot et al., 2017]: Auto-encoder extension. GANs training: [Arjovsky and Bottou, 2017] Wasserstein GANs: WGANs [Arjovsky et al., 2017]: Wasserstein L1 divergence. Improved WGANs [Gulrajani et al., 2017]: Gradient Penalty.

15 / 50

slide-61
SLIDE 61

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Several Choices of Divergence

The divergences to measure the difference between P and Q inlcude

16 / 50

slide-62
SLIDE 62

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Several Choices of Divergence

The divergences to measure the difference between P and Q inlcude Kullback-Leibler divergence: KL(P, Q) =

  • X

P(dx) · log P(dx) Q(dx)

  • .

16 / 50

slide-63
SLIDE 63

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Several Choices of Divergence

The divergences to measure the difference between P and Q inlcude Kullback-Leibler divergence: KL(P, Q) =

  • X

P(dx) · log P(dx) Q(dx)

  • .

Jensen-Shannon (JS) divergence: JS(P, Q) = 1 2

  • KL(P, P + Q

2 ) + KL(Q, P + Q 2 )

  • .

16 / 50

slide-64
SLIDE 64

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Several Choices of Divergence

The divergences to measure the difference between P and Q inlcude Kullback-Leibler divergence: KL(P, Q) =

  • X

P(dx) · log P(dx) Q(dx)

  • .

Jensen-Shannon (JS) divergence: JS(P, Q) = 1 2

  • KL(P, P + Q

2 ) + KL(Q, P + Q 2 )

  • .

Wasserstein divergence/distance of order p Wp(P, Q) =

  • inf

π∈Π(P,Q)

  • X×X

m(x, y)p π(dx, dy) 1

p

, with m a metric such as m(x, y) = ||x − y||q for q ≥ 1.

16 / 50

slide-65
SLIDE 65

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Discussions on these divergences

Example: Given θ ∈ [0, 1], assume that P and Q satisfy ∀(x, y) ∈ P, x = 0, y ∼ Uniform(0, 1), ∀(x, y) ∈ Q, x = θ, y ∼ Uniform(0, 1),

17 / 50

slide-66
SLIDE 66

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Discussions on these divergences

Example: Given θ ∈ [0, 1], assume that P and Q satisfy ∀(x, y) ∈ P, x = 0, y ∼ Uniform(0, 1), ∀(x, y) ∈ Q, x = θ, y ∼ Uniform(0, 1), As θ = 0, KL(P, Q) = KL(Q, P) = +∞, JS(P, Q) = log(2), W1(P, Q) = |θ| .

17 / 50

slide-67
SLIDE 67

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Discussions on these divergences

Example: Given θ ∈ [0, 1], assume that P and Q satisfy ∀(x, y) ∈ P, x = 0, y ∼ Uniform(0, 1), ∀(x, y) ∈ Q, x = θ, y ∼ Uniform(0, 1), As θ = 0, KL(P, Q) = KL(Q, P) = +∞, JS(P, Q) = log(2), W1(P, Q) = |θ| . As θ = 0, KL(P, Q) = KL(Q, P) = JS(P, Q) = W1(P, Q) = 0.

17 / 50

slide-68
SLIDE 68

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Remark

KL is infinite when two distributions are disjoint;

18 / 50

slide-69
SLIDE 69

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Remark

KL is infinite when two distributions are disjoint; JS has sudden jump, discontinuous at θ = 0;

18 / 50

slide-70
SLIDE 70

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Remark

KL is infinite when two distributions are disjoint; JS has sudden jump, discontinuous at θ = 0; W1 is continuous and relatively smooth;

18 / 50

slide-71
SLIDE 71

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Remark

KL is infinite when two distributions are disjoint; JS has sudden jump, discontinuous at θ = 0; W1 is continuous and relatively smooth; Wasserstein L1 divergence outperforms KL and JS divergences but lacks the flexibility.

18 / 50

slide-72
SLIDE 72

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

Remedy: Relaxed Wasserstein

Definition (G., Hong, Lin, and Yang 2018)

The Relaxed Wasserstein divergence between the probability distributions P and Q is defined as WDφ(P, Q) = inf

π∈Π(P,Q)

  • X×X

Dφ(x, y) π(dx, dy), where Dφ is the Bregman divergence with a strictly convex and differentiable function φ : Rd → R, i.e., Dφ(x, y) = φ(x) − φ(y) − ∇φ(y), x − y

1 WDφ(P, Q) ≥ 0 and = 0 iff P = Q almost everywhere. 19 / 50

slide-73
SLIDE 73

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

Remedy: Relaxed Wasserstein

Definition (G., Hong, Lin, and Yang 2018)

The Relaxed Wasserstein divergence between the probability distributions P and Q is defined as WDφ(P, Q) = inf

π∈Π(P,Q)

  • X×X

Dφ(x, y) π(dx, dy), where Dφ is the Bregman divergence with a strictly convex and differentiable function φ : Rd → R, i.e., Dφ(x, y) = φ(x) − φ(y) − ∇φ(y), x − y

1 WDφ(P, Q) ≥ 0 and = 0 iff P = Q almost everywhere. 2 WDφ(P, Q) is a metric, as it is asymmetric. 19 / 50

slide-74
SLIDE 74

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

Remedy: Relaxed Wasserstein

Definition (G., Hong, Lin, and Yang 2018)

The Relaxed Wasserstein divergence between the probability distributions P and Q is defined as WDφ(P, Q) = inf

π∈Π(P,Q)

  • X×X

Dφ(x, y) π(dx, dy), where Dφ is the Bregman divergence with a strictly convex and differentiable function φ : Rd → R, i.e., Dφ(x, y) = φ(x) − φ(y) − ∇φ(y), x − y

1 WDφ(P, Q) ≥ 0 and = 0 iff P = Q almost everywhere. 2 WDφ(P, Q) is a metric, as it is asymmetric. 3 WDφ(P, Q) includes WKL with φ(x) = −x⊤ log(x). 19 / 50

slide-75
SLIDE 75

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

Relaxed Wasserstein as Divergence

Question: Is Wφ a good divergence?

20 / 50

slide-76
SLIDE 76

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

Relaxed Wasserstein as Divergence

Question: Is Wφ a good divergence? Point 1: Wφ(P, Q) should be small when P and Q are close.

20 / 50

slide-77
SLIDE 77

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

Relaxed Wasserstein as Divergence

Question: Is Wφ a good divergence? Point 1: Wφ(P, Q) should be small when P and Q are close. Requirement: Wφ(P, Q) should be dominated by standard divergence, TV (P, Q) := sup

A∈B

|P(A) − Q(A)| .

20 / 50

slide-78
SLIDE 78

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

Relaxed Wasserstein as Divergence

Question: Is Wφ a good divergence? Point 1: Wφ(P, Q) should be small when P and Q are close. Requirement: Wφ(P, Q) should be dominated by standard divergence, TV (P, Q) := sup

A∈B

|P(A) − Q(A)| . Point 2: Wφ(Pn, Pr) → 0 as n → ∞ where Pr is a true distribution Pr and Pn is the empirical distribution based on X = (X1, X2, . . . , Xn) i.i.d. ∼ Pr.

20 / 50

slide-79
SLIDE 79

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

Relaxed Wasserstein as Divergence

Question: Is Wφ a good divergence? Point 1: Wφ(P, Q) should be small when P and Q are close. Requirement: Wφ(P, Q) should be dominated by standard divergence, TV (P, Q) := sup

A∈B

|P(A) − Q(A)| . Point 2: Wφ(Pn, Pr) → 0 as n → ∞ where Pr is a true distribution Pr and Pn is the empirical distribution based on X = (X1, X2, . . . , Xn) i.i.d. ∼ Pr. Requirement: Wφ(Pn, Pr) should have the moment estimate and concentration inequality, i.e., there exist α, β > 0 such that

E

  • WDφ (Pn, Pr)
  • =

O(n−α) (Moment Estimate), Prob

  • WDφ (Pn, Pr) ≥ ǫ
  • =

O(n−β) (Concentration Inequality).

20 / 50

slide-80
SLIDE 80

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

Dominated by TV and Standard Wasserstein

Theorem (G., Hong, Lin, and Yang 2018) Assume that φ : X → R is a strictly convex and smooth function with an L-Lipschitz continuous factor, WDφ(P, Q) ≤ L [diam(X)]2 · TV (P, Q) WDφ(P, Q) ≤ L 2WL2(P, Q)2 where P and Q are two probability distributions supported on a compact set X ⊂ Rd.

21 / 50

slide-81
SLIDE 81

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

Table of Contents

1

Bregman Divergence Function

2

Generative Adversarial Networks (GANs)

3

Wasserstein Divergence and GANs

4

Relaxed Wasserstein Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

5

Empirical Results Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

6

Conclusions

22 / 50

slide-82
SLIDE 82

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

Moment Estimate for RW

Theorem (G, Hong, Lin, and Yang 2018) Assume that Mq(Pr) =

  • X

xq

2 Pr(dx) < +∞

for some q > 2, then there exists a constant C(q, d) > 0 such that, for n ≥ 1,

E

  • WDφ (Pn, Pr)

C(q, d)LM

2 q

q (Pr)

2 ·        n− 1

2 + n− q−2

q ,

1 ≤ d ≤ 3, q = 4, n− 1

2 log(1 + n) + n− q−2

q ,

d = 4, q = 4, n− 2

d + n− q−2 q ,

d ≥ 5, q = d/(d − 2).

23 / 50

slide-83
SLIDE 83

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

Concentration Inequality for RW

Theorem (G., Hong, Lin, and Yang 2018)

Assume that Eα,γ(Pr) =

  • X

exp (γxα

2 ) Pr(dx).

and one of the three following conditions holds, ∃ α > 2, ∃ γ > 0, Eα,γ(Pr) < ∞,

  • r

∃ α ∈ (0, 2) , ∃ γ > 0, Eα,γ(Pr) < ∞,

  • r

∃ q > 4, Mq(Pr) < ∞. Then for n ≥ 1 and ǫ > 0, there exist the scalar a(n, ǫ) and b(n, ǫ) such that Prob

  • WDφ (Pn, Pr) ≥ ǫ
  • ≤ a(n, ǫ)1{ǫ≤ L

2 } + b(n, ǫ).

24 / 50

slide-84
SLIDE 84

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

Duality Representation for RW

Theorem (G., Hong, Lin, and Yang 2018)

Assume that two probability distributions P and Q satisfy

  • X

x2

2 (P + Q) (dx) < +∞.

Then there exists a Lipschitz continuous function f : X → R such that the RW divergence has a duality representation as WDφ(P, Q) =

  • X

φ(x) (P − Q) (dx) +

  • X

∇φ(x), x Q(dx) −

  • X

f (x) P(dx) +

  • X

f ∗ (∇φ(x)) Q(dx)

  • ,

where f ∗ is the conjugate of f , i.e., f ∗(y) = sup

x∈Rd x, y − f (x). 25 / 50

slide-85
SLIDE 85

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

Key element for proof of duality

The classical duality representation for the standard Wasserstein distance

26 / 50

slide-86
SLIDE 86

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

Key element for proof of duality

The classical duality representation for the standard Wasserstein distance The RW can be decomposed in terms of a distorted squared Wasserstein-L2 distance of order 2, plus some residual terms that are independent of the choice of the coupling π.

26 / 50

slide-87
SLIDE 87

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

Table of Contents

1

Bregman Divergence Function

2

Generative Adversarial Networks (GANs)

3

Wasserstein Divergence and GANs

4

Relaxed Wasserstein Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

5

Empirical Results Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

6

Conclusions

27 / 50

slide-88
SLIDE 88

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

Relaxed Wasserstein for GANs

Question: Is Wφ tractable for GANs?

28 / 50

slide-89
SLIDE 89

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

Relaxed Wasserstein for GANs

Question: Is Wφ tractable for GANs? Requirement 1: Wφ(Pr, Pθ) should be continuous and differentiable w.r.t. θ.

28 / 50

slide-90
SLIDE 90

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

Relaxed Wasserstein for GANs

Question: Is Wφ tractable for GANs? Requirement 1: Wφ(Pr, Pθ) should be continuous and differentiable w.r.t. θ. Requirement 2: Wφ(Pr, Pθ) should have the easily computed

  • r approximated gradient evaluation, i.e.,

∇θ

  • WDφ(Pr, Pθ)
  • = F (gθ, φ, Z, . . .) .

where F is an abstract mapping.

28 / 50

slide-91
SLIDE 91

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

Continuity and Differentiablity

Theorem (G., Hong, Lin, and Yang 2018)

29 / 50

slide-92
SLIDE 92

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

Continuity and Differentiablity

Theorem (G., Hong, Lin, and Yang 2018)

1 WDφ(Pr, Pθ) is continuous in θ if gθ is continuous in θ. 29 / 50

slide-93
SLIDE 93

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

Continuity and Differentiablity

Theorem (G., Hong, Lin, and Yang 2018)

1 WDφ(Pr, Pθ) is continuous in θ if gθ is continuous in θ. 2 WDφ(Pr, Pθ) is differentiable almost everywhere if gθ is locally

Lipschitz with a constant ¯ L(θ, z) such that E ¯ L(θ, Z)2 < ∞, i.e., for each given (θ0, z0), there exists a neighborhood N such that gθ(z) − gθ0(z0)2 ≤ L(θ0, z0) (θ − θ02 + z − z02) . for any (θ, z) ∈ N.

29 / 50

slide-94
SLIDE 94

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

Table of Contents

1

Bregman Divergence Function

2

Generative Adversarial Networks (GANs)

3

Wasserstein Divergence and GANs

4

Relaxed Wasserstein Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

5

Empirical Results Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

6

Conclusions

30 / 50

slide-95
SLIDE 95

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

Gradient Descent Scheme

Corollary (G., Hong, Lin, and Yang 2018) Assume that gθ is locally Lipschitz with a constant L(θ, z) such that E

  • L(θ, Z)2

< ∞, and

  • X x2

2 (Pr + Pθ) (dx) < +∞. Then

there exists a Lipschitz continuous solution f : X → R such that the gradient of the RW divergence has an explicit form, i.e., ∇θ

  • WDφ(Pr, Pθ)
  • = EZ
  • [∇θgθ(Z)]⊤ ∇2φ(gθ(Z))gθ(Z)
  • +EZ [∇θf (∇φ(gθ(Z)))] .

31 / 50

slide-96
SLIDE 96

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Table of Contents

1

Bregman Divergence Function

2

Generative Adversarial Networks (GANs)

3

Wasserstein Divergence and GANs

4

Relaxed Wasserstein Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

5

Empirical Results Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

6

Conclusions

32 / 50

slide-97
SLIDE 97

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Experiment Setup

RW: KL divergence where φ(x) = −x⊤ log(x).

33 / 50

slide-98
SLIDE 98

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Experiment Setup

RW: KL divergence where φ(x) = −x⊤ log(x). Approach: RMSProp [Tieleman and Hinton, 2012].

33 / 50

slide-99
SLIDE 99

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Experiment Setup

RW: KL divergence where φ(x) = −x⊤ log(x). Approach: RMSProp [Tieleman and Hinton, 2012]. Experiment I:

Baselines: WGANs, CGANs, InfoGANs, GANs, LSGANs, DRAGANs, BEGANs, EBGANs and ACGANs. Datasets:

MNIST: 60000 (train) and 10000 (test). Fashion-MNIST: 60000 (train) and 10000 (test).

33 / 50

slide-100
SLIDE 100

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Experiment Setup

RW: KL divergence where φ(x) = −x⊤ log(x). Approach: RMSProp [Tieleman and Hinton, 2012]. Experiment I:

Baselines: WGANs, CGANs, InfoGANs, GANs, LSGANs, DRAGANs, BEGANs, EBGANs and ACGANs. Datasets:

MNIST: 60000 (train) and 10000 (test). Fashion-MNIST: 60000 (train) and 10000 (test).

Experiment II:

Baselines: WGANs and WGANs-GP. Datasets:

CIFAR-10 (color): 50000 (train) and 10000 (test). ImageNet (color): 14197122.

33 / 50

slide-101
SLIDE 101

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Metric for performance

The inception score is defined as follows: Inception Score = exp {Ex [DKL(p(y|x), p(y)]} ,

34 / 50

slide-102
SLIDE 102

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Table of Contents

1

Bregman Divergence Function

2

Generative Adversarial Networks (GANs)

3

Wasserstein Divergence and GANs

4

Relaxed Wasserstein Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

5

Empirical Results Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

6

Conclusions

35 / 50

slide-103
SLIDE 103

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on MNIST and Fashion-MNIST datasets

Method MNIST Fashion-MNIST Method MNIST Fashion-MNIST RWGANs 36 / 50

slide-104
SLIDE 104

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on MNIST and Fashion-MNIST datasets

Method MNIST Fashion-MNIST Method MNIST Fashion-MNIST RWGANs LSGANs 36 / 50

slide-105
SLIDE 105

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on MNIST and Fashion-MNIST datasets

Method MNIST Fashion-MNIST Method MNIST Fashion-MNIST RWGANs LSGANs WGANs 36 / 50

slide-106
SLIDE 106

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on MNIST and Fashion-MNIST datasets

Method MNIST Fashion-MNIST Method MNIST Fashion-MNIST RWGANs LSGANs WGANs DRAGANs 36 / 50

slide-107
SLIDE 107

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on MNIST and Fashion-MNIST datasets

Method MNIST Fashion-MNIST Method MNIST Fashion-MNIST RWGANs LSGANs WGANs DRAGANs CGANs 36 / 50

slide-108
SLIDE 108

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on MNIST and Fashion-MNIST datasets

Method MNIST Fashion-MNIST Method MNIST Fashion-MNIST RWGANs LSGANs WGANs DRAGANs CGANs BEGANs 36 / 50

slide-109
SLIDE 109

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on MNIST and Fashion-MNIST datasets

Method MNIST Fashion-MNIST Method MNIST Fashion-MNIST RWGANs LSGANs WGANs DRAGANs CGANs BEGANs InfoGANs 36 / 50

slide-110
SLIDE 110

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on MNIST and Fashion-MNIST datasets

Method MNIST Fashion-MNIST Method MNIST Fashion-MNIST RWGANs LSGANs WGANs DRAGANs CGANs BEGANs InfoGANs EBGANs 36 / 50

slide-111
SLIDE 111

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on MNIST and Fashion-MNIST datasets

Method MNIST Fashion-MNIST Method MNIST Fashion-MNIST RWGANs LSGANs WGANs DRAGANs CGANs BEGANs InfoGANs EBGANs GANs 36 / 50

slide-112
SLIDE 112

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on MNIST and Fashion-MNIST datasets

Method MNIST Fashion-MNIST Method MNIST Fashion-MNIST RWGANs LSGANs WGANs DRAGANs CGANs BEGANs InfoGANs EBGANs GANs ACGANs 36 / 50

slide-113
SLIDE 113

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on MNIST dataset

Method N = 1 N = 10 N = 25 N = 100 RWGANs 37 / 50

slide-114
SLIDE 114

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on MNIST dataset

Method N = 1 N = 10 N = 25 N = 100 RWGANs WGANs 37 / 50

slide-115
SLIDE 115

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on MNIST dataset

Method N = 1 N = 10 N = 25 N = 100 RWGANs WGANs CGANs 37 / 50

slide-116
SLIDE 116

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on MNIST dataset

Method N = 1 N = 10 N = 25 N = 100 RWGANs WGANs CGANs InfoGANs 37 / 50

slide-117
SLIDE 117

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on MNIST dataset

Method N = 1 N = 10 N = 25 N = 100 RWGANs WGANs CGANs InfoGANs GANs 37 / 50

slide-118
SLIDE 118

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on MNIST dataset

Method N = 1 N = 10 N = 25 N = 100 LSGANs 38 / 50

slide-119
SLIDE 119

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on MNIST dataset

Method N = 1 N = 10 N = 25 N = 100 LSGANs DRAGANs 38 / 50

slide-120
SLIDE 120

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on MNIST dataset

Method N = 1 N = 10 N = 25 N = 100 LSGANs DRAGANs BEGANs 38 / 50

slide-121
SLIDE 121

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on MNIST dataset

Method N = 1 N = 10 N = 25 N = 100 LSGANs DRAGANs BEGANs EBGANs 38 / 50

slide-122
SLIDE 122

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on MNIST dataset

Method N = 1 N = 10 N = 25 N = 100 LSGANs DRAGANs BEGANs EBGANs ACGANs 38 / 50

slide-123
SLIDE 123

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Table of Contents

1

Bregman Divergence Function

2

Generative Adversarial Networks (GANs)

3

Wasserstein Divergence and GANs

4

Relaxed Wasserstein Moment Estimate, Concentration Inequality, and Duality Continuity, Differentiability Gradient Descent Scheme

5

Empirical Results Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

6

Conclusions

39 / 50

slide-124
SLIDE 124

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on CIFAR-10 and ImageNet datasets

Method Architecture CIFAR-10 ImageNet RWGANs DCGAN 40 / 50

slide-125
SLIDE 125

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on CIFAR-10 and ImageNet datasets

Method Architecture CIFAR-10 ImageNet RWGANs DCGAN MLP 40 / 50

slide-126
SLIDE 126

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on CIFAR-10 and ImageNet datasets

Method Architecture CIFAR-10 ImageNet RWGANs DCGAN MLP WGANs DCGAN 40 / 50

slide-127
SLIDE 127

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on CIFAR-10 and ImageNet datasets

Method Architecture CIFAR-10 ImageNet RWGANs DCGAN MLP WGANs DCGAN MLP 40 / 50

slide-128
SLIDE 128

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on CIFAR-10 and ImageNet datasets

Method Architecture CIFAR-10 ImageNet RWGANs DCGAN MLP WGANs DCGAN MLP WGANs-GP DCGAN 40 / 50

slide-129
SLIDE 129

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on CIFAR-10 and ImageNet datasets

Method Architecture CIFAR-10 ImageNet RWGANs DCGAN MLP WGANs DCGAN MLP WGANs-GP DCGAN MLP 40 / 50

slide-130
SLIDE 130

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on Inception Score

Architecture Method CIFAR-10 ImageNet First 5 epochs Last 10 epochs First 3 epochs Last 5 epochs DCGAN RWGANs 1.8606 2.3962 2.0430 2.7008 WGANs 1.6329 2.4246 2.2070 2.7972 WGANs-GP 1.7259 2.3731 2.2749 2.7331 MLP RWGANs 1.3126 2.1710 2.0025 2.4805 WGANs 1.2798 1.9007 1.7401 2.2304 WGANs-GP 1.2711 2.2192 1.8845 2.3448 41 / 50

slide-131
SLIDE 131

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on ImageNet dataset

Method N = 1 DCGAN MLP RWGANs 42 / 50

slide-132
SLIDE 132

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on ImageNet dataset

Method N = 1 DCGAN MLP RWGANs WGANs 42 / 50

slide-133
SLIDE 133

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on ImageNet dataset

Method N = 1 DCGAN MLP RWGANs WGANs WGANs- GP 42 / 50

slide-134
SLIDE 134

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on ImageNet dataset

Method N = 25 DCGAN MLP RWGANs 43 / 50

slide-135
SLIDE 135

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on ImageNet dataset

Method N = 25 DCGAN MLP RWGANs WGANs 43 / 50

slide-136
SLIDE 136

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions Experiment Setup MNIST and Fashion-MNIST datasets CIFAR-10 and ImageNet datasets

Empirical Results on ImageNet dataset

Method N = 25 DCGAN MLP RWGANs WGANs WGANs- GP 43 / 50

slide-137
SLIDE 137

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Conclusions

In Summary, We propose a novel class of statistical divergence called Relaxed Wasserstein (RW) divergence. This RW shares the same critical probabilistic properties as Wasserstein distance, without possible asymmetry.

44 / 50

slide-138
SLIDE 138

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Conclusions

In Summary, We propose a novel class of statistical divergence called Relaxed Wasserstein (RW) divergence. This RW shares the same critical probabilistic properties as Wasserstein distance, without possible asymmetry. RW divergence provides a lot of flexibility and possibilities in generative modeling by using a class of strictly convex and differentiable functions which contain different curvature information.

44 / 50

slide-139
SLIDE 139

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Conclusions

In Summary, We propose a novel class of statistical divergence called Relaxed Wasserstein (RW) divergence. This RW shares the same critical probabilistic properties as Wasserstein distance, without possible asymmetry. RW divergence provides a lot of flexibility and possibilities in generative modeling by using a class of strictly convex and differentiable functions which contain different curvature information. We present a gradient-based optimization framework to learn RWGAN and attain an encouraging results on image generation.

44 / 50

slide-140
SLIDE 140

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Future directions:

Does some optimal choice of φ exist in real problems?

45 / 50

slide-141
SLIDE 141

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Future directions:

Does some optimal choice of φ exist in real problems? Does φ depend on the data samples or the problem structure?

45 / 50

slide-142
SLIDE 142

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Future directions:

Does some optimal choice of φ exist in real problems? Does φ depend on the data samples or the problem structure? Applications to Finance: JP Morgan on-going project using GANs.

45 / 50

slide-143
SLIDE 143

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Future directions:

Does some optimal choice of φ exist in real problems? Does φ depend on the data samples or the problem structure? Applications to Finance: JP Morgan on-going project using GANs. In the theory of optimal transport and stochastic games, relaxed Wasserstein is more natural than Wasserstein distance: the same nice mathematical properties, without the symmetry constraint.

45 / 50

slide-144
SLIDE 144

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

References

Arjovsky, M. and Bottou, L. (2017). Towards principled methods for training generative adversarial networks. ArXiv Preprint: 1701.04862. Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein generative adversarial networks. ICML, pages 214–223. Aude, G., Cuturi, M., Peyr´ e, G., and Bach, F. (2016). Stochastic optimization for large-scale optimal transport. ArXiv Preprint:1605.08527. Bernton, E., Jacob, P. E., Gerber, M., and Robert, C. P. (2017). Inference in generative models using the wasserstein distance. ArXiv Preprint: 1701.05146. Berthelot, D., Schumm, T., and Metz, L. (2017). BEGAN: Boundary equilibrium generative adversarial networks. ArXiv Preprint: 1703.10717. Blanchet, J., Kang, Y., and Murthy, K. (2016). Robust wasserstein profile inference and applications to machine learning. ArXiv Preprint: 1610.05627.

46 / 50

slide-145
SLIDE 145

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

References

Carlier, G., Duval, V., Peyr´ e, G., and Schmitzer, B. (2017). Convergence of entropic schemes for optimal transport and gradient flows. SIAM Journal on Mathematical Analysis, 49(2):1385–1418. Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. (2016). InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. NIPS, pages 2172–2180. Esfahani, P. M. and Kuhn, D. (2015). Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, pages 1–52. Gao, R., Chen, X., and Kleywegt, A. J. (2017). Wasserstein distributional robustness and regularization in statistical learning. ArXiv Preprint: 1712.06050. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets.

47 / 50

slide-146
SLIDE 146

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

References

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. (2017). Improved training of Wasserstein GANs. ArXiv Preprint: 1704.00028. Guo, X., Hong, J., Lin, T., and Yang, N. (2017). Relaxed Wasserstein with applications to GANs. ArXiv Preprint: 1705.07164. Karazeev, A. (Aug 17, 2017). Generative Adversarial Networks (GANs): Engine and Applications. https://blog.statsbot.co/ generative-adversarial-networks-gans-engine-and-applications-f96291965b47. Kodali, N., Abernethy, J., Hays, J., and Kira, Z. (2017). How to train your DRAGAN. ArXiv Preprint: 1705.07215. Mao, X., Li, Q., Xie, H., Lau, R. Y. K., Wang, Z., and Smolley, S. P. (2016). Least squares generative adversarial networks. ArXiv Preprint: 1611.04076. Mirza, M. and Osindero, S. (2014). Conditional generative adversarial nets.

48 / 50

slide-147
SLIDE 147

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

References

Odena, A., Olah, C., and Shlens, J. (2017). Conditional image synthesis with auxiliary classifier GANs. ICML, pages 2642–2651. Peyr´ e, G. (2015). Entropic approximation of Wasserstein gradient flows. SIAM Journal on Imaging Sciences, 8(4):2323–2351. Ramdas, A., Trillos, N. G., and Cuturi, M. (2017). On Wasserstein two-sample testing and related families of nonparametric tests. Entropy, 19(2):47. Tieleman, T. and Hinton, G. (2012). Lecture 6.5-RMSProp: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4(2). Zhao, J., Mathieu, M., and LeCun, Y. (2016). Energy-based generative adversarial network. ArXiv Preprint: 1609.03126.

49 / 50

slide-148
SLIDE 148

Bregman Divergence Function Generative Adversarial Networks (GANs) Wasserstein Divergence and GANs Relaxed Wasserstein Empirical Results Conclusions

Thank you for your attention !

50 / 50