Generative networks part 2: GANs 23 / 54 Recap on generative - - PowerPoint PPT Presentation

generative networks part 2 gans
SMART_READER_LITE
LIVE PREVIEW

Generative networks part 2: GANs 23 / 54 Recap on generative - - PowerPoint PPT Presentation

Generative networks part 2: GANs 23 / 54 Recap on generative networks Generative networks provide a way to sample from any distribution. 1. Sample z , where denotes an efficiently sampleable distribution (e.g., uniform or Gaussian). 2.


slide-1
SLIDE 1

Generative networks part 2: GANs

23 / 54

slide-2
SLIDE 2

Recap on generative networks

Generative networks provide a way to sample from any distribution.

  • 1. Sample z ∼ µ, where µ denotes an efficiently sampleable

distribution (e.g., uniform or Gaussian).

  • 2. Output g(z), where g : Rd → Rm is a deep network.

Notation: let g#µ (pushforward of µ through g) denote this distribution.

24 / 54

slide-3
SLIDE 3

Recap on generative networks

Generative networks provide a way to sample from any distribution.

  • 1. Sample z ∼ µ, where µ denotes an efficiently sampleable

distribution (e.g., uniform or Gaussian).

  • 2. Output g(z), where g : Rd → Rm is a deep network.

Notation: let g#µ (pushforward of µ through g) denote this distribution. Brief remarks: ◮ Can this model any target distribution ν? Yes, (roughly) for the same reason that g can approximate any f : Rd → Rm. ◮ Graphical models let us sample and estimate probabilities; what about here? Nope.

24 / 54

slide-4
SLIDE 4

Univariate examples

g(x) = x, the identity function, mapping Uniform([0, 1]) to itself.

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 25 / 54

slide-5
SLIDE 5

Univariate examples

g(x) = x2, mapping Uniform([0, 1]) to something ∝

2 √x.

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 26 / 54

slide-6
SLIDE 6

Univariate examples

g is inverse CDF of Gaussian, input distribution is Uniform([0, 1]) and output is Gaussian.

0.0 0.2 0.4 0.6 0.8 1.0 2 1 1 27 / 54

slide-7
SLIDE 7

Another way to visualize generative networks

Given a sample from a distribution (even g#µ), here’s the “kernel density” / “Parzen window” estimate of its density:

  • 1. Start with random draw (xi)n

i=1.

  • 2. “Place bumps at every xi”:

Define ˆ p(x) := 1

n

n

i=1 k

x−xi

h

  • ,

where k is a kernel function (not the SVM one!), h is the “bandwidth”; for example:

28 / 54

slide-8
SLIDE 8

Another way to visualize generative networks

Given a sample from a distribution (even g#µ), here’s the “kernel density” / “Parzen window” estimate of its density:

  • 1. Start with random draw (xi)n

i=1.

  • 2. “Place bumps at every xi”:

Define ˆ p(x) := 1

n

n

i=1 k

x−xi

h

  • ,

where k is a kernel function (not the SVM one!), h is the “bandwidth”; for example:

◮ Gaussian: k(z) ∝ exp

  • −z2/2
  • ;

◮ Epanechnikov: k(z) ∝ max{0, 1 − z2}.

28 / 54

slide-9
SLIDE 9

Examples — univariate sampling.

Univariate sample, kernel density estimate (kde), GMM E-M.

2 1 1 2 3 4 5 0.0 0.1 0.2 0.3 0.4 kde gmm 29 / 54

slide-10
SLIDE 10

Examples — univariate sampling.

Univariate sample, kernel density estimate (kde), GAN kde.

2 1 1 2 3 4 5 0.0 0.1 0.2 0.3 0.4 kde gan kde

This is admittedly very indirect! As mentioned, there aren’t great ways to get GAN/VAE density information.

30 / 54

slide-11
SLIDE 11

Examples — bivariate sampling.

Bivariate sample, GMM E-M.

2 1 1 2 3 4 5 2 1 1 2 3 4 5 6

31 / 54

slide-12
SLIDE 12

Examples — bivariate sampling.

Bivariate sample, kernel density estimate (kde).

2 1 1 2 3 4 5 2 1 1 2 3 4 5 6

32 / 54

slide-13
SLIDE 13

Examples — bivariate sampling.

Bivariate sample, GAN kde.

2 1 1 2 3 4 5 2 1 1 2 3 4 5 6

Question: how will this plot change with network capacity?

33 / 54

slide-14
SLIDE 14

Approaches we’ve seen for modeling distributions.

34 / 54

slide-15
SLIDE 15

Approaches we’ve seen for modeling distributions.

Let’s survey our approaches to density estimation. ◮ Graphical models: can be interpretable, can encode domain knowledge. ◮ Kernel density estimation: easy to implement, converges to the right thing, suffers a curse of dimension. ◮ Training: easy for KDE, messy for graphical models. Interpretability: fine for both. Sampling: easy for both. Probability measurements: easy for KDE, sometimes easy for graphical model.

34 / 54

slide-16
SLIDE 16

Approaches we’ve seen for modeling distributions.

Let’s survey our approaches to density estimation. ◮ Graphical models: can be interpretable, can encode domain knowledge. ◮ Kernel density estimation: easy to implement, converges to the right thing, suffers a curse of dimension. ◮ Training: easy for KDE, messy for graphical models. Interpretability: fine for both. Sampling: easy for both. Probability measurements: easy for KDE, sometimes easy for graphical model. Deep networks. ◮ Either we have easy sampling, or we can estimate densities. Doing both seems to have major computational or data costs.

34 / 54

slide-17
SLIDE 17

Brief VAE Recap

35 / 54

slide-18
SLIDE 18

(Variational) Autoencoders

◮ Autoencoder: xi

f

− − →

map

latent zi = f(xi)

g

− − →

map

ˆ xi = g(zi). Objective:

1 n

n

i=1 ℓ(xi, ˆ

xi).

36 / 54

slide-19
SLIDE 19

(Variational) Autoencoders

◮ Autoencoder: xi

f

− − →

map

latent zi = f(xi)

g

− − →

map

ˆ xi = g(zi). Objective:

1 n

n

i=1 ℓ(xi, ˆ

xi). ◮ Variational Autoencoder: xi

f

− − →

map

latent distribution µi = f(xi)

g

− − − − − − − →

pushforward

ˆ xi ∼ g#µi. Objective:

1 n

n

i=1

  • ℓ(xi, ˆ

xi) + λKL(µ, µi)

  • .

36 / 54

slide-20
SLIDE 20

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

ˆ xi ∼ g#µi

37 / 54

slide-21
SLIDE 21

0.2 0.4 0.6 0.8 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

ˆ xi ∼ g#µ with small λ

37 / 54

slide-22
SLIDE 22

Generative Adversarial Networks (GANs)

38 / 54

slide-23
SLIDE 23

Generative network setup and training.

◮ We are given (xi)n

i=1 ∼ ν.

◮ We want to find g so that (g(zi))n

i=1 ≈ (xi)n i=1, where (zi)n i=1 ∼ µ.

Problem: this isn’t as simple as fitting g(zi) ≈ xi.

39 / 54

slide-24
SLIDE 24

Generative network setup and training.

◮ We are given (xi)n

i=1 ∼ ν.

◮ We want to find g so that (g(zi))n

i=1 ≈ (xi)n i=1, where (zi)n i=1 ∼ µ.

Problem: this isn’t as simple as fitting g(zi) ≈ xi. Solutions: ◮ VAE: For each xi, construct distribution µi, so that ˆ xi ∼ g#µi and xi are close, as are µi and µ. To generate fresh samples, get z ∼ µ and output g(z). ◮ GAN: Pick a distance notion between distributions (or between samples (g(zi))n

i=1 and (xi)n i=1) and pick g to minimize that!

39 / 54

slide-25
SLIDE 25

GAN overview

GAN approach: we minimize D(ν, g#µ) directly, where “D” is some notion of distance/divergence: ◮ Jensen-Shannon Divergence (original GAN paper). ◮ Wasserstein distance (influential follow-up).

40 / 54

slide-26
SLIDE 26

GAN overview

GAN approach: we minimize D(ν, g#µ) directly, where “D” is some notion of distance/divergence: ◮ Jensen-Shannon Divergence (original GAN paper). ◮ Wasserstein distance (influential follow-up). Each distance is computed with an alternating/adversarial scheme:

  • 1. We have some current choice gt, and use it to produce a sample

(ˆ xi)n

i=1 with ˆ

xi = gt(zi).

  • 2. We train a discriminator/critic ft to find differences between (ˆ

xi)n

i=1

and (xi)n

i=1.

  • 3. We then pick a new generator gt+1, trained to fool ft!

40 / 54

slide-27
SLIDE 27

Jensen-Shannon divergence (original GAN)

41 / 54

slide-28
SLIDE 28

Original GAN formulation

Let p, pg denote density of data and generator, ˜ p = p

2 + pg 2 .

Original GAN minimizes Jensen-Shannon Divergence: 2 · JS(p, pg) = KL(p, ˜ p) + KL(pg, ˜ p) =

  • p(x) ln p(x)

˜ p(x) dx +

  • pg(x) ln pg(x)

˜ p(x) dx = Ep ln p(x) ˜ p(x) + Epg ln pg(x) ˜ p(x) .

42 / 54

slide-29
SLIDE 29

Original GAN formulation

Let p, pg denote density of data and generator, ˜ p = p

2 + pg 2 .

Original GAN minimizes Jensen-Shannon Divergence: 2 · JS(p, pg) = KL(p, ˜ p) + KL(pg, ˜ p) =

  • p(x) ln p(x)

˜ p(x) dx +

  • pg(x) ln pg(x)

˜ p(x) dx = Ep ln p(x) ˜ p(x) + Epg ln pg(x) ˜ p(x) . But we’ve been saying we can’t write down pg?

42 / 54

slide-30
SLIDE 30

Original GAN formulation

Let p, pg denote density of data and generator, ˜ p = p

2 + pg 2 .

Original GAN minimizes Jensen-Shannon Divergence: 2 · JS(p, pg) = KL(p, ˜ p) + KL(pg, ˜ p) =

  • p(x) ln p(x)

˜ p(x) dx +

  • pg(x) ln pg(x)

˜ p(x) dx = Ep ln p(x) ˜ p(x) + Epg ln pg(x) ˜ p(x) . But we’ve been saying we can’t write down pg? Original GAN approach applies alternating minimization to inf

g∈G

sup

f∈F f:X→(0,1)

  1 n

n

  • i=1

ln

  • f(xi)
  • + 1

m

m

  • j=1

ln

  • 1 − f(g(zj))

 .

42 / 54

slide-31
SLIDE 31

Original GAN formulation and algorithm.

Original GAN objective: inf

g∈G

sup

f∈F f:X→(0,1)

  1 n

n

  • i=1

ln

  • f(xi)
  • + 1

m

m

  • j=1

ln

  • 1 − f(g(zj))

 . Algorithm alternates these two steps:

  • 1. Hold g fixed and optimize f. Specifically, generate a sample

(ˆ xj)m

j=1 = (g(zj))m j=1, and approximately optimize

sup

f∈F f:X→(0,1)

  1 n

n

  • i=1

ln

  • f(xi)
  • + 1

m

m

  • j=1

ln

  • 1 − f(ˆ

xj)

 .

  • 2. Hold f fixed and optimize g. Specifically, generate (zj)m

j=1 and

approximately optimize inf

g∈G

  1 n

n

  • i=1

ln

  • f(xi)
  • + 1

m

m

  • j=1

ln

  • 1 − f(g(zj))

 .

43 / 54

slide-32
SLIDE 32

Some implementation issues

Algorithm alternates these two steps:

  • 1. Hold g fixed and optimize f. Specifically, generate a sample

(ˆ xj)m

j=1 = (g(zj))m j=1, and approximately optimize

sup

f∈F f:X→(0,1)

  1 n

n

  • i=1

ln

  • f(xi)
  • + 1

m

m

  • j=1

ln

  • 1 − f(ˆ

xj)

 .

  • 2. Hold f fixed and optimize g. Specifically, generate (zj)m

j=1 and

approximately optimize inf

g∈G

  1 n

n

  • i=1

ln

  • f(xi)
  • + 1

m

m

  • j=1

ln

  • 1 − f(g(zj))

 . Remarks. ◮ Common practice: do many f ascents for each g descent. ◮ Training has all sorts of instabilities and heuristics fixes; e.g., mode collapse (g outputs a small subset of training elements). ◮ Original intuition was game-theoretic: generator and critic compete.

44 / 54

slide-33
SLIDE 33

Optimal discriminator

Given p (of data), pg (from g), pz (on z), E ln f(x) + E ln(1 − f(g(z))) =

  • ln f(x)p(x) dx +
  • ln(1 − f(g(z)))pz(z) dz

=

  • ln f(x)p(x) dx +
  • ln(1 − f(x))pg(x) dx.

= ln f(x)p(x) + ln(1 − f(x))pg(x)

  • dx.

45 / 54

slide-34
SLIDE 34

Optimal discriminator

Given p (of data), pg (from g), pz (on z), E ln f(x) + E ln(1 − f(g(z))) =

  • ln f(x)p(x) dx +
  • ln(1 − f(g(z)))pz(z) dz

=

  • ln f(x)p(x) dx +
  • ln(1 − f(x))pg(x) dx.

= ln f(x)p(x) + ln(1 − f(x))pg(x)

  • dx.

To find maximal f, maximize pointwise.

45 / 54

slide-35
SLIDE 35

Optimal discriminator

Given p (of data), pg (from g), pz (on z), E ln f(x) + E ln(1 − f(g(z))) =

  • ln f(x)p(x) dx +
  • ln(1 − f(g(z)))pz(z) dz

=

  • ln f(x)p(x) dx +
  • ln(1 − f(x))pg(x) dx.

= ln f(x)p(x) + ln(1 − f(x))pg(x)

  • dx.

To find maximal f, maximize pointwise. r → a ln(r) + b ln(1 − r) is concave with maximum a/(a + b).

45 / 54

slide-36
SLIDE 36

Optimal discriminator

Given p (of data), pg (from g), pz (on z), E ln f(x) + E ln(1 − f(g(z))) =

  • ln f(x)p(x) dx +
  • ln(1 − f(g(z)))pz(z) dz

=

  • ln f(x)p(x) dx +
  • ln(1 − f(x))pg(x) dx.

= ln f(x)p(x) + ln(1 − f(x))pg(x)

  • dx.

To find maximal f, maximize pointwise. r → a ln(r) + b ln(1 − r) is concave with maximum a/(a + b). Therefore, optimal discriminator satisfies f(x) =

p(x) p(x)+pg(x).

45 / 54

slide-37
SLIDE 37

Recovering Jensen-Shannon divergence

Let’s plug in optimal discriminator f(x) =

p(x) p(x)+pg(x):

sup

f∈F

E ln f(x) + E ln(1 − f(g(z))) = sup

f∈F

ln f(x)p(x) + ln(1 − f(x))pg(x)

  • dx.

= p(x) ln p(x) p(x) + pg(x) + pg(x) ln pg(x) p(x) + pg(x)

  • dx

= p(x) ln 2p(x) p(x) + pg(x) + pg(x) ln 2pg(x) p(x) + pg(x)

  • dx − ln 4

= KL

  • p, p + pg

2

  • + KL
  • pg, p + pg

2

  • − ln 4

= 2 · JS (p, pg) − ln 4.

46 / 54

slide-38
SLIDE 38

Technical remarks.

◮ This derivation is over the true distribution, not the sample! The sample induces a discrete distribution!

◮ How to regularize/generalize? ◮ The optimum of memorizing training set is trivial and doesn’t need a GAN to train (just randomly sample the training set).

◮ We pick f from a class of deep networks, and in general can’t set it to arbitrary p/(p+pg). So Jensen-Shannon connection is strained. ◮ There are many refinements, including architecture choices; e.g., “DCGAN”.

47 / 54

slide-39
SLIDE 39

Wasserstein GAN (WGAN)

48 / 54

slide-40
SLIDE 40

Wasserstein GAN (WGAN)

Original GAN objective: inf

g∈G

sup

f∈F f:X→(0,1)

  1 n

n

  • i=1

ln

  • f(xi)
  • + 1

m

m

  • j=1

ln

  • 1 − f(g(zj))

 .

49 / 54

slide-41
SLIDE 41

Wasserstein GAN (WGAN)

Original GAN objective: inf

g∈G

sup

f∈F f:X→(0,1)

  1 n

n

  • i=1

ln

  • f(xi)
  • + 1

m

m

  • j=1

ln

  • 1 − f(g(zj))

 . Wasserstein GAN objective: inf

g∈G

sup

f∈F fLip≤1

  1 n

n

  • i=1

f(xi) − 1 m

m

  • j=1

f(g(zj))   , where “fLip ≤ 1” means f 1-Lipschitz (f(x) − f(y) ≤ x − y).

49 / 54

slide-42
SLIDE 42

WGAN remarks

Wasserstein GAN objective: inf

g∈G

sup

f∈F fLip≤1

  1 n

n

  • i=1

f(xi) − 1 m

m

  • j=1

f(g(zj))   , where “fLip ≤ 1” means f 1-Lipschitz (f(x) − f(y) ≤ x − y).

50 / 54

slide-43
SLIDE 43

WGAN remarks

Wasserstein GAN objective: inf

g∈G

sup

f∈F fLip≤1

  1 n

n

  • i=1

f(xi) − 1 m

m

  • j=1

f(g(zj))   , where “fLip ≤ 1” means f 1-Lipschitz (f(x) − f(y) ≤ x − y). Remarks. ◮ In practice, G and F are deep networks architectures, fLip is only approximately enforced. ◮ This objective is a “Wasserstein distance” or “earth mover distance”; it can be interpreted as how much mass we have to shift to convert

  • ne distribution into another (in this case, g#µ and the original).

◮ The above formulate for Wasserstein distance is the “dual form” given via “Kantorovich-Rubinstein duality”.

50 / 54

slide-44
SLIDE 44

Summary and Reflection

51 / 54

slide-45
SLIDE 45

Reflection

◮ We gave two approaches (GAN and VAE) to sample with deep networks by training g and then sampling from g#µ. ◮ There are other ways to sample with deep networks (e.g., fit a density and then use Langevin), but no one talks about them? ◮ Open question: how to evaluate GANs?! This is currently a disaster. On the plus side: community wants evaluation that matches human notion of similarity. ◮ Both GAN and VAE are used extensively; some approaches blend both (e.g., BicycleGAN). ◮ GAN needs alternating minimization, VAE uses regular minimization. Both are finicky, though.

52 / 54

slide-46
SLIDE 46

Original papers

◮ (Original VAE paper.) Diederik P Kingma, Max Welling. “Auto-Encoding Variational Bayes”. 2013. ◮ (Original GAN paper.) Goodfellow, Ian J., Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron C., and Bengio, Yoshua. “Generative adversarial nets”. 2014. ◮ (Wasserstein GAN papers.)

◮ Arjovsky, Martin, Chintala, Soumith, and Bottou, Leon. “Wasserstein generative adversarial networks”. 2017. ◮ Gulrajani, Ishaan, Ahmed, Faruk, Arjovsky, Martin, Dumoulin, Vincent, and Courville, Aaron C. “Improved training of Wasserstein GANs”. 2017.

53 / 54

slide-47
SLIDE 47

Summary (of part 1). ◮ The sampling scheme: draw x ∼ µ efficiently, then compute g(x), where g is a deep network. ◮ The basic VAE scheme and its objective function (The ERM perspective); perhaps recap in part 2 has cleanest presentation. Summary (of part 2). ◮ GAN: minimize a distance on probability measures. ◮ Original: Jensen-Shannon divergence, and the corresponding alternating scheme. ◮ WGAN.

54 / 54