CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks - - PowerPoint PPT Presentation

cs7015 deep learning lecture 23
SMART_READER_LITE
LIVE PREVIEW

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks - - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23 Module


slide-1
SLIDE 1

1/38

CS7015 (Deep Learning) : Lecture 23

Generative Adversarial Networks (GANs) Mitesh M. Khapra

Department of Computer Science and Engineering Indian Institute of Technology Madras

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-2
SLIDE 2

2/38

Module 23.1: Generative Adversarial Networks - The intuition

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-3
SLIDE 3

3/38

So far we have looked at generative models which explicitly model the joint probability distribution

  • r

conditional probability distribution

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-4
SLIDE 4

3/38

v1 v2 · · · vm V ∈ {0, 1}m b1 b2 bm h1 h2 · · · hn H ∈ {0, 1}n c1 c2 cn

W ∈ Rm×n

w1,1 wm,n

+ ǫ z X Qθ(z|X) Σ µ Pφ(X|z) ˆ X

x1 x2 x3 x4 p(x1) p(x2|x1) p(x3|x1, x2) p(x4|x1, x2, x3) h1 h2 h3 h4

V W

So far we have looked at generative models which explicitly model the joint probability distribution

  • r

conditional probability distribution For example, in RBMs we learn P(X, H), in VAEs we learn P(z|X) and P(X|z) whereas in AR models we learn P(X)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-5
SLIDE 5

3/38

v1 v2 · · · vm V ∈ {0, 1}m b1 b2 bm h1 h2 · · · hn H ∈ {0, 1}n c1 c2 cn

W ∈ Rm×n

w1,1 wm,n

+ ǫ z X Qθ(z|X) Σ µ Pφ(X|z) ˆ X

x1 x2 x3 x4 p(x1) p(x2|x1) p(x3|x1, x2) p(x4|x1, x2, x3) h1 h2 h3 h4

V W

So far we have looked at generative models which explicitly model the joint probability distribution

  • r

conditional probability distribution For example, in RBMs we learn P(X, H), in VAEs we learn P(z|X) and P(X|z) whereas in AR models we learn P(X) What if we are only interested in sampling from the distribution and don’t really care about the explicit density function P(X)?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-6
SLIDE 6

3/38

v1 v2 · · · vm V ∈ {0, 1}m b1 b2 bm h1 h2 · · · hn H ∈ {0, 1}n c1 c2 cn

W ∈ Rm×n

w1,1 wm,n

+ ǫ z X Qθ(z|X) Σ µ Pφ(X|z) ˆ X

x1 x2 x3 x4 p(x1) p(x2|x1) p(x3|x1, x2) p(x4|x1, x2, x3) h1 h2 h3 h4

V W

So far we have looked at generative models which explicitly model the joint probability distribution

  • r

conditional probability distribution For example, in RBMs we learn P(X, H), in VAEs we learn P(z|X) and P(X|z) whereas in AR models we learn P(X) What if we are only interested in sampling from the distribution and don’t really care about the explicit density function P(X)? What does this mean?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-7
SLIDE 7

3/38

v1 v2 · · · vm V ∈ {0, 1}m b1 b2 bm h1 h2 · · · hn H ∈ {0, 1}n c1 c2 cn

W ∈ Rm×n

w1,1 wm,n

+ ǫ z X Qθ(z|X) Σ µ Pφ(X|z) ˆ X

x1 x2 x3 x4 p(x1) p(x2|x1) p(x3|x1, x2) p(x4|x1, x2, x3) h1 h2 h3 h4

V W

So far we have looked at generative models which explicitly model the joint probability distribution

  • r

conditional probability distribution For example, in RBMs we learn P(X, H), in VAEs we learn P(z|X) and P(X|z) whereas in AR models we learn P(X) What if we are only interested in sampling from the distribution and don’t really care about the explicit density function P(X)? What does this mean? Let us see

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-8
SLIDE 8

4/38

As usual we are given some training data (say, MNIST images) which obviously comes from some underlying distribution

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-9
SLIDE 9

4/38

As usual we are given some training data (say, MNIST images) which obviously comes from some underlying distribution Our goal is to generate more images from this distribution (i.e., create images which look similar to the images from the training data)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-10
SLIDE 10

4/38

As usual we are given some training data (say, MNIST images) which obviously comes from some underlying distribution Our goal is to generate more images from this distribution (i.e., create images which look similar to the images from the training data) In other words, we want to sample from a complex high dimensional distribution which is intractable (recall RBMs, VAEs and AR models deal with this intractability in their own way)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-11
SLIDE 11

5/38

z ∼ N(0, I) Complex Transformation Sample Generated

GANs take a different approach to this problem where the idea is to sample from a simple tractable distribution (say, z ∼ N(0, I)) and then learn a complex transformation from this to the training distribution

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-12
SLIDE 12

5/38

z ∼ N(0, I) Complex Transformation Sample Generated

GANs take a different approach to this problem where the idea is to sample from a simple tractable distribution (say, z ∼ N(0, I)) and then learn a complex transformation from this to the training distribution In other words, we will take a z ∼ N(0, I), learn to make a series of complex transformations on it so that the output looks as if it came from our training distribution

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-13
SLIDE 13

6/38

What can we use for such a complex transformation?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-14
SLIDE 14

6/38

What can we use for such a complex transformation? A Neural Network

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-15
SLIDE 15

6/38

What can we use for such a complex transformation? A Neural Network How do you train such a neural network?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-16
SLIDE 16

6/38

What can we use for such a complex transformation? A Neural Network How do you train such a neural network? Using a two player game

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-17
SLIDE 17

6/38

What can we use for such a complex transformation? A Neural Network How do you train such a neural network? Using a two player game There are two players in the game:

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-18
SLIDE 18

6/38

Generator z ∼ N(0, I)

What can we use for such a complex transformation? A Neural Network How do you train such a neural network? Using a two player game There are two players in the game: a generator

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-19
SLIDE 19

6/38

Generator z ∼ N(0, I) Discriminator

What can we use for such a complex transformation? A Neural Network How do you train such a neural network? Using a two player game There are two players in the game: a generator and a discriminator

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-20
SLIDE 20

6/38

Generator z ∼ N(0, I) Real Images Discriminator

What can we use for such a complex transformation? A Neural Network How do you train such a neural network? Using a two player game There are two players in the game: a generator and a discriminator The job of the generator is to produce images which look so natural that the discriminator thinks that the images came from the real data distribution

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-21
SLIDE 21

6/38

Generator z ∼ N(0, I) Real Images Discriminator Real or Fake

What can we use for such a complex transformation? A Neural Network How do you train such a neural network? Using a two player game There are two players in the game: a generator and a discriminator The job of the generator is to produce images which look so natural that the discriminator thinks that the images came from the real data distribution The job of the discriminator is to get better and better at distinguishing between true images and generated (fake) images

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-22
SLIDE 22

7/38

Generator z ∼ N(0, I) Real Images Discriminator Real or Fake

So let’s look at the full picture

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-23
SLIDE 23

7/38

Generator z ∼ N(0, I) Real Images Discriminator Real or Fake

So let’s look at the full picture Let Gφ be the generator and Dθ be the discriminator (φ and θ are the parameters of G and D, respectively)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-24
SLIDE 24

7/38

Generator z ∼ N(0, I) Real Images Discriminator Real or Fake

So let’s look at the full picture Let Gφ be the generator and Dθ be the discriminator (φ and θ are the parameters of G and D, respectively) We have a neural network based generator which takes as input a noise vector z ∼ N(0, I) and produces Gφ(z) = X

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-25
SLIDE 25

7/38

Generator z ∼ N(0, I) Real Images Discriminator Real or Fake

So let’s look at the full picture Let Gφ be the generator and Dθ be the discriminator (φ and θ are the parameters of G and D, respectively) We have a neural network based generator which takes as input a noise vector z ∼ N(0, I) and produces Gφ(z) = X We have a neural network based discriminator which could take as input a real X or a generated X = Gφ(z) and classify the input as real/fake

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-26
SLIDE 26

8/38

Generator z ∼ N(0, I) Real Images Discriminator Real or Fake

What should be the objective function of the

  • verall network?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-27
SLIDE 27

8/38

Generator z ∼ N(0, I) Real Images Discriminator Real or Fake

What should be the objective function of the

  • verall network?

Let’s look at the

  • bjective

function

  • f

the generator first

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-28
SLIDE 28

8/38

Generator z ∼ N(0, I) Real Images Discriminator Real or Fake

What should be the objective function of the

  • verall network?

Let’s look at the

  • bjective

function

  • f

the generator first Given an image generated by the generator as Gφ(z) the discriminator assigns a score Dθ(Gφ(z)) to it

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-29
SLIDE 29

8/38

Generator z ∼ N(0, I) Real Images Discriminator Real or Fake

What should be the objective function of the

  • verall network?

Let’s look at the

  • bjective

function

  • f

the generator first Given an image generated by the generator as Gφ(z) the discriminator assigns a score Dθ(Gφ(z)) to it This score will be between 0 and 1 and will tell us the probability of the image being real or fake

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-30
SLIDE 30

8/38

Generator z ∼ N(0, I) Real Images Discriminator Real or Fake

What should be the objective function of the

  • verall network?

Let’s look at the

  • bjective

function

  • f

the generator first Given an image generated by the generator as Gφ(z) the discriminator assigns a score Dθ(Gφ(z)) to it This score will be between 0 and 1 and will tell us the probability of the image being real or fake For a given z, the generator would want to maximize log Dθ(Gφ(z)) (log likelihood) or minimize log(1 − Dθ(Gφ(z)))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-31
SLIDE 31

9/38

Generator z ∼ N(0, I) Real Images Discriminator Real or Fake

This is just for a single z and the generator would like to do this for all possible values of z,

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-32
SLIDE 32

9/38

Generator z ∼ N(0, I) Real Images Discriminator Real or Fake

This is just for a single z and the generator would like to do this for all possible values of z, For example, if z was discrete and drawn from a uniform distribution (i.e., p(z) = 1

N ∀z) then the

generator’s objective function would be min

φ N

  • i=1

1 N log(1 − Dθ(Gφ(z)))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-33
SLIDE 33

9/38

Generator z ∼ N(0, I) Real Images Discriminator Real or Fake

This is just for a single z and the generator would like to do this for all possible values of z, For example, if z was discrete and drawn from a uniform distribution (i.e., p(z) = 1

N ∀z) then the

generator’s objective function would be min

φ N

  • i=1

1 N log(1 − Dθ(Gφ(z))) However, in our case, z is continuous and not uniform (z ∼ N(0, I)) so the equivalent objective function would be min

φ

ˆ p(z) log(1 − Dθ(Gφ(z))) min

φ Ez∼p(z)[log(1 − Dθ(Gφ(z)))]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-34
SLIDE 34

10/38

Generator z ∼ N(0, I) Real Images Discriminator Real or Fake

Now let’s look at the discriminator

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-35
SLIDE 35

10/38

Generator z ∼ N(0, I) Real Images Discriminator Real or Fake

Now let’s look at the discriminator The task of the discriminator is to assign a high score to real images and a low score to fake images

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-36
SLIDE 36

10/38

Generator z ∼ N(0, I) Real Images Discriminator Real or Fake

Now let’s look at the discriminator The task of the discriminator is to assign a high score to real images and a low score to fake images And it should do this for all possible real images and all possible fake images

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-37
SLIDE 37

10/38

Generator z ∼ N(0, I) Real Images Discriminator Real or Fake

Now let’s look at the discriminator The task of the discriminator is to assign a high score to real images and a low score to fake images And it should do this for all possible real images and all possible fake images In other words, it should try to maximize the following objective function max

θ

Ex∼pdata[log Dθ(x)]+Ez∼p(z)[log(1−Dθ(Gφ(z)))]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-38
SLIDE 38

11/38

Generator z ∼ N(0, I) Real Images Discriminator Real or Fake

If we put the objectives of the generator and discriminator together we get a minimax game min

φ

max

θ

[Ex∼pdata log Dθ(x) + Ez∼p(z) log(1 − Dθ(Gφ(z)))]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-39
SLIDE 39

11/38

Generator z ∼ N(0, I) Real Images Discriminator Real or Fake

If we put the objectives of the generator and discriminator together we get a minimax game min

φ

max

θ

[Ex∼pdata log Dθ(x) + Ez∼p(z) log(1 − Dθ(Gφ(z)))] The first term in the objective is only w.r.t. the parameters of the discriminator (θ)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-40
SLIDE 40

11/38

Generator z ∼ N(0, I) Real Images Discriminator Real or Fake

If we put the objectives of the generator and discriminator together we get a minimax game min

φ

max

θ

[Ex∼pdata log Dθ(x) + Ez∼p(z) log(1 − Dθ(Gφ(z)))] The first term in the objective is only w.r.t. the parameters of the discriminator (θ) The second term in the objective is w.r.t. the parameters of the generator (φ) as well as the discriminator (θ)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-41
SLIDE 41

11/38

Generator z ∼ N(0, I) Real Images Discriminator Real or Fake

If we put the objectives of the generator and discriminator together we get a minimax game min

φ

max

θ

[Ex∼pdata log Dθ(x) + Ez∼p(z) log(1 − Dθ(Gφ(z)))] The first term in the objective is only w.r.t. the parameters of the discriminator (θ) The second term in the objective is w.r.t. the parameters of the generator (φ) as well as the discriminator (θ) The discriminator wants to maximize the second term whereas the generator wants to minimize it (hence it is a two-player game)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-42
SLIDE 42

12/38

Generator z ∼ N(0, I) Real Images Discriminator Real or Fake

So the overall training proceeds by alternating between these two step

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-43
SLIDE 43

12/38

Generator z ∼ N(0, I) Real Images Discriminator Real or Fake

So the overall training proceeds by alternating between these two step Step 1: Gradient Ascent on Discriminator max

θ

[Ex∼pdata log Dθ(x)+Ez∼p(z) log(1−Dθ(Gφ(z)))]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-44
SLIDE 44

12/38

Generator z ∼ N(0, I) Real Images Discriminator Real or Fake

So the overall training proceeds by alternating between these two step Step 1: Gradient Ascent on Discriminator max

θ

[Ex∼pdata log Dθ(x)+Ez∼p(z) log(1−Dθ(Gφ(z)))] Step 2: Gradient Descent on Generator min

φ

Ez∼p(z) log(1 − Dθ(Gφ(z)))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-45
SLIDE 45

12/38

Generator z ∼ N(0, I) Real Images Discriminator Real or Fake

So the overall training proceeds by alternating between these two step Step 1: Gradient Ascent on Discriminator max

θ

[Ex∼pdata log Dθ(x)+Ez∼p(z) log(1−Dθ(Gφ(z)))] Step 2: Gradient Descent on Generator min

φ

Ez∼p(z) log(1 − Dθ(Gφ(z))) In practice, the above generator objective does not work well and we use a slightly modified objective

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-46
SLIDE 46

12/38

Generator z ∼ N(0, I) Real Images Discriminator Real or Fake

So the overall training proceeds by alternating between these two step Step 1: Gradient Ascent on Discriminator max

θ

[Ex∼pdata log Dθ(x)+Ez∼p(z) log(1−Dθ(Gφ(z)))] Step 2: Gradient Descent on Generator min

φ

Ez∼p(z) log(1 − Dθ(Gφ(z))) In practice, the above generator objective does not work well and we use a slightly modified objective Let us see why

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-47
SLIDE 47

13/38

0.2 0.4 0.6 0.8 1 −4 −2 2 4 D(G(z)) Loss log(1 − D(g(x)))

When the sample is likely fake, we want to give a feedback to the generator (using gradients)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-48
SLIDE 48

13/38

0.2 0.4 0.6 0.8 1 −4 −2 2 4 D(G(z)) Loss log(1 − D(g(x)))

When the sample is likely fake, we want to give a feedback to the generator (using gradients) However, in this region where D(G(z)) is close to 0, the curve of the loss function is very flat and the gradient would be close to 0

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-49
SLIDE 49

13/38

0.2 0.4 0.6 0.8 1 −4 −2 2 4 D(G(z)) Loss log(1 − D(g(x))) − log(D(g(x)))

When the sample is likely fake, we want to give a feedback to the generator (using gradients) However, in this region where D(G(z)) is close to 0, the curve of the loss function is very flat and the gradient would be close to 0 Trick: Instead of minimizing the likelihood of the discriminator being correct, maximize the likelihood of the discriminator being wrong

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-50
SLIDE 50

13/38

0.2 0.4 0.6 0.8 1 −4 −2 2 4 D(G(z)) Loss log(1 − D(g(x))) − log(D(g(x)))

When the sample is likely fake, we want to give a feedback to the generator (using gradients) However, in this region where D(G(z)) is close to 0, the curve of the loss function is very flat and the gradient would be close to 0 Trick: Instead of minimizing the likelihood of the discriminator being correct, maximize the likelihood of the discriminator being wrong In effect, the objective remains the same but the gradient signal becomes better

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-51
SLIDE 51

14/38

With that we are now ready to see the full algorithm for training GANs

1: procedure GAN Training 11: end procedure

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-52
SLIDE 52

14/38

With that we are now ready to see the full algorithm for training GANs

1: procedure GAN Training 2:

for number of training iterations do

10:

end for

11: end procedure

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-53
SLIDE 53

14/38

With that we are now ready to see the full algorithm for training GANs

1: procedure GAN Training 2:

for number of training iterations do

3:

for k steps do

7:

end for

10:

end for

11: end procedure

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-54
SLIDE 54

14/38

With that we are now ready to see the full algorithm for training GANs

1: procedure GAN Training 2:

for number of training iterations do

3:

for k steps do

4:

  • Sample minibatch of m noise samples {z(1), .., z(m)} from noise prior pg(z)

7:

end for

10:

end for

11: end procedure

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-55
SLIDE 55

14/38

With that we are now ready to see the full algorithm for training GANs

1: procedure GAN Training 2:

for number of training iterations do

3:

for k steps do

4:

  • Sample minibatch of m noise samples {z(1), .., z(m)} from noise prior pg(z)

5:

  • Sample minibatch of m examples {x(1), .., x(m)} from data generating distribution pdata(x)

7:

end for

10:

end for

11: end procedure

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-56
SLIDE 56

14/38

With that we are now ready to see the full algorithm for training GANs

1: procedure GAN Training 2:

for number of training iterations do

3:

for k steps do

4:

  • Sample minibatch of m noise samples {z(1), .., z(m)} from noise prior pg(z)

5:

  • Sample minibatch of m examples {x(1), .., x(m)} from data generating distribution pdata(x)

6:

  • Update the discriminator by ascending its stochastic gradient:

∇θ 1 m

m

  • i=1
  • log Dθ
  • x(i)

+ log

  • 1 − Dθ
  • z(i)

7:

end for

10:

end for

11: end procedure

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-57
SLIDE 57

14/38

With that we are now ready to see the full algorithm for training GANs

1: procedure GAN Training 2:

for number of training iterations do

3:

for k steps do

4:

  • Sample minibatch of m noise samples {z(1), .., z(m)} from noise prior pg(z)

5:

  • Sample minibatch of m examples {x(1), .., x(m)} from data generating distribution pdata(x)

6:

  • Update the discriminator by ascending its stochastic gradient:

∇θ 1 m

m

  • i=1
  • log Dθ
  • x(i)

+ log

  • 1 − Dθ
  • z(i)

7:

end for

8:

  • Sample minibatch of m noise samples {z(1), .., z(m)} from noise prior pg(z)

10:

end for

11: end procedure

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-58
SLIDE 58

14/38

With that we are now ready to see the full algorithm for training GANs

1: procedure GAN Training 2:

for number of training iterations do

3:

for k steps do

4:

  • Sample minibatch of m noise samples {z(1), .., z(m)} from noise prior pg(z)

5:

  • Sample minibatch of m examples {x(1), .., x(m)} from data generating distribution pdata(x)

6:

  • Update the discriminator by ascending its stochastic gradient:

∇θ 1 m

m

  • i=1
  • log Dθ
  • x(i)

+ log

  • 1 − Dθ
  • z(i)

7:

end for

8:

  • Sample minibatch of m noise samples {z(1), .., z(m)} from noise prior pg(z)

9:

  • Update the generator by ascending its stochastic gradient

∇φ 1 m

m

  • i=1
  • log
  • z(i)

10:

end for

11: end procedure

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-59
SLIDE 59

15/38

Module 23.2: Generative Adversarial Networks - Architecture

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-60
SLIDE 60

16/38

We will now look at one of the popular neural networks used for the generator and discriminator (Deep Convolutional GANs)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-61
SLIDE 61

16/38

We will now look at one of the popular neural networks used for the generator and discriminator (Deep Convolutional GANs) For discriminator, any CNN based classifier with 1 class (real) at the output can be used (e.g. VGG, ResNet, etc.)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-62
SLIDE 62

16/38

We will now look at one of the popular neural networks used for the generator and discriminator (Deep Convolutional GANs) For discriminator, any CNN based classifier with 1 class (real) at the output can be used (e.g. VGG, ResNet, etc.)

Figure: Generator (Redford et al 2015) (left) and discriminator (Yeh et al 2016) (right) used in DCGAN

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-63
SLIDE 63

17/38

Architecture guidelines for stable Deep Convolutional GANs Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator).

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-64
SLIDE 64

17/38

Architecture guidelines for stable Deep Convolutional GANs Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator). Use batchnorm in both the generator and the discriminator.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-65
SLIDE 65

17/38

Architecture guidelines for stable Deep Convolutional GANs Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator). Use batchnorm in both the generator and the discriminator. Remove fully connected hidden layers for deeper architectures.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-66
SLIDE 66

17/38

Architecture guidelines for stable Deep Convolutional GANs Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator). Use batchnorm in both the generator and the discriminator. Remove fully connected hidden layers for deeper architectures. Use ReLU activation in generator for all layers except for the output, which uses tanh.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-67
SLIDE 67

17/38

Architecture guidelines for stable Deep Convolutional GANs Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator). Use batchnorm in both the generator and the discriminator. Remove fully connected hidden layers for deeper architectures. Use ReLU activation in generator for all layers except for the output, which uses tanh. Use LeakyReLU activation in the discriminator for all layers

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-68
SLIDE 68

18/38

Module 23.3: Generative Adversarial Networks - The Math Behind it

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-69
SLIDE 69

19/38

We will now delve a bit deeper into the objective function used by GANs and see what it implies

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-70
SLIDE 70

19/38

We will now delve a bit deeper into the objective function used by GANs and see what it implies Suppose we denote the true data distribution by pdata(x) and the distribution

  • f the data generated by the model as pG(x)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-71
SLIDE 71

19/38

We will now delve a bit deeper into the objective function used by GANs and see what it implies Suppose we denote the true data distribution by pdata(x) and the distribution

  • f the data generated by the model as pG(x)

What do we wish should happen at the end of training?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-72
SLIDE 72

19/38

We will now delve a bit deeper into the objective function used by GANs and see what it implies Suppose we denote the true data distribution by pdata(x) and the distribution

  • f the data generated by the model as pG(x)

What do we wish should happen at the end of training? pG(x) = pdata(x)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-73
SLIDE 73

19/38

We will now delve a bit deeper into the objective function used by GANs and see what it implies Suppose we denote the true data distribution by pdata(x) and the distribution

  • f the data generated by the model as pG(x)

What do we wish should happen at the end of training? pG(x) = pdata(x) Can we prove this formally even though the model is not explicitly computing this density?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-74
SLIDE 74

19/38

We will now delve a bit deeper into the objective function used by GANs and see what it implies Suppose we denote the true data distribution by pdata(x) and the distribution

  • f the data generated by the model as pG(x)

What do we wish should happen at the end of training? pG(x) = pdata(x) Can we prove this formally even though the model is not explicitly computing this density? We will try to prove this over the next few slides

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-75
SLIDE 75

20/38

Theorem The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved if and only if pG = pdata

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-76
SLIDE 76

20/38

Theorem The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved if and only if pG = pdata is equivalent to

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-77
SLIDE 77

20/38

Theorem The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved if and only if pG = pdata is equivalent to Theorem

1 If pG = pdata then the global minimum of the virtual training criterion

C(G) = max

D

V (G, D) is achieved and

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-78
SLIDE 78

20/38

Theorem The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved if and only if pG = pdata is equivalent to Theorem

1 If pG = pdata then the global minimum of the virtual training criterion

C(G) = max

D

V (G, D) is achieved and

2 The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved only if pG = pdata

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-79
SLIDE 79

21/38

Outline of the Proof The ‘if’ part: The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved if pG = pdata The ‘only if’ part: The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved only if pG = pdata

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-80
SLIDE 80

21/38

Outline of the Proof The ‘if’ part: The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved if pG = pdata (a) Find the value of V (D, G) when the generator is optimal i.e., when pG = pdata The ‘only if’ part: The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved only if pG = pdata

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-81
SLIDE 81

21/38

Outline of the Proof The ‘if’ part: The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved if pG = pdata (a) Find the value of V (D, G) when the generator is optimal i.e., when pG = pdata (b) Find the value of V (D, G) for other values of the generator i.e., for any pG such that pG = pdata The ‘only if’ part: The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved only if pG = pdata

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-82
SLIDE 82

21/38

Outline of the Proof The ‘if’ part: The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved if pG = pdata (a) Find the value of V (D, G) when the generator is optimal i.e., when pG = pdata (b) Find the value of V (D, G) for other values of the generator i.e., for any pG such that pG = pdata (c) Show that a < b ∀ pG = pdata(and hence the minimum V (D, G) is achieved when pG = pdata) The ‘only if’ part: The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved only if pG = pdata

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-83
SLIDE 83

21/38

Outline of the Proof The ‘if’ part: The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved if pG = pdata (a) Find the value of V (D, G) when the generator is optimal i.e., when pG = pdata (b) Find the value of V (D, G) for other values of the generator i.e., for any pG such that pG = pdata (c) Show that a < b ∀ pG = pdata(and hence the minimum V (D, G) is achieved when pG = pdata) The ‘only if’ part: The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved only if pG = pdata Show that when V (D, G) is minimum then pG = pdata

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-84
SLIDE 84

21/38

Outline of the Proof The ‘if’ part: The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved if pG = pdata (a) Find the value of V (D, G) when the generator is optimal i.e., when pG = pdata (b) Find the value of V (D, G) for other values of the generator i.e., for any pG such that pG = pdata (c) Show that a < b ∀ pG = pdata(and hence the minimum V (D, G) is achieved when pG = pdata) The ‘only if’ part: The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved only if pG = pdata Show that when V (D, G) is minimum then pG = pdata

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-85
SLIDE 85

22/38

First let us look at the objective function again min

φ

max

θ

[Ex∼pdata log Dθ(x) + Ez∼p(z) log(1 − Dθ(Gφ(z)))]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-86
SLIDE 86

22/38

First let us look at the objective function again min

φ

max

θ

[Ex∼pdata log Dθ(x) + Ez∼p(z) log(1 − Dθ(Gφ(z)))] We will expand it to its integral form min

φ

max

θ

ˆ

x

pdata(x) log Dθ(x) + ˆ

z

p(z) log(1 − Dθ(Gφ(z)))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-87
SLIDE 87

22/38

First let us look at the objective function again min

φ

max

θ

[Ex∼pdata log Dθ(x) + Ez∼p(z) log(1 − Dθ(Gφ(z)))] We will expand it to its integral form min

φ

max

θ

ˆ

x

pdata(x) log Dθ(x) + ˆ

z

p(z) log(1 − Dθ(Gφ(z))) Let pG(X) denote the distribution of the X’s generated by the generator and since X is a function of z we can replace the second integral as shown below min

φ

max

θ

ˆ

x

pdata(x) log Dθ(x) + ˆ

x

pG(x) log(1 − Dθ(x))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-88
SLIDE 88

22/38

First let us look at the objective function again min

φ

max

θ

[Ex∼pdata log Dθ(x) + Ez∼p(z) log(1 − Dθ(Gφ(z)))] We will expand it to its integral form min

φ

max

θ

ˆ

x

pdata(x) log Dθ(x) + ˆ

z

p(z) log(1 − Dθ(Gφ(z))) Let pG(X) denote the distribution of the X’s generated by the generator and since X is a function of z we can replace the second integral as shown below min

φ

max

θ

ˆ

x

pdata(x) log Dθ(x) + ˆ

x

pG(x) log(1 − Dθ(x)) The above replacement follows from the law of the unconscious statistician (click to link of wikipedia page)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-89
SLIDE 89

23/38

Okay, so our revised objective is given by min

φ

max

θ

ˆ

x

(pdata(x) log Dθ(x) + pG(x) log(1 − Dθ(x))) dx

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-90
SLIDE 90

23/38

Okay, so our revised objective is given by min

φ

max

θ

ˆ

x

(pdata(x) log Dθ(x) + pG(x) log(1 − Dθ(x))) dx Given a generator G, we are interested in finding the optimum discriminator D which will maximize the above objective function

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-91
SLIDE 91

23/38

Okay, so our revised objective is given by min

φ

max

θ

ˆ

x

(pdata(x) log Dθ(x) + pG(x) log(1 − Dθ(x))) dx Given a generator G, we are interested in finding the optimum discriminator D which will maximize the above objective function The above objective will be maximized when the quantity inside the integral is maximized ∀x

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-92
SLIDE 92

23/38

Okay, so our revised objective is given by min

φ

max

θ

ˆ

x

(pdata(x) log Dθ(x) + pG(x) log(1 − Dθ(x))) dx Given a generator G, we are interested in finding the optimum discriminator D which will maximize the above objective function The above objective will be maximized when the quantity inside the integral is maximized ∀x To find the optima we will take the derivative of the term inside the integral w.r.t. D and set it to zero

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-93
SLIDE 93

23/38

Okay, so our revised objective is given by min

φ

max

θ

ˆ

x

(pdata(x) log Dθ(x) + pG(x) log(1 − Dθ(x))) dx Given a generator G, we are interested in finding the optimum discriminator D which will maximize the above objective function The above objective will be maximized when the quantity inside the integral is maximized ∀x To find the optima we will take the derivative of the term inside the integral w.r.t. D and set it to zero d d(Dθ(x)) (pdata(x) log Dθ(x) + pG(x) log(1 − Dθ(x))) = 0

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-94
SLIDE 94

23/38

Okay, so our revised objective is given by min

φ

max

θ

ˆ

x

(pdata(x) log Dθ(x) + pG(x) log(1 − Dθ(x))) dx Given a generator G, we are interested in finding the optimum discriminator D which will maximize the above objective function The above objective will be maximized when the quantity inside the integral is maximized ∀x To find the optima we will take the derivative of the term inside the integral w.r.t. D and set it to zero d d(Dθ(x)) (pdata(x) log Dθ(x) + pG(x) log(1 − Dθ(x))) = 0 pdata(x) 1 Dθ(x) + pG(x) 1 1 − Dθ(x)(−1) = 0

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-95
SLIDE 95

23/38

Okay, so our revised objective is given by min

φ

max

θ

ˆ

x

(pdata(x) log Dθ(x) + pG(x) log(1 − Dθ(x))) dx Given a generator G, we are interested in finding the optimum discriminator D which will maximize the above objective function The above objective will be maximized when the quantity inside the integral is maximized ∀x To find the optima we will take the derivative of the term inside the integral w.r.t. D and set it to zero d d(Dθ(x)) (pdata(x) log Dθ(x) + pG(x) log(1 − Dθ(x))) = 0 pdata(x) 1 Dθ(x) + pG(x) 1 1 − Dθ(x)(−1) = 0 pdata(x) Dθ(x) = pG(x) 1 − Dθ(x)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-96
SLIDE 96

23/38

Okay, so our revised objective is given by min

φ

max

θ

ˆ

x

(pdata(x) log Dθ(x) + pG(x) log(1 − Dθ(x))) dx Given a generator G, we are interested in finding the optimum discriminator D which will maximize the above objective function The above objective will be maximized when the quantity inside the integral is maximized ∀x To find the optima we will take the derivative of the term inside the integral w.r.t. D and set it to zero d d(Dθ(x)) (pdata(x) log Dθ(x) + pG(x) log(1 − Dθ(x))) = 0 pdata(x) 1 Dθ(x) + pG(x) 1 1 − Dθ(x)(−1) = 0 pdata(x) Dθ(x) = pG(x) 1 − Dθ(x) (pdata(x))(1 − Dθ(x)) = (pG(x))(Dθ(x))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-97
SLIDE 97

23/38

Okay, so our revised objective is given by min

φ

max

θ

ˆ

x

(pdata(x) log Dθ(x) + pG(x) log(1 − Dθ(x))) dx Given a generator G, we are interested in finding the optimum discriminator D which will maximize the above objective function The above objective will be maximized when the quantity inside the integral is maximized ∀x To find the optima we will take the derivative of the term inside the integral w.r.t. D and set it to zero d d(Dθ(x)) (pdata(x) log Dθ(x) + pG(x) log(1 − Dθ(x))) = 0 pdata(x) 1 Dθ(x) + pG(x) 1 1 − Dθ(x)(−1) = 0 pdata(x) Dθ(x) = pG(x) 1 − Dθ(x) (pdata(x))(1 − Dθ(x)) = (pG(x))(Dθ(x)) Dθ(x) = pdata(x) pG(x) + pdata(x)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-98
SLIDE 98

24/38

This means for any given generator D∗

G(G(x)) =

pdata(x) pdata(x) + pG(x)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-99
SLIDE 99

24/38

This means for any given generator D∗

G(G(x)) =

pdata(x) pdata(x) + pG(x) Now the if part of the theorem says “if pG = pdata ....”

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-100
SLIDE 100

24/38

This means for any given generator D∗

G(G(x)) =

pdata(x) pdata(x) + pG(x) Now the if part of the theorem says “if pG = pdata ....” So let us substitute pG = pdata into D∗

G(G(x)) and see what happens to the

loss functions

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-101
SLIDE 101

24/38

This means for any given generator D∗

G(G(x)) =

pdata(x) pdata(x) + pG(x) Now the if part of the theorem says “if pG = pdata ....” So let us substitute pG = pdata into D∗

G(G(x)) and see what happens to the

loss functions D∗

G =

pdata pdata + pG = 1 2

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-102
SLIDE 102

24/38

This means for any given generator D∗

G(G(x)) =

pdata(x) pdata(x) + pG(x) Now the if part of the theorem says “if pG = pdata ....” So let us substitute pG = pdata into D∗

G(G(x)) and see what happens to the

loss functions D∗

G =

pdata pdata + pG = 1 2 V (G, D∗

G) =

ˆ

x

pdata(x) log D(x) + pG(x) log (1 − D(x)) dx

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-103
SLIDE 103

24/38

This means for any given generator D∗

G(G(x)) =

pdata(x) pdata(x) + pG(x) Now the if part of the theorem says “if pG = pdata ....” So let us substitute pG = pdata into D∗

G(G(x)) and see what happens to the

loss functions D∗

G =

pdata pdata + pG = 1 2 V (G, D∗

G) =

ˆ

x

pdata(x) log D(x) + pG(x) log (1 − D(x)) dx = ˆ

x

pdata(x) log 1 2 + pG(x) log

  • 1 − 1

2

  • dx

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-104
SLIDE 104

24/38

This means for any given generator D∗

G(G(x)) =

pdata(x) pdata(x) + pG(x) Now the if part of the theorem says “if pG = pdata ....” So let us substitute pG = pdata into D∗

G(G(x)) and see what happens to the

loss functions D∗

G =

pdata pdata + pG = 1 2 V (G, D∗

G) =

ˆ

x

pdata(x) log D(x) + pG(x) log (1 − D(x)) dx = ˆ

x

pdata(x) log 1 2 + pG(x) log

  • 1 − 1

2

  • dx

= log 2 ˆ

x

pG(x)dx − log 2 ˆ

x

pdata(x)dx

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-105
SLIDE 105

24/38

This means for any given generator D∗

G(G(x)) =

pdata(x) pdata(x) + pG(x) Now the if part of the theorem says “if pG = pdata ....” So let us substitute pG = pdata into D∗

G(G(x)) and see what happens to the

loss functions D∗

G =

pdata pdata + pG = 1 2 V (G, D∗

G) =

ˆ

x

pdata(x) log D(x) + pG(x) log (1 − D(x)) dx = ˆ

x

pdata(x) log 1 2 + pG(x) log

  • 1 − 1

2

  • dx

= log 2 ˆ

x

pG(x)dx − log 2 ˆ

x

pdata(x)dx = −2 log 2 = − log 4

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-106
SLIDE 106

25/38

Outline of the Proof The ‘if’ part: The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved if pG = pdata (a) Find the value of V (D, G) when the generator is optimal i.e., when pG = pdata (b) Find the value of V (D, G) for other values of the generator i.e., for any pG such that pG = pdata (c) Show that a < b ∀ pG = pdata(and hence the minimum V (D, G) is achieved when pG = pdata) The ‘only if’ part: The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved only if pG = pdata Show that when V (D, G) is minimum then pG = pdata

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-107
SLIDE 107

25/38

Outline of the Proof The ‘if’ part: The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved if pG = pdata (a) Find the value of V (D, G) when the generator is optimal i.e., when pG = pdata (b) Find the value of V (D, G) for other values of the generator i.e., for any pG such that pG = pdata (c) Show that a < b ∀ pG = pdata(and hence the minimum V (D, G) is achieved when pG = pdata) The ‘only if’ part: The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved only if pG = pdata Show that when V (D, G) is minimum then pG = pdata

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-108
SLIDE 108

26/38

So what we have proved so far is that if the generator is optimal (pG = pdata) the discriminator’s loss value is − log 4

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-109
SLIDE 109

26/38

So what we have proved so far is that if the generator is optimal (pG = pdata) the discriminator’s loss value is − log 4 We still haven’t proved that this is the minima

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-110
SLIDE 110

26/38

So what we have proved so far is that if the generator is optimal (pG = pdata) the discriminator’s loss value is − log 4 We still haven’t proved that this is the minima For example, it is possible that for some pG = pdata, the discriminator’s loss value is lower than − log 4

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-111
SLIDE 111

26/38

So what we have proved so far is that if the generator is optimal (pG = pdata) the discriminator’s loss value is − log 4 We still haven’t proved that this is the minima For example, it is possible that for some pG = pdata, the discriminator’s loss value is lower than − log 4 To show that the discriminator achieves its lowest value “if pG = pdata”, we need to show that for all other values of pG the discriminator’s loss value is greater than − log 4

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-112
SLIDE 112

27/38

To show this we will get rid of the assumption that pG = pdata

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-113
SLIDE 113

27/38

To show this we will get rid of the assumption that pG = pdata

C(G) = ˆ

x

  • pdata(x) log
  • pdata(x)

pG(x) + pdata(x)

  • + pG(x) log
  • 1 −

pdata(x) pG(x) + pdata(x)

  • dx

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-114
SLIDE 114

27/38

To show this we will get rid of the assumption that pG = pdata

C(G) = ˆ

x

  • pdata(x) log
  • pdata(x)

pG(x) + pdata(x)

  • + pG(x) log
  • 1 −

pdata(x) pG(x) + pdata(x)

  • dx

= ˆ

x

  • pdata(x) log
  • pdata(x)

pG(x) + pdata(x)

  • + pG(x) log
  • pG(x)

pG(x) + pdata(x)

  • + (log 2 − log 2)(pdata + pG)
  • dx

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-115
SLIDE 115

27/38

To show this we will get rid of the assumption that pG = pdata

C(G) = ˆ

x

  • pdata(x) log
  • pdata(x)

pG(x) + pdata(x)

  • + pG(x) log
  • 1 −

pdata(x) pG(x) + pdata(x)

  • dx

= ˆ

x

  • pdata(x) log
  • pdata(x)

pG(x) + pdata(x)

  • + pG(x) log
  • pG(x)

pG(x) + pdata(x)

  • + (log 2 − log 2)(pdata + pG)
  • dx

= − log 2 ˆ

x

(pG(x) + pdata(x)) dx + ˆ

x

  • pdata(x)
  • log 2 + log
  • pdata(x)

pG(x) + pdata(x)

  • + pG(x)
  • log 2 + log
  • pG(x)

PpG(x) + pdata(x)

  • dx

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-116
SLIDE 116

27/38

To show this we will get rid of the assumption that pG = pdata

C(G) = ˆ

x

  • pdata(x) log
  • pdata(x)

pG(x) + pdata(x)

  • + pG(x) log
  • 1 −

pdata(x) pG(x) + pdata(x)

  • dx

= ˆ

x

  • pdata(x) log
  • pdata(x)

pG(x) + pdata(x)

  • + pG(x) log
  • pG(x)

pG(x) + pdata(x)

  • + (log 2 − log 2)(pdata + pG)
  • dx

= − log 2 ˆ

x

(pG(x) + pdata(x)) dx + ˆ

x

  • pdata(x)
  • log 2 + log
  • pdata(x)

pG(x) + pdata(x)

  • + pG(x)
  • log 2 + log
  • pG(x)

PpG(x) + pdata(x)

  • dx

= − log 2(1 + 1) + ˆ

x

  • pdata(x) log
  • pdata(x)

pG(x)+pdata(x) 2

  • + pG(x) log
  • pG(x)

pG(x)+pdata(x) 2

  • dx

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-117
SLIDE 117

27/38

To show this we will get rid of the assumption that pG = pdata

C(G) = ˆ

x

  • pdata(x) log
  • pdata(x)

pG(x) + pdata(x)

  • + pG(x) log
  • 1 −

pdata(x) pG(x) + pdata(x)

  • dx

= ˆ

x

  • pdata(x) log
  • pdata(x)

pG(x) + pdata(x)

  • + pG(x) log
  • pG(x)

pG(x) + pdata(x)

  • + (log 2 − log 2)(pdata + pG)
  • dx

= − log 2 ˆ

x

(pG(x) + pdata(x)) dx + ˆ

x

  • pdata(x)
  • log 2 + log
  • pdata(x)

pG(x) + pdata(x)

  • + pG(x)
  • log 2 + log
  • pG(x)

PpG(x) + pdata(x)

  • dx

= − log 2(1 + 1) + ˆ

x

  • pdata(x) log
  • pdata(x)

pG(x)+pdata(x) 2

  • + pG(x) log
  • pG(x)

pG(x)+pdata(x) 2

  • dx

= − log 4 + KL

  • pdatapG(x) + pdata(x)

2

  • + KL
  • pGpG(x) + pdata(x)

2

  • Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 23

slide-118
SLIDE 118

28/38

Outline of the Proof The ‘if’ part: The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved if pG = pdata (a) Find the value of V (D, G) when the generator is optimal i.e., when pG = pdata (b) Find the value of V (D, G) for other values of the generator i.e., for any pG such that pG = pdata (c) Show that a < b ∀ pG = pdata(and hence the minimum V (D, G) is achieved when pG = pdata) The ‘only if’ part: The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved only if pG = pdata Show that when V (D, G) is minimum then pG = pdata

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-119
SLIDE 119

28/38

Outline of the Proof The ‘if’ part: The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved if pG = pdata (a) Find the value of V (D, G) when the generator is optimal i.e., when pG = pdata (b) Find the value of V (D, G) for other values of the generator i.e., for any pG such that pG = pdata (c) Show that a < b ∀ pG = pdata(and hence the minimum V (D, G) is achieved when pG = pdata) The ‘only if’ part: The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved only if pG = pdata Show that when V (D, G) is minimum then pG = pdata

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-120
SLIDE 120

29/38

Okay, so we have C(G) = − log 4 + KL

  • pdata||pdata + pg

2

  • + KL
  • pG||pdata + pG

2

  • Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 23

slide-121
SLIDE 121

29/38

Okay, so we have C(G) = − log 4 + KL

  • pdata||pdata + pg

2

  • + KL
  • pG||pdata + pG

2

  • We know that KL divergence is always ≥ 0

∴ C(G) ≥ − log 4

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-122
SLIDE 122

29/38

Okay, so we have C(G) = − log 4 + KL

  • pdata||pdata + pg

2

  • + KL
  • pG||pdata + pG

2

  • We know that KL divergence is always ≥ 0

∴ C(G) ≥ − log 4 Hence the minimum possible value of C(G) is − log 4

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-123
SLIDE 123

29/38

Okay, so we have C(G) = − log 4 + KL

  • pdata||pdata + pg

2

  • + KL
  • pG||pdata + pG

2

  • We know that KL divergence is always ≥ 0

∴ C(G) ≥ − log 4 Hence the minimum possible value of C(G) is − log 4 But this is the value that C(G) achieves when pG = pdata (and this is exactly what we wanted to prove)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-124
SLIDE 124

29/38

Okay, so we have C(G) = − log 4 + KL

  • pdata||pdata + pg

2

  • + KL
  • pG||pdata + pG

2

  • We know that KL divergence is always ≥ 0

∴ C(G) ≥ − log 4 Hence the minimum possible value of C(G) is − log 4 But this is the value that C(G) achieves when pG = pdata (and this is exactly what we wanted to prove) We have, thus, proved the if part of the theorem

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-125
SLIDE 125

30/38

Outline of the Proof The ‘if’ part: The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved if pG = pdata (a) Find the value of V (D, G) when the generator is optimal i.e., when pG = pdata (b) Find the value of V (D, G) for other values of the generator i.e., for any pG such that pG = pdata (c) Show that a < b ∀ pG = pdata(and hence the minimum V (D, G) is achieved when pG = pdata) The ‘only if’ part: The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved only if pG = pdata Show that when V (D, G) is minimum then pG = pdata

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-126
SLIDE 126

30/38

Outline of the Proof The ‘if’ part: The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved if pG = pdata (a) Find the value of V (D, G) when the generator is optimal i.e., when pG = pdata (b) Find the value of V (D, G) for other values of the generator i.e., for any pG such that pG = pdata (c) Show that a < b ∀ pG = pdata(and hence the minimum V (D, G) is achieved when pG = pdata) The ‘only if’ part: The global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved only if pG = pdata Show that when V (D, G) is minimum then pG = pdata

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-127
SLIDE 127

31/38

Now let’s look at the other part of the theorem If the global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved then pG = pdata

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-128
SLIDE 128

31/38

Now let’s look at the other part of the theorem If the global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved then pG = pdata We know that C(G) = − log 4 + KL

  • pdatapdata + pg

2

  • + KL
  • pGpdata + pG

2

  • Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 23

slide-129
SLIDE 129

31/38

Now let’s look at the other part of the theorem If the global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved then pG = pdata We know that C(G) = − log 4 + KL

  • pdatapdata + pg

2

  • + KL
  • pGpdata + pG

2

  • If the global minima is achieved then C(G) = − log 4 which implies that

KL

  • pdatapdata + pg

2

  • + KL
  • pGpdata + pG

2

  • = 0

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-130
SLIDE 130

31/38

Now let’s look at the other part of the theorem If the global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved then pG = pdata We know that C(G) = − log 4 + KL

  • pdatapdata + pg

2

  • + KL
  • pGpdata + pG

2

  • If the global minima is achieved then C(G) = − log 4 which implies that

KL

  • pdatapdata + pg

2

  • + KL
  • pGpdata + pG

2

  • = 0

This will happen only when pG = pdata (you can prove this easily)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-131
SLIDE 131

31/38

Now let’s look at the other part of the theorem If the global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved then pG = pdata We know that C(G) = − log 4 + KL

  • pdatapdata + pg

2

  • + KL
  • pGpdata + pG

2

  • If the global minima is achieved then C(G) = − log 4 which implies that

KL

  • pdatapdata + pg

2

  • + KL
  • pGpdata + pG

2

  • = 0

This will happen only when pG = pdata (you can prove this easily) In fact KL

  • pdata pdata+pg

2

  • + KL
  • pG pdata+pG

2

  • is the Jenson-Shannon divergence between

pG and pdata KL

  • pdatapdata + pg

2

  • + KL
  • pGpdata + pG

2

  • = JSD(pdatapG)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-132
SLIDE 132

31/38

Now let’s look at the other part of the theorem If the global minimum of the virtual training criterion C(G) = max

D

V (G, D) is achieved then pG = pdata We know that C(G) = − log 4 + KL

  • pdatapdata + pg

2

  • + KL
  • pGpdata + pG

2

  • If the global minima is achieved then C(G) = − log 4 which implies that

KL

  • pdatapdata + pg

2

  • + KL
  • pGpdata + pG

2

  • = 0

This will happen only when pG = pdata (you can prove this easily) In fact KL

  • pdata pdata+pg

2

  • + KL
  • pG pdata+pG

2

  • is the Jenson-Shannon divergence between

pG and pdata KL

  • pdatapdata + pg

2

  • + KL
  • pGpdata + pG

2

  • = JSD(pdatapG)

which is minimum only when pG = pdata

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-133
SLIDE 133

32/38

Module 23.4: Generative Adversarial Networks - Some Cool Stuff and Applications

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-134
SLIDE 134

33/38

In each row the first image was generated by the network by taking a vector z1 as the input and the last images was generated by a vector z2 as the input

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-135
SLIDE 135

33/38

In each row the first image was generated by the network by taking a vector z1 as the input and the last images was generated by a vector z2 as the input All intermediate images were generated by feeding z’s which were obtained by interpolating z1 and z2 (z = λz1 + (1 − λ)z2)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-136
SLIDE 136

33/38

In each row the first image was generated by the network by taking a vector z1 as the input and the last images was generated by a vector z2 as the input All intermediate images were generated by feeding z’s which were obtained by interpolating z1 and z2 (z = λz1 + (1 − λ)z2) As we transition from z1 to z2 in the input space there is a corresponding smooth transition in the image space also

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-137
SLIDE 137

34/38

The first 3 images in the first column were generated by feeding some z11, z12, z13 respectively as the input to the generator

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-138
SLIDE 138

34/38

The first 3 images in the first column were generated by feeding some z11, z12, z13 respectively as the input to the generator The fourth image was generated by taking an average of z1 = z11, z12, z13 and feeding it to the generator

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-139
SLIDE 139

34/38

The first 3 images in the first column were generated by feeding some z11, z12, z13 respectively as the input to the generator The fourth image was generated by taking an average of z1 = z11, z12, z13 and feeding it to the generator Similarly we obtain the average vectors z2 and z3 for the 2nd and 3rd columns

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-140
SLIDE 140

34/38

The first 3 images in the first column were generated by feeding some z11, z12, z13 respectively as the input to the generator The fourth image was generated by taking an average of z1 = z11, z12, z13 and feeding it to the generator Similarly we obtain the average vectors z2 and z3 for the 2nd and 3rd columns If we do a simple vector arithmetic on these averaged vectors then we see the corresponding effect in the generated images

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-141
SLIDE 141

34/38

The first 3 images in the first column were generated by feeding some z11, z12, z13 respectively as the input to the generator The fourth image was generated by taking an average of z1 = z11, z12, z13 and feeding it to the generator Similarly we obtain the average vectors z2 and z3 for the 2nd and 3rd columns If we do a simple vector arithmetic on these averaged vectors then we see the corresponding effect in the generated images

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-142
SLIDE 142

35/38 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-143
SLIDE 143

36/38

Module 23.5: Bringing it all together (the deep generative summary)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-144
SLIDE 144

37/38

RBMs VAEs AR models GANs Abstraction Yes Yes No No

Table: Comparison of Generative Models

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-145
SLIDE 145

37/38

RBMs VAEs AR models GANs Abstraction Yes Yes No No Generation Yes Yes Yes Yes

Table: Comparison of Generative Models

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-146
SLIDE 146

37/38

RBMs VAEs AR models GANs Abstraction Yes Yes No No Generation Yes Yes Yes Yes Compute P(X) Intractable Intractable Tractable No

Table: Comparison of Generative Models

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-147
SLIDE 147

37/38

RBMs VAEs AR models GANs Abstraction Yes Yes No No Generation Yes Yes Yes Yes Compute P(X) Intractable Intractable Tractable No Sampling MCMC Fast Slow Fast

Table: Comparison of Generative Models

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-148
SLIDE 148

37/38

RBMs VAEs AR models GANs Abstraction Yes Yes No No Generation Yes Yes Yes Yes Compute P(X) Intractable Intractable Tractable No Sampling MCMC Fast Slow Fast Type of GM Undirected Directed Directed Directed

Table: Comparison of Generative Models

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-149
SLIDE 149

37/38

RBMs VAEs AR models GANs Abstraction Yes Yes No No Generation Yes Yes Yes Yes Compute P(X) Intractable Intractable Tractable No Sampling MCMC Fast Slow Fast Type of GM Undirected Directed Directed Directed Loss KL-divergence KL-divergence KL-divergence JSD

Table: Comparison of Generative Models

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-150
SLIDE 150

37/38

RBMs VAEs AR models GANs Abstraction Yes Yes No No Generation Yes Yes Yes Yes Compute P(X) Intractable Intractable Tractable No Sampling MCMC Fast Slow Fast Type of GM Undirected Directed Directed Directed Loss KL-divergence KL-divergence KL-divergence JSD Assumptions X independent given z X independent given z None N.A.

Table: Comparison of Generative Models

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-151
SLIDE 151

37/38

RBMs VAEs AR models GANs Abstraction Yes Yes No No Generation Yes Yes Yes Yes Compute P(X) Intractable Intractable Tractable No Sampling MCMC Fast Slow Fast Type of GM Undirected Directed Directed Directed Loss KL-divergence KL-divergence KL-divergence JSD Assumptions X independent given z X independent given z None N.A. Samples Bad Ok Good Good (best)

Table: Comparison of Generative Models

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-152
SLIDE 152

37/38

RBMs VAEs AR models GANs Abstraction Yes Yes No No Generation Yes Yes Yes Yes Compute P(X) Intractable Intractable Tractable No Sampling MCMC Fast Slow Fast Type of GM Undirected Directed Directed Directed Loss KL-divergence KL-divergence KL-divergence JSD Assumptions X independent given z X independent given z None N.A. Samples Bad Ok Good Good (best)

Table: Comparison of Generative Models Recent works look at combining these methods: e.g. Adversarial Autoencoders (Makhzani 2015), PixelVAE (Gulrajani 2016) and PixelGAN Autoencoders (Makhzani 2017)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23

slide-153
SLIDE 153

38/38

Source: Ian Goodfellow, NIPS 2016 Tutorial: Generative Adversarial Networks

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 23