CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh - - PowerPoint PPT Presentation

cs7015 deep learning lecture 21
SMART_READER_LITE
LIVE PREVIEW

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh - - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21 Acknowledgments


slide-1
SLIDE 1

1/36

CS7015 (Deep Learning) : Lecture 21

Variational Autoencoders Mitesh M. Khapra

Department of Computer Science and Engineering Indian Institute of Technology Madras

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-2
SLIDE 2

2/36

Acknowledgments Tutorial on Variational Autoencoders by Carl Doersch1 Blog on Variational Autoencoders by Jaan Altosaar2

1Tutorial 2Blog Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-3
SLIDE 3

3/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-4
SLIDE 4

4/36

Module 21.1: Revisiting Autoencoders

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-5
SLIDE 5

5/36

X W h W ∗ ˆ X h = g(WX + b) ˆ X = f(W ∗h + c) Before we start talking about VAEs, let us quickly revisit autoencoders

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-6
SLIDE 6

5/36

X W h W ∗ ˆ X h = g(WX + b) ˆ X = f(W ∗h + c) Before we start talking about VAEs, let us quickly revisit autoencoders An autoencoder contains an encoder which takes the input X and maps it to a hidden representation

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-7
SLIDE 7

5/36

X W h W ∗ ˆ X h = g(WX + b) ˆ X = f(W ∗h + c) Before we start talking about VAEs, let us quickly revisit autoencoders An autoencoder contains an encoder which takes the input X and maps it to a hidden representation The decoder then takes this hidden represent- ation and tries to reconstruct the input from it as ˆ X

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-8
SLIDE 8

5/36

X W h W ∗ ˆ X h = g(WX + b) ˆ X = f(W ∗h + c) Before we start talking about VAEs, let us quickly revisit autoencoders An autoencoder contains an encoder which takes the input X and maps it to a hidden representation The decoder then takes this hidden represent- ation and tries to reconstruct the input from it as ˆ X The training happens using the following ob- jective function

min

W,W ∗,c,b

1 m

m

  • i=1

n

  • j=1

(ˆ xij − xij)2

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-9
SLIDE 9

5/36

X W h W ∗ ˆ X h = g(WX + b) ˆ X = f(W ∗h + c) Before we start talking about VAEs, let us quickly revisit autoencoders An autoencoder contains an encoder which takes the input X and maps it to a hidden representation The decoder then takes this hidden represent- ation and tries to reconstruct the input from it as ˆ X The training happens using the following ob- jective function

min

W,W ∗,c,b

1 m

m

  • i=1

n

  • j=1

(ˆ xij − xij)2

where m is the number of training instances, {xi}m

i=1 and each xi ∈ Rn (xij is thus the j-th

dimension of the i-th training instance)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-10
SLIDE 10

6/36

X W h W ∗ ˆ X h = g(WX + b) ˆ X = f(W ∗h + c) But where’s the fun in this ?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-11
SLIDE 11

6/36

X W h W ∗ ˆ X h = g(WX + b) ˆ X = f(W ∗h + c) But where’s the fun in this ? We are taking an input and simply recon- structing it

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-12
SLIDE 12

6/36

X W h W ∗ ˆ X h = g(WX + b) ˆ X = f(W ∗h + c) But where’s the fun in this ? We are taking an input and simply recon- structing it Of course, the fun lies in the fact that we are getting a good abstraction of the input

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-13
SLIDE 13

6/36

X W h W ∗ ˆ X h = g(WX + b) ˆ X = f(W ∗h + c) But where’s the fun in this ? We are taking an input and simply recon- structing it Of course, the fun lies in the fact that we are getting a good abstraction of the input But RBMs were able to do something more besides abstraction

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-14
SLIDE 14

6/36

X W h W ∗ ˆ X h = g(WX + b) ˆ X = f(W ∗h + c) But where’s the fun in this ? We are taking an input and simply recon- structing it Of course, the fun lies in the fact that we are getting a good abstraction of the input But RBMs were able to do something more besides abstraction (they were able to do gen- eration)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-15
SLIDE 15

6/36

X W h W ∗ ˆ X h = g(WX + b) ˆ X = f(W ∗h + c) But where’s the fun in this ? We are taking an input and simply recon- structing it Of course, the fun lies in the fact that we are getting a good abstraction of the input But RBMs were able to do something more besides abstraction (they were able to do gen- eration) Let us revisit generation in the context of au- toencoders

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-16
SLIDE 16

7/36

X W h W ∗ ˆ X h = g(WX + b) ˆ X = f(W ∗h + c) Can we do generation with autoencoders ?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-17
SLIDE 17

7/36

h W ∗ ˆ X ˆ X = f(W ∗h + c) Can we do generation with autoencoders ? In other words,

  • nce the autoencoder is

trained can I remove the encoder, feed a hid- den representation h to the decoder and de- code a ˆ X from it ?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-18
SLIDE 18

7/36

h W ∗ ˆ X ˆ X = f(W ∗h + c) Can we do generation with autoencoders ? In other words,

  • nce the autoencoder is

trained can I remove the encoder, feed a hid- den representation h to the decoder and de- code a ˆ X from it ? In principle, yes! But in practice there is a problem with this approach

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-19
SLIDE 19

7/36

h W ∗ ˆ X ˆ X = f(W ∗h + c) Can we do generation with autoencoders ? In other words,

  • nce the autoencoder is

trained can I remove the encoder, feed a hid- den representation h to the decoder and de- code a ˆ X from it ? In principle, yes! But in practice there is a problem with this approach h is a very high dimensional vector and only a few vectors in this space would actually cor- respond to meaningful latent representations

  • f our input

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-20
SLIDE 20

7/36

h W ∗ ˆ X ˆ X = f(W ∗h + c) Can we do generation with autoencoders ? In other words,

  • nce the autoencoder is

trained can I remove the encoder, feed a hid- den representation h to the decoder and de- code a ˆ X from it ? In principle, yes! But in practice there is a problem with this approach h is a very high dimensional vector and only a few vectors in this space would actually cor- respond to meaningful latent representations

  • f our input

So of all the possible value of h which values should I feed to the decoder (we had asked a similar question before: slide 67, bullet 5 of lecture 19)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-21
SLIDE 21

8/36

h W ∗ ˆ X ˆ X = f(W ∗h + c) Ideally, we should only feed those values of h which are highly likely

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-22
SLIDE 22

8/36

h W ∗ ˆ X ˆ X = f(W ∗h + c) Ideally, we should only feed those values of h which are highly likely In other words, we are interested in sampling from P(h|X) so that we pick only those h’s which have a high probability

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-23
SLIDE 23

8/36

h W ∗ ˆ X ˆ X = f(W ∗h + c) Ideally, we should only feed those values of h which are highly likely In other words, we are interested in sampling from P(h|X) so that we pick only those h’s which have a high probability But unlike RBMs, autoencoders do not have such a probabilistic interpretation

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-24
SLIDE 24

8/36

h W ∗ ˆ X ˆ X = f(W ∗h + c) Ideally, we should only feed those values of h which are highly likely In other words, we are interested in sampling from P(h|X) so that we pick only those h’s which have a high probability But unlike RBMs, autoencoders do not have such a probabilistic interpretation They learn a hidden representation h but not a distribution P(h|X)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-25
SLIDE 25

8/36

h W ∗ ˆ X ˆ X = f(W ∗h + c) Ideally, we should only feed those values of h which are highly likely In other words, we are interested in sampling from P(h|X) so that we pick only those h’s which have a high probability But unlike RBMs, autoencoders do not have such a probabilistic interpretation They learn a hidden representation h but not a distribution P(h|X) Similarly the decoder is also deterministic and does not learn a distribution over X (given a h we can get a X but not P(X|h) )

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-26
SLIDE 26

9/36

We will now look at variational autoencoders which have the same structure as autoencoders but they learn a distribution over the hidden variables

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-27
SLIDE 27

10/36

Module 21.2: Variational Autoencoders: The Neural Network Perspective

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-28
SLIDE 28

11/36

Let {X = xi}N

i=1 be the training data

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-29
SLIDE 29

11/36

Let {X = xi}N

i=1 be the training data

We can think of X as a random variable in Rn

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-30
SLIDE 30

11/36

Let {X = xi}N

i=1 be the training data

We can think of X as a random variable in Rn For example, X could be an image and the dimensions of X correspond to pixels of the image

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-31
SLIDE 31

11/36

Figure: Abstraction

Let {X = xi}N

i=1 be the training data

We can think of X as a random variable in Rn For example, X could be an image and the dimensions of X correspond to pixels of the image We are interested in learning an abstraction (i.e., given an X find the hidden representa- tion z)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-32
SLIDE 32

11/36

Figure: Abstraction Figure: Generation

Let {X = xi}N

i=1 be the training data

We can think of X as a random variable in Rn For example, X could be an image and the dimensions of X correspond to pixels of the image We are interested in learning an abstraction (i.e., given an X find the hidden representa- tion z) We are also interested in generation (i.e., given a hidden representation generate an X)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-33
SLIDE 33

11/36

Figure: Abstraction Figure: Generation

Let {X = xi}N

i=1 be the training data

We can think of X as a random variable in Rn For example, X could be an image and the dimensions of X correspond to pixels of the image We are interested in learning an abstraction (i.e., given an X find the hidden representa- tion z) We are also interested in generation (i.e., given a hidden representation generate an X) In probabilistic terms we are interested in P(z|X) and P(X|z) (to be consistent with the literation on VAEs we will use z instead of H and X instead of V )

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-34
SLIDE 34

12/36

v1 v2 · · · vm V ∈ {0, 1}m b1 b2 bm h1 h2 · · · hn H ∈ {0, 1}n c1 c2 cn

W ∈ Rm×n

w1,1 wm,n Earlier we saw RBMs where we learnt P(z|X) and P(X|z)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-35
SLIDE 35

12/36

v1 v2 · · · vm V ∈ {0, 1}m b1 b2 bm h1 h2 · · · hn H ∈ {0, 1}n c1 c2 cn

W ∈ Rm×n

w1,1 wm,n Earlier we saw RBMs where we learnt P(z|X) and P(X|z) Below we list certain characteristics of RBMs

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-36
SLIDE 36

12/36

v1 v2 · · · vm V ∈ {0, 1}m b1 b2 bm h1 h2 · · · hn H ∈ {0, 1}n c1 c2 cn

W ∈ Rm×n

w1,1 wm,n Earlier we saw RBMs where we learnt P(z|X) and P(X|z) Below we list certain characteristics of RBMs Structural assumptions: We assume cer- tain independencies in the Markov Network

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-37
SLIDE 37

12/36

v1 v2 · · · vm V ∈ {0, 1}m b1 b2 bm h1 h2 · · · hn H ∈ {0, 1}n c1 c2 cn

W ∈ Rm×n

w1,1 wm,n Earlier we saw RBMs where we learnt P(z|X) and P(X|z) Below we list certain characteristics of RBMs Structural assumptions: We assume cer- tain independencies in the Markov Network Computational: When training with Gibbs Sampling we have to run the Markov Chain for many time steps which is expensive

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-38
SLIDE 38

12/36

v1 v2 · · · vm V ∈ {0, 1}m b1 b2 bm h1 h2 · · · hn H ∈ {0, 1}n c1 c2 cn

W ∈ Rm×n

w1,1 wm,n Earlier we saw RBMs where we learnt P(z|X) and P(X|z) Below we list certain characteristics of RBMs Structural assumptions: We assume cer- tain independencies in the Markov Network Computational: When training with Gibbs Sampling we have to run the Markov Chain for many time steps which is expensive Approximation: When using Contrastive Divergence, we approximate the expectation by a point estimate

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-39
SLIDE 39

12/36

v1 v2 · · · vm V ∈ {0, 1}m b1 b2 bm h1 h2 · · · hn H ∈ {0, 1}n c1 c2 cn

W ∈ Rm×n

w1,1 wm,n Earlier we saw RBMs where we learnt P(z|X) and P(X|z) Below we list certain characteristics of RBMs Structural assumptions: We assume cer- tain independencies in the Markov Network Computational: When training with Gibbs Sampling we have to run the Markov Chain for many time steps which is expensive Approximation: When using Contrastive Divergence, we approximate the expectation by a point estimate (Nothing wrong with the above but we just mention them to make the reader aware of these characteristics)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-40
SLIDE 40

13/36

We now return to our goals

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-41
SLIDE 41

13/36

We now return to our goals Goal 1: Learn a distribution over the latent variables (Q(z|X))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-42
SLIDE 42

13/36

We now return to our goals Goal 1: Learn a distribution over the latent variables (Q(z|X)) Goal 2: Learn a distribution over the visible variables (P(X|z))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-43
SLIDE 43

13/36

z Data: X Encoder Qθ(z|X) θ: the parameters of the encoder neural network We now return to our goals Goal 1: Learn a distribution over the latent variables (Q(z|X)) Goal 2: Learn a distribution over the visible variables (P(X|z)) VAEs use a neural network based encoder for Goal 1

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-44
SLIDE 44

13/36

z Data: X Encoder Qθ(z|X) Reconstruction: ˆ X Decoder Pφ(X|z) θ: the parameters of the encoder neural network φ: the parameters of the decoder neural network We now return to our goals Goal 1: Learn a distribution over the latent variables (Q(z|X)) Goal 2: Learn a distribution over the visible variables (P(X|z)) VAEs use a neural network based encoder for Goal 1 and a neural network based decoder for Goal 2

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-45
SLIDE 45

13/36

z Data: X Encoder Qθ(z|X) Reconstruction: ˆ X Decoder Pφ(X|z) θ: the parameters of the encoder neural network φ: the parameters of the decoder neural network We now return to our goals Goal 1: Learn a distribution over the latent variables (Q(z|X)) Goal 2: Learn a distribution over the visible variables (P(X|z)) VAEs use a neural network based encoder for Goal 1 and a neural network based decoder for Goal 2 We will look at the encoder first

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-46
SLIDE 46

14/36

X z Qθ(z|X) Encoder: What do we mean when we say we want to learn a distribution?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-47
SLIDE 47

14/36

X z Qθ(z|X) Encoder: What do we mean when we say we want to learn a distribution? We mean that we want to learn the parameters of the distribution

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-48
SLIDE 48

14/36

X z Qθ(z|X) Encoder: What do we mean when we say we want to learn a distribution? We mean that we want to learn the parameters of the distribution But what are the parameters of Q(z|X)?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-49
SLIDE 49

14/36

X z Qθ(z|X) Encoder: What do we mean when we say we want to learn a distribution? We mean that we want to learn the parameters of the distribution But what are the parameters of Q(z|X)? Well it depends on our modeling assump- tion!

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-50
SLIDE 50

14/36

X z Qθ(z|X) Σ µ X ∈ Rn, µ ∈ Rm and Σ ∈ Rm×m Encoder: What do we mean when we say we want to learn a distribution? We mean that we want to learn the parameters of the distribution But what are the parameters of Q(z|X)? Well it depends on our modeling assump- tion! In VAEs we assume that the latent variables come from a standard normal distribution N(0, I) and the job of the encoder is to then predict the parameters of this distribution

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-51
SLIDE 51

15/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

Now what about the decoder?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-52
SLIDE 52

15/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

Now what about the decoder? The job of the decoder is to predict a probab- ility distribution over X : P(X|z)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-53
SLIDE 53

15/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

Now what about the decoder? The job of the decoder is to predict a probab- ility distribution over X : P(X|z) Once again we will assume a certain form for this distribution

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-54
SLIDE 54

15/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

Now what about the decoder? The job of the decoder is to predict a probab- ility distribution over X : P(X|z) Once again we will assume a certain form for this distribution For example, if we want to predict 28 x 28 pixels and each pixel belongs to R (i.e., X ∈ R784) then what would be a suitable family for P(X|z)?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-55
SLIDE 55

15/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

Now what about the decoder? The job of the decoder is to predict a probab- ility distribution over X : P(X|z) Once again we will assume a certain form for this distribution For example, if we want to predict 28 x 28 pixels and each pixel belongs to R (i.e., X ∈ R784) then what would be a suitable family for P(X|z)? We could assume that P(X|z) is a Gaussian distribution with unit variance

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-56
SLIDE 56

15/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

Now what about the decoder? The job of the decoder is to predict a probab- ility distribution over X : P(X|z) Once again we will assume a certain form for this distribution For example, if we want to predict 28 x 28 pixels and each pixel belongs to R (i.e., X ∈ R784) then what would be a suitable family for P(X|z)? We could assume that P(X|z) is a Gaussian distribution with unit variance The job of the decoder f would then be to predict the mean of this distribution as fφ(z)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-57
SLIDE 57

16/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

What would be the objective function of the decoder ?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-58
SLIDE 58

16/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

What would be the objective function of the decoder ? For any given training sample xi it should maximize P(xi) given by P(xi) = ˆ P(z)P(xi|z)dz

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-59
SLIDE 59

16/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

What would be the objective function of the decoder ? For any given training sample xi it should maximize P(xi) given by P(xi) = ˆ P(z)P(xi|z)dz = −Ez∼Qθ(z|xi)[log Pφ(xi|z)]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-60
SLIDE 60

16/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

What would be the objective function of the decoder ? For any given training sample xi it should maximize P(xi) given by P(xi) = ˆ P(z)P(xi|z)dz = −Ez∼Qθ(z|xi)[log Pφ(xi|z)] (As usual we take log for numerical stability)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-61
SLIDE 61

17/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

This is the loss function for one data point (li(θ)) and we will just sum over all the data points to get the total loss L (θ) L (θ) =

m

  • i=1

li(θ)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-62
SLIDE 62

17/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

This is the loss function for one data point (li(θ)) and we will just sum over all the data points to get the total loss L (θ) L (θ) =

m

  • i=1

li(θ) In addition, we also want a constraint on the distribution over the latent variables

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-63
SLIDE 63

17/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

This is the loss function for one data point (li(θ)) and we will just sum over all the data points to get the total loss L (θ) L (θ) =

m

  • i=1

li(θ) In addition, we also want a constraint on the distribution over the latent variables Specifically, we had assumed P(z) to be N(0, I) and we want Q(z|X) to be as close to P(z) as possible

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-64
SLIDE 64

17/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

This is the loss function for one data point (li(θ)) and we will just sum over all the data points to get the total loss L (θ) L (θ) =

m

  • i=1

li(θ) In addition, we also want a constraint on the distribution over the latent variables Specifically, we had assumed P(z) to be N(0, I) and we want Q(z|X) to be as close to P(z) as possible Thus, we will modify the loss function such that li(θ, φ) = −Ez∼Qθ(z|xi)[log Pφ(xi|z)] +KL(Qθ(z|xi)||P(z))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-65
SLIDE 65

17/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

KL divergence captures the difference (or distance) between 2 distributions This is the loss function for one data point (li(θ)) and we will just sum over all the data points to get the total loss L (θ) L (θ) =

m

  • i=1

li(θ) In addition, we also want a constraint on the distribution over the latent variables Specifically, we had assumed P(z) to be N(0, I) and we want Q(z|X) to be as close to P(z) as possible Thus, we will modify the loss function such that li(θ, φ) = −Ez∼Qθ(z|xi)[log Pφ(xi|z)] +KL(Qθ(z|xi)||P(z))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-66
SLIDE 66

18/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

li(θ, φ) = −Ez∼Qθ(z|xi)[log Pφ(xi|z)] +KL(Qθ(z|xi)||P(z))

The second term in the loss function can actually be thought of as a regularizer

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-67
SLIDE 67

18/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

li(θ, φ) = −Ez∼Qθ(z|xi)[log Pφ(xi|z)] +KL(Qθ(z|xi)||P(z))

The second term in the loss function can actually be thought of as a regularizer It ensures that the encoder does not cheat by mapping each xi to a different point (a normal distribution with very low variance) in the Euclidean space

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-68
SLIDE 68

18/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

li(θ, φ) = −Ez∼Qθ(z|xi)[log Pφ(xi|z)] +KL(Qθ(z|xi)||P(z))

The second term in the loss function can actually be thought of as a regularizer It ensures that the encoder does not cheat by mapping each xi to a different point (a normal distribution with very low variance) in the Euclidean space In other words, in the absence of the regularizer the encoder can learn a unique mapping for each xi and the decoder can then decode from this unique mapping

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-69
SLIDE 69

18/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

li(θ, φ) = −Ez∼Qθ(z|xi)[log Pφ(xi|z)] +KL(Qθ(z|xi)||P(z))

The second term in the loss function can actually be thought of as a regularizer It ensures that the encoder does not cheat by mapping each xi to a different point (a normal distribution with very low variance) in the Euclidean space In other words, in the absence of the regularizer the encoder can learn a unique mapping for each xi and the decoder can then decode from this unique mapping Even with high variance in samples from the distribu- tion, we want the decoder to be able to reconstruct the original data very well (motivation similar to the adding noise)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-70
SLIDE 70

18/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

li(θ, φ) = −Ez∼Qθ(z|xi)[log Pφ(xi|z)] +KL(Qθ(z|xi)||P(z))

The second term in the loss function can actually be thought of as a regularizer It ensures that the encoder does not cheat by mapping each xi to a different point (a normal distribution with very low variance) in the Euclidean space In other words, in the absence of the regularizer the encoder can learn a unique mapping for each xi and the decoder can then decode from this unique mapping Even with high variance in samples from the distribu- tion, we want the decoder to be able to reconstruct the original data very well (motivation similar to the adding noise) To summarize, for each data point we predict a distri- bution such that, with high probability a sample from this distribution should be able to reconstruct the ori- ginal data point

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-71
SLIDE 71

18/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

li(θ, φ) = −Ez∼Qθ(z|xi)[log Pφ(xi|z)] +KL(Qθ(z|xi)||P(z))

The second term in the loss function can actually be thought of as a regularizer It ensures that the encoder does not cheat by mapping each xi to a different point (a normal distribution with very low variance) in the Euclidean space In other words, in the absence of the regularizer the encoder can learn a unique mapping for each xi and the decoder can then decode from this unique mapping Even with high variance in samples from the distribu- tion, we want the decoder to be able to reconstruct the original data very well (motivation similar to the adding noise) To summarize, for each data point we predict a distri- bution such that, with high probability a sample from this distribution should be able to reconstruct the ori- ginal data point But why do we choose a normal distribution? Isn’t it too simplistic to assume that z follows a normal distribution

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-72
SLIDE 72

19/36

li(θ, φ) = −Ez∼Qθ(z|xi)[log Pφ(xi|z)] +KL(Qθ(z|xi)||P(z)) Isn’t it a very strong assumption that P(z) ∼ N(0, I) ?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-73
SLIDE 73

19/36

li(θ, φ) = −Ez∼Qθ(z|xi)[log Pφ(xi|z)] +KL(Qθ(z|xi)||P(z)) Isn’t it a very strong assumption that P(z) ∼ N(0, I) ? For example, in the 2-dimensional case how can we be sure that P(z) is a normal distri- bution and not any other distribution

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-74
SLIDE 74

19/36

li(θ, φ) = −Ez∼Qθ(z|xi)[log Pφ(xi|z)] +KL(Qθ(z|xi)||P(z)) Isn’t it a very strong assumption that P(z) ∼ N(0, I) ? For example, in the 2-dimensional case how can we be sure that P(z) is a normal distri- bution and not any other distribution The key insight here is that any distribution in d dimensions can be generated by the fol- lowing steps

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-75
SLIDE 75

19/36

li(θ, φ) = −Ez∼Qθ(z|xi)[log Pφ(xi|z)] +KL(Qθ(z|xi)||P(z)) Isn’t it a very strong assumption that P(z) ∼ N(0, I) ? For example, in the 2-dimensional case how can we be sure that P(z) is a normal distri- bution and not any other distribution The key insight here is that any distribution in d dimensions can be generated by the fol- lowing steps Step 1: Start with a set of d variables that are normally distributed (that’s exactly what we are assuming for P(z))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-76
SLIDE 76

19/36

li(θ, φ) = −Ez∼Qθ(z|xi)[log Pφ(xi|z)] +KL(Qθ(z|xi)||P(z)) Isn’t it a very strong assumption that P(z) ∼ N(0, I) ? For example, in the 2-dimensional case how can we be sure that P(z) is a normal distri- bution and not any other distribution The key insight here is that any distribution in d dimensions can be generated by the fol- lowing steps Step 1: Start with a set of d variables that are normally distributed (that’s exactly what we are assuming for P(z)) Step 2: Mapping these variables through a sufficiently complex function (that’s exactly what the first few layers of the decoder can do)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-77
SLIDE 77

20/36

li(θ, φ) = −Ez∼Qθ(z|xi)[log Pφ(xi|z)] +KL(Qθ(z|xi)||P(z))

In particular, note that in the adjoining example if z is 2-D and normally distributed then f(z) is roughly ring shaped (giving us the distribution in the bottom figure) f(z) = z 10 + z ||z||

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-78
SLIDE 78

20/36

li(θ, φ) = −Ez∼Qθ(z|xi)[log Pφ(xi|z)] +KL(Qθ(z|xi)||P(z))

In particular, note that in the adjoining example if z is 2-D and normally distributed then f(z) is roughly ring shaped (giving us the distribution in the bottom figure) f(z) = z 10 + z ||z|| A non-linear neural network, such as the one we use for the decoder, could learn a complex mapping from z to fφ(z) using its parameters φ

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-79
SLIDE 79

20/36

li(θ, φ) = −Ez∼Qθ(z|xi)[log Pφ(xi|z)] +KL(Qθ(z|xi)||P(z))

In particular, note that in the adjoining example if z is 2-D and normally distributed then f(z) is roughly ring shaped (giving us the distribution in the bottom figure) f(z) = z 10 + z ||z|| A non-linear neural network, such as the one we use for the decoder, could learn a complex mapping from z to fφ(z) using its parameters φ The initial layers of a non linear decoder could learn their weights such that the output is fφ(z)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-80
SLIDE 80

20/36

li(θ, φ) = −Ez∼Qθ(z|xi)[log Pφ(xi|z)] +KL(Qθ(z|xi)||P(z))

In particular, note that in the adjoining example if z is 2-D and normally distributed then f(z) is roughly ring shaped (giving us the distribution in the bottom figure) f(z) = z 10 + z ||z|| A non-linear neural network, such as the one we use for the decoder, could learn a complex mapping from z to fφ(z) using its parameters φ The initial layers of a non linear decoder could learn their weights such that the output is fφ(z) The above argument suggests that even if we start with normally distributed variables the initial layers of the decoder could learn a complex transformation of these variables say fφ(z) if required

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-81
SLIDE 81

20/36

li(θ, φ) = −Ez∼Qθ(z|xi)[log Pφ(xi|z)] +KL(Qθ(z|xi)||P(z))

In particular, note that in the adjoining example if z is 2-D and normally distributed then f(z) is roughly ring shaped (giving us the distribution in the bottom figure) f(z) = z 10 + z ||z|| A non-linear neural network, such as the one we use for the decoder, could learn a complex mapping from z to fφ(z) using its parameters φ The initial layers of a non linear decoder could learn their weights such that the output is fφ(z) The above argument suggests that even if we start with normally distributed variables the initial layers of the decoder could learn a complex transformation of these variables say fφ(z) if required The objective function of the decoder will ensure that an appropriate transformation of z is learnt to recon- struct X

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-82
SLIDE 82

21/36

Module 21.3: Variational autoencoders: (The graphical model perspective)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-83
SLIDE 83

22/36

X z N Here we can think of z and X as random vari- ables

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-84
SLIDE 84

22/36

X z N Here we can think of z and X as random vari- ables We are then interested in the joint prob- ability distribution P(X, z) which factorizes as P(X, z) = P(z)P(X|z)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-85
SLIDE 85

22/36

X z N Here we can think of z and X as random vari- ables We are then interested in the joint prob- ability distribution P(X, z) which factorizes as P(X, z) = P(z)P(X|z) This factorization is natural because we can imagine that the latent variables are fixed first and then the visible variables are drawn based

  • n the latent variables

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-86
SLIDE 86

22/36

X z N Here we can think of z and X as random vari- ables We are then interested in the joint prob- ability distribution P(X, z) which factorizes as P(X, z) = P(z)P(X|z) This factorization is natural because we can imagine that the latent variables are fixed first and then the visible variables are drawn based

  • n the latent variables

For example, if we want to draw a digit we could first fix the latent variables: the digit, size, angle, thickness, position and so on and then draw a digit which corresponds to these latent variables

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-87
SLIDE 87

22/36

X z N Here we can think of z and X as random vari- ables We are then interested in the joint prob- ability distribution P(X, z) which factorizes as P(X, z) = P(z)P(X|z) This factorization is natural because we can imagine that the latent variables are fixed first and then the visible variables are drawn based

  • n the latent variables

For example, if we want to draw a digit we could first fix the latent variables: the digit, size, angle, thickness, position and so on and then draw a digit which corresponds to these latent variables And of course, unlike RBMs, this is a directed graphical model

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-88
SLIDE 88

23/36

X z N

Now at inference time, we are given an X (observed variable) and we are interested in finding the most likely assignments of latent variables z which would have resulted in this observation

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-89
SLIDE 89

23/36

X z N

Now at inference time, we are given an X (observed variable) and we are interested in finding the most likely assignments of latent variables z which would have resulted in this observation Mathematically, we want to find P(z|X) = P(X|z)P(z) P(X)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-90
SLIDE 90

23/36

X z N

Now at inference time, we are given an X (observed variable) and we are interested in finding the most likely assignments of latent variables z which would have resulted in this observation Mathematically, we want to find P(z|X) = P(X|z)P(z) P(X) This is hard to compute because the LHS contains P(X) which is intractable P(X) = ˆ P(X|z)P(z)dz = ˆ ˆ ... ˆ P(X|z1, z2, ..., zn)P(z1, z2, ..., zn)dz1, ...dzn

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-91
SLIDE 91

23/36

X z N

Now at inference time, we are given an X (observed variable) and we are interested in finding the most likely assignments of latent variables z which would have resulted in this observation Mathematically, we want to find P(z|X) = P(X|z)P(z) P(X) This is hard to compute because the LHS contains P(X) which is intractable P(X) = ˆ P(X|z)P(z)dz = ˆ ˆ ... ˆ P(X|z1, z2, ..., zn)P(z1, z2, ..., zn)dz1, ...dzn In RBMs, we had a similar integral which we approx- imated using Gibbs Sampling

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-92
SLIDE 92

23/36

X z N

Now at inference time, we are given an X (observed variable) and we are interested in finding the most likely assignments of latent variables z which would have resulted in this observation Mathematically, we want to find P(z|X) = P(X|z)P(z) P(X) This is hard to compute because the LHS contains P(X) which is intractable P(X) = ˆ P(X|z)P(z)dz = ˆ ˆ ... ˆ P(X|z1, z2, ..., zn)P(z1, z2, ..., zn)dz1, ...dzn In RBMs, we had a similar integral which we approx- imated using Gibbs Sampling VAEs, on the other hand, cast this into an optimiza- tion problem and learn the parameters of the optim- ization problem

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-93
SLIDE 93

24/36

X z N Specifically, in VAEs, we assume that instead

  • f P(z|X) which is intractable, the posterior

distribution is given by Qθ(z|X)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-94
SLIDE 94

24/36

X z N Specifically, in VAEs, we assume that instead

  • f P(z|X) which is intractable, the posterior

distribution is given by Qθ(z|X) Further, we assume that Qθ(z|X) is a Gaus- sian whose parameters are determined by a neural network µ, Σ = gθ(X)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-95
SLIDE 95

24/36

X z N Specifically, in VAEs, we assume that instead

  • f P(z|X) which is intractable, the posterior

distribution is given by Qθ(z|X) Further, we assume that Qθ(z|X) is a Gaus- sian whose parameters are determined by a neural network µ, Σ = gθ(X) The parameters of the distribution are thus determined by the parameters θ of a neural network

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-96
SLIDE 96

24/36

X z N Specifically, in VAEs, we assume that instead

  • f P(z|X) which is intractable, the posterior

distribution is given by Qθ(z|X) Further, we assume that Qθ(z|X) is a Gaus- sian whose parameters are determined by a neural network µ, Σ = gθ(X) The parameters of the distribution are thus determined by the parameters θ of a neural network Our job then is to learn the parameters of this neural network

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-97
SLIDE 97

25/36

X z N But what is the objective function for this neural network

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-98
SLIDE 98

25/36

X z N But what is the objective function for this neural network Well we want the proposed distribution Qθ(z|X) to be as close to the true distribu- tion

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-99
SLIDE 99

25/36

X z N But what is the objective function for this neural network Well we want the proposed distribution Qθ(z|X) to be as close to the true distribu- tion We can capture this using the following ob- jective function minimize KL(Qθ(z|X)||P(z|X))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-100
SLIDE 100

25/36

X z N But what is the objective function for this neural network Well we want the proposed distribution Qθ(z|X) to be as close to the true distribu- tion We can capture this using the following ob- jective function minimize KL(Qθ(z|X)||P(z|X)) What are the parameters of the objective function ?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-101
SLIDE 101

25/36

X z N But what is the objective function for this neural network Well we want the proposed distribution Qθ(z|X) to be as close to the true distribu- tion We can capture this using the following ob- jective function minimize KL(Qθ(z|X)||P(z|X)) What are the parameters of the objective function ? (they are the parameters of the neural network - we will return back to this again)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-102
SLIDE 102

26/36

Let us expand the KL divergence term D[Qθ(z|X)||P(z|X)]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-103
SLIDE 103

26/36

Let us expand the KL divergence term D[Qθ(z|X)||P(z|X)] = ˆ Qθ(z|X) log Qθ(z|X)dz − ˆ Qθ(z|X) log P(z|X)dz

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-104
SLIDE 104

26/36

Let us expand the KL divergence term D[Qθ(z|X)||P(z|X)] = ˆ Qθ(z|X) log Qθ(z|X)dz − ˆ Qθ(z|X) log P(z|X)dz = Ez∼Qθ(z|X)[log Qθ(z|X) − log P(z|X)]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-105
SLIDE 105

26/36

Let us expand the KL divergence term D[Qθ(z|X)||P(z|X)] = ˆ Qθ(z|X) log Qθ(z|X)dz − ˆ Qθ(z|X) log P(z|X)dz = Ez∼Qθ(z|X)[log Qθ(z|X) − log P(z|X)] For shorthand we will use EQ = Ez∼Qθ(z|X)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-106
SLIDE 106

26/36

Let us expand the KL divergence term D[Qθ(z|X)||P(z|X)] = ˆ Qθ(z|X) log Qθ(z|X)dz − ˆ Qθ(z|X) log P(z|X)dz = Ez∼Qθ(z|X)[log Qθ(z|X) − log P(z|X)] For shorthand we will use EQ = Ez∼Qθ(z|X) Substituting P(z|X) = P(X|z)P(z)

P(X)

, we get

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-107
SLIDE 107

26/36

Let us expand the KL divergence term D[Qθ(z|X)||P(z|X)] = ˆ Qθ(z|X) log Qθ(z|X)dz − ˆ Qθ(z|X) log P(z|X)dz = Ez∼Qθ(z|X)[log Qθ(z|X) − log P(z|X)] For shorthand we will use EQ = Ez∼Qθ(z|X) Substituting P(z|X) = P(X|z)P(z)

P(X)

, we get D[Qθ(z|X)||P(z|X)] = EQ[log Qθ(z|X) − log P(X|z) − log P(z) + log P(X)]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-108
SLIDE 108

26/36

Let us expand the KL divergence term D[Qθ(z|X)||P(z|X)] = ˆ Qθ(z|X) log Qθ(z|X)dz − ˆ Qθ(z|X) log P(z|X)dz = Ez∼Qθ(z|X)[log Qθ(z|X) − log P(z|X)] For shorthand we will use EQ = Ez∼Qθ(z|X) Substituting P(z|X) = P(X|z)P(z)

P(X)

, we get D[Qθ(z|X)||P(z|X)] = EQ[log Qθ(z|X) − log P(X|z) − log P(z) + log P(X)] = EQ[log Qθ(z|X) − log P(z)] − EQ[log P(X|z)] + log P(X)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-109
SLIDE 109

26/36

Let us expand the KL divergence term D[Qθ(z|X)||P(z|X)] = ˆ Qθ(z|X) log Qθ(z|X)dz − ˆ Qθ(z|X) log P(z|X)dz = Ez∼Qθ(z|X)[log Qθ(z|X) − log P(z|X)] For shorthand we will use EQ = Ez∼Qθ(z|X) Substituting P(z|X) = P(X|z)P(z)

P(X)

, we get D[Qθ(z|X)||P(z|X)] = EQ[log Qθ(z|X) − log P(X|z) − log P(z) + log P(X)] = EQ[log Qθ(z|X) − log P(z)] − EQ[log P(X|z)] + log P(X) = D[Qθ(z|X)||p(z)] − EQ[log P(X|z)] + log P(X)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-110
SLIDE 110

26/36

Let us expand the KL divergence term D[Qθ(z|X)||P(z|X)] = ˆ Qθ(z|X) log Qθ(z|X)dz − ˆ Qθ(z|X) log P(z|X)dz = Ez∼Qθ(z|X)[log Qθ(z|X) − log P(z|X)] For shorthand we will use EQ = Ez∼Qθ(z|X) Substituting P(z|X) = P(X|z)P(z)

P(X)

, we get D[Qθ(z|X)||P(z|X)] = EQ[log Qθ(z|X) − log P(X|z) − log P(z) + log P(X)] = EQ[log Qθ(z|X) − log P(z)] − EQ[log P(X|z)] + log P(X) = D[Qθ(z|X)||p(z)] − EQ[log P(X|z)] + log P(X) ∴ log p(X) = EQ[log P(X|z)] − D[Qθ(z|X)||P(z)] + D[Qθ(z|X)||P(z|X)]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-111
SLIDE 111

27/36

So, we have log P(X) = EQ[log P(X|z)] − D[Qθ(z|X)||P(z)] + D[Qθ(z|X)||P(z|X)]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-112
SLIDE 112

27/36

So, we have log P(X) = EQ[log P(X|z)] − D[Qθ(z|X)||P(z)] + D[Qθ(z|X)||P(z|X)] Recall that we are interested in maximizing the log likelihood of the data i.e. P(X)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-113
SLIDE 113

27/36

So, we have log P(X) = EQ[log P(X|z)] − D[Qθ(z|X)||P(z)] + D[Qθ(z|X)||P(z|X)] Recall that we are interested in maximizing the log likelihood of the data i.e. P(X) Since KL divergence (the red term) is always >= 0 we can say that EQ[log P(X|z)] − D[Qθ(z|X)||P(z)] <= log P(X)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-114
SLIDE 114

27/36

So, we have log P(X) = EQ[log P(X|z)] − D[Qθ(z|X)||P(z)] + D[Qθ(z|X)||P(z|X)] Recall that we are interested in maximizing the log likelihood of the data i.e. P(X) Since KL divergence (the red term) is always >= 0 we can say that EQ[log P(X|z)] − D[Qθ(z|X)||P(z)] <= log P(X) The quantity on the LHS is thus a lower bound for the quantity that we want to maximize and is knows as the Evidence lower bound (ELBO)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-115
SLIDE 115

27/36

So, we have log P(X) = EQ[log P(X|z)] − D[Qθ(z|X)||P(z)] + D[Qθ(z|X)||P(z|X)] Recall that we are interested in maximizing the log likelihood of the data i.e. P(X) Since KL divergence (the red term) is always >= 0 we can say that EQ[log P(X|z)] − D[Qθ(z|X)||P(z)] <= log P(X) The quantity on the LHS is thus a lower bound for the quantity that we want to maximize and is knows as the Evidence lower bound (ELBO) Maximizing this lower bound is the same as maximizing log P(X) and hence

  • ur equivalent objective now becomes

maximize EQ[log P(X|z)] − D[Qθ(z|X)||P(z)]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-116
SLIDE 116

27/36

So, we have log P(X) = EQ[log P(X|z)] − D[Qθ(z|X)||P(z)] + D[Qθ(z|X)||P(z|X)] Recall that we are interested in maximizing the log likelihood of the data i.e. P(X) Since KL divergence (the red term) is always >= 0 we can say that EQ[log P(X|z)] − D[Qθ(z|X)||P(z)] <= log P(X) The quantity on the LHS is thus a lower bound for the quantity that we want to maximize and is knows as the Evidence lower bound (ELBO) Maximizing this lower bound is the same as maximizing log P(X) and hence

  • ur equivalent objective now becomes

maximize EQ[log P(X|z)] − D[Qθ(z|X)||P(z)] And, this method of learning parameters of probability distributions associ- ated with graphical models using optimization (by maximizing ELBO) is called variational inference

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-117
SLIDE 117

27/36

So, we have log P(X) = EQ[log P(X|z)] − D[Qθ(z|X)||P(z)] + D[Qθ(z|X)||P(z|X)] Recall that we are interested in maximizing the log likelihood of the data i.e. P(X) Since KL divergence (the red term) is always >= 0 we can say that EQ[log P(X|z)] − D[Qθ(z|X)||P(z)] <= log P(X) The quantity on the LHS is thus a lower bound for the quantity that we want to maximize and is knows as the Evidence lower bound (ELBO) Maximizing this lower bound is the same as maximizing log P(X) and hence

  • ur equivalent objective now becomes

maximize EQ[log P(X|z)] − D[Qθ(z|X)||P(z)] And, this method of learning parameters of probability distributions associ- ated with graphical models using optimization (by maximizing ELBO) is called variational inference Why is this any easier? It is easy because of certain assumptions that we make as discussed on the next slide

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-118
SLIDE 118

28/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

First we will just reintroduce the parameters in the equation to make things explicit maximize EQ[log Pφ(X|z)] − D[Qθ(z|X)||P(z)]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-119
SLIDE 119

28/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

First we will just reintroduce the parameters in the equation to make things explicit maximize EQ[log Pφ(X|z)] − D[Qθ(z|X)||P(z)] At training time, we are interested in learning the parameters θ which maximize the above for every training example (xi ∈ {xi}N

i=1) Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-120
SLIDE 120

28/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

First we will just reintroduce the parameters in the equation to make things explicit maximize EQ[log Pφ(X|z)] − D[Qθ(z|X)||P(z)] At training time, we are interested in learning the parameters θ which maximize the above for every training example (xi ∈ {xi}N

i=1)

So our total objective function is maximize

θ N

  • i=1

EQ[log Pφ(X = xi|z)] − D[Qθ(z|X = xi)||P(z)]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-121
SLIDE 121

28/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

First we will just reintroduce the parameters in the equation to make things explicit maximize EQ[log Pφ(X|z)] − D[Qθ(z|X)||P(z)] At training time, we are interested in learning the parameters θ which maximize the above for every training example (xi ∈ {xi}N

i=1)

So our total objective function is maximize

θ N

  • i=1

EQ[log Pφ(X = xi|z)] − D[Qθ(z|X = xi)||P(z)] We will shorthand P(X = xi) as P(xi)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-122
SLIDE 122

28/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

First we will just reintroduce the parameters in the equation to make things explicit maximize EQ[log Pφ(X|z)] − D[Qθ(z|X)||P(z)] At training time, we are interested in learning the parameters θ which maximize the above for every training example (xi ∈ {xi}N

i=1)

So our total objective function is maximize

θ N

  • i=1

EQ[log Pφ(X = xi|z)] − D[Qθ(z|X = xi)||P(z)] We will shorthand P(X = xi) as P(xi) However, we will assume that we are using stochastic gradient descent so we need to deal with only one of the terms in the summation corresponding to the current training example

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-123
SLIDE 123

29/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

So our objective function w.r.t. one example is maximize

θ

EQ[log Pφ(xi|z)] − D[Qθ(z|xi)||P(z)]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-124
SLIDE 124

29/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

So our objective function w.r.t. one example is maximize

θ

EQ[log Pφ(xi|z)] − D[Qθ(z|xi)||P(z)] Now, first we will do a forward prop through the en- coder using Xi and compute µ(X) and Σ(X)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-125
SLIDE 125

29/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

So our objective function w.r.t. one example is maximize

θ

EQ[log Pφ(xi|z)] − D[Qθ(z|xi)||P(z)] Now, first we will do a forward prop through the en- coder using Xi and compute µ(X) and Σ(X)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-126
SLIDE 126

29/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

So our objective function w.r.t. one example is maximize

θ

EQ[log Pφ(xi|z)] − D[Qθ(z|xi)||P(z)] Now, first we will do a forward prop through the en- coder using Xi and compute µ(X) and Σ(X)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-127
SLIDE 127

29/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

So our objective function w.r.t. one example is maximize

θ

EQ[log Pφ(xi|z)] − D[Qθ(z|xi)||P(z)] Now, first we will do a forward prop through the en- coder using Xi and compute µ(X) and Σ(X) The second term in the above objective function is the difference between two normal distribution N(µ(X), Σ(X)) and N(0, I)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-128
SLIDE 128

29/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

So our objective function w.r.t. one example is maximize

θ

EQ[log Pφ(xi|z)] − D[Qθ(z|xi)||P(z)] Now, first we will do a forward prop through the en- coder using Xi and compute µ(X) and Σ(X) The second term in the above objective function is the difference between two normal distribution N(µ(X), Σ(X)) and N(0, I) With some simple trickery you can show that this term reduces to the following expression (Seep proof here) D[N(µ(X), Σ(X))||N(0, I)] = 1 2(tr(Σ(X)) + (µ(X))T [µ(X)) − k − log det(Σ(X))] where k is the dimensionality of the latent variables

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-129
SLIDE 129

29/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

So our objective function w.r.t. one example is maximize

θ

EQ[log Pφ(xi|z)] − D[Qθ(z|xi)||P(z)] Now, first we will do a forward prop through the en- coder using Xi and compute µ(X) and Σ(X) The second term in the above objective function is the difference between two normal distribution N(µ(X), Σ(X)) and N(0, I) With some simple trickery you can show that this term reduces to the following expression (Seep proof here) D[N(µ(X), Σ(X))||N(0, I)] = 1 2(tr(Σ(X)) + (µ(X))T [µ(X)) − k − log det(Σ(X))] where k is the dimensionality of the latent variables This term can be computed easily because we have already computed µ(X) and Σ(X) in the forward pass

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-130
SLIDE 130

30/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

Now let us look at the other term in the ob- jective function

n

  • i=1

EQ[log Pφ(X|z)]

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-131
SLIDE 131

30/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

Now let us look at the other term in the ob- jective function

n

  • i=1

EQ[log Pφ(X|z)] This is again an expectation and hence in- tractable (integral over z)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-132
SLIDE 132

30/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

Now let us look at the other term in the ob- jective function

n

  • i=1

EQ[log Pφ(X|z)] This is again an expectation and hence in- tractable (integral over z) In VAEs, we approximate this with a single z sampled from N(µ(X), Σ(X))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-133
SLIDE 133

30/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

Now let us look at the other term in the ob- jective function

n

  • i=1

EQ[log Pφ(X|z)] This is again an expectation and hence in- tractable (integral over z) In VAEs, we approximate this with a single z sampled from N(µ(X), Σ(X)) Hence this term is also easy to compute (of course it is a nasty approximation but we will live with it!)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-134
SLIDE 134

31/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

Further, as usual, we need to assume some parametric form for P(X|z)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-135
SLIDE 135

31/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

Further, as usual, we need to assume some parametric form for P(X|z) For example, if we assume that P(X|z) is a Gaussian with mean µ(z) and variance I then log P(X = Xi|z) = C − 1 2||Xi − µ(z)||2

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-136
SLIDE 136

31/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

Further, as usual, we need to assume some parametric form for P(X|z) For example, if we assume that P(X|z) is a Gaussian with mean µ(z) and variance I then log P(X = Xi|z) = C − 1 2||Xi − µ(z)||2 µ(z) in turn is a function of the parameters of the decoder and can be written as fφ(z)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-137
SLIDE 137

31/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

Further, as usual, we need to assume some parametric form for P(X|z) For example, if we assume that P(X|z) is a Gaussian with mean µ(z) and variance I then log P(X = Xi|z) = C − 1 2||Xi − µ(z)||2 µ(z) in turn is a function of the parameters of the decoder and can be written as fφ(z) log P(X = Xi|z) = C − 1 2||Xi − fφ(z)||2

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-138
SLIDE 138

31/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

Further, as usual, we need to assume some parametric form for P(X|z) For example, if we assume that P(X|z) is a Gaussian with mean µ(z) and variance I then log P(X = Xi|z) = C − 1 2||Xi − µ(z)||2 µ(z) in turn is a function of the parameters of the decoder and can be written as fφ(z) log P(X = Xi|z) = C − 1 2||Xi − fφ(z)||2 Our effective objective function thus becomes

minimize

θ,φ N

  • n=1

1 2(tr(Σ(Xi)) + (µ(Xi))T [µ(Xi)) − k − log det(Σ(Xi))] + ||Xi − fφ(z)||2

  • Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21

slide-139
SLIDE 139

32/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

The above loss can be easily computed and we can update the parameters θ of the encoder and φ of decoder using backpropagation

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-140
SLIDE 140

32/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

The above loss can be easily computed and we can update the parameters θ of the encoder and φ of decoder using backpropagation However, there is a catch !

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-141
SLIDE 141

32/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

The above loss can be easily computed and we can update the parameters θ of the encoder and φ of decoder using backpropagation However, there is a catch ! The network is not end to end differentiable because the output fφ(z) is not an end to end differentiable function of the input X

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-142
SLIDE 142

32/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

The above loss can be easily computed and we can update the parameters θ of the encoder and φ of decoder using backpropagation However, there is a catch ! The network is not end to end differentiable because the output fφ(z) is not an end to end differentiable function of the input X Why?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-143
SLIDE 143

32/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

The above loss can be easily computed and we can update the parameters θ of the encoder and φ of decoder using backpropagation However, there is a catch ! The network is not end to end differentiable because the output fφ(z) is not an end to end differentiable function of the input X Why? because after passing X through the network we simply compute µ(X) and Σ(X) and then sample a z to be fed to the decoder

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-144
SLIDE 144

32/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

The above loss can be easily computed and we can update the parameters θ of the encoder and φ of decoder using backpropagation However, there is a catch ! The network is not end to end differentiable because the output fφ(z) is not an end to end differentiable function of the input X Why? because after passing X through the network we simply compute µ(X) and Σ(X) and then sample a z to be fed to the decoder This makes the entire process non- deterministic and hence fφ(z) is not a continuous function of the input X

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-145
SLIDE 145

33/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

VAEs use a neat trick to get around this prob- lem

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-146
SLIDE 146

33/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

VAEs use a neat trick to get around this prob- lem This is known as the reparameterization trick wherein we move the process of sampling to an input layer

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-147
SLIDE 147

33/36

Sample z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

VAEs use a neat trick to get around this prob- lem This is known as the reparameterization trick wherein we move the process of sampling to an input layer For 1 dimensional case, given µ and σ we can sample from N(µ, σ) by first sampling ǫ ∼ N(0, 1), and then computing z = µ + σ ∗ ǫ

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-148
SLIDE 148

33/36

+ ǫ ∼ N(0, I)

z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

VAEs use a neat trick to get around this prob- lem This is known as the reparameterization trick wherein we move the process of sampling to an input layer For 1 dimensional case, given µ and σ we can sample from N(µ, σ) by first sampling ǫ ∼ N(0, 1), and then computing z = µ + σ ∗ ǫ The adjacent figure shows the difference between the original network and the repara- mterized network

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-149
SLIDE 149

33/36

+ ǫ ∼ N(0, I)

z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

VAEs use a neat trick to get around this prob- lem This is known as the reparameterization trick wherein we move the process of sampling to an input layer For 1 dimensional case, given µ and σ we can sample from N(µ, σ) by first sampling ǫ ∼ N(0, 1), and then computing z = µ + σ ∗ ǫ The adjacent figure shows the difference between the original network and the repara- mterized network The randomness in fφ(z) is now associated with ǫ and not X or the parameters of the model

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-150
SLIDE 150

34/36

With that we are done with the process of training VAEs

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-151
SLIDE 151

34/36

Data: {Xi}N

i=1

Model: ˆ X = fφ(µ(X)+Σ(X)∗ǫ) Parameters: θ, φ Algorithm: Gradient descent Objective:

N

  • n=1

1 2(tr(Σ(Xi)) + (µ(Xi))T [µ(Xi)) − k − log det(Σ(Xi))] + ||Xi − fφ(z)||2

  • With that we are done with the process of

training VAEs Specifically, we have described the data, model, parameters, objective function and learning algorithm

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-152
SLIDE 152

34/36

Data: {Xi}N

i=1

Model: ˆ X = fφ(µ(X)+Σ(X)∗ǫ) Parameters: θ, φ Algorithm: Gradient descent Objective:

N

  • n=1

1 2(tr(Σ(Xi)) + (µ(Xi))T [µ(Xi)) − k − log det(Σ(Xi))] + ||Xi − fφ(z)||2

  • With that we are done with the process of

training VAEs Specifically, we have described the data, model, parameters, objective function and learning algorithm Now what happens at test time? We need to consider both abstraction and generation

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-153
SLIDE 153

34/36

Data: {Xi}N

i=1

Model: ˆ X = fφ(µ(X)+Σ(X)∗ǫ) Parameters: θ, φ Algorithm: Gradient descent Objective:

N

  • n=1

1 2(tr(Σ(Xi)) + (µ(Xi))T [µ(Xi)) − k − log det(Σ(Xi))] + ||Xi − fφ(z)||2

  • With that we are done with the process of

training VAEs Specifically, we have described the data, model, parameters, objective function and learning algorithm Now what happens at test time? We need to consider both abstraction and generation In other words we are interested in computing a z given a X as well as in generating a X given a z

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-154
SLIDE 154

34/36

Data: {Xi}N

i=1

Model: ˆ X = fφ(µ(X)+Σ(X)∗ǫ) Parameters: θ, φ Algorithm: Gradient descent Objective:

N

  • n=1

1 2(tr(Σ(Xi)) + (µ(Xi))T [µ(Xi)) − k − log det(Σ(Xi))] + ||Xi − fφ(z)||2

  • With that we are done with the process of

training VAEs Specifically, we have described the data, model, parameters, objective function and learning algorithm Now what happens at test time? We need to consider both abstraction and generation In other words we are interested in computing a z given a X as well as in generating a X given a z Let us look at each of these goals

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-155
SLIDE 155

35/36

+ ǫ ∼ N(0, I)

z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

Abstraction After the model parameters are learned we feed a X to the encoder

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-156
SLIDE 156

35/36

+ ǫ ∼ N(0, I)

z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

Abstraction After the model parameters are learned we feed a X to the encoder By doing a forward pass using the learned parameters of the model we compute µ(X) and Σ(X)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-157
SLIDE 157

35/36

+ ǫ ∼ N(0, I)

z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

Abstraction After the model parameters are learned we feed a X to the encoder By doing a forward pass using the learned parameters of the model we compute µ(X) and Σ(X) We then sample a z from the distribution µ(X) and Σ(X) or using the same reparamet- erization trick

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-158
SLIDE 158

35/36

+ ǫ ∼ N(0, I)

z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

Abstraction After the model parameters are learned we feed a X to the encoder By doing a forward pass using the learned parameters of the model we compute µ(X) and Σ(X) We then sample a z from the distribution µ(X) and Σ(X) or using the same reparamet- erization trick In

  • ther

words,

  • nce

we have

  • btained

µ(X) and Σ(X), we first sample ǫ ∼ N(µ(X), Σ(X)) and then compute z z = µ + σ ∗ ǫ

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-159
SLIDE 159

36/36

+ ǫ ∼ N(0, I)

z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

Generation After the model parameters are learned we re- move the encoder and feed a z ∼ N(0, I) to the decoder

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-160
SLIDE 160

36/36

+ ǫ ∼ N(0, I)

z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

Generation After the model parameters are learned we re- move the encoder and feed a z ∼ N(0, I) to the decoder The decoder will then predict fφ(z) and we can draw an X ∼ N(fφ(z), I)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-161
SLIDE 161

36/36

+ ǫ ∼ N(0, I)

z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

Generation After the model parameters are learned we re- move the encoder and feed a z ∼ N(0, I) to the decoder The decoder will then predict fφ(z) and we can draw an X ∼ N(fφ(z), I) Why would this work ?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-162
SLIDE 162

36/36

+ ǫ ∼ N(0, I)

z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

Generation After the model parameters are learned we re- move the encoder and feed a z ∼ N(0, I) to the decoder The decoder will then predict fφ(z) and we can draw an X ∼ N(fφ(z), I) Why would this work ? Well, we had trained the model to minimize D(Qθ(z|X)||p(z)) where p(z) was N(0, I)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-163
SLIDE 163

36/36

+ ǫ ∼ N(0, I)

z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

Generation After the model parameters are learned we re- move the encoder and feed a z ∼ N(0, I) to the decoder The decoder will then predict fφ(z) and we can draw an X ∼ N(fφ(z), I) Why would this work ? Well, we had trained the model to minimize D(Qθ(z|X)||p(z)) where p(z) was N(0, I) If the model is trained well then Qθ(z|X) should also become N(0, I)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-164
SLIDE 164

36/36

+ ǫ ∼ N(0, I)

z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

Generation After the model parameters are learned we re- move the encoder and feed a z ∼ N(0, I) to the decoder The decoder will then predict fφ(z) and we can draw an X ∼ N(fφ(z), I) Why would this work ? Well, we had trained the model to minimize D(Qθ(z|X)||p(z)) where p(z) was N(0, I) If the model is trained well then Qθ(z|X) should also become N(0, I) Hence, if we feed z ∼ N(0, I), it is almost as if we are feeding a z ∼ Qθ(z|X) and the decoder was indeed trained to produce a good fφ(z) from such a z

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

slide-165
SLIDE 165

36/36

+ ǫ ∼ N(0, I)

z Xi Qθ(z|X) Σ µ Pφ(X|z) ˆ Xi

Generation After the model parameters are learned we re- move the encoder and feed a z ∼ N(0, I) to the decoder The decoder will then predict fφ(z) and we can draw an X ∼ N(fφ(z), I) Why would this work ? Well, we had trained the model to minimize D(Qθ(z|X)||p(z)) where p(z) was N(0, I) If the model is trained well then Qθ(z|X) should also become N(0, I) Hence, if we feed z ∼ N(0, I), it is almost as if we are feeding a z ∼ Qθ(z|X) and the decoder was indeed trained to produce a good fφ(z) from such a z Hence this will work !

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21