CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh - PowerPoint PPT Presentation

Let { X = x i } N i =1 be the training data We can think of X as a random variable in R n For example, X could be an image and the dimensions of X correspond to pixels of the image We are interested in learning an abstraction Figure: Abstraction (i.e., given an X find the hidden representation z ) 11/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Let { X = x i } N i =1 be the training data We can think of X as a random variable in R n For example, X could be an image and the dimensions of X correspond to pixels of the image We are interested in learning an abstraction Figure: Abstraction (i.e., given an X find the hidden representation z ) We are also interested in generation ( i.e. , given a hidden representation generate an X ) Figure: Generation 11/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Let { X = x i } N i =1 be the training data We can think of X as a random variable in R n For example, X could be an image and the dimensions of X correspond to pixels of the image We are interested in learning an abstraction Figure: Abstraction (i.e., given an X find the hidden representation z ) We are also interested in generation ( i.e. , given a hidden representation generate an X ) In probabilistic terms we are interested in P ( z | X ) and P ( X | z ) (to be consistent with the literation on VAEs we will use z instead of H Figure: Generation and X instead of V ) 11/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Earlier we saw RBMs where we learnt P ( z | X ) and P ( X | z ) H ∈ { 0 , 1 } n c 1 c 2 c n · · · h 1 h 2 h n w 1 , 1 w m,n W ∈ R m × n v 1 v 2 · · · v m b 1 b 2 b m V ∈ { 0 , 1 } m 12/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Earlier we saw RBMs where we learnt P ( z | X ) and P ( X | z ) H ∈ { 0 , 1 } n Below we list certain characteristics of RBMs c 1 c 2 c n · · · h 1 h 2 h n w 1 , 1 w m,n W ∈ R m × n v 1 v 2 · · · v m b 1 b 2 b m V ∈ { 0 , 1 } m 12/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Earlier we saw RBMs where we learnt P ( z | X ) and P ( X | z ) H ∈ { 0 , 1 } n Below we list certain characteristics of RBMs c 1 c 2 c n Structural assumptions: We assume cer- · · · h 1 h 2 h n tain independencies in the Markov Network w 1 , 1 w m,n W ∈ R m × n v 1 v 2 · · · v m b 1 b 2 b m V ∈ { 0 , 1 } m 12/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Earlier we saw RBMs where we learnt P ( z | X ) and P ( X | z ) H ∈ { 0 , 1 } n Below we list certain characteristics of RBMs c 1 c 2 c n Structural assumptions: We assume cer- · · · h 1 h 2 h n tain independencies in the Markov Network Computational: When training with Gibbs w 1 , 1 w m,n Sampling we have to run the Markov Chain W ∈ R m × n for many time steps which is expensive v 1 v 2 · · · v m b 1 b 2 b m V ∈ { 0 , 1 } m 12/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Earlier we saw RBMs where we learnt P ( z | X ) and P ( X | z ) H ∈ { 0 , 1 } n Below we list certain characteristics of RBMs c 1 c 2 c n Structural assumptions: We assume cer- · · · h 1 h 2 h n tain independencies in the Markov Network Computational: When training with Gibbs w 1 , 1 w m,n Sampling we have to run the Markov Chain W ∈ R m × n for many time steps which is expensive When using Contrastive Approximation: v 1 v 2 · · · v m Divergence, we approximate the expectation by a point estimate b 1 b 2 b m V ∈ { 0 , 1 } m 12/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Earlier we saw RBMs where we learnt P ( z | X ) and P ( X | z ) H ∈ { 0 , 1 } n Below we list certain characteristics of RBMs c 1 c 2 c n Structural assumptions: We assume cer- · · · h 1 h 2 h n tain independencies in the Markov Network Computational: When training with Gibbs w 1 , 1 w m,n Sampling we have to run the Markov Chain W ∈ R m × n for many time steps which is expensive When using Contrastive Approximation: v 1 v 2 · · · v m Divergence, we approximate the expectation by a point estimate b 1 b 2 b m (Nothing wrong with the above but we just V ∈ { 0 , 1 } m mention them to make the reader aware of these characteristics) 12/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

We now return to our goals 13/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

We now return to our goals Goal 1: Learn a distribution over the latent variables ( Q ( z | X )) 13/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

We now return to our goals Goal 1: Learn a distribution over the latent variables ( Q ( z | X )) Goal 2: Learn a distribution over the visible variables ( P ( X | z )) 13/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

We now return to our goals Goal 1: Learn a distribution over the latent variables ( Q ( z | X )) Goal 2: Learn a distribution over the visible variables ( P ( X | z )) VAEs use a neural network based encoder for z Goal 1 Encoder Q θ ( z | X ) Data: X θ : the parameters of the encoder neural network 13/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

We now return to our goals Reconstruction: ˆ X Goal 1: Learn a distribution over the latent variables ( Q ( z | X )) Decoder P φ ( X | z ) Goal 2: Learn a distribution over the visible variables ( P ( X | z )) VAEs use a neural network based encoder for z Goal 1 and a neural network based decoder for Goal Encoder Q θ ( z | X ) 2 Data: X θ : the parameters of the encoder neural network φ : the parameters of the decoder neural network 13/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

We now return to our goals Reconstruction: ˆ X Goal 1: Learn a distribution over the latent variables ( Q ( z | X )) Decoder P φ ( X | z ) Goal 2: Learn a distribution over the visible variables ( P ( X | z )) VAEs use a neural network based encoder for z Goal 1 and a neural network based decoder for Goal Encoder Q θ ( z | X ) 2 We will look at the encoder first Data: X θ : the parameters of the encoder neural network φ : the parameters of the decoder neural network 13/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Encoder: What do we mean when we say z we want to learn a distribution? Q θ ( z | X ) X 14/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Encoder: What do we mean when we say z we want to learn a distribution? We mean that we want to learn the parameters of the distribution Q θ ( z | X ) X 14/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Encoder: What do we mean when we say z we want to learn a distribution? We mean that we want to learn the parameters of the distribution But what are the parameters of Q ( z | X )? Q θ ( z | X ) X 14/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Encoder: What do we mean when we say z we want to learn a distribution? We mean that we want to learn the parameters of the distribution But what are the parameters of Q ( z | X )? Well it depends on our modeling assump- Q θ ( z | X ) tion! X 14/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Encoder: What do we mean when we say z we want to learn a distribution? We mean that we want to learn the parameters of the µ Σ distribution But what are the parameters of Q ( z | X )? Well it depends on our modeling assump- Q θ ( z | X ) tion! In VAEs we assume that the latent variables come from a standard normal distribution X N (0 , I ) and the job of the encoder is to then predict the parameters of this distribution X ∈ R n , µ ∈ R m and Σ ∈ R m × m 14/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

ˆ X i Now what about the decoder? P φ ( X | z ) z Sample µ Σ Q θ ( z | X ) X i 15/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

ˆ X i Now what about the decoder? The job of the decoder is to predict a probab- P φ ( X | z ) ility distribution over X : P ( X | z ) z Sample µ Σ Q θ ( z | X ) X i 15/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

ˆ X i Now what about the decoder? The job of the decoder is to predict a probab- P φ ( X | z ) ility distribution over X : P ( X | z ) Once again we will assume a certain form for this distribution z Sample µ Σ Q θ ( z | X ) X i 15/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

ˆ X i Now what about the decoder? The job of the decoder is to predict a probab- P φ ( X | z ) ility distribution over X : P ( X | z ) Once again we will assume a certain form for this distribution z For example, if we want to predict 28 x 28 Sample pixels and each pixel belongs to R ( i.e. , X ∈ R 784 ) then what would be a suitable family µ for P ( X | z )? Σ Q θ ( z | X ) X i 15/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

ˆ X i Now what about the decoder? The job of the decoder is to predict a probab- P φ ( X | z ) ility distribution over X : P ( X | z ) Once again we will assume a certain form for this distribution z For example, if we want to predict 28 x 28 Sample pixels and each pixel belongs to R ( i.e. , X ∈ R 784 ) then what would be a suitable family µ for P ( X | z )? Σ We could assume that P ( X | z ) is a Gaussian Q θ ( z | X ) distribution with unit variance X i 15/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

ˆ X i Now what about the decoder? The job of the decoder is to predict a probab- P φ ( X | z ) ility distribution over X : P ( X | z ) Once again we will assume a certain form for this distribution z For example, if we want to predict 28 x 28 Sample pixels and each pixel belongs to R ( i.e. , X ∈ R 784 ) then what would be a suitable family µ for P ( X | z )? Σ We could assume that P ( X | z ) is a Gaussian Q θ ( z | X ) distribution with unit variance The job of the decoder f would then be to X i predict the mean of this distribution as f φ ( z ) 15/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

ˆ X i What would be the objective function of the decoder ? P φ ( X | z ) z Sample µ Σ Q θ ( z | X ) X i 16/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

ˆ X i What would be the objective function of the decoder ? P φ ( X | z ) For any given training sample x i it should maximize P ( x i ) given by z ˆ P ( x i ) = P ( z ) P ( x i | z ) dz Sample µ Σ Q θ ( z | X ) X i 16/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

ˆ X i What would be the objective function of the decoder ? P φ ( X | z ) For any given training sample x i it should maximize P ( x i ) given by z ˆ P ( x i ) = P ( z ) P ( x i | z ) dz Sample = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] µ Σ Q θ ( z | X ) X i 16/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

ˆ X i What would be the objective function of the decoder ? P φ ( X | z ) For any given training sample x i it should maximize P ( x i ) given by z ˆ P ( x i ) = P ( z ) P ( x i | z ) dz Sample = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] µ Σ (As usual we take log for numerical stability) Q θ ( z | X ) X i 16/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

ˆ X i This is the loss function for one data point ( l i ( θ )) and we will just sum over all the data P φ ( X | z ) points to get the total loss L ( θ ) m � L ( θ ) = l i ( θ ) z i =1 Sample µ Σ Q θ ( z | X ) X i 17/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

ˆ X i This is the loss function for one data point ( l i ( θ )) and we will just sum over all the data P φ ( X | z ) points to get the total loss L ( θ ) m � L ( θ ) = l i ( θ ) z i =1 Sample In addition, we also want a constraint on the distribution over the latent variables µ Σ Q θ ( z | X ) X i 17/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

ˆ X i This is the loss function for one data point ( l i ( θ )) and we will just sum over all the data P φ ( X | z ) points to get the total loss L ( θ ) m � L ( θ ) = l i ( θ ) z i =1 Sample In addition, we also want a constraint on the distribution over the latent variables Specifically, we had assumed P ( z ) to be µ Σ N (0 , I ) and we want Q ( z | X ) to be as close to P ( z ) as possible Q θ ( z | X ) X i 17/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

ˆ X i This is the loss function for one data point ( l i ( θ )) and we will just sum over all the data P φ ( X | z ) points to get the total loss L ( θ ) m � L ( θ ) = l i ( θ ) z i =1 Sample In addition, we also want a constraint on the distribution over the latent variables Specifically, we had assumed P ( z ) to be µ Σ N (0 , I ) and we want Q ( z | X ) to be as close to P ( z ) as possible Q θ ( z | X ) Thus, we will modify the loss function such X i that l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 17/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

ˆ X i This is the loss function for one data point ( l i ( θ )) and we will just sum over all the data P φ ( X | z ) points to get the total loss L ( θ ) m � L ( θ ) = l i ( θ ) z i =1 Sample In addition, we also want a constraint on the distribution over the latent variables Specifically, we had assumed P ( z ) to be µ Σ N (0 , I ) and we want Q ( z | X ) to be as close to P ( z ) as possible Q θ ( z | X ) Thus, we will modify the loss function such X i that KL divergence captures l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] the difference (or distance) between 2 distributions + KL ( Q θ ( z | x i ) || P ( z )) 17/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

ˆ X i The second term in the loss function can actually be thought of as a regularizer P φ ( X | z ) z Sample µ Σ Q θ ( z | X ) X i l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 18/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

ˆ X i The second term in the loss function can actually be thought of as a regularizer P φ ( X | z ) It ensures that the encoder does not cheat by mapping each x i to a different point (a normal distribution with very low variance) in the Euclidean space z Sample µ Σ Q θ ( z | X ) X i l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 18/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

ˆ X i The second term in the loss function can actually be thought of as a regularizer P φ ( X | z ) It ensures that the encoder does not cheat by mapping each x i to a different point (a normal distribution with very low variance) in the Euclidean space In other words, in the absence of the regularizer the z encoder can learn a unique mapping for each x i and Sample the decoder can then decode from this unique mapping µ Σ Q θ ( z | X ) X i l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 18/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

ˆ X i The second term in the loss function can actually be thought of as a regularizer P φ ( X | z ) It ensures that the encoder does not cheat by mapping each x i to a different point (a normal distribution with very low variance) in the Euclidean space In other words, in the absence of the regularizer the z encoder can learn a unique mapping for each x i and Sample the decoder can then decode from this unique mapping Even with high variance in samples from the distribution, we want the decoder to be able to reconstruct µ Σ the original data very well (motivation similar to the adding noise) Q θ ( z | X ) X i l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 18/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

ˆ X i The second term in the loss function can actually be thought of as a regularizer P φ ( X | z ) It ensures that the encoder does not cheat by mapping each x i to a different point (a normal distribution with very low variance) in the Euclidean space In other words, in the absence of the regularizer the z encoder can learn a unique mapping for each x i and Sample the decoder can then decode from this unique mapping Even with high variance in samples from the distribution, we want the decoder to be able to reconstruct µ Σ the original data very well (motivation similar to the adding noise) To summarize, for each data point we predict a distri- Q θ ( z | X ) bution such that, with high probability a sample from this distribution should be able to reconstruct the ori- X i ginal data point l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 18/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

ˆ X i The second term in the loss function can actually be thought of as a regularizer P φ ( X | z ) It ensures that the encoder does not cheat by mapping each x i to a different point (a normal distribution with very low variance) in the Euclidean space In other words, in the absence of the regularizer the z encoder can learn a unique mapping for each x i and Sample the decoder can then decode from this unique mapping Even with high variance in samples from the distribution, we want the decoder to be able to reconstruct µ Σ the original data very well (motivation similar to the adding noise) To summarize, for each data point we predict a distri- Q θ ( z | X ) bution such that, with high probability a sample from this distribution should be able to reconstruct the ori- X i ginal data point l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] But why do we choose a normal distribution? Isn’t it too simplistic to assume that z follows a normal + KL ( Q θ ( z | x i ) || P ( z )) distribution 18/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Isn’t it a very strong assumption that P ( z ) ∼ N (0 , I ) ? l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 19/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Isn’t it a very strong assumption that P ( z ) ∼ N (0 , I ) ? For example, in the 2-dimensional case how can we be sure that P ( z ) is a normal distribution and not any other distribution l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 19/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Isn’t it a very strong assumption that P ( z ) ∼ N (0 , I ) ? For example, in the 2-dimensional case how can we be sure that P ( z ) is a normal distribution and not any other distribution The key insight here is that any distribution in d dimensions can be generated by the following steps l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 19/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Isn’t it a very strong assumption that P ( z ) ∼ N (0 , I ) ? For example, in the 2-dimensional case how can we be sure that P ( z ) is a normal distribution and not any other distribution The key insight here is that any distribution in d dimensions can be generated by the following steps Step 1: Start with a set of d variables that are normally distributed (that’s exactly what we are assuming for P ( z )) l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 19/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Isn’t it a very strong assumption that P ( z ) ∼ N (0 , I ) ? For example, in the 2-dimensional case how can we be sure that P ( z ) is a normal distribution and not any other distribution The key insight here is that any distribution in d dimensions can be generated by the following steps Step 1: Start with a set of d variables that are normally distributed (that’s exactly what we are assuming for P ( z )) Step 2: Mapping these variables through a sufficiently complex function (that’s exactly l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] what the first few layers of the decoder can + KL ( Q θ ( z | x i ) || P ( z )) do) 19/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

In particular, note that in the adjoining example if z is 2-D and normally distributed then f ( z ) is roughly ring shaped (giving us the distribution in the bottom figure) f ( z ) = z z 10 + || z || l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 20/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

In particular, note that in the adjoining example if z is 2-D and normally distributed then f ( z ) is roughly ring shaped (giving us the distribution in the bottom figure) f ( z ) = z z 10 + || z || A non-linear neural network, such as the one we use for the decoder, could learn a complex mapping from z to f φ ( z ) using its parameters φ l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 20/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

In particular, note that in the adjoining example if z is 2-D and normally distributed then f ( z ) is roughly ring shaped (giving us the distribution in the bottom figure) f ( z ) = z z 10 + || z || A non-linear neural network, such as the one we use for the decoder, could learn a complex mapping from z to f φ ( z ) using its parameters φ The initial layers of a non linear decoder could learn their weights such that the output is f φ ( z ) l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 20/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

In particular, note that in the adjoining example if z is 2-D and normally distributed then f ( z ) is roughly ring shaped (giving us the distribution in the bottom figure) f ( z ) = z z 10 + || z || A non-linear neural network, such as the one we use for the decoder, could learn a complex mapping from z to f φ ( z ) using its parameters φ The initial layers of a non linear decoder could learn their weights such that the output is f φ ( z ) The above argument suggests that even if we start with normally distributed variables the initial layers of the decoder could learn a complex transformation of these variables say f φ ( z ) if required l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 20/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

In particular, note that in the adjoining example if z is 2-D and normally distributed then f ( z ) is roughly ring shaped (giving us the distribution in the bottom figure) f ( z ) = z z 10 + || z || A non-linear neural network, such as the one we use for the decoder, could learn a complex mapping from z to f φ ( z ) using its parameters φ The initial layers of a non linear decoder could learn their weights such that the output is f φ ( z ) The above argument suggests that even if we start with normally distributed variables the initial layers of the decoder could learn a complex transformation of these variables say f φ ( z ) if required The objective function of the decoder will ensure that l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] an appropriate transformation of z is learnt to reconstruct X + KL ( Q θ ( z | x i ) || P ( z )) 20/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Module 21.3: Variational autoencoders: (The graphical model perspective) 21/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Here we can think of z and X as random variables z X N 22/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Here we can think of z and X as random variables z We are then interested in the joint probability distribution P ( X, z ) which factorizes X as P ( X, z ) = P ( z ) P ( X | z ) N 22/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Here we can think of z and X as random variables z We are then interested in the joint probability distribution P ( X, z ) which factorizes X as P ( X, z ) = P ( z ) P ( X | z ) This factorization is natural because we can N imagine that the latent variables are fixed first and then the visible variables are drawn based on the latent variables 22/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Here we can think of z and X as random variables z We are then interested in the joint probability distribution P ( X, z ) which factorizes X as P ( X, z ) = P ( z ) P ( X | z ) This factorization is natural because we can N imagine that the latent variables are fixed first and then the visible variables are drawn based on the latent variables For example, if we want to draw a digit we could first fix the latent variables: the digit, size, angle, thickness, position and so on and then draw a digit which corresponds to these latent variables 22/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Here we can think of z and X as random variables z We are then interested in the joint probability distribution P ( X, z ) which factorizes X as P ( X, z ) = P ( z ) P ( X | z ) This factorization is natural because we can N imagine that the latent variables are fixed first and then the visible variables are drawn based on the latent variables For example, if we want to draw a digit we could first fix the latent variables: the digit, size, angle, thickness, position and so on and then draw a digit which corresponds to these latent variables And of course, unlike RBMs, this is a directed graphical model 22/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Now at inference time, we are given an X (observed variable) and we are interested in finding the most z likely assignments of latent variables z which would have resulted in this observation X N 23/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Now at inference time, we are given an X (observed variable) and we are interested in finding the most z likely assignments of latent variables z which would have resulted in this observation Mathematically, we want to find X P ( z | X ) = P ( X | z ) P ( z ) N P ( X ) 23/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Now at inference time, we are given an X (observed variable) and we are interested in finding the most z likely assignments of latent variables z which would have resulted in this observation Mathematically, we want to find X P ( z | X ) = P ( X | z ) P ( z ) N P ( X ) This is hard to compute because the LHS contains P ( X ) which is intractable ˆ P ( X | z ) P ( z ) dz P ( X ) = ˆ ˆ ˆ = P ( X | z 1 , z 2 , ..., z n ) P ( z 1 , z 2 , ..., z n ) dz 1 , ...dz n ... 23/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Now at inference time, we are given an X (observed variable) and we are interested in finding the most z likely assignments of latent variables z which would have resulted in this observation Mathematically, we want to find X P ( z | X ) = P ( X | z ) P ( z ) N P ( X ) This is hard to compute because the LHS contains P ( X ) which is intractable ˆ P ( X | z ) P ( z ) dz P ( X ) = ˆ ˆ ˆ = P ( X | z 1 , z 2 , ..., z n ) P ( z 1 , z 2 , ..., z n ) dz 1 , ...dz n ... In RBMs, we had a similar integral which we approx- imated using Gibbs Sampling 23/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Now at inference time, we are given an X (observed variable) and we are interested in finding the most z likely assignments of latent variables z which would have resulted in this observation Mathematically, we want to find X P ( z | X ) = P ( X | z ) P ( z ) N P ( X ) This is hard to compute because the LHS contains P ( X ) which is intractable ˆ P ( X | z ) P ( z ) dz P ( X ) = ˆ ˆ ˆ = P ( X | z 1 , z 2 , ..., z n ) P ( z 1 , z 2 , ..., z n ) dz 1 , ...dz n ... In RBMs, we had a similar integral which we approx- imated using Gibbs Sampling VAEs, on the other hand, cast this into an optimiza- tion problem and learn the parameters of the optim- ization problem 23/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Specifically, in VAEs, we assume that instead of P ( z | X ) which is intractable, the posterior z distribution is given by Q θ ( z | X ) X N 24/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Specifically, in VAEs, we assume that instead of P ( z | X ) which is intractable, the posterior z distribution is given by Q θ ( z | X ) Further, we assume that Q θ ( z | X ) is a Gaus- X sian whose parameters are determined by a neural network µ , Σ = g θ ( X ) N 24/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Specifically, in VAEs, we assume that instead of P ( z | X ) which is intractable, the posterior z distribution is given by Q θ ( z | X ) Further, we assume that Q θ ( z | X ) is a Gaus- X sian whose parameters are determined by a neural network µ , Σ = g θ ( X ) N The parameters of the distribution are thus determined by the parameters θ of a neural network 24/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Specifically, in VAEs, we assume that instead of P ( z | X ) which is intractable, the posterior z distribution is given by Q θ ( z | X ) Further, we assume that Q θ ( z | X ) is a Gaus- X sian whose parameters are determined by a neural network µ , Σ = g θ ( X ) N The parameters of the distribution are thus determined by the parameters θ of a neural network Our job then is to learn the parameters of this neural network 24/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

But what is the objective function for this neural network z X N 25/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

But what is the objective function for this neural network z Well we want the proposed distribution Q θ ( z | X ) to be as close to the true distribu- X tion N 25/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

But what is the objective function for this neural network z Well we want the proposed distribution Q θ ( z | X ) to be as close to the true distribu- X tion We can capture this using the following ob- N jective function minimize KL ( Q θ ( z | X ) || P ( z | X )) 25/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

But what is the objective function for this neural network z Well we want the proposed distribution Q θ ( z | X ) to be as close to the true distribu- X tion We can capture this using the following ob- N jective function minimize KL ( Q θ ( z | X ) || P ( z | X )) What are the parameters of the objective function ? 25/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21 Acknowledgments

CS7015 (Deep Learning) : Lecture 10 Learning Vectorial Representations Of Words Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 16 Encoder Decoder Models, Attention Mechanism Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 18 Markov Networks Mitesh M. Khapra Department of Computer

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 13 Visualizing Convolutional Neural Networks, Guided

CS7015 (Deep Learning) : Lecture 19 Using joint distributions for classification and sampling,

CS7015 (Deep Learning) : Lecture 2 McCulloch Pitts Neuron, Thresholding Logic, Perceptrons,

Logistics Homework Homework #1 due today. Regular Expressions Homework #2

Pointers III: Struct & Pointers 1 Structures What is

An introduction to separation logic James Brotherston Programming Principles, Logic and

iClicker Liang, Introduction to Java Programming, Tenth Edition, (c) 2015 Pearson Education, Inc.

Differential Equations via Temporal Logic and Infinitesimals Evan Cavallo 15-824 Foundations of

Finiteness Properties for Totally Disconnected Locally Compact Groups Ilaria Castellano

SERVICE (SMS) ECE 2525 MOBILE COMMUNICATION Monday, 25 February 2020 SAFARICOM 6-MONTH

THE EMERGENCY SUPPORT SYSTEM STORY OF ONE PROJECT Tom EZNK Masaryk University The Czech

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21 Acknowledgments

CS7015 (Deep Learning) : Lecture 10 Learning Vectorial Representations Of Words Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 16 Encoder Decoder Models, Attention Mechanism Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 18 Markov Networks Mitesh M. Khapra Department of Computer

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of

CS7015 (Deep Learning) : Lecture 23 Generative Adversarial Networks (GANs) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 22 Autoregressive Models (NADE, MADE) Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 15 Long Short Term Memory Cells (LSTMs), Gated Recurrent Units

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 1 (Partial/Brief) History of Deep Learning Mitesh M. Khapra

CS7015 (Deep Learning) : Lecture 13 Visualizing Convolutional Neural Networks, Guided

CS7015 (Deep Learning) : Lecture 19 Using joint distributions for classification and sampling,

CS7015 (Deep Learning) : Lecture 2 McCulloch Pitts Neuron, Thresholding Logic, Perceptrons,

Logistics Homework Homework #1 due today. Regular Expressions Homework #2

Pointers III: Struct &amp; Pointers 1 Structures What is

An introduction to separation logic James Brotherston Programming Principles, Logic and

iClicker Liang, Introduction to Java Programming, Tenth Edition, (c) 2015 Pearson Education, Inc.

Differential Equations via Temporal Logic and Infinitesimals Evan Cavallo 15-824 Foundations of

Finiteness Properties for Totally Disconnected Locally Compact Groups Ilaria Castellano

SERVICE (SMS) ECE 2525 MOBILE COMMUNICATION Monday, 25 February 2020 SAFARICOM 6-MONTH

THE EMERGENCY SUPPORT SYSTEM STORY OF ONE PROJECT Tom EZNK Masaryk University The Czech

Pointers III: Struct & Pointers 1 Structures What is