cs7015 deep learning lecture 21
play

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh - PowerPoint PPT Presentation

CS7015 (Deep Learning) : Lecture 21 Variational Autoencoders Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21 Acknowledgments


  1. Let { X = x i } N i =1 be the training data We can think of X as a random variable in R n For example, X could be an image and the dimensions of X correspond to pixels of the image We are interested in learning an abstraction Figure: Abstraction (i.e., given an X find the hidden representa- tion z ) 11/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  2. Let { X = x i } N i =1 be the training data We can think of X as a random variable in R n For example, X could be an image and the dimensions of X correspond to pixels of the image We are interested in learning an abstraction Figure: Abstraction (i.e., given an X find the hidden representa- tion z ) We are also interested in generation ( i.e. , given a hidden representation generate an X ) Figure: Generation 11/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  3. Let { X = x i } N i =1 be the training data We can think of X as a random variable in R n For example, X could be an image and the dimensions of X correspond to pixels of the image We are interested in learning an abstraction Figure: Abstraction (i.e., given an X find the hidden representa- tion z ) We are also interested in generation ( i.e. , given a hidden representation generate an X ) In probabilistic terms we are interested in P ( z | X ) and P ( X | z ) (to be consistent with the literation on VAEs we will use z instead of H Figure: Generation and X instead of V ) 11/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  4. Earlier we saw RBMs where we learnt P ( z | X ) and P ( X | z ) H ∈ { 0 , 1 } n c 1 c 2 c n · · · h 1 h 2 h n w 1 , 1 w m,n W ∈ R m × n v 1 v 2 · · · v m b 1 b 2 b m V ∈ { 0 , 1 } m 12/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  5. Earlier we saw RBMs where we learnt P ( z | X ) and P ( X | z ) H ∈ { 0 , 1 } n Below we list certain characteristics of RBMs c 1 c 2 c n · · · h 1 h 2 h n w 1 , 1 w m,n W ∈ R m × n v 1 v 2 · · · v m b 1 b 2 b m V ∈ { 0 , 1 } m 12/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  6. Earlier we saw RBMs where we learnt P ( z | X ) and P ( X | z ) H ∈ { 0 , 1 } n Below we list certain characteristics of RBMs c 1 c 2 c n Structural assumptions: We assume cer- · · · h 1 h 2 h n tain independencies in the Markov Network w 1 , 1 w m,n W ∈ R m × n v 1 v 2 · · · v m b 1 b 2 b m V ∈ { 0 , 1 } m 12/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  7. Earlier we saw RBMs where we learnt P ( z | X ) and P ( X | z ) H ∈ { 0 , 1 } n Below we list certain characteristics of RBMs c 1 c 2 c n Structural assumptions: We assume cer- · · · h 1 h 2 h n tain independencies in the Markov Network Computational: When training with Gibbs w 1 , 1 w m,n Sampling we have to run the Markov Chain W ∈ R m × n for many time steps which is expensive v 1 v 2 · · · v m b 1 b 2 b m V ∈ { 0 , 1 } m 12/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  8. Earlier we saw RBMs where we learnt P ( z | X ) and P ( X | z ) H ∈ { 0 , 1 } n Below we list certain characteristics of RBMs c 1 c 2 c n Structural assumptions: We assume cer- · · · h 1 h 2 h n tain independencies in the Markov Network Computational: When training with Gibbs w 1 , 1 w m,n Sampling we have to run the Markov Chain W ∈ R m × n for many time steps which is expensive When using Contrastive Approximation: v 1 v 2 · · · v m Divergence, we approximate the expectation by a point estimate b 1 b 2 b m V ∈ { 0 , 1 } m 12/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  9. Earlier we saw RBMs where we learnt P ( z | X ) and P ( X | z ) H ∈ { 0 , 1 } n Below we list certain characteristics of RBMs c 1 c 2 c n Structural assumptions: We assume cer- · · · h 1 h 2 h n tain independencies in the Markov Network Computational: When training with Gibbs w 1 , 1 w m,n Sampling we have to run the Markov Chain W ∈ R m × n for many time steps which is expensive When using Contrastive Approximation: v 1 v 2 · · · v m Divergence, we approximate the expectation by a point estimate b 1 b 2 b m (Nothing wrong with the above but we just V ∈ { 0 , 1 } m mention them to make the reader aware of these characteristics) 12/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  10. We now return to our goals 13/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  11. We now return to our goals Goal 1: Learn a distribution over the latent variables ( Q ( z | X )) 13/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  12. We now return to our goals Goal 1: Learn a distribution over the latent variables ( Q ( z | X )) Goal 2: Learn a distribution over the visible variables ( P ( X | z )) 13/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  13. We now return to our goals Goal 1: Learn a distribution over the latent variables ( Q ( z | X )) Goal 2: Learn a distribution over the visible variables ( P ( X | z )) VAEs use a neural network based encoder for z Goal 1 Encoder Q θ ( z | X ) Data: X θ : the parameters of the encoder neural network 13/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  14. We now return to our goals Reconstruction: ˆ X Goal 1: Learn a distribution over the latent variables ( Q ( z | X )) Decoder P φ ( X | z ) Goal 2: Learn a distribution over the visible variables ( P ( X | z )) VAEs use a neural network based encoder for z Goal 1 and a neural network based decoder for Goal Encoder Q θ ( z | X ) 2 Data: X θ : the parameters of the encoder neural network φ : the parameters of the decoder neural network 13/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  15. We now return to our goals Reconstruction: ˆ X Goal 1: Learn a distribution over the latent variables ( Q ( z | X )) Decoder P φ ( X | z ) Goal 2: Learn a distribution over the visible variables ( P ( X | z )) VAEs use a neural network based encoder for z Goal 1 and a neural network based decoder for Goal Encoder Q θ ( z | X ) 2 We will look at the encoder first Data: X θ : the parameters of the encoder neural network φ : the parameters of the decoder neural network 13/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  16. Encoder: What do we mean when we say z we want to learn a distribution? Q θ ( z | X ) X 14/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  17. Encoder: What do we mean when we say z we want to learn a distribution? We mean that we want to learn the parameters of the distribution Q θ ( z | X ) X 14/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  18. Encoder: What do we mean when we say z we want to learn a distribution? We mean that we want to learn the parameters of the distribution But what are the parameters of Q ( z | X )? Q θ ( z | X ) X 14/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  19. Encoder: What do we mean when we say z we want to learn a distribution? We mean that we want to learn the parameters of the distribution But what are the parameters of Q ( z | X )? Well it depends on our modeling assump- Q θ ( z | X ) tion! X 14/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  20. Encoder: What do we mean when we say z we want to learn a distribution? We mean that we want to learn the parameters of the µ Σ distribution But what are the parameters of Q ( z | X )? Well it depends on our modeling assump- Q θ ( z | X ) tion! In VAEs we assume that the latent variables come from a standard normal distribution X N (0 , I ) and the job of the encoder is to then predict the parameters of this distribution X ∈ R n , µ ∈ R m and Σ ∈ R m × m 14/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  21. ˆ X i Now what about the decoder? P φ ( X | z ) z Sample µ Σ Q θ ( z | X ) X i 15/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  22. ˆ X i Now what about the decoder? The job of the decoder is to predict a probab- P φ ( X | z ) ility distribution over X : P ( X | z ) z Sample µ Σ Q θ ( z | X ) X i 15/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  23. ˆ X i Now what about the decoder? The job of the decoder is to predict a probab- P φ ( X | z ) ility distribution over X : P ( X | z ) Once again we will assume a certain form for this distribution z Sample µ Σ Q θ ( z | X ) X i 15/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  24. ˆ X i Now what about the decoder? The job of the decoder is to predict a probab- P φ ( X | z ) ility distribution over X : P ( X | z ) Once again we will assume a certain form for this distribution z For example, if we want to predict 28 x 28 Sample pixels and each pixel belongs to R ( i.e. , X ∈ R 784 ) then what would be a suitable family µ for P ( X | z )? Σ Q θ ( z | X ) X i 15/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  25. ˆ X i Now what about the decoder? The job of the decoder is to predict a probab- P φ ( X | z ) ility distribution over X : P ( X | z ) Once again we will assume a certain form for this distribution z For example, if we want to predict 28 x 28 Sample pixels and each pixel belongs to R ( i.e. , X ∈ R 784 ) then what would be a suitable family µ for P ( X | z )? Σ We could assume that P ( X | z ) is a Gaussian Q θ ( z | X ) distribution with unit variance X i 15/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  26. ˆ X i Now what about the decoder? The job of the decoder is to predict a probab- P φ ( X | z ) ility distribution over X : P ( X | z ) Once again we will assume a certain form for this distribution z For example, if we want to predict 28 x 28 Sample pixels and each pixel belongs to R ( i.e. , X ∈ R 784 ) then what would be a suitable family µ for P ( X | z )? Σ We could assume that P ( X | z ) is a Gaussian Q θ ( z | X ) distribution with unit variance The job of the decoder f would then be to X i predict the mean of this distribution as f φ ( z ) 15/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  27. ˆ X i What would be the objective function of the decoder ? P φ ( X | z ) z Sample µ Σ Q θ ( z | X ) X i 16/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  28. ˆ X i What would be the objective function of the decoder ? P φ ( X | z ) For any given training sample x i it should maximize P ( x i ) given by z ˆ P ( x i ) = P ( z ) P ( x i | z ) dz Sample µ Σ Q θ ( z | X ) X i 16/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  29. ˆ X i What would be the objective function of the decoder ? P φ ( X | z ) For any given training sample x i it should maximize P ( x i ) given by z ˆ P ( x i ) = P ( z ) P ( x i | z ) dz Sample = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] µ Σ Q θ ( z | X ) X i 16/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  30. ˆ X i What would be the objective function of the decoder ? P φ ( X | z ) For any given training sample x i it should maximize P ( x i ) given by z ˆ P ( x i ) = P ( z ) P ( x i | z ) dz Sample = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] µ Σ (As usual we take log for numerical stability) Q θ ( z | X ) X i 16/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  31. ˆ X i This is the loss function for one data point ( l i ( θ )) and we will just sum over all the data P φ ( X | z ) points to get the total loss L ( θ ) m � L ( θ ) = l i ( θ ) z i =1 Sample µ Σ Q θ ( z | X ) X i 17/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  32. ˆ X i This is the loss function for one data point ( l i ( θ )) and we will just sum over all the data P φ ( X | z ) points to get the total loss L ( θ ) m � L ( θ ) = l i ( θ ) z i =1 Sample In addition, we also want a constraint on the distribution over the latent variables µ Σ Q θ ( z | X ) X i 17/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  33. ˆ X i This is the loss function for one data point ( l i ( θ )) and we will just sum over all the data P φ ( X | z ) points to get the total loss L ( θ ) m � L ( θ ) = l i ( θ ) z i =1 Sample In addition, we also want a constraint on the distribution over the latent variables Specifically, we had assumed P ( z ) to be µ Σ N (0 , I ) and we want Q ( z | X ) to be as close to P ( z ) as possible Q θ ( z | X ) X i 17/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  34. ˆ X i This is the loss function for one data point ( l i ( θ )) and we will just sum over all the data P φ ( X | z ) points to get the total loss L ( θ ) m � L ( θ ) = l i ( θ ) z i =1 Sample In addition, we also want a constraint on the distribution over the latent variables Specifically, we had assumed P ( z ) to be µ Σ N (0 , I ) and we want Q ( z | X ) to be as close to P ( z ) as possible Q θ ( z | X ) Thus, we will modify the loss function such X i that l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 17/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  35. ˆ X i This is the loss function for one data point ( l i ( θ )) and we will just sum over all the data P φ ( X | z ) points to get the total loss L ( θ ) m � L ( θ ) = l i ( θ ) z i =1 Sample In addition, we also want a constraint on the distribution over the latent variables Specifically, we had assumed P ( z ) to be µ Σ N (0 , I ) and we want Q ( z | X ) to be as close to P ( z ) as possible Q θ ( z | X ) Thus, we will modify the loss function such X i that KL divergence captures l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] the difference (or distance) between 2 distributions + KL ( Q θ ( z | x i ) || P ( z )) 17/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  36. ˆ X i The second term in the loss function can actually be thought of as a regularizer P φ ( X | z ) z Sample µ Σ Q θ ( z | X ) X i l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 18/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  37. ˆ X i The second term in the loss function can actually be thought of as a regularizer P φ ( X | z ) It ensures that the encoder does not cheat by mapping each x i to a different point (a normal distribution with very low variance) in the Euclidean space z Sample µ Σ Q θ ( z | X ) X i l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 18/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  38. ˆ X i The second term in the loss function can actually be thought of as a regularizer P φ ( X | z ) It ensures that the encoder does not cheat by mapping each x i to a different point (a normal distribution with very low variance) in the Euclidean space In other words, in the absence of the regularizer the z encoder can learn a unique mapping for each x i and Sample the decoder can then decode from this unique mapping µ Σ Q θ ( z | X ) X i l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 18/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  39. ˆ X i The second term in the loss function can actually be thought of as a regularizer P φ ( X | z ) It ensures that the encoder does not cheat by mapping each x i to a different point (a normal distribution with very low variance) in the Euclidean space In other words, in the absence of the regularizer the z encoder can learn a unique mapping for each x i and Sample the decoder can then decode from this unique mapping Even with high variance in samples from the distribu- tion, we want the decoder to be able to reconstruct µ Σ the original data very well (motivation similar to the adding noise) Q θ ( z | X ) X i l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 18/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  40. ˆ X i The second term in the loss function can actually be thought of as a regularizer P φ ( X | z ) It ensures that the encoder does not cheat by mapping each x i to a different point (a normal distribution with very low variance) in the Euclidean space In other words, in the absence of the regularizer the z encoder can learn a unique mapping for each x i and Sample the decoder can then decode from this unique mapping Even with high variance in samples from the distribu- tion, we want the decoder to be able to reconstruct µ Σ the original data very well (motivation similar to the adding noise) To summarize, for each data point we predict a distri- Q θ ( z | X ) bution such that, with high probability a sample from this distribution should be able to reconstruct the ori- X i ginal data point l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 18/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  41. ˆ X i The second term in the loss function can actually be thought of as a regularizer P φ ( X | z ) It ensures that the encoder does not cheat by mapping each x i to a different point (a normal distribution with very low variance) in the Euclidean space In other words, in the absence of the regularizer the z encoder can learn a unique mapping for each x i and Sample the decoder can then decode from this unique mapping Even with high variance in samples from the distribu- tion, we want the decoder to be able to reconstruct µ Σ the original data very well (motivation similar to the adding noise) To summarize, for each data point we predict a distri- Q θ ( z | X ) bution such that, with high probability a sample from this distribution should be able to reconstruct the ori- X i ginal data point l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] But why do we choose a normal distribution? Isn’t it too simplistic to assume that z follows a normal + KL ( Q θ ( z | x i ) || P ( z )) distribution 18/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  42. Isn’t it a very strong assumption that P ( z ) ∼ N (0 , I ) ? l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 19/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  43. Isn’t it a very strong assumption that P ( z ) ∼ N (0 , I ) ? For example, in the 2-dimensional case how can we be sure that P ( z ) is a normal distri- bution and not any other distribution l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 19/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  44. Isn’t it a very strong assumption that P ( z ) ∼ N (0 , I ) ? For example, in the 2-dimensional case how can we be sure that P ( z ) is a normal distri- bution and not any other distribution The key insight here is that any distribution in d dimensions can be generated by the fol- lowing steps l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 19/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  45. Isn’t it a very strong assumption that P ( z ) ∼ N (0 , I ) ? For example, in the 2-dimensional case how can we be sure that P ( z ) is a normal distri- bution and not any other distribution The key insight here is that any distribution in d dimensions can be generated by the fol- lowing steps Step 1: Start with a set of d variables that are normally distributed (that’s exactly what we are assuming for P ( z )) l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 19/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  46. Isn’t it a very strong assumption that P ( z ) ∼ N (0 , I ) ? For example, in the 2-dimensional case how can we be sure that P ( z ) is a normal distri- bution and not any other distribution The key insight here is that any distribution in d dimensions can be generated by the fol- lowing steps Step 1: Start with a set of d variables that are normally distributed (that’s exactly what we are assuming for P ( z )) Step 2: Mapping these variables through a sufficiently complex function (that’s exactly l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] what the first few layers of the decoder can + KL ( Q θ ( z | x i ) || P ( z )) do) 19/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  47. In particular, note that in the adjoining example if z is 2-D and normally distributed then f ( z ) is roughly ring shaped (giving us the distribution in the bottom figure) f ( z ) = z z 10 + || z || l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 20/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  48. In particular, note that in the adjoining example if z is 2-D and normally distributed then f ( z ) is roughly ring shaped (giving us the distribution in the bottom figure) f ( z ) = z z 10 + || z || A non-linear neural network, such as the one we use for the decoder, could learn a complex mapping from z to f φ ( z ) using its parameters φ l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 20/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  49. In particular, note that in the adjoining example if z is 2-D and normally distributed then f ( z ) is roughly ring shaped (giving us the distribution in the bottom figure) f ( z ) = z z 10 + || z || A non-linear neural network, such as the one we use for the decoder, could learn a complex mapping from z to f φ ( z ) using its parameters φ The initial layers of a non linear decoder could learn their weights such that the output is f φ ( z ) l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 20/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  50. In particular, note that in the adjoining example if z is 2-D and normally distributed then f ( z ) is roughly ring shaped (giving us the distribution in the bottom figure) f ( z ) = z z 10 + || z || A non-linear neural network, such as the one we use for the decoder, could learn a complex mapping from z to f φ ( z ) using its parameters φ The initial layers of a non linear decoder could learn their weights such that the output is f φ ( z ) The above argument suggests that even if we start with normally distributed variables the initial layers of the decoder could learn a complex transformation of these variables say f φ ( z ) if required l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] + KL ( Q θ ( z | x i ) || P ( z )) 20/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  51. In particular, note that in the adjoining example if z is 2-D and normally distributed then f ( z ) is roughly ring shaped (giving us the distribution in the bottom figure) f ( z ) = z z 10 + || z || A non-linear neural network, such as the one we use for the decoder, could learn a complex mapping from z to f φ ( z ) using its parameters φ The initial layers of a non linear decoder could learn their weights such that the output is f φ ( z ) The above argument suggests that even if we start with normally distributed variables the initial layers of the decoder could learn a complex transformation of these variables say f φ ( z ) if required The objective function of the decoder will ensure that l i ( θ, φ ) = − E z ∼ Q θ ( z | x i ) [log P φ ( x i | z )] an appropriate transformation of z is learnt to recon- struct X + KL ( Q θ ( z | x i ) || P ( z )) 20/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  52. Module 21.3: Variational autoencoders: (The graphical model perspective) 21/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  53. Here we can think of z and X as random vari- ables z X N 22/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  54. Here we can think of z and X as random vari- ables z We are then interested in the joint prob- ability distribution P ( X, z ) which factorizes X as P ( X, z ) = P ( z ) P ( X | z ) N 22/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  55. Here we can think of z and X as random vari- ables z We are then interested in the joint prob- ability distribution P ( X, z ) which factorizes X as P ( X, z ) = P ( z ) P ( X | z ) This factorization is natural because we can N imagine that the latent variables are fixed first and then the visible variables are drawn based on the latent variables 22/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  56. Here we can think of z and X as random vari- ables z We are then interested in the joint prob- ability distribution P ( X, z ) which factorizes X as P ( X, z ) = P ( z ) P ( X | z ) This factorization is natural because we can N imagine that the latent variables are fixed first and then the visible variables are drawn based on the latent variables For example, if we want to draw a digit we could first fix the latent variables: the digit, size, angle, thickness, position and so on and then draw a digit which corresponds to these latent variables 22/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  57. Here we can think of z and X as random vari- ables z We are then interested in the joint prob- ability distribution P ( X, z ) which factorizes X as P ( X, z ) = P ( z ) P ( X | z ) This factorization is natural because we can N imagine that the latent variables are fixed first and then the visible variables are drawn based on the latent variables For example, if we want to draw a digit we could first fix the latent variables: the digit, size, angle, thickness, position and so on and then draw a digit which corresponds to these latent variables And of course, unlike RBMs, this is a directed graphical model 22/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  58. Now at inference time, we are given an X (observed variable) and we are interested in finding the most z likely assignments of latent variables z which would have resulted in this observation X N 23/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  59. Now at inference time, we are given an X (observed variable) and we are interested in finding the most z likely assignments of latent variables z which would have resulted in this observation Mathematically, we want to find X P ( z | X ) = P ( X | z ) P ( z ) N P ( X ) 23/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  60. Now at inference time, we are given an X (observed variable) and we are interested in finding the most z likely assignments of latent variables z which would have resulted in this observation Mathematically, we want to find X P ( z | X ) = P ( X | z ) P ( z ) N P ( X ) This is hard to compute because the LHS contains P ( X ) which is intractable ˆ P ( X | z ) P ( z ) dz P ( X ) = ˆ ˆ ˆ = P ( X | z 1 , z 2 , ..., z n ) P ( z 1 , z 2 , ..., z n ) dz 1 , ...dz n ... 23/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  61. Now at inference time, we are given an X (observed variable) and we are interested in finding the most z likely assignments of latent variables z which would have resulted in this observation Mathematically, we want to find X P ( z | X ) = P ( X | z ) P ( z ) N P ( X ) This is hard to compute because the LHS contains P ( X ) which is intractable ˆ P ( X | z ) P ( z ) dz P ( X ) = ˆ ˆ ˆ = P ( X | z 1 , z 2 , ..., z n ) P ( z 1 , z 2 , ..., z n ) dz 1 , ...dz n ... In RBMs, we had a similar integral which we approx- imated using Gibbs Sampling 23/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  62. Now at inference time, we are given an X (observed variable) and we are interested in finding the most z likely assignments of latent variables z which would have resulted in this observation Mathematically, we want to find X P ( z | X ) = P ( X | z ) P ( z ) N P ( X ) This is hard to compute because the LHS contains P ( X ) which is intractable ˆ P ( X | z ) P ( z ) dz P ( X ) = ˆ ˆ ˆ = P ( X | z 1 , z 2 , ..., z n ) P ( z 1 , z 2 , ..., z n ) dz 1 , ...dz n ... In RBMs, we had a similar integral which we approx- imated using Gibbs Sampling VAEs, on the other hand, cast this into an optimiza- tion problem and learn the parameters of the optim- ization problem 23/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  63. Specifically, in VAEs, we assume that instead of P ( z | X ) which is intractable, the posterior z distribution is given by Q θ ( z | X ) X N 24/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  64. Specifically, in VAEs, we assume that instead of P ( z | X ) which is intractable, the posterior z distribution is given by Q θ ( z | X ) Further, we assume that Q θ ( z | X ) is a Gaus- X sian whose parameters are determined by a neural network µ , Σ = g θ ( X ) N 24/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  65. Specifically, in VAEs, we assume that instead of P ( z | X ) which is intractable, the posterior z distribution is given by Q θ ( z | X ) Further, we assume that Q θ ( z | X ) is a Gaus- X sian whose parameters are determined by a neural network µ , Σ = g θ ( X ) N The parameters of the distribution are thus determined by the parameters θ of a neural network 24/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  66. Specifically, in VAEs, we assume that instead of P ( z | X ) which is intractable, the posterior z distribution is given by Q θ ( z | X ) Further, we assume that Q θ ( z | X ) is a Gaus- X sian whose parameters are determined by a neural network µ , Σ = g θ ( X ) N The parameters of the distribution are thus determined by the parameters θ of a neural network Our job then is to learn the parameters of this neural network 24/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  67. But what is the objective function for this neural network z X N 25/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  68. But what is the objective function for this neural network z Well we want the proposed distribution Q θ ( z | X ) to be as close to the true distribu- X tion N 25/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  69. But what is the objective function for this neural network z Well we want the proposed distribution Q θ ( z | X ) to be as close to the true distribu- X tion We can capture this using the following ob- N jective function minimize KL ( Q θ ( z | X ) || P ( z | X )) 25/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

  70. But what is the objective function for this neural network z Well we want the proposed distribution Q θ ( z | X ) to be as close to the true distribu- X tion We can capture this using the following ob- N jective function minimize KL ( Q θ ( z | X ) || P ( z | X )) What are the parameters of the objective function ? 25/36 Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend