Advanced Machine Learning Variational Auto-encoders Amit Sethi, EE, - - PowerPoint PPT Presentation
Advanced Machine Learning Variational Auto-encoders Amit Sethi, EE, - - PowerPoint PPT Presentation
Advanced Machine Learning Variational Auto-encoders Amit Sethi, EE, IITB Objectives Learn how VAEs help in sampling from a data distribution Write the objective function of a VAE Derive how VAE objective is adapted for SGD VAE setup
Objectives
- Learn how VAEs help in sampling from a data
distribution
- Write the objective function of a VAE
- Derive how VAE objective is adapted for SGD
VAE setup
- We are interested in maximizing the data
likelihood
π π = π π π¨; π π π¨ ππ¨
- Let π π π¨; π be modeled by π π¨; π
- Further, let us assume that
π π π¨; π = πͺ π π π¨; π , π2π½
Source: VAEs by Kingma, Welling, et al.; βTutorial on Variational Autoencodersβ by Carl Doersch
We do not care about distribution of z
- Latent variable z is drawn from a standard
normal
- It may represent many different variations of
the data
N ΞΈ X z ~ N(0,I)
Source: VAEs by Kingma, Welling, et al.; βTutorial on Variational Autoencodersβ by Carl Doersch
Example of a variable transformation
X = g(z) = z/10 + z/βzβ z
Source: VAEs by Kingma, Welling, et al.; βTutorial on Variational Autoencodersβ by Carl Doersch
Because of Gaussian assumption, the most obvious variation may not be the most likely
- Although the β2β on the right is a better choice
as a variation of the one on the left, the one in the middle is more likely due to the Gaussian assumption
Source: VAEs by Kingma, Welling, et al.; βTutorial on Variational Autoencodersβ by Carl Doersch
Sampling z from standard normal is problematic
- It may give samples of z that are unlikely to
have produced X
- Can we sample z itself intelligently?
- Enter Q(z|X) to compute, e.g., Ez~QP(X|z)
- All we need to do is reduce the KL divergence
between P(X) and Ez~QP(X|z)
- Hence, a variational method
Source: VAEs by Kingma, Welling, et al.; βTutorial on Variational Autoencodersβ by Carl Doersch
VAE Objective Setup
D[Q(z) β P(z|X)] = Ez~Q[log Q(z) β log P(z|X)] = Ez~Q[log Q(z) β log P(X|z) β log P(z)] + log P(X) Rearranging some terms: log P(X) β D[Q(z) β P(z|X)] = EzβΌQ[log P(X|z)+ β D[Q(z) β P(z)] Introducing dependency of Q on X: log P(X) β D[Q(z|X) β P(z|X)] = EzβΌQ[log P(X|z)+ β D[Q(z|X) β P(z)]
Source: VAEs by Kingma, Welling, et al.; βTutorial on Variational Autoencodersβ by Carl Doersch
Optimizing the RHS
- Q is encoding X into z; P(X|z) is decoding z
- Assume in LHS Q(z|X) is a high capacity NN
- For: EzβΌQ[log P(X|z)+ β D[Q(z|X) β P(z)]
- Assume: Q(z|X) = N(z|ΞΌ(X;ΞΈ),β(X;ΞΈ))
- Then KL divergence is:
D[N(ΞΌ(X),Ξ£(X)) β N(0,I)] = 1/2 [ tr(Ξ£(X)) + ΞΌ(X)TΞΌ(X) β k β log det(Ξ£(X)) ]
- In SGD, the objective becomes maximizing:
EXβΌD*log P(X)βD*Q(z|X) β P(z|X)]] =EXβΌD[EzβΌQ[log P(X|z)+ β D*Q(z|X) β P(z)]]
Source: VAEs by Kingma, Welling, et al.; βTutorial on Variational Autoencodersβ by Carl Doersch
Moving the gradient inside the expectation
- We need to compute the gradient of:
log P(X|z) β D*Q(z|X) β P(z)+
- The first term does not depend on parameters
Q, but EzβΌQ[log P(X|z)] does!
- So, we need to generate z that are plausible,
i.e. decodable
Source: VAEs by Kingma, Welling, et al.; βTutorial on Variational Autoencodersβ by Carl Doersch
The actual model that resists backpropagation
- Cannot backpropagate through a stochastic unit
Source: VAEs by Kingma, Welling, et al.; βTutorial on Variational Autoencodersβ by Carl Doersch
The actual model that resists backpropagation
- EXβΌD[EeβΌN(0,I)[log P(X|z=ΞΌ(X) +Ξ£1/2(X)βe)+βD*Q(z|X)βP(z)++
- Now, we can BP end-to-end, because expectations are not
with respect to distributions dependent on the model
Reparameterization trick: eβΌN(0,I) and z=ΞΌ(X)+Ξ£1/2(X)βe This works, if Q(z|X) and P(z) are continuous
Source: VAEs by Kingma, Welling, et al.; βTutorial on Variational Autoencodersβ by Carl Doersch
Test-time sampling is straightforward
- The encoder pathway, including the
multiplication and addition are discarded
- For getting an estimate of likelihood of a test
sample, generate z, and then compute P(z|X)
Source: VAEs by Kingma, Welling, et al.; βTutorial on Variational Autoencodersβ by Carl Doersch
Conditional VAE
Source: VAEs by Kingma, Welling, et al.; βTutorial on Variational Autoencodersβ by Carl Doersch
Sample results for a MNIST VAE
Source: VAEs by Kingma, Welling, et al.; βTutorial on Variational Autoencodersβ by Carl Doersch
Sample results for a MNIST CVAE
Source: VAEs by Kingma, Welling, et al.; βTutorial on Variational Autoencodersβ by Carl Doersch
Advanced Machine Learning Generative Adversarial Networks
Amit Sethi, EE, IITB
Objectives
- Articulate how using a discriminator helps a
generator
- Write the objective function of GAN
- Write the training algorithm for GAN
GAN trains two networks together
- GAN objective:
min
π» max πΈ
π πΈ, π» = π½π~ππ(π) log πΈ π + π½π~ππ π [log(1 β πΈ(π» π )] z G x' x D y
Source: βGenerative Adversarial Netsβ by Goodfellow et al. in NeurIPS 2014
At the solution, the transformed distribution from z will emulate px(x)
- As training progresses, the distributions of the
transformed noise and the data will become indistinguishable
G G G G D D D D px(x) px(x) px(x) px(x) Training steps ο
Source: βGenerative Adversarial Netsβ by Goodfellow et al. in NeurIPS 2014
The trick is to allow D to catch up before improving G in each iteration
- For training iterations
- For k steps
- Update discriminator by ascending
- πΌππΈ
1 π
log πΈ π π + log(1 β πΈ π»(π π )
π π=1
- Update generator by descending
- πΌππ»
1 π
log(1 β πΈ π»(π π )
π π=1
Source: βGenerative Adversarial Netsβ by Goodfellow et al. in NeurIPS 2014
An optimum exists
- For a fixed generator, πΈ π =
ππ(π) ππ π +ππ»(π)
- Because
π½π~ππ(π) log πΈ π + π½π~ππ π [log(1 β πΈ(π» π )] = ππ π log πΈ π + ππ» π log 1 β πΈ π ππ¦
π
And, optimal of π log π§ + π log(1 β π§) is π
π+π
Source: βGenerative Adversarial Netsβ by Goodfellow et al. in NeurIPS 2014
Generatorβs optimization reduces as followsβ¦
- π½π~ππ(π) log πΈ π
+ π½π~ππ π [log(1 β πΈ(π» π )] = π½π~ππ(π) log ππ(π) ππ π + ππ»(π) + π½π~ππ― π log ππ―(π) ππ π + ππ»(π) = β log 4 + πΏπ ππ(π) ππ π + ππ»(π) 2 + πΏπ ππ»(π) ππ π + ππ»(π) 2
- This assumes that the generator and the discriminator are
high capacity such that these can model the desired distributions arbitrarily well.
Source: βGenerative Adversarial Netsβ by Goodfellow et al. in NeurIPS 2014
Some sample generations and interpolations of latent vector
Source: βGenerative Adversarial Netsβ by Goodfellow et al. in NeurIPS 2014
DC-GAN was designed to generate better images
- No pooling β convolutions with > or < 1 stride
- No fully connected layers
- Heavy use of batchnorm
- Use ReLU in G, leakyReLU in D in all but final layers
- Use tanh in the last layer of G
Source: βUnsupervised Representation Learning with Deep Convolutional Generative Adversarial Networksβ by Radford et al. in ICLR 2016
While mode-collapse isnβt evident, there is some underfitting
Source: βUnsupervised Representation Learning with Deep Convolutional Generative Adversarial Networksβ by Radford et al. in ICLR 2016
GAN features can directly be used for classification
Source: βUnsupervised Representation Learning with Deep Convolutional Generative Adversarial Networksβ by Radford et al. in ICLR 2016
GANs allow latent vector βarithmeticβ
Source: βUnsupervised Representation Learning with Deep Convolutional Generative Adversarial Networksβ by Radford et al. in ICLR 2016
Advantages and disadvantages of GAN
- Markov chains not
needed
- Only backprop used
- No inference needed
- Models a wide range of
functions
- No explicit generator
- Need to sync D with G
- Mode collapse
Source: βGenerative Adversarial Netsβ by Goodfellow et al. in NeurIPS 2014
Conditional GAN introduces another variable (e.g. class)
- Instead of the GAN objective:
π½π~ππ(π) log πΈ π + π½π~ππ π [log(1 β πΈ(π» π )]
- CGAN uses a modified objective:
π½π~ππ(π) log πΈ π|π + π½π~ππ π [log(1 β πΈ(π» π|π )]
Source: βConditional Generative Adversarial Netsβ by Mirza and Osindero, Arxiv 2014
Conditional GAN introduces another variable (e.g. class)
Source: βConditional Generative Adversarial Netsβ by Mirza and Osindero, Arxiv 2014
Each row is conditioned upon one digit label of a CGAN
Source: βConditional Generative Adversarial Netsβ by Mirza and Osindero, Arxiv 2014
Interpretable latent variables β InfoGAN
- GAN objective:
min
π» max πΈ
π πΈ, π» = π½π~data log πΈ π + π½π~noise[log(1 β πΈ(π» π )]
- Introduce extra (supposedly interpretable)
latent variables π in addition to π¨
- InfoGAN objective:
min
π» max πΈ
π
π½ πΈ, π» = π πΈ, π» β π π½ π, π» π¨, π
- Mutual information:
π½ π, π» π¨, π = πΌ π β πΌ π π» π¨, π
Source: βInfoGAN: Interpretable Representation Learning byInformation Maximizing Generative Adversarial Netsβ by Chen et al., in NeurIPS 2016.
Variational information maximization to the rescue
- Mutual information:
π½ π, π» π¨, π = πΌ π β πΌ π π» π¨, π = π½π¦~π»(π¨,π) π½πβ²~π(π|π¦) log π(πβ²|π¦) + πΌ π = π½π¦~π»(π¨,π) πΏπ π . π¦ π . π¦ + π½πβ²~π(π|π¦) log π (πβ²|π¦) + πΌ π β₯ π½π¦~π»(π¨,π) π½πβ²~π(π|π¦) log π (πβ²|π¦) + πΌ π
- Further: π½π¦~π,π§~π|π¦ π π¦, π§
= π½π¦~π,π§~π|π¦,π¦β²~π|π§ π π¦β², π§ under certain conditions, implies that previous quantity is ππ½ π», π = π½π~π π ,π¦~π»(π¨,π) log π (π|π¦) + πΌ π
- Overall: min
π»,π max πΈ
π
π½ πΈ, π», π = π πΈ, π» β π ππ½ π», π
Source: βInfoGAN: Interpretable Representation Learning byInformation Maximizing Generative Adversarial Netsβ by Chen et al., in NeurIPS 2016.
InfoGAN visually
Inspiration source: βInfoGAN β Generative Adversarial Networks Part IIIβ Zak Jost, TowardsDataScience.com
z G x' x D y c Q c'
- The figure elements in red were added to
InfoGAN on top of a regular GAN
- C can be a mix of class code and disentangled
continuous variables
Results are more interpretable
Source: βInfoGAN: Interpretable Representation Learning byInformation Maximizing Generative Adversarial Netsβ by Chen et al., in NeurIPS 2016.
Results are more interpretable
Source: βInfoGAN: Interpretable Representation Learning byInformation Maximizing Generative Adversarial Netsβ by Chen et al., in NeurIPS 2016.