[PPT] - Advanced Machine Learning Variational Auto-encoders Amit Sethi, EE, PowerPoint Presentation

SLIDE 1

Advanced Machine Learning Variational Auto-encoders

Amit Sethi, EE, IITB

SLIDE 2

Objectives

Learn how VAEs help in sampling from a data

distribution

Write the objective function of a VAE
Derive how VAE objective is adapted for SGD

SLIDE 3

VAE setup

We are interested in maximizing the data

likelihood

𝑄 𝑌 = 𝑄 𝑌 𝑨; 𝜄 𝑄 𝑨 𝑒𝑨

Let 𝑄 𝑌 𝑨; 𝜄 be modeled by 𝑔 𝑨; 𝜄
Further, let us assume that

𝑄 𝑌 𝑨; 𝜄 = 𝒪 𝑌 𝑔 𝑨; 𝜄 , 𝜏2𝐽

Source: VAEs by Kingma, Welling, et al.; “Tutorial on Variational Autoencoders” by Carl Doersch

SLIDE 4

We do not care about distribution of z

Latent variable z is drawn from a standard

normal

It may represent many different variations of

the data

N θ X z ~ N(0,I)

Source: VAEs by Kingma, Welling, et al.; “Tutorial on Variational Autoencoders” by Carl Doersch

SLIDE 5

Example of a variable transformation

X = g(z) = z/10 + z/‖z‖ z

Source: VAEs by Kingma, Welling, et al.; “Tutorial on Variational Autoencoders” by Carl Doersch

SLIDE 6

Because of Gaussian assumption, the most obvious variation may not be the most likely

Although the ‘2’ on the right is a better choice

as a variation of the one on the left, the one in the middle is more likely due to the Gaussian assumption

Source: VAEs by Kingma, Welling, et al.; “Tutorial on Variational Autoencoders” by Carl Doersch

SLIDE 7

Sampling z from standard normal is problematic

It may give samples of z that are unlikely to

have produced X

Can we sample z itself intelligently?
Enter Q(z|X) to compute, e.g., Ez~QP(X|z)
All we need to do is reduce the KL divergence

between P(X) and Ez~QP(X|z)

Hence, a variational method

Source: VAEs by Kingma, Welling, et al.; “Tutorial on Variational Autoencoders” by Carl Doersch

SLIDE 8

VAE Objective Setup

D[Q(z) ‖ P(z|X)] = Ez~Q[log Q(z) − log P(z|X)] = Ez~Q[log Q(z) − log P(X|z) − log P(z)] + log P(X) Rearranging some terms: log P(X) − D[Q(z) ‖ P(z|X)] = Ez∼Q[log P(X|z)+ − D[Q(z) ‖ P(z)] Introducing dependency of Q on X: log P(X) − D[Q(z|X) ‖ P(z|X)] = Ez∼Q[log P(X|z)+ − D[Q(z|X) ‖ P(z)]

Source: VAEs by Kingma, Welling, et al.; “Tutorial on Variational Autoencoders” by Carl Doersch

SLIDE 9

Optimizing the RHS

Q is encoding X into z; P(X|z) is decoding z
Assume in LHS Q(z|X) is a high capacity NN
For: Ez∼Q[log P(X|z)+ − D[Q(z|X) ‖ P(z)]
Assume: Q(z|X) = N(z|μ(X;θ),∑(X;θ))
Then KL divergence is:

D[N(μ(X),Σ(X)) ‖ N(0,I)] = 1/2 [ tr(Σ(X)) + μ(X)Tμ(X) − k − log det(Σ(X)) ]

In SGD, the objective becomes maximizing:

EX∼Dlog P(X)−DQ(z|X) ‖ P(z|X)]] =EX∼D[Ez∼Q[log P(X|z)+ − D*Q(z|X) ‖ P(z)]]

Source: VAEs by Kingma, Welling, et al.; “Tutorial on Variational Autoencoders” by Carl Doersch

SLIDE 10

Moving the gradient inside the expectation

We need to compute the gradient of:

log P(X|z) − D*Q(z|X) ‖ P(z)+

The first term does not depend on parameters

Q, but Ez∼Q[log P(X|z)] does!

So, we need to generate z that are plausible,

i.e. decodable

Source: VAEs by Kingma, Welling, et al.; “Tutorial on Variational Autoencoders” by Carl Doersch

SLIDE 11

The actual model that resists backpropagation

Cannot backpropagate through a stochastic unit

Source: VAEs by Kingma, Welling, et al.; “Tutorial on Variational Autoencoders” by Carl Doersch

SLIDE 12

The actual model that resists backpropagation

EX∼D[Ee∼N(0,I)[log P(X|z=μ(X) +Σ1/2(X)∗e)+−D*Q(z|X)‖P(z)++
Now, we can BP end-to-end, because expectations are not

with respect to distributions dependent on the model

Reparameterization trick: e∼N(0,I) and z=μ(X)+Σ1/2(X)∗e This works, if Q(z|X) and P(z) are continuous

Source: VAEs by Kingma, Welling, et al.; “Tutorial on Variational Autoencoders” by Carl Doersch

SLIDE 13

Test-time sampling is straightforward

The encoder pathway, including the

multiplication and addition are discarded

For getting an estimate of likelihood of a test

sample, generate z, and then compute P(z|X)

Source: VAEs by Kingma, Welling, et al.; “Tutorial on Variational Autoencoders” by Carl Doersch

SLIDE 14

Conditional VAE

Source: VAEs by Kingma, Welling, et al.; “Tutorial on Variational Autoencoders” by Carl Doersch

SLIDE 15

Sample results for a MNIST VAE

Source: VAEs by Kingma, Welling, et al.; “Tutorial on Variational Autoencoders” by Carl Doersch

SLIDE 16

Sample results for a MNIST CVAE

Source: VAEs by Kingma, Welling, et al.; “Tutorial on Variational Autoencoders” by Carl Doersch

SLIDE 17

Advanced Machine Learning Generative Adversarial Networks

Amit Sethi, EE, IITB

SLIDE 18

Objectives

Articulate how using a discriminator helps a

generator

Write the objective function of GAN
Write the training algorithm for GAN

SLIDE 19

GAN trains two networks together

GAN objective:

min

𝐻 max 𝐸

𝑊 𝐸, 𝐻 = 𝔽𝒚~𝑞𝒚(𝒚) log 𝐸 𝒚 + 𝔽𝒜~𝑞𝒜 𝒜 [log(1 − 𝐸(𝐻 𝒜 )] z G x' x D y

Source: “Generative Adversarial Nets” by Goodfellow et al. in NeurIPS 2014

SLIDE 20

At the solution, the transformed distribution from z will emulate px(x)

As training progresses, the distributions of the

transformed noise and the data will become indistinguishable

G G G G D D D D px(x) px(x) px(x) px(x) Training steps 

Source: “Generative Adversarial Nets” by Goodfellow et al. in NeurIPS 2014

SLIDE 21

The trick is to allow D to catch up before improving G in each iteration

For training iterations
For k steps
Update discriminator by ascending
𝛼𝜄𝐸

1 𝑛

log 𝐸 𝒚 𝑗 + log(1 − 𝐸 𝐻(𝒜 𝑗 )

𝑛 𝑗=1

Update generator by descending
𝛼𝜄𝐻

1 𝑛

log(1 − 𝐸 𝐻(𝒜 𝑗 )

𝑛 𝑗=1

Source: “Generative Adversarial Nets” by Goodfellow et al. in NeurIPS 2014

SLIDE 22

An optimum exists

For a fixed generator, 𝐸 𝒚 =

𝑞𝒚(𝒚) 𝑞𝒚 𝒚 +𝑞𝐻(𝒚)

Because

𝔽𝒚~𝑞𝒚(𝒚) log 𝐸 𝒚 + 𝔽𝒜~𝑞𝒜 𝒜 [log(1 − 𝐸(𝐻 𝒜 )] = 𝑞𝒚 𝒚 log 𝐸 𝒚 + 𝑞𝐻 𝒚 log 1 − 𝐸 𝒚 𝑒𝑦

𝒚

And, optimal of 𝑏 log 𝑧 + 𝑐 log(1 − 𝑧) is 𝑏

𝑏+𝑐

Source: “Generative Adversarial Nets” by Goodfellow et al. in NeurIPS 2014

SLIDE 23

Generator’s optimization reduces as follows…

𝔽𝒚~𝑞𝒚(𝒚) log 𝐸 𝒚

+ 𝔽𝒜~𝑞𝒜 𝒜 [log(1 − 𝐸(𝐻 𝒜 )] = 𝔽𝒚~𝑞𝒚(𝒚) log 𝑞𝒚(𝒚) 𝑞𝒚 𝒚 + 𝑞𝐻(𝒚) + 𝔽𝒚~𝑞𝑯 𝒚 log 𝑞𝑯(𝒚) 𝑞𝒚 𝒚 + 𝑞𝐻(𝒚) = − log 4 + 𝐿𝑀 𝑞𝒚(𝒚) 𝑞𝒚 𝒚 + 𝑞𝐻(𝒚) 2 + 𝐿𝑀 𝑞𝐻(𝒚) 𝑞𝒚 𝒚 + 𝑞𝐻(𝒚) 2

This assumes that the generator and the discriminator are

high capacity such that these can model the desired distributions arbitrarily well.

Source: “Generative Adversarial Nets” by Goodfellow et al. in NeurIPS 2014

SLIDE 24

Some sample generations and interpolations of latent vector

Source: “Generative Adversarial Nets” by Goodfellow et al. in NeurIPS 2014

SLIDE 25

DC-GAN was designed to generate better images

No pooling – convolutions with > or < 1 stride
No fully connected layers
Heavy use of batchnorm
Use ReLU in G, leakyReLU in D in all but final layers
Use tanh in the last layer of G

Source: “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks” by Radford et al. in ICLR 2016

SLIDE 26

While mode-collapse isn’t evident, there is some underfitting

Source: “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks” by Radford et al. in ICLR 2016

SLIDE 27

GAN features can directly be used for classification

Source: “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks” by Radford et al. in ICLR 2016

SLIDE 28

GANs allow latent vector “arithmetic”

Source: “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks” by Radford et al. in ICLR 2016

SLIDE 29

Advantages and disadvantages of GAN

Markov chains not

needed

Only backprop used
No inference needed
Models a wide range of

functions

No explicit generator
Need to sync D with G
Mode collapse

Source: “Generative Adversarial Nets” by Goodfellow et al. in NeurIPS 2014

SLIDE 30

Conditional GAN introduces another variable (e.g. class)

Instead of the GAN objective:

𝔽𝒚~𝑞𝒚(𝒚) log 𝐸 𝒚 + 𝔽𝒜~𝑞𝒜 𝒜 [log(1 − 𝐸(𝐻 𝒜 )]

CGAN uses a modified objective:

𝔽𝒚~𝑞𝒚(𝒚) log 𝐸 𝒚|𝒛 + 𝔽𝒜~𝑞𝒜 𝒜 [log(1 − 𝐸(𝐻 𝒜|𝒛 )]

Source: “Conditional Generative Adversarial Nets” by Mirza and Osindero, Arxiv 2014

SLIDE 31

Conditional GAN introduces another variable (e.g. class)

Source: “Conditional Generative Adversarial Nets” by Mirza and Osindero, Arxiv 2014

SLIDE 32

Each row is conditioned upon one digit label of a CGAN

Source: “Conditional Generative Adversarial Nets” by Mirza and Osindero, Arxiv 2014

SLIDE 33

Interpretable latent variables – InfoGAN

GAN objective:

min

𝐻 max 𝐸

𝑊 𝐸, 𝐻 = 𝔽𝒚~data log 𝐸 𝒚 + 𝔽𝒜~noise[log(1 − 𝐸(𝐻 𝒜 )]

Introduce extra (supposedly interpretable)

latent variables 𝑑 in addition to 𝑨

InfoGAN objective:

min

𝐻 max 𝐸

𝑊

𝐽 𝐸, 𝐻 = 𝑊 𝐸, 𝐻 − 𝜇 𝐽 𝑑, 𝐻 𝑨, 𝑑

Mutual information:

𝐽 𝑑, 𝐻 𝑨, 𝑑 = 𝐼 𝑑 − 𝐼 𝑑 𝐻 𝑨, 𝑑

Source: “InfoGAN: Interpretable Representation Learning byInformation Maximizing Generative Adversarial Nets” by Chen et al., in NeurIPS 2016.

SLIDE 34

Variational information maximization to the rescue

Mutual information:

𝐽 𝑑, 𝐻 𝑨, 𝑑 = 𝐼 𝑑 − 𝐼 𝑑 𝐻 𝑨, 𝑑 = 𝔽𝑦~𝐻(𝑨,𝑑) 𝔽𝑑′~𝑄(𝑑|𝑦) log 𝑄(𝑑′|𝑦) + 𝐼 𝑑 = 𝔽𝑦~𝐻(𝑨,𝑑) 𝐿𝑀 𝑄 . 𝑦 𝑅 . 𝑦 + 𝔽𝑑′~𝑄(𝑑|𝑦) log 𝑅(𝑑′|𝑦) + 𝐼 𝑑 ≥ 𝔽𝑦~𝐻(𝑨,𝑑) 𝔽𝑑′~𝑄(𝑑|𝑦) log 𝑅(𝑑′|𝑦) + 𝐼 𝑑

Further: 𝔽𝑦~𝑌,𝑧~𝑍|𝑦 𝑔 𝑦, 𝑧

= 𝔽𝑦~𝑌,𝑧~𝑍|𝑦,𝑦′~𝑌|𝑧 𝑔 𝑦′, 𝑧 under certain conditions, implies that previous quantity is 𝑀𝐽 𝐻, 𝑅 = 𝔽𝑑~𝑄 𝑑 ,𝑦~𝐻(𝑨,𝑑) log 𝑅(𝑑|𝑦) + 𝐼 𝑑

Overall: min

𝐻,𝑅 max 𝐸

𝑊

𝐽 𝐸, 𝐻, 𝑅 = 𝑊 𝐸, 𝐻 − 𝜇 𝑀𝐽 𝐻, 𝑅

Source: “InfoGAN: Interpretable Representation Learning byInformation Maximizing Generative Adversarial Nets” by Chen et al., in NeurIPS 2016.

SLIDE 35

InfoGAN visually

Inspiration source: “InfoGAN — Generative Adversarial Networks Part III” Zak Jost, TowardsDataScience.com

z G x' x D y c Q c'

The figure elements in red were added to

InfoGAN on top of a regular GAN

C can be a mix of class code and disentangled

continuous variables

SLIDE 36

Results are more interpretable

Source: “InfoGAN: Interpretable Representation Learning byInformation Maximizing Generative Adversarial Nets” by Chen et al., in NeurIPS 2016.

SLIDE 37

Results are more interpretable

Source: “InfoGAN: Interpretable Representation Learning byInformation Maximizing Generative Adversarial Nets” by Chen et al., in NeurIPS 2016.

Advanced Machine Learning Variational Auto-encoders

Amit Sethi, EE, IITB

Objectives

distribution

VAE setup

likelihood

𝑄 𝑌 = 𝑄 𝑌 𝑨; 𝜄 𝑄 𝑨 𝑒𝑨

𝑄 𝑌 𝑨; 𝜄 = 𝒪 𝑌 𝑔 𝑨; 𝜄 , 𝜏2𝐽

We do not care about distribution of z

normal

the data

Example of a variable transformation

X = g(z) = z/10 + z/‖z‖ z

Because of Gaussian assumption, the most obvious variation may not be the most likely

as a variation of the one on the left, the one in the middle is more likely due to the Gaussian assumption

Sampling z from standard normal is problematic

have produced X

between P(X) and Ez~QP(X|z)

VAE Objective Setup

Optimizing the RHS

D[N(μ(X),Σ(X)) ‖ N(0,I)] = 1/2 [ tr(Σ(X)) + μ(X)Tμ(X) − k − log det(Σ(X)) ]

EX∼D*log P(X)−D*Q(z|X) ‖ P(z|X)]] =EX∼D[Ez∼Q[log P(X|z)+ − D*Q(z|X) ‖ P(z)]]

Moving the gradient inside the expectation

log P(X|z) − D*Q(z|X) ‖ P(z)+

Q, but Ez∼Q[log P(X|z)] does!

i.e. decodable

The actual model that resists backpropagation

The actual model that resists backpropagation

with respect to distributions dependent on the model

Reparameterization trick: e∼N(0,I) and z=μ(X)+Σ1/2(X)∗e This works, if Q(z|X) and P(z) are continuous

Test-time sampling is straightforward

multiplication and addition are discarded

sample, generate z, and then compute P(z|X)

Conditional VAE

Sample results for a MNIST VAE

Sample results for a MNIST CVAE

Advanced Machine Learning Generative Adversarial Networks

Amit Sethi, EE, IITB

Objectives

generator

GAN trains two networks together

min

𝐻 max 𝐸

𝑊 𝐸, 𝐻 = 𝔽𝒚~𝑞𝒚(𝒚) log 𝐸 𝒚 + 𝔽𝒜~𝑞𝒜 𝒜 [log(1 − 𝐸(𝐻 𝒜 )] z G x' x D y

At the solution, the transformed distribution from z will emulate px(x)

transformed noise and the data will become indistinguishable

The trick is to allow D to catch up before improving G in each iteration

log 𝐸 𝒚 𝑗 + log(1 − 𝐸 𝐻(𝒜 𝑗 )

log(1 − 𝐸 𝐻(𝒜 𝑗 )

An optimum exists

𝔽𝒚~𝑞𝒚(𝒚) log 𝐸 𝒚 + 𝔽𝒜~𝑞𝒜 𝒜 [log(1 − 𝐸(𝐻 𝒜 )] = 𝑞𝒚 𝒚 log 𝐸 𝒚 + 𝑞𝐻 𝒚 log 1 − 𝐸 𝒚 𝑒𝑦

And, optimal of 𝑏 log 𝑧 + 𝑐 log(1 − 𝑧) is 𝑏

Generator’s optimization reduces as follows…

high capacity such that these can model the desired distributions arbitrarily well.

Some sample generations and interpolations of latent vector

DC-GAN was designed to generate better images

While mode-collapse isn’t evident, there is some underfitting

GAN features can directly be used for classification

GANs allow latent vector “arithmetic”

Advantages and disadvantages of GAN

needed

functions

Conditional GAN introduces another variable (e.g. class)

𝔽𝒚~𝑞𝒚(𝒚) log 𝐸 𝒚 + 𝔽𝒜~𝑞𝒜 𝒜 [log(1 − 𝐸(𝐻 𝒜 )]

𝔽𝒚~𝑞𝒚(𝒚) log 𝐸 𝒚|𝒛 + 𝔽𝒜~𝑞𝒜 𝒜 [log(1 − 𝐸(𝐻 𝒜|𝒛 )]

Conditional GAN introduces another variable (e.g. class)

Each row is conditioned upon one digit label of a CGAN

Interpretable latent variables – InfoGAN

latent variables 𝑑 in addition to 𝑨

min

𝑊

𝐽 𝑑, 𝐻 𝑨, 𝑑 = 𝐼 𝑑 − 𝐼 𝑑 𝐻 𝑨, 𝑑

Variational information maximization to the rescue

= 𝔽𝑦~𝑌,𝑧~𝑍|𝑦,𝑦′~𝑌|𝑧 𝑔 𝑦′, 𝑧 under certain conditions, implies that previous quantity is 𝑀𝐽 𝐻, 𝑅 = 𝔽𝑑~𝑄 𝑑 ,𝑦~𝐻(𝑨,𝑑) log 𝑅(𝑑|𝑦) + 𝐼 𝑑

𝑊

InfoGAN visually

z G x' x D y c Q c'

InfoGAN on top of a regular GAN

continuous variables

Results are more interpretable

EX∼Dlog P(X)−DQ(z|X) ‖ P(z|X)]] =EX∼D[Ez∼Q[log P(X|z)+ − D*Q(z|X) ‖ P(z)]]