Advanced Machine Learning Variational Auto-encoders Amit Sethi, EE, - - PowerPoint PPT Presentation

β–Ά
advanced machine learning
SMART_READER_LITE
LIVE PREVIEW

Advanced Machine Learning Variational Auto-encoders Amit Sethi, EE, - - PowerPoint PPT Presentation

Advanced Machine Learning Variational Auto-encoders Amit Sethi, EE, IITB Objectives Learn how VAEs help in sampling from a data distribution Write the objective function of a VAE Derive how VAE objective is adapted for SGD VAE setup


slide-1
SLIDE 1

Advanced Machine Learning Variational Auto-encoders

Amit Sethi, EE, IITB

slide-2
SLIDE 2

Objectives

  • Learn how VAEs help in sampling from a data

distribution

  • Write the objective function of a VAE
  • Derive how VAE objective is adapted for SGD
slide-3
SLIDE 3

VAE setup

  • We are interested in maximizing the data

likelihood

𝑄 π‘Œ = 𝑄 π‘Œ 𝑨; πœ„ 𝑄 𝑨 𝑒𝑨

  • Let 𝑄 π‘Œ 𝑨; πœ„ be modeled by 𝑔 𝑨; πœ„
  • Further, let us assume that

𝑄 π‘Œ 𝑨; πœ„ = π’ͺ π‘Œ 𝑔 𝑨; πœ„ , 𝜏2𝐽

Source: VAEs by Kingma, Welling, et al.; β€œTutorial on Variational Autoencoders” by Carl Doersch

slide-4
SLIDE 4

We do not care about distribution of z

  • Latent variable z is drawn from a standard

normal

  • It may represent many different variations of

the data

N ΞΈ X z ~ N(0,I)

Source: VAEs by Kingma, Welling, et al.; β€œTutorial on Variational Autoencoders” by Carl Doersch

slide-5
SLIDE 5

Example of a variable transformation

X = g(z) = z/10 + z/β€–zβ€– z

Source: VAEs by Kingma, Welling, et al.; β€œTutorial on Variational Autoencoders” by Carl Doersch

slide-6
SLIDE 6

Because of Gaussian assumption, the most obvious variation may not be the most likely

  • Although the β€˜2’ on the right is a better choice

as a variation of the one on the left, the one in the middle is more likely due to the Gaussian assumption

Source: VAEs by Kingma, Welling, et al.; β€œTutorial on Variational Autoencoders” by Carl Doersch

slide-7
SLIDE 7

Sampling z from standard normal is problematic

  • It may give samples of z that are unlikely to

have produced X

  • Can we sample z itself intelligently?
  • Enter Q(z|X) to compute, e.g., Ez~QP(X|z)
  • All we need to do is reduce the KL divergence

between P(X) and Ez~QP(X|z)

  • Hence, a variational method

Source: VAEs by Kingma, Welling, et al.; β€œTutorial on Variational Autoencoders” by Carl Doersch

slide-8
SLIDE 8

VAE Objective Setup

D[Q(z) β€– P(z|X)] = Ez~Q[log Q(z) βˆ’ log P(z|X)] = Ez~Q[log Q(z) βˆ’ log P(X|z) βˆ’ log P(z)] + log P(X) Rearranging some terms: log P(X) βˆ’ D[Q(z) β€– P(z|X)] = Ez∼Q[log P(X|z)+ βˆ’ D[Q(z) β€– P(z)] Introducing dependency of Q on X: log P(X) βˆ’ D[Q(z|X) β€– P(z|X)] = Ez∼Q[log P(X|z)+ βˆ’ D[Q(z|X) β€– P(z)]

Source: VAEs by Kingma, Welling, et al.; β€œTutorial on Variational Autoencoders” by Carl Doersch

slide-9
SLIDE 9

Optimizing the RHS

  • Q is encoding X into z; P(X|z) is decoding z
  • Assume in LHS Q(z|X) is a high capacity NN
  • For: Ez∼Q[log P(X|z)+ βˆ’ D[Q(z|X) β€– P(z)]
  • Assume: Q(z|X) = N(z|ΞΌ(X;ΞΈ),βˆ‘(X;ΞΈ))
  • Then KL divergence is:

D[N(ΞΌ(X),Ξ£(X)) β€– N(0,I)] = 1/2 [ tr(Ξ£(X)) + ΞΌ(X)TΞΌ(X) βˆ’ k βˆ’ log det(Ξ£(X)) ]

  • In SGD, the objective becomes maximizing:

EX∼D*log P(X)βˆ’D*Q(z|X) β€– P(z|X)]] =EX∼D[Ez∼Q[log P(X|z)+ βˆ’ D*Q(z|X) β€– P(z)]]

Source: VAEs by Kingma, Welling, et al.; β€œTutorial on Variational Autoencoders” by Carl Doersch

slide-10
SLIDE 10

Moving the gradient inside the expectation

  • We need to compute the gradient of:

log P(X|z) βˆ’ D*Q(z|X) β€– P(z)+

  • The first term does not depend on parameters

Q, but Ez∼Q[log P(X|z)] does!

  • So, we need to generate z that are plausible,

i.e. decodable

Source: VAEs by Kingma, Welling, et al.; β€œTutorial on Variational Autoencoders” by Carl Doersch

slide-11
SLIDE 11

The actual model that resists backpropagation

  • Cannot backpropagate through a stochastic unit

Source: VAEs by Kingma, Welling, et al.; β€œTutorial on Variational Autoencoders” by Carl Doersch

slide-12
SLIDE 12

The actual model that resists backpropagation

  • EX∼D[Ee∼N(0,I)[log P(X|z=ΞΌ(X) +Ξ£1/2(X)βˆ—e)+βˆ’D*Q(z|X)β€–P(z)++
  • Now, we can BP end-to-end, because expectations are not

with respect to distributions dependent on the model

Reparameterization trick: e∼N(0,I) and z=ΞΌ(X)+Ξ£1/2(X)βˆ—e This works, if Q(z|X) and P(z) are continuous

Source: VAEs by Kingma, Welling, et al.; β€œTutorial on Variational Autoencoders” by Carl Doersch

slide-13
SLIDE 13

Test-time sampling is straightforward

  • The encoder pathway, including the

multiplication and addition are discarded

  • For getting an estimate of likelihood of a test

sample, generate z, and then compute P(z|X)

Source: VAEs by Kingma, Welling, et al.; β€œTutorial on Variational Autoencoders” by Carl Doersch

slide-14
SLIDE 14

Conditional VAE

Source: VAEs by Kingma, Welling, et al.; β€œTutorial on Variational Autoencoders” by Carl Doersch

slide-15
SLIDE 15

Sample results for a MNIST VAE

Source: VAEs by Kingma, Welling, et al.; β€œTutorial on Variational Autoencoders” by Carl Doersch

slide-16
SLIDE 16

Sample results for a MNIST CVAE

Source: VAEs by Kingma, Welling, et al.; β€œTutorial on Variational Autoencoders” by Carl Doersch

slide-17
SLIDE 17

Advanced Machine Learning Generative Adversarial Networks

Amit Sethi, EE, IITB

slide-18
SLIDE 18

Objectives

  • Articulate how using a discriminator helps a

generator

  • Write the objective function of GAN
  • Write the training algorithm for GAN
slide-19
SLIDE 19

GAN trains two networks together

  • GAN objective:

min

𝐻 max 𝐸

π‘Š 𝐸, 𝐻 = π”½π’š~π‘žπ’š(π’š) log 𝐸 π’š + π”½π’œ~π‘žπ’œ π’œ [log(1 βˆ’ 𝐸(𝐻 π’œ )] z G x' x D y

Source: β€œGenerative Adversarial Nets” by Goodfellow et al. in NeurIPS 2014

slide-20
SLIDE 20

At the solution, the transformed distribution from z will emulate px(x)

  • As training progresses, the distributions of the

transformed noise and the data will become indistinguishable

G G G G D D D D px(x) px(x) px(x) px(x) Training steps οƒ 

Source: β€œGenerative Adversarial Nets” by Goodfellow et al. in NeurIPS 2014

slide-21
SLIDE 21

The trick is to allow D to catch up before improving G in each iteration

  • For training iterations
  • For k steps
  • Update discriminator by ascending
  • π›Όπœ„πΈ

1 𝑛

log 𝐸 π’š 𝑗 + log(1 βˆ’ 𝐸 𝐻(π’œ 𝑗 )

𝑛 𝑗=1

  • Update generator by descending
  • π›Όπœ„π»

1 𝑛

log(1 βˆ’ 𝐸 𝐻(π’œ 𝑗 )

𝑛 𝑗=1

Source: β€œGenerative Adversarial Nets” by Goodfellow et al. in NeurIPS 2014

slide-22
SLIDE 22

An optimum exists

  • For a fixed generator, 𝐸 π’š =

π‘žπ’š(π’š) π‘žπ’š π’š +π‘žπ»(π’š)

  • Because

π”½π’š~π‘žπ’š(π’š) log 𝐸 π’š + π”½π’œ~π‘žπ’œ π’œ [log(1 βˆ’ 𝐸(𝐻 π’œ )] = π‘žπ’š π’š log 𝐸 π’š + π‘žπ» π’š log 1 βˆ’ 𝐸 π’š 𝑒𝑦

π’š

And, optimal of 𝑏 log 𝑧 + 𝑐 log(1 βˆ’ 𝑧) is 𝑏

𝑏+𝑐

Source: β€œGenerative Adversarial Nets” by Goodfellow et al. in NeurIPS 2014

slide-23
SLIDE 23

Generator’s optimization reduces as follows…

  • π”½π’š~π‘žπ’š(π’š) log 𝐸 π’š

+ π”½π’œ~π‘žπ’œ π’œ [log(1 βˆ’ 𝐸(𝐻 π’œ )] = π”½π’š~π‘žπ’š(π’š) log π‘žπ’š(π’š) π‘žπ’š π’š + π‘žπ»(π’š) + π”½π’š~π‘žπ‘― π’š log π‘žπ‘―(π’š) π‘žπ’š π’š + π‘žπ»(π’š) = βˆ’ log 4 + 𝐿𝑀 π‘žπ’š(π’š) π‘žπ’š π’š + π‘žπ»(π’š) 2 + 𝐿𝑀 π‘žπ»(π’š) π‘žπ’š π’š + π‘žπ»(π’š) 2

  • This assumes that the generator and the discriminator are

high capacity such that these can model the desired distributions arbitrarily well.

Source: β€œGenerative Adversarial Nets” by Goodfellow et al. in NeurIPS 2014

slide-24
SLIDE 24

Some sample generations and interpolations of latent vector

Source: β€œGenerative Adversarial Nets” by Goodfellow et al. in NeurIPS 2014

slide-25
SLIDE 25

DC-GAN was designed to generate better images

  • No pooling – convolutions with > or < 1 stride
  • No fully connected layers
  • Heavy use of batchnorm
  • Use ReLU in G, leakyReLU in D in all but final layers
  • Use tanh in the last layer of G

Source: β€œUnsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks” by Radford et al. in ICLR 2016

slide-26
SLIDE 26

While mode-collapse isn’t evident, there is some underfitting

Source: β€œUnsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks” by Radford et al. in ICLR 2016

slide-27
SLIDE 27

GAN features can directly be used for classification

Source: β€œUnsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks” by Radford et al. in ICLR 2016

slide-28
SLIDE 28

GANs allow latent vector β€œarithmetic”

Source: β€œUnsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks” by Radford et al. in ICLR 2016

slide-29
SLIDE 29

Advantages and disadvantages of GAN

  • Markov chains not

needed

  • Only backprop used
  • No inference needed
  • Models a wide range of

functions

  • No explicit generator
  • Need to sync D with G
  • Mode collapse

Source: β€œGenerative Adversarial Nets” by Goodfellow et al. in NeurIPS 2014

slide-30
SLIDE 30

Conditional GAN introduces another variable (e.g. class)

  • Instead of the GAN objective:

π”½π’š~π‘žπ’š(π’š) log 𝐸 π’š + π”½π’œ~π‘žπ’œ π’œ [log(1 βˆ’ 𝐸(𝐻 π’œ )]

  • CGAN uses a modified objective:

π”½π’š~π‘žπ’š(π’š) log 𝐸 π’š|𝒛 + π”½π’œ~π‘žπ’œ π’œ [log(1 βˆ’ 𝐸(𝐻 π’œ|𝒛 )]

Source: β€œConditional Generative Adversarial Nets” by Mirza and Osindero, Arxiv 2014

slide-31
SLIDE 31

Conditional GAN introduces another variable (e.g. class)

Source: β€œConditional Generative Adversarial Nets” by Mirza and Osindero, Arxiv 2014

slide-32
SLIDE 32

Each row is conditioned upon one digit label of a CGAN

Source: β€œConditional Generative Adversarial Nets” by Mirza and Osindero, Arxiv 2014

slide-33
SLIDE 33

Interpretable latent variables – InfoGAN

  • GAN objective:

min

𝐻 max 𝐸

π‘Š 𝐸, 𝐻 = π”½π’š~data log 𝐸 π’š + π”½π’œ~noise[log(1 βˆ’ 𝐸(𝐻 π’œ )]

  • Introduce extra (supposedly interpretable)

latent variables 𝑑 in addition to 𝑨

  • InfoGAN objective:

min

𝐻 max 𝐸

π‘Š

𝐽 𝐸, 𝐻 = π‘Š 𝐸, 𝐻 βˆ’ πœ‡ 𝐽 𝑑, 𝐻 𝑨, 𝑑

  • Mutual information:

𝐽 𝑑, 𝐻 𝑨, 𝑑 = 𝐼 𝑑 βˆ’ 𝐼 𝑑 𝐻 𝑨, 𝑑

Source: β€œInfoGAN: Interpretable Representation Learning byInformation Maximizing Generative Adversarial Nets” by Chen et al., in NeurIPS 2016.

slide-34
SLIDE 34

Variational information maximization to the rescue

  • Mutual information:

𝐽 𝑑, 𝐻 𝑨, 𝑑 = 𝐼 𝑑 βˆ’ 𝐼 𝑑 𝐻 𝑨, 𝑑 = 𝔽𝑦~𝐻(𝑨,𝑑) 𝔽𝑑′~𝑄(𝑑|𝑦) log 𝑄(𝑑′|𝑦) + 𝐼 𝑑 = 𝔽𝑦~𝐻(𝑨,𝑑) 𝐿𝑀 𝑄 . 𝑦 𝑅 . 𝑦 + 𝔽𝑑′~𝑄(𝑑|𝑦) log 𝑅(𝑑′|𝑦) + 𝐼 𝑑 β‰₯ 𝔽𝑦~𝐻(𝑨,𝑑) 𝔽𝑑′~𝑄(𝑑|𝑦) log 𝑅(𝑑′|𝑦) + 𝐼 𝑑

  • Further: 𝔽𝑦~π‘Œ,𝑧~𝑍|𝑦 𝑔 𝑦, 𝑧

= 𝔽𝑦~π‘Œ,𝑧~𝑍|𝑦,𝑦′~π‘Œ|𝑧 𝑔 𝑦′, 𝑧 under certain conditions, implies that previous quantity is 𝑀𝐽 𝐻, 𝑅 = 𝔽𝑑~𝑄 𝑑 ,𝑦~𝐻(𝑨,𝑑) log 𝑅(𝑑|𝑦) + 𝐼 𝑑

  • Overall: min

𝐻,𝑅 max 𝐸

π‘Š

𝐽 𝐸, 𝐻, 𝑅 = π‘Š 𝐸, 𝐻 βˆ’ πœ‡ 𝑀𝐽 𝐻, 𝑅

Source: β€œInfoGAN: Interpretable Representation Learning byInformation Maximizing Generative Adversarial Nets” by Chen et al., in NeurIPS 2016.

slide-35
SLIDE 35

InfoGAN visually

Inspiration source: β€œInfoGAN β€” Generative Adversarial Networks Part III” Zak Jost, TowardsDataScience.com

z G x' x D y c Q c'

  • The figure elements in red were added to

InfoGAN on top of a regular GAN

  • C can be a mix of class code and disentangled

continuous variables

slide-36
SLIDE 36

Results are more interpretable

Source: β€œInfoGAN: Interpretable Representation Learning byInformation Maximizing Generative Adversarial Nets” by Chen et al., in NeurIPS 2016.

slide-37
SLIDE 37

Results are more interpretable

Source: β€œInfoGAN: Interpretable Representation Learning byInformation Maximizing Generative Adversarial Nets” by Chen et al., in NeurIPS 2016.