Variational Auto-Encoders Diederik P. Kingma Introduction and - - PowerPoint PPT Presentation

variational auto encoders
SMART_READER_LITE
LIVE PREVIEW

Variational Auto-Encoders Diederik P. Kingma Introduction and - - PowerPoint PPT Presentation

Variational Auto-Encoders Diederik P. Kingma Introduction and Motivation Motivation and applications Versatile framework for unsupervised and semi-supervised deep learning Representation Learning. E.g.: 2D visualisation Data-e ffi cient


slide-1
SLIDE 1

Variational Auto-Encoders

Diederik P. Kingma

slide-2
SLIDE 2

Introduction and Motivation

slide-3
SLIDE 3

Motivation and applications

Versatile framework for unsupervised and semi-supervised deep learning Representation Learning. E.g.: 2D visualisation Data-efficient learning. Semi-supervised learning Artificial Creativity. E.g.: Image/text resynthesis, Molecule design

slide-4
SLIDE 4

Sad Kanye -> Happy Kanye

“Smile vector”. Tom White, 2016, twitter: @dribnet

slide-5
SLIDE 5

Background

slide-6
SLIDE 6

Probabilistic Models

x: Observed random variables p*(x) or: underlying unknown process pθ(x): model distribution Goal: pθ(x) ≈ p*(x) We wish flexible pθ(x) Conditional modeling goal: pθ(x|y) ≈ p*(x|y)

slide-7
SLIDE 7

Concept 1: Parameterization of conditional distributions
 with Neural Networks

slide-8
SLIDE 8

Common example

0.45 0.9 Cat MouseDog ...

NeuralNet(x)

y x

slide-9
SLIDE 9

Concept 2: Generalization into Directed Models 
 parameterized with Bayesian Networks

slide-10
SLIDE 10

Directed graphical models / Bayesian networks

We parameterize conditionals using neural networks: Traditionally: parameterized using probability tables Joint distribution factorizes as:

slide-11
SLIDE 11

Maximum Likelihood (ML)

Log-probability of a datapoint x:
 Log-likelihood of i.i.d. dataset:
 Optimizable with (minibatch) SGD

slide-12
SLIDE 12

Concept 3: Generalization into Deep Latent-Variable Models

slide-13
SLIDE 13

Deep Latent-Variable Model (DLVM)

Introduction of latent variables in graph Latent-variable model pθ(x,z)
 where conditionals are parameterized with neural networks Advantages: Extremely flexible: even if each conditional is simple (e.g. conditional Gaussian), the marginal likelihood can be arbitrarily complex Disadvantage: is intractable

slide-14
SLIDE 14

Neural Net

slide-15
SLIDE 15

DLVM: Optimization is non-trivial

By direct optimization of log p(x) ? Intractable marg. likelihood With expectation maximization (EM)? Intractable posterior: p(z|x) = p(x,z)/p(x) With MAP: point estimate of p(z|x)? Overfits With trad. variational EM and MCMC-EM? Slow And none tells us how to do fast posterior inference

slide-16
SLIDE 16

Variational Autoencoders (VAEs)

slide-17
SLIDE 17

Solution: Variational Autoencoder (VAE)

Introduce q(z|x): parametric model


  • f true posterior

Parameterized by another neural network Joint optimization of q(z|x) and p(x,z) Remarkably simple objective:
 evidence lower bound (ELBO) [MacKay, 1992]

slide-18
SLIDE 18

Encoder / Approximate Posterior

qφ(z|x): parametric model of the posterior
 φ: variational parameters We optimize the variational parameters φ such that:
 Like a DLVM, the inference model can be (almost) any directed graphical model:
 
 Note that traditionally, variational methods employ local variational parameters. We only have global φ

slide-19
SLIDE 19

x z

N

θ

Example

Evidence Lower Bound / ELBO

  • 1. Maximization of log p(x)


=> Good marginal likelihood

  • 2. Minimization of DKL(q(z|x)||p(z|x))


=> Accurate (and fast) posterior inference Objective (ELBO): Can be rewritten as:

L(x; θ) = Eq(z|x) [log p(x, z) − log q(z|x)] L(x; θ) = log p(x) − DKL(q(z|x)||p(z|x))

slide-20
SLIDE 20

Stochastic Gradient Descent (SGD)

Minibatch SGD: requires unbiased gradients estimates Reparameterization trick for continuous latent variables
 [Kingma and Welling, 2013] REINFORCE for discrete latent variables Adam optimizer adaptively pre-conditioned SGD
 [Kingma and Ba, 2014] Weight normalisation for faster convergence
 [Salimans and Kingma, 2015]

slide-21
SLIDE 21

ELBO as KL Divergence

slide-22
SLIDE 22

Gradients

An unbiased gradient estimator of the ELBO w.r.t. the generative model parameters is straightforwardly obtained: A gradient estimator of the ELBO w.r.t. the variational parameters φ is more difficult to obtain:

slide-23
SLIDE 23

Reparameterization Trick

Construct the following Monte Carlo estimator:
 
 
 where p(ε) and g() chosen such that z ∼ qφ(z|x) Which has a simple Monte Carlo gradient:

slide-24
SLIDE 24

Reparameterization Trick

This is an unbiased estimator of the exact single-datapoint ELBO gradient:

slide-25
SLIDE 25

Reparameterization Trick

Under reparameterization, density is given by:
 
 Important: choose transformations g() for which the logdet is computationally affordable/simple

slide-26
SLIDE 26

Factorized Gaussian Posterior

A common choice is a simple factorized Gaussian encoder: After reparameterization, we can write:

slide-27
SLIDE 27

Factorized Gaussian Posterior

The Jacobian of the transformation is:
 
 
 Determinant of diagonal matrix is product of diag. entries. So the posterior density is:

slide-28
SLIDE 28

Full-covariance Gaussian posterior

The factorized Gaussian posterior can be extended to a Gaussian with full covariance: A reparameterization of this distribution with a surprisingly simple determinant, is: where L is a lower (or upper) triangular matrix, with non- zero entries on the diagonal. The off-diagonal element define the correlations (covariance) of the elements in z.

slide-29
SLIDE 29

Full-covariance Gaussian posterior

This reason for this parameterization of the full-covariance Gaussian, is that the Jacobian determinant is remarkably

  • simple. The Jacobian is trivial:


And the determinant of a triangular matrix is simply the product of its diagonal terms. So:

slide-30
SLIDE 30

Full-covariance Gaussian posterior

This parameterization corresponds to the Cholesky decomposition of the covariance of z:

slide-31
SLIDE 31

Full-covariance Gaussian posterior

One way to construct the matrix L is as follows:
 Lmask is a masking matrix. The log-determinant is identical to the factorized Gaussian case: 
 


slide-32
SLIDE 32

Full-covariance Gaussian posterior

Therefore, density equal to diagonal Gaussian case!

slide-33
SLIDE 33

Beyond Gaussian posteriors

slide-34
SLIDE 34

Normalizing Flows

Full-covariance Gaussian: One transformation operation: ft(ε, x) = Lε Normalizing flows: Multiple transformation steps

slide-35
SLIDE 35

Normalizing Flows

Define z ~ qφ(z|x) as:
 The Jacobian of the transformation factorizes: And the density

[Rezende and Mohamed, 2015]

slide-36
SLIDE 36

Inverse Autoregressive Flows

Probably the most flexible type of transformation, with simple determinant, that can be chained. Each transformation given by a autoregressive neural net, with triangular Jacobian Best known way to construct arbitrarily flexible posteriors

slide-37
SLIDE 37

Inverse Autoregressive Flow

slide-38
SLIDE 38

Posteriors in 2D space

slide-39
SLIDE 39

Deep IAF helps towards better likelihoods

[Kingma, Salimans and Welling, 2014]

slide-40
SLIDE 40

Optimization Issues

Overpruning: Solution 1: KL annealing Solution 2: Free bits (see IAF paper) ‘Blurriness’ of samples Solution: better Q or P models

slide-41
SLIDE 41

Better generative models

slide-42
SLIDE 42

Improving Q versus improving P

slide-43
SLIDE 43

PixelVAE

Use PixelCNN models as p(x|z) and p(z) models No need for complicated q(z|x): just factorized Gaussian

[Gulrajani et al, 2016]

slide-44
SLIDE 44

PixelVAE

[Gulrajani et al, 2016]

slide-45
SLIDE 45

PixelVAE

slide-46
SLIDE 46

PixelVAE

slide-47
SLIDE 47

Applications

slide-48
SLIDE 48

Visualisation

  • f Data in 2D
slide-49
SLIDE 49

Representation learning

x z

2D

slide-50
SLIDE 50

Semi-supervised learning

slide-51
SLIDE 51

SSL With Auxiliary VAE

[Maaløe et al, 2016]

slide-52
SLIDE 52

Data-efficient learning on ImageNet

[Pu et al, “Variational Autoencoder for Deep Learning of Images, Labels and Captions”, 2016]

from 10% to 60% accuracy,
 for 1% labeled

slide-53
SLIDE 53

(Re)Synthesis

slide-54
SLIDE 54

Analogy-making

slide-55
SLIDE 55

Automatic chemical design

VAE trained on text representation of 250K molecules Uses latent space to design new drugs and organic LEDs

[Gómez-Bombarelli et al, 2016]

slide-56
SLIDE 56

Semantic Editing

“Smile vector”. Tom White, 2016, twitter: @dribnet

slide-57
SLIDE 57

Semantic Editing

“Smile vector”. Tom White, 2016, twitter: @dribnet

slide-58
SLIDE 58

Semantic Editing

“Neural Photo Editing”. Andrew Brock et al, 2016

slide-59
SLIDE 59

Questions?