Learning Deep Generative Models Inference & Representation - - PowerPoint PPT Presentation

learning deep generative models
SMART_READER_LITE
LIVE PREVIEW

Learning Deep Generative Models Inference & Representation - - PowerPoint PPT Presentation

Introduction Variational Inference Deep Generative Models Summary Learning Deep Generative Models Inference & Representation Lecture 12 Rahul G. Krishnan Fall 2015 Rahul G. Krishnan Learning Deep Generative Models Introduction


slide-1
SLIDE 1

Introduction Variational Inference Deep Generative Models Summary

Learning Deep Generative Models

Inference & Representation Lecture 12 Rahul G. Krishnan Fall 2015

Rahul G. Krishnan Learning Deep Generative Models

slide-2
SLIDE 2

Introduction Variational Inference Deep Generative Models Summary

Outline

1 Introduction

Variational Bound Summary

2 Variational Inference

Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

3 Deep Generative Models

Bayesian Networks & Deep-Learning Learning Summary of DGMs

4 Summary

Rahul G. Krishnan Learning Deep Generative Models

slide-3
SLIDE 3

Introduction Variational Inference Deep Generative Models Summary Variational Bound Summary

Outline

1 Introduction

Variational Bound Summary

2 Variational Inference

Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

3 Deep Generative Models

Bayesian Networks & Deep-Learning Learning Summary of DGMs

4 Summary

Rahul G. Krishnan Learning Deep Generative Models

slide-4
SLIDE 4

Introduction Variational Inference Deep Generative Models Summary Variational Bound Summary

Overview of Lecture

1 Review mathematical concepts: Jensen’s Inequality and

the Maximum Likelihood (ML) principle

2 Learning as Optimization : Maximizing the Evidence

Lower Bound (ELBO)

3 Learning in LDA 4 Stochastic Variational Inference 5 Learning Deep Generative Models 6 Summarize Rahul G. Krishnan Learning Deep Generative Models

slide-5
SLIDE 5

Introduction Variational Inference Deep Generative Models Summary Variational Bound Summary

Recap

Jensen’s Inequality: For concave f, we have f(E [X]) ≥ E [f(X)]

f(E [X]) ≥ E [f(X)]

f((1 − λ)a + λb) | {z }

f(E(X)) where P [X=a]=1−λ,P [X=b]=λ

(1 − λ)f(a) + λf(b) | {z }

E[f(X)] where P [X=a]=1−λ,P [X=b]=λ

a b f

Figure: Jensen’s Inequality

Rahul G. Krishnan Learning Deep Generative Models

slide-6
SLIDE 6

Introduction Variational Inference Deep Generative Models Summary Variational Bound Summary

Recap

We assume that for D = {x1, . . . , xN}, xi ∼ p(x) i.i.d We hypothesize a model (with parameters θ) for how the data is generated The Maximum Likelihood Principle: maxθ p(D; θ) = N

i=1 p(xi; θ)

Typically work with the log probability: i.e maxθ N

i=1 log p(xi; θ)

Rahul G. Krishnan Learning Deep Generative Models

slide-7
SLIDE 7

Introduction Variational Inference Deep Generative Models Summary Variational Bound Summary

A simple Bayesian Network

x z

Lets start with a very simple generative model for our data We assume that the data is generated i.i.d as: z ∼ p(z) x ∼ p(x|z) z is latent/hidden and x is observed

Rahul G. Krishnan Learning Deep Generative Models

slide-8
SLIDE 8

Introduction Variational Inference Deep Generative Models Summary Variational Bound Summary

Bounding the Marginal Likelihood

Log-Likelihood of a single datapoint x ∈ D under the model: log p(x; θ) Important: Assume ∃q(z; φ), (variational approximation) log p(x) = log

  • z

p(x, z) (Multiply and divide by q(z)) = log

  • z

q(z)p(x, z) q(z) = log Ez∼q(z) p(x, z) q(z)

  • (By Jensen’s Inequality)

  • z

q(z) log p(x, z) q(z) = L(x; θ, φ) = Eq(z)[log p(x, z)]

  • Expectation of Joint distribution

+ H(q(z))

Entropy

Rahul G. Krishnan Learning Deep Generative Models

slide-9
SLIDE 9

Introduction Variational Inference Deep Generative Models Summary Variational Bound Summary

Evidence Lower BOund (ELBO)/Variational Bound

When is the lower bound tight? Look at: function - lower bound log p(x; θ) − L(x; θ, φ) log p(x) −

  • z

q(z) log p(x, z) q(z) =

  • z

q(z) log p(x) −

  • z

q(z) log p(x, z) q(z) =

  • z

q(z) log q(z)p(x) p(x, z) = KL(q(z; φ)||p(z|x))

Rahul G. Krishnan Learning Deep Generative Models

slide-10
SLIDE 10

Introduction Variational Inference Deep Generative Models Summary Variational Bound Summary

Evidence Lower BOund (ELBO)/Variational Bound

We assumed the existance of q(z; φ) What we just showed is that: Key Point The optimal q(z; φ) corresponds to the one that realizes KL(q(z; φ)||p(z|x)) = 0 ⇐ ⇒ q(z; φ) = p(z|x)

Rahul G. Krishnan Learning Deep Generative Models

slide-11
SLIDE 11

Introduction Variational Inference Deep Generative Models Summary Variational Bound Summary

Evidence Lower BOund (ELBO)/Variational Bound

In order to estimate the liklihood of the entire dataset D, we need N

i=1 log p(xi; θ)

Summing up over datapoints we get: max

θ N

  • i=1

log p(xi; θ) ≥ max

θ,φ1,...,φN N

  • i=1

L(xi; θ, φi)

  • ELBO

Note that we use a different φi for every data point

Rahul G. Krishnan Learning Deep Generative Models

slide-12
SLIDE 12

Introduction Variational Inference Deep Generative Models Summary Variational Bound Summary

Outline

1 Introduction

Variational Bound Summary

2 Variational Inference

Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

3 Deep Generative Models

Bayesian Networks & Deep-Learning Learning Summary of DGMs

4 Summary

Rahul G. Krishnan Learning Deep Generative Models

slide-13
SLIDE 13

Introduction Variational Inference Deep Generative Models Summary Variational Bound Summary

Summary

Learning as Optimization Variational learning turns learning into an optimization problem, namely: max

θ,φ1,...,φN N

  • i=1

L(xi; θ, φi)

Rahul G. Krishnan Learning Deep Generative Models

slide-14
SLIDE 14

Introduction Variational Inference Deep Generative Models Summary Variational Bound Summary

Summary

Optimal q The optimal q(z; φ) used in the bound corresponds to the intractable posterior distribution p(z|x)

Rahul G. Krishnan Learning Deep Generative Models

slide-15
SLIDE 15

Introduction Variational Inference Deep Generative Models Summary Variational Bound Summary

Summary

Approximating the Posterior The better q(z; φ) can approximate the posterior, the smaller KL(q(z; φ)||p(z|x)) we can achieve, the closer ELBO will be to log p(x; θ)

Rahul G. Krishnan Learning Deep Generative Models

slide-16
SLIDE 16

Introduction Variational Inference Deep Generative Models Summary Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

Outline

1 Introduction

Variational Bound Summary

2 Variational Inference

Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

3 Deep Generative Models

Bayesian Networks & Deep-Learning Learning Summary of DGMs

4 Summary

Rahul G. Krishnan Learning Deep Generative Models

slide-17
SLIDE 17

Introduction Variational Inference Deep Generative Models Summary Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

Generative Model

Latent Dirichlet Allocation (LDA) θ z

w

α β η M N K

Figure: Generative Model for Latent Dirichlet Allocation

Rahul G. Krishnan Learning Deep Generative Models

slide-18
SLIDE 18

Introduction Variational Inference Deep Generative Models Summary Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

Generative Model

1 Sample global topics βk ∼ Dir(ηk) 2

For document d = 1, . . . , N

3

Sample θd ∼ Dir(α)

4

For each word m = 1, . . . , M

5

Sample topic zdm ∼ Mult(θd)

6

Sample word wdm ∼ Mult(βzdm) S denotes the simplex V is the vocabulary and K is the number of topics θd ∈ SK βzdm ∈ SV

Rahul G. Krishnan Learning Deep Generative Models

slide-19
SLIDE 19

Introduction Variational Inference Deep Generative Models Summary Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

Variational Distribution

w are observed and z, β, θ are latent We will perform inference over z, β, θ As before, we will assume that there exists a distribution

  • ver our latent variables

We will assume that our distribution factorizes (mean-field assumption) Variational Distribution: q(θ, z, β; Φ) = q(θ; γ) N

  • n=1

q(zn; φn) K

  • k=1

q(βk; λk)

  • Denote Φ = {γ, φ, λ}, the parameters of the variational

approximation

Rahul G. Krishnan Learning Deep Generative Models

slide-20
SLIDE 20

Introduction Variational Inference Deep Generative Models Summary Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

Homework

Your next homework assignment involves implementing a mean-field algorithm for inference in LDA Assume Topic-Word Probabilities β1:K observed and fixed, you won’t have to infer these Perform inference over θ and z The following slides are to give you intuition and understanding on how to derive the updates for inference Read Blei et al. (2003) (particularly the appendix) for details on derivation

Rahul G. Krishnan Learning Deep Generative Models

slide-21
SLIDE 21

Introduction Variational Inference Deep Generative Models Summary Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

Outline

1 Introduction

Variational Bound Summary

2 Variational Inference

Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

3 Deep Generative Models

Bayesian Networks & Deep-Learning Learning Summary of DGMs

4 Summary

Rahul G. Krishnan Learning Deep Generative Models

slide-22
SLIDE 22

Introduction Variational Inference Deep Generative Models Summary Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

ELBO Derivation

For a single document, the joint distribution is:

log p(θ, z, w, β; α, η) = log K

  • k=1

p(βk; η)

D

  • d=1
  • p(θd; α)

N

  • n=1

p(zdn|θd)p(wn|zdn, β)

  • Rahul G. Krishnan

Learning Deep Generative Models

slide-23
SLIDE 23

Introduction Variational Inference Deep Generative Models Summary Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

ELBO Derivation

Denote Φ = {γ, φ, λ}, the parameters of the variational approximation For a single document, the bound on the log likelihood is: log p(w; α, η) ≥ Eq(θ,z,β;Φ) [log p(θ, z, w, β; α, η)] + H(q(θ, z, β; Φ))

  • L(w;α,η,Φ)

Rahul G. Krishnan Learning Deep Generative Models

slide-24
SLIDE 24

Introduction Variational Inference Deep Generative Models Summary Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

ELBO Derivation

Assumption: The posterior distribution fully factorizes θ z φ γ λ β M N K

Figure: Plate model for Mean Field Approximation to LDA

Rahul G. Krishnan Learning Deep Generative Models

slide-25
SLIDE 25

Introduction Variational Inference Deep Generative Models Summary Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

ELBO Derivation

What q(θ, z, β; Φ) do we use? Mean-field assumption: q(θ, z, β; Φ) = q(θ; γ) N

n=1 q(zn; φn)

K

k=1 q(βk; λk)

  • θ is a multinomial therefore γ is a Dirichlet parameter,

likewise for βk Each zn ∈ {1, . . . , K}, therefore φn represents the parameters of a Multinomial distribution

Rahul G. Krishnan Learning Deep Generative Models

slide-26
SLIDE 26

Introduction Variational Inference Deep Generative Models Summary Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

Variational EM

L(w; α, η, Φ) = Eq(θ,z,β;Φ) [log p(θ, z, w, β; α, η)] + H(q(θ, z, β; Φ)) L is a function of α, η, the parameters of the model and Φ = {γ, φ, λ}, the parameters of approximation to the posterior Variational EM Fix α, η. Approximate γ∗, φ∗, λ∗ (mean-field inference) Fix γ∗, φ∗, λ∗, Update α, η Unlike EM, variational EM not guaranteed to reach a local maximizer of L

Rahul G. Krishnan Learning Deep Generative Models

slide-27
SLIDE 27

Introduction Variational Inference Deep Generative Models Summary Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

Variational EM

Deriving updates for Variational Inference in HW

1 See Appendix in Blei et al. (2003) 2 Expand the bound L using the factorization of the joint

distribution and the form of the mean-field posterior

3 Isolate terms in L corresponding to variational parameters

γ, φ.

4 Find γ∗, φ∗ that maximize L(γ), L(φ) Rahul G. Krishnan Learning Deep Generative Models

slide-28
SLIDE 28

Introduction Variational Inference Deep Generative Models Summary Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

Outline

1 Introduction

Variational Bound Summary

2 Variational Inference

Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

3 Deep Generative Models

Bayesian Networks & Deep-Learning Learning Summary of DGMs

4 Summary

Rahul G. Krishnan Learning Deep Generative Models

slide-29
SLIDE 29

Introduction Variational Inference Deep Generative Models Summary Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

Variational Inference

Let us focus just on variational inference (E-step) for the moment. φk

dn: probability that word n in document d has topic k

γd: posterior Dirichlet parameter for document d λk: posterior Dirichlet parameter for topic k

Rahul G. Krishnan Learning Deep Generative Models

slide-30
SLIDE 30

Introduction Variational Inference Deep Generative Models Summary Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

Variational Inference

Lets recall what the variational distribution looked like θ z φ γ λ β M N K

Figure: Plate model for Mean Field Approximation to LDA

Rahul G. Krishnan Learning Deep Generative Models

slide-31
SLIDE 31

Introduction Variational Inference Deep Generative Models Summary Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

Variational Inference

1 For a single document d 2 Repeat till convergence: 3

Update φk

dn for n ∈ {1, . . . , N}, k ∈ {1, . . . , K}

4

Update γd This process yields the local posterior parameters φk

dn gives us the probability that the nth work was drawn

from topic k γd gives us a Dirichlet parameter. Samples from this distribution give us an estimate of the topic proportions in the document

Rahul G. Krishnan Learning Deep Generative Models

slide-32
SLIDE 32

Introduction Variational Inference Deep Generative Models Summary Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

Variational Inference

We just saw the updates to the local variational parameters (local to every document) What about the update to λ, the global variational parameter (shared across all documents)

Rahul G. Krishnan Learning Deep Generative Models

slide-33
SLIDE 33

Introduction Variational Inference Deep Generative Models Summary Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

Variational Inference

The posterior over β uses local posterior parameters from every document

1 For all documents d = 1, . . . , M, repeat: 2

Update φk

dn for n ∈ {1, . . . , N}, k ∈ {1, . . . , K}

3

Update γd

4 Update λk ←

η

  • Prior over βk

+ D

d=1

N

n=1 φk dnwdn for

k = {1, . . . , K}

5 The update to λk uses φ from every document in

the corpus

Rahul G. Krishnan Learning Deep Generative Models

slide-34
SLIDE 34

Introduction Variational Inference Deep Generative Models Summary Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

Inefficiencies in the Algorithm

As M (the number of documents) increases, inference becomes increasingly inefficient Step 4 requires you to process the entire dataset before updating λk Can we do better?

Rahul G. Krishnan Learning Deep Generative Models

slide-35
SLIDE 35

Introduction Variational Inference Deep Generative Models Summary Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

Stochastic Variational Inference

Key Point Instead of waiting to process the entire corpus before updating λ, why don’t we replicate the update from a single document M times.

Rahul G. Krishnan Learning Deep Generative Models

slide-36
SLIDE 36

Introduction Variational Inference Deep Generative Models Summary Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

SVI Pseudocode

1 for t = 1, . . . , T 2

Sample a document d from the dataset

3

Repeat till convergence:

4

Update φk

dn for n ∈ {1, . . . , N}, k ∈ {1, . . . , K}

5

Update γd

6

ˆ λk ← η

  • Prior over βk

+ M

N

  • n=1

φk

dnwdn

  • Multiply the update by M

, k = {1, . . . , K}

7

Set λt ← (1 − ρt)λt−1 + ρtˆ λ ρt is the adaptive learning rate

Rahul G. Krishnan Learning Deep Generative Models

slide-37
SLIDE 37

Introduction Variational Inference Deep Generative Models Summary Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

SVI Pseudocode

For k = {1, . . . , K}: ˆ λk ← η + M

N

  • n=1

φk

dnwdn

ˆ λk is the estimate of the variational parameter We update λt to be a weighted sum of its previous value and the proposed estimate.

Rahul G. Krishnan Learning Deep Generative Models

slide-38
SLIDE 38

Introduction Variational Inference Deep Generative Models Summary Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

What do we gain?

Lets us scale up to much larger datasets Faster convergence

Figure: Per word predictive probability for 100-topic LDA. SVI converges faster than batch variational inference. Taken from Hoffman et al. (2013)

Rahul G. Krishnan Learning Deep Generative Models

slide-39
SLIDE 39

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Outline

1 Introduction

Variational Bound Summary

2 Variational Inference

Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

3 Deep Generative Models

Bayesian Networks & Deep-Learning Learning Summary of DGMs

4 Summary

Rahul G. Krishnan Learning Deep Generative Models

slide-40
SLIDE 40

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Deep Generative Model

Can we give an efficient learning algorithm for bayesian networks like this:

x z

Rahul G. Krishnan Learning Deep Generative Models

slide-41
SLIDE 41

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Deep Generative Model

Or deeper latent variable models like this?

x1 x2 z1 z2 z3 z4 z5

Rahul G. Krishnan Learning Deep Generative Models

slide-42
SLIDE 42

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Outline

1 Introduction

Variational Bound Summary

2 Variational Inference

Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

3 Deep Generative Models

Bayesian Networks & Deep-Learning Learning Summary of DGMs

4 Summary

Rahul G. Krishnan Learning Deep Generative Models

slide-43
SLIDE 43

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Outline

Reset the notation from LDA, we’re starting afresh First, a simple model to learn the technique, then a more complex latent variable model

Rahul G. Krishnan Learning Deep Generative Models

slide-44
SLIDE 44

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Simple Generative Model

x z

z ∼ p(z) x ∼ p(x|z) Assume that θ are the parameters of the generative model Includes the parameters of the prior p(z) and the conditional p(x|z)

Rahul G. Krishnan Learning Deep Generative Models

slide-45
SLIDE 45

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

New methods for Learning

Based on recent work in learning graphical models (Rezende et al. , 2014), (Kingma & Welling, 2013) In variational EM, every point in our dataset had an associated set of posterior parameters

Rahul G. Krishnan Learning Deep Generative Models

slide-46
SLIDE 46

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

New Methods for Learning

We’ll use a single variational approximation for all datapoints

Rahul G. Krishnan Learning Deep Generative Models

slide-47
SLIDE 47

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

New Methods for Learning

We’ll use a single variational approximation for all datapoints To do that, we will learn a conditional, parametric function

Rahul G. Krishnan Learning Deep Generative Models

slide-48
SLIDE 48

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

New Methods for Learning

We’ll use a single variational approximation for all datapoints To do that, we will learn a conditional, parametric function The output of this function will be the parameters of the variational distribution

Rahul G. Krishnan Learning Deep Generative Models

slide-49
SLIDE 49

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

New Methods for Learning

We’ll use a single variational approximation for all datapoints To do that, we will learn a conditional, parametric function The output of this function will be the parameters of the variational distribution We will approximate the posterior with this distribution

Rahul G. Krishnan Learning Deep Generative Models

slide-50
SLIDE 50

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

New Methods for Learning

We’ll use a single variational approximation for all datapoints To do that, we will learn a conditional, parametric function The output of this function will be the parameters of the variational distribution We will approximate the posterior with this distribution So previously the q(z) we assumed will now be qφ(z|x)

Rahul G. Krishnan Learning Deep Generative Models

slide-51
SLIDE 51

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

New Methods for Learning

We’ll use a single variational approximation for all datapoints To do that, we will learn a conditional, parametric function The output of this function will be the parameters of the variational distribution We will approximate the posterior with this distribution So previously the q(z) we assumed will now be qφ(z|x) For every x, we get a different set of posterior parameters

Rahul G. Krishnan Learning Deep Generative Models

slide-52
SLIDE 52

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

New Methods for Learning

We’ll use a single variational approximation for all datapoints To do that, we will learn a conditional, parametric function The output of this function will be the parameters of the variational distribution We will approximate the posterior with this distribution So previously the q(z) we assumed will now be qφ(z|x) For every x, we get a different set of posterior parameters Optimization Problem: maxφ,θ N

i=1 L(xi; θ, φ)

Rahul G. Krishnan Learning Deep Generative Models

slide-53
SLIDE 53

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

ELBO

L(x; θ, φ) =

  • z

qφ(z|x) log pθ(x, z) qφ(z|x) =

  • z

qφ(z|x) log pθ(x|z) −

  • z

qφ(z|x) log qφ(z|x) pθ(z) = Eqφ(z|x)[log pθ(x, z)]

  • Expectation of Joint Distribution

+ H(qφ(z|x))

  • Entropy of qφ(z|x)

(1)

Rahul G. Krishnan Learning Deep Generative Models

slide-54
SLIDE 54

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Key Points

Parametric q(z|x; φ) We’re going to learn a conditional parametric approximation q(z|x; φ) to p(z|x), the posterior distribution. Shared φ Learning a conditional model, q(z|x; φ) where φ will be shared for all x Gradient Ascent We’re going to perform joint optimization of θ, φ on maxθ,φ N

i=1 L(xi; θ, φ)

Rahul G. Krishnan Learning Deep Generative Models

slide-55
SLIDE 55

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Plate Model

x z

φ θ

Figure: Learning DGMs

Use Stochastic Gradient Ascent to learn this model

Rahul G. Krishnan Learning Deep Generative Models

slide-56
SLIDE 56

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Putting it all together

L(x; θ, φ) = Ez∼qφ(z|x) [log pθ(x, z)] + H(qφ(z|x)) Step 1: Sample a datapoint from dataset: x ∼ D Posterior Inference: Evaluate qφ(z|x) to obtain parameters

  • f posterior

Step 2: Sample z1:K ∼ qφ(z|x)

Rahul G. Krishnan Learning Deep Generative Models

slide-57
SLIDE 57

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Putting it all together

L(x; θ, φ) = Ez∼qφ(z|x) [log pθ(x, z)]

  • (a)

+ H(qφ(z|x))

  • (b)

Step 3: Estimate ELBO Approximate (a) as a Monte Carlo estimate over K samples (b) typically an analytic function of φ

Rahul G. Krishnan Learning Deep Generative Models

slide-58
SLIDE 58

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Putting it all together

Compute gradients of L(x; θ, φ) = Ez∼qφ(z|x) [log pθ(x, z)]

  • (a)

+ H(qφ(z|x))

  • (b)

Step 4: Compute gradients: ∇θL(x; θ, φ), ∇φL(x; θ, φ) First look at gradients with respect to θ ∇θL(x; θ, φ) = Ez [∇θ log p(x, z; θ)] + ∇θH(qφ(z|x)||p(z)) We approximate these gradients using a Monte-Carlo estimator with the K samples

Rahul G. Krishnan Learning Deep Generative Models

slide-59
SLIDE 59

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Putting it all together

Compute gradients of L(x; θ, φ) = Ez∼qφ(z|x) [log pθ(x, z)]

  • (a)

+ H(qφ(z|x))

  • (b)

Step 4: Compute gradients: ∇θL(x; θ, φ), ∇φL(x; θ, φ) Now look at gradients with respect to φ As before, what we would like is to move the gradient into the expectation and approximate it with a Monte-Carlo estimator The issue is that the expectation also depends on φ

Rahul G. Krishnan Learning Deep Generative Models

slide-60
SLIDE 60

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Putting it all together

Recent Work What we want: ∇E [f] = E

  • ∇ ˜

f

  • We can write the gradient of an expectation as an

expectation of gradients Ranganath et al. (2014); Kingma & Welling (2013); Rezende et al. (2014)

Rahul G. Krishnan Learning Deep Generative Models

slide-61
SLIDE 61

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Putting it all together

Compute gradients of L(x; θ, φ) = Ez∼qφ(z|x) [log pθ(x, z)]

  • (a)

+ H(qφ(z|x))

  • (b)

Step 4: Compute gradients: ∇θL(x; θ, φ), ∇φL(x; θ, φ) Write the gradient of an expectation as an expectation of gradients Ranganath et al. (2014); Kingma & Welling (2013); Rezende et al. (2014) We approximate the gradients using a Monte-Carlo estimator with the K samples

Rahul G. Krishnan Learning Deep Generative Models

slide-62
SLIDE 62

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Putting it all together

Update θ and φ Step 5: Update parameters: θ ← θ + ηθ∇θL(x; θ, φ) and φ ← φ + ηφ∇φL(x; θ, φ)

Rahul G. Krishnan Learning Deep Generative Models

slide-63
SLIDE 63

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Putting it all together

Pseudocode Step 1: Sample a datapoint from dataset: x ∼ D Step 2: Perform posterior inference: Sample z1:K ∼ q(z|x; φ) Step 3: Estimate ELBO Step 4: Approximate gradients: ∇θL(x; θ, φ), ∇φL(x; θ, φ) (Gradients are Monte Carlo estimates over K samples ) Step 5: Update parameters: θ ← θ + ηθ∇θL(x; θ, φ) and φ ← φ + ηφ∇φL(x; θ, φ) Step 6: Go to Step 1

Rahul G. Krishnan Learning Deep Generative Models

slide-64
SLIDE 64

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Gaussian DGMs

This is a very general framework capable of learning many different kinds of graphical models Lets consider a simple set of DGMs is where priors and the conditionals are Gaussian

Rahul G. Krishnan Learning Deep Generative Models

slide-65
SLIDE 65

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Assumption on q(z|x)

q(z|x) Assume q(z|x; φ) approximates the posterior with a Gaussian distribution z ∼ N(µ(x; φ), Σ(x; φ)) p(x, z) = p(z)p(x|z) L(x; θ, φ) = Eqφ(z|x)[log pθ(x, z)]

  • Function of θ,φ

+ H(qφ(z|x))

  • Function of φ

For Multivariate Gaussian distributions of dimension D:

H(qφ(z|x)) = 1 2D [1 + log 2π] + 1 2|detΣ(x; φ)|

Rahul G. Krishnan Learning Deep Generative Models

slide-66
SLIDE 66

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Location and Scale transformations

We’ll need one more tool in our toolbox. This is specific to Gaussian latent variable models. In some cases, we can sample from distribution A and transform the samples to appear as if they came from distribution B. Easy to see in the univariate Gaussian case z ∼ N(µ, σ2) is equivalent to z = µ + σǫ where ǫ ∼ N(0, 1) Therefore: Ez∼N(µ,σ2) [f(z)] = Eǫ∼N(0,1) [f(µ + ǫσ)]

Rahul G. Krishnan Learning Deep Generative Models

slide-67
SLIDE 67

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Gradients of L(x; θ, φ)

Gradients with respect to θ ∇θL = ∇θEqφ(z|x)[log pθ(x, z)] + ∇θHφ = Eqφ(z|x)[∇θ log pθ(x, z)] Gradients with respect to φ Define Σφ(x) := Rφ(x)Rφ(x)T ∇φL = ∇φEz∼qφ(z|x)[log pθ(x, z)] + ∇φHφ (Using Location and Scale Transformation) = ∇φEǫ∼N(0;I)[log pθ(x, µφ(x) + Rφ(x)ǫ)] + ∇φHφ = Eǫ∼N(0;I)[∇φ log pθ(x, µφ(x) + Rφ(x)ǫ)] + ∇φHφ

Rahul G. Krishnan Learning Deep Generative Models

slide-68
SLIDE 68

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Gradients of L(x; θ, φ)

The gradients are expectations! We approximate them with a Monte-Carlo estimate Gradients with respect to θ: for z ∼ qφ(z|x) ∇θL = Ez∼qφ(z|x)[∇θ log pθ(x, z)] = 1 K

K

  • k=1

[∇θ log pθ(x, zk)] Gradients with respect to φ: for ǫ ∼ N(0; I) ∇φL = Eǫ∼N(0;I)[∇φ log pθ(x, µφ(x) + Rφ(x)ǫ)] + ∇φHφ = 1 K

K

  • k=1

[∇φ log pθ(x, µφ(x) + Rφ(x)ǫk)] + ∇φHφ

Rahul G. Krishnan Learning Deep Generative Models

slide-69
SLIDE 69

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Learning: A graphical view

Lets see a pictoral representation of this process for a single data point x

Rahul G. Krishnan Learning Deep Generative Models

slide-70
SLIDE 70

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Learning: A graphical view

For a given datapoint x, do inference to infer the parameters that form the approximation to the posterior At this point, we can evaluate the entropy H(qφ(z|x))

x µ(x), Σ(x)

Figure: Step 1 & 2: Sampling datapoint & inferring µ(x), Σ(x)

Rahul G. Krishnan Learning Deep Generative Models

slide-71
SLIDE 71

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Learning: A graphical view

Sample z1:K from posterior z1:K ∼ N(µ(x), Σ(x)) Now, we have a fully observed bayesian network

x µ(x), Σ(x) z

Figure: Step 2: Sampling z

Rahul G. Krishnan Learning Deep Generative Models

slide-72
SLIDE 72

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Learning: A graphical view

Evaluate ELBO, ie. L(x; θ, φ) = Ez∼qφ(z|x) [log pθ(x, z)] + H(qφ(z|x)

x µ(x), Σ(x) z

Figure: Step 3: Evaluating ELBO

Rahul G. Krishnan Learning Deep Generative Models

slide-73
SLIDE 73

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Learning: A graphical view

Compute ∇θL(x; θ, φ) = Ez∼qφ(z|x) [∇θ logθ p(x, z)]

x µ(x), Σ(x) z

Figure: Step 4: Compute Gradients

Rahul G. Krishnan Learning Deep Generative Models

slide-74
SLIDE 74

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Learning: A graphical view

Use the Location and Scale Transformations: Compute ∇φL(x; θ, φ) = Eǫ∼N(0;I) [∇φ log p(x, µ(x; φ) + R(x; φ)ǫ)] + ∇φH(qφ(z|x))

x µ(x), Σ(x) z

Figure: Step 4: Compute Gradients

Rahul G. Krishnan Learning Deep Generative Models

slide-75
SLIDE 75

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Easy to Learn

Specific forms of these models also go by the name Variational Autoencoders. There are ways to learn non-Gaussian graphical models (not covered) Easily implemented in popular libraries such as Torch/Theano! There is a Torch implementation you can play around with in: https://github.com/clinicalml/dgm

Rahul G. Krishnan Learning Deep Generative Models

slide-76
SLIDE 76

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Combining Deep Learning with Graphical Models

x1 x2 z1 z2 z3 z4 z5

We haven’t yet talked about the parameterizations of the conditional distributions (in both p and q) One possibility is to use a neural network. Results in a powerful, highly non-linear transformation

Rahul G. Krishnan Learning Deep Generative Models

slide-77
SLIDE 77

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Generating Digits from MNIST

Figure: Generating MNIST Digits (Kingma & Welling, 2013)

Rahul G. Krishnan Learning Deep Generative Models

slide-78
SLIDE 78

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Generating Faces

With a DGM trained on images of faces, lets look at how the samples vary as we move around in the latent dimension Traversing the face manifold (Radford, 2015) Morphing Faces (Dumoulin, 2015) Many more such examples!

Rahul G. Krishnan Learning Deep Generative Models

slide-79
SLIDE 79

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Outline

1 Introduction

Variational Bound Summary

2 Variational Inference

Latent Dirichlet Allocation Learning LDA Stochastic Variational Inference

3 Deep Generative Models

Bayesian Networks & Deep-Learning Learning Summary of DGMs

4 Summary

Rahul G. Krishnan Learning Deep Generative Models

slide-80
SLIDE 80

Introduction Variational Inference Deep Generative Models Summary Bayesian Networks & Deep-Learning Learning Summary of DGMs

Limitations of DGMs

New methods allow us to learn a broad and powerful class

  • f generative models.

1 Can be tricky to learn. 2 No theoretical guarantees on the optimization problem. 3 Interpretability: Does z really mean anything? Can you & I

put a name to the quantity it represents?

Rahul G. Krishnan Learning Deep Generative Models

slide-81
SLIDE 81

Introduction Variational Inference Deep Generative Models Summary

Summary

Theres a lot more to do! Active area of research.

Probabilistic Programming: If I can write out my graphical model, can I automatically learn it using techniques from stochastic variational inference? Tightening the bound on log p(x): How can we form better and more complex approximations to the posterior distributions?

Rahul G. Krishnan Learning Deep Generative Models

slide-82
SLIDE 82

Appendix References

References I

Blei, David M., Ng, Andrew Y., & Jordan, Michael I. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research. Dumoulin, Vincent. 2015. Morphing Faces. http:// vdumoulin.github.io/morphing_faces/online_demo.html. Hoffman, Matthew D., Blei, David M., Wang, Chong, & Paisley, John William. 2013. Stochastic variational inference. Journal of Machine Learning Research. Kingma, Diederik P, & Welling, Max. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Radford, Alec. 2015. Morphing Faces. https://www.youtube.com/watch?v=XNZIN7Jh3Sg.

Rahul G. Krishnan Learning Deep Generative Models

slide-83
SLIDE 83

Appendix References

References II

Ranganath, Rajesh, Gerrish, Sean, & Blei, David M. 2014. Black Box Variational Inference. In: Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, AISTATS 2014, Reykjavik, Iceland, April 22-25, 2014. Rezende, Danilo Jimenez, Mohamed, Shakir, & Wierstra, Daan.

  • 2014. Stochastic backpropagation and approximate inference

in deep generative models. arXiv preprint arXiv:1401.4082.

Rahul G. Krishnan Learning Deep Generative Models