Variational Inference CMSC 691 UMBC Goal: Posterior Inference - - PowerPoint PPT Presentation

variational inference
SMART_READER_LITE
LIVE PREVIEW

Variational Inference CMSC 691 UMBC Goal: Posterior Inference - - PowerPoint PPT Presentation

Approximate Inference: Variational Inference CMSC 691 UMBC Goal: Posterior Inference Hyperparameters Unknown parameters Data: p ( | ) Likelihood model: p( | ) (Some) Learning Techniques MAP/MLE: Point


slide-1
SLIDE 1

Approximate Inference: Variational Inference

CMSC 691 UMBC

slide-2
SLIDE 2

Goal: Posterior Inference

Hyperparameters α Unknown “parameters” Θ Data: Likelihood model:

p( | Θ ) pα( Θ | )

slide-3
SLIDE 3

(Some) Learning Techniques

MAP/MLE: Point estimation, basic EM Variational Inference: Functional Optimization Sampling/Monte Carlo

today next class

what we’ve already covered

slide-4
SLIDE 4

Outline

Variational Inference Basic Technique Variational Approximation Example: Topic Models

slide-5
SLIDE 5

Variational Inference: Core Idea

  • Observed 𝑦, latent r.v.s 𝜄
  • We have some joint model 𝑞(𝜄, 𝑦)
  • We want to compute 𝑞(𝜄|𝑦) but this is

computationally difficult

slide-6
SLIDE 6

Variational Inference: Core Idea

  • Observed 𝑦, latent r.v.s 𝜄
  • We have some joint model 𝑞(𝜄, 𝑦)
  • We want to compute 𝑞(𝜄|𝑦) but this is

computationally difficult

  • Solution: approximate 𝑞(𝜄|𝑦) with a different

distribution 𝑟𝜇(𝜄) and make 𝑟𝜇(𝜄) “close” to 𝑞(𝜄|𝑦)

slide-7
SLIDE 7

Variational Inference

Difficult to compute

slide-8
SLIDE 8

Variational Inference

Difficult to compute Minimize the “difference” by changing λ Easy(ier) to compute q(θ): controlled by parameters λ

slide-9
SLIDE 9

Variational Inference

Difficult to compute Easy(ier) to compute Minimize the “difference” by changing λ

slide-10
SLIDE 10

Variational Inference: A Gradient- Based Optimization Technique

Set t = 0

Pick a starting value λt

Until converged:

  • 1. Get value y t = F(q(•;λt))
  • 2. Get gradient g t = F’(q(•;λt))
  • 3. Get scaling factor ρ t
  • 4. Set λt+1 = λt + ρt*g t
  • 5. Set t += 1
slide-11
SLIDE 11

Variational Inference: A Gradient- Based Optimization Technique

Set t = 0

Pick a starting value λt

Until converged:

  • 1. Get value y t = F(q(•;λt))
  • 2. Get gradient g t = F’(q(•;λt))
  • 3. Get scaling factor ρ t
  • 4. Set λt+1 = λt + ρt*g t
  • 5. Set t += 1
slide-12
SLIDE 12

Variational Inference: The Function to Optimize

Posterior of desired model Any easy-to-compute distribution

slide-13
SLIDE 13

Variational Inference: The Function to Optimize

Posterior of desired model Any easy-to-compute distribution Find the best distribution (calculus of variations)

slide-14
SLIDE 14

Variational Inference: The Function to Optimize

Find the best distribution Parameters for desired model

slide-15
SLIDE 15

Variational Inference: The Function to Optimize

Find the best distribution Variational parameters for θ Parameters for desired model

slide-16
SLIDE 16

Variational Inference: The Function to Optimize

Find the best distribution Variational parameters for θ Parameters for desired model KL-Divergence (expectation)

DKL 𝑟 𝜄 || 𝑞(𝜄|𝑦) = 𝔽𝑟 𝜄 log 𝑟 𝜄 𝑞(𝜄|𝑦)

slide-17
SLIDE 17

Variational Inference

Find the best distribution Variational parameters for θ Parameters for desired model

slide-18
SLIDE 18

Exponential Family Recap: “Easy” Expectations

Exponential Family Recap: “Easy” Posterior Inference

p is the conjugate prior for π

slide-19
SLIDE 19

Variational Inference

Find the best distribution When p and q are the same exponential family form, the variational update q(θ) is (often) computable (in closed form)

slide-20
SLIDE 20

Variational Inference: A Gradient- Based Optimization Technique

Set t = 0

Pick a starting value λt

Let F(q(•;λt)) = KL[q(•;λt) || p(•)] Until converged:

  • 1. Get value y t = F(q(•;λt))
  • 2. Get gradient g t = F’(q(•;λt))
  • 3. Get scaling factor ρ t
  • 4. Set λt+1 = λt + ρt*g t
  • 5. Set t += 1
slide-21
SLIDE 21

Variational Inference: Maximization or Minimization?

slide-22
SLIDE 22

Evidence Lower Bound (ELBO)

log 𝑞 𝑦 = log ∫ 𝑞 𝑦, 𝜄 𝑒𝜄

slide-23
SLIDE 23

Evidence Lower Bound (ELBO)

log 𝑞 𝑦 = log ∫ 𝑞 𝑦, 𝜄 𝑒𝜄 = log ∫ 𝑞 𝑦, 𝜄 𝑟 𝜄 𝑟(𝜄) 𝑒𝜄

slide-24
SLIDE 24

Evidence Lower Bound (ELBO)

log 𝑞 𝑦 = log ∫ 𝑞 𝑦, 𝜄 𝑒𝜄 = log ∫ 𝑞 𝑦, 𝜄 𝑟 𝜄 𝑟(𝜄) 𝑒𝜄 = log 𝔽𝑟 𝜄 𝑞 𝑦, 𝜄 𝑟 𝜄

slide-25
SLIDE 25

Evidence Lower Bound (ELBO)

log 𝑞 𝑦 = log ∫ 𝑞 𝑦, 𝜄 𝑒𝜄 = log ∫ 𝑞 𝑦, 𝜄 𝑟 𝜄 𝑟(𝜄) 𝑒𝜄 = log 𝔽𝑟 𝜄 𝑞 𝑦, 𝜄 𝑟 𝜄

≥ 𝔽𝑟 𝜄 log 𝑞 𝑦, 𝜄 − 𝔽𝑟 𝜄 log 𝑟(𝜄)

= ℒ(𝑟)

slide-26
SLIDE 26

Jensen’s Inequality

For a concave function 𝑔

  • 𝛽 ∈ Δ𝐿−1
  • Sequence of points 𝑦 =

𝑦1, … , 𝑦𝐿

  • 𝑔 𝛽𝑈𝑦 ≥ σ𝑙 𝛽𝑙𝑔 𝑦𝑙
  • For convex f, flip

inequality

slide-27
SLIDE 27

Jensen’s Inequality

For a concave function 𝑔

  • 𝛽 ∈ Δ𝐿−1
  • Sequence of points 𝑦 =

𝑦1, … , 𝑦𝐿

  • 𝑔 𝛽𝑈𝑦 ≥ σ𝑙 𝛽𝑙𝑔 𝑦𝑙

For convex f, flip inequality

  • 𝑔 𝛽𝑈𝑦 ≤ σ𝑙 𝛽𝑙𝑔 𝑦𝑙
  • log is convex so for

variational inference: log 𝔽 𝑞 𝑟 ≤ 𝔽 log 𝑞 𝑟

slide-28
SLIDE 28

EM: A Maximization- Maximization Procedure

𝐺 𝜄, 𝑟 = 𝔽 𝒟(𝜄) −𝔽 log 𝑟(𝑎)

  • bserved data

log-likelihood any distribution

  • ver Z

we’ll see this again with variational inference

Throwback

Complete joint according to “true” model

slide-29
SLIDE 29

Steps

  • 1. Write out the objective
slide-30
SLIDE 30

Steps

  • 1. Write out the objective
  • 2. Use basic properties of expectations, logs to

simplify and expand it

– In general, the objective is just a large sum of individual expectations focused on one or two R.V.s

slide-31
SLIDE 31

Steps

  • 1. Write out the objective
  • 2. Use basic properties of expectations, logs to

simplify and expand it

– In general, the objective is just a large sum of individual expectations focused on one or two R.V.s

  • 3. Simplify each expectation
slide-32
SLIDE 32

Steps

  • 1. Write out the objective
  • 2. Use basic properties of expectations, logs to

simplify and expand it

– In general, the objective is just a large sum of individual expectations focused on one or two R.V.s

  • 3. Simplify each expectation
  • 4. Differentiate the objective wrt the variational

parameters

slide-33
SLIDE 33

Steps

  • 1. Write out the objective
  • 2. Use basic properties of expectations, logs to

simplify and expand it

– In general, the objective is just a large sum of individual expectations focused on one or two R.V.s

  • 3. Simplify each expectation
  • 4. Differentiate the objective wrt the variational

parameters

  • 5. Optimize based on gradients, with two options
  • 1. Closed form solutions
  • Can lead to better convergence
  • May not be possible or worth it to get

2.

slide-34
SLIDE 34

Steps

  • 1. Write out the objective
  • 2. Use basic properties of expectations, logs to

simplify and expand it

– In general, the objective is just a large sum of individual expectations focused on one or two R.V.s

  • 3. Simplify each expectation
  • 4. Differentiate the objective wrt the variational

parameters

  • 5. Optimize based on gradients, with two options
  • 1. Closed form solutions
  • Can lead to better convergence
  • May not be possible or worth it to get
  • 2. Non-closed form (e.g., Newton-like step required)
  • Differentiation can be handled automatically
  • Convergence can be slower
slide-35
SLIDE 35

Outline

Variational Inference Basic Technique Variational Approximation Example: Topic Models

slide-36
SLIDE 36

What should q be?

Terminology: p/our generative story is our “true” model q approximates the “true” model’s posterior

slide-37
SLIDE 37

What should q be?

Terminology: p/our generative story is our “true” model q approximates the “true” model’s posterior Therefore… q needs to be a distribution over the latent random variables

slide-38
SLIDE 38

What should q be?

Terminology: p/our generative story is our “true” model q approximates the “true” model’s posterior Therefore… q needs to be a distribution over the latent random variables q’s precise formulation is task & model dependent q’s complexity (what (in)dependence assumptions it makes) directly influences the computations

slide-39
SLIDE 39

Very common: Mean Field Approximation

  • Let the observed data

be 𝑌 = {𝑦1, … , 𝑦𝑂}

  • Let the latent random

variables be Θ = {𝜄1, … , 𝜄𝑁}

  • Goal: learn 𝑟(Θ)
slide-40
SLIDE 40

Very common: Mean Field Approximation

  • Let the observed data

be 𝑌 = {𝑦1, … , 𝑦𝑂}

  • Let the latent random

variables be Θ = {𝜄1, … , 𝜄𝑁}

  • Goal: learn 𝑟(Θ)

Mean field:

  • Regardless of

dependencies in the true model, assume all 𝜄𝑗 are independent in the q distribution

  • Under the q

distribution, each 𝜄𝑗 has its own parameters

slide-41
SLIDE 41

Very common: Mean Field Approximation

  • Let the observed data

be 𝑌 = {𝑦1, … , 𝑦𝑂}

  • Let the latent random

variables be Θ = {𝜄1, … , 𝜄𝑁}

  • Goal: learn 𝑟(Θ)

Mean field:

  • Regardless of

dependencies in the true model, assume all 𝜄𝑗 are independent in the q distribution

  • Under the q

distribution, each 𝜄𝑗 has its own parameters 𝜄𝑗 ∼ 𝑟(𝜄𝑗; 𝛿𝑗)

slide-42
SLIDE 42

Very common: Mean Field Approximation

  • Let the observed data

be 𝑌 = {𝑦1, … , 𝑦𝑂}

  • Let the latent random

variables be Θ = {𝜄1, … , 𝜄𝑁}

  • Goal: learn 𝑟(Θ)

Mean field:

  • Regardless of

dependencies in the true model, assume all 𝜄𝑗 are independent in the q distribution

  • Under the q distribution,

each 𝜄𝑗 has its own parameters 𝜄𝑗 ∼ 𝑟(𝜄𝑗; 𝛿𝑗)

𝑟 Θ = ෑ

𝑗=1 𝑁

𝑟(𝜄𝑗; 𝛿𝑗)

slide-43
SLIDE 43

Some General Guidelines

Easiest math occurs when:

  • Conjugacy exists in the true model
  • Family distributions in q mimic those in p
slide-44
SLIDE 44

Some General Guidelines

Easiest math occurs when:

  • Conjugacy exists in the true model

– If 𝜄𝑗~𝑞𝑗 and 𝜄

𝑘|𝜄𝑗~𝑞𝑘, then 𝑞𝑗 is chosen to be the

conjugate prior of 𝑞𝑘

  • Family distributions in q mimic those in p
slide-45
SLIDE 45

Some General Guidelines

Easiest math occurs when:

  • Conjugacy exists in the true model

– If 𝜄𝑗~𝑞𝑗 and 𝜄

𝑘|𝜄𝑗~𝑞𝑘, then 𝑞𝑗 is chosen to be the

conjugate prior of 𝑞𝑘

  • Family distributions in q mimic those in p

– If 𝜄𝑗~𝑞𝑗 and 𝑞𝑗 is a certain type of distribution (e.g., Dirichlet, or normal), then 𝑟𝑗 is that same type of distribution

slide-46
SLIDE 46

Outline

Variational Inference Basic Technique Variational Approximation Example: Topic Models

slide-47
SLIDE 47

Mixture Model vs. Admixture Model

  • Both consider K generating

distributions

  • Mixture model
  • Admixture model
slide-48
SLIDE 48

Mixture Model vs. Admixture Model

  • Both consider K generating

distributions

  • Mixture model

– Each of the N datapoints is generated from one of those K distributions

  • Admixture model
slide-49
SLIDE 49

Mixture Model vs. Admixture Model

  • Both consider K generating

distributions

  • Mixture model

– Each of the N datapoints is generated from one of those K distributions

  • Admixture model

– Each of the N datapoints is generated from a mixture of those K distributions

slide-50
SLIDE 50

Bag-of-Items Models: Admixture Models

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . …

p( )

Three: 1, people: 2, attack: 2, …

p( ) =

Unigram counts

slide-51
SLIDE 51

Bag-of-Items Models: Admixture Models

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . …

p( )

Three: 1, people: 2, attack: 2, …

pφ,ω( ) =

Unigram counts Global (corpus-level) parameters interact with local (document-level) parameters

slide-52
SLIDE 52

Latent Dirichlet Allocation (Blei et al., 2003)

Per-document (unigram) word counts

slide-53
SLIDE 53

Latent Dirichlet Allocation (Blei et al., 2003)

Per-document (unigram) word counts Count of word j in document i j i

slide-54
SLIDE 54

Latent Dirichlet Allocation (Blei et al., 2003)

Per-document (unigram) word counts Count of word j in document i j i Core assumptions:

  • 1. K “topics”: distributions over possible vocab

words

slide-55
SLIDE 55

Latent Dirichlet Allocation (Blei et al., 2003)

Per-document (unigram) word counts Count of word j in document i j i Core assumptions:

  • 1. K “topics”: distributions over possible vocab

words

  • 2. Each document i has general “preferences”

for which topics to use

slide-56
SLIDE 56

Latent Dirichlet Allocation (Blei et al., 2003)

Per-document (unigram) word counts Count of word j in document i j i Core assumptions:

  • 1. K “topics”: distributions over possible vocab

words

  • 2. Each document i has general “preferences”

for which topics to use

  • 3. Each observed word j in a document i can

come from a different topic

slide-57
SLIDE 57

Latent Dirichlet Allocation (Blei et al., 2003)

Per- document (latent) topic usage Per-document (unigram) word counts Per-topic word usage Count of word j in document i j i K “topics”: distribution over vocabulary

slide-58
SLIDE 58

Latent Dirichlet Allocation (Blei et al., 2003)

Per- document (latent) topic usage Per-document (unigram) word counts Per-topic word usage

slide-59
SLIDE 59

Latent Dirichlet Allocation (Blei et al., 2003)

Per- document (latent) topic usage Per-document (unigram) word counts Per-topic word usage

d Core assumptions:

  • 1. K “topics”: distributions over possible vocab words
  • 2. Each document i has general “preferences” for which topics to use
  • 3. Each observed word j in a document i can come from a different topic

Explicit conditioning left

  • ff (for space)
slide-60
SLIDE 60

Latent Dirichlet Allocation (Blei et al., 2003)

Per- document (latent) topic usage Per-document (unigram) word counts Per-topic word usage

d Core assumptions:

  • 1. K “topics”: distributions over possible vocab words
  • 2. Each document i has general “preferences” for which topics to use
  • 3. Each observed word j in a document i can come from a different topic

Explicit conditioning left

  • ff (for space)
slide-61
SLIDE 61

Latent Dirichlet Allocation (Blei et al., 2003)

Per- document (latent) topic usage Per-document (unigram) word counts Per-topic word usage

d Core assumptions:

  • 1. K “topics”: distributions over possible vocab words
  • 2. Each document i has general “preferences” for which topics to use
  • 3. Each observed word j in a document i can come from a different topic

Explicit conditioning left

  • ff (for space)
slide-62
SLIDE 62

Latent Dirichlet Allocation (Blei et al., 2003)

Per- document (latent) topic usage Per-document (unigram) word counts Per-topic word usage

d Core assumptions:

  • 1. K “topics”: distributions over possible vocab words
  • 2. Each document i has general “preferences” for which topics to use
  • 3. Each observed word j in a document i can come from a different topic

Explicit conditioning left

  • ff (for space)
slide-63
SLIDE 63

Latent Dirichlet Allocation (Blei et al., 2003)

Per- document (latent) topic usage Per-document (unigram) word counts Per-topic word usage

d Core assumptions:

  • 1. K “topics”: distributions over possible vocab words
  • 2. Each document i has general “preferences” for which topics to use
  • 3. Each observed word j in a document i can come from a different topic

Explicit conditioning left

  • ff (for space)
slide-64
SLIDE 64

Variational Inference: LDirA

Topic usage Per-document (unigram) word counts Topic words

p: True model 𝜚𝑙 ∼ Dirichlet(𝜸) 𝑥(𝑒,𝑜) ∼ Discrete(𝜚𝑨 𝑒,𝑜 ) 𝜄(𝑒) ∼ Dirichlet(𝜷) 𝑨(𝑒,𝑜) ∼ Discrete(𝜄(𝑒))

Explicit conditioning left

  • ff (for space)
slide-65
SLIDE 65

Variational Inference: LDirA

Topic usage Per-document (unigram) word counts Topic words

p: True model q: Mean-field approximation 𝜚𝑙 ∼ Dirichlet(𝜸) 𝑥(𝑒,𝑜) ∼ Discrete(𝜚𝑨 𝑒,𝑜 ) 𝜄(𝑒) ∼ Dirichlet(𝜷) 𝑨(𝑒,𝑜) ∼ Discrete(𝜄(𝑒))

Explicit conditioning left

  • ff (for space)
slide-66
SLIDE 66

Variational Inference: LDirA

Topic usage Per-document (unigram) word counts Topic words

p: True model q: Mean-field approximation 𝜚𝑙 ∼ Dirichlet(𝜸) 𝑥(𝑒,𝑜) ∼ Discrete(𝜚𝑨 𝑒,𝑜 ) 𝜄(𝑒) ∼ Dirichlet(𝜷) 𝑨(𝑒,𝑜) ∼ Discrete(𝜄(𝑒)) 𝜚𝑙 ∼ Dirichlet(𝝁𝒍) 𝜄(𝑒) ∼ Dirichlet(𝜹𝒆) 𝑨(𝑒,𝑜) ∼ Discrete(𝜔(𝑒,𝑜))

Explicit conditioning left

  • ff (for space)
slide-67
SLIDE 67

Variational Inference: LDirA

Topic usage Per-document (unigram) word counts Topic words

p: True model q: Mean-field approximation 𝜚𝑙 ∼ Dirichlet(𝜸) 𝑥(𝑒,𝑜) ∼ Discrete(𝜚𝑨 𝑒,𝑜 ) 𝜄(𝑒) ∼ Dirichlet(𝜷) 𝑨(𝑒,𝑜) ∼ Discrete(𝜄(𝑒)) 𝜚𝑙 ∼ Dirichlet(𝝁𝒍) 𝜄(𝑒) ∼ Dirichlet(𝜹𝒆) 𝑨(𝑒,𝑜) ∼ Discrete(𝜔(𝑒,𝑜))

Explicit conditioning left

  • ff (for space)

Notice: full independence, no shared parameters!!!

slide-68
SLIDE 68

Variational Inference: A Gradient- Based Optimization Technique

Set t = 0

Pick a starting value λt

Let F(q(•;λt)) = KL[q(•;λt) || p(•)] Until converged:

  • 1. Get value y t = F(q(•;λt))
  • 2. Get gradient g t = F’(q(•;λt))
  • 3. Get scaling factor ρ t
  • 4. Set λt+1 = λt + ρt*g t
  • 5. Set t += 1
slide-69
SLIDE 69

Variational Inference: LDirA Topic Proportions

𝜄(𝑒) ∼ Dirichlet(𝜷) 𝑨(𝑒,𝑜) ∼ Discrete(𝜄(𝑒)) 𝜄(𝑒) ∼ Dirichlet(𝜹𝒆) 𝑨(𝑒,𝑜) ∼ Discrete(𝜔(𝑒,𝑜))

p: True model q: Mean-field approximation

𝔽𝑟(𝜄(𝑒)) log 𝑞 𝜄(𝑒) | 𝛽

slide-70
SLIDE 70

Variational Inference: LDirA Topic Proportions

𝜄(𝑒) ∼ Dirichlet(𝜷) 𝑨(𝑒,𝑜) ∼ Discrete(𝜄(𝑒)) 𝜄(𝑒) ∼ Dirichlet(𝜹𝒆) 𝑨(𝑒,𝑜) ∼ Discrete(𝜔(𝑒,𝑜))

p: True model q: Mean-field approximation

𝔽𝑟(𝜄(𝑒)) log 𝑞 𝜄(𝑒) | 𝛽 = 𝔽𝑟(𝜄(𝑒)) 𝛽 − 1 𝑈 log 𝜄(𝑒) + 𝐷

exponential family form

  • f Dirichlet

𝑞 𝜄 = Γ(σ𝑙 𝛽𝑙) ς𝑙 Γ 𝛽𝑙 ෑ

𝑙

𝜄𝑙

𝛽𝑙−1

params = 𝛽𝑙 − 1 𝑙

  • suff. stats.

= log 𝜄𝑙 𝑙

slide-71
SLIDE 71

Variational Inference: LDirA Topic Proportions

𝜄(𝑒) ∼ Dirichlet(𝜷) 𝑨(𝑒,𝑜) ∼ Discrete(𝜄(𝑒)) 𝜄(𝑒) ∼ Dirichlet(𝜹𝒆) 𝑨(𝑒,𝑜) ∼ Discrete(𝜔(𝑒,𝑜))

p: True model q: Mean-field approximation

𝔽𝑟(𝜄(𝑒)) log 𝑞 𝜄(𝑒) | 𝛽 = 𝔽𝑟(𝜄(𝑒)) 𝛽 − 1 𝑈 log 𝜄(𝑒) + 𝐷

expectation of sufficient statistics of q distribution

params = 𝛿𝑙 − 1 𝑙

  • suff. stats. =

log 𝜄𝑙 𝑙

slide-72
SLIDE 72

Variational Inference: LDirA Topic Proportions

𝜄(𝑒) ∼ Dirichlet(𝜷) 𝑨(𝑒,𝑜) ∼ Discrete(𝜄(𝑒)) 𝜄(𝑒) ∼ Dirichlet(𝜹𝒆) 𝑨(𝑒,𝑜) ∼ Discrete(𝜔(𝑒,𝑜))

p: True model q: Mean-field approximation

𝔽𝑟(𝜄(𝑒)) log 𝑞 𝜄(𝑒) | 𝛽 = 𝔽𝑟(𝜄(𝑒)) 𝛽 − 1 𝑈 log 𝜄(𝑒) + 𝐷 =

expectation of the sufficient statistics is the gradient of the log normalizer

𝛽 − 1 𝑈𝔽𝑟(𝜄(𝑒)) log 𝜄(𝑒) + 𝐷

slide-73
SLIDE 73

Variational Inference: LDirA Topic Proportions

𝜄(𝑒) ∼ Dirichlet(𝜷) 𝑨(𝑒,𝑜) ∼ Discrete(𝜄(𝑒)) 𝜄(𝑒) ∼ Dirichlet(𝜹𝒆) 𝑨(𝑒,𝑜) ∼ Discrete(𝜔(𝑒,𝑜))

p: True model q: Mean-field approximation

𝔽𝑟(𝜄(𝑒)) log 𝑞 𝜄(𝑒) | 𝛽 = 𝔽𝑟(𝜄(𝑒)) 𝛽 − 1 𝑈 log 𝜄(𝑒) + 𝐷 =

expectation of the sufficient statistics is the gradient of the log normalizer

𝛽 − 1 𝑈𝛼

𝛿𝑒𝐵 𝛿𝑒 − 1 + 𝐷

slide-74
SLIDE 74

Variational Inference: LDirA Topic Proportions

𝜄(𝑒) ∼ Dirichlet(𝜷) 𝑨(𝑒,𝑜) ∼ Discrete(𝜄(𝑒)) 𝜄(𝑒) ∼ Dirichlet(𝜹𝒆) 𝑨(𝑒,𝑜) ∼ Discrete(𝜔(𝑒,𝑜))

p: True model q: Mean-field approximation

𝔽𝑟(𝜄(𝑒)) log 𝑞 𝜄(𝑒) | 𝛽 = 𝛽 − 1 𝑈𝛼

𝛿𝑒𝐵 𝛿𝑒 − 1 + 𝐷

ℒ ቚ

𝛿𝑒

= 𝛽 − 1 𝑈𝛼

𝛿𝑒𝐵 𝛿𝑒 − 1 + 𝑁 𝛿𝑒

there’s more math to do!

slide-75
SLIDE 75

Variational Inference: A Gradient- Based Optimization Technique

Set t = 0

Pick a starting value λt

Let F(q(•;λt)) = KL[q(•;λt) || p(•)] Until converged:

  • 1. Get value y t = F(q(•;λt))
  • 2. Get gradient g t = F’(q(•;λt))
  • 3. Get scaling factor ρ t
  • 4. Set λt+1 = λt + ρt*g t
  • 5. Set t += 1
slide-76
SLIDE 76

Variational Inference: LDirA Topic Proportions

𝜄(𝑒) ∼ Dirichlet(𝜷) 𝑨(𝑒,𝑜) ∼ Discrete(𝜄(𝑒)) 𝜄(𝑒) ∼ Dirichlet(𝜹𝒆) 𝑨(𝑒,𝑜) ∼ Discrete(𝜔(𝑒,𝑜))

p: True model q: Mean-field approximation

ℒ ቚ

𝛿𝑒

= 𝛽 − 1 𝑈𝛼

𝛿𝑒𝐵 𝛿𝑒 − 1 + 𝑁 𝛿𝑒

𝛼

𝛿𝑒ℒ ቚ 𝛿𝑒

= 𝛽 − 1 𝑈𝛼

𝛿𝑒 2 𝐵 𝛿𝑒 − 1 + 𝛼 𝛿𝑒𝑁 𝛿𝑒

slide-77
SLIDE 77

Variational Inference: LDirA Topic Proportions

𝜄(𝑒) ∼ Dirichlet(𝜷) 𝑨(𝑒,𝑜) ∼ Discrete(𝜄(𝑒)) 𝜄(𝑒) ∼ Dirichlet(𝜹𝒆) 𝑨(𝑒,𝑜) ∼ Discrete(𝜔(𝑒,𝑜))

p: True model q: Mean-field approximation

ℒ ቚ

𝛿𝑒

= 𝛽 − 1 𝑈𝛼

𝛿𝑒𝐵 𝛿𝑒 − 1 + 𝑁 𝛿𝑒

𝛼

𝛿𝑒ℒ ቚ 𝛿𝑒

= 𝛽 − 1 𝑈𝛼

𝛿𝑒 2 𝐵 𝛿𝑒 − 1 + 𝛼 𝛿𝑒𝑁 𝛿𝑒

analytically solve this for faster convergence (Blei et al., 2003)

slide-78
SLIDE 78

Steps

  • 1. Write out the objective
  • 2. Use basic properties of expectations, logs to

simplify and expand it

– In general, the objective is just a large sum of individual expectations focused on one or two R.V.s

  • 3. Simplify each expectation
  • 4. Differentiate the objective wrt the variational

parameters

  • 5. Optimize based on gradients, with two options
  • 1. Closed form solutions
  • Can lead to better convergence
  • May not be possible or worth it to get
  • 2. Non-closed form (e.g., Newton-like step required)
  • Differentiation can be handled automatically
  • Convergence can be slower
slide-79
SLIDE 79

Some General Guidelines, Recap

“analytically solve this for faster convergence” Obviously, the math can be intimidating: don’t let that deter you! Easiest math occurs when:

  • Conjugacy exists in the true model

– If 𝜄𝑗~𝑞𝑗 and 𝜄

𝑘|𝜄𝑗~𝑞𝑘, then 𝑞𝑗 is chosen to be the conjugate

prior of 𝑞𝑘

  • Family distributions in q mimic those in p

– If 𝜄𝑗~𝑞𝑗 and 𝑞𝑗 is a certain type of distribution (e.g., Dirichlet,

  • r normal), then 𝑟𝑗 is that same type of distribution
slide-80
SLIDE 80

Some General Guidelines, Recap

“analytically solve this for faster convergence” Obviously, the math can be intimidating: don’t let that deter you! Easiest math occurs when:

  • Conjugacy exists in the true model

– If 𝜄𝑗~𝑞𝑗 and 𝜄

𝑘|𝜄𝑗~𝑞𝑘, then 𝑞𝑗 is chosen to be the conjugate prior

  • f 𝑞𝑘
  • Family distributions in q mimic those in p

– If 𝜄𝑗~𝑞𝑗 and 𝑞𝑗 is a certain type of distribution (e.g., Dirichlet, or normal), then 𝑟𝑗 is that same type of distribution

Alternatively: perform neural variational inference (e.g., variational autoencoder)

slide-81
SLIDE 81

Variational Inference: Core Idea

  • Observed 𝑦, latent r.v.s 𝜄
  • We have some joint model

𝑞(𝜄, 𝑦)

  • We want to compute 𝑞(𝜄|𝑦)

but this is computationally difficult

  • Solution: approximate

𝑞(𝜄|𝑦) with a different distribution 𝑟𝜇(𝜄) and make 𝑟𝜇(𝜄) “close” to 𝑞(𝜄|𝑦)

Basic Technique Variational Approximation Example: Topic Models