Approximate Inference: Variational Inference
CMSC 691 UMBC
Variational Inference CMSC 691 UMBC Goal: Posterior Inference - - PowerPoint PPT Presentation
Approximate Inference: Variational Inference CMSC 691 UMBC Goal: Posterior Inference Hyperparameters Unknown parameters Data: p ( | ) Likelihood model: p( | ) (Some) Learning Techniques MAP/MLE: Point
Approximate Inference: Variational Inference
CMSC 691 UMBC
Goal: Posterior Inference
Hyperparameters α Unknown “parameters” Θ Data: Likelihood model:
p( | Θ ) pα( Θ | )
(Some) Learning Techniques
MAP/MLE: Point estimation, basic EM Variational Inference: Functional Optimization Sampling/Monte Carlo
today next class
what we’ve already covered
Outline
Variational Inference Basic Technique Variational Approximation Example: Topic Models
Variational Inference: Core Idea
computationally difficult
Variational Inference: Core Idea
computationally difficult
distribution 𝑟𝜇(𝜄) and make 𝑟𝜇(𝜄) “close” to 𝑞(𝜄|𝑦)
Variational Inference
Difficult to compute
Variational Inference
Difficult to compute Minimize the “difference” by changing λ Easy(ier) to compute q(θ): controlled by parameters λ
Variational Inference
Difficult to compute Easy(ier) to compute Minimize the “difference” by changing λ
Variational Inference: A Gradient- Based Optimization Technique
Set t = 0
Pick a starting value λt
Until converged:
Variational Inference: A Gradient- Based Optimization Technique
Set t = 0
Pick a starting value λt
Until converged:
Variational Inference: The Function to Optimize
Posterior of desired model Any easy-to-compute distribution
Variational Inference: The Function to Optimize
Posterior of desired model Any easy-to-compute distribution Find the best distribution (calculus of variations)
Variational Inference: The Function to Optimize
Find the best distribution Parameters for desired model
Variational Inference: The Function to Optimize
Find the best distribution Variational parameters for θ Parameters for desired model
Variational Inference: The Function to Optimize
Find the best distribution Variational parameters for θ Parameters for desired model KL-Divergence (expectation)
DKL 𝑟 𝜄 || 𝑞(𝜄|𝑦) = 𝔽𝑟 𝜄 log 𝑟 𝜄 𝑞(𝜄|𝑦)
Variational Inference
Find the best distribution Variational parameters for θ Parameters for desired model
Exponential Family Recap: “Easy” Expectations
Exponential Family Recap: “Easy” Posterior Inference
p is the conjugate prior for π
Variational Inference
Find the best distribution When p and q are the same exponential family form, the variational update q(θ) is (often) computable (in closed form)
Variational Inference: A Gradient- Based Optimization Technique
Set t = 0
Pick a starting value λt
Let F(q(•;λt)) = KL[q(•;λt) || p(•)] Until converged:
Variational Inference: Maximization or Minimization?
Evidence Lower Bound (ELBO)
log 𝑞 𝑦 = log ∫ 𝑞 𝑦, 𝜄 𝑒𝜄
Evidence Lower Bound (ELBO)
log 𝑞 𝑦 = log ∫ 𝑞 𝑦, 𝜄 𝑒𝜄 = log ∫ 𝑞 𝑦, 𝜄 𝑟 𝜄 𝑟(𝜄) 𝑒𝜄
Evidence Lower Bound (ELBO)
log 𝑞 𝑦 = log ∫ 𝑞 𝑦, 𝜄 𝑒𝜄 = log ∫ 𝑞 𝑦, 𝜄 𝑟 𝜄 𝑟(𝜄) 𝑒𝜄 = log 𝔽𝑟 𝜄 𝑞 𝑦, 𝜄 𝑟 𝜄
Evidence Lower Bound (ELBO)
log 𝑞 𝑦 = log ∫ 𝑞 𝑦, 𝜄 𝑒𝜄 = log ∫ 𝑞 𝑦, 𝜄 𝑟 𝜄 𝑟(𝜄) 𝑒𝜄 = log 𝔽𝑟 𝜄 𝑞 𝑦, 𝜄 𝑟 𝜄
≥ 𝔽𝑟 𝜄 log 𝑞 𝑦, 𝜄 − 𝔽𝑟 𝜄 log 𝑟(𝜄)
= ℒ(𝑟)
Jensen’s Inequality
For a concave function 𝑔
𝑦1, … , 𝑦𝐿
inequality
Jensen’s Inequality
For a concave function 𝑔
𝑦1, … , 𝑦𝐿
For convex f, flip inequality
variational inference: log 𝔽 𝑞 𝑟 ≤ 𝔽 log 𝑞 𝑟
EM: A Maximization- Maximization Procedure
𝐺 𝜄, 𝑟 = 𝔽 𝒟(𝜄) −𝔽 log 𝑟(𝑎)
log-likelihood any distribution
we’ll see this again with variational inference
Throwback
Complete joint according to “true” model
Steps
Steps
simplify and expand it
– In general, the objective is just a large sum of individual expectations focused on one or two R.V.s
Steps
simplify and expand it
– In general, the objective is just a large sum of individual expectations focused on one or two R.V.s
Steps
simplify and expand it
– In general, the objective is just a large sum of individual expectations focused on one or two R.V.s
parameters
Steps
simplify and expand it
– In general, the objective is just a large sum of individual expectations focused on one or two R.V.s
parameters
2.
Steps
simplify and expand it
– In general, the objective is just a large sum of individual expectations focused on one or two R.V.s
parameters
Outline
Variational Inference Basic Technique Variational Approximation Example: Topic Models
What should q be?
Terminology: p/our generative story is our “true” model q approximates the “true” model’s posterior
What should q be?
Terminology: p/our generative story is our “true” model q approximates the “true” model’s posterior Therefore… q needs to be a distribution over the latent random variables
What should q be?
Terminology: p/our generative story is our “true” model q approximates the “true” model’s posterior Therefore… q needs to be a distribution over the latent random variables q’s precise formulation is task & model dependent q’s complexity (what (in)dependence assumptions it makes) directly influences the computations
Very common: Mean Field Approximation
be 𝑌 = {𝑦1, … , 𝑦𝑂}
variables be Θ = {𝜄1, … , 𝜄𝑁}
Very common: Mean Field Approximation
be 𝑌 = {𝑦1, … , 𝑦𝑂}
variables be Θ = {𝜄1, … , 𝜄𝑁}
Mean field:
dependencies in the true model, assume all 𝜄𝑗 are independent in the q distribution
distribution, each 𝜄𝑗 has its own parameters
Very common: Mean Field Approximation
be 𝑌 = {𝑦1, … , 𝑦𝑂}
variables be Θ = {𝜄1, … , 𝜄𝑁}
Mean field:
dependencies in the true model, assume all 𝜄𝑗 are independent in the q distribution
distribution, each 𝜄𝑗 has its own parameters 𝜄𝑗 ∼ 𝑟(𝜄𝑗; 𝛿𝑗)
Very common: Mean Field Approximation
be 𝑌 = {𝑦1, … , 𝑦𝑂}
variables be Θ = {𝜄1, … , 𝜄𝑁}
Mean field:
dependencies in the true model, assume all 𝜄𝑗 are independent in the q distribution
each 𝜄𝑗 has its own parameters 𝜄𝑗 ∼ 𝑟(𝜄𝑗; 𝛿𝑗)
𝑟 Θ = ෑ
𝑗=1 𝑁
𝑟(𝜄𝑗; 𝛿𝑗)
Some General Guidelines
Easiest math occurs when:
Some General Guidelines
Easiest math occurs when:
– If 𝜄𝑗~𝑞𝑗 and 𝜄
𝑘|𝜄𝑗~𝑞𝑘, then 𝑞𝑗 is chosen to be the
conjugate prior of 𝑞𝑘
Some General Guidelines
Easiest math occurs when:
– If 𝜄𝑗~𝑞𝑗 and 𝜄
𝑘|𝜄𝑗~𝑞𝑘, then 𝑞𝑗 is chosen to be the
conjugate prior of 𝑞𝑘
– If 𝜄𝑗~𝑞𝑗 and 𝑞𝑗 is a certain type of distribution (e.g., Dirichlet, or normal), then 𝑟𝑗 is that same type of distribution
Outline
Variational Inference Basic Technique Variational Approximation Example: Topic Models
Mixture Model vs. Admixture Model
distributions
Mixture Model vs. Admixture Model
distributions
– Each of the N datapoints is generated from one of those K distributions
Mixture Model vs. Admixture Model
distributions
– Each of the N datapoints is generated from one of those K distributions
– Each of the N datapoints is generated from a mixture of those K distributions
Bag-of-Items Models: Admixture Models
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . …
Three: 1, people: 2, attack: 2, …
Unigram counts
Bag-of-Items Models: Admixture Models
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region . …
Three: 1, people: 2, attack: 2, …
Unigram counts Global (corpus-level) parameters interact with local (document-level) parameters
Latent Dirichlet Allocation (Blei et al., 2003)
Per-document (unigram) word counts
Latent Dirichlet Allocation (Blei et al., 2003)
Per-document (unigram) word counts Count of word j in document i j i
Latent Dirichlet Allocation (Blei et al., 2003)
Per-document (unigram) word counts Count of word j in document i j i Core assumptions:
words
Latent Dirichlet Allocation (Blei et al., 2003)
Per-document (unigram) word counts Count of word j in document i j i Core assumptions:
words
for which topics to use
Latent Dirichlet Allocation (Blei et al., 2003)
Per-document (unigram) word counts Count of word j in document i j i Core assumptions:
words
for which topics to use
come from a different topic
Latent Dirichlet Allocation (Blei et al., 2003)
Per- document (latent) topic usage Per-document (unigram) word counts Per-topic word usage Count of word j in document i j i K “topics”: distribution over vocabulary
Latent Dirichlet Allocation (Blei et al., 2003)
Per- document (latent) topic usage Per-document (unigram) word counts Per-topic word usage
Latent Dirichlet Allocation (Blei et al., 2003)
Per- document (latent) topic usage Per-document (unigram) word counts Per-topic word usage
d Core assumptions:
Explicit conditioning left
Latent Dirichlet Allocation (Blei et al., 2003)
Per- document (latent) topic usage Per-document (unigram) word counts Per-topic word usage
d Core assumptions:
Explicit conditioning left
Latent Dirichlet Allocation (Blei et al., 2003)
Per- document (latent) topic usage Per-document (unigram) word counts Per-topic word usage
d Core assumptions:
Explicit conditioning left
Latent Dirichlet Allocation (Blei et al., 2003)
Per- document (latent) topic usage Per-document (unigram) word counts Per-topic word usage
d Core assumptions:
Explicit conditioning left
Latent Dirichlet Allocation (Blei et al., 2003)
Per- document (latent) topic usage Per-document (unigram) word counts Per-topic word usage
d Core assumptions:
Explicit conditioning left
Variational Inference: LDirA
Topic usage Per-document (unigram) word counts Topic words
p: True model 𝜚𝑙 ∼ Dirichlet(𝜸) 𝑥(𝑒,𝑜) ∼ Discrete(𝜚𝑨 𝑒,𝑜 ) 𝜄(𝑒) ∼ Dirichlet(𝜷) 𝑨(𝑒,𝑜) ∼ Discrete(𝜄(𝑒))
Explicit conditioning left
Variational Inference: LDirA
Topic usage Per-document (unigram) word counts Topic words
p: True model q: Mean-field approximation 𝜚𝑙 ∼ Dirichlet(𝜸) 𝑥(𝑒,𝑜) ∼ Discrete(𝜚𝑨 𝑒,𝑜 ) 𝜄(𝑒) ∼ Dirichlet(𝜷) 𝑨(𝑒,𝑜) ∼ Discrete(𝜄(𝑒))
Explicit conditioning left
Variational Inference: LDirA
Topic usage Per-document (unigram) word counts Topic words
p: True model q: Mean-field approximation 𝜚𝑙 ∼ Dirichlet(𝜸) 𝑥(𝑒,𝑜) ∼ Discrete(𝜚𝑨 𝑒,𝑜 ) 𝜄(𝑒) ∼ Dirichlet(𝜷) 𝑨(𝑒,𝑜) ∼ Discrete(𝜄(𝑒)) 𝜚𝑙 ∼ Dirichlet(𝝁𝒍) 𝜄(𝑒) ∼ Dirichlet(𝜹𝒆) 𝑨(𝑒,𝑜) ∼ Discrete(𝜔(𝑒,𝑜))
Explicit conditioning left
Variational Inference: LDirA
Topic usage Per-document (unigram) word counts Topic words
p: True model q: Mean-field approximation 𝜚𝑙 ∼ Dirichlet(𝜸) 𝑥(𝑒,𝑜) ∼ Discrete(𝜚𝑨 𝑒,𝑜 ) 𝜄(𝑒) ∼ Dirichlet(𝜷) 𝑨(𝑒,𝑜) ∼ Discrete(𝜄(𝑒)) 𝜚𝑙 ∼ Dirichlet(𝝁𝒍) 𝜄(𝑒) ∼ Dirichlet(𝜹𝒆) 𝑨(𝑒,𝑜) ∼ Discrete(𝜔(𝑒,𝑜))
Explicit conditioning left
Notice: full independence, no shared parameters!!!
Variational Inference: A Gradient- Based Optimization Technique
Set t = 0
Pick a starting value λt
Let F(q(•;λt)) = KL[q(•;λt) || p(•)] Until converged:
Variational Inference: LDirA Topic Proportions
𝜄(𝑒) ∼ Dirichlet(𝜷) 𝑨(𝑒,𝑜) ∼ Discrete(𝜄(𝑒)) 𝜄(𝑒) ∼ Dirichlet(𝜹𝒆) 𝑨(𝑒,𝑜) ∼ Discrete(𝜔(𝑒,𝑜))
p: True model q: Mean-field approximation
𝔽𝑟(𝜄(𝑒)) log 𝑞 𝜄(𝑒) | 𝛽
Variational Inference: LDirA Topic Proportions
𝜄(𝑒) ∼ Dirichlet(𝜷) 𝑨(𝑒,𝑜) ∼ Discrete(𝜄(𝑒)) 𝜄(𝑒) ∼ Dirichlet(𝜹𝒆) 𝑨(𝑒,𝑜) ∼ Discrete(𝜔(𝑒,𝑜))
p: True model q: Mean-field approximation
𝔽𝑟(𝜄(𝑒)) log 𝑞 𝜄(𝑒) | 𝛽 = 𝔽𝑟(𝜄(𝑒)) 𝛽 − 1 𝑈 log 𝜄(𝑒) + 𝐷
exponential family form
𝑞 𝜄 = Γ(σ𝑙 𝛽𝑙) ς𝑙 Γ 𝛽𝑙 ෑ
𝑙
𝜄𝑙
𝛽𝑙−1
params = 𝛽𝑙 − 1 𝑙
= log 𝜄𝑙 𝑙
Variational Inference: LDirA Topic Proportions
𝜄(𝑒) ∼ Dirichlet(𝜷) 𝑨(𝑒,𝑜) ∼ Discrete(𝜄(𝑒)) 𝜄(𝑒) ∼ Dirichlet(𝜹𝒆) 𝑨(𝑒,𝑜) ∼ Discrete(𝜔(𝑒,𝑜))
p: True model q: Mean-field approximation
𝔽𝑟(𝜄(𝑒)) log 𝑞 𝜄(𝑒) | 𝛽 = 𝔽𝑟(𝜄(𝑒)) 𝛽 − 1 𝑈 log 𝜄(𝑒) + 𝐷
expectation of sufficient statistics of q distribution
params = 𝛿𝑙 − 1 𝑙
log 𝜄𝑙 𝑙
Variational Inference: LDirA Topic Proportions
𝜄(𝑒) ∼ Dirichlet(𝜷) 𝑨(𝑒,𝑜) ∼ Discrete(𝜄(𝑒)) 𝜄(𝑒) ∼ Dirichlet(𝜹𝒆) 𝑨(𝑒,𝑜) ∼ Discrete(𝜔(𝑒,𝑜))
p: True model q: Mean-field approximation
𝔽𝑟(𝜄(𝑒)) log 𝑞 𝜄(𝑒) | 𝛽 = 𝔽𝑟(𝜄(𝑒)) 𝛽 − 1 𝑈 log 𝜄(𝑒) + 𝐷 =
expectation of the sufficient statistics is the gradient of the log normalizer
𝛽 − 1 𝑈𝔽𝑟(𝜄(𝑒)) log 𝜄(𝑒) + 𝐷
Variational Inference: LDirA Topic Proportions
𝜄(𝑒) ∼ Dirichlet(𝜷) 𝑨(𝑒,𝑜) ∼ Discrete(𝜄(𝑒)) 𝜄(𝑒) ∼ Dirichlet(𝜹𝒆) 𝑨(𝑒,𝑜) ∼ Discrete(𝜔(𝑒,𝑜))
p: True model q: Mean-field approximation
𝔽𝑟(𝜄(𝑒)) log 𝑞 𝜄(𝑒) | 𝛽 = 𝔽𝑟(𝜄(𝑒)) 𝛽 − 1 𝑈 log 𝜄(𝑒) + 𝐷 =
expectation of the sufficient statistics is the gradient of the log normalizer
𝛽 − 1 𝑈𝛼
𝛿𝑒𝐵 𝛿𝑒 − 1 + 𝐷
Variational Inference: LDirA Topic Proportions
𝜄(𝑒) ∼ Dirichlet(𝜷) 𝑨(𝑒,𝑜) ∼ Discrete(𝜄(𝑒)) 𝜄(𝑒) ∼ Dirichlet(𝜹𝒆) 𝑨(𝑒,𝑜) ∼ Discrete(𝜔(𝑒,𝑜))
p: True model q: Mean-field approximation
𝔽𝑟(𝜄(𝑒)) log 𝑞 𝜄(𝑒) | 𝛽 = 𝛽 − 1 𝑈𝛼
𝛿𝑒𝐵 𝛿𝑒 − 1 + 𝐷
ℒ ቚ
𝛿𝑒
= 𝛽 − 1 𝑈𝛼
𝛿𝑒𝐵 𝛿𝑒 − 1 + 𝑁 𝛿𝑒
there’s more math to do!
Variational Inference: A Gradient- Based Optimization Technique
Set t = 0
Pick a starting value λt
Let F(q(•;λt)) = KL[q(•;λt) || p(•)] Until converged:
Variational Inference: LDirA Topic Proportions
𝜄(𝑒) ∼ Dirichlet(𝜷) 𝑨(𝑒,𝑜) ∼ Discrete(𝜄(𝑒)) 𝜄(𝑒) ∼ Dirichlet(𝜹𝒆) 𝑨(𝑒,𝑜) ∼ Discrete(𝜔(𝑒,𝑜))
p: True model q: Mean-field approximation
ℒ ቚ
𝛿𝑒
= 𝛽 − 1 𝑈𝛼
𝛿𝑒𝐵 𝛿𝑒 − 1 + 𝑁 𝛿𝑒
𝛼
𝛿𝑒ℒ ቚ 𝛿𝑒
= 𝛽 − 1 𝑈𝛼
𝛿𝑒 2 𝐵 𝛿𝑒 − 1 + 𝛼 𝛿𝑒𝑁 𝛿𝑒
Variational Inference: LDirA Topic Proportions
𝜄(𝑒) ∼ Dirichlet(𝜷) 𝑨(𝑒,𝑜) ∼ Discrete(𝜄(𝑒)) 𝜄(𝑒) ∼ Dirichlet(𝜹𝒆) 𝑨(𝑒,𝑜) ∼ Discrete(𝜔(𝑒,𝑜))
p: True model q: Mean-field approximation
ℒ ቚ
𝛿𝑒
= 𝛽 − 1 𝑈𝛼
𝛿𝑒𝐵 𝛿𝑒 − 1 + 𝑁 𝛿𝑒
𝛼
𝛿𝑒ℒ ቚ 𝛿𝑒
= 𝛽 − 1 𝑈𝛼
𝛿𝑒 2 𝐵 𝛿𝑒 − 1 + 𝛼 𝛿𝑒𝑁 𝛿𝑒
analytically solve this for faster convergence (Blei et al., 2003)
Steps
simplify and expand it
– In general, the objective is just a large sum of individual expectations focused on one or two R.V.s
parameters
Some General Guidelines, Recap
“analytically solve this for faster convergence” Obviously, the math can be intimidating: don’t let that deter you! Easiest math occurs when:
– If 𝜄𝑗~𝑞𝑗 and 𝜄
𝑘|𝜄𝑗~𝑞𝑘, then 𝑞𝑗 is chosen to be the conjugate
prior of 𝑞𝑘
– If 𝜄𝑗~𝑞𝑗 and 𝑞𝑗 is a certain type of distribution (e.g., Dirichlet,
Some General Guidelines, Recap
“analytically solve this for faster convergence” Obviously, the math can be intimidating: don’t let that deter you! Easiest math occurs when:
– If 𝜄𝑗~𝑞𝑗 and 𝜄
𝑘|𝜄𝑗~𝑞𝑘, then 𝑞𝑗 is chosen to be the conjugate prior
– If 𝜄𝑗~𝑞𝑗 and 𝑞𝑗 is a certain type of distribution (e.g., Dirichlet, or normal), then 𝑟𝑗 is that same type of distribution
Alternatively: perform neural variational inference (e.g., variational autoencoder)
Variational Inference: Core Idea
𝑞(𝜄, 𝑦)
but this is computationally difficult
𝑞(𝜄|𝑦) with a different distribution 𝑟𝜇(𝜄) and make 𝑟𝜇(𝜄) “close” to 𝑞(𝜄|𝑦)
Basic Technique Variational Approximation Example: Topic Models