Variational Inference CMSC 691 UMBC Goal: Posterior Inference - PowerPoint PPT Presentation

Approximate Inference: Variational Inference CMSC 691 UMBC

Goal: Posterior Inference Hyperparameters α Unknown “parameters” Θ Data: p α ( Θ | ) Likelihood model: p( | Θ )

(Some) Learning Techniques MAP/MLE: Point estimation, what we’ve already covered basic EM Variational Inference: today Functional Optimization Sampling/Monte Carlo next class

Outline Variational Inference Basic Technique Variational Approximation Example: Topic Models

Variational Inference: Core Idea • Observed 𝑦 , latent r.v.s 𝜄 • We have some joint model 𝑞(𝜄, 𝑦) • We want to compute 𝑞(𝜄|𝑦) but this is computationally difficult

Variational Inference: Core Idea • Observed 𝑦 , latent r.v.s 𝜄 • We have some joint model 𝑞(𝜄, 𝑦) • We want to compute 𝑞(𝜄|𝑦) but this is computationally difficult • Solution: approximate 𝑞(𝜄|𝑦) with a different distribution 𝑟 𝜇 (𝜄) and make 𝑟 𝜇 (𝜄) “close” to 𝑞(𝜄|𝑦)

Variational Inference Difficult to compute

Variational Inference Difficult to Minimize the compute “difference” by changing λ Easy(ier) to q( θ ): controlled by parameters λ compute

Variational Inference Difficult to Minimize the compute “difference” by changing λ Easy(ier) to compute

Variational Inference: A Gradient- Based Optimization Technique Set t = 0 Pick a starting value λ t Until converged : 1. Get value y t = F(q(•; λ t )) 2. Get gradient g t = F’(q(•; λ t )) 3. Get scaling factor ρ t 4. Set λ t+1 = λ t + ρ t *g t 5. Set t += 1

Variational Inference: A Gradient- Based Optimization Technique Set t = 0 Pick a starting value λ t Until converged : 1. Get value y t = F (q(•; λ t )) 2. Get gradient g t = F’ (q(•; λ t )) 3. Get scaling factor ρ t 4. Set λ t+1 = λ t + ρ t *g t 5. Set t += 1

Variational Inference: The Function to Optimize Any easy-to-compute distribution Posterior of desired model

Variational Inference: The Function to Optimize Any easy-to-compute distribution Posterior of desired model Find the best distribution (calculus of variations )

Variational Inference: The Function to Optimize Find the best distribution Parameters for desired model

Variational Inference: The Function to Optimize Find the best distribution Parameters for desired model Variational parameters for θ

Variational Inference: The Function to Optimize KL-Divergence (expectation) Find the best distribution Parameters for desired model D KL 𝑟 𝜄 || 𝑞(𝜄|𝑦) = log 𝑟 𝜄 𝔽 𝑟 𝜄 𝑞(𝜄|𝑦) Variational parameters for θ

Variational Inference Find the best distribution Parameters for desired model Variational parameters for θ

Exponential Family Recap: “Easy” Posterior Inference p is the conjugate prior for π Exponential Family Recap: “Easy” Expectations

Variational Inference Find the best distribution When p and q are the same exponential family form, the variational update q( θ ) is (often) computable (in closed form)

Variational Inference: A Gradient- Based Optimization Technique Set t = 0 Pick a starting value λ t Let F(q(•; λ t )) = KL[q(•; λ t ) || p(•)] Until converged : 1. Get value y t = F(q(•; λ t )) 2. Get gradient g t = F’(q(•; λ t )) 3. Get scaling factor ρ t 4. Set λ t+1 = λ t + ρ t *g t 5. Set t += 1

Variational Inference: Maximization or Minimization?

Evidence Lower Bound (ELBO) log 𝑞 𝑦 = log ∫ 𝑞 𝑦, 𝜄 𝑒𝜄

Evidence Lower Bound (ELBO) log 𝑞 𝑦 = log ∫ 𝑞 𝑦, 𝜄 𝑒𝜄 = log ∫ 𝑞 𝑦, 𝜄 𝑟 𝜄 𝑟(𝜄) 𝑒𝜄

Evidence Lower Bound (ELBO) log 𝑞 𝑦 = log ∫ 𝑞 𝑦, 𝜄 𝑒𝜄 = log ∫ 𝑞 𝑦, 𝜄 𝑟 𝜄 𝑟(𝜄) 𝑒𝜄 𝑞 𝑦, 𝜄 = log 𝔽 𝑟 𝜄 𝑟 𝜄

Evidence Lower Bound (ELBO) log 𝑞 𝑦 = log ∫ 𝑞 𝑦, 𝜄 𝑒𝜄 = log ∫ 𝑞 𝑦, 𝜄 𝑟 𝜄 𝑟(𝜄) 𝑒𝜄 𝑞 𝑦, 𝜄 = log 𝔽 𝑟 𝜄 𝑟 𝜄 ≥ 𝔽 𝑟 𝜄 log 𝑞 𝑦, 𝜄 − 𝔽 𝑟 𝜄 log 𝑟(𝜄) = ℒ(𝑟)

Jensen’s Inequality For a concave function 𝑔 • 𝛽 ∈ Δ 𝐿−1 • Sequence of points 𝑦 = 𝑦 1 , … , 𝑦 𝐿 • 𝑔 𝛽 𝑈 𝑦 ≥ σ 𝑙 𝛽 𝑙 𝑔 𝑦 𝑙 • For convex f, flip inequality

Jensen’s Inequality For a concave function 𝑔 • 𝛽 ∈ Δ 𝐿−1 • Sequence of points 𝑦 = 𝑦 1 , … , 𝑦 𝐿 • 𝑔 𝛽 𝑈 𝑦 ≥ σ 𝑙 𝛽 𝑙 𝑔 𝑦 𝑙 For convex f, flip inequality • 𝑔 𝛽 𝑈 𝑦 ≤ σ 𝑙 𝛽 𝑙 𝑔 𝑦 𝑙 • log is convex so for variational inference: log 𝔽 𝑞 ≤ 𝔽 log 𝑞 𝑟 𝑟

EM: A Maximization- Throwback Maximization Procedure any Complete joint observed data distribution according to log-likelihood over Z “true” model 𝐺 𝜄, 𝑟 = 𝔽 𝒟(𝜄) −𝔽 log 𝑟(𝑎) we’ll see this again with variational inference

Steps 1. Write out the objective

Steps 1. Write out the objective 2. Use basic properties of expectations, logs to simplify and expand it – In general, the objective is just a large sum of individual expectations focused on one or two R.V.s

Steps 1. Write out the objective 2. Use basic properties of expectations, logs to simplify and expand it – In general, the objective is just a large sum of individual expectations focused on one or two R.V.s 3. Simplify each expectation

Steps 1. Write out the objective 2. Use basic properties of expectations, logs to simplify and expand it – In general, the objective is just a large sum of individual expectations focused on one or two R.V.s 3. Simplify each expectation 4. Differentiate the objective wrt the variational parameters

Steps 1. Write out the objective 2. Use basic properties of expectations, logs to simplify and expand it – In general, the objective is just a large sum of individual expectations focused on one or two R.V.s 3. Simplify each expectation 4. Differentiate the objective wrt the variational parameters 5. Optimize based on gradients, with two options 1. Closed form solutions • Can lead to better convergence • May not be possible or worth it to get 2.

Steps 1. Write out the objective 2. Use basic properties of expectations, logs to simplify and expand it – In general, the objective is just a large sum of individual expectations focused on one or two R.V.s 3. Simplify each expectation 4. Differentiate the objective wrt the variational parameters 5. Optimize based on gradients, with two options 1. Closed form solutions • Can lead to better convergence • May not be possible or worth it to get 2. Non-closed form (e.g., Newton-like step required) • Differentiation can be handled automatically • Convergence can be slower

Outline Variational Inference Basic Technique Variational Approximation Example: Topic Models

What should q be? Terminology: p /our generative story is our “true” model q approximates the “true” model’s posterior

What should q be? Terminology: p /our generative story is our “true” model q approximates the “true” model’s posterior Therefore… q needs to be a distribution over the latent random variables

What should q be? Terminology: p /our generative story is our “true” model q approximates the “true” model’s posterior Therefore… q needs to be a distribution over the latent random variables q’s precise formulation is task & model dependent q ’s complexity (what (in)dependence assumptions it makes) directly influences the computations

Very common: Mean Field Approximation • Let the observed data be 𝑌 = {𝑦 1 , … , 𝑦 𝑂 } • Let the latent random variables be Θ = {𝜄 1 , … , 𝜄 𝑁 } • Goal: learn 𝑟(Θ)

Very common: Mean Field Approximation • Let the observed data Mean field : be • Regardless of 𝑌 = {𝑦 1 , … , 𝑦 𝑂 } dependencies in the • Let the latent random true model, assume all variables be 𝜄 𝑗 are independent in Θ = {𝜄 1 , … , 𝜄 𝑁 } the q distribution • Goal: learn 𝑟(Θ) • Under the q distribution, each 𝜄 𝑗 has its own parameters

Very common: Mean Field Approximation • Let the observed data Mean field : be • Regardless of 𝑌 = {𝑦 1 , … , 𝑦 𝑂 } dependencies in the • Let the latent random true model, assume all 𝜄 𝑗 are independent in variables be Θ = {𝜄 1 , … , 𝜄 𝑁 } the q distribution • Goal: learn 𝑟(Θ) • Under the q distribution, each 𝜄 𝑗 has its own parameters 𝜄 𝑗 ∼ 𝑟(𝜄 𝑗 ; 𝛿 𝑗 )

Very common: Mean Field Approximation Mean field : • Let the observed data • Regardless of be dependencies in the true 𝑌 = {𝑦 1 , … , 𝑦 𝑂 } model, assume all 𝜄 𝑗 are independent in the q • Let the latent random distribution variables be • Under the q distribution, Θ = {𝜄 1 , … , 𝜄 𝑁 } each 𝜄 𝑗 has its own parameters • Goal: learn 𝑟(Θ) 𝜄 𝑗 ∼ 𝑟(𝜄 𝑗 ; 𝛿 𝑗 ) 𝑁 𝑟 Θ = ෑ 𝑟(𝜄 𝑗 ; 𝛿 𝑗 ) 𝑗=1

Some General Guidelines Easiest math occurs when: • Conjugacy exists in the true model • Family distributions in q mimic those in p

Variational Inference CMSC 691 UMBC Goal: Posterior Inference - PowerPoint PPT Presentation

Approximate Inference: Variational Inference CMSC 691 UMBC Goal: Posterior Inference Hyperparameters Unknown parameters Data: p ( | ) Likelihood model: p( | ) (Some) Learning Techniques MAP/MLE: Point

Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is

Variational Inference for GPs: Presenters Group1: Stochastic variational inference. Slides 2 - 28

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

Rejection Sampling Variational Inference Karan Grewal CSC2547 / STA4273 Overview Variational

CS480/680 Machine Learning Lecture 11: February 11 th , 2020 Variational Inference Zahra

An Introduction to An Introduction to Variational Variational Methods for Graphical Models

Lecture Variational 13 Inference Panini Kaushal Scribes : - Margulies Smedeuranh Niklas

CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. Probabilistic latent variable

Variational Inference for Bayes vMF Mixture Hanxiao Liu September 23, 2014 1 / 14 Variational

Variational Mean Field Variational Mean Field for Graphical Models for Graphical Models

Variational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor

Fast and Simple Natural-Gradient Variational Inference with Mixture of Exponential-family

Neural Variational Inference and Learning Andriy Mnih, Karol Gregor 22 June 2014 1 / 14

Regret bounds for online variational inference Pierre Alquier ACML Nagoya, Nov. 18, 2019

The Variational Predictive Natural Gradient Da Tang 1 Rajesh Ranganath 2 1 Columbia University 2

Variational Laplace Autoencoders Yookoon Park, Chris Dongjoo Kim and Gunhee Kim Vision and

Closed forms for generating series, and finite summation analogues modulo a prime Sandro Mattarei

Lecture 23: Decision Trees Decision trees Prof. Julia Hockenmaier

Page 1 1 Midterm Topics Reading: Today H1, P1, H2, P2 FCG Chapter 11 pp 209-214

Likely Program Invariants Michael Ernst, Jake Cockrell, Bill Griswold (UCSD), and David Notkin

Recursive Definitions And Applications to Counting C(n,k) C(n,k) = C(n-1,k-1) + C(n-1,k) (where

TOP SECRET CONFIDENTIAL 1 TOP SECRET WITCHCRAFT SECRETS CONFIDENTIAL 2 Witchcraft Secrets

Discrete Mathematics in Computer Science Fibonacci Series Generating Functions Malte Helmert,

CSE 341 : Programming Languages We know function bodies can use any bindings in scope But