Variational inference Probabilistic Graphical Models Sharif - PowerPoint PPT Presentation

Variational inference Probabilistic Graphical Models Sharif University of Technology Spring 2016 Soleymani Some slides are adapted from Xing ’ s slides

Nodes: 𝒀 = {𝑌 1 , … , 𝑌 𝑜 } Evidence: 𝒀 𝑾 Inference query Query variables: 𝒂 = 𝒀\𝒀 𝑾  Marginal probability (Likelihood): 𝑄 𝒀 𝒘 = 𝑄(𝒂, 𝒀 𝒘 ) 𝒂  Conditional probability (a posteriori belief): 𝑄(𝒂, 𝒚 𝒘 ) 𝑄 𝒂|𝒚 𝒘 = 𝒂 𝑄(𝒂, 𝒚 𝒘 )  Marginalized conditional probability: 𝑿 𝑄(𝒁, 𝑿, 𝒚 𝒘 ) 𝑄 𝒁|𝒚 𝒘 = (𝒁 = 𝒁 ∪ 𝑿) 𝒁 𝑿 𝑄(𝒁, 𝑿, 𝒚 𝒘 )  Most probable assignment for some variables of interest given an evidence 𝒀 𝑾 = 𝒚 𝒘 𝒁 ∗ |𝒚 𝒘 = argmax 𝑄 𝒁|𝒚 𝒘 𝒁 2

Exact methods for inference  Variable elimination  Message Passing: shared terms  Sum-product (belief propagation)  Max-product  Junction Tree 3

Junction tree  General algorithm on graphs with cycles  Message passing on junction trees 𝑛 𝑗𝑘 𝑇 𝑗𝑘 𝑇 𝑗𝑘 𝐷 𝐷 𝑗 𝑘 𝑛 𝑘𝑗 𝑇 𝑗𝑘 4

Why approximate inference  The computational complexity of Junction tree algorithm with be at least 𝐿 𝐷 where 𝐷 shows the largest elimination clique (the largest clique in the triangulated graph) Tree-width of an 𝑂 × 𝑂 grid is 𝑂  For a distribution 𝑄 associated with a complex graph, computing the marginal (or conditional) probability of arbitrary random variable(s) is intractable 5

Learning and inference  Learning is also an inference problem or usually needs inference  For Bayesian inference that is one of the principal foundations for machine learning, learning is just an inference problem  For Maximum Likelihood approach, also, we need inference when we have incomplete data or when we encounter an undirected model 6

Approximate inference  Approximate inference techniques  Variational algorithms  Loopy belief propagation  Mean field approximation  Expectation propagation  Stochastic simulation / sampling methods 7

Variational methods  “ variational ” : general term for optimization-based formulations  Many problems can be expressed in terms of an optimization problem in which the quantity being optimized is a functional  Variational inference is a deterministic framework that is widely used for approximate inference 8

Variatonal inference methods  Constructing an approximation to the target distribution 𝑄 where this approximation takes a simpler form for inference:  We define a target class of distributions 𝒭  Search for an instance 𝑅 ∗ in 𝒭 that is the best approximation to 𝑄  Queries will be answered using 𝑅 ∗ rather than on 𝑄 Constrained optimization  𝒭 : given family of distributions  Simpler families for which solving the optimization problem will be computationally tractable  However, the family may not be sufficiently expressive to encode 𝑄 9

Setup  Assume that we are interested in the posterior distribution Observed variables 𝑄(𝑎, 𝑌|𝛽) 𝑌 = {𝑦 1 , … , 𝑦 𝑜 } 𝑄 𝑎 𝑌, 𝛽 = 𝑎 = {𝑨 1 , … , 𝑨 𝑛 } Hidden variables 𝑄 𝑎, 𝑌 𝛽 𝑒𝑎  The problem of computing the posterior is an instance of more general problems that variational inference solves  Main idea:  We pick a family of distributions over the latent variables with its own variational parameters  Then, find the setting of the parameters that makes 𝑅 close to the posterior of interest  Use 𝑅 with the fitted parameters as an approximation for the posterior 10

Approximation  Goal: Approximate a difficult distribution 𝑄(𝑎|𝑌) with a new distribution 𝑅(𝑎) such that:  𝑄(𝑎|𝑌) and 𝑅(𝑎) are close  Computation on 𝑅(𝑎) is easy  Typically, the true posterior is not in the variational family.  How should we measure distance between distributions?  The Kullback-Leibler divergence (KL-divergence) between two distributions 𝑄 and 𝑅 11

KL divergence  Kullback-Leibler divergence between 𝑄 and 𝑅 : 𝐿𝑀(𝑄| 𝑅 = 𝑄 𝑦 log 𝑄(𝑦) 𝑅(𝑦) 𝑒𝑦  A result from information theory: For any 𝑄 and 𝑅 𝐿𝑀(𝑄| 𝑅 ≥ 0  𝐿𝑀(𝑄| 𝑅 = 0 if and only if 𝑄 ≡ 𝑅  𝐸 is asymmetric 12

How measure the distance of 𝑄 and 𝑅 ?  We wish to find a distribution 𝑅 such that 𝑅 is a “ good ” approximation to 𝑄  We can therefore use KL divergence as a scoring function to decide a good 𝑅  But, 𝐿𝑀(𝑄(𝑎|𝑌)||𝑅(𝑎)) ≠ 𝐿𝑀(𝑅(𝑎)||𝑄(𝑎|𝑌)) 13

M-projection vs. I-projection  M-projection of 𝑅 onto 𝑄 𝑅 ∗ = argmin 𝐿𝑀(𝑄||𝑅) 𝑅∈𝒭  I-projection of 𝑅 onto 𝑄 𝑅 ∗ = argmin 𝐿𝑀(𝑅||𝑄) 𝑅∈𝒭  These two will differ only when 𝑅 is minimized over a restricted set of probability distributions (when 𝑄 ∉ 𝑅 set of possible 𝑅 distributions) 14

KL divergence: M-projection vs. I-projection  Let 𝑄 be a 2D Gaussian and 𝑅 be a Gaussian distribution with diagonal covariance matrix: 𝑅 𝒜 log 𝑅 𝒜 𝑅 ∗ = argmin 𝑄 𝒜 log 𝑄 𝒜 𝑄 𝒜 𝑒𝒜 𝑅 ∗ = argmin 𝑅 𝒜 𝑒𝒜 𝑅 𝑅 𝑄 : Green 𝑅 ∗ : Red 𝐹 𝑄 𝒜 = 𝐹 𝑅 [𝒜] 𝐹 𝑄 𝒜 = 𝐹 𝑅 [𝒜] [Bishop] 15

KL divergence: M-projection vs. I-projection  Let 𝑄 is mixture of two 2D Gaussians and 𝑅 be a 2D Gaussian distribution with arbitrary covariance matrix: 𝑅 𝒜 log 𝑅 𝒜 𝑄 : Blue 𝑄 𝒜 log 𝑄 𝒜 𝑅 ∗ = argmin 𝑅 ∗ = argmin 𝑄 𝒜 𝑒𝒜 𝑅 𝒜 𝑒𝒜 𝑅 ∗ : Red 𝑅 𝑅 two good solutions! 𝐹 𝑄 𝒜 = 𝐹 𝑅 𝒜 𝐷𝑝𝑤 𝑄 𝒜 = 𝐷𝑝𝑤 𝑅 𝒜 16 [Bishop]

M-projection  Computing 𝐿𝑀(𝑄| 𝑅 requires inference on 𝑄 𝑄 𝑨 log 𝑄 𝑨 𝐿𝑀(𝑄| 𝑅 = 𝑅 𝑨 = −𝐼 𝑄 − 𝐹 𝑄 [log 𝑅(𝑨)] 𝑨 Inference on 𝑄 (that is difficult) is required!  When 𝑅 is in the exponential family: 𝐿𝑀(𝑄| 𝑅 = 0 ⇔ 𝐹 𝑄 𝑈 𝑨 = 𝐹 𝑅 [𝑈 𝑨 ] Moment projection  Expectation Propagation methods are based on minimizing 𝐿𝑀(𝑄| 𝑅 17

I-projection can be computed without performing inference on  𝐿𝑀(𝑅| 𝑄 𝑄 𝐿𝑀(𝑅| 𝑄 = 𝑅 𝑨 log 𝑅 𝑨 𝑄 𝑨 𝑒𝑨 = −𝐼 𝑅 − 𝐹 𝑅 [log 𝑄(𝑨)]  Most variational inference algorithms make use of 𝐿𝑀(𝑅| 𝑄  Computing expectations w.r.t. 𝑅 is tractable (by choosing a suitable class of distributions for 𝑅 )  We choose a restricted family of distributions such that the expectations can be evaluated and optimized efficiently.  and yet which is still sufficiently flexible as to give a good approximation 18

Example of variatinal approximation Variational Laplace Approx. [Bishop] 19

Evidence Lower Bound (ELBO) ln 𝑄 𝑌 = ℒ 𝑅 + 𝐿𝑀(𝑅||𝑄) 𝑌 = {𝑦 1 , … , 𝑦 𝑜 } 𝑎 = {𝑨 1 , … , 𝑨 𝑛 } ℒ 𝑅 = 𝑅 𝑎 ln 𝑄(𝑌, 𝑎) 𝑒𝑎 𝑅(𝑎) 𝐿𝑀(𝑅||𝑄) = − 𝑅 𝑎 ln 𝑄(𝑎|𝑌) 𝑒𝑎 𝑅(𝑎) We also called ℒ 𝑅 as  We can maximize the lower bound ℒ 𝑅 𝐺[𝑄, 𝑅] latter.  equivalent to minimizing KL divergence.  if we allow any possible choice for 𝑅(𝑎) , then the maximum of the lower bound occurs when the KL divergence vanishes  occurs when 𝑅(𝑎) equals the posterior distribution 𝑄(𝑎|𝑌) .  The difference between the ELBO and the KL divergence is ln 𝑄(𝑌) which is what the ELBO bounds 20

Evidence Lower Bound (ELBO)  Lower bound on the marginal likelihood  This quantity should increase monotonically with each iteration  we maximize the ELBO to find the parameters that gives as tight a bound as possible on the marginal likelihood  ELBO converges to a local minimum.  Variational inference is closely related to EM 21

Factorized distributions 𝑅 The restriction on the distributions in the form of factorization assumptions: 𝑅 𝑎 = 𝑅 𝑗 (𝑎 𝑗 ) 𝑗 ℒ 𝑅 = 𝑅 𝑗 ln 𝑄(𝑌, 𝑎) − ln 𝑅 𝑗 𝑒𝑎 𝑗 𝑗 Coordinate ascent to optimize ℒ 𝑅 : ℒ 𝑘 𝑅 = 𝑅 𝑘 ln 𝑄(𝑌, 𝑎) 𝑅 𝑗 𝑒𝑎 𝑗 𝑒𝑎 𝑘 − 𝑅 𝑘 ln 𝑅 𝑘 𝑒𝑎 𝑘 + 𝑑𝑝𝑜𝑡𝑢 𝑗≠𝑘 ⇒ ℒ 𝑘 𝑅 = 𝑅 𝑘 𝐹 −𝑘 ln 𝑄 𝑌, 𝑎 𝑒𝑎 𝑘 − 𝑅 𝑘 ln 𝑅 𝑘 𝑒𝑎 𝑘 + 𝑑𝑝𝑜𝑡𝑢 𝐹 −𝑘 ln 𝑄 𝑌, 𝑎 = ln 𝑄 𝑌, 𝑎 𝑅 𝑗 𝑒𝑎 𝑗 22 𝑗≠𝑘

Variational inference Probabilistic Graphical Models Sharif - PowerPoint PPT Presentation

Variational inference Probabilistic Graphical Models Sharif University of Technology Spring 2016 Soleymani Some slides are adapted from Xing s slides Nodes: = { 1 , , } Evidence: Inference query Query

Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is

Variational Inference for GPs: Presenters Group1: Stochastic variational inference. Slides 2 - 28

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

Rejection Sampling Variational Inference Karan Grewal CSC2547 / STA4273 Overview Variational

CS480/680 Machine Learning Lecture 11: February 11 th , 2020 Variational Inference Zahra

An Introduction to An Introduction to Variational Variational Methods for Graphical Models

Lecture Variational 13 Inference Panini Kaushal Scribes : - Margulies Smedeuranh Niklas

CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. Probabilistic latent variable

Variational Inference for Bayes vMF Mixture Hanxiao Liu September 23, 2014 1 / 14 Variational

Variational Mean Field Variational Mean Field for Graphical Models for Graphical Models

Variational Inference CMSC 691 UMBC Goal: Posterior Inference Hyperparameters Unknown

Variational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor

Fast and Simple Natural-Gradient Variational Inference with Mixture of Exponential-family

Neural Variational Inference and Learning Andriy Mnih, Karol Gregor 22 June 2014 1 / 14

Regret bounds for online variational inference Pierre Alquier ACML Nagoya, Nov. 18, 2019

The Variational Predictive Natural Gradient Da Tang 1 Rajesh Ranganath 2 1 Columbia University 2

Probabilistic & Unsupervised Learning Factored Variational Approximations and Variational

A start of Variational Methods for ERGM Ranran Wang, UW MURI-UCI April 24, 2009 A start of

Layouts in Drupal 8 @timplunkett Core Developer, Acquia Office of the CTO Drupal 7 Block

Simple geometrical models for the distribution of domain sizes in martensitic microstructures

Weighted Sobolev spaces for the advection operator: A variational method for computing shape

variational methods for effective dynamics, part II Robert L. Jerrard Department of Mathematics

Model optimization and selection: Variational Approach for Markov Processes (VAMP) Frank No (FU

Variational Methods for Path Integral Scattering J. Carron Paul-Scherrer Institute, Villigen

Variational inference Probabilistic Graphical Models Sharif - PowerPoint PPT Presentation

Variational inference Probabilistic Graphical Models Sharif University of Technology Spring 2016 Soleymani Some slides are adapted from Xing s slides Nodes: = { 1 , , } Evidence: Inference query Query

Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is

Variational Inference for GPs: Presenters Group1: Stochastic variational inference. Slides 2 - 28

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

Rejection Sampling Variational Inference Karan Grewal CSC2547 / STA4273 Overview Variational

CS480/680 Machine Learning Lecture 11: February 11 th , 2020 Variational Inference Zahra

An Introduction to An Introduction to Variational Variational Methods for Graphical Models

Lecture Variational 13 Inference Panini Kaushal Scribes : - Margulies Smedeuranh Niklas

CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. Probabilistic latent variable

Variational Inference for Bayes vMF Mixture Hanxiao Liu September 23, 2014 1 / 14 Variational

Variational Mean Field Variational Mean Field for Graphical Models for Graphical Models

Variational Inference CMSC 691 UMBC Goal: Posterior Inference Hyperparameters Unknown

Variational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor

Fast and Simple Natural-Gradient Variational Inference with Mixture of Exponential-family

Neural Variational Inference and Learning Andriy Mnih, Karol Gregor 22 June 2014 1 / 14

Regret bounds for online variational inference Pierre Alquier ACML Nagoya, Nov. 18, 2019

The Variational Predictive Natural Gradient Da Tang 1 Rajesh Ranganath 2 1 Columbia University 2

Probabilistic &amp; Unsupervised Learning Factored Variational Approximations and Variational

A start of Variational Methods for ERGM Ranran Wang, UW MURI-UCI April 24, 2009 A start of

Layouts in Drupal 8 @timplunkett Core Developer, Acquia Office of the CTO Drupal 7 Block

Simple geometrical models for the distribution of domain sizes in martensitic microstructures

Weighted Sobolev spaces for the advection operator: A variational method for computing shape

variational methods for effective dynamics, part II Robert L. Jerrard Department of Mathematics

Model optimization and selection: Variational Approach for Markov Processes (VAMP) Frank No (FU

Variational Methods for Path Integral Scattering J. Carron Paul-Scherrer Institute, Villigen

Probabilistic & Unsupervised Learning Factored Variational Approximations and Variational