Variational Inference and Learning Michael Gutmann Probabilistic - PowerPoint PPT Presentation

Variational Inference and Learning Michael Gutmann Probabilistic Modelling and Reasoning (INFR11134) School of Informatics, University of Edinburgh Spring semester 2018

Recap ◮ Learning and inference often involves intractable integrals ◮ For example: marginalisation � p ( x ) = p ( x , y ) d y y ◮ For example: likelihood in case of unobserved variables � L ( θ ) = p ( D ; θ ) = p ( u , D ; θ ) d u u ◮ We can use Monte Carlo integration and sampling to approximate the integrals. ◮ Alternative: variational approach to (approximate) inference and learning. Michael Gutmann Variational Inference and Learning 2 / 36

History Variational methods have a long history, in particular in physics. For example: ◮ Fermat’s principle (1650) to explain the path of light: “light travels between two given points along the path of shortest time” (see e.g. http://www.feynmanlectures.caltech.edu/I_26.html) ◮ Principle of least action in classical mechanics and beyond (see e.g. http://www.feynmanlectures.caltech.edu/II_19.html ) ◮ Finite elements methods to solve problems in fluid dynamics or civil engineering. Michael Gutmann Variational Inference and Learning 3 / 36

Program 1. Preparations 2. The variational principle 3. Application to inference and learning Michael Gutmann Variational Inference and Learning 4 / 36

Program 1. Preparations Concavity of the logarithm and Jensen’s inequality Kullback-Leibler divergence and its properties 2. The variational principle 3. Application to inference and learning Michael Gutmann Variational Inference and Learning 5 / 36

log is concave ◮ log( u ) is concave log( au 1 +(1 − a ) u 2 ) ≥ a log( u 1 )+(1 − a ) log( u 2 ) a ∈ [0 , 1] ◮ log(average) ≥ average (log) log( u ) ◮ Generalisation log( u ) log E [ g ( x )] ≥ E [log g ( x )] u with g ( x ) > 0 ◮ Jensen’s inequality for concave functions. Michael Gutmann Variational Inference and Learning 6 / 36

Kullback-Leibler divergence ◮ Kullback Leibler divergence KL( p || q ) � � � p ( x ) log p ( x ) log p ( x ) KL( p || q ) = q ( x ) d x = E p ( x ) q ( x ) ◮ Properties ◮ KL( p || q ) = 0 if and only if (iff) p = q (they may be different on sets of probability zero) ◮ KL( p || q ) � = KL( q || p ) ◮ KL( p || q ) ≥ 0 ◮ Non-negativity follows from the concavity of the logarithm. Michael Gutmann Variational Inference and Learning 7 / 36

Non-negativity of the KL divergence Non-negativity follows from the concavity of the logarithm. � � � q ( x ) � log q ( x ) ≤ log E p ( x ) E p ( x ) p ( x ) p ( x ) � p ( x ) q ( x ) = log p ( x ) d x � = log q ( x ) d x = log 1 = 0 . From � � log q ( x ) ≤ 0 E p ( x ) p ( x ) it follows that � � � � log p ( x ) log q ( x ) KL( p || q ) = E p ( x ) = − E p ( x ) ≥ 0 q ( x ) p ( x ) Michael Gutmann Variational Inference and Learning 8 / 36

Asymmetry of the KL divergence Blue: mixture of Gaussians p ( x ) (fixed) Green: (unimodal) Gaussian q that minimises KL( q || p ) Red: (unimodal) Gaussian q that minimises KL( p || q ) 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 − 30 − 20 − 10 0 10 20 30 Barber Figure 28.1, Section 28.3.4 Michael Gutmann Variational Inference and Learning 9 / 36

Asymmetry of the KL divergence � q ( x ) log q ( x ) argmin q KL( q || p ) = argmin q p ( x ) d x ◮ Optimal q avoids regions where p is small. ◮ Produces good local fit, “mode seeking” � p ( x ) log p ( x ) argmin q KL( p || q ) = argmin q q ( x ) d x ◮ Optimal q is nonzero where p is nonzero (and does not care about regions where p is small) ◮ Corresponds to MLE; produces global fit/moment matching 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 − 30 − 20 − 10 0 10 20 30 Michael Gutmann Variational Inference and Learning 10 / 36

Asymmetry of the KL divergence Blue: mixture of Gaussians p ( x ) (fixed) Red: optimal (unimodal) Gaussians q ( x ) Global moment matching (left) versus mode seeking (middle and right). (two local minima are shown) min q KL( p || q) min q KL( q || p) min q KL( q || p) Bishop Figure 10.3 Michael Gutmann Variational Inference and Learning 11 / 36

Program 1. Preparations Concavity of the logarithm and Jensen’s inequality Kullback-Leibler divergence and its properties 2. The variational principle 3. Application to inference and learning Michael Gutmann Variational Inference and Learning 12 / 36

Program 1. Preparations 2. The variational principle Variational lower bound Free energy and the decomposition of the log marginal Free energy maximisation to compute the marginal and conditional from the joint 3. Application to inference and learning Michael Gutmann Variational Inference and Learning 13 / 36

Variational lower bound: auxiliary distribution � p ( x , y ) d y Consider joint pdf /pmf p ( x , y ) with marginal p ( x ) = ◮ Like for importance sampling, we can write � p ( x , y ) � p ( x , y ) � � p ( x ) = p ( x , y ) d y = q ( y ) q ( y ) d y = E q ( y ) q ( y ) where q ( y ) is an auxiliary distribution (called the variational distribution in the context of variational inference/learning) ◮ Log marginal is � p ( x , y ) � log p ( x ) = log E q ( y ) q ( y ) ◮ Instead of approximating the expectation with a sample average, use now the concavity of the logarithm. Michael Gutmann Variational Inference and Learning 14 / 36

Variational lower bound: concavity of the logarithm ◮ Concavity of the log gives � p ( x , y ) � � � log p ( x , y ) ≥ E q ( y ) log p ( x ) = log E q ( y ) q ( y ) q ( y ) This is the variational lower bound for log p ( x ). ◮ Right-hand side is called the (variational) free energy � � log p ( x , y ) F ( x , q ) = E q ( y ) q ( y ) It depends on x through the joint p ( x , y ), and on the auxiliary distribution q ( y ) (since q is a function, the free energy is called a functional, which is a mapping that depends on a function) Michael Gutmann Variational Inference and Learning 15 / 36

Decomposition of the log marginal ◮ We can re-write the free energy as � � � � log p ( y | x ) p ( x ) log p ( x , y ) F ( x , q ) = E q ( y ) = E q ( y ) q ( y ) q ( y ) � � log p ( y | x ) = E q ( y ) q ( y ) + log p ( x ) � � log p ( y | x ) = E q ( y ) + log p ( x ) q ( y ) = − KL( q ( y ) || p ( y | x )) + log p ( x ) ◮ Hence: log p ( x ) = KL( q ( y ) || p ( y | x )) + F ( x , q ) ◮ KL ≥ 0 implies the bound log p ( x ) ≥ F ( x , q ). ◮ KL( q || p ) = 0 iff q = p implies that for q ( y ) = p ( y | x ), the free energy is maximised and equals log p ( x ) . Michael Gutmann Variational Inference and Learning 16 / 36

Variational principle ◮ By maximising the free energy � � log p ( x , y ) F ( x , q ) = E q ( y ) q ( y ) we can split the joint p ( x , y ) into p ( x ) and p ( y | x ) q ( y ) F ( x , q ) log p ( x ) = max p ( y | x ) = argmax F ( x , q ) q ( y ) ◮ You can think of free energy maximisation as a “function” that takes as input a joint p ( x , y ) and returns as output the (log) marginal and the conditional. Michael Gutmann Variational Inference and Learning 17 / 36

Variational principle ◮ Given p ( x , y ), consider inference tasks � 1. compute p ( x ) = p ( x , y ) d y 2. compute p ( y | x ) ◮ Variational principle: we can formulate the marginal inference problems as an optimisation problem. ◮ Maximising the free energy � � log p ( x , y ) F ( x , q ) = E q ( y ) q ( y ) gives 1. log p ( x ) = max q ( y ) F ( x , q ) 2. p ( y | x ) = argmax q ( y ) F ( x , q ) ◮ Inference becomes optimisation. ◮ Note: while we use q ( y ) to denote the variational distribution, it depends on (fixed) x . Better (and rarer) notation is q ( y | x ). Michael Gutmann Variational Inference and Learning 18 / 36

Solving the optimisation problem � � log p ( x , y ) F ( x , q ) = E q ( y ) q ( y ) ◮ Difficulties when maximising the free energy: ◮ optimisation with respect to pdf/pmf q ( y ) ◮ computation of the expectation ◮ Restrict search space to family of variational distributions q ( y ) for which F ( x , q ) is computable. ◮ Family Q specified by ◮ independence assumptions, e.g. q ( y ) = � i q ( y i ), which corresponds to “mean-field” variational inference ◮ parametric assumptions, e.g. q ( y i ) = N ( y i ; µ i , σ 2 i ) ◮ Optimisation is generally challenging: lots of research on how to do it (keywords: stochastic variational inference, black-box variational inference) Michael Gutmann Variational Inference and Learning 19 / 36

Program 1. Preparations 2. The variational principle Variational lower bound Free energy and the decomposition of the log marginal Free energy maximisation to compute the marginal and conditional from the joint 3. Application to inference and learning Michael Gutmann Variational Inference and Learning 20 / 36

Program 1. Preparations 2. The variational principle 3. Application to inference and learning Inference: approximating posteriors Learning with Bayesian models Learning with statistical models and unobserved variables Learning with statistical models and unobs variables: EM algorithm Michael Gutmann Variational Inference and Learning 21 / 36

Variational Inference and Learning Michael Gutmann Probabilistic - PowerPoint PPT Presentation

Variational Inference and Learning Michael Gutmann Probabilistic Modelling and Reasoning (INFR11134) School of Informatics, University of Edinburgh Spring semester 2018 Recap Learning and inference often involves intractable integrals

Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is

Variational Inference for GPs: Presenters Group1: Stochastic variational inference. Slides 2 - 28

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

CS480/680 Machine Learning Lecture 11: February 11 th , 2020 Variational Inference Zahra

Rejection Sampling Variational Inference Karan Grewal CSC2547 / STA4273 Overview Variational

An Introduction to An Introduction to Variational Variational Methods for Graphical Models

CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. Probabilistic latent variable

Lecture Variational 13 Inference Panini Kaushal Scribes : - Margulies Smedeuranh Niklas

Variational Inference for Bayes vMF Mixture Hanxiao Liu September 23, 2014 1 / 14 Variational

Variational Mean Field Variational Mean Field for Graphical Models for Graphical Models

Neural Variational Inference and Learning Andriy Mnih, Karol Gregor 22 June 2014 1 / 14

Variational Inference CMSC 691 UMBC Goal: Posterior Inference Hyperparameters Unknown

Meta-Amoruized Variational Inference and Learning Kristy Choi CS236: December 4th, 2019

Variational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor

Fast and Simple Natural-Gradient Variational Inference with Mixture of Exponential-family

Variational Laplace Autoencoders Yookoon Park, Chris Dongjoo Kim and Gunhee Kim Vision and

An Optimal Estimation An Optimal Estimation Approach to AIRS CO Approach to AIRS CO Retrievals

Course Overview CMPUT 654: Modelling Human Strategic Behaviour Strategic Modelling This course

draft-ietf-pcn-marking- behaviour-01 (Standards track) Philip Eardley Minneapolis, IETF-73, 18

Network analysis for context and content oriented wireless networking Katia Jaffrs-Runser

Inverse problems and control optimal in non-linear mechanics C. Stolz 1 2 Introduction

Admissibility in Infinite Games Dietmar Berwanger EPFL, Lausanne STACS , Aachen

Optimal Taxation and Public Provision for Poverty Minimization Ravi Kanbur (Cornell University)

UNCERTAINTY ALTERNATIVE THEORIES Behavior inconsistent with expected utility theory: Some