Approximate Inference 9.520 Class 19 Ruslan Salakhutdinov BCS and - PowerPoint PPT Presentation

Approximate Inference 9.520 Class 19 Ruslan Salakhutdinov BCS and CSAIL, MIT 1

Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Laplace and Variational Inference. 4. Basic Sampling Algorithms. 5. Markov chain Monte Carlo algorithms. 2

References/Acknowledgements • Chris Bishop’s book: Pattern Recognition and Machine Learning , chapter 11 (many figures are borrowed from this book). • David MacKay’s book: Information Theory, Inference, and Learning Algorithms , chapters 29-32. • Radford Neals’s technical report on Probabilistic Inference Using Markov Chain Monte Carlo Methods . • Zoubin Ghahramani’s ICML tutorial on Bayesian Machine Learning: http://www.gatsby.ucl.ac.uk/ ∼ zoubin/ICML04-tutorial.html • Ian Murray’s tutorial on Sampling Methods: http://www.cs.toronto.edu/ ∼ murray/teaching/ 3

Basic Notation P ( x ) probability of x P ( x | θ ) conditional probability of x given θ P ( x, θ ) joint probability of x and θ Bayes Rule: P ( x | θ ) P ( θ ) P ( θ | x ) = P ( x ) where � P ( x ) = P ( x, θ ) dθ Marginalization I will use probability distribution and probability density interchangeably. It should be obvious from the context. 4

Inference Problem Given a dataset D = { x 1 , ..., x n } : Bayes Rule: P ( D| θ ) Likelihood function of θ P ( θ |D ) = P ( D | θ ) P ( θ ) P ( θ ) Prior probability of θ P ( D ) P ( θ |D ) Posterior distribution over θ Computing posterior distribution is known as the inference problem. But: � P ( D ) = P ( D , θ ) dθ This integral can be very high-dimensional and difficult to compute. 5

Computational Challenges • Computing marginal likelihoods often requires computing very high- dimensional integrals. • Computing posterior distributions (and hence predictive distributions) is often analytically intractable. • In this class, we will concentrate on Markov Chain Monte Carlo (MCMC) methods for performing approximate inference . • First, let us look at some specific examples: – Bayesian Probabilistic Matrix Factorization – Bayesian Neural Networks – Dirichlet Process Mixtures (last class) 8

Bayesian PMF User Features 1 2 3 4 5 6 7 ... 1 5 3 ? 1 ... V 2 3 ? 4 ? 3 2 ... 3 4 Movie ~ ~ 5 U R Features 6 7 ... We have N users, M movies, and integer rating values from 1 to K . Let r ij be the rating of user i for movie j , and U ∈ R D × N , V ∈ R D × M be latent user and movie feature matrices: R ≈ U ⊤ V Goal: Predict missing ratings. 9

Bayesian PMF α V α Probabilistic linear model with Gaussian U observation noise. Likelihood: Θ Θ p ( r ij | u i , v j , σ 2 ) = N ( r ij | u ⊤ i v j , σ 2 ) V U Gaussian Priors over parameters: V j U i N � p ( U | µ U , Λ U ) = N ( u i | µ u , Σ u ) , i =1 R ij M i=1,...,N � j=1,...,M p ( V | µ V , Λ V ) = N ( v i | µ v , Σ v ) . i =1 σ Conjugate Gaussian-inverse-Wishart priors on the user and movie hyperparameters Θ U = { µ u , Σ u } and Θ V = { µ v , Σ v } . Hierarchical Prior. 10

Bayesian PMF Predictive distribution : Consider predicting a rating r ∗ ij for user i and query movie j : �� p ( r ∗ p ( r ∗ ij | R ) = ij | u i , v j ) p ( U, V, Θ U , Θ V | R ) d { U, V } d { Θ U , Θ V } � �� Posterior over parameters and hyperparameters Exact evaluation of this predictive distribution is analytically intractable. Posterior distribution p ( U, V, Θ U , Θ V | R ) is complicated and does not have a closed form expression. Need to approximate. 11

Bayesian Neural Nets Regression problem: Given a set of i.i.d observations X = { x n } N n =1 with corresponding targets D = { t n } N n =1 . Likelihood: N � N ( t n | y ( x n , w ) , β 2 ) p ( D| X , w ) = n =1 The mean is given by the output of the neural network: M � D � � � w 2 w 1 y k ( x , w ) = kj σ ji x i j =0 i =0 where σ ( x ) is the sigmoid function. Gaussian prior over the network parameters: p ( w ) = N (0 , α 2 I ) . 12

Bayesian Neural Nets Likelihood: N � N ( t n | y ( x n , w ) , β 2 ) p ( D| X , w ) = n =1 Gaussian prior over parameters: p ( w ) = N (0 , α 2 I ) Posterior is analytically intractable: p ( D| w , X ) p ( w ) p ( w |D , X ) = � p ( D| w , X ) p ( w ) d w Remark: Under certain conditions, Radford Neal (1994) showed, as the number of hidden units go to infinity, a Gaussian prior over parameters results in a Gaussian process prior for functions. 13

Undirected Models x is a binary random vector with x i ∈ { +1 , − 1 } : p ( x ) = 1 � � � � Z exp θ ij x i x j + θ i x i . i ∈ V ( i,j ) ∈ E where Z is known as partition function: � � � � � Z = exp θ ij x i x j + θ i x i . x i ∈ V ( i,j ) ∈ E If x is 100-dimensional, need to sum over 2 100 terms. The sum might decompose (e.g. junction tree). Otherwise we need to approximate. Remark: Compare to marginal likelihood. 14

Inference For most situations we will be interested in evaluating the expectation: � E [ f ] = f ( z ) p ( z ) dz We will use the following notation: p ( z ) = ˜ p ( z ) Z . We can evaluate ˜ p ( z ) pointwise, but cannot evaluate Z . 1 • Posterior distribution: P ( θ |D ) = P ( D ) P ( D| θ ) P ( θ ) • Markov random fields: P ( z ) = 1 Z exp( − E ( z )) 15

Laplace Approximation Consider: 0.8 p ( z ) = ˜ p ( z ) 0.6 Z 0.4 Goal: Find a Gaussian approximation 0.2 q ( z ) which is centered on a mode of the distribution p ( z ) . 0 −2 −1 0 1 2 3 4 At a stationary point z 0 the gradient ▽ ˜ p ( z ) vanishes. Consider a Taylor expansion of ln ˜ p ( z ) : p ( z 0 ) − 1 2( z − z 0 ) T A ( z − z 0 ) p ( z ) ≈ ln ˜ ln ˜ where A is a Hessian matrix: A = − ▽ ▽ ln ˜ p ( z ) | z = z 0 16

Laplace Approximation Consider: 0.8 p ( z ) = ˜ p ( z ) 0.6 Z 0.4 Goal: Find a Gaussian approximation 0.2 q ( z ) which is centered on a mode of the distribution p ( z ) . 0 −2 −1 0 1 2 3 4 Exponentiating both sides: � � − 1 2( z − z 0 ) T A ( z − z 0 ) p ( z ) ≈ ˜ ˜ p ( z 0 ) exp We get a multivariate Gaussian approximation: � � q ( z ) = | A | 1 / 2 − 1 2( z − z 0 ) T A ( z − z 0 ) (2 π ) D/ 2 exp 17

Laplace Approximation Remember p ( z ) = ˜ p ( z ) Z , where we approximate: � � � � p ( z 0 )(2 π ) D/ 2 − 1 2( z − z 0 ) T A ( z − z 0 ) Z = p ( z ) d z ≈ ˜ ˜ p ( z 0 ) exp = ˜ | A | 1 / 2 1 Bayesian Inference: P ( θ |D ) = P ( D ) P ( D| θ ) P ( θ ) . Identify: ˜ p ( θ ) = P ( D| θ ) P ( θ ) and Z = P ( D ) : • The posterior is approximately Gaussian around the MAP estimate θ MAP � � p ( θ |D ) ≈ | A | 1 / 2 − 1 2( θ − θ MAP ) T A ( θ − θ MAP ) (2 π ) D/ 2 exp 18

Laplace Approximation Remember p ( z ) = ˜ p ( z ) Z , where we approximate: � � � � p ( z 0 )(2 π ) D/ 2 − 1 2( z − z 0 ) T A ( z − z 0 ) Z = p ( z ) d z ≈ ˜ ˜ p ( z 0 ) exp = ˜ | A | 1 / 2 1 Bayesian Inference: P ( θ |D ) = P ( D ) P ( D| θ ) P ( θ ) . p ( θ ) = P ( D| θ ) P ( θ ) and Z = P ( D ) : Identify: ˜ • Can approximate Model Evidence: � P ( D ) = P ( D| θ ) P ( θ ) dθ • Using Laplace approximation ln P ( D ) ≈ ln P ( D | θ MAP ) + ln P ( θ MAP ) + D 2 ln 2 π − 1 2 ln | A | � �� Occam factor: penalize model complexity 19

Bayesian Information Criterion BIC can be obtained from the Laplace approximation: ln P ( D ) ≈ ln P ( D | θ MAP ) + ln P ( θ MAP ) + D 2 ln 2 π − 1 2 ln | A | by taking the large sample limit ( N → ∞ ) where N is the number of data points: ln P ( D ) ≈ P ( D | θ MAP ) − 1 2 D ln N • Quick, easy, does not depend on the prior. • Can use maximum likelihood estimate of θ instead of the MAP estimate • D denotes the number of “well-determined parameters” • Danger: Counting parameters can be tricky (e.g. infinite models) 20

Variational Inference Approximate intractable distribution p ( θ | D ) with simpler, tractable Key Idea: distribution q ( θ ) . We can lower bound the marginal likelihood using Jensen’s inequality: � � q ( θ ) P ( D , θ ) ln p ( D ) p ( D , θ ) dθ = ln = ln dθ q ( θ ) � � � q ( θ ) ln p ( D , θ ) 1 ≥ q ( θ ) ln p ( D , θ ) dθ + q ( θ ) dθ = q ( θ ) ln q ( θ ) dθ � �� Entropy functional � �� Variational Lower-Bound = ln p ( D ) − KL( q ( θ ) || p ( θ | D )) = L ( q ) where KL( q || p ) is a Kullback–Leibler divergence. It is a non-symmetric measure of the difference between two probability distributions q and p . The goal of variational inference is to maximize the variational lower-bound w.r.t. approximate q distribution, or minimize KL( q || p ) . 21

Approximate Inference 9.520 Class 19 Ruslan Salakhutdinov BCS and - PowerPoint PPT Presentation

Approximate Inference 9.520 Class 19 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Laplace and Variational Inference. 4. Basic Sampling Algorithms. 5. Markov chain

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Approximate inference: Sampling methods Probabilistic Graphical Models Sharif University of

Bayesian networks: approximate inference Machine Intelligence Thomas D. Nielsen September 2008

Two Approximate- Programmability Birds, One Statistical- Inference Stone Adrian Sampson

Approximate Inference: Randomized Methods October 15, 2015 Topics Hard Inference

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Travel Time Estimation using Approximate Belief States on a Hidden Markov Model Walid Krichene

Approximate Computing Is Dead; Long Live Approximate Computing Adrian Sampson Cornell Hardware

Approximate Nearest Neighbors Search Approximate Nearest Neighbors Search in High Dimensions in

Approximate inference on graphical models: variational methods Alexandre Bouchard-C ot e

Approximate Bayesian inference for latent Gaussian models avard Rue 1 H Department of

Variable Elimination 1 Inference Exact inference Enumeration Variable elimination

Variational Inference CMSC 691 UMBC Goal: Posterior Inference Hyperparameters Unknown

Post-Selection Inference Todd Kuffner Washington University in St. Louis PhyStat 2016

Soft Inference and Posterior Marginals September 19, 2013 Soft vs. Hard Inference Hard

Type Inference 75 Definition Type Inference Type inference = Java compiler's ability

Logic of Joint Action Natasha Alechina (joint work with Thomas Agotnes) November 2011, St

Trying to Act Together The Power of Trust The Ques4on

Mul$-Robot Percep$on and Ac$on: World Modeling and Task Alloca$on F.

A Joint Adventure in Sasakian and K ahler Geometry Charles Boyer and Christina

Assessment of HTS for Fusion. Activities of EFDA Manel Sanmart (IREC) On behalf of EFDA-HTS

Use of AIRS data in the Joint Center for Satellite Data Assimilation Lars Peter Riishojgaard

Joint Probability Distributions In many experiments, two or more random variables have values that

Assouad Dimension and Random Fractals (Contains joint work with Jonathan M. Fraser and Jun J.