approximate inference
play

Approximate Inference 9.520 Class 19 Ruslan Salakhutdinov BCS and - PowerPoint PPT Presentation

Approximate Inference 9.520 Class 19 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Laplace and Variational Inference. 4. Basic Sampling Algorithms. 5. Markov chain


  1. Approximate Inference 9.520 Class 19 Ruslan Salakhutdinov BCS and CSAIL, MIT 1

  2. Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Laplace and Variational Inference. 4. Basic Sampling Algorithms. 5. Markov chain Monte Carlo algorithms. 2

  3. References/Acknowledgements • Chris Bishop’s book: Pattern Recognition and Machine Learning , chapter 11 (many figures are borrowed from this book). • David MacKay’s book: Information Theory, Inference, and Learning Algorithms , chapters 29-32. • Radford Neals’s technical report on Probabilistic Inference Using Markov Chain Monte Carlo Methods . • Zoubin Ghahramani’s ICML tutorial on Bayesian Machine Learning: http://www.gatsby.ucl.ac.uk/ ∼ zoubin/ICML04-tutorial.html • Ian Murray’s tutorial on Sampling Methods: http://www.cs.toronto.edu/ ∼ murray/teaching/ 3

  4. Basic Notation P ( x ) probability of x P ( x | θ ) conditional probability of x given θ P ( x, θ ) joint probability of x and θ Bayes Rule: P ( x | θ ) P ( θ ) P ( θ | x ) = P ( x ) where � P ( x ) = P ( x, θ ) dθ Marginalization I will use probability distribution and probability density interchangeably. It should be obvious from the context. 4

  5. Inference Problem Given a dataset D = { x 1 , ..., x n } : Bayes Rule: P ( D| θ ) Likelihood function of θ P ( θ |D ) = P ( D | θ ) P ( θ ) P ( θ ) Prior probability of θ P ( D ) P ( θ |D ) Posterior distribution over θ Computing posterior distribution is known as the inference problem. But: � P ( D ) = P ( D , θ ) dθ This integral can be very high-dimensional and difficult to compute. 5

  6. Prediction P ( D| θ ) Likelihood function of θ P ( θ |D ) = P ( D | θ ) P ( θ ) P ( θ ) Prior probability of θ P ( D ) P ( θ |D ) Posterior distribution over θ Prediction : Given D , computing conditional probability of x ∗ requires computing the following integral: � P ( x ∗ |D ) P ( x ∗ | θ, D ) P ( θ |D ) dθ = E P ( θ |D ) [ P ( x ∗ | θ, D )] = which is sometimes called predictive distribution . Computing predictive distribution requires posterior P ( θ |D ) . 6

  7. Model Selection Compare model classes, e.g. M 1 and M 2 . Need to compute posterior probabilities given D : P ( M|D ) = P ( D|M ) P ( M ) P ( D ) where � P ( D|M ) = P ( D| θ, M ) P ( θ |M ) dθ is known as the marginal likelihood or evidence . 7

  8. Computational Challenges • Computing marginal likelihoods often requires computing very high- dimensional integrals. • Computing posterior distributions (and hence predictive distributions) is often analytically intractable. • In this class, we will concentrate on Markov Chain Monte Carlo (MCMC) methods for performing approximate inference . • First, let us look at some specific examples: – Bayesian Probabilistic Matrix Factorization – Bayesian Neural Networks – Dirichlet Process Mixtures (last class) 8

  9. Bayesian PMF User Features 1 2 3 4 5 6 7 ... 1 5 3 ? 1 ... V 2 3 ? 4 ? 3 2 ... 3 4 Movie ~ ~ 5 U R Features 6 7 ... We have N users, M movies, and integer rating values from 1 to K . Let r ij be the rating of user i for movie j , and U ∈ R D × N , V ∈ R D × M be latent user and movie feature matrices: R ≈ U ⊤ V Goal: Predict missing ratings. 9

  10. Bayesian PMF α V α Probabilistic linear model with Gaussian U observation noise. Likelihood: Θ Θ p ( r ij | u i , v j , σ 2 ) = N ( r ij | u ⊤ i v j , σ 2 ) V U Gaussian Priors over parameters: V j U i N � p ( U | µ U , Λ U ) = N ( u i | µ u , Σ u ) , i =1 R ij M i=1,...,N � j=1,...,M p ( V | µ V , Λ V ) = N ( v i | µ v , Σ v ) . i =1 σ Conjugate Gaussian-inverse-Wishart priors on the user and movie hyperparameters Θ U = { µ u , Σ u } and Θ V = { µ v , Σ v } . Hierarchical Prior. 10

  11. Bayesian PMF Predictive distribution : Consider predicting a rating r ∗ ij for user i and query movie j : �� p ( r ∗ p ( r ∗ ij | R ) = ij | u i , v j ) p ( U, V, Θ U , Θ V | R ) d { U, V } d { Θ U , Θ V } � �� � Posterior over parameters and hyperparameters Exact evaluation of this predictive distribution is analytically intractable. Posterior distribution p ( U, V, Θ U , Θ V | R ) is complicated and does not have a closed form expression. Need to approximate. 11

  12. Bayesian Neural Nets Regression problem: Given a set of i.i.d observations X = { x n } N n =1 with corresponding targets D = { t n } N n =1 . Likelihood: N � N ( t n | y ( x n , w ) , β 2 ) p ( D| X , w ) = n =1 The mean is given by the output of the neural network: M � D � � � w 2 w 1 y k ( x , w ) = kj σ ji x i j =0 i =0 where σ ( x ) is the sigmoid function. Gaussian prior over the network parameters: p ( w ) = N (0 , α 2 I ) . 12

  13. Bayesian Neural Nets Likelihood: N � N ( t n | y ( x n , w ) , β 2 ) p ( D| X , w ) = n =1 Gaussian prior over parameters: p ( w ) = N (0 , α 2 I ) Posterior is analytically intractable: p ( D| w , X ) p ( w ) p ( w |D , X ) = � p ( D| w , X ) p ( w ) d w Remark: Under certain conditions, Radford Neal (1994) showed, as the number of hidden units go to infinity, a Gaussian prior over parameters results in a Gaussian process prior for functions. 13

  14. Undirected Models x is a binary random vector with x i ∈ { +1 , − 1 } : p ( x ) = 1 � � � � Z exp θ ij x i x j + θ i x i . i ∈ V ( i,j ) ∈ E where Z is known as partition function: � � � � � Z = exp θ ij x i x j + θ i x i . x i ∈ V ( i,j ) ∈ E If x is 100-dimensional, need to sum over 2 100 terms. The sum might decompose (e.g. junction tree). Otherwise we need to approximate. Remark: Compare to marginal likelihood. 14

  15. Inference For most situations we will be interested in evaluating the expectation: � E [ f ] = f ( z ) p ( z ) dz We will use the following notation: p ( z ) = ˜ p ( z ) Z . We can evaluate ˜ p ( z ) pointwise, but cannot evaluate Z . 1 • Posterior distribution: P ( θ |D ) = P ( D ) P ( D| θ ) P ( θ ) • Markov random fields: P ( z ) = 1 Z exp( − E ( z )) 15

  16. Laplace Approximation Consider: 0.8 p ( z ) = ˜ p ( z ) 0.6 Z 0.4 Goal: Find a Gaussian approximation 0.2 q ( z ) which is centered on a mode of the distribution p ( z ) . 0 −2 −1 0 1 2 3 4 At a stationary point z 0 the gradient ▽ ˜ p ( z ) vanishes. Consider a Taylor expansion of ln ˜ p ( z ) : p ( z 0 ) − 1 2( z − z 0 ) T A ( z − z 0 ) p ( z ) ≈ ln ˜ ln ˜ where A is a Hessian matrix: A = − ▽ ▽ ln ˜ p ( z ) | z = z 0 16

  17. Laplace Approximation Consider: 0.8 p ( z ) = ˜ p ( z ) 0.6 Z 0.4 Goal: Find a Gaussian approximation 0.2 q ( z ) which is centered on a mode of the distribution p ( z ) . 0 −2 −1 0 1 2 3 4 Exponentiating both sides: � � − 1 2( z − z 0 ) T A ( z − z 0 ) p ( z ) ≈ ˜ ˜ p ( z 0 ) exp We get a multivariate Gaussian approximation: � � q ( z ) = | A | 1 / 2 − 1 2( z − z 0 ) T A ( z − z 0 ) (2 π ) D/ 2 exp 17

  18. Laplace Approximation Remember p ( z ) = ˜ p ( z ) Z , where we approximate: � � � � p ( z 0 )(2 π ) D/ 2 − 1 2( z − z 0 ) T A ( z − z 0 ) Z = p ( z ) d z ≈ ˜ ˜ p ( z 0 ) exp = ˜ | A | 1 / 2 1 Bayesian Inference: P ( θ |D ) = P ( D ) P ( D| θ ) P ( θ ) . Identify: ˜ p ( θ ) = P ( D| θ ) P ( θ ) and Z = P ( D ) : • The posterior is approximately Gaussian around the MAP estimate θ MAP � � p ( θ |D ) ≈ | A | 1 / 2 − 1 2( θ − θ MAP ) T A ( θ − θ MAP ) (2 π ) D/ 2 exp 18

  19. Laplace Approximation Remember p ( z ) = ˜ p ( z ) Z , where we approximate: � � � � p ( z 0 )(2 π ) D/ 2 − 1 2( z − z 0 ) T A ( z − z 0 ) Z = p ( z ) d z ≈ ˜ ˜ p ( z 0 ) exp = ˜ | A | 1 / 2 1 Bayesian Inference: P ( θ |D ) = P ( D ) P ( D| θ ) P ( θ ) . p ( θ ) = P ( D| θ ) P ( θ ) and Z = P ( D ) : Identify: ˜ • Can approximate Model Evidence: � P ( D ) = P ( D| θ ) P ( θ ) dθ • Using Laplace approximation ln P ( D ) ≈ ln P ( D | θ MAP ) + ln P ( θ MAP ) + D 2 ln 2 π − 1 2 ln | A | � �� � Occam factor: penalize model complexity 19

  20. Bayesian Information Criterion BIC can be obtained from the Laplace approximation: ln P ( D ) ≈ ln P ( D | θ MAP ) + ln P ( θ MAP ) + D 2 ln 2 π − 1 2 ln | A | by taking the large sample limit ( N → ∞ ) where N is the number of data points: ln P ( D ) ≈ P ( D | θ MAP ) − 1 2 D ln N • Quick, easy, does not depend on the prior. • Can use maximum likelihood estimate of θ instead of the MAP estimate • D denotes the number of “well-determined parameters” • Danger: Counting parameters can be tricky (e.g. infinite models) 20

  21. Variational Inference Approximate intractable distribution p ( θ | D ) with simpler, tractable Key Idea: distribution q ( θ ) . We can lower bound the marginal likelihood using Jensen’s inequality: � � q ( θ ) P ( D , θ ) ln p ( D ) p ( D , θ ) dθ = ln = ln dθ q ( θ ) � � � q ( θ ) ln p ( D , θ ) 1 ≥ q ( θ ) ln p ( D , θ ) dθ + q ( θ ) dθ = q ( θ ) ln q ( θ ) dθ � �� � Entropy functional � �� � Variational Lower-Bound = ln p ( D ) − KL( q ( θ ) || p ( θ | D )) = L ( q ) where KL( q || p ) is a Kullback–Leibler divergence. It is a non-symmetric measure of the difference between two probability distributions q and p . The goal of variational inference is to maximize the variational lower-bound w.r.t. approximate q distribution, or minimize KL( q || p ) . 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend