Variational Model Selection for Sparse Gaussian Process Regression - PowerPoint PPT Presentation

Variational Model Selection for Sparse Gaussian Process Regression Variational Model Selection for Sparse Gaussian Process Regression Michalis K. Titsias School of Computer Science University of Manchester 7 September 2008

Variational Model Selection for Sparse Gaussian Process Regression Outline Gaussian process regression and sparse methods Variational inference based on inducing variables Auxiliary inducing variables The variational bound Comparison with the PP/DTC and SPGP/FITC marginal likelihood Experiments in large datasets Inducing variables selected from training data Variational reformulation of SD, FITC and PITC Related work/Conclusions

Variational Model Selection for Sparse Gaussian Process Regression Gaussian process regression Regression with Gaussian noise Data: { ( x i , y i ) , i = 1 , . . . , n } where x i is a vector and y i scalar Likelihood: ǫ ∼ N (0 , σ 2 ) y i = f ( x i ) + ǫ, p ( y | f ) = N ( y | f , σ 2 I ) , f i = f ( x i ) GP prior on f : p ( f ) = N ( f | 0 , K nn ) K nn is the n × n covariance matrix on the training data computed using a kernel that depends on θ Hyperparameters: ( σ 2 , θ )

Variational Model Selection for Sparse Gaussian Process Regression Gaussian process regression Maximum likelihood II inference and learning Prediction: Assume hyperparameters ( σ 2 , θ ) are known Infer the latent values f ∗ at test inputs X ∗ : � p ( f ∗ | y ) = p ( f ∗ | f ) p ( f | y ) d f f p ( f ∗ | f ) test conditional, p ( f | y ) posterior on training latent values Learning ( σ 2 , θ ): Maximize the marginal likelihood � p ( y | f ) p ( f ) d f = N ( y | 0 , σ 2 I + K nn ) p ( y ) = f Time complexity is O ( n 3 )

Variational Model Selection for Sparse Gaussian Process Regression Sparse GP regression Time complexity is O ( n 3 ) : Intractability for large datasets Exact prediction and training is intractable We can neither compute the predictive distribution p ( f ∗ | y ) nor the marginal likelihood p ( y ) Approximate/sparse methods: Subset of data: Keep only m training points, complexity is O ( m 3 ) Inducing/active/support variables: Complexity O ( nm 2 ) Other methods: Iterative methods for linear systems

Variational Model Selection for Sparse Gaussian Process Regression Sparse GP regression using inducing variables Inducing variables Subset of training points ( Csato and Opper, 2002; Seeger et al. 2003, Smola and Bartlett, 2001 ) Test points ( BCM; Tresp, 2000 ) Auxiliary variables ( Snelson and Ghahramani, 2006; Qui˜ nonero-Candela and Rasmussen, 2005 ) Training the sparse GP regression system Select inducing inputs Select hyperparameters ( σ 2 , θ ) Which objective function is going to do all that? The approximate marginal likelihood But which approximate marginal likelihood?

Variational Model Selection for Sparse Gaussian Process Regression Sparse GP regression using inducing variables Approximate marginal likelihoods currently used are derived by changing/approximating the likelihood p ( y | f ) by changing/approximating the prior p ( f ) ( Qui˜ nonero-Candela and Rasmussen, 2005 ) all have the form F P = N ( y | 0 , � K ) where � K is some approximation to the true covariance σ 2 I + K nn Overfitting can often occur The approximate marginal likelihood is not a lower bound Joint learning of the inducing points and hyperparameters easily leads to overfitting

Variational Model Selection for Sparse Gaussian Process Regression Sparse GP regression using inducing variables What we wish to do here Do model selection in a different way Never think about approximating the likelihood p ( y | f ) or the prior p ( f ) Apply standard variational inference Just introduce a variational distribution to approximate the true posterior That will give us a lower bound We will propose the bound for model selection jointly handle inducing inputs and hyperparameters

Variational Model Selection for Sparse Gaussian Process Regression Auxiliary inducing variables ( Snelson and Ghahramani, 2006) Auxiliary inducing variables: m latent function values f m associated with arbitrary inputs X m Model augmentation: We augment the GP prior p ( f , f m ) = p ( f | f m ) p ( f m ) joint p ( y | f ) p ( f | f m ) p ( f m ) � marginal likelihood p ( y | f ) p ( f | f m ) p ( f m ) d f d f m f , f m The model is unchanged! The predictive distribution and the marginal likelihood are the same The parameters X m play no active role (at the moment)...and there is no any fear about overfitting when we specify X m

Variational Model Selection for Sparse Gaussian Process Regression Auxiliary inducing variables What we wish: To use the auxiliary variables ( f m , X m ) to facilitate inference about the training function values f Before we get there: Let’s specify the ideal inducing variables Definition : We call ( f m , X m ) optimal when y and f are conditionally independent given f m p ( f | f m , y ) = p ( f | f m ) At optimality: The augmented true posterior p ( f , f m | y ) factorizes as p ( f , f m | y ) = p ( f | f m ) p ( f m | y )

Variational Model Selection for Sparse Gaussian Process Regression Auxiliary inducing variables What we wish: To use the auxiliary variables ( f m , X m ) to facilitate inference about the training function values f Question: How can we discover optimal inducing variables? Answer: Minimize a distance between the true p ( f , f m | y ) and an approximate q ( f , f m ) wrt to X m and (optionally) the number m The key: q ( f , f m ) must satisfy the factorization that holds for optimal inducing variables: True p ( f , f m | y ) = p ( f | f m , y ) p ( f m | y ) q ( f , f m ) = p ( f | f m ) φ ( f m ) Approximate

Variational Model Selection for Sparse Gaussian Process Regression Variational learning of inducing variables Variational distribution: q ( f , f m ) = p ( f | f m ) φ ( f m ) φ ( f m ) is an unconstrained variational distribution over f m Standard variational inference: We minimize the divergence KL( q ( f , f m ) || p ( f , f m | y )) Equivalently we maximize a bound on the true log marginal likelihood: � q ( f , f m ) log p ( y | f ) p ( f | f m ) p ( f m ) F V ( X m , φ ( f m )) = d f d f m q ( f , f m ) f , f m Let’s compute this

Variational Model Selection for Sparse Gaussian Process Regression Computation of the variational bound � p ( f | f m ) φ ( f m ) log p ( y | f ) p ( f | f m ) p ( f m ) F V ( X m , φ ( f m )) = d f d f m p ( f | f m ) φ ( f m ) f , f m � p ( f | f m ) φ ( f m ) log p ( y | f ) p ( f m ) = d f d f m φ ( f m ) f , f m �� p ( f | f m ) log p ( y | f ) d f + log p ( f m ) = φ ( f m ) d f m φ ( f m ) f m f � � � log G ( f m , y ) + log p ( f m ) = φ ( f m ) d f m φ ( f m ) f m � � 1 N ( y | E [ f | f m ] , σ 2 I ) log G ( f m , y ) = log − 2 σ 2 Tr [Cov( f | f m )] E [ f | f m ] = K nm K − 1 mm f m , Cov( f | f m ) = K nn − K nm K − 1 mm K mn

Variational Model Selection for Sparse Gaussian Process Regression Computation of the variational bound Merge the logs � � � log G ( f m , y ) p ( f m ) F V ( X m , φ ( f m )) = φ ( f m ) d f m φ ( f m ) f m Reverse Jensen’s inequality to maximize wrt φ ( f m ): � F V ( X m ) = log G ( f m , y ) p ( f m ) d f m f m � 1 N ( y | α m , σ 2 I ) p ( f m ) d f m − = log 2 σ 2 Tr [ Cov ( f | f m )] f m � � 1 N ( y | 0 , σ 2 I + K nm K − 1 = log mm K mn ) − 2 σ 2 Tr [ Cov ( f | f m )] where Cov ( f | f m ) = K nn − K nm K − 1 mm K mn

Variational Model Selection for Sparse Gaussian Process Regression Variational bound versus PP log likelihood The traditional projected process (PP or DTC) log likelihood is � � N ( y | 0 , σ 2 I + K nm K − 1 F P = log mm K mn ) What we obtained is � � − 1 N ( y | 0 , σ 2 I + K nm K − 1 2 σ 2 Tr [ K nn − K nm K − 1 F V = log mm K mn ) mm K mn ] We got this extra trace term (the total variance of p ( f | f m ))

Variational Model Selection for Sparse Gaussian Process Regression Optimal φ ∗ ( f m ) and predictive distribution The optimal φ ∗ ( f m ) that corresponds to the above bound gives rise to the PP predictive distribution ( Csato and Opper, 2002; Seeger and Williams and Lawrence, 2003 ) The approximate predictive distribution is identical to PP

Variational Model Selection for Sparse Gaussian Process Regression Variational bound for model selection Learning inducing inputs X m and ( σ 2 , θ ) using continuous optimization Maximize the bound wrt to ( X m , σ 2 , θ ) � � − 1 N ( y | 0 , σ 2 I + K nm K − 1 2 σ 2 Tr [ K nn − K nm K − 1 F V = log mm K mn ) mm K mn ] The first term encourages fitting the data y The second trace term says to minimize the total variance of p ( f | f m ) The trace Tr [ K nn − K nm K − 1 mm K mn ] can stand on its own as an objective function for sparse GP learning

Variational Model Selection for Sparse Gaussian Process Regression Variational bound for model selection When the bound becomes equal to the true marginal log likelihood, i.e F V = log p ( y ) , then: Tr [ K nn − K nm K − 1 mm K mn ] = 0 K nn = K nm K − 1 mm K mn p ( f | f m ) becomes a delta function We can reproduce the full/exact GP prediction

Variational Model Selection for Sparse Gaussian Process Regression - PowerPoint PPT Presentation

Variational Model Selection for Sparse Gaussian Process Regression Variational Model Selection for Sparse Gaussian Process Regression Michalis K. Titsias School of Computer Science University of Manchester 7 September 2008 Variational Model

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Gaussian Processes for Big Data James Hensman joint work with Nicol o Fusi, Neil D. Lawrence

An Introduction to An Introduction to Variational Variational Methods for Graphical Models

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Gaussian Process Lei Tang Arizona State University Jul. 31th, 2007 Lei Tang (ASU) Gaussian

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is

Variational Inference for GPs: Presenters Group1: Stochastic variational inference. Slides 2 - 28

Rejection Sampling Variational Inference Karan Grewal CSC2547 / STA4273 Overview Variational

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

CS70: Jean Walrand: Lecture 36. Gaussian and CLT CS70: Jean Walrand: Lecture 36. Gaussian and

Gaussian Processes Seung-Hoon Na Chonbuk National University Gaussian Process Regression

Data Science in the Wild Lecture 8: Advanced Experimental Analysis Eran Toch Data Science in the

Outline 1

Truncations of unitary matrices and Brownian bridges Alain Rouault (Laboratoire de

Generalization Error of Generalized Linear Models in High Dimensions Melika Emami 1 , Mojtaba

Representing Images and Sounds Class 4. 3 Sep 2009 Instructor: Bhiksha Raj Representing an

Investigating the security properties of MACs based on stream ciphers Leonie Simpson, Mufeed Al

Modeling and Verification with SPIN Wishnu Prasetya wishnu@cs.uu.nl www.cs.uu.nl/docs/vakken/pv

Can Social Media tell us something about our lives? Vasileios Lampos Computer Science Department