variational model selection for sparse gaussian process
play

Variational Model Selection for Sparse Gaussian Process Regression - PowerPoint PPT Presentation

Variational Model Selection for Sparse Gaussian Process Regression Variational Model Selection for Sparse Gaussian Process Regression Michalis K. Titsias School of Computer Science University of Manchester 7 September 2008 Variational Model


  1. Variational Model Selection for Sparse Gaussian Process Regression Variational Model Selection for Sparse Gaussian Process Regression Michalis K. Titsias School of Computer Science University of Manchester 7 September 2008

  2. Variational Model Selection for Sparse Gaussian Process Regression Outline Gaussian process regression and sparse methods Variational inference based on inducing variables Auxiliary inducing variables The variational bound Comparison with the PP/DTC and SPGP/FITC marginal likelihood Experiments in large datasets Inducing variables selected from training data Variational reformulation of SD, FITC and PITC Related work/Conclusions

  3. Variational Model Selection for Sparse Gaussian Process Regression Gaussian process regression Regression with Gaussian noise Data: { ( x i , y i ) , i = 1 , . . . , n } where x i is a vector and y i scalar Likelihood: ǫ ∼ N (0 , σ 2 ) y i = f ( x i ) + ǫ, p ( y | f ) = N ( y | f , σ 2 I ) , f i = f ( x i ) GP prior on f : p ( f ) = N ( f | 0 , K nn ) K nn is the n × n covariance matrix on the training data computed using a kernel that depends on θ Hyperparameters: ( σ 2 , θ )

  4. Variational Model Selection for Sparse Gaussian Process Regression Gaussian process regression Maximum likelihood II inference and learning Prediction: Assume hyperparameters ( σ 2 , θ ) are known Infer the latent values f ∗ at test inputs X ∗ : � p ( f ∗ | y ) = p ( f ∗ | f ) p ( f | y ) d f f p ( f ∗ | f ) test conditional, p ( f | y ) posterior on training latent values Learning ( σ 2 , θ ): Maximize the marginal likelihood � p ( y | f ) p ( f ) d f = N ( y | 0 , σ 2 I + K nn ) p ( y ) = f Time complexity is O ( n 3 )

  5. Variational Model Selection for Sparse Gaussian Process Regression Sparse GP regression Time complexity is O ( n 3 ) : Intractability for large datasets Exact prediction and training is intractable We can neither compute the predictive distribution p ( f ∗ | y ) nor the marginal likelihood p ( y ) Approximate/sparse methods: Subset of data: Keep only m training points, complexity is O ( m 3 ) Inducing/active/support variables: Complexity O ( nm 2 ) Other methods: Iterative methods for linear systems

  6. Variational Model Selection for Sparse Gaussian Process Regression Sparse GP regression using inducing variables Inducing variables Subset of training points ( Csato and Opper, 2002; Seeger et al. 2003, Smola and Bartlett, 2001 ) Test points ( BCM; Tresp, 2000 ) Auxiliary variables ( Snelson and Ghahramani, 2006; Qui˜ nonero-Candela and Rasmussen, 2005 ) Training the sparse GP regression system Select inducing inputs Select hyperparameters ( σ 2 , θ ) Which objective function is going to do all that? The approximate marginal likelihood But which approximate marginal likelihood?

  7. Variational Model Selection for Sparse Gaussian Process Regression Sparse GP regression using inducing variables Approximate marginal likelihoods currently used are derived by changing/approximating the likelihood p ( y | f ) by changing/approximating the prior p ( f ) ( Qui˜ nonero-Candela and Rasmussen, 2005 ) all have the form F P = N ( y | 0 , � K ) where � K is some approximation to the true covariance σ 2 I + K nn Overfitting can often occur The approximate marginal likelihood is not a lower bound Joint learning of the inducing points and hyperparameters easily leads to overfitting

  8. Variational Model Selection for Sparse Gaussian Process Regression Sparse GP regression using inducing variables What we wish to do here Do model selection in a different way Never think about approximating the likelihood p ( y | f ) or the prior p ( f ) Apply standard variational inference Just introduce a variational distribution to approximate the true posterior That will give us a lower bound We will propose the bound for model selection jointly handle inducing inputs and hyperparameters

  9. Variational Model Selection for Sparse Gaussian Process Regression Auxiliary inducing variables ( Snelson and Ghahramani, 2006) Auxiliary inducing variables: m latent function values f m associated with arbitrary inputs X m Model augmentation: We augment the GP prior p ( f , f m ) = p ( f | f m ) p ( f m ) joint p ( y | f ) p ( f | f m ) p ( f m ) � marginal likelihood p ( y | f ) p ( f | f m ) p ( f m ) d f d f m f , f m The model is unchanged! The predictive distribution and the marginal likelihood are the same The parameters X m play no active role (at the moment)...and there is no any fear about overfitting when we specify X m

  10. Variational Model Selection for Sparse Gaussian Process Regression Auxiliary inducing variables What we wish: To use the auxiliary variables ( f m , X m ) to facilitate inference about the training function values f Before we get there: Let’s specify the ideal inducing variables Definition : We call ( f m , X m ) optimal when y and f are conditionally independent given f m p ( f | f m , y ) = p ( f | f m ) At optimality: The augmented true posterior p ( f , f m | y ) factorizes as p ( f , f m | y ) = p ( f | f m ) p ( f m | y )

  11. Variational Model Selection for Sparse Gaussian Process Regression Auxiliary inducing variables What we wish: To use the auxiliary variables ( f m , X m ) to facilitate inference about the training function values f Question: How can we discover optimal inducing variables? Answer: Minimize a distance between the true p ( f , f m | y ) and an approximate q ( f , f m ) wrt to X m and (optionally) the number m The key: q ( f , f m ) must satisfy the factorization that holds for optimal inducing variables: True p ( f , f m | y ) = p ( f | f m , y ) p ( f m | y ) q ( f , f m ) = p ( f | f m ) φ ( f m ) Approximate

  12. Variational Model Selection for Sparse Gaussian Process Regression Variational learning of inducing variables Variational distribution: q ( f , f m ) = p ( f | f m ) φ ( f m ) φ ( f m ) is an unconstrained variational distribution over f m Standard variational inference: We minimize the divergence KL( q ( f , f m ) || p ( f , f m | y )) Equivalently we maximize a bound on the true log marginal likelihood: � q ( f , f m ) log p ( y | f ) p ( f | f m ) p ( f m ) F V ( X m , φ ( f m )) = d f d f m q ( f , f m ) f , f m Let’s compute this

  13. Variational Model Selection for Sparse Gaussian Process Regression Computation of the variational bound � p ( f | f m ) φ ( f m ) log p ( y | f ) p ( f | f m ) p ( f m ) F V ( X m , φ ( f m )) = d f d f m p ( f | f m ) φ ( f m ) f , f m � p ( f | f m ) φ ( f m ) log p ( y | f ) p ( f m ) = d f d f m φ ( f m ) f , f m �� � � p ( f | f m ) log p ( y | f ) d f + log p ( f m ) = φ ( f m ) d f m φ ( f m ) f m f � � � log G ( f m , y ) + log p ( f m ) = φ ( f m ) d f m φ ( f m ) f m � � 1 N ( y | E [ f | f m ] , σ 2 I ) log G ( f m , y ) = log − 2 σ 2 Tr [Cov( f | f m )] E [ f | f m ] = K nm K − 1 mm f m , Cov( f | f m ) = K nn − K nm K − 1 mm K mn

  14. Variational Model Selection for Sparse Gaussian Process Regression Computation of the variational bound Merge the logs � � � log G ( f m , y ) p ( f m ) F V ( X m , φ ( f m )) = φ ( f m ) d f m φ ( f m ) f m Reverse Jensen’s inequality to maximize wrt φ ( f m ): � F V ( X m ) = log G ( f m , y ) p ( f m ) d f m f m � 1 N ( y | α m , σ 2 I ) p ( f m ) d f m − = log 2 σ 2 Tr [ Cov ( f | f m )] f m � � 1 N ( y | 0 , σ 2 I + K nm K − 1 = log mm K mn ) − 2 σ 2 Tr [ Cov ( f | f m )] where Cov ( f | f m ) = K nn − K nm K − 1 mm K mn

  15. Variational Model Selection for Sparse Gaussian Process Regression Variational bound versus PP log likelihood The traditional projected process (PP or DTC) log likelihood is � � N ( y | 0 , σ 2 I + K nm K − 1 F P = log mm K mn ) What we obtained is � � − 1 N ( y | 0 , σ 2 I + K nm K − 1 2 σ 2 Tr [ K nn − K nm K − 1 F V = log mm K mn ) mm K mn ] We got this extra trace term (the total variance of p ( f | f m ))

  16. Variational Model Selection for Sparse Gaussian Process Regression Optimal φ ∗ ( f m ) and predictive distribution The optimal φ ∗ ( f m ) that corresponds to the above bound gives rise to the PP predictive distribution ( Csato and Opper, 2002; Seeger and Williams and Lawrence, 2003 ) The approximate predictive distribution is identical to PP

  17. Variational Model Selection for Sparse Gaussian Process Regression Variational bound for model selection Learning inducing inputs X m and ( σ 2 , θ ) using continuous optimization Maximize the bound wrt to ( X m , σ 2 , θ ) � � − 1 N ( y | 0 , σ 2 I + K nm K − 1 2 σ 2 Tr [ K nn − K nm K − 1 F V = log mm K mn ) mm K mn ] The first term encourages fitting the data y The second trace term says to minimize the total variance of p ( f | f m ) The trace Tr [ K nn − K nm K − 1 mm K mn ] can stand on its own as an objective function for sparse GP learning

  18. Variational Model Selection for Sparse Gaussian Process Regression Variational bound for model selection When the bound becomes equal to the true marginal log likelihood, i.e F V = log p ( y ) , then: Tr [ K nn − K nm K − 1 mm K mn ] = 0 K nn = K nm K − 1 mm K mn p ( f | f m ) becomes a delta function We can reproduce the full/exact GP prediction

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend