gaussian process
play

Gaussian Process Lei Tang Arizona State University Jul. 31th, 2007 - PowerPoint PPT Presentation

Gaussian Process Lei Tang Arizona State University Jul. 31th, 2007 Lei Tang (ASU) Gaussian Process Jul. 31th, 2007 1 / 22 Gaussian Process, (kriging in geostatistics) Autoregressive moving average model, Kalman filters, and radial basis


  1. Gaussian Process Lei Tang Arizona State University Jul. 31th, 2007 Lei Tang (ASU) Gaussian Process Jul. 31th, 2007 1 / 22

  2. Gaussian Process, (kriging in geostatistics) Autoregressive moving average model, Kalman filters, and radial basis function networks can be viewed as forms of Gaussian process models. Lei Tang (ASU) Gaussian Process Jul. 31th, 2007 2 / 22

  3. Linear regression revisited w T φ ( x ) y ( x ) = N ( w | 0 , α − 1 I ) p ( w ) = y = Φ w E [ y ] = Φ E [ w ] E [ yy T ] = Φ E [ ww T ]Φ T = 1 α ΦΦ T = K cov [ y ] = where K is Gram matrix with elements K nm = k ( x n , x m ) = 1 αφ ( x n ) T φ ( x m ) Lei Tang (ASU) Gaussian Process Jul. 31th, 2007 3 / 22

  4. Gaussian Process A Gaussian process is defined as a probability distribution over functions y ( x ) suth that the set of values of y ( x ) evaluated at an arbitrary set of points x 1 , · · · , x N jointly have a Gaussian distribution. Gaussian random field : when the input vector x is two-dimentional. Stochastic process : y ( x ) is specified by giving the joint probability distirubtion for any finite set of values y ( x 1 ) , · · · , y ( x N ) in a consistent manner. Lei Tang (ASU) Gaussian Process Jul. 31th, 2007 4 / 22

  5. GP Connection to Kernel For Gaussian stochstic process, the joint distribution over N variables y 1 , · · · , y N is specified completely by the second-order statistics. For most applications, we have no prior knowledge, so by symmetry(also for sparsity) we take the mean of y ( x ) to be zero. Then the Gaussian process is deteremined by the covariance of y ( x ) which is specified by the kernel function: E [ y ( x n ) , y ( x m )] = k ( x n , x m ) Lei Tang (ASU) Gaussian Process Jul. 31th, 2007 5 / 22

  6. Two Examples of GP Specificy the covaraince (kernel) directly. 1 Gaussian Kernel: k ( x , x ′ ) = exp ( −|| x − x ′ || 2 / 2 σ 2 ) 2 Exponential Kernel: k ( x , x ′ ) = exp ( − θ | x − x ′ | ) (correpsonds to the Ornstein-Uhlenbeck process original introduced for Brownian motion) 3 3 1.5 1.5 0 0 −1.5 −1.5 −3 −3 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 Lei Tang (ASU) Gaussian Process Jul. 31th, 2007 6 / 22

  7. GP for Regression with Random Noise If the noise on the observed target values are considered: N ( t n | y n , β − 1 ) p ( t n | y n ) = N ( t | y , β − 1 I N ) p ( t | y ) = N ( y | 0 , K ) p ( y ) = � p ( t ) = p ( t | y ) p ( y ) d y = N ( t | 0 , C ) where C ( x n , x m ) = k ( x n , x m ) + β − 1 δ nm . Covraince simply add. Hint: matrix inverse lemma [ B − 1 + CD − 1 C T ] − 1 = B − BC ( D + C T BC ) − 1 C T B Lei Tang (ASU) Gaussian Process Jul. 31th, 2007 7 / 22

  8. GP for Regression with Random Noise If the noise on the observed target values are considered: N ( t n | y n , β − 1 ) p ( t n | y n ) = N ( t | y , β − 1 I N ) p ( t | y ) = N ( y | 0 , K ) p ( y ) = � p ( t ) = p ( t | y ) p ( y ) d y = N ( t | 0 , C ) where C ( x n , x m ) = k ( x n , x m ) + β − 1 δ nm . Covraince simply add. Hint: matrix inverse lemma [ B − 1 + CD − 1 C T ] − 1 = B − BC ( D + C T BC ) − 1 C T B Lei Tang (ASU) Gaussian Process Jul. 31th, 2007 7 / 22

  9. GP for Regression with Random Noise If the noise on the observed target values are considered: N ( t n | y n , β − 1 ) p ( t n | y n ) = N ( t | y , β − 1 I N ) p ( t | y ) = N ( y | 0 , K ) p ( y ) = � p ( t ) = p ( t | y ) p ( y ) d y = N ( t | 0 , C ) where C ( x n , x m ) = k ( x n , x m ) + β − 1 δ nm . Covraince simply add. Hint: matrix inverse lemma [ B − 1 + CD − 1 C T ] − 1 = B − BC ( D + C T BC ) − 1 C T B Lei Tang (ASU) Gaussian Process Jul. 31th, 2007 7 / 22

  10. � ✁ 3 0 −3 −1 0 1 Lei Tang (ASU) Gaussian Process Jul. 31th, 2007 8 / 22

  11. Commonly Used Kernel for Regression � � − θ 1 2 || x n − x m || 2 + θ 2 + θ 3 x T k ( x n , x m ) = θ 0 exp n x m Lei Tang (ASU) Gaussian Process Jul. 31th, 2007 9 / 22

  12. GP for Prediction p ( t N +1 ) = N ( t N +1 | 0 , C N +1 ) � C N � k C N +1 = k T c k T C − 1 m ( x N +1 ) = N t σ 2 ( x N +1 ) c − k T C − 1 = N k If we rewrite m ( x N +1 ) = � N n =1 a n k ( x n , x N +1 ), and define a kernel function depending only on the distance || x n − x m || , we obtain an expansion in radial basis function. Lei Tang (ASU) Gaussian Process Jul. 31th, 2007 10 / 22

  13. GP for Prediction p ( t N +1 ) = N ( t N +1 | 0 , C N +1 ) � C N � k C N +1 = k T c k T C − 1 m ( x N +1 ) = N t σ 2 ( x N +1 ) c − k T C − 1 = N k If we rewrite m ( x N +1 ) = � N n =1 a n k ( x n , x N +1 ), and define a kernel function depending only on the distance || x n − x m || , we obtain an expansion in radial basis function. Lei Tang (ASU) Gaussian Process Jul. 31th, 2007 10 / 22

  14. 1 0.5 0 −0.5 −1 0 0.2 0.4 0.6 0.8 1 Lei Tang (ASU) Gaussian Process Jul. 31th, 2007 11 / 22

  15. Computation time for GP regression 1 Training: GP: inversion of a N × N matrix O ( N 3 ) + O ( N 2 ). Linear basis function model: inversion of a M × M matrix O ( M 3 ) + O ( M 2 ). 2 Prediction: GP: O ( N ). Linear basis function: O ( M ). Advantages of GP If the number of basis functions is larger than the number of data points, GP is computionally more efficient. Donot need to construct the basis function. Can learn the hyperparameters (maximum likelihood estimation) Lei Tang (ASU) Gaussian Process Jul. 31th, 2007 12 / 22

  16. Computation time for GP regression 1 Training: GP: inversion of a N × N matrix O ( N 3 ) + O ( N 2 ). Linear basis function model: inversion of a M × M matrix O ( M 3 ) + O ( M 2 ). 2 Prediction: GP: O ( N ). Linear basis function: O ( M ). Advantages of GP If the number of basis functions is larger than the number of data points, GP is computionally more efficient. Donot need to construct the basis function. Can learn the hyperparameters (maximum likelihood estimation) Lei Tang (ASU) Gaussian Process Jul. 31th, 2007 12 / 22

  17. Automatic relevance determination Previous example doesn’t consider the relevave importance of each dimension. Define a kernel as � 2 � − 1 � γ i ( x i − x ′ i ) 2 k ( x , x ′ ) = θ 0 exp 2 i =1 Atuomatically learn the hyperparameters resulting ARD which automatically determine the relative importance of each basis. Lei Tang (ASU) Gaussian Process Jul. 31th, 2007 13 / 22

  18. GP for Classification Similar to logistic/probit regression, using a nonlinear activation function to transform ( −∞ , + ∞ ) into probability interval (0 , 1). Assume latent variable a and the target output given latent variables are determined: σ ( a ) t (1 − σ ( a )) 1 − t p ( t | a ) = Latent variables a follows the Gaussian Process For prediction, � p ( t N +1 = 1 | t N ) = p ( t N +1 = 1 | a N +1 ) p ( a N +1 | t N ) da N +1 Unfortunately, this is analytically intractable and may be approximated using sampling methods or analytical approximation. Lei Tang (ASU) Gaussian Process Jul. 31th, 2007 14 / 22

  19. GP for Classification Similar to logistic/probit regression, using a nonlinear activation function to transform ( −∞ , + ∞ ) into probability interval (0 , 1). Assume latent variable a and the target output given latent variables are determined: σ ( a ) t (1 − σ ( a )) 1 − t p ( t | a ) = Latent variables a follows the Gaussian Process For prediction, � p ( t N +1 = 1 | t N ) = p ( t N +1 = 1 | a N +1 ) p ( a N +1 | t N ) da N +1 Unfortunately, this is analytically intractable and may be approximated using sampling methods or analytical approximation. Lei Tang (ASU) Gaussian Process Jul. 31th, 2007 14 / 22

  20. GP for Classification Similar to logistic/probit regression, using a nonlinear activation function to transform ( −∞ , + ∞ ) into probability interval (0 , 1). Assume latent variable a and the target output given latent variables are determined: σ ( a ) t (1 − σ ( a )) 1 − t p ( t | a ) = Latent variables a follows the Gaussian Process For prediction, � p ( t N +1 = 1 | t N ) = p ( t N +1 = 1 | a N +1 ) p ( a N +1 | t N ) da N +1 Unfortunately, this is analytically intractable and may be approximated using sampling methods or analytical approximation. Lei Tang (ASU) Gaussian Process Jul. 31th, 2007 14 / 22

  21. GP for classification prediction Gaussin approximation to the posterior distribution over a N +1 . � p ( a N +1 | t N ) = p ( a N +1 | a N ) p ( a N | t N ) d a N N ( a N +1 | k T C − 1 N a N , c − k T C − 1 p ( a N +1 | a N ) = N k ) Need to estimate p ( a N | t N ): use Gaussian Approximation The shape of single-mode distribution is close to Gaussian distribution. Increasing the number of data points falling in a fixed region of x space, then the corresponding uncertainty in the function a ( x ) will decrease, asymptotically leading to a Gaussian. Lei Tang (ASU) Gaussian Process Jul. 31th, 2007 15 / 22

  22. GP for classification prediction Gaussin approximation to the posterior distribution over a N +1 . � p ( a N +1 | t N ) = p ( a N +1 | a N ) p ( a N | t N ) d a N N ( a N +1 | k T C − 1 N a N , c − k T C − 1 p ( a N +1 | a N ) = N k ) Need to estimate p ( a N | t N ): use Gaussian Approximation The shape of single-mode distribution is close to Gaussian distribution. Increasing the number of data points falling in a fixed region of x space, then the corresponding uncertainty in the function a ( x ) will decrease, asymptotically leading to a Gaussian. Lei Tang (ASU) Gaussian Process Jul. 31th, 2007 15 / 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend