gaussian processes covariance functions and classification
play

Gaussian Processes Covariance Functions and Classification Carl - PowerPoint PPT Presentation

Gaussian Processes Covariance Functions and Classification Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics T ubingen, Germany Gaussian Processes in Practice, Bletchley Park, July 12th, 2006 Carl Edward Rasmussen


  1. Gaussian Processes Covariance Functions and Classification Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics T¨ ubingen, Germany Gaussian Processes in Practice, Bletchley Park, July 12th, 2006 Carl Edward Rasmussen Covariance Functions and Classification

  2. Outline Covariance functions encode structure. You can learn about them by • sampling, • optimizing the marginal likelihood. GP’s with various covariance functions are equivalent to many well known models, large neural networks, splines, relevance vector machines... • infinitely many Gaussian bumps regression • Rational Quadratic and Mat´ ern Quick two-page recap of GP regression Approximate inference for Gaussian process classification: Replace the non-Gaussian intractable posterior by a Gaussian. Expectation Propagation. Carl Edward Rasmussen Covariance Functions and Classification

  3. From random functions to covariance functions Consider the class of functions (sums of squared exponentials): 1 � γ i exp( − ( x − i/n ) 2 ) , f ( x ) = lim where γ i ∼ N (0 , 1) , ∀ i n n →∞ i � ∞ γ ( u ) exp( − ( x − u ) 2 ) du, = where γ ( u ) ∼ N (0 , 1) , ∀ u. −∞ The mean function is: � ∞ � ∞ exp( − ( x − u ) 2 ) µ ( x ) = E [ f ( x )] = γp ( γ ) dγdu = 0 , −∞ −∞ and the covariance function: � − ( x − u ) 2 − ( x ′ − u ) 2 � E [ f ( x ) f ( x ′ )] = � exp du − 2( u − x + x ′ ) 2 + ( x + x ′ ) 2 − ( x − x ′ ) 2 � − x 2 − x ′ 2 � � � � = exp ) du ∝ exp . 2 2 2 Thus, the squared exponential covariance function is equivalent to regression using infinitely many Gaussian shaped basis functions placed everywhere, not just at your training points! Carl Edward Rasmussen Covariance Functions and Classification

  4. Why it is dangerous to use only finitely many basis functions? 1 0.5 ? 0 −0.5 −10 −8 −6 −4 −2 0 2 4 6 8 10 Carl Edward Rasmussen Covariance Functions and Classification

  5. Rational quadratic covariance function The rational quadratic (RQ) covariance function: r 2 � − α � k RQ ( r ) = 1 + 2 αℓ 2 with α, ℓ > 0 can be seen as a scale mixture (an infinite sum) of squared exponential (SE) covariance functions with different characteristic length-scales. Using τ = ℓ − 2 and p ( τ | α, β ) ∝ τ α − 1 exp( − ατ/β ): � k RQ ( r ) = p ( τ | α, β ) k SE ( r | τ ) dτ − τr 2 r 2 � − ατ � − α τ α − 1 exp � � � � � ∝ exp dτ ∝ 1 + . β 2 2 αℓ 2 Carl Edward Rasmussen Covariance Functions and Classification

  6. Rational quadratic covariance function II 3 1 α =1/2 α =2 2 α→∞ 0.8 1 output, f(x) covariance 0.6 0 0.4 −1 0.2 −2 −3 0 0 1 2 3 −5 0 5 input distance input, x The limit α → ∞ of the RQ covariance function is the SE. Carl Edward Rasmussen Covariance Functions and Classification

  7. Mat´ ern covariance functions Stationary covariance functions can be based on the Mat´ ern form: � √ � √ 1 2 ν � ν 2 ν � k ( x , x ′ ) = κ | x − x ′ | κ | x − x ′ | K ν , Γ( ν )2 ν − 1 where K ν is the modified Bessel function of second kind of order ν , and κ is the characteristic length scale. Sample functions from Mat´ ern forms are ⌊ ν − 1 ⌋ times differentiable. Thus, the hyperparameter ν can control the degree of smoothness Carl Edward Rasmussen Covariance Functions and Classification

  8. Mat´ ern covariance functions II Univariate Mat´ ern covariance function with unit characteristic length scale and unit variance: covariance function sample functions 1 ν =1/2 2 ν =1 covariance output, f(x) 1 ν =2 ν→∞ 0 0.5 −1 −2 0 0 1 2 3 −5 0 5 input distance input, x Carl Edward Rasmussen Covariance Functions and Classification

  9. Mat´ ern covariance functions II It is possible that the most interesting cases for machine learning are ν = 3 / 2 and ν = 5 / 2, for which √ √ 3 r 3 r � � � � k ν =3 / 2 ( r ) = 1 + exp − , ℓ ℓ √ √ + 5 r 2 5 r 5 r � � � � k ν =5 / 2 ( r ) = 1 + exp − , 3 ℓ 2 ℓ ℓ Other special cases: • ν = 1 / 2: Laplacian covariance function, sample functions: stationary Browninan motion • ν → ∞ : Gaussian covariance function with smooth (infinitely differentiable) sample functions Carl Edward Rasmussen Covariance Functions and Classification

  10. A Comparison Left, SE covariance function, log marginal likelihood − 15 . 6, and right Mat´ ern covariance function with ν = 3 / 2, marginal likelihood − 18 . 0. Carl Edward Rasmussen Covariance Functions and Classification

  11. GP regression recap We use a Gaussian process prior for the latent function: f | X, θ ∼ N ( 0 , K ) The likelihood is a factorized Gaussian m � N ( y i | f i , σ 2 y | f ∼ n ) i =1 The posterior is Gaussian p ( f |D , θ ) = p ( f | X, θ ) p ( y | f ) p ( D| θ ) The latent value at the test point, f ( x ∗ ) is Gaussian � p ( f ∗ |D , θ, x ∗ ) = p ( f ∗ | f , X, θ, x ∗ ) p ( f |D , θ ) d f , and the predictive class probability is Gaussian � p ( y ∗ |D , θ, x ∗ ) = p ( y ∗ | f ∗ ) p ( f ∗ |D , θ, x ∗ ) d f ∗ . Carl Edward Rasmussen Covariance Functions and Classification

  12. Prior and posterior 2 2 1 1 output, f(x) output, f(x) 0 0 −1 −1 −2 −2 −5 0 5 −5 0 5 input, x input, x Predictive distribution: p ( y ∗ | x ∗ , x , y ) ∼ N � k ( x ∗ , x ) ⊤ [ K + σ 2 noise I ] − 1 y , k ( x ∗ , x ∗ ) + σ 2 noise − k ( x ∗ , x ) ⊤ [ K + σ 2 noise I ] − 1 k ( x ∗ , x ) � Carl Edward Rasmussen Covariance Functions and Classification

  13. The marginal likelihood To chose between models M 1 , M 2 , . . . , compare the posterior for the models p ( M i |D ) = p ( y | x , M i ) p ( M i ) . p ( D ) Log marginal likelihood: log p ( y | x , M i ) = − 1 2 y ⊤ K − 1 y − 1 2 log | K | − n 2 log(2 π ) is the combination of a data fit term and complexity penalty. Occam’s Razor is automatic. Carl Edward Rasmussen Covariance Functions and Classification

  14. Binary Gaussian Process Classification 1 4 class probability, π (x) latent function, f(x) 2 0 −2 −4 0 input, x input, x The class probability is related to the latent function through: � � p ( y = 1 | f ( x )) = π ( x ) = Φ f ( x ) . Observations are independent given f , so the likelihood is n n � � p ( y | f ) = p ( y i | f i ) = Φ( y i f i ) . i =1 i =1 Carl Edward Rasmussen Covariance Functions and Classification

  15. Likelihood functions The logistic (1 + exp( − y i f i )) − 1 and probit Φ( y i f i ) and their derivatives: log likelihood, log p(y i |f i ) 1 log likelihood, log p(y i |f i ) 2 0 0 −1 −2 −2 −4 log likelihood log likelihood −3 −6 1st derivative 1st derivative 2nd derivative 2nd derivative −2 0 2 −2 0 2 latent times target, z i =y i f i latent times target, z i =y i f i Carl Edward Rasmussen Covariance Functions and Classification

  16. Exact expressions We use a Gaussian process prior for the latent function: f | X, θ ∼ N ( 0 , K ) The posterior becomes: m p ( f |D , θ ) = p ( f | X, θ ) p ( y | f ) = N ( f | 0 , K ) � Φ( y i f i ) , p ( D| θ ) p ( D| θ ) i =1 which is non-Gaussian. The latent value at the test point, f ( x ∗ ) is � p ( f ∗ |D , θ, x ∗ ) = p ( f ∗ | f , X, θ, x ∗ ) p ( f |D , θ ) d f , and the predictive class probability becomes � p ( y ∗ |D , θ, x ∗ ) = p ( y ∗ | f ∗ ) p ( f ∗ |D , θ, x ∗ ) d f ∗ , both of which are intractable to compute. Carl Edward Rasmussen Covariance Functions and Classification

  17. Gaussian Approximation to the Posterior We approximate the non-Gaussian posterior by a Gaussian: p ( f |D , θ ) ≃ q ( f |D , θ ) = N ( m , A ) then q ( f ∗ |D , θ, x ∗ ) = N ( f ∗ | µ ∗ , σ 2 ∗ ), where µ ∗ = k ⊤ ∗ K − 1 m ∗ ( K − 1 − K − 1 AK − 1 ) k ∗ . σ 2 ∗ = k ( x ∗ , x ∗ ) − k ⊤ Using this approximation: � µ ∗ � � Φ( f ∗ ) N ( f ∗ | µ ∗ , σ 2 q ( y ∗ = 1 |D , θ, x ∗ ) = ∗ ) d f ∗ = Φ √ 1 + σ 2 ∗ Carl Edward Rasmussen Covariance Functions and Classification

  18. What Gaussian? Some suggestions: • local expansion: Laplace’s method • optimize a variational lower bound (using Jensen’s ineqality): � � � p ( y | f ) p ( f ) � log p ( y | X ) = log p ( y | f ) p ( f ) d f ≥ log q ( f ) d f q ( f ) • the Expectation Propagation (EP) algorithm Carl Edward Rasmussen Covariance Functions and Classification

  19. Expectation Propagation Posterior: n 1 � p ( f | X, y ) = Z p ( f | X ) p ( y i | f i ) , i =1 where the normalizing term is the marginal likelihood n � � Z = p ( y | X ) = p ( f | X ) p ( y i | f i ) d f . i =1 Exact likelihood: p ( y i | f i ) = Φ( f i y i ) which makes inference intractable. In EP we use a local likelihood approximation p ( y i | f i ) ≃ t i ( f i | ˜ i ) � ˜ σ 2 σ 2 Z i , ˜ µ i , ˜ Z i N ( f i | ˜ µ i , ˜ i ) , where the site parameters are ˜ σ 2 Z i , ˜ µ i and ˜ i , such that: n � t i ( f i | ˜ µ , ˜ � ˜ σ 2 Z i , ˜ µ i , ˜ i ) = N (˜ Σ) Z i . i =1 i Carl Edward Rasmussen Covariance Functions and Classification

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend