Gaussian Processes Covariance Functions and Classification Carl - PowerPoint PPT Presentation

Gaussian Processes Covariance Functions and Classification Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics T¨ ubingen, Germany Gaussian Processes in Practice, Bletchley Park, July 12th, 2006 Carl Edward Rasmussen Covariance Functions and Classification

Outline Covariance functions encode structure. You can learn about them by • sampling, • optimizing the marginal likelihood. GP’s with various covariance functions are equivalent to many well known models, large neural networks, splines, relevance vector machines... • infinitely many Gaussian bumps regression • Rational Quadratic and Mat´ ern Quick two-page recap of GP regression Approximate inference for Gaussian process classification: Replace the non-Gaussian intractable posterior by a Gaussian. Expectation Propagation. Carl Edward Rasmussen Covariance Functions and Classification

From random functions to covariance functions Consider the class of functions (sums of squared exponentials): 1 � γ i exp( − ( x − i/n ) 2 ) , f ( x ) = lim where γ i ∼ N (0 , 1) , ∀ i n n →∞ i � ∞ γ ( u ) exp( − ( x − u ) 2 ) du, = where γ ( u ) ∼ N (0 , 1) , ∀ u. −∞ The mean function is: � ∞ � ∞ exp( − ( x − u ) 2 ) µ ( x ) = E [ f ( x )] = γp ( γ ) dγdu = 0 , −∞ −∞ and the covariance function: � − ( x − u ) 2 − ( x ′ − u ) 2 � E [ f ( x ) f ( x ′ )] = � exp du − 2( u − x + x ′ ) 2 + ( x + x ′ ) 2 − ( x − x ′ ) 2 � − x 2 − x ′ 2 � � � � = exp ) du ∝ exp . 2 2 2 Thus, the squared exponential covariance function is equivalent to regression using infinitely many Gaussian shaped basis functions placed everywhere, not just at your training points! Carl Edward Rasmussen Covariance Functions and Classification

Why it is dangerous to use only finitely many basis functions? 1 0.5 ? 0 −0.5 −10 −8 −6 −4 −2 0 2 4 6 8 10 Carl Edward Rasmussen Covariance Functions and Classification

Rational quadratic covariance function The rational quadratic (RQ) covariance function: r 2 � − α � k RQ ( r ) = 1 + 2 αℓ 2 with α, ℓ > 0 can be seen as a scale mixture (an infinite sum) of squared exponential (SE) covariance functions with different characteristic length-scales. Using τ = ℓ − 2 and p ( τ | α, β ) ∝ τ α − 1 exp( − ατ/β ): � k RQ ( r ) = p ( τ | α, β ) k SE ( r | τ ) dτ − τr 2 r 2 � − ατ � − α τ α − 1 exp � � � � � ∝ exp dτ ∝ 1 + . β 2 2 αℓ 2 Carl Edward Rasmussen Covariance Functions and Classification

Rational quadratic covariance function II 3 1 α =1/2 α =2 2 α→∞ 0.8 1 output, f(x) covariance 0.6 0 0.4 −1 0.2 −2 −3 0 0 1 2 3 −5 0 5 input distance input, x The limit α → ∞ of the RQ covariance function is the SE. Carl Edward Rasmussen Covariance Functions and Classification

Mat´ ern covariance functions Stationary covariance functions can be based on the Mat´ ern form: � √ � √ 1 2 ν � ν 2 ν � k ( x , x ′ ) = κ | x − x ′ | κ | x − x ′ | K ν , Γ( ν )2 ν − 1 where K ν is the modified Bessel function of second kind of order ν , and κ is the characteristic length scale. Sample functions from Mat´ ern forms are ⌊ ν − 1 ⌋ times differentiable. Thus, the hyperparameter ν can control the degree of smoothness Carl Edward Rasmussen Covariance Functions and Classification

Mat´ ern covariance functions II Univariate Mat´ ern covariance function with unit characteristic length scale and unit variance: covariance function sample functions 1 ν =1/2 2 ν =1 covariance output, f(x) 1 ν =2 ν→∞ 0 0.5 −1 −2 0 0 1 2 3 −5 0 5 input distance input, x Carl Edward Rasmussen Covariance Functions and Classification

Mat´ ern covariance functions II It is possible that the most interesting cases for machine learning are ν = 3 / 2 and ν = 5 / 2, for which √ √ 3 r 3 r � � � � k ν =3 / 2 ( r ) = 1 + exp − , ℓ ℓ √ √ + 5 r 2 5 r 5 r � � � � k ν =5 / 2 ( r ) = 1 + exp − , 3 ℓ 2 ℓ ℓ Other special cases: • ν = 1 / 2: Laplacian covariance function, sample functions: stationary Browninan motion • ν → ∞ : Gaussian covariance function with smooth (infinitely differentiable) sample functions Carl Edward Rasmussen Covariance Functions and Classification

A Comparison Left, SE covariance function, log marginal likelihood − 15 . 6, and right Mat´ ern covariance function with ν = 3 / 2, marginal likelihood − 18 . 0. Carl Edward Rasmussen Covariance Functions and Classification

Prior and posterior 2 2 1 1 output, f(x) output, f(x) 0 0 −1 −1 −2 −2 −5 0 5 −5 0 5 input, x input, x Predictive distribution: p ( y ∗ | x ∗ , x , y ) ∼ N � k ( x ∗ , x ) ⊤ [ K + σ 2 noise I ] − 1 y , k ( x ∗ , x ∗ ) + σ 2 noise − k ( x ∗ , x ) ⊤ [ K + σ 2 noise I ] − 1 k ( x ∗ , x ) � Carl Edward Rasmussen Covariance Functions and Classification

The marginal likelihood To chose between models M 1 , M 2 , . . . , compare the posterior for the models p ( M i |D ) = p ( y | x , M i ) p ( M i ) . p ( D ) Log marginal likelihood: log p ( y | x , M i ) = − 1 2 y ⊤ K − 1 y − 1 2 log | K | − n 2 log(2 π ) is the combination of a data fit term and complexity penalty. Occam’s Razor is automatic. Carl Edward Rasmussen Covariance Functions and Classification

Binary Gaussian Process Classification 1 4 class probability, π (x) latent function, f(x) 2 0 −2 −4 0 input, x input, x The class probability is related to the latent function through: � � p ( y = 1 | f ( x )) = π ( x ) = Φ f ( x ) . Observations are independent given f , so the likelihood is n n � � p ( y | f ) = p ( y i | f i ) = Φ( y i f i ) . i =1 i =1 Carl Edward Rasmussen Covariance Functions and Classification

Likelihood functions The logistic (1 + exp( − y i f i )) − 1 and probit Φ( y i f i ) and their derivatives: log likelihood, log p(y i |f i ) 1 log likelihood, log p(y i |f i ) 2 0 0 −1 −2 −2 −4 log likelihood log likelihood −3 −6 1st derivative 1st derivative 2nd derivative 2nd derivative −2 0 2 −2 0 2 latent times target, z i =y i f i latent times target, z i =y i f i Carl Edward Rasmussen Covariance Functions and Classification

Gaussian Approximation to the Posterior We approximate the non-Gaussian posterior by a Gaussian: p ( f |D , θ ) ≃ q ( f |D , θ ) = N ( m , A ) then q ( f ∗ |D , θ, x ∗ ) = N ( f ∗ | µ ∗ , σ 2 ∗ ), where µ ∗ = k ⊤ ∗ K − 1 m ∗ ( K − 1 − K − 1 AK − 1 ) k ∗ . σ 2 ∗ = k ( x ∗ , x ∗ ) − k ⊤ Using this approximation: � µ ∗ � � Φ( f ∗ ) N ( f ∗ | µ ∗ , σ 2 q ( y ∗ = 1 |D , θ, x ∗ ) = ∗ ) d f ∗ = Φ √ 1 + σ 2 ∗ Carl Edward Rasmussen Covariance Functions and Classification

What Gaussian? Some suggestions: • local expansion: Laplace’s method • optimize a variational lower bound (using Jensen’s ineqality): � � � p ( y | f ) p ( f ) � log p ( y | X ) = log p ( y | f ) p ( f ) d f ≥ log q ( f ) d f q ( f ) • the Expectation Propagation (EP) algorithm Carl Edward Rasmussen Covariance Functions and Classification

Expectation Propagation Posterior: n 1 � p ( f | X, y ) = Z p ( f | X ) p ( y i | f i ) , i =1 where the normalizing term is the marginal likelihood n � � Z = p ( y | X ) = p ( f | X ) p ( y i | f i ) d f . i =1 Exact likelihood: p ( y i | f i ) = Φ( f i y i ) which makes inference intractable. In EP we use a local likelihood approximation p ( y i | f i ) ≃ t i ( f i | ˜ i ) � ˜ σ 2 σ 2 Z i , ˜ µ i , ˜ Z i N ( f i | ˜ µ i , ˜ i ) , where the site parameters are ˜ σ 2 Z i , ˜ µ i and ˜ i , such that: n � t i ( f i | ˜ µ , ˜ � ˜ σ 2 Z i , ˜ µ i , ˜ i ) = N (˜ Σ) Z i . i =1 i Carl Edward Rasmussen Covariance Functions and Classification

Gaussian Processes Covariance Functions and Classification Carl - PowerPoint PPT Presentation

Gaussian Processes Covariance Functions and Classification Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics T ubingen, Germany Gaussian Processes in Practice, Bletchley Park, July 12th, 2006 Carl Edward Rasmussen

Lecture 14 Covariance Functions 3/08/2018 1 More on Covariance Functions 2 Nugget Covariance

Introduction to Gaussian Processes Neil D. Lawrence GPMC 6th February 2017 Book Rasmussen and

Fitting Covariance and Multioutput Gaussian Processes Neil D. Lawrence GPMC 6th February 2017

Fitting Covariance and Multioutput Gaussian Processes Neil D. Lawrence GPSS 13th September 2016

CMPUT 466 Introduction to Gaussian Processes Dan Lizotte The Plan Introduction to Gaussian

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Covariance Matrices and Covariance Operators Theory and Applications H` a Quang Minh Functional

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Multivariate Gaussian Mean vector: Covariance matrix: 2 1 Conditioning a Gaussian Joint

State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2

How to choose the covariance for Gaussian process regression independently of the basis Workshop

Another introduction to Gaussian Processes Richard Wilkinson School of Maths and Statistics

Gaussian Processes for Big Data James Hensman joint work with Nicol o Fusi, Neil D. Lawrence

Computers and Thought Guy Van den Broeck IJCAI August 16, 2019 Outline 1. What would 2011

The Role of Cybersecurity in Modern Societies Prof. Mario Marchese DITEN University of Genoa

Lecture 5: Short-Time Fourier Transform and Filterbanks Mark Hasegawa-Johnson ECE 417:

Trajectory tracking, Path Following and Formation Control of Autonomous Marine Vehicles Kristin

Combining Effects and Coeffects via Grading (slides) Marco Gaboardi Shin-ya Katsumata Dominic

Method description Thomas Navin Lal and Olivier Chapelle { navin.lal, olivier.chapelle }

Three questions for today 1) Can lighting encourage more cycling after-dark? 2) How does

Cycling Without Age-Enhancing Lives of Seniors The right to wind in your face Kim