Kernel Design Nicolas Durrande PROWLER.io (nicolas@prowler.io) - PowerPoint PPT Presentation

Gaussian Process Summer School Kernel Design Nicolas Durrande – PROWLER.io (nicolas@prowler.io) Sheffield, September 2017 1 / 59

Introduction What is a kernel ? Choosing the appropriate kernel Making new from old Effect of linear operators Application : Periodicity detection Conclusion 2 / 59

We have seen during the introduction lectures that the distribution of a GP Z depends on two functions : the mean m ( x ) = E ( Z ( x )) the covariance k ( x , x ′ ) = cov ( Z ( x ) , Z ( x ′ )) In this talk, we will focus on the covariance function , which is often call the kernel . 4 / 59

We assume we have observed a function f for a limited number of time points x 1 , . . . , x n : 1.5 1.0 0.5 f ( x ) 0.0 -0.5 -1.0 0.0 0.2 0.4 0.6 0.8 1.0 x The observations are denoted by f i = f ( x i ) (or F = f ( X ) ). 5 / 59

Since f in unknown, we make the general assumption that it is a sample path of a Gaussian process Z : 4 2 Z ( x ) 0 -2 -4 0.0 0.2 0.4 0.6 0.8 1.0 x 6 / 59

Combining these two informations means keeping the samples interpolating the data points : 1.5 Z ( x ) | Z ( X ) = F 1.0 0.5 0.0 -0.5 -1.0 0.0 0.2 0.4 0.6 0.8 1.0 x 7 / 59

The conditional distribution is still Gaussian with moments : m ( x ) = E ( Z ( x ) | Z ( X ) = F ) = k ( x , X ) k ( X , X ) − 1 F c ( x , x ′ ) = cov ( Z ( x ) , Z ( x ′ ) | Z ( X ) = F ) = k ( x , x ′ ) − k ( x , X ) k ( X , X ) − 1 k ( X , x ′ ) It can be represented as a mean function with confidence intervals. 1.5 Z ( x ) | Z ( X ) = F 1.0 0.5 0.0 -0.5 -1.0 0.0 0.2 0.4 0.6 0.8 1.0 x 8 / 59

Let Z be a random process with kernel k . Some properties of kernels can be obtained directly from their definition. Example k ( x , x ) = cov ( Z ( x ) , Z ( x )) = var ( Z ( x )) ≥ 0 ⇒ k ( x , x ) is positive . k ( x , y ) = cov ( Z ( x ) , Z ( y )) = cov ( Z ( y ) , Z ( x )) = k ( y , x ) ⇒ k ( x , y ) is symmetric . We can obtain a thinner result... 10 / 59

We introduce the random variable T = � n i = 1 a i Z ( x i ) where n , a i and x i are arbitrary. Computing the variance of T gives :   � � � �  = var ( T ) = cov a i Z ( x i ) , a j Z ( x j ) a i a j cov ( Z ( x i ) , Z ( x j )) i j i j � � = a i a j k ( x i , x j ) Since a variance is positive, we have � � a i a j k ( x i , x j ) ≥ 0 i j for any arbitrary n , a i and x i . Definition The functions satisfying the above inequality for all n ∈ N , for all x i ∈ D , for all a i ∈ R are called positive semi-definite functions. 11 / 59

We have just seen : k is a covariance ⇒ k is a positive semi-definite function The reverse is also true : Theorem (Loeve) k corresponds to the covariance of a GP � k is a symmetric positive semi-definite function 12 / 59

Proving that a function is psd is often difficult. However there are a lot of functions that have already been proven to be psd : − ( x − y ) 2 � � k ( x , y ) = σ 2 exp squared exp. 2 θ 2 √ √ � � � � + 5 | x − y | 2 5 | x − y | 5 | x − y | k ( x , y ) = σ 2 Matern 5/2 1 + exp − 3 θ 2 θ θ √ √ � � � � 3 | x − y | 3 | x − y | k ( x , y ) = σ 2 Matern 3/2 1 + exp − θ θ � − | x − y | � k ( x , y ) = σ 2 exp exponential θ k ( x , y ) = σ 2 min ( x , y ) Brownian k ( x , y ) = σ 2 δ x , y white noise k ( x , y ) = σ 2 constant k ( x , y ) = σ 2 xy linear When k is a function of x − y , the kernel is called stationary . σ 2 is called the variance and θ the lengthscale . 13 / 59

14 / 59

For a few kernels, it is possible to prove they are psd directly from the definition. k ( x , y ) = δ x , y k ( x , y ) = 1 For most of them a direct proof from the definition is not possible. The following theorem is helpful for stationary kernels : Theorem (Bochner) A continuous stationary function k ( x , y ) = ˜ k ( | x − y | ) is positive definite if and only if ˜ k is the Fourier transform of a finite positive measure : � ˜ e − i ω t d µ ( ω ) k ( t ) = R 15 / 59

Example We consider the following measure : 0.0 k ( t ) = sin ( t ) Its Fourier transform gives ˜ : t 0.0 As a consequence, k ( x , y ) = sin ( x − y ) is a valid covariance x − y function. 16 / 59

Usual kernels Bochner theorem can be used to prove the positive definiteness of many usual stationary kernels The Gaussian is the Fourier transform of itself ⇒ it is psd. 1 Matérn kernels are the Fourier transforms of ( 1 + ω 2 ) p ⇒ they are psd. 17 / 59

Unusual kernels Inverse Fourier transform of a (symmetrised) sum of Gaussian gives (A. Wilson, ICML 2013) : ˜ µ ( ω ) k ( t ) − → F 0.0 0.0 The obtained kernel is parametrised by its spectrum. 18 / 59

Unusual kernels The sample paths have the following shape : 6 4 2 0 2 4 6 0 1 2 3 4 5 19 / 59

Changing the kernel has a huge impact on the model : Gaussian kernel: Exponential kernel: 21 / 59

This is because changing the kernel implies changing the prior Gaussian kernel: Exponential kernel: 22 / 59

In order to choose a kernel, one should gather all possible informations about the function to approximate... Is it stationary ? Is it differentiable, what’s its regularity ? Do we expect particular trends ? Do we expect particular patterns (periodicity, cycles, additivity) ? Kernels often include rescaling parameters : θ for the x axis (length-scale) and σ for the y ( σ 2 often corresponds to the GP variance). They can be tuned by maximizing the likelihood minimizing the prediction error 23 / 59

It is common to try various kernels and to asses the model accuracy. The idea is to compare some model predictions against actual values : On a test set Using leave-one-out Two (ideally three) things should be checked : Is the mean accurate (MSE, Q 2 ) ? Do the confidence intervals make sense ? Are the predicted covariances right ? Furthermore, it is often interesting to try some input remapping such as x → log ( x ) , x → exp ( x ) , ... 24 / 59

Making new from old : Kernels can be : Summed together ◮ On the same space k ( x , y ) = k 1 ( x , y ) + k 2 ( x , y ) ◮ On the tensor space k ( x , y ) = k 1 ( x 1 , y 1 ) + k 2 ( x 2 , y 2 ) Multiplied together ◮ On the same space k ( x , y ) = k 1 ( x , y ) × k 2 ( x , y ) ◮ On the tensor space k ( x , y ) = k 1 ( x 1 , y 1 ) × k 2 ( x 2 , y 2 ) Composed with a function ◮ k ( x , y ) = k 1 ( f ( x ) , f ( y )) All these operations will preserve the positive definiteness. How can this be useful ? 26 / 59

Sum of kernels over the same space Example (The Mauna Loa observatory dataset) This famous dataset compiles the monthly CO 2 concentration in Hawaii since 1958. 440 420 400 380 360 340 320 1960 1970 1980 1990 2000 2010 2020 2030 Let’s try to predict the concentration for the next 20 years. 27 / 59

Sum of kernels over the same space We first consider a squared-exponential kernel : � � − ( x − y ) 2 k ( x , y ) = σ 2 exp θ 2 480 600 460 400 440 420 200 400 0 380 360 200 340 400 320 600 300 1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 The results are terrible ! 28 / 59

480 460 440 420 400 380 360 340 320 300 1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 Sum of kernels over the same space What happen if we sum both kernels ? k ( x , y ) = k rbf 1 ( x , y ) + k rbf 2 ( x , y ) 29 / 59

Sum of kernels over the same space What happen if we sum both kernels ? k ( x , y ) = k rbf 1 ( x , y ) + k rbf 2 ( x , y ) 480 460 440 420 400 380 360 340 320 300 1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 The model is drastically improved ! 29 / 59

460 440 420 400 380 360 340 320 300 1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 Sum of kernels over the same space We can try the following kernel : 0 x 2 y 2 + k rbf 1 ( x , y ) + k rbf 2 ( x , y ) + k per ( x , y ) k ( x , y ) = σ 2 30 / 59

Sum of kernels over the same space We can try the following kernel : 0 x 2 y 2 + k rbf 1 ( x , y ) + k rbf 2 ( x , y ) + k per ( x , y ) k ( x , y ) = σ 2 460 440 420 400 380 360 340 320 300 1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 Once again, the model is significantly improved. 30 / 59

Kernel Design Nicolas Durrande PROWLER.io (nicolas@prowler.io) - PowerPoint PPT Presentation

Gaussian Process Summer School Kernel Design Nicolas Durrande PROWLER.io (nicolas@prowler.io) Sheffield, September 2017 1 / 59 Introduction What is a kernel ? Choosing the appropriate kernel Making new from old Effect of linear

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Black Kernel Rot Malady of Pecan B Wood, C Bock, l Wells, T Cottrell, M Hotchkiss Black Kernel

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

Linux Kernel Debugging Your kernel just oopsed - What do you do, hotshot? Muli Ben-Yehuda

Introduction to Linux Kernel Modules Luca Abeni luca.abeni@santannapisa.it Linux Kernel Modules

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

A kernel in a library Genodes custom kernel approach Martin Stein <

Linux Kernel Synchronization System Calls Synchronization in Kernel the kernel RCU File

Debugging the Linux Kernel with GDB Kieran Bingham Debugging the Linux Kernel with GDB Many

TOS Arno Puder 1 Demo Kernel /* tos/kernel/main.c */ #include <kernel.h> WINDOW

Multiple Kernel Learning and Feature Space Denoising Fei Yan, Josef Kittler and Krystian

Efficient Multiple Kernel Learning Lei Tang Outline What is Kernel Learning? Whats the

Introduction to Kubernetes Containers container vs virtual machine Virtual machine Container

machine vision and computation to describe genome function at the organismal level. Tessa Durham

Lecture 2: Biology Basics Continued Central Dogma DNA: The Code of Life The structure and the

Genetic determinants of dabigatran plasma levels and their relation to bleeding Guillaume Pare MD

The infinitesimal with dominance CIRM, February 2020 Recap of the additive model Trait value =

Bayesian Two-way Clustering expression analysis: can they be made to work? for Gene Expression

Transforming Medicine and Healthcare through Machine Learning and AI Mihaela van der Schaar John

Bayesian computing with INLA and the R-INLA package H avard Rue Norwegian University of

Computationally Tractable Methods for High-Dimensional Data Peter B uhlmann Seminar f ur

Kernel Design Nicolas Durrande PROWLER.io (nicolas@prowler.io) - PowerPoint PPT Presentation

Gaussian Process Summer School Kernel Design Nicolas Durrande PROWLER.io (nicolas@prowler.io) Sheffield, September 2017 1 / 59 Introduction What is a kernel ? Choosing the appropriate kernel Making new from old Effect of linear

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Black Kernel Rot Malady of Pecan B Wood, C Bock, l Wells, T Cottrell, M Hotchkiss Black Kernel

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

Linux Kernel Debugging Your kernel just oopsed - What do you do, hotshot? Muli Ben-Yehuda

Introduction to Linux Kernel Modules Luca Abeni luca.abeni@santannapisa.it Linux Kernel Modules

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

A kernel in a library Genodes custom kernel approach Martin Stein &lt;

Linux Kernel Synchronization System Calls Synchronization in Kernel the kernel RCU File

Debugging the Linux Kernel with GDB Kieran Bingham Debugging the Linux Kernel with GDB Many

TOS Arno Puder 1 Demo Kernel /* tos/kernel/main.c */ #include &lt;kernel.h&gt; WINDOW

Multiple Kernel Learning and Feature Space Denoising Fei Yan, Josef Kittler and Krystian

Efficient Multiple Kernel Learning Lei Tang Outline What is Kernel Learning? Whats the

Introduction to Kubernetes Containers container vs virtual machine Virtual machine Container

machine vision and computation to describe genome function at the organismal level. Tessa Durham

Lecture 2: Biology Basics Continued Central Dogma DNA: The Code of Life The structure and the

Genetic determinants of dabigatran plasma levels and their relation to bleeding Guillaume Pare MD

The infinitesimal with dominance CIRM, February 2020 Recap of the additive model Trait value =

Bayesian Two-way Clustering expression analysis: can they be made to work? for Gene Expression

Transforming Medicine and Healthcare through Machine Learning and AI Mihaela van der Schaar John

Bayesian computing with INLA and the R-INLA package H avard Rue Norwegian University of

Computationally Tractable Methods for High-Dimensional Data Peter B uhlmann Seminar f ur

A kernel in a library Genodes custom kernel approach Martin Stein <

TOS Arno Puder 1 Demo Kernel /* tos/kernel/main.c */ #include <kernel.h> WINDOW