Kernel Design Nicolas Durrande PROWLER.io (nicolas@prowler.io) - PowerPoint PPT Presentation

Gaussian Process Summer School Kernel Design Nicolas Durrande – PROWLER.io (nicolas@prowler.io) Sheffield, September 2018

Introduction 2 / 57

We have seen during the introduction lectures that the distribution of a GP Z depends on two functions: the mean m ( x ) = E ( Z ( x )) the covariance k ( x , x ′ ) = cov ( Z ( x ) , Z ( x ′ )) In this talk, we will focus on the covariance function , which is often call the kernel . 3 / 57

Given some data, the conditional distribution is still Gaussian: m ( x ) = E ( Z ( x ) | Z ( X ) + ε = F ) = k ( x , X )( k ( X , X ) + τ 2 I ) − 1 F c ( x , x ′ ) = cov ( Z ( x ) , Z ( x ′ ) | Z ( X ) + ε = F ) = k ( x , x ′ ) − k ( x , X )( k ( X , X ) + τ 2 I ) − 1 k ( X , x ′ ) It can be represented as a mean function with confidence intervals. 1.5 Z ( x ) | Z ( X ) = F 1.0 0.5 0.0 -0.5 -1.0 0.0 0.2 0.4 0.6 0.8 1.0 x 4 / 57

What is a kernel? 5 / 57

Let Z be a random process with kernel k . Some properties of kernels can be obtained directly from their definition. Example k ( x , x ) = cov ( Z ( x ) , Z ( x )) = var ( Z ( x )) ≥ 0 ⇒ k ( x , x ) is positive . k ( x , y ) = cov ( Z ( x ) , Z ( y )) = cov ( Z ( y ) , Z ( x )) = k ( y , x ) ⇒ k ( x , y ) is symmetric . We can obtain a thinner result... 6 / 57

We introduce the random variable T = � n i = 1 a i Z ( x i ) where n , a i and x i are arbitrary. Computing the variance of T gives:   � � � �  = var ( T ) = cov a i Z ( x i ) , a j Z ( x j ) a i a j cov ( Z ( x i ) , Z ( x j )) i j i j � � = a i a j k ( x i , x j ) ≥ 0 Since a variance is positive, we have for any arbitrary n , a i and x i . Definition The functions k satisfying � � a i a j k ( x i , x j ) ≥ 0 i j for all n ∈ N , for all x i ∈ D , for all a i ∈ R are called positive semi-definite functions. 7 / 57

We have just seen: k is a covariance ⇒ k is a symmetric positive semi-definite function The reverse is also true: Theorem (Loeve) k corresponds to the covariance of a GP � k is a symmetric positive semi-definite function 8 / 57

Proving that a function is psd is often difficult. However there are a lot of functions that have already been proven to be psd: − ( x − y ) 2 � � k ( x , y ) = σ 2 exp squared exp. 2 θ 2 √ √ � � � � + 5 | x − y | 2 5 | x − y | 5 | x − y | k ( x , y ) = σ 2 Matern 5/2 1 + exp − 3 θ 2 θ θ √ √ � � � � 3 | x − y | 3 | x − y | k ( x , y ) = σ 2 Matern 3/2 1 + exp − θ θ � − | x − y | � k ( x , y ) = σ 2 exp exponential θ k ( x , y ) = σ 2 min ( x , y ) Brownian k ( x , y ) = σ 2 δ x , y white noise k ( x , y ) = σ 2 constant k ( x , y ) = σ 2 xy linear When k is a function of x − y , the kernel is called stationary . σ 2 is called the variance and θ the lengthscale . 9 / 57

Examples of kernels in gpflow: Matern12 k(x, 0.0) Matern32 k(x, 0.0) Matern52 k(x, 0.0) RBF k(x, 0.0) 1.0 0.8 0.6 0.4 0.2 0.0 0.2 RationalQuadratic k(x, 0.0) Constant k(x, 0.0) White k(x, 0.0) Cosine k(x, 0.0) 1.0 0.8 0.6 0.4 0.2 0.0 0.2 Periodic k(x, 0.0) Linear k(x, 1.0) Polynomial k(x, 1.0) ArcCosine k(x, 0.0) 1.0 0.8 0.6 0.4 0.2 0.0 0.2 2 0 2 2 0 2 2 0 2 2 0 2 10 / 57

Associated samples Matern12 Matern32 Matern52 RBF 3 2 1 0 1 2 3 RationalQuadratic Constant White Cosine 3 2 1 0 1 2 3 Periodic Linear Polynomial ArcCosine 3 2 1 0 1 2 3 2 0 2 2 0 2 2 0 2 2 0 2 11 / 57

For a few kernels, it is possible to prove they are psd directly from the definition. k ( x , y ) = δ x , y k ( x , y ) = 1 For most of them a direct proof from the definition is not possible. The following theorem is helpful for stationary kernels: Theorem (Bochner) A continuous stationary function k ( x , y ) = ˜ k ( | x − y | ) is positive definite if and only if ˜ k is the Fourier transform of a finite positive measure: � ˜ e − i ω t d µ ( ω ) k ( t ) = R 12 / 57

Example We consider the following measure: 0.0 k ( t ) = sin ( t ) Its Fourier transform gives ˜ : t 0.0 As a consequence, k ( x , y ) = sin ( x − y ) is a valid covariance x − y function. 13 / 57

Usual kernels Bochner theorem can be used to prove the positive definiteness of many usual stationary kernels The Gaussian is the Fourier transform of itself ⇒ it is psd. 1 Matérn kernels are the Fourier transforms of ( 1 + ω 2 ) p ⇒ they are psd. 14 / 57

Unusual kernels Inverse Fourier transform of a (symmetrised) sum of Gaussian gives (A. Wilson, ICML 2013): ˜ µ ( ω ) k ( t ) − → F 0.0 0.0 The obtained kernel is parametrised by its spectrum. 15 / 57

Unusual kernels The sample paths have the following shape: 6 4 2 0 2 4 6 0 1 2 3 4 5 16 / 57

Choosing the appropriate kernel 17 / 57

Changing the kernel has a huge impact on the model: Gaussian kernel: Exponential kernel: 18 / 57

This is because changing the kernel implies changing the prior Gaussian kernel: Exponential kernel: 19 / 57

In order to choose a kernel, one should gather all possible informations about the function to approximate... Is it stationary? Is it differentiable, what’s its regularity? Do we expect particular trends? Do we expect particular patterns (periodicity, cycles, additivity)? Kernels often include rescaling parameters: θ for the x axis (length-scale) and σ for the y ( σ 2 often corresponds to the GP variance). They can be tuned by maximizing the likelihood minimizing the prediction error 20 / 57

It is common to try various kernels and to asses the model accuracy. The idea is to compare some model predictions against actual values: On a test set Using leave-one-out Two (ideally three) things should be checked: Is the mean accurate (MSE, Q 2 )? Do the confidence intervals make sense? Are the predicted covariances right? Furthermore, it is often interesting to try some input remapping such as x → log ( x ) , x → exp ( x ) , ... 21 / 57

Making new from old 22 / 57

Making new from old: Kernels can be: Summed together ◮ On the same space k ( x , y ) = k 1 ( x , y ) + k 2 ( x , y ) ◮ On the tensor space k ( x , y ) = k 1 ( x 1 , y 1 ) + k 2 ( x 2 , y 2 ) Multiplied together ◮ On the same space k ( x , y ) = k 1 ( x , y ) × k 2 ( x , y ) ◮ On the tensor space k ( x , y ) = k 1 ( x 1 , y 1 ) × k 2 ( x 2 , y 2 ) Composed with a function ◮ k ( x , y ) = k 1 ( f ( x ) , f ( y )) All these operations will preserve the positive definiteness. How can this be useful? 23 / 57

Sum of kernels over the same input space Property k ( x , y ) = k 1 ( x , y ) + k 2 ( x , y ) is a valid covariance structure. This can be proved directly from the p.s.d. definition. Example Matern12 k(x, 0.03) Linear k(x, 0.03) Sum k(x, .03) 0.04 0.04 0.04 0.02 0.02 0.02 = + 0.00 0.00 0.00 0.02 0.02 0.02 0.04 0.04 0.04 3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 24 / 57

Sum of kernels over the same input space Z ∼ N ( 0 , k 1 + k 2 ) can be seen as Z = Z 1 + Z 2 where Z 1 , Z 2 are indenpendent and Z 1 ∼ N ( 0 , k 1 ) , Z 2 ∼ N ( 0 , k 2 ) k ( x , y ) = k 1 ( x , y ) + k 2 ( x , y ) Example Z 1 ( x ) Z 2 ( x ) Z ( x ) 2.0 2.0 2.0 1.5 1.5 1.5 1.0 1.0 1.0 0.5 0.5 0.5 = + 0.0 0.0 0.0 0.5 0.5 0.5 1.0 1.0 1.0 1.5 1.5 1.5 2.0 2.0 2.0 3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3 25 / 57

Sum of kernels over the same space Example (The Mauna Loa observatory dataset) This famous dataset compiles the monthly CO 2 concentration in Hawaii since 1958. 440 420 400 380 360 340 320 1960 1970 1980 1990 2000 2010 2020 2030 Let’s try to predict the concentration for the next 20 years. 26 / 57

Sum of kernels over the same space We first consider a squared-exponential kernel: � � − ( x − y ) 2 k ( x , y ) = σ 2 exp θ 2 480 600 460 400 440 420 200 400 0 380 360 200 340 400 320 600 300 1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 The results are terrible! 27 / 57

480 460 440 420 400 380 360 340 320 300 1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 Sum of kernels over the same space What happen if we sum both kernels? k ( x , y ) = k rbf 1 ( x , y ) + k rbf 2 ( x , y ) 28 / 57

Sum of kernels over the same space What happen if we sum both kernels? k ( x , y ) = k rbf 1 ( x , y ) + k rbf 2 ( x , y ) 480 460 440 420 400 380 360 340 320 300 1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 The model is drastically improved! 28 / 57

460 440 420 400 380 360 340 320 300 1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 Sum of kernels over the same space We can try the following kernel: 0 x 2 y 2 + k rbf 1 ( x , y ) + k rbf 2 ( x , y ) + k per ( x , y ) k ( x , y ) = σ 2 29 / 57

Sum of kernels over the same space We can try the following kernel: 0 x 2 y 2 + k rbf 1 ( x , y ) + k rbf 2 ( x , y ) + k per ( x , y ) k ( x , y ) = σ 2 460 440 420 400 380 360 340 320 300 1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 Once again, the model is significantly improved. 29 / 57

Kernel Design Nicolas Durrande PROWLER.io (nicolas@prowler.io) - PowerPoint PPT Presentation

Gaussian Process Summer School Kernel Design Nicolas Durrande PROWLER.io (nicolas@prowler.io) Sheffield, September 2018 Introduction 2 / 57 We have seen during the introduction lectures that the distribution of a GP Z depends on two

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Black Kernel Rot Malady of Pecan B Wood, C Bock, l Wells, T Cottrell, M Hotchkiss Black Kernel

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

Linux Kernel Debugging Your kernel just oopsed - What do you do, hotshot? Muli Ben-Yehuda

Introduction to Linux Kernel Modules Luca Abeni luca.abeni@santannapisa.it Linux Kernel Modules

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

A kernel in a library Genodes custom kernel approach Martin Stein <

Linux Kernel Synchronization System Calls Synchronization in Kernel the kernel RCU File

Debugging the Linux Kernel with GDB Kieran Bingham Debugging the Linux Kernel with GDB Many

TOS Arno Puder 1 Demo Kernel /* tos/kernel/main.c */ #include <kernel.h> WINDOW

Multiple Kernel Learning and Feature Space Denoising Fei Yan, Josef Kittler and Krystian

Efficient Multiple Kernel Learning Lei Tang Outline What is Kernel Learning? Whats the

Introduction to Kubernetes Containers container vs virtual machine Virtual machine Container

Intermediate Math Circles - When You Arrive If you have been here before Check your name on the

C# Programming in Depth Prof. Dr. Bertrand Meyer March 2007 May 2007 Lecture 4: Garbage

GC Assertions: Using the Garbage Collector to check heap properties Shirley Gracelyn February

Basics of Garbage Collection Merlin Laue Universitt zu Lbeck 9. November 2015 Merlin Laue

Probabilistic Graphical Models Lecture 21: Advanced Gaussian Processes Andrew Gordon Wilson

Learning From Data Lecture 25 The Kernel Trick Learning with only inner products The Kernel M.

Density-Based Alternative Explanation Fuzzy Clustering as a Explaining the . . . What If Not

Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP) Andrew Gordon Wilson

Kernel Design Nicolas Durrande PROWLER.io (nicolas@prowler.io) - PowerPoint PPT Presentation

Gaussian Process Summer School Kernel Design Nicolas Durrande PROWLER.io (nicolas@prowler.io) Sheffield, September 2018 Introduction 2 / 57 We have seen during the introduction lectures that the distribution of a GP Z depends on two

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Black Kernel Rot Malady of Pecan B Wood, C Bock, l Wells, T Cottrell, M Hotchkiss Black Kernel

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

Linux Kernel Debugging Your kernel just oopsed - What do you do, hotshot? Muli Ben-Yehuda

Introduction to Linux Kernel Modules Luca Abeni luca.abeni@santannapisa.it Linux Kernel Modules

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

A kernel in a library Genodes custom kernel approach Martin Stein &lt;

Linux Kernel Synchronization System Calls Synchronization in Kernel the kernel RCU File

Debugging the Linux Kernel with GDB Kieran Bingham Debugging the Linux Kernel with GDB Many

TOS Arno Puder 1 Demo Kernel /* tos/kernel/main.c */ #include &lt;kernel.h&gt; WINDOW

Multiple Kernel Learning and Feature Space Denoising Fei Yan, Josef Kittler and Krystian

Efficient Multiple Kernel Learning Lei Tang Outline What is Kernel Learning? Whats the

Introduction to Kubernetes Containers container vs virtual machine Virtual machine Container

Intermediate Math Circles - When You Arrive If you have been here before Check your name on the

C# Programming in Depth Prof. Dr. Bertrand Meyer March 2007 May 2007 Lecture 4: Garbage

GC Assertions: Using the Garbage Collector to check heap properties Shirley Gracelyn February

Basics of Garbage Collection Merlin Laue Universitt zu Lbeck 9. November 2015 Merlin Laue

Probabilistic Graphical Models Lecture 21: Advanced Gaussian Processes Andrew Gordon Wilson

Learning From Data Lecture 25 The Kernel Trick Learning with only inner products The Kernel M.

Density-Based Alternative Explanation Fuzzy Clustering as a Explaining the . . . What If Not

Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP) Andrew Gordon Wilson

A kernel in a library Genodes custom kernel approach Martin Stein <

TOS Arno Puder 1 Demo Kernel /* tos/kernel/main.c */ #include <kernel.h> WINDOW