Kernel Design Nicolas Durrande PROWLER.io (nicolas@prowler.io) - PowerPoint PPT Presentation

Gaussian Process Summer School Kernel Design Nicolas Durrande – PROWLER.io (nicolas@prowler.io) Sheffield, September 2019

Second Introduction to GPs and GP Regression 2 / 77

The pdf of a Gaussian random variable is: 0.4 0.3 � � density − ( x − µ ) 2 1 0.2 √ f ( x ) = exp 2 σ 2 σ 2 π 0.1 0.0 -4 -2 0 2 4 x The parameters µ and σ 2 correspond to the mean and variance µ = E [ X ] σ 2 = E [ X 2 ] − E [ X ] 2 The variance is positive. 3 / 77

Definition We say that a vector Y = ( Y 1 , . . . , Y n ) t follows a multivariate normal distribution if any linear combination of Y follows a normal distribution: ∀ α ∈ R n , α t Y ∼ N Two examples and one counter-example : 10 3 5 2 5 1 Y 2 0 Y 2 Y 2 0 0 -1 -2 -5 -5 -3 -3 -2 -1 0 1 2 3 -5 0 5 -5 0 5 Y 1 Y 1 Y 1 4 / 77

The pdf of a multivariate Gaussian is: � � 1 − 1 2 ( x − µ ) t Σ − 1 ( x − µ ) f Y ( x ) = ( 2 π ) n / 2 | Σ | 1 / 2 exp . It is parametrised by mean vector µ = E [ Y ] covariance matrix density Σ = E [ YY t ] − E [ Y ] E [ Y ] t (i.e. Σ i , j = cov ( Y i , Y j ) ) x 2 x 1 A covariance matrix is symmetric Σ i , j = Σ j , i and positive semi-definite ∀ α ∈ R n , α t Σ α ≥ 0 . 5 / 77

Conditional distribution 2D multivariate Gaussian conditional distribution: p ( y 1 | y 2 = α ) = p ( y 1 , α ) p ( α ) = exp ( quadratic in y 1 and α ) const √ Σ c = exp ( quadratic in y 1 ) f Y const µ c x 2 = Gaussian distribution ! x 1 The conditional distribution is still Gaussian! 6 / 77

3D Example 3D multivariate Gaussian conditional distribution: x 3 x 2 x 1 7 / 77

Conditional distribution Let ( Y 1 , Y 2 ) be a Gaussian vector ( Y 1 and Y 2 may both be vectors): � Y 1 � �� µ 1 � � Σ 11 �� Σ 12 = N , . Y 2 µ 2 Σ 21 Σ 22 The conditional distribution of Y 1 given Y 2 is: Y 1 | Y 2 ∼ N ( µ cond , Σ cond ) µ cond = E [ Y 1 | Y 2 ] = µ 1 + Σ 12 Σ − 1 with 22 ( Y 2 − µ 2 ) Σ cond = cov [ Y 1 , Y 1 | Y 2 ] = Σ 11 − Σ 12 Σ − 1 22 Σ 21 8 / 77

Gaussian processes 4 Y ( x ) 0 4 1 1 x Definition A random process Z over D ⊂ R d is said to be Gaussian if ∀ n ∈ N , ∀ x i ∈ D , ( Z ( x 1 ) , . . . , Z ( x n )) is multivariate normal . ⇒ Demo: https://github.com/awav/interactive-gp 9 / 77

We write Z ∼ N ( m ( . ) , k ( ., . )) : m : D → R is the mean function m ( x ) = E [ Z ( x )] k : D × D → R is the covariance function (i.e. kernel): k ( x , y ) = cov ( Z ( x ) , Z ( y )) The mean m can be any function, but not the kernel: Theorem (Loeve) k is a GP covariance � k is symmetric k ( x , y ) = k ( y , x ) and positive semi-definite: for all n ∈ N , for all x i ∈ D , for all α i ∈ R n n � � α i α j k ( x i , x j ) ≥ 0 i = 1 j = 1 10 / 77

Proving that a function is psd is often difficult. However there are a lot of functions that have already been proven to be psd: − ( x − y ) 2 � � k ( x , y ) = σ 2 exp squared exp. 2 θ 2 √ √ � � � � + 5 | x − y | 2 5 | x − y | 5 | x − y | k ( x , y ) = σ 2 Matern 5/2 1 + exp − 3 θ 2 θ θ √ √ � � � � 3 | x − y | 3 | x − y | k ( x , y ) = σ 2 Matern 3/2 1 + exp − θ θ � − | x − y | � k ( x , y ) = σ 2 exp exponential θ k ( x , y ) = σ 2 min ( x , y ) Brownian k ( x , y ) = σ 2 δ x , y white noise k ( x , y ) = σ 2 constant k ( x , y ) = σ 2 xy linear When k is a function of x − y , the kernel is called stationary . σ 2 is called the variance and θ the lengthscale . ⇒ Demo: https://github.com/NicolasDurrande/shinyApps 11 / 77

Examples of kernels in gpflow: Matern12 k(x, 0.0) Matern32 k(x, 0.0) Matern52 k(x, 0.0) RBF k(x, 0.0) 1.0 0.8 0.6 0.4 0.2 0.0 0.2 RationalQuadratic k(x, 0.0) Constant k(x, 0.0) White k(x, 0.0) Cosine k(x, 0.0) 1.0 0.8 0.6 0.4 0.2 0.0 0.2 Periodic k(x, 0.0) Linear k(x, 1.0) Polynomial k(x, 1.0) ArcCosine k(x, 0.0) 1.0 0.8 0.6 0.4 0.2 0.0 0.2 2 0 2 2 0 2 2 0 2 2 0 2 12 / 77

Associated samples Matern12 Matern32 Matern52 RBF 3 2 1 0 1 2 3 RationalQuadratic Constant White Cosine 3 2 1 0 1 2 3 Periodic Linear Polynomial ArcCosine 3 2 1 0 1 2 3 2 0 2 2 0 2 2 0 2 2 0 2 13 / 77

Gaussian process regression We assume we have observed a function f for a set of points X = ( X 1 , . . . , X n ) : 4 0 f 5 1 1 x The vector of observations is F = f ( X ) (ie F i = f ( X i ) ). 14 / 77

Since f in unknown, we make the general assumption that it is the sample path of a Gaussian process Z ∼ N ( 0 , k ) : 4 Y ( x ) 0 4 1 1 x 15 / 77

The posterior distribution Z ( · ) | Z ( X ) = F : Is still Gaussian Can be computed analytically It is N ( m ( · ) , c ( · , · )) with: m ( x ) = E [ Z ( x ) | Z ( X ) = F ] = k ( x , X ) k ( X , X ) − 1 F c ( x , y ) = cov [ Z ( x ) , Z ( y ) | Z ( X ) = F ] = k ( x , y ) − k ( x , X ) k ( X , X ) − 1 k ( X , y ) 16 / 77

A few words on GPR Complexity Storage footprint: We have to store the covariance matrix which is n × n . Complexity: We have to invert the covariance matrix, which requires is O ( n 3 ) . Storage footprint is often the first limit to be reached. The maximal number of observation points is between 1000 and 10 000. What if we have more data? ⇒ Talk from Zhenwen this afternoon What if we need to be faster? ⇒ Talk from Arno on Wednesday 17 / 77

Samples from the posterior distribution 4 Y ( x )| Y ( X ) = F 0 4 1 1 x 18 / 77

It can be summarized by a mean function and 95% confidence intervals. 4 Y ( x )| Y ( X ) = F 0 4 1 1 x 19 / 77

A few remarkable properties of GPR models They (can) interpolate the data-points. The prediction variance does not depend on the observations. The mean predictor does not depend on the variance parameter. The mean (usually) come back to zero when predicting far away from the observations. Can we prove them? ⇒ Demo https://durrande.shinyapps.io/gp_playground 20 / 77

We are not always interested in models that interpolate the data. For example, if there is some observation noise: F = f ( X ) + ε . Let N be a process N ( 0 , n ( ., . )) that represent the observation noise. The expressions of GPR with noise are m ( x ) = E [ Z ( x ) | Z ( X ) + N ( X ) = F ] = k ( x , X )( k ( X , X ) + n ( X , X )) − 1 F c ( x , y ) = cov [ Z ( x ) , Z ( y ) | Z ( X ) + N ( X ) = F ] = k ( x , y ) − k ( x , X )( k ( X , X ) + n ( X , X )) − 1 k ( X , y ) 21 / 77

Examples of models with observation noise for n ( x , y ) = τ 2 δ x , y : 4 Z ( x ) | Z ( X ) + N ( X ) = F Z ( x ) | Z ( X ) + N ( X ) = F Z ( x ) | Z ( X ) + N ( X ) = F 3 3 3 2 2 2 1 1 1 0 0 0 -1 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x x x The values of τ 2 are respectively 0.001, 0.01 and 0.1. What if F = f ( X ) + ε isn’t appropriate? ⇒ Talks from Alan and Neil tomorrow. 22 / 77

Parameter estimation 23 / 77

The choice of the kernel parameters has a great influence on the model. ⇒ Demo https://durrande.shinyapps.io/gp_playground In order to choose a prior that is suited to the data at hand, we can search for the parameters that maximise the model likelihood . Definition The likelihood of a distribution with a density f X given some observations X 1 , . . . , X p is: p � L = f X ( X i ) i = 1 24 / 77

In the GPR context, we often have only one observation of the vector F . The likelihood is then: � � 1 − 1 L ( σ 2 , θ ) = f Z ( X ) ( F ) = 2 F t k ( X , X ) − 1 F ( 2 π ) n / 2 | k ( X , X ) | 1 / 2 exp . It is thus possible to maximise L – or log ( L ) – with respect to the kernel’s parameters in order to find a well suited prior. Why is the likelihood linked to good model predictions? They are linked by the product rule: f Z ( X ) ( F ) = f ( F 1 ) × f ( F 2 | F 1 ) × f ( F 3 | F 1 , F 2 ) × · · · × f ( F n | F 1 , . . . , F n − 1 ) 25 / 77

Model validation 26 / 77

The idea is to introduce new data and to compare the model prediction with reality 2 Z ( x ) | Z ( X ) = F 1 0 -1 0.0 0.2 0.4 0.6 0.8 1.0 x Two (ideally three) things should be checked: Is the mean accurate? Do the confidence intervals make sense? Are the predicted covariances right? 27 / 77

Let X t be the test set and F t = f ( X t ) be the associated observations. The accuracy of the mean can be measured by computing: MSE = mean (( F t − m ( X t )) 2 ) Mean Square Error � ( F t − m ( X t )) 2 A “normalised” criterion Q 2 = 1 − � ( F t − mean ( F t )) 2 On the above example we get MSE = 0 . 038 and Q 2 = 0 . 95. 28 / 77

The predicted distribution can be tested by normalising the residuals. According to the model, F t ∼ N ( m ( X t ) , c ( X t , X t )) . c ( X t , X t ) − 1 / 2 ( F t − m ( X t )) should thus be independents N ( 0 , 1 ) : Normal Q-Q Plot 0.6 3 0.5 2 0.4 Sample Quantiles 1 Density 0.3 0 0.2 -1 -2 0.1 -3 0.0 -3 -2 -1 0 1 2 3 -2 -1 0 1 2 standardised residuals Theoretical Quantiles 29 / 77

When no test set is available, another option is to consider cross validation methods such as leave-one-out. The steps are: 1. build a model based on all observations except one 2. compute the model error at this point This procedure can be repeated for all the design points in order to get a vector of error. 30 / 77

Kernel Design Nicolas Durrande PROWLER.io (nicolas@prowler.io) - PowerPoint PPT Presentation

Gaussian Process Summer School Kernel Design Nicolas Durrande PROWLER.io (nicolas@prowler.io) Sheffield, September 2019 Second Introduction to GPs and GP Regression 2 / 77 The pdf of a Gaussian random variable is: 0.4 0.3

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Black Kernel Rot Malady of Pecan B Wood, C Bock, l Wells, T Cottrell, M Hotchkiss Black Kernel

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

Linux Kernel Debugging Your kernel just oopsed - What do you do, hotshot? Muli Ben-Yehuda

Introduction to Linux Kernel Modules Luca Abeni luca.abeni@santannapisa.it Linux Kernel Modules

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

A kernel in a library Genodes custom kernel approach Martin Stein <

Linux Kernel Synchronization System Calls Synchronization in Kernel the kernel RCU File

Debugging the Linux Kernel with GDB Kieran Bingham Debugging the Linux Kernel with GDB Many

TOS Arno Puder 1 Demo Kernel /* tos/kernel/main.c */ #include <kernel.h> WINDOW

Multiple Kernel Learning and Feature Space Denoising Fei Yan, Josef Kittler and Krystian

Efficient Multiple Kernel Learning Lei Tang Outline What is Kernel Learning? Whats the

Introduction to Kubernetes Containers container vs virtual machine Virtual machine Container

Imprecise copulas constructed from shock models Damjan kulj University of Ljubljana 11th

Rui Carvalho, School Mathematical Sciences, QMUL QMUL: Wolfram Just David Arrowsmith University

Introduction to Social Choice Lirong Xia Fall, 2016 Keep in mind Good science What

Entanglement Wedge Reconstruction and the Information Paradox Geoff Penington, Stanford

An Algorithm for Unconstrained Quadratically Penalized Convex Optimization (post conference

Formal Methods for Probabilistic Systems Annabelle McIver Carroll Morgan Source-level

optimization of software architectures Aurora Ramrez, Jos Ral Romero, Sebastin Ventura

Maximum Likelihood Jonathan Pillow Mathematical Tools for Neuroscience (NEU 314) Spring, 2016

Kernel Design Nicolas Durrande PROWLER.io (nicolas@prowler.io) - PowerPoint PPT Presentation

Gaussian Process Summer School Kernel Design Nicolas Durrande PROWLER.io (nicolas@prowler.io) Sheffield, September 2019 Second Introduction to GPs and GP Regression 2 / 77 The pdf of a Gaussian random variable is: 0.4 0.3

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Black Kernel Rot Malady of Pecan B Wood, C Bock, l Wells, T Cottrell, M Hotchkiss Black Kernel

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

Linux Kernel Debugging Your kernel just oopsed - What do you do, hotshot? Muli Ben-Yehuda

Introduction to Linux Kernel Modules Luca Abeni luca.abeni@santannapisa.it Linux Kernel Modules

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

A kernel in a library Genodes custom kernel approach Martin Stein &lt;

Linux Kernel Synchronization System Calls Synchronization in Kernel the kernel RCU File

Debugging the Linux Kernel with GDB Kieran Bingham Debugging the Linux Kernel with GDB Many

TOS Arno Puder 1 Demo Kernel /* tos/kernel/main.c */ #include &lt;kernel.h&gt; WINDOW

Multiple Kernel Learning and Feature Space Denoising Fei Yan, Josef Kittler and Krystian

Efficient Multiple Kernel Learning Lei Tang Outline What is Kernel Learning? Whats the

Introduction to Kubernetes Containers container vs virtual machine Virtual machine Container

Imprecise copulas constructed from shock models Damjan kulj University of Ljubljana 11th

Rui Carvalho, School Mathematical Sciences, QMUL QMUL: Wolfram Just David Arrowsmith University

Introduction to Social Choice Lirong Xia Fall, 2016 Keep in mind Good science What

Entanglement Wedge Reconstruction and the Information Paradox Geoff Penington, Stanford

An Algorithm for Unconstrained Quadratically Penalized Convex Optimization (post conference

Formal Methods for Probabilistic Systems Annabelle McIver Carroll Morgan Source-level

optimization of software architectures Aurora Ramrez, Jos Ral Romero, Sebastin Ventura

Maximum Likelihood Jonathan Pillow Mathematical Tools for Neuroscience (NEU 314) Spring, 2016

A kernel in a library Genodes custom kernel approach Martin Stein <

TOS Arno Puder 1 Demo Kernel /* tos/kernel/main.c */ #include <kernel.h> WINDOW