Kernel Design Nicolas Durrande PROWLER.io (nicolas@prowler.io) - - PowerPoint PPT Presentation

kernel design
SMART_READER_LITE
LIVE PREVIEW

Kernel Design Nicolas Durrande PROWLER.io (nicolas@prowler.io) - - PowerPoint PPT Presentation

Gaussian Process Summer School Kernel Design Nicolas Durrande PROWLER.io (nicolas@prowler.io) Sheffield, September 2017 1 / 59 Introduction What is a kernel ? Choosing the appropriate kernel Making new from old Effect of linear


slide-1
SLIDE 1

Gaussian Process Summer School

Kernel Design

Nicolas Durrande – PROWLER.io (nicolas@prowler.io)

Sheffield, September 2017

1 / 59

slide-2
SLIDE 2

Introduction What is a kernel ? Choosing the appropriate kernel Making new from old Effect of linear operators Application : Periodicity detection Conclusion

2 / 59

slide-3
SLIDE 3

Introduction What is a kernel ? Choosing the appropriate kernel Making new from old Effect of linear operators Application : Periodicity detection Conclusion

3 / 59

slide-4
SLIDE 4

We have seen during the introduction lectures that the distribution

  • f a GP Z depends on two functions :

the mean m(x) = E (Z(x)) the covariance k(x, x′) = cov (Z(x), Z(x′)) In this talk, we will focus on the covariance function, which is

  • ften call the kernel.

4 / 59

slide-5
SLIDE 5

We assume we have observed a function f for a limited number of time points x1, . . . , xn :

0.0 0.2 0.4 0.6 0.8 1.0

  • 1.0
  • 0.5

0.0 0.5 1.0 1.5

x f(x)

The observations are denoted by fi = f (xi) (or F = f (X)).

5 / 59

slide-6
SLIDE 6

Since f in unknown, we make the general assumption that it is a sample path of a Gaussian process Z :

0.0 0.2 0.4 0.6 0.8 1.0

  • 4
  • 2

2 4

x Z(x)

6 / 59

slide-7
SLIDE 7

Combining these two informations means keeping the samples interpolating the data points :

0.0 0.2 0.4 0.6 0.8 1.0

  • 1.0
  • 0.5

0.0 0.5 1.0 1.5

x Z(x)|Z(X) = F

7 / 59

slide-8
SLIDE 8

The conditional distribution is still Gaussian with moments :

m(x) = E (Z(x)|Z(X)=F) = k(x, X)k(X, X)−1F c(x, x′) = cov (Z(x), Z(x′)|Z(X)=F) = k(x, x′) − k(x, X)k(X, X)−1k(X, x′)

It can be represented as a mean function with confidence intervals.

0.0 0.2 0.4 0.6 0.8 1.0

  • 1.0
  • 0.5

0.0 0.5 1.0 1.5

x Z(x)|Z(X) = F

8 / 59

slide-9
SLIDE 9

Introduction What is a kernel ? Choosing the appropriate kernel Making new from old Effect of linear operators Application : Periodicity detection Conclusion

9 / 59

slide-10
SLIDE 10

Let Z be a random process with kernel k. Some properties of kernels can be obtained directly from their definition.

Example

k(x, x) = cov (Z(x), Z(x)) = var (Z(x)) ≥ 0 ⇒ k(x, x) is positive. k(x, y) = cov (Z(x), Z(y)) = cov (Z(y), Z(x)) = k(y, x) ⇒ k(x, y) is symmetric. We can obtain a thinner result...

10 / 59

slide-11
SLIDE 11

We introduce the random variable T = n

i=1 aiZ(xi) where n, ai

and xi are arbitrary. Computing the variance of T gives : var (T) = cov  

i

aiZ(xi),

  • j

ajZ(xj)   =

  • i
  • j

aiajcov (Z(xi), Z(xj)) = aiajk(xi, xj) Since a variance is positive, we have

  • i
  • j

aiajk(xi, xj) ≥ 0 for any arbitrary n, ai and xi.

Definition

The functions satisfying the above inequality for all n ∈ N, for all xi ∈ D, for all ai ∈ R are called positive semi-definite functions.

11 / 59

slide-12
SLIDE 12

We have just seen : k is a covariance ⇒ k is a positive semi-definite function The reverse is also true :

Theorem (Loeve)

k corresponds to the covariance of a GP

  • k is a symmetric positive semi-definite function

12 / 59

slide-13
SLIDE 13

Proving that a function is psd is often difficult. However there are a lot of functions that have already been proven to be psd :

squared exp. k(x, y) = σ2 exp

  • − (x − y)2

2θ2

  • Matern 5/2

k(x, y) = σ2

  • 1 +

√ 5|x − y| θ + 5|x − y|2 3θ2

  • exp

√ 5|x − y| θ

  • Matern 3/2

k(x, y) = σ2

  • 1 +

√ 3|x − y| θ

  • exp

√ 3|x − y| θ

  • exponential

k(x, y) = σ2 exp

  • − |x − y|

θ

  • Brownian

k(x, y) = σ2 min(x, y) white noise k(x, y) = σ2δx,y constant k(x, y) = σ2 linear k(x, y) = σ2xy

When k is a function of x − y, the kernel is called stationary. σ2 is called the variance and θ the lengthscale.

13 / 59

slide-14
SLIDE 14

14 / 59

slide-15
SLIDE 15

For a few kernels, it is possible to prove they are psd directly from the definition. k(x, y) = δx,y k(x, y) = 1 For most of them a direct proof from the definition is not possible. The following theorem is helpful for stationary kernels :

Theorem (Bochner)

A continuous stationary function k(x, y) = ˜ k(|x − y|) is positive definite if and only if ˜ k is the Fourier transform of a finite positive measure : ˜ k(t) =

  • R

e−iωtdµ(ω)

15 / 59

slide-16
SLIDE 16

Example

We consider the following measure : Its Fourier transform gives ˜ k(t) = sin(t) t :

0.0 0.0

As a consequence, k(x, y) = sin(x − y) x − y is a valid covariance function.

16 / 59

slide-17
SLIDE 17

Usual kernels

Bochner theorem can be used to prove the positive definiteness of many usual stationary kernels The Gaussian is the Fourier transform of itself ⇒ it is psd. Matérn kernels are the Fourier transforms of

1 (1+ω2)p

⇒ they are psd.

17 / 59

slide-18
SLIDE 18

Unusual kernels

Inverse Fourier transform of a (symmetrised) sum of Gaussian gives (A. Wilson, ICML 2013) : µ(ω)

0.0

− →

F ˜ k(t)

0.0

The obtained kernel is parametrised by its spectrum.

18 / 59

slide-19
SLIDE 19

Unusual kernels

The sample paths have the following shape :

1 2 3 4 5 6 4 2 2 4 6

19 / 59

slide-20
SLIDE 20

Introduction What is a kernel ? Choosing the appropriate kernel Making new from old Effect of linear operators Application : Periodicity detection Conclusion

20 / 59

slide-21
SLIDE 21

Changing the kernel has a huge impact on the model :

Gaussian kernel: Exponential kernel:

21 / 59

slide-22
SLIDE 22

This is because changing the kernel implies changing the prior

Gaussian kernel: Exponential kernel:

22 / 59

slide-23
SLIDE 23

In order to choose a kernel, one should gather all possible informations about the function to approximate... Is it stationary ? Is it differentiable, what’s its regularity ? Do we expect particular trends ? Do we expect particular patterns (periodicity, cycles, additivity) ? Kernels often include rescaling parameters : θ for the x axis (length-scale) and σ for the y (σ2 often corresponds to the GP variance). They can be tuned by maximizing the likelihood minimizing the prediction error

23 / 59

slide-24
SLIDE 24

It is common to try various kernels and to asses the model

  • accuracy. The idea is to compare some model predictions against

actual values : On a test set Using leave-one-out Two (ideally three) things should be checked : Is the mean accurate (MSE, Q2) ? Do the confidence intervals make sense ? Are the predicted covariances right ? Furthermore, it is often interesting to try some input remapping such as x → log(x), x → exp(x), ...

24 / 59

slide-25
SLIDE 25

Introduction What is a kernel ? Choosing the appropriate kernel Making new from old Effect of linear operators Application : Periodicity detection Conclusion

25 / 59

slide-26
SLIDE 26

Making new from old : Kernels can be : Summed together

◮ On the same space k(x, y) = k1(x, y) + k2(x, y) ◮ On the tensor space k(x, y) = k1(x1, y1) + k2(x2, y2)

Multiplied together

◮ On the same space k(x, y) = k1(x, y) × k2(x, y) ◮ On the tensor space k(x, y) = k1(x1, y1) × k2(x2, y2)

Composed with a function

◮ k(x, y) = k1(f (x), f (y))

All these operations will preserve the positive definiteness. How can this be useful ?

26 / 59

slide-27
SLIDE 27

Sum of kernels over the same space

Example (The Mauna Loa observatory dataset)

This famous dataset compiles the monthly CO2 concentration in Hawaii since 1958.

1960 1970 1980 1990 2000 2010 2020 2030 320 340 360 380 400 420 440

Let’s try to predict the concentration for the next 20 years.

27 / 59

slide-28
SLIDE 28

Sum of kernels over the same space

We first consider a squared-exponential kernel : k(x, y) = σ2 exp

  • −(x − y)2

θ2

  • 1950

1960 1970 1980 1990 2000 2010 2020 2030 2040 600 400 200 200 400 600 1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 300 320 340 360 380 400 420 440 460 480

The results are terrible !

28 / 59

slide-29
SLIDE 29

Sum of kernels over the same space

What happen if we sum both kernels ? k(x, y) = krbf 1(x, y) + krbf 2(x, y)

1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 300 320 340 360 380 400 420 440 460 480

29 / 59

slide-30
SLIDE 30

Sum of kernels over the same space

What happen if we sum both kernels ? k(x, y) = krbf 1(x, y) + krbf 2(x, y)

1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 300 320 340 360 380 400 420 440 460 480

The model is drastically improved !

29 / 59

slide-31
SLIDE 31

Sum of kernels over the same space

We can try the following kernel : k(x, y) = σ2

0x2y2 + krbf 1(x, y) + krbf 2(x, y) + kper(x, y)

1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 300 320 340 360 380 400 420 440 460

30 / 59

slide-32
SLIDE 32

Sum of kernels over the same space

We can try the following kernel : k(x, y) = σ2

0x2y2 + krbf 1(x, y) + krbf 2(x, y) + kper(x, y)

1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 300 320 340 360 380 400 420 440 460

Once again, the model is significantly improved.

30 / 59

slide-33
SLIDE 33

Sum of kernels over tensor space

Property

k(x, y) = k1(x1, y1) + k2(x2, y2) is a valid covariance structure.

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

+

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

=

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.5 1.0 1.5

Remark : From a GP point of view, k is the kernel of Z(x) = Z1(x1) + Z2(x2)

31 / 59

slide-34
SLIDE 34

Sum of kernels over tensor space

We can have a look at a few sample paths from Z :

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 2 1 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 3 2 1 1 2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5

⇒ They are additive (up to a modification) Tensor Additive kernels are very useful for Approximating additive functions Building models over high dimensional input space

32 / 59

slide-35
SLIDE 35

Sum of kernels over tensor space

We consider the test function f (x) = sin(4πx1) + cos(4πx2) + 2x2 and a set of 20 observation in [0, 1]2 Test function

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1 1 2 3

Observations

0.2 0.3 0.4 0.5 0.60.70.8 0.2 0.4 0.6 0.8 1 1 2 3

33 / 59

slide-36
SLIDE 36

Sum of kernels over tensor space

We obtain the following models : Gaussian kernel Mean predictor

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 2 1 1 2 3

RMSE is 1.06 Additive Gaussian kernel Mean predictor

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1 1 2 3

RMSE is 0.12

34 / 59

slide-37
SLIDE 37

Sum of kernels over tensor space

Remarks

It is straightforward to show that the mean predictor is additive

m(x) = (k1(x, X) + k2(x, X))(k(X, X))−1F = k1(x1, X1)(k(X, X))−1F

  • m1(x1)

+ k2(x2, X2)(k(X, X))−1F

  • m2(x2)

⇒ The model shares the prior behaviour.

The sub-models can be interpreted as GP regression models with observation noise : m1(x1) = E ( Z1(x1) | Z1(X1) + Z2(X2)=F )

35 / 59

slide-38
SLIDE 38

Sum of kernels over tensor space

Remark

The prediction variance has interesting features

  • pred. var. with kernel product

0.0 0.2 0.4 0.6 0.8 1.0

x1

0.0 0.2 0.4 0.6 0.8 1.0

x2

  • pred. var. with kernel sum

0.0 0.2 0.4 0.6 0.8 1.0

x1

0.0 0.2 0.4 0.6 0.8 1.0

x2

36 / 59

slide-39
SLIDE 39

Sum of kernels over tensor space

This property can be used to construct a design of experiment that covers the space with only cst × d points.

0.0 0.2 0.4 0.6 0.8 1.0

x1

0.0 0.2 0.4 0.6 0.8 1.0

x2

Prediction variance

37 / 59

slide-40
SLIDE 40

Product over the same space

Property

k(x, y) = k1(x, y) × k2(x, y) is valid covariance structure.

Example

We consider the product of a squared exponential with a cosine : × =

38 / 59

slide-41
SLIDE 41

Product over the tensor space

Property

k(x, y) = k1(x1, y1) × k2(x2, y2) is valid covariance structure.

Example

We multiply two squared exponential kernels

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

×

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

=

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

Calculation shows we obtain the usual 2D squared exponential kernels.

39 / 59

slide-42
SLIDE 42

Composition with a function

Property

Let k1 be a kernel over D1 × D1 and f be an arbitrary function D → D1, then k(x, y) = k1(f (x), f (y)) is a kernel over D × D.

proof aiajk(xi, xj) = aiajk1(f (xi)

  • yi

, f (xj)

  • yj

) ≥ 0

Remarks : k corresponds to the covariance of Z(x) = Z1(f (x)) This can be seen as a (nonlinear) rescaling of the input space

40 / 59

slide-43
SLIDE 43

Example

We consider f (x) = 1

x and a Matérn 3/2 kernel

k1(x, y) = (1 + |x − y|)e−|x−y|. We obtain : Kernel

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Sample paths

0.0 0.2 0.4 0.6 0.8 1.0 3 2 1 1 2 3

41 / 59

slide-44
SLIDE 44

All these transformations can be combined !

Example

k(x, y) = f (x)f (y)k1(x, y) is a valid kernel. This can be illustrated with f (x) = 1

x and

k1(x, y) = (1 + |x − y|)e−|x−y| : Kernel

0.0 0.2 0.4 0.6 0.8 1.0 10 20 30 40 50 60 70

Sample paths

0.0 0.2 0.4 0.6 0.8 1.0 40 20 20 40

42 / 59

slide-45
SLIDE 45

Introduction What is a kernel ? Choosing the appropriate kernel Making new from old Effect of linear operators Application : Periodicity detection Conclusion

43 / 59

slide-46
SLIDE 46

Effect of a linear operator

Property (Ginsbourger 2013)

Let L be a linear operator that commutes with the covariance, then k(x, y) = Lx(Ly(k1(x, y))) is a kernel.

Example

We want to approximate a function [0, 1] → R that is symmetric with respect to 0.5. We will consider 2 linear operators : L1 : f (x) →

  • f (x)

x < 0.5 f (1 − x) x ≥ 0.5 L2 : f (x) → f (x) + f (1 − x) 2 .

44 / 59

slide-47
SLIDE 47

Effect of a linear operator

Example

Associated sample paths are k1 = L1(L1(k))

0.0 0.2 0.4 0.6 0.8 1.0 −2 −1 1 2 3 x Y

k2 = L2(L2(k))

0.0 0.2 0.4 0.6 0.8 1.0 −2 −1 1 2 3 x Y

The differentiability is not always respected !

45 / 59

slide-48
SLIDE 48

Effect of a linear operator

These linear operator are projections onto a space of symmetric functions :

H Hsym f L1f L2f

What about the optimal projection ? ⇒ This can be difficult... but it raises interesting questions !

46 / 59

slide-49
SLIDE 49

Introduction What is a kernel ? Choosing the appropriate kernel Making new from old Effect of linear operators Application : Periodicity detection Conclusion

47 / 59

slide-50
SLIDE 50

Periodicity detection

We will now discuss the detection of periodicity Given a few observations can we extract the periodic part of a signal ?

5 10 15 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5

48 / 59

slide-51
SLIDE 51

As previously we will build a decomposition of the process in two independent GPs : Z = Zp + Za where Zp is a GP in the span of the Fourier basis B(t) = (sin(t), cos(t), . . . , sin(nt), cos(nt))t.

Property

It can be proved that the kernel of Zp and Za are kp(x, y) = B(x)tG −1B(y) ka(x, y) = k(x, y) − kp(x, y) where G is the Gram matrix associated to B in the RKHS.

49 / 59

slide-52
SLIDE 52

As previously, a decomposition of the model comes with a decomposition of the kernel

m(t) = (kp(x, X) + ka(x, X))k(X, X)−1F = kp(x, X)k(X, X)−1F

  • periodic sub-model mp

+ ka(x, X)k(X, X)−1F

  • aperiodic sub-model ma

and we can associate a prediction variance to the sub-models :

vp(t) = kp(x, x) − kp(x, X)tk(X, X)−1kp(t) va(t) = ka(x, x) − ka(x, X)tk(X, X)−1ka(t)

50 / 59

slide-53
SLIDE 53

Example

For the observations shown previously we obtain :

5 10 15 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 5 10 15 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5

||

5 10 15 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5

+

Can we can do any better ?

51 / 59

slide-54
SLIDE 54

Initially, the kernels are parametrised by 2 variables : k(x, y, σ2, θ) but writing k as a sum allows to tune independently the parameters

  • f the sub-kernels.

Let k∗ be defined as k∗(x, y, σ2

p, σ2 a, θp, θa) = kp(x, y, σ2 p, θp) + ka(x, y, σ2 a, θa)

Furthermore, we include a 5th parameter in k∗ accounting for the period by changing the Fourier basis : Bω(t) = (sin(ωt), cos(ωt), . . . , sin(nωt), cos(nωt))t

52 / 59

slide-55
SLIDE 55

MLE of the 5 parameters of k∗ gives :

5 10 15 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 5 10 15 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5

||

5 10 15 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5

+

We will now illustrate the use of these kernels for gene expression analysis.

53 / 59

slide-56
SLIDE 56

We can apply this method to study the circadian rythm in

  • rganisms. We used arabidopsis data from Edward 2006.

The dimension of the data is : 22810 genes 13 time points Edward 2006 gives a list of the 3504 most periodically expressed

  • genes. The comparison with our approach gives :

21767 genes with the same label (2461 per. and 19306 non-per.) 1043 genes with different labels

54 / 59

slide-57
SLIDE 57

Let’s look at genes with different labels :

30 40 50 60 70 3 2 1 1 2 3

At1g60810

30 40 50 60 70 2 1 1 2 3

At4g10040

30 40 50 60 70 3 2 1 1 2 3

At1g06290

30 40 50 60 70 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

At5g48900

30 40 50 60 70 2 1 1 2 3 4

At5g41480

30 40 50 60 70 2 1 1 2 3 4

At3g08000

30 40 50 60 70 3 2 1 1 2

At3g03900

30 40 50 60 70 2 1 1 2 3

At2g36400

periodic for Edward periodic for our approach

55 / 59

slide-58
SLIDE 58

Introduction What is a kernel ? Choosing the appropriate kernel Making new from old Effect of linear operators Application : Periodicity detection Conclusion

56 / 59

slide-59
SLIDE 59

Small recap

We have seen that Kernels have a huge impact on the model They have to reflect the prior belief on the function to approximate. Kernels can (and should) be tailored to the problem at hand. Although a direct proof of the positive definiteness of a function is

  • ften intractable, Bochner theorem allows to build kernels from

their power spectrum.

57 / 59

slide-60
SLIDE 60

Various operations can be applied to kernels while keeping p.s.d.ness :

Making new from old

sum product composition with a function these can be combined

Linear operator

If we have a linear application that transforms any function into a function satisfying the desired property, it is possible to build a GP fulfilling the requirements.

58 / 59

slide-61
SLIDE 61
  • C. E. Rasmussen and C. Williams

Gaussian Processes for Machine Learning, The MIT Press, 2006.

  • A. Berlinet and C. Thomas-Agnan

RKHS in probability and statistics, Kluwer academic, 2004.

  • N. Durrande, D. Ginsbourger, O. Roustant

Additive covariance kernels for high-dimensional Gaussian process modeling, AFST 2012.

  • N. Durrande, D. Ginsbourger, O. Roustant, L. Carraro

ANOVA kernels and RKHS of zero mean functions for model-based sensitivity analysis, JMA 2013.

  • N. Durrande, J. Hensman, M. Rattray, N. D. Lawrence

Detecting periodicities with Gaussian processes. PeerJ Computer Science 2016.

  • D. Ginsbourger, X. Bay, L. Carraro and O. Roustant

Argumentwise invariant kernels for the approximation of invariant functions, AFST 2012.

59 / 59