Kernel Design Nicolas Durrande PROWLER.io (nicolas@prowler.io) - - PowerPoint PPT Presentation

kernel design
SMART_READER_LITE
LIVE PREVIEW

Kernel Design Nicolas Durrande PROWLER.io (nicolas@prowler.io) - - PowerPoint PPT Presentation

Gaussian Process Summer School Kernel Design Nicolas Durrande PROWLER.io (nicolas@prowler.io) Sheffield, September 2018 Introduction 2 / 57 We have seen during the introduction lectures that the distribution of a GP Z depends on two


slide-1
SLIDE 1

Gaussian Process Summer School

Kernel Design

Nicolas Durrande – PROWLER.io (nicolas@prowler.io)

Sheffield, September 2018

slide-2
SLIDE 2

Introduction

2 / 57

slide-3
SLIDE 3

We have seen during the introduction lectures that the distribution

  • f a GP Z depends on two functions:

the mean m(x) = E (Z(x)) the covariance k(x, x′) = cov (Z(x), Z(x′)) In this talk, we will focus on the covariance function, which is

  • ften call the kernel.

3 / 57

slide-4
SLIDE 4

Given some data, the conditional distribution is still Gaussian:

m(x) = E (Z(x)|Z(X) + ε=F) = k(x, X)(k(X, X) + τ 2I)−1F c(x, x′) = cov (Z(x), Z(x′)|Z(X) + ε=F) = k(x, x′) − k(x, X)(k(X, X) + τ 2I)−1k(X, x′)

It can be represented as a mean function with confidence intervals.

0.0 0.2 0.4 0.6 0.8 1.0

  • 1.0
  • 0.5

0.0 0.5 1.0 1.5

x Z(x)|Z(X) = F

4 / 57

slide-5
SLIDE 5

What is a kernel?

5 / 57

slide-6
SLIDE 6

Let Z be a random process with kernel k. Some properties of kernels can be obtained directly from their definition.

Example

k(x, x) = cov (Z(x), Z(x)) = var (Z(x)) ≥ 0 ⇒ k(x, x) is positive. k(x, y) = cov (Z(x), Z(y)) = cov (Z(y), Z(x)) = k(y, x) ⇒ k(x, y) is symmetric. We can obtain a thinner result...

6 / 57

slide-7
SLIDE 7

We introduce the random variable T = n

i=1 aiZ(xi) where n, ai

and xi are arbitrary. Computing the variance of T gives: var (T) = cov  

i

aiZ(xi),

  • j

ajZ(xj)   =

  • i
  • j

aiajcov (Z(xi), Z(xj)) = aiajk(xi, xj) ≥ 0 Since a variance is positive, we have for any arbitrary n, ai and xi.

Definition

The functions k satisfying

  • i
  • j

aiajk(xi, xj) ≥ 0 for all n ∈ N, for all xi ∈ D, for all ai ∈ R are called positive semi-definite functions.

7 / 57

slide-8
SLIDE 8

We have just seen: k is a covariance ⇒ k is a symmetric positive semi-definite function The reverse is also true:

Theorem (Loeve)

k corresponds to the covariance of a GP

  • k is a symmetric positive semi-definite function

8 / 57

slide-9
SLIDE 9

Proving that a function is psd is often difficult. However there are a lot of functions that have already been proven to be psd:

squared exp. k(x, y) = σ2 exp

  • − (x − y)2

2θ2

  • Matern 5/2

k(x, y) = σ2

  • 1 +

√ 5|x − y| θ + 5|x − y|2 3θ2

  • exp

√ 5|x − y| θ

  • Matern 3/2

k(x, y) = σ2

  • 1 +

√ 3|x − y| θ

  • exp

√ 3|x − y| θ

  • exponential

k(x, y) = σ2 exp

  • − |x − y|

θ

  • Brownian

k(x, y) = σ2 min(x, y) white noise k(x, y) = σ2δx,y constant k(x, y) = σ2 linear k(x, y) = σ2xy

When k is a function of x − y, the kernel is called stationary. σ2 is called the variance and θ the lengthscale.

9 / 57

slide-10
SLIDE 10

Examples of kernels in gpflow:

0.2 0.0 0.2 0.4 0.6 0.8 1.0

Matern12 k(x, 0.0) Matern32 k(x, 0.0) Matern52 k(x, 0.0) RBF k(x, 0.0)

0.2 0.0 0.2 0.4 0.6 0.8 1.0

RationalQuadratic k(x, 0.0) Constant k(x, 0.0) White k(x, 0.0) Cosine k(x, 0.0)

2 2 0.2 0.0 0.2 0.4 0.6 0.8 1.0

Periodic k(x, 0.0)

2 2

Linear k(x, 1.0)

2 2

Polynomial k(x, 1.0)

2 2

ArcCosine k(x, 0.0)

10 / 57

slide-11
SLIDE 11

Associated samples

3 2 1 1 2 3

Matern12 Matern32 Matern52 RBF

3 2 1 1 2 3

RationalQuadratic Constant White Cosine

2 2 3 2 1 1 2 3

Periodic

2 2

Linear

2 2

Polynomial

2 2

ArcCosine

11 / 57

slide-12
SLIDE 12

For a few kernels, it is possible to prove they are psd directly from the definition. k(x, y) = δx,y k(x, y) = 1 For most of them a direct proof from the definition is not possible. The following theorem is helpful for stationary kernels:

Theorem (Bochner)

A continuous stationary function k(x, y) = ˜ k(|x − y|) is positive definite if and only if ˜ k is the Fourier transform of a finite positive measure: ˜ k(t) =

  • R

e−iωtdµ(ω)

12 / 57

slide-13
SLIDE 13

Example

We consider the following measure: Its Fourier transform gives ˜ k(t) = sin(t) t :

0.0 0.0

As a consequence, k(x, y) = sin(x − y) x − y is a valid covariance function.

13 / 57

slide-14
SLIDE 14

Usual kernels

Bochner theorem can be used to prove the positive definiteness of many usual stationary kernels The Gaussian is the Fourier transform of itself ⇒ it is psd. Matérn kernels are the Fourier transforms of

1 (1+ω2)p

⇒ they are psd.

14 / 57

slide-15
SLIDE 15

Unusual kernels

Inverse Fourier transform of a (symmetrised) sum of Gaussian gives (A. Wilson, ICML 2013): µ(ω)

0.0

− →

F ˜ k(t)

0.0

The obtained kernel is parametrised by its spectrum.

15 / 57

slide-16
SLIDE 16

Unusual kernels

The sample paths have the following shape:

1 2 3 4 5 6 4 2 2 4 6

16 / 57

slide-17
SLIDE 17

Choosing the appropriate kernel

17 / 57

slide-18
SLIDE 18

Changing the kernel has a huge impact on the model:

Gaussian kernel: Exponential kernel:

18 / 57

slide-19
SLIDE 19

This is because changing the kernel implies changing the prior

Gaussian kernel: Exponential kernel:

19 / 57

slide-20
SLIDE 20

In order to choose a kernel, one should gather all possible informations about the function to approximate... Is it stationary? Is it differentiable, what’s its regularity? Do we expect particular trends? Do we expect particular patterns (periodicity, cycles, additivity)? Kernels often include rescaling parameters: θ for the x axis (length-scale) and σ for the y (σ2 often corresponds to the GP variance). They can be tuned by maximizing the likelihood minimizing the prediction error

20 / 57

slide-21
SLIDE 21

It is common to try various kernels and to asses the model

  • accuracy. The idea is to compare some model predictions against

actual values: On a test set Using leave-one-out Two (ideally three) things should be checked: Is the mean accurate (MSE, Q2)? Do the confidence intervals make sense? Are the predicted covariances right? Furthermore, it is often interesting to try some input remapping such as x → log(x), x → exp(x), ...

21 / 57

slide-22
SLIDE 22

Making new from old

22 / 57

slide-23
SLIDE 23

Making new from old: Kernels can be: Summed together

◮ On the same space k(x, y) = k1(x, y) + k2(x, y) ◮ On the tensor space k(x, y) = k1(x1, y1) + k2(x2, y2)

Multiplied together

◮ On the same space k(x, y) = k1(x, y) × k2(x, y) ◮ On the tensor space k(x, y) = k1(x1, y1) × k2(x2, y2)

Composed with a function

◮ k(x, y) = k1(f (x), f (y))

All these operations will preserve the positive definiteness. How can this be useful?

23 / 57

slide-24
SLIDE 24

Sum of kernels over the same input space

Property

k(x, y) = k1(x, y) + k2(x, y) is a valid covariance structure. This can be proved directly from the p.s.d. definition.

Example

3 2 1 1 2 3 0.04 0.02 0.00 0.02 0.04

Matern12 k(x, 0.03)

+

3 2 1 1 2 3 0.04 0.02 0.00 0.02 0.04

Linear k(x, 0.03)

=

3 2 1 1 2 3 0.04 0.02 0.00 0.02 0.04

Sum k(x, .03)

24 / 57

slide-25
SLIDE 25

Sum of kernels over the same input space

Z ∼ N(0, k1 + k2) can be seen as Z = Z1 + Z2 where Z1, Z2 are indenpendent and Z1 ∼ N(0, k1), Z2 ∼ N(0, k2) k(x, y) = k1(x, y) + k2(x, y)

Example

3 2 1 1 2 3 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

Z1(x)

+

3 2 1 1 2 3 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

Z2(x)

=

3 2 1 1 2 3 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

Z(x)

25 / 57

slide-26
SLIDE 26

Sum of kernels over the same space

Example (The Mauna Loa observatory dataset)

This famous dataset compiles the monthly CO2 concentration in Hawaii since 1958.

1960 1970 1980 1990 2000 2010 2020 2030 320 340 360 380 400 420 440

Let’s try to predict the concentration for the next 20 years.

26 / 57

slide-27
SLIDE 27

Sum of kernels over the same space

We first consider a squared-exponential kernel: k(x, y) = σ2 exp

  • −(x − y)2

θ2

  • 1950

1960 1970 1980 1990 2000 2010 2020 2030 2040 600 400 200 200 400 600 1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 300 320 340 360 380 400 420 440 460 480

The results are terrible!

27 / 57

slide-28
SLIDE 28

Sum of kernels over the same space

What happen if we sum both kernels? k(x, y) = krbf 1(x, y) + krbf 2(x, y)

1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 300 320 340 360 380 400 420 440 460 480

28 / 57

slide-29
SLIDE 29

Sum of kernels over the same space

What happen if we sum both kernels? k(x, y) = krbf 1(x, y) + krbf 2(x, y)

1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 300 320 340 360 380 400 420 440 460 480

The model is drastically improved!

28 / 57

slide-30
SLIDE 30

Sum of kernels over the same space

We can try the following kernel: k(x, y) = σ2

0x2y2 + krbf 1(x, y) + krbf 2(x, y) + kper(x, y)

1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 300 320 340 360 380 400 420 440 460

29 / 57

slide-31
SLIDE 31

Sum of kernels over the same space

We can try the following kernel: k(x, y) = σ2

0x2y2 + krbf 1(x, y) + krbf 2(x, y) + kper(x, y)

1950 1960 1970 1980 1990 2000 2010 2020 2030 2040 300 320 340 360 380 400 420 440 460

Once again, the model is significantly improved.

29 / 57

slide-32
SLIDE 32

Sum of kernels over tensor space

Property

k(x, y) = k1(x1, y1) + k2(x2, y2) is a valid covariance structure.

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

+

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

=

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.5 1.0 1.5

Remark: From a GP point of view, k is the kernel of Z(x) = Z1(x1) + Z2(x2)

30 / 57

slide-33
SLIDE 33

Sum of kernels over tensor space

We can have a look at a few sample paths from Z:

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 2 1 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 3 2 1 1 2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5

⇒ They are additive (up to a modification) Tensor Additive kernels are very useful for Approximating additive functions Building models over high dimensional input space

31 / 57

slide-34
SLIDE 34

Sum of kernels over tensor space

Remarks

It is straightforward to show that the mean predictor is additive

m(x) = (k1(x, X) + k2(x, X))k(X, X)−1F = k1(x1, X1)k(X, X)−1F

  • m1(x1)

+ k2(x2, X2)k(X, X)−1F

  • m2(x2)

⇒ The model shares the prior behaviour.

The sub-models can be interpreted as GP regression models with observation noise: m1(x1) = E ( Z1(x1) | Z1(X1) + Z2(X2)=F )

32 / 57

slide-35
SLIDE 35

Sum of kernels over tensor space

Remark

The prediction variance has interesting features

  • pred. var. with kernel product

0.0 0.2 0.4 0.6 0.8 1.0

x1

0.0 0.2 0.4 0.6 0.8 1.0

x2

  • pred. var. with kernel sum

0.0 0.2 0.4 0.6 0.8 1.0

x1

0.0 0.2 0.4 0.6 0.8 1.0

x2

33 / 57

slide-36
SLIDE 36

Sum of kernels over tensor space

This property can be used to construct a design of experiment that covers the space with only cst × d points.

0.0 0.2 0.4 0.6 0.8 1.0

x1

0.0 0.2 0.4 0.6 0.8 1.0

x2

Prediction variance

34 / 57

slide-37
SLIDE 37

Product over the same space

Property

k(x, y) = k1(x, y) × k2(x, y) is valid covariance structure.

Example

We consider the product of a squared exponential with a cosine: × =

35 / 57

slide-38
SLIDE 38

Product over the tensor space

Property

k(x, y) = k1(x1, y1) × k2(x2, y2) is valid covariance structure.

Example

We multiply two squared exponential kernels

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

×

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

=

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

Calculation shows we obtain the usual 2D squared exponential kernels.

36 / 57

slide-39
SLIDE 39

Composition with a function

Property

Let k1 be a kernel over D1 × D1 and f be an arbitrary function D → D1, then k(x, y) = k1(f (x), f (y)) is a kernel over D × D.

proof aiajk(xi, xj) = aiajk1(f (xi)

  • yi

, f (xj)

  • yj

) ≥ 0

Remarks: k corresponds to the covariance of Z(x) = Z1(f (x)) This can be seen as a (nonlinear) rescaling of the input space

37 / 57

slide-40
SLIDE 40

Example

We consider f (x) = 1

x and a Matérn 3/2 kernel

k1(x, y) = (1 + |x − y|)e−|x−y|. We obtain: Kernel

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Sample paths

0.0 0.2 0.4 0.6 0.8 1.0 3 2 1 1 2 3

38 / 57

slide-41
SLIDE 41

All these transformations can be combined!

Example

k(x, y) = f (x)f (y)k1(x, y) is a valid kernel. This can be illustrated with f (x) = 1

x and

k1(x, y) = (1 + |x − y|)e−|x−y|: Kernel

0.0 0.2 0.4 0.6 0.8 1.0 10 20 30 40 50 60 70

Sample paths

0.0 0.2 0.4 0.6 0.8 1.0 40 20 20 40

39 / 57

slide-42
SLIDE 42

Effect of linear operators

40 / 57

slide-43
SLIDE 43

Effect of a linear operator

Property (Ginsbourger 2013)

Let L be a linear operator that commutes with the covariance, then k(x, y) = Lx(Ly(k1(x, y))) is a kernel.

Example

We want to approximate a function [0, 1] → R that is symmetric with respect to 0.5. We will consider 2 linear operators: L1 : f (x) →

  • f (x)

x < 0.5 f (1 − x) x ≥ 0.5 L2 : f (x) → f (x) + f (1 − x) 2 .

41 / 57

slide-44
SLIDE 44

Effect of a linear operator

Example

Associated sample paths are k1 = L1(L1(k))

0.0 0.2 0.4 0.6 0.8 1.0 −2 −1 1 2 3 x Y

k2 = L2(L2(k))

0.0 0.2 0.4 0.6 0.8 1.0 −2 −1 1 2 3 x Y

The differentiability is not always respected!

42 / 57

slide-45
SLIDE 45

Effect of a linear operator

These linear operator are projections onto a space of symmetric functions:

H Hsym f L1f L2f

What about the optimal projection? ⇒ This can be difficult... but it raises interesting questions!

43 / 57

slide-46
SLIDE 46

Application: Periodicity detection

44 / 57

slide-47
SLIDE 47

Periodicity detection

We will now discuss the detection of periodicity Given a few observations can we extract the periodic part of a signal ?

5 10 15 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5

45 / 57

slide-48
SLIDE 48

As previously we will build a decomposition of the process in two independent GPs: Z = Zp + Za where Zp is a GP in the span of the Fourier basis B(t) = (sin(t), cos(t), . . . , sin(nt), cos(nt))t.

Property

It can be proved that the kernel of Zp and Za are kp(x, y) = B(x)tG −1B(y) ka(x, y) = k(x, y) − kp(x, y) where G is the Gram matrix associated to B in the RKHS.

46 / 57

slide-49
SLIDE 49

As previously, a decomposition of the model comes with a decomposition of the kernel

m(x) = (kp(x, X) + ka(x, X))k(X, X)−1F = kp(x, X)k(X, X)−1F

  • periodic sub-model mp

+ ka(x, X)k(X, X)−1F

  • aperiodic sub-model ma

and we can associate a prediction variance to the sub-models:

vp(x) = kp(x, x) − kp(x, X)k(X, X)−1kp(X, x) va(x) = ka(x, x) − ka(x, X)k(X, X)−1ka(X, x)

47 / 57

slide-50
SLIDE 50

Example

For the observations shown previously we obtain:

5 10 15 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 5 10 15 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5

||

5 10 15 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5

+

Can we can do any better?

48 / 57

slide-51
SLIDE 51

Initially, the kernels are parametrised by 2 variables: k(x, y, σ2, θ) but writing k as a sum allows to tune independently the parameters

  • f the sub-kernels.

Let k∗ be defined as k∗(x, y, σ2

p, σ2 a, θp, θa) = kp(x, y, σ2 p, θp) + ka(x, y, σ2 a, θa)

Furthermore, we include a 5th parameter in k∗ accounting for the period by changing the Fourier basis: Bω(t) = (sin(ωt), cos(ωt), . . . , sin(nωt), cos(nωt))t

49 / 57

slide-52
SLIDE 52

MLE of the 5 parameters of k∗ gives:

5 10 15 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 5 10 15 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5

||

5 10 15 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5

+

We will now illustrate the use of these kernels for gene expression analysis.

50 / 57

slide-53
SLIDE 53

We can apply this method to study the circadian rythm in

  • rganisms. We used arabidopsis data from Edward 2006.

The dimension of the data is: 22810 genes 13 time points Edward 2006 gives a list of the 3504 most periodically expressed

  • genes. The comparison with our approach gives:

21767 genes with the same label (2461 per. and 19306 non-per.) 1043 genes with different labels

51 / 57

slide-54
SLIDE 54

Let’s look at genes with different labels:

30 40 50 60 70 3 2 1 1 2 3

At1g60810

30 40 50 60 70 2 1 1 2 3

At4g10040

30 40 50 60 70 3 2 1 1 2 3

At1g06290

30 40 50 60 70 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

At5g48900

30 40 50 60 70 2 1 1 2 3 4

At5g41480

30 40 50 60 70 2 1 1 2 3 4

At3g08000

30 40 50 60 70 3 2 1 1 2

At3g03900

30 40 50 60 70 2 1 1 2 3

At2g36400

periodic for Edward periodic for our approach

52 / 57

slide-55
SLIDE 55

Conclusion

53 / 57

slide-56
SLIDE 56

Small recap

We have seen that Kernels have a huge impact on the model They have to reflect the prior belief on the function to approximate. Kernels can (and should) be tailored to the problem at hand. Although a direct proof of the positive definiteness of a function is

  • ften intractable, Bochner theorem allows to build kernels from

their power spectrum.

54 / 57

slide-57
SLIDE 57

Various operations can be applied to kernels while keeping p.s.d.ness:

Making new from old

sum product composition with a function these can be combined

Linear operator

If we have a linear application that transforms any function into a function satisfying the desired property, it is possible to build a GP fulfilling the requirements.

55 / 57

slide-58
SLIDE 58
  • C. E. Rasmussen and C. Williams

Gaussian Processes for Machine Learning, The MIT Press, 2006.

  • A. Berlinet and C. Thomas-Agnan

RKHS in probability and statistics, Kluwer academic, 2004.

  • N. Durrande, D. Ginsbourger, O. Roustant

Additive covariance kernels for high-dimensional Gaussian process modeling, AFST 2012.

  • N. Durrande, J. Hensman, M. Rattray, N. D. Lawrence

Detecting periodicities with Gaussian processes. PeerJ Computer Science 2016.

  • D. Ginsbourger, X. Bay, L. Carraro and O. Roustant

Argumentwise invariant kernels for the approximation of invariant functions, AFST 2012.

56 / 57