An introduction to Gaussian processes Oliver Stegle and Karsten - - PowerPoint PPT Presentation

an introduction to gaussian processes
SMART_READER_LITE
LIVE PREVIEW

An introduction to Gaussian processes Oliver Stegle and Karsten - - PowerPoint PPT Presentation

An introduction to Gaussian processes Oliver Stegle and Karsten Borgwardt Machine Learning and Computational Biology Research Group, Max Planck Institute for Biological Cybernetics and Max Planck Institute for Developmental Biology, Tbingen


slide-1
SLIDE 1

Oliver Stegle and Karsten Borgwardt: Computational Approaches for Analysing Complex Biological Systems, Page 1

An introduction to Gaussian processes

Oliver Stegle and Karsten Borgwardt Machine Learning and Computational Biology Research Group, Max Planck Institute for Biological Cybernetics and Max Planck Institute for Developmental Biology, Tübingen

slide-2
SLIDE 2

Motivation

Why Gaussian processes?

◮ So far: linear models with a

finite number of basis functions, e.g. φ(x) = (1, x, x2, . . . , xK)

◮ Open questions:

◮ How to design a suitable

basis?

◮ How many basis functions to

pick?

◮ Gaussian processes: accurate

and flexible regression method yielding predictions alongside with error bars.

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 1

slide-3
SLIDE 3

Motivation

Why Gaussian processes?

◮ So far: linear models with a

finite number of basis functions, e.g. φ(x) = (1, x, x2, . . . , xK)

◮ Open questions:

◮ How to design a suitable

basis?

◮ How many basis functions to

pick?

◮ Gaussian processes: accurate

and flexible regression method yielding predictions alongside with error bars.

2 4 6 8 10 X 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Y

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 1

slide-4
SLIDE 4

Motivation

Why Gaussian processes?

◮ So far: linear models with a

finite number of basis functions, e.g. φ(x) = (1, x, x2, . . . , xK)

◮ Open questions:

◮ How to design a suitable

basis?

◮ How many basis functions to

pick?

◮ Gaussian processes: accurate

and flexible regression method yielding predictions alongside with error bars.

2 4 6 8 10 X 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Y

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 1

slide-5
SLIDE 5

Motivation

Further reading

◮ A comprehensive and very good introduction to Gaussian processes

  • C. E. Rasmussen, C. K. Williams

Gaussian proceesses for machine learning

◮ Free download: http://www.gaussianprocess.org/gpml/

◮ A really good introductory movie to watch

http://videolectures.net/gpip06 mackay gpb/

◮ Many ideas used in this course are borrowed from this lecture.

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 2

slide-6
SLIDE 6

Outline

Outline

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 3

slide-7
SLIDE 7

Intuitive approach

Outline

Motivation Intuitive approach Function space view GP classification & other extensions Summary

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 4

slide-8
SLIDE 8

Intuitive approach

The Gaussian distribution

◮ Gaussian processes are merely based on the good old Gaussian

N

  • x
  • µ, K
  • =

1

  • |2π K |

exp

  • −1

2(x − µ)T K

−1(x − µ)

  • ◮ Covariance matrix or kernel matrix
  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 5

slide-9
SLIDE 9

Intuitive approach

A 2D Gaussian

◮ Probability contour ◮ Samples

−3 −2 −1 1 2 3

y1

−3 −2 −1 1 2 3

y2

K =

  • 1

0.6 0.6 1

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 6

slide-10
SLIDE 10

Intuitive approach

A 2D Gaussian

◮ Probability contour ◮ Samples

−3 −2 −1 1 2 3

y1

−3 −2 −1 1 2 3

y2

K =

  • 1

0.6 0.6 1

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 6

slide-11
SLIDE 11

Intuitive approach

A 2D Gaussian

Varying the covariance matrix

−3 −2 −1 1 2 3 y1 −3 −2 −1 1 2 3 y2

K =

  • 1

0.14 0.14 1

  • −3

−2 −1 1 2 3 y1 −3 −2 −1 1 2 3 y2

K =

  • 1

0.6 0.6 1

  • −3

−2 −1 1 2 3 y1 −3 −2 −1 1 2 3 y2

K =

  • 1
  • 0.9
  • 0.9

1

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 7

slide-12
SLIDE 12

Intuitive approach

A 2D Gaussian

Inference −3 −2 −1 1 2 3 −3 −2 −1 1 2 3

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 8

slide-13
SLIDE 13

Intuitive approach

A 2D Gaussian

Inference −3 −2 −1 1 2 3 −3 −2 −1 1 2 3

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 8

slide-14
SLIDE 14

Intuitive approach

A 2D Gaussian

Inference − 3 − 2 − 1 1 2 3 − 3 − 2 − 1 1 2 3

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 8

slide-15
SLIDE 15

Intuitive approach

Inference

◮ Joint probability p(y1, y2 | K) = N ([y1, y2] | 0, K) ◮ Conditional probability

p(y2 | y1, K) = p(y1, y2 | K) p(y1 | K) ∝ exp

  • −1

2[y1, y2] K−1 y1 y2

  • ◮ Completing the square yields a Gaussian with non-zero as posterior

for y2.

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 9

slide-16
SLIDE 16

Intuitive approach

Extending the idea to higher dimensions

◮ Let us interpret y1 and y2 as outputs in a regression setting. ◮ We can introduce an additional 3rd point.

1 2

X

1 2 3 4 5

Y

◮ Now P([y1, y2, y3] | K3) = N ([y1, y2, y3] | 0, K3), where K3 is now a

3 x 3 covariance matrix!

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 10

slide-17
SLIDE 17

Intuitive approach

Extending the idea to higher dimensions

◮ Let us interpret y1 and y2 as outputs in a regression setting. ◮ We can introduce an additional 3rd point.

1 2

X

1 2 3 4 5

Y

◮ Now P([y1, y2, y3] | K3) = N ([y1, y2, y3] | 0, K3), where K3 is now a

3 x 3 covariance matrix!

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 10

slide-18
SLIDE 18

Intuitive approach

Extending the idea to higher dimensions

◮ Let us interpret y1 and y2 as outputs in a regression setting. ◮ We can introduce an additional 3rd point.

1 2

X

1 2 3 4 5

Y

◮ Now P([y1, y2, y3] | K3) = N ([y1, y2, y3] | 0, K3), where K3 is now a

3 x 3 covariance matrix!

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 10

slide-19
SLIDE 19

Intuitive approach

Extending the idea to higher dimensions

◮ Let us interpret y1 and y2 as outputs in a regression setting. ◮ We can introduce an additional 3rd point.

1 2

X

1 2 3 4 5

Y

◮ Now P([y1, y2, y3] | K3) = N ([y1, y2, y3] | 0, K3), where K3 is now a

3 x 3 covariance matrix!

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 10

slide-20
SLIDE 20

Intuitive approach

Constructing Covariance Matrices

◮ Analogously we can look at the joint probability for arbitrary many

points and obtain predictions.

◮ Issue: how to construct a good covariance matrix? ◮ A simple heuristics

K2 =

  • 1

0.6 0.6 1

  • K3 =

  1 0.6 0.6 1 0.6 0.6 1  

◮ Note:

◮ The ordering of the points y1, y2, y3 matters. ◮ Important to ensure that covariance matrices remain positive definite

(matrix inversion).

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 11

slide-21
SLIDE 21

Intuitive approach

Constructing Covariance Matrices

◮ Analogously we can look at the joint probability for arbitrary many

points and obtain predictions.

◮ Issue: how to construct a good covariance matrix? ◮ A simple heuristics

K2 =

  • 1

0.6 0.6 1

  • K3 =

  1 0.6 0.6 1 0.6 0.6 1  

◮ Note:

◮ The ordering of the points y1, y2, y3 matters. ◮ Important to ensure that covariance matrices remain positive definite

(matrix inversion).

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 11

slide-22
SLIDE 22

Intuitive approach

Constructing Covariance Matrices

◮ Analogously we can look at the joint probability for arbitrary many

points and obtain predictions.

◮ Issue: how to construct a good covariance matrix? ◮ A simple heuristics

K2 =

  • 1

0.6 0.6 1

  • K3 =

  1 0.6 0.6 1 0.6 0.6 1  

◮ Note:

◮ The ordering of the points y1, y2, y3 matters. ◮ Important to ensure that covariance matrices remain positive definite

(matrix inversion).

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 11

slide-23
SLIDE 23

Intuitive approach

Constructing Covariance Matrices

A general recipe

◮ Use a covariance function (kernel function) to construct K:

Ki,j = k(xi, xj; θK)

◮ For example: The squared exponential covariance function embodies

the belief that points further apart are less correlated:

kSE(xi, xj, ; A, L) = A2 exp

  • −0.5 · (xi − xj)2

L2

Overall correlation, amplitude

Scaling parameter, smoothness

◮ θK = {A, L} are called hyperparameters. ◮ We denote the covariance matrix for a set of inputs

X = {x1, . . . , xN} as: KX,X(θK)

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 12

slide-24
SLIDE 24

Intuitive approach

Constructing Covariance Matrices

A general recipe

◮ Use a covariance function (kernel function) to construct K:

Ki,j = k(xi, xj; θK)

◮ For example: The squared exponential covariance function embodies

the belief that points further apart are less correlated:

kSE(xi, xj, ; A, L) = A2 exp

  • −0.5 · (xi − xj)2

L2

Overall correlation, amplitude

Scaling parameter, smoothness

◮ θK = {A, L} are called hyperparameters. ◮ We denote the covariance matrix for a set of inputs

X = {x1, . . . , xN} as: KX,X(θK)

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 12

slide-25
SLIDE 25

Intuitive approach

Constructing Covariance Matrices

A general recipe

◮ Use a covariance function (kernel function) to construct K:

Ki,j = k(xi, xj; θK)

◮ For example: The squared exponential covariance function embodies

the belief that points further apart are less correlated:

kSE(xi, xj, ; A, L) = A2 exp

  • −0.5 · (xi − xj)2

L2

Overall correlation, amplitude

Scaling parameter, smoothness

◮ θK = {A, L} are called hyperparameters. ◮ We denote the covariance matrix for a set of inputs

X = {x1, . . . , xN} as: KX,X(θK)

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 12

slide-26
SLIDE 26

Intuitive approach

Constructing Covariance Matrices

A general recipe

◮ Use a covariance function (kernel function) to construct K:

Ki,j = k(xi, xj; θK)

◮ For example: The squared exponential covariance function embodies

the belief that points further apart are less correlated:

kSE(xi, xj, ; A, L) = A2 exp

  • −0.5 · (xi − xj)2

L2

Overall correlation, amplitude

Scaling parameter, smoothness

◮ θK = {A, L} are called hyperparameters. ◮ We denote the covariance matrix for a set of inputs

X = {x1, . . . , xN} as: KX,X(θK)

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 12

slide-27
SLIDE 27

Intuitive approach

Constructing Covariance Matrices

GP samples using the squared exponential covariance function

−6 −4 −2 2 4 6 −4 −3 −2 −1 1 2 3 4 A=1,L=1 A=1,L=0.5 A=3,L=1

10D Gaussian

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 13

slide-28
SLIDE 28

Intuitive approach

Constructing Covariance Matrices

GP samples using the squared exponential covariance function

−6 −4 −2 2 4 6 −6 −4 −2 2 4 6 A=1,L=1 A=1,L=0.5 A=3,L=1

500D Gaussian

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 13

slide-29
SLIDE 29

Intuitive approach

Constructing Covariance Matrices

GP samples using the squared exponential covariance function

−3 −2 −1 1 2 3

y1

−3 −2 −1 1 2 3

y2

Reminder: Every function line corresponds to a sample drawn from this 2D Gaussian!

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 13

slide-30
SLIDE 30

Intuitive approach

Why this all works

◮ Consistency of the 10D and 500D Gaussian. ◮ A small quizz:

◮ Let y1, y2, y3 have covariance matrix

K3 =   1 5 5 1 5 5 1   and inverse K−1

3

=   1.5

  • 1

5

  • 1

2

  • 1

5

  • 1

1.5  

◮ Now focus on the variables y1, y2, integrating out y3. Which of the

following statements is true a) K2 =

  • 1

5 5 1

  • b) K−1

2

=

  • 1.5
  • 1
  • 1

2

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 14

slide-31
SLIDE 31

Intuitive approach

Why this all works

◮ Consistency of the 10D and 500D Gaussian. ◮ A small quizz:

◮ Let y1, y2, y3 have covariance matrix

K3 =   1 5 5 1 5 5 1   and inverse K−1

3

=   1.5

  • 1

5

  • 1

2

  • 1

5

  • 1

1.5  

◮ Now focus on the variables y1, y2, integrating out y3. Which of the

following statements is true a) K2 =

  • 1

5 5 1

  • b) K−1

2

=

  • 1.5
  • 1
  • 1

2

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 14

slide-32
SLIDE 32

Function space view

Outline

Motivation Intuitive approach Function space view GP classification & other extensions Summary

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 15

slide-33
SLIDE 33

Function space view

Function space view So far

  • 1. Joint Gaussian distribution over the set of all outputs y.
  • 2. Covariance function as a recipe to construct a suitable covariance

matrices from the corresponding inputs X.

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 16

slide-34
SLIDE 34

Function space view

Function space view

The Gaussian process as a prior on functions

◮ Covariance function and hyperparameters reflect the prior belief on

function smoothness, lengthscales etc.

◮ The general recipe allows a joint Gaussian to be constructed for an

arbitrary selection of input locations X.

Prior on infinite function f(x)

p(f(x)) = GP(f(x) | k)

Prior on function values f = (f1, . . . , fN)

p(f | X, θK) = N (f | 0, KX,X(θK))

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 17

slide-35
SLIDE 35

Function space view

Noise-free observations

◮ Given noise-free training data D = {xn, fn}N n=1 ◮ Want to make predictions f⋆ at test points X⋆ ◮ Joint distribution of f and f⋆ is

p([f, f⋆] | X, X⋆, θK) = N

  • [f, f⋆]
  • 0,

KX,X KX,X⋆ KX⋆,X KX⋆,X⋆

  • (All kernel matrices K depend on hyperparameters θK which are dropped for

brevity.)

◮ Real data is rarely noise-free.

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 18

slide-36
SLIDE 36

Function space view

Noise-free observations

◮ Given noise-free training data D = {xn, fn}N n=1 ◮ Want to make predictions f⋆ at test points X⋆ ◮ Joint distribution of f and f⋆ is

p([f, f⋆] | X, X⋆, θK) = N

  • [f, f⋆]
  • 0,

KX,X KX,X⋆ KX⋆,X KX⋆,X⋆

  • (All kernel matrices K depend on hyperparameters θK which are dropped for

brevity.)

◮ Real data is rarely noise-free.

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 18

slide-37
SLIDE 37

Function space view

Inference

◮ Given observed noisy data D = {X, y}, the joint probability over

latent function values f and f⋆ given y is

p([f, f ⋆] | X, X⋆, y, θK, σ2) ∝

Prior

  • N ([f, f ⋆] | 0, K)

×

N

  • n=1

N

  • yn
  • fn, σ2
  • Likelihood

,

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 19

slide-38
SLIDE 38

Function space view

Inference

◮ Given observed noisy data D = {X, y}, the joint probability over

latent function values f and f⋆ given y is

p([f, f ⋆] | X, X⋆, y, θK, σ2) ∝

Prior

  • N
  • [f, f ⋆]
  • 0,
  • KX,X

KX,X⋆ KX⋆,X KX⋆,X⋆ ×

N

  • n=1

N

  • yn
  • fn, σ2
  • Likelihood

,

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 19

slide-39
SLIDE 39

Function space view

Inference

◮ Applying “Gaussian calculus”, integrating out f yields

p([y, f ⋆] | X, X⋆, y, θK, σ2) ∝ N

  • [y, f ⋆]
  • 0,

KX,X + σ2I KX,X⋆ KX⋆,X KX⋆,X⋆

  • ◮ Note: Assuming noisy instead of perfect observation noise merely

corresponds to adding a diagonal component to the self-covariance KX,X.

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 20

slide-40
SLIDE 40

Function space view

Inference

◮ Applying “Gaussian calculus”, integrating out f yields

p([y, f ⋆] | X, X⋆, y, θK, σ2) ∝ N

  • [y, f ⋆]
  • 0,

KX,X + σ2I KX,X⋆ KX⋆,X KX⋆,X⋆

  • ◮ Note: Assuming noisy instead of perfect observation noise merely

corresponds to adding a diagonal component to the self-covariance KX,X.

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 20

slide-41
SLIDE 41

Function space view

Making predictions

◮ The predictive distribution follows from the joint distribution by

completing the square (conditioning)

p([y, f ⋆] | X, X⋆, y, θK, σ2) ∝ N

  • [y, f ⋆]
  • 0,

KX,X + σ2I KX,X⋆ KX⋆,X KX⋆,X⋆

  • ◮ Gaussian predictive distribution for f⋆

p(f ⋆ | X, y, X⋆, θK, σ2) = N (f ⋆ | µ⋆, Σ⋆) with µ⋆ = KX⋆,X

  • KX,X + σ2I

−1 y Σ⋆ = KX⋆,X⋆ − KX⋆,X

  • KX,X + σ2I

−1 KX,X⋆

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 21

slide-42
SLIDE 42

Function space view

Making predictions

◮ The predictive distribution follows from the joint distribution by

completing the square (conditioning)

p([y, f ⋆] | X, X⋆, y, θK, σ2) ∝ N

  • [y, f ⋆]
  • 0,

KX,X + σ2I KX,X⋆ KX⋆,X KX⋆,X⋆

  • ◮ Gaussian predictive distribution for f⋆

p(f ⋆ | X, y, X⋆, θK, σ2) = N (f ⋆ | µ⋆, Σ⋆) with µ⋆ = KX⋆,X

  • KX,X + σ2I

−1 y Σ⋆ = KX⋆,X⋆ − KX⋆,X

  • KX,X + σ2I

−1 KX,X⋆

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 21

slide-43
SLIDE 43

Function space view

Making predictions

Example 2 4 6 8 10

X

0.5 1.0 1.5 2.0 2.5 3.0 3.5

Y

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 22

slide-44
SLIDE 44

Function space view

Making predictions

Example 2 4 6 8 10

X

0.5 1.0 1.5 2.0 2.5 3.0 3.5

Y

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 22

slide-45
SLIDE 45

Function space view

Learning hyperparameters

  • 1. Fixed covariance matrix: p(y | K)
  • 2. Constructed covariance matrix: {K}i,j = k(xi, xj; θK)
  • 3. Can we learn the hyperparameters θK?
  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 23

slide-46
SLIDE 46

Function space view

Learning hyperparameters

◮ Formally we are interested in the posterior

p(θK | D) ∝ p (y | X, θK) p(θK)

◮ Inference is analytically intractable! ◮ MAP estimate instead of a full posterior. Set θK to the most

probable hyperparameter settings:

ˆ θK = argmax θK ln [p (y | X, θK) p(θK)] = argmax θK ln N

  • y
  • 0, KX,X(θK) + σ2I
  • + ln p(θK)

= argmax θK

  • − 1

2 log det[KX,X(θK) + σ2I] − 1 2yT[KX,X(θK) + σ2I]−1y − N 2 log 2π + ln p(θK)

  • ◮ Optimization can be carried out using standard optimization

techniques.

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 24

slide-47
SLIDE 47

Function space view

Learning hyperparameters

◮ Formally we are interested in the posterior

p(θK | D) ∝ p (y | X, θK) p(θK)

◮ Inference is analytically intractable! ◮ MAP estimate instead of a full posterior. Set θK to the most

probable hyperparameter settings:

ˆ θK = argmax θK ln [p (y | X, θK) p(θK)] = argmax θK ln N

  • y
  • 0, KX,X(θK) + σ2I
  • + ln p(θK)

= argmax θK

  • − 1

2 log det[KX,X(θK) + σ2I] − 1 2yT[KX,X(θK) + σ2I]−1y − N 2 log 2π + ln p(θK)

  • ◮ Optimization can be carried out using standard optimization

techniques.

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 24

slide-48
SLIDE 48

Function space view

Learning hyperparameters

◮ Formally we are interested in the posterior

p(θK | D) ∝ p (y | X, θK) p(θK)

◮ Inference is analytically intractable! ◮ MAP estimate instead of a full posterior. Set θK to the most

probable hyperparameter settings:

ˆ θK = argmax θK ln [p (y | X, θK) p(θK)] = argmax θK ln N

  • y
  • 0, KX,X(θK) + σ2I
  • + ln p(θK)

= argmax θK

  • − 1

2 log det[KX,X(θK) + σ2I] − 1 2yT[KX,X(θK) + σ2I]−1y − N 2 log 2π + ln p(θK)

  • ◮ Optimization can be carried out using standard optimization

techniques.

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 24

slide-49
SLIDE 49

Function space view

Choosing covariance functions

◮ The covariance function embodies the prior belief about functions. ◮ Example: linear regression

yn = wxn + c + ψn

◮ Covariance function denote covariation

k(xn, x′

n) =

  • yny′

n

  • =
  • (wxn + c + ψn)(wx′

n + c + ψ′ n)

  • = w2 · xnx′

n + c2

  • kernel: k(xn,x′

n)

+δn,n′ψ2

n

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 25

slide-50
SLIDE 50

Function space view

Choosing covariance functions

Multidimensional input space

◮ Generalise squared exponential covariance function to multiple

dimensions

◮ 1 Dimension kSE(xi, xj, ; A, L) = A2 exp

  • −0.5 · (xi − xj)2

L2

  • ◮ D Dimensions dD

kSE(xi, xj, ; A, L) = A2 exp   −0.5

D

  • d=1

(xd

i − xd j)2

L2

d

  

◮ Lengthscale parameters Ld denote “relevance” of a particular data

dimension.

◮ Large Ld correspond to irrelevant dimensions.

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 26

slide-51
SLIDE 51

Function space view

Choosing covariance functions

Multidimensional input space

◮ Generalise squared exponential covariance function to multiple

dimensions

◮ 1 Dimension kSE(xi, xj, ; A, L) = A2 exp

  • −0.5 · (xi − xj)2

L2

  • ◮ D Dimensions dD

kSE(xi, xj, ; A, L) = A2 exp   −0.5

D

  • d=1

(xd

i − xd j)2

L2

d

  

◮ Lengthscale parameters Ld denote “relevance” of a particular data

dimension.

◮ Large Ld correspond to irrelevant dimensions.

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 26

slide-52
SLIDE 52

Function space view

Choosing covariance functions

2D regression

X1

2 3 4 5 6 7

X2

2 3 4 5 6 7

Y

2 3 4 5

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 27

slide-53
SLIDE 53

Function space view

Choosing covariance functions

2D regression

X1

2 3 4 5 6 7

X2

2 3 4 5 6 7

Y

2 3 4 5 6

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 27

slide-54
SLIDE 54

Function space view

Choosing covariance functions

Any kernel will do

◮ Established kernels are all valid covariance functions, allowing for a

wide range of possible input domains X:

◮ Graph kernels (molecules) ◮ Kernels defined on strings (DNA sequences)

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 28

slide-55
SLIDE 55

Function space view

Choosing covariance functions

Combining existing covariance functions

◮ The sum of two covariances functions is itself a valid covariance

function kS(x, x′) = k1(x, x′) + k2(x, x′)

◮ The product of two covariance functions is itself a valid covariance

function kP (x, x′) = k1(x, x′) · k2(x, x′)

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 29

slide-56
SLIDE 56

GP classification & other extensions

Outline

Motivation Intuitive approach Function space view GP classification & other extensions Summary

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 30

slide-57
SLIDE 57

GP classification & other extensions

GPs for classification

◮ How to deal with binary observations?

X1

−0.5 0.0 0.5

X2

−0.5 0.0 0.5

Y

−0.4 −0.2 0.0 0.2 0.4

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 31

slide-58
SLIDE 58

GP classification & other extensions

GPs for classification

◮ How to deal with binary observations?

X1

−0.5 0.0 0.5

X2

−0.5 0.0 0.5

Y

−100 −50 50

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 31

slide-59
SLIDE 59

GP classification & other extensions

GPs for classification

Probit likelihood model

◮ Posterior with a general likelihood model

p(f | X, y, θK, σ2) ∝

Prior

  • N (f | 0, KX,X(θK)) ×

Likelihood

  • N
  • n=1

p(yn | fn)

◮ Classification: probit link model

p(yn = 1 | fn) = 1 1 + exp(−fn)

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 32

slide-60
SLIDE 60

GP classification & other extensions

GPs for classification

Inference

◮ Inference with non-Gaussian likelihood is analytically intractable. ◮ Idea: approximate the true likelihood terms each with a Gaussian

KL

  • Prior
  • N (f | 0, KX,K(θK)) ×

exact likelihood

  • N
  • n=1

p(yn | fn) || N (f | 0, KX,X(θK))

  • Prior

×

N

  • n=1

N (fn | ˜ µn, ˜ σn)

  • approximation
  • ◮ The KL divergence is a common measure of approximation accuracy

KL[P||Q] =

  • θ

P(θ)P(θ | D) Q(θ)

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 33

slide-61
SLIDE 61

GP classification & other extensions

GPs for classification

Inference

◮ Inference with non-Gaussian likelihood is analytically intractable. ◮ Idea: approximate the true likelihood terms each with a Gaussian

KL

  • Prior
  • N (f | 0, KX,K(θK)) ×

exact likelihood

  • N
  • n=1

p(yn | fn) || N (f | 0, KX,X(θK))

  • Prior

×

N

  • n=1

N (fn | ˜ µn, ˜ σn)

  • approximation
  • ◮ The KL divergence is a common measure of approximation accuracy

KL[P||Q] =

  • θ

P(θ)P(θ | D) Q(θ)

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 33

slide-62
SLIDE 62

GP classification & other extensions

Robust regression Regression with 15% outliers

−1.5 −1 −0.5 0.5 1 1.5 −3 −2 −1 1 2 3 +−2*stdDev mean

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 34

slide-63
SLIDE 63

GP classification & other extensions

Robust regression Regression with 1% outliers

−1.5 −1 −0.5 0.5 1 1.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5 +−2*stdDev mean

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 34

slide-64
SLIDE 64

GP classification & other extensions

Robust regression

Mixture likelihood model

◮ Naive: filtering. ◮ We rather would like the likelihood model to empobdy the belief that

a fraction of datapoints is “useless”. p(yn | fn) = πokN

  • yn
  • fn, σ2

+ (1 − πok)N

  • yn
  • fn, σ2

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 35

slide-65
SLIDE 65

GP classification & other extensions

Robust regression

Mixture likelihood model

◮ Naive: filtering. ◮ We rather would like the likelihood model to empobdy the belief that

a fraction of datapoints is “useless”. p(yn | fn) = πokN

  • yn
  • fn, σ2

+ (1 − πok)N

  • yn
  • fn, σ2

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 35

slide-66
SLIDE 66

GP classification & other extensions

Robust regression

Mixture likelihood in action

Robust noise model

−1.5 −1 −0.5 0.5 1 1.5 −3 −2 −1 1 2 3 +−2*stdDev mean

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 36

slide-67
SLIDE 67

GP classification & other extensions

Why Gaussian processes and not something else?

◮ Tractable probabilistic model; uncertainty estimates ◮ Equal or better performance than other methods. ◮ Many other approaches as special case

◮ Linear regression ◮ Splines ◮ Neural networks

◮ Kernel method; flexible choice of covariance functions. ◮ Major limitation: inversion of N × N matrix; scaling O(N3).

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 37

slide-68
SLIDE 68

Summary

Outline

Motivation Intuitive approach Function space view GP classification & other extensions Summary

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 38

slide-69
SLIDE 69

Summary

Summary

◮ The key ingredient of a Gaussian processes is the covariance function;

a recipe to construct covariance matrices.

◮ GP predictions boil down to conditioning joint Gaussian distributions. ◮ Most probable covariance function hyperparameters can be derived

from the marginal likelihood.

◮ Non-Gaussian likelihood models allow for classification and robust

regression, however require approximate inference techniques.

  • O. Stegle & K. Borgwardt

An introduction to Gaussian processes T¨ ubingen 39