Oliver Stegle and Karsten Borgwardt: Computational Approaches for Analysing Complex Biological Systems, Page 1
An introduction to Gaussian processes Oliver Stegle and Karsten - - PowerPoint PPT Presentation
An introduction to Gaussian processes Oliver Stegle and Karsten - - PowerPoint PPT Presentation
An introduction to Gaussian processes Oliver Stegle and Karsten Borgwardt Machine Learning and Computational Biology Research Group, Max Planck Institute for Biological Cybernetics and Max Planck Institute for Developmental Biology, Tbingen
Motivation
Why Gaussian processes?
◮ So far: linear models with a
finite number of basis functions, e.g. φ(x) = (1, x, x2, . . . , xK)
◮ Open questions:
◮ How to design a suitable
basis?
◮ How many basis functions to
pick?
◮ Gaussian processes: accurate
and flexible regression method yielding predictions alongside with error bars.
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 1
Motivation
Why Gaussian processes?
◮ So far: linear models with a
finite number of basis functions, e.g. φ(x) = (1, x, x2, . . . , xK)
◮ Open questions:
◮ How to design a suitable
basis?
◮ How many basis functions to
pick?
◮ Gaussian processes: accurate
and flexible regression method yielding predictions alongside with error bars.
2 4 6 8 10 X 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Y
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 1
Motivation
Why Gaussian processes?
◮ So far: linear models with a
finite number of basis functions, e.g. φ(x) = (1, x, x2, . . . , xK)
◮ Open questions:
◮ How to design a suitable
basis?
◮ How many basis functions to
pick?
◮ Gaussian processes: accurate
and flexible regression method yielding predictions alongside with error bars.
2 4 6 8 10 X 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Y
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 1
Motivation
Further reading
◮ A comprehensive and very good introduction to Gaussian processes
- C. E. Rasmussen, C. K. Williams
Gaussian proceesses for machine learning
◮ Free download: http://www.gaussianprocess.org/gpml/
◮ A really good introductory movie to watch
http://videolectures.net/gpip06 mackay gpb/
◮ Many ideas used in this course are borrowed from this lecture.
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 2
Outline
Outline
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 3
Intuitive approach
Outline
Motivation Intuitive approach Function space view GP classification & other extensions Summary
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 4
Intuitive approach
The Gaussian distribution
◮ Gaussian processes are merely based on the good old Gaussian
N
- x
- µ, K
- =
1
- |2π K |
exp
- −1
2(x − µ)T K
−1(x − µ)
- ◮ Covariance matrix or kernel matrix
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 5
Intuitive approach
A 2D Gaussian
◮ Probability contour ◮ Samples
−3 −2 −1 1 2 3
y1
−3 −2 −1 1 2 3
y2
K =
- 1
0.6 0.6 1
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 6
Intuitive approach
A 2D Gaussian
◮ Probability contour ◮ Samples
−3 −2 −1 1 2 3
y1
−3 −2 −1 1 2 3
y2
K =
- 1
0.6 0.6 1
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 6
Intuitive approach
A 2D Gaussian
Varying the covariance matrix
−3 −2 −1 1 2 3 y1 −3 −2 −1 1 2 3 y2
K =
- 1
0.14 0.14 1
- −3
−2 −1 1 2 3 y1 −3 −2 −1 1 2 3 y2
K =
- 1
0.6 0.6 1
- −3
−2 −1 1 2 3 y1 −3 −2 −1 1 2 3 y2
K =
- 1
- 0.9
- 0.9
1
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 7
Intuitive approach
A 2D Gaussian
Inference −3 −2 −1 1 2 3 −3 −2 −1 1 2 3
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 8
Intuitive approach
A 2D Gaussian
Inference −3 −2 −1 1 2 3 −3 −2 −1 1 2 3
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 8
Intuitive approach
A 2D Gaussian
Inference − 3 − 2 − 1 1 2 3 − 3 − 2 − 1 1 2 3
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 8
Intuitive approach
Inference
◮ Joint probability p(y1, y2 | K) = N ([y1, y2] | 0, K) ◮ Conditional probability
p(y2 | y1, K) = p(y1, y2 | K) p(y1 | K) ∝ exp
- −1
2[y1, y2] K−1 y1 y2
- ◮ Completing the square yields a Gaussian with non-zero as posterior
for y2.
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 9
Intuitive approach
Extending the idea to higher dimensions
◮ Let us interpret y1 and y2 as outputs in a regression setting. ◮ We can introduce an additional 3rd point.
1 2
X
1 2 3 4 5
Y
◮ Now P([y1, y2, y3] | K3) = N ([y1, y2, y3] | 0, K3), where K3 is now a
3 x 3 covariance matrix!
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 10
Intuitive approach
Extending the idea to higher dimensions
◮ Let us interpret y1 and y2 as outputs in a regression setting. ◮ We can introduce an additional 3rd point.
1 2
X
1 2 3 4 5
Y
◮ Now P([y1, y2, y3] | K3) = N ([y1, y2, y3] | 0, K3), where K3 is now a
3 x 3 covariance matrix!
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 10
Intuitive approach
Extending the idea to higher dimensions
◮ Let us interpret y1 and y2 as outputs in a regression setting. ◮ We can introduce an additional 3rd point.
1 2
X
1 2 3 4 5
Y
◮ Now P([y1, y2, y3] | K3) = N ([y1, y2, y3] | 0, K3), where K3 is now a
3 x 3 covariance matrix!
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 10
Intuitive approach
Extending the idea to higher dimensions
◮ Let us interpret y1 and y2 as outputs in a regression setting. ◮ We can introduce an additional 3rd point.
1 2
X
1 2 3 4 5
Y
◮ Now P([y1, y2, y3] | K3) = N ([y1, y2, y3] | 0, K3), where K3 is now a
3 x 3 covariance matrix!
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 10
Intuitive approach
Constructing Covariance Matrices
◮ Analogously we can look at the joint probability for arbitrary many
points and obtain predictions.
◮ Issue: how to construct a good covariance matrix? ◮ A simple heuristics
K2 =
- 1
0.6 0.6 1
- K3 =
1 0.6 0.6 1 0.6 0.6 1
◮ Note:
◮ The ordering of the points y1, y2, y3 matters. ◮ Important to ensure that covariance matrices remain positive definite
(matrix inversion).
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 11
Intuitive approach
Constructing Covariance Matrices
◮ Analogously we can look at the joint probability for arbitrary many
points and obtain predictions.
◮ Issue: how to construct a good covariance matrix? ◮ A simple heuristics
K2 =
- 1
0.6 0.6 1
- K3 =
1 0.6 0.6 1 0.6 0.6 1
◮ Note:
◮ The ordering of the points y1, y2, y3 matters. ◮ Important to ensure that covariance matrices remain positive definite
(matrix inversion).
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 11
Intuitive approach
Constructing Covariance Matrices
◮ Analogously we can look at the joint probability for arbitrary many
points and obtain predictions.
◮ Issue: how to construct a good covariance matrix? ◮ A simple heuristics
K2 =
- 1
0.6 0.6 1
- K3 =
1 0.6 0.6 1 0.6 0.6 1
◮ Note:
◮ The ordering of the points y1, y2, y3 matters. ◮ Important to ensure that covariance matrices remain positive definite
(matrix inversion).
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 11
Intuitive approach
Constructing Covariance Matrices
A general recipe
◮ Use a covariance function (kernel function) to construct K:
Ki,j = k(xi, xj; θK)
◮ For example: The squared exponential covariance function embodies
the belief that points further apart are less correlated:
kSE(xi, xj, ; A, L) = A2 exp
- −0.5 · (xi − xj)2
L2
- ◮
Overall correlation, amplitude
◮
Scaling parameter, smoothness
◮ θK = {A, L} are called hyperparameters. ◮ We denote the covariance matrix for a set of inputs
X = {x1, . . . , xN} as: KX,X(θK)
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 12
Intuitive approach
Constructing Covariance Matrices
A general recipe
◮ Use a covariance function (kernel function) to construct K:
Ki,j = k(xi, xj; θK)
◮ For example: The squared exponential covariance function embodies
the belief that points further apart are less correlated:
kSE(xi, xj, ; A, L) = A2 exp
- −0.5 · (xi − xj)2
L2
- ◮
Overall correlation, amplitude
◮
Scaling parameter, smoothness
◮ θK = {A, L} are called hyperparameters. ◮ We denote the covariance matrix for a set of inputs
X = {x1, . . . , xN} as: KX,X(θK)
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 12
Intuitive approach
Constructing Covariance Matrices
A general recipe
◮ Use a covariance function (kernel function) to construct K:
Ki,j = k(xi, xj; θK)
◮ For example: The squared exponential covariance function embodies
the belief that points further apart are less correlated:
kSE(xi, xj, ; A, L) = A2 exp
- −0.5 · (xi − xj)2
L2
- ◮
Overall correlation, amplitude
◮
Scaling parameter, smoothness
◮ θK = {A, L} are called hyperparameters. ◮ We denote the covariance matrix for a set of inputs
X = {x1, . . . , xN} as: KX,X(θK)
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 12
Intuitive approach
Constructing Covariance Matrices
A general recipe
◮ Use a covariance function (kernel function) to construct K:
Ki,j = k(xi, xj; θK)
◮ For example: The squared exponential covariance function embodies
the belief that points further apart are less correlated:
kSE(xi, xj, ; A, L) = A2 exp
- −0.5 · (xi − xj)2
L2
- ◮
Overall correlation, amplitude
◮
Scaling parameter, smoothness
◮ θK = {A, L} are called hyperparameters. ◮ We denote the covariance matrix for a set of inputs
X = {x1, . . . , xN} as: KX,X(θK)
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 12
Intuitive approach
Constructing Covariance Matrices
GP samples using the squared exponential covariance function
−6 −4 −2 2 4 6 −4 −3 −2 −1 1 2 3 4 A=1,L=1 A=1,L=0.5 A=3,L=1
10D Gaussian
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 13
Intuitive approach
Constructing Covariance Matrices
GP samples using the squared exponential covariance function
−6 −4 −2 2 4 6 −6 −4 −2 2 4 6 A=1,L=1 A=1,L=0.5 A=3,L=1
500D Gaussian
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 13
Intuitive approach
Constructing Covariance Matrices
GP samples using the squared exponential covariance function
−3 −2 −1 1 2 3
y1
−3 −2 −1 1 2 3
y2
Reminder: Every function line corresponds to a sample drawn from this 2D Gaussian!
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 13
Intuitive approach
Why this all works
◮ Consistency of the 10D and 500D Gaussian. ◮ A small quizz:
◮ Let y1, y2, y3 have covariance matrix
K3 = 1 5 5 1 5 5 1 and inverse K−1
3
= 1.5
- 1
5
- 1
2
- 1
5
- 1
1.5
◮ Now focus on the variables y1, y2, integrating out y3. Which of the
following statements is true a) K2 =
- 1
5 5 1
- b) K−1
2
=
- 1.5
- 1
- 1
2
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 14
Intuitive approach
Why this all works
◮ Consistency of the 10D and 500D Gaussian. ◮ A small quizz:
◮ Let y1, y2, y3 have covariance matrix
K3 = 1 5 5 1 5 5 1 and inverse K−1
3
= 1.5
- 1
5
- 1
2
- 1
5
- 1
1.5
◮ Now focus on the variables y1, y2, integrating out y3. Which of the
following statements is true a) K2 =
- 1
5 5 1
- b) K−1
2
=
- 1.5
- 1
- 1
2
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 14
Function space view
Outline
Motivation Intuitive approach Function space view GP classification & other extensions Summary
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 15
Function space view
Function space view So far
- 1. Joint Gaussian distribution over the set of all outputs y.
- 2. Covariance function as a recipe to construct a suitable covariance
matrices from the corresponding inputs X.
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 16
Function space view
Function space view
The Gaussian process as a prior on functions
◮ Covariance function and hyperparameters reflect the prior belief on
function smoothness, lengthscales etc.
◮ The general recipe allows a joint Gaussian to be constructed for an
arbitrary selection of input locations X.
Prior on infinite function f(x)
p(f(x)) = GP(f(x) | k)
Prior on function values f = (f1, . . . , fN)
p(f | X, θK) = N (f | 0, KX,X(θK))
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 17
Function space view
Noise-free observations
◮ Given noise-free training data D = {xn, fn}N n=1 ◮ Want to make predictions f⋆ at test points X⋆ ◮ Joint distribution of f and f⋆ is
p([f, f⋆] | X, X⋆, θK) = N
- [f, f⋆]
- 0,
KX,X KX,X⋆ KX⋆,X KX⋆,X⋆
- (All kernel matrices K depend on hyperparameters θK which are dropped for
brevity.)
◮ Real data is rarely noise-free.
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 18
Function space view
Noise-free observations
◮ Given noise-free training data D = {xn, fn}N n=1 ◮ Want to make predictions f⋆ at test points X⋆ ◮ Joint distribution of f and f⋆ is
p([f, f⋆] | X, X⋆, θK) = N
- [f, f⋆]
- 0,
KX,X KX,X⋆ KX⋆,X KX⋆,X⋆
- (All kernel matrices K depend on hyperparameters θK which are dropped for
brevity.)
◮ Real data is rarely noise-free.
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 18
Function space view
Inference
◮ Given observed noisy data D = {X, y}, the joint probability over
latent function values f and f⋆ given y is
p([f, f ⋆] | X, X⋆, y, θK, σ2) ∝
Prior
- N ([f, f ⋆] | 0, K)
×
N
- n=1
N
- yn
- fn, σ2
- Likelihood
,
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 19
Function space view
Inference
◮ Given observed noisy data D = {X, y}, the joint probability over
latent function values f and f⋆ given y is
p([f, f ⋆] | X, X⋆, y, θK, σ2) ∝
Prior
- N
- [f, f ⋆]
- 0,
- KX,X
KX,X⋆ KX⋆,X KX⋆,X⋆ ×
N
- n=1
N
- yn
- fn, σ2
- Likelihood
,
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 19
Function space view
Inference
◮ Applying “Gaussian calculus”, integrating out f yields
p([y, f ⋆] | X, X⋆, y, θK, σ2) ∝ N
- [y, f ⋆]
- 0,
KX,X + σ2I KX,X⋆ KX⋆,X KX⋆,X⋆
- ◮ Note: Assuming noisy instead of perfect observation noise merely
corresponds to adding a diagonal component to the self-covariance KX,X.
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 20
Function space view
Inference
◮ Applying “Gaussian calculus”, integrating out f yields
p([y, f ⋆] | X, X⋆, y, θK, σ2) ∝ N
- [y, f ⋆]
- 0,
KX,X + σ2I KX,X⋆ KX⋆,X KX⋆,X⋆
- ◮ Note: Assuming noisy instead of perfect observation noise merely
corresponds to adding a diagonal component to the self-covariance KX,X.
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 20
Function space view
Making predictions
◮ The predictive distribution follows from the joint distribution by
completing the square (conditioning)
p([y, f ⋆] | X, X⋆, y, θK, σ2) ∝ N
- [y, f ⋆]
- 0,
KX,X + σ2I KX,X⋆ KX⋆,X KX⋆,X⋆
- ◮ Gaussian predictive distribution for f⋆
p(f ⋆ | X, y, X⋆, θK, σ2) = N (f ⋆ | µ⋆, Σ⋆) with µ⋆ = KX⋆,X
- KX,X + σ2I
−1 y Σ⋆ = KX⋆,X⋆ − KX⋆,X
- KX,X + σ2I
−1 KX,X⋆
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 21
Function space view
Making predictions
◮ The predictive distribution follows from the joint distribution by
completing the square (conditioning)
p([y, f ⋆] | X, X⋆, y, θK, σ2) ∝ N
- [y, f ⋆]
- 0,
KX,X + σ2I KX,X⋆ KX⋆,X KX⋆,X⋆
- ◮ Gaussian predictive distribution for f⋆
p(f ⋆ | X, y, X⋆, θK, σ2) = N (f ⋆ | µ⋆, Σ⋆) with µ⋆ = KX⋆,X
- KX,X + σ2I
−1 y Σ⋆ = KX⋆,X⋆ − KX⋆,X
- KX,X + σ2I
−1 KX,X⋆
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 21
Function space view
Making predictions
Example 2 4 6 8 10
X
0.5 1.0 1.5 2.0 2.5 3.0 3.5
Y
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 22
Function space view
Making predictions
Example 2 4 6 8 10
X
0.5 1.0 1.5 2.0 2.5 3.0 3.5
Y
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 22
Function space view
Learning hyperparameters
- 1. Fixed covariance matrix: p(y | K)
- 2. Constructed covariance matrix: {K}i,j = k(xi, xj; θK)
- 3. Can we learn the hyperparameters θK?
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 23
Function space view
Learning hyperparameters
◮ Formally we are interested in the posterior
p(θK | D) ∝ p (y | X, θK) p(θK)
◮ Inference is analytically intractable! ◮ MAP estimate instead of a full posterior. Set θK to the most
probable hyperparameter settings:
ˆ θK = argmax θK ln [p (y | X, θK) p(θK)] = argmax θK ln N
- y
- 0, KX,X(θK) + σ2I
- + ln p(θK)
= argmax θK
- − 1
2 log det[KX,X(θK) + σ2I] − 1 2yT[KX,X(θK) + σ2I]−1y − N 2 log 2π + ln p(θK)
- ◮ Optimization can be carried out using standard optimization
techniques.
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 24
Function space view
Learning hyperparameters
◮ Formally we are interested in the posterior
p(θK | D) ∝ p (y | X, θK) p(θK)
◮ Inference is analytically intractable! ◮ MAP estimate instead of a full posterior. Set θK to the most
probable hyperparameter settings:
ˆ θK = argmax θK ln [p (y | X, θK) p(θK)] = argmax θK ln N
- y
- 0, KX,X(θK) + σ2I
- + ln p(θK)
= argmax θK
- − 1
2 log det[KX,X(θK) + σ2I] − 1 2yT[KX,X(θK) + σ2I]−1y − N 2 log 2π + ln p(θK)
- ◮ Optimization can be carried out using standard optimization
techniques.
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 24
Function space view
Learning hyperparameters
◮ Formally we are interested in the posterior
p(θK | D) ∝ p (y | X, θK) p(θK)
◮ Inference is analytically intractable! ◮ MAP estimate instead of a full posterior. Set θK to the most
probable hyperparameter settings:
ˆ θK = argmax θK ln [p (y | X, θK) p(θK)] = argmax θK ln N
- y
- 0, KX,X(θK) + σ2I
- + ln p(θK)
= argmax θK
- − 1
2 log det[KX,X(θK) + σ2I] − 1 2yT[KX,X(θK) + σ2I]−1y − N 2 log 2π + ln p(θK)
- ◮ Optimization can be carried out using standard optimization
techniques.
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 24
Function space view
Choosing covariance functions
◮ The covariance function embodies the prior belief about functions. ◮ Example: linear regression
yn = wxn + c + ψn
◮ Covariance function denote covariation
k(xn, x′
n) =
- yny′
n
- =
- (wxn + c + ψn)(wx′
n + c + ψ′ n)
- = w2 · xnx′
n + c2
- kernel: k(xn,x′
n)
+δn,n′ψ2
n
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 25
Function space view
Choosing covariance functions
Multidimensional input space
◮ Generalise squared exponential covariance function to multiple
dimensions
◮ 1 Dimension kSE(xi, xj, ; A, L) = A2 exp
- −0.5 · (xi − xj)2
L2
- ◮ D Dimensions dD
kSE(xi, xj, ; A, L) = A2 exp −0.5
D
- d=1
(xd
i − xd j)2
L2
d
◮ Lengthscale parameters Ld denote “relevance” of a particular data
dimension.
◮ Large Ld correspond to irrelevant dimensions.
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 26
Function space view
Choosing covariance functions
Multidimensional input space
◮ Generalise squared exponential covariance function to multiple
dimensions
◮ 1 Dimension kSE(xi, xj, ; A, L) = A2 exp
- −0.5 · (xi − xj)2
L2
- ◮ D Dimensions dD
kSE(xi, xj, ; A, L) = A2 exp −0.5
D
- d=1
(xd
i − xd j)2
L2
d
◮ Lengthscale parameters Ld denote “relevance” of a particular data
dimension.
◮ Large Ld correspond to irrelevant dimensions.
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 26
Function space view
Choosing covariance functions
2D regression
X1
2 3 4 5 6 7
X2
2 3 4 5 6 7
Y
2 3 4 5
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 27
Function space view
Choosing covariance functions
2D regression
X1
2 3 4 5 6 7
X2
2 3 4 5 6 7
Y
2 3 4 5 6
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 27
Function space view
Choosing covariance functions
Any kernel will do
◮ Established kernels are all valid covariance functions, allowing for a
wide range of possible input domains X:
◮ Graph kernels (molecules) ◮ Kernels defined on strings (DNA sequences)
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 28
Function space view
Choosing covariance functions
Combining existing covariance functions
◮ The sum of two covariances functions is itself a valid covariance
function kS(x, x′) = k1(x, x′) + k2(x, x′)
◮ The product of two covariance functions is itself a valid covariance
function kP (x, x′) = k1(x, x′) · k2(x, x′)
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 29
GP classification & other extensions
Outline
Motivation Intuitive approach Function space view GP classification & other extensions Summary
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 30
GP classification & other extensions
GPs for classification
◮ How to deal with binary observations?
X1
−0.5 0.0 0.5
X2
−0.5 0.0 0.5
Y
−0.4 −0.2 0.0 0.2 0.4
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 31
GP classification & other extensions
GPs for classification
◮ How to deal with binary observations?
X1
−0.5 0.0 0.5
X2
−0.5 0.0 0.5
Y
−100 −50 50
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 31
GP classification & other extensions
GPs for classification
Probit likelihood model
◮ Posterior with a general likelihood model
p(f | X, y, θK, σ2) ∝
Prior
- N (f | 0, KX,X(θK)) ×
Likelihood
- N
- n=1
p(yn | fn)
◮ Classification: probit link model
p(yn = 1 | fn) = 1 1 + exp(−fn)
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 32
GP classification & other extensions
GPs for classification
Inference
◮ Inference with non-Gaussian likelihood is analytically intractable. ◮ Idea: approximate the true likelihood terms each with a Gaussian
KL
- Prior
- N (f | 0, KX,K(θK)) ×
exact likelihood
- N
- n=1
p(yn | fn) || N (f | 0, KX,X(θK))
- Prior
×
N
- n=1
N (fn | ˜ µn, ˜ σn)
- approximation
- ◮ The KL divergence is a common measure of approximation accuracy
KL[P||Q] =
- θ
P(θ)P(θ | D) Q(θ)
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 33
GP classification & other extensions
GPs for classification
Inference
◮ Inference with non-Gaussian likelihood is analytically intractable. ◮ Idea: approximate the true likelihood terms each with a Gaussian
KL
- Prior
- N (f | 0, KX,K(θK)) ×
exact likelihood
- N
- n=1
p(yn | fn) || N (f | 0, KX,X(θK))
- Prior
×
N
- n=1
N (fn | ˜ µn, ˜ σn)
- approximation
- ◮ The KL divergence is a common measure of approximation accuracy
KL[P||Q] =
- θ
P(θ)P(θ | D) Q(θ)
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 33
GP classification & other extensions
Robust regression Regression with 15% outliers
−1.5 −1 −0.5 0.5 1 1.5 −3 −2 −1 1 2 3 +−2*stdDev mean
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 34
GP classification & other extensions
Robust regression Regression with 1% outliers
−1.5 −1 −0.5 0.5 1 1.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5 +−2*stdDev mean
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 34
GP classification & other extensions
Robust regression
Mixture likelihood model
◮ Naive: filtering. ◮ We rather would like the likelihood model to empobdy the belief that
a fraction of datapoints is “useless”. p(yn | fn) = πokN
- yn
- fn, σ2
+ (1 − πok)N
- yn
- fn, σ2
∞
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 35
GP classification & other extensions
Robust regression
Mixture likelihood model
◮ Naive: filtering. ◮ We rather would like the likelihood model to empobdy the belief that
a fraction of datapoints is “useless”. p(yn | fn) = πokN
- yn
- fn, σ2
+ (1 − πok)N
- yn
- fn, σ2
∞
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 35
GP classification & other extensions
Robust regression
Mixture likelihood in action
Robust noise model
−1.5 −1 −0.5 0.5 1 1.5 −3 −2 −1 1 2 3 +−2*stdDev mean
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 36
GP classification & other extensions
Why Gaussian processes and not something else?
◮ Tractable probabilistic model; uncertainty estimates ◮ Equal or better performance than other methods. ◮ Many other approaches as special case
◮ Linear regression ◮ Splines ◮ Neural networks
◮ Kernel method; flexible choice of covariance functions. ◮ Major limitation: inversion of N × N matrix; scaling O(N3).
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 37
Summary
Outline
Motivation Intuitive approach Function space view GP classification & other extensions Summary
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 38
Summary
Summary
◮ The key ingredient of a Gaussian processes is the covariance function;
a recipe to construct covariance matrices.
◮ GP predictions boil down to conditioning joint Gaussian distributions. ◮ Most probable covariance function hyperparameters can be derived
from the marginal likelihood.
◮ Non-Gaussian likelihood models allow for classification and robust
regression, however require approximate inference techniques.
- O. Stegle & K. Borgwardt
An introduction to Gaussian processes T¨ ubingen 39