Introduction to Gaussian Processes
Neil D. Lawrence GPMC 6th February 2017
Introduction to Gaussian Processes Neil D. Lawrence GPMC 6th - - PowerPoint PPT Presentation
Introduction to Gaussian Processes Neil D. Lawrence GPMC 6th February 2017 Book Rasmussen and Williams (2006) Outline The Gaussian Density Covariance from Basis Functions Outline The Gaussian Density Covariance from Basis Functions The
Neil D. Lawrence GPMC 6th February 2017
Rasmussen and Williams (2006)
The Gaussian Density Covariance from Basis Functions
The Gaussian Density Covariance from Basis Functions
◮ Perhaps the most common probability density.
p(y|µ, σ2) = 1 √ 2πσ2 exp
2σ2
= N
◮ The Gaussian density.
1 2 3 1 2 p(h|µ, σ2) h, height/m
The Gaussian PDF with µ = 1.7 and variance σ2 = 0.0225. Mean shown as red line. It could represent the heights of a population of students.
Sum of Gaussians
◮ Sum of Gaussian variables is also Gaussian.
yi ∼ N
i
Sum of Gaussians
◮ Sum of Gaussian variables is also Gaussian.
yi ∼ N
i
n
yi ∼ N
n
µi,
n
σ2
i
Sum of Gaussians
◮ Sum of Gaussian variables is also Gaussian.
yi ∼ N
i
n
yi ∼ N
n
µi,
n
σ2
i
(Aside: As sum increases, sum of non-Gaussian, finite variance variables is also Gaussian [central limit theorem].)
Sum of Gaussians
◮ Sum of Gaussian variables is also Gaussian.
yi ∼ N
i
n
yi ∼ N
n
µi,
n
σ2
i
(Aside: As sum increases, sum of non-Gaussian, finite variance variables is also Gaussian [central limit theorem].)
Scaling a Gaussian
◮ Scaling a Gaussian leads to a Gaussian.
Scaling a Gaussian
◮ Scaling a Gaussian leads to a Gaussian.
y ∼ N
Scaling a Gaussian
◮ Scaling a Gaussian leads to a Gaussian.
y ∼ N
And the scaled density is distributed as wy ∼ N
1 2 50 60 70 80 90 100 y x data points best fit line
A linear regression between x and y.
◮ Predict a real value, yi given some inputs xi. ◮ Predict quality of meat given spectral measurements
(Tecator data).
◮ Radiocarbon dating, the C14 calibration curve: predict age
given quantity of C14 isotope.
◮ Predict quality of different Go or Backgammon moves
given expert rated training data.
1 2 3 4 5 1 2 3 4 5 y x
1 2 3 4 5 1 2 3 4 5 y x
c m
1 2 3 4 5 1 2 3 4 5 y x
c m
1 2 3 4 5 1 2 3 4 5 y x
c m
1 2 3 4 5 1 2 3 4 5 y x
1 2 3 4 5 1 2 3 4 5 y x
1 2 3 4 5 1 2 3 4 5 y x
6
A PHILOSOPHICAL ESSAY ON PROBABILITIES.
height: "The day will come when, by study pursued through several ages, the things now concealed will appear with evidence; and posterity will be astonished that truths so clear had escaped us.
' 'Clairaut then undertook to submit to analysis the perturbations which the comet had experienced by the action of the two great planets, Jupiter and Saturn;
after immense cal- culations he fixed
its next passage
at the perihelion
toward the beginning of April, 1759, which was actually
verified by observation. The regularity which astronomy
shows
us in the movements
doubtless exists also in all phenomena.
vapor
is regulated
in a manner just
as certain as the planetary orbits
;the only difference between them
is
that which comes from our ignorance. Probability
is
relative, in part to this ignorance, in part to our knowledge.
We know that of three
greater number of events a single one ought to occur
;but nothing induces us to believe that one of them will
In this state of indecision
it is impossible for us to announce their occurrence with
certainty. It
is, however, probable
that one of these events, chosen at will, will not occur because we see several cases equally possible which exclude its occur- rence, while only a single one favors
it.
The
theory of chance consists
in reducing all
the events of the same kind to a certain number of cases equally possible, that
is to
say, to such as we may be equally undecided about in regard to their existence,
and
in determining the number of cases favorable to the
event whose probability
is sought.
The
ratio
What about two unknowns and
y1 = mx1 + c
1 2 3 4 5 1 2 3 y x
Can compute m given c. m = y1 − c x
1 2 3 4 5 1 2 3 y x
Can compute m given c. c = 1.75 =⇒ m = 1.25
1 2 3 4 5 1 2 3 y x
Can compute m given c. c = −0.777 =⇒ m = 3.78
1 2 3 4 5 1 2 3 y x
Can compute m given c. c = −4.01 =⇒ m = 7.01
1 2 3 4 5 1 2 3 y x
Can compute m given c. c = −0.718 =⇒ m = 3.72
1 2 3 4 5 1 2 3 y x
Can compute m given c. c = 2.45 =⇒ m = 0.545
1 2 3 4 5 1 2 3 y x
Can compute m given c. c = −0.657 =⇒ m = 3.66
1 2 3 4 5 1 2 3 y x
Can compute m given c. c = −3.13 =⇒ m = 6.13
1 2 3 4 5 1 2 3 y x
Can compute m given c. c = −1.47 =⇒ m = 4.47
1 2 3 4 5 1 2 3 y x
Can compute m given c. Assume c ∼ N (0, 4) , we find a distribution of solu- tions.
1 2 3 4 5 1 2 3 y x
◮ To deal with overdetermined introduced probability
distribution for ‘variable’, ǫi.
◮ For underdetermined system introduced probability
distribution for ‘parameter’, c.
◮ This is known as a Bayesian treatment.
◮ For general Bayesian inference need multivariate priors. ◮ E.g. for multivariate linear regression:
yi =
wjxi,j + ǫi (where we’ve dropped c for convenience), we need a prior
◮ This motivates a multivariate Gaussian density. ◮ We will use the multivariate Gaussian to put a prior directly
◮ For general Bayesian inference need multivariate priors. ◮ E.g. for multivariate linear regression:
yi = w⊤xi,: + ǫi (where we’ve dropped c for convenience), we need a prior
◮ This motivates a multivariate Gaussian density. ◮ We will use the multivariate Gaussian to put a prior directly
◮ Bayesian inference requires a prior on the parameters. ◮ The prior represents your belief before you see the data of
the likely value of the parameters.
◮ For linear regression, consider a Gaussian prior on the
intercept: c ∼ N (0, α1)
◮ Posterior distribution is found by combining the prior with
the likelihood.
◮ Posterior distribution is your belief after you see the data of
the likely value of the parameters.
◮ The posterior is found through Bayes’ Rule
p(c|y) = p(y|c)p(c) p(y)
1 2
1 2 3 4 c p(c) = N (c|0, α1) Figure: A Gaussian prior combines with a Gaussian likelihood for a Gaussian posterior.
1 2
1 2 3 4 c p(c) = N (c|0, α1) p(y|m, c, x, σ2) = N
Figure: A Gaussian prior combines with a Gaussian likelihood for a Gaussian posterior.
1 2
1 2 3 4 c p(c) = N (c|0, α1) p(y|m, c, x, σ2) = N
p(c|y, m, x, σ2) = N
1+σ2/α1 , (σ−2 + α−1 1 )−1
Figure: A Gaussian prior combines with a Gaussian likelihood for a Gaussian posterior.
◮ Multiply likelihood by prior
◮ they are “exponentiated quadratics”, the answer is always
also an exponentiated quadratic because exp(a2) exp(b2) = exp(a2 + b2).
◮ Complete the square to get the resulting density in the
form of a Gaussian.
◮ Recognise the mean and (co)variance of the Gaussian. This
is the estimate of the posterior.
◮ Noise corrupted data point
yi = w⊤xi,: + ǫi
◮ Noise corrupted data point
yi = w⊤xi,: + ǫi
◮ Multivariate regression likelihood:
p(y|X, w) = 1 (2πσ2)n/2 exp − 1 2σ2
n
2
◮ Noise corrupted data point
yi = w⊤xi,: + ǫi
◮ Multivariate regression likelihood:
p(y|X, w) = 1 (2πσ2)n/2 exp − 1 2σ2
n
2
◮ Now use a multivariate Gaussian prior:
p(w) = 1 (2πα)
p 2
exp
2αw⊤w
◮ Consider height, h/m and weight, w/kg. ◮ Could sample height from a distribution:
p(h) ∼ N (1.7, 0.0225)
◮ And similarly weight:
p(w) ∼ N (75, 36)
p(h) h/m p(w) w/kg
Gaussian distributions for height and weight.
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Samples of height and weight
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Samples of height and weight
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Samples of height and weight
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Samples of height and weight
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Samples of height and weight
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Samples of height and weight
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Samples of height and weight
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Samples of height and weight
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Samples of height and weight
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Samples of height and weight
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Samples of height and weight
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Samples of height and weight
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Samples of height and weight
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Samples of height and weight
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Samples of height and weight
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Samples of height and weight
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Samples of height and weight
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Samples of height and weight
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Samples of height and weight
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Samples of height and weight
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Samples of height and weight
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Samples of height and weight
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Samples of height and weight
◮ This assumes height and weight are independent.
p(h, w) = p(h)p(w)
◮ In reality they are dependent (body mass index) = w h2 .
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
Joint Distribution
w/kg h/m
Marginal Distributions
p(h) p(w)
p(w, h) = p(w)p(h)
p(w, h) = 1
1
2
exp −1 2 (w − µ1)2 σ2
1
+ (h − µ2)2 σ2
2
p(w, h) = 1
12πσ2 2
exp −1 2
h
µ2 ⊤ σ2
1
σ2
2
−1 w h
µ2
p(y) = 1 |2πD|
1 2
exp
2(y − µ)⊤D−1(y − µ)
Form correlated from original by rotating the data space using matrix R. p(y) = 1 |2πD|
1 2
exp
2(y − µ)⊤D−1(y − µ)
Form correlated from original by rotating the data space using matrix R. p(y) = 1 |2πD|
1 2
exp
2(R⊤y − R⊤µ)⊤D−1(R⊤y − R⊤µ)
Form correlated from original by rotating the data space using matrix R. p(y) = 1 |2πD|
1 2
exp
2(y − µ)⊤RD−1R⊤(y − µ)
C−1 = RD−1R⊤
Form correlated from original by rotating the data space using matrix R. p(y) = 1 |2πC|
1 2
exp
2(y − µ)⊤C−1(y − µ)
C = RDR⊤
yi ∼ N
i
yi ∼ N
i
yi ∼ N
n
µi,
n
σ2
i
yi ∼ N
i
yi ∼ N
n
µi,
n
σ2
i
yi ∼ N
i
yi ∼ N
n
µi,
n
σ2
i
y ∼ N
yi ∼ N
i
yi ∼ N
n
µi,
n
σ2
i
y ∼ N
wy ∼ N
◮ If
x ∼ N
◮ If
x ∼ N
y = Wx
◮ If
x ∼ N
y = Wx
◮ Then
y ∼ N
Multi-variate Gaussians
◮ We will consider a Gaussian with a particular structure of
covariance matrix.
◮ Generate a single sample from this 25 dimensional
Gaussian distribution, f = f1, f2 . . . f25 .
◮ We will plot these points against their index.
1 2 5 10 15 20 25 fi i
(a) A 25 dimensional correlated ran- dom variable (values ploted against index)
j i 0.2 0.4 0.6 0.8 1
(b) colormap showing correlations between dimensions.
Figure: A sample from a 25 dimensional Gaussian distribution.
1 2 5 10 15 20 25 fi i
(a) A 25 dimensional correlated ran- dom variable (values ploted against index)
j i 0.2 0.4 0.6 0.8 1
(b) colormap showing correlations between dimensions.
Figure: A sample from a 25 dimensional Gaussian distribution.
1 2 5 10 15 20 25 fi i
(a) A 25 dimensional correlated ran- dom variable (values ploted against index)
0.2 0.4 0.6 0.8 1
(b) colormap showing correlations between dimensions.
Figure: A sample from a 25 dimensional Gaussian distribution.
1 2 5 10 15 20 25 fi i
(a) A 25 dimensional correlated ran- dom variable (values ploted against index)
0.2 0.4 0.6 0.8 1
(b) colormap showing correlations between dimensions.
Figure: A sample from a 25 dimensional Gaussian distribution.
1 2 5 10 15 20 25 fi i
(a) A 25 dimensional correlated ran- dom variable (values ploted against index)
0.2 0.4 0.6 0.8 1
(b) colormap showing correlations between dimensions.
Figure: A sample from a 25 dimensional Gaussian distribution.
1 2 5 10 15 20 25 fi i
(a) A 25 dimensional correlated ran- dom variable (values ploted against index)
0.2 0.4 0.6 0.8 1
(b) colormap showing correlations between dimensions.
Figure: A sample from a 25 dimensional Gaussian distribution.
1 2 5 10 15 20 25 fi i
(a) A 25 dimensional correlated ran- dom variable (values ploted against index)
0.2 0.4 0.6 0.8 1
(b) colormap showing correlations between dimensions.
Figure: A sample from a 25 dimensional Gaussian distribution.
1 2 5 10 15 20 25 fi i
(a) A 25 dimensional correlated ran- dom variable (values ploted against index)
1 0.96587 0.96587 1
(b) correlation between f1 and f2.
Figure: A sample from a 25 dimensional Gaussian distribution.
1
1 f1 f2 1 0.96587 0.96587 1
◮ The single contour of the Gaussian density represents the
joint distribution, p( f1, f2).
1
1 f1 f2 1 0.96587 0.96587 1
◮ The single contour of the Gaussian density represents the
joint distribution, p( f1, f2).
◮ We observe that f1 = −0.313.
1
1 f1 f2 1 0.96587 0.96587 1
◮ The single contour of the Gaussian density represents the
joint distribution, p( f1, f2).
◮ We observe that f1 = −0.313. ◮ Conditional density: p(f2| f1 = −0.313).
1
1 f1 f2 1 0.96587 0.96587 1
◮ The single contour of the Gaussian density represents the
joint distribution, p( f1, f2).
◮ We observe that f1 = −0.313. ◮ Conditional density: p(f2| f1 = −0.313).
◮ Prediction of f2 from f1 requires conditional density. ◮ Conditional density is also Gaussian.
p( f2|f1) = N f2|k1,2 k1,1 f1, k2,2 − k2
1,2
k1,1 where covariance of joint density is given by K =
k1,2 k2,1 k2,2
1
1 f1 f5 1 0.57375 0.57375 1
◮ The single contour of the Gaussian density represents the
joint distribution, p( f1, f5).
1
1 f1 f5 1 0.57375 0.57375 1
◮ The single contour of the Gaussian density represents the
joint distribution, p( f1, f5).
◮ We observe that f1 = −0.313.
1
1 f1 f5 1 0.57375 0.57375 1
◮ The single contour of the Gaussian density represents the
joint distribution, p( f1, f5).
◮ We observe that f1 = −0.313. ◮ Conditional density: p(f5| f1 = −0.313).
1
1 f1 f5 1 0.57375 0.57375 1
◮ The single contour of the Gaussian density represents the
joint distribution, p( f1, f5).
◮ We observe that f1 = −0.313. ◮ Conditional density: p(f5| f1 = −0.313).
◮ Prediction of f∗ from f requires multivariate conditional
density.
◮ Multivariate conditional density is also Gaussian.
p(f∗|f) = N
f,ff, K∗,∗ − K∗,fK−1 f,fKf,∗
K =
K∗,f Kf,∗ K∗,∗
◮ Prediction of f∗ from f requires multivariate conditional
density.
◮ Multivariate conditional density is also Gaussian.
p(f∗|f) = N
f,ff
Σ = K∗,∗ − K∗,fK−1
f,fKf,∗ ◮ Here covariance of joint density is given by
K =
K∗,f Kf,∗ K∗,∗
Where did this covariance matrix come from?
Exponentiated Quadratic Kernel Function (RBF, Squared Exponential, Gaussian) k (x, x′) = α exp −x − x′2
2
2ℓ2
◮ Covariance matrix is
built using the inputs to the function x.
◮ For the example above it
was based on Euclidean distance.
◮ The covariance function
is also know as a kernel. ¡1¿ ¡2¿
Where did this covariance matrix come from?
k
2
2ℓ2
x1 = −3.0, x1 = −3.0 k1,1 = 1.00 × exp
2×2.002
k
The Gaussian Density Covariance from Basis Functions
Radial basis functions commonly have the form φk (xi) = exp −
2ℓ2 .
◮ Basis function
maps data into a “feature space” in which a linear sum is a non linear function.
0.5 1
2 4 6 8 φ(x) x Figure: A set of radial basis functions with width ℓ = 2 and location parameters µ = [−4 0 4]⊤.
◮ Represent a function by a linear sum over a basis,
f(xi,:; w) =
m
wkφk(xi,:), (1)
◮ Here: m basis functions and φk(·) is kth basis function and
w = [w1, . . . , wm]⊤ .
◮ For standard linear model: φk(xi,:) = xi,k.
Functions derived using: f(x) =
m
wkφk(x), where elements of w are independently sampled from a Gaussian density, wk ∼ N (0, α) .
1 2
2 4 6 8 f(x) x
Figure: Functions sampled using the basis set from figure 3. Each line is a separate sample, generated by a weighted sum of the basis set. The weights, w are sampled from a Gaussian density with variance α = 1.
Use matrix notation to write function, f (xi; w) =
m
wkφk (xi)
Use matrix notation to write function, f (xi; w) =
m
wkφk (xi) computed at training data gives a vector f = Φw.
Use matrix notation to write function, f (xi; w) =
m
wkφk (xi) computed at training data gives a vector f = Φw. w ∼ N (0, αI)
Use matrix notation to write function, f (xi; w) =
m
wkφk (xi) computed at training data gives a vector f = Φw. w ∼ N (0, αI) w and f are only related by an inner product.
Use matrix notation to write function, f (xi; w) =
m
wkφk (xi) computed at training data gives a vector f = Φw. w ∼ N (0, αI) w and f are only related by an inner product. Φ ∈ ℜn×p is a design matrix
Use matrix notation to write function, f (xi; w) =
m
wkφk (xi) computed at training data gives a vector f = Φw. w ∼ N (0, αI) w and f are only related by an inner product. Φ ∈ ℜn×p is a design matrix Φ is fixed and non-stochastic for a given training set.
Use matrix notation to write function, f (xi; w) =
m
wkφk (xi) computed at training data gives a vector f = Φw. w ∼ N (0, αI) w and f are only related by an inner product. Φ ∈ ℜn×p is a design matrix Φ is fixed and non-stochastic for a given training set. f is Gaussian distributed.
◮ We have
f = Φ w . We use · to denote expectations under prior distributions.
◮ We have
f = Φ w .
◮ Prior mean of w was zero giving
f = 0. We use · to denote expectations under prior distributions.
◮ We have
f = Φ w .
◮ Prior mean of w was zero giving
f = 0.
◮ Prior covariance of f is
K =
− f f⊤ We use · to denote expectations under prior distributions.
◮ We have
f = Φ w .
◮ Prior mean of w was zero giving
f = 0.
◮ Prior covariance of f is
K =
− f f⊤
= Φ
Φ⊤, giving K = αΦΦ⊤. We use · to denote expectations under prior distributions.
◮ The prior covariance between two points xi and xj is
k
◮ The prior covariance between two points xi and xj is
k
k
m
φk (xi) φk
◮ The prior covariance between two points xi and xj is
k
k
m
φk (xi) φk
◮ The prior covariance between two points xi and xj is
k
k
m
φk (xi) φk
k
m
exp −
2ℓ2 .
RBF Basis Functions k (x, x′) = αφ(x)⊤φ(x′) φk(x) = exp −
2
ℓ2 µ = −1 1 ¡1¿ ¡2¿
◮ Need to choose
Restrict analysis to 1-D input, x.
◮ Consider uniform spacing over a region:
k
◮ Need to choose
Restrict analysis to 1-D input, x.
◮ Consider uniform spacing over a region:
k
m
φk(xi)φk(xj)
◮ Need to choose
Restrict analysis to 1-D input, x.
◮ Consider uniform spacing over a region:
k
m
exp
2ℓ2
− (xj − µk)2 2ℓ2
◮ Need to choose
Restrict analysis to 1-D input, x.
◮ Consider uniform spacing over a region:
k
m
exp −(xi − µk)2 2ℓ2 − (xj − µk)2 2ℓ2
◮ Need to choose
Restrict analysis to 1-D input, x.
◮ Consider uniform spacing over a region:
k
m
exp − x2
i + x2 j − 2µk
k
2ℓ2 ,
◮ Set each center location to
µk = a + ∆µ · (k − 1).
◮ Set each center location to
µk = a + ∆µ · (k − 1).
◮ Specify the basis functions in terms of their indices,
k
m
exp
x2
i + x2 j
2ℓ2 − 2 a + ∆µ · (k − 1) xi + xj
2ℓ2
◮ Set each center location to
µk = a + ∆µ · (k − 1).
◮ Specify the basis functions in terms of their indices,
k
m
exp
x2
i + x2 j
2ℓ2 − 2 a + ∆µ · (k − 1) xi + xj
2ℓ2
◮ Here we’ve scaled variance of process by ∆µ.
◮ Take
µ1 = a and µm = b so b = a + ∆µ · (m − 1)
◮ Take
µ1 = a and µm = b so b = a + ∆µ · (m − 1)
◮ This implies
b − a = ∆µ(m − 1)
◮ Take
µ1 = a and µm = b so b = a + ∆µ · (m − 1)
◮ This implies
b − a = ∆µ(m − 1) and therefore m = b − a ∆µ + 1
◮ Take
µ1 = a and µm = b so b = a + ∆µ · (m − 1)
◮ This implies
b − a = ∆µ(m − 1) and therefore m = b − a ∆µ + 1
◮ Take limit as ∆µ → 0 so m → ∞
◮ Take
µ1 = a and µm = b so b = a + ∆µ · (m − 1)
◮ This implies
b − a = ∆µ(m − 1) and therefore m = b − a ∆µ + 1
◮ Take limit as ∆µ → 0 so m → ∞
k(xi, xj) = α′ b
a
exp
x2
i + x2 j
2ℓ2 + 2
2
2 − 1
2
2 2ℓ2
where we have used a + k · ∆µ → µ.
◮ Performing the integration leads to
k(xi,xj) = α′ √ πℓ2 exp −
2 4ℓ2 × 1 2 erf
2
− erf
2
,
◮ Performing the integration leads to
k(xi,xj) = α′ √ πℓ2 exp −
2 4ℓ2 × 1 2 erf
2
− erf
2
,
◮ Now take limit as a → −∞ and b → ∞
◮ Performing the integration leads to
k(xi,xj) = α′ √ πℓ2 exp −
2 4ℓ2 × 1 2 erf
2
− erf
2
,
◮ Now take limit as a → −∞ and b → ∞
k
−
2 4ℓ2 . where α = α′ √ πℓ2.
◮ An RBF model with infinite basis functions is a Gaussian
process.
◮ An RBF model with infinite basis functions is a Gaussian
process.
◮ The covariance function is given by the exponentiated
quadratic covariance function. k
−
2 4ℓ2 .
◮ An RBF model with infinite basis functions is a Gaussian
process.
◮ The covariance function is the exponentiated quadratic. ◮ Note: The functional form for the covariance function and
basis functions are similar.
◮ this is a special case, ◮ in general they are very different
Similar results can obtained for multi-dimensional input models Williams (1998); Neal (1996).
Where did this covariance matrix come from?
Exponentiated Quadratic Kernel Function (RBF, Squared Exponential, Gaussian) k (x, x′) = α exp −x − x′2
2
2ℓ2
◮ Covariance matrix is
built using the inputs to the function x.
◮ For the example above it
was based on Euclidean distance.
◮ The covariance function
is also know as a kernel. ¡1¿ ¡2¿
RBF Basis Functions k (x, x′) = αφ(x)⊤φ(x′) φk(x) = exp −
2
ℓ2 µ = −1 1 ¡1¿ ¡2¿
and repreinted (1951) as A Philosophical Essay on Probabilities, New York: Dover; fifth edition of 1825 reprinted 1986 with notes by Bernard Bru, Paris: Christian Bourgois ´ Editeur, translated by Andrew Dale (1995) as Philosophical Essay on Probabilities, New York:Springer-Verlag.
[Google Books] .