Probabilistic Graphical Models Lecture 21: Advanced Gaussian Processes
Andrew Gordon Wilson
www.cs.cmu.edu/~andrewgw Carnegie Mellon University April 1, 2015
1 / 100
Probabilistic Graphical Models Lecture 21: Advanced Gaussian - - PowerPoint PPT Presentation
Probabilistic Graphical Models Lecture 21: Advanced Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University April 1, 2015 1 / 100 Gaussian process review Definition A Gaussian process (GP) is a collection
www.cs.cmu.edu/~andrewgw Carnegie Mellon University April 1, 2015
1 / 100
◮ Prior: f(x) ∼ GP(m(x), k(x, x′)), meaning
GP posterior
Likelihood
GP prior
−3 −2 −1 1 2 3
Gaussian process sample prior functions −3 −2 −1 1 2 3
Gaussian process sample posterior functions 2 / 100
◮ Observed noisy data y = (y(x1), . . . , y(xN))T at input locations X. ◮ Start with the standard regression assumption: N(y(x); f(x), σ2). ◮ Place a Gaussian process distribution over noise free functions
◮ Infer p(f∗|y, X, X∗) for the noise free function f evaluated at test points
3 / 100
y All Possible Datasets p(y|M)
Complex Model Simple Model Appropriate Model
(a)
−10 −8 −6 −4 −2 2 4 6 8 10 −4 −3 −2 −1 1 2 3 4
Input, x Output, f(x)
Data Simple Complex Appropriate
(b)
4 / 100
◮ We can integrate away the entire Gaussian process f(x) to obtain the
model fit
complexity penalty
◮ An extremely powerful mechanism for kernel learning.
−10 −5 5 10 −4 −3 −2 −1 1 2 3 4
Input, x Output, f(x) Samples from GP Prior
−10 −5 5 10 −4 −3 −2 −1 1 2 3 4
Input, x Output, f(x) Samples from GP Posterior 5 / 100
model fit
complexity penalty
6 / 100
◮ A fully Bayesian treatment would integrate away kernel
◮ For example, we could specify a prior p(θ), use MCMC to take J
J
◮ If we have a non-Gaussian noise model, and thus cannot integrate away
7 / 100
8 / 100
1968 1977 1986 1995 2004 320 340 360 380 400
9 / 100
10 / 100
◮ Long rising trend: k1(xp, xq) = θ2 1 exp
2θ2
2
3 exp
2θ2
4
θ2
5
6
2θ8θ2
7
◮ Correlated and i.i.d. noise: k4(xp, xq) = θ2 9 exp
2θ2
10
11δpq ◮ ktotal(xp, xq) = k1(xp, xq) + k2(xp, xq) + k3(xp, xq) + k4(xp, xq)
11 / 100
◮ Informally, k describes the similarities between pairs of data points. For
◮ We have seen that all linear basis function models f(x) = wTφ(x), with
◮ We have also accumulated some experience with the RBF kernel
2ℓ2
◮ The kernel controls the generalisation behaviour of a kernel machine.
◮ A kernel is also known as covariance function or covariance kernel in
12 / 100
◮ Symmetric ◮ Provides information about proximity of points ◮ Exercise: Is it a valid kernel?
13 / 100
14 / 100
15 / 100
N
N
◮ Representer theorem says this function exists with finitely many
◮ Initially viewed as a strength of kernel methods, for datasets not
◮ Unfortunately, the number of nonzero αi often grows linearly in the
◮ Example: In GP regression, the predictive mean is
∗(K + σ2I)−1y = N
16 / 100
a) + kb(xb, x′ b)
a)kb(xb, x′ b)
17 / 100
◮ A stationary kernel is invariant to translations of the input space.
◮ All distance kernels, k = k(||x − x′||) are examples of stationary
◮ The RBF kernel kRBF(x, x′) = a2 exp(− ||x−x′||2 2ℓ2
0)p is an example of a
◮ Stationarity provides a useful inductive bias.
18 / 100
19 / 100
◮ f(x, w) is a Gaussian process, f(x) ∼ N(m, k) with mean function
◮ The entire basis function model of Eqs. (31) and (32) is encapsulated as
20 / 100
◮ Start with the basis model
J
◮ Equations (37)-(39) define a radial basis function regression model,
◮ Using our result for the kernel of a generalised linear model,
J
21 / 100
J
J
◮ Letting ci+1 − ci = ∆c = 1 J , and J → ∞, the kernel in Eq. (42)
J→∞
J
c0
◮ By setting c0 = −∞ and c∞ = ∞, we spread the infinitely many basis
−∞
22 / 100
◮ It is remarkable we can work with infinitely many basis functions with
◮ The RBF kernel, also known as the Gaussian or squared exponential
2ℓ2
◮ Recall Bochner’s theorem. If we take the Fourier transform of the RBF
◮ Functions drawn from a GP with an RBF kernel are infinitely
23 / 100
2 4 6 8 10 12 14 16 18 20 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
τ k(τ) SE kernels with Different Length−scales
l=7 l = 0.7 l = 2.28
24 / 100
N
N
◮ Representer theorem says this function exists with finitely many
◮ Initially viewed as a strength of kernel methods, for datasets not
◮ Unfortunately, the number of nonzero αi often grows linearly in the
◮ Example: In GP regression, the predictive mean is
∗(K + σ2I)−1y = N
25 / 100
1, x2 2,
26 / 100
◮ What if we want data varying at multiple scales?
27 / 100
◮ k(r) =
2ℓ2 )p(ℓ)dℓ .
◮ One could derive other interesting covariance functions using different
28 / 100
5 10 15 20 0.2 0.4 0.6 0.8 1 1.2 1.4
τ k(τ) RQ kernels with Different α
α = 40 α=0.1 α=2
(a)
5 10 15 20 −1.5 −1 −0.5 0.5 1 1.5 2 2.5
Input, x f(x) Sample GP−RQ Functions
(b)
29 / 100
◮ The neural network kernel (Neal, 1996) is famous for triggering
J
◮ b is a bias, vi are the hidden to output weights, h is any bounded hidden
b and σ2 v/J, respectively, and let the ui have independent
b + 1
J
vEu[hi(x; ui)hi(x′; ui)] ,
b + σ2 vEu[h(x; u)h(x′; u)] .
30 / 100
J
◮ Let h(x; u) = erf(u0 + P j=1 ujxj), where erf(z) = 2 √π
0 e−t2dt ◮ Choose u ∼ N(0, Σ)
31 / 100
32 / 100
◮ What if we want to make the length-scale of ℓ input dependent, so that
33 / 100
◮ What if we want to make the length-scale of ℓ input dependent, so that
◮ Just letting ℓ → ℓ(x) doesn’t produce a valid kernel.
34 / 100
P
p(x) + l2 p(x′)
P
p)2
p(x) + l2 p(x′)
1 2 3 4 0.5 1 1.5 lengthscale l(x) input, x 1 2 3 4 −1 1 input, x
(a) (b)
35 / 100
◮ Transform the inputs through a vector-valued function:
◮ Apply the RBF kernel in u space: kRBF(x, x′) → kRBF(u(x), u(x′)). ◮ Recover the periodic kernel
2 )
◮ Can you see anything unusual about this kernel?
36 / 100
2 4 6 8 10 0.85 0.9 0.95 1 1.05
Input distance, r Covariance
(a)
2 4 6 8 10 0.2 0.4 0.6 0.8
Input, x Output, y(x)
(b)
37 / 100
◮ A stationary kernel is invariant to translations of the input space:
◮ Intuitively, this means the properties of the function are similar across
◮ How might we make other non-stationary kernels, besides the Gibbs
38 / 100
1 2 3 4 5 −1 −0.5 0.5 1
39 / 100
1 2 3 4 5 −1 −0.5 0.5 1
(a)
5 10 15 20 25 −1 −0.5 0.5 1
(b)
40 / 100
1 2 3 4 5 −10 −8 −6 −4 −2 2 4 6 8 10 41 / 100
◮ Warp the input space: k(x, x′) → k(g(x), g(x′)) where g is an arbitrary
◮ Modulate the amplitude of the kernel. If f(x) ∼ GP(0, k(x, x′)) then
◮ What would happen if we tried w1(x)f1(x) + w2(x)f2(x) where f1 and f2
◮ How about σ(w1(x))f1(x) + (1 − σ(w1(x)))f2(x)?
42 / 100
◮ The RBF kernel kRBF(x, x′) = a2 exp(− ||x−x′||2 2ℓ2
◮ How might we create a drop-in replacement, while retaining useful
43 / 100
◮ The RBF kernel kRBF(x, x′) = a2 exp(− ||x−x′||2 2ℓ2
◮ How might we create a drop-in replacement, while retaining useful
◮ Could replace the Euclidean distance measure with an absolute distance
44 / 100
◮ The RBF kernel kRBF(x, x′) = a2 exp(− ||x−x′||2 2ℓ2
◮ How might we create a drop-in replacement, while retaining useful
◮ Could replace the Euclidean distance measure with an absolute distance
◮ Recall that stationary kernels k(τ), τ = x − x′ and spectral densities are
45 / 100
◮ The RBF kernel kRBF(x, x′) = a2 exp(− ||x−x′||2 2ℓ2
◮ How might we create a drop-in replacement, while retaining useful
◮ Could replace the Euclidean distance measure with an absolute distance
◮ Recall that stationary kernels k(τ), τ = x − x′ and spectral densities are
◮ If we use a Student-t spectral density for S(s), and take the inverse
46 / 100
◮
◮ In one dimension, and when ν + 1/2 = p, for some natural number p,
◮ By setting ν = 1, we obtain the Ornstein-Uhlenbeck (OU) kernel,
◮ The Matérn kernel does not have concentration of measure problems
◮ The kernel gives rise to a Markovian process (and classical filtering and
47 / 100
2 4 6 8 10 0.2 0.4 0.6 0.8 1 1.2
Distance, r Covariance
kOU kRBF
(a)
2 4 6 8 10 −2 −1.5 −1 −0.5 0.5 1
Inputs, x Outputs, f(x)
GP−OU GP−RBF
(b)
48 / 100
49 / 100
◮ Are Gaussian processes Bayesian nonparametric models?
50 / 100
◮ For a Gaussian process f(x) to be non-parametric, f(xi)|f−i, where f−i
◮ For this freedom to be possible it is a necessary (but not sufficient)
◮ Nonparametric kernels allow for a great amount of flexibility: the
51 / 100
◮ The parametric analogue to a GP with a non-parametric RBF kernel
52 / 100
◮ Discrete time auto-regressive model ◮
◮ Is this model a Gaussian process?
53 / 100
54 / 100
55 / 100
56 / 100
ˆ fi(x) ˆ f1(x) ˆ fq(x) y1(x) yj(x)
W11(x) W
p q
(x) W
1 q
(x) Wp1(x)
yp(x) . . . . . . . . . . . . x
57 / 100
1 2 3 4 5 longitude 1 2 3 4 5 6 latitude 0.15 0.00 0.15 0.30 0.45 0.60 0.75 0.90
58 / 100
◮ GPs in Bayesian neural network like architectures. (Salakhutdinov and
◮ Compositions of kernels. (Archambeau and Bach, 2011; Durrande et.
59 / 100
60 / 100
◮ If we can approximate S(s) to arbitrary accuracy, then we can
◮ We can model S(s) to arbitrary accuracy, since scale-location mixtures
◮ A scale-location mixture of Gaussians can flexibly model many
61 / 100
q , . . . , v(P) q ), then
Q
q ) P
p v(p) q }.
62 / 100
◮ Observations y(x) ∼ N(y(x); f(x), σ2)
◮ f(x) ∼ GP(0, kSM(x, x′|θ))
◮ kSM(x, x′|θ) can approximate many different kernels with different
◮ Learning involves training these hyperparameters through maximum
model fit
complexity penalty
◮ Once hyperparameters are trained as ˆ
63 / 100
1968 1977 1986 1995 2004 320 340 360 380 400
Year CO2 Concentration (ppm)
95% CR Train Test MA RQ PER SE SM
(c)
0.2 0.4 0.6 0.8 1 1.2 −10 −5 5 10 15 20
Frequency (1/month) Log Spectral Density
SM SE Empirical
(d)
64 / 100
20 40 60 80 100 −0.5 0.5 1
τ
a)
Correlation
20 40 60 80 100 −0.5 0.5 1
τ
b)
Correlation
65 / 100
100 200 300 400 −50 50
x (input) Observation
(e)
5 10 15 20 −200 −100 100 200
τ Covariance
Truth SM
(f)
0.5 1 1.5 2 −10 −5 5 10 15 20
Frequency Log Spectral Density
SM Empirical
(g)
66 / 100
−15 −10 −5 5 10 15 −0.4 −0.2 0.2 0.4 0.6 0.8 1
x (input) Observation
(h)
−15 −10 −5 5 10 15 −0.2 0.2 0.4 0.6 0.8 1 1.2
x (input) Observation
Train Test MA RQ SE PER SM
(i)
20 40 60 80 −0.5 0.5 1
τ Correlation
MA SM
(j)
0.1 0.2 0.3 0.4 0.5 0.6 −150 −100 −50
Frequency Log Spectral Density
SM SE Empirical
(k)
67 / 100
1949 1951 1953 1955 1957 1959 1961 100 200 300 400 500 600 700
Airline Passengers (Thousands) Year
95% CR PER Train Test SE RQ MA SM SSGP
(l)
0.1 0.2 0.3 0.4 0.5 5 10 15 20
Frequency (1/month) Log Spectral Density
SM SE Empirical
(m)
68 / 100
◮ Expressive kernels will be most valuable on large datasets. ◮ Computational bottlenecks for GPs:
◮ Inference: (Kθ + σ2I)−1y for n × n matrix K. ◮ Learning: log |Kθ + σ2I|, for marginal likelihood evaluations needed to
◮ Both inference and learning naively require O(n3) operations and
69 / 100
model fit
complexity penalty
70 / 100
◮ Approximate non-parametric kernels in a finite basis ‘dual space’.
◮ Inducing point based sparse approximations. Examples: SoR, FITC,
◮ Exploit existing structure in K to quickly (and exactly) solve linear
71 / 100
◮ Return to Bochner’s Theorem
◮ We can treat S(s) as a probability distribution and sample from it, to
◮ It is a valid Monte Carlo procedure to sample the pairs {sj, −sj} from
J
j τ) + exp(−2πisT j τ)
J
j τ)
◮ This is exactly the covariance function we get if we use a linear basis
72 / 100
◮ Gaussian process f and f∗ evaluated at n training points and J testing
◮ m ≪ n inducing points u, p(u) = N(0, Ku,u) ◮ p(f∗, f) =
◮ Assume that f and f∗ are conditionally independent given u:
◮ Exact conditional distributions
u,uu, Kf,f − Qf,f)
u,uu, Kf∗,f∗ − Qf∗,f∗)
u,uKu,b
◮ Cost for predictions reduced from O(n3) to O(m2n) where m ≪ n. ◮ Different inducing approaches correspond to different additional
73 / 100
74 / 100
U,Uu, 0)
U,Uu, 0)
u,uKu,b
75 / 100
U,Uk(U, xj)
76 / 100
n×n
n×m
m×m
U,U m×n
◮ For m < n, this is a low rank covariance matrix, corresponding to a
◮ As a result, for n large, SoR tends to underestimate uncertainty.
77 / 100
n
U,UKU,z ,
78 / 100
◮ If x ∈ RP, k decomposes as a product of kernels across each input
p=1 kp(xp i , xp j ) (e.g., the RBF kernel has this
◮ Suppose the inputs x ∈ X are on a multidimensional grid
◮ K decomposes into a Kronecker product of matrices over each input
◮ The eigendecomposition of K into QVQ also decomposes:
79 / 100
◮ If x ∈ RP, k decomposes as a product of kernels across each input
p=1 kp(xp i , xp j ) (e.g., the RBF kernel has this
◮ Suppose the inputs x ∈ X are on a multidimensional grid
◮ K decomposes into a Kronecker product of matrices over each input
◮ The eigendecomposition of K into QVQ also decomposes:
◮
N
80 / 100
◮ We assumed that the inputs x ∈ X are on a multidimensional grid
◮ How might we relax this assumption, to use Kronecker methods if there
81 / 100
◮ Assume imaginary points that complete the grid ◮ Place infinite noise on these points so they have no effect on inference ◮ The relevant matrices are no longer Kronecker, but we can get around
82 / 100
◮ Assuming we have a dataset of M observations which are not
◮ The total observation vector y = [yM, yW]T has N = M + W entries:
◮ The imaginary observations yW have no corrupting effect on inference:
83 / 100
◮ We use preconditioned conjugate gradients to compute (KN + DN)−1 y.
N
◮ For the log complexity in the marginal likelihood (used in
M
i + σ2) ≈ M
i + σ2) ,
i = M N λN i for i = 1, . . . , M.
84 / 100
◮ The spectral mixture kernel, in its standard form, does not quite have
◮ Introduce a spectral mixture product kernel, which takes a product of
P
85 / 100
◮ Observations y(x) ∼ N(y(x); f(x), σ2)
◮ f(x) ∼ GP(0, kSMP(x, x′|θ))
◮ kSMP(x, x′|θ) can approximate many different kernels with different
◮ Learning involves training these hyperparameters through maximum
model fit
complexity penalty
◮ Once hyperparameters are trained as ˆ
◮ Exploit Kronecker structure for fast exact inference and learning (and
P+1 P ) operations and O(PN 2 P ) storage,
86 / 100
(a) Train (b) Test (c) Full (d) GPatt (e) SSGP (f) FITC (g) GP-SE (h) GP-MA (i) GP-RQ
87 / 100
(a) Train (b) GPatt (c) GP-MA (d) Train (e) GPatt (f) GP-MA
88 / 100
◮ Simple initialisation ◮ The marginal likelihood shrinks weights of extraneous components to
89 / 100
(a) Train (b) Test (c) Full (d) GPatt (e) SSGP (f) FITC (g) GP-SE (h) GP-MA (i) GP-RQ
0.5 1 2
w1 µ1
0.5 1 2
w2 µ2
Learned Initial
(j) GPatt Initialisation (k) Train (l) GPatt (m) GP-MA (n) Train (o) GPatt (p) GP-MA
90 / 100
(a) Rubber mat (b) Tread plate (c) Pores (d) Wood (e) Chain mail (f) Cone
91 / 100
(a) Runtime Stress Test (b) Accuracy Stress Test
92 / 100
93 / 100
94 / 100
◮ GPatt makes almost no assumptions about the correlation structures
◮ Top row: True frames taken from the middle of a movie. Bottom row:
◮ 112,500 datapoints. GPatt training time is under 5 minutes.
95 / 100
◮ Train using 9 years of temperature data. First two rows are the last 12
◮ Predictions using GP-SE (GP with an SE or RBF kernel), and
−40 −20 20 40
96 / 100
◮ Train using 9 years of temperature data. First two rows are the last 12
◮ Predictions using GPatt. Training time < 30 minutes.
−40 −20 20 40
97 / 100
50 100 0.5 1 Time [mon] 50 0.2 0.4 0.6 0.8 Y [Km] 50 0.5 1 Correlations X [Km]
(a) Learned GPatt Kernel for Temperatures
5 0.2 0.4 0.6 0.8 Time [mon] 50 0.2 0.4 0.6 0.8 Y [Km] 20 40 0.2 0.4 0.6 0.8 Correlations X [Km]
(b) Learned GP-SE Kernel for Temperatures
◮ The learned GPatt kernel tells us interesting properties of the data. In
98 / 100
99 / 100
100 / 100