Sparse Gaussian Process Approximations
- Dr. Richard E. Turner (ret26@cam.ac.uk)
Computational and Biological Learning Lab, Department of Engineering, University of Cambridge
1 / 90
Sparse Gaussian Process Approximations Dr. Richard E. Turner ( - - PowerPoint PPT Presentation
Sparse Gaussian Process Approximations Dr. Richard E. Turner ( ret26@cam.ac.uk ) Computational and Biological Learning Lab, Department of Engineering, University of Cambridge 1 / 90 Motivating application 1: Audio modelling audio 5
Computational and Biological Learning Lab, Department of Engineering, University of Cambridge
1 / 90
Motivating application 1: Audio modelling
3 4 5 6 7 8
time /s
time /ms 5 10 15 20 25
T = 10 -10 datapoints
5 7
audio time-series data reconstruction using a GP model
2 / 90
Motivating application 1: Audio modelling
3 4 5 6 7 8
time /s
time /ms 5 10 15 20 25
T = 10 -10 datapoints
5 7
audio time-series data reconstruction using a GP model
How can we use GPs in this setting?
3 / 90
Motivating application 2: non-linear regression
boston N = 506 D = 13
−2.5 −2.4 −2.3 −2.2 −2.1 −2.0
average test log-likelihood/nats concrete N = 1030 D = 8
−3.2 −3.0 −2.8 −2.6 −2.4 −2.2 −2.0 −1.8
energy N = 768 D = 8
−2.0 −1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6
kin8nm N = 8192 D = 8
1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2
naval N = 11934 D = 16
5.0 5.5 6.0 6.5 7.0 7.5
power N = 9568 D = 4
−2.65 −2.60 −2.55 −2.50 −2.45 −2.40 −2.35 −2.30 −2.25
protein N = 45730 D = 9
−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0
red wine N = 1588 D = 11
−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8
yacht N = 308 D = 6
−1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6 −0.4
year N = 515345 D = 90
−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 BNN-deterministic BNN-sampling GP DGP
4 / 90
Motivating application 2: non-linear regression
boston N = 506 D = 13
−2.5 −2.4 −2.3 −2.2 −2.1 −2.0
average test log-likelihood/nats concrete N = 1030 D = 8
−3.2 −3.0 −2.8 −2.6 −2.4 −2.2 −2.0 −1.8
energy N = 768 D = 8
−2.0 −1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6
kin8nm N = 8192 D = 8
1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2
naval N = 11934 D = 16
5.0 5.5 6.0 6.5 7.0 7.5
power N = 9568 D = 4
−2.65 −2.60 −2.55 −2.50 −2.45 −2.40 −2.35 −2.30 −2.25
protein N = 45730 D = 9
−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0
red wine N = 1588 D = 11
−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8
yacht N = 308 D = 6
−1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6 −0.4
year N = 515345 D = 90
−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 BNN-deterministic BNN-sampling GP DGP
5 / 90
Motivating application 2: non-linear regression
boston N = 506 D = 13
−2.5 −2.4 −2.3 −2.2 −2.1 −2.0
average test log-likelihood/nats concrete N = 1030 D = 8
−3.2 −3.0 −2.8 −2.6 −2.4 −2.2 −2.0 −1.8
energy N = 768 D = 8
−2.0 −1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6
kin8nm N = 8192 D = 8
1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2
naval N = 11934 D = 16
5.0 5.5 6.0 6.5 7.0 7.5
power N = 9568 D = 4
−2.65 −2.60 −2.55 −2.50 −2.45 −2.40 −2.35 −2.30 −2.25
protein N = 45730 D = 9
−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0
red wine N = 1588 D = 11
−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8
yacht N = 308 D = 6
−1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6 −0.4
year N = 515345 D = 90
−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 BNN-deterministic BNN-sampling GP DGP
6 / 90
Motivating application 2: non-linear regression
boston N = 506 D = 13
−2.5 −2.4 −2.3 −2.2 −2.1 −2.0
average test log-likelihood/nats concrete N = 1030 D = 8
−3.2 −3.0 −2.8 −2.6 −2.4 −2.2 −2.0 −1.8
energy N = 768 D = 8
−2.0 −1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6
kin8nm N = 8192 D = 8
1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2
naval N = 11934 D = 16
5.0 5.5 6.0 6.5 7.0 7.5
power N = 9568 D = 4
−2.65 −2.60 −2.55 −2.50 −2.45 −2.40 −2.35 −2.30 −2.25
protein N = 45730 D = 9
−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0
red wine N = 1588 D = 11
−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8
yacht N = 308 D = 6
−1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6 −0.4
year N = 515345 D = 90
−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 BNN-deterministic BNN-sampling GP DGP
7 / 90
Motivating application 2: non-linear regression
boston N = 506 D = 13
−2.5 −2.4 −2.3 −2.2 −2.1 −2.0
average test log-likelihood/nats concrete N = 1030 D = 8
−3.2 −3.0 −2.8 −2.6 −2.4 −2.2 −2.0 −1.8
energy N = 768 D = 8
−2.0 −1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6
kin8nm N = 8192 D = 8
1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2
naval N = 11934 D = 16
5.0 5.5 6.0 6.5 7.0 7.5
power N = 9568 D = 4
−2.65 −2.60 −2.55 −2.50 −2.45 −2.40 −2.35 −2.30 −2.25
protein N = 45730 D = 9
−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0
red wine N = 1588 D = 11
−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8
yacht N = 308 D = 6
−1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6 −0.4
year N = 515345 D = 90
−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 BNN-deterministic BNN-sampling GP DGP
8 / 90
Motivation: Gaussian Process Regression inputs
9 / 90
Motivation: Gaussian Process Regression inputs
?
9 / 90
Motivation: Gaussian Process Regression inputs
?
9 / 90
Motivation: Gaussian Process Regression inputs
?
inference & learning
9 / 90
Motivation: Gaussian Process Regression inputs
?
inference & learning intractabilities computational analytic
9 / 90
Motivation: Gaussian Process Regression
9 / 90
A Brief History of Gaussian Process Approximations
FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”
10 / 90
A Brief History of Gaussian Process Approximations
approximate generative model exact inference
FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”
10 / 90
A Brief History of Gaussian Process Approximations
approximate generative model exact inference methods employing pseudo-data
FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”
10 / 90
A Brief History of Gaussian Process Approximations
approximate generative model exact inference methods employing pseudo-data
FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”
FITC PITC DTC
10 / 90
A Brief History of Gaussian Process Approximations
approximate generative model exact inference methods employing pseudo-data
FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”
FITC PITC DTC
A Unifying View of Sparse Approximate Gaussian Process Regression Quinonero-Candela & Rasmussen, 2005 (FITC, PITC, DTC)
10 / 90
A Brief History of Gaussian Process Approximations
approximate generative model exact inference exact generative model approximate inference methods employing pseudo-data
FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”
FITC PITC DTC
A Unifying View of Sparse Approximate Gaussian Process Regression Quinonero-Candela & Rasmussen, 2005 (FITC, PITC, DTC)
10 / 90
A Brief History of Gaussian Process Approximations
approximate generative model exact inference exact generative model approximate inference methods employing pseudo-data
FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”
VFE EP PP FITC PITC DTC
A Unifying View of Sparse Approximate Gaussian Process Regression Quinonero-Candela & Rasmussen, 2005 (FITC, PITC, DTC)
10 / 90
A Brief History of Gaussian Process Approximations
approximate generative model exact inference exact generative model approximate inference methods employing pseudo-data
FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”
VFE EP PP FITC PITC DTC
A Unifying View of Sparse Approximate Gaussian Process Regression Quinonero-Candela & Rasmussen, 2005 (FITC, PITC, DTC)
10 / 90
A Brief History of Gaussian Process Approximations
approximate generative model exact inference exact generative model approximate inference methods employing pseudo-data
FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”
VFE EP PP FITC PITC DTC
A Unifying View of Sparse Approximate Gaussian Process Regression Quinonero-Candela & Rasmussen, 2005 (FITC, PITC, DTC) A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation Bui, Yan and Turner, 2016 (VFE, EP, FITC, PITC ...)
10 / 90
Factor Graphs: introduction / reminder
factor graph examples
11 / 90
Factor Graphs: introduction / reminder
factor graph examples what is the minimal factor graph for this multivariate Gaussian? 4 dimensional solution:
12 / 90
Factor Graphs: introduction / reminder
factor graph examples what is the minimal factor graph for this multivariate Gaussian? 4 dimensional solution:
13 / 90
Fully independent training conditional (FITC) approximation
construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original
14 / 90
Fully independent training conditional (FITC) approximation
construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original
15 / 90
Fully independent training conditional (FITC) approximation
construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original
(results in simpler model)
all factors
16 / 90
Fully independent training conditional (FITC) approximation
construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original
(results in simpler model)
all factors
17 / 90
Fully independent training conditional (FITC) approximation
construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original
(results in simpler model)
(e.g. using KL divergence, many choices)
equal to exact conditionals all factors
18 / 90
Fully independent training conditional (FITC) approximation
construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original
(results in simpler model)
(e.g. using KL divergence, many choices)
equal to exact conditionals all factors indirect posterior approximation
19 / 90
Fully independent training conditional (FITC) approximation
construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original
indirect posterior approximation
20 / 90
Fully independent training conditional (FITC) approximation
construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original
indirect posterior approximation
21 / 90
Fully independent training conditional (FITC) approximation
construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original
indirect posterior approximation
22 / 90
Fully independent training conditional (FITC) approximation
How do we make predictions?
construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original
indirect posterior approximation
23 / 90
Fully independent training conditional (FITC) approximation
How do we make predictions?
construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original
indirect posterior approximation
24 / 90
Fully independent training conditional (FITC) approximation
construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original
indirect posterior approximation
25 / 90
Fully independent training conditional (FITC) approximation
construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original
indirect posterior approximation
26 / 90
Fully independent training conditional (FITC) approximation
construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original
indirect posterior approximation
27 / 90
Fully independent training conditional (FITC) approximation
cost of computing likelihood is construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original
indirect posterior approximation
28 / 90
Fully independent training conditional (FITC) approximation
cost of computing likelihood is construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original
indirect posterior approximation
29 / 90
Fully independent training conditional (FITC) approximation
cost of computing likelihood is construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original
indirect posterior approximation
30 / 90
Fully independent training conditional (FITC) approximation
cost of computing likelihood is construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original
indirect posterior approximation
31 / 90
FITC: Demo (Snelson)
32 / 90
FITC: Demo (Snelson)
33 / 90
Fully independent training conditional (FITC) approximation parametric (although cleverly so) if I see more data, should I add extra pseudo-data?
◮ unnatural from a generative modelling perspective ◮ natural from a prediction perspective (posterior gets more complex)
= ⇒ lost elegant separation of model, inference and approximation
example of prior approximation Extensions: inter-domain GP (pseudo-data in a different space) partially independent training conditional and tree-structured approximations
34 / 90
Variational free-energy method (VFE)
lower bound the likelihood
35 / 90
Variational free-energy method (VFE)
lower bound the likelihood
36 / 90
Variational free-energy method (VFE)
lower bound the likelihood
37 / 90
Variational free-energy method (VFE)
lower bound the likelihood
38 / 90
Variational free-energy method (VFE)
lower bound the likelihood
39 / 90
Variational free-energy method (VFE)
lower bound the likelihood
KL between stochastic processes
40 / 90
Variational free-energy method (VFE)
lower bound the likelihood assume approximate posterior factorisation with special form exact:
KL between stochastic processes
41 / 90
Variational free-energy method (VFE)
true posterior approximate posterior
42 / 90
Variational free-energy method (VFE)
true posterior approximate posterior
same form as prediction from GP-regression
43 / 90
Variational free-energy method (VFE)
true posterior approximate posterior
inputs locations of 'pseudo' data
and covariance 'pseudo' data same form as prediction from GP-regression
44 / 90
Variational free-energy method (VFE)
lower bound the likelihood assume approximate posterior factorisation with special form exact:
predictive from GP regression KL between stochastic processes
45 / 90
Variational free-energy method (VFE)
lower bound the likelihood assume approximate posterior factorisation with special form exact:
predictive from GP regression
plug into Free-energy:
KL between stochastic processes
46 / 90
Variational free-energy method (VFE)
lower bound the likelihood assume approximate posterior factorisation with special form exact:
predictive from GP regression
plug into Free-energy:
KL between stochastic processes
47 / 90
Variational free-energy method (VFE)
lower bound the likelihood assume approximate posterior factorisation with special form exact:
predictive from GP regression
plug into Free-energy:
KL between stochastic processes
48 / 90
Variational free-energy method (VFE)
lower bound the likelihood where
DTC like uncertainty based correction
49 / 90
Variational free-energy method (VFE)
lower bound the likelihood where
DTC like uncertainty based correction KL between two multivariate Gaussians average of quadratic form
50 / 90
Variational free-energy method (VFE)
lower bound the likelihood where make bound as tight as possible:
DTC like uncertainty based correction KL between two multivariate Gaussians average of quadratic form
51 / 90
Variational free-energy method (VFE)
lower bound the likelihood where make bound as tight as possible: (DTC)
DTC like uncertainty based correction KL between two multivariate Gaussians average of quadratic form
52 / 90
Variational free-energy method (VFE)
lower bound the likelihood where make bound as tight as possible: (DTC)
DTC like uncertainty based correction KL between two multivariate Gaussians average of quadratic form
53 / 90
Summary of VFE method
than FITC variational methods known to underfit (and have other biases) no augmentation required: target is posterior over functions, which includes inducing variables
◮ pseudo-input locations are pure variational parameters (do not
parameterise the generative model like they do in FITC)
◮ coherent way of adding pseudo-data: more complex posteriors require
more computational resources (more pseudo-points)
Rule of thumb: VFE returns better mean estimates FITC returns better error-bar estimates how should we select M = number of pseudo-points?
54 / 90
How do we select M = number of pseudo-data?
y x
200 400 600 800 1000 1200 1400 1600 1800 2000
1 2 3 55 / 90
How do we select M = number of pseudo-data?
y x
200 400 600 800 1000 1200 1400 1600 1800 2000
1 2 3 56 / 90
How do we select M = number of pseudo-data?
compute time/s SMSE
10 10
1
10
10
y x
200 400 600 800 1000 1200 1400 1600 1800 2000
1 2 3 57 / 90
How do we select M = number of pseudo-data?
compute time/s SMSE x pseudo-dataset (input location)
58 / 90
How do we select M = number of pseudo-data?
compute time/s SMSE
10 10
1
10
10
y x
200 400 600 800 1000 1200 1400 1600 1800 2000
1 2 3 Exact VFE 59 / 90
How do we select M = number of pseudo-data?
compute time/s SMSE
10 10
1
10
10
y x
200 400 600 800 1000 1200 1400 1600 1800 2000
1 2 3 Exact VFE 60 / 90
How do we select M = number of pseudo-data?
compute time/s SMSE
10 10
1
10
10
y x
200 400 600 800 1000 1200 1400 1600 1800 2000
1 2 3 Exact VFE 61 / 90
How do we select M = number of pseudo-data?
compute time/s SMSE
10 10
1
10
10
y x
200 400 600 800 1000 1200 1400 1600 1800 2000
1 2 3 Exact VFE 62 / 90
How do we select M = number of pseudo-data?
compute time/s SMSE
10 10
1
10
10
y x
200 400 600 800 1000 1200 1400 1600 1800 2000
1 2 3 Exact VFE 63 / 90
How do we select M = number of pseudo-data?
compute time/s SMSE
10 10
1
10
10
y x
200 400 600 800 1000 1200 1400 1600 1800 2000
1 2 3 Exact VFE 64 / 90
How do we select M = number of pseudo-data?
compute time/s SMSE
10 10
1
10
10
y x
200 400 600 800 1000 1200 1400 1600 1800 2000
1 2 3 Exact VFE 65 / 90
How do we select M = number of pseudo-data?
compute time/s SMSE
10 10
1
10
10
y x
200 400 600 800 1000 1200 1400 1600 1800 2000
1 2 3 Exact VFE 66 / 90
How do we select M = number of pseudo-data?
compute time/s SMSE
10 10
1
10
10
y x
200 400 600 800 1000 1200 1400 1600 1800 2000
1 2 3 Exact VFE 67 / 90
How do we select M = number of pseudo-data?
compute time/s SMSE
10 10
1
10
10
y x
200 400 600 800 1000 1200 1400 1600 1800 2000
1 2 3 Exact VFE 68 / 90
How do we select M = number of pseudo-data?
compute time/s SMSE
10 10
1
10
10
y x
200 400 600 800 1000 1200 1400 1600 1800 2000
1 2 3 Exact VFE 69 / 90
How do we select M = number of pseudo-data?
10 10
1
10
10
y x
200 400 600 800 1000 1200 1400 1600 1800 2000
1 2 3 Exact VFE
compute time/s SMSE
70 / 90
How do we select M = number of pseudo-data?
10 10
1
10
10
y x
200 400 600 800 1000 1200 1400 1600 1800 2000
1 2 3 Exact VFE
compute time/s SMSE
71 / 90
72 / 90
A Brief History of Gaussian Process Approximations
approximate generative model exact inference exact generative model approximate inference methods employing pseudo-data
FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”
VFE EP PP FITC PITC DTC
A Unifying View of Sparse Approximate Gaussian Process Regression Quinonero-Candela & Rasmussen, 2005 (FITC, PITC, DTC) A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation Bui, Yan and Turner, 2016 (VFE, EP, FITC, PITC ...)
73 / 90
EP pseudo-point approximation
true posterior
74 / 90
EP pseudo-point approximation
true posterior
74 / 90
EP pseudo-point approximation
true posterior
marginal likelihood posterior
74 / 90
EP pseudo-point approximation
true posterior approximate posterior
marginal likelihood posterior
74 / 90
EP pseudo-point approximation
true posterior approximate posterior
marginal likelihood posterior
74 / 90
EP pseudo-point approximation
true posterior approximate posterior
marginal likelihood posterior
74 / 90
EP pseudo-point approximation
true posterior approximate posterior
marginal likelihood posterior
74 / 90
EP pseudo-point approximation
input locations of 'pseudo' data
'pseudo' data
true posterior approximate posterior
marginal likelihood posterior exact joint
regression model
74 / 90
EP algorithm
75 / 90
EP algorithm
take out one pseudo-observation likelihood
cavity
75 / 90
EP algorithm
take out one pseudo-observation likelihood add in one true observation likelihood
cavity tilted
75 / 90
EP algorithm
take out one pseudo-observation likelihood add in one true observation likelihood project onto approximating family
cavity tilted KL between unnormalised stochastic processes
75 / 90
EP algorithm
take out one pseudo-observation likelihood add in one true observation likelihood project onto approximating family update pseudo-observation likelihood
cavity tilted KL between unnormalised stochastic processes
75 / 90
EP algorithm
take out one pseudo-observation likelihood add in one true observation likelihood project onto approximating family update pseudo-observation likelihood
cavity tilted
KL between unnormalised stochastic processes
75 / 90
EP algorithm
take out one pseudo-observation likelihood add in one true observation likelihood project onto approximating family update pseudo-observation likelihood
cavity tilted
KL between unnormalised stochastic processes rank 1
75 / 90
A Brief History of Gaussian Process Approximations
approximate generative model exact inference exact generative model approximate inference methods employing pseudo-data
FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”
VFE EP PP FITC PITC DTC
A Unifying View of Sparse Approximate Gaussian Process Regression Quinonero-Candela & Rasmussen, 2005 (FITC, PITC, DTC) A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation Bui, Yan and Turner, 2016 (VFE, EP, FITC, PITC ...)
76 / 90
Fixed points of EP = FITC approximation
approximate generative model exact inference exact generative model approximate inference methods employing pseudo-data
FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”
VFE EP PP FITC PITC DTC
A Unifying View of Sparse Approximate Gaussian Process Regression Quinonero-Candela & Rasmussen, 2005 (FITC, PITC, DTC) A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation Bui, Yan and Turner, 2016 (VFE, EP, FITC, PITC ...)
77 / 90
Fixed points of EP = FITC approximation
approximate generative model exact inference exact generative model approximate inference methods employing pseudo-data
FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”
VFE EP PP FITC PITC DTC
A Unifying View of Sparse Approximate Gaussian Process Regression Quinonero-Candela & Rasmussen, 2005 (FITC, PITC, DTC) A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation Bui, Yan and Turner, 2016 (VFE, EP, FITC, PITC ...)
77 / 90
Fixed points of EP = FITC approximation
approximate generative model exact inference exact generative model approximate inference methods employing pseudo-data
FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”
VFE EP PP FITC PITC DTC interpretation resolves issues with FITC: why does it work so well? are we allowed to increase M with N
A Unifying View of Sparse Approximate Gaussian Process Regression Quinonero-Candela & Rasmussen, 2005 (FITC, PITC, DTC) A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation Bui, Yan and Turner, 2016 (VFE, EP, FITC, PITC ...)
77 / 90
EP algorithm
take out one pseudo-observation likelihood add in one true observation likelihood project onto approximating family update pseudo-observation likelihood
cavity tilted
KL between unnormalised stochastic processes rank 1
78 / 90
Power EP algorithm (as tractable as EP)
take out fraction of pseudo-observation likelihood add in fraction of true observation likelihood project onto approximating family update pseudo-observation likelihood
cavity tilted
KL between unnormalised stochastic processes rank 1
79 / 90
Power EP: a unifying framework
FITC Csato and Opper, 2002 Snelson and Ghahramani, 2005 VFE Titsias, 2009
80 / 90
Power EP: a unifying framework
GP Regression GP Classification
PEP VFE EP inter-domain
[4] Quiñonero-Candela et al. 2005 [5] Snelson et al., 2005 [6] Snelson, 2006 [7] Schwaighofer, 2002 [10,5,6*] [14*] [12*,15*] [13] [17,13] [9,11,8*] [16*]
inter-domain structured approx. structured approx.
(FITC) [7,4*,6*] (PITC) [8] Titsias, 2009 [9] Csató, 2002 [10] Csató et al., 2002 [11] Seeger et al., 2003 [12] Naish-Guzman et al, 2007 [13] Qi et al., 2010 [14] Hensman et al., 2015 [15] Hernández-Lobato et al., 2016 [16] Matthews et al., 2016 [17] Figueiras-Vidal et al., 2009
PEP VFE EP
* = optimised pseudo-inputs ** = structured versions of VFE recover VFE ** ** 81 / 90
How should I set the power parameter α?
6 UCI classification datasets 20 random splits M = 10, 50, 100 hypers and inducing inputs optimised 8 UCI regression datasets 20 random splits M = 0 - 200 hypers and inducing inputs optimised
0.0 0.2 0.4 0.6 0.8 1.0 2 3 4 5 6 7 8 0.0 0.2 0.4 0.6 0.8 1.0 2 3 4 5 6 7 8
MSE rank error rank log-loss rank log-loss rank = 0.5 does well on average
0.0 0.2 0.4 0.6 0.8 1.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 6 7
82 / 90
References (hyperlinked) Approximate inference in GPs: A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation, arXiv preprint 2016 Scalable Approximate inference: Stochastic Expectation Propagation, NIPS 2015 Black-box α-divergence Minimization, ICML 2016 Deep Gaussian Processes (incl. comparisons to Bayesian Neural Networks and GPs): Deep Gaussian Processes for Regression using Approximate Expectation Propagation, ICML 2016
83 / 90
GP regression: introducing notation
84 / 90
GP regression: introducing notation
generative model (like non-linear regression)
85 / 90
GP regression: introducing notation
generative model (like non-linear regression) place GP prior over the non-linear function (smoothly wiggling functions expected)
86 / 90
GP regression: introducing notation
generative model (like non-linear regression) place GP prior over the non-linear function sum of Gaussian variables = Gaussian: induces a GP over (smoothly wiggling functions expected)
87 / 90
GP regression: introducing notation
predictive mean
88 / 90
GP regression: introducing notation
linear in the data predictive mean
89 / 90
GP regression: introducing notation
prior uncertainty predictive uncertainty reduction in uncertainty linear in the data predictive mean predictive covariance predictions more confident than prior
90 / 90
A brief introduction to the Kullback-Leibler divergence KL(p1(z)||p2(z)) =
p1(z) log p1(z) p2(z) Important properties: Gibb’s inequality: KL(p1(z)||p2(z)) ≥ 0, equality at p1(z) = p2(z)
◮ proof via Jensen’s inequality or differentiation (see MacKay pg. 35 )
Non-symmetric: KL(p1(z)||p2(z)) = KL(p2(z)||p1(z))
◮ hence named divergence and not distance
Example: binary variables z ∈ {0, 1} p(z = 1) = 0.8 and q(z = 1) = ρ
ρ
0.5 1
KL(q || p)
2 4 6
ρ
0.5 1
KL(p || q)
2 4 6
0.8 0.8
91 / 90