Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP)
Andrew Gordon Wilson
Postdoctoral Research Fellow www.cs.cmu.edu/~andrewgw Carnegie Mellon University Joint work with Hannes Nickisch ICML Lille, France 7 July, 2015
1 / 13
Kernel Interpolation for Scalable Structured Gaussian Processes - - PowerPoint PPT Presentation
Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP) Andrew Gordon Wilson Postdoctoral Research Fellow www.cs.cmu.edu/~andrewgw Carnegie Mellon University Joint work with Hannes Nickisch ICML Lille, France 7 July, 2015
Postdoctoral Research Fellow www.cs.cmu.edu/~andrewgw Carnegie Mellon University Joint work with Hannes Nickisch ICML Lille, France 7 July, 2015
1 / 13
◮ Gaussian processes (GPs) are exactly the types of models we want to
◮ However, GPs require O(n3) computations and O(n2) storage. ◮ We present a near-exact, O(n), general purpose Gaussian process
◮ This framework i) provides a new unifying perspective of scalable
◮ Code is available:
2 / 13
◮ Prior: f(x) ∼ GP(m(x), k(x, x′)), meaning
GP posterior
Likelihood
GP prior
−10 −5 5 10 −4 −3 −2 −1 1 2 3 4
Input, x Output, f(x) Samples from GP Prior
−10 −5 5 10 −4 −3 −2 −1 1 2 3 4
Input, x Output, f(x) Samples from GP Posterior
3 / 13
model fit
complexity penalty
4 / 13
◮ Examples: Kronecker Structure, K = K1 ⊗ K2 ⊗ · · · ⊗ KP. Toeplitz
◮ Extremely efficient and accurate, but require severe grid assumptions.
i=1, and approximate
U,UKU,X. ◮ SoR, DTC, FITC, Big Data GP ◮ General purpose, but requires m ≪ n for efficiency, which degrades
5 / 13
◮ Recall n×n
n×m
m×m
U,U m×n
◮ Complexity is O(m2n + m3). ◮ It is tempting to place inducing points on a grid to create structure in
◮ Can we approximate KX,U from KU,U?
6 / 13
U,UKU,X ≈ WKU,UK−1 U,UKU,UWT = WKU,UWT = KSKI .
7 / 13
n×m
m×m
◮ MVMs with W cost O(n) computations and storage. ◮ Toeplitz KU,U: MVMs cost O(m log m). ◮ Kronecker structure in KU,U: MVMs cost O(Pm1+1/P) .
◮ MVMs with KSKI cost O(n) computations and storage! ◮ We can therefore solve K−1 SKIy using linear conjugate gradients in j ≪ n
◮ Even if the inputs X do not have any structure, we can naturally create
◮ We can use m ≫ n inducing points! (Accuracy and kernel learning)
8 / 13
◮ The predictive mean of a noise-free, zero mean GP (σ = 0, µ(x) ≡ 0)
X,XKX,x∗
X,Xy weighted sum of training-test cross-covariances KX,x∗:
◮ If we are to perform a noise free zero-mean GP regression on the kernel
i=1, then we recover the
U,UKU,z as the predictive mean of
9 / 13
2.5 5 7.5 10 0.25 0.5 0.75 1
k k(U,u) Global kSoR(x,u)
2.5 5 7.5 10 0.25 0.5 0.75 1
k k(U,u) Local kSKI(x,u)
U,UKU,u, for any desired x and u. b) SKI can perform local kernel
x KU,u.
10 / 13
200 400 600 800 1000 200 400 600 800 1000 0.2 0.4 0.6 0.8 1
200 400 600 800 1000 200 400 600 800 1000 0.2 0.4 0.6 0.8 1
200 400 600 800 1000 200 400 600 800 1000 5 10 15 x 10
−5
10 15 20 25 0.02 0.04 0.06 0.08 0.1
m Error
equi−linear kmeans−linear equi−GP equi−cubic
200 400 600 800 1000 200 400 600 800 1000 1 2 3 4 x 10
−8
200 400 600 800 1000 200 400 600 800 1000 2 4 6 8 10 12 14 x 10
−8
0.1 0.2 0.3 0.4 10
−10
10
−8
10
−6
10
−4
Runtime (s) Error
SKI (linear) SoR FITC SKI (cubic)
11 / 13
0.5 1 −0.5 0.5 1
True FITC SKI 0.5 1 −0.2 0.2 0.4 0.6 0.8
12 / 13
1 2 3 −0.2 −0.1 0.1 0.2
2500 3000 3500 4000 4500 5000 10 10
1
10
2
10
3
10 10
1
10
2
10
3
0.2 0.3 0.4 0.5 0.6 0.7 0.8
13 / 13