Kernel Interpolation for Scalable Structured Gaussian Processes - - PowerPoint PPT Presentation

kernel interpolation for scalable structured gaussian
SMART_READER_LITE
LIVE PREVIEW

Kernel Interpolation for Scalable Structured Gaussian Processes - - PowerPoint PPT Presentation

Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP) Andrew Gordon Wilson Postdoctoral Research Fellow www.cs.cmu.edu/~andrewgw Carnegie Mellon University Joint work with Hannes Nickisch ICML Lille, France 7 July, 2015


slide-1
SLIDE 1

Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP)

Andrew Gordon Wilson

Postdoctoral Research Fellow www.cs.cmu.edu/~andrewgw Carnegie Mellon University Joint work with Hannes Nickisch ICML Lille, France 7 July, 2015

1 / 13

slide-2
SLIDE 2

Scalable and Accurate Gaussian Processes

◮ Gaussian processes (GPs) are exactly the types of models we want to

apply to big data: flexible function approximators, capable of using the information in large datasets to learn intricate structure through covariance kernels.

◮ However, GPs require O(n3) computations and O(n2) storage. ◮ We present a near-exact, O(n), general purpose Gaussian process

framework.

◮ This framework i) provides a new unifying perspective of scalable

GP approaches, ii) can be used to make predictions with GPs on massive datasets, and iii) enables large-scale kernel learning.

◮ Code is available:

http://www.cs.cmu.edu/~andrewgw/pattern

2 / 13

slide-3
SLIDE 3

Gaussian process review

Definition

A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution.

Nonparametric Regression Model

◮ Prior: f(x) ∼ GP(m(x), k(x, x′)), meaning

(f(x1), . . . , f(xN)) ∼ N(µ, K), with µi = m(xi) and Kij = cov(f(xi), f(xj)) = k(xi, xj).

GP posterior

  • p(f(x)|D) ∝

Likelihood

  • p(D|f(x))

GP prior

p(f(x))

−10 −5 5 10 −4 −3 −2 −1 1 2 3 4

Input, x Output, f(x) Samples from GP Prior

−10 −5 5 10 −4 −3 −2 −1 1 2 3 4

Input, x Output, f(x) Samples from GP Posterior

3 / 13

slide-4
SLIDE 4

Inference and Learning

  • 1. Learning: Optimize marginal likelihood,

log p(y|θ, X) =

model fit

  • −1

2yT(Kθ + σ2I)−1y −

complexity penalty

  • 1

2 log |Kθ + σ2I| −N 2 log(2π) , with respect to kernel hyperparameters θ. The marginal likelihood provides a powerful mechanism for kernel learning.

  • 2. Inference: Conditioned on kernel hyperparameters θ, form the

predictive distribution for test inputs X∗: f∗|X∗, X, y, θ ∼ N(¯ f∗, cov(f∗)) , ¯ f∗ = Kθ(X∗, X)[Kθ(X, X) + σ2I]−1y , cov(f∗) = Kθ(X∗, X∗) − Kθ(X∗, X)[Kθ(X, X) + σ2I]−1Kθ(X, X∗) . (Kθ + σ2I)−1y and log |Kθ + σ2I| naively require O(n3) computations, O(n2) storage.

4 / 13

slide-5
SLIDE 5

Scalable Gaussian Processes

Structure Exploiting Approaches

Exploit existing structure in K to efficiently solve linear systems and log determinants.

◮ Examples: Kronecker Structure, K = K1 ⊗ K2 ⊗ · · · ⊗ KP. Toeplitz

Structure: Kij = Ki+1,j+1.

◮ Extremely efficient and accurate, but require severe grid assumptions.

Inducing Point Approaches

Introduce m inducing points, U = {ui}m

i=1, and approximate

KX,X ≈ KX,UK−1

U,UKU,X. ◮ SoR, DTC, FITC, Big Data GP ◮ General purpose, but requires m ≪ n for efficiency, which degrades

accuracy and prohibits expressive kernel learning. Can we create a new framework that combines the benefits of each approach?

5 / 13

slide-6
SLIDE 6

A New Unifying Framework

◮ Recall n×n

  • KSoR(X, X) =

n×m

  • KX,U

m×m

  • K−1

U,U m×n

  • KU,X

(1)

◮ Complexity is O(m2n + m3). ◮ It is tempting to place inducing points on a grid to create structure in

KU,U, but this only helps with the m3 term, not the more critical m2n term coming from KX,U.

◮ Can we approximate KX,U from KU,U?

6 / 13

slide-7
SLIDE 7

Kernel Interpolation

For example, if we want to approximate k(x, u), we could form k(x, u) ≈ wk(ua, u) + (1 − w)k(ub, u) , (2) where ua ≤ x ≤ ub. More generally, we form KX,U ≈ WKU,U , (3) where W is an n × m sparse matrix of interpolation weights. For local linear interpolation W has only c = 2 non-zero entries per row. For local cubic interpolation, c = 4. Substituting KX,U ≈ WKU,U into the inducing point approximation, KX,X ≈ KX,UK−1

U,UKU,X ≈ WKU,UK−1 U,UKU,UWT = WKU,UWT = KSKI .

7 / 13

slide-8
SLIDE 8

Kernel Interpolation

KSKI =

n×m

  • W

m×m

  • KU,U WT

(4)

◮ MVMs with W cost O(n) computations and storage. ◮ Toeplitz KU,U: MVMs cost O(m log m). ◮ Kronecker structure in KU,U: MVMs cost O(Pm1+1/P) .

Conclusions

◮ MVMs with KSKI cost O(n) computations and storage! ◮ We can therefore solve K−1 SKIy using linear conjugate gradients in j ≪ n

iterations, for GP inference.

◮ Even if the inputs X do not have any structure, we can naturally create

structure in the latent variables U which can be exploited for greatly accelerated inference and learning.

◮ We can use m ≫ n inducing points! (Accuracy and kernel learning)

8 / 13

slide-9
SLIDE 9

New Unifying Framework

It turns out that all inducing methods perform global GP interpolation on a user-specified kernel!

◮ The predictive mean of a noise-free, zero mean GP (σ = 0, µ(x) ≡ 0)

is linear in two ways: on the one hand, as a wX(x∗) = K−1

X,XKX,x∗

weighted sum of the observations y, and on the other hand as an α = K−1

X,Xy weighted sum of training-test cross-covariances KX,x∗:

¯ f∗ = yTwX(x∗) = αTKX,x∗ . (5)

◮ If we are to perform a noise free zero-mean GP regression on the kernel

itself, such that we have data D = (ui, k(ui, x))m

i=1, then we recover the

inducing kernel ˜ kSoR(x, z) = KU,xK−1

U,UKU,z as the predictive mean of

the GP at test point x∗ = z!

9 / 13

slide-10
SLIDE 10

Local versus Global Kernel Interpolation

2.5 5 7.5 10 0.25 0.5 0.75 1

Covariance

k k(U,u) Global kSoR(x,u)

x (c) Global Kernel Interpolation

2.5 5 7.5 10 0.25 0.5 0.75 1

Covariance

k k(U,u) Local kSKI(x,u)

x (d) Local Kernel Interpolation

Figure: Global vs. local kernel interpolation. Triangle markers denote the inducing points used for interpolating k(x, u) from k(U, u). Here u = 0, U = {0, 1, . . . , 10}, and x = 3.4. a) All conventional inducing point methods, such as SoR or FITC, perform global GP regression on KU,u (a vector of covariances between all inducing points U and the point u), at test point x∗ = x, to form an approximate ˜ k, e.g., kSoR(x, u) = Kx,UK−1

U,UKU,u, for any desired x and u. b) SKI can perform local kernel

interpolation on KU,u to form the approximation kSKI(x, u) = wT

x KU,u.

10 / 13

slide-11
SLIDE 11

Kernel Matrix Reconstruction

200 400 600 800 1000 200 400 600 800 1000 0.2 0.4 0.6 0.8 1

(a) Ktrue

200 400 600 800 1000 200 400 600 800 1000 0.2 0.4 0.6 0.8 1

(b) KSKI (m = 40)

200 400 600 800 1000 200 400 600 800 1000 5 10 15 x 10

−5

(c) |Ktrue − KSKI, 40|

10 15 20 25 0.02 0.04 0.06 0.08 0.1

m Error

equi−linear kmeans−linear equi−GP equi−cubic

(d) Interpolation Strategies

200 400 600 800 1000 200 400 600 800 1000 1 2 3 4 x 10

−8

(e) |Ktrue −KSKI, 150|

200 400 600 800 1000 200 400 600 800 1000 2 4 6 8 10 12 14 x 10

−8

(f) |Ktrue −KSoR, 150|

0.1 0.2 0.3 0.4 10

−10

10

−8

10

−6

10

−4

Runtime (s) Error

SKI (linear) SoR FITC SKI (cubic)

(g) Error vs Runtime

11 / 13

slide-12
SLIDE 12

Kernel Learning

0.5 1 −0.5 0.5 1

τ Correlation

True FITC SKI 0.5 1 −0.2 0.2 0.4 0.6 0.8

τ Correlation

Figure: Kernel Learning. A product of two kernels (shown in green) was used to sample 10, 000 datapoints from a GP. From this data, we performed kernel learning using SKI (cubic) and FITC, with the results shown in blue and red, respectively. All kernels are a function of τ = x − x′ and are scaled by k(0).

12 / 13

slide-13
SLIDE 13

Natural Sound Modelling

1 2 3 −0.2 −0.1 0.1 0.2

Time (s) Intensity (a) Natural Sound

2500 3000 3500 4000 4500 5000 10 10

1

10

2

10

3

m Runtime (s) FITC SKI (cubic) (b) Runtime vs m

10 10

1

10

2

10

3

0.2 0.3 0.4 0.5 0.6 0.7 0.8

Runtime (s) SMAE (c) Error vs Runtime

Figure: Natural Sound Modelling

13 / 13