Fast Laplace Approximation for Gaussian Ketter Processes with a - - PowerPoint PPT Presentation

fast laplace approximation for gaussian
SMART_READER_LITE
LIVE PREVIEW

Fast Laplace Approximation for Gaussian Ketter Processes with a - - PowerPoint PPT Presentation

Perry Groot , Markus Peters Tom Heskes, Wolgang Fast Laplace Approximation for Gaussian Ketter Processes with a Tensor Product Kernel Introduction Background GPs Laplace Approximation Perry Groot a Markus Peters b Kronecker product Tom


slide-1
SLIDE 1

Perry Groot, Markus Peters Tom Heskes, Wolgang Ketter Introduction Background

GPs Laplace Approximation Kronecker product

MAP estimation Model Selection Experiment

Fast Laplace Approximation for Gaussian Processes with a Tensor Product Kernel

Perry Groota Markus Petersb Tom Heskesa Wolgang Ketterb

Radboud University Nijmegena Erasmus University Rotterdamb

slide-2
SLIDE 2

Perry Groot, Markus Peters Tom Heskes, Wolgang Ketter Introduction Background

GPs Laplace Approximation Kronecker product

MAP estimation Model Selection Experiment

Introduction

Gaussian process models + rich, principled Bayesian framework

  • scalability O(N3)

Approaches addressing scalability approximations / subset selection O(M2N) exploit additional structure

Tensor product kernels

+ Efficient use of Kronecker products on grid data

  • Limited to standard regression, since

if K = QΛQT then (K + σ2I)−1y = Q(Λ + σ2I)−1QTy.

slide-3
SLIDE 3

Perry Groot, Markus Peters Tom Heskes, Wolgang Ketter Introduction Background

GPs Laplace Approximation Kronecker product

MAP estimation Model Selection Experiment

Gaussian Processes

A Gaussian process (GP) is collection of random variables with the property that the joint distribution of any finite subset is a Gaussian. A GP specifies a probability distribution over functions f(x) ∼ GP(m(x), k(x, x′)) and is fully specified by its mean function m(x) and covariance (or kernel) function k(x, x′). Typically m(x) = 0, which gives {f(x1), . . . , f(xI)} ∼ N(0, K) with Kij = k(xi, xj)

slide-4
SLIDE 4

Perry Groot, Markus Peters Tom Heskes, Wolgang Ketter Introduction Background

GPs Laplace Approximation Kronecker product

MAP estimation Model Selection Experiment

Gaussian Processes - Covariance function

Squared exponential (or Gaussian) covariance function: k(x, x′) = exp

  • − 1

2θ2

N

  • n=1

(xn − x′

n)2

  • where θ is a length-scale parameter denoting how quickly

the functions are to vary.

2 4 6 8 10 −1.5 −1 −0.5 0.5 1 1.5 2 2.5 length−scale 0.5 2 4 6 8 10 −1.5 −1 −0.5 0.5 1 1.5 2 2.5 length−scale 2

slide-5
SLIDE 5

Perry Groot, Markus Peters Tom Heskes, Wolgang Ketter Introduction Background

GPs Laplace Approximation Kronecker product

MAP estimation Model Selection Experiment

Gaussian Processes - 1D demo

2 4 6 8 10 −10 −5 5 10 2 4 6 8 10 −10 −5 5 10 2 4 6 8 10 −10 −5 5 10 2 4 6 8 10 −10 −5 5 10

slide-6
SLIDE 6

Perry Groot, Markus Peters Tom Heskes, Wolgang Ketter Introduction Background

GPs Laplace Approximation Kronecker product

MAP estimation Model Selection Experiment

Laplace Approximation

For Gaussian process models with non-gaussian likelihood models we need approximations. Laplace: approximate true posterior p(f|X, y) with a Gaussian q(f) centered on the mode of the posterior: q(f) = N(f|ˆ f, (K −1 + W)−1) (1) with ˆ f = arg maxfp(f|X, y, θ) and W = −∇∇f log p(y|f)|f=ˆ

f.

slide-7
SLIDE 7

Perry Groot, Markus Peters Tom Heskes, Wolgang Ketter Introduction Background

GPs Laplace Approximation Kronecker product

MAP estimation Model Selection Experiment

Kronecker product

Assumptions Tensor product kernel function: k(xi, xj) = D

d=1 kd(xd i , xd j ).

data on multi-dimensional grid This results in a kernel matrix that decomposes into a Kronecker product of matrices of lower dimensions. K = K 1 ⊗ · · · ⊗ K D where A ⊗ B =    a11B · · · a1nB . . . ... . . . am1B · · · amnB   

slide-8
SLIDE 8

Perry Groot, Markus Peters Tom Heskes, Wolgang Ketter Introduction Background

GPs Laplace Approximation Kronecker product

MAP estimation Model Selection Experiment

Kronecker product

The Kronecker product has a convenient algebra (A ⊗ B)vec(X) = vec(BXAT) AB ⊗ CD = (A ⊗ C)(B ⊗ D) (A ⊗ B)−1 = A−1 ⊗ B−1 Operation Standard Kronecker Storage O(N2) O(D

d=1 N2 d)

Matrix vector product O(N2) O(N(D

d=1 Nd))

Cholesky / SVD O(N3) O(D

d=1 N3 d)

slide-9
SLIDE 9

Perry Groot, Markus Peters Tom Heskes, Wolgang Ketter Introduction Background

GPs Laplace Approximation Kronecker product

MAP estimation Model Selection Experiment

MAP estimation

Let b = Wf + ∇ log p(y|f). We need f new = (K −1 + W)−1b = K(I − W

1 2 (I + W 1 2 KW 1 2 )−1W 1 2 K)b

  • v

Repeat until convergence: iteratively solve (I + W

1 2 KW 1 2 )v = W 1 2 Kb

a = b − W

1 2 v

f = Ka

slide-10
SLIDE 10

Perry Groot, Markus Peters Tom Heskes, Wolgang Ketter Introduction Background

GPs Laplace Approximation Kronecker product

MAP estimation Model Selection Experiment

Marginal Likelihood

learn the value of θ, which can be done by minimizing the negative marginal likelihood − log p(y|X, θ) ≈ 1 2 ˆ f

TK −1ˆ

f − log p(y|ˆ f) + 1 2 log |B| with B = I + KW.

slide-11
SLIDE 11

Perry Groot, Markus Peters Tom Heskes, Wolgang Ketter Introduction Background

GPs Laplace Approximation Kronecker product

MAP estimation Model Selection Experiment

Reduced-rank approximation

We use a reduced-rank approximation: K ≈ QSQT + Λ1 with Q =

D

  • d=1

Qd, N × R matrix S =

D

  • d=1

Sd, R × R matrix Λ1 = diag(diag(K) − diag(QSQT))

slide-12
SLIDE 12

Perry Groot, Markus Peters Tom Heskes, Wolgang Ketter Introduction Background

GPs Laplace Approximation Kronecker product

MAP estimation Model Selection Experiment

Evaluating Marginal Likelihood

|B| = |I + W

1 2 KW 1 2 |

≈ |I + W

1 2 Λ1W 1 2 + W 1 2 QSQTW 1 2 |

= |Λ2 + W

1 2 QSQTW 1 2 |

= |Λ2||S||S−1 + QTW

1 2 Λ−1

2 W

1 2 Q|

= |Λ2||S||S−1 + QTΛ3Q|

slide-13
SLIDE 13

Perry Groot, Markus Peters Tom Heskes, Wolgang Ketter Introduction Background

GPs Laplace Approximation Kronecker product

MAP estimation Model Selection Experiment

Gradients

Need the gradients wrt θ of − log p(y|X, θ) ≈ 1 2 ˆ f

TK −1ˆ

f − log p(y|ˆ f) + 1 2 log |B| which is given by ∂ log q(y|X, θ) ∂θj = ∂ log q(y|X, θ) ∂θj

  • explicit

+

N

  • i=1

∂ log q(y|X, θ) ∂ˆ fi ∂ˆ fi ∂θj

slide-14
SLIDE 14

Perry Groot, Markus Peters Tom Heskes, Wolgang Ketter Introduction Background

GPs Laplace Approximation Kronecker product

MAP estimation Model Selection Experiment

Explicit Derivatives – Kernel Hyperparameters

Given by ∂ log q(y|X, θ) ∂θc

j

  • explicit

= 1 2 ˆ f

TK −1 ∂K

∂θc

j

K −1ˆ f − 1 2tr

  • (W −1 + K)−1 ∂K

∂θc

j

  • .
slide-15
SLIDE 15

Perry Groot, Markus Peters Tom Heskes, Wolgang Ketter Introduction Background

GPs Laplace Approximation Kronecker product

MAP estimation Model Selection Experiment

Explicit Derivatives – Kernel Hyperparameters

(W −1 + K)−1 = W

1 2 (I + W 1 2 KW 1 2 )−1W 1 2

≈ W

1 2 (I + W 1 2 ΛW 1 2 + W 1 2 QSQTW 1 2 )−1W 1 2

= W

1 2 (Λ2 + W 1 2 QSQTW 1 2 )−1W 1 2

= W

1 2 (Λ−1

2

− Λ−1

2 W

1 2 Q(S−1 + QTΛ−1

3 Q)−1QTW

1 2 Λ−1

2 )W

= Λ3 − Λ3Q(S−1 + QTΛ3Q)−1QTΛ3

slide-16
SLIDE 16

Perry Groot, Markus Peters Tom Heskes, Wolgang Ketter Introduction Background

GPs Laplace Approximation Kronecker product

MAP estimation Model Selection Experiment

Explicit Derivatives – Kernel Hyperparameters

tr

  • W −1 + K

−1 ∂K ∂θc

j

  • = tr
  • Λ3

∂K ∂θc

j

  • − tr
  • Λ3Q
  • S−1 + QTΛ3Q

−1 QTΛ3 ∂K ∂θc

j

  • = tr
  • Λ3

∂K ∂θc

j

  • − tr
  • V T

1

∂K ∂θc

j

V 1

  • with V 1 = Λ3Qchol
  • (S−1 + QTΛ3Q)−1

.

slide-17
SLIDE 17

Perry Groot, Markus Peters Tom Heskes, Wolgang Ketter Introduction Background

GPs Laplace Approximation Kronecker product

MAP estimation Model Selection Experiment

Remaining Gradients

Similar arguments for computing the remaining gradients Linear Algebra Shortcuts tr(ABC) = tr(BCA) tr(ABT) = SUMROWS(A ◦ B) diag(ABCT) = (A ◦ C) · diag(B) when B is diagonal

slide-18
SLIDE 18

Perry Groot, Markus Peters Tom Heskes, Wolgang Ketter Introduction Background

GPs Laplace Approximation Kronecker product

MAP estimation Model Selection Experiment

Experiment

Artificial classification data generated on X = [0, 1]2 with various grid sizes.

1000 2000 3000 4000 5000 6000 100 200 300 400 500 600 700 800 900 1000 1100 N seconds Standard Kronecker