Scalable Gaussian processes with a twist of Probabilistic Numerics - - PowerPoint PPT Presentation

scalable gaussian processes with a twist of probabilistic
SMART_READER_LITE
LIVE PREVIEW

Scalable Gaussian processes with a twist of Probabilistic Numerics - - PowerPoint PPT Presentation

Scalable Gaussian processes with a twist of Probabilistic Numerics Kurt Cutajar EURECOM, Sophia Antipolis, France Data Science Meetup - October 30 th 2017 Agenda Kernel Methods Scalable Gaussian Processes (using Preconditioning)


slide-1
SLIDE 1

Scalable Gaussian processes with a twist of Probabilistic Numerics

Kurt Cutajar

EURECOM, Sophia Antipolis, France Data Science Meetup - October 30th 2017

slide-2
SLIDE 2

Agenda

  • Kernel Methods
  • Scalable Gaussian Processes (using Preconditioning)
  • Probabilistic Numerics

1/33

slide-3
SLIDE 3

Kernel Methods

  • Operate in a high-dimensional, implicit feature space
  • Rely on the construction of an n × n Gram matrix K
  • E.g. RBF : k

( xi, xj ) = σ2 exp ( − 1

2d2)

where d2 = ( xi − xj )⊤ Λ ( xi − xj )

2/33

slide-4
SLIDE 4

Kernel Methods

  • Wide variety of kernel functions available

Taken from David Duvenaud’s PhD Thesis

3/33

slide-5
SLIDE 5

Kernel Methods

  • Choice is not always straightforward!

Taken from David Duvenaud’s PhD Thesis

4/33

slide-6
SLIDE 6

All About that Bass Bayes

posterior = likelihood × prior marginal likelihood p (par|X, y) = p (y|X, par) × p (par) p (y|X)

5/33

slide-7
SLIDE 7

All About that Bass Bayes - Making Predictions

  • We average over all possible parameter values, weighted by their posterior

probability

p (y∗|x∗, X, y) = ∫ p (y∗|x∗, par) p (par|X, y) dpar = N (E [y∗] , V [y∗])

6/33

slide-8
SLIDE 8

Gaussian Processes

slide-9
SLIDE 9

Gaussian Processes - Prior Distribution over Functions

K∞ =

−4 −2 2 4 −3 −2 −1 1 2 3 input label

7/33

slide-10
SLIDE 10

Gaussian Processes - Conditioned on Observations

K∞ =

−4 −2 2 4 −3 −2 −1 1 2 3 input label

  • 8/33
slide-11
SLIDE 11

Gaussian Processes - Posterior Distribution over Functions

Ky =

−4 −2 2 4 −3 −2 −1 1 2 3 input label

  • 9/33
slide-12
SLIDE 12

Gaussian Processes

GP prior

−4 −2 2 4 −3 −2 −1 1 2 3 input label

K∞ =

GP regression example

−4 −2 2 4 −3 −2 −1 1 2 3 input label

  • K∞ =

Inference result

−4 −2 2 4 −3 −2 −1 1 2 3 input label

  • Ky =

10/33

slide-13
SLIDE 13

Bayesian Learning vs Deep Learning

  • Deep Learning

+ Scalable to very large datasets + Increased model flexibility/capacity

  • Frequentist approaches make only point estimates
  • Less robust to overfitting
  • Bayesian Learning

+ Incorporates uncertainty in predictions + Works well with smaller datasets

  • Lack of conjugacy necessitates approximation
  • Expensive computational and storage requirements

11/33

slide-14
SLIDE 14

Bayesian Learning vs Deep Learning - Deep Gaussian Processes

  • Deep probabilistic models
  • Composition of functions

f(x) = ( h(Nh−1) ( θ(Nh−1))

  • . . . ◦ h(0) (

θ(0))) (x) h(0)(x) h(1)(x) h(1) ( h(0) (x) )

12/33

slide-15
SLIDE 15

Bayesian Learning vs Deep Learning - Deep Gaussian Processes

  • Inference requires calculating the marginal likelihood:

p(Y|X, θ) = ∫ p ( Y|F(Nh), θ(Nh)) × p ( F(Nh)|F(Nh−1), θ(Nh−1)) × . . . × p ( F(1)|X, θ(0)) dF(Nh) . . . dF(1)

  • Very challenging!

13/33

slide-16
SLIDE 16

Bayesian Learning vs Deep Learning - Deep Gaussian Processes

θ(0) θ(1) Φ(0) X F(1) Φ(1) F(2) Y Ω(0) W(0) Ω(1) W(1)

Cutajar et al., Random Feature Expansions for Deep Gaussian Processes, ICML 2017 Yarin Gal, Bayesian Deep Learning, PhD Thesis

14/33

slide-17
SLIDE 17

Scalable Gaussian Processes

slide-18
SLIDE 18

Gaussian Processes

  • Marginal likelihood

log[p(y|par)] = −1 2 log |Ky| − 1 2yTK−1

y y + const.

  • Derivatives wrt par

∂ log[p(y|par)] ∂pari = −1 2Tr ( K−1

y

∂Ky ∂pari ) + 1 2yTK−1

y

∂Ky ∂pari K−1

y y

15/33

slide-19
SLIDE 19

Gaussian Processes - Stochastic Trace Estimation

Taken from Shakir Mohamed’s Machine Learning Blog

16/33

slide-20
SLIDE 20

Gaussian Processes - Stochastic Gradients

  • Stochastic estimate of the trace - assuming E[rrT] = I, then

Tr ( K−1

y

∂Ky ∂pari ) = Tr ( K−1

y

∂Ky ∂pari E[rrT] ) = E [ rTK−1

y

∂Ky ∂pari r ]

  • Stochastic gradient

− 1 2Nr

Nr

i=1

r(i)TK−1

y

∂Ky ∂pari r(i) + 1 2yTK−1

y

∂Ky ∂pari K−1

y y

Linear systems only!

17/33

slide-21
SLIDE 21

Gaussian Processes - Stochastic Gradients

  • Stochastic estimate of the trace - assuming E[rrT] = I, then

Tr ( K−1

y

∂Ky ∂pari ) = Tr ( K−1

y

∂Ky ∂pari E[rrT] ) = E [ rTK−1

y

∂Ky ∂pari r ]

  • Stochastic gradient

− 1 2Nr

Nr

i=1

r(i)TK−1

y

∂Ky ∂pari r(i) + 1 2yTK−1

y

∂Ky ∂pari K−1

y y

Linear systems only!

17/33

slide-22
SLIDE 22

Solving Linear Systems

  • Involve the solution of linear systems Kz = v
  • Cholesky Decomposition
  • K must be stored in memory!
  • O(n2) space and O(n3) time - unfeasible for large n
  • Conjugate Gradient
  • Numerical solution of linear systems
  • tn2 for t CG iterations - in theory t

n (possibly worse!)

18/33

slide-23
SLIDE 23

Solving Linear Systems

  • Involve the solution of linear systems Kz = v
  • Cholesky Decomposition
  • K must be stored in memory!
  • O(n2) space and O(n3) time - unfeasible for large n
  • Conjugate Gradient
  • Numerical solution of linear systems
  • O(tn2) for t CG iterations - in theory t = n (possibly worse!)

z z0

18/33

slide-24
SLIDE 24

Solving Linear Systems

  • Preconditioned Conjugate Gradient (henceforth PCG)
  • Transforms linear system to be better conditioned, improving convergence
  • Yields a new linear system of the form P−1Kz = P−1v
  • O(tn2) for t PCG iterations - in practice t ≪ n

z z0

CG

z z0

PCG

19/33

slide-25
SLIDE 25

Preconditioning Approaches

  • Suppose we want to precondition Ky = K + λI
  • Our choice of preconditioner, P, should:
  • Approximate Ky as closely as possible
  • Be easy to invert
  • For low-rank preconditioners we employ the Woodbury inversion lemma:

Ky = P = P

1 =

  • For other preconditioners we solve inner linear systems once again

using CG!

20/33

slide-26
SLIDE 26

Preconditioning Approaches

  • Suppose we want to precondition Ky = K + λI
  • Our choice of preconditioner, P, should:
  • Approximate Ky as closely as possible
  • Be easy to invert
  • For low-rank preconditioners we employ the Woodbury inversion lemma:

Ky = P = P−1 =

  • For other preconditioners we solve inner linear systems once again

using CG!

20/33

slide-27
SLIDE 27

Preconditioning Approaches

  • Suppose we want to precondition Ky = K + λI
  • Our choice of preconditioner, P, should:
  • Approximate Ky as closely as possible
  • Be easy to invert
  • For low-rank preconditioners we employ the Woodbury inversion lemma:

Ky = P = P−1 =

  • For other preconditioners we solve inner linear systems once again

using CG!

20/33

slide-28
SLIDE 28

Preconditioning Approaches

Nyström P = KXUK−1

UUKUX + λI

where U ⊂ X FITC P = KXUK−1

UUKUX + diag

( K − KXUK−1

UUKUX

) + λI PITC P = KXUK−1

UUKUX + bldiag

( K − KXUK−1

UUKUX

) + λI Spectral Pij = σ2

m

∑m

r=1 cos

[ 2πs⊤

r

( xi − xj )] + λIij Partial SVD K = AΛA⊤ ⇒ P = A[·,1:m]Λ[1:m,1:m]A⊤

[1:m,·] + λI

Block Jacobi P = bldiag (K) + λI SKI P = WKUUW⊤ + λI where KUU is Kronecker Regularization P = K + λI + δI

21/33

slide-29
SLIDE 29

Comparison of Preconditioners vs CG

22/33

slide-30
SLIDE 30

Experimental Setup - GP Kernel Parameter Optimization

  • Exact gradient-based optimization using Cholesky decomposition (CHOL)
  • Stochastic gradient-based optimization
  • Linear systems solved with CG and PCG
  • GP Approximations
  • Variational learning of inducing variables (VAR)
  • Fully Independent Training Conditional (FITC)
  • Partially Independent Training Conditional (PITC)

23/33

slide-31
SLIDE 31

Results - ARD Kernel

Classification Spam (n = 4061, d=57)

0.5 1.0 1.5 2.0 2.5 3.0 3.5 0.04 0.08 0.12 log10(seconds) Error Rate −1 1 2 3 15 20 25 30 35 40 log10(seconds) Negative Test Log−Lik

Regression Power plant (n = 9568, d=4)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0.18 0.20 0.22 log10(seconds) RMSE 1 2 3 −40 −30 −20 −10 log10(seconds) Negative Test Log−Lik

EEG (n = 14979, d=14)

1.0 1.5 2.0 2.5 3.0 3.5 0.05 0.15 0.25 log10(seconds) Error Rate 1 2 3 4 20 30 40 50 60 log10(seconds) Negative Test Log−Lik

Protein (n = 45730, d=9)

1.5 2.0 2.5 3.0 3.5 4.0 0.60 0.64 0.68 0.72 log10(seconds) RMSE 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 200 250 300 350 400 log10(seconds) Negative Test Log−Lik

PCG CG CHOL FITC PITC VAR 24/33

slide-32
SLIDE 32

Follow-up Work

  • Faster Kernel Ridge Regression Using Sketching and Preconditioning

Avron et al. (2017)

  • FALKON: An Optimal Large Scale Kernel Method

Rosasco et al. (2017)

  • Large Linear Multi-output Gaussian Process Learning for Time Series

Feinberg et al. (2017)

  • Scaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian

Processes

Kim et al. (2017)

25/33

slide-33
SLIDE 33

Follow-up work

... but what’s left to do now?

26/33

slide-34
SLIDE 34

Follow-up work

... but what’s left to do now?

26/33

slide-35
SLIDE 35

Probabilistic Numerics

slide-36
SLIDE 36

Probabilistic Numerics - Mission Statement

We deliver a call to arms for probabilistic numerical methods: algorithms for numerical tasks, including linear algebra, integration, optimization and solving differential equations, that return uncertainties in their calculations. Such uncertainties, arising from the loss of precision induced by numerical calculation with limited time or hardware, are important for much contemporary science and industry.

Hennig et al., 2015

27/33

slide-37
SLIDE 37

Probabilistic Numerics - Applications

  • Quadrature
  • Linear Algebra
  • Optimization
  • Ordinary Differential Equations
  • Partial Differential Equations
  • Monte Carlo Sampling

28/33

slide-38
SLIDE 38

Probabilistic Numerics - Applications

  • Quadrature
  • Linear Algebra
  • Optimization
  • Ordinary Differential Equations
  • Partial Differential Equations
  • Monte Carlo Sampling

28/33

slide-39
SLIDE 39

Bayesian Inference of Log Determinants

  • Standard computation of log |A| is O(n3)
  • Existing approximations do not give uncertainty estimates

Poster 70

Infer Eigenspectrum Distribution of Geometric Mean

We use a Gaussian Process with Bayesian Quadrature Truncate posterior using well known bounds

λ

Z p(λ) log(λ)dλ

log(λ) p(log(λ))

E[log(λ)]

p(λ)

29/33

slide-40
SLIDE 40

Bayesian Inference of Log Determinants

  • Methods compared on a variety of UFL Sparse Datasets. For each dataset, the matrix

was approximately raised up to the power of 5, 10, 15, 20, 25 and 30 (left to right) using stochastic trace estimation.

30/33

slide-41
SLIDE 41

Bayesian Inference of Log Determinants - Application to DPPs

log p (Z) = log |LJ| |L + I| ∀ J ⊂ X

10-3 10-2 10-1 100 Lengthscale 0.0 0.2 0.4 0.6 0.8 1.0 NLL / P(l =lmax)

31/33

slide-42
SLIDE 42

Outlook

Can we include this computational uncertainty within the full pipeline of GP inference? – Better calibrated uncertainties – Performance tunable to computational budget – Applications to Bayesian optimisation

32/33

slide-43
SLIDE 43

Outlook

Can we include this computational uncertainty within the full pipeline of GP inference? – Better calibrated uncertainties – Performance tunable to computational budget – Applications to Bayesian optimisation

32/33

slide-44
SLIDE 44

Other Work

  • Preconditioning Kernel Matrices (ICML 2016)

Kurt Cutajar, John Cunningham, Michael Osborne, Maurizio Filippone

  • Bayesian Inference of Log Determinants (UAI 2017)

Jack Fitzsimons, Kurt Cutajar, Michael Osborne, Stephen Roberts, Maurizio Filippone

  • Random Feature Expansions for Deep Gaussian Processes (ICML 2017)

Kurt Cutajar, Edwin V. Bonilla, Pietro Michiardi, Maurizio Filippone

  • AutoGP: Exploring the capabilities and limitations of Gaussian processes (UAI 2017)

Karl Krauth, Edwin V. Bonilla, Kurt Cutajar, Maurizio Filippone

  • Entropic Trace Estimates for Log Determinants (ECML/PKDD 2017)

Jack Fitzsimons, Diego Granziol, Kurt Cutajar, Maurizio Filippone, Michael Osborne, Stephen Roberts

33/33

slide-45
SLIDE 45

Thank you!

https://github.com/mauriziofilippone kurt.cutajar@eurecom.fr

33/33

slide-46
SLIDE 46

Thank you!

https://github.com/mauriziofilippone kurt.cutajar@eurecom.fr

33/33