Sparse Gaussian Processes with Spherical Harmonic Features Vincent - - PowerPoint PPT Presentation

sparse gaussian processes with spherical harmonic features
SMART_READER_LITE
LIVE PREVIEW

Sparse Gaussian Processes with Spherical Harmonic Features Vincent - - PowerPoint PPT Presentation

Sparse Gaussian Processes with Spherical Harmonic Features Vincent Dutordoir 1 , Nicolas Durrande 1 and James Hensman 2 1 PROWLER.io, 2 Amazon (Work completed while JH was at PROWLER.io) International Conference of Machine Learning 2020


slide-1
SLIDE 1

Sparse Gaussian Processes with Spherical Harmonic Features

Vincent Dutordoir1, Nicolas Durrande1 and James Hensman2

1 PROWLER.io, 2Amazon (Work completed while JH was at PROWLER.io)

International Conference of Machine Learning – 2020

slide-2
SLIDE 2

Contribution

We improve the scaling of Sparse GPs with #datapoints and #inputs

Airline dataset: Regression problem 6.106 datapoints 8 input dimensions Setup GTX 1070 GPU

918.77 41.32 1.31 1.29 Models Time (seconds) NLPD (Lower is better) 250 500 750 1000 0.00 0.50 1.00 1.50 SVGP * VISH * Wall-clock Time NLPD (Error)

2 / 14

slide-3
SLIDE 3

Variational Inference with Spherical Harmonics (VISH)

Gist of method: make inputs d + 1 dimensional project data radially on Sd Fast SVGP on the sphere map predictions on Sd back to the original space

x bias y

The efficiency of VISH comes from using spherical harmonics as inducing functions for the SVGP on the sphere.

3 / 14

slide-4
SLIDE 4

From inducing points to inducing features

Inducing Points um = f (zm) Kuu = K −1

uu is O(M3)

VISH um = f , φmH Kuu = K −1

uu is O(M)

Orthogonality of the basisfunctions φ leads to diagonal Kuu and O(M) inversion

4 / 14

slide-5
SLIDE 5

Deep-dive

5 / 14

slide-6
SLIDE 6

Sparse Variational Gaussian processes

Scalable and flexible

Capture the GP by a set of inducing variables u = f (Z), at locations z1, . . . , zM.

6 / 14

slide-7
SLIDE 7

Sparse Variational Gaussian processes

Scalable and flexible

Capture the GP by a set of inducing variables u = f (Z), at locations z1, . . . , zM. Minimise KL-divergence from p(f (·) | y) to q(f (·)) = GP(µ(·), ν(·, ·′))

  • µ(·) = k⊤

u (·)K −1 uu m

ν(·, ·′) = k(·, ·′) − k⊤

u (·)K −1 uu (Kuu − S)K −1 uu ku(·′)

, where [Kuu]m,m′ = Cov(um, um′) and [ku(·)]m = Cov(um, f (·)).

6 / 14

slide-8
SLIDE 8

Sparse Variational Gaussian processes

Scalable and flexible

Capture the GP by a set of inducing variables u = f (Z), at locations z1, . . . , zM. Minimise KL-divergence from p(f (·) | y) to q(f (·)) = GP(µ(·), ν(·, ·′))

  • µ(·) = k⊤

u (·)K −1 uu m

ν(·, ·′) = k(·, ·′) − k⊤

u (·)K −1 uu (Kuu − S)K −1 uu ku(·′)

, where [Kuu]m,m′ = Cov(um, um′) and [ku(·)]m = Cov(um, f (·)). A more flexible (e.g. non-Gaussian likelihoods) and scalable (e.g. mini-batching) model at a cost of O(M3 + M2N).

6 / 14

slide-9
SLIDE 9

Sparse Variational Gaussian processes

Scalable and flexible

Capture the GP by a set of inducing variables u = f (Z), at locations z1, . . . , zM. Minimise KL-divergence from p(f (·) | y) to q(f (·)) = GP(µ(·), ν(·, ·′))

  • µ(·) = k⊤

u (·)K −1 uu m

ν(·, ·′) = k(·, ·′) − k⊤

u (·)K −1 uu (Kuu − S)K −1 uu ku(·′)

, where [Kuu]m,m′ = Cov(um, um′) and [ku(·)]m = Cov(um, f (·)). A more flexible (e.g. non-Gaussian likelihoods) and scalable (e.g. mini-batching) model at a cost of O(M3 + M2N). Speedup through structure in the Kuu matrix (e.g. Hensman et al 2017, VFF).

6 / 14

slide-10
SLIDE 10

Outline

Gaussian processes on the circle and hypersphere Spherical harmonics as inducing features Linear projection data on the hyper-sphere

7 / 14

slide-11
SLIDE 11

Gaussian processes on the circle

Φ(θ) = [cos(iθ), sin(iθ)]∞

i=0

k(θ1, θ2) = ∞

i=0 λiφi(θ1)φi(θ2)

/2 /2

1 2

0.0 0.5 1.0 1.5 2.0 2.5 3.0

f =

i ξiφi(θ), with

ξi ∼ N(0, λi)

x 1.0 0.5 0.0 0.5 1.0 y 1.0 0.5 0.0 0.5 1.0 z 1 1 2

8 / 14

slide-12
SLIDE 12

Spherical Harmonics

Orthonormal basis on the hyper sphere Eigenfunctions the Laplace-Beltrami

  • perator ∆Sd−1φi = λi φi

Eigenfunction of zonal kernels

9 / 14

slide-13
SLIDE 13

Mercer’s theorem for zonal kernels on the sphere

x x’

x

T

x

Zonal kernels are the spherical counterpart of stationary kernels k(x, x′) = k′(distance(x, x′)).

10 / 14

slide-14
SLIDE 14

Mercer’s theorem for zonal kernels on the sphere

x x’

x

T

x

Zonal kernels are the spherical counterpart of stationary kernels k(x, x′) = k′(distance(x, x′)). Mercer’s decomposition: Any zonal kernel k on the hyper- sphere can be decomposed as k(x, x′) =

  • i=0

λi φi(x) φi(x′).

10 / 14

slide-15
SLIDE 15

Mercer’s theorem for zonal kernels on the sphere

x x’

x

T

x

Zonal kernels are the spherical counterpart of stationary kernels k(x, x′) = k′(distance(x, x′)). Mercer’s decomposition: Any zonal kernel k on the hyper- sphere can be decomposed as k(x, x′) =

  • i=0

λi φi(x) φi(x′). Karhunen–Loève expansion: A GP f on the hypersphere with zonal covariance k can be written f =

i ξiφi with ξi ∼ N(0, λi):

10 / 14

slide-16
SLIDE 16

Spherical harmonics as inducing features in SVGPs

Define the kernel’s RKHS H with reproducing inner-product: k(x, ·), h(·)H = h(x)

11 / 14

slide-17
SLIDE 17

Spherical harmonics as inducing features in SVGPs

Define the kernel’s RKHS H with reproducing inner-product: k(x, ·), h(·)H = h(x) Approximate posterior constructed out of inducing features um = f , φmH

11 / 14

slide-18
SLIDE 18

Spherical harmonics as inducing features in SVGPs

Define the kernel’s RKHS H with reproducing inner-product: k(x, ·), h(·)H = h(x) Approximate posterior constructed out of inducing features um = f , φmH = ⇒ Diagonal covariance matrix: [Kuu]m,m′ = Cov(um, um′) = φm, φm′H = λ−1

m δmm′

11 / 14

slide-19
SLIDE 19

Spherical harmonics as inducing features in SVGPs

Define the kernel’s RKHS H with reproducing inner-product: k(x, ·), h(·)H = h(x) Approximate posterior constructed out of inducing features um = f , φmH = ⇒ Diagonal covariance matrix: [Kuu]m,m′ = Cov(um, um′) = φm, φm′H = λ−1

m δmm′

= ⇒ Spherical Harmonics as features [ku(·)]m = Cov(um, f (·)) = φm(·)

11 / 14

slide-20
SLIDE 20

Spherical harmonics as inducing features in SVGPs

Define the kernel’s RKHS H with reproducing inner-product: k(x, ·), h(·)H = h(x) Approximate posterior constructed out of inducing features um = f , φmH = ⇒ Diagonal covariance matrix: [Kuu]m,m′ = Cov(um, um′) = φm, φm′H = λ−1

m δmm′

= ⇒ Spherical Harmonics as features [ku(·)]m = Cov(um, f (·)) = φm(·) = ⇒ A O(M2N) approximate GP q(f (·)) GP

  • Φ⊤(·) m;

k(·, ·′) − Φ⊤(·)(Λ − S)Φ(·′)

  • ,

where Λ = diag(λ1, . . . , λM) and Φ(·) = [φ1(·), . . . , φM(·)].

11 / 14

slide-21
SLIDE 21

Linear mapping to the hypersphere

Most datasets do not correspond to data on a hypersphere... The proposed solution is to augment the inputs with a constant variable (bias) before projecting it radially

  • nto the hypersphere.

x bias y

Although such construction may seem arbitrary, it is used implicitly in the Arc-Cosine kernel [Cho & Saul, 2009]: k(x, x′) = xx′

radial

(sin θ + (π − θ) cos θ)

  • angular

with θ = arccos x⊤x′ xx′.

12 / 14

slide-22
SLIDE 22

Experiment

Airline dataset: 6,000,000 datapoints regression task fitted in 40 seconds on a single cheap GTX 1070 GPU

NLPD

1.31 1.32 1.29 models NLPD (Error) 0.00 0.50 1.00 1.50 SVGP Additive-VFF * VISH *

Wall-clock Time

918.77 75.61 41.32 models Time (Seconds) 250 500 750 1000 SVGP Additive-VFF * VISH *

13 / 14

slide-23
SLIDE 23

Conclusion

Summary of the advantages It is the fastest SVGP model to date ⇒ No need for expensive hardware The natural ordering of spherical harmonics makes our model scale nicely with the input dimension ⇒ Does not suffer from the curse of dimensionality as VFF Similarities with Arc-cosine kernel makes extrapolation properties similar to Neural Networks Reach out to have a chat if you want to know more!

14 / 14