Scalable Exact Inference in Multi-Output Gaussian Processes Wessel - - PowerPoint PPT Presentation

▶

Aug 30, 2023 109 likes •345 views

Scalable Exact Inference in Multi-Output Gaussian Processes Wessel P. Bruinsma 1 , 2 , Eric Perim 2 , Will Tebbutt 1 , J. Scott Hosking 3 , 4 , Arno Solin 5 , Richard E. Turner 1 , 6 1 University of Cambridge, 2 Invenia Labs, 3 British Antarctic

SLIDE 1

Scalable Exact Inference in Multi-Output Gaussian Processes

Wessel P. Bruinsma1,2, Eric Perim2, Will Tebbutt1,

J. Scott Hosking3,4, Arno Solin5, Richard E. Turner1,6

1University of Cambridge, 2Invenia Labs, 3British Antarctic Survey, 4Alan Turing Institute, 5Aalto University, 6Microsoft Research

International Conference on Machine Learning 2020

SLIDE 2

Collaborators

Wessel P. Bruinsma Eric Perim Will Tebbutt

J. Scott

Hoskings Arno Solin Richard E. Turner

SLIDE 3

Introduction and Motivation

SLIDE 4

Introduction

1/17

Gaussian processes are a powerful and popular probabilistic

modelling framework for nonlinear functions.

f1 t′ t t′ f2

Central modelling choice:

K(t, t′) = cov(f1(t), f1(t′)) cov(f1(t), f2(t′)) cov(f2(t), f1(t′)) cov(f2(t), f2(t′))

Inference and learning: O(n3p3) time and O(n2p2) memory.

number of

utputs
Often alleviated by exploiting structure in K.

SLIDE 5

Instantaneous Linear Mixing Model (ILMM)

2/17

x ∼ GP(0, K(t, t′)),

K(t, t) = Im

f(t) = h1x1(t) + h2x2(t) = Hx(t), y(t) ∼ N(f(t), Σ), x: “latent processes”, H: “basis” or “mixing matrix”. f(t) h1x1(t) h2x2(t)

Use m ≪ p basis vectors: data lives in “pancake” around col(H).
Generalisation of FA to time series setting.
Captures many existing MOGPs from literature.
Inference and learning: O(m3n3) instead of O(p3n3).

SLIDE 6

Inside the ILMM

SLIDE 7

Key Result

3/17

p                    

. .

      y high-dim. observation ✗ inference in p(y) noise: Σ   

. .

       m (≪ p)

→

yproj = Ty “projected observation” for x ∼ GP(0, K(t, t′)) inference in p(x) projected noise: ΣT Proposition: This is exact!

SLIDE 8

Key Result (2)

4/17

Y p(f | Y) inference ✗ O(n3p3) p(x | TY) TY projection O(nmp) inference O(n3m3) reconstruction O(nmp)

SLIDE 9

Key Result (3)

5/17

log p(Y) =

likelihood of projected observations under projected noise

log

p(x)

N(Tyi | xi, ΣT) dx − 1 2

yi − HTyi2

data “lost” by projection (reconstruction error)

− 1 2n log |Σ| |ΣT|

noise “lost” by projection

+ const.

Learning H ⇔ learning T ⇔ learning a transform of the data!
“Regularisation terms” prevent underfitting.

SLIDE 10

Key Insight

6/17

Inference in ILMM: condition x on Yproj under noise ΣT.
Hence,

if x are independent under the prior and the projected noise ΣT is diagonal, then x remain independent upon observing data. Treat latent processes independently: condition xi on (Yproj)i: under noise (ΣT)ii!

Decouples inference into independent single-output problems.

SLIDE 11

“Decoupling” the ILMM

SLIDE 12

Orthogonal ILMM (OILMM)

7/17

x ∼ GP(0, K(t, t′)), f(t) = Hx(t) = US

1 2 x(t),

y(t) ∼ N(f(t), Σ).

rthogonal

diagonal scaling Key property: ΣT is diagonal! f(t) h1x1(t) h2x2(t) h1x1(t) h2x2(t) FA ILMM PPCA OILMM

rthogonality constraint

time varying

rthogonality constraint

time varying

SLIDE 13

Benefits of Orthogonality

8/17

Y p(f | Y)

inference ✗ O(n3p3)

TY p(x1 | (TY)1:) p(xm | (TY)m:)

projection O(nmp) inference O(n3)

. . .

inference O(n3) reconstruction O(nmp)

Linear scaling in m!
Trivially compatible with single-output scaling techniques!

SLIDE 14

Benefits of Orthogonality (2)

9/17

. .

1 Project data and compute proj. noise:

Yproj = S− 1

2 UTY,

ΣT = σ−2S−1 + D.

2 For i = 1, . . . , m,

compute the log-probability LMLi of (Yproj):i under latent process xi and observation noise (ΣT)ii.

3 Compute the “regularisation term”:

reg. = −n

2 log |S|−n(p − m) 2 log 2πσ2− 1 2σ2 (Ip−UUT)Y2

F 4 Construct the log-probability of the data Y under the OILMM:

log p(Y) =

LMLi + reg.

SLIDE 15

Complexities of MOGPs

10/17

Class Complexity MOGP O(p3n3)

more restrictive

ILMM O(m3n3) OILMM O(mn3) Use single-output scaling techniques to also bring down complexity in n. O(mnr2) (r inducing points) O(mnd3) (d-dim. state-space approximation) Orthogonality gives excellent computational benefits. But how restrictive is it?

SLIDE 16

Generality of the OILMM

11/17 Definition

An (O)ILMM is separable if K(t, t′) = k(t, t′)Im. Example: ICM. ILMM versus OILMM:

Separable case: without loss of generality.
Non-separable case: only affects correlations through time.
ILMM can be approximated by an OILMM (in KL) if the right

singular vectors of H are close to unit vectors (in • F ).

Separable spatio–temporal GP is an OILMM.
OILMM gives non-separable relaxation of separable models

whilst retaining efficient inference.

SLIDE 17

Missing Data

12/17

Missing data is troublesome: it breaks orthogonality of H.
In the paper, we derive a simple and effective approximation.

SLIDE 18

The OILMM in Practice

SLIDE 19

Demonstration of Scalability

13/17

1 5 10 15 20 25 Number of latent processes m 100 200 300 400 Time (s) ILMM OILMM 1 5 10 15 20 25 Number of latent processes m 10 20 30 40 Memory (GB) ILMM OILMM

SLIDE 20

Demonstration of Generality

14/17

EEG FX

PPLP SMSE PPLP SMSE

ILMM −2.11 0.49 3.39 0.19 OILMM −2.11 0.49 3.39 0.19

Near identical performance on two real-world data sets.
Demonstrates that missing data approximation works well.

SLIDE 21

Case Study: Climate Simulators

15/17

1979-01-01 1984-01-01 1989-01-01 1994-01-01 1999-01-01 2004-01-01 ACCESS1-0 ACCESS1-3 BNU-ESM CCSM4 CMCC-CM CNRM-CM5 CSIRO-Mk3-6-0 CanAM4 EC-EARTH FGOALS-g2 Simulator 240 260 280 300

Temp. (K)
Jointly model ps = 28 climate simulators at pr = 247 spatial

locations and n = 10 000 points in time.

Equals p = pspr ≈ 7 k outputs and pn ≈ 70 M observations.
Goal: Learn covariance between simulators with H = Hs ⊗ Hr.
Use m = 50 and inducing points to scale decoupled problems.

SLIDE 22

Case Study: Climate Simulators (2)

16/17

Empirical correlations Learned by OILMM

ACCESS1.0 ACCESS1.3 BCC_CSM1.1 BCC_CSM1.1(m) BNU-ESM CCSM4 CMCC-CM CNRM-CM5 CSIRO-Mk3.6.0 CanAM4 EC-EARTH FGOALS-g2 FGOALS-s2 GFDL-CM3 GFDL-HIRAM-C180 GFDL-HIRAM-C360 HadGEM2-A INMCM4 IPSL-CM5A-LR IPSL-CM5A-MR IPSL-CM5B-LR MIROC5 MPI-ESM-LR MPI-ESM-MR MRI-AGCM3.2H MRI-AGCM3.2S MRI-CGCM3 NorESM1-M

SLIDE 23

Conclusion

17/17

Use projection of the data to accelerate inference in MOGPs with

rthogonal bases: