Scalable Exact Inference in Multi-Output Gaussian Processes Wessel - - PowerPoint PPT Presentation

scalable exact inference in multi output gaussian
SMART_READER_LITE
LIVE PREVIEW

Scalable Exact Inference in Multi-Output Gaussian Processes Wessel - - PowerPoint PPT Presentation

Scalable Exact Inference in Multi-Output Gaussian Processes Wessel P. Bruinsma 1 , 2 , Eric Perim 2 , Will Tebbutt 1 , J. Scott Hosking 3 , 4 , Arno Solin 5 , Richard E. Turner 1 , 6 1 University of Cambridge, 2 Invenia Labs, 3 British Antarctic


slide-1
SLIDE 1

Scalable Exact Inference in Multi-Output Gaussian Processes

Wessel P. Bruinsma1,2, Eric Perim2, Will Tebbutt1,

  • J. Scott Hosking3,4, Arno Solin5, Richard E. Turner1,6

1University of Cambridge, 2Invenia Labs, 3British Antarctic Survey, 4Alan Turing Institute, 5Aalto University, 6Microsoft Research

International Conference on Machine Learning 2020

slide-2
SLIDE 2

Collaborators

Wessel P. Bruinsma Eric Perim Will Tebbutt

  • J. Scott

Hoskings Arno Solin Richard E. Turner

slide-3
SLIDE 3

Introduction and Motivation

slide-4
SLIDE 4

Introduction

1/17

  • Gaussian processes are a powerful and popular probabilistic

modelling framework for nonlinear functions.

f1 t′ t t′ f2

Central modelling choice:

K(t, t′) = cov(f1(t), f1(t′)) cov(f1(t), f2(t′)) cov(f2(t), f1(t′)) cov(f2(t), f2(t′))

  • Inference and learning: O(n3p3) time and O(n2p2) memory.

number of

  • utputs
  • Often alleviated by exploiting structure in K.
slide-5
SLIDE 5

Instantaneous Linear Mixing Model (ILMM)

2/17

x ∼ GP(0, K(t, t′)),

K(t, t) = Im

f(t) = h1x1(t) + h2x2(t) = Hx(t), y(t) ∼ N(f(t), Σ), x: “latent processes”, H: “basis” or “mixing matrix”. f(t) h1x1(t) h2x2(t)

  • Use m ≪ p basis vectors: data lives in “pancake” around col(H).
  • Generalisation of FA to time series setting.
  • Captures many existing MOGPs from literature.
  • Inference and learning: O(m3n3) instead of O(p3n3).
slide-6
SLIDE 6

Inside the ILMM

slide-7
SLIDE 7

Key Result

3/17

p                    

  • .

. .

      y high-dim. observation ✗ inference in p(y) noise: Σ   

  • .

. .

       m (≪ p)

yproj = Ty “projected observation” for x ∼ GP(0, K(t, t′)) inference in p(x) projected noise: ΣT Proposition: This is exact!

slide-8
SLIDE 8

Key Result (2)

4/17

Y p(f | Y) inference ✗ O(n3p3) p(x | TY) TY projection O(nmp) inference O(n3m3) reconstruction O(nmp)

slide-9
SLIDE 9

Key Result (3)

5/17

log p(Y) =

likelihood of projected observations under projected noise

log

  • p(x)

n

  • i=1

N(Tyi | xi, ΣT) dx − 1 2

n

  • i=1

yi − HTyi2

Σ

data “lost” by projection (reconstruction error)

− 1 2n log |Σ| |ΣT|

noise “lost” by projection

+ const.

  • Learning H ⇔ learning T ⇔ learning a transform of the data!
  • “Regularisation terms” prevent underfitting.
slide-10
SLIDE 10

Key Insight

6/17

  • Inference in ILMM: condition x on Yproj under noise ΣT.
  • Hence,

if x are independent under the prior and the projected noise ΣT is diagonal, then x remain independent upon observing data. Treat latent processes independently: condition xi on (Yproj)i: under noise (ΣT)ii!

  • Decouples inference into independent single-output problems.
slide-11
SLIDE 11

“Decoupling” the ILMM

slide-12
SLIDE 12

Orthogonal ILMM (OILMM)

7/17

x ∼ GP(0, K(t, t′)), f(t) = Hx(t) = US

1 2 x(t),

y(t) ∼ N(f(t), Σ).

  • rthogonal

diagonal scaling Key property: ΣT is diagonal! f(t) h1x1(t) h2x2(t) h1x1(t) h2x2(t) FA ILMM PPCA OILMM

  • rthogonality constraint

time varying

  • rthogonality constraint

time varying

slide-13
SLIDE 13

Benefits of Orthogonality

8/17

Y p(f | Y)

inference ✗ O(n3p3)

TY p(x1 | (TY)1:) p(xm | (TY)m:)

projection O(nmp) inference O(n3)

. . .

inference O(n3) reconstruction O(nmp)

  • Linear scaling in m!
  • Trivially compatible with single-output scaling techniques!
slide-14
SLIDE 14

Benefits of Orthogonality (2)

9/17

  • .

. .

1 Project data and compute proj. noise:

Yproj = S− 1

2 UTY,

ΣT = σ−2S−1 + D.

2 For i = 1, . . . , m,

compute the log-probability LMLi of (Yproj):i under latent process xi and observation noise (ΣT)ii.

3 Compute the “regularisation term”:

  • reg. = −n

2 log |S|−n(p − m) 2 log 2πσ2− 1 2σ2 (Ip−UUT)Y2

F 4 Construct the log-probability of the data Y under the OILMM:

log p(Y) =

m

  • i=1

LMLi + reg.

slide-15
SLIDE 15

Complexities of MOGPs

10/17

Class Complexity MOGP O(p3n3)

more restrictive

ILMM O(m3n3) OILMM O(mn3) Use single-output scaling techniques to also bring down complexity in n. O(mnr2) (r inducing points) O(mnd3) (d-dim. state-space approximation) Orthogonality gives excellent computational benefits. But how restrictive is it?

slide-16
SLIDE 16

Generality of the OILMM

11/17 Definition

An (O)ILMM is separable if K(t, t′) = k(t, t′)Im. Example: ICM. ILMM versus OILMM:

  • Separable case: without loss of generality.
  • Non-separable case: only affects correlations through time.
  • ILMM can be approximated by an OILMM (in KL) if the right

singular vectors of H are close to unit vectors (in • F ).

  • Separable spatio–temporal GP is an OILMM.
  • OILMM gives non-separable relaxation of separable models

whilst retaining efficient inference.

slide-17
SLIDE 17

Missing Data

12/17

  • Missing data is troublesome: it breaks orthogonality of H.
  • In the paper, we derive a simple and effective approximation.
slide-18
SLIDE 18

The OILMM in Practice

slide-19
SLIDE 19

Demonstration of Scalability

13/17

1 5 10 15 20 25 Number of latent processes m 100 200 300 400 Time (s) ILMM OILMM 1 5 10 15 20 25 Number of latent processes m 10 20 30 40 Memory (GB) ILMM OILMM

slide-20
SLIDE 20

Demonstration of Generality

14/17

EEG FX

PPLP SMSE PPLP SMSE

ILMM −2.11 0.49 3.39 0.19 OILMM −2.11 0.49 3.39 0.19

  • Near identical performance on two real-world data sets.
  • Demonstrates that missing data approximation works well.
slide-21
SLIDE 21

Case Study: Climate Simulators

15/17

1979-01-01 1984-01-01 1989-01-01 1994-01-01 1999-01-01 2004-01-01 ACCESS1-0 ACCESS1-3 BNU-ESM CCSM4 CMCC-CM CNRM-CM5 CSIRO-Mk3-6-0 CanAM4 EC-EARTH FGOALS-g2 Simulator 240 260 280 300

  • Temp. (K)
  • Jointly model ps = 28 climate simulators at pr = 247 spatial

locations and n = 10 000 points in time.

  • Equals p = pspr ≈ 7 k outputs and pn ≈ 70 M observations.
  • Goal: Learn covariance between simulators with H = Hs ⊗ Hr.
  • Use m = 50 and inducing points to scale decoupled problems.
slide-22
SLIDE 22

Case Study: Climate Simulators (2)

16/17

Empirical correlations Learned by OILMM

ACCESS1.0 ACCESS1.3 BCC_CSM1.1 BCC_CSM1.1(m) BNU-ESM CCSM4 CMCC-CM CNRM-CM5 CSIRO-Mk3.6.0 CanAM4 EC-EARTH FGOALS-g2 FGOALS-s2 GFDL-CM3 GFDL-HIRAM-C180 GFDL-HIRAM-C360 HadGEM2-A INMCM4 IPSL-CM5A-LR IPSL-CM5A-MR IPSL-CM5B-LR MIROC5 MPI-ESM-LR MPI-ESM-MR MRI-AGCM3.2H MRI-AGCM3.2S MRI-CGCM3 NorESM1-M

slide-23
SLIDE 23

Conclusion

17/17

Use projection of the data to accelerate inference in MOGPs with

  • rthogonal bases:

Linear scaling in m. Simple to implement. Trivially compatible with single-output scaling techniques. Does not sacrifice significant expressivity.