A Unifying Framework for Sparse Gaussian Process Approximation using - - PowerPoint PPT Presentation

a unifying framework for sparse gaussian process
SMART_READER_LITE
LIVE PREVIEW

A Unifying Framework for Sparse Gaussian Process Approximation using - - PowerPoint PPT Presentation

A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation Dr. Richard E. Turner ( ret26@cam.ac.uk ) Computational and Biological Learning Lab, Department of Engineering, University of Cambridge ...joint


slide-1
SLIDE 1

A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation

  • Dr. Richard E. Turner (ret26@cam.ac.uk)

Computational and Biological Learning Lab, Department of Engineering, University of Cambridge

...joint work with Thang Bui, Cuong Nguyen and Josiah Yan

1 / 22

slide-2
SLIDE 2

Manfred Opper is a God

2 / 22

slide-3
SLIDE 3

Motivation: Gaussian Process Regression inputs

  • utputs

3 / 22

slide-4
SLIDE 4

Motivation: Gaussian Process Regression inputs

  • utputs

?

3 / 22

slide-5
SLIDE 5

Motivation: Gaussian Process Regression inputs

  • utputs

?

3 / 22

slide-6
SLIDE 6

Motivation: Gaussian Process Regression inputs

  • utputs

?

inference & learning

3 / 22

slide-7
SLIDE 7

Motivation: Gaussian Process Regression inputs

  • utputs

?

inference & learning intractabilities computational analytic

3 / 22

slide-8
SLIDE 8

A Brief History of Gaussian Process Approximations

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

4 / 22

slide-9
SLIDE 9

A Brief History of Gaussian Process Approximations

approximate generative model exact inference

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

4 / 22

slide-10
SLIDE 10

A Brief History of Gaussian Process Approximations

approximate generative model exact inference methods employing pseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

4 / 22

slide-11
SLIDE 11

A Brief History of Gaussian Process Approximations

approximate generative model exact inference methods employing pseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

FITC PITC DTC

4 / 22

slide-12
SLIDE 12

A Brief History of Gaussian Process Approximations

approximate generative model exact inference methods employing pseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

FITC PITC DTC

A Unifying View of Sparse Approximate Gaussian Process Regression Quinonero-Candela & Rasmussen, 2005 (FITC, PITC, DTC)

4 / 22

slide-13
SLIDE 13

A Brief History of Gaussian Process Approximations

approximate generative model exact inference exact generative model approximate inference methods employing pseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

FITC PITC DTC

A Unifying View of Sparse Approximate Gaussian Process Regression Quinonero-Candela & Rasmussen, 2005 (FITC, PITC, DTC)

4 / 22

slide-14
SLIDE 14

A Brief History of Gaussian Process Approximations

approximate generative model exact inference exact generative model approximate inference methods employing pseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

VFE EP PP FITC PITC DTC

A Unifying View of Sparse Approximate Gaussian Process Regression Quinonero-Candela & Rasmussen, 2005 (FITC, PITC, DTC)

4 / 22

slide-15
SLIDE 15

A Brief History of Gaussian Process Approximations

approximate generative model exact inference exact generative model approximate inference methods employing pseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

VFE EP PP FITC PITC DTC

A Unifying View of Sparse Approximate Gaussian Process Regression Quinonero-Candela & Rasmussen, 2005 (FITC, PITC, DTC)

4 / 22

slide-16
SLIDE 16

A Brief History of Gaussian Process Approximations

approximate generative model exact inference exact generative model approximate inference methods employing pseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

VFE EP PP FITC PITC DTC

A Unifying View of Sparse Approximate Gaussian Process Regression Quinonero-Candela & Rasmussen, 2005 (FITC, PITC, DTC) A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation Bui, Yan and Turner, 2016 (VFE, EP, FITC, PITC ...)

4 / 22

slide-17
SLIDE 17

EP pseudo-point approximation

true posterior

5 / 22

slide-18
SLIDE 18

EP pseudo-point approximation

true posterior

5 / 22

slide-19
SLIDE 19

EP pseudo-point approximation

true posterior

marginal likelihood posterior

5 / 22

slide-20
SLIDE 20

EP pseudo-point approximation

true posterior approximate posterior

marginal likelihood posterior

5 / 22

slide-21
SLIDE 21

EP pseudo-point approximation

true posterior approximate posterior

marginal likelihood posterior

5 / 22

slide-22
SLIDE 22

EP pseudo-point approximation

true posterior approximate posterior

marginal likelihood posterior

5 / 22

slide-23
SLIDE 23

EP pseudo-point approximation

true posterior approximate posterior

marginal likelihood posterior

5 / 22

slide-24
SLIDE 24

EP pseudo-point approximation

input locations of 'pseudo' data

  • utputs and covariance

'pseudo' data

true posterior approximate posterior

marginal likelihood posterior exact joint

  • f new GP

regression model

5 / 22

slide-25
SLIDE 25

EP algorithm

6 / 22

slide-26
SLIDE 26

EP algorithm

  • 1. remove

take out one pseudo-observation likelihood

cavity

6 / 22

slide-27
SLIDE 27

EP algorithm

  • 1. remove
  • 2. include

take out one pseudo-observation likelihood add in one true observation likelihood

cavity tilted

6 / 22

slide-28
SLIDE 28

EP algorithm

  • 1. remove
  • 2. include
  • 3. project

take out one pseudo-observation likelihood add in one true observation likelihood project onto approximating family

cavity tilted KL between unnormalised stochastic processes

6 / 22

slide-29
SLIDE 29

EP algorithm

  • 1. remove
  • 2. include
  • 3. project
  • 4. update

take out one pseudo-observation likelihood add in one true observation likelihood project onto approximating family update pseudo-observation likelihood

cavity tilted KL between unnormalised stochastic processes

6 / 22

slide-30
SLIDE 30

EP algorithm

  • 1. remove
  • 2. include
  • 3. project
  • 4. update

take out one pseudo-observation likelihood add in one true observation likelihood project onto approximating family update pseudo-observation likelihood

cavity tilted

  • 1. minimum: moments matched at pseudo-inputs
  • 2. Gaussian regression: matches moments everywhere

KL between unnormalised stochastic processes

6 / 22

slide-31
SLIDE 31

EP algorithm

  • 1. remove
  • 2. include
  • 3. project
  • 4. update

take out one pseudo-observation likelihood add in one true observation likelihood project onto approximating family update pseudo-observation likelihood

cavity tilted

  • 1. minimum: moments matched at pseudo-inputs
  • 2. Gaussian regression: matches moments everywhere

KL between unnormalised stochastic processes rank 1

6 / 22

slide-32
SLIDE 32

Fixed points of EP = FITC approximation

7 / 22

slide-33
SLIDE 33

Fixed points of EP = FITC approximation

suppressed &

7 / 22

slide-34
SLIDE 34

Fixed points of EP = FITC approximation

suppressed &

7 / 22

slide-35
SLIDE 35

Fixed points of EP = FITC approximation

suppressed &

7 / 22

slide-36
SLIDE 36

Fixed points of EP = FITC approximation

suppressed &

8 / 22

slide-37
SLIDE 37

Fixed points of EP = FITC approximation

suppressed &

8 / 22

slide-38
SLIDE 38

Fixed points of EP = FITC approximation

suppressed &

8 / 22

slide-39
SLIDE 39

Fixed points of EP = FITC approximation

suppressed &

9 / 22

slide-40
SLIDE 40

Fixed points of EP = FITC approximation

suppressed &

9 / 22

slide-41
SLIDE 41

Fixed points of EP = FITC approximation

suppressed &

9 / 22

slide-42
SLIDE 42

Fixed points of EP = FITC approximation

suppressed &

9 / 22

slide-43
SLIDE 43

Fixed points of EP = FITC approximation

suppressed &

10 / 22

slide-44
SLIDE 44

Fixed points of EP = FITC approximation

suppressed &

=

equivalent

10 / 22

slide-45
SLIDE 45

Fixed points of EP = FITC approximation

Csato & Opper (2002) Qi, Abdel-Gawad & Minka (2010) suppressed &

=

equivalent

10 / 22

slide-46
SLIDE 46

Fixed points of EP = FITC approximation

Csato & Opper (2002) Qi, Abdel-Gawad & Minka (2010)

Interpretation resolves philosophical issues with FITC (increase M with N) FITC likelihood > GP likelihood => EP over-estimates (marginal) likelihood

suppressed &

=

equivalent

10 / 22

slide-47
SLIDE 47

EP algorithm

  • 1. remove
  • 2. include
  • 3. project
  • 4. update

take out one pseudo-observation likelihood add in one true observation likelihood project onto approximating family update pseudo-observation likelihood

cavity tilted

  • 1. minimum: moments matched at pseudo-inputs
  • 2. Gaussian regression: matches moments everywhere

KL between unnormalised stochastic processes rank 1

11 / 22

slide-48
SLIDE 48

Power EP algorithm (as tractable as EP)

  • 1. remove
  • 2. include
  • 3. project
  • 4. update

take out fraction of pseudo-observation likelihood add in fraction of true observation likelihood project onto approximating family update pseudo-observation likelihood

cavity tilted

  • 1. minimum: moments matched at pseudo-inputs
  • 2. Gaussian regression: matches moments everywhere

KL between unnormalised stochastic processes rank 1

12 / 22

slide-49
SLIDE 49

Power EP: a unifying framework

FITC Csato and Opper, 2002 Snelson and Ghahramani, 2005 VFE Titsias, 2009

13 / 22

slide-50
SLIDE 50

Power EP: a unifying framework

Approximate blocks of data: structured approximations

14 / 22

slide-51
SLIDE 51

Power EP: a unifying framework

Approximate blocks of data: structured approximations

14 / 22

slide-52
SLIDE 52

Power EP: a unifying framework

PITC / BCM Schwaighofer & Tresp, 2002, Snelson 2006, VFE Titsias, 2009

Approximate blocks of data: structured approximations

14 / 22

slide-53
SLIDE 53

Power EP: a unifying framework

PITC / BCM Schwaighofer & Tresp, 2002, Snelson 2006, VFE Titsias, 2009

Approximate blocks of data: structured approximations Place pseudo-data in different space: interdomain transformations (linear transform)

14 / 22

slide-54
SLIDE 54

Power EP: a unifying framework

PITC / BCM Schwaighofer & Tresp, 2002, Snelson 2006, VFE Titsias, 2009

Approximate blocks of data: structured approximations Place pseudo-data in different space: interdomain transformations (linear transform)

14 / 22

slide-55
SLIDE 55

Power EP: a unifying framework

PITC / BCM Schwaighofer & Tresp, 2002, Snelson 2006, VFE Titsias, 2009

Approximate blocks of data: structured approximations Place pseudo-data in different space: interdomain transformations (linear transform) pseudo-data in new space

14 / 22

slide-56
SLIDE 56

Power EP: a unifying framework

PITC / BCM Schwaighofer & Tresp, 2002, Snelson 2006, VFE Titsias, 2009 Figueiras-Vidal & Lázaro-Gredilla 2009 T

  • bar et al. 2015

Matthews et al, 2016

Approximate blocks of data: structured approximations Place pseudo-data in different space: interdomain transformations (linear transform) pseudo-data in new space

14 / 22

slide-57
SLIDE 57

Power EP: a unifying framework

GP Regression GP Classification

PEP VFE EP inter-domain

[4] Quiñonero-Candela et al. 2005 [5] Snelson et al., 2005 [6] Snelson, 2006 [7] Schwaighofer, 2002 [10,5,6*] [14*] [12*,15*] [13] [17,13] [9,11,8*] [16*]

inter-domain structured approx. structured approx.

(FITC) [7,4*,6*] (PITC) [8] Titsias, 2009 [9] Csató, 2002 [10] Csató et al., 2002 [11] Seeger et al., 2003 [12] Naish-Guzman et al, 2007 [13] Qi et al., 2010 [14] Hensman et al., 2015 [15] Hernández-Lobato et al., 2016 [16] Matthews et al., 2016 [17] Figueiras-Vidal et al., 2009

PEP VFE EP

* = optimised pseudo-inputs ** = structured versions of VFE recover VFE ** ** 15 / 22

slide-58
SLIDE 58

How should I set the power parameter α?

6 UCI classification datasets 20 random splits M = 10, 50, 100 hypers and inducing inputs optimised 8 UCI regression datasets 20 random splits M = 0 - 200 hypers and inducing inputs optimised

0.0 0.2 0.4 0.6 0.8 1.0 2 3 4 5 6 7 8 0.0 0.2 0.4 0.6 0.8 1.0 2 3 4 5 6 7 8

MSE rank error rank log-loss rank log-loss rank = 0.5 does well on average

0.0 0.2 0.4 0.6 0.8 1.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 6 7

16 / 22

slide-59
SLIDE 59

How should I set the power parameter α?

6 UCI classification datasets 8 UCI regression datasets MSE error rank log-loss rank log-loss rank = 0.5 does well on average

V FE V FE 0.05 0.1 0.2 0.4 0.5 0.6 0.8 EP 0.05 0.1 0.2 0.4 0.5 0.6 0.8 EP V FE V FE 0.05 0.1 0.2 0.4 0.5 0.6 0.8 EP 0.05 0.1 0.2 0.4 0.5 0.6 0.8 EP V FE V FE 0.01 0.1 0.2 0.4 0.5 0.6 0.8 EP 0.01 0.1 0.2 0.4 0.5 0.6 0.8 EP V FE V FE 0.01 0.1 0.2 0.4 0.5 0.6 0.8 EP 0.01 0.1 0.2 0.4 0.5 0.6 0.8 EP

1 EP beats VFE in 40% of tests 0.4

17 / 22

slide-60
SLIDE 60

Streaming / Online Sparse Approximations Goal: Online posterior update (using old posterior and new data batch). Two new innovations for online learning and inducing input

  • ptimisation
  • 1. na¨

ıve approach: use previous approximate posterior as prior

new posterior

˙ ˝¸ ˚

q(new)(f ) ≈

new likelihood

˙ ˝¸ ˚

p(y(new)|f )

  • ld posterior

˙ ˝¸ ˚

q(old)(f )

  • ld likelihoods

˙ ˝¸ ˚

q(old)(f ) p(f |θ(old))

18 / 22

slide-61
SLIDE 61

Streaming / Online Sparse Approximations Goal: Online posterior update (using old posterior and new data batch). Two new innovations for online learning and inducing input

  • ptimisation
  • 1. better approach: only take likelihood terms from old posterior

new posterior

˙ ˝¸ ˚

q(new)(f ) ≈

new likelihood

˙ ˝¸ ˚

p(y(new)|f )

  • ld likelihoods

˙ ˝¸ ˚

q(old)(f ) p(f |θ(old))

  • riginal prior

˙ ˝¸ ˚

p(f |θ(new))

18 / 22

slide-62
SLIDE 62

Streaming / Online Sparse Approximations Goal: Online posterior update (using old posterior and new data batch). Two new innovations for online learning and inducing input

  • ptimisation
  • 1. better approach: only take likelihood terms from old posterior

new posterior

˙ ˝¸ ˚

q(new)(f ) ≈

new likelihood

˙ ˝¸ ˚

p(y(new)|f )

  • ld likelihoods

˙ ˝¸ ˚

q(old)(f ) p(f |θ(old))

  • riginal prior

˙ ˝¸ ˚

p(f |θ(new))

  • 2. na¨

ıve approach: use same pseudo-points throughout q(old)(f ) = p(f”=u|u, θ(old))q(u) q(new)(f ) = p(f”=u|u, θ(new))q(u)

18 / 22

slide-63
SLIDE 63

Streaming / Online Sparse Approximations Goal: Online posterior update (using old posterior and new data batch). Two new innovations for online learning and inducing input

  • ptimisation
  • 1. better approach: only take likelihood terms from old posterior

new posterior

˙ ˝¸ ˚

q(new)(f ) ≈

new likelihood

˙ ˝¸ ˚

p(y(new)|f )

  • ld likelihoods

˙ ˝¸ ˚

q(old)(f ) p(f |θ(old))

  • riginal prior

˙ ˝¸ ˚

p(f |θ(new))

  • 2. better approach: decouple sets of pseudo-points

q(old)(f ) = p(f”=u(old)|u(old), θ(old))q(u(old)) q(new)(f ) = p(f”=u(new)|u(new), θ(new))q(u(new))

18 / 22

slide-64
SLIDE 64

Streaming / Online Sparse Approximations Goal: Online posterior update (using old posterior and new data batch). Two new innovations for online learning and inducing input

  • ptimisation
  • 1. better approach: only take likelihood terms from old posterior

new posterior

˙ ˝¸ ˚

q(new)(f ) ≈

new likelihood

˙ ˝¸ ˚

p(y(new)|f )

  • ld likelihoods

˙ ˝¸ ˚

q(old)(f ) p(f |θ(old))

  • riginal prior

˙ ˝¸ ˚

p(f |θ(new))

  • 2. better approach: decouple sets of pseudo-points

q(old)(f ) = p(f”=u(old)|u(old), θ(old))q(u(old)) q(new)(f ) = p(f”=u(new)|u(new), θ(new))q(u(new)) VFE is now the best Power EP method (inducing point clumping)

18 / 22

slide-65
SLIDE 65

Online Sparse Approximations: Regression and Classification

−2 2 4 6 8 10 12

−2 2

x1

−2 2

x2

−2 2

x1

−2 2

x1

−2 2

x1 19 / 22

slide-66
SLIDE 66

Streaming / Online Sparse Approximations: Time-series Regression

mean held out log-likelihood

1 10 100 1000

  • 8
  • 6
  • 4
  • 2

2 4

  • nline variational

exact batch VFE minibatch VFE

accumulated running time /s

20 / 22

slide-67
SLIDE 67

Summary Provided a unifying framework for Gaussian Process Approximation methods using pseudo-points via PEP FITC and PITC are EP in disguise and they use the same approximating distribution as VFE Intermediate powers in PEP perform best on average in batch setting (more theory and empirical work needed) VFE methods perform best in the online setting Core material: A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation, arXiv preprint 2016 Streaming Sparse Gaussian Process Approximations, arXiv preprint 2017

21 / 22

slide-68
SLIDE 68

VFE is best for online inference and learning

5 10 15 20 25 1.0 1.5 2.0 2.5

22 / 22