Sparse Gaussian Process Approximations Dr. Richard E. Turner ( - - PowerPoint PPT Presentation

sparse gaussian process approximations
SMART_READER_LITE
LIVE PREVIEW

Sparse Gaussian Process Approximations Dr. Richard E. Turner ( - - PowerPoint PPT Presentation

Sparse Gaussian Process Approximations Dr. Richard E. Turner ( ret26@cam.ac.uk ) Computational and Biological Learning Lab, Department of Engineering, University of Cambridge 1 / 90 Motivating application 1: Audio modelling audio 5


slide-1
SLIDE 1

Sparse Gaussian Process Approximations

  • Dr. Richard E. Turner (ret26@cam.ac.uk)

Computational and Biological Learning Lab, Department of Engineering, University of Cambridge

1 / 90

slide-2
SLIDE 2

Motivating application 1: Audio modelling

3 4 5 6 7 8

time /s

time /ms 5 10 15 20 25

T = 10 -10 datapoints

5 7

audio time-series data reconstruction using a GP model

2 / 90

slide-3
SLIDE 3

Motivating application 1: Audio modelling

3 4 5 6 7 8

time /s

time /ms 5 10 15 20 25

T = 10 -10 datapoints

5 7

audio time-series data reconstruction using a GP model

How can we use GPs in this setting?

3 / 90

slide-4
SLIDE 4

Motivating application 2: non-linear regression

boston N = 506 D = 13

−2.5 −2.4 −2.3 −2.2 −2.1 −2.0

average test log-likelihood/nats concrete N = 1030 D = 8

−3.2 −3.0 −2.8 −2.6 −2.4 −2.2 −2.0 −1.8

energy N = 768 D = 8

−2.0 −1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6

kin8nm N = 8192 D = 8

1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2

naval N = 11934 D = 16

5.0 5.5 6.0 6.5 7.0 7.5

power N = 9568 D = 4

−2.65 −2.60 −2.55 −2.50 −2.45 −2.40 −2.35 −2.30 −2.25

protein N = 45730 D = 9

−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0

red wine N = 1588 D = 11

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

yacht N = 308 D = 6

−1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6 −0.4

year N = 515345 D = 90

−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 BNN-deterministic BNN-sampling GP DGP

4 / 90

slide-5
SLIDE 5

Motivating application 2: non-linear regression

boston N = 506 D = 13

−2.5 −2.4 −2.3 −2.2 −2.1 −2.0

average test log-likelihood/nats concrete N = 1030 D = 8

−3.2 −3.0 −2.8 −2.6 −2.4 −2.2 −2.0 −1.8

energy N = 768 D = 8

−2.0 −1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6

kin8nm N = 8192 D = 8

1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2

naval N = 11934 D = 16

5.0 5.5 6.0 6.5 7.0 7.5

power N = 9568 D = 4

−2.65 −2.60 −2.55 −2.50 −2.45 −2.40 −2.35 −2.30 −2.25

protein N = 45730 D = 9

−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0

red wine N = 1588 D = 11

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

yacht N = 308 D = 6

−1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6 −0.4

year N = 515345 D = 90

−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 BNN-deterministic BNN-sampling GP DGP

5 / 90

slide-6
SLIDE 6

Motivating application 2: non-linear regression

boston N = 506 D = 13

−2.5 −2.4 −2.3 −2.2 −2.1 −2.0

average test log-likelihood/nats concrete N = 1030 D = 8

−3.2 −3.0 −2.8 −2.6 −2.4 −2.2 −2.0 −1.8

energy N = 768 D = 8

−2.0 −1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6

kin8nm N = 8192 D = 8

1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2

naval N = 11934 D = 16

5.0 5.5 6.0 6.5 7.0 7.5

power N = 9568 D = 4

−2.65 −2.60 −2.55 −2.50 −2.45 −2.40 −2.35 −2.30 −2.25

protein N = 45730 D = 9

−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0

red wine N = 1588 D = 11

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

yacht N = 308 D = 6

−1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6 −0.4

year N = 515345 D = 90

−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 BNN-deterministic BNN-sampling GP DGP

6 / 90

slide-7
SLIDE 7

Motivating application 2: non-linear regression

boston N = 506 D = 13

−2.5 −2.4 −2.3 −2.2 −2.1 −2.0

average test log-likelihood/nats concrete N = 1030 D = 8

−3.2 −3.0 −2.8 −2.6 −2.4 −2.2 −2.0 −1.8

energy N = 768 D = 8

−2.0 −1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6

kin8nm N = 8192 D = 8

1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2

naval N = 11934 D = 16

5.0 5.5 6.0 6.5 7.0 7.5

power N = 9568 D = 4

−2.65 −2.60 −2.55 −2.50 −2.45 −2.40 −2.35 −2.30 −2.25

protein N = 45730 D = 9

−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0

red wine N = 1588 D = 11

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

yacht N = 308 D = 6

−1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6 −0.4

year N = 515345 D = 90

−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 BNN-deterministic BNN-sampling GP DGP

7 / 90

slide-8
SLIDE 8

Motivating application 2: non-linear regression

boston N = 506 D = 13

−2.5 −2.4 −2.3 −2.2 −2.1 −2.0

average test log-likelihood/nats concrete N = 1030 D = 8

−3.2 −3.0 −2.8 −2.6 −2.4 −2.2 −2.0 −1.8

energy N = 768 D = 8

−2.0 −1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6

kin8nm N = 8192 D = 8

1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2

naval N = 11934 D = 16

5.0 5.5 6.0 6.5 7.0 7.5

power N = 9568 D = 4

−2.65 −2.60 −2.55 −2.50 −2.45 −2.40 −2.35 −2.30 −2.25

protein N = 45730 D = 9

−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0

red wine N = 1588 D = 11

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

yacht N = 308 D = 6

−1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6 −0.4

year N = 515345 D = 90

−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 BNN-deterministic BNN-sampling GP DGP

8 / 90

slide-9
SLIDE 9

Motivation: Gaussian Process Regression inputs

  • utputs

9 / 90

slide-10
SLIDE 10

Motivation: Gaussian Process Regression inputs

  • utputs

?

9 / 90

slide-11
SLIDE 11

Motivation: Gaussian Process Regression inputs

  • utputs

?

9 / 90

slide-12
SLIDE 12

Motivation: Gaussian Process Regression inputs

  • utputs

?

inference & learning

9 / 90

slide-13
SLIDE 13

Motivation: Gaussian Process Regression inputs

  • utputs

?

inference & learning intractabilities computational analytic

9 / 90

slide-14
SLIDE 14

Motivation: Gaussian Process Regression

9 / 90

slide-15
SLIDE 15

A Brief History of Gaussian Process Approximations

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

10 / 90

slide-16
SLIDE 16

A Brief History of Gaussian Process Approximations

approximate generative model exact inference

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

10 / 90

slide-17
SLIDE 17

A Brief History of Gaussian Process Approximations

approximate generative model exact inference methods employing pseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

10 / 90

slide-18
SLIDE 18

A Brief History of Gaussian Process Approximations

approximate generative model exact inference methods employing pseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

FITC PITC DTC

10 / 90

slide-19
SLIDE 19

A Brief History of Gaussian Process Approximations

approximate generative model exact inference methods employing pseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

FITC PITC DTC

A Unifying View of Sparse Approximate Gaussian Process Regression Quinonero-Candela & Rasmussen, 2005 (FITC, PITC, DTC)

10 / 90

slide-20
SLIDE 20

A Brief History of Gaussian Process Approximations

approximate generative model exact inference exact generative model approximate inference methods employing pseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

FITC PITC DTC

A Unifying View of Sparse Approximate Gaussian Process Regression Quinonero-Candela & Rasmussen, 2005 (FITC, PITC, DTC)

10 / 90

slide-21
SLIDE 21

A Brief History of Gaussian Process Approximations

approximate generative model exact inference exact generative model approximate inference methods employing pseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

VFE EP PP FITC PITC DTC

A Unifying View of Sparse Approximate Gaussian Process Regression Quinonero-Candela & Rasmussen, 2005 (FITC, PITC, DTC)

10 / 90

slide-22
SLIDE 22

A Brief History of Gaussian Process Approximations

approximate generative model exact inference exact generative model approximate inference methods employing pseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

VFE EP PP FITC PITC DTC

A Unifying View of Sparse Approximate Gaussian Process Regression Quinonero-Candela & Rasmussen, 2005 (FITC, PITC, DTC)

10 / 90

slide-23
SLIDE 23

A Brief History of Gaussian Process Approximations

approximate generative model exact inference exact generative model approximate inference methods employing pseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

VFE EP PP FITC PITC DTC

A Unifying View of Sparse Approximate Gaussian Process Regression Quinonero-Candela & Rasmussen, 2005 (FITC, PITC, DTC) A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation Bui, Yan and Turner, 2016 (VFE, EP, FITC, PITC ...)

10 / 90

slide-24
SLIDE 24

Factor Graphs: introduction / reminder

factor graph examples

11 / 90

slide-25
SLIDE 25

Factor Graphs: introduction / reminder

factor graph examples what is the minimal factor graph for this multivariate Gaussian? 4 dimensional solution:

12 / 90

slide-26
SLIDE 26

Factor Graphs: introduction / reminder

factor graph examples what is the minimal factor graph for this multivariate Gaussian? 4 dimensional solution:

13 / 90

slide-27
SLIDE 27

Fully independent training conditional (FITC) approximation

construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original

14 / 90

slide-28
SLIDE 28

Fully independent training conditional (FITC) approximation

construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original

  • 1. augment model with M<T pseudo data

15 / 90

slide-29
SLIDE 29

Fully independent training conditional (FITC) approximation

construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original

  • 1. augment model with M<T pseudo data
  • 2. remove some of the dependencies

(results in simpler model)

all factors

16 / 90

slide-30
SLIDE 30

Fully independent training conditional (FITC) approximation

construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original

  • 1. augment model with M<T pseudo data
  • 2. remove some of the dependencies

(results in simpler model)

all factors

17 / 90

slide-31
SLIDE 31

Fully independent training conditional (FITC) approximation

construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original

  • 1. augment model with M<T pseudo data
  • 2. remove some of the dependencies

(results in simpler model)

  • 3. calibrate model

(e.g. using KL divergence, many choices)

equal to exact conditionals all factors

18 / 90

slide-32
SLIDE 32

Fully independent training conditional (FITC) approximation

construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original

  • 1. augment model with M<T pseudo data
  • 2. remove some of the dependencies

(results in simpler model)

  • 3. calibrate model

(e.g. using KL divergence, many choices)

equal to exact conditionals all factors indirect posterior approximation

19 / 90

slide-33
SLIDE 33

Fully independent training conditional (FITC) approximation

construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original

indirect posterior approximation

20 / 90

slide-34
SLIDE 34

Fully independent training conditional (FITC) approximation

construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original

indirect posterior approximation

21 / 90

slide-35
SLIDE 35

Fully independent training conditional (FITC) approximation

construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original

indirect posterior approximation

22 / 90

slide-36
SLIDE 36

Fully independent training conditional (FITC) approximation

How do we make predictions?

construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original

indirect posterior approximation

23 / 90

slide-37
SLIDE 37

Fully independent training conditional (FITC) approximation

How do we make predictions?

construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original

indirect posterior approximation

24 / 90

slide-38
SLIDE 38

Fully independent training conditional (FITC) approximation

construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original

indirect posterior approximation

25 / 90

slide-39
SLIDE 39

Fully independent training conditional (FITC) approximation

construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original

indirect posterior approximation

26 / 90

slide-40
SLIDE 40

Fully independent training conditional (FITC) approximation

construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original

indirect posterior approximation

27 / 90

slide-41
SLIDE 41

Fully independent training conditional (FITC) approximation

cost of computing likelihood is construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original

indirect posterior approximation

28 / 90

slide-42
SLIDE 42

Fully independent training conditional (FITC) approximation

cost of computing likelihood is construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original

indirect posterior approximation

29 / 90

slide-43
SLIDE 43

Fully independent training conditional (FITC) approximation

cost of computing likelihood is construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original

indirect posterior approximation

30 / 90

slide-44
SLIDE 44

Fully independent training conditional (FITC) approximation

cost of computing likelihood is construct new generative model (with pseudo-data) cheaper to perform exact learning and inference calibrated to original

indirect posterior approximation

  • riginal variances along diagonal: stops variances collapsing

31 / 90

slide-45
SLIDE 45

FITC: Demo (Snelson)

32 / 90

slide-46
SLIDE 46

FITC: Demo (Snelson)

33 / 90

slide-47
SLIDE 47

Fully independent training conditional (FITC) approximation parametric (although cleverly so) if I see more data, should I add extra pseudo-data?

◮ unnatural from a generative modelling perspective ◮ natural from a prediction perspective (posterior gets more complex)

= ⇒ lost elegant separation of model, inference and approximation

example of prior approximation Extensions: inter-domain GP (pseudo-data in a different space) partially independent training conditional and tree-structured approximations

34 / 90

slide-48
SLIDE 48

Variational free-energy method (VFE)

lower bound the likelihood

35 / 90

slide-49
SLIDE 49

Variational free-energy method (VFE)

lower bound the likelihood

36 / 90

slide-50
SLIDE 50

Variational free-energy method (VFE)

lower bound the likelihood

37 / 90

slide-51
SLIDE 51

Variational free-energy method (VFE)

lower bound the likelihood

38 / 90

slide-52
SLIDE 52

Variational free-energy method (VFE)

lower bound the likelihood

39 / 90

slide-53
SLIDE 53

Variational free-energy method (VFE)

lower bound the likelihood

KL between stochastic processes

40 / 90

slide-54
SLIDE 54

Variational free-energy method (VFE)

lower bound the likelihood assume approximate posterior factorisation with special form exact:

KL between stochastic processes

41 / 90

slide-55
SLIDE 55

Variational free-energy method (VFE)

true posterior approximate posterior

  • ptimise variational free-energy wrt to these variational parameters

42 / 90

slide-56
SLIDE 56

Variational free-energy method (VFE)

true posterior approximate posterior

same form as prediction from GP-regression

  • ptimise variational free-energy wrt to these variational parameters

43 / 90

slide-57
SLIDE 57

Variational free-energy method (VFE)

true posterior approximate posterior

inputs locations of 'pseudo' data

  • utput locations

and covariance 'pseudo' data same form as prediction from GP-regression

  • ptimise variational free-energy wrt to these variational parameters

44 / 90

slide-58
SLIDE 58

Variational free-energy method (VFE)

lower bound the likelihood assume approximate posterior factorisation with special form exact:

predictive from GP regression KL between stochastic processes

45 / 90

slide-59
SLIDE 59

Variational free-energy method (VFE)

lower bound the likelihood assume approximate posterior factorisation with special form exact:

predictive from GP regression

plug into Free-energy:

KL between stochastic processes

46 / 90

slide-60
SLIDE 60

Variational free-energy method (VFE)

lower bound the likelihood assume approximate posterior factorisation with special form exact:

predictive from GP regression

plug into Free-energy:

KL between stochastic processes

47 / 90

slide-61
SLIDE 61

Variational free-energy method (VFE)

lower bound the likelihood assume approximate posterior factorisation with special form exact:

predictive from GP regression

plug into Free-energy:

KL between stochastic processes

48 / 90

slide-62
SLIDE 62

Variational free-energy method (VFE)

lower bound the likelihood where

DTC like uncertainty based correction

49 / 90

slide-63
SLIDE 63

Variational free-energy method (VFE)

lower bound the likelihood where

DTC like uncertainty based correction KL between two multivariate Gaussians average of quadratic form

50 / 90

slide-64
SLIDE 64

Variational free-energy method (VFE)

lower bound the likelihood where make bound as tight as possible:

DTC like uncertainty based correction KL between two multivariate Gaussians average of quadratic form

51 / 90

slide-65
SLIDE 65

Variational free-energy method (VFE)

lower bound the likelihood where make bound as tight as possible: (DTC)

DTC like uncertainty based correction KL between two multivariate Gaussians average of quadratic form

52 / 90

slide-66
SLIDE 66

Variational free-energy method (VFE)

lower bound the likelihood where make bound as tight as possible: (DTC)

DTC like uncertainty based correction KL between two multivariate Gaussians average of quadratic form

53 / 90

slide-67
SLIDE 67

Summary of VFE method

  • ptimisation of pseudo point inputs: VFE has better guarantees

than FITC variational methods known to underfit (and have other biases) no augmentation required: target is posterior over functions, which includes inducing variables

◮ pseudo-input locations are pure variational parameters (do not

parameterise the generative model like they do in FITC)

◮ coherent way of adding pseudo-data: more complex posteriors require

more computational resources (more pseudo-points)

Rule of thumb: VFE returns better mean estimates FITC returns better error-bar estimates how should we select M = number of pseudo-points?

54 / 90

slide-68
SLIDE 68

How do we select M = number of pseudo-data?

y x

200 400 600 800 1000 1200 1400 1600 1800 2000

  • 3
  • 2
  • 1

1 2 3 55 / 90

slide-69
SLIDE 69

How do we select M = number of pseudo-data?

y x

200 400 600 800 1000 1200 1400 1600 1800 2000

  • 3
  • 2
  • 1

1 2 3 56 / 90

slide-70
SLIDE 70

How do we select M = number of pseudo-data?

compute time/s SMSE

10 10

1

10

  • 2

10

y x

200 400 600 800 1000 1200 1400 1600 1800 2000

  • 3
  • 2
  • 1

1 2 3 57 / 90

slide-71
SLIDE 71

How do we select M = number of pseudo-data?

compute time/s SMSE x pseudo-dataset (input location)

58 / 90

slide-72
SLIDE 72

How do we select M = number of pseudo-data?

compute time/s SMSE

10 10

1

10

  • 2

10

y x

200 400 600 800 1000 1200 1400 1600 1800 2000

  • 3
  • 2
  • 1

1 2 3 Exact VFE 59 / 90

slide-73
SLIDE 73

How do we select M = number of pseudo-data?

compute time/s SMSE

10 10

1

10

  • 2

10

y x

200 400 600 800 1000 1200 1400 1600 1800 2000

  • 3
  • 2
  • 1

1 2 3 Exact VFE 60 / 90

slide-74
SLIDE 74

How do we select M = number of pseudo-data?

compute time/s SMSE

10 10

1

10

  • 2

10

y x

200 400 600 800 1000 1200 1400 1600 1800 2000

  • 3
  • 2
  • 1

1 2 3 Exact VFE 61 / 90

slide-75
SLIDE 75

How do we select M = number of pseudo-data?

compute time/s SMSE

10 10

1

10

  • 2

10

y x

200 400 600 800 1000 1200 1400 1600 1800 2000

  • 3
  • 2
  • 1

1 2 3 Exact VFE 62 / 90

slide-76
SLIDE 76

How do we select M = number of pseudo-data?

compute time/s SMSE

10 10

1

10

  • 2

10

y x

200 400 600 800 1000 1200 1400 1600 1800 2000

  • 3
  • 2
  • 1

1 2 3 Exact VFE 63 / 90

slide-77
SLIDE 77

How do we select M = number of pseudo-data?

compute time/s SMSE

10 10

1

10

  • 2

10

y x

200 400 600 800 1000 1200 1400 1600 1800 2000

  • 3
  • 2
  • 1

1 2 3 Exact VFE 64 / 90

slide-78
SLIDE 78

How do we select M = number of pseudo-data?

compute time/s SMSE

10 10

1

10

  • 2

10

y x

200 400 600 800 1000 1200 1400 1600 1800 2000

  • 3
  • 2
  • 1

1 2 3 Exact VFE 65 / 90

slide-79
SLIDE 79

How do we select M = number of pseudo-data?

compute time/s SMSE

10 10

1

10

  • 2

10

y x

200 400 600 800 1000 1200 1400 1600 1800 2000

  • 3
  • 2
  • 1

1 2 3 Exact VFE 66 / 90

slide-80
SLIDE 80

How do we select M = number of pseudo-data?

compute time/s SMSE

10 10

1

10

  • 2

10

y x

200 400 600 800 1000 1200 1400 1600 1800 2000

  • 3
  • 2
  • 1

1 2 3 Exact VFE 67 / 90

slide-81
SLIDE 81

How do we select M = number of pseudo-data?

compute time/s SMSE

10 10

1

10

  • 2

10

y x

200 400 600 800 1000 1200 1400 1600 1800 2000

  • 3
  • 2
  • 1

1 2 3 Exact VFE 68 / 90

slide-82
SLIDE 82

How do we select M = number of pseudo-data?

compute time/s SMSE

10 10

1

10

  • 2

10

y x

200 400 600 800 1000 1200 1400 1600 1800 2000

  • 3
  • 2
  • 1

1 2 3 Exact VFE 69 / 90

slide-83
SLIDE 83

How do we select M = number of pseudo-data?

10 10

1

10

  • 2

10

y x

200 400 600 800 1000 1200 1400 1600 1800 2000

  • 3
  • 2
  • 1

1 2 3 Exact VFE

compute time/s SMSE

70 / 90

slide-84
SLIDE 84

How do we select M = number of pseudo-data?

10 10

1

10

  • 2

10

y x

200 400 600 800 1000 1200 1400 1600 1800 2000

  • 3
  • 2
  • 1

1 2 3 Exact VFE

compute time/s SMSE

71 / 90

slide-85
SLIDE 85

Power Expectation Propagation and Gaussian Processes

72 / 90

slide-86
SLIDE 86

A Brief History of Gaussian Process Approximations

approximate generative model exact inference exact generative model approximate inference methods employing pseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

VFE EP PP FITC PITC DTC

A Unifying View of Sparse Approximate Gaussian Process Regression Quinonero-Candela & Rasmussen, 2005 (FITC, PITC, DTC) A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation Bui, Yan and Turner, 2016 (VFE, EP, FITC, PITC ...)

73 / 90

slide-87
SLIDE 87

EP pseudo-point approximation

true posterior

74 / 90

slide-88
SLIDE 88

EP pseudo-point approximation

true posterior

74 / 90

slide-89
SLIDE 89

EP pseudo-point approximation

true posterior

marginal likelihood posterior

74 / 90

slide-90
SLIDE 90

EP pseudo-point approximation

true posterior approximate posterior

marginal likelihood posterior

74 / 90

slide-91
SLIDE 91

EP pseudo-point approximation

true posterior approximate posterior

marginal likelihood posterior

74 / 90

slide-92
SLIDE 92

EP pseudo-point approximation

true posterior approximate posterior

marginal likelihood posterior

74 / 90

slide-93
SLIDE 93

EP pseudo-point approximation

true posterior approximate posterior

marginal likelihood posterior

74 / 90

slide-94
SLIDE 94

EP pseudo-point approximation

input locations of 'pseudo' data

  • utputs and covariance

'pseudo' data

true posterior approximate posterior

marginal likelihood posterior exact joint

  • f new GP

regression model

74 / 90

slide-95
SLIDE 95

EP algorithm

75 / 90

slide-96
SLIDE 96

EP algorithm

  • 1. remove

take out one pseudo-observation likelihood

cavity

75 / 90

slide-97
SLIDE 97

EP algorithm

  • 1. remove
  • 2. include

take out one pseudo-observation likelihood add in one true observation likelihood

cavity tilted

75 / 90

slide-98
SLIDE 98

EP algorithm

  • 1. remove
  • 2. include
  • 3. project

take out one pseudo-observation likelihood add in one true observation likelihood project onto approximating family

cavity tilted KL between unnormalised stochastic processes

75 / 90

slide-99
SLIDE 99

EP algorithm

  • 1. remove
  • 2. include
  • 3. project
  • 4. update

take out one pseudo-observation likelihood add in one true observation likelihood project onto approximating family update pseudo-observation likelihood

cavity tilted KL between unnormalised stochastic processes

75 / 90

slide-100
SLIDE 100

EP algorithm

  • 1. remove
  • 2. include
  • 3. project
  • 4. update

take out one pseudo-observation likelihood add in one true observation likelihood project onto approximating family update pseudo-observation likelihood

cavity tilted

  • 1. minimum: moments matched at pseudo-inputs
  • 2. Gaussian regression: matches moments everywhere

KL between unnormalised stochastic processes

75 / 90

slide-101
SLIDE 101

EP algorithm

  • 1. remove
  • 2. include
  • 3. project
  • 4. update

take out one pseudo-observation likelihood add in one true observation likelihood project onto approximating family update pseudo-observation likelihood

cavity tilted

  • 1. minimum: moments matched at pseudo-inputs
  • 2. Gaussian regression: matches moments everywhere

KL between unnormalised stochastic processes rank 1

75 / 90

slide-102
SLIDE 102

A Brief History of Gaussian Process Approximations

approximate generative model exact inference exact generative model approximate inference methods employing pseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

VFE EP PP FITC PITC DTC

A Unifying View of Sparse Approximate Gaussian Process Regression Quinonero-Candela & Rasmussen, 2005 (FITC, PITC, DTC) A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation Bui, Yan and Turner, 2016 (VFE, EP, FITC, PITC ...)

76 / 90

slide-103
SLIDE 103

Fixed points of EP = FITC approximation

approximate generative model exact inference exact generative model approximate inference methods employing pseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

VFE EP PP FITC PITC DTC

A Unifying View of Sparse Approximate Gaussian Process Regression Quinonero-Candela & Rasmussen, 2005 (FITC, PITC, DTC) A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation Bui, Yan and Turner, 2016 (VFE, EP, FITC, PITC ...)

77 / 90

slide-104
SLIDE 104

Fixed points of EP = FITC approximation

approximate generative model exact inference exact generative model approximate inference methods employing pseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

VFE EP PP FITC PITC DTC

A Unifying View of Sparse Approximate Gaussian Process Regression Quinonero-Candela & Rasmussen, 2005 (FITC, PITC, DTC) A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation Bui, Yan and Turner, 2016 (VFE, EP, FITC, PITC ...)

77 / 90

slide-105
SLIDE 105

Fixed points of EP = FITC approximation

approximate generative model exact inference exact generative model approximate inference methods employing pseudo-data

FITC: Snelson et al. “Sparse Gaussian Processes using Pseudo-inputs” PITC: Snelson et al. “Local and global sparse Gaussian process approximations” EP: Csato and Opper 2002 / Qi et al. "Sparse-posterior Gaussian Processes for general likelihoods.” VFE: Titsias “Variational Learning of Inducing Variables in Sparse Gaussian Processes” DTC / PP: Seeger et al. “Fast Forward Selection to Speed Up Sparse Gaussian Process Regression”

VFE EP PP FITC PITC DTC interpretation resolves issues with FITC: why does it work so well? are we allowed to increase M with N

A Unifying View of Sparse Approximate Gaussian Process Regression Quinonero-Candela & Rasmussen, 2005 (FITC, PITC, DTC) A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation Bui, Yan and Turner, 2016 (VFE, EP, FITC, PITC ...)

77 / 90

slide-106
SLIDE 106

EP algorithm

  • 1. remove
  • 2. include
  • 3. project
  • 4. update

take out one pseudo-observation likelihood add in one true observation likelihood project onto approximating family update pseudo-observation likelihood

cavity tilted

  • 1. minimum: moments matched at pseudo-inputs
  • 2. Gaussian regression: matches moments everywhere

KL between unnormalised stochastic processes rank 1

78 / 90

slide-107
SLIDE 107

Power EP algorithm (as tractable as EP)

  • 1. remove
  • 2. include
  • 3. project
  • 4. update

take out fraction of pseudo-observation likelihood add in fraction of true observation likelihood project onto approximating family update pseudo-observation likelihood

cavity tilted

  • 1. minimum: moments matched at pseudo-inputs
  • 2. Gaussian regression: matches moments everywhere

KL between unnormalised stochastic processes rank 1

79 / 90

slide-108
SLIDE 108

Power EP: a unifying framework

FITC Csato and Opper, 2002 Snelson and Ghahramani, 2005 VFE Titsias, 2009

80 / 90

slide-109
SLIDE 109

Power EP: a unifying framework

GP Regression GP Classification

PEP VFE EP inter-domain

[4] Quiñonero-Candela et al. 2005 [5] Snelson et al., 2005 [6] Snelson, 2006 [7] Schwaighofer, 2002 [10,5,6*] [14*] [12*,15*] [13] [17,13] [9,11,8*] [16*]

inter-domain structured approx. structured approx.

(FITC) [7,4*,6*] (PITC) [8] Titsias, 2009 [9] Csató, 2002 [10] Csató et al., 2002 [11] Seeger et al., 2003 [12] Naish-Guzman et al, 2007 [13] Qi et al., 2010 [14] Hensman et al., 2015 [15] Hernández-Lobato et al., 2016 [16] Matthews et al., 2016 [17] Figueiras-Vidal et al., 2009

PEP VFE EP

* = optimised pseudo-inputs ** = structured versions of VFE recover VFE ** ** 81 / 90

slide-110
SLIDE 110

How should I set the power parameter α?

6 UCI classification datasets 20 random splits M = 10, 50, 100 hypers and inducing inputs optimised 8 UCI regression datasets 20 random splits M = 0 - 200 hypers and inducing inputs optimised

0.0 0.2 0.4 0.6 0.8 1.0 2 3 4 5 6 7 8 0.0 0.2 0.4 0.6 0.8 1.0 2 3 4 5 6 7 8

MSE rank error rank log-loss rank log-loss rank = 0.5 does well on average

0.0 0.2 0.4 0.6 0.8 1.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 6 7

82 / 90

slide-111
SLIDE 111

References (hyperlinked) Approximate inference in GPs: A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation, arXiv preprint 2016 Scalable Approximate inference: Stochastic Expectation Propagation, NIPS 2015 Black-box α-divergence Minimization, ICML 2016 Deep Gaussian Processes (incl. comparisons to Bayesian Neural Networks and GPs): Deep Gaussian Processes for Regression using Approximate Expectation Propagation, ICML 2016

83 / 90

slide-112
SLIDE 112

GP regression: introducing notation

  • Q1. What's the formal justification for how we were using GPs for regression?

84 / 90

slide-113
SLIDE 113

GP regression: introducing notation

  • Q1. What's the formal justification for how we were using GPs for regression?

generative model (like non-linear regression)

85 / 90

slide-114
SLIDE 114

GP regression: introducing notation

  • Q1. What's the formal justification for how we were using GPs for regression?

generative model (like non-linear regression) place GP prior over the non-linear function (smoothly wiggling functions expected)

86 / 90

slide-115
SLIDE 115

GP regression: introducing notation

  • Q1. What's the formal justification for how we were using GPs for regression?

generative model (like non-linear regression) place GP prior over the non-linear function sum of Gaussian variables = Gaussian: induces a GP over (smoothly wiggling functions expected)

87 / 90

slide-116
SLIDE 116

GP regression: introducing notation

  • Q3. How do we make predictions?

predictive mean

88 / 90

slide-117
SLIDE 117

GP regression: introducing notation

  • Q3. How do we make predictions?

linear in the data predictive mean

89 / 90

slide-118
SLIDE 118

GP regression: introducing notation

  • Q3. How do we make predictions?

prior uncertainty predictive uncertainty reduction in uncertainty linear in the data predictive mean predictive covariance predictions more confident than prior

90 / 90

slide-119
SLIDE 119

A brief introduction to the Kullback-Leibler divergence KL(p1(z)||p2(z)) =

  • z

p1(z) log p1(z) p2(z) Important properties: Gibb’s inequality: KL(p1(z)||p2(z)) ≥ 0, equality at p1(z) = p2(z)

◮ proof via Jensen’s inequality or differentiation (see MacKay pg. 35 )

Non-symmetric: KL(p1(z)||p2(z)) = KL(p2(z)||p1(z))

◮ hence named divergence and not distance

Example: binary variables z ∈ {0, 1} p(z = 1) = 0.8 and q(z = 1) = ρ

ρ

0.5 1

KL(q || p)

2 4 6

ρ

0.5 1

KL(p || q)

2 4 6

0.8 0.8

91 / 90