Power Expectation Propagation for Deep Gaussian Processes Dr. - - PowerPoint PPT Presentation

power expectation propagation for deep gaussian processes
SMART_READER_LITE
LIVE PREVIEW

Power Expectation Propagation for Deep Gaussian Processes Dr. - - PowerPoint PPT Presentation

Power Expectation Propagation for Deep Gaussian Processes Dr. Richard E. Turner ( ret26@cam.ac.uk ) Computational and Biological Learning Lab, Department of Engineering, University of Cambridge with Thang Bui, Yingzhen Li, Jos e Miguel Hern


slide-1
SLIDE 1

Power Expectation Propagation for Deep Gaussian Processes

  • Dr. Richard E. Turner (ret26@cam.ac.uk)

Computational and Biological Learning Lab, Department of Engineering, University of Cambridge

with Thang Bui, Yingzhen Li, Jos´ e Miguel Hern´ andez Lobato, Daniel Hern´ andez Lobato, Josiah Jan

1 / 32

slide-2
SLIDE 2

Motivation: Gaussian Process regression inputs

  • utputs

2 / 32

slide-3
SLIDE 3

Motivation: Gaussian Process regression inputs

  • utputs

?

2 / 32

slide-4
SLIDE 4

Motivation: Gaussian Process regression inputs

  • utputs

?

2 / 32

slide-5
SLIDE 5

Motivation: Gaussian Process regression inputs

  • utputs

?

inference & learning

2 / 32

slide-6
SLIDE 6

Motivation: Gaussian Process regression inputs

  • utputs

?

inference & learning intractabilities computational analytic

2 / 32

slide-7
SLIDE 7

EP pseudo-point approximation

true posterior

3 / 32

slide-8
SLIDE 8

EP pseudo-point approximation

true posterior

3 / 32

slide-9
SLIDE 9

EP pseudo-point approximation

true posterior

marginal likelihood posterior

3 / 32

slide-10
SLIDE 10

EP pseudo-point approximation

true posterior approximate posterior

marginal likelihood posterior

3 / 32

slide-11
SLIDE 11

EP pseudo-point approximation

true posterior approximate posterior

marginal likelihood posterior

3 / 32

slide-12
SLIDE 12

EP pseudo-point approximation

true posterior approximate posterior

marginal likelihood posterior

3 / 32

slide-13
SLIDE 13

EP pseudo-point approximation

true posterior approximate posterior

marginal likelihood posterior

3 / 32

slide-14
SLIDE 14

EP pseudo-point approximation

input locations of 'pseudo' data

  • utputs and covariance

'pseudo' data

true posterior approximate posterior

marginal likelihood posterior

3 / 32

slide-15
SLIDE 15

EP algorithm

4 / 32

slide-16
SLIDE 16

EP algorithm

  • 1. remove

take out one pseudo-observation likelihood

cavity

4 / 32

slide-17
SLIDE 17

EP algorithm

  • 1. remove
  • 2. include

take out one pseudo-observation likelihood add in one true observation likelihood

cavity tilted

4 / 32

slide-18
SLIDE 18

EP algorithm

  • 1. remove
  • 2. include
  • 3. project

take out one pseudo-observation likelihood add in one true observation likelihood project onto approximating family

cavity tilted KL between unnormalised stochastic processes

4 / 32

slide-19
SLIDE 19

EP algorithm

  • 1. remove
  • 2. include
  • 3. project
  • 4. update

take out one pseudo-observation likelihood add in one true observation likelihood project onto approximating family update pseudo-observation likelihood

cavity tilted KL between unnormalised stochastic processes

4 / 32

slide-20
SLIDE 20

EP algorithm

  • 1. remove
  • 2. include
  • 3. project
  • 4. update

take out one pseudo-observation likelihood add in one true observation likelihood project onto approximating family update pseudo-observation likelihood

cavity tilted

  • 1. minimum: moments matched at pseudo-inputs
  • 2. Gaussian regression: matches moments everywhere

KL between unnormalised stochastic processes

4 / 32

slide-21
SLIDE 21

EP algorithm

  • 1. remove
  • 2. include
  • 3. project
  • 4. update

take out one pseudo-observation likelihood add in one true observation likelihood project onto approximating family update pseudo-observation likelihood

cavity tilted

  • 1. minimum: moments matched at pseudo-inputs
  • 2. Gaussian regression: matches moments everywhere

KL between unnormalised stochastic processes rank 1

4 / 32

slide-22
SLIDE 22

Fixed points of EP = FITC approximation

5 / 32

slide-23
SLIDE 23

Fixed points of EP = FITC approximation

suppressed &

5 / 32

slide-24
SLIDE 24

Fixed points of EP = FITC approximation

suppressed &

5 / 32

slide-25
SLIDE 25

Fixed points of EP = FITC approximation

suppressed &

5 / 32

slide-26
SLIDE 26

Fixed points of EP = FITC approximation

suppressed &

6 / 32

slide-27
SLIDE 27

Fixed points of EP = FITC approximation

suppressed &

6 / 32

slide-28
SLIDE 28

Fixed points of EP = FITC approximation

suppressed &

6 / 32

slide-29
SLIDE 29

Fixed points of EP = FITC approximation

suppressed &

7 / 32

slide-30
SLIDE 30

Fixed points of EP = FITC approximation

suppressed &

7 / 32

slide-31
SLIDE 31

Fixed points of EP = FITC approximation

suppressed &

7 / 32

slide-32
SLIDE 32

Fixed points of EP = FITC approximation

suppressed &

7 / 32

slide-33
SLIDE 33

Fixed points of EP = FITC approximation

suppressed &

8 / 32

slide-34
SLIDE 34

Fixed points of EP = FITC approximation

suppressed &

=

equivalent

8 / 32

slide-35
SLIDE 35

Fixed points of EP = FITC approximation

Csato & Opper (2002) Qi, Abdel-Gawad & Minka (2010) suppressed &

=

equivalent

8 / 32

slide-36
SLIDE 36

Fixed points of EP = FITC approximation

Csato & Opper (2002) Qi, Abdel-Gawad & Minka (2010)

Interpretation resolves philosophical issues with FITC (increase M with N) FITC known to overfit => EP over-estimates marginal likelihood

suppressed &

=

equivalent

8 / 32

slide-37
SLIDE 37

EP algorithm

  • 1. remove
  • 2. include
  • 3. project
  • 4. update

take out one pseudo-observation likelihood add in one true observation likelihood project onto approximating family update pseudo-observation likelihood

cavity tilted

  • 1. minimum: moments matched at pseudo-inputs
  • 2. Gaussian regression: matches moments everywhere

KL between unnormalised stochastic processes rank 1

9 / 32

slide-38
SLIDE 38

Power EP algorithm (as tractable as EP)

  • 1. remove
  • 2. include
  • 3. project
  • 4. update

take out fraction of pseudo-observation likelihood add in fraction of true observation likelihood project onto approximating family update pseudo-observation likelihood

cavity tilted

  • 1. minimum: moments matched at pseudo-inputs
  • 2. Gaussian regression: matches moments everywhere

KL between unnormalised stochastic processes rank 1

10 / 32

slide-39
SLIDE 39

Power EP: a unifying framework

FITC Csato and Opper, 2002 Snelson and Ghahramani, 2005 VFE Titsias, 2009

11 / 32

slide-40
SLIDE 40

Power EP: a unifying framework

Approximate blocks of data: structured approximations

12 / 32

slide-41
SLIDE 41

Power EP: a unifying framework

Approximate blocks of data: structured approximations

12 / 32

slide-42
SLIDE 42

Power EP: a unifying framework

PITC / BCM Schwaighofer & Tresp, 2002, Snelson 2006, VFE Titsias, 2009

Approximate blocks of data: structured approximations

12 / 32

slide-43
SLIDE 43

Power EP: a unifying framework

PITC / BCM Schwaighofer & Tresp, 2002, Snelson 2006, VFE Titsias, 2009

Approximate blocks of data: structured approximations Place pseudo-data in different space: interdomain transformations (linear transform)

12 / 32

slide-44
SLIDE 44

Power EP: a unifying framework

PITC / BCM Schwaighofer & Tresp, 2002, Snelson 2006, VFE Titsias, 2009

Approximate blocks of data: structured approximations Place pseudo-data in different space: interdomain transformations (linear transform)

12 / 32

slide-45
SLIDE 45

Power EP: a unifying framework

PITC / BCM Schwaighofer & Tresp, 2002, Snelson 2006, VFE Titsias, 2009

Approximate blocks of data: structured approximations Place pseudo-data in different space: interdomain transformations (linear transform) pseudo-data in new space

12 / 32

slide-46
SLIDE 46

Power EP: a unifying framework

PITC / BCM Schwaighofer & Tresp, 2002, Snelson 2006, VFE Titsias, 2009 Figueiras-Vidal & Lázaro-Gredilla 2009 T

  • bar et al. 2015

Matthews et al, 2016

Approximate blocks of data: structured approximations Place pseudo-data in different space: interdomain transformations (linear transform) pseudo-data in new space

12 / 32

slide-47
SLIDE 47

Power EP: a unifying framework

GP Regression GP Classification

PEP VFE EP inter-domain

[4] Quiñonero-Candela et al. 2005 [5] Snelson et al., 2005 [6] Snelson, 2006 [7] Schwaighofer, 2002 [10,5,6*] [14*] [12*,15*] [13] [17,13] [9,11,8*] [16*]

inter-domain structured approx. structured approx.

(FITC) [7,4*,6*] (PITC) [8] Titsias, 2009 [9] Csató, 2002 [10] Csató et al., 2002 [11] Seeger et al., 2003 [12] Naish-Guzman et al, 2007 [13] Qi et al., 2010 [14] Hensman et al., 2015 [15] Hernández-Lobato et al., 2016 [16] Matthews et al., 2016 [17] Figueiras-Vidal et al., 2009

PEP VFE EP

* = optimised pseudo-inputs ** = structured versions of VFE recover VFE ** ** 13 / 32

slide-48
SLIDE 48

How should I set the power parameter α?

1 2 3 4 5 6 7 8 0.05 0.1 0.2 0.8 0.6 0.4 CD 1 2 3 4 5 6 7 8 0.05 0.1 0.6 0.4 0.2 0.8 1 CD

indicates significant difference

1 2 3 4 Error Average Rank 1 0.8 0.5 1 2 3 4 MLL Average Rank 1 0.5 0.8

6 UCI classification datasets 20 random splits M = 10, 50, 100 hypers and inducing inputs optimised 8 UCI regression datasets 20 random splits M = 0 - 200 hypers and inducing inputs optimised

SMSE Average Rank SMLL Average Rank Error Average Rank MLL Average Rank

1

= 0.5 does well over all

14 / 32

slide-49
SLIDE 49

Deep Gaussian processes fl ∼ GP(0, k(., .)) yn = g(xn) = fL(fL−1(· · · f2(f1(xn)))) + ǫn hL−1,n := fL−1(· · · f1(xn)), yn = fL(hL−1,n) + ǫn Deep GPsa are multi-layer generalisation of Gaussian processes, equivalent to deep neural networks with infinitely wide hidden layers Questions: How to perform inference and learning tractably? How Deep GPs compare to alternative, e.g. Bayesian neural networks?

aDamianou and Lawrence (2013) [unsupervised learning]

xn h1,n h2,n yn f1 f2 f3

N

15 / 32

slide-50
SLIDE 50

Pros and cons of Deep GPs Why deep GPs? Because deep and nonparametric, discover useful input warping or dimensionality compression/expansion, i.e. automatic, nonparametric Bayesian kernel design, give a non-Gaussian functional mapping g,

xn h1,n h2,n yn f1 f2 f3

N

16 / 32

slide-51
SLIDE 51

Pros and cons of Deep GPs Why deep GPs? Because deep and nonparametric, discover useful input warping or dimensionality compression/expansion, i.e. automatic, nonparametric Bayesian kernel design, give a non-Gaussian functional mapping g, Drawbacks: bottleneck in hierarchy? need medium/high dimensional hidden layers, skip links too flexible?

◮ how to incorporate prior knowledge, e.g. invariance ◮ learnability/identifiability

xn h1,n h2,n yn f1 f2 f3

N

16 / 32

slide-52
SLIDE 52

Power EP for Deep GPs Joint distribution:

p(f1)p(f2)p(f3)

  • n

p(yn|h2,n, f3)p(h2,n|h1,n, f2)p(h1,n|f1, xn)

EP approximation:

p(f1)p(f2)p(f3)

  • n

s3,n(h2,n)t3,n(u3)r2,n(h2,n)s2,n(h1,n)t2,n(u2) × r1,n(h1,n)t1,n(u1)

xn h1,n h2,n yn f1 f2 f3

N

17 / 32

slide-53
SLIDE 53

Power EP for Deep GPs Joint distribution:

p(f1)p(f2)p(f3)

  • n

p(yn|h2,n, f3)p(h2,n|h1,n, f2)p(h1,n|f1, xn)

EP approximation:

p(f1)p(f2)p(f3)

  • n

s3,n(h2,n)t3,n(u3)r2,n(h2,n)s2,n(h1,n)t2,n(u2) × r1,n(h1,n)t1,n(u1)

Power EP:

initialise with all approximate factors = 1 incorporate p(h1,n|f1, xn), update r1,n(h1,n)t1,n(u1) incorporate p(h2,n|h1,n, f2), update r2,n(h2,n)s2,n(h1,n)t2,n(u2) incorporate p(yn|h2,n, f3), update s3,n(h2,n)t3,n(u3)

xn h1,n h2,n yn f1 f2 f3

N

17 / 32

slide-54
SLIDE 54

Power EP for Deep GPs Joint distribution:

p(f1)p(f2)p(f3)

  • n

p(yn|h2,n, f3)p(h2,n|h1,n, f2)p(h1,n|f1, xn)

EP approximation:

p(f1)p(f2)p(f3)

  • n

s3,n(h2,n)t3,n(u3)r2,n(h2,n)s2,n(h1,n)t2,n(u2) × r1,n(h1,n)t1,n(u1)

Power EP:

initialise with all approximate factors = 1 incorporate p(h1,n|f1, xn), update r1,n(h1,n)t1,n(u1) incorporate p(h2,n|h1,n, f2), update r2,n(h2,n)s2,n(h1,n)t2,n(u2) incorporate p(yn|h2,n, f3), update s3,n(h2,n)t3,n(u3)

Once again: optimal Gaussian tm,n(um) is rank 1 α → 0 recovers Damianou & Lawrence (2013)

xn h1,n h2,n yn f1 f2 f3

N

17 / 32

slide-55
SLIDE 55

Power EP for Deep GPs: three key additional ideas

1 Reduce memory overhead: Tie factors tm,n(um) = tm(um)

(Stochastic Expectation Propagation for inducing variables)

2 Reduce message passing overhead: just pass them down from the

inputs to the outputs (ADF for hidden unit activities)

3 Improve hyper-parameter optimisation: optimise the EP energy

log ZEP function directly using ADAM

18 / 32

slide-56
SLIDE 56

Training deep GPs

x1 x2 x1 x2 f11 f12

x1, x2 f11(x1, x2) f12(x1, x2) f2(f11, f12) y + noise ≡

x1 x2

y = g(x1, x2)+ noise

19 / 32

slide-57
SLIDE 57

Training deep GPs

x1 x2 x1 x2 f11 f12

x1, x2 f11(x1, x2) f12(x1, x2) f2(f11, f12) y + noise ≡

x1 x2

y = g(x1, x2)+ noise

20 / 32

slide-58
SLIDE 58

Training deep GPs

x1 x2 x1 x2 f11 f12

x1, x2 f11(x1, x2) f12(x1, x2) f2(f11, f12) y + noise ≡

x1 x2

y = g(x1, x2)+ noise

21 / 32

slide-59
SLIDE 59

Training deep GPs

x1 x2 x1 x2 f11 f12

x1, x2 f11(x1, x2) f12(x1, x2) f2(f11, f12) y + noise ≡

x1 x2

y = g(x1, x2)+ noise

22 / 32

slide-60
SLIDE 60

Experiment: Value function of the mountain car problem

x1 x2

GP fit

x1 x2

Value function

x1 x2

DGP fit

23 / 32

slide-61
SLIDE 61

Experiment: Comparison to Bayesian neural networks We compared DGPs with GPs and Bayesian neural networks with one and two hidden layers using: VI(G): Graves’ VI [diagonal Gaussian, without the reparam. trick] VI(KW): Kingma and Welling’s VI [with the reparam. trick] PBP: ADF with Probablistic Backpropagation Dropout: Combining dropout predictions at test time SGLD: Stochastic gradient Langevin dynamics HMC: Hamiltonian Monte Carlo [only for small networks]

24 / 32

slide-62
SLIDE 62

Experiment: Comparison to Bayesian neural networks Rankings of all methods across all datasets

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 MLL Average Rank PBP−1 Dropout−1 SGLD−1 DGP−1 50 SGLD−2 VI(KW)−1 GP 50 DGP−3 100 DGP−2 100 DGP−3 50 DGP−2 50 GP 100 HMC−1 VI(KW)−2 DGP−1 100 CD

25 / 32

slide-63
SLIDE 63

Experiment: Comparison to Bayesian neural networks [Best results]

boston N = 506 D = 13

−2.5 −2.4 −2.3 −2.2 −2.1 −2.0

average test log-likelihood/nats concrete N = 1030 D = 8

−3.2 −3.0 −2.8 −2.6 −2.4 −2.2 −2.0 −1.8

energy N = 768 D = 8

−2.0 −1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6

kin8nm N = 8192 D = 8

1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2

naval N = 11934 D = 16

5.0 5.5 6.0 6.5 7.0 7.5

power N = 9568 D = 4

−2.65 −2.60 −2.55 −2.50 −2.45 −2.40 −2.35 −2.30 −2.25

protein N = 45730 D = 9

−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0

red wine N = 1588 D = 11

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

yacht N = 308 D = 6

−1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6 −0.4

year N = 515345 D = 90

−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 BNN-deterministic BNN-sampling GP DGP

26 / 32

slide-64
SLIDE 64

Experiment: Comparison to Bayesian neural networks [Best results]

boston N = 506 D = 13

−2.5 −2.4 −2.3 −2.2 −2.1 −2.0

average test log-likelihood/nats concrete N = 1030 D = 8

−3.2 −3.0 −2.8 −2.6 −2.4 −2.2 −2.0 −1.8

energy N = 768 D = 8

−2.0 −1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6

kin8nm N = 8192 D = 8

1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2

naval N = 11934 D = 16

5.0 5.5 6.0 6.5 7.0 7.5

power N = 9568 D = 4

−2.65 −2.60 −2.55 −2.50 −2.45 −2.40 −2.35 −2.30 −2.25

protein N = 45730 D = 9

−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0

red wine N = 1588 D = 11

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

yacht N = 308 D = 6

−1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6 −0.4

year N = 515345 D = 90

−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 BNN-deterministic BNN-sampling GP DGP

27 / 32

slide-65
SLIDE 65

Experiment: Comparison to Bayesian neural networks [Best results]

boston N = 506 D = 13

−2.5 −2.4 −2.3 −2.2 −2.1 −2.0

average test log-likelihood/nats concrete N = 1030 D = 8

−3.2 −3.0 −2.8 −2.6 −2.4 −2.2 −2.0 −1.8

energy N = 768 D = 8

−2.0 −1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6

kin8nm N = 8192 D = 8

1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2

naval N = 11934 D = 16

5.0 5.5 6.0 6.5 7.0 7.5

power N = 9568 D = 4

−2.65 −2.60 −2.55 −2.50 −2.45 −2.40 −2.35 −2.30 −2.25

protein N = 45730 D = 9

−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0

red wine N = 1588 D = 11

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

yacht N = 308 D = 6

−1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6 −0.4

year N = 515345 D = 90

−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 BNN-deterministic BNN-sampling GP DGP

28 / 32

slide-66
SLIDE 66

Experiment: Comparison to Bayesian neural networks [Best results]

boston N = 506 D = 13

−2.5 −2.4 −2.3 −2.2 −2.1 −2.0

average test log-likelihood/nats concrete N = 1030 D = 8

−3.2 −3.0 −2.8 −2.6 −2.4 −2.2 −2.0 −1.8

energy N = 768 D = 8

−2.0 −1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6

kin8nm N = 8192 D = 8

1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2

naval N = 11934 D = 16

5.0 5.5 6.0 6.5 7.0 7.5

power N = 9568 D = 4

−2.65 −2.60 −2.55 −2.50 −2.45 −2.40 −2.35 −2.30 −2.25

protein N = 45730 D = 9

−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0

red wine N = 1588 D = 11

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

yacht N = 308 D = 6

−1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6 −0.4

year N = 515345 D = 90

−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 BNN-deterministic BNN-sampling GP DGP

29 / 32

slide-67
SLIDE 67

Experiment: Comparison to Bayesian neural networks [Best results]

boston N = 506 D = 13

−2.5 −2.4 −2.3 −2.2 −2.1 −2.0

average test log-likelihood/nats concrete N = 1030 D = 8

−3.2 −3.0 −2.8 −2.6 −2.4 −2.2 −2.0 −1.8

energy N = 768 D = 8

−2.0 −1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6

kin8nm N = 8192 D = 8

1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2

naval N = 11934 D = 16

5.0 5.5 6.0 6.5 7.0 7.5

power N = 9568 D = 4

−2.65 −2.60 −2.55 −2.50 −2.45 −2.40 −2.35 −2.30 −2.25

protein N = 45730 D = 9

−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0

red wine N = 1588 D = 11

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8

yacht N = 308 D = 6

−1.8 −1.6 −1.4 −1.2 −1.0 −0.8 −0.6 −0.4

year N = 515345 D = 90

−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 BNN-deterministic BNN-sampling GP DGP

30 / 32

slide-68
SLIDE 68

Experiment: Efficiency of organic photovoltaic molecules Dataset: 50k/10k training/test points, 512-dim. binary input features Need error-bars for active learning or Bayesian optimisation

MLL

−1.40 −1.35 −1.30 −1.25 −1.20 −1.15 −1.10 −1.05 −1.00 −0.95

BNN-VI GP 200 GP 400 DGP-2 200 DGP-5 200

31 / 32

slide-69
SLIDE 69

References (hyperlinked) Core material: A Unifying Framework for Sparse Gaussian Process Approximation using Power Expectation Propagation, arXiv preprint Deep Gaussian Processes for Regression using Approximate Expectation Propagation, ICML 2016 Related papers: Stochastic Expectation Propagation, NIPS 20015 Black-box α-divergence Minimization, ICML 2016

32 / 32