Doubly Stochastic Inference for Deep Gaussian Processes Hugh - - PowerPoint PPT Presentation

doubly stochastic inference for deep gaussian processes
SMART_READER_LITE
LIVE PREVIEW

Doubly Stochastic Inference for Deep Gaussian Processes Hugh - - PowerPoint PPT Presentation

Doubly Stochastic Inference for Deep Gaussian Processes Hugh Salimbeni Department of Computing Imperial College London 29/5/2017 Motivation DGPs promise much, but are difficult to train 2 Doubly Stochastic Inference for DGPs Hugh


slide-1
SLIDE 1

Doubly Stochastic Inference for Deep Gaussian Processes

Hugh Salimbeni Department of Computing Imperial College London 29/5/2017

slide-2
SLIDE 2

Motivation

§ DGPs promise much, but are difficult to train

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

2

slide-3
SLIDE 3

Motivation

§ DGPs promise much, but are difficult to train § Fully factorized VI doesn’t work well

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

2

slide-4
SLIDE 4

Motivation

§ DGPs promise much, but are difficult to train § Fully factorized VI doesn’t work well § We seek a variational approach that works and scales

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

2

slide-5
SLIDE 5

Motivation

§ DGPs promise much, but are difficult to train § Fully factorized VI doesn’t work well § We seek a variational approach that works and scales

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

2

slide-6
SLIDE 6

Motivation

§ DGPs promise much, but are difficult to train § Fully factorized VI doesn’t work well § We seek a variational approach that works and scales

Other recently proposed schemes [1, 2, 5] make additional approximations and require more machinery than VI

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

2

slide-7
SLIDE 7

Talk outline

  • 1. Summary: Model

Inference Results

  • 2. Details: Model

Inference Results

  • 3. Questions

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

3

slide-8
SLIDE 8

Model

We use the standard DGP model, with one addition:

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

4

slide-9
SLIDE 9

Model

We use the standard DGP model, with one addition:

§ We include a linear (identity) mean function for all the internal

layers

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

4

slide-10
SLIDE 10

Model

We use the standard DGP model, with one addition:

§ We include a linear (identity) mean function for all the internal

layers (1D example in [4])

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

4

slide-11
SLIDE 11

Inference

§ We use the model conditioned on the inducing points as a

conditional variational posterior

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

5

slide-12
SLIDE 12

Inference

§ We use the model conditioned on the inducing points as a

conditional variational posterior

§ We impose Gaussians on the inducing points, (independent

between layers but full rank within layers)

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

5

slide-13
SLIDE 13

Inference

§ We use the model conditioned on the inducing points as a

conditional variational posterior

§ We impose Gaussians on the inducing points, (independent

between layers but full rank within layers)

§ We use sampling to deal with the intractable expectation

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

5

slide-14
SLIDE 14

Inference

§ We use the model conditioned on the inducing points as a

conditional variational posterior

§ We impose Gaussians on the inducing points, (independent

between layers but full rank within layers)

§ We use sampling to deal with the intractable expectation

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

5

slide-15
SLIDE 15

Inference

§ We use the model conditioned on the inducing points as a

conditional variational posterior

§ We impose Gaussians on the inducing points, (independent

between layers but full rank within layers)

§ We use sampling to deal with the intractable expectation

We never compute N ˆ N matrices (we make no additional simplifications to variational posterior)

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

5

slide-16
SLIDE 16

Results

§ We show significant improvement over single layer models on

large („ 106) and massive („ 109) data

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

6

slide-17
SLIDE 17

Results

§ We show significant improvement over single layer models on

large („ 106) and massive („ 109) data

§ Big jump in improvement over single layer GP with 5ˆ number

  • f inducing points

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

6

slide-18
SLIDE 18

Results

§ We show significant improvement over single layer models on

large („ 106) and massive („ 109) data

§ Big jump in improvement over single layer GP with 5ˆ number

  • f inducing points

§ On small data we never do worse than the single layer model,

and often better

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

6

slide-19
SLIDE 19

Results

§ We show significant improvement over single layer models on

large („ 106) and massive („ 109) data

§ Big jump in improvement over single layer GP with 5ˆ number

  • f inducing points

§ On small data we never do worse than the single layer model,

and often better

§ We can get 98.1% on mnist with only 100 inducing points

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

6

slide-20
SLIDE 20

Results

§ We show significant improvement over single layer models on

large („ 106) and massive („ 109) data

§ Big jump in improvement over single layer GP with 5ˆ number

  • f inducing points

§ On small data we never do worse than the single layer model,

and often better

§ We can get 98.1% on mnist with only 100 inducing points § We surpass all permutation invariant methods on

rectangles-images (designed to test deep vs shallow architectures)

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

6

slide-21
SLIDE 21

Results

§ We show significant improvement over single layer models on

large („ 106) and massive („ 109) data

§ Big jump in improvement over single layer GP with 5ˆ number

  • f inducing points

§ On small data we never do worse than the single layer model,

and often better

§ We can get 98.1% on mnist with only 100 inducing points § We surpass all permutation invariant methods on

rectangles-images (designed to test deep vs shallow architectures)

§ Identical model/inference hyperparameters for all our models

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

6

slide-22
SLIDE 22

Details: The Model

We use the standard DGP model, with a linear mean function for all the internal layers:

§ If dimensions agree use the identity, otherwise PCA

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

7

slide-23
SLIDE 23

Details: The Model

We use the standard DGP model, with a linear mean function for all the internal layers:

§ If dimensions agree use the identity, otherwise PCA § Sensible alternative: initialize latents to identity (but linear mean

function works better)

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

7

slide-24
SLIDE 24

Details: The Model

We use the standard DGP model, with a linear mean function for all the internal layers:

§ If dimensions agree use the identity, otherwise PCA § Sensible alternative: initialize latents to identity (but linear mean

function works better)

§ Not so sensible alternative: random. Doesn’t work well

(posterior is (very) multimodal)

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

7

slide-25
SLIDE 25

The DGP: Graphical Model

f 1 f 2 u1 u2 X Z0 h1 Z1 f 3 u3 h2 Z2 y ✏ ✏

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

8

slide-26
SLIDE 26

The DGP: Density

ppy, thl, fl, uluL

l“1q “ likelihood

hkkkkkikkkkkj

N

π

i“1

ppyi|f L

i q ˆ L

π

l“1

pphl|flqppfl|ul; hl´1, Zl´1qppul; Zl´1q loooooooooooooooooooooooooomoooooooooooooooooooooooooon

DGP prior

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

9

slide-27
SLIDE 27

Factorised Variational Posterior

f 1 f 2 u1 u2 X Z0 h1 Z1 f 3 u3 h2 Z2 y ✏ ✏ f 1 f 2 u1 u2 X h1 f 3 u3 h2 N(u1|m1, S1) N(u2|m2, S2) N(u3|m3, S3) Q

i N(h1 i |µ1 i , σ1 i 2)

Q

i N(h2 i |µ2 i , σ2 i 2)

y

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

10

slide-28
SLIDE 28

Our Variational Posterior

f 1 f 2 u1 u2 X Z0 h1 Z1 f 3 u3 h2 Z2 y ✏ ✏ f 1 f 2 u1 u2 X h1 f 3 u3 h2 ✏ ✏ N(u1|m1, S1) N(u2|m2, S2) N(u3|m3, S3) f 3

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

11

slide-29
SLIDE 29

Recap: ‘GPs for Big Data’ [3]

qpf, uq “ ppf|u; X, ZqN pu|m, Sq

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

12

slide-30
SLIDE 30

Recap: ‘GPs for Big Data’ [3]

qpf, uq “ ppf|u; X, ZqN pu|m, Sq Marginalise u from the variational posterior: ª ppf|u; X, ZqN pu|m, Sqdu “ N pf|µ, Σq “: qpf|m, S; X, Zq (1)

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

12

slide-31
SLIDE 31

Recap: ‘GPs for Big Data’ [3]

qpf, uq “ ppf|u; X, ZqN pu|m, Sq Marginalise u from the variational posterior: ª ppf|u; X, ZqN pu|m, Sqdu “ N pf|µ, Σq “: qpf|m, S; X, Zq (1) Define the following mean and covariance functions: µm,Zpxiq “ mpxiq ` αpxiqTpm ´ mpZqq , ΣS,Zpxi, xjq “ kpxi, xjq ´ αpxiqTpkpZ, Zq ´ Sqαpxjq . where αpxiq “ kpxi, ZqkpZ, Zq´1

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

12

slide-32
SLIDE 32

Recap: ‘GPs for Big Data’ [3]

qpf, uq “ ppf|u; X, ZqN pu|m, Sq Marginalise u from the variational posterior: ª ppf|u; X, ZqN pu|m, Sqdu “ N pf|µ, Σq “: qpf|m, S; X, Zq (1) Define the following mean and covariance functions: µm,Zpxiq “ mpxiq ` αpxiqTpm ´ mpZqq , ΣS,Zpxi, xjq “ kpxi, xjq ´ αpxiqTpkpZ, Zq ´ Sqαpxjq . where αpxiq “ kpxi, ZqkpZ, Zq´1 With these functions rµsi “ µm,Zpxiq and rΣsij “ ΣS,Zpxi, xjq.

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

12

slide-33
SLIDE 33

Recap: ‘GPs for Big Data’ [3] cont.

Key idea:

The fi marginals of qpf, uq “ ppf|u; X, ZqN pu|m, Sq depend only on the inputs xi (and the variational parameters)

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

13

slide-34
SLIDE 34

Our Variational Posterior

f 1 f 2 u1 u2 X Z0 h1 Z1 f 3 u3 h2 Z2 y ✏ ✏ f 1 f 2 u1 u2 X h1 f 3 u3 h2 ✏ ✏ N(u1|m1, S1) N(u2|m2, S2) N(u3|m3, S3) f 3

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

14

slide-35
SLIDE 35

Our approach

We can marginalise all the ul from our posterior

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

15

slide-36
SLIDE 36

Our approach

We can marginalise all the ul from our posterior Result for lth layer is qpfl|ml, Sl; hl´1, Zl´1q.

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

15

slide-37
SLIDE 37

Our approach

We can marginalise all the ul from our posterior Result for lth layer is qpfl|ml, Sl; hl´1, Zl´1q. The fully coupled (both between and within layers) variational posterior is

L

π

l“1

pphl|flqqpfl|ml, Sl; hl´1, Zl´1q

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

15

slide-38
SLIDE 38

But what about the ith marginals?

Since at each layer the ith marginal depends only on the ith component of the layer below, we have qptf l

i , hl iuL l“1q “ L

π

l“1

pphl

i|f l i qqpf l i |ml, Sl; hl´1 i

, Zl´1q (2)

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

16

slide-39
SLIDE 39

The lower bound

Since our variational posterior matches the model everywhere except the inducing points, the bound is: L “ Eqptf l

i ,hl iuL l“1q log ppyi|f L

i q ´ L

ÿ

l“1

KLpqpulq||ppulqq The analytic marginalisation of the all the inner layers qpf L

i q is

intractable, but we can draw samples ancestrally

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

17

slide-40
SLIDE 40

Sampling from the variational posterior

§ Each layer is Gaussian, given the layer below

Whole sampling process is differentiable wrt variational parameters

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

18

slide-41
SLIDE 41

Sampling from the variational posterior

§ Each layer is Gaussian, given the layer below § We draw samples using unit Gaussians e „ N p0, 1q at each layer

Whole sampling process is differentiable wrt variational parameters

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

18

slide-42
SLIDE 42

Sampling from the variational posterior

§ Each layer is Gaussian, given the layer below § We draw samples using unit Gaussians e „ N p0, 1q at each layer § For ˆ

f l

u, mean and var from qpf l i |ml, Sl; ˆ

hl´1

i

, Zl´1q ˆ f l

i “ µml,Zl´1pˆ

hl´1

i

q ` e b ΣSl,Zl´1pˆ hl´1

i

, ˆ hl´1

i

q where for the first layer ˆ h0

i “ xi

Whole sampling process is differentiable wrt variational parameters

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

18

slide-43
SLIDE 43

Sampling from the variational posterior

§ Each layer is Gaussian, given the layer below § We draw samples using unit Gaussians e „ N p0, 1q at each layer § For ˆ

f l

u, mean and var from qpf l i |ml, Sl; ˆ

hl´1

i

, Zl´1q ˆ f l

i “ µml,Zl´1pˆ

hl´1

i

q ` e b ΣSl,Zl´1pˆ hl´1

i

, ˆ hl´1

i

q where for the first layer ˆ h0

i “ xi § Just add noise for the ˆ

hl

i

Whole sampling process is differentiable wrt variational parameters

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

18

slide-44
SLIDE 44

The second source of stochasticity

§ We sample the bound in minibatches § Linear scaling in N § Can be used when only steaming is possible (° 50GB datasets)

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

19

slide-45
SLIDE 45

Inference recap

§ We use the full model as a variational posterior, conditioned on

the inducing points

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

20

slide-46
SLIDE 46

Inference recap

§ We use the full model as a variational posterior, conditioned on

the inducing points

§ We use Gaussians for the inducing points

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

20

slide-47
SLIDE 47

Inference recap

§ We use the full model as a variational posterior, conditioned on

the inducing points

§ We use Gaussians for the inducing points § The lower bound requires only the posterior marginals

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

20

slide-48
SLIDE 48

Inference recap

§ We use the full model as a variational posterior, conditioned on

the inducing points

§ We use Gaussians for the inducing points § The lower bound requires only the posterior marginals § We can take samples from the posterior marginals using a Monte

Carlo estimate

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

20

slide-49
SLIDE 49

Results (1): UCI

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

21

slide-50
SLIDE 50

Code Demo

https://github.com/ICL-SML/Doubly-Stochastic-DGP/blob/ master/demos/demo_regression_UCI.ipynb

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

22

slide-51
SLIDE 51

Results (2): Large and Massive Data

Test RMSE

N D SGP SGP 500 DGP 2 DGP 3 DGP 4 DGP 5 year 463810 90 10.67 9.89 9.58 8.98 8.93 8.87 airline 700K 8 25.6 25.1 24.6 24.3 24.2 24.1 taxi 1B 9 337.5 330.7 281.4 270.4 268.0 266.4

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

23

slide-52
SLIDE 52

Thanks for listening

Questions?

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

24

slide-53
SLIDE 53

References

[1] T. D. Bui, D. Hern´ andez-Lobato, Y. Li, J. M. Hern´ andez-Lobato, and R. E. Turner. Deep Gaussian Processes for Regression using Approximate Expectation Propagation. Icml, 2016. [2] K. Cutajar, E. V. Bonilla, P. Michiardi, and M. Filippone. Practical Learning of Deep Gaussian Processes via Random Fourier

  • Features. arXiv preprint arXiv:1610.04386, 2016.

[3] J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian Processes for Big Data. Uncertainty in Artificial Intelligence, pages 282–290, 2013. [4] M. L´ azaro-Gredilla. Bayesian Warped Gaussian Processes. Advances in Neural Information Processing Systems, pages 1619–1627, 2012. [5] Y. Wang, M. Brubaker, B. Chaib-Draa, and R. Urtasun. Sequential Inference for Deep Gaussian Process. Artificial Intelligence and Statistics, 2016.

Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017

25