Doubly Stochastic Inference for Deep Gaussian Processes
Hugh Salimbeni Department of Computing Imperial College London 29/5/2017
Doubly Stochastic Inference for Deep Gaussian Processes Hugh - - PowerPoint PPT Presentation
Doubly Stochastic Inference for Deep Gaussian Processes Hugh Salimbeni Department of Computing Imperial College London 29/5/2017 Motivation DGPs promise much, but are difficult to train 2 Doubly Stochastic Inference for DGPs Hugh
Hugh Salimbeni Department of Computing Imperial College London 29/5/2017
§ DGPs promise much, but are difficult to train
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
2
§ DGPs promise much, but are difficult to train § Fully factorized VI doesn’t work well
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
2
§ DGPs promise much, but are difficult to train § Fully factorized VI doesn’t work well § We seek a variational approach that works and scales
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
2
§ DGPs promise much, but are difficult to train § Fully factorized VI doesn’t work well § We seek a variational approach that works and scales
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
2
§ DGPs promise much, but are difficult to train § Fully factorized VI doesn’t work well § We seek a variational approach that works and scales
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
2
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
3
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
4
§ We include a linear (identity) mean function for all the internal
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
4
§ We include a linear (identity) mean function for all the internal
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
4
§ We use the model conditioned on the inducing points as a
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
5
§ We use the model conditioned on the inducing points as a
§ We impose Gaussians on the inducing points, (independent
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
5
§ We use the model conditioned on the inducing points as a
§ We impose Gaussians on the inducing points, (independent
§ We use sampling to deal with the intractable expectation
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
5
§ We use the model conditioned on the inducing points as a
§ We impose Gaussians on the inducing points, (independent
§ We use sampling to deal with the intractable expectation
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
5
§ We use the model conditioned on the inducing points as a
§ We impose Gaussians on the inducing points, (independent
§ We use sampling to deal with the intractable expectation
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
5
§ We show significant improvement over single layer models on
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
6
§ We show significant improvement over single layer models on
§ Big jump in improvement over single layer GP with 5ˆ number
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
6
§ We show significant improvement over single layer models on
§ Big jump in improvement over single layer GP with 5ˆ number
§ On small data we never do worse than the single layer model,
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
6
§ We show significant improvement over single layer models on
§ Big jump in improvement over single layer GP with 5ˆ number
§ On small data we never do worse than the single layer model,
§ We can get 98.1% on mnist with only 100 inducing points
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
6
§ We show significant improvement over single layer models on
§ Big jump in improvement over single layer GP with 5ˆ number
§ On small data we never do worse than the single layer model,
§ We can get 98.1% on mnist with only 100 inducing points § We surpass all permutation invariant methods on
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
6
§ We show significant improvement over single layer models on
§ Big jump in improvement over single layer GP with 5ˆ number
§ On small data we never do worse than the single layer model,
§ We can get 98.1% on mnist with only 100 inducing points § We surpass all permutation invariant methods on
§ Identical model/inference hyperparameters for all our models
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
6
§ If dimensions agree use the identity, otherwise PCA
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
7
§ If dimensions agree use the identity, otherwise PCA § Sensible alternative: initialize latents to identity (but linear mean
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
7
§ If dimensions agree use the identity, otherwise PCA § Sensible alternative: initialize latents to identity (but linear mean
§ Not so sensible alternative: random. Doesn’t work well
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
7
f 1 f 2 u1 u2 X Z0 h1 Z1 f 3 u3 h2 Z2 y ✏ ✏
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
8
l“1q “ likelihood
N
i“1
i q ˆ L
l“1
DGP prior
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
9
f 1 f 2 u1 u2 X Z0 h1 Z1 f 3 u3 h2 Z2 y ✏ ✏ f 1 f 2 u1 u2 X h1 f 3 u3 h2 N(u1|m1, S1) N(u2|m2, S2) N(u3|m3, S3) Q
i N(h1 i |µ1 i , σ1 i 2)
Q
i N(h2 i |µ2 i , σ2 i 2)
y
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
10
f 1 f 2 u1 u2 X Z0 h1 Z1 f 3 u3 h2 Z2 y ✏ ✏ f 1 f 2 u1 u2 X h1 f 3 u3 h2 ✏ ✏ N(u1|m1, S1) N(u2|m2, S2) N(u3|m3, S3) f 3
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
11
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
12
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
12
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
12
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
12
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
13
f 1 f 2 u1 u2 X Z0 h1 Z1 f 3 u3 h2 Z2 y ✏ ✏ f 1 f 2 u1 u2 X h1 f 3 u3 h2 ✏ ✏ N(u1|m1, S1) N(u2|m2, S2) N(u3|m3, S3) f 3
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
14
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
15
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
15
L
l“1
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
15
i , hl iuL l“1q “ L
l“1
i|f l i qqpf l i |ml, Sl; hl´1 i
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
16
i ,hl iuL l“1q log ppyi|f L
i q ´ L
l“1
i q is
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
17
§ Each layer is Gaussian, given the layer below
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
18
§ Each layer is Gaussian, given the layer below § We draw samples using unit Gaussians e „ N p0, 1q at each layer
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
18
§ Each layer is Gaussian, given the layer below § We draw samples using unit Gaussians e „ N p0, 1q at each layer § For ˆ
u, mean and var from qpf l i |ml, Sl; ˆ
i
i “ µml,Zl´1pˆ
i
i
i
i “ xi
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
18
§ Each layer is Gaussian, given the layer below § We draw samples using unit Gaussians e „ N p0, 1q at each layer § For ˆ
u, mean and var from qpf l i |ml, Sl; ˆ
i
i “ µml,Zl´1pˆ
i
i
i
i “ xi § Just add noise for the ˆ
i
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
18
§ We sample the bound in minibatches § Linear scaling in N § Can be used when only steaming is possible (° 50GB datasets)
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
19
§ We use the full model as a variational posterior, conditioned on
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
20
§ We use the full model as a variational posterior, conditioned on
§ We use Gaussians for the inducing points
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
20
§ We use the full model as a variational posterior, conditioned on
§ We use Gaussians for the inducing points § The lower bound requires only the posterior marginals
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
20
§ We use the full model as a variational posterior, conditioned on
§ We use Gaussians for the inducing points § The lower bound requires only the posterior marginals § We can take samples from the posterior marginals using a Monte
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
20
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
21
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
22
N D SGP SGP 500 DGP 2 DGP 3 DGP 4 DGP 5 year 463810 90 10.67 9.89 9.58 8.98 8.93 8.87 airline 700K 8 25.6 25.1 24.6 24.3 24.2 24.1 taxi 1B 9 337.5 330.7 281.4 270.4 268.0 266.4
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
23
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
24
Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017
25