Distributed Variational Inference in Sparse Gaussian Process - - PowerPoint PPT Presentation

distributed variational inference in sparse gaussian
SMART_READER_LITE
LIVE PREVIEW

Distributed Variational Inference in Sparse Gaussian Process - - PowerPoint PPT Presentation

Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models Yarin Gal Mark van der Wilk Carl E. Rasmussen yg279@cam.ac.uk June 25th, 2014 Outline Gaussian process regression and latent variable


slide-1
SLIDE 1

Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models

Yarin Gal • Mark van der Wilk • Carl E. Rasmussen

yg279@cam.ac.uk

June 25th, 2014

slide-2
SLIDE 2

Outline

Gaussian process regression and latent variable models Why do we want to scale these? Distributed inference Utility in scaling-up GPs New horizons in big data

2 of 24

slide-3
SLIDE 3

GP regression & latent variable models

Gaussian processes (GPs) are a powerful tool for probabilistic inference over functions.

◮ GP regression captures non-linear

functions

◮ Can be seen as an infinite limit of

single layer neural networks

◮ GP latent variable models are an

unsupervised version of regression, used for manifold learning

◮ Can be seen as a non-linear

generalisation of PCA

3 of 24

slide-4
SLIDE 4

GP regression & latent variable models

GPs offer:

◮ uncertainty estimates, ◮ robustness to over-fitting, ◮ and principled ways for tuning hyper-parameters

4 of 24

slide-5
SLIDE 5

GP latent variable models

GP latent variable models are used for tasks such as...

◮ Dimensionality reduction ◮ Face reconstruction ◮ Human pose estimation and tracking ◮ Matching silhouettes ◮ Animation deformation and

segmentation

◮ WiFi localisation ◮ State-of-the-art results for face

recognition

1 2 3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

5 of 24

slide-6
SLIDE 6

GP latent variable models

GP latent variable models are used for tasks such as...

◮ Dimensionality reduction ◮ Face reconstruction ◮ Human pose estimation and tracking ◮ Matching silhouettes ◮ Animation deformation and

segmentation

◮ WiFi localisation ◮ State-of-the-art results for face

recognition

5 of 24

slide-7
SLIDE 7

GP latent variable models

GP latent variable models are used for tasks such as...

◮ Dimensionality reduction ◮ Face reconstruction ◮ Human pose estimation and tracking ◮ Matching silhouettes ◮ Animation deformation and

segmentation

◮ WiFi localisation ◮ State-of-the-art results for face

recognition

5 of 24

slide-8
SLIDE 8

GP latent variable models

GP latent variable models are used for tasks such as...

◮ Dimensionality reduction ◮ Face reconstruction ◮ Human pose estimation and tracking ◮ Matching silhouettes ◮ Animation deformation and

segmentation

◮ WiFi localisation ◮ State-of-the-art results for face

recognition

5 of 24

slide-9
SLIDE 9

GP latent variable models

GP latent variable models are used for tasks such as...

◮ Dimensionality reduction ◮ Face reconstruction ◮ Human pose estimation and tracking ◮ Matching silhouettes ◮ Animation deformation and

segmentation

◮ WiFi localisation ◮ State-of-the-art results for face

recognition

5 of 24

slide-10
SLIDE 10

GP regression

Regression setting:

◮ Training dataset with N inputs X ∈ RN×Q (Q dimensional) ◮ Corresponding D dimensional outputs Fn = f(Xn) ◮ We place a Gaussian process prior over the space of functions

f ∼ GP(mean µ(x), covariance k(x, x′))

◮ This implies a joint Gaussian distribution over function values:

p(F|X) = N(F; µ(X), K), Kij = k(xi, xj)

◮ Y consists of noisy observations, making the functions F latent:

p(Y|F) = N(Y; F, β−1In)

6 of 24

slide-11
SLIDE 11

GP regression

Regression setting:

◮ Training dataset with N inputs X ∈ RN×Q (Q dimensional) ◮ Corresponding D dimensional outputs Fn = f(Xn) ◮ We place a Gaussian process prior over the space of functions

f ∼ GP(mean µ(x), covariance k(x, x′))

◮ This implies a joint Gaussian distribution over function values:

p(F|X) = N(F; µ(X), K), Kij = k(xi, xj)

◮ Y consists of noisy observations, making the functions F latent:

p(Y|F) = N(Y; F, β−1In)

6 of 24

slide-12
SLIDE 12

GP regression

Regression setting:

◮ Training dataset with N inputs X ∈ RN×Q (Q dimensional) ◮ Corresponding D dimensional outputs Fn = f(Xn) ◮ We place a Gaussian process prior over the space of functions

f ∼ GP(mean µ(x), covariance k(x, x′))

◮ This implies a joint Gaussian distribution over function values:

p(F|X) = N(F; µ(X), K), Kij = k(xi, xj)

◮ Y consists of noisy observations, making the functions F latent:

p(Y|F) = N(Y; F, β−1In)

6 of 24

slide-13
SLIDE 13

GP regression

Regression setting:

◮ Training dataset with N inputs X ∈ RN×Q (Q dimensional) ◮ Corresponding D dimensional outputs Fn = f(Xn) ◮ We place a Gaussian process prior over the space of functions

f ∼ GP(mean µ(x), covariance k(x, x′))

◮ This implies a joint Gaussian distribution over function values:

p(F|X) = N(F; µ(X), K), Kij = k(xi, xj)

◮ Y consists of noisy observations, making the functions F latent:

p(Y|F) = N(Y; F, β−1In)

6 of 24

slide-14
SLIDE 14

GP regression

Regression setting:

◮ Training dataset with N inputs X ∈ RN×Q (Q dimensional) ◮ Corresponding D dimensional outputs Fn = f(Xn) ◮ We place a Gaussian process prior over the space of functions

f ∼ GP(mean µ(x), covariance k(x, x′))

◮ This implies a joint Gaussian distribution over function values:

p(F|X) = N(F; µ(X), K), Kij = k(xi, xj)

◮ Y consists of noisy observations, making the functions F latent:

p(Y|F) = N(Y; F, β−1In)

6 of 24

slide-15
SLIDE 15

GP latent variable models

Latent variable models setting:

◮ Infer both the inputs, which are now latent, and the latent

function mappings at the same time

◮ Model identical to regression, with a prior over now latents X

Xn ∼ N(Xn; 0, I), F(Xn) ∼ GP(0, k(X, X)), Yn ∼ N(Fn, β−1I)

◮ In approximate inference we look for variational lower bound to:

p(Y) =

  • p(Y|F)p(F|X)p(X)d(F, X)

◮ This leads to Gaussian approximation to the posterior over X

q(X) :≈ p(X|Y)

7 of 24

slide-16
SLIDE 16

GP latent variable models

Latent variable models setting:

◮ Infer both the inputs, which are now latent, and the latent

function mappings at the same time

◮ Model identical to regression, with a prior over now latents X

Xn ∼ N(Xn; 0, I), F(Xn) ∼ GP(0, k(X, X)), Yn ∼ N(Fn, β−1I)

◮ In approximate inference we look for variational lower bound to:

p(Y) =

  • p(Y|F)p(F|X)p(X)d(F, X)

◮ This leads to Gaussian approximation to the posterior over X

q(X) :≈ p(X|Y)

7 of 24

slide-17
SLIDE 17

GP latent variable models

Latent variable models setting:

◮ Infer both the inputs, which are now latent, and the latent

function mappings at the same time

◮ Model identical to regression, with a prior over now latents X

Xn ∼ N(Xn; 0, I), F(Xn) ∼ GP(0, k(X, X)), Yn ∼ N(Fn, β−1I)

◮ In approximate inference we look for variational lower bound to:

p(Y) =

  • p(Y|F)p(F|X)p(X)d(F, X)

◮ This leads to Gaussian approximation to the posterior over X

q(X) :≈ p(X|Y)

7 of 24

slide-18
SLIDE 18

GP latent variable models

Latent variable models setting:

◮ Infer both the inputs, which are now latent, and the latent

function mappings at the same time

◮ Model identical to regression, with a prior over now latents X

Xn ∼ N(Xn; 0, I), F(Xn) ∼ GP(0, k(X, X)), Yn ∼ N(Fn, β−1I)

◮ In approximate inference we look for variational lower bound to:

p(Y) =

  • p(Y|F)p(F|X)p(X)d(F, X)

◮ This leads to Gaussian approximation to the posterior over X

q(X) :≈ p(X|Y)

7 of 24

slide-19
SLIDE 19

Why do we want to scale GPs?

◮ Naive models are often used with big data (linear regression,

ridge regression, random forests, etc.)

◮ These don’t offer many of the desirable properties of GPs

(non-linearity, robustness, uncertainty, etc.)

◮ Scaling GP regression and latent variable models allows for

non-linear regression, density estimation, data imputation, dimensionality reduction, etc. on big datasets

8 of 24

slide-20
SLIDE 20

However...

Problem – time and space complexity

◮ Evaluating p(Y|X) directly is an expensive operation ◮ Involves the inversion of the n by n matrix K ◮ requiring O(n3) time complexity

9 of 24

slide-21
SLIDE 21

Sparse approximation!

Solution – sparse approximation!

◮ A collection of M “inducing inputs” – a set of points in the same

input space with corresponding values in the output space.

◮ These summarise the characteristics of the function using less

points than the training data.

◮ Given the dataset, we want to learn an optimal subset of

inducing inputs.

◮ Requires O(nm2 + m3) time complexity.

[Qui˜ nonero-Candela and Rasmussen, 2005]

10 of 24

slide-22
SLIDE 22

Sparse approximation!

Solution – sparse approximation!

◮ A collection of M “inducing inputs” – a set of points in the same

input space with corresponding values in the output space.

◮ These summarise the characteristics of the function using less

points than the training data.

◮ Given the dataset, we want to learn an optimal subset of

inducing inputs.

◮ Requires O(nm2 + m3) time complexity.

[Qui˜ nonero-Candela and Rasmussen, 2005]

10 of 24

slide-23
SLIDE 23

Sparse approximation!

Solution – sparse approximation!

◮ A collection of M “inducing inputs” – a set of points in the same

input space with corresponding values in the output space.

◮ These summarise the characteristics of the function using less

points than the training data.

◮ Given the dataset, we want to learn an optimal subset of

inducing inputs.

◮ Requires O(nm2 + m3) time complexity.

[Qui˜ nonero-Candela and Rasmussen, 2005]

10 of 24

slide-24
SLIDE 24

Sparse approximation!

Solution – sparse approximation!

◮ A collection of M “inducing inputs” – a set of points in the same

input space with corresponding values in the output space.

◮ These summarise the characteristics of the function using less

points than the training data.

◮ Given the dataset, we want to learn an optimal subset of

inducing inputs.

◮ Requires O(nm2 + m3) time complexity.

[Qui˜ nonero-Candela and Rasmussen, 2005]

10 of 24

slide-25
SLIDE 25

Sparse approximation

Sparse approximation in pictures:

Regression on 5000 points dataset

11 of 24

slide-26
SLIDE 26

Sparse approximation

Sparse approximation in pictures:

◮ We can summarise the data using a small number of points

Regression on 500 points subset (in red)

11 of 24

slide-27
SLIDE 27

Sparse approximation

Sparse approximation in pictures:

◮ We can summarise the data using a small number of points

Regression on 50 points subset (in red)

11 of 24

slide-28
SLIDE 28

Distributed inference

Distributed Inference in GPs

12 of 24

slide-29
SLIDE 29

Why do we want distributed inference?

Usual datasets used with full GPs [O(n3)]

13 of 24

slide-30
SLIDE 30

Why do we want distributed inference?

Usual datasets used with Sparse GPs [O(nm2 + m3), m << n]

13 of 24

slide-31
SLIDE 31

Why do we want distributed inference?

Big data

13 of 24

slide-32
SLIDE 32

Why do we want distributed inference?

Distributed Sparse GPs – O( nm2

T

+ m3) = O(n + m3), for T = m2 nodes, m << n

13 of 24

slide-33
SLIDE 33

Distributing the inference

◮ The data points become independent of one another given the

inducing inputs

◮ We can write the evidence lower bound as:

log p(Y) ≥

n

  • i=1
  • q(u)q(Xi)p(Fi|Xi, u) log p(Yi|Fi)d(Fi, Xi, u)

−KL(q(u)||p(u)) − KL(q(X)||p(X)) with inducing inputs u and approximating distributions q(·)

◮ We can analytically integrate out q(u) and still keep a

factorised form

◮ We can compute each term in the factorised form

independently of the others with the Map-Reduce framework.

14 of 24

slide-34
SLIDE 34

Distributing the inference

◮ The data points become independent of one another given the

inducing inputs

◮ We can write the evidence lower bound as:

log p(Y) ≥

n

  • i=1
  • q(u)q(Xi)p(Fi|Xi, u) log p(Yi|Fi)d(Fi, Xi, u)

−KL(q(u)||p(u)) − KL(q(X)||p(X)) with inducing inputs u and approximating distributions q(·)

◮ We can analytically integrate out q(u) and still keep a

factorised form

◮ We can compute each term in the factorised form

independently of the others with the Map-Reduce framework.

14 of 24

slide-35
SLIDE 35

Distributing the inference

◮ The data points become independent of one another given the

inducing inputs

◮ We can write the evidence lower bound as:

log p(Y) ≥

n

  • i=1
  • q(u)q(Xi)p(Fi|Xi, u) log p(Yi|Fi)d(Fi, Xi, u)

−KL(q(u)||p(u)) − KL(q(X)||p(X)) with inducing inputs u and approximating distributions q(·)

◮ We can analytically integrate out q(u) and still keep a

factorised form

◮ We can compute each term in the factorised form

independently of the others with the Map-Reduce framework.

14 of 24

slide-36
SLIDE 36

Distributing the inference

◮ The data points become independent of one another given the

inducing inputs

◮ We can write the evidence lower bound as:

log p(Y) ≥

n

  • i=1
  • q(u)q(Xi)p(Fi|Xi, u) log p(Yi|Fi)d(Fi, Xi, u)

−KL(q(u)||p(u)) − KL(q(X)||p(X)) with inducing inputs u and approximating distributions q(·)

◮ We can analytically integrate out q(u) and still keep a

factorised form

◮ We can compute each term in the factorised form

independently of the others with the Map-Reduce framework.

14 of 24

slide-37
SLIDE 37

Map-Reduce framework

[http://mohamednabeel.blogspot.co.uk/]

15 of 24

slide-38
SLIDE 38

Characteristics of distributed inference

The inference procedure should:

◮ distribute the computational load evenly across nodes, ◮ scale favourably with the number of nodes, ◮ and have low overhead in the global steps.

16 of 24

slide-39
SLIDE 39

Characteristics of distributed inference

5 10 15 20 25 30 35 40 iter 17.0 17.1 17.2 17.3 17.4 17.5 17.6 Thread execution time (s)

Load balancing - 5 cores

20 40 60 80 100 120 140 160 180 iter 2.80 2.85 2.90 2.95 3.00 3.05 3.10 3.15 Thread execution time (s)

Load balancing - 30 cores

Distribution of computational load

17 of 24

slide-40
SLIDE 40

Characteristics of distributed inference

10 20 30 40 50 dataset size (103 ) 5 10 15 20 25 30 35 40 time / iter (s)

Time scaling with data Suggested inference GPy

5 10 15 20 25 30 available cores

Scalability with the number of nodes

17 of 24

slide-41
SLIDE 41

Characteristics of distributed inference

100 101 cores 100 101 102 time / iter (s)

Time scaling with cores

Negligible overhead in the global steps (constant time – O(m3))

17 of 24

slide-42
SLIDE 42

Utility in scaling-up GPs

◮ We want to predict flight delays from various flight-record

characteristics (flight date and time, flight distance, etc.)

◮ Can we improve on GP prediction using increasing amounts of

data?

◮ We use different subset sizes of data: 7K, 70K, and 700K

18 of 24

slide-43
SLIDE 43

Utility in scaling-up GPs

Size 7K 70K 700K Dist GP 33.56 33.11 32.95

Root mean square error (RMSE) on flight dataset 7K-700K

◮ With more data we can learn better inducing inputs!

Year Month DayofMonth DayOfWeek DepTime ArrTime AirTime Distance plane_age 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

ARD parameters for flight 700K

19 of 24

slide-44
SLIDE 44

Utility in scaling-up GPs

GP latent variable model on the full MNIST dataset (60K, 784 dim.):

◮ Used a density model for each digit ◮ No pre-processing (the model is non-specialised) ◮ Trained the models on 10K and all 60K points

Size 10K 60K Dist GP 8.98% 5.95%

Classification error on a subset and full MNIST

◮ Improvement of 3.03 percentage points ◮ Training on the full MNIST dataset took 20 minutes for the

longest running model

20 of 24

slide-45
SLIDE 45

Utility in scaling-up GPs

GP latent variable model on the full MNIST dataset (60K, 784 dim.):

◮ Used a density model for each digit ◮ No pre-processing (the model is non-specialised) ◮ Trained the models on 10K and all 60K points

Size 10K 60K Dist GP 8.98% 5.95%

Classification error on a subset and full MNIST

◮ Improvement of 3.03 percentage points ◮ Training on the full MNIST dataset took 20 minutes for the

longest running model

20 of 24

slide-46
SLIDE 46

Utility in scaling-up GPs

GP latent variable model on the full MNIST dataset (60K, 784 dim.):

◮ Used a density model for each digit ◮ No pre-processing (the model is non-specialised) ◮ Trained the models on 10K and all 60K points

Size 10K 60K Dist GP 8.98% 5.95%

Classification error on a subset and full MNIST

◮ Improvement of 3.03 percentage points ◮ Training on the full MNIST dataset took 20 minutes for the

longest running model

20 of 24

slide-47
SLIDE 47

Utility in scaling-up GPs

GP latent variable model on the full MNIST dataset (60K, 784 dim.):

◮ Used a density model for each digit ◮ No pre-processing (the model is non-specialised) ◮ Trained the models on 10K and all 60K points

Size 10K 60K Dist GP 8.98% 5.95%

Classification error on a subset and full MNIST

◮ Improvement of 3.03 percentage points ◮ Training on the full MNIST dataset took 20 minutes for the

longest running model

20 of 24

slide-48
SLIDE 48

Utility in scaling-up GPs

GP latent variable model on the full MNIST dataset (60K, 784 dim.):

◮ Used a density model for each digit ◮ No pre-processing (the model is non-specialised) ◮ Trained the models on 10K and all 60K points

Size 10K 60K Dist GP 8.98% 5.95%

Classification error on a subset and full MNIST

◮ Improvement of 3.03 percentage points ◮ Training on the full MNIST dataset took 20 minutes for the

longest running model

20 of 24

slide-49
SLIDE 49

New horizons in big data

But these models give us much more...

◮ The MNIST trained models are density estimation models ◮ They allow us to perform image imputation, ◮ Generate new digits by sampling from the posterior, etc.

21 of 24

slide-50
SLIDE 50

New horizons in big data

Furthermore, real big data is complex and non-linear – and naive models may under-perform on it

◮ Back to flight regression – ◮ Flight 2M dataset compared to common approaches in big

data: Dataset Mean Linear Ridge RF Dist GP Flight 2M 38.92 37.65 37.65 37.33 35.31

RMSE of regression over flight data with 2M points

◮ These are just error rates – we can do much more with GPs

◮ robust, offer uncertainty bounds, etc. 22 of 24

slide-51
SLIDE 51

New horizons in big data

Furthermore, real big data is complex and non-linear – and naive models may under-perform on it

◮ Back to flight regression – ◮ Flight 2M dataset compared to common approaches in big

data: Dataset Mean Linear Ridge RF Dist GP Flight 2M 38.92 37.65 37.65 37.33 35.31

RMSE of regression over flight data with 2M points

◮ These are just error rates – we can do much more with GPs

◮ robust, offer uncertainty bounds, etc. 22 of 24

slide-52
SLIDE 52

New horizons in big data

Furthermore, real big data is complex and non-linear – and naive models may under-perform on it

◮ Back to flight regression – ◮ Flight 2M dataset compared to common approaches in big

data: Dataset Mean Linear Ridge RF Dist GP Flight 2M 38.92 37.65 37.65 37.33 35.31

RMSE of regression over flight data with 2M points

◮ These are just error rates – we can do much more with GPs

◮ robust, offer uncertainty bounds, etc. 22 of 24

slide-53
SLIDE 53

Conclusions

◮ We showed that the inference scales well with data and

computational resources

◮ We demonstrated the utility in scaling GPs to big data ◮ The results show that GPs perform better than many common

models often used for big data

23 of 24

slide-54
SLIDE 54

Conclusions

◮ Developing the inference we wrote an introductory tutorial [Gal

and van der Wilk, 2014] with detailed derivations

◮ The code developed is open source1

◮ 300 lines of Python with detailed and documented examples

◮ Pointers between equations in the tutorial and in code

1See https://github.com/markvdw/GParML 24 of 24