Lecture 22 Computational Methods for GPs Colin Rundel 04/12/2017 - - PowerPoint PPT Presentation

lecture 22
SMART_READER_LITE
LIVE PREVIEW

Lecture 22 Computational Methods for GPs Colin Rundel 04/12/2017 - - PowerPoint PPT Presentation

Lecture 22 Computational Methods for GPs Colin Rundel 04/12/2017 1 GPs and Computational Complexity 2 2 exp n 1 i The problem with GPs x n 2 j 2 ij d ij Update covariance parameter? n 3 log 2 2 n 1 Unless you are lucky (or


slide-1
SLIDE 1

Lecture 22

Computational Methods for GPs

Colin Rundel 04/12/2017

1

slide-2
SLIDE 2

GPs and Computational Complexity

2

slide-3
SLIDE 3

The problem with GPs

Unless you are lucky (or clever), Gaussian process models are difficult to scale to large problems. For a Gaussian process y ∼ N(µ, Σ): Want to sample y? Chol Z with Zi 0 1 n3 Evaluate the (log) likelihood? 1 2 log 1 2 x

1

x n 2 log 2 n3 Update covariance parameter?

ij 2 exp

d

ij 2 n 1i j

n2

3

slide-4
SLIDE 4

The problem with GPs

Unless you are lucky (or clever), Gaussian process models are difficult to scale to large problems. For a Gaussian process y ∼ N(µ, Σ): Want to sample y?

µ + Chol(Σ) × Z with Zi ∼ N(0, 1) O

(

n3) Evaluate the (log) likelihood? 1 2 log 1 2 x

1

x n 2 log 2 n3 Update covariance parameter?

ij 2 exp

d

ij 2 n 1i j

n2

3

slide-5
SLIDE 5

The problem with GPs

Unless you are lucky (or clever), Gaussian process models are difficult to scale to large problems. For a Gaussian process y ∼ N(µ, Σ): Want to sample y?

µ + Chol(Σ) × Z with Zi ∼ N(0, 1) O

(

n3) Evaluate the (log) likelihood?

1 2 log |Σ| − 1 2(x − µ)′ Σ−1 (x − µ) − n 2 log 2π

O

(

n3) Update covariance parameter?

ij 2 exp

d

ij 2 n 1i j

n2

3

slide-6
SLIDE 6

The problem with GPs

Unless you are lucky (or clever), Gaussian process models are difficult to scale to large problems. For a Gaussian process y ∼ N(µ, Σ): Want to sample y?

µ + Chol(Σ) × Z with Zi ∼ N(0, 1) O

(

n3) Evaluate the (log) likelihood?

1 2 log |Σ| − 1 2(x − µ)′ Σ−1 (x − µ) − n 2 log 2π

O

(

n3) Update covariance parameter?

{Σ}ij = σ2 exp(−{d}ijϕ) + σ2

n 1i=j

O

(

n2)

3

slide-7
SLIDE 7

A simple guide to computational complexity

O (n) - Linear complexity

  • Go for it

n2 - Quadratic complexity - Pray n3 - Cubic complexity - Give up

4

slide-8
SLIDE 8

A simple guide to computational complexity

O (n) - Linear complexity - Go for it

n2 - Quadratic complexity - Pray n3 - Cubic complexity - Give up

4

slide-9
SLIDE 9

A simple guide to computational complexity

O (n) - Linear complexity - Go for it O (n2) - Quadratic complexity

  • Pray

n3 - Cubic complexity - Give up

4

slide-10
SLIDE 10

A simple guide to computational complexity

O (n) - Linear complexity - Go for it O (n2) - Quadratic complexity - Pray

n3 - Cubic complexity - Give up

4

slide-11
SLIDE 11

A simple guide to computational complexity

O (n) - Linear complexity - Go for it O (n2) - Quadratic complexity - Pray O (n3) - Cubic complexity

  • Give up

4

slide-12
SLIDE 12

A simple guide to computational complexity

O (n) - Linear complexity - Go for it O (n2) - Quadratic complexity - Pray O (n3) - Cubic complexity - Give up

4

slide-13
SLIDE 13

How bad is the problem?

10 20 30 2500 5000 7500 10000

n time (secs) method

chol inv LU inv QR inv

5

slide-14
SLIDE 14

Practice - Migratory Model Prediction

After fitting the GP need to sample from the posterior predictive distribution at ∼ 3000 locations yp ∼ N

(µp + ΣpoΣ−1

  • (yo − µo), Σp − ΣpoΣ−1
  • Σop

)

Step CPU (secs) CPU+GPU (secs)

  • Rel. Performance

1. Calc.

p, po

1.080 0.046 23.0 2.

  • Calc. chol

p po 1

  • p

0.467 0.208 2.3 3. Calc.

p o

chol

p o

Z 0.049 0.052 0.9 4.

  • Calc. Allele Prob

0.129 0.127 1.0 Total 1.732 0.465 3.7

Total run time: CPU (28.9 min), CPU+GPU (7.8 min)

6

slide-15
SLIDE 15

Practice - Migratory Model Prediction

After fitting the GP need to sample from the posterior predictive distribution at ∼ 3000 locations yp ∼ N

(µp + ΣpoΣ−1

  • (yo − µo), Σp − ΣpoΣ−1
  • Σop

)

Step CPU (secs) CPU+GPU (secs)

  • Rel. Performance

1.

  • Calc. Σp, Σpo

1.080 0.046 23.0 2.

  • Calc. chol(Σp − ΣpoΣ−1
  • Σop)

0.467 0.208 2.3 3.

  • Calc. µp|o + chol(Σp|o) × Z

0.049 0.052 0.9 4.

  • Calc. Allele Prob

0.129 0.127 1.0 Total 1.732 0.465 3.7

Total run time: CPU (28.9 min), CPU+GPU (7.8 min)

6

slide-16
SLIDE 16

Cholesky CPU vs GPU (P100)

10 20 30 2500 5000 7500 10000

n time (secs) method

chol inv LU inv QR inv

comp

cpu gpu

7

slide-17
SLIDE 17

0.1 10.0 2500 5000 7500 10000

n time (secs) method

chol inv LU inv QR inv

comp

cpu gpu

8

slide-18
SLIDE 18

Relative Performance

1 10 2500 5000 7500 10000

n Relative performance method

chol inv LU inv QR inv

9

slide-19
SLIDE 19

Aside - Matrix Multiplication

0.0 2.5 5.0 7.5 2500 5000 7500 10000

n time (sec) comp

cpu gpu

Matrix Multiplication 10

slide-20
SLIDE 20

An even bigger hammer

bigGP is an R package written by Chris Paciorek (UC Berkeley), et al.

  • Specialized distributed implementation of linear algebra operation for GPs
  • Designed to run on large super computer clusters
  • Uses both shared and distributed memory
  • Able to fit models on the order of n = 65k (32 GB Cov. matrix)

0.01 0.1 1 10 100 1000 2048 8192 32768 131072 Execution time, seconds (log scale) Matrix dimension, n (log scale) Cholesky Decomposition 6 cores 60 cores 816 cores 12480 cores 49920 cores

11

slide-21
SLIDE 21

More scalable solutions?

  • Spectral domain / basis functions
  • Covariance tapering
  • GMRF approximations
  • Low-rank approximations
  • Nearest-neighbor models

12

slide-22
SLIDE 22

Low Rank Approximations

13

slide-23
SLIDE 23

Low rank approximations in general

Lets look at the example of the singular value decomposition of a matrix, M

n×m =

U

n×n diag(S) n×m

V t

m×m

where U are called the left singular vectors, V the right singular vectors, and S the singular values. Usually the singular values and vectors are ordered such that the signular values are in descending order. It turns out (Eckart–Young theorem) that we can approximate M as having rank r by defining S to only have the r largest singular values (others set to zero). M

n m

U

n n diag S n m

V t

m m

U

n k diag S k k

V t

k m 14

slide-24
SLIDE 24

Low rank approximations in general

Lets look at the example of the singular value decomposition of a matrix, M

n×m =

U

n×n diag(S) n×m

V t

m×m

where U are called the left singular vectors, V the right singular vectors, and S the singular values. Usually the singular values and vectors are ordered such that the signular values are in descending order. It turns out (Eckart–Young theorem) that we can approximate M as having rank r by defining ˜ S to only have the r largest singular values (others set to zero).

˜

M

n×m =

U

n×n diag(˜

S)

n×m

V t

m×m

= ˜

U

n×k diag(˜

S)

k×k

˜

V t

k×m 14

slide-25
SLIDE 25

Example

M =

  

1.000 0.500 0.333 0.250 0.500 0.333 0.250 0.200 0.333 0.250 0.200 0.167 0.250 0.200 0.167 0.143

   = U diag(S) V

t

U = V =

  

−0.79

0.58

−0.18 −0.03 −0.45 −0.37

0.74 0.33

−0.32 −0.51 −0.10 −0.79 −0.25 −0.51 −0.64

0.51

  

S = ( 1.50 0.17 0.01 0.00)

Rank 2 approximation:

M 0 79 0 58 0 45 0 37 0 32 0 51 0 25 0 51 1 50 0 00 0 00 0 17 0 79 0 45 0 32 0 25 0 58 0 37 0 51 0 51 1 000 0 501 0 333 0 249 0 501 0 330 0 251 0 203 0 333 0 251 0 200 0 166 0 249 0 203 0 166 0 140

15

slide-26
SLIDE 26

Example

M =

  

1.000 0.500 0.333 0.250 0.500 0.333 0.250 0.200 0.333 0.250 0.200 0.167 0.250 0.200 0.167 0.143

   = U diag(S) V

t

U = V =

  

−0.79

0.58

−0.18 −0.03 −0.45 −0.37

0.74 0.33

−0.32 −0.51 −0.10 −0.79 −0.25 −0.51 −0.64

0.51

  

S = ( 1.50 0.17 0.01 0.00)

Rank 2 approximation: ˜

M =

  

−0.79

0.58

−0.45 −0.37 −0.32 −0.51 −0.25 −0.51

   (

1.50 0.00 0.00 0.17

) (

−0.79 −0.45 −0.32 −0.25

0.58

−0.37 −0.51 −0.51

)

=

  

1.000 0.501 0.333 0.249 0.501 0.330 0.251 0.203 0.333 0.251 0.200 0.166 0.249 0.203 0.166 0.140

  

15

slide-27
SLIDE 27

Approximation Error

We can measure the error of the approximation using the Frobenius norm,

∥M − ˜

M∥F =

 

m

i=1 n

j=1

(Mij − ˜

Mij)2

 

1/2

Strong dependence (large eff. range):

16

slide-28
SLIDE 28

Approximation Error

We can measure the error of the approximation using the Frobenius norm,

∥M − ˜

M∥F =

 

m

i=1 n

j=1

(Mij − ˜

Mij)2

 

1/2

Strong dependence (large eff. range):

10 20 30 40 50 5 10 15

SVD

Singular Values 10 20 30 40 50 5 10 15

Low Rank SVD

Error (Frob. norm) 16

slide-29
SLIDE 29

Weak dependence (short eff. range):

10 20 30 40 50 1 2 3 4

SVD

Singular Values 10 20 30 40 50 2 4 6 8 10

Low Rank SVD

Rank Error (Frob. norm) 17

slide-30
SLIDE 30

How does this help? (Sherman-Morrison-Woodbury)

There is an immensely useful linear algebra identity, the Sherman-Morrison-Woodbury formula, for the inverse (and determinant) of a decomposed matrix,

˜

M

n×m

−1 =

(

A

n×m + U n×k S k×k Vt k×m

)−1

= A−1 − A−1U

(

S−1 + V tA−1U

)−1 V tA−1.

How does this help?

  • Imagine that A

diag A , then it is trivial to find A

1.

  • S

1 is k

k which is hopefully small, or even better S diag S .

  • S

1

V tA

1U is k

k which is hopefully small.

18

slide-31
SLIDE 31

How does this help? (Sherman-Morrison-Woodbury)

There is an immensely useful linear algebra identity, the Sherman-Morrison-Woodbury formula, for the inverse (and determinant) of a decomposed matrix,

˜

M

n×m

−1 =

(

A

n×m + U n×k S k×k Vt k×m

)−1

= A−1 − A−1U

(

S−1 + V tA−1U

)−1 V tA−1.

How does this help?

  • Imagine that A = diag(A), then it is trivial to find A−1.
  • S−1 is k × k which is hopefully small, or even better S = diag(S).
  • (S−1 + V tA−1U) is k × k which is hopefully small.

18

slide-32
SLIDE 32

Aside - Determinant

Remember for any MVN distribution when evaluating the likelihood

1 2 log |Σ| − 1 2(x − µ)′Σ−1(x − µ) − n 2 log 2π we need the inverse of Σ as well as its determinant.

  • For a full rank Cholesky decomposition we get the determinant for

“free”. M LLt

n i 1

diag L i

2

  • For a low rank approximation the Sherman-Morrison-Woodbury /

Determinant lemma gives us, det M det A USVt det S

1

VtA

1U

det S det A

19

slide-33
SLIDE 33

Aside - Determinant

Remember for any MVN distribution when evaluating the likelihood

1 2 log |Σ| − 1 2(x − µ)′Σ−1(x − µ) − n 2 log 2π we need the inverse of Σ as well as its determinant.

  • For a full rank Cholesky decomposition we get the determinant for

“free”.

|M| = |LLt| =

n

i=1

(diag(L)i)2

  • For a low rank approximation the Sherman-Morrison-Woodbury /

Determinant lemma gives us, det M det A USVt det S

1

VtA

1U

det S det A

19

slide-34
SLIDE 34

Aside - Determinant

Remember for any MVN distribution when evaluating the likelihood

1 2 log |Σ| − 1 2(x − µ)′Σ−1(x − µ) − n 2 log 2π we need the inverse of Σ as well as its determinant.

  • For a full rank Cholesky decomposition we get the determinant for

“free”.

|M| = |LLt| =

n

i=1

(diag(L)i)2

  • For a low rank approximation the Sherman-Morrison-Woodbury /

Determinant lemma gives us, det(˜ M) = det(A + USVt)

= det(S−1 + VtA−1U) det(S) det(A)

19

slide-35
SLIDE 35

Low rank approximations for GPs

For a standard spatial random effects model, y(s) = x(s) β + w(s) + ϵ,

ϵ ∼ N(0, τ 2I)

w(s) ∼ N(0, Σ(s)),

Σ(s, s′) = σ ρ(s, s′|θ)

if we can replace Σ(s) with a low rank approximation of the form

  • Σ(s) ≈ U S Vt where
  • U and V are n × k,
  • S is k × k, and
  • A = τ 2I or a similar diagonal matrix

20

slide-36
SLIDE 36

Predictive Processes

21

slide-37
SLIDE 37

Gaussian Predictive Processes

For a rank k approximation,

  • Pick k knot locations s⋆
  • Calculate knot covariance, Σ(s⋆), and knot cross-covariance, Σ(s, s⋆)
  • Approximate full covariance using

Σ(s) ≈ Σ(s, s⋆)

n×k

Σ(s⋆)−1

k×k

Σ(s⋆, s)

k×n

.

  • PPs systematically underestimates variance (σ2) and inflate τ 2, Modified

predictive processs corrects this using

Σ(s) ≈Σ(s, s⋆) Σ(s⋆)−1 Σ(s⋆, s) + diag ( Σ(s) − Σ(s, s⋆) Σ(s⋆)−1 Σ(s⋆, s) ) .

Banerjee, Gelfand, Finley, Sang (2008) Finley, Sang, Banerjee, Gelfand (2008)

22

slide-38
SLIDE 38

Example

Below we have a surface generate from a squared exponential Gaussian Process where

{Σ}ij = σ2 exp

(−(ϕ d)2) + τ 2I

σ2 = 1 ϕ = 9 τ 2 = 0.1

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 −3 −2 −1 1 2 3 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 −3 −2 −1 1 2 23

slide-39
SLIDE 39

Predictive Process Model Results

0.0 0.4 0.8 0.0 0.4 0.8

True Field

−3 −2 −1 1 2 3 0.0 0.4 0.8 0.0 0.4 0.8

PP − 5 x 5 knots

−2 −1 1 0.0 0.4 0.8 0.0 0.4 0.8

PP − 10 x 10 knots

−3 −2 −1 1 2 0.0 0.4 0.8 0.0 0.4 0.8

PP − 15 x 15 knots

−3 −2 −1 1 2 0.0 0.4 0.8 0.0 0.4 0.8

Full GP

−3 −2 −1 1 2 0.0 0.4 0.8 0.0 0.4 0.8

  • Mod. PP − 5 x 5 knots

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 0.0 0.4 0.8 0.0 0.4 0.8

  • Mod. PP − 10 x 10 knots

−3 −2 −1 1 2 0.0 0.4 0.8 0.0 0.4 0.8

  • Mod. PP − 15 x 15 knots

−3 −2 −1 1 2

24

slide-40
SLIDE 40

Performance

30 40 50 50 100 150 200

time error knots

− 25 100 225

model

Full GP PP

  • Mod. PP

25

slide-41
SLIDE 41

Parameter Estimates

phi sigma.sq tau.sq 4 5 6 7 8 9 1.2 1.5 1.8 2.1 0.1 0.2 0.3 0.4 0.5

Parameter Value model

Full GP PP

  • Mod. PP

True

knots

− 25 100 225

26

slide-42
SLIDE 42

Random Projections

27

slide-43
SLIDE 43

Low Rank Approximations via Random Projections

  • 1. Starting with an m × n matrix A.
  • 2. Draw an n × k + p Gaussian random matrix Ω.
  • 3. Form Y = A Ω and compute its QR factorization Y = Q R
  • 4. Form the k + p × n matrix B = Q′ A.
  • 5. Compute the SVD of the small matrix B, B = ˆ

U S V′.

  • 6. Form the matrix U = Qˆ

U. Resulting approximation has a bounded expected error, E ∥A − USV′∥ ≤

[

1 + 4√ k + p p − 1

min(m, n)

]

σk+1.

Halko, Martinsson, Tropp (2011)

28

slide-44
SLIDE 44

Random Matrix Low Rank Approximations and GPs

Preceeding algorithm can be modified slightly to take advantage of the positive definite structure of a covariance matrix.

  • 1. Starting with an n × n covariance matrix A.
  • 2. Draw an n × k + p Gaussian random matrix Ω.
  • 3. Form Y = A Ω and compute its QR factorization Y = Q R
  • 4. Form the k + p × k + p matrix B = Q′ A Q.
  • 5. Compute the eigen decomposition of the small matrix B, B = ˆ

U Sˆ U′.

  • 6. Form the matrix U = Qˆ

U. Once again we have a bound on the error, E ∥A − Q(Q′AQ)Q′∥ = E ∥A − USU′∥ ⪅ c · σk+1.

Halko, Martinsson, Tropp (2011), Banerjee, Dunson, Tokdar (2012)

29

slide-45
SLIDE 45

Low Rank Approximations and GPUs

Both predictive process and random matrix low rank approximations are good candidates for acceleration using GPUs.

  • Both use Sherman-Woodbury-Morrison to calculate the inverse

(involves matrix multiplication, addition, and a small matrix inverse).

  • Predictive processes involves several covariance matrix calculations

(knots and cross-covariance) and a small matrix inverse.

  • Random matrix low rank approximations involves a large matrix

multiplication (A Ω) and several small matrix decompositions (QR, eigen).

30

slide-46
SLIDE 46

Comparison (n = 15, 000, k = {100, . . . , 4900})

5 10 15 10 20 30

time error method

lr1 lr1 mod pp pp mod

Strong Dependence

10 20 30 10 20 30

time error method

lr1 lr1 mod pp pp mod

Weak Dependence

31

slide-47
SLIDE 47
  • Rand. Projection LR Decompositions for Prediction

This approach can also be used for prediction, if we want to sample y ∼ N(0, Σ)

Σ ≈ USUt = (US1/2Ut)(US1/2Ut)t

then ypred = (U S1/2 Ut) × Z where Zi ∼ N(0, 1) because Ut U = I since U is an orthogonal matrix.

Dehdari, Deutsch (2012)

32

slide-48
SLIDE 48

n = 1000, p = 10000

33