Lecture 21 Computational Methods for GPs Colin Rundel 04/10/2017 - - PowerPoint PPT Presentation

β–Ά
lecture 21
SMART_READER_LITE
LIVE PREVIEW

Lecture 21 Computational Methods for GPs Colin Rundel 04/10/2017 - - PowerPoint PPT Presentation

Lecture 21 Computational Methods for GPs Colin Rundel 04/10/2017 1 GPs and Computational Complexity 2 + Chol () with (0, 1) 2 log || 1 2( ) 1 ( ) 2 log 2


slide-1
SLIDE 1

Lecture 21

Computational Methods for GPs

Colin Rundel 04/10/2017

1

slide-2
SLIDE 2

GPs and Computational Complexity

2

slide-3
SLIDE 3

The problem with GPs

Unless you are lucky (or clever), Gaussian process models are difficult to scale to large problems. For a Gaussian process 𝐳 ∼ π’ͺ(𝝂, 𝚻): Want to sample 𝐳?

𝝂 + Chol(𝚻) Γ— 𝐚 with π‘Žπ‘— ∼ π’ͺ(0, 1) 𝒫 (π‘œ3)

Evaluate the (log) likelihood?

βˆ’1 2 log |Ξ£| βˆ’ 1 2(𝐲 βˆ’ 𝝂)β€² πš»βˆ’1 (𝐲 βˆ’ 𝝂) βˆ’ π‘œ 2 log 2𝜌 𝒫 (π‘œ3)

Update covariance parameter?

{Ξ£}π‘—π‘˜ = 𝜏2 exp(βˆ’{𝑒}π‘—π‘˜πœš) + 𝜏2

π‘œ 1𝑗=π‘˜

𝒫 (π‘œ2)

3

slide-4
SLIDE 4

The problem with GPs

Unless you are lucky (or clever), Gaussian process models are difficult to scale to large problems. For a Gaussian process 𝐳 ∼ π’ͺ(𝝂, 𝚻): Want to sample 𝐳?

𝝂 + Chol(𝚻) Γ— 𝐚 with π‘Žπ‘— ∼ π’ͺ(0, 1) 𝒫 (π‘œ3)

Evaluate the (log) likelihood?

βˆ’1 2 log |Ξ£| βˆ’ 1 2(𝐲 βˆ’ 𝝂)β€² πš»βˆ’1 (𝐲 βˆ’ 𝝂) βˆ’ π‘œ 2 log 2𝜌 𝒫 (π‘œ3)

Update covariance parameter?

{Ξ£}π‘—π‘˜ = 𝜏2 exp(βˆ’{𝑒}π‘—π‘˜πœš) + 𝜏2

π‘œ 1𝑗=π‘˜

𝒫 (π‘œ2)

3

slide-5
SLIDE 5

The problem with GPs

Unless you are lucky (or clever), Gaussian process models are difficult to scale to large problems. For a Gaussian process 𝐳 ∼ π’ͺ(𝝂, 𝚻): Want to sample 𝐳?

𝝂 + Chol(𝚻) Γ— 𝐚 with π‘Žπ‘— ∼ π’ͺ(0, 1) 𝒫 (π‘œ3)

Evaluate the (log) likelihood?

βˆ’1 2 log |Ξ£| βˆ’ 1 2(𝐲 βˆ’ 𝝂)β€² πš»βˆ’1 (𝐲 βˆ’ 𝝂) βˆ’ π‘œ 2 log 2𝜌 𝒫 (π‘œ3)

Update covariance parameter?

{Ξ£}π‘—π‘˜ = 𝜏2 exp(βˆ’{𝑒}π‘—π‘˜πœš) + 𝜏2

π‘œ 1𝑗=π‘˜

𝒫 (π‘œ2)

3

slide-6
SLIDE 6

The problem with GPs

Unless you are lucky (or clever), Gaussian process models are difficult to scale to large problems. For a Gaussian process 𝐳 ∼ π’ͺ(𝝂, 𝚻): Want to sample 𝐳?

𝝂 + Chol(𝚻) Γ— 𝐚 with π‘Žπ‘— ∼ π’ͺ(0, 1) 𝒫 (π‘œ3)

Evaluate the (log) likelihood?

βˆ’1 2 log |Ξ£| βˆ’ 1 2(𝐲 βˆ’ 𝝂)β€² πš»βˆ’1 (𝐲 βˆ’ 𝝂) βˆ’ π‘œ 2 log 2𝜌 𝒫 (π‘œ3)

Update covariance parameter?

{Ξ£}π‘—π‘˜ = 𝜏2 exp(βˆ’{𝑒}π‘—π‘˜πœš) + 𝜏2

π‘œ 1𝑗=π‘˜

𝒫 (π‘œ2)

3

slide-7
SLIDE 7

A simple guide to computational complexity

𝒫 (π‘œ) - Linear complexity

  • Go for it

𝒫 (π‘œ2) - Quadratic complexity - Pray 𝒫 (π‘œ3) - Cubic complexity - Give up

4

slide-8
SLIDE 8

A simple guide to computational complexity

𝒫 (π‘œ) - Linear complexity - Go for it 𝒫 (π‘œ2) - Quadratic complexity - Pray 𝒫 (π‘œ3) - Cubic complexity - Give up

4

slide-9
SLIDE 9

A simple guide to computational complexity

𝒫 (π‘œ) - Linear complexity - Go for it 𝒫 (π‘œ2) - Quadratic complexity

  • Pray

𝒫 (π‘œ3) - Cubic complexity - Give up

4

slide-10
SLIDE 10

A simple guide to computational complexity

𝒫 (π‘œ) - Linear complexity - Go for it 𝒫 (π‘œ2) - Quadratic complexity - Pray 𝒫 (π‘œ3) - Cubic complexity - Give up

4

slide-11
SLIDE 11

A simple guide to computational complexity

𝒫 (π‘œ) - Linear complexity - Go for it 𝒫 (π‘œ2) - Quadratic complexity - Pray 𝒫 (π‘œ3) - Cubic complexity

  • Give up

4

slide-12
SLIDE 12

A simple guide to computational complexity

𝒫 (π‘œ) - Linear complexity - Go for it 𝒫 (π‘œ2) - Quadratic complexity - Pray 𝒫 (π‘œ3) - Cubic complexity - Give up

4

slide-13
SLIDE 13

How bad is the problem?

10 20 30 2500 5000 7500 10000

n time (secs) method

chol inv LU inv QR inv

5

slide-14
SLIDE 14

Practice - Migratory Model Prediction

After fitting the GP need to sample from the posterior predictive distribution at ∼ 3000 locations

π³π‘ž ∼ π’ͺ (πœˆπ‘ž + Ξ£π‘žπ‘Ξ£βˆ’1

𝑝 (𝑧𝑝 βˆ’ πœˆπ‘), Ξ£π‘ž βˆ’ Ξ£π‘žπ‘Ξ£βˆ’1 𝑝 Ξ£π‘π‘ž)

Step CPU (secs)

  • 1. Calc. Ξ£π‘ž, Ξ£π‘žπ‘, Ξ£π‘ž

1.080

  • 2. Calc. chol(Ξ£π‘ž βˆ’ Ξ£π‘žπ‘Ξ£βˆ’1

𝑝 Ξ£π‘π‘ž)

0.467

  • 3. Calc. πœˆπ‘ž|𝑝 + chol(Ξ£π‘ž|𝑝) Γ— π‘Ž

0.049

  • 4. Calc. Allele Prob

0.129 Total 1.732

Total run time for 1000 posterior predictive draws:

  • CPU (28.9 min)

6

slide-15
SLIDE 15

Practice - Migratory Model Prediction

After fitting the GP need to sample from the posterior predictive distribution at ∼ 3000 locations

π³π‘ž ∼ π’ͺ (πœˆπ‘ž + Ξ£π‘žπ‘Ξ£βˆ’1

𝑝 (𝑧𝑝 βˆ’ πœˆπ‘), Ξ£π‘ž βˆ’ Ξ£π‘žπ‘Ξ£βˆ’1 𝑝 Ξ£π‘π‘ž)

Step CPU (secs)

  • 1. Calc. Ξ£π‘ž, Ξ£π‘žπ‘, Ξ£π‘ž

1.080

  • 2. Calc. chol(Ξ£π‘ž βˆ’ Ξ£π‘žπ‘Ξ£βˆ’1

𝑝 Ξ£π‘π‘ž)

0.467

  • 3. Calc. πœˆπ‘ž|𝑝 + chol(Ξ£π‘ž|𝑝) Γ— π‘Ž

0.049

  • 4. Calc. Allele Prob

0.129 Total 1.732

Total run time for 1000 posterior predictive draws:

  • CPU (28.9 min)

6

slide-16
SLIDE 16

A bigger hammer?

Step CPU (secs) CPU+GPU (secs)

  • Rel. Perf
  • 1. Calc. Ξ£π‘ž, Ξ£π‘žπ‘, Ξ£π‘ž

1.080 0.046 23.0

  • 2. Calc. chol(Ξ£π‘ž βˆ’ Ξ£π‘žπ‘Ξ£βˆ’1

𝑝 Ξ£π‘π‘ž)

0.467 0.208 2.3

  • 3. Calc. πœˆπ‘ž|𝑝 + chol(Ξ£π‘ž|𝑝) Γ— π‘Ž

0.049 0.052 0.9

  • 4. Calc. Allele Prob

0.129 0.127 1.0 Total 1.732 0.465 3.7

Total run time for 1000 posterior predictive draws:

  • CPU (28.9 min)
  • CPU+GPU (7.8 min)

7

slide-17
SLIDE 17

Cholesky CPU vs GPU (P100)

10 20 30 2500 5000 7500 10000

n time (secs) method

chol inv LU inv QR inv

comp

cpu gpu

8

slide-18
SLIDE 18

0.1 10.0 2500 5000 7500 10000

n time (secs) method

chol inv LU inv QR inv

comp

cpu gpu

9

slide-19
SLIDE 19

Relative Performance

1 10 2500 5000 7500 10000

n Relative performance method

chol inv LU inv QR inv

10

slide-20
SLIDE 20

Aside (1) - Matrix Multiplication

0.0 2.5 5.0 7.5 2500 5000 7500 10000

n time (sec) comp

cpu gpu

Matrix Multiplication

11

slide-21
SLIDE 21

20 25 30 35 40 45 2500 5000 7500 10000

n time (sec)

Matrix Multiplication βˆ’ Relative Performance

12

slide-22
SLIDE 22

Aside (2) - Memory Limitations

A general covariance is a dense π‘œ Γ— π‘œ matrix, meaning it will require π‘œ2Γ— 64-bits to store.

5 10 15 20 10000 20000 30000 40000 50000

n Cov Martrix Size (GB) 13

slide-23
SLIDE 23

Other big hammers

bigGP is an R package written by Chris Paciorek (UC Berkeley), et al.

  • Specialized distributed implementation of linear algebra operation for GPs
  • Designed to run on large super computer clusters
  • Uses both shared and distributed memory
  • Able to fit models on the order of π‘œ = 65k (32 GB Cov. matrix)

0.01 0.1 1 10 100 1000 2048 8192 32768 131072 Execution time, seconds (log scale) Matrix dimension, n (log scale) Cholesky Decomposition 6 cores 60 cores 816 cores 12480 cores 49920 cores

14

slide-24
SLIDE 24

More scalable solutions?

  • Spectral domain / basis functions
  • Covariance tapering
  • GMRF approximations
  • Low-rank approximations
  • Nearest-neighbor models

15

slide-25
SLIDE 25

Low Rank Approximations

16

slide-26
SLIDE 26

Low rank approximations in general

Lets look at the example of the singular value decomposition of a matrix,

𝑁

π‘œΓ—π‘› = 𝑉 π‘œΓ—π‘œ diag(𝑇) π‘œΓ—π‘›

π‘Š 𝑒

𝑛×𝑛

where 𝑉 are called the left singular vectors, π‘Š the right singular vectors, and 𝑇 the singular values. Usually the singular values and vectors are

  • rdered such that the singular values are in descending order.

The Eckart–Young theorem states that we can construct an approximatation

  • f 𝑁 with rank 𝑙 by setting

Μƒ 𝑇 to contain only the 𝑙 largest singular values

and all other values set to zero.

Μƒ 𝑁

π‘œΓ—π‘› = 𝑉 π‘œΓ—π‘œ diag( Μƒ

𝑇)

π‘œΓ—π‘›

π‘Š 𝑒

𝑛×𝑛

= Μƒ 𝑉

π‘œΓ—π‘™

diag( Μƒ

𝑇)

𝑙×𝑙

Μƒ π‘Š 𝑒

𝑙×𝑛

17

slide-27
SLIDE 27

Low rank approximations in general

Lets look at the example of the singular value decomposition of a matrix,

𝑁

π‘œΓ—π‘› = 𝑉 π‘œΓ—π‘œ diag(𝑇) π‘œΓ—π‘›

π‘Š 𝑒

𝑛×𝑛

where 𝑉 are called the left singular vectors, π‘Š the right singular vectors, and 𝑇 the singular values. Usually the singular values and vectors are

  • rdered such that the singular values are in descending order.

The Eckart–Young theorem states that we can construct an approximatation

  • f 𝑁 with rank 𝑙 by setting

Μƒ 𝑇 to contain only the 𝑙 largest singular values

and all other values set to zero.

Μƒ 𝑁

π‘œΓ—π‘› = 𝑉 π‘œΓ—π‘œ diag( Μƒ

𝑇)

π‘œΓ—π‘›

π‘Š 𝑒

𝑛×𝑛

= Μƒ 𝑉

π‘œΓ—π‘™

diag( Μƒ

𝑇)

𝑙×𝑙

Μƒ π‘Š 𝑒

𝑙×𝑛

17

slide-28
SLIDE 28

Example

𝑁 = βŽ› ⎜ ⎜ ⎜ ⎝ 1.000 0.500 0.333 0.250 0.500 0.333 0.250 0.200 0.333 0.250 0.200 0.167 0.250 0.200 0.167 0.143 ⎞ ⎟ ⎟ ⎟ ⎠ = 𝑉 diag(𝑇) π‘Š 𝑒 𝑉 = π‘Š = βŽ› ⎜ ⎜ ⎜ ⎝ βˆ’0.79 0.58 βˆ’0.18 βˆ’0.03 βˆ’0.45 βˆ’0.37 0.74 0.33 βˆ’0.32 βˆ’0.51 βˆ’0.10 βˆ’0.79 βˆ’0.25 βˆ’0.51 βˆ’0.64 0.51 ⎞ ⎟ ⎟ ⎟ ⎠ 𝑇 = (1.50 0.17 0.01 0.00) Rank 2 approximation: Μƒ 𝑁 = βŽ› ⎜ ⎜ ⎜ ⎝ βˆ’0.79 0.58 βˆ’0.45 βˆ’0.37 βˆ’0.32 βˆ’0.51 βˆ’0.25 βˆ’0.51 ⎞ ⎟ ⎟ ⎟ ⎠ (1.50 0.00 0.00 0.17) (βˆ’0.79 βˆ’0.45 βˆ’0.32 βˆ’0.25 0.58 βˆ’0.37 βˆ’0.51 βˆ’0.51) = βŽ› ⎜ ⎜ ⎜ ⎝ 1.000 0.501 0.333 0.249 0.501 0.330 0.251 0.203 0.333 0.251 0.200 0.166 0.249 0.203 0.166 0.140 ⎞ ⎟ ⎟ ⎟ ⎠

18

slide-29
SLIDE 29

Example

𝑁 = βŽ› ⎜ ⎜ ⎜ ⎝ 1.000 0.500 0.333 0.250 0.500 0.333 0.250 0.200 0.333 0.250 0.200 0.167 0.250 0.200 0.167 0.143 ⎞ ⎟ ⎟ ⎟ ⎠ = 𝑉 diag(𝑇) π‘Š 𝑒 𝑉 = π‘Š = βŽ› ⎜ ⎜ ⎜ ⎝ βˆ’0.79 0.58 βˆ’0.18 βˆ’0.03 βˆ’0.45 βˆ’0.37 0.74 0.33 βˆ’0.32 βˆ’0.51 βˆ’0.10 βˆ’0.79 βˆ’0.25 βˆ’0.51 βˆ’0.64 0.51 ⎞ ⎟ ⎟ ⎟ ⎠ 𝑇 = (1.50 0.17 0.01 0.00) Rank 2 approximation: Μƒ 𝑁 = βŽ› ⎜ ⎜ ⎜ ⎝ βˆ’0.79 0.58 βˆ’0.45 βˆ’0.37 βˆ’0.32 βˆ’0.51 βˆ’0.25 βˆ’0.51 ⎞ ⎟ ⎟ ⎟ ⎠ (1.50 0.00 0.00 0.17) (βˆ’0.79 βˆ’0.45 βˆ’0.32 βˆ’0.25 0.58 βˆ’0.37 βˆ’0.51 βˆ’0.51) = βŽ› ⎜ ⎜ ⎜ ⎝ 1.000 0.501 0.333 0.249 0.501 0.330 0.251 0.203 0.333 0.251 0.200 0.166 0.249 0.203 0.166 0.140 ⎞ ⎟ ⎟ ⎟ ⎠

18

slide-30
SLIDE 30

Approximation Error

We can measure the error of the approximation using the Frobenius norm,

‖𝑁 βˆ’ Μƒ 𝑁‖𝐺 = (

𝑛

βˆ‘

𝑗=1 π‘œ

βˆ‘

π‘˜=1

(π‘π‘—π‘˜ βˆ’ Μƒ π‘π‘—π‘˜)2)

1/2

𝑁 βˆ’ Μƒ 𝑁 = βŽ› ⎜ ⎜ ⎜ ⎜ ⎝ 0.00022 βˆ’0.00090 0.00012 0.00077 βˆ’0.00090 0.00372 βˆ’0.00053 βˆ’0.00317 0.00012 βˆ’0.00053 0.00013 0.00039 0.00077 βˆ’0.00317 0.00039 0.00277 ⎞ ⎟ ⎟ ⎟ ⎟ ⎠ ‖𝑁 βˆ’ Μƒ 𝑁‖𝐺 = 0.00674

19

slide-31
SLIDE 31

Approximation Error

We can measure the error of the approximation using the Frobenius norm,

‖𝑁 βˆ’ Μƒ 𝑁‖𝐺 = (

𝑛

βˆ‘

𝑗=1 π‘œ

βˆ‘

π‘˜=1

(π‘π‘—π‘˜ βˆ’ Μƒ π‘π‘—π‘˜)2)

1/2

𝑁 βˆ’ Μƒ 𝑁 = βŽ› ⎜ ⎜ ⎜ ⎜ ⎝ 0.00022 βˆ’0.00090 0.00012 0.00077 βˆ’0.00090 0.00372 βˆ’0.00053 βˆ’0.00317 0.00012 βˆ’0.00053 0.00013 0.00039 0.00077 βˆ’0.00317 0.00039 0.00277 ⎞ ⎟ ⎟ ⎟ ⎟ ⎠ ‖𝑁 βˆ’ Μƒ 𝑁‖𝐺 = 0.00674

19

slide-32
SLIDE 32

Cov Mat - Strong dependence (large eff. range):

10 20 30 40 50 2 4 6 8 10 14

SVD

Singular Values 10 20 30 40 50 5 10 15

Low Rank SVD

Rank Error (Frob. norm)

20

slide-33
SLIDE 33

Cov Mat - Weak dependence (short eff. range):

10 20 30 40 50 1 2 3 4

SVD

Singular Values 10 20 30 40 50 2 4 6 8 10

Low Rank SVD

Rank Error (Frob. norm)

21

slide-34
SLIDE 34

How does this help? (Sherman-Morrison-Woodbury)

There is an immensely useful linear algebra identity, the Sherman-Morrison-Woodbury formula, for the inverse (and determinant) of a decomposed matrix,

Μƒ 𝑁

π‘œΓ—π‘› βˆ’1

= ( 𝐡

π‘œΓ—π‘› + 𝑉 π‘œΓ—π‘™ 𝑇 𝑙×𝑙 π‘Š 𝑒 𝑙×𝑛) βˆ’1

= π΅βˆ’1 βˆ’ π΅βˆ’1𝑉 (π‘‡βˆ’1 + π‘Š π‘’π΅βˆ’1𝑉)

βˆ’1 π‘Š π‘’π΅βˆ’1.

How does this help?

  • Imagine that 𝐡 = diag(𝐡), then it is trivial to find π΅βˆ’1.
  • π‘‡βˆ’1 is 𝑙 Γ— 𝑙 which is hopefully small, or even better 𝑇 = diag(𝑇).
  • (π‘‡βˆ’1 + π‘Š π‘’π΅βˆ’1𝑉) is 𝑙 Γ— 𝑙 which is also hopefully small.

22

slide-35
SLIDE 35

How does this help? (Sherman-Morrison-Woodbury)

There is an immensely useful linear algebra identity, the Sherman-Morrison-Woodbury formula, for the inverse (and determinant) of a decomposed matrix,

Μƒ 𝑁

π‘œΓ—π‘› βˆ’1

= ( 𝐡

π‘œΓ—π‘› + 𝑉 π‘œΓ—π‘™ 𝑇 𝑙×𝑙 π‘Š 𝑒 𝑙×𝑛) βˆ’1

= π΅βˆ’1 βˆ’ π΅βˆ’1𝑉 (π‘‡βˆ’1 + π‘Š π‘’π΅βˆ’1𝑉)

βˆ’1 π‘Š π‘’π΅βˆ’1.

How does this help?

  • Imagine that 𝐡 = diag(𝐡), then it is trivial to find π΅βˆ’1.
  • π‘‡βˆ’1 is 𝑙 Γ— 𝑙 which is hopefully small, or even better 𝑇 = diag(𝑇).
  • (π‘‡βˆ’1 + π‘Š π‘’π΅βˆ’1𝑉) is 𝑙 Γ— 𝑙 which is also hopefully small.

22

slide-36
SLIDE 36

Aside - Determinant

Remember for any MVN distribution when evaluating the likelihood

βˆ’1 2 log |Ξ£| βˆ’ 1 2(𝐲 βˆ’ 𝝂)β€²πš»βˆ’1(𝐲 βˆ’ 𝝂) βˆ’ π‘œ 2 log 2𝜌

we need the inverse of Ξ£ as well as its determinant.

  • For a full rank Cholesky decomposition we get the determinant for

β€œfree’’.

|𝑁| = |𝑀𝑀𝑒| =

π‘œ

∏

𝑗=1

(diag(𝑀)𝑗)

2

  • For a low rank approximation the Sherman-Morrison-Woodbury

Determinant lemma gives us,

det( Μƒ 𝑁) = det(𝐡 + π‘‰π‘‡π‘Š 𝑒) = det(π‘‡βˆ’1 + π‘Š π‘’π΅βˆ’1𝑉) det(𝑇) det(𝐡)

23

slide-37
SLIDE 37

Aside - Determinant

Remember for any MVN distribution when evaluating the likelihood

βˆ’1 2 log |Ξ£| βˆ’ 1 2(𝐲 βˆ’ 𝝂)β€²πš»βˆ’1(𝐲 βˆ’ 𝝂) βˆ’ π‘œ 2 log 2𝜌

we need the inverse of Ξ£ as well as its determinant.

  • For a full rank Cholesky decomposition we get the determinant for

β€œfree’’.

|𝑁| = |𝑀𝑀𝑒| =

π‘œ

∏

𝑗=1

(diag(𝑀)𝑗)

2

  • For a low rank approximation the Sherman-Morrison-Woodbury

Determinant lemma gives us,

det( Μƒ 𝑁) = det(𝐡 + π‘‰π‘‡π‘Š 𝑒) = det(π‘‡βˆ’1 + π‘Š π‘’π΅βˆ’1𝑉) det(𝑇) det(𝐡)

23

slide-38
SLIDE 38

Aside - Determinant

Remember for any MVN distribution when evaluating the likelihood

βˆ’1 2 log |Ξ£| βˆ’ 1 2(𝐲 βˆ’ 𝝂)β€²πš»βˆ’1(𝐲 βˆ’ 𝝂) βˆ’ π‘œ 2 log 2𝜌

we need the inverse of Ξ£ as well as its determinant.

  • For a full rank Cholesky decomposition we get the determinant for

β€œfree’’.

|𝑁| = |𝑀𝑀𝑒| =

π‘œ

∏

𝑗=1

(diag(𝑀)𝑗)

2

  • For a low rank approximation the Sherman-Morrison-Woodbury

Determinant lemma gives us,

det( Μƒ 𝑁) = det(𝐡 + π‘‰π‘‡π‘Š 𝑒) = det(π‘‡βˆ’1 + π‘Š π‘’π΅βˆ’1𝑉) det(𝑇) det(𝐡)

23

slide-39
SLIDE 39

Low rank approximations for GPs

For a standard spatial random effects model,

𝑧(𝐭) = 𝑦(𝐭) 𝜸 + π‘₯(𝐭) + πœ—, πœ— ∼ 𝑂(0, 𝜐2𝐽) π‘₯(𝐭) ∼ π’ͺ(0, 𝚻(𝐭)), 𝚻(𝐭, 𝐭′) = 𝜏 𝜍(𝐭, 𝐭′|πœ„)

if we can replace 𝚻(𝐭) with a low rank approximation of the form

  • 𝚻(𝐭) β‰ˆ 𝐕 𝐓 𝐖𝑒 where
  • 𝐕 and 𝐖 are π‘œ Γ— 𝑙,
  • 𝐓 is 𝑙 Γ— 𝑙, and
  • 𝐡 = 𝜐2𝐽 or a similar diagonal matrix

24

slide-40
SLIDE 40

Predictive Processes

25

slide-41
SLIDE 41

Gaussian Predictive Processes

For a rank 𝑙 approximation,

  • Pick 𝑙 knot locations 𝐭⋆
  • Calculate knot covariance, 𝚻(𝐭⋆), and knot cross-covariance, 𝚻(𝐭, 𝐭⋆)
  • Approximate full covariance using

𝚻(𝐭) β‰ˆ 𝚻(𝐭, 𝐭⋆)

π‘œΓ—π‘™

𝚻(𝐭⋆)βˆ’1

𝑙×𝑙

𝚻(𝐭⋆, 𝐭)

π‘™Γ—π‘œ

.

  • PPs systematically underestimates variance (𝜏2) and inflate 𝜐2, Modified

predictive processs corrects this using

𝚻(𝐭) β‰ˆπš»(𝐭, 𝐭⋆) 𝚻(𝐭⋆)βˆ’1 𝚻(𝐭⋆, 𝐭) + diag(𝚻(𝐭) βˆ’ 𝚻(𝐭, 𝐭⋆) 𝚻(𝐭⋆)βˆ’1 𝚻(𝐭⋆, 𝐭)).

Banerjee, Gelfand, Finley, Sang (2008) Finley, Sang, Banerjee, Gelfand (2008)

26

slide-42
SLIDE 42

Gaussian Predictive Processes

For a rank 𝑙 approximation,

  • Pick 𝑙 knot locations 𝐭⋆
  • Calculate knot covariance, 𝚻(𝐭⋆), and knot cross-covariance, 𝚻(𝐭, 𝐭⋆)
  • Approximate full covariance using

𝚻(𝐭) β‰ˆ 𝚻(𝐭, 𝐭⋆)

π‘œΓ—π‘™

𝚻(𝐭⋆)βˆ’1

𝑙×𝑙

𝚻(𝐭⋆, 𝐭)

π‘™Γ—π‘œ

.

  • PPs systematically underestimates variance (𝜏2) and inflate 𝜐2, Modified

predictive processs corrects this using

𝚻(𝐭) β‰ˆπš»(𝐭, 𝐭⋆) 𝚻(𝐭⋆)βˆ’1 𝚻(𝐭⋆, 𝐭) + diag(𝚻(𝐭) βˆ’ 𝚻(𝐭, 𝐭⋆) 𝚻(𝐭⋆)βˆ’1 𝚻(𝐭⋆, 𝐭)).

Banerjee, Gelfand, Finley, Sang (2008) Finley, Sang, Banerjee, Gelfand (2008)

26

slide-43
SLIDE 43

Gaussian Predictive Processes

For a rank 𝑙 approximation,

  • Pick 𝑙 knot locations 𝐭⋆
  • Calculate knot covariance, 𝚻(𝐭⋆), and knot cross-covariance, 𝚻(𝐭, 𝐭⋆)
  • Approximate full covariance using

𝚻(𝐭) β‰ˆ 𝚻(𝐭, 𝐭⋆)

π‘œΓ—π‘™

𝚻(𝐭⋆)βˆ’1

𝑙×𝑙

𝚻(𝐭⋆, 𝐭)

π‘™Γ—π‘œ

.

  • PPs systematically underestimates variance (𝜏2) and inflate 𝜐2, Modified

predictive processs corrects this using

𝚻(𝐭) β‰ˆπš»(𝐭, 𝐭⋆) 𝚻(𝐭⋆)βˆ’1 𝚻(𝐭⋆, 𝐭) + diag(𝚻(𝐭) βˆ’ 𝚻(𝐭, 𝐭⋆) 𝚻(𝐭⋆)βˆ’1 𝚻(𝐭⋆, 𝐭)).

Banerjee, Gelfand, Finley, Sang (2008) Finley, Sang, Banerjee, Gelfand (2008)

26

slide-44
SLIDE 44

Gaussian Predictive Processes

For a rank 𝑙 approximation,

  • Pick 𝑙 knot locations 𝐭⋆
  • Calculate knot covariance, 𝚻(𝐭⋆), and knot cross-covariance, 𝚻(𝐭, 𝐭⋆)
  • Approximate full covariance using

𝚻(𝐭) β‰ˆ 𝚻(𝐭, 𝐭⋆)

π‘œΓ—π‘™

𝚻(𝐭⋆)βˆ’1

𝑙×𝑙

𝚻(𝐭⋆, 𝐭)

π‘™Γ—π‘œ

.

  • PPs systematically underestimates variance (𝜏2) and inflate 𝜐2, Modified

predictive processs corrects this using

𝚻(𝐭) β‰ˆπš»(𝐭, 𝐭⋆) 𝚻(𝐭⋆)βˆ’1 𝚻(𝐭⋆, 𝐭) + diag(𝚻(𝐭) βˆ’ 𝚻(𝐭, 𝐭⋆) 𝚻(𝐭⋆)βˆ’1 𝚻(𝐭⋆, 𝐭)).

Banerjee, Gelfand, Finley, Sang (2008) Finley, Sang, Banerjee, Gelfand (2008)

26

slide-45
SLIDE 45

Example

Below we have a surface generate from a squared exponential Gaussian Process where

{Ξ£}π‘—π‘˜ = 𝜏2 exp (βˆ’(𝜚 𝑒)2) + 𝜐2𝐽 𝜏2 = 1 𝜚 = 9 𝜐2 = 0.1

True Surface

βˆ’3 βˆ’2 βˆ’1 1 2 3

Observed Data

βˆ’3 βˆ’2 βˆ’1 1 2

27

slide-46
SLIDE 46

Predictive Process Model Results

True Field PP βˆ’ 5 x 5 knots PP βˆ’ 10 x 10 knots PP βˆ’ 15 x 15 knots Full GP

  • Mod. PP βˆ’ 5 x 5 knots
  • Mod. PP βˆ’ 10 x 10 knots
  • Mod. PP βˆ’ 15 x 15 knots

28

slide-47
SLIDE 47

Performance

30 40 50 50 100 150 200

time error knots

βˆ’ 25 100 225

model

Full GP PP

  • Mod. PP

29

slide-48
SLIDE 48

Parameter Estimates

phi sigma.sq tau.sq 4 5 6 7 8 9 1.2 1.5 1.8 2.1 0.1 0.2 0.3 0.4 0.5

Parameter Value model

Full GP PP

  • Mod. PP

True

knots

βˆ’ 25 100 225

30

slide-49
SLIDE 49

Random Projections

31

slide-50
SLIDE 50

Low Rank Approximations via Random Projections

  • 1. Starting with an matrix

𝐁

π‘›Γ—π‘œ.

  • 2. Draw a Gaussian random matrix

𝛁

π‘œΓ—π‘™+π‘ž.

  • 3. Form 𝐙 = 𝐁 𝛁 and compute its QR factorization 𝐙 = 𝐑 𝐒
  • 4. Form 𝐂 = 𝐑′ 𝐁.
  • 5. Compute the SVD of 𝐂 =

Μ‚ 𝐕 𝐓 𝐖′.

  • 6. Form the matrix 𝐕 = 𝐑

Μ‚ 𝐕.

  • 7. Form

Μƒ 𝐁 = 𝐕𝐓𝐖′

Resulting approximation has a bounded expected error,

𝐹|𝐁 βˆ’ 𝐕𝐓𝐖′‖𝐺 ≀ [1 + 4βˆšπ‘™ + π‘ž π‘ž βˆ’ 1 √min(𝑛, π‘œ)] πœπ‘™+1.

Halko, Martinsson, Tropp (2011)

32

slide-51
SLIDE 51

Low Rank Approximations via Random Projections

  • 1. Starting with an matrix

𝐁

π‘›Γ—π‘œ.

  • 2. Draw a Gaussian random matrix

𝛁

π‘œΓ—π‘™+π‘ž.

  • 3. Form 𝐙 = 𝐁 𝛁 and compute its QR factorization 𝐙 = 𝐑 𝐒
  • 4. Form 𝐂 = 𝐑′ 𝐁.
  • 5. Compute the SVD of 𝐂 =

Μ‚ 𝐕 𝐓 𝐖′.

  • 6. Form the matrix 𝐕 = 𝐑

Μ‚ 𝐕.

  • 7. Form

Μƒ 𝐁 = 𝐕𝐓𝐖′

Resulting approximation has a bounded expected error,

𝐹|𝐁 βˆ’ 𝐕𝐓𝐖′‖𝐺 ≀ [1 + 4βˆšπ‘™ + π‘ž π‘ž βˆ’ 1 √min(𝑛, π‘œ)] πœπ‘™+1.

Halko, Martinsson, Tropp (2011)

32

slide-52
SLIDE 52

Low Rank Approximations via Random Projections

  • 1. Starting with an matrix

𝐁

π‘›Γ—π‘œ.

  • 2. Draw a Gaussian random matrix

𝛁

π‘œΓ—π‘™+π‘ž.

  • 3. Form 𝐙 = 𝐁 𝛁 and compute its QR factorization 𝐙 = 𝐑 𝐒
  • 4. Form 𝐂 = 𝐑′ 𝐁.
  • 5. Compute the SVD of 𝐂 =

Μ‚ 𝐕 𝐓 𝐖′.

  • 6. Form the matrix 𝐕 = 𝐑

Μ‚ 𝐕.

  • 7. Form

Μƒ 𝐁 = 𝐕𝐓𝐖′

Resulting approximation has a bounded expected error,

𝐹|𝐁 βˆ’ 𝐕𝐓𝐖′‖𝐺 ≀ [1 + 4βˆšπ‘™ + π‘ž π‘ž βˆ’ 1 √min(𝑛, π‘œ)] πœπ‘™+1.

Halko, Martinsson, Tropp (2011)

32

slide-53
SLIDE 53

Low Rank Approximations via Random Projections

  • 1. Starting with an matrix

𝐁

π‘›Γ—π‘œ.

  • 2. Draw a Gaussian random matrix

𝛁

π‘œΓ—π‘™+π‘ž.

  • 3. Form 𝐙 = 𝐁 𝛁 and compute its QR factorization 𝐙 = 𝐑 𝐒
  • 4. Form 𝐂 = 𝐑′ 𝐁.
  • 5. Compute the SVD of 𝐂 =

Μ‚ 𝐕 𝐓 𝐖′.

  • 6. Form the matrix 𝐕 = 𝐑

Μ‚ 𝐕.

  • 7. Form

Μƒ 𝐁 = 𝐕𝐓𝐖′

Resulting approximation has a bounded expected error,

𝐹|𝐁 βˆ’ 𝐕𝐓𝐖′‖𝐺 ≀ [1 + 4βˆšπ‘™ + π‘ž π‘ž βˆ’ 1 √min(𝑛, π‘œ)] πœπ‘™+1.

Halko, Martinsson, Tropp (2011)

32

slide-54
SLIDE 54

Low Rank Approximations via Random Projections

  • 1. Starting with an matrix

𝐁

π‘›Γ—π‘œ.

  • 2. Draw a Gaussian random matrix

𝛁

π‘œΓ—π‘™+π‘ž.

  • 3. Form 𝐙 = 𝐁 𝛁 and compute its QR factorization 𝐙 = 𝐑 𝐒
  • 4. Form 𝐂 = 𝐑′ 𝐁.
  • 5. Compute the SVD of 𝐂 =

Μ‚ 𝐕 𝐓 𝐖′.

  • 6. Form the matrix 𝐕 = 𝐑

Μ‚ 𝐕.

  • 7. Form

Μƒ 𝐁 = 𝐕𝐓𝐖′

Resulting approximation has a bounded expected error,

𝐹|𝐁 βˆ’ 𝐕𝐓𝐖′‖𝐺 ≀ [1 + 4βˆšπ‘™ + π‘ž π‘ž βˆ’ 1 √min(𝑛, π‘œ)] πœπ‘™+1.

Halko, Martinsson, Tropp (2011)

32

slide-55
SLIDE 55

Low Rank Approximations via Random Projections

  • 1. Starting with an matrix

𝐁

π‘›Γ—π‘œ.

  • 2. Draw a Gaussian random matrix

𝛁

π‘œΓ—π‘™+π‘ž.

  • 3. Form 𝐙 = 𝐁 𝛁 and compute its QR factorization 𝐙 = 𝐑 𝐒
  • 4. Form 𝐂 = 𝐑′ 𝐁.
  • 5. Compute the SVD of 𝐂 =

Μ‚ 𝐕 𝐓 𝐖′.

  • 6. Form the matrix 𝐕 = 𝐑

Μ‚ 𝐕.

  • 7. Form

Μƒ 𝐁 = 𝐕𝐓𝐖′

Resulting approximation has a bounded expected error,

𝐹|𝐁 βˆ’ 𝐕𝐓𝐖′‖𝐺 ≀ [1 + 4βˆšπ‘™ + π‘ž π‘ž βˆ’ 1 √min(𝑛, π‘œ)] πœπ‘™+1.

Halko, Martinsson, Tropp (2011)

32

slide-56
SLIDE 56

Low Rank Approximations via Random Projections

  • 1. Starting with an matrix

𝐁

π‘›Γ—π‘œ.

  • 2. Draw a Gaussian random matrix

𝛁

π‘œΓ—π‘™+π‘ž.

  • 3. Form 𝐙 = 𝐁 𝛁 and compute its QR factorization 𝐙 = 𝐑 𝐒
  • 4. Form 𝐂 = 𝐑′ 𝐁.
  • 5. Compute the SVD of 𝐂 =

Μ‚ 𝐕 𝐓 𝐖′.

  • 6. Form the matrix 𝐕 = 𝐑

Μ‚ 𝐕.

  • 7. Form

Μƒ 𝐁 = 𝐕𝐓𝐖′

Resulting approximation has a bounded expected error,

𝐹|𝐁 βˆ’ 𝐕𝐓𝐖′‖𝐺 ≀ [1 + 4βˆšπ‘™ + π‘ž π‘ž βˆ’ 1 √min(𝑛, π‘œ)] πœπ‘™+1.

Halko, Martinsson, Tropp (2011)

32

slide-57
SLIDE 57

Random Matrix Low Rank Approximations and GPs

PreceThe pding algorithm can be modified slightly to take advantage of the positive definite structure of a covariance matrix.

  • 1. Starting with an π‘œ Γ— π‘œ covariance matrix 𝐁.
  • 2. Draw Gaussian random matrix

𝛁

π‘œΓ—π‘™+π‘ž.

  • 3. Form 𝐙 = 𝐁 𝛁 and compute its QR factorization 𝐙 = 𝐑 𝐒
  • 4. Form the 𝐂 = 𝐑′ 𝐁 𝐑.
  • 5. Compute the eigen decomposition of 𝐂 =

Μ‚ 𝐕 𝐓 Μ‚ 𝐕′.

  • 6. Form the matrix 𝐕 = 𝐑

Μ‚ 𝐕.

Once again we have a bound on the error,

𝐹‖𝐁 βˆ’ 𝐕𝐓𝐕′‖𝐺 ≲ 𝑑 β‹… πœπ‘™+1.

Halko, Martinsson, Tropp (2011), Banerjee, Dunson, Tokdar (2012)

33

slide-58
SLIDE 58

Random Matrix Low Rank Approximations and GPs

PreceThe pding algorithm can be modified slightly to take advantage of the positive definite structure of a covariance matrix.

  • 1. Starting with an π‘œ Γ— π‘œ covariance matrix 𝐁.
  • 2. Draw Gaussian random matrix

𝛁

π‘œΓ—π‘™+π‘ž.

  • 3. Form 𝐙 = 𝐁 𝛁 and compute its QR factorization 𝐙 = 𝐑 𝐒
  • 4. Form the 𝐂 = 𝐑′ 𝐁 𝐑.
  • 5. Compute the eigen decomposition of 𝐂 =

Μ‚ 𝐕 𝐓 Μ‚ 𝐕′.

  • 6. Form the matrix 𝐕 = 𝐑

Μ‚ 𝐕.

Once again we have a bound on the error,

𝐹‖𝐁 βˆ’ 𝐕𝐓𝐕′‖𝐺 ≲ 𝑑 β‹… πœπ‘™+1.

Halko, Martinsson, Tropp (2011), Banerjee, Dunson, Tokdar (2012)

33

slide-59
SLIDE 59

Random Matrix Low Rank Approximations and GPs

PreceThe pding algorithm can be modified slightly to take advantage of the positive definite structure of a covariance matrix.

  • 1. Starting with an π‘œ Γ— π‘œ covariance matrix 𝐁.
  • 2. Draw Gaussian random matrix

𝛁

π‘œΓ—π‘™+π‘ž.

  • 3. Form 𝐙 = 𝐁 𝛁 and compute its QR factorization 𝐙 = 𝐑 𝐒
  • 4. Form the 𝐂 = 𝐑′ 𝐁 𝐑.
  • 5. Compute the eigen decomposition of 𝐂 =

Μ‚ 𝐕 𝐓 Μ‚ 𝐕′.

  • 6. Form the matrix 𝐕 = 𝐑

Μ‚ 𝐕.

Once again we have a bound on the error,

𝐹‖𝐁 βˆ’ 𝐕𝐓𝐕′‖𝐺 ≲ 𝑑 β‹… πœπ‘™+1.

Halko, Martinsson, Tropp (2011), Banerjee, Dunson, Tokdar (2012)

33

slide-60
SLIDE 60

Random Matrix Low Rank Approximations and GPs

PreceThe pding algorithm can be modified slightly to take advantage of the positive definite structure of a covariance matrix.

  • 1. Starting with an π‘œ Γ— π‘œ covariance matrix 𝐁.
  • 2. Draw Gaussian random matrix

𝛁

π‘œΓ—π‘™+π‘ž.

  • 3. Form 𝐙 = 𝐁 𝛁 and compute its QR factorization 𝐙 = 𝐑 𝐒
  • 4. Form the 𝐂 = 𝐑′ 𝐁 𝐑.
  • 5. Compute the eigen decomposition of 𝐂 =

Μ‚ 𝐕 𝐓 Μ‚ 𝐕′.

  • 6. Form the matrix 𝐕 = 𝐑

Μ‚ 𝐕.

Once again we have a bound on the error,

𝐹‖𝐁 βˆ’ 𝐕𝐓𝐕′‖𝐺 ≲ 𝑑 β‹… πœπ‘™+1.

Halko, Martinsson, Tropp (2011), Banerjee, Dunson, Tokdar (2012)

33

slide-61
SLIDE 61

Random Matrix Low Rank Approximations and GPs

PreceThe pding algorithm can be modified slightly to take advantage of the positive definite structure of a covariance matrix.

  • 1. Starting with an π‘œ Γ— π‘œ covariance matrix 𝐁.
  • 2. Draw Gaussian random matrix

𝛁

π‘œΓ—π‘™+π‘ž.

  • 3. Form 𝐙 = 𝐁 𝛁 and compute its QR factorization 𝐙 = 𝐑 𝐒
  • 4. Form the 𝐂 = 𝐑′ 𝐁 𝐑.
  • 5. Compute the eigen decomposition of 𝐂 =

Μ‚ 𝐕 𝐓 Μ‚ 𝐕′.

  • 6. Form the matrix 𝐕 = 𝐑

Μ‚ 𝐕.

Once again we have a bound on the error,

𝐹‖𝐁 βˆ’ 𝐕𝐓𝐕′‖𝐺 ≲ 𝑑 β‹… πœπ‘™+1.

Halko, Martinsson, Tropp (2011), Banerjee, Dunson, Tokdar (2012)

33

slide-62
SLIDE 62

Random Matrix Low Rank Approximations and GPs

PreceThe pding algorithm can be modified slightly to take advantage of the positive definite structure of a covariance matrix.

  • 1. Starting with an π‘œ Γ— π‘œ covariance matrix 𝐁.
  • 2. Draw Gaussian random matrix

𝛁

π‘œΓ—π‘™+π‘ž.

  • 3. Form 𝐙 = 𝐁 𝛁 and compute its QR factorization 𝐙 = 𝐑 𝐒
  • 4. Form the 𝐂 = 𝐑′ 𝐁 𝐑.
  • 5. Compute the eigen decomposition of 𝐂 =

Μ‚ 𝐕 𝐓 Μ‚ 𝐕′.

  • 6. Form the matrix 𝐕 = 𝐑

Μ‚ 𝐕.

Once again we have a bound on the error,

𝐹‖𝐁 βˆ’ 𝐕𝐓𝐕′‖𝐺 ≲ 𝑑 β‹… πœπ‘™+1.

Halko, Martinsson, Tropp (2011), Banerjee, Dunson, Tokdar (2012)

33

slide-63
SLIDE 63

Random Matrix Low Rank Approximations and GPs

PreceThe pding algorithm can be modified slightly to take advantage of the positive definite structure of a covariance matrix.

  • 1. Starting with an π‘œ Γ— π‘œ covariance matrix 𝐁.
  • 2. Draw Gaussian random matrix

𝛁

π‘œΓ—π‘™+π‘ž.

  • 3. Form 𝐙 = 𝐁 𝛁 and compute its QR factorization 𝐙 = 𝐑 𝐒
  • 4. Form the 𝐂 = 𝐑′ 𝐁 𝐑.
  • 5. Compute the eigen decomposition of 𝐂 =

Μ‚ 𝐕 𝐓 Μ‚ 𝐕′.

  • 6. Form the matrix 𝐕 = 𝐑

Μ‚ 𝐕.

Once again we have a bound on the error,

𝐹‖𝐁 βˆ’ 𝐕𝐓𝐕′‖𝐺 ≲ 𝑑 β‹… πœπ‘™+1.

Halko, Martinsson, Tropp (2011), Banerjee, Dunson, Tokdar (2012)

33

slide-64
SLIDE 64

Low Rank Approximations and GPUs

Both predictive process and random matrix low rank approximations are good candidates for acceleration using GPUs.

  • Both use Sherman-Woodbury-Morrison to calculate the inverse

(involves matrix multiplication, addition, and a small matrix inverse).

  • Predictive processes involves several covariance matrix calculations

(knots and cross-covariance) and a small matrix inverse.

  • Random matrix low rank approximations involves a large matrix

multiplication (𝐁 𝛁) and several small matrix decompositions (QR, eigen).

34

slide-65
SLIDE 65

Comparison (π‘œ = 15, 000, 𝑙 = {100, … , 4900})

5 10 15 10 20 30 10 20 30 10 20 30

time time error error method

lr1 lr1 mod pp pp mod

method

lr1 lr1 mod pp pp mod

Strong Dependence Weak Dependence

35

slide-66
SLIDE 66
  • Rand. Projection LR Depositions for Prediction

This approach can also be used for prediction, if we want to sample

𝐳 ∼ π’ͺ(0, 𝚻) Ξ£ β‰ˆ 𝐕𝐓𝐕𝑒 = (𝐕𝐓1/2𝐕𝑒)(𝐕𝐓1/2𝐕𝑒)𝑒

then

𝑧pred = (𝐕 𝐓1/2 𝐕𝑒) Γ— 𝐚 where π‘Žπ‘— ∼ π’ͺ(0, 1)

because 𝐕𝑒 𝐕 = 𝐽 since 𝐕 is an orthogonal matrix.

Dehdari, Deutsch (2012)

36

slide-67
SLIDE 67

π‘œ = 1000, π‘ž = 10000

37