lecture 22
play

Lecture 22 Computational Methods for GPs Colin Rundel 04/12/2017 - PowerPoint PPT Presentation

Lecture 22 Computational Methods for GPs Colin Rundel 04/12/2017 1 GPs and Computational Complexity 2 2 exp n 1 i The problem with GPs x n 2 j 2 ij d ij Update covariance parameter? n 3 log 2 2 n 1 Unless you are lucky (or


  1. Lecture 22 Computational Methods for GPs Colin Rundel 04/12/2017 1

  2. GPs and Computational Complexity 2

  3. 2 exp n 1 i The problem with GPs x n 2 j 2 ij d ij Update covariance parameter? n 3 log 2 2 n 1 Unless you are lucky (or clever), Gaussian process models are difficult to x 2 1 log 2 1 Evaluate the (log) likelihood? n 3 0 1 Z with Z i Chol Want to sample y ? 3 scale to large problems. For a Gaussian process y ∼ N ( µ , Σ ) :

  4. 2 exp n 1 i The problem with GPs x n 2 j 2 ij d ij Update covariance parameter? n 3 log 2 2 n 1 Unless you are lucky (or clever), Gaussian process models are difficult to x 2 1 log 2 1 Evaluate the (log) likelihood? Want to sample y ? 3 scale to large problems. For a Gaussian process y ∼ N ( µ , Σ ) : µ + Chol ( Σ ) × Z with Z i ∼ N ( 0 , 1 ) O ( n 3 )

  5. 2 exp n 1 i The problem with GPs Unless you are lucky (or clever), Gaussian process models are difficult to n 2 j 2 ij d ij Update covariance parameter? 2 n 1 1 Want to sample y ? Evaluate the (log) likelihood? 3 2 scale to large problems. For a Gaussian process y ∼ N ( µ , Σ ) : µ + Chol ( Σ ) × Z with Z i ∼ N ( 0 , 1 ) O ( n 3 ) 2 ( x − µ ) ′ Σ − 1 ( x − µ ) − ( n 3 ) − log | Σ | − log 2 π O

  6. The problem with GPs 1 Update covariance parameter? 2 n Unless you are lucky (or clever), Gaussian process models are difficult to 1 2 3 Want to sample y ? Evaluate the (log) likelihood? scale to large problems. For a Gaussian process y ∼ N ( µ , Σ ) : µ + Chol ( Σ ) × Z with Z i ∼ N ( 0 , 1 ) O ( n 3 ) 2 ( x − µ ) ′ Σ − 1 ( x − µ ) − ( n 3 ) − log | Σ | − log 2 π O { Σ } ij = σ 2 exp ( −{ d } ij ϕ ) + σ 2 O ( n 2 ) n 1 i = j

  7. n 2 - Quadratic complexity - Pray n 3 - Cubic complexity - Give up A simple guide to computational complexity - Go for it 4 O ( n ) - Linear complexity

  8. n 2 - Quadratic complexity - Pray n 3 - Cubic complexity - Give up A simple guide to computational complexity 4 O ( n ) - Linear complexity - Go for it

  9. n 3 - Cubic complexity - Give up A simple guide to computational complexity - Pray 4 O ( n ) - Linear complexity - Go for it O ( n 2 ) - Quadratic complexity

  10. n 3 - Cubic complexity - Give up A simple guide to computational complexity 4 O ( n ) - Linear complexity - Go for it O ( n 2 ) - Quadratic complexity - Pray

  11. A simple guide to computational complexity - Give up 4 O ( n ) - Linear complexity - Go for it O ( n 2 ) - Quadratic complexity - Pray O ( n 3 ) - Cubic complexity

  12. A simple guide to computational complexity 4 O ( n ) - Linear complexity - Go for it O ( n 2 ) - Quadratic complexity - Pray O ( n 3 ) - Cubic complexity - Give up

  13. How bad is the problem? 5 30 method 20 time (secs) chol inv LU inv QR inv 10 0 2500 5000 7500 10000 n

  14. Practice - Migratory Model Prediction 0.052 0.208 2.3 3. Calc. p o chol p o Z 0.049 0.9 op 4. Calc. Allele Prob 0.129 0.127 1.0 Total 1.732 0.465 3.7 Total run time: CPU (28.9 min), CPU+GPU (7.8 min) 0.467 o After fitting the GP need to sample from the posterior predictive distribution p , Step CPU (secs) CPU+GPU (secs) Rel. Performance 1 Calc. 1. po 1.080 0.046 23.0 2. Calc. chol p po 6 at ∼ 3000 locations y p ∼ N ( µ p + Σ po Σ − 1 o ( y o − µ o ) , Σ p − Σ po Σ − 1 ) o Σ op

  15. Practice - Migratory Model Prediction Calc. Allele Prob 2.3 3. 0.049 0.052 0.9 4. 0.129 0.467 0.127 1.0 Total 1.732 0.465 3.7 Total run time: CPU (28.9 min), CPU+GPU (7.8 min) 0.208 6 After fitting the GP need to sample from the posterior predictive distribution 1. Step CPU (secs) Rel. Performance CPU+GPU (secs) 1.080 0.046 23.0 2. at ∼ 3000 locations y p ∼ N ( µ p + Σ po Σ − 1 o ( y o − µ o ) , Σ p − Σ po Σ − 1 ) o Σ op Calc. Σ p , Σ po Calc. chol (Σ p − Σ po Σ − 1 o Σ op ) Calc. µ p | o + chol (Σ p | o ) × Z

  16. Cholesky CPU vs GPU (P100) 7 30 method chol inv 20 LU inv time (secs) QR inv comp cpu gpu 10 0 2500 5000 7500 10000 n

  17. 8 10.0 method chol inv LU inv time (secs) QR inv comp cpu gpu 0.1 2500 5000 7500 10000 n

  18. Relative Performance 9 10 Relative performance method chol inv LU inv QR inv 1 2500 5000 7500 10000 n

  19. Aside - Matrix Multiplication 10 Matrix Multiplication 7.5 time (sec) comp 5.0 cpu gpu 2.5 0.0 2500 5000 7500 10000 n

  20. An even bigger hammer • Uses both shared and distributed memory bigGP is an R package written by Chris Paciorek (UC Berkeley), et al. 11 • Specialized distributed implementation of linear algebra operation for GPs • Designed to run on large super computer clusters • Able to fit models on the order of n = 65k (32 GB Cov. matrix) Cholesky Decomposition 1000 Execution time, seconds (log scale) 100 10 1 6 cores 0.1 60 cores 816 cores 12480 cores 49920 cores 0.01 2048 8192 32768 131072 Matrix dimension, n (log scale)

  21. More scalable solutions? • Spectral domain / basis functions • Covariance tapering • GMRF approximations • Low-rank approximations • Nearest-neighbor models 12

  22. Low Rank Approximations 13

  23. V t V t n diag S k diag S Low rank approximations in general m n m m n U U k k k m n m Lets look at the example of the singular value decomposition of a matrix, S the singular values. Usually the singular values and vectors are ordered M U n where U are called the left singular vectors, V the right singular vectors, and such that the signular values are in descending order. It turns out (Eckart–Young theorem) that we can approximate M as having rank r by defining S to only have the r largest singular values (others set to zero). M 14 n × m = n × n diag ( S ) V t m × m n × m

  24. Low rank approximations in general S the singular values. Usually the singular values and vectors are ordered U U Lets look at the example of the singular value decomposition of a matrix, zero). S to only have the r largest singular values (others set to It turns out (Eckart–Young theorem) that we can approximate M as having such that the signular values are in descending order. M where U are called the left singular vectors, V the right singular vectors, and U M 14 n × m = n × n diag ( S ) V t m × m n × m rank r by defining ˜ ˜ n × n diag (˜ = ˜ n × k diag (˜ ˜ n × m = S ) S ) V t V t m × m k × m n × m k × k

  25. Example 1 50 0 25 0 32 0 45 0 79 0 17 0 00 0 00 0 51 0 37 0 25 0 51 0 32 0 37 0 45 0 58 0 79 0 58 0 51 Rank 2 approximation: 0 333 0 140 0 166 0 203 0 249 0 166 0 200 0 251 0 203 0 51 0 251 0 330 0 501 0 249 0 333 0 501 1 000 M 15 t   1 . 000 0 . 500 0 . 333 0 . 250 M = 0 . 500 0 . 333 0 . 250 0 . 200  = U diag ( S ) V    0 . 333 0 . 250 0 . 200 0 . 167 0 . 250 0 . 200 0 . 167 0 . 143   − 0 . 79 − 0 . 18 − 0 . 03 0 . 58 − 0 . 45 − 0 . 37 U = V =  0 . 74 0 . 33    − 0 . 32 − 0 . 51 − 0 . 10 − 0 . 79 − 0 . 25 − 0 . 51 − 0 . 64 0 . 51 S = ( 0 . 00 ) 1 . 50 0 . 17 0 . 01

  26. Example t Rank 2 approximation: 15   1 . 000 0 . 500 0 . 333 0 . 250 M = 0 . 500 0 . 333 0 . 250 0 . 200  = U diag ( S ) V    0 . 333 0 . 250 0 . 200 0 . 167 0 . 250 0 . 200 0 . 167 0 . 143   − 0 . 79 − 0 . 18 − 0 . 03 0 . 58 − 0 . 45 − 0 . 37 U = V =  0 . 74 0 . 33    − 0 . 32 − 0 . 51 − 0 . 10 − 0 . 79 − 0 . 25 − 0 . 51 − 0 . 64 0 . 51 S = ( 0 . 00 ) 1 . 50 0 . 17 0 . 01   − 0 . 79 0 . 58 ( ) ( ) − 0 . 45 − 0 . 37 − 0 . 79 − 0 . 45 − 0 . 32 − 0 . 25 ˜ M = 1 . 50 0 . 00     − 0 . 32 − 0 . 51 − 0 . 37 − 0 . 51 − 0 . 51 0 . 00 0 . 17 0 . 58 − 0 . 25 − 0 . 51   1 . 000 0 . 501 0 . 333 0 . 249 = 0 . 501 0 . 330 0 . 251 0 . 203     0 . 333 0 . 251 0 . 200 0 . 166 0 . 249 0 . 203 0 . 166 0 . 140

  27. Approximation Error m Strong dependence (large eff. range): We can measure the error of the approximation using the Frobenius norm, n 16 1 / 2   ∥ M − ˜ ∑ ∑ ( M ij − ˜ M ∥ F = M ij ) 2   i = 1 j = 1

  28. Approximation Error m We can measure the error of the approximation using the Frobenius norm, Strong dependence (large eff. range): n 16 1 / 2   ∥ M − ˜ ∑ ∑ ( M ij − ˜ M ∥ F = M ij ) 2   i = 1 j = 1 SVD Low Rank SVD 15 15 Error (Frob. norm) Singular Values 10 10 5 5 0 0 0 10 20 30 40 50 0 10 20 30 40 50

  29. Weak dependence (short eff. range): 17 SVD Low Rank SVD 10 4 Error (Frob. norm) 8 Singular Values 3 6 2 4 1 2 0 0 0 10 20 30 40 50 0 10 20 30 40 50 Rank

  30. 1 is k V t A How does this help? (Sherman-Morrison-Woodbury) There is an immensely useful linear algebra identity, the k which is hopefully small. 1 U is k 1 S • diag S . k which is hopefully small, or even better S • S 1 . diag A , then it is trivial to find A • Imagine that A How does this help? 18 A a decomposed matrix, Sherman-Morrison- Woodbury formula, for the inverse (and determinant) of M ) − 1 ( − 1 = ˜ n × m + U n × k S k × k V t n × m k × m = A − 1 − A − 1 U S − 1 + V t A − 1 U ) − 1 V t A − 1 . (

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend