Scalable Gaussian Processes Zhenwen Dai Amazon 9 September 2019 - PowerPoint PPT Presentation

Scalable Gaussian Processes Zhenwen Dai Amazon 9 September 2019 @GPSS 2019 Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 1 / 46

Gaussian process Input and Output Data: X = ( x 1 , . . . , x N ) ⊤ y = ( y 1 , . . . , y N ) , y | f , σ 2 I � � p ( y | f ) = N , p ( f | X ) = N ( f | 0 , K ( X , X )) 10 Mean Data 8 Confidence 6 4 2 0 − 2 − 4 − 6 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 2 / 46

Behind a Gaussian process fit Maximum likelihood estimate of the hyper-parameters. θ ∗ = arg max � y | 0 , K + σ 2 I � log p ( y | X , θ ) = arg max log N θ θ Prediction on a test point given the observed data and the optimized hyper-parameters. p ( f ∗ | X ∗ , y , X , θ ) = � f ∗ | K ∗ ( K + σ 2 I ) − 1 y , K ∗∗ − K ∗ ( K + σ 2 I ) − 1 K ⊤ � N ∗ Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 3 / 46

How to implement the log-likelihood (1) Compute the covariance matrix K :   · · · k ( x 1 , x 1 ) k ( x 1 , x N ) . . ... . . K =   . .   k ( x N , x 1 ) · · · k ( x N , x N ) � − 1 2 l 2 ( x i − x j ) ⊤ ( x i − x j ) � where k ( x i , x j ) = γ exp The complexity is O ( N 2 Q ) . Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 4 / 46

How to implement the log-likelihood (2) Plug in the log-pdf of multi-variate normal distribution: � y | 0 , K + σ 2 I � log p ( y | X ) = log N = − 1 2 log | 2 π ( K + σ 2 I ) | − 1 2 y ⊤ ( K + σ 2 I ) − 1 y = − 1 2( || L − 1 y || 2 + N log 2 π ) − � log L ii i Take a Cholesky decomposition: L = chol ( K + σ 2 I ) . The computational complexity is O ( N 3 + N 2 + N ) . Therefore, the overall complexity including the computation of K is O ( N 3 ) . Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 5 / 46

A quick profiling ( N =1000, Q =10) Time unit is microsecond. Line # Time % Time Line Contents 2 def log_likelihood(kern, X, Y, sigma2): 3 6.0 0.0 N = X.shape[0] 4 55595.0 58.7 K = kern.K(X) 5 4369.0 4.6 Ky = K + np.eye(N)*sigma2 6 30012.0 31.7 L = np.linalg.cholesky(Ky) 7 4361.0 4.6 LinvY = dtrtrs(L, Y, lower=1)[0] 8 49.0 0.1 logL = N*np.log(2*np.pi)/-2. 9 82.0 0.1 logL += np.square(LinvY).sum()/-2. 10 208.0 0.2 logL += -np.log(np.diag(L)).sum() 11 2.0 0.0 return logL Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 6 / 46

Empirical analysis of computational time I collect the run time for N = { 10 , 100 , 500 , 1000 , 1500 , 2000 } . They take 1.3ms, 8.5ms, 28ms, 0.12s, 0.29s, 0.76s. Mean 1 . 4 Data Confidence 1 . 2 1 . 0 time (second) 0 . 8 0 . 6 0 . 4 0 . 2 0 . 0 0 500 1000 1500 2000 2500 data size ( N ) Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 7 / 46

What if we have 1 million data points? The mean of predicted computational time is 9 . 4 × 10 7 seconds ≈ 2 . 98 years. Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 8 / 46

What about waiting for faster computers? Computational time = amount of work computer speed If the computer speed increase at the pace of 20% year over year: ◮ After 10 years, it will take about 176 days. ◮ After 50 years, it will take about 2.9 hours. If we double the size of data, it takes 11.4 years to catch up. Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 9 / 46

What about parallel computing / GPU? Ongoing works about speeding up Cholesky decomposition with multi-core CPU or GPU. Main limitation: heavy communication and shared memory. Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 10 / 46

Other approaches Apart from speeding up the exact computation, there have been a lot of works on approximation of GP inference. These methods often target at some specific scenario and provide good approximation for the targeted scenarios. Provide an overview about common approximations. Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 11 / 46

Big data (?) lots of data � = complex function In real world problems, we often collect a lot of data for modeling relatively simple relations. 20 Mean Data Confidence 15 10 5 0 − 5 − 10 − 0 . 2 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 1 . 2 Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 12 / 46

Data subsampling? Real data often do not evenly distributed. We tend to get a lot of data on common cases and very few data on rare cases. 20 Mean Data Confidence 15 10 5 0 − 5 − 10 − 0 . 25 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 1 . 25 Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 13 / 46

Covariance matrix of redundant data With redundant data, the covariance matrix becomes low rank. What about low rank approximation? 1400 1200 1000 800 600 400 200 0 0 20 40 60 80 100 Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 14 / 46

Low-rank approximation Let’s recall the log-likelihood of GP: y | 0 , K + σ 2 I � � log p ( y | X ) = log N , where K is the covariance matrix computed from X according to the kernel function k ( · , · ) and σ 2 is the variance of the Gaussian noise distribution. Assume K to be low rank. This leads to Nystr¨ om approximation by Williams and Seeger [Williams and Seeger, 2001]. Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 15 / 46

Approximation by subset Let’s randomly pick a subset from the training data: Z ∈ R M × Q . Approximate the covariance matrix K by ˜ K . ˜ K = K z K − 1 zz K ⊤ z , where K z = K ( X , Z ) and K zz = K ( Z , Z ) . K ∈ R N × N , K z ∈ R N × M and K zz ∈ R M × M . Note that ˜ The log-likelihood is approximated by � y | 0 , K z K − 1 zz K ⊤ z + σ 2 I � log p ( y | X , θ ) ≈ log N . Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 16 / 46

Efficient computation using Woodbury formula The naive formulation does not bring any computational benefits. L = − 1 K + σ 2 I ) | − 1 ˜ 2 log | 2 π ( ˜ 2 y ⊤ ( ˜ K + σ 2 I ) − 1 y Apply the Woodbury formula: z + σ 2 I ) − 1 = σ − 2 I − σ − 4 K z ( K zz + σ − 2 K ⊤ ( K z K − 1 zz K ⊤ z K z ) − 1 K ⊤ z Note that ( K zz + σ − 2 K ⊤ z K z ) ∈ R M × M . The computational complexity reduces to O ( NM 2 ) . Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 17 / 46

Nystr¨ om approximation The above approach is called Nystr¨ om approximation by Williams and Seeger [2001]. The approximation is directly done on the covariance matrix without the concept of pseudo data. The approximation becomes exact if the whole data set is taken, i.e., KK − 1 K ⊤ = K . The subset selection is done randomly. Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 18 / 46

Gaussian process with Pseudo Data (1) Snelson and Ghahramani [2006] proposes the idea of having pseudo data, which is later referred to as Fully independent training conditional (FITC). Augment the training data ( X , y ) with pseudo data u at location Z . �� y � � X �� y � � K ff + σ 2 I �� K fu p | = N | 0 , K ⊤ K uu u Z u fu where K ff = K ( X , X ) , K fu = K ( X , Z ) and K uu = K ( Z , Z ) . Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 19 / 46

Gaussian process with Pseudo Data (2) Thanks to the marginalization property of Gaussian distribution, � p ( y | X ) = p ( y , u | X , Z ) . u Further re-arrange the notation: p ( y , u | X , Z ) = p ( y | u , X , Z ) p ( u | Z ) where p ( u | Z ) = N ( u | 0 , K uu ) , � y | K fu K − 1 uu u , K ff − K fu K − 1 uu K ⊤ fu + σ 2 I � p ( y | u , X , Z ) = N . Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 20 / 46

FITC approximation (1) So far, p ( y | X ) has not been changed, but there is no speed-up, K ff ∈ R N × N in K ff − K fu K − 1 uu K ⊤ fu + σ 2 I . The FITC approximation assumes y | K fu K − 1 uu u , Λ + σ 2 I � � p ( y | u , X , Z ) = N ˜ , where Λ = ( K ff − K fu K − 1 uu K ⊤ fu ) ◦ I . Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 21 / 46

FITC approximation (2) Marginalize u from the model definition: y | 0 , K fu K − 1 uu K ⊤ fu + Λ + σ 2 I � � p ( y | X , Z ) = N ˜ Woodbury formula can be applied in the sam way as in Nystr¨ om approximation: z + Λ + σ 2 I ) − 1 = A − AK z ( K zz + K ⊤ ( K z K − 1 zz K ⊤ z AK z ) − 1 K ⊤ z A , where A = ( Λ + σ 2 I ) − 1 . Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 22 / 46

Scalable Gaussian Processes Zhenwen Dai Amazon 9 September 2019 - PowerPoint PPT Presentation

Scalable Gaussian Processes Zhenwen Dai Amazon 9 September 2019 @GPSS 2019 Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 1 / 46 Gaussian process Input and Output Data: X = ( x 1 , . . . , x N ) y = ( y 1 ,

Scalable Gaussian Processes Zhenwen Dai Amazon September 4, 2018 @GPSS2018 Zhenwen Dai (Amazon)

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian

CMPUT 466 Introduction to Gaussian Processes Dan Lizotte The Plan Introduction to Gaussian

Scalable Gaussian processes with a twist of Probabilistic Numerics Kurt Cutajar EURECOM, Sophia

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2

Another introduction to Gaussian Processes Richard Wilkinson School of Maths and Statistics

Gaussian Processes for Big Data James Hensman joint work with Nicol o Fusi, Neil D. Lawrence

Gaussian Processes Seung-Hoon Na Chonbuk National University Gaussian Process Regression

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

CS70: Jean Walrand: Lecture 36. Gaussian and CLT CS70: Jean Walrand: Lecture 36. Gaussian and

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Determining the PSF over the Full FoV of LSST using Anisotropic Gaussian Processes

Securing Serverless and Container Services Marc Schrter AWS DevOps Engineer @ globaldatanet

Internet censorship is everywhere Source:

Encryption at Scale on AWS Matt Campagna campagna@amazon.com Agenda Describe the AWS Key

Experience of so-ware engineers using TLA+, PlusCal and TLC

Library2Go for Kindle - Instructions Kindle Fire, Smartphone and Tablet users: You must have

StarCluster - NumPy/SciPy Computing on Amazons Elastic Compute Cloud (EC2) Justin Riley

Basheer Al-Duwairi Jordan University of Science & Technology United-States/Middle-East

Video Codecs In An AI World Dr Doug Ridge Amphion Semiconductor The Proliferance of Video in

Scalable Gaussian Processes Zhenwen Dai Amazon 9 September 2019 - PowerPoint PPT Presentation

Scalable Gaussian Processes Zhenwen Dai Amazon 9 September 2019 @GPSS 2019 Zhenwen Dai (Amazon) Scalable Gaussian Processes 9 September 2019 @GPSS 2019 1 / 46 Gaussian process Input and Output Data: X = ( x 1 , . . . , x N ) y = ( y 1 ,

Scalable Gaussian Processes Zhenwen Dai Amazon September 4, 2018 @GPSS2018 Zhenwen Dai (Amazon)

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian

CMPUT 466 Introduction to Gaussian Processes Dan Lizotte The Plan Introduction to Gaussian

Scalable Gaussian processes with a twist of Probabilistic Numerics Kurt Cutajar EURECOM, Sophia

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2

Another introduction to Gaussian Processes Richard Wilkinson School of Maths and Statistics

Gaussian Processes for Big Data James Hensman joint work with Nicol o Fusi, Neil D. Lawrence

Gaussian Processes Seung-Hoon Na Chonbuk National University Gaussian Process Regression

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

CS70: Jean Walrand: Lecture 36. Gaussian and CLT CS70: Jean Walrand: Lecture 36. Gaussian and

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Determining the PSF over the Full FoV of LSST using Anisotropic Gaussian Processes

Securing Serverless and Container Services Marc Schrter AWS DevOps Engineer @ globaldatanet

Internet censorship is everywhere Source:

Encryption at Scale on AWS Matt Campagna campagna@amazon.com Agenda Describe the AWS Key

Experience of so-ware engineers using TLA+, PlusCal and TLC

Library2Go for Kindle - Instructions Kindle Fire, Smartphone and Tablet users: You must have

StarCluster - NumPy/SciPy Computing on Amazons Elastic Compute Cloud (EC2) Justin Riley

Basheer Al-Duwairi Jordan University of Science &amp; Technology United-States/Middle-East

Video Codecs In An AI World Dr Doug Ridge Amphion Semiconductor The Proliferance of Video in

Basheer Al-Duwairi Jordan University of Science & Technology United-States/Middle-East