Scalable Gaussian Processes Zhenwen Dai Amazon September 4, 2018 - PowerPoint PPT Presentation

Scalable Gaussian Processes Zhenwen Dai Amazon September 4, 2018 @GPSS2018 Zhenwen Dai (Amazon) Scalable Gaussian Processes September 4, 2018 @GPSS2018 1 / 55

Gaussian process Input and Output Data: X = ( x 1 , . . . , x N ) ⊤ y = ( y 1 , . . . , y N ) , y | f , σ 2 I � � p ( y | f ) = N , p ( f | X ) = N ( f | 0 , K ( X , X )) 10 Mean Data 8 Confidence 6 4 2 0 − 2 − 4 − 6 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 Zhenwen Dai (Amazon) Scalable Gaussian Processes September 4, 2018 @GPSS2018 2 / 55

The scaling behavior w.r.t. N The computational cost of Gaussian process is O ( N 3 ) . Mean 1 . 4 Data Confidence 1 . 2 1 . 0 time (second) 0 . 8 0 . 6 0 . 4 0 . 2 0 . 0 0 500 1000 1500 2000 2500 data size ( N ) Zhenwen Dai (Amazon) Scalable Gaussian Processes September 4, 2018 @GPSS2018 3 / 55

Behind a Gaussian process fit Point Estimate / Maximum A Posteriori (MAP) of hyper-parameters. θ ∗ = arg max y | 0 , K + σ 2 I � � log p ( y | X , θ ) = arg max log N θ θ Prediction on a test point given the observed data and the optimized hyper-parameters. p ( f ∗ | X ∗ , y , X , θ ) = f ∗ | K ∗ ( K + σ 2 I ) − 1 y , K ∗∗ − K ∗ ( K + σ 2 I ) − 1 K ⊤ � � N ∗ Zhenwen Dai (Amazon) Scalable Gaussian Processes September 4, 2018 @GPSS2018 4 / 55

How to implement the log-likelihood (1) Compute the covariance matrix K : · · ·  k ( x 1 , x 1 ) k ( x 1 , x N )  . . ... . . K = . .   k ( x N , x 1 ) · · · k ( x N , x N ) � − 1 2 l 2 ( x i − x j ) ⊤ ( x i − x j ) � where k ( x i , x j ) = γ exp The complexity is O ( N 2 Q ) . Zhenwen Dai (Amazon) Scalable Gaussian Processes September 4, 2018 @GPSS2018 5 / 55

How to implement the log-likelihood (2) Plug in the log-pdf of multi-variate normal distribution: y | 0 , K + σ 2 I � � log p ( y | X ) = log N = − 1 2 log | 2 π ( K + σ 2 I ) | − 1 2 y ⊤ ( K + σ 2 I ) − 1 y = − 1 2( || L − 1 y || 2 + N log 2 π ) − � log L ii i Take a Cholesky decomposition: L = chol ( K + σ 2 I ) . The computational complexity is O ( N 3 + N 2 + N ) . Therefore, the overall complexity including the computation of K is O ( N 3 ) . Zhenwen Dai (Amazon) Scalable Gaussian Processes September 4, 2018 @GPSS2018 6 / 55

A quick profiling ( N =1000, Q =10) Time unit is microsecond. Line # Time % Time Line Contents 2 def log_likelihood(kern, X, Y, sigma2): 3 6.0 0.0 N = X.shape[0] 4 55595.0 58.7 K = kern.K(X) 5 4369.0 4.6 Ky = K + np.eye(N)*sigma2 6 30012.0 31.7 L = np.linalg.cholesky(Ky) 7 4361.0 4.6 LinvY = dtrtrs(L, Y, lower=1)[0] 8 49.0 0.1 logL = N*np.log(2*np.pi)/-2. 9 82.0 0.1 logL += np.square(LinvY).sum()/-2. 10 208.0 0.2 logL += -np.log(np.diag(L)).sum() 11 2.0 0.0 return logL Zhenwen Dai (Amazon) Scalable Gaussian Processes September 4, 2018 @GPSS2018 7 / 55

Too slow or too many data points? A lot of data does not necessarily mean a complex model. 3000 20 Mean Data 2500 15 Confidence 2000 10 1500 5 1000 0 500 − 5 0 − 10 0 20 40 60 80 100 − 0 . 2 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 1 . 2 Zhenwen Dai (Amazon) Scalable Gaussian Processes September 4, 2018 @GPSS2018 8 / 55

Pseudo Data Summarize real data into a small set of pseudo data. 20 Mean Inducing Data 15 Confidence 10 5 0 − 5 − 10 − 0 . 2 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0 1 . 2 Zhenwen Dai (Amazon) Scalable Gaussian Processes September 4, 2018 @GPSS2018 9 / 55

Sparse Gaussian Process Sparse GPs refers to a family of approximations: Nystr¨ om approximation [Williams and Seeger, 2001] Fully independent training conditional (FITC) [Snelson and Ghahramani, 2006] Variational sparse Gaussian process [Titsias, 2009] Zhenwen Dai (Amazon) Scalable Gaussian Processes September 4, 2018 @GPSS2018 10 / 55

Approximation by subset Let’s randomly pick a subset from the training data: Z ∈ R M × Q . Approximate the covariance matrix K by ˜ K . ˜ K = K z K − 1 zz K ⊤ z , where K z = K ( X , Z ) and K zz = K ( Z , Z ) . K ∈ R N × N , K z ∈ R N × M and K zz ∈ R M × M . Note that ˜ The log-likelihood is approximated by y | 0 , K z K − 1 zz K ⊤ z + σ 2 I � � log p ( y | X , θ ) ≈ log N . Zhenwen Dai (Amazon) Scalable Gaussian Processes September 4, 2018 @GPSS2018 11 / 55

Efficient computation using Woodbury formula The naive formulation does not bring any computational benefits. L = − 1 K + σ 2 I ) | − 1 ˜ 2 log | 2 π ( ˜ 2 y ⊤ ( ˜ K + σ 2 I ) − 1 y Apply the Woodbury formula: z + σ 2 I ) − 1 = σ − 2 I − σ − 4 K z ( K zz + σ − 2 K ⊤ ( K z K − 1 zz K ⊤ z K z ) − 1 K ⊤ z Note that ( K zz + σ − 2 K ⊤ z K z ) ∈ R M × M . The computational complexity reduces to O ( NM 2 ) . Zhenwen Dai (Amazon) Scalable Gaussian Processes September 4, 2018 @GPSS2018 12 / 55

Nystr¨ om approximation The above approach is called Nystr¨ om approximation by Williams and Seeger [2001]. The approximation is directly done on the covariance matrix without the concept of pseudo data. The approximation becomes exact if the whole data set is taken, i.e., KK − 1 K ⊤ = K . The subset selection is done randomly. Zhenwen Dai (Amazon) Scalable Gaussian Processes September 4, 2018 @GPSS2018 13 / 55

Gaussian process with Pseudo Data (1) Snelson and Ghahramani [2006] proposes the idea of having pseudo data. This approach is later referred to as Fully independent training conditional (FITC). Augment the training data ( X , y ) with pseudo data u at location Z . � K ff + σ 2 I K fu �� y X y p | = N | 0 , K ⊤ u Z u K uu fu where K ff = K ( X , X ) , K fu = K ( X , Z ) and K uu = K ( Z , Z ) . Zhenwen Dai (Amazon) Scalable Gaussian Processes September 4, 2018 @GPSS2018 14 / 55

Gaussian process with Pseudo Data (2) Thanks to the marginalization property of Gaussian distribution, � p ( y | X ) = p ( y , u | X , Z ) . u Further re-arrange the notation: p ( y , u | X , Z ) = p ( y | u , X , Z ) p ( u | Z ) where p ( u | Z ) = N ( u | 0 , K uu ) , y | K fu K − 1 uu u , K ff − K fu K − 1 uu K ⊤ � fu + σ 2 I � p ( y | u , X , Z ) = N . Zhenwen Dai (Amazon) Scalable Gaussian Processes September 4, 2018 @GPSS2018 15 / 55

FITC approximation (1) So far, p ( y | X ) has not been changed, but there is no speed-up, K ff ∈ R N × N in K ff − K fu K − 1 uu K ⊤ fu + σ 2 I . The FITC approximation assumes y | K fu K − 1 uu u , Λ + σ 2 I � � p ( y | u , X , Z ) = N ˜ , where Λ = ( K ff − K fu K − 1 uu K ⊤ fu ) ◦ I . Zhenwen Dai (Amazon) Scalable Gaussian Processes September 4, 2018 @GPSS2018 16 / 55

FITC approximation (2) Marginalize u from the model definition: y | 0 , K fu K − 1 uu K ⊤ fu + Λ + σ 2 I � � p ( y | X , Z ) = N ˜ Woodbury formula can be applied in the sam way as in Nystr¨ om approximation: z + Λ + σ 2 I ) − 1 = A − AK z ( K zz + K ⊤ ( K z K − 1 zz K ⊤ z AK z ) − 1 K ⊤ z A , where A = ( Λ + σ 2 I ) − 1 . Zhenwen Dai (Amazon) Scalable Gaussian Processes September 4, 2018 @GPSS2018 17 / 55

FITC approximation (3) FITC allows the pseudo data not being a subset of training data. The inducing inputs Z can be optimized via gradient optimization. Like Nystr¨ om approximation, when taking all the training data as inducing inputs, the FITC approximation is equivalent to the original GP: y | 0 , K ff + σ 2 I � � p ( y | X , Z = X ) = N ˜ FITC can be combined easily with expectation propagation (EP). Bui et al. [2017] provides an overview and a nice connection with variational sparse GP. Zhenwen Dai (Amazon) Scalable Gaussian Processes September 4, 2018 @GPSS2018 18 / 55

Model Approximation vs. Approximate Inference When the exact model/inference is intractable, typically there are two types of approaches: Approximate the original model with a simpler one such that inference becomes tractable, like Nystr¨ om approximation, FITC. Keep the original model but derive an approximate inference method which is often not able to return the true answer, like variational inference. Zhenwen Dai (Amazon) Scalable Gaussian Processes September 4, 2018 @GPSS2018 19 / 55

Model Approximation vs. Approximate Inference A problem with model approximation is that when an approximated model requires some tuning, e.g., for hyper-parameters, it is unclear how to improve it based on training data. In the case of FITC, we know the model is correct if Z = X , however, optimizing Z will not necessarily lead to a better location. In fact, optimizing Z can lead to overfitting. [Qui˜ nonero-Candela and Rasmussen, 2005] Zhenwen Dai (Amazon) Scalable Gaussian Processes September 4, 2018 @GPSS2018 20 / 55

Variational Sparse Gaussian Process (1) Titsias [2009] introduces a variational approach for sparse GP. It follows the same concept of pseudo data: � p ( y | X ) = p ( y | f ) p ( f | u , X , Z ) p ( u | Z ) f , u where p ( u | Z ) = N ( u | 0 , K uu ) , � y | K fu K − 1 uu u , K ff − K fu K − 1 uu K ⊤ fu + σ 2 I � p ( y | u , X , Z ) = N . Zhenwen Dai (Amazon) Scalable Gaussian Processes September 4, 2018 @GPSS2018 21 / 55

Scalable Gaussian Processes Zhenwen Dai Amazon September 4, 2018 - PowerPoint PPT Presentation

Scalable Gaussian Processes Zhenwen Dai Amazon September 4, 2018 @GPSS2018 Zhenwen Dai (Amazon) Scalable Gaussian Processes September 4, 2018 @GPSS2018 1 / 55 Gaussian process Input and Output Data: X = ( x 1 , . . . , x N ) y = ( y 1 ,

Scalable Gaussian Processes Zhenwen Dai Amazon 9 September 2019 @GPSS 2019 Zhenwen Dai (Amazon)

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian

CMPUT 466 Introduction to Gaussian Processes Dan Lizotte The Plan Introduction to Gaussian

Scalable Gaussian processes with a twist of Probabilistic Numerics Kurt Cutajar EURECOM, Sophia

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2

Another introduction to Gaussian Processes Richard Wilkinson School of Maths and Statistics

Gaussian Processes for Big Data James Hensman joint work with Nicol o Fusi, Neil D. Lawrence

Gaussian Processes Seung-Hoon Na Chonbuk National University Gaussian Process Regression

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

CS70: Jean Walrand: Lecture 36. Gaussian and CLT CS70: Jean Walrand: Lecture 36. Gaussian and

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Determining the PSF over the Full FoV of LSST using Anisotropic Gaussian Processes

Validation of Regional Seismic Travel Time (RSTT) Predictions and Use in Event Location Stephen C.

Ch. 7 Scheduling Mark Redekopp Michael Shindler & Ramesh Govindan 2 Overview

Large-Scale Data Fusion for Improved Model Simulation and Predictability Ahmed Attia Mathematics

Seismic Modeling, Migration and Velocity Inversion Full Waveform Inversion Bee Bednar Panorama

Divide and conquer Philip II of Macedon Divide and conquer 1) Divide your problem into

A Computational Understanding of Classical (Co)Recursion P a ul Downen a nd Zen a M. Ariol a PPDP

Amortized Complexity of Zero- Knowledge Proofs Revisited: Achieving Linear Soundness Slack

+ The ability of satellite-based CO2 measurements to constrain carbon cycle science: from GOSAT

Scalable Gaussian Processes Zhenwen Dai Amazon September 4, 2018 - PowerPoint PPT Presentation

Scalable Gaussian Processes Zhenwen Dai Amazon September 4, 2018 @GPSS2018 Zhenwen Dai (Amazon) Scalable Gaussian Processes September 4, 2018 @GPSS2018 1 / 55 Gaussian process Input and Output Data: X = ( x 1 , . . . , x N ) y = ( y 1 ,

Scalable Gaussian Processes Zhenwen Dai Amazon 9 September 2019 @GPSS 2019 Zhenwen Dai (Amazon)

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian

CMPUT 466 Introduction to Gaussian Processes Dan Lizotte The Plan Introduction to Gaussian

Scalable Gaussian processes with a twist of Probabilistic Numerics Kurt Cutajar EURECOM, Sophia

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2

Another introduction to Gaussian Processes Richard Wilkinson School of Maths and Statistics

Gaussian Processes for Big Data James Hensman joint work with Nicol o Fusi, Neil D. Lawrence

Gaussian Processes Seung-Hoon Na Chonbuk National University Gaussian Process Regression

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

CS70: Jean Walrand: Lecture 36. Gaussian and CLT CS70: Jean Walrand: Lecture 36. Gaussian and

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

Determining the PSF over the Full FoV of LSST using Anisotropic Gaussian Processes

Validation of Regional Seismic Travel Time (RSTT) Predictions and Use in Event Location Stephen C.

Ch. 7 Scheduling Mark Redekopp Michael Shindler &amp; Ramesh Govindan 2 Overview

Large-Scale Data Fusion for Improved Model Simulation and Predictability Ahmed Attia Mathematics

Seismic Modeling, Migration and Velocity Inversion Full Waveform Inversion Bee Bednar Panorama

Divide and conquer Philip II of Macedon Divide and conquer 1) Divide your problem into

A Computational Understanding of Classical (Co)Recursion P a ul Downen a nd Zen a M. Ariol a PPDP

Amortized Complexity of Zero- Knowledge Proofs Revisited: Achieving Linear Soundness Slack

+ The ability of satellite-based CO2 measurements to constrain carbon cycle science: from GOSAT

Ch. 7 Scheduling Mark Redekopp Michael Shindler & Ramesh Govindan 2 Overview