Exact Inference for Gaussian Process Regression in case of Big Data - PowerPoint PPT Presentation

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure Belyaev Mikhail 1 , 2 , 3 , Burnaev Evgeny 1 , 2 , 3 , Kapushev Yermek 1 , 2 1 Institute for Information Transmission Problems, Moscow, Russia 2 DATADVANCE, llc, Moscow, Russia 3 PreMoLab, MIPT, Dolgoprudny, Russia ICML 2014 workshop on New Learning Frameworks and Models for Big Data Beijing, 2014 1/39 Burnaev Evgeny

Approximation task Problem statement Let y = g ( x ) be some unknown function. The training sample is given D = { x i , y i } , g ( x i ) = y i , i = 1 , . . . , N. The task is to construct ˆ f ( x ) such that: ˆ f ( x ) ≈ g ( x ) . 2/39 Burnaev Evgeny

Factorial Design of Experiments Factors: s k = { x k i k ∈ X k , i k = 1 , . . . ,n k } , X k ∈ R d k , k = 1 , . . . , K ; 1.0 d k — dimensionality of the factor s k . 0.8 Factorial Design of Experiments: 0.6 i =1 = s 1 × s 2 × · · · × s K . S = { x i } N x 2 0.4 Dimensionality of x ∈ S : d = � K k =1 d k . 0.2 Sample size: N = � K k =1 n k 0.0 0.0 0.2 0.4 0.6 0.8 1.0 x 1 3/39 Burnaev Evgeny

Example of Factorial DoE 1 0.9 0.8 0.7 0.6 0.5 x 3 0.4 0.3 0.2 1 0.1 0.5 0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0 0.2 0.1 0 x 1 x 2 Figure : Factorial DoE with multidimensional factor 4/39 Burnaev Evgeny

DoE in engineering problems 1 Independent groups of variables — factors. 2 Training sample generation procedure: Fix values of the first factor. Perform experiments varying values of other factors. Fix other value of the first factor. Perform new series of experiments. 3 Take into account knowledge from a subject domain. Data properties Has special structure Can have large sample size Factors’ sizes can differ significantly 5/39 Burnaev Evgeny

Example of Factorial DoE Modeling of pressure distribution over the aircraft wing: Angle of attack: s 1 = { 0 , 0 . 8 , 1 . 6 , 2 . 4 , 3 . 2 , 4 . 0 } . Mach number: s 2 = { 0 . 77 , 0 . 78 , 0 . 79 , 0 . 8 , 0 . 81 , 0 . 82 , 0 . 83 } . Wing points coordinates: s 3 — 5000 point in R 3 . The training sample: S = s 1 × s 2 × s 3 . Dimensionality d = 1 + 1 + 3 = 5 . Sample size N = 6 ∗ 7 ∗ 5000 = 210000 . 6/39 Burnaev Evgeny

Existing solutions Universal techniques: Disadvantages: don’t take into account sample structure ⇒ low approximation quality, high computational complexity Multivariate Adaptive Regression Splines [Friedman, 1991] Disadvantages: discontinuous derivatives, non-physical behaviour Tensor product of splines [Stone et al., 1997, Xiao et al., 2013] Disadvantages: only one-dimensional factors, no accuracy evaluation procedure Gaussian Processes on lattice [Dietrich and Newsam, 1997, Stroud et al., 2014] Disadvantages: two-dimensional grid with equidistant points 7/39 Burnaev Evgeny

The aim is to develop computationally efficient algorithm taking into account special features of factorial Design of Experiments 8/39 Burnaev Evgeny

Gaussian Process Regression Function model g ( x ) = f ( x ) + ε ( x ) , where f ( x ) — Gaussian process (GP), ε ( x ) — Gaussian white noise. GP is fully defined by its mean and covariance function. The covariance function of f ( x ) � � � d i ( x ( i ) − x ′ ( i ) ) 2 K f ( x, x ′ ) = σ 2 θ 2 f exp − , i =1 where x ( i ) — i -th component of vector, θ = ( σ 2 f , θ 1 , . . . , θ d ) — parameters of the covariance function. The covariance function of g ( x ) : K g ( x, x ′ ) = K ( x, x ′ ) + σ 2 noise δ ( x, x ′ ) , δ ( x, x ′ ) — Kronecker delta. 9/39 Burnaev Evgeny

Choosing parameters Maximum Likelihood Estimation Loglikelihood log p ( y | X, θ ) = − 1 g y − 1 2 log | K g | − N 2 y T K − 1 2 log 2 π, where | K g | — determinant of matrix K g = � K g ( x i , x j ) � N i,j =1 , x i , x j ∈ S . Parameters θ ∗ are chosen such that θ ∗ = arg max (log p ( y | X, θ )) θ 10/39 Burnaev Evgeny

Final model Prediction of g ( x ) at point x f ( x ) = k T K − 1 ˆ g y , where k = ( K f ( x 1 , x ) , . . . , K f ( x n , x )) . Posterior variance σ 2 ( x ) = K f ( x, x ) − k T K − 1 g k . 11/39 Burnaev Evgeny

Gaussian Processes for factorial DoE Issues: 1 High computational complexity: O ( N 3 ) . In case of factorial DoE the sample size N can be very large. 2 Degeneracy as a result of significantly different factor sizes. 12/39 Burnaev Evgeny

Estimation of covariance function parameters Loglikelihood: log p ( y | X, θ ) = − 1 g y − 1 2 log | K g | − N 2 y T K − 1 2 log 2 π, Derivatives: ∂θ (log p ( y | X , σ f , σ noise )) = − 1 ∂ g K ′ ) + 1 2 y T K − 1 2Tr( K − 1 g K ′ K − 1 g y , where θ is a parameter of covariance function (component of θ i , σ noise or σ f,i , i = 1 , . . . , d ), and K ′ = ∂ K ∂θ 13/39 Burnaev Evgeny

Tensors and Kronecker product Definition Tensor Y is a K -dimensional matrix of size n 1 ∗ n 2 ∗ · · · ∗ n K : � � y i 1 ,i 2 ,...,i K , { i k = 1 , . . . , n k } K Y = . k =1 Definition The Kronecker product of matrices A and B is a block matrix   a 11 B · · · a 1 n B  . .  ... . . A ⊗ B =  .  . . a m 1 B · · · a mn B 14/39 Burnaev Evgeny

Related operations Operation vec : vec( Y ) = [ Y 1 , 1 ,..., 1 , Y 2 , 1 ,..., 1 , . . . , Y n 1 , 1 ,..., 1 , Y 1 , 2 ,..., 1 , . . . , Y n 1 ,n 2 ,...,n K ] . Multiplication of a tensor by a matrix along k -th direction � Z = Y ⊗ k B ⇔ Z i 1 ,...,i k − 1 ,j,i k +1 ,...i K = Y i 1 ,...,i k ,...,i K B i k j . i k Connection between tensors and the Kronecker product: vec( Y ⊗ 1 B 1 · · · ⊗ K B K ) = (B 1 ⊗ · · · ⊗ B K )vec( Y ) (1) Complexity of computation of the left part — O ( N � k n k ) , of the right part — O ( N 2 ) . 15/39 Burnaev Evgeny

Covariance function Form of the covariance function: � K x i , y i ∈ s i , k i ( x i , y i ) , K f ( x, y ) = i =1 where k i is an arbitrary covariance function for i -th factor. Covariance matrix: � K K = K i , i =1 K i is a covariance matrix for i -th factor. 16/39 Burnaev Evgeny

Fast computation of loglikelihood Proposition Let K i = U i D i U T i be a Singular Value Decomposition (SVD) of the matrix K i , where U i is an orthogonal matrix, and D i is diagonal. Then: � | K g | = D i 1 ,...,i K , i 1 ,...,i K �� − 1 �� (2) K − 1 U T D k + σ 2 = noise I U k , g k k k k � g y = vec [(( Y ⊗ 1 U 1 · · · ⊗ K U K ) ∗ D − 1 � K − 1 ⊗ 1 U T 1 · · · ⊗ K U T , K noise I + � where D is a tensor of diagonal elements of the matrix σ 2 k D k 17/39 Burnaev Evgeny

Computational complexity Proposition Calculation of the loglikelihood using (2) has the following computation complexity � � � K � K n 3 O N n i + . i i =1 i =1 Assuming n i ≪ N (number of factors is large and their sizes are close) we get � � � � � N 1+ 1 O N n i = O . K 18/39 Burnaev Evgeny

Fast computation of derivatives Proposition The following statements hold � � � D − 1 � � K � � Tr( K − 1 ˆ g K ′ ) = U i K ′ diag , diag , i U i i =1 1 2 y T K − 1 g K ′ K − 1 g y = �A , A ⊗ 1 K T 1 ⊗ 2 · · · ⊗ i − 1 K T i − 1 ⊗ i � ⊗ i ∂ K T ⊗ i +1 K T i +1 ⊗ i +2 · · · ⊗ K K T i , K ∂θ noise I + � where ˆ D = σ 2 k D k , and vec( A ) = K − 1 g y . The computational complexity is � � � K � K n 3 O N n i + . i i =1 i =1 19/39 Burnaev Evgeny

Gaussian Processes for facotrial DoE Issues: 1 High computational complexity: O ( N 3 ) . In case of factorial DoE the sample size N can be very large. 2 Degeneracy as a result of significantly different factor sizes. 20/39 Burnaev Evgeny

Degeneracy Example of degeneracy 2.5 GP regression 2.5 True function training set training set 2 2 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 1 1 0.5 0.5 1 0 1 0 0.5 0.5 −0.5 0 0 −0.5 −0.5 −0.5 −1 x 2 −1 −1 x 2 −1 x 1 x 1 Figure : Approximation obtained using GP Figure : Original function from GPML toolbox 21/39 Burnaev Evgeny

Regularization Prior distribution: θ ( i ) − a ( i ) ∼ B e ( α, β ) , { i = 1 , . . . , d k } K k k k =1 , b ( i ) k − a ( i ) k c k C k a ( i ) b ( i ) = x,y ∈ s k ( x ( i ) − y ( i ) ) , = x,y ∈ s k ,x � = y ( x ( i ) − y ( i ) ) k k max min where B e ( α, β ) is the Beta distribution, c k and C k are parameters of the algorithm (we use c k = 0 . 01 and C k = 2 ). Initialization � 1 � �� − 1 θ ( i ) x ∈ s k ( x ( i ) ) − min x ∈ s k ( x ( i ) ) = max . k n k 22/39 Burnaev Evgeny

Exact Inference for Gaussian Process Regression in case of Big Data - PowerPoint PPT Presentation

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure Belyaev Mikhail 1 , 2 , 3 , Burnaev Evgeny 1 , 2 , 3 , Kapushev Yermek 1 , 2 1 Institute for Information Transmission Problems, Moscow, Russia

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Gaussian Processes Seung-Hoon Na Chonbuk National University Gaussian Process Regression

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models

Gaussian Process Regression with Noisy Inputs Dan Cervone Harvard Statistics Department March 3,

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

CS 730/730W/830: Intro AI Bayesian Networks Approx. Inference Exact Inference 1 handout: slides

CS 730/830: Intro AI Bayesian Networks Approx. Inference Exact Inference Wheeler Ruml (UNH)

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Mathematical approximation Jo Hardin Professor, Pomona College DataCamp Inference for Linear

& Exact inference for Gaussian networks Probabilistic Graphical Models Sharif University of

Factor analysis & Exact inference for Gaussian networks Probabilistic Graphical Models

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Exact Inference Inference Basic task for inference: Compute

Gaussian Process Lei Tang Arizona State University Jul. 31th, 2007 Lei Tang (ASU) Gaussian

Outline Multivariate Data 1 Multivariate Parametric Methods Multivariate Normal Distribution 2

= = = 2 Further Var( X ) Var( ) Y a a a X =

Spiked Eigenvalues of High Dimensional Separable Sample Covariance Matrices Guangming Pan,

Fitting Covariance and Multioutput Gaussian Processes Neil D. Lawrence GPSS 13th September 2016

Where do Multivariate Normal Samples Come from? Paul E. Johnson 1 2 1 Department of Political

PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, LOGISTIC REGRESSION

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

Nested logit models Michel Bierlaire michel.bierlaire@epfl.ch Transport and Mobility Laboratory

Exact Inference for Gaussian Process Regression in case of Big Data - PowerPoint PPT Presentation

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure Belyaev Mikhail 1 , 2 , 3 , Burnaev Evgeny 1 , 2 , 3 , Kapushev Yermek 1 , 2 1 Institute for Information Transmission Problems, Moscow, Russia

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Gaussian Processes Seung-Hoon Na Chonbuk National University Gaussian Process Regression

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models

Gaussian Process Regression with Noisy Inputs Dan Cervone Harvard Statistics Department March 3,

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

CS 730/730W/830: Intro AI Bayesian Networks Approx. Inference Exact Inference 1 handout: slides

CS 730/830: Intro AI Bayesian Networks Approx. Inference Exact Inference Wheeler Ruml (UNH)

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Mathematical approximation Jo Hardin Professor, Pomona College DataCamp Inference for Linear

&amp; Exact inference for Gaussian networks Probabilistic Graphical Models Sharif University of

Factor analysis &amp; Exact inference for Gaussian networks Probabilistic Graphical Models

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Exact Inference Inference Basic task for inference: Compute

Gaussian Process Lei Tang Arizona State University Jul. 31th, 2007 Lei Tang (ASU) Gaussian

Outline Multivariate Data 1 Multivariate Parametric Methods Multivariate Normal Distribution 2

= = = 2 Further Var( X ) Var( ) Y a a a X =

Spiked Eigenvalues of High Dimensional Separable Sample Covariance Matrices Guangming Pan,

Fitting Covariance and Multioutput Gaussian Processes Neil D. Lawrence GPSS 13th September 2016

Where do Multivariate Normal Samples Come from? Paul E. Johnson 1 2 1 Department of Political

PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, LOGISTIC REGRESSION

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst.

Nested logit models Michel Bierlaire michel.bierlaire@epfl.ch Transport and Mobility Laboratory

& Exact inference for Gaussian networks Probabilistic Graphical Models Sharif University of

Factor analysis & Exact inference for Gaussian networks Probabilistic Graphical Models