Fast Laplace Approximation for Gaussian Ketter Processes with a - PowerPoint PPT Presentation

Perry Groot , Markus Peters Tom Heskes, Wolgang Fast Laplace Approximation for Gaussian Ketter Processes with a Tensor Product Kernel Introduction Background GPs Laplace Approximation Perry Groot a Markus Peters b Kronecker product Tom Heskes a Wolgang Ketter b MAP estimation Model Selection Radboud University Nijmegen a Experiment Erasmus University Rotterdam b

Introduction Perry Groot , Markus Peters Tom Heskes, Gaussian process models Wolgang Ketter + rich, principled Bayesian framework Introduction - scalability O ( N 3 ) Background Approaches addressing scalability GPs Laplace Approximation approximations / subset selection O ( M 2 N ) Kronecker product exploit additional structure MAP estimation Tensor product kernels Model Selection + Efficient use of Kronecker products on grid data - Limited to standard regression, since Experiment if K = Q Λ Q T then ( K + σ 2 I ) − 1 y = Q ( Λ + σ 2 I ) − 1 Q T y .

Gaussian Processes Perry Groot , Markus Peters A Gaussian process (GP) is collection of random variables Tom Heskes, Wolgang with the property that the joint distribution of any finite Ketter subset is a Gaussian. Introduction Background GPs A GP specifies a probability distribution over functions Laplace Approximation f ( x ) ∼ GP ( m ( x ) , k ( x , x ′ )) and is fully specified by its mean Kronecker product function m ( x ) and covariance (or kernel) function k ( x , x ′ ) . MAP estimation Model Selection Typically m ( x ) = 0 , which gives Experiment { f ( x 1 ) , . . . , f ( x I ) } ∼ N ( 0 , K ) with K ij = k ( x i , x j )

Gaussian Processes - Covariance function Squared exponential (or Gaussian) covariance function: Perry Groot , Markus Peters � � N − 1 Tom Heskes, � n ) 2 k ( x , x ′ ) = exp ( x n − x ′ Wolgang 2 θ 2 Ketter n = 1 Introduction where θ is a length-scale parameter denoting how quickly Background the functions are to vary. GPs Laplace Approximation Kronecker product length−scale 0.5 length−scale 2 2.5 2.5 MAP estimation 2 2 Model 1.5 1.5 Selection 1 1 Experiment 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 0 2 4 6 8 10 0 2 4 6 8 10

Gaussian Processes - 1D demo Perry Groot , 10 10 Markus Peters Tom Heskes, 5 5 Wolgang Ketter 0 0 Introduction −5 −5 Background GPs −10 −10 Laplace Approximation 0 2 4 6 8 10 0 2 4 6 8 10 Kronecker product MAP estimation 10 10 Model Selection 5 5 Experiment 0 0 −5 −5 −10 −10 0 2 4 6 8 10 0 2 4 6 8 10

Laplace Approximation Perry Groot , Markus Peters Tom Heskes, Wolgang Ketter For Gaussian process models with non-gaussian likelihood models we need approximations. Introduction Background GPs Laplace: approximate true posterior p ( f | X , y ) with a Laplace Approximation Gaussian q ( f ) centered on the mode of the posterior: Kronecker product MAP estimation f , ( K − 1 + W ) − 1 ) q ( f ) = N ( f | ˆ (1) Model Selection with ˆ f = arg max f p ( f | X , y , θ ) and W = −∇∇ f log p ( y | f ) | f =ˆ f . Experiment

Kronecker product Assumptions Perry Groot , Tensor product kernel function: Markus Peters Tom Heskes, k ( x i , x j ) = � D d = 1 k d ( x d i , x d j ) . Wolgang Ketter data on multi-dimensional grid Introduction Background This results in a kernel matrix that decomposes into a GPs Laplace Approximation Kronecker product of matrices of lower dimensions. Kronecker product MAP estimation K = K 1 ⊗ · · · ⊗ K D Model Selection where Experiment   a 11 B · · · a 1 n B . . ...  . .  A ⊗ B = . .   · · · a m 1 B a mn B

Kronecker product Perry Groot , Markus Peters The Kronecker product has a convenient algebra Tom Heskes, Wolgang Ketter ( A ⊗ B ) vec ( X ) = vec ( BXA T ) Introduction AB ⊗ CD = ( A ⊗ C )( B ⊗ D ) Background GPs ( A ⊗ B ) − 1 = A − 1 ⊗ B − 1 Laplace Approximation Kronecker product MAP estimation Operation Standard Kronecker Model Selection O ( � D O ( N 2 ) d = 1 N 2 d ) Storage Experiment O ( N ( � D O ( N 2 ) Matrix vector product d = 1 N d )) O ( � D O ( N 3 ) d = 1 N 3 Cholesky / SVD d )

MAP estimation Perry Groot , Markus Peters Tom Heskes, Let b = Wf + ∇ log p ( y | f ) . We need Wolgang Ketter f new = ( K − 1 + W ) − 1 b Introduction 1 1 1 1 Background 2 ) − 1 W 2 ( I + W 2 KW 2 K ) b = K ( I − W GPs � �� Laplace Approximation v Kronecker product MAP Repeat until convergence: estimation 1 1 1 Model 2 KW 2 ) v = W 2 Kb iteratively solve ( I + W Selection 1 Experiment 2 v a = b − W f = Ka

Marginal Likelihood Perry Groot , Markus Peters Tom Heskes, Wolgang Ketter learn the value of θ , which can be done by minimizing the Introduction negative marginal likelihood Background GPs Laplace Approximation − log p ( y | X , θ ) ≈ 1 f ) + 1 T K − 1 ˆ ˆ f − log p ( y | ˆ Kronecker product f 2 log | B | 2 MAP estimation with B = I + KW . Model Selection Experiment

Reduced-rank approximation Perry Groot , We use a reduced-rank approximation: Markus Peters Tom Heskes, Wolgang K ≈ QSQ T + Λ 1 Ketter Introduction with Background GPs Laplace D Approximation � Kronecker product Q = Q d , N × R matrix MAP estimation d = 1 Model D � Selection S = S d , R × R matrix Experiment d = 1 Λ 1 = diag ( diag ( K ) − diag ( QSQ T ))

Evaluating Marginal Likelihood Perry Groot , Markus Peters Tom Heskes, Wolgang Ketter 1 1 Introduction 2 KW 2 | | B | = | I + W Background 1 1 1 1 GPs 2 QSQ T W 2 + W 2 | ≈ | I + W 2 Λ 1 W Laplace Approximation Kronecker product 1 1 2 QSQ T W 2 | = | Λ 2 + W MAP estimation = | Λ 2 || S || S − 1 + Q T W 1 1 2 Λ − 1 2 Q | 2 W Model Selection = | Λ 2 || S || S − 1 + Q T Λ 3 Q | Experiment

Gradients Perry Groot , Markus Peters Need the gradients wrt θ of Tom Heskes, Wolgang Ketter − log p ( y | X , θ ) ≈ 1 f ) + 1 T K − 1 ˆ ˆ f − log p ( y | ˆ 2 log | B | f Introduction 2 Background GPs which is given by Laplace Approximation Kronecker product � ∂ log q ( y | X , θ ) = ∂ log q ( y | X , θ ) � MAP � estimation ∂θ j ∂θ j � explicit Model Selection N ∂ ˆ ∂ log q ( y | X , θ ) f i � Experiment + ∂ ˆ ∂θ j f i i = 1

Explicit Derivatives – Kernel Hyperparameters Perry Groot , Markus Peters Tom Heskes, Wolgang Ketter Given by Introduction � Background ∂ log q ( y | X , θ ) � = 1 T K − 1 ∂ K ˆ K − 1 ˆ � GPs f f � ∂θ c ∂θ c Laplace 2 � Approximation j j explicit Kronecker product � � MAP − 1 ( W − 1 + K ) − 1 ∂ K estimation . 2 tr ∂θ c Model j Selection Experiment

Explicit Derivatives – Kernel Hyperparameters Perry Groot , Markus Peters Tom Heskes, Wolgang Ketter ( W − 1 + K ) − 1 Introduction 1 1 1 1 Background 2 ) − 1 W 2 ( I + W 2 KW = W 2 GPs Laplace 1 1 1 1 1 1 Approximation 2 QSQ T W 2 ) − 1 W 2 ( I + W 2 + W ≈ W 2 Λ W 2 Kronecker product MAP 1 1 1 1 2 QSQ T W 2 ) − 1 W 2 ( Λ 2 + W = W estimation 2 Model 2 Q ( S − 1 + Q T Λ − 1 1 1 1 2 ( Λ − 1 − Λ − 1 3 Q ) − 1 Q T W 2 Λ − 1 Selection = W 2 W 2 ) W 2 Experiment = Λ 3 − Λ 3 Q ( S − 1 + Q T Λ 3 Q ) − 1 Q T Λ 3

Explicit Derivatives – Kernel Hyperparameters Perry Groot , Markus Peters Tom Heskes, Wolgang Ketter �� − 1 ∂ K W − 1 + K tr Introduction ∂θ c j Background � � � � GPs ∂ K � � − 1 ∂ K Laplace S − 1 + Q T Λ 3 Q Q T Λ 3 Approximation = tr − tr Λ 3 Q Λ 3 Kronecker product ∂θ c ∂θ c j j MAP estimation � � � � ∂ K ∂ K Model V T = tr Λ 3 − tr V 1 Selection 1 ∂θ c ∂θ c j j Experiment � ( S − 1 + Q T Λ 3 Q ) − 1 � with V 1 = Λ 3 Q chol .

Remaining Gradients Perry Groot , Markus Peters Tom Heskes, Wolgang Ketter Similar arguments for computing the remaining gradients Introduction Background Linear Algebra Shortcuts GPs Laplace Approximation tr ( ABC ) = tr ( BCA ) Kronecker product MAP tr ( AB T ) = S UM R OWS ( A ◦ B ) estimation diag ( ABC T ) = ( A ◦ C ) · diag ( B ) when B is diagonal Model Selection Experiment

Experiment Artificial classification data generated on X = [ 0 , 1 ] 2 with Perry Groot , Markus Peters Tom Heskes, various grid sizes. Wolgang Ketter Introduction 1100 Standard Kronecker 1000 Background GPs 900 Laplace 800 Approximation Kronecker product 700 seconds MAP 600 estimation 500 Model 400 Selection 300 Experiment 200 100 1000 2000 3000 4000 5000 6000 N

Fast Laplace Approximation for Gaussian Ketter Processes with a - PowerPoint PPT Presentation

Perry Groot , Markus Peters Tom Heskes, Wolgang Fast Laplace Approximation for Gaussian Ketter Processes with a Tensor Product Kernel Introduction Background GPs Laplace Approximation Perry Groot a Markus Peters b Kronecker product Tom

JUST THE MATHS SLIDES NUMBER 16.2 LAPLACE TRANSFORMS 2 (Inverse Laplace Transforms) by

Topic 9: The Laplace Transform o Introduction o Laplace Transform & Examples o Region of

TOC Chapter 4. The Laplace Transform [part 1] 4.1 Preliminaries 4.2 Laplace Transform 4.3

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

JUST THE MATHS SLIDES NUMBER 16.7 LAPLACE TRANSFORMS 7 (An appendix) by A.J.Hobson One

JUST THE MATHS SLIDES NUMBER 16.1 LAPLACE TRANSFORMS 1 (Definitions and rules) by

Laplace Transforms e st f ( t ) dt . Definition 1 (Laplace Transform) . L [ f ( t )] =

Chapter 7: The Laplace Transform Part 1 Department of Electrical Engineering National Taiwan

Signal and Systems Chapter 9: Laplace Transform Motivation and Definition of the (Bilateral)

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

A Fast Approximation Algorithm for the Gaussian Filter Kentaro Imajo (Yamamoto Laboratory)

6. Approximation and fitting norm approximation least-norm problems regularized

18.175: Lecture 12 DeMoivre-Laplace and weak convergence Scott Sheffield MIT 1 18.175 Lecture 12

Chapter 7: The Laplace Transform Department of Electrical Engineering National Taiwan University

Chapter 7: The Laplace Transform Department of Electrical Engineering National Taiwan University

Kernel Methods Lei Tang Arizona State University Jul. 26th, 2007 Lei Tang Kernel Methods

Convex Analysis in Stochastic Teams and Asymptotic Optimality of Finite Model Representations and

Time-Synchronization in Mobile Sensor Networks from Difference Measurements Chenda Liao and

Tensor Decomposition for Healthcare Analytics Matteo Ruffini Laboratory for Relational

Bayesian Optimization of Gaussian Processes applied to Performance Tuning Ramki Ramakrishna

Gaussian Process Lei Tang Arizona State University Jul. 31th, 2007 Lei Tang (ASU) Gaussian

TV Ads Attribution and Gaussian Processes Adrin Jalali November 16, 2016 1 / 27 Problem

An Extension of the Divergence Operator for Gaussian Processes Jorge A. Len Departamento de

Fast Laplace Approximation for Gaussian Ketter Processes with a - PowerPoint PPT Presentation

Perry Groot , Markus Peters Tom Heskes, Wolgang Fast Laplace Approximation for Gaussian Ketter Processes with a Tensor Product Kernel Introduction Background GPs Laplace Approximation Perry Groot a Markus Peters b Kronecker product Tom

JUST THE MATHS SLIDES NUMBER 16.2 LAPLACE TRANSFORMS 2 (Inverse Laplace Transforms) by

Topic 9: The Laplace Transform o Introduction o Laplace Transform &amp; Examples o Region of

TOC Chapter 4. The Laplace Transform [part 1] 4.1 Preliminaries 4.2 Laplace Transform 4.3

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

JUST THE MATHS SLIDES NUMBER 16.7 LAPLACE TRANSFORMS 7 (An appendix) by A.J.Hobson One

JUST THE MATHS SLIDES NUMBER 16.1 LAPLACE TRANSFORMS 1 (Definitions and rules) by

Laplace Transforms e st f ( t ) dt . Definition 1 (Laplace Transform) . L [ f ( t )] =

Chapter 7: The Laplace Transform Part 1 Department of Electrical Engineering National Taiwan

Signal and Systems Chapter 9: Laplace Transform Motivation and Definition of the (Bilateral)

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

A Fast Approximation Algorithm for the Gaussian Filter Kentaro Imajo (Yamamoto Laboratory)

6. Approximation and fitting norm approximation least-norm problems regularized

18.175: Lecture 12 DeMoivre-Laplace and weak convergence Scott Sheffield MIT 1 18.175 Lecture 12

Chapter 7: The Laplace Transform Department of Electrical Engineering National Taiwan University

Chapter 7: The Laplace Transform Department of Electrical Engineering National Taiwan University

Kernel Methods Lei Tang Arizona State University Jul. 26th, 2007 Lei Tang Kernel Methods

Convex Analysis in Stochastic Teams and Asymptotic Optimality of Finite Model Representations and

Time-Synchronization in Mobile Sensor Networks from Difference Measurements Chenda Liao and

Tensor Decomposition for Healthcare Analytics Matteo Ruffini Laboratory for Relational

Bayesian Optimization of Gaussian Processes applied to Performance Tuning Ramki Ramakrishna

Gaussian Process Lei Tang Arizona State University Jul. 31th, 2007 Lei Tang (ASU) Gaussian

TV Ads Attribution and Gaussian Processes Adrin Jalali November 16, 2016 1 / 27 Problem

An Extension of the Divergence Operator for Gaussian Processes Jorge A. Len Departamento de

Topic 9: The Laplace Transform o Introduction o Laplace Transform & Examples o Region of