Introduction to Gaussian Processes Iain Murray School of - PowerPoint PPT Presentation

Introduction to Gaussian Processes Iain Murray School of Informatics, University of Edinburgh

The problem Learn scalar function of vector values f ( x ) 1 f(x) 0.5 y i 5 0 0 f −0.5 −5 0 −1 1 0.5 −1.5 0.5 0 0.2 0.4 0.6 0.8 1 x 2 1 x 1 0 x We have (possibly noisy) observations { x i , y i } n i =1

Example Applications Real-valued regression: — Robotics: target state → required torque — Process engineering: predicting yield — Surrogate surfaces for optimization or simulation Many problems are not regression: Classification, rating/ranking, discovery, embedding, clustering, . . . But unknown functions may be part of larger model

Model complexity The world is often complicated: 1 1 1 0.5 0.5 0.5 0 0 0 −0.5 −0.5 −0.5 −1 −1 −1 −1.5 −1.5 −1.5 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 simple fit complex fit truth Problems: — Don’t want to underfit, and be too certain — Don’t want to overfit, and generalize poorly — Bayesian model comparison is often hard

Predicting yield Factory settings x 1 → profit of 32 ± 5 monetary units Factory settings x 2 → profit of 100 ± 200 monetary units Which are the best settings x 1 or x 2 ? Knowing the error bars can be important

Optimization In high dimensions it takes many function evaluations to be certain everywhere. Costly if experiments are involved. 1 0.5 0 −0.5 −1 −1.5 0 0.2 0.4 0.6 0.8 1 Error bars are needed to see if a region is still promising.

Bayesian modelling If we come up with a parametric family of functions, f ( x ; θ ) and define a prior over θ , probability theory tells us how to make predictions given data. For flexible models, this usually involves intractable integrals over θ . We’re really good at integrating Gaussians though 2 Can we really solve significant 1 machine learning problems with 0 a simple multivariate Gaussian −1 distribution? −2 −2 −1 0 1 2

Gaussian distributions Completely described by parameters µ and Σ : p ( f | Σ , µ ) = | 2 π Σ | − 1 � � 2 exp 2 ( f − µ ) T Σ − 1 ( f − µ ) − 1 µ and Σ are the mean and covariance: µ i = E [ f i ] Σ ij = E [ f i f j ] − µ i µ j If we know a distribution is Gaussian and know its mean and covariances, we know its density function.

Marginal of Gaussian The marginal of a Gaussian distribution is Gaussian. � A �� f � a � � �� C p ( f , g ) = N ; , C ⊤ B g b As soon as you convince yourself that the marginal � p ( f ) = p ( f , g ) d g is Gaussian, you already know the means and covariances: p ( f ) = N ( f ; a , A )

Conditional of Gaussian Any conditional of a Gaussian distribution is also Gaussian: � A �� f � a � � �� C p ( f , g ) = N ; , C ⊤ B g b p ( f | g ) = N ( f ; a + CB − 1 ( g − b ) , A − CB − 1 C ⊤ ) Showing this result requires some grunt work. But it is standard, and easily looked up.

Noisy observations Previously we inferred f given g . What if we only saw a noisy observation, y ∼ N ( g , S ) ? p ( f , g , y ) = p ( f , g ) p ( y | g ) is Gaussian distributed a quadratic form inside the exponential after multiplying Posterior over f is still Gaussian: � p ( f | y ) ∝ p ( f , g , y ) d g RHS is Gaussian after marginalizing, so still a quadratic form in f inside an exponential.

Laying out Gaussians A way of visualizing draws from a 2D Gaussian: 2 1 0 ⇔ 0 f 2 −0.5 f −1 −1 −2 −2 −1 0 1 2 x_1 x_2 f 1 1.5 1 0.5 Now it’s easy to show three draws 0 f from a 6D Gaussian: −0.5 −1 −1.5 x_1 x_2 x_3 x_4 x_5 x_6

Building large Gaussians Three draws from a 25D Gaussian: 2 1 f 0 −1 x To produce this, we needed a mean: I used zeros(25,1) The covariances were set using a kernel function: Σ ij = k ( x i , x j ) . The x ’s are the positions that I planted the tics on the axis. Later we’ll find k ’s that ensure Σ is always positive semi-definite.

GP regression model 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Noisy observations: f ∼ GP y i | f i ∼ N ( f i , σ 2 n ) f ∼ N (0 , K ) , K ij = k ( x i , x j ) where f i = f ( x i )

GP Posterior Our prior over observations and targets is Gaussian: � K ( X, X ) + σ 2 �� y �� y �� K ( X, X ∗ ) n I � = N P ; 0 , K ( X ∗ , X ) K ( X ∗ , X ∗ ) f ∗ f ∗ Using the rule for conditionals, p ( f ∗ | y ) is Gaussian with: mean , ¯ f ∗ = K ( X ∗ , X )( K ( X, X ) + σ 2 n I ) − 1 y n I ) − 1 K ( X, X ∗ ) cov( f ∗ ) = K ( X ∗ , X ∗ ) − K ( X ∗ , X )( K ( X, X ) + σ 2 The posterior over functions is a Gaussian Process.

GP Posterior Two incomplete ways of visualizing what we know: 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Draws ∼ p ( f | data) Mean and error bars

Point predictions Conditional at one point x ∗ is a simple Gaussian: p ( f ( x ∗ ) | data) = N ( f ; m, s 2 ) Need covariances: K ij = k ( x i , x j ) , ( k ∗ ) i = k ( x ∗ , x i ) Special case of joint posterior: M = K + σ 2 n I m = k ⊤ ∗ M − 1 y s 2 = k ( x ∗ , x ∗ ) − k ⊤ ∗ M − 1 k ∗ � �� positive

Discovery or prediction? 1 ± 2 σ , p(y * |data) ± 2 σ , p(f * |data) 0.5 True f 0 Posterior Mean −0.5 Observations −1 −1.5 0 0.2 0.4 0.6 0.8 1 x * p ( f ∗ | data) = N ( f ∗ ; m, s 2 ) says what we know about the noiseless function. p ( y ∗ | data) = N ( y ∗ ; m, s 2 + σ 2 n ) predicts what we’ll see next.

Review so far We can represent a function as a big vector f We assume that this unknown vector was drawn from a big correlated Gaussian distribution, a Gaussian process . (This might upset some mathematicians, but for all practical machine learning and statistical problems, this is fine.) Observing elements of the vector (optionally corrupted by Gaussian noise) creates a Gaussian posterior distribution. The posterior over functions is still a Gaussian process. Marginalization in Gaussians is trivial: just ignore all of the positions x i that are neither observed nor queried.

Covariance functions The main part that has been missing so far is where the covariance function k ( x i , x j ) comes from. What else can it say, other than nearby points are similar?

Covariance functions We can construct covariance functions from parametric models Simplest example: Bayesian linear regression: f ( x i ) = w ⊤ x i + b, w ∼ N (0 , σ 2 w I ) , b ∼ N (0 , σ 2 b ) ✯ 0 ✯ 0 ✟ ✟ cov( f i , f j ) = E [ f i f j ] − ✟✟✟✟✟✟ E [ f i ] E [ f j ] ✟✟✟✟✟✟ � � ( w ⊤ x i + b ) ⊤ ( w ⊤ x j + b ) = E w x ⊤ = σ 2 i x j + σ 2 b = k ( x i , x j ) Kernel parameters σ 2 w and σ 2 b are hyper-parameters in the Bayesian hierarchical model. More interesting kernels come from models with a large or infinite w Φ( x i ) ⊤ Φ( x j ) + σ 2 feature space: k ( x i , x j ) = σ 2 b , the ‘kernel trick’.

What’s a valid kernel? We could ‘make up’ a kernel function k ( x i , x j ) But any ‘Gram matrix’ must be positive semi-definite:   k ( x 1 , x 1 ) · · · k ( x 1 , x N ) . z ⊤ K z ≥ 0 for all z  .  K = .  ,  k ( x N , x 1 ) · · · k ( x N , x N ) Achieved by positive semi-definite kernel , or Mercer kernel K +ve eigenvalues ⇒ K − 1 +ve eigenvalues ⇒ Gaussian normalizable Mercer kernels give inner-products of some feature vectors Φ( x ) But these Φ( x ) vectors may be infinite.

Squared-exponential kernel An ∞ number of radial-basis functions can give D � � � k ( x i , x j ) = σ 2 ( x d,i − x d,j ) 2 /ℓ 2 − 1 f exp , d 2 d =1 the most commonly-used kernel in machine learning. It looks like an (unnormalized) Gaussian, so is sometimes called the Gaussian kernel. A Gaussian process need not use the “Gaussian” kernel. In fact, other choices will often be better.

Meaning of hyper-parameters Many kernels have similar types of parameters: D � � � k ( x i , x j ) = σ 2 ( x d,i − x d,j ) 2 /ℓ 2 − 1 f exp , d 2 d =1 Consider x i = x j , ⇒ marginal function variance is σ 2 f 20 σ f = 2 σ f = 10 10 0 −10 −20 −30 0 0.2 0.4 0.6 0.8 1

Meaning of hyper-parameters ℓ d parameters give the length-scale in dimension- d � D � � k ( x i , x j ) = σ 2 ( x d,i − x d,j ) 2 /ℓ 2 − 1 f exp , d 2 d =1 Typical distance between peaks ≈ ℓ 2 l = 0.05 l = 0.5 1 0 −1 −2 −3 0 0.2 0.4 0.6 0.8 1

Effect of hyper-parameters Different (SE) kernel parameters give different explanations of the data: 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 ℓ = 0 . 5 , σ n = 0 . 05 ℓ = 1 . 5 , σ n = 0 . 15

Other kernels SE kernel produce very smooth and ‘boring’ functions Kernels are available for rough data, periodic data, strings, graphs, images, models, . . . Different kernels can be combined: k ( x i , x j ) = αk 1 ( x i , x j ) + βk 2 ( x i , x j ) Positive semi-definite if k 1 and k 2 are.

Introduction to Gaussian Processes Iain Murray School of - PowerPoint PPT Presentation

Introduction to Gaussian Processes Iain Murray School of Informatics, University of Edinburgh The problem Learn scalar function of vector values f ( x ) 1 f(x) 0.5 y i 5 0 0 f 0.5 5 0 1 1 0.5 1.5 0.5 0 0.2 0.4 0.6

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee

CMPUT 466 Introduction to Gaussian Processes Dan Lizotte The Plan Introduction to Gaussian

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2

Another introduction to Gaussian Processes Richard Wilkinson School of Maths and Statistics

Gaussian Processes for Big Data James Hensman joint work with Nicol o Fusi, Neil D. Lawrence

Gaussian Processes Seung-Hoon Na Chonbuk National University Gaussian Process Regression

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

CS70: Jean Walrand: Lecture 36. Gaussian and CLT CS70: Jean Walrand: Lecture 36. Gaussian and

Determining the PSF over the Full FoV of LSST using Anisotropic Gaussian Processes

Scalable Gaussian Processes Zhenwen Dai Amazon September 4, 2018 @GPSS2018 Zhenwen Dai (Amazon)

Scalable Gaussian Processes Zhenwen Dai Amazon 9 September 2019 @GPSS 2019 Zhenwen Dai (Amazon)

Introduction to Gaussian Processes Neil D. Lawrence GPMC 6th February 2017 Book Rasmussen and

Voyaging around nacre with the x-ray shuttle from biomineralisation to prosthetics via mollusc

Event Calendar SHIMA Daio,

First International Workshop on Learning over Multiple Contexts LMCE 2014 Nancy, 19 September

Menu Concerns about the quality of the predictive distributions Augmentation: a bit more

Information Theory and Statistical Inference Samuel Cheng School of ECE University of Oklahoma

Teaching VDM & Teaching Formal Methods Ana Paiva apaiva@fe.up.pt www.fe.up.pt/~apaiva OVT

Infinitely many corks with shadow complexity one . . . . . Hironobu Naoe (Tohoku University)

MLSS 06 - Canberra Elements Hierarchical Basis Sparse Grids Sparse Grids Combination

Introduction to Gaussian Processes Iain Murray School of - PowerPoint PPT Presentation

Introduction to Gaussian Processes Iain Murray School of Informatics, University of Edinburgh The problem Learn scalar function of vector values f ( x ) 1 f(x) 0.5 y i 5 0 0 f 0.5 5 0 1 1 0.5 1.5 0.5 0 0.2 0.4 0.6

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee

CMPUT 466 Introduction to Gaussian Processes Dan Lizotte The Plan Introduction to Gaussian

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2

Another introduction to Gaussian Processes Richard Wilkinson School of Maths and Statistics

Gaussian Processes for Big Data James Hensman joint work with Nicol o Fusi, Neil D. Lawrence

Gaussian Processes Seung-Hoon Na Chonbuk National University Gaussian Process Regression

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

CS70: Jean Walrand: Lecture 36. Gaussian and CLT CS70: Jean Walrand: Lecture 36. Gaussian and

Determining the PSF over the Full FoV of LSST using Anisotropic Gaussian Processes

Scalable Gaussian Processes Zhenwen Dai Amazon September 4, 2018 @GPSS2018 Zhenwen Dai (Amazon)

Scalable Gaussian Processes Zhenwen Dai Amazon 9 September 2019 @GPSS 2019 Zhenwen Dai (Amazon)

Introduction to Gaussian Processes Neil D. Lawrence GPMC 6th February 2017 Book Rasmussen and

Voyaging around nacre with the x-ray shuttle from biomineralisation to prosthetics via mollusc

Event Calendar SHIMA Daio,

First International Workshop on Learning over Multiple Contexts LMCE 2014 Nancy, 19 September

Menu Concerns about the quality of the predictive distributions Augmentation: a bit more

Information Theory and Statistical Inference Samuel Cheng School of ECE University of Oklahoma

Teaching VDM &amp; Teaching Formal Methods Ana Paiva apaiva@fe.up.pt www.fe.up.pt/~apaiva OVT

Infinitely many corks with shadow complexity one . . . . . Hironobu Naoe (Tohoku University)

MLSS 06 - Canberra Elements Hierarchical Basis Sparse Grids Sparse Grids Combination

Teaching VDM & Teaching Formal Methods Ana Paiva apaiva@fe.up.pt www.fe.up.pt/~apaiva OVT