Introduction to Gaussian Processes Iain Murray - PowerPoint PPT Presentation

Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC2515, Introduction to Machine Learning, Fall 2008 Dept. Computer Science, University of Toronto

The problem Learn scalar function of vector values f ( x ) 1 f(x) 0.5 y i 5 0 0 f −0.5 −5 0 −1 1 0.5 −1.5 0.5 0 0.2 0.4 0.6 0.8 1 x 2 1 x 1 0 x We have (possibly noisy) observations { x i , y i } n i =1

Example Applications Real-valued regression: — Robotics: target state → required torque — Process engineering: predicting yield — Surrogate surfaces for optimization or simulation Classification: — Recognition: e.g. handwritten digits on cheques — Filtering: fraud, interesting science, disease screening Ordinal regression: — User ratings (e.g. movies or restaurants) — Disease screening (e.g. predicting Gleason score)

Model complexity The world is often complicated: 1 1 1 0.5 0.5 0.5 0 0 0 −0.5 −0.5 −0.5 −1 −1 −1 −1.5 −1.5 −1.5 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 simple fit complex fit truth Problems: — Fitting complicated models can be hard — How do we find an appropriate model? — How do we avoid over-fitting some aspects of model?

Predicting yield Factory settings x 1 → profit of 32 ± 5 monetary units Factory settings x 2 → profit of 100 ± 200 monetary units Which are the best settings x 1 or x 2 ? Knowing the error bars can be very important.

Optimization In high dimensions it takes many function evaluations to be certain everywhere. Costly if experiments are involved. 1 0.5 0 −0.5 −1 −1.5 0 0.2 0.4 0.6 0.8 1 Error bars are needed to see if a region is still promising.

Bayesian modelling If we come up with a parametric family of functions, f ( x ; θ ) and define a prior over θ , probability theory tells us how to make predictions given data. For flexible models, this usually involves intractable integrals over θ . We’re really good at integrating Gaussians though 2 Can we really solve significant 1 machine learning problems with 0 a simple multivariate Gaussian −1 distribution? −2 −2 −1 0 1 2

Gaussian distributions Completely described by parameters µ and Σ : P ( f | Σ , µ ) = | 2 π Σ | − 1 � � 2 exp 2 ( f − µ ) T Σ − 1 ( f − µ ) − 1 µ and Σ are the mean and covariance of the distribution. For example: Σ ij = � f i f j � − µ i µ j If we know a distribution is Gaussian and know its mean and covariances, we know its density function.

Marginal of Gaussian The marginal of a Gaussian distribution is Gaussian. � A �� a � �� C P ( f , g ) = N , C ⊤ B b As soon as you convince yourself that the marginal � P ( f ) = d g P ( f , g ) is Gaussian, you already know the means and covariances: P ( f ) = N ( a , A ) .

Conditional of Gaussian Any conditional of a Gaussian distribution is also Gaussian: � A �� a � �� C P ( f , g ) = N , C ⊤ B b P ( f | g ) = N ( a + CB − 1 ( y − b ) , A − CB − 1 C ⊤ ) Showing this is not completely straightforward. But it is a standard result, easily looked up.

Noisy observations Previously we inferred f given g . What if we only saw a noisy observation, y ∼ N ( g , S ) ? P ( f , g , y ) = P ( f , g ) P ( y | g ) is Gaussian distributed; still a quadratic form inside the exponential after multiplying. Our posterior over f is still Gaussian: � P ( f | y ) ∝ d g P ( f , g , y ) (RHS is Gaussian after marginalizing, so still a quadratic form in f inside an exponential.)

Laying out Gaussians A way of visualizing draws from a 2D Gaussian: 2 1 0 ⇔ 0 f 2 −0.5 f −1 −1 −2 −2 −1 0 1 2 x_1 x_2 f 1 1.5 1 0.5 Now it’s easy to show three draws 0 f from a 6D Gaussian: −0.5 −1 −1.5 x_1 x_2 x_3 x_4 x_5 x_6

Building large Gaussians Three draws from a 25D Gaussian: 2 1 f 0 −1 x To produce this, we needed a mean: I used zeros(25,1) The covariances were set using a kernel function: Σ ij = k ( x i , x j ) . The x ’s are the positions that I planted the tics on the axis. Later we’ll find k ’s that ensure Σ is always positive semi-definite.

GP regression model 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Noisy observations: f ∼ GP y i | f i ∼ N ( f i , σ 2 n ) f ∼ N (0 , K ) , K ij = k ( x i , x j ) where f i = f ( x i )

GP Posterior Our prior over observations and targets is Gaussian: � K ( X, X ) + σ 2 �� y �� K ( X, X ∗ ) n I = N P 0 , K ( X ∗ , X ) K ( X ∗ , X ∗ ) f ∗ Using the rule for conditionals, P ( f ∗ | y ) is Gaussian with: mean , ¯ f ∗ = K ( X ∗ , X )( K ( X ∗ , X ) + σ 2 n I ) − 1 y cov( f ∗ ) = K ( X ∗ , X ∗ ) − K ( X ∗ , X )( K ( X, X ) + σ 2 n I ) − 1 K ( X, X ∗ ) The posterior over functions is a Gaussian Process.

GP posterior Two (incomplete) ways of visualizing what we know: 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Draws ∼ p ( f | data) Mean and error bars

Point predictions Conditional at one point x ∗ is a simple Gaussian: p ( f ( x ∗ ) | data) = N ( m, s 2 ) Need covariances: K ij = k ( x i , x j ) , ( k ∗ ) i = k ( x ∗ , x i ) Special case of joint posterior: M = K + σ 2 n I m = k ⊤ ∗ M − 1 y s 2 = k ( x ∗ , x ∗ ) − k ⊤ ∗ M − 1 k ∗ � �� positive

Discovery or prediction? What should error-bars show? 1 ± 2 σ , p(y * |data) ± 2 σ , p(f * |data) 0.5 True f 0 Posterior Mean −0.5 Observations −1 −1.5 0 0.2 0.4 0.6 0.8 1 x * P ( f ∗ | data) = N ( m, s 2 ) says what we know about the noiseless function. P ( y ∗ | data) = N ( m, s 2 + σ 2 n ) predicts what we’ll see next.

Review so far We can represent a function as a big vector f We assume that this unknown vector was drawn from a big correlated Gaussian distribution, a Gaussian process . (This might upset some mathematicians, but for all practical machine learning and statistical problems, this is fine.) Observing elements of the vector (optionally corrupted by Gaussian noise) creates a posterior distribution. This is also Gaussian: the posterior over functions is still a Gaussian process. Because marginalization in Gaussians is trivial, we can easily ignore all of the positions x i that are neither observed nor queried.

Covariance functions The main part that has been missing so far is where the covariance function k ( x i , x j ) comes from. Also, other than making nearby points covary, what can we express with covariance functions, and what do do they mean?

Covariance functions We can construct covariance functions from parametric models Simplest example: Bayesian linear regression: f ( x i ) = w ⊤ x i + b, w ∼ N (0 , σ 2 w I ) , b ∼ N (0 , σ 2 b ) ❃ 0 ❃ 0 ✚✚✚✚✚ ✚ ✚✚✚✚✚ cov( f i , f j ) = � f i f j � − � f i � � f j � � � ( w ⊤ x i + b ) ⊤ ( w ⊤ x j + b ) = = σ 2 w x ⊤ i x j + σ 2 b = k ( x i , x j ) Kernel parameters σ 2 w and σ 2 b are hyper-parameters in the Bayesian hierarchical model. More interesting kernels come from models with a large or infinite feature space. Because feature weights w are integrated out, this is computationally no more expensive.

Squared-exponential kernel An ∞ number of radial-basis functions can give D � � � k ( x i , x j ) = σ 2 ( x d,i − x d,j ) 2 /ℓ 2 − 1 f exp , d 2 d =1 the most commonly-used kernel in machine learning. It looks like an (unnormalized) Gaussian, so is commonly called the Gaussian kernel. Please remember that this has nothing to do with it being a Gaussian process. A Gaussian process need not use the “Gaussian” kernel. In fact, other choices will often be better.

Meaning of hyper-parameters Many kernels have similar types of parameters: D � � � k ( x i , x j ) = σ 2 − 1 ( x d,i − x d,j ) 2 /ℓ 2 f exp , d 2 d =1 Consider x i = x j , ⇒ marginal function variance is σ 2 f 20 σ f = 2 σ f = 10 10 0 −10 −20 −30 0 0.2 0.4 0.6 0.8 1

Meaning of hyper-parameters The ℓ d parameters give the overall lengthscale in dimension-d D � � � k ( x i , x j ) = σ 2 ( x d,i − x d,j ) 2 /ℓ 2 − 1 f exp , d 2 d =1 Typical distance between peaks ≈ ℓ 2 l = 0.05 l = 0.5 1 0 −1 −2 −3 0 0.2 0.4 0.6 0.8 1

Typical GP lengthscales What is the covariance matrix like? Consider 1D problems: 2.4 0.8 0.5 0.6 0 output, y 2.2 0.4 −0.5 0.2 −1 2 −1.5 0 0 0.2 0.4 x* 0.8 1 0 0.2 0.4 x* 0.8 1 0 0.2 0.4 0.6 0.8 1 input, x input, x input, x — Zeros in the covariance would ⇒ marginal independence — Short length scales usually don’t match my beliefs — Empirically, I often learn ℓ ≈ 1 giving a dense K Common exceptions: Time series data, ℓ small. Irrelevant dimensions, ℓ large. In high dimensions, can have K ij ≈ 0 with ℓ ≈ 1 .

What GPs are not Locally-Weighted Regression weights points with a kernel before fitting a simple model 0.8 0.6 kernel value output, y 0.4 0.2 0 0 0.2 0.4 x* 0.8 1 x* input, x Meaning of kernel zero here: ≈ conditional dependence. Unlike GP kernel: a) shrinks to small ℓ with many data points; b) does not need to be positive definite.

Introduction to Gaussian Processes Iain Murray - PowerPoint PPT Presentation

Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC2515, Introduction to Machine Learning, Fall 2008 Dept. Computer Science, University of Toronto The problem Learn scalar function of vector values f ( x ) 1 f(x) 0.5 y i

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee

CMPUT 466 Introduction to Gaussian Processes Dan Lizotte The Plan Introduction to Gaussian

Gaussian Processes Dan Cervone NYU CDS November 10, 2015 Dan Cervone (NYU CDS) Gaussian

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2

Another introduction to Gaussian Processes Richard Wilkinson School of Maths and Statistics

Gaussian Processes for Big Data James Hensman joint work with Nicol o Fusi, Neil D. Lawrence

Gaussian Processes Seung-Hoon Na Chonbuk National University Gaussian Process Regression

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

CS70: Jean Walrand: Lecture 36. Gaussian and CLT CS70: Jean Walrand: Lecture 36. Gaussian and

Determining the PSF over the Full FoV of LSST using Anisotropic Gaussian Processes

Scalable Gaussian Processes Zhenwen Dai Amazon September 4, 2018 @GPSS2018 Zhenwen Dai (Amazon)

Scalable Gaussian Processes Zhenwen Dai Amazon 9 September 2019 @GPSS 2019 Zhenwen Dai (Amazon)

Introduction to Gaussian Processes Neil D. Lawrence GPMC 6th February 2017 Book Rasmussen and

Dare to Compare Vicki Hawhee, MEd, CTR March 27, 2019 Objectives Upon completion of the

How to best treat cN1 prostate cancer? Karim Fizazi Institut Gustave Roussy France Disclosure

The Gleason Grade Grouping System for Consulting fees: Genomic Health, GenomeDx, 3Scan, 3D

Sponsor - F ac ilitate d Re lationships Be twe e n L ate Stage Re se ar c he r s and Phase

G a m e s L a b Animations Video Games Apps Virtual Labs Jeanne Gleason

Overall Survival Analysis of African American and Caucasian

CRO AVIANO Michele Avanzo Medical Physicist Centro di Riferimento Oncologico IRCSS Aviano (PN)

Importing data Peter Humburg Statistician, Macquarie University DataCamp ChIP-seq Workflows in