introduction to gaussian processes
play

Introduction to Gaussian Processes Iain Murray - PowerPoint PPT Presentation

Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC2515, Introduction to Machine Learning, Fall 2008 Dept. Computer Science, University of Toronto The problem Learn scalar function of vector values f ( x ) 1 f(x) 0.5 y i


  1. Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC2515, Introduction to Machine Learning, Fall 2008 Dept. Computer Science, University of Toronto

  2. The problem Learn scalar function of vector values f ( x ) 1 f(x) 0.5 y i 5 0 0 f −0.5 −5 0 −1 1 0.5 −1.5 0.5 0 0.2 0.4 0.6 0.8 1 x 2 1 x 1 0 x We have (possibly noisy) observations { x i , y i } n i =1

  3. Example Applications Real-valued regression: — Robotics: target state → required torque — Process engineering: predicting yield — Surrogate surfaces for optimization or simulation Classification: — Recognition: e.g. handwritten digits on cheques — Filtering: fraud, interesting science, disease screening Ordinal regression: — User ratings (e.g. movies or restaurants) — Disease screening (e.g. predicting Gleason score)

  4. Model complexity The world is often complicated: 1 1 1 0.5 0.5 0.5 0 0 0 −0.5 −0.5 −0.5 −1 −1 −1 −1.5 −1.5 −1.5 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 simple fit complex fit truth Problems: — Fitting complicated models can be hard — How do we find an appropriate model? — How do we avoid over-fitting some aspects of model?

  5. Predicting yield Factory settings x 1 → profit of 32 ± 5 monetary units Factory settings x 2 → profit of 100 ± 200 monetary units Which are the best settings x 1 or x 2 ? Knowing the error bars can be very important.

  6. Optimization In high dimensions it takes many function evaluations to be certain everywhere. Costly if experiments are involved. 1 0.5 0 −0.5 −1 −1.5 0 0.2 0.4 0.6 0.8 1 Error bars are needed to see if a region is still promising.

  7. Bayesian modelling If we come up with a parametric family of functions, f ( x ; θ ) and define a prior over θ , probability theory tells us how to make predictions given data. For flexible models, this usually involves intractable integrals over θ . We’re really good at integrating Gaussians though 2 Can we really solve significant 1 machine learning problems with 0 a simple multivariate Gaussian −1 distribution? −2 −2 −1 0 1 2

  8. Gaussian distributions Completely described by parameters µ and Σ : P ( f | Σ , µ ) = | 2 π Σ | − 1 � � 2 exp 2 ( f − µ ) T Σ − 1 ( f − µ ) − 1 µ and Σ are the mean and covariance of the distribution. For example: Σ ij = � f i f j � − µ i µ j If we know a distribution is Gaussian and know its mean and covariances, we know its density function.

  9. Marginal of Gaussian The marginal of a Gaussian distribution is Gaussian. � A �� a � �� C P ( f , g ) = N , C ⊤ B b As soon as you convince yourself that the marginal � P ( f ) = d g P ( f , g ) is Gaussian, you already know the means and covariances: P ( f ) = N ( a , A ) .

  10. Conditional of Gaussian Any conditional of a Gaussian distribution is also Gaussian: � A �� a � �� C P ( f , g ) = N , C ⊤ B b P ( f | g ) = N ( a + CB − 1 ( y − b ) , A − CB − 1 C ⊤ ) Showing this is not completely straightforward. But it is a standard result, easily looked up.

  11. Noisy observations Previously we inferred f given g . What if we only saw a noisy observation, y ∼ N ( g , S ) ? P ( f , g , y ) = P ( f , g ) P ( y | g ) is Gaussian distributed; still a quadratic form inside the exponential after multiplying. Our posterior over f is still Gaussian: � P ( f | y ) ∝ d g P ( f , g , y ) (RHS is Gaussian after marginalizing, so still a quadratic form in f inside an exponential.)

  12. Laying out Gaussians A way of visualizing draws from a 2D Gaussian: 2 1 0 ⇔ 0 f 2 −0.5 f −1 −1 −2 −2 −1 0 1 2 x_1 x_2 f 1 1.5 1 0.5 Now it’s easy to show three draws 0 f from a 6D Gaussian: −0.5 −1 −1.5 x_1 x_2 x_3 x_4 x_5 x_6

  13. Building large Gaussians Three draws from a 25D Gaussian: 2 1 f 0 −1 x To produce this, we needed a mean: I used zeros(25,1) The covariances were set using a kernel function: Σ ij = k ( x i , x j ) . The x ’s are the positions that I planted the tics on the axis. Later we’ll find k ’s that ensure Σ is always positive semi-definite.

  14. GP regression model 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Noisy observations: f ∼ GP y i | f i ∼ N ( f i , σ 2 n ) f ∼ N (0 , K ) , K ij = k ( x i , x j ) where f i = f ( x i )

  15. GP Posterior Our prior over observations and targets is Gaussian: � K ( X, X ) + σ 2 �� y �� � �� K ( X, X ∗ ) n I = N P 0 , K ( X ∗ , X ) K ( X ∗ , X ∗ ) f ∗ Using the rule for conditionals, P ( f ∗ | y ) is Gaussian with: mean , ¯ f ∗ = K ( X ∗ , X )( K ( X ∗ , X ) + σ 2 n I ) − 1 y cov( f ∗ ) = K ( X ∗ , X ∗ ) − K ( X ∗ , X )( K ( X, X ) + σ 2 n I ) − 1 K ( X, X ∗ ) The posterior over functions is a Gaussian Process.

  16. GP posterior Two (incomplete) ways of visualizing what we know: 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Draws ∼ p ( f | data) Mean and error bars

  17. Point predictions Conditional at one point x ∗ is a simple Gaussian: p ( f ( x ∗ ) | data) = N ( m, s 2 ) Need covariances: K ij = k ( x i , x j ) , ( k ∗ ) i = k ( x ∗ , x i ) Special case of joint posterior: M = K + σ 2 n I m = k ⊤ ∗ M − 1 y s 2 = k ( x ∗ , x ∗ ) − k ⊤ ∗ M − 1 k ∗ � �� � positive

  18. Discovery or prediction? What should error-bars show? 1 ± 2 σ , p(y * |data) ± 2 σ , p(f * |data) 0.5 True f 0 Posterior Mean −0.5 Observations −1 −1.5 0 0.2 0.4 0.6 0.8 1 x * P ( f ∗ | data) = N ( m, s 2 ) says what we know about the noiseless function. P ( y ∗ | data) = N ( m, s 2 + σ 2 n ) predicts what we’ll see next.

  19. Review so far We can represent a function as a big vector f We assume that this unknown vector was drawn from a big correlated Gaussian distribution, a Gaussian process . (This might upset some mathematicians, but for all practical machine learning and statistical problems, this is fine.) Observing elements of the vector (optionally corrupted by Gaussian noise) creates a posterior distribution. This is also Gaussian: the posterior over functions is still a Gaussian process. Because marginalization in Gaussians is trivial, we can easily ignore all of the positions x i that are neither observed nor queried.

  20. Covariance functions The main part that has been missing so far is where the covariance function k ( x i , x j ) comes from. Also, other than making nearby points covary, what can we express with covariance functions, and what do do they mean?

  21. Covariance functions We can construct covariance functions from parametric models Simplest example: Bayesian linear regression: f ( x i ) = w ⊤ x i + b, w ∼ N (0 , σ 2 w I ) , b ∼ N (0 , σ 2 b ) ❃ 0 ❃ 0 ✚✚✚✚✚ ✚ ✚✚✚✚✚ cov( f i , f j ) = � f i f j � − � f i � � f j � � � ( w ⊤ x i + b ) ⊤ ( w ⊤ x j + b ) = = σ 2 w x ⊤ i x j + σ 2 b = k ( x i , x j ) Kernel parameters σ 2 w and σ 2 b are hyper-parameters in the Bayesian hierarchical model. More interesting kernels come from models with a large or infinite feature space. Because feature weights w are integrated out, this is computationally no more expensive.

  22. Squared-exponential kernel An ∞ number of radial-basis functions can give D � � � k ( x i , x j ) = σ 2 ( x d,i − x d,j ) 2 /ℓ 2 − 1 f exp , d 2 d =1 the most commonly-used kernel in machine learning. It looks like an (unnormalized) Gaussian, so is commonly called the Gaussian kernel. Please remember that this has nothing to do with it being a Gaussian process. A Gaussian process need not use the “Gaussian” kernel. In fact, other choices will often be better.

  23. Meaning of hyper-parameters Many kernels have similar types of parameters: D � � � k ( x i , x j ) = σ 2 − 1 ( x d,i − x d,j ) 2 /ℓ 2 f exp , d 2 d =1 Consider x i = x j , ⇒ marginal function variance is σ 2 f 20 σ f = 2 σ f = 10 10 0 −10 −20 −30 0 0.2 0.4 0.6 0.8 1

  24. Meaning of hyper-parameters The ℓ d parameters give the overall lengthscale in dimension-d D � � � k ( x i , x j ) = σ 2 ( x d,i − x d,j ) 2 /ℓ 2 − 1 f exp , d 2 d =1 Typical distance between peaks ≈ ℓ 2 l = 0.05 l = 0.5 1 0 −1 −2 −3 0 0.2 0.4 0.6 0.8 1

  25. Typical GP lengthscales What is the covariance matrix like? Consider 1D problems: 2.4 0.8 0.5 0.6 0 output, y 2.2 0.4 −0.5 0.2 −1 2 −1.5 0 0 0.2 0.4 x* 0.8 1 0 0.2 0.4 x* 0.8 1 0 0.2 0.4 0.6 0.8 1 input, x input, x input, x — Zeros in the covariance would ⇒ marginal independence — Short length scales usually don’t match my beliefs — Empirically, I often learn ℓ ≈ 1 giving a dense K Common exceptions: Time series data, ℓ small. Irrelevant dimensions, ℓ large. In high dimensions, can have K ij ≈ 0 with ℓ ≈ 1 .

  26. What GPs are not Locally-Weighted Regression weights points with a kernel before fitting a simple model 0.8 0.6 kernel value output, y 0.4 0.2 0 0 0.2 0.4 x* 0.8 1 x* input, x Meaning of kernel zero here: ≈ conditional dependence. Unlike GP kernel: a) shrinks to small ℓ with many data points; b) does not need to be positive definite.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend