introduction to gaussian processes
play

Introduction to Gaussian Processes Iain Murray School of - PowerPoint PPT Presentation

Introduction to Gaussian Processes Iain Murray School of Informatics, University of Edinburgh The problem Learn scalar function of vector values f ( x ) 1 f(x) 0.5 y i 5 0 0 f 0.5 5 0 1 1 0.5 1.5 0.5 0 0.2 0.4 0.6


  1. Introduction to Gaussian Processes Iain Murray School of Informatics, University of Edinburgh

  2. The problem Learn scalar function of vector values f ( x ) 1 f(x) 0.5 y i 5 0 0 f −0.5 −5 0 −1 1 0.5 −1.5 0.5 0 0.2 0.4 0.6 0.8 1 x 2 1 x 1 0 x We have (possibly noisy) observations { x i , y i } n i =1

  3. Example Applications Real-valued regression: — Robotics: target state → required torque — Process engineering: predicting yield — Surrogate surfaces for optimization or simulation Many problems are not regression: Classification, rating/ranking, discovery, embedding, clustering, . . . But unknown functions may be part of larger model

  4. Model complexity The world is often complicated: 1 1 1 0.5 0.5 0.5 0 0 0 −0.5 −0.5 −0.5 −1 −1 −1 −1.5 −1.5 −1.5 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 simple fit complex fit truth Problems: — Don’t want to underfit, and be too certain — Don’t want to overfit, and generalize poorly — Bayesian model comparison is often hard

  5. Predicting yield Factory settings x 1 → profit of 32 ± 5 monetary units Factory settings x 2 → profit of 100 ± 200 monetary units Which are the best settings x 1 or x 2 ? Knowing the error bars can be important

  6. Optimization In high dimensions it takes many function evaluations to be certain everywhere. Costly if experiments are involved. 1 0.5 0 −0.5 −1 −1.5 0 0.2 0.4 0.6 0.8 1 Error bars are needed to see if a region is still promising.

  7. Bayesian modelling If we come up with a parametric family of functions, f ( x ; θ ) and define a prior over θ , probability theory tells us how to make predictions given data. For flexible models, this usually involves intractable integrals over θ . We’re really good at integrating Gaussians though 2 Can we really solve significant 1 machine learning problems with 0 a simple multivariate Gaussian −1 distribution? −2 −2 −1 0 1 2

  8. Gaussian distributions Completely described by parameters µ and Σ : p ( f | Σ , µ ) = | 2 π Σ | − 1 � � 2 exp 2 ( f − µ ) T Σ − 1 ( f − µ ) − 1 µ and Σ are the mean and covariance: µ i = E [ f i ] Σ ij = E [ f i f j ] − µ i µ j If we know a distribution is Gaussian and know its mean and covariances, we know its density function.

  9. Marginal of Gaussian The marginal of a Gaussian distribution is Gaussian. � A �� f � a � � �� C p ( f , g ) = N ; , C ⊤ B g b As soon as you convince yourself that the marginal � p ( f ) = p ( f , g ) d g is Gaussian, you already know the means and covariances: p ( f ) = N ( f ; a , A )

  10. Conditional of Gaussian Any conditional of a Gaussian distribution is also Gaussian: � A �� f � a � � �� C p ( f , g ) = N ; , C ⊤ B g b p ( f | g ) = N ( f ; a + CB − 1 ( g − b ) , A − CB − 1 C ⊤ ) Showing this result requires some grunt work. But it is standard, and easily looked up.

  11. Noisy observations Previously we inferred f given g . What if we only saw a noisy observation, y ∼ N ( g , S ) ? p ( f , g , y ) = p ( f , g ) p ( y | g ) is Gaussian distributed a quadratic form inside the exponential after multiplying Posterior over f is still Gaussian: � p ( f | y ) ∝ p ( f , g , y ) d g RHS is Gaussian after marginalizing, so still a quadratic form in f inside an exponential.

  12. Laying out Gaussians A way of visualizing draws from a 2D Gaussian: 2 1 0 ⇔ 0 f 2 −0.5 f −1 −1 −2 −2 −1 0 1 2 x_1 x_2 f 1 1.5 1 0.5 Now it’s easy to show three draws 0 f from a 6D Gaussian: −0.5 −1 −1.5 x_1 x_2 x_3 x_4 x_5 x_6

  13. Building large Gaussians Three draws from a 25D Gaussian: 2 1 f 0 −1 x To produce this, we needed a mean: I used zeros(25,1) The covariances were set using a kernel function: Σ ij = k ( x i , x j ) . The x ’s are the positions that I planted the tics on the axis. Later we’ll find k ’s that ensure Σ is always positive semi-definite.

  14. GP regression model 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Noisy observations: f ∼ GP y i | f i ∼ N ( f i , σ 2 n ) f ∼ N (0 , K ) , K ij = k ( x i , x j ) where f i = f ( x i )

  15. GP Posterior Our prior over observations and targets is Gaussian: � K ( X, X ) + σ 2 �� y �� � � y �� K ( X, X ∗ ) n I � = N P ; 0 , K ( X ∗ , X ) K ( X ∗ , X ∗ ) f ∗ f ∗ Using the rule for conditionals, p ( f ∗ | y ) is Gaussian with: mean , ¯ f ∗ = K ( X ∗ , X )( K ( X, X ) + σ 2 n I ) − 1 y n I ) − 1 K ( X, X ∗ ) cov( f ∗ ) = K ( X ∗ , X ∗ ) − K ( X ∗ , X )( K ( X, X ) + σ 2 The posterior over functions is a Gaussian Process.

  16. GP Posterior Two incomplete ways of visualizing what we know: 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Draws ∼ p ( f | data) Mean and error bars

  17. Point predictions Conditional at one point x ∗ is a simple Gaussian: p ( f ( x ∗ ) | data) = N ( f ; m, s 2 ) Need covariances: K ij = k ( x i , x j ) , ( k ∗ ) i = k ( x ∗ , x i ) Special case of joint posterior: M = K + σ 2 n I m = k ⊤ ∗ M − 1 y s 2 = k ( x ∗ , x ∗ ) − k ⊤ ∗ M − 1 k ∗ � �� � positive

  18. Discovery or prediction? 1 ± 2 σ , p(y * |data) ± 2 σ , p(f * |data) 0.5 True f 0 Posterior Mean −0.5 Observations −1 −1.5 0 0.2 0.4 0.6 0.8 1 x * p ( f ∗ | data) = N ( f ∗ ; m, s 2 ) says what we know about the noiseless function. p ( y ∗ | data) = N ( y ∗ ; m, s 2 + σ 2 n ) predicts what we’ll see next.

  19. Review so far We can represent a function as a big vector f We assume that this unknown vector was drawn from a big correlated Gaussian distribution, a Gaussian process . (This might upset some mathematicians, but for all practical machine learning and statistical problems, this is fine.) Observing elements of the vector (optionally corrupted by Gaussian noise) creates a Gaussian posterior distribution. The posterior over functions is still a Gaussian process. Marginalization in Gaussians is trivial: just ignore all of the positions x i that are neither observed nor queried.

  20. Covariance functions The main part that has been missing so far is where the covariance function k ( x i , x j ) comes from. What else can it say, other than nearby points are similar?

  21. Covariance functions We can construct covariance functions from parametric models Simplest example: Bayesian linear regression: f ( x i ) = w ⊤ x i + b, w ∼ N (0 , σ 2 w I ) , b ∼ N (0 , σ 2 b ) ✯ 0 ✯ 0 ✟ ✟ cov( f i , f j ) = E [ f i f j ] − ✟✟✟✟✟✟ E [ f i ] E [ f j ] ✟✟✟✟✟✟ � � ( w ⊤ x i + b ) ⊤ ( w ⊤ x j + b ) = E w x ⊤ = σ 2 i x j + σ 2 b = k ( x i , x j ) Kernel parameters σ 2 w and σ 2 b are hyper-parameters in the Bayesian hierarchical model. More interesting kernels come from models with a large or infinite w Φ( x i ) ⊤ Φ( x j ) + σ 2 feature space: k ( x i , x j ) = σ 2 b , the ‘kernel trick’.

  22. What’s a valid kernel? We could ‘make up’ a kernel function k ( x i , x j ) But any ‘Gram matrix’ must be positive semi-definite:   k ( x 1 , x 1 ) · · · k ( x 1 , x N ) . z ⊤ K z ≥ 0 for all z  .  K = .  ,  k ( x N , x 1 ) · · · k ( x N , x N ) Achieved by positive semi-definite kernel , or Mercer kernel K +ve eigenvalues ⇒ K − 1 +ve eigenvalues ⇒ Gaussian normalizable Mercer kernels give inner-products of some feature vectors Φ( x ) But these Φ( x ) vectors may be infinite.

  23. Squared-exponential kernel An ∞ number of radial-basis functions can give D � � � k ( x i , x j ) = σ 2 ( x d,i − x d,j ) 2 /ℓ 2 − 1 f exp , d 2 d =1 the most commonly-used kernel in machine learning. It looks like an (unnormalized) Gaussian, so is sometimes called the Gaussian kernel. A Gaussian process need not use the “Gaussian” kernel. In fact, other choices will often be better.

  24. Meaning of hyper-parameters Many kernels have similar types of parameters: D � � � k ( x i , x j ) = σ 2 ( x d,i − x d,j ) 2 /ℓ 2 − 1 f exp , d 2 d =1 Consider x i = x j , ⇒ marginal function variance is σ 2 f 20 σ f = 2 σ f = 10 10 0 −10 −20 −30 0 0.2 0.4 0.6 0.8 1

  25. Meaning of hyper-parameters ℓ d parameters give the length-scale in dimension- d � D � � k ( x i , x j ) = σ 2 ( x d,i − x d,j ) 2 /ℓ 2 − 1 f exp , d 2 d =1 Typical distance between peaks ≈ ℓ 2 l = 0.05 l = 0.5 1 0 −1 −2 −3 0 0.2 0.4 0.6 0.8 1

  26. Effect of hyper-parameters Different (SE) kernel parameters give different explanations of the data: 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 ℓ = 0 . 5 , σ n = 0 . 05 ℓ = 1 . 5 , σ n = 0 . 15

  27. Other kernels SE kernel produce very smooth and ‘boring’ functions Kernels are available for rough data, periodic data, strings, graphs, images, models, . . . Different kernels can be combined: k ( x i , x j ) = αk 1 ( x i , x j ) + βk 2 ( x i , x j ) Positive semi-definite if k 1 and k 2 are.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend