kernel design
play

Kernel Design Nicolas Durrande PROWLER.io (nicolas@prowler.io) - PowerPoint PPT Presentation

Gaussian Process Summer School Kernel Design Nicolas Durrande PROWLER.io (nicolas@prowler.io) Sheffield, September 2019 Second Introduction to GPs and GP Regression 2 / 77 The pdf of a Gaussian random variable is: 0.4 0.3


  1. Gaussian Process Summer School Kernel Design Nicolas Durrande – PROWLER.io (nicolas@prowler.io) Sheffield, September 2019

  2. Second Introduction to GPs and GP Regression 2 / 77

  3. The pdf of a Gaussian random variable is: 0.4 0.3 � � density − ( x − µ ) 2 1 0.2 √ f ( x ) = exp 2 σ 2 σ 2 π 0.1 0.0 -4 -2 0 2 4 x The parameters µ and σ 2 correspond to the mean and variance µ = E [ X ] σ 2 = E [ X 2 ] − E [ X ] 2 The variance is positive. 3 / 77

  4. Definition We say that a vector Y = ( Y 1 , . . . , Y n ) t follows a multivariate normal distribution if any linear combination of Y follows a normal distribution: ∀ α ∈ R n , α t Y ∼ N Two examples and one counter-example : 10 3 5 2 5 1 Y 2 0 Y 2 Y 2 0 0 -1 -2 -5 -5 -3 -3 -2 -1 0 1 2 3 -5 0 5 -5 0 5 Y 1 Y 1 Y 1 4 / 77

  5. The pdf of a multivariate Gaussian is: � � 1 − 1 2 ( x − µ ) t Σ − 1 ( x − µ ) f Y ( x ) = ( 2 π ) n / 2 | Σ | 1 / 2 exp . It is parametrised by mean vector µ = E [ Y ] covariance matrix density Σ = E [ YY t ] − E [ Y ] E [ Y ] t (i.e. Σ i , j = cov ( Y i , Y j ) ) x 2 x 1 A covariance matrix is symmetric Σ i , j = Σ j , i and positive semi-definite ∀ α ∈ R n , α t Σ α ≥ 0 . 5 / 77

  6. Conditional distribution 2D multivariate Gaussian conditional distribution: p ( y 1 | y 2 = α ) = p ( y 1 , α ) p ( α ) = exp ( quadratic in y 1 and α ) const √ Σ c = exp ( quadratic in y 1 ) f Y const µ c x 2 = Gaussian distribution ! x 1 The conditional distribution is still Gaussian! 6 / 77

  7. 3D Example 3D multivariate Gaussian conditional distribution: x 3 x 2 x 1 7 / 77

  8. Conditional distribution Let ( Y 1 , Y 2 ) be a Gaussian vector ( Y 1 and Y 2 may both be vectors): � Y 1 � �� µ 1 � � Σ 11 �� Σ 12 = N , . Y 2 µ 2 Σ 21 Σ 22 The conditional distribution of Y 1 given Y 2 is: Y 1 | Y 2 ∼ N ( µ cond , Σ cond ) µ cond = E [ Y 1 | Y 2 ] = µ 1 + Σ 12 Σ − 1 with 22 ( Y 2 − µ 2 ) Σ cond = cov [ Y 1 , Y 1 | Y 2 ] = Σ 11 − Σ 12 Σ − 1 22 Σ 21 8 / 77

  9. Gaussian processes 4 Y ( x ) 0 4 1 1 x Definition A random process Z over D ⊂ R d is said to be Gaussian if ∀ n ∈ N , ∀ x i ∈ D , ( Z ( x 1 ) , . . . , Z ( x n )) is multivariate normal . ⇒ Demo: https://github.com/awav/interactive-gp 9 / 77

  10. We write Z ∼ N ( m ( . ) , k ( ., . )) : m : D → R is the mean function m ( x ) = E [ Z ( x )] k : D × D → R is the covariance function (i.e. kernel): k ( x , y ) = cov ( Z ( x ) , Z ( y )) The mean m can be any function, but not the kernel: Theorem (Loeve) k is a GP covariance � k is symmetric k ( x , y ) = k ( y , x ) and positive semi-definite: for all n ∈ N , for all x i ∈ D , for all α i ∈ R n n � � α i α j k ( x i , x j ) ≥ 0 i = 1 j = 1 10 / 77

  11. Proving that a function is psd is often difficult. However there are a lot of functions that have already been proven to be psd: − ( x − y ) 2 � � k ( x , y ) = σ 2 exp squared exp. 2 θ 2 √ √ � � � � + 5 | x − y | 2 5 | x − y | 5 | x − y | k ( x , y ) = σ 2 Matern 5/2 1 + exp − 3 θ 2 θ θ √ √ � � � � 3 | x − y | 3 | x − y | k ( x , y ) = σ 2 Matern 3/2 1 + exp − θ θ � − | x − y | � k ( x , y ) = σ 2 exp exponential θ k ( x , y ) = σ 2 min ( x , y ) Brownian k ( x , y ) = σ 2 δ x , y white noise k ( x , y ) = σ 2 constant k ( x , y ) = σ 2 xy linear When k is a function of x − y , the kernel is called stationary . σ 2 is called the variance and θ the lengthscale . ⇒ Demo: https://github.com/NicolasDurrande/shinyApps 11 / 77

  12. Examples of kernels in gpflow: Matern12 k(x, 0.0) Matern32 k(x, 0.0) Matern52 k(x, 0.0) RBF k(x, 0.0) 1.0 0.8 0.6 0.4 0.2 0.0 0.2 RationalQuadratic k(x, 0.0) Constant k(x, 0.0) White k(x, 0.0) Cosine k(x, 0.0) 1.0 0.8 0.6 0.4 0.2 0.0 0.2 Periodic k(x, 0.0) Linear k(x, 1.0) Polynomial k(x, 1.0) ArcCosine k(x, 0.0) 1.0 0.8 0.6 0.4 0.2 0.0 0.2 2 0 2 2 0 2 2 0 2 2 0 2 12 / 77

  13. Associated samples Matern12 Matern32 Matern52 RBF 3 2 1 0 1 2 3 RationalQuadratic Constant White Cosine 3 2 1 0 1 2 3 Periodic Linear Polynomial ArcCosine 3 2 1 0 1 2 3 2 0 2 2 0 2 2 0 2 2 0 2 13 / 77

  14. Gaussian process regression We assume we have observed a function f for a set of points X = ( X 1 , . . . , X n ) : 4 0 f 5 1 1 x The vector of observations is F = f ( X ) (ie F i = f ( X i ) ). 14 / 77

  15. Since f in unknown, we make the general assumption that it is the sample path of a Gaussian process Z ∼ N ( 0 , k ) : 4 Y ( x ) 0 4 1 1 x 15 / 77

  16. The posterior distribution Z ( · ) | Z ( X ) = F : Is still Gaussian Can be computed analytically It is N ( m ( · ) , c ( · , · )) with: m ( x ) = E [ Z ( x ) | Z ( X ) = F ] = k ( x , X ) k ( X , X ) − 1 F c ( x , y ) = cov [ Z ( x ) , Z ( y ) | Z ( X ) = F ] = k ( x , y ) − k ( x , X ) k ( X , X ) − 1 k ( X , y ) 16 / 77

  17. A few words on GPR Complexity Storage footprint: We have to store the covariance matrix which is n × n . Complexity: We have to invert the covariance matrix, which requires is O ( n 3 ) . Storage footprint is often the first limit to be reached. The maximal number of observation points is between 1000 and 10 000. What if we have more data? ⇒ Talk from Zhenwen this afternoon What if we need to be faster? ⇒ Talk from Arno on Wednesday 17 / 77

  18. Samples from the posterior distribution 4 Y ( x )| Y ( X ) = F 0 4 1 1 x 18 / 77

  19. It can be summarized by a mean function and 95% confidence intervals. 4 Y ( x )| Y ( X ) = F 0 4 1 1 x 19 / 77

  20. A few remarkable properties of GPR models They (can) interpolate the data-points. The prediction variance does not depend on the observations. The mean predictor does not depend on the variance parameter. The mean (usually) come back to zero when predicting far away from the observations. Can we prove them? ⇒ Demo https://durrande.shinyapps.io/gp_playground 20 / 77

  21. We are not always interested in models that interpolate the data. For example, if there is some observation noise: F = f ( X ) + ε . Let N be a process N ( 0 , n ( ., . )) that represent the observation noise. The expressions of GPR with noise are m ( x ) = E [ Z ( x ) | Z ( X ) + N ( X ) = F ] = k ( x , X )( k ( X , X ) + n ( X , X )) − 1 F c ( x , y ) = cov [ Z ( x ) , Z ( y ) | Z ( X ) + N ( X ) = F ] = k ( x , y ) − k ( x , X )( k ( X , X ) + n ( X , X )) − 1 k ( X , y ) 21 / 77

  22. Examples of models with observation noise for n ( x , y ) = τ 2 δ x , y : 4 Z ( x ) | Z ( X ) + N ( X ) = F Z ( x ) | Z ( X ) + N ( X ) = F Z ( x ) | Z ( X ) + N ( X ) = F 3 3 3 2 2 2 1 1 1 0 0 0 -1 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x x x The values of τ 2 are respectively 0.001, 0.01 and 0.1. What if F = f ( X ) + ε isn’t appropriate? ⇒ Talks from Alan and Neil tomorrow. 22 / 77

  23. Parameter estimation 23 / 77

  24. The choice of the kernel parameters has a great influence on the model. ⇒ Demo https://durrande.shinyapps.io/gp_playground In order to choose a prior that is suited to the data at hand, we can search for the parameters that maximise the model likelihood . Definition The likelihood of a distribution with a density f X given some observations X 1 , . . . , X p is: p � L = f X ( X i ) i = 1 24 / 77

  25. In the GPR context, we often have only one observation of the vector F . The likelihood is then: � � 1 − 1 L ( σ 2 , θ ) = f Z ( X ) ( F ) = 2 F t k ( X , X ) − 1 F ( 2 π ) n / 2 | k ( X , X ) | 1 / 2 exp . It is thus possible to maximise L – or log ( L ) – with respect to the kernel’s parameters in order to find a well suited prior. Why is the likelihood linked to good model predictions? They are linked by the product rule: f Z ( X ) ( F ) = f ( F 1 ) × f ( F 2 | F 1 ) × f ( F 3 | F 1 , F 2 ) × · · · × f ( F n | F 1 , . . . , F n − 1 ) 25 / 77

  26. Model validation 26 / 77

  27. The idea is to introduce new data and to compare the model prediction with reality 2 Z ( x ) | Z ( X ) = F 1 0 -1 0.0 0.2 0.4 0.6 0.8 1.0 x Two (ideally three) things should be checked: Is the mean accurate? Do the confidence intervals make sense? Are the predicted covariances right? 27 / 77

  28. Let X t be the test set and F t = f ( X t ) be the associated observations. The accuracy of the mean can be measured by computing: MSE = mean (( F t − m ( X t )) 2 ) Mean Square Error � ( F t − m ( X t )) 2 A “normalised” criterion Q 2 = 1 − � ( F t − mean ( F t )) 2 On the above example we get MSE = 0 . 038 and Q 2 = 0 . 95. 28 / 77

  29. The predicted distribution can be tested by normalising the residuals. According to the model, F t ∼ N ( m ( X t ) , c ( X t , X t )) . c ( X t , X t ) − 1 / 2 ( F t − m ( X t )) should thus be independents N ( 0 , 1 ) : Normal Q-Q Plot 0.6 3 0.5 2 0.4 Sample Quantiles 1 Density 0.3 0 0.2 -1 -2 0.1 -3 0.0 -3 -2 -1 0 1 2 3 -2 -1 0 1 2 standardised residuals Theoretical Quantiles 29 / 77

  30. When no test set is available, another option is to consider cross validation methods such as leave-one-out. The steps are: 1. build a model based on all observations except one 2. compute the model error at this point This procedure can be repeated for all the design points in order to get a vector of error. 30 / 77

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend