Kernels for deterministic and stochastic approximations of - PowerPoint PPT Presentation

Introduction On kernels and invariances p.d. kernels, from analysis to GPs and back Optimal approximation in RKHSs Theorem (Generalization of Kimeldorf and Wahba’s 1971’s “representer theorem” by Sch¨ olkopf, Herbrich and Smola): Given evaluation results ( x 1 , z 1 ) , . . . , ( x n , z n ) ∈ D × R , an arbitrary cost function c : ( D × R 2 ) n − → R ∪ {∞} , and a strictly increasing function p on [ 0 , ∞ ) , any m n ∈ H k ( RKHS with kernel k ) minimizing g ∈ H k − → c (( x 1 , z 1 , g ( x 1 )) , . . . , ( x n , z n , g ( x n ))) + p ( || g || H k ) admits a representation of the form n � m n ( · ) = α i k ( · , x i ) , i = 1 with α 1 , . . . , α n ∈ R (Notes: noiseless or noisy z i s; real-valued k here.). 14 / 42

Introduction On kernels and invariances p.d. kernels, from analysis to GPs and back Optimal approximation in RKHSs Theorem (Generalization of Kimeldorf and Wahba’s 1971’s “representer theorem” by Sch¨ olkopf, Herbrich and Smola): Given evaluation results ( x 1 , z 1 ) , . . . , ( x n , z n ) ∈ D × R , an arbitrary cost function c : ( D × R 2 ) n − → R ∪ {∞} , and a strictly increasing function p on [ 0 , ∞ ) , any m n ∈ H k ( RKHS with kernel k ) minimizing g ∈ H k − → c (( x 1 , z 1 , g ( x 1 )) , . . . , ( x n , z n , g ( x n ))) + p ( || g || H k ) admits a representation of the form n � m n ( · ) = α i k ( · , x i ) , i = 1 with α 1 , . . . , α n ∈ R (Notes: noiseless or noisy z i s; real-valued k here.). B. Sch¨ olkopf, R. Herbrich, A.J. Smola (2001) A Generalized Representer Theorem Computational Learning Theory. Lecture Notes in Computer Science 2111:416-426. 14 / 42

Introduction On kernels and invariances Outline 1 Introduction p.d. kernels, from analysis to GPs and back 2 On kernels and invariances Contributions from second order to Gaussian Numerical applications and discussion 15 / 42

Introduction On kernels and invariances In RKHS regularization and GP models with known (e.g., constant) mean, prior assumptions on f are implicitly accounted for through the choice of k . 16 / 42

Introduction On kernels and invariances In RKHS regularization and GP models with known (e.g., constant) mean, prior assumptions on f are implicitly accounted for through the choice of k . Classical notions of invariance for k 2nd order stationarity ( k invariant wrt simult. translations of x and x ′ ) Isotropy ( k invariant wrt simultaneous rigid motions of x and x ′ ). 16 / 42

Introduction On kernels and invariances In RKHS regularization and GP models with known (e.g., constant) mean, prior assumptions on f are implicitly accounted for through the choice of k . Classical notions of invariance for k 2nd order stationarity ( k invariant wrt simult. translations of x and x ′ ) Isotropy ( k invariant wrt simultaneous rigid motions of x and x ′ ). We rather investigate some functional properties driven by k , with a main focus on the stochastic case (+ some links to RKHSs). This talk follows to a large extent the paper below and references therein: D. G., O. Roustant and N. Durrande (2016) On degeneracy and invariances of random fields paths with applications in Gaussian Process modelling Journal of Statistical Planning and Inference, 170:117-128. 16 / 42

Introduction On kernels and invariances Contributions from second order to Gaussian Outline 1 Introduction p.d. kernels, from analysis to GPs and back 2 On kernels and invariances Contributions from second order to Gaussian Numerical applications and discussion 17 / 42

Introduction On kernels and invariances Contributions from second order to Gaussian Simulating a GP with group-invariant paths Simulating a GP with group−invariant paths 1.0 2 1 3 0.5 0.5 2.5 1.5 0.5 0.0 1 −2 −2 −1.5 −1.5 −0.5 −1 −1 −0.5 −0.5 1.5 0 0 0 2.5 3 5 . . 5 0 0.5 0.5 −1.0 1 1 1 2 −1.0 −0.5 0.0 0.5 1.0 18 / 42

Introduction On kernels and invariances Contributions from second order to Gaussian Towards invariant prediction: set-up GP path to be predicted and design points 1.0 2 1 ● 3 0.5 0.5 ● ● ● 2 . 5 ● 1.5 ● ● 0.5 ● ● ● ● 0.0 1 −2 −2 ● ● −1.5 −1.5 −0.5 − 1 − 1 −0.5 −0.5 1.5 0 0 ● ● 0.5 2.5 3 0.5 0.5 0.5 −1.0 1 1 1 2 −1.0 −0.5 0.0 0.5 1.0 19 / 42

Introduction On kernels and invariances Contributions from second order to Gaussian Predicting with an (argumentwise) invariant kernel Invariant GP path predicted with an adapted kernel 1.0 1 ● 2.5 5 2 0 0 . . 5 ● ● ● ● 0 1.5 ● ● 5 − 0 . 0.5 ● ● ● − 2 − 2 ● −1.5 0.0 ● ● − 2 − 2 −1.5 −1 −0.5 − 1 −0.5 1.5 0 2.5 2 ● ● 0.5 0.5 0.5 0.5 −1.0 1 1 1 −1.0 −0.5 0.0 0.5 1.0 20 / 42

Introduction On kernels and invariances Contributions from second order to Gaussian Predicting with an (argumentwise) invariant kernel Invariant GP prediction: posterior standard deviation 1.0 0.65 0 0.6 . 5 0.55 0.4 ● 0.7 0 4 . 5 0.2 3 0.35 0 . 0.25 1 5 0 . ● ● ● 0.1 ● 0.05 0.05 ● ● 0.5 0.2 0.2 0.3 3 0 . 0.35 0.35 0 . 5 0 0 0.4 5 1 5 1 . . 0 ● 0 5 5 . 0 0 . ● 0 0 2 . 5 0 . 2 5 ● 0.45 0.3 0.3 0.45 ● 0.05 0 2 . 0.05 0.0 0.3 0.25 0 . 1 5 0.25 0.25 0 2 . 5 0 . 2 5 0 . 0 0 0.4 0.4 5 . 0 . 0 ● 0 5 5 ● 0.1 0.35 0.35 0.1 0.3 0.2 0.2 0.3 0.2 0.2 −0.5 0.15 0.15 0.1 0.1 0.05 0.05 0.1 ● ● 0 . 1 5 1 0.2 0.25 0 0 . . 1 0.35 0.4 0.6 . 3 0 0 . 4 5 −1.0 5 0.55 0 . 0.7 0.65 −1.0 −0.5 0.0 0.5 1.0 21 / 42

Introduction On kernels and invariances Contributions from second order to Gaussian Invariant conditional simulations Simulating a GP with group−invariant paths 1.0 1 1 . 5 2 ● 5 5 2 0 0 . ● ● ● 2 . . 5 . 5 ● ● ● 0.5 −1.5 −1.5 ● ● − 2 . 5 − 2 . 5 ● ● 0.0 1 −2.5 −2.5 ● ● −2 −2 −0.5 − 1 − 1 −0.5 −0.5 0.5 0.5 0 0 2 5 . . 5 2 ● ● 0.5 0.5 −1.0 1 1 2 1 . 5 1 −1.0 −0.5 0.0 0.5 1.0 22 / 42

Introduction On kernels and invariances Contributions from second order to Gaussian Some refs on group-invariance in kernel methods B. Haasdonk, H.Burkhardt (2007). Invariant kernels for pattern analysis and machine learning Machine Learning 68, 35-61 D. G., X. Bay, O. Roustant and L. Carraro (2012) Argumentwise invariant kernels for the approximation of invariant functions Annales de la Facult´ e des Sciences de Toulouse, 21(3):501-527 K. Hansen et al. (2013) Assessment and Validation of Machine Learning Methods for Predicting Molecular Atomization Energies Journal of Chemical Theory and Computation 9, 3404-3419 Y. Mroueh, S. Voinea, T. Poggio (2015) Learning with Group Invariant Features: A Kernel Perspective Advances in Neural Information Processing Systems, 1558-1566 23 / 42

Introduction On kernels and invariances Contributions from second order to Gaussian Proposition (DG et al. 2016) Let Z be a measurable random field with paths (a.s.) in some function space F and T : F − → F be a linear operator such that for all x ∈ D there exists a signed measure ν x : D − → R satisfying � T ( g )( x ) = g ( u ) d ν x ( u ) . Assume further that � � sup k ( u , u ) + m ( u ) 2 d | ν x | ( u ) < + ∞ . x ∈ D D Then the following are equivalent: a) ∀ x ∈ D P ( T ( Z ) x = 0 ) = 1 (“T ( Z ) = 0 up to a modification”) b) ∀ x ∈ D T ( m )( x ) = 0 and ( T ⊗ T ( k ))( x , x ) = 0 . Assuming further that T ( Z ) is separable, a) and b) are also equivalent to c) P ( T ( Z ) = 0 ) = P ( ∀ x ∈ D T ( Z ) x = 0 ) = 1 (“T ( Z ) = 0 a.s.”) . 24 / 42

Introduction On kernels and invariances Contributions from second order to Gaussian Another invariance: random fields with additive paths Let D = � d i D i where D i ⊂ R . f ∈ R D is called additive when there exists f i ∈ R D i ( 1 ≤ i ≤ d ) such that f ( x ) = � d i = 1 f i ( x i ) ( x = ( x 1 , . . . , x d ) ∈ D ) . 25 / 42

Introduction On kernels and invariances Contributions from second order to Gaussian Another invariance: random fields with additive paths Let D = � d i D i where D i ⊂ R . f ∈ R D is called additive when there exists f i ∈ R D i ( 1 ≤ i ≤ d ) such that f ( x ) = � d i = 1 f i ( x i ) ( x = ( x 1 , . . . , x d ) ∈ D ) . GP models possessing additive paths (with k ( x , x ′ ) = � d i = 1 k i ( x i , x ′ i ) ) have been considered in Nicolas Durrande’s Ph.D. thesis (2011): 25 / 42

Introduction On kernels and invariances Contributions from second order to Gaussian Another invariance: random fields with additive paths Let D = � d i D i where D i ⊂ R . f ∈ R D is called additive when there exists f i ∈ R D i ( 1 ≤ i ≤ d ) such that f ( x ) = � d i = 1 f i ( x i ) ( x = ( x 1 , . . . , x d ) ∈ D ) . GP models possessing additive paths (with k ( x , x ′ ) = � d i = 1 k i ( x i , x ′ i ) ) have been considered in Nicolas Durrande’s Ph.D. thesis (2011): 2.0 −1.0 Y Y 1.5 1 1 −1.5 0 0 2 2 x x x 1 x 1 0 0 1 1 25 / 42

Introduction On kernels and invariances Contributions from second order to Gaussian A few selected references related to additive kernels N. Durrande (2011) ´ Etude de classes de noyaux adapt´ es ` a la simplification et ` a l’interpr´ etation des mod` eles d’approximation. Une approche fonctionnelle et probabiliste PhD thesis, Ecole des Mines de Saint-Etienne D. Duvenaud, H. Nickisch, C. Rasmussen (2011) Additive Gaussian Processes Neural Information Processing Systems N. Durrande, D. G. and O. Roustant (2012) Additive Covariance kernels for high-dimensional Gaussian Process modeling Annales de la Facult´ e des Sciences de Toulouse, 21(3):481-499 D. G., N. Durrande and O. Roustant (2013) Kernels and designs for modelling invariant functions: From group invariance to additivity. In mODa 10 - Advances in Model-Oriented Design and Analysis. Contributions to Statistics 26 / 42

Introduction On kernels and invariances Contributions from second order to Gaussian A link with RKHSs in the Gaussian case In Gaussian case, the Lo` eve isometry Ψ between L ( Z ) (The Hilbert space generated by Z ) and the RKHS H k leads to the following. 27 / 42

Introduction On kernels and invariances Contributions from second order to Gaussian A link with RKHSs in the Gaussian case In Gaussian case, the Lo` eve isometry Ψ between L ( Z ) (The Hilbert space generated by Z ) and the RKHS H k leads to the following. Proposition Let T : F → R D be a linear operator such that T ( m ) ≡ 0 and T ( Z ) x ∈ L ( Z ) for any x ∈ D. Then, there exists a unique linear T : H k → R D satisfying ( x , x ′ ∈ D ) cov( T ( Z ) x , Z x ′ ) = T ( k ( · , x ′ ))( x ) H and such that T ( h n )( x ) − → T ( h )( x ) for any x ∈ D and h n − → h. In addition, we have equivalence between the following: (i) ∀ x ∈ D T ( Z ) x = 0 (almost surely) (iii) ∀ x ′ ∈ D T ( k ( · , x ′ )) = 0 (iii) T ( H k ) = { 0 } 27 / 42

Introduction On kernels and invariances Contributions from second order to Gaussian Examples � � a) Let ν be a measure on D s.t. k ( u , u ) d ν ( u ) < + ∞ . Then a centred Z � D (Gaussian or not) has centred paths iff D k ( x , u ) d ν ( u ) = 0 , ∀ x ∈ D . For instance, given any p.d. kernel k , k 0 defined by � � � k 0 ( x , y ) = k ( x , y ) − k ( x , u ) d ν ( u ) − k ( y , u ) d ν ( u ) + k ( u , v ) d ν ( u ) d ν ( v ) satisfies the above condition. 28 / 42

Introduction On kernels and invariances Contributions from second order to Gaussian Examples � � a) Let ν be a measure on D s.t. k ( u , u ) d ν ( u ) < + ∞ . Then a centred Z � D (Gaussian or not) has centred paths iff D k ( x , u ) d ν ( u ) = 0 , ∀ x ∈ D . For instance, given any p.d. kernel k , k 0 defined by � � � k 0 ( x , y ) = k ( x , y ) − k ( x , u ) d ν ( u ) − k ( y , u ) d ν ( u ) + k ( u , v ) d ν ( u ) d ν ( v ) satisfies the above condition. b) Solutions to the Laplace equation are called harmonic functions. Let us call harmonic any p.d. kernel solving the Laplace equation argumentwise: (∆ k ( · , x ′ )) = 0 ( x ′ ∈ D ) . An example of such harmonic kernel over R 2 × R 2 can be found in the recent literature (Schaback et al. 2009): � x 1 y 1 + x 2 y 2 � x 2 y 1 − x 1 y 2 � � k harm ( x , y ) = exp cos . θ 2 θ 2 28 / 42

Introduction On kernels and invariances Contributions from second order to Gaussian Example sample paths invariant under various T ’s 1.0 1.5 0 0.8 1.0 1 0.6 1 0.5 0.0 0 0.4 −0.5 −1 0.2 −1.0 2 2 − 3 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 (a) Zero-mean paths of the (b) Harmonic path of a GRF centred GP with kernel k 0 . with kernel k harm . 29 / 42

Introduction On kernels and invariances Contributions from second order to Gaussian Some “stability of invariances by conditioning” result Proposition Let F , G be real separable Banach spaces, µ be a Gaussian measure on B ( F ) with mean zero and covariance operator C µ → F be a bounded linear operator such that TC µ T ⋆ = 0 F ⋆ − T : F − →F A : F − → G be another bounded linear operator, and A ♯ µ be the image of µ under A. Then there exist a Borel measurable mapping m : G − → F , a Gaussian covariance R : F ⋆ − → F with R ≤ C µ and a disintegration ( q y ) y ∈G of µ on B ( F ) with respect to A such that for any fixed y ∈ G , q y is a Gaussian measure with mean m and covariance operator R satisfying T ( m ) = 0 F and TRT ⋆ = 0 F ⋆ − →F . 30 / 42

Introduction On kernels and invariances Contributions from second order to Gaussian GP prediction with invariant kernels: example a) 5 5 test function test function best predictor best predictor 95% confidence intervals 95% confidence intervals 4 4 3 3 2 2 1 1 0 0 −1 −1 −2 −2 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 (a) GPR with kernel k (b) GPR with kernel k 0 Figure: Comparison of two GP models. The left one is based on a Gaussian kernel. The right one incorporates the zero-mean property. 31 / 42

Introduction On kernels and invariances Contributions from second order to Gaussian GP models with invariant kernels: example b) 1.0 −0.2 − 0 . 1 0.8 0.6 1 . 0 0 0.4 . 1 0.2 0 0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 (a) Mean predictor and 95% prediction intervals (b) prediction error Figure: Example of GP model based on a harmonic kernel. 32 / 42

Introduction On kernels and invariances Numerical applications and discussion Outline 1 Introduction p.d. kernels, from analysis to GPs and back 2 On kernels and invariances Contributions from second order to Gaussian Numerical applications and discussion 33 / 42

Introduction On kernels and invariances Numerical applications and discussion Numerical application: maximum of a harmonic f Here we consider approximating a harmonic function (left/right: Gaussian/harmonic kernels) and estimating its maximum by GRF modelling. 1.0 1.0 ● ● ● ● ● ● ● ● ● 0.2 ● 0.2 ● ● ● ● 2.0 ● ● ● ● 0.4 ● 0.4 ● ● ● ● 0 ● 1.5 ● ● ● ● ● ● 0.5 0.5 1.5 ● ● 6 ● ● ● 0.6 ● 0 . ● ● ● − ● 0 . −0.2 1.0 2 ● ● ● ● 0.8 0.8 1.0 ● ● ● ● x2 0.0 x2 0.0 ● ● ● ● 1 ● ● ● ● ● ● 1 − −0.4 0 . 4 0.5 0.5 1.2 1.2 1 . 4 4 −0.5 1 . −0.5 0.0 . 6 ● ● 1 0.0 1.6 1.8 0 2 −1.0 −1.0 −0.5 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 x1 x1 34 / 42

Introduction On kernels and invariances Numerical applications and discussion Numerical application: maximum of a harmonic f Here we consider approximating a harmonic function (left/right: Gaussian/harmonic kernels) and estimating its maximum by GRF modelling. 1.0 1.0 ● ● ● ● ● ● ● ● ● 0.2 ● 0.2 ● ● ● ● 2.0 ● ● ● ● 0.4 ● 0.4 ● ● ● ● 0 ● 1.5 ● ● ● ● ● ● 0.5 0.5 1.5 ● ● 6 ● ● ● 0.6 ● 0 . ● ● ● − ● 0 . −0.2 1.0 2 ● ● ● ● 0.8 0.8 1.0 ● ● ● ● x2 0.0 x2 0.0 ● ● ● ● 1 ● ● ● ● ● 1 ● − −0.4 0 4 . 0.5 0.5 1.2 1.2 1 . 4 4 −0.5 1 . −0.5 0.0 . 6 ● ● 1 0.0 1.6 1.8 0 2 −1.0 −1.0 −0.5 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 x1 x1 Extracted from “On degeneracy and invariances of random fields paths with applications in Gaussian Process modelling” (DG, O.Roustant & N.Durrande, Journal of Statistical Planning and Inference, 170:117-128, 2016) 34 / 42

Introduction On kernels and invariances Numerical applications and discussion Numerical application: maximum of a harmonic f Prediction errors (left/right: Gaussian/harmonic kernels). 1.0 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.015 ● ● ● ● ● ● ● ● 0.3 ● ● ● ● 0.5 0.5 0.010 ● ● ● ● ● ● ● ● ● ● 0.005 ● ● ● ● 0.2 ● ● ● ● x2 0.0 x2 0.0 0.000 ● ● ● ● ● ● ● ● ● ● 0.1 −0.005 −0.5 −0.5 ● ● −0.010 0.0 −0.015 −1.0 −1.0 −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 x1 x1 35 / 42

Introduction On kernels and invariances Numerical applications and discussion Numerical application: maximum of a harmonic f Prediction errors (left/right: Gaussian/harmonic kernels). 2.0 2.0 1.5 1.5 Temperature Temperature 1.0 1.0 0.5 0.5 0.0 0.0 −0.5 −0.5 3.5 4.5 5.5 3.5 4.5 5.5 θ θ 36 / 42

Introduction On kernels and invariances Numerical applications and discussion Numerical application: maximum of a harmonic f Conditional simulations of the maximum under the two GRF models. 25 Simulated maxima under Gaussian kernel Simulated maxima under Harmonic kernel Actual maximum 20 Density 15 10 5 0 1.4 1.6 1.8 2.0 2.2 maximum value 37 / 42

Introduction On kernels and invariances Numerical applications and discussion Numerical application: recovering a symmetry axis 1.0 −1 ● ● ● −0.5 −0.5 3 ● 0.5 ● 0.5 ● 0 ● 2 ● ● 1.5 5 1 . ● x2 0.0 ● 1 1 ● 0 ● 2.5 ● ● 2.5 −0.5 0 5 . 0 ● ● 0 3 . 5 ● ● −1 −0.5 1 3 ● −1.0 2 3 −1.0 −0.5 0.0 0.5 1.0 x1 38 / 42

Introduction On kernels and invariances Numerical applications and discussion Numerical application 2: recovering a symmetry axis −0.15 −1000 −2000 −0.20 angle ● −3000 −0.25 −4000 −0.30 −5000 0.70 0.75 0.80 0.85 distance to the origin 39 / 42

Introduction On kernels and invariances Numerical applications and discussion Discussion Function approximation approaches based on p.d. kernels enable incorporating degeneracies and invariances under linear operators including Symmetries and further invariances under group actions Additivity and further multivariate sparsity properties towards high-dimensional GRF modelling (See, e.g., MCQMC2014 paper) Harmonicity but also, e.g., divergence-free properties for vector fields (See, e.g., Scheuerer and Schlather 2012) In the Gaussian set up, such properties are inherited by conditional distributions, which is clearly convenient but also comes withs risks. 40 / 42

Introduction On kernels and invariances Numerical applications and discussion Perspectives Developing further the inference of degeneracy/invariance properties based on data and investigating consistency, 41 / 42

Introduction On kernels and invariances Numerical applications and discussion Perspectives Developing further the inference of degeneracy/invariance properties based on data and investigating consistency, Creating classes of kernels that incorporate some invariant components and non-invariant components, 41 / 42

Introduction On kernels and invariances Numerical applications and discussion Perspectives Developing further the inference of degeneracy/invariance properties based on data and investigating consistency, Creating classes of kernels that incorporate some invariant components and non-invariant components, Explore further the potential of invariant kernels based on real-world applications (e.g., from physics, biology, engineering). Thank you very much for your attention! 41 / 42

Introduction On kernels and invariances Numerical applications and discussion Further references C.J. Stone (1985) Additive regression and other nonparametric models The Annals of Statistics 13(2):689-705 M. Scheuerer and M. Schlather (2012) Covariance Models for Divergence-Free and Curl-Free Random Vector Fields Stochastic Models 28(3) D. Duvenaud (2014) Automatic Model Construction with Gaussian Processes PhD thesis, University of Cambridge K. Kandasamy, J. Schneider and B. Poczos (2015) High Dimensional Bayesian Optimisation and Bandits via Additive Models International Conference on Machine Learning (ICML) 2015 D. G., O. Roustant, D. Schuhmacher, N. Durrande and N. Lenz (2016) On ANOVA decompositions of kernels and Gaussian random field paths. Monte Carlo and Quasi-Monte Carlo Methods 42 / 42

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Appendix back 3 About GPs and their use in function modelling 4 Examples of GPs and generalities on p.d. kernels 5 Miscellanea 0 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Outline 3 About GPs and their use in function modelling 4 Examples of GPs and generalities on p.d. kernels 5 Miscellanea 1 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea What do we assume about f in GP modelling? In Gaussian Process (GP) modelling, probabilistic concepts are used to model the deterministic function f . 2 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea What do we assume about f in GP modelling? In Gaussian Process (GP) modelling, probabilistic concepts are used to model the deterministic function f . Let us first focus on an arbitrary point x ∈ D and think of the unknown response value f ( x ) as a Gaussian random variable, denoted here Z x . Of course, how the mean and variance of Z x are specified is crucial. A simple option is to set them to constant values (e.g. 0 mean and σ 2 > 0 variance) . . . 2 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea What do we assume about f in GP modelling? In Gaussian Process (GP) modelling, probabilistic concepts are used to model the deterministic function f . Let us first focus on an arbitrary point x ∈ D and think of the unknown response value f ( x ) as a Gaussian random variable, denoted here Z x . Of course, how the mean and variance of Z x are specified is crucial. A simple option is to set them to constant values (e.g. 0 mean and σ 2 > 0 variance) . . . . . . However, a white noise assumption would not be very constructive! The crux in GP modelling is to assume that the Z x ’s for different x ’s are correlated. 2 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Reminder: n -dimensional Gaussian distribution More precisely, we will appeal to the multivariate Gaussian distribution. Let us forget about x for now and consider a random vector Z = ( Z 1 , . . . , Z n ) . 3 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Reminder: n -dimensional Gaussian distribution More precisely, we will appeal to the multivariate Gaussian distribution. Let us forget about x for now and consider a random vector Z = ( Z 1 , . . . , Z n ) . Z is said to be multivariate Gaussian distributed when � n i = 1 a i Z i is Gaussian distributed whatever n ≥ 1 and a 1 , . . . , a n ∈ R . 3 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Reminder: n -dimensional Gaussian distribution More precisely, we will appeal to the multivariate Gaussian distribution. Let us forget about x for now and consider a random vector Z = ( Z 1 , . . . , Z n ) . Z is said to be multivariate Gaussian distributed when � n i = 1 a i Z i is Gaussian distributed whatever n ≥ 1 and a 1 , . . . , a n ∈ R . Such Z is characterized by its mean µ ∈ R n and covariance matrix K ∈ R n × n (with E [ Z i ] and Cov [ Z i , Z j ] = E [( Z i − µ i )( Z j − µ j )] entries, respectively). 3 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Reminder: n -dimensional Gaussian distribution More precisely, we will appeal to the multivariate Gaussian distribution. Let us forget about x for now and consider a random vector Z = ( Z 1 , . . . , Z n ) . Z is said to be multivariate Gaussian distributed when � n i = 1 a i Z i is Gaussian distributed whatever n ≥ 1 and a 1 , . . . , a n ∈ R . Such Z is characterized by its mean µ ∈ R n and covariance matrix K ∈ R n × n (with E [ Z i ] and Cov [ Z i , Z j ] = E [( Z i − µ i )( Z j − µ j )] entries, respectively). We use the notation: Z ∼ N ( µ , K ) . Note that while µ can take any value, K must be symmetric positive semi-definite (i.e. symmetric with non-negative eigenvalues). 3 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Reminder: n -dimensional Gaussian distribution In case of invertible K , Z possesses the probability density function: � � − 1 p N ( µ , K ) ( z ) = ( 2 π ) − n / 2 det( K ) − 1 / 2 exp 2 ( z − µ ) ′ K − 1 ( z − µ ) 4 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Reminder: n -dimensional Gaussian distribution In case of invertible K , Z possesses the probability density function: � � − 1 p N ( µ , K ) ( z ) = ( 2 π ) − n / 2 det( K ) − 1 / 2 exp 2 ( z − µ ) ′ K − 1 ( z − µ ) Besides that, denoting by Z a and Z b two subvectors of Z such that Z = ( Z a , Z b ) , by µ a , µ b the corresponding means, and defining the corresponding blocks of Z ’s covariance matrix by � K a � K ab K = , K ba K b then (assuming invertibility of K a ), the conditional probability distribution of Z b knowing that Z a = z a is (multivariate) Gaussian with L ( Z ( b ) | Z a = z a ) = N ( µ b + K ba K − 1 ( z a − µ a ) , K b − K ba K − 1 K ab ) . a a 4 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Priors on functions? Let us now come back to our function approximation problem. We are interested in having a prior distribution on functions, not just on a finite-dimensional vector! 5 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Priors on functions? Let us now come back to our function approximation problem. We are interested in having a prior distribution on functions, not just on a finite-dimensional vector! Good news from probability theory (Kolmogorov’s extension theorem): random processes on D (a.k.a. random fields in case of multivariate D ) can be defined through finite-dimensional distributions, i.e. through distributions of the random vectors ( Z x 1 , . . . , Z x n ) for any finite set of points x 1 , . . . , x n . 5 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Priors on functions? Let us now come back to our function approximation problem. We are interested in having a prior distribution on functions, not just on a finite-dimensional vector! Good news from probability theory (Kolmogorov’s extension theorem): random processes on D (a.k.a. random fields in case of multivariate D ) can be defined through finite-dimensional distributions, i.e. through distributions of the random vectors ( Z x 1 , . . . , Z x n ) for any finite set of points x 1 , . . . , x n . Gaussian Processes (a.k.a. Gaussian Random Fields) A GP (GRF) Z with index set D is a collection of random variables ( Z x ) x ∈ D (defined over the same probability space (Ω , A , P ) ) such that for any finite set of points x 1 , . . . , x n ∈ D , ( Z x 1 , . . . , Z x n ) is multivariate Gaussian 5 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Mean and covariance functions of a GP Hence a GP is Z defined by specifying the mean and the covariance matrix of any random vector of the form ( Z x 1 , . . . , Z x n ) , so that Z is characterized by µ : x ∈ D − → µ ( x ) = E [ Z x ] ∈ R k : ( x , x ′ ) ∈ D × D − → k ( x , x ′ ) = Cov[ Z x , Z x ′ ] ∈ R 6 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Mean and covariance functions of a GP Hence a GP is Z defined by specifying the mean and the covariance matrix of any random vector of the form ( Z x 1 , . . . , Z x n ) , so that Z is characterized by µ : x ∈ D − → µ ( x ) = E [ Z x ] ∈ R k : ( x , x ′ ) ∈ D × D − → k ( x , x ′ ) = Cov[ Z x , Z x ′ ] ∈ R While µ can be any function, k is constrained since ( k ( x i , x j )) 1 ≤ i ≤ n , 1 ≤ j ≤ n must be symmetric positive semi-definite for any set of points. k satisfying such property are referred to as p.d. kernels . 6 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Mean and covariance functions of a GP Hence a GP is Z defined by specifying the mean and the covariance matrix of any random vector of the form ( Z x 1 , . . . , Z x n ) , so that Z is characterized by µ : x ∈ D − → µ ( x ) = E [ Z x ] ∈ R k : ( x , x ′ ) ∈ D × D − → k ( x , x ′ ) = Cov[ Z x , Z x ′ ] ∈ R While µ can be any function, k is constrained since ( k ( x i , x j )) 1 ≤ i ≤ n , 1 ≤ j ≤ n must be symmetric positive semi-definite for any set of points. k satisfying such property are referred to as p.d. kernels . Remark: Assuming µ ≡ 0 for now, k accounts for a number of properties of Z , including pathwise properties , i.e. functional properties of the paths x ∈ D − → Z x ( ω ) ∈ R , for ω ∈ Ω (paths are also called “realizations”, or “trajectories”). 6 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Some GRF R simulations (d=1) with DiceKriging Here k ( t , t ′ ) = σ 2 � � 1 + | t ′ − t | /ℓ + ( t − t ′ ) 2 / ( 3 ℓ 2 ) exp ( −| t ′ − t | /ℓ ) ( Mat´ ern kernel with regularity parameter 5 / 2) where ℓ = 0 . 4 and σ = 1 . 5. Furthermore, here trend is a trend µ ( t ) = − 1 + 2 t + 3 t 2 . 6 4 2 z 0 −2 −4 0.0 0.2 0.4 0.6 0.8 1.0 x 7 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Some GRF R simulations (d=2) with DiceKriging Now take a tensorized version of Mat´ ern kernel and a constant trend µ = 0. 8 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Approximating functions using GP models Let us now consider a deterministic function f : D − → R , which response values are measured at n points X n = ( x 1 , . . . , x n ) ∈ D n . Putting a GP prior Z on f and updating it with respect to f ’s values at the x i points, we can work out a posterior distribution. 9 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Approximating functions using GP models Let us now consider a deterministic function f : D − → R , which response values are measured at n points X n = ( x 1 , . . . , x n ) ∈ D n . Putting a GP prior Z on f and updating it with respect to f ’s values at the x i points, we can work out a posterior distribution. Indeed, finite-dimensional distributions of this posterior can be obtained by looking at the conditional distribution of ( Z x n + 1 , . . . , Z x n + q ) knowing ( Z x 1 , . . . , Z x n ) for arbitrary points x n + 1 , . . . , x n + q ∈ D . By Gaussianity, it turns out that such conditional distributions are Gaussian and so the posterior Z | measurements is a GRF. 9 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Approximating functions using GP models Let us now consider a deterministic function f : D − → R , which response values are measured at n points X n = ( x 1 , . . . , x n ) ∈ D n . Putting a GP prior Z on f and updating it with respect to f ’s values at the x i points, we can work out a posterior distribution. Indeed, finite-dimensional distributions of this posterior can be obtained by looking at the conditional distribution of ( Z x n + 1 , . . . , Z x n + q ) knowing ( Z x 1 , . . . , Z x n ) for arbitrary points x n + 1 , . . . , x n + q ∈ D . By Gaussianity, it turns out that such conditional distributions are Gaussian and so the posterior Z | measurements is a GRF. NB: the same applied in noisy cases when considering ( Z x 1 + ε 1 , . . . , Z x n + ε n ) with Gaussian ε i ’s independent of Z ). 9 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea About the estimation of covariance parameters The previous equations were at given µ and k . In practice, however, trend and/or covariance parameters often have to be estimated. Let us consider the case of known µ and k that depends on a vector of “hyperparameters” ψ . 10 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea About the estimation of covariance parameters The previous equations were at given µ and k . In practice, however, trend and/or covariance parameters often have to be estimated. Let us consider the case of known µ and k that depends on a vector of “hyperparameters” ψ . Several approaches do exist for dealing with the unknown ψ : Maximum Likelihood Estimation (MLE), Cross-validation (CV), but also Bayesian approaches involving sampling algorithms such as McMC, SMC, etc. 10 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea About the estimation of covariance parameters The previous equations were at given µ and k . In practice, however, trend and/or covariance parameters often have to be estimated. Let us consider the case of known µ and k that depends on a vector of “hyperparameters” ψ . Several approaches do exist for dealing with the unknown ψ : Maximum Likelihood Estimation (MLE), Cross-validation (CV), but also Bayesian approaches involving sampling algorithms such as McMC, SMC, etc. Let us present a brief overview of the MLE approach, probably the most implemented (although not necessarily the most robust) option. 10 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea A brief overview of MLE ( back to Branin) Let us denote by K ( ψ ) the covariance matrix of responses, say k ( X n , X n ; ψ ) , under the assumption of covariance hyperparameters with value ψ . 11 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea A brief overview of MLE ( back to Branin) Let us denote by K ( ψ ) the covariance matrix of responses, say k ( X n , X n ; ψ ) , under the assumption of covariance hyperparameters with value ψ . The principle of MLE is to search for a value of ψ under which it would have been the most likely to observe the responses z n . Under GP model assumptions, Z X n ∼ N ( µ ( X n ) , K ( ψ )) . The likelihood then writes as the probability density of Z X n at point z n , seen as a function of ψ : � � − 1 L ( ψ ; z n ) = ( 2 π ) − n / 2 det( K ( ψ )) − 1 / 2 exp 2 ( z n − µ ( X n )) ′ K ( ψ ) − 1 ( z n − µ ( X n )) 11 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea A brief overview of MLE ( back to Branin) Let us denote by K ( ψ ) the covariance matrix of responses, say k ( X n , X n ; ψ ) , under the assumption of covariance hyperparameters with value ψ . The principle of MLE is to search for a value of ψ under which it would have been the most likely to observe the responses z n . Under GP model assumptions, Z X n ∼ N ( µ ( X n ) , K ( ψ )) . The likelihood then writes as the probability density of Z X n at point z n , seen as a function of ψ : � � − 1 L ( ψ ; z n ) = ( 2 π ) − n / 2 det( K ( ψ )) − 1 / 2 exp 2 ( z n − µ ( X n )) ′ K ( ψ ) − 1 ( z n − µ ( X n )) Solving MLE is typically addressed by equivalently minimizing the function ℓ ( ψ ; z n ) = log(det( K ( ψ ))) + ( z n − µ ( X n )) ′ K ( ψ ) − 1 ( z n − µ ( X n )) . 11 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea A brief overview of MLE Minimizing ℓ is usually analytically intractable, and numerical optimization algorithms are employed. 12 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea A brief overview of MLE Minimizing ℓ is usually analytically intractable, and numerical optimization algorithms are employed. An elegant trick exists to estimate σ 2 ∈ ( 0 , + ∞ ) in case k writes as σ 2 × r where r is a given kernel depending on parameters θ . 12 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea A brief overview of MLE Minimizing ℓ is usually analytically intractable, and numerical optimization algorithms are employed. An elegant trick exists to estimate σ 2 ∈ ( 0 , + ∞ ) in case k writes as σ 2 × r where r is a given kernel depending on parameters θ . Writing K ( ψ ) = σ 2 R ( θ ) where ψ = ( σ 2 , θ ) , one can derive the “optimal” σ 2 as a function of θ . A swift calculation leads indeed to σ 2 ⋆ ( θ ) = 1 n ( z n − µ ( X n )) ′ R ( θ ) − 1 ( z n − µ ( X n )) . 12 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea A brief overview of MLE Minimizing ℓ is usually analytically intractable, and numerical optimization algorithms are employed. An elegant trick exists to estimate σ 2 ∈ ( 0 , + ∞ ) in case k writes as σ 2 × r where r is a given kernel depending on parameters θ . Writing K ( ψ ) = σ 2 R ( θ ) where ψ = ( σ 2 , θ ) , one can derive the “optimal” σ 2 as a function of θ . A swift calculation leads indeed to σ 2 ⋆ ( θ ) = 1 n ( z n − µ ( X n )) ′ R ( θ ) − 1 ( z n − µ ( X n )) . Re-injecting the latter equation into ℓ , MLE then boils down to minimizing a function depending solely on θ , the so-called profile (or “concentrated”) ℓ : ℓ p ( θ ; z n ) = log(det( σ 2 ⋆ ( θ ) R ( θ ))) 12 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Towards Universal Kriging Another situation where an elegant concentration of ℓ is feasible is when k depends on ψ and µ linearly depends on p basis functions f 1 , . . . , f p : p � µ ( x ) = β i f i ( x ) , i = 1 where β = ( β 1 , . . . , β p ) ′ is assumed unknown. 13 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Towards Universal Kriging Another situation where an elegant concentration of ℓ is feasible is when k depends on ψ and µ linearly depends on p basis functions f 1 , . . . , f p : p � µ ( x ) = β i f i ( x ) , i = 1 where β = ( β 1 , . . . , β p ) ′ is assumed unknown. Then, setting F = ( f j ( x i )) 1 ≤ i ≤ n , 1 ≤ j ≤ p , we have µ ( X n ) = F β , and maximizing the likelihood with respect to β at fixed covariance parameters (say ψ again) leads to: β ⋆ ( ψ ) = ( F ′ K ( ψ ) − 1 F ) − 1 F ′ K ( ψ ) − 1 z n . 13 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Towards Universal Kriging Another situation where an elegant concentration of ℓ is feasible is when k depends on ψ and µ linearly depends on p basis functions f 1 , . . . , f p : p � µ ( x ) = β i f i ( x ) , i = 1 where β = ( β 1 , . . . , β p ) ′ is assumed unknown. Then, setting F = ( f j ( x i )) 1 ≤ i ≤ n , 1 ≤ j ≤ p , we have µ ( X n ) = F β , and maximizing the likelihood with respect to β at fixed covariance parameters (say ψ again) leads to: β ⋆ ( ψ ) = ( F ′ K ( ψ ) − 1 F ) − 1 F ′ K ( ψ ) − 1 z n . Plugging-in β ⋆ ( ψ ) in the predictor and inflating the conditional (co)variance accordingly leads to the “Universal Kriging” equations (See also particular Eqs ). case of “Ordinary Kriging”, where p = 1 and µ is a constant; NB: In a Bayesian set-up where an improper uniform prior is put on β , one even recovers a GP posterior distribution. 13 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Selected references M.L. Stein (1999). Interpolation of Spatial Data, Some Theory for Kriging. Springer R. Adler and J. Taylor (2007). Random Fields and Geometry. Springer M. Scheuerer (2009). A Comparison of Models and Methods for Spatial Interpolation in Statistics and Numerical Analysis. PhD thesis of Georg-August Universit¨ at G¨ ottingen O. Roustant, D. Ginsbourger, Y. Deville (2012). DiceKriging, DiceOptim: Two R Packages for the Analysis of Computer Experiments by Kriging-Based Metamodeling and Optimization. Journal of Statistical Software, 51(1), 1-55. M. Schlather, A. Malinowski, P . J. Menck, M. Oesting and K. Strokorb (2015). Analysis, Simulation and Prediction of Multivariate Random Fields with Package RandomFields. Journal of Statistical Software, 63, 8, 1-25. 14 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Selected references B. Rajput and S. Cambanis (1972). Gaussian processes and Gaussian measures. Ann. Math. Statist. 43 (6), 1944-1952. A. O’Hagan (1978). Curve fitting and optimal design for prediction. Journal of the Royal Statistical Society, Series B, 40(1):1-42. H. Omre and K. Halvorsen (1989). The bayesian bridge between simple and universal kriging. Mathematical Geology, 22 (7):767-786. M. S. Handcock and M. L. Stein (1993). A bayesian analysis of kriging. Technometrics, 35(4):403-410. A.W. Van der Vaart and J. H. Van Zanten (2008). Rates of contraction of posterior distributions based on Gaussian process priors. Annals of Statistics, 36:1435-1463. 15 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Outline 3 About GPs and their use in function modelling 4 Examples of GPs and generalities on p.d. kernels 5 Miscellanea 16 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Some examples of p.d. kernels and GPs Let us start by a very classical example (for d = 1): the Brownian motion W = ( W t ) t ∈ D over D = [ 0 , + ∞ ) . Let us define W (in distribution) as follows: W 0 = 0, for any t ∈ D and h > 0, W t + h − W t ∼ N ( 0 , h ) , and for any t 1 , t 2 , t 3 , t 4 ∈ D with t 1 ≤ t 2 ≤ t 3 ≤ t 4 , the increments W t 4 − W t 3 and W t 2 − W t 1 are independent. 17 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Some examples of p.d. kernels and GPs Let us start by a very classical example (for d = 1): the Brownian motion W = ( W t ) t ∈ D over D = [ 0 , + ∞ ) . Let us define W (in distribution) as follows: W 0 = 0, for any t ∈ D and h > 0, W t + h − W t ∼ N ( 0 , h ) , and for any t 1 , t 2 , t 3 , t 4 ∈ D with t 1 ≤ t 2 ≤ t 3 ≤ t 4 , the increments W t 4 − W t 3 and W t 2 − W t 1 are independent. Such conditions define a GP; there remains to work out its expectation and covariance functions. First, for t ∈ D the two first conditions imply that m ( t ) = E [ W t ] = E [ W 0 + W t − W 0 ] = E [ W 0 ] + E [ W t − W 0 ] = 0 + 0 = 0 . 17 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Some examples of p.d. kernels and GPs Let us start by a very classical example (for d = 1): the Brownian motion W = ( W t ) t ∈ D over D = [ 0 , + ∞ ) . Let us define W (in distribution) as follows: W 0 = 0, for any t ∈ D and h > 0, W t + h − W t ∼ N ( 0 , h ) , and for any t 1 , t 2 , t 3 , t 4 ∈ D with t 1 ≤ t 2 ≤ t 3 ≤ t 4 , the increments W t 4 − W t 3 and W t 2 − W t 1 are independent. Such conditions define a GP; there remains to work out its expectation and covariance functions. First, for t ∈ D the two first conditions imply that m ( t ) = E [ W t ] = E [ W 0 + W t − W 0 ] = E [ W 0 ] + E [ W t − W 0 ] = 0 + 0 = 0 . Second, taking two points t , t ′ ∈ D (assuming, say, that t < t ′ ), the third condition implies that W t ′ − W t is independent of W t − W 0 . Consequently, k BM ( t , t ′ ) = E [ W t W t ′ ] = E [( W t − W 0 )( W t − W 0 + W t ′ − W t )] = E [( W t − W 0 ) 2 ] + E [( W t − W 0 )( W t ′ − W t )] = t + 0 = t = min( t , t ′ ) . 17 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Examples of covariance kernels and GPs (cont’d) Another famous covariance function stems from the so-called “Brownian Bridge” (ending in 0) B = ( B t ) t ∈ [ 0 , 1 ] . Let us first restrict W to D = [ 0 , 1 ] , obtaining a centred process with covariance k ( t , t ′ ) = min( t , t ′ ) over [ 0 , 1 ] 2 . The distribution of B is then obtained by conditioning W on W 1 = 0, thus obtaining the mean m B ( t ) = 0 and covariance kernel k BB ( t , t ′ ) = min( t , t ′ ) − tt ′ = min( t , t ′ )( 1 − max( t , t ′ )) . 18 / 47

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea Examples of covariance kernels and GPs (cont’d) Another famous covariance function stems from the so-called “Brownian Bridge” (ending in 0) B = ( B t ) t ∈ [ 0 , 1 ] . Let us first restrict W to D = [ 0 , 1 ] , obtaining a centred process with covariance k ( t , t ′ ) = min( t , t ′ ) over [ 0 , 1 ] 2 . The distribution of B is then obtained by conditioning W on W 1 = 0, thus obtaining the mean m B ( t ) = 0 and covariance kernel k BB ( t , t ′ ) = min( t , t ′ ) − tt ′ = min( t , t ′ )( 1 − max( t , t ′ )) . Another covariance function of interest can be obtained by integrating W . � t Defining ( I t ) t ∈ D (with D = [ 0 , + ∞ ) again) by I t = 0 B u d u , we obtain a new centred GP with covariance � t � t ′ k IBM ( t , t ′ ) = min( u , v ) d u d v 0 0 = min( t , t ′ ) 3 / 3 + (max( t , t ′ ) − min( t , t ′ )) min( t , t ′ ) 2 / 2 . 18 / 47

Kernels for deterministic and stochastic approximations of - PowerPoint PPT Presentation

Kernels for deterministic and stochastic approximations of (invariant) functions David Ginsbourger 1 Idiap Research Institute, UQOD group, Martigny, Switzerland, and 2 IMSV, Mathematics and Statistics Department, University of Bern, Switzerland

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Deterministic vs. stochastic models In deterministic models, the output of the model is fully

Training Deterministic Parsers with Non-Deterministic Oracles by Yoav Goldberg and Joakim

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

Numerical Approximations of McKean Anticipative Backward Stochastic Differential Equations Arising

On nonlinear approximations and the linear hull effect Anne Canteaut Inria, Paris, France joint

On the Properties of Variational Approximations in Statistical Learning. Pierre Alquier UCD

JUST THE MATHS SLIDES NUMBER 3.3 TRIGONOMETRY 3 (Approximations & inverse functions)

Deterministic Networking Lab Part Bernhard Frmel Institut fr Technische Informatik

From normal to anomalous deterministic diffusion Part 1: Normal deterministic diffusion Rainer

Kernels, Sequences and Approximations P . Zinterhof Department of Computer Sciences University

Sentence Analysis (with TIL) Knowledge of language is modular. COLING2000: Angela Friederici,

LRI WEBSITE COMMUNICATION Acknowledgements Workshop Chair: Stuart Marshall (Unilever, SIG

Online Learning via Convex Geometry, with Applications to Pricing Adrian Vladu MIT Dynamic

Simplicity Is Worse Than What Simplification . . . Theft: A Constraint-Based The Simplified . .

Identifying Opportunities for R&D and Collaboration Roundtable Discussion ANL Marcel

Business Process Management Journal Mobile customer relationship management: underlying issues and

Visually Grounded Meaning Representation Qi Huang Ryan Rock Outline 1. Motivation 2.

CS244 Advanced Topics in Networking Lecture 10: Buffer Sizing Nick McKeown Sizing Router

Kernels for deterministic and stochastic approximations of - PowerPoint PPT Presentation

Kernels for deterministic and stochastic approximations of (invariant) functions David Ginsbourger 1 Idiap Research Institute, UQOD group, Martigny, Switzerland, and 2 IMSV, Mathematics and Statistics Department, University of Bern, Switzerland

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Deterministic vs. stochastic models In deterministic models, the output of the model is fully

Training Deterministic Parsers with Non-Deterministic Oracles by Yoav Goldberg and Joakim

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

Numerical Approximations of McKean Anticipative Backward Stochastic Differential Equations Arising

On nonlinear approximations and the linear hull effect Anne Canteaut Inria, Paris, France joint

On the Properties of Variational Approximations in Statistical Learning. Pierre Alquier UCD

JUST THE MATHS SLIDES NUMBER 3.3 TRIGONOMETRY 3 (Approximations &amp; inverse functions)

Deterministic Networking Lab Part Bernhard Frmel Institut fr Technische Informatik

From normal to anomalous deterministic diffusion Part 1: Normal deterministic diffusion Rainer

Kernels, Sequences and Approximations P . Zinterhof Department of Computer Sciences University

Sentence Analysis (with TIL) Knowledge of language is modular. COLING2000: Angela Friederici,

LRI WEBSITE COMMUNICATION Acknowledgements Workshop Chair: Stuart Marshall (Unilever, SIG

Online Learning via Convex Geometry, with Applications to Pricing Adrian Vladu MIT Dynamic

Simplicity Is Worse Than What Simplification . . . Theft: A Constraint-Based The Simplified . .

Identifying Opportunities for R&amp;D and Collaboration Roundtable Discussion ANL Marcel

Business Process Management Journal Mobile customer relationship management: underlying issues and

Visually Grounded Meaning Representation Qi Huang Ryan Rock Outline 1. Motivation 2.

CS244 Advanced Topics in Networking Lecture 10: Buffer Sizing Nick McKeown Sizing Router

JUST THE MATHS SLIDES NUMBER 3.3 TRIGONOMETRY 3 (Approximations & inverse functions)

Identifying Opportunities for R&D and Collaboration Roundtable Discussion ANL Marcel