Kernels for deterministic and stochastic approximations of - - PowerPoint PPT Presentation

kernels for deterministic and stochastic approximations
SMART_READER_LITE
LIVE PREVIEW

Kernels for deterministic and stochastic approximations of - - PowerPoint PPT Presentation

Kernels for deterministic and stochastic approximations of (invariant) functions David Ginsbourger 1 Idiap Research Institute, UQOD group, Martigny, Switzerland, and 2 IMSV, Mathematics and Statistics Department, University of Bern, Switzerland


slide-1
SLIDE 1

Kernels for deterministic and stochastic approximations of (invariant) functions

David Ginsbourger

1Idiap Research Institute, UQOD group, Martigny, Switzerland, and 2IMSV, Mathematics and Statistics Department, University of Bern, Switzerland

Acknowledgements: a number of co-authors, notably appearing via citations!

Advances in Kernel Methods workshop Gaussian Process and Uncertainty Quantification Summer School The University of Sheffield, September 6th 2018

1 / 42

slide-2
SLIDE 2

Introduction On kernels and invariances

Outline

1

Introduction p.d. kernels, from analysis to GPs and back

2

On kernels and invariances Contributions from second order to Gaussian Numerical applications and discussion

2 / 42

slide-3
SLIDE 3

Introduction On kernels and invariances

Outline

1

Introduction p.d. kernels, from analysis to GPs and back

2

On kernels and invariances Contributions from second order to Gaussian Numerical applications and discussion

3 / 42

slide-4
SLIDE 4

Introduction On kernels and invariances p.d. kernels, from analysis to GPs and back

Outline

1

Introduction p.d. kernels, from analysis to GPs and back

2

On kernels and invariances Contributions from second order to Gaussian Numerical applications and discussion

4 / 42

slide-5
SLIDE 5

Introduction On kernels and invariances p.d. kernels, from analysis to GPs and back

Kernel methods and invariances/degeneracies

Kernels are a crucial ingredient in a number of mathematical and statistical methods for function approximation, data classification and beyond: Support Vector Machines, Gaussian Process Modelling, Regularization in Reproducing Kernel Hilbert Spaces, Kernel Principal Component Analysis, Embedding of measures in RKHS, Etc. The implementation of any of these methods require a valid kernel k. We focus on the choice of k in the function approximation framework and in particular on invariance/degeneracy properties that can be driven by k.

5 / 42

slide-6
SLIDE 6

Introduction On kernels and invariances p.d. kernels, from analysis to GPs and back

What are (complex- and real-valued) p.d. kernels?

Let D be a set and k : D × D − → C. k is called a positive definite kernel when

n

  • i=1

n

  • j=1

aiajk(xi, xj) ∈ [0, +∞) for all n ≥ 1, a1, . . . , an ∈ C, and x1, . . . , xn ∈ D.

6 / 42

slide-7
SLIDE 7

Introduction On kernels and invariances p.d. kernels, from analysis to GPs and back

What are (complex- and real-valued) p.d. kernels?

Let D be a set and k : D × D − → C. k is called a positive definite kernel when

n

  • i=1

n

  • j=1

aiajk(xi, xj) ∈ [0, +∞) for all n ≥ 1, a1, . . . , an ∈ C, and x1, . . . , xn ∈ D. Follow directly from this definition (More

here ):

k(x, x) ∈ [0, +∞) for all x ∈ D k(x′, x) = k(x, x′) for all x, x′ ∈ D (k is hermitian) Non-negative combinations and limits of p.d. kernels are p.d.

6 / 42

slide-8
SLIDE 8

Introduction On kernels and invariances p.d. kernels, from analysis to GPs and back

What are (complex- and real-valued) p.d. kernels?

Let D be a set and k : D × D − → C. k is called a positive definite kernel when

n

  • i=1

n

  • j=1

aiajk(xi, xj) ∈ [0, +∞) for all n ≥ 1, a1, . . . , an ∈ C, and x1, . . . , xn ∈ D. Follow directly from this definition (More

here ):

k(x, x) ∈ [0, +∞) for all x ∈ D k(x′, x) = k(x, x′) for all x, x′ ∈ D (k is hermitian) Non-negative combinations and limits of p.d. kernels are p.d. NB: k : D × D − → R is p.d. when both n

i=1

n

j=1 aiajk(xi, xj) ∈ [0, +∞) for

all n ≥ 1, a1, . . . , an ∈ R and x1, . . . , xn ∈ D, and k is symmetric.

6 / 42

slide-9
SLIDE 9

Introduction On kernels and invariances p.d. kernels, from analysis to GPs and back

Considered kernel methods for function approximation

Here we focus on two classes of kernel methods for the approximation of functions based on observational/evaluation data: Gaussian Process (GP) modelling/interpolation/regression Interpolation/Regularization in Reproducing Kernel Hilbert Spaces Typical settings of interest are those of an objective function f : D − → R (e.g. with D ⊂ Rd, d ≥ 1) that one wishes to approximate relying on a limited number n ≥ 1 of evaluations at points xi ∈ D (1 ≤ i ≤ n).

7 / 42

slide-10
SLIDE 10

Introduction On kernels and invariances p.d. kernels, from analysis to GPs and back

About Gaussian Process modelling

GP modelling basically consists in postulating that f is a realization of a real-valued Gaussian random field Z = (Zx)x∈D and to do inferences on f by using the conditional distribution of Z given the available evaluation results.

8 / 42

slide-11
SLIDE 11

Introduction On kernels and invariances p.d. kernels, from analysis to GPs and back

About Gaussian Process modelling

GP modelling basically consists in postulating that f is a realization of a real-valued Gaussian random field Z = (Zx)x∈D and to do inferences on f by using the conditional distribution of Z given the available evaluation results. As we know, in the Gaussian case the mean and covariance functions (say m and k, here) characterize Z’s distribution, so choosing them is crucial.

8 / 42

slide-12
SLIDE 12

Introduction On kernels and invariances p.d. kernels, from analysis to GPs and back

Reminder: GP/Kriging equations

The GP/Kriging prediction amounts to calculating the conditional expectation and covariance of Zx knowing ZXn = zn, with zn = (f(x1), . . . , f(xn))′:    mn(x) = E[Zx|ZXn = zn] = m(x) + k(x, Xn)k(Xn, Xn)−1 (zn − m(X n)) kn(x, x′) = Cov[Zx, Zx′|ZXn = zn] = k(x, x′) − k(x, Xn)k(Xn, Xn)−1k(Xn, x),

9 / 42

slide-13
SLIDE 13

Introduction On kernels and invariances p.d. kernels, from analysis to GPs and back

Reminder: GP/Kriging equations

The GP/Kriging prediction amounts to calculating the conditional expectation and covariance of Zx knowing ZXn = zn, with zn = (f(x1), . . . , f(xn))′:    mn(x) = E[Zx|ZXn = zn] = m(x) + k(x, Xn)k(Xn, Xn)−1 (zn − m(X n)) kn(x, x′) = Cov[Zx, Zx′|ZXn = zn] = k(x, x′) − k(x, Xn)k(Xn, Xn)−1k(Xn, x),

where k(Xn, Xn), =     k(x1, x1) k(x1, x2) ... k(x1, xn) k(x2, x1) k(x2, x2) ... k(x2, xn) ... ... .... .... k(xn, x1) ... .... k(xn, xn)     and k(Xn, x) =     k(x1, x) k(x2, x) ... k(xn, x)     . 9 / 42

slide-14
SLIDE 14

Introduction On kernels and invariances p.d. kernels, from analysis to GPs and back

Reminder: GP/Kriging equations

The GP/Kriging prediction amounts to calculating the conditional expectation and covariance of Zx knowing ZXn = zn, with zn = (f(x1), . . . , f(xn))′:    mn(x) = E[Zx|ZXn = zn] = m(x) + k(x, Xn)k(Xn, Xn)−1 (zn − m(X n)) kn(x, x′) = Cov[Zx, Zx′|ZXn = zn] = k(x, x′) − k(x, Xn)k(Xn, Xn)−1k(Xn, x),

where k(Xn, Xn), =     k(x1, x1) k(x1, x2) ... k(x1, xn) k(x2, x1) k(x2, x2) ... k(x2, xn) ... ... .... .... k(xn, x1) ... .... k(xn, xn)     and k(Xn, x) =     k(x1, x) k(x2, x) ... k(xn, x)     .

For given m and k (possible generalizations to m known up to linear combination coefficients, cf. Universal Kriging with improper uniform prior), Z knowing ZXn = zn is a GP with mean mn and covariance kn.

9 / 42

slide-15
SLIDE 15

Introduction On kernels and invariances p.d. kernels, from analysis to GPs and back

A classical test function: Branin-Hoo ( Eqs )

10 / 42

slide-16
SLIDE 16

Introduction On kernels and invariances p.d. kernels, from analysis to GPs and back

GP Interpolation (Kriging) of the Branin-Hoo function

The covariance is here a stationary anisotropic Mat´ ern kernel (ν = 5/2) with scale σ and range parameters (θ1, θ2) estimated by

Maximum Likelihood . 11 / 42

slide-17
SLIDE 17

Introduction On kernels and invariances p.d. kernels, from analysis to GPs and back

Conditional simulations of the Branin-Hoo function

12 / 42

slide-18
SLIDE 18

Introduction On kernels and invariances p.d. kernels, from analysis to GPs and back

A detour through deterministic function approximation

Approximating f based on evaluations at n points is ill-posed without further assumptions on f. Also in deterministic settings, p.d. kernels play a key role.

Kimeldorf, G. and Wahba, G. (1971) Some results on Tchebycheffian spline functions Journal of mathematical analysis and applications 33 (1), 82-95

  • H. Wendland (2005)

Scattered Data Approximation Cambridge University Press Fasshauer, G. E. (2011) Positive definite kernels: past, present and future Dolomites Research Notes on Approximation, 4:21-63 Scheuerer, M. and Schaback, R. and Schlather, M. (2013) Interpolation of spatial data - a stochastic or a deterministic problem? European Journal of Applied Mathematics, 24, 4, 601-629 13 / 42

slide-19
SLIDE 19

Introduction On kernels and invariances p.d. kernels, from analysis to GPs and back

Optimal approximation in RKHSs

Theorem (Generalization of Kimeldorf and Wahba’s 1971’s “representer theorem” by Sch¨

  • lkopf, Herbrich and Smola): Given evaluation results

(x1, z1), . . . , (xn, zn) ∈ D × R, an arbitrary cost function c : (D × R2)n − → R ∪ {∞}, and a strictly increasing function p on [0, ∞), any mn ∈ Hk ( RKHS with kernel k) minimizing g ∈ Hk − → c ((x1, z1, g(x1)), . . . , (xn, zn, g(xn))) + p(||g||Hk ) admits a representation of the form mn(·) =

n

  • i=1

αik(·, xi), with α1, . . . , αn ∈ R (Notes: noiseless or noisy zis; real-valued k here.).

14 / 42

slide-20
SLIDE 20

Introduction On kernels and invariances p.d. kernels, from analysis to GPs and back

Optimal approximation in RKHSs

Theorem (Generalization of Kimeldorf and Wahba’s 1971’s “representer theorem” by Sch¨

  • lkopf, Herbrich and Smola): Given evaluation results

(x1, z1), . . . , (xn, zn) ∈ D × R, an arbitrary cost function c : (D × R2)n − → R ∪ {∞}, and a strictly increasing function p on [0, ∞), any mn ∈ Hk ( RKHS with kernel k) minimizing g ∈ Hk − → c ((x1, z1, g(x1)), . . . , (xn, zn, g(xn))) + p(||g||Hk ) admits a representation of the form mn(·) =

n

  • i=1

αik(·, xi), with α1, . . . , αn ∈ R (Notes: noiseless or noisy zis; real-valued k here.).

  • B. Sch¨
  • lkopf, R. Herbrich, A.J. Smola (2001)

A Generalized Representer Theorem Computational Learning Theory. Lecture Notes in Computer Science 2111:416-426. 14 / 42

slide-21
SLIDE 21

Introduction On kernels and invariances

Outline

1

Introduction p.d. kernels, from analysis to GPs and back

2

On kernels and invariances Contributions from second order to Gaussian Numerical applications and discussion

15 / 42

slide-22
SLIDE 22

Introduction On kernels and invariances

In RKHS regularization and GP models with known (e.g., constant) mean, prior assumptions on f are implicitly accounted for through the choice of k.

16 / 42

slide-23
SLIDE 23

Introduction On kernels and invariances

In RKHS regularization and GP models with known (e.g., constant) mean, prior assumptions on f are implicitly accounted for through the choice of k. Classical notions of invariance for k 2nd order stationarity (k invariant wrt simult. translations of x and x′) Isotropy (k invariant wrt simultaneous rigid motions of x and x′).

16 / 42

slide-24
SLIDE 24

Introduction On kernels and invariances

In RKHS regularization and GP models with known (e.g., constant) mean, prior assumptions on f are implicitly accounted for through the choice of k. Classical notions of invariance for k 2nd order stationarity (k invariant wrt simult. translations of x and x′) Isotropy (k invariant wrt simultaneous rigid motions of x and x′). We rather investigate some functional properties driven by k, with a main focus on the stochastic case (+ some links to RKHSs). This talk follows to a large extent the paper below and references therein:

  • D. G., O. Roustant and N. Durrande (2016)

On degeneracy and invariances of random fields paths with applications in Gaussian Process modelling Journal of Statistical Planning and Inference, 170:117-128.

16 / 42

slide-25
SLIDE 25

Introduction On kernels and invariances Contributions from second order to Gaussian

Outline

1

Introduction p.d. kernels, from analysis to GPs and back

2

On kernels and invariances Contributions from second order to Gaussian Numerical applications and discussion

17 / 42

slide-26
SLIDE 26

Introduction On kernels and invariances Contributions from second order to Gaussian

Simulating a GP with group-invariant paths

18 / 42

−2 −2 −1.5 −1.5 −1 −1 −0.5 −0.5 . 5 0.5 0.5 0.5 . 5 0.5 1 1 1 1 1 1.5 1.5 2 2 2.5 2.5 3 3

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

Simulating a GP with group−invariant paths

slide-27
SLIDE 27

Introduction On kernels and invariances Contributions from second order to Gaussian

Towards invariant prediction: set-up

−2 −2 −1.5 −1.5 − 1 − 1 −0.5 −0.5 0.5 0.5 0.5 0.5 0.5 0.5 1 1 1 1 1 1.5 1.5 2 2 2.5 2 . 5 3 3

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

  • GP path to be predicted and design points

19 / 42

slide-28
SLIDE 28

Introduction On kernels and invariances Contributions from second order to Gaussian

Predicting with an (argumentwise) invariant kernel

− 2 − 2 − 2 − 2 −1.5 −1.5 −1 − 1 −0.5 − . 5 0.5 . 5 0.5 0.5 0.5 . 5 1 1 1 1 1.5 1.5 2 2 2.5 2.5

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

  • Invariant GP path predicted with an adapted kernel

20 / 42

slide-29
SLIDE 29

Introduction On kernels and invariances Contributions from second order to Gaussian

Predicting with an (argumentwise) invariant kernel

0.05 0.05 0.05 . 5 . 5 . 5 . 5 . 5 . 5 0.05 0.05 0.05 . 1 0.1 0.1 0.1 0.1 . 1 0.1 0.1 . 1 5 0.15 . 1 5 . 1 5 . 1 5 0.15 . 1 5 0.2 0.2 0.2 0.2 . 2 0.2 0.2 0.2 0.2 0.25 0.25 . 2 5 . 2 5 0.25 0.25 . 2 5 . 2 5 0.25 0.3 0.3 0.3 . 3 . 3 0.3 0.3 . 3 0.3 0.35 0.35 0.35 0.35 0.35 0.35 0.4 0.4 0.4 0.4 0.4 0.45 . 4 5 . 4 5 0.45 . 5 . 5 0.55 0.55 0.6 0.6 0.65 0.65 0.7 0.7

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

  • Invariant GP prediction: posterior standard deviation

21 / 42

slide-30
SLIDE 30

Introduction On kernels and invariances Contributions from second order to Gaussian

Invariant conditional simulations

22 / 42

−2.5 − 2 . 5 −2.5 − 2 . 5 −2 −2 −1.5 −1.5 − 1 − 1 −0.5 −0.5 0.5 . 5 0.5 0.5 0.5 . 5 1 1 1 1 1 1 . 5 1 . 5 2 2 2 . 5 2 . 5 2 . 5 2 . 5

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

  • Simulating a GP with group−invariant paths
slide-31
SLIDE 31

Introduction On kernels and invariances Contributions from second order to Gaussian

Some refs on group-invariance in kernel methods

  • B. Haasdonk, H.Burkhardt (2007).

Invariant kernels for pattern analysis and machine learning Machine Learning 68, 35-61

  • D. G., X. Bay, O. Roustant and L. Carraro (2012)

Argumentwise invariant kernels for the approximation of invariant functions Annales de la Facult´ e des Sciences de Toulouse, 21(3):501-527

  • K. Hansen et al. (2013)

Assessment and Validation of Machine Learning Methods for Predicting Molecular Atomization Energies Journal of Chemical Theory and Computation 9, 3404-3419

  • Y. Mroueh, S. Voinea, T. Poggio (2015)

Learning with Group Invariant Features: A Kernel Perspective Advances in Neural Information Processing Systems, 1558-1566

23 / 42

slide-32
SLIDE 32

Introduction On kernels and invariances Contributions from second order to Gaussian

Proposition (DG et al. 2016) Let Z be a measurable random field with paths (a.s.) in some function space F and T : F − → F be a linear operator such that for all x ∈ D there exists a signed measure νx : D − → R satisfying T(g)(x) =

  • g(u)dνx(u).

Assume further that sup

x∈D

  • D
  • k(u, u) + m(u)2d|νx|(u) < +∞.

Then the following are equivalent: a) ∀x ∈ D P(T(Z)x = 0) = 1 (“T(Z) = 0 up to a modification”) b) ∀x ∈ D T(m)(x) = 0 and (T ⊗ T(k))(x, x) = 0. Assuming further that T(Z) is separable, a) and b) are also equivalent to c) P(T(Z) = 0) = P(∀x ∈ D T(Z)x = 0) = 1 (“T(Z) = 0 a.s.”) .

24 / 42

slide-33
SLIDE 33

Introduction On kernels and invariances Contributions from second order to Gaussian

Another invariance: random fields with additive paths

Let D = d

i Di where Di ⊂ R. f ∈ RD is called additive when there exists

fi ∈ RDi (1 ≤ i ≤ d) such that f(x) = d

i=1 fi(xi) (x = (x1, . . . , xd) ∈ D). 25 / 42

slide-34
SLIDE 34

Introduction On kernels and invariances Contributions from second order to Gaussian

Another invariance: random fields with additive paths

Let D = d

i Di where Di ⊂ R. f ∈ RD is called additive when there exists

fi ∈ RDi (1 ≤ i ≤ d) such that f(x) = d

i=1 fi(xi) (x = (x1, . . . , xd) ∈ D).

GP models possessing additive paths (with k(x, x′) = d

i=1 ki(xi, x′ i )) have

been considered in Nicolas Durrande’s Ph.D. thesis (2011):

25 / 42

slide-35
SLIDE 35

Introduction On kernels and invariances Contributions from second order to Gaussian

Another invariance: random fields with additive paths

Let D = d

i Di where Di ⊂ R. f ∈ RD is called additive when there exists

fi ∈ RDi (1 ≤ i ≤ d) such that f(x) = d

i=1 fi(xi) (x = (x1, . . . , xd) ∈ D).

GP models possessing additive paths (with k(x, x′) = d

i=1 ki(xi, x′ i )) have

been considered in Nicolas Durrande’s Ph.D. thesis (2011):

x 1 1 x 2 1 Y −1.5 −1.0 x 1 1 x 2 1 Y 1.5 2.0

25 / 42

slide-36
SLIDE 36

Introduction On kernels and invariances Contributions from second order to Gaussian

A few selected references related to additive kernels

  • N. Durrande (2011)

´ Etude de classes de noyaux adapt´ es ` a la simplification et ` a l’interpr´ etation des mod` eles d’approximation. Une approche fonctionnelle et probabiliste PhD thesis, Ecole des Mines de Saint-Etienne

  • D. Duvenaud, H. Nickisch, C. Rasmussen (2011)

Additive Gaussian Processes Neural Information Processing Systems

  • N. Durrande, D. G. and O. Roustant (2012)

Additive Covariance kernels for high-dimensional Gaussian Process modeling Annales de la Facult´ e des Sciences de Toulouse, 21(3):481-499

  • D. G., N. Durrande and O. Roustant (2013)

Kernels and designs for modelling invariant functions: From group invariance to additivity. In mODa 10 - Advances in Model-Oriented Design and Analysis. Contributions to Statistics

26 / 42

slide-37
SLIDE 37

Introduction On kernels and invariances Contributions from second order to Gaussian

A link with RKHSs in the Gaussian case

In Gaussian case, the Lo` eve isometry Ψ between L(Z) (The Hilbert space generated by Z) and the RKHS Hk leads to the following.

27 / 42

slide-38
SLIDE 38

Introduction On kernels and invariances Contributions from second order to Gaussian

A link with RKHSs in the Gaussian case

In Gaussian case, the Lo` eve isometry Ψ between L(Z) (The Hilbert space generated by Z) and the RKHS Hk leads to the following. Proposition Let T : F → RD be a linear operator such that T(m) ≡ 0 and T(Z)x ∈ L(Z) for any x ∈ D. Then, there exists a unique linear T : Hk → RD satisfying cov(T(Z)x, Zx′) = T (k(·, x′))(x) (x, x′ ∈ D) and such that T (hn)(x) − → T (h)(x) for any x ∈ D and hn

H

− → h. In addition, we have equivalence between the following: (i) ∀x ∈ D T(Z)x = 0 (almost surely) (iii) ∀x′ ∈ D T (k(·, x′)) = 0 (iii) T (Hk) = {0}

27 / 42

slide-39
SLIDE 39

Introduction On kernels and invariances Contributions from second order to Gaussian

Examples

a) Let ν be a measure on D s.t.

  • D
  • k(u, u)dν(u) < +∞. Then a centred Z

(Gaussian or not) has centred paths iff

  • D k(x, u)dν(u) = 0, ∀x ∈ D.

For instance, given any p.d. kernel k, k0 defined by

k0(x, y) = k(x, y) −

  • k(x, u)dν(u) −
  • k(y, u)dν(u) +
  • k(u, v)dν(u)dν(v)

satisfies the above condition.

28 / 42

slide-40
SLIDE 40

Introduction On kernels and invariances Contributions from second order to Gaussian

Examples

a) Let ν be a measure on D s.t.

  • D
  • k(u, u)dν(u) < +∞. Then a centred Z

(Gaussian or not) has centred paths iff

  • D k(x, u)dν(u) = 0, ∀x ∈ D.

For instance, given any p.d. kernel k, k0 defined by

k0(x, y) = k(x, y) −

  • k(x, u)dν(u) −
  • k(y, u)dν(u) +
  • k(u, v)dν(u)dν(v)

satisfies the above condition. b) Solutions to the Laplace equation are called harmonic functions. Let us call harmonic any p.d. kernel solving the Laplace equation argumentwise: (∆k(·, x′)) = 0 (x′ ∈ D). An example of such harmonic kernel over R2 × R2 can be found in the recent literature (Schaback et al. 2009):

kharm(x, y) = exp x1y1 + x2y2 θ2

  • cos

x2y1 − x1y2 θ2

  • .

28 / 42

slide-41
SLIDE 41

Introduction On kernels and invariances Contributions from second order to Gaussian

Example sample paths invariant under various T’s

0.0 0.2 0.4 0.6 0.8 1.0 −1.0 −0.5 0.0 0.5 1.0 1.5

(a) Zero-mean paths of the centred GP with kernel k0.

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

− 2 −1 1 1 2 3

(b) Harmonic path of a GRF with kernel kharm.

29 / 42

slide-42
SLIDE 42

Introduction On kernels and invariances Contributions from second order to Gaussian

Some “stability of invariances by conditioning” result

Proposition Let F, G be real separable Banach spaces, µ be a Gaussian measure on B(F) with mean zero and covariance

  • perator Cµ

T : F − → F be a bounded linear operator such that TCµT ⋆ = 0F⋆−

→F

A : F − → G be another bounded linear operator, and A♯µ be the image of µ under A. Then there exist a Borel measurable mapping m : G − → F, a Gaussian covariance R : F ⋆ − → F with R ≤ Cµ and a disintegration (qy)y∈G of µ on B(F) with respect to A such that for any fixed y ∈ G, qy is a Gaussian measure with mean m and covariance operator R satisfying T(m) = 0F and TRT ⋆ = 0F⋆−

→F. 30 / 42

slide-43
SLIDE 43

Introduction On kernels and invariances Contributions from second order to Gaussian

GP prediction with invariant kernels: example a)

−3 −2 −1 1 2 3 −2 −1 1 2 3 4 5 test function best predictor 95% confidence intervals

(a) GPR with kernel k

−3 −2 −1 1 2 3 −2 −1 1 2 3 4 5 test function best predictor 95% confidence intervals

(b) GPR with kernel k0 Figure: Comparison of two GP models. The left one is based on a Gaussian kernel. The right one incorporates the zero-mean property.

31 / 42

slide-44
SLIDE 44

Introduction On kernels and invariances Contributions from second order to Gaussian

GP models with invariant kernels: example b)

(a) Mean predictor and 95% prediction intervals

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

−0.2 − . 1 . 1 . 1

(b) prediction error Figure: Example of GP model based on a harmonic kernel.

32 / 42

slide-45
SLIDE 45

Introduction On kernels and invariances Numerical applications and discussion

Outline

1

Introduction p.d. kernels, from analysis to GPs and back

2

On kernels and invariances Contributions from second order to Gaussian Numerical applications and discussion

33 / 42

slide-46
SLIDE 46

Introduction On kernels and invariances Numerical applications and discussion

Numerical application: maximum of a harmonic f

Here we consider approximating a harmonic function (left/right: Gaussian/harmonic kernels) and estimating its maximum by GRF modelling.

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

x1 x2

0.0 0.5 1.0 1.5

. 4 −0.2 0.2 0.4 . 6 0.8 1 1.2 1 . 4 1.6

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

x1 x2

−0.5 0.0 0.5 1.0 1.5 2.0

  • −0.4

− . 2 0.2 0.4 0.6 0.8 1 1.2 1 . 4 1 . 6 1.8 2

34 / 42

slide-47
SLIDE 47

Introduction On kernels and invariances Numerical applications and discussion

Numerical application: maximum of a harmonic f

Here we consider approximating a harmonic function (left/right: Gaussian/harmonic kernels) and estimating its maximum by GRF modelling.

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

x1 x2

0.0 0.5 1.0 1.5

. 4 −0.2 0.2 0.4 . 6 0.8 1 1.2 1 . 4 1.6

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

x1 x2

−0.5 0.0 0.5 1.0 1.5 2.0

  • −0.4

− . 2 0.2 0.4 0.6 0.8 1 1.2 1 . 4 1 . 6 1.8 2

Extracted from “On degeneracy and invariances of random fields paths with applications in Gaussian Process modelling” (DG, O.Roustant & N.Durrande, Journal of Statistical Planning and Inference, 170:117-128, 2016)

34 / 42

slide-48
SLIDE 48

Introduction On kernels and invariances Numerical applications and discussion

Numerical application: maximum of a harmonic f

Prediction errors (left/right: Gaussian/harmonic kernels).

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

x1 x2

0.0 0.1 0.2 0.3

  • −1.0

−0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

x1 x2

−0.015 −0.010 −0.005 0.000 0.005 0.010 0.015

  • 35 / 42
slide-49
SLIDE 49

Introduction On kernels and invariances Numerical applications and discussion

Numerical application: maximum of a harmonic f

Prediction errors (left/right: Gaussian/harmonic kernels).

3.5 4.5 5.5 −0.5 0.0 0.5 1.0 1.5 2.0

θ Temperature

3.5 4.5 5.5 −0.5 0.0 0.5 1.0 1.5 2.0

θ Temperature

36 / 42

slide-50
SLIDE 50

Introduction On kernels and invariances Numerical applications and discussion

Numerical application: maximum of a harmonic f

Conditional simulations of the maximum under the two GRF models.

maximum value Density

1.4 1.6 1.8 2.0 2.2 5 10 15 20 25

Simulated maxima under Gaussian kernel Simulated maxima under Harmonic kernel Actual maximum

37 / 42

slide-51
SLIDE 51

Introduction On kernels and invariances Numerical applications and discussion

Numerical application: recovering a symmetry axis

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0 x1 x2 −1 1 2 3

−1 −0.5 −0.5 −0.5 0.5 . 5 1 1 1.5 1 . 5 2 2.5 2.5 3 3 3 . 5

  • 38 / 42
slide-52
SLIDE 52

Introduction On kernels and invariances Numerical applications and discussion

Numerical application 2: recovering a symmetry axis

0.70 0.75 0.80 0.85 −0.30 −0.25 −0.20 −0.15 distance to the origin angle −5000 −4000 −3000 −2000 −1000

  • 39 / 42
slide-53
SLIDE 53

Introduction On kernels and invariances Numerical applications and discussion

Discussion

Function approximation approaches based on p.d. kernels enable incorporating degeneracies and invariances under linear operators including Symmetries and further invariances under group actions Additivity and further multivariate sparsity properties towards high-dimensional GRF modelling (See, e.g., MCQMC2014 paper) Harmonicity but also, e.g., divergence-free properties for vector fields (See, e.g., Scheuerer and Schlather 2012) In the Gaussian set up, such properties are inherited by conditional distributions, which is clearly convenient but also comes withs risks.

40 / 42

slide-54
SLIDE 54

Introduction On kernels and invariances Numerical applications and discussion

Perspectives

Developing further the inference of degeneracy/invariance properties based on data and investigating consistency,

41 / 42

slide-55
SLIDE 55

Introduction On kernels and invariances Numerical applications and discussion

Perspectives

Developing further the inference of degeneracy/invariance properties based on data and investigating consistency, Creating classes of kernels that incorporate some invariant components and non-invariant components,

41 / 42

slide-56
SLIDE 56

Introduction On kernels and invariances Numerical applications and discussion

Perspectives

Developing further the inference of degeneracy/invariance properties based on data and investigating consistency, Creating classes of kernels that incorporate some invariant components and non-invariant components, Explore further the potential of invariant kernels based on real-world applications (e.g., from physics, biology, engineering). Thank you very much for your attention!

41 / 42

slide-57
SLIDE 57

Introduction On kernels and invariances Numerical applications and discussion

Further references

C.J. Stone (1985) Additive regression and other nonparametric models The Annals of Statistics 13(2):689-705

  • M. Scheuerer and M. Schlather (2012)

Covariance Models for Divergence-Free and Curl-Free Random Vector Fields Stochastic Models 28(3)

  • D. Duvenaud (2014)

Automatic Model Construction with Gaussian Processes PhD thesis, University of Cambridge

  • K. Kandasamy, J. Schneider and B. Poczos (2015)

High Dimensional Bayesian Optimisation and Bandits via Additive Models International Conference on Machine Learning (ICML) 2015

  • D. G., O. Roustant, D. Schuhmacher, N. Durrande and N. Lenz (2016)

On ANOVA decompositions of kernels and Gaussian random field paths. Monte Carlo and Quasi-Monte Carlo Methods

42 / 42

slide-58
SLIDE 58

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Appendix

back

3

About GPs and their use in function modelling

4

Examples of GPs and generalities on p.d. kernels

5

Miscellanea

0 / 47

slide-59
SLIDE 59

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Outline

3

About GPs and their use in function modelling

4

Examples of GPs and generalities on p.d. kernels

5

Miscellanea

1 / 47

slide-60
SLIDE 60

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

What do we assume about f in GP modelling?

In Gaussian Process (GP) modelling, probabilistic concepts are used to model the deterministic function f.

2 / 47

slide-61
SLIDE 61

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

What do we assume about f in GP modelling?

In Gaussian Process (GP) modelling, probabilistic concepts are used to model the deterministic function f. Let us first focus on an arbitrary point x ∈ D and think of the unknown response value f(x) as a Gaussian random variable, denoted here Zx. Of course, how the mean and variance of Zx are specified is crucial. A simple

  • ption is to set them to constant values (e.g. 0 mean and σ2 > 0 variance) . . .

2 / 47

slide-62
SLIDE 62

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

What do we assume about f in GP modelling?

In Gaussian Process (GP) modelling, probabilistic concepts are used to model the deterministic function f. Let us first focus on an arbitrary point x ∈ D and think of the unknown response value f(x) as a Gaussian random variable, denoted here Zx. Of course, how the mean and variance of Zx are specified is crucial. A simple

  • ption is to set them to constant values (e.g. 0 mean and σ2 > 0 variance) . . .

. . . However, a white noise assumption would not be very constructive! The crux in GP modelling is to assume that the Zx’s for different x’s are correlated.

2 / 47

slide-63
SLIDE 63

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Reminder: n-dimensional Gaussian distribution

More precisely, we will appeal to the multivariate Gaussian distribution. Let us forget about x for now and consider a random vector Z = (Z1, . . . , Zn).

3 / 47

slide-64
SLIDE 64

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Reminder: n-dimensional Gaussian distribution

More precisely, we will appeal to the multivariate Gaussian distribution. Let us forget about x for now and consider a random vector Z = (Z1, . . . , Zn). Z is said to be multivariate Gaussian distributed when n

i=1 aiZi is Gaussian

distributed whatever n ≥ 1 and a1, . . . , an ∈ R.

3 / 47

slide-65
SLIDE 65

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Reminder: n-dimensional Gaussian distribution

More precisely, we will appeal to the multivariate Gaussian distribution. Let us forget about x for now and consider a random vector Z = (Z1, . . . , Zn). Z is said to be multivariate Gaussian distributed when n

i=1 aiZi is Gaussian

distributed whatever n ≥ 1 and a1, . . . , an ∈ R. Such Z is characterized by its mean µ ∈ Rn and covariance matrix K ∈ Rn×n (with E[Zi] and Cov[Zi, Zj] = E[(Zi − µi)(Zj − µj)] entries, respectively).

3 / 47

slide-66
SLIDE 66

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Reminder: n-dimensional Gaussian distribution

More precisely, we will appeal to the multivariate Gaussian distribution. Let us forget about x for now and consider a random vector Z = (Z1, . . . , Zn). Z is said to be multivariate Gaussian distributed when n

i=1 aiZi is Gaussian

distributed whatever n ≥ 1 and a1, . . . , an ∈ R. Such Z is characterized by its mean µ ∈ Rn and covariance matrix K ∈ Rn×n (with E[Zi] and Cov[Zi, Zj] = E[(Zi − µi)(Zj − µj)] entries, respectively). We use the notation: Z ∼ N(µ, K). Note that while µ can take any value, K must be symmetric positive semi-definite (i.e. symmetric with non-negative eigenvalues).

3 / 47

slide-67
SLIDE 67

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Reminder: n-dimensional Gaussian distribution

In case of invertible K, Z possesses the probability density function: pN (µ,K)(z) = (2π)−n/2 det(K)−1/2 exp

  • −1

2(z − µ)′K −1(z − µ)

  • 4 / 47
slide-68
SLIDE 68

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Reminder: n-dimensional Gaussian distribution

In case of invertible K, Z possesses the probability density function: pN (µ,K)(z) = (2π)−n/2 det(K)−1/2 exp

  • −1

2(z − µ)′K −1(z − µ)

  • Besides that, denoting by Z a and Z b two subvectors of Z such that

Z = (Z a, Z b), by µa, µb the corresponding means, and defining the corresponding blocks of Z’s covariance matrix by K = Ka Kab Kba Kb

  • ,

then (assuming invertibility of Ka), the conditional probability distribution of Z b knowing that Z a = za is (multivariate) Gaussian with L(Z (b)|Z a = za) = N(µb + KbaK −1

a

(za − µa), Kb − KbaK −1

a

Kab).

4 / 47

slide-69
SLIDE 69

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Priors on functions?

Let us now come back to our function approximation problem. We are interested in having a prior distribution on functions, not just on a finite-dimensional vector!

5 / 47

slide-70
SLIDE 70

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Priors on functions?

Let us now come back to our function approximation problem. We are interested in having a prior distribution on functions, not just on a finite-dimensional vector! Good news from probability theory (Kolmogorov’s extension theorem): random processes on D (a.k.a. random fields in case of multivariate D) can be defined through finite-dimensional distributions, i.e. through distributions

  • f the random vectors (Zx1, . . . , Zxn) for any finite set of points x1, . . . , xn.

5 / 47

slide-71
SLIDE 71

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Priors on functions?

Let us now come back to our function approximation problem. We are interested in having a prior distribution on functions, not just on a finite-dimensional vector! Good news from probability theory (Kolmogorov’s extension theorem): random processes on D (a.k.a. random fields in case of multivariate D) can be defined through finite-dimensional distributions, i.e. through distributions

  • f the random vectors (Zx1, . . . , Zxn) for any finite set of points x1, . . . , xn.

Gaussian Processes (a.k.a. Gaussian Random Fields) A GP (GRF) Z with index set D is a collection of random variables (Zx)x∈D (defined over the same probability space (Ω, A, P)) such that for any finite set

  • f points x1, . . . , xn ∈ D, (Zx1, . . . , Zxn) is multivariate Gaussian

5 / 47

slide-72
SLIDE 72

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Mean and covariance functions of a GP

Hence a GP is Z defined by specifying the mean and the covariance matrix

  • f any random vector of the form (Zx1, . . . , Zxn), so that Z is characterized by

µ : x ∈ D − → µ(x) = E[Zx] ∈ R k : (x, x′) ∈ D × D − → k(x, x′) = Cov[Zx, Zx′] ∈ R

6 / 47

slide-73
SLIDE 73

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Mean and covariance functions of a GP

Hence a GP is Z defined by specifying the mean and the covariance matrix

  • f any random vector of the form (Zx1, . . . , Zxn), so that Z is characterized by

µ : x ∈ D − → µ(x) = E[Zx] ∈ R k : (x, x′) ∈ D × D − → k(x, x′) = Cov[Zx, Zx′] ∈ R While µ can be any function, k is constrained since (k(xi, xj))1≤i≤n,1≤j≤n must be symmetric positive semi-definite for any set of points. k satisfying such property are referred to as p.d. kernels.

6 / 47

slide-74
SLIDE 74

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Mean and covariance functions of a GP

Hence a GP is Z defined by specifying the mean and the covariance matrix

  • f any random vector of the form (Zx1, . . . , Zxn), so that Z is characterized by

µ : x ∈ D − → µ(x) = E[Zx] ∈ R k : (x, x′) ∈ D × D − → k(x, x′) = Cov[Zx, Zx′] ∈ R While µ can be any function, k is constrained since (k(xi, xj))1≤i≤n,1≤j≤n must be symmetric positive semi-definite for any set of points. k satisfying such property are referred to as p.d. kernels. Remark: Assuming µ ≡ 0 for now, k accounts for a number of properties of Z, including pathwise properties, i.e. functional properties of the paths x ∈ D − → Zx(ω) ∈ R, for ω ∈ Ω (paths are also called “realizations”, or “trajectories”).

6 / 47

slide-75
SLIDE 75

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Some GRF R simulations (d=1) with DiceKriging

Here k(t, t′) = σ2 1 + |t′ − t|/ℓ + (t − t′)2/(3ℓ2)

  • exp (−|t′ − t|/ℓ)

(Mat´ ern kernel with regularity parameter 5/2) where ℓ = 0.4 and σ = 1.5. Furthermore, here trend is a trend µ(t) = −1 + 2t + 3t2.

0.0 0.2 0.4 0.6 0.8 1.0 −4 −2 2 4 6 x z 7 / 47

slide-76
SLIDE 76

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Some GRF R simulations (d=2) with DiceKriging

Now take a tensorized version of Mat´ ern kernel and a constant trend µ = 0.

8 / 47

slide-77
SLIDE 77

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Approximating functions using GP models

Let us now consider a deterministic function f : D − → R, which response values are measured at n points Xn = (x1, . . . , xn) ∈ Dn. Putting a GP prior Z on f and updating it with respect to f’s values at the xi points, we can work out a posterior distribution.

9 / 47

slide-78
SLIDE 78

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Approximating functions using GP models

Let us now consider a deterministic function f : D − → R, which response values are measured at n points Xn = (x1, . . . , xn) ∈ Dn. Putting a GP prior Z on f and updating it with respect to f’s values at the xi points, we can work out a posterior distribution. Indeed, finite-dimensional distributions of this posterior can be obtained by looking at the conditional distribution of (Zxn+1, . . . , Zxn+q) knowing (Zx1, . . . , Zxn) for arbitrary points xn+1, . . . , xn+q ∈ D. By Gaussianity, it turns out that such conditional distributions are Gaussian and so the posterior Z| measurements is a GRF.

9 / 47

slide-79
SLIDE 79

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Approximating functions using GP models

Let us now consider a deterministic function f : D − → R, which response values are measured at n points Xn = (x1, . . . , xn) ∈ Dn. Putting a GP prior Z on f and updating it with respect to f’s values at the xi points, we can work out a posterior distribution. Indeed, finite-dimensional distributions of this posterior can be obtained by looking at the conditional distribution of (Zxn+1, . . . , Zxn+q) knowing (Zx1, . . . , Zxn) for arbitrary points xn+1, . . . , xn+q ∈ D. By Gaussianity, it turns out that such conditional distributions are Gaussian and so the posterior Z| measurements is a GRF. NB: the same applied in noisy cases when considering (Zx1 + ε1, . . . , Zxn + εn) with Gaussian εi’s independent of Z).

9 / 47

slide-80
SLIDE 80

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

About the estimation of covariance parameters

The previous equations were at given µ and k. In practice, however, trend and/or covariance parameters often have to be estimated. Let us consider the case of known µ and k that depends on a vector of “hyperparameters” ψ.

10 / 47

slide-81
SLIDE 81

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

About the estimation of covariance parameters

The previous equations were at given µ and k. In practice, however, trend and/or covariance parameters often have to be estimated. Let us consider the case of known µ and k that depends on a vector of “hyperparameters” ψ. Several approaches do exist for dealing with the unknown ψ: Maximum Likelihood Estimation (MLE), Cross-validation (CV), but also Bayesian approaches involving sampling algorithms such as McMC, SMC, etc.

10 / 47

slide-82
SLIDE 82

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

About the estimation of covariance parameters

The previous equations were at given µ and k. In practice, however, trend and/or covariance parameters often have to be estimated. Let us consider the case of known µ and k that depends on a vector of “hyperparameters” ψ. Several approaches do exist for dealing with the unknown ψ: Maximum Likelihood Estimation (MLE), Cross-validation (CV), but also Bayesian approaches involving sampling algorithms such as McMC, SMC, etc. Let us present a brief overview of the MLE approach, probably the most implemented (although not necessarily the most robust) option.

10 / 47

slide-83
SLIDE 83

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

A brief overview of MLE ( back to Branin)

Let us denote by K(ψ) the covariance matrix of responses, say k(X n, X n; ψ), under the assumption of covariance hyperparameters with value ψ.

11 / 47

slide-84
SLIDE 84

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

A brief overview of MLE ( back to Branin)

Let us denote by K(ψ) the covariance matrix of responses, say k(X n, X n; ψ), under the assumption of covariance hyperparameters with value ψ. The principle of MLE is to search for a value of ψ under which it would have been the most likely to observe the responses zn. Under GP model assumptions, Z Xn ∼ N(µ(X n), K(ψ)). The likelihood then writes as the probability density of Z Xn at point zn, seen as a function of ψ: L(ψ; zn) = (2π)−n/2 det(K(ψ))−1/2 exp

  • −1

2(zn − µ(X n))′K(ψ)−1(zn − µ(X n))

  • 11 / 47
slide-85
SLIDE 85

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

A brief overview of MLE ( back to Branin)

Let us denote by K(ψ) the covariance matrix of responses, say k(X n, X n; ψ), under the assumption of covariance hyperparameters with value ψ. The principle of MLE is to search for a value of ψ under which it would have been the most likely to observe the responses zn. Under GP model assumptions, Z Xn ∼ N(µ(X n), K(ψ)). The likelihood then writes as the probability density of Z Xn at point zn, seen as a function of ψ: L(ψ; zn) = (2π)−n/2 det(K(ψ))−1/2 exp

  • −1

2(zn − µ(X n))′K(ψ)−1(zn − µ(X n))

  • Solving MLE is typically addressed by equivalently minimizing the function

ℓ(ψ; zn) = log(det(K(ψ))) + (zn − µ(X n))′K(ψ)−1(zn − µ(X n)).

11 / 47

slide-86
SLIDE 86

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

A brief overview of MLE

Minimizing ℓ is usually analytically intractable, and numerical optimization algorithms are employed.

12 / 47

slide-87
SLIDE 87

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

A brief overview of MLE

Minimizing ℓ is usually analytically intractable, and numerical optimization algorithms are employed. An elegant trick exists to estimate σ2 ∈ (0, +∞) in case k writes as σ2 × r where r is a given kernel depending on parameters θ.

12 / 47

slide-88
SLIDE 88

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

A brief overview of MLE

Minimizing ℓ is usually analytically intractable, and numerical optimization algorithms are employed. An elegant trick exists to estimate σ2 ∈ (0, +∞) in case k writes as σ2 × r where r is a given kernel depending on parameters θ. Writing K(ψ) = σ2R(θ) where ψ = (σ2, θ), one can derive the “optimal” σ2 as a function of θ. A swift calculation leads indeed to σ2⋆(θ) = 1 n(zn − µ(X n))′R(θ)−1(zn − µ(X n)).

12 / 47

slide-89
SLIDE 89

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

A brief overview of MLE

Minimizing ℓ is usually analytically intractable, and numerical optimization algorithms are employed. An elegant trick exists to estimate σ2 ∈ (0, +∞) in case k writes as σ2 × r where r is a given kernel depending on parameters θ. Writing K(ψ) = σ2R(θ) where ψ = (σ2, θ), one can derive the “optimal” σ2 as a function of θ. A swift calculation leads indeed to σ2⋆(θ) = 1 n(zn − µ(X n))′R(θ)−1(zn − µ(X n)). Re-injecting the latter equation into ℓ, MLE then boils down to minimizing a function depending solely on θ, the so-called profile (or “concentrated”) ℓ: ℓp(θ; zn) = log(det(σ2⋆(θ)R(θ)))

12 / 47

slide-90
SLIDE 90

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Towards Universal Kriging

Another situation where an elegant concentration of ℓ is feasible is when k depends on ψ and µ linearly depends on p basis functions f1, . . . , fp: µ(x) =

p

  • i=1

βifi(x), where β = (β1, . . . , βp)′ is assumed unknown.

13 / 47

slide-91
SLIDE 91

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Towards Universal Kriging

Another situation where an elegant concentration of ℓ is feasible is when k depends on ψ and µ linearly depends on p basis functions f1, . . . , fp: µ(x) =

p

  • i=1

βifi(x), where β = (β1, . . . , βp)′ is assumed unknown. Then, setting F = (fj(xi))1≤i≤n,1≤j≤p, we have µ(X n) = Fβ, and maximizing the likelihood with respect to β at fixed covariance parameters (say ψ again) leads to: β⋆(ψ) = (F ′K(ψ)−1F)−1F ′K(ψ)−1zn.

13 / 47

slide-92
SLIDE 92

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Towards Universal Kriging

Another situation where an elegant concentration of ℓ is feasible is when k depends on ψ and µ linearly depends on p basis functions f1, . . . , fp: µ(x) =

p

  • i=1

βifi(x), where β = (β1, . . . , βp)′ is assumed unknown. Then, setting F = (fj(xi))1≤i≤n,1≤j≤p, we have µ(X n) = Fβ, and maximizing the likelihood with respect to β at fixed covariance parameters (say ψ again) leads to: β⋆(ψ) = (F ′K(ψ)−1F)−1F ′K(ψ)−1zn. Plugging-in β⋆(ψ) in the predictor and inflating the conditional (co)variance accordingly leads to the “Universal Kriging” equations (See also particular case of “Ordinary Kriging”, where p = 1 and µ is a constant;

Eqs ).

NB: In a Bayesian set-up where an improper uniform prior is put on β, one even recovers a GP posterior distribution.

13 / 47

slide-93
SLIDE 93

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Selected references

M.L. Stein (1999). Interpolation of Spatial Data, Some Theory for Kriging. Springer

  • R. Adler and J. Taylor (2007).

Random Fields and Geometry. Springer

  • M. Scheuerer (2009).

A Comparison of Models and Methods for Spatial Interpolation in Statistics and Numerical Analysis. PhD thesis of Georg-August Universit¨ at G¨

  • ttingen
  • O. Roustant, D. Ginsbourger, Y. Deville (2012).

DiceKriging, DiceOptim: Two R Packages for the Analysis of Computer Experiments by Kriging-Based Metamodeling and Optimization. Journal of Statistical Software, 51(1), 1-55.

  • M. Schlather, A. Malinowski, P

. J. Menck, M. Oesting and K. Strokorb (2015). Analysis, Simulation and Prediction of Multivariate Random Fields with Package RandomFields. Journal of Statistical Software, 63, 8, 1-25. 14 / 47

slide-94
SLIDE 94

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Selected references

  • B. Rajput and S. Cambanis (1972).

Gaussian processes and Gaussian measures.

  • Ann. Math. Statist. 43 (6), 1944-1952.
  • A. O’Hagan (1978).

Curve fitting and optimal design for prediction. Journal of the Royal Statistical Society, Series B, 40(1):1-42.

  • H. Omre and K. Halvorsen (1989).

The bayesian bridge between simple and universal kriging. Mathematical Geology, 22 (7):767-786.

  • M. S. Handcock and M. L. Stein (1993).

A bayesian analysis of kriging. Technometrics, 35(4):403-410. A.W. Van der Vaart and J. H. Van Zanten (2008). Rates of contraction of posterior distributions based on Gaussian process priors. Annals of Statistics, 36:1435-1463. 15 / 47

slide-95
SLIDE 95

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Outline

3

About GPs and their use in function modelling

4

Examples of GPs and generalities on p.d. kernels

5

Miscellanea

16 / 47

slide-96
SLIDE 96

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Some examples of p.d. kernels and GPs

Let us start by a very classical example (for d = 1): the Brownian motion W = (Wt)t∈D over D = [0, +∞). Let us define W (in distribution) as follows: W0 = 0, for any t ∈ D and h > 0, Wt+h − Wt ∼ N(0, h), and for any t1, t2, t3, t4 ∈ D with t1 ≤ t2 ≤ t3 ≤ t4, the increments Wt4 − Wt3 and Wt2 − Wt1 are independent.

17 / 47

slide-97
SLIDE 97

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Some examples of p.d. kernels and GPs

Let us start by a very classical example (for d = 1): the Brownian motion W = (Wt)t∈D over D = [0, +∞). Let us define W (in distribution) as follows: W0 = 0, for any t ∈ D and h > 0, Wt+h − Wt ∼ N(0, h), and for any t1, t2, t3, t4 ∈ D with t1 ≤ t2 ≤ t3 ≤ t4, the increments Wt4 − Wt3 and Wt2 − Wt1 are independent. Such conditions define a GP; there remains to work out its expectation and covariance functions. First, for t ∈ D the two first conditions imply that m(t) = E[Wt] = E[W0 + Wt − W0] = E[W0] + E[Wt − W0] = 0 + 0 = 0.

17 / 47

slide-98
SLIDE 98

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Some examples of p.d. kernels and GPs

Let us start by a very classical example (for d = 1): the Brownian motion W = (Wt)t∈D over D = [0, +∞). Let us define W (in distribution) as follows: W0 = 0, for any t ∈ D and h > 0, Wt+h − Wt ∼ N(0, h), and for any t1, t2, t3, t4 ∈ D with t1 ≤ t2 ≤ t3 ≤ t4, the increments Wt4 − Wt3 and Wt2 − Wt1 are independent. Such conditions define a GP; there remains to work out its expectation and covariance functions. First, for t ∈ D the two first conditions imply that m(t) = E[Wt] = E[W0 + Wt − W0] = E[W0] + E[Wt − W0] = 0 + 0 = 0. Second, taking two points t, t′ ∈ D (assuming, say, that t < t′), the third condition implies that Wt′ − Wt is independent of Wt − W0. Consequently, kBM(t, t′) = E[WtWt′] = E[(Wt − W0)(Wt − W0 + Wt′ − Wt)] = E[(Wt − W0)2] + E[(Wt − W0)(Wt′ − Wt)] = t + 0 = t = min(t, t′).

17 / 47

slide-99
SLIDE 99

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Examples of covariance kernels and GPs (cont’d)

Another famous covariance function stems from the so-called “Brownian Bridge” (ending in 0) B = (Bt)t∈[0,1]. Let us first restrict W to D = [0, 1],

  • btaining a centred process with covariance k(t, t′) = min(t, t′) over [0, 1]2.

The distribution of B is then obtained by conditioning W on W1 = 0, thus

  • btaining the mean mB(t) = 0 and covariance kernel

kBB(t, t′) = min(t, t′) − tt′ = min(t, t′)(1 − max(t, t′)).

18 / 47

slide-100
SLIDE 100

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Examples of covariance kernels and GPs (cont’d)

Another famous covariance function stems from the so-called “Brownian Bridge” (ending in 0) B = (Bt)t∈[0,1]. Let us first restrict W to D = [0, 1],

  • btaining a centred process with covariance k(t, t′) = min(t, t′) over [0, 1]2.

The distribution of B is then obtained by conditioning W on W1 = 0, thus

  • btaining the mean mB(t) = 0 and covariance kernel

kBB(t, t′) = min(t, t′) − tt′ = min(t, t′)(1 − max(t, t′)). Another covariance function of interest can be obtained by integrating W. Defining (It)t∈D (with D = [0, +∞) again) by It = t

0 Budu, we obtain a new

centred GP with covariance kIBM(t, t′) = t t′ min(u, v)dudv = min(t, t′)3/3 + (max(t, t′) − min(t, t′)) min(t, t′)2/2.

18 / 47

slide-101
SLIDE 101

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Examples of covariance kernels and GPs (cont’d)

Without entering into much detail, let us list a few further examples of 1-dimensional GPs / associated covariance kernels: For D = [0, 1] and H ∈ (0, 1), kfBM(t, t′) = 1

2(|t|2H + |t′|2H − |t − t′|2H) is

the covariance kernel of the fractional (or “fractal”) Brownian Motion with Hurst coefficient H, ktriang(t, t′) = (1 − |t − t′|)+ is the “triangular” kernel over D = R, Defining Zt = ζ1 cos(ωt) + ζ2 sin(ωt), where ζ1, ζ2 ∼ N(0, σ2) independently (σ > 0) and ω > 0, one obtains k(t, t′) = cos(ω(t′ − t)), kOU(t, t′) = e−|t−t′| is called exponential kernel and characterizes the Ornstein-Uhlenbeck process. k(t, t′) = e−|t−t′|2 is the squared-exponential kernel.

19 / 47

slide-102
SLIDE 102

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Examples of covariance kernels and GPs (cont’d)

Previous k’s from real-valued one-dimensional settings can be generalize in a number of ways. Let us review a few simple examples. One obtains an admissible k on [0, +∞)d × [0, +∞)d by taking k(x, x′) = d

i=1 min(xi, x′ i ) where the x(′) i

’s are the coordinates of x(′). The associated centred GP over [0, +∞)d is called “Brownian Sheet”. The exponential and Gaussian kernels can be generalized to Rd × Rd by taking k(x, x′) = exp(−||x − x′||) and k(x, x′) = exp(−||x − x′||2), respectively, where || · || refers to the Euclidean norm on Rd. From a different perspective, one can define a particular complex-valued GP by taking Zx = ζ exp−ix,ω where ζ ∼ N(0, σ2) (σ > 0) and ω ∈ Rd. Such Z is centred and has (complex) covariance k(x, x′) = Cov(Zx, Z ′

x) = E[ZxZx′] = σ2 exp−ix,ω expix′,ω = exp−ix−x′,ω. 20 / 47

slide-103
SLIDE 103

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

A necessary and sufficient condition of admissibility

A common point about all kernels reviewed so far is that, for ad hoc D, if one takes any n ≥ 1 and arbitrary points x1, . . . , xn and complex numbers a1, . . . , an ∈ C, the following holds: 0 ≤ Var

  • n
  • i=1

aiZxi

  • =

n

  • i=1

n

  • j=1

aiajk(xi, xj).

21 / 47

slide-104
SLIDE 104

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

A necessary and sufficient condition of admissibility

A common point about all kernels reviewed so far is that, for ad hoc D, if one takes any n ≥ 1 and arbitrary points x1, . . . , xn and complex numbers a1, . . . , an ∈ C, the following holds: 0 ≤ Var

  • n
  • i=1

aiZxi

  • =

n

  • i=1

n

  • j=1

aiajk(xi, xj). This property is indeed necessary for k to be an admissible covariance. Furthermore, it turns out that any k possessing this property is a covariance kernel (there exists some (Gaussian) random process with this k).

21 / 47

slide-105
SLIDE 105

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

From p.d. kernels to function approximation

For an introduction to the mathematical foundations of p.d. kernels and their use in function approximation, see notably the following references:

  • C. Berg, J. P

. R. Christensen and P . Ressel (1984) Harmonic Analysis on Semigroups. Theory of Positive Definite and Related Functions Springer-Verlag

  • A. Berlinet, C. Thomas-Agnan (2004)

Reproducing Kernel Hilbert Spaces in Probability and Statistics Kluwer Academic Publishers

22 / 47

slide-106
SLIDE 106

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Choosing p.d. kernels?

In practice, choosing an adapted k for an objective f (about which limited information may be available) is both a crucial and difficult task.

23 / 47

slide-107
SLIDE 107

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Choosing p.d. kernels?

In practice, choosing an adapted k for an objective f (about which limited information may be available) is both a crucial and difficult task. Typically, k is chosen among some well-known p.d. kernel families, often among “shift-invariant” (a.k.a. “stationary”) kernels, i.e. functions of x − x′. Examples: Generalized Exponential (including Gaussian) kernels, Mat´ ern kernels, and more generally kernels obtained via the Bochner theorem.

23 / 47

slide-108
SLIDE 108

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Bochner theorem

By a slight abuse of notation, we denote stationary kernels on D = Rd (d ≥ 1) by k : h ∈ Rd − → k(h) ∈ C. Theorem (Bochner’s theorem) A continuous k : h ∈ Rd − → k(h) ∈ C is positive definite if and only if it is the Fourier transform of a finite non-negative Borel measure ν on Rd, i.e. k(h) = ˆ ν(h) = (2π)−d/2

  • Rd e−ih,ωdν(ω)

See for instance Wendland 2005 (Chap. 6) for a proof.

24 / 47

slide-109
SLIDE 109

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Bochner theorem

By a slight abuse of notation, we denote stationary kernels on D = Rd (d ≥ 1) by k : h ∈ Rd − → k(h) ∈ C. Theorem (Bochner’s theorem) A continuous k : h ∈ Rd − → k(h) ∈ C is positive definite if and only if it is the Fourier transform of a finite non-negative Borel measure ν on Rd, i.e. k(h) = ˆ ν(h) = (2π)−d/2

  • Rd e−ih,ωdν(ω)

See for instance Wendland 2005 (Chap. 6) for a proof. By playing on the “spectral measure” ν one can hence generate all continuous stationary p.d. kernels on Rd. For the case of a measure ν admitting a density q(ω) = dν

dλ(ω) w.r.t. the Lebesgue measure λ, k is hence

characterized by its spectral density q.

24 / 47

slide-110
SLIDE 110

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

A few 1-dimensional examples

Triangular: k(h) := c(a − |h|)+ q(ω) ∼ c(1−cos(aω))

πω2

  • Mat´

ern ν = 3/2: k(h) ∼ α−3e−α|t|(1 + α|t|) (q(ω) ∼ (α2 + ω2)−2) Gauss: k(h) ∼ e−( t

θ )2 (q(ω) ∼ e−θ2ω2)

  • M. Stein (Springer, 1999)

Interpolation of Spatial Data. Some Theory for Kriging

25 / 47

slide-111
SLIDE 111

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

More on spectral densities of Mat´ ern kernels (d ≥ 1)

Mat´ ern kernels can be characterized using the Hancock and Wallis parametrization (1994) mentioned in Stein (1999) (here σ = 1): q(ω) = c(ν, ρ)

ρ2 + ||ω||2

ν+d/2 where c(ν, ρ) =

Γ(ν+ d

2 )(4ν)ν

πd/2Γ(ν)ρ2ν . 26 / 47

slide-112
SLIDE 112

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

More on spectral densities of Mat´ ern kernels (d ≥ 1)

Mat´ ern kernels can be characterized using the Hancock and Wallis parametrization (1994) mentioned in Stein (1999) (here σ = 1): q(ω) = c(ν, ρ)

ρ2 + ||ω||2

ν+d/2 where c(ν, ρ) =

Γ(ν+ d

2 )(4ν)ν

πd/2Γ(ν)ρ2ν . The corresponding (“isotropic”) p.d. kernel is

k(h) = 1 2ν−1Γ(ν) 2ν1/2||h|| ρ ν Kν 2ν1/2||h|| ρ

  • where Kν is a modified Bessel function of the third kind. More tractable

expressions can be obtained for ν = 1

2, 3 2, 5 2 . . . 26 / 47

slide-113
SLIDE 113

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

More on spectral densities of Mat´ ern kernels (d ≥ 1)

Mat´ ern kernels can be characterized using the Hancock and Wallis parametrization (1994) mentioned in Stein (1999) (here σ = 1): q(ω) = c(ν, ρ)

ρ2 + ||ω||2

ν+d/2 where c(ν, ρ) =

Γ(ν+ d

2 )(4ν)ν

πd/2Γ(ν)ρ2ν . The corresponding (“isotropic”) p.d. kernel is

k(h) = 1 2ν−1Γ(ν) 2ν1/2||h|| ρ ν Kν 2ν1/2||h|| ρ

  • where Kν is a modified Bessel function of the third kind. More tractable

expressions can be obtained for ν = 1

2, 3 2, 5 2 . . . See Stein (1999) for more on

this class and Wendland (2005) –chap. 5– for more on Bessel functions.

26 / 47

slide-114
SLIDE 114

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

More on isotropic p.d. kernels in Rd

The Mat´ ern (class of) kernels considered previously are one among many isotropic p.d. kernels on Rd, i.e. p.d. kernels that write as k(x, x′) = κ(r) where r = ||x − x′||Rd , and κ : R+ − → R is also often (by a slight abusive of language) referred to as positive definite. Such k’s are also called radial.

27 / 47

slide-115
SLIDE 115

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

More on isotropic p.d. kernels in Rd

The Mat´ ern (class of) kernels considered previously are one among many isotropic p.d. kernels on Rd, i.e. p.d. kernels that write as k(x, x′) = κ(r) where r = ||x − x′||Rd , and κ : R+ − → R is also often (by a slight abusive of language) referred to as positive definite. Such k’s are also called radial. Definition (Cf. Wendland 2005): A function Φ : Rd − → R is said to be radial if there exists φ : [0, +∞) − → R such that Φ(h) = φ(||h||2) for all h ∈ Rd. A number κ leading to radial p.d. kernels in Rd do exist and have been studied by generations of mathematicians. Some depend on d, some do not!

27 / 47

slide-116
SLIDE 116

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

More on isotropic p.d. kernels in Rd

Let us review of few examples. κ(r) = e−rp (0 < p ≤ 2) ”Generalized exponential”

28 / 47

slide-117
SLIDE 117

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

More on isotropic p.d. kernels in Rd

Let us review of few examples. κ(r) = e−rp (0 < p ≤ 2) ”Generalized exponential” κ(r) = (c2 + r 2)−β (c, β > 0) ”Inverse multiquadrics”

28 / 47

slide-118
SLIDE 118

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

More on isotropic p.d. kernels in Rd

Let us review of few examples. κ(r) = e−rp (0 < p ≤ 2) ”Generalized exponential” κ(r) = (c2 + r 2)−β (c, β > 0) ”Inverse multiquadrics” κ(r) = (1 − r)ℓ

+ where (x)+ = max(0, x)

”Truncated power kernel”)

28 / 47

slide-119
SLIDE 119

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

More on isotropic p.d. kernels in Rd

Let us review of few examples. κ(r) = e−rp (0 < p ≤ 2) ”Generalized exponential” κ(r) = (c2 + r 2)−β (c, β > 0) ”Inverse multiquadrics” κ(r) = (1 − r)ℓ

+ where (x)+ = max(0, x)

”Truncated power kernel”) While the two first kernels are (strictly) positive definite for all d ≥ 1, for the third one one needs to restrict to ℓ ≥ ⌊d/2⌋ + 1 to get this property.

28 / 47

slide-120
SLIDE 120

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

More on isotropic p.d. kernels in Rd

Let us review of few examples. κ(r) = e−rp (0 < p ≤ 2) ”Generalized exponential” κ(r) = (c2 + r 2)−β (c, β > 0) ”Inverse multiquadrics” κ(r) = (1 − r)ℓ

+ where (x)+ = max(0, x)

”Truncated power kernel”) While the two first kernels are (strictly) positive definite for all d ≥ 1, for the third one one needs to restrict to ℓ ≥ ⌊d/2⌋ + 1 to get this property. Is it possible to characterize radial p.d. functions defined in terms of one κ valid in any dimension?

28 / 47

slide-121
SLIDE 121

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

More on isotropic p.d. kernels in Rd

Let us review of few examples. κ(r) = e−rp (0 < p ≤ 2) ”Generalized exponential” κ(r) = (c2 + r 2)−β (c, β > 0) ”Inverse multiquadrics” κ(r) = (1 − r)ℓ

+ where (x)+ = max(0, x)

”Truncated power kernel”) While the two first kernels are (strictly) positive definite for all d ≥ 1, for the third one one needs to restrict to ℓ ≥ ⌊d/2⌋ + 1 to get this property. Is it possible to characterize radial p.d. functions defined in terms of one κ valid in any dimension? Yes, thanks to completely monotone functions!

28 / 47

slide-122
SLIDE 122

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

More on isotropic p.d. kernels in Rd

Definition (Cf. Wendland 2005): A function φ is called completely monotone

  • n (0, +∞) if it satisfies φ ∈ C∞(0, +∞) and

(−1)ℓφ(ℓ)(r) ≥ 0 for all ℓ ∈ N and all r > 0. The function φ is called completely monotone on [0, +∞) if it is in addition in C[0, +∞).

29 / 47

slide-123
SLIDE 123

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

More on isotropic p.d. kernels in Rd

Definition (Cf. Wendland 2005): A function φ is called completely monotone

  • n (0, +∞) if it satisfies φ ∈ C∞(0, +∞) and

(−1)ℓφ(ℓ)(r) ≥ 0 for all ℓ ∈ N and all r > 0. The function φ is called completely monotone on [0, +∞) if it is in addition in C[0, +∞). Theorem (Schoenberg, Cf. Wendland 2005) A function φ : [0, +∞) − → R is completely monotone on [0, +∞) if and only if Φ := φ(|| · ||2

2) is positive definite on every Rd. 29 / 47

slide-124
SLIDE 124

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

More on isotropic p.d. kernels in Rd

Definition (Cf. Wendland 2005): A function φ is called completely monotone

  • n (0, +∞) if it satisfies φ ∈ C∞(0, +∞) and

(−1)ℓφ(ℓ)(r) ≥ 0 for all ℓ ∈ N and all r > 0. The function φ is called completely monotone on [0, +∞) if it is in addition in C[0, +∞). Theorem (Schoenberg, Cf. Wendland 2005) A function φ : [0, +∞) − → R is completely monotone on [0, +∞) if and only if Φ := φ(|| · ||2

2) is positive definite on every Rd.

Application: the inverse multiquadrics is p.d. in any dim. for c, β > 0.

29 / 47

slide-125
SLIDE 125

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Nota Bene: geometric anisotropy

Starting from any isotropic p.d. kernel, it is always possible to generalize it and obtain (geometric) anisotropic p.d. kernels through orthogonal transformations and dilatations, by defining k(x, x′) = κ

  • (x − x′)TΣ(x − x′)
  • where Σ is a real-valued symmetric (strictly!) positive definite matrix.

30 / 47

slide-126
SLIDE 126

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Other ways of defining p.d. kernels: overview

Kernels that write as functions of x, x′ (as the previously presented radial p.d. kernels on the sphere) are also called zonal kernels in G. E. Fasshauer’s review paper below, were examples of zonal kernels are discussed:

Fasshauer, G. E. (2011) Positive definite kernels: past, present and future Dolomites Research Notes on Approximation, 4:21-63

The following paper also includes alternative classes of p.d. kernels:

  • T. Hofmann, B. Sch¨
  • lkopf, A.J. Smola (2008)

Kernel methods in machine learning The Annals of Statistics, Vol. 36, No. 3, 1171-1220.

31 / 47

slide-127
SLIDE 127

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Other ways of defining p.d. kernels: overview

Kernels that write as functions of x, x′ (as the previously presented radial p.d. kernels on the sphere) are also called zonal kernels in G. E. Fasshauer’s review paper below, were examples of zonal kernels are discussed:

Fasshauer, G. E. (2011) Positive definite kernels: past, present and future Dolomites Research Notes on Approximation, 4:21-63

The following paper also includes alternative classes of p.d. kernels:

  • T. Hofmann, B. Sch¨
  • lkopf, A.J. Smola (2008)

Kernel methods in machine learning The Annals of Statistics, Vol. 36, No. 3, 1171-1220.

Overall, the notion of scalar product plays a crucial role in p.d. kernels.

31 / 47

slide-128
SLIDE 128

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Other ways of defining p.d. kernels: Mercer theorem

For continuous p.d. kernels –say real-valued, defined on a compact set D ⊂ Rd– a fruitful approach is to consider the following operator Tk on L2(D): g − → Tk(g)(·) =

  • D

g(x′)k(·, x′)dλ(x′) where λ refers to the Lebesgue measure (generalizations do exist) on Rd. Under our continuity and compactness conditions on Tk there exist (ϕj(·))j∈N∗ forming an orthonormal system of L2(D) and (λj)j∈N∗ non-negative such that ∀j ∈ N Tk(ϕj) = λjϕj and this leads to the Mercer decomposition (1909): k(x, x′) =

  • j=1

λjϕj(x)ϕj(x′). See Adler & Taylor, Steinwart and more for detail on the convergence, etc.

32 / 47

slide-129
SLIDE 129

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Basic principle of the Karhunen-Lo` eve expansion

Assuming D compact and k continuous, the Mercer theorem ensures the existence of an orthonormal basis (ϕj)j≥1 of L2(D) such that k(x, x′) =

+∞

  • j=1

λjϕj(x)ϕj(x′)

33 / 47

slide-130
SLIDE 130

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Basic principle of the Karhunen-Lo` eve expansion

Assuming D compact and k continuous, the Mercer theorem ensures the existence of an orthonormal basis (ϕj)j≥1 of L2(D) such that k(x, x′) =

+∞

  • j=1

λjϕj(x)ϕj(x′) The KL expansion of a GRF Z then consists in representing it under the form Zx =

+∞

  • j=1
  • λjζjϕj(x)

where the ζj’s are i.i.d. standard Gaussian random variables.

33 / 47

slide-131
SLIDE 131

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Deriving the eigenfunctions: a Fredholm problem

Given a GRF Z of covariance kernel k, finding the basis functions ϕj (j ≥ 1) is the key to the KL decomposition of Z. This is done by solving the following integral equation:

  • D

k(x, x′)g(x)dµ(x) = λg(x′), called a Fredholm problem.

34 / 47

slide-132
SLIDE 132

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Deriving the eigenfunctions: a Fredholm problem

Given a GRF Z of covariance kernel k, finding the basis functions ϕj (j ≥ 1) is the key to the KL decomposition of Z. This is done by solving the following integral equation:

  • D

k(x, x′)g(x)dµ(x) = λg(x′), called a Fredholm problem. When possible, the latter is solved analytically by using calculus.

34 / 47

slide-133
SLIDE 133

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Example: KL expansion of the Brownian Motion

For the covariance kernel of the BM, k(t, t′) = min(t, t′), the eigenvalues and eigenfunctions are solutions to the following Fredholm problem: 1 min(t, t′)ϕ(t)dt = λϕ(t′)

35 / 47

slide-134
SLIDE 134

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Example: KL expansion of the Brownian Motion

For the covariance kernel of the BM, k(t, t′) = min(t, t′), the eigenvalues and eigenfunctions are solutions to the following Fredholm problem: 1 min(t, t′)ϕ(t)dt = λϕ(t′) It can be shown by solving a differential equation that the solutions are λj = 1 π2(j − 1

2)2

ϕj(t) = √ 2 sin

  • j − 1

2

  • × πt
  • R.J. Adler and J.E. Taylor (Springer, 2007)

Random Fields and Geometry

35 / 47

slide-135
SLIDE 135

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Example: KL expansion of the Brownian Motion

Let us simulate the Brownian Motion using a truncated KL expansion: m <- 10000 t <- seq(0,1,,m) v <- function(t,k){sqrt(2)*sin((k-0.5)*pi*t)} lambda <- function(k){1/(pi*(k-0.5))ˆ2} q <- 1000 KL <- rep(0,m) for(i in seq(1,q)){ KL <- KL + sqrt(lambda(i))*rnorm(1)*v(t,i)}

36 / 47

slide-136
SLIDE 136

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Example: KL expansion of the Brownian Motion

Here are two simulation results based on the truncated KL expansion of the Brownian Motion, respectively with q = 50 and q = 1000:

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 Approximate BM simulation by truncated KL expansion (q=50) t z 0.0 0.2 0.4 0.6 0.8 1.0 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Approximate BM simulation by truncated KL expansion (q=1000) t z

The simulations are not exact, but can be performed at a continuous set. The ζj’s can be stored, and the corresponding path evaluated at a new point later.

37 / 47

slide-137
SLIDE 137

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

A few selected references

M.L. Stein (1999). Interpolation of Spatial Data, Some Theory for Kriging Springer

  • M. Scheuerer (2009).

A Comparison of Models and Methods for Spatial Interpolation in Statistics and Numerical Analysis PhD thesis of Georg-August Universit¨ at G¨

  • ttingen
  • C. E. Rasmussen and C.K.I. Williams (2006).

Gaussian Processes for Machine Learning MIT Press

  • R. Adler and J. Taylor (2007).

Random Fields and Geometry Springer

  • I. Steinwart (2017).

Convergence Types and Rates in Generic Karhune–Lo` eve Expansions with Applications to Sample Path Properties arXiv:1403.1040v3 [math.PR] 38 / 47

slide-138
SLIDE 138

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Outline

3

About GPs and their use in function modelling

4

Examples of GPs and generalities on p.d. kernels

5

Miscellanea

39 / 47

slide-139
SLIDE 139

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Further properties of p.d. kernels ( back )

Further general properties can be derived for p.d. kernels, including: Products of p.d. kernels are p.d. kernels If σ : D − → D is a bijection, k(x, x′) is a p.d. kernel if and only if k(σ(x), σ(x′)) is a p.d. kernel For all x, x′ ∈ D |k(x, x′)| ≤

  • k(x, x)
  • k(x′, x′)

The function dk : (x, x′) ∈ D2 − → dk(x, x′) =

  • k(x, x) + k(x′, x′) − 2ℜ(k(x, x′))

defines a (pseudo-)distance on D.

40 / 47

slide-140
SLIDE 140

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Further properties of p.d. kernels ( back )

Further general properties can be derived for p.d. kernels, including: Products of p.d. kernels are p.d. kernels If σ : D − → D is a bijection, k(x, x′) is a p.d. kernel if and only if k(σ(x), σ(x′)) is a p.d. kernel For all x, x′ ∈ D |k(x, x′)| ≤

  • k(x, x)
  • k(x′, x′)

The function dk : (x, x′) ∈ D2 − → dk(x, x′) =

  • k(x, x) + k(x′, x′) − 2ℜ(k(x, x′))

defines a (pseudo-)distance on D. Note also that positive definiteness can be generalized as follows: k : (x, x′) ∈ D2 − → C is called conditionally positive definite (c.p.d.) if it is hermitian and n

i=1

n

j=1 aiajk(xi, xj) ∈ [0, +∞) for all n ≥ 1, x1, . . . , xn ∈ D

and a1, . . . , an ∈ C s.t n

i=1 ai = 0. C.n.d. is defined similarly with (−∞, 0]. 40 / 47

slide-141
SLIDE 141

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

RKHS

Reproducing Kernel Hilbert Spaces (RKHS) offer a very convenient framework for function approximation. Here Definition: A Hilbert space of functions D − → C, (H, ·, ·H), is a RKHS if for all x ∈ D, the evaluation functional ex : f ∈ H − → f(x) ∈ C are continuous. From the so-called Riesz representation theorem, for all x ∈ D there exists an element of H, denoted here kx, such that f(x) = f, kxH.

41 / 47

slide-142
SLIDE 142

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

RKHS

Reproducing Kernel Hilbert Spaces (RKHS) offer a very convenient framework for function approximation. Here Definition: A Hilbert space of functions D − → C, (H, ·, ·H), is a RKHS if for all x ∈ D, the evaluation functional ex : f ∈ H − → f(x) ∈ C are continuous. From the so-called Riesz representation theorem, for all x ∈ D there exists an element of H, denoted here kx, such that f(x) = f, kxH. From such a RKHS and the collection of Riesz evaluation representers kx, the “kernel” k : D × D − → C associated with H can be defined as follows: k : (x, x′) ∈ D × D − → k(x, x′) = kx′, kxH Easy to check: k is a p.d. kernel.

41 / 47

slide-143
SLIDE 143

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

RKHS

Reproducing Kernel Hilbert Spaces (RKHS) offer a very convenient framework for function approximation. Here Definition: A Hilbert space of functions D − → C, (H, ·, ·H), is a RKHS if for all x ∈ D, the evaluation functional ex : f ∈ H − → f(x) ∈ C are continuous. From the so-called Riesz representation theorem, for all x ∈ D there exists an element of H, denoted here kx, such that f(x) = f, kxH. From such a RKHS and the collection of Riesz evaluation representers kx, the “kernel” k : D × D − → C associated with H can be defined as follows: k : (x, x′) ∈ D × D − → k(x, x′) = kx′, kxH Easy to check: k is a p.d. kernel. Less easy to check: any p.d. kernel defines a unique RKHS → Moore-Aronszajn theorem (Published 1950 :-)

41 / 47

slide-144
SLIDE 144

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Representing RKHSs based on the Mercer theorem

For simplicity, let us consider here a RKHS Hk associated with a real-valued Mercer kernel k. Hk can be represented more concretely as follows.

42 / 47

slide-145
SLIDE 145

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Representing RKHSs based on the Mercer theorem

For simplicity, let us consider here a RKHS Hk associated with a real-valued Mercer kernel k. Hk can be represented more concretely as follows. Hk =   f =

  • j=1

αj

  • λjφj, α ∈ RN :

+∞

  • j=1

α2

j < ∞

   with ∞

j=1 αj

  • λjφj, ∞

j=1 βj

  • λjφjH := ∞

j=1 αjβj. 42 / 47

slide-146
SLIDE 146

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Representing RKHSs based on the Mercer theorem

For simplicity, let us consider here a RKHS Hk associated with a real-valued Mercer kernel k. Hk can be represented more concretely as follows. Hk =   f =

  • j=1

αj

  • λjφj, α ∈ RN :

+∞

  • j=1

α2

j < ∞

   with ∞

j=1 αj

  • λjφj, ∞

j=1 βj

  • λjφjH := ∞

j=1 αjβj.

Comparing this with the K-L expansion of a GP with kernel k, we find that in the case of an infinite number of non-zero eigenvalues, the paths of Z are not in Hk with probability 1 (Parzen-Kallianpur-LePage theorem, as discussed in Luki´ c and Beder 2001). However, it can be shown that in general GP paths belong to bigger RKHSs (See, e.g., Steinwart 2017 for more detail).

42 / 47

slide-147
SLIDE 147

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Some properties of GRFs and kernels

Back to centred Z for simplicity, one can define a (pseudo-)metric dZ on D by d2

Z(x, x′) = E

  • (Zx − Zx′)2)
  • = k(x, x) + k(x′, x′) − 2k(x, x′)

A number of properties of Z are driven by dZ.

43 / 47

slide-148
SLIDE 148

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Some properties of GRFs and kernels

Back to centred Z for simplicity, one can define a (pseudo-)metric dZ on D by d2

Z(x, x′) = E

  • (Zx − Zx′)2)
  • = k(x, x) + k(x′, x′) − 2k(x, x′)

A number of properties of Z are driven by dZ. For instance, Theorem (Sufficient condition for the continuity of GRF paths) Let (Zx)x∈D be a separable Gaussian random field on a compact index set D ⊂ Rd. If for some 0 < C < ∞ and δ, η > 0, d2

Z(x, x′) ≤

C

  • log ||x − x′||
  • 1+δ

for all x, x′ ∈ D with ||x − x′|| < η, then the paths of Z are almost surely continuous and bounded. See, e.g., M. Scheuerer’s PhD thesis (2009) for details.

43 / 47

slide-149
SLIDE 149

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Some properties of GRFs and kernels

Starting from p.d. kernels notably obtained via Bochner’s theorem, an appealing approach to enrich them is by operations conserving symmetric positive definiteness.

44 / 47

slide-150
SLIDE 150

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Some properties of GRFs and kernels

Starting from p.d. kernels notably obtained via Bochner’s theorem, an appealing approach to enrich them is by operations conserving symmetric positive definiteness. Classical operations of that kind notably encompass: Non-negative linear combinations of p.d. kernels Products and tensor products of p.d. kernels Multiplication by σ(x)σ(x′) for σ : x ∈ D − → [0, +∞) Deformations/warpings: k(g(x), g(x′)) for g : D − → D Convolutions, etc. . . See, e.g., Section “making new kernels from old” of the book Gaussian Processes for Machine Learning (cited earlier).

44 / 47

slide-151
SLIDE 151

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

The Branin-Hoo function

The rescaled Branin-Hoo function f is defined over [0, 1]2 by f(x1, x2) = fBH(15x1 − 5, 15x2), where fBH : (x1, x2) ∈ [−5, 10]×[0, 15] − → a(x2 −bx2

1 +cx1 −r)+s(1−t) cos(x1)+s,

with a = 1, b = 5/(4π2), c = 5/π, r = 6, s = 10 and t = 1/(8π)

back . 45 / 47

slide-152
SLIDE 152

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Ordinary Kriging Equations –for completeness!–

Assume Z has a covariance kernel k, and constant mean µ ∈ R ✶

✶ ✶ ✶ ✶ ✶ ✶ ✶ 46 / 47

slide-153
SLIDE 153

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Ordinary Kriging Equations –for completeness!–

Assume Z has a covariance kernel k, and constant mean µ ∈ R        mn(x) = k(Xn, x)Tk(Xn, Xn)−1zn + µn(1 − k(Xn, x)Tk(Xn, Xn)−1✶n) kn(x, x′) = k(x, x′) − k(Xn, x)Tk(Xn, Xn)−1k(Xn, x′) +(1−✶T

n k(Xn,Xn)−1k(Xn,x))(1−✶T n k(Xn,Xn)−1k(Xn,x′))

(✶T

n k(Xn,Xn)−1✶n)

✶ ✶ ✶ 46 / 47

slide-154
SLIDE 154

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Ordinary Kriging Equations –for completeness!–

Assume Z has a covariance kernel k, and constant mean µ ∈ R        mn(x) = k(Xn, x)Tk(Xn, Xn)−1zn + µn(1 − k(Xn, x)Tk(Xn, Xn)−1✶n) kn(x, x′) = k(x, x′) − k(Xn, x)Tk(Xn, Xn)−1k(Xn, x′) +(1−✶T

n k(Xn,Xn)−1k(Xn,x))(1−✶T n k(Xn,Xn)−1k(Xn,x′))

(✶T

n k(Xn,Xn)−1✶n)

where µn =

✶T

n k(Xn,Xn)−1zn

(✶T

n k(Xn,Xn)−1✶n).

46 / 47

slide-155
SLIDE 155

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Ordinary Kriging Equations –for completeness!–

Assume Z has a covariance kernel k, and constant mean µ ∈ R        mn(x) = k(Xn, x)Tk(Xn, Xn)−1zn + µn(1 − k(Xn, x)Tk(Xn, Xn)−1✶n) kn(x, x′) = k(x, x′) − k(Xn, x)Tk(Xn, Xn)−1k(Xn, x′) +(1−✶T

n k(Xn,Xn)−1k(Xn,x))(1−✶T n k(Xn,Xn)−1k(Xn,x′))

(✶T

n k(Xn,Xn)−1✶n)

where µn =

✶T

n k(Xn,Xn)−1zn

(✶T

n k(Xn,Xn)−1✶n).

Under standard conditions, mn and kn are Z’s conditional mean and covariance and L(Z|ZXn = zn) = GRF

  • mn(·), kn(·, ·′)
  • back

46 / 47

slide-156
SLIDE 156

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Heterogeneously noisy OK Equations

       mn(x) = µn + kn(x)T(k(Xn, Xn)+∆n)−1 zn − µn✶n

  • kn(x, x′)

= k(x, x′) − k(Xn, x)T(k(Xn, Xn)+∆n)−1k(Xn, x′) +(1−✶T

n (k(Xn,Xn)+∆n)−1k(Xn,x))(1−✶T n (k(Xn,Xn)+∆n)−1k(Xn,x′))

(✶T

n (k(Xn,Xn)+∆n)−1✶n)

✶ ✶ ✶ 47 / 47

slide-157
SLIDE 157

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Heterogeneously noisy OK Equations

       mn(x) = µn + kn(x)T(k(Xn, Xn)+∆n)−1 zn − µn✶n

  • kn(x, x′)

= k(x, x′) − k(Xn, x)T(k(Xn, Xn)+∆n)−1k(Xn, x′) +(1−✶T

n (k(Xn,Xn)+∆n)−1k(Xn,x))(1−✶T n (k(Xn,Xn)+∆n)−1k(Xn,x′))

(✶T

n (k(Xn,Xn)+∆n)−1✶n)

where µn =

✶T

n (k(Xn,Xn)+∆n)−1

zn (✶T

n (K+∆n)−1✶n)

.

47 / 47

slide-158
SLIDE 158

About GPs and their use in function modelling Examples of GPs and generalities on p.d. kernels Miscellanea

Heterogeneously noisy OK Equations

       mn(x) = µn + kn(x)T(k(Xn, Xn)+∆n)−1 zn − µn✶n

  • kn(x, x′)

= k(x, x′) − k(Xn, x)T(k(Xn, Xn)+∆n)−1k(Xn, x′) +(1−✶T

n (k(Xn,Xn)+∆n)−1k(Xn,x))(1−✶T n (k(Xn,Xn)+∆n)−1k(Xn,x′))

(✶T

n (k(Xn,Xn)+∆n)−1✶n)

where µn =

✶T

n (k(Xn,Xn)+∆n)−1

zn (✶T

n (K+∆n)−1✶n)

. Under usual assumptions, and if Z and the ε′

i s are independent:

L(Z| An) = N

  • mn(·), kn(·, ·′)
  • back

47 / 47