Exact Inference for Gaussian Process Regression in case of Big Data - - PowerPoint PPT Presentation

exact inference for gaussian process regression in case
SMART_READER_LITE
LIVE PREVIEW

Exact Inference for Gaussian Process Regression in case of Big Data - - PowerPoint PPT Presentation

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure Belyaev Mikhail 1 , 2 , 3 , Burnaev Evgeny 1 , 2 , 3 , Kapushev Yermek 1 , 2 1 Institute for Information Transmission Problems, Moscow, Russia


slide-1
SLIDE 1

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

Belyaev Mikhail1,2,3, Burnaev Evgeny1,2,3, Kapushev Yermek1,2

1Institute for Information Transmission Problems, Moscow, Russia 2DATADVANCE, llc, Moscow, Russia 3PreMoLab, MIPT, Dolgoprudny, Russia

ICML 2014 workshop on New Learning Frameworks and Models for Big Data

Beijing, 2014

1/39 Burnaev Evgeny

slide-2
SLIDE 2

Approximation task

Problem statement Let y = g (x) be some unknown function. The training sample is given D = {xi, yi}, g (xi) = yi, i = 1, . . . , N. The task is to construct ˆ f (x) such that: ˆ f (x) ≈ g(x).

2/39 Burnaev Evgeny

slide-3
SLIDE 3

Factorial Design of Experiments

Factors: sk = {xk

ik ∈ Xk, ik = 1, . . . ,nk}, Xk ∈ Rdk,

k = 1, . . . , K; dk — dimensionality of the factor sk. Factorial Design of Experiments: S = {xi}N

i=1 = s1 × s2 × · · · × sK.

Dimensionality of x ∈ S: d = K

k=1 dk.

Sample size: N = K

k=1 nk

0.0 0.2 0.4 0.6 0.8 1.0

x1

0.0 0.2 0.4 0.6 0.8 1.0

x2

3/39 Burnaev Evgeny

slide-4
SLIDE 4

Example of Factorial DoE

0.5 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x1 x2 x3

Figure : Factorial DoE with multidimensional factor

4/39 Burnaev Evgeny

slide-5
SLIDE 5

DoE in engineering problems

1 Independent groups of variables — factors. 2 Training sample generation procedure:

Fix values of the first factor. Perform experiments varying values of other factors. Fix other value of the first factor. Perform new series of experiments.

3 Take into account knowledge from a subject domain.

Data properties

Has special structure Can have large sample size Factors’ sizes can differ significantly

5/39 Burnaev Evgeny

slide-6
SLIDE 6

Example of Factorial DoE

Modeling of pressure distribution over the aircraft wing: Angle of attack: s1 = {0, 0.8, 1.6, 2.4, 3.2, 4.0}. Mach number: s2 = {0.77, 0.78, 0.79, 0.8, 0.81, 0.82, 0.83}. Wing points coordinates: s3 — 5000 point in R3. The training sample: S = s1 × s2 × s3. Dimensionality d = 1 + 1 + 3 = 5. Sample size N = 6 ∗ 7 ∗ 5000 = 210000.

6/39 Burnaev Evgeny

slide-7
SLIDE 7

Existing solutions

Universal techniques: Disadvantages: don’t take into account sample structure ⇒ low approximation quality, high computational complexity Multivariate Adaptive Regression Splines [Friedman, 1991] Disadvantages: discontinuous derivatives, non-physical behaviour Tensor product of splines [Stone et al., 1997, Xiao et al., 2013] Disadvantages: only one-dimensional factors, no accuracy evaluation procedure Gaussian Processes on lattice [Dietrich and Newsam, 1997, Stroud et al., 2014] Disadvantages: two-dimensional grid with equidistant points

7/39 Burnaev Evgeny

slide-8
SLIDE 8

The aim is to develop computationally efficient algorithm taking into account special features of factorial Design of Experiments

8/39 Burnaev Evgeny

slide-9
SLIDE 9

Gaussian Process Regression

Function model g(x) = f(x) + ε(x), where f(x) — Gaussian process (GP), ε(x) — Gaussian white noise. GP is fully defined by its mean and covariance function. The covariance function of f(x) Kf(x, x′) = σ2

f exp

d

  • i=1

θ2

i (x(i) − x′(i))2

  • ,

where x(i) — i-th component of vector, θ = (σ2

f, θ1, . . . , θd) — parameters

  • f the covariance function.

The covariance function of g(x): Kg(x, x′) = K(x, x′) + σ2

noiseδ(x, x′),

δ(x, x′) — Kronecker delta.

9/39 Burnaev Evgeny

slide-10
SLIDE 10

Choosing parameters

Maximum Likelihood Estimation Loglikelihood log p(y |X, θ) = −1 2yT K−1

g y − 1

2 log |Kg| − N 2 log 2π, where |Kg| — determinant of matrix Kg = Kg(xi, xj)N

i,j=1 , xi, xj ∈ S.

Parameters θ∗ are chosen such that θ∗ = arg max

θ

(log p(y |X, θ))

10/39 Burnaev Evgeny

slide-11
SLIDE 11

Final model

Prediction of g(x) at point x ˆ f(x) = kT K−1

g y,

where k = (Kf(x1, x), . . . , Kf(xn, x)). Posterior variance σ2(x) = Kf(x, x) − kT K−1

g k.

11/39 Burnaev Evgeny

slide-12
SLIDE 12

Gaussian Processes for factorial DoE

Issues:

1 High computational complexity: O(N 3).

In case of factorial DoE the sample size N can be very large.

2 Degeneracy as a result of significantly different factor sizes. 12/39 Burnaev Evgeny

slide-13
SLIDE 13

Gaussian Processes for factorial DoE

Issues:

1 High computational complexity: O(N 3).

In case of factorial DoE the sample size N can be very large.

2 Degeneracy as a result of significantly different factor sizes. 12/39 Burnaev Evgeny

slide-14
SLIDE 14

Estimation of covariance function parameters

Loglikelihood: log p(y |X, θ) = −1 2yT K−1

g y − 1

2 log |Kg| − N 2 log 2π, Derivatives: ∂ ∂θ (log p(y|X, σf, σnoise)) = −1 2Tr(K−1

g K′) + 1

2yT K−1

g K′K−1 g y,

where θ is a parameter of covariance function (component of θi, σnoise or σf,i, i = 1, . . . , d), and K′ = ∂K ∂θ

13/39 Burnaev Evgeny

slide-15
SLIDE 15

Tensors and Kronecker product

Definition Tensor Y is a K-dimensional matrix of size n1 ∗ n2 ∗ · · · ∗ nK: Y =

  • yi1,i2,...,iK, {ik = 1, . . . , nk}K

k=1

  • .

Definition The Kronecker product of matrices A and B is a block matrix A ⊗ B =    a11B · · · a1nB . . . ... . . . am1B · · · amnB    .

14/39 Burnaev Evgeny

slide-16
SLIDE 16

Related operations

Operation vec: vec(Y) = [Y1,1,...,1, Y2,1,...,1, . . . , Yn1,1,...,1, Y1,2,...,1, . . . , Yn1,n2,...,nK] . Multiplication of a tensor by a matrix along k-th direction Z = Y ⊗k B ⇔ Zi1,...,ik−1,j,ik+1,...iK =

  • ik

Yi1,...,ik,...,iKBik j. Connection between tensors and the Kronecker product: vec(Y ⊗1 B1 · · · ⊗K BK) = (B1 ⊗ · · · ⊗ BK)vec(Y) (1) Complexity of computation of the left part — O(N

k nk), of the right

part — O(N 2).

15/39 Burnaev Evgeny

slide-17
SLIDE 17

Covariance function

Form of the covariance function: Kf(x, y) =

K

  • i=1

ki(xi, yi), xi, yi ∈ si, where ki is an arbitrary covariance function for i-th factor. Covariance matrix: K =

K

  • i=1

Ki, Ki is a covariance matrix for i-th factor.

16/39 Burnaev Evgeny

slide-18
SLIDE 18

Fast computation of loglikelihood

Proposition Let Ki = UiDiUT

i be a Singular Value Decomposition (SVD) of the matrix

Ki, where Ui is an orthogonal matrix, and Di is diagonal. Then: |Kg| =

  • i1,...,iK

Di1,...,iK, K−1

g

=

  • k

UT

k k

Dk + σ2

noiseI

−1

k

Uk

  • ,

K−1

g y = vec [((Y ⊗1 U1 · · · ⊗K UK) ∗ D−1

⊗1 UT

1 · · · ⊗K UT K

  • ,

(2) where D is a tensor of diagonal elements of the matrix σ2

noiseI + k Dk

17/39 Burnaev Evgeny

slide-19
SLIDE 19

Computational complexity

Proposition Calculation of the loglikelihood using (2) has the following computation complexity O

  • N

K

  • i=1

ni +

K

  • i=1

n3

i

  • .

Assuming ni ≪ N (number of factors is large and their sizes are close) we get O

  • N
  • ni
  • = O
  • N 1+ 1

K

  • .

18/39 Burnaev Evgeny

slide-20
SLIDE 20

Fast computation of derivatives

Proposition The following statements hold Tr(K−1

g K′) =

  • diag
  • ˆ

D−1 ,

K

  • i=1

diag

  • UiK′

iUi

  • ,

1 2yT K−1

g K′K−1 g y = A, A ⊗1 KT 1 ⊗2 · · · ⊗i−1 KT i−1⊗i

⊗i ∂KT

i

∂θ ⊗i+1 KT

i+1 ⊗i+2 · · · ⊗K KT K

  • ,

where ˆ D = σ2

noiseI + k Dk, and vec(A) = K−1 g y.

The computational complexity is O

  • N

K

  • i=1

ni +

K

  • i=1

n3

i

  • .

19/39 Burnaev Evgeny

slide-21
SLIDE 21

Gaussian Processes for facotrial DoE

Issues:

1 High computational complexity: O(N 3).

In case of factorial DoE the sample size N can be very large.

2 Degeneracy as a result of significantly different factor sizes. 20/39 Burnaev Evgeny

slide-22
SLIDE 22

Gaussian Processes for facotrial DoE

Issues:

1 High computational complexity: O(N 3).

In case of factorial DoE the sample size N can be very large.

2 Degeneracy as a result of significantly different factor sizes. 20/39 Burnaev Evgeny

slide-23
SLIDE 23

Degeneracy

Example of degeneracy

−1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.5 0.5 1 1.5 2 2.5 x1 x2 True function training set

Figure : Original function

−1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.5 0.5 1 1.5 2 2.5 x1 x2 GP regression training set

Figure : Approximation obtained using GP from GPML toolbox

21/39 Burnaev Evgeny

slide-24
SLIDE 24

Regularization

Prior distribution: θ(i)

k

− a(i)

k

b(i)

k − a(i) k

∼ Be(α, β), {i = 1, . . . , dk}K

k=1,

a(i)

k

= ck max

x,y∈sk(x(i) − y(i)),

b(i)

k

= Ck min

x,y∈sk,x=y(x(i) − y(i))

where Be(α, β) is the Beta distribution, ck and Ck are parameters of the algorithm (we use ck = 0.01 and Ck = 2). Initialization θ(i)

k

= 1 nk

  • max

x∈sk(x(i)) − min x∈sk(x(i))

−1 .

22/39 Burnaev Evgeny

slide-25
SLIDE 25

Regularization

Loglikelihood: log p(y | X, θ, σf, σnoise) = −1 2yT K−1

y y − 1

2 log |Ky| − N 2 log 2π +

  • k,i
  • (α − 1) log
  • θ(i)

k

− a(i)

k

b(i)

k − a(i) k

  • + (β − 1) log
  • 1 − θ(i)

k

− a(i)

k

b(i)

k − a(i) k

−d log(B(α, β)).

23/39 Burnaev Evgeny

slide-26
SLIDE 26

Regularization

Loglikelihood: log p(y | X, θ, σf, σnoise) = −1 2yT K−1

y y − 1

2 log |Ky| − N 2 log 2π +

  • k,i
  • (α − 1) log
  • θ(i)

k

− a(i)

k

b(i)

k − a(i) k

  • + (β − 1) log
  • 1 − θ(i)

k

− a(i)

k

b(i)

k − a(i) k

−d log(B(α, β)).

0.5 1 1.5 2 −6 −5 −4 −3 −2 −1 1 x log(PDF) a = b = 2

Figure : Penalty function

23/39 Burnaev Evgeny

slide-27
SLIDE 27

Regularization

Example of regularized approximation

−1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.5 0.5 1 1.5 2 2.5 x1 x2 True function training set

Figure : Original function

−1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.5 0.5 1 1.5 2 2.5 x1 x2 tensorGP−reg training set

Figure : Approximation obtained using developed algorithm with regularization

24/39 Burnaev Evgeny

slide-28
SLIDE 28

Toy problems

Set of 34 functions (usual artificial test functions used for testing of global

  • ptimization algorithms)

Dimensionality — from 2 to 6. Sample sizes from 80 to 216000 with full factorial DoE Quality criteria — training time and approximation error: MSE = 1 Ntest

Ntest

  • i=1

( ˆ f(xi) − g(xi))2 Tested algorithms:

MARS — Multivariate Adaptive Regression Splines SparseGP — sparse Gaussian Processes (GPML toolbox) tensorGP — developed algorithm without regularization tensorGP-reg — developed algorithm with regularization

25/39 Burnaev Evgeny

slide-29
SLIDE 29

Dolan-Mor´ e curves

T problems, A algorithms. eta — approximation error (or training time) of a-th algorithm on t-th problem. ˜ et = mina eta. ρa(τ) = #{t : eta < τ ˜ et} T The higher the curve is the better works the corresponding algorithm. ρa(1) — fraction of problems for which a-th algorithm worked the best.

26/39 Burnaev Evgeny

slide-30
SLIDE 30

Results of experiments: training time

Figure : Dolan-Mor´ e curves. Quality criterion — training time

27/39 Burnaev Evgeny

slide-31
SLIDE 31

Results of experiments: approximation quality

Figure : Dolan-Mor´ e curves. Quality criterion — MSE

28/39 Burnaev Evgeny

slide-32
SLIDE 32

Real problem: Rotating disc of an impeller

Objective functions: p1 — contact pressure. Srmax — maximum radial stress. w — weight of disc. The geometrical shape of the disc is parametrized by 6 input variables x = (h1, h2, h3, h4, r2, r3) (r1 and r4 are fixed) Training sample (generated from computational physical model): Sample size — 14400 Factor sizes — {1, 8, 8, 3, 15, 5}

Figure : Rotating disc of an impeller

29/39 Burnaev Evgeny

slide-33
SLIDE 33

Results of experiment

Table : Approximation errors of p1 MAE MSE RRMS MARS 4.4644 6.5120 0.1166 SparseGP 76.9313 86.7034 1.5530 tensorGP-reg 0.3020 0.3981 0.0070

MAE = 1 Ntest

Ntest

  • i=1

| ˆ f(xi) − g(xi)|, MSE = 1 Ntest

Ntest

  • i=1

( ˆ f(xi) − g(xi))2 RRMS = Ntest

i=1

( ˆ f(xi) − g(xi))2 Ntest

i=1

(¯ y − g(xi))2 , ¯ y = 1 N

N

  • i=1

yi.

30/39 Burnaev Evgeny

slide-34
SLIDE 34

Results of approximation

2D-slices of approximations (other parameters are fixed)

Figure : GPML Sparse GP is applied

50 100 150 200 250 300 350 350 400 450 500 550 600 650 −100 −50 50 100 150 x6 x5 p1 tensorGP−reg training set test set

Figure : Approximation obtained using developed algorithm with regularization

31/39 Burnaev Evgeny

slide-35
SLIDE 35

Further research: missing data

Reasons for missing points: Data generation is in progress (each point calculation is time consuming). Data generator failed in some points.

32/39 Burnaev Evgeny

slide-36
SLIDE 36

Incomplete Factorial DoE

Sfull — full factorial DoE. Nfull = |Sfull|. S — incomplete factorial DoE. N = |S|. S ⊂ Sfull ⇒ Covariance matrix is not the Kronecker product! Kf =

K

  • i=1

Ki.

0.0 0.2 0.4 0.6 0.8 1.0

x1

0.0 0.2 0.4 0.6 0.8 1.0

x2

33/39 Burnaev Evgeny

slide-37
SLIDE 37

Computation of loglikelihood

Notation {xi}

Nfull i=1

= Sfull.

  • Kf — covariance matrix for full factorial DoE Sfull.

W — diagonal matrix such that Wii =

  • 1, if xi ∈ S

0, if xi / ∈ S. ˜ y — vector of outputs y extended by arbitrary values. Proposition Let ˜ z∗ be a solution of ( KfW Kf + σ2

noise

Kf)˜ z = KfW˜ y. (3) Then the solution z∗ of (Kf + σ2

noiseI)z = y, i.e.

z∗ = (Kf + σ2

noiseI)−1y,

has the form z∗ = (˜ z∗

i1, . . . , ˜

z∗

iN ), where ik ∈ {j : xj ∈ S}, k = 1, . . . , N.

34/39 Burnaev Evgeny

slide-38
SLIDE 38

Computation of loglikelihood

R = Nfull − N — number of missing points.

  • U = K

i=1

Ui

  • D = K

i=1

Di. ˆ

  • D =

D + σ2

noiseI.

The change of variables ˜ α =      ˆ

  • D

D 1

2

UT ˜ z if R < N,

  • σ2

noise

D 1

2

UT ˜ z if R ≥ N. (4) Proposition System of linear equations (3) can be solved using the change of variables (4) and Conjugate Gradient Method in at most min(R + 1, N + 1) iterations. The computational complexity of each iteration is O(Nfull

  • k nk).

35/39 Burnaev Evgeny

slide-39
SLIDE 39

Computation of determinant

  • Kf + σ2

noiseI =

Kg A AT B

  • Determinant

|Kg| = | Kf + σ2

noiseI|

|B − AT K−1

g A|.

(5)

| Kf + σ2

noiseI| is computed using formulae for full factorial case.

|B − AT K−1

g A| is computed numerically.

Proposition The complexity of computing determinant using (5) is O

  • min{R + 1, N + 1}RNfull
  • k nk
  • .

36/39 Burnaev Evgeny

slide-40
SLIDE 40

Conclusion

The developed algorithm is computationally efficient; can handle large samples; takes into account features of given data; is proved to be efficient on a large set of toy problems as well as real world problems.

37/39 Burnaev Evgeny

slide-41
SLIDE 41

Thank you for attention!

More details are given in Belyaev, M., Burnaev, E., and Kapushev, Y. (2014). Exact inference for gaussian process regression in case of big data with the cartesian product structure. arXiv preprint arXiv:1403.6573

38/39 Burnaev Evgeny

slide-42
SLIDE 42

References

Dietrich, C. R. and Newsam, G. N. (1997). Fast and exact simulation of stationary gaussian processes through circulant embedding of the covariance matrix. SIAM J. Sci. Comput., 18(4):1088–1107. Friedman, J. (1991). Multivariate adaptive regression splines. Annals of Statistic, 19(1):1–141. Stone, C., Hansen, M., Kooperberg, C., and Truong, Y. (1997). Polynomial splines, their tensor products in extended linear modeling. Annals of Statistic, 25:1371–1470. Stroud, J. R., Stein, M. L., and Lysen, S. (2014). Bayesian and maximum likelihood estimation for gaussian processes on an incomplete lattice. arXiv preprint arXiv:1402.4281. Xiao, L., Li, Y., and Ruppert, D. (2013). Fast bivariate p-splines: the sandwich smoother. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3):577–599.

39/39 Burnaev Evgeny