Efficient Sparse Approximations for Convolution Processes Mauricio - - PowerPoint PPT Presentation

efficient sparse approximations for convolution processes
SMART_READER_LITE
LIVE PREVIEW

Efficient Sparse Approximations for Convolution Processes Mauricio - - PowerPoint PPT Presentation

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data Efficient Sparse Approximations for Convolution Processes Mauricio A. lvarez Joint work


slide-1
SLIDE 1

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Efficient Sparse Approximations for Convolution Processes

Mauricio A. Álvarez Joint work with Neil Lawrence, David Luengo and Michalis Titsias

University of Manchester

slide-2
SLIDE 2

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Introduction: covariances for multiple outputs

D1 = {(x1

i , y1 i )|i = 1, . . . , N1}

slide-3
SLIDE 3

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Introduction: covariances for multiple outputs

D1 = {(x1

i , y1 i )|i = 1, . . . , N1}

slide-4
SLIDE 4

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Introduction: covariances for multiple outputs

D1 = {(x1

i , y1 i )|i = 1, . . . , N1}

k1(xi, xj)

slide-5
SLIDE 5

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Introduction: covariances for multiple outputs

D1 = {(x1

i , y1 i )|i = 1, . . . , N1}

k1(xi, xj)

slide-6
SLIDE 6

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Introduction: covariances for multiple outputs

D1 = {(x1

i , y1 i )|i = 1, . . . , N1}

k1(xi, xj) K1 =

slide-7
SLIDE 7

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Introduction: covariances for multiple outputs

D1

slide-8
SLIDE 8

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Introduction: covariances for multiple outputs

D1 D2

slide-9
SLIDE 9

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Introduction: covariances for multiple outputs

D1 D2 D3

slide-10
SLIDE 10

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Introduction: covariances for multiple outputs

D1 D2 D3 K1 = K2 = K3 =

slide-11
SLIDE 11

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Introduction: covariances for multiple outputs

D1 D2 D3 K2 = K3 = K1

slide-12
SLIDE 12

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Introduction: covariances for multiple outputs

D1 D2 D3 K3 = K1 K2

slide-13
SLIDE 13

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Introduction: covariances for multiple outputs

D1 D2 D3 K1 K2 K3

slide-14
SLIDE 14

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Introduction: covariances for multiple outputs

D1 D2 D3 K1 K2 K3

K =

slide-15
SLIDE 15

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Introduction: covariances for multiple outputs

D1 D2 D3 K1 K2 K3 ? ? ? ? ? ? Joint covariance K be a valid covariance matrix

K =

slide-16
SLIDE 16

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Some approaches

Linear model of coregionalization.

Intrinsic coregionalization model.

Multitask kernels.

Convolution of covariances.

Convolution of processes or convolution process.

slide-17
SLIDE 17

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Convolution Process

A convolution process is a moving-average construction that guarantees a valid covariance function.

Consider a set of functions {fd(x)}D

d=1.

Each function can be expressed as fd(x) =

  • X

Gd(x − z)u(z)dz = Gd(x) ∗ u(x).

Influence of more than one latent function, {uq(z)}Q

q=1 and inclusion of

an independent process wd(x) yd(x) = fd(x) + wd(x) =

Q

  • q=1
  • X

Gd,q(x − z)uq(z)dz + wd(x).

slide-18
SLIDE 18

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

A pictorial representation

u(x)

u(x): latent function.

slide-19
SLIDE 19

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

A pictorial representation

u(x)

G (x)

1

G (x)

2

u(x): latent function. G(x): smoothing kernel.

slide-20
SLIDE 20

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

A pictorial representation

u(x)

G (x)

1

G (x)

2

( f x)

2

( f x)

1

u(x): latent function. G(x): smoothing kernel. f(x): output function.

slide-21
SLIDE 21

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

A pictorial representation

u(x)

G (x)

1

G (x)

2

( f x)

2

( f x)

1

( w x)

1

( w x)

2

u(x): latent function. G(x): smoothing kernel. f(x): output function. w(x): independent process.

slide-22
SLIDE 22

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

A pictorial representation

u(x)

G (x)

1

G (x)

2

( f x)

2

( f x)

1

( w x)

1

( w x)

2

( y x)

1

( y x)

2

u(x): latent function. G(x): smoothing kernel. f(x): output function. w(x): independent process. y(x): noisy output function.

slide-23
SLIDE 23

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Covariance of the output functions.

The covariance between yd(x) and yd′(x′) is given as cov [yd(x), yd′(x′)] = cov [fd(x), fd′(x′)] + cov [wd(x), wd′(x′)] δd,d′ where cov [fd(x), fd′(x′)] =

  • X

Gd(x − z)

  • X

Gd′(x′ − z′) cov [u(z), u(z′)] dz′dz

slide-24
SLIDE 24

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Different forms of covariance for the output functions.

Input Gaussian process cov [fd, fd′] =

  • X

Gd(x − z)

  • X

Gd′(x′ − z′)ku,u(z, z′)dz′dz

Input white noise process cov [fd, fd′] =

  • X

Gd(x − z)Gd′(x′ − z)dz

Covariance between output functions and latent functions cov [fd, u] =

  • X

Gd(x − z′)ku,u(z′, z)dz′

slide-25
SLIDE 25

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Likelihood of the full Gaussian process.

The likelihood of the model is given by p(y|X, φ) = N(0, Kf,f + Σ) where y =

  • y⊤

1 , . . . , y⊤ D

⊤ is the set of output functions, Kf,f covariance matrix with blocks cov [fd, fd′], Σ matrix of noise variances, φ is the set

  • f parameters of the covariance matrix and X = {x1, . . . , xN} is the set
  • f input vectors.

Learning from the log-likelihood involves the inverse of Kf,f + Σ, which grows with complexity O(N3D3)

slide-26
SLIDE 26

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Predictive distribution of the full Gaussian process.

Predictive distribution at X∗ p(y∗|y, X, X∗, φ) = N (µ∗, Λ∗) with µ∗ = Kf∗,f(Kf,f + Σ)−1y Λ∗ = Kf∗,f∗ − Kf∗,f(Kf,f + Σ)−1Kf,f∗ + Σ

Prediction is O(DN) for the mean and O(D2N2) for the variance, for

  • ne test point. Storage is O(D2N2).
slide-27
SLIDE 27

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

1

Partial independence

2

Fully Independence

3

Variational Approximation

4

Variational Inducing Kernels

5

Case study: a dynamic model for financial data

slide-28
SLIDE 28

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

1

Partial independence

2

Fully Independence

3

Variational Approximation

4

Variational Inducing Kernels

5

Case study: a dynamic model for financial data

slide-29
SLIDE 29

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

1

Partial independence

2

Fully Independence

3

Variational Approximation

4

Variational Inducing Kernels

5

Case study: a dynamic model for financial data

slide-30
SLIDE 30

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

1

Partial independence

2

Fully Independence

3

Variational Approximation

4

Variational Inducing Kernels

5

Case study: a dynamic model for financial data

slide-31
SLIDE 31

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

1

Partial independence

2

Fully Independence

3

Variational Approximation

4

Variational Inducing Kernels

5

Case study: a dynamic model for financial data

slide-32
SLIDE 32

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Conditional prior distribution.

Sample from p(u) fd(x) =

  • X

Gd(x − z)u(z)dz

slide-33
SLIDE 33

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Conditional prior distribution.

Sample from p(u) fd(x) =

  • X

Gd(x − z)u(z)dz Discretize u fd(x) ≈

  • ∀k

Gd(x − zk)u(zk)

slide-34
SLIDE 34

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Conditional prior distribution.

Sample from p(u) fd(x) =

  • X

Gd(x − z)u(z)dz Discretize u fd(x) ≈

  • ∀k

Gd(x − zk)u(zk) Sample from p(u|u) fd(x) ≈

  • X

Gd(x − z)u(z)|udz

slide-35
SLIDE 35

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

The conditional independence assumption I.

This form for fd(x) leads to the following likelihood p(f|u, Z) = N

  • f|Kf,uK−1

u,uu, Kf,f − Kf,uK−1 u,uKu,f

  • ,

where u discrete sample from the latent function Z set of input vectors corresponding to u Ku,u cross-covariance matrix between latent functions Kf,u = K⊤

u,f cross-covariance matrix between latent and output functions

Even though we conditioned on u, we still have dependencies between

  • utputs due to the uncertainty in p(u|u).
slide-36
SLIDE 36

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

The conditional independence assumption II.

Our key assumption is that the outputs will be independent even if we have only observed u rather than the whole function u.

Kf

1f 1_

Kuu

−1

Kf

1u

Kuf

1

Kf

2f 2_

Kuu

−1

Kf

2u

Kuf

2

Kf

3f 3_

Kuu

−1

Kf

3u

Kuf

3

Kf

2f 1_

Kuu

−1

Kf

2u

Kuf

1

Kf

3f 1_

Kuu

−1

Kf

3u

Kuf

1

Kf

1f 2_

Kuu

−1

Kf

1u

Kuf

2

Kf

1f 3_

Kuu

−1

Kf

1u

Kuf

3

Kf

2f 3_

Kuu

−1

Kf

2u

Kuf

3

Kf

3f 2_

Kuu

−1

Kf

3u

Kuf

2

slide-37
SLIDE 37

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

The conditional independence assumption II.

Our key assumption is that the outputs will be independent even if we have only observed u rather than the whole function u.

Kf

1f 1_

Kuu

−1

Kf

1u

Kuf

1

Kf

2f 2_

Kuu

−1

Kf

2u

Kuf

2

Kf

3f 3_

Kuu

−1

Kf

3u

Kuf

3

Kf

2f 1_

Kuu

−1

Kf

2u

Kuf

1

Kf

3f 1_

Kuu

−1

Kf

3u

Kuf

1

Kf

1f 2_

Kuu

−1

Kf

1u

Kuf

2

Kf

1f 3_

Kuu

−1

Kf

1u

Kuf

3

Kf

2f 3_

Kuu

−1

Kf

2u

Kuf

3

Kf

3f 2_

Kuu

−1

Kf

3u

Kuf

2

Kf

1f 1_

Kuu

−1

Kf

1u

Kuf

1

Kf

2f 2_

Kuu

−1

Kf

2u

Kuf

2

Kf

3f 3_

Kuu

−1

Kf

3u

Kuf

3

Better approximations can be obtained when E [u|u] approximates u.

slide-38
SLIDE 38

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Comparison of marginal likelihoods

Integrating out u, the marginal likelihood is given as p(y|Z, X, θ) =N

  • y|0, Kf,uK−1

u,uKu,f + blockdiag

  • Kf,f − Kf,uK−1

u,uKu,f

  • + Σ
  • .
slide-39
SLIDE 39

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Comparison of marginal likelihoods

Integrating out u, the marginal likelihood is given as p(y|Z, X, θ) =N

  • y|0, Kf,uK−1

u,uKu,f + blockdiag

  • Kf,f − Kf,uK−1

u,uKu,f

  • + Σ
  • .

Kf

1f 1

Kf

2f 2

Kf

3f 3

Kf

2f 1_

Kuu

−1

Kf

2u

Kuf

1

Kf

3f 1_

Kuu

−1

Kf

3u

Kuf

1

Kf

1f 2_

Kuu

−1

Kf

1u

Kuf

2

Kf

1f 3_

Kuu

−1

Kf

1u

Kuf

3

Kf

2f 3_

Kuu

−1

Kf

2u

Kuf

3

Kf

3f 2_

Kuu

−1

Kf

3u

Kuf

2

~ ~

Kf

1f 1

Kf

2f 1

Kf

2f 2

Kf

2f 3

Kf

1f 2

Kf

1f 3

Kf

3f 1

Kf

3f 2

Kf

3f 3

slide-40
SLIDE 40

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Comparison of marginal likelihoods

Integrating out u, the marginal likelihood is given as p(y|Z, X, θ) =N

  • y|0, Kf,uK−1

u,uKu,f + blockdiag

  • Kf,f − Kf,uK−1

u,uKu,f

  • + Σ
  • .

Kf

1f 1

Kf

2f 2

Kf

3f 3

Kf

2f 1_

Kuu

−1

Kf

2u

Kuf

1

Kf

3f 1_

Kuu

−1

Kf

3u

Kuf

1

Kf

1f 2_

Kuu

−1

Kf

1u

Kuf

2

Kf

1f 3_

Kuu

−1

Kf

1u

Kuf

3

Kf

2f 3_

Kuu

−1

Kf

2u

Kuf

3

Kf

3f 2_

Kuu

−1

Kf

3u

Kuf

2

~ ~

Kf

1f 1

Kf

2f 1

Kf

2f 2

Kf

2f 3

Kf

1f 2

Kf

1f 3

Kf

3f 1

Kf

3f 2

Kf

3f 3

Kf

1f 1

Kf

2f 1

Kf

2f 2

Kf

2f 3

Kf

1f 2

Kf

1f 3

Kf

3f 1

Kf

3f 2

Kf

3f 3

~ ~

G G

X

T

Discrete case [G]i,k = Gd(xi − zk)

slide-41
SLIDE 41

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Predictive distribution for the sparse approximation

Predictive distribution p(y∗|y, X, X∗, Z, θ) = N

  • µ∗,

Λ∗

  • , with
  • µ∗ = Kf∗,uA−1Ku,f(D + Σ)−1y
  • Λ∗ = D∗ + Kf∗,uA−1Ku,f∗ + Σ

A = Ku,u + Ku,f(D + Σ)−1Kf,u D∗ = blockdiag

  • Kf∗,f∗ − Kf∗,uK−1

u,uKu,f∗

slide-42
SLIDE 42

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Remarks

For learning the computational demand is in the calculation of D−1, which grows as O(N3D) + O(NDM2) (with R = 1). Storage is O(N2D) + O(NDM).

For inference, the computation of the mean grows as O(DM) and the computation of the variance as O(DM2), after some pre-computations and for one test point.

The functional form of the approximation is almost identical to that of the Partially Independent Training Conditional (PITC) approximation [QR05].

slide-43
SLIDE 43

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Additional conditional independencies

The N3 term in the computational complexity and the N2 term in storage in PITC are still expensive for larger data sets.

An additional assumption is independence over the data points.

slide-44
SLIDE 44

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Additional conditional independencies

The N3 term in the computational complexity and the N2 term in storage in PITC are still expensive for larger data sets.

An additional assumption is independence over the data points.

f

1

f

2

f

D

u

slide-45
SLIDE 45

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Additional conditional independencies

The N3 term in the computational complexity and the N2 term in storage in PITC are still expensive for larger data sets.

An additional assumption is independence over the data points.

f

1

f

2

f

D

u

slide-46
SLIDE 46

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Comparison of marginal likelihoods

The marginal likelihood is given as p(y|Z, X, θ) =N

  • 0, Kf,uK−1

u,uKu,f + diag

  • Kf,f − Kf,uK−1

u,uKu,f

  • + Σ
  • .
slide-47
SLIDE 47

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Comparison of marginal likelihoods

The marginal likelihood is given as p(y|Z, X, θ) =N

  • 0, Kf,uK−1

u,uKu,f + diag

  • Kf,f − Kf,uK−1

u,uKu,f

  • + Σ
  • .

Kf

1f 1

Kf

2f 2

Kf

3f 3

Kf

2f 1_

Kuu

−1

Kf

2u

Kuf

1

Kf

3f 1_

Kuu

−1

Kf

3u

Kuf

1

Kf

1f 2_

Kuu

−1

Kf

1u

Kuf

2

Kf

1f 3_

Kuu

−1

Kf

1u

Kuf

3

Kf

2f 3_

Kuu

−1

Kf

2u

Kuf

3

Kf

3f 2_

Kuu

−1

Kf

3u

Kuf

2

~ ~

Kf

1f 1

Kf

2f 1

Kf

2f 2

Kf

2f 3

Kf

1f 2

Kf

1f 3

Kf

3f 1

Kf

3f 2

Kf

3f 3

slide-48
SLIDE 48

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Comparison of marginal likelihoods

The marginal likelihood is given as p(y|Z, X, θ) =N

  • 0, Kf,uK−1

u,uKu,f + diag

  • Kf,f − Kf,uK−1

u,uKu,f

  • + Σ
  • .

Kf

1f 1

Kf

2f 2

Kf

3f 3

Kf

2f 1_

Kuu

−1

Kf

2u

Kuf

1

Kf

3f 1_

Kuu

−1

Kf

3u

Kuf

1

Kf

1f 2_

Kuu

−1

Kf

1u

Kuf

2

Kf

1f 3_

Kuu

−1

Kf

1u

Kuf

3

Kf

2f 3_

Kuu

−1

Kf

2u

Kuf

3

Kf

3f 2_

Kuu

−1

Kf

3u

Kuf

2

~ ~

Kf

1f 1

Kf

2f 1

Kf

2f 2

Kf

2f 3

Kf

1f 2

Kf

1f 3

Kf

3f 1

Kf

3f 2

Kf

3f 3

~ ~

Kf

1f 1

Kf

2f 1

Kf

2f 2

Kf

2f 3

Kf

1f 2

Kf

1f 3

Kf

3f 1

Kf

3f 2

Kf

3f 3

Qf

1f 1

Qf

2f 2

Qf

3f 3

Kf

2f 1_

Kuu

−1

Kf

2u

Kuf

1

Kf

3f 1_

Kuu

−1

Kf

3u

Kuf

1

Kf

1f 2_

Kuu

−1

Kf

1u

Kuf

2

Kf

1f 3_

Kuu

−1

Kf

1u

Kuf

3

Kf

2f 3_

Kuu

−1

Kf

2u

Kuf

3

Kf

3f 2_

Kuu

−1

Kf

3u

Kuf

2

slide-49
SLIDE 49

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Comparison of marginal likelihoods

The marginal likelihood is given as p(y|Z, X, θ) =N

  • 0, Kf,uK−1

u,uKu,f + diag

  • Kf,f − Kf,uK−1

u,uKu,f

  • + Σ
  • .

Kf

1f 1

Kf

2f 2

Kf

3f 3

Kf

2f 1_

Kuu

−1

Kf

2u

Kuf

1

Kf

3f 1_

Kuu

−1

Kf

3u

Kuf

1

Kf

1f 2_

Kuu

−1

Kf

1u

Kuf

2

Kf

1f 3_

Kuu

−1

Kf

1u

Kuf

3

Kf

2f 3_

Kuu

−1

Kf

2u

Kuf

3

Kf

3f 2_

Kuu

−1

Kf

3u

Kuf

2

~ ~

Kf

1f 1

Kf

2f 1

Kf

2f 2

Kf

2f 3

Kf

1f 2

Kf

1f 3

Kf

3f 1

Kf

3f 2

Kf

3f 3

~ ~

Kf

1f 1

Kf

2f 1

Kf

2f 2

Kf

2f 3

Kf

1f 2

Kf

1f 3

Kf

3f 1

Kf

3f 2

Kf

3f 3

Qf

1f 1

Qf

2f 2

Qf

3f 3

Kf

2f 1_

Kuu

−1

Kf

2u

Kuf

1

Kf

3f 1_

Kuu

−1

Kf

3u

Kuf

1

Kf

1f 2_

Kuu

−1

Kf

1u

Kuf

2

Kf

1f 3_

Kuu

−1

Kf

1u

Kuf

3

Kf

2f 3_

Kuu

−1

Kf

2u

Kuf

3

Kf

3f 2_

Kuu

−1

Kf

3u

Kuf

2

~ ~

Kf

1f 1

Kf

2f 1

Kf

2f 2

Kf

2f 3

Kf

1f 2

Kf

1f 3

Kf

3f 1

Kf

3f 2

Kf

3f 3

Qf

1f 1

Qf

2f 2

Qf

3f 3

Kf

2f 1_

Kuu

−1

Kf

2u

Kuf

1

Kf

3f 1_

Kuu

−1

Kf

3u

Kuf

1

Kf

1f 2_

Kuu

−1

Kf

1u

Kuf

2

Kf

1f 3_

Kuu

−1

Kf

1u

Kuf

3

Kf

2f 3_

Kuu

−1

Kf

2u

Kuf

3

Kf

3f 2_

Kuu

−1

Kf

3u

Kuf

2

slide-50
SLIDE 50

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Comparison of marginal likelihoods

The marginal likelihood is given as p(y|Z, X, θ) =N

  • 0, Kf,uK−1

u,uKu,f + diag

  • Kf,f − Kf,uK−1

u,uKu,f

  • + Σ
  • .

~ ~

Kf

1f 1

Kf

2f 1

Kf

2f 2

Kf

2f 3

Kf

1f 2

Kf

1f 3

Kf

3f 1

Kf

3f 2

Kf

3f 3

Qf

1f 1

Qf

2f 2

Qf

3f 3

Kf

2f 1_

Kuu

−1

Kf

2u

Kuf

1

Kf

3f 1_

Kuu

−1

Kf

3u

Kuf

1

Kf

1f 2_

Kuu

−1

Kf

1u

Kuf

2

Kf

1f 3_

Kuu

−1

Kf

1u

Kuf

3

Kf

2f 3_

Kuu

−1

Kf

2u

Kuf

3

Kf

3f 2_

Kuu

−1

Kf

3u

Kuf

2

~ ~

Kf

1f 1

Kf

2f 1

Kf

2f 2

Kf

2f 3

Kf

1f 2

Kf

1f 3

Kf

3f 1

Kf

3f 2

Kf

3f 3

Qf

1f 1

Qf

2f 2

Qf

3f 3

Kf

2f 1_

Kuu

−1

Kf

2u

Kuf

1

Kf

3f 1_

Kuu

−1

Kf

3u

Kuf

1

Kf

1f 2_

Kuu

−1

Kf

1u

Kuf

2

Kf

1f 3_

Kuu

−1

Kf

1u

Kuf

3

Kf

2f 3_

Kuu

−1

Kf

2u

Kuf

3

Kf

3f 2_

Kuu

−1

Kf

3u

Kuf

2

Kf

1f 1(x ,x ) 1 1

(x ,x )

2 1

(Kf

1f 1_

Kuu

−1

Kf

1u

K )

uf

1

(x ,x )

3 1

(Kf

1f 1_

Kuu

−1

Kf

1u

K )

uf

1

(x ,x )

3 2

(Kf

1f 1_

Kuu

−1

Kf

1u

K )

uf

1

(x ,x )

2 3

(Kf

1f 1_

Kuu

−1

Kf

1u

K )

uf

1

(x ,x )

1 2

(Kf

1f 1_

Kuu

−1

Kf

1u

K )

uf

1

(x ,x )

1 3

(Kf

1f 1_

Kuu

−1

Kf

1u

K )

uf

1

Kf

1f 1(x ,x ) 2 2

Kf

1f 1(x ,x ) 3 3

slide-51
SLIDE 51

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Comparison of marginal likelihoods

The marginal likelihood is given as p(y|Z, X, θ) =N

  • 0, Kf,uK−1

u,uKu,f + diag

  • Kf,f − Kf,uK−1

u,uKu,f

  • + Σ
  • .

Kf

1f 1(x ,x ) 1 1

(x ,x )

2 1

(Kf

1f 1_

Kuu

−1

Kf

1u

K )

uf

1

(x ,x )

3 1

(Kf

1f 1_

Kuu

−1

Kf

1u

K )

uf

1

(x ,x )

3 2

(Kf

1f 1_

Kuu

−1

Kf

1u

K )

uf

1

(x ,x )

2 3

(Kf

1f 1_

Kuu

−1

Kf

1u

K )

uf

1

(x ,x )

1 2

(Kf

1f 1_

Kuu

−1

Kf

1u

K )

uf

1

(x ,x )

1 3

(Kf

1f 1_

Kuu

−1

Kf

1u

K )

uf

1

Kf

1f 1(x ,x ) 2 2

Kf

1f 1(x ,x ) 3 3

Qf1,f1

slide-52
SLIDE 52

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Computational requirements

The computational demand is now equal to O(NDM2). Storage is O(NDM).

For inference, the computation of the mean grows as O(DM) and the computation of the variance as O(DM2), after some pre-computations and for one test point.

Similar to the Fully Independent Training Conditional (FITC) approximation [QR05, SG06].

slide-53
SLIDE 53

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Examples using PITC and FITC

For all our experiments we considered squared exponential covariance functions for the latent process of the form ku,u(x, x′) = exp

  • −1

2 (x − x′)⊤ L (x − x′)

  • ,

where L is a diagonal matrix which allows for different length-scales along each dimension.

The smoothing kernel had the same form, Gd(τ) = Sd|Ld|1/2 (2π)p/2 exp

  • −1

2τ ⊤Ldτ

  • ,

where Sd ∈ R and Ld is a symmetric positive definite matrix.

slide-54
SLIDE 54

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Examples using PITC and FITC: Artificial data 1D

Four outputs generated from the full GP (D = 4).

−1 −0.5 0.5 1 −10 −8 −6 −4 −2 2 4 6 8 10

Full GP

−1 −0.5 0.5 1 −10 −8 −6 −4 −2 2 4 6 8 10

Independent GP

−1 −0.5 0.5 1 −8 −6 −4 −2 2 4 6 8 10

FITC

−1 −0.5 0.5 1 −10 −8 −6 −4 −2 2 4 6 8 10

PITC

slide-55
SLIDE 55

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Jura Data set I

Measurements of concentrations of seven heavy metals collected in the topsoil of a 14.5 km2 region of the Swiss Jura.

Prediction set (259 locations) and a validation set (100 locations).

Primary variable Secondary Variables Cd Ni, Zn Cu Pb, Ni, Zn ❑

Optimisation of the locations of the inducing inputs.

slide-56
SLIDE 56

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Jura Data set II

IGP P(50) P(100) P(200) P(500) FGP CK 0.42 0.44 0.46 0.48 0.5 0.52 0.54 0.56 0.58

MEAN ABSOLUTE ERROR Cd

Cadmium IGP P(50) P(100) P(200) P(500) FGP CK 7 8 9 10 11 12 13 14 15 16

MEAN ABSOLUTE ERROR Cu

Copper

Figure: Mean absolute error for IGP: Independent GP , P(M): PITC with M inducing points, FGP: Full GP , CK: Ordinary Co-kriging

slide-57
SLIDE 57

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Comparison of marginal likelihoods

To obtain the above approximations, we have replaced the exact likelihood p(f|θ) = N(f|0, Kf,f + Σ) for the approximated one p(f|θ, Z) = N(f|0, Qf,f(Z) + Σ), where θ corresponds to the hyperparameters of the model.

In other words, we have changed the model and additionally, we have introduced new hyperparameters Z.

Without additional restrictions, maximization of the approximated marginal likelihood over Z might lead to overfitting.

slide-58
SLIDE 58

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

An alternative

A different way to face the problem is to use approximate inference to the exact model.

Since obtaining the posterior over u is intractable (computational complexity grows as O(N3D3)), we propose to approximate the posterior using variational inference.

slide-59
SLIDE 59

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Variational inference in one slide

Variational inference idea: to fit a variational distribution to the true posterior minimizing the Kullback-Leibler divergence KL(q p) = −

  • q(u) log

p(u|y) q(u)

  • du.

Minimizing the KL divergence is equivalent to maximize the lower bound log

  • p(y, u)du ≥ L(q) =
  • q(u) log

p(u, y) q(u)

  • du
slide-60
SLIDE 60

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Variational inference for convolution processes

We augment the joint distribution p(y, u) with a set of variables u p (y, u, u) = p(y|u)p(u|u)p(u).

We want to approximate the true posterior p(u, u|y) with a distribution q(u, u) = p(u|u)φ(u), where φ(u) represents the approximated posterior over the latent variables u.

slide-61
SLIDE 61

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Lower bound for the marginal likelihood

The distribution q(u, u) is approximated minimizing the KL distance.

Equivalently, we maximize the following lower bound L(Z, φ(u)) =

  • u,u

q(u, u) log p (y, u, u) q(u, u)

  • du du

=

  • u,u

p(u|u)φ(u) log p(y|u)p(u|u)p(u) p(u|u)φ(u)

  • du du

Maximizing the lower bound with respect to φ(u) L(Z, θ) = log N

  • y|0, Kf,uK−1

u,uKu,f + Σ

  • − 1

2 trace

  • Σ−1

Kf,f − Kf,uK−1

u,uKu,f

  • .
slide-62
SLIDE 62

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Remarks

Expressions for the (approximated) posterior φ(u) and the predictive distribution follow similar forms that for the PITC and FITC approximations.

The computational complexity is again O(NDM2) plus an aditional trace

  • peration.

The form of the likelihood obtained if we remove the trace term is similar to the Deterministic Training Conditional (DTC).

Since we have an additional trace term and a variational treatment we call this approximation DTC-VAR.

slide-63
SLIDE 63

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

An illustration: artificial data 1D revisited

Measuring the KL divergence for the 1D toy example above

10 20 30 40 50 100 200 −10 10 20 30 40 50

KL divergence PITC

PITC

10 20 30 40 50 100 200 −100 100 200 300 400 500

KL divergence FITC

FITC

10 20 30 40 50 100 200 100 200 300 400 500 600 700 800 900

KL divergence DTC

DTC-VAR

20 30 40 50 100 200 2 4 6 8 10 12 14 16 18 20

KL divergence DTC

DTC-VAR with zoom

slide-64
SLIDE 64

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Input functions as white noise processes

The key assumption for the approximations before is that we can express the conditional prior p(u|u).

In other words, that the latent functions can be summarized using just a few points.

If the input function corresponds to a white noise process this is certainly not true.

slide-65
SLIDE 65

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Variational inducing kernel

Instead of applying the variational framework described before to a finite set of inducing points u, we compute the bound with respect to a finite set of points λ obtained from the process λ(z) =

  • X

T(z − z′)u(z′)dz′.

We refer to the smoothing kernel T(z − z′) as the inducing kernel.

Under this setup, the set of points λ are informative about the white noise process.

slide-66
SLIDE 66

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Comparison

u u y p (y, u, u) = p(y|u)p(u|u)p(u). u is uninformative λ u y p (y, u, λ) = p(y|u)p(u|λ)p(λ). λ is informative

slide-67
SLIDE 67

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Lower bound

Under the same analysis that before, the variational lower bound is obtained as L(Z, T, θ) = log N

  • y|0, Kf,λK−1

λ,λKλ,f + Σ

  • − 1

2 trace

  • Σ−1

Kf,f − Kf,λK−1

λ,λKλ,f

slide-68
SLIDE 68

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Example

Measuring the KL divergence for a 1D toy example

10 20 30 40 50 100 200 200 400 600 800 1000 1200

KL divergence DTC Noise

DTC

slide-69
SLIDE 69

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Financial data set

Multivariate financial data set: the dollar prices of the 3 precious metals and top 10 currencies.

50 100 150 200 250 600 650 700 750 800 850

Gold

50 100 150 200 250 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94

AUD

slide-70
SLIDE 70

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Dynamic model

Our model: a set of coupled differential equations, driven by either a Gaussian process, a white noise process or both, dfd(t) dt = Bdfd(t) + Sdu(t), where Bd is a decay coefficient and Sd quantifies the influence of the process u(t).

If u(t) is a white noise process → Langevin equation → a linear stochastic differential equation.

Solution for fd(t) has the form of convolutions. For a single output and white noise process, fd(t) → Ornstein-Uhlenbeck (OU) process.

slide-71
SLIDE 71

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Some results

50 100 150 200 250 1000 1100 1200 1300 1400 1500 1600

XPT: Real data and prediction

50 100 150 200 250 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15

CAD: Real data and prediction

50 100 150 200 250 7.8 8 8.2 8.4 8.6 8.8 9 9.2 9.4 9.6 9.8 x 10

−3

JPY: Real data and prediction

50 100 150 200 250 0.7 0.75 0.8 0.85 0.9 0.95

AUD: Real data and prediction

slide-72
SLIDE 72

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

Open questions

Choice of the kernel function

Experimental comparison

Online learning

Theoretical connections between methods.

Computational complexity

How the inference is affected with different variants of spatial configuration (isotopic vs heterotopic).

Is there any theoretical way to know beforehand when considering the cross-covariance might help?

slide-73
SLIDE 73

Outline Partial independence Fully Independence Variational Approximation Variational Inducing Kernels Case study: a dynamic model for financial data

References I

David M. Higdon. Space and space-time modelling using process convolutions. In C. Anderson, V. Barnett, P . Chatwin, and A. El-Shaarawi, editors, Quantitative methods for current environmental issues, pages 37–56. Springer-Verlag, 2002. Joaquin Quiñonero Candela and Carl Edward Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6:1939–1959, 2005. Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Yair Weiss, Bernhard Schölkopf, and John C. Platt, editors, NIPS, volume 18, Cambridge, MA, 2006. MIT Press. Michalis K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In David van Dyk and Max Welling, editors, Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, pages 567–574, Clearwater Beach, Florida, 16-18 April 2009. JMLR W&CP 5.