State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes - - PowerPoint PPT Presentation

state space gaussian processes with non gaussian
SMART_READER_LITE
LIVE PREVIEW

State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes - - PowerPoint PPT Presentation

State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2 Alexander Grigorievskiy 2 , 3 1 Philips Research, 2 Aalto University, 3 Silo.AI ICML2018 July 13, 2018 Outline Gaussian Processes Temporal GPs as


slide-1
SLIDE 1

State Space Gaussian Processes with Non-Gaussian Likelihoods

Hannes Nickisch1 Arno Solin2 Alexander Grigorievskiy2,3

1Philips Research, 2Aalto University, 3Silo.AI

ICML2018

July 13, 2018

slide-2
SLIDE 2

Poster #151 Non-Gaussian State Space GPs Nickisch, Solin, Grigorievskiy 2/ 14

Outline

Gaussian Processes Temporal GPs as stochastic differential equations (SDEs) Learning and inference with Gaussian Likelihoods Speeding up computation of state space model parameters Non-Gaussian likelihoods Approximate inference algorithms Computational primitives and how to compute them Experiments

slide-3
SLIDE 3

Poster #151 Non-Gaussian State Space GPs Nickisch, Solin, Grigorievskiy 3/ 14

Gaussian Processes (GPs)

Def:

Gaussian Process (GP) is a stochastic process where for any inputs t all corresponding outputs y are distributed as y ∼ N(m(t), K(t, t|θ)). Denoted: f(t) ∼ GP(m(t), k(t, t′|θ))

◮ Used as a prior over continuous functions in statistical

models

◮ Properties (e.g. smoothness) are determined by the

covariance function k(t, t′|θ)

slide-4
SLIDE 4

Poster #151 Non-Gaussian State Space GPs Nickisch, Solin, Grigorievskiy 4/ 14

Temporal Gaussian Processes

◮ Input data is 1-D, usually time ◮ Fully probabilistic (Bayesian)

approach

◮ Conveniently combining structural

components by covariance

  • perations

◮ Applicability for unevenly sampled

data

Challenges:

◮ Large datasets ◮ Non-Gaussian

likelihoods

slide-5
SLIDE 5

Poster #151 Non-Gaussian State Space GPs Nickisch, Solin, Grigorievskiy 4/ 14

Temporal Gaussian Processes

◮ Input data is 1-D, usually time ◮ Fully probabilistic (Bayesian)

approach

◮ Conveniently combining structural

components by covariance

  • perations

◮ Applicability for unevenly sampled

data

Challenges:

◮ Large datasets ◮ Non-Gaussian

likelihoods

slide-6
SLIDE 6

Poster #151 Non-Gaussian State Space GPs Nickisch, Solin, Grigorievskiy 5/ 14

GP as a Stochastic Differential Equation (SDE)

Addressing challenge 1

Given a 1-D time series: {yi, ti}N

i=1 ◮ Gaussian Process

model: f(t) ∼ GP(m(t), k(t, t′)) GP prior y | f ∼

n

  • i=1

P(yi | f(ti)) Likelihood

◮ Latent Posterior:

Q(f | D) = N

  • f
  • m + Kα, (K−1 + W)−1

◮ Equivalent Stochastic Differential

Equation (SDE) [3] d f(t) d t = Ff(t) + Lw(t); f0 ∼ N(0, P∞) y | f ∼

n

  • i=1

P(yi | Hf(ti))

◮ f(t) = Hf(t) ◮ w(t) - multidimensional white noise ◮ F, L, H, P∞ are determined from the

covariance K [3]

slide-7
SLIDE 7

Poster #151 Non-Gaussian State Space GPs Nickisch, Solin, Grigorievskiy 5/ 14

GP as a Stochastic Differential Equation (SDE)

Addressing challenge 1

Given a 1-D time series: {yi, ti}N

i=1 ◮ Gaussian Process

model: f(t) ∼ GP(m(t), k(t, t′)) GP prior y | f ∼

n

  • i=1

P(yi | f(ti)) Likelihood

◮ Latent Posterior:

Q(f | D) = N

  • f
  • m + Kα, (K−1 + W)−1

◮ Equivalent Stochastic Differential

Equation (SDE) [3] d f(t) d t = Ff(t) + Lw(t); f0 ∼ N(0, P∞) y | f ∼

n

  • i=1

P(yi | Hf(ti))

◮ f(t) = Hf(t) ◮ w(t) - multidimensional white noise ◮ F, L, H, P∞ are determined from the

covariance K [3]

slide-8
SLIDE 8

Poster #151 Non-Gaussian State Space GPs Nickisch, Solin, Grigorievskiy 6/ 14

Inference and Learning with Gaussian likelihood

Gaussian likelihood: P(yi | f(ti)) = N(yi | f(ti), σ2

nI) ◮ Posterior parameters:

W = σ−2In α = (K + W−1)−1(y − m)

◮ Evidence:

log ZGPR = −1 2α⊤(y − m) − 1 2 log |K + W−1| − N 2 log(2πσ2

n)

◮ The na¨

ıve approach has O(N3) complexity

◮ Solve SDE between time points

(equivalent discrete time model): fi = Ai−1fi−1 + qi−1; qi−1 ∼ N(0, Qi−1) yi = Hfi + ǫi; ǫn ∼ N(0, σ2

n)

◮ Parameters of the discrete

model: Ai = A[∆ti] = e∆ti F, Qi = P∞ − Ai P∞ A⊤

i

◮ Inference and learning by

Kalman FIlter (KF) and Rauch-Tung-Striebel (RTS) smoother in O(N) complexity

slide-9
SLIDE 9

Poster #151 Non-Gaussian State Space GPs Nickisch, Solin, Grigorievskiy 6/ 14

Inference and Learning with Gaussian likelihood

Gaussian likelihood: P(yi | f(ti)) = N(yi | f(ti), σ2

nI) ◮ Posterior parameters:

W = σ−2In α = (K + W−1)−1(y − m)

◮ Evidence:

log ZGPR = −1 2α⊤(y − m) − 1 2 log |K + W−1| − N 2 log(2πσ2

n)

◮ The na¨

ıve approach has O(N3) complexity

◮ Solve SDE between time points

(equivalent discrete time model): fi = Ai−1fi−1 + qi−1; qi−1 ∼ N(0, Qi−1) yi = Hfi + ǫi; ǫn ∼ N(0, σ2

n)

◮ Parameters of the discrete

model: Ai = A[∆ti] = e∆ti F, Qi = P∞ − Ai P∞ A⊤

i

◮ Inference and learning by

Kalman FIlter (KF) and Rauch-Tung-Striebel (RTS) smoother in O(N) complexity

slide-10
SLIDE 10

Poster #151 Non-Gaussian State Space GPs Nickisch, Solin, Grigorievskiy 7/ 14

Fast computation of Ai and Qi by interpolation

Problem:

◮ When there are many ∆ti

parameters computation can be slow Solution:

◮ ψ : s → esX is smooth

mapping, hence interpolation (similar to KISS-GP [4])

◮ Evaluate ψ on an

equispaced grid s1, s2, .., sK, where sj = s0 + j · ∆s

◮ Use 4-point interpolation:

A ≈ c1Aj−1 + c2Aj + c3Aj+1 + c4Aj+2. Coefficients {ci}4

i=1 are

efficiently computable

mean±min/max errors visualized. 5 10 15 20 ·103 2 4 6 8 10 Number of training inputs, n Evaluation time (s)

Na¨ ıve State space State space (K = 2000) State space (K = 10)

slide-11
SLIDE 11

Poster #151 Non-Gaussian State Space GPs Nickisch, Solin, Grigorievskiy 8/ 14

Non-Gaussian Likelihoods

Addressing challenge 2

Posterior as a Gaussian approximation:

Q(f | D) = N

  • f
  • m + Kα, (K−1 + W)−1

◮ Laplace approximation (LA) ◮ Variational Bayes (VB) ◮ Direct Kullback-Liebler

minimization (KL)

◮ Assumed Density Filtering (ADF)

a.k.a. single sweep Expectation Propagation (EP) Laplace Approximation

◮ log P(f | D) ∼

log P(f | y) + log P(f | t)

◮ Find the mode ˆ

f of this function by Newton method

◮ Hessian at the mode ˆ

f is precision W = −∂2 log P(ˆ f | t)

◮ log ZLA = − 1

2

  • α⊤mvmK(α) +

ldK(W) − 2

i log P(yi|ˆ

fi)

slide-12
SLIDE 12

Poster #151 Non-Gaussian State Space GPs Nickisch, Solin, Grigorievskiy 9/ 14

Computational Primitives

The following computational primitives allow to cast the covariance approximation in more generic terms:

◮ Linear system solving: solveK(W, r) := (K + W−1)−1r ◮ Matrix-vector multiplications: mvmK(r) := Kr ◮ Log-determinants: ldK(W) := log |B| with well-conditioned

B = I + W

1 2 K W 1 2

◮ Predictions need latent mean E[f∗] and

variance V[f∗]

slide-13
SLIDE 13

Poster #151 Non-Gaussian State Space GPs Nickisch, Solin, Grigorievskiy 10/ 14

Tackling computational primitives

Using state space from of temporal GPs SpInGP:

◮ The first two computational primitives are calculated using SpInGP [5]

approach:

◮ Idea is: using state space form compose the inverse of the covariance

matrix, which turns out to be block-tridiagonal KF and RTS Smoothing:

◮ The last two primitives are solved by Kalman filtering and RTS

smoothing

◮ Predictions are computed by primitive 4 and then by propagation

through likelihood Comments:

◮ Derivatives of computational primitives, required for learning, are

computed in a similar way

◮ SpInGP involves computations with block-tridiagonal matrices. These

computations are similar to KF and RTS smoothing (see [1] Appendix)

slide-14
SLIDE 14

Poster #151 Non-Gaussian State Space GPs Nickisch, Solin, Grigorievskiy 11/ 14

Experiments 2-3

Experiments are designed to emphasize the paper findings and statements

  • 1. A robust regression (Student’s t likelihood) study example

with n = 34,154 observations

  • 2. Numerical effects in non-Gaussian likelihoods
slide-15
SLIDE 15

Poster #151 Non-Gaussian State Space GPs Nickisch, Solin, Grigorievskiy 12/ 14

Experiment 4

◮ A new interesting data set with commercial airline

accidents dates scraped from Wikipedia [6]

◮ Accidents over the time-span of ∼100 years, n = 35,959

days

◮ We model the accident intensity as a Log Gaussian Cox

process (Poisson likelihood)

◮ The GP prior is set up as:

k(t, t′) = kMat.(t, t′) + kper.(t, t′) kMat.(t, t′)

slide-16
SLIDE 16

Poster #151 Non-Gaussian State Space GPs Nickisch, Solin, Grigorievskiy 13/ 14

Conclusions

◮ This paper brings together research done in state space

GPs and non-Gaussian approximate inference

◮ We improve stability and provide additional speed-up by

fast computations of the state space model parameters

◮ We provide unifying code for all approches in GPML

toolbox v. 4.2 [7]

◮ Visit our poster: #151

slide-17
SLIDE 17

Poster #151 Non-Gaussian State Space GPs Nickisch, Solin, Grigorievskiy 14/ 14

References

[1] H. Nickisch, A. Solin, and A. Grigorievskiy (2018). State Space Gaussian Processes with Non-Gaussian Likelihood. In ICML. [2] C.E. Rasmussen and C.K.I. Williams (2006). Gaussian Processes for Machine Learning. The MIT Press. [3] J. Hartikainen, and S. S¨ arkk¨ a (2010). Kalman filtering and smoothing solutions to temporal Gaussian process regression models. In MLSP. [4] A. G. Wilson, and H. Nickisch (2015). Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP). In ICML. [5] A. Grigorievskiy, N. Lawrence, and S. S¨ arkk¨ a (2017). Parallelizable Sparse Inverse Formulation Gaussian Processes (SpInGP). In MLSP. [6] Wikipedia (2018). URL https://en.wikipedia.org/wiki/List_of_ accidents_and_incidents_involving_commercial_aircraft [7] C. E. Rasmussen and H. Nickisch (2010). Gaussian Processes for Machine Learning (GPML). In GMLR.