Learning unknown forces in nonlinear models with Gaussian processes - - PowerPoint PPT Presentation

learning unknown forces in nonlinear models with gaussian
SMART_READER_LITE
LIVE PREVIEW

Learning unknown forces in nonlinear models with Gaussian processes - - PowerPoint PPT Presentation

Learning unknown forces in nonlinear models with Gaussian processes and autoregressive flows Wil O C Ward w.ward@sheffield.ac.uk Department of Physics and Astronomy, The University of Sheffield GPSS Workshop: Structurally Constrained Gaussian


slide-1
SLIDE 1

Learning unknown forces in nonlinear models with Gaussian processes and autoregressive flows

Wil O C Ward w.ward@sheffield.ac.uk

Department of Physics and Astronomy, The University of Sheffield

GPSS Workshop: Structurally Constrained Gaussian Processes 12 Sep 2019

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-2
SLIDE 2

Collaborative Work

Mauricio Alvarez Tom Ryder Dennis Prangle

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-3
SLIDE 3

Gaussian Processes

GPs generalise Gaussian distribution Infinite dimension and non-parametric Defined in terms of mean and covariance function f (t) ∼ GP(m(t), k(t, t′))

2 4 6 8 10 12

t

3 2 1 1 2 3

f(t)

2 4 6 8 10 12

t

2 1 1 2

f(t)

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-4
SLIDE 4

Motivating Example

Consider the model, d dt x = α(x(t), θ) +

  • u(t)

T Where α : R2 × Θ → R2 are known dynamics: α(x, θ) = θ1x1 − θ2x1x2 θ2x1x2 − θ3x2

  • . . . but θ and u(t) are unknown. How can we infer x(t) and u(t)

given some noisy observations y = [x(τj) + εj]N

j=0?

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-5
SLIDE 5

Motivating Example

5 10 15 20 25 30 35 40

t

1 2 3 4 5 6

x(t)

x

1

x

2

20 40

t

1 2

u(t)

2 4

x

1

2 4 6

x

2

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-6
SLIDE 6

Contents

1 Stochastic Differential Equations and Gaussian Processes 2 Variational Solutions to Non-Linear Latent Force Models 3 Approximate Gaussian Processes 4 Some Results 5 Recap 6 Open Issues

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-7
SLIDE 7

Contents

1 Stochastic Differential Equations and Gaussian Processes 2 Variational Solutions to Non-Linear Latent Force Models 3 Approximate Gaussian Processes 4 Some Results 5 Recap 6 Open Issues

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-8
SLIDE 8

Itô Processes

Consider an ordinary differential equation describing the dynamics of some (vector-valued) function x : R → Rd The dynamics αk : Rd → Rd are known but it is driven by a white-noise process with covariance as function of x, Σ : Rd → Rd×d Ordinary Differential Equation with White Noise

n

  • k=0

αk(x, t; θ) dn dtn x(t) = Σ

1/2(x, t; θ) w(t) Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-9
SLIDE 9

Itô Processes

Consider an ordinary differential equation describing the dynamics of some (vector-valued) function x : R → Rd The dynamics αk : Rd → Rd are known but it is driven by a white-noise process with covariance as function of x, Σ : Rd → Rd×d Stochastic Differential Equation

n

  • k=0

αk(x, t; θ)

  • drift terms

dn dtn x(t) = Σ

1/2(x, t; θ)

  • diffusion

w(t)

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-10
SLIDE 10

Solutions to Itô Processes

If system has linear dynamics, can solve exactly using Kalman filtering / Rauch-Tung-Streibel smoothing Assuming non-linearity, there are a number of approximation methods Stochastic extension to Euler method for iterative discrete-time estimation Euler-Maruyama Discretisation x(tk+1) − x(tk) ∼ N(α(x(tk))∆t, Σ∆t)

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-11
SLIDE 11

Solutions to Itô Processes

If system has linear dynamics, can solve exactly using Kalman filtering / Rauch-Tung-Streibel smoothing Assuming non-linearity, there are a number of approximation methods Stochastic extension to Euler method for iterative discrete-time estimation Euler-Maruyama Discretisation as a Generative Prior x(tk+1) | x(tk) ∼ N(x(tk) + α(x(tk))∆t, Σ∆t)

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-12
SLIDE 12

Gaussian Processes as SDEs

Examples White noise process w(t) ∼ GP

  • 0, ς2δ(t − t′)
  • Half-integer (ν = p + 1/2) Matérn models

fν(t) ∼ GP

  • 0, σ2 exp
  • −λ|t − t′|
  • p!

(2p)!

p

  • i=0

(p + i)! i!(p − i)!

  • 2λ|t − t′|

p−i

  • Gaussian Radial Basis / Exponentiated Quadratic (ν → ∞)

f (t) ∼ GP

  • 0, σ2 exp(−λ|t − t′|2)
  • Wil O C Ward

Department of Physics and Astronomy, The University of Sheffield

slide-13
SLIDE 13

Gaussian Processes as SDEs

Examples White noise process dw(t) = ςdβ Half-integer (ν = p + 1/2) Matérn models

p

  • i=1

p i − 1

  • λp+1−i di

dti f (t) = −λp+1f (t) + w(t) Gaussian Radial Basis / Exponentiated Quadratic (ν → ∞) infinitely differentiable so cannot represent as Itô process exactly

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-14
SLIDE 14

Gaussian Processes as SDEs

Examples White noise process dw(t) = ςdβ Half-integer (ν = p + 1/2) Matérn models

df (t) =      1 ... ... ... 1 −a1λp+1 −a2λp · · · −apλ     

  • G

     f (t) df /dt . . . dp−1f /dtp−1     

  • f (t)

dt + ςν      . . . 1      dβ

  • w(t)dt

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-15
SLIDE 15

Stochastic Latent Force Models

Recall our motivating example, a mixture of known dynamics with some hidden input function General form: α0(x, t; θ)x(t) + α1(x, t; θ) d dt x(t) + . . . = u(t) Placing a GP prior over u(t) Termed latent force models

  • M. A. Alvarez, D. Luengo, and N. D. Lawrence. Linear latent force models using Gaussian
  • processes. IEEE Trans. Pattern Anal. Mach. Intell., 35(11):2693–2705, 2013

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-16
SLIDE 16

Companion Form LFMs

Easy enough to reframe nth-order differential equation as first-order df /dt = D(f (t), θ) + Lw(t)

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-17
SLIDE 17

Companion Form LFMs

Easy enough to reframe nth-order differential equation as first-order df /dt = D(f (t), θ) + Lw(t) Companion Form

f (τ) =

  • x(τ)

dx dt

  • t=τ

· · ·

dn−1x dtn−1

  • t=τ

u(τ)

du dt

  • t=τ

· · ·

dm−1u dtm−1

  • t=τ

⊤ D(f (t), θ) =                f2 f3 . . . ˘ α0f1 + n−1

i=1 ˘

αifi+1 + fn+1 fn+2 fn+3 . . . a0fn+1 + m−1

i=1 aifn+i+1

               , L =                . . . . . . 1               

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-18
SLIDE 18

Contents

1 Stochastic Differential Equations and Gaussian Processes 2 Variational Solutions to Non-Linear Latent Force Models 3 Approximate Gaussian Processes 4 Some Results 5 Recap 6 Open Issues

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-19
SLIDE 19

Inferring the Joint Posterior of a Non-Linear LFM

Problem: Infer f and θ d dt f (t) = D(f (t), θ) + Lw(t) We cannot infer f exactly if D is non-linear since the joint posterior is intractible Pseudo-chaos under some systems Non-linear versions of filters/smoothers, e.g. E/UKF, ADF, SMC Difficult to do joint parameter estimation, difficult to use autodifferentiation

  • J. Hartikainen, M. Seppänen, and S. Särkkä. State-space inference for non-linear latent force

models with application to satellite orbit prediction. In ICML, pages 723–730, 2012. Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-20
SLIDE 20

Variational Bridge Constructs

We want to build variational approximation of conditional posterior: p(x, u, θ | y). Variational Bayes Find q∗ ∈ Q, such that q∗ = arg min

q∈Q

KL[q(x, u, θ) p(x, u, θ | y)] where Q is a family of distributions parameterised by φ

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-21
SLIDE 21

Variational Bridge Constructs

We want to build variational approximation of conditional posterior: p(f , θ | y). Variational Bayes Find q∗ ∈ Q, such that q∗ = arg min

q∈Q

KL[q(f , θ) p(f , θ | y)] where Q is a family of distributions parameterised by φ

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-22
SLIDE 22

Variational Bridge Constructs

Evidence Lower Bound (elbo) L(φ) = Ef ,θ∼q [log p(f , θ, y) − log q(f , θ)]

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-23
SLIDE 23

Variational Bridge Constructs

Unbiased Evidence Lower Bound (elbo) ˆ L(φ) = 1 ns

ns

  • i=1

log p(θ(i))p(f (i) | θ(i))p(y | f (i), θ(i)) q(θ(i))q(f (i) | θ(i)) where f (i) ∼ q(f | θ(i)) and θ(i) ∼ q(θ) i = 1, . . . , ns

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-24
SLIDE 24

Variational Bridge Constructs

Unbiased Evidence Lower Bound (elbo) ˆ L(φ) = 1 ns

ns

  • i=1

log p(θ(i))p(f (i) | θ(i))p(y | f (i), θ(i)) q(θ(i))q(f (i) | θ(i)) where f (i) ∼ q(f | θ(i)) and θ(i) ∼ q(θ) i = 1, . . . , ns Likelihood Agnostic Valid for any (differentiable?) observation model p(y | f , θ)

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-25
SLIDE 25

Black-box Variational Inference

Black-box variational inference (bbvi) is predicated on the fact that the gradient of elbo can be written as an unbiased average Straightforward since we have ˆ L(φ) as an unbiased average

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-26
SLIDE 26

Black-box Variational Inference

Black-box variational inference (bbvi) is predicated on the fact that the gradient of elbo can be written as an unbiased average Straightforward since we have ˆ L(φ) as an unbiased average Monte Carlo approximation of elbo gradient ∇φL(φ) ≈ 1 ns

ns

  • i=1

∇φ log q(f (i), θ(i)) log p(f (i), θ(i), y) q(f (i), θ(i)) where f (i) ∼ q(f | θ(i)) and θ(i) ∼ q(θ) i = 1, . . . , ns

  • R. Ranganath, S. Gerrish, and D. Blei. Black box variational inference. In Artificial

Intelligence and Statistics, 2014.

  • D. Duvenaud and R. P. Adams. Black-box stochastic variational inference in five lines of
  • Python. In NIPS Workshop on Black-box Learning and Inference, 2015.

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-27
SLIDE 27

Black-box Variational Inference

Algorithm 1 bbvi with gradient ascent Initialise φ0 (randomly) j ← 0 while not converged do Calculate ∇φL(φj) Update φ w.r.t. elbo gradient, e.g.: φj+1 ← φj + h∇φL(φj) j ← j + 1 end while Variational approximation q(f , θ | φj) ≈ p(f , θ | y)

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-28
SLIDE 28

Parameter Estimation: q(θ)

Commonly in system estimation, the model parameters, θ are unknown. We can also give these a Bayesian treatment by using variational representation of the posterior. We can use any variational approach here, e.g. mean-field: q(θ) =

  • N(θi | mi, si)

Here, the free parameters are scalars, φθ = {(mi, si)}∀i

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-29
SLIDE 29

Filtering Density: p(f | θ)

Represent stochastic process, f as a filtering distribution with moments m(t) and P(t) Mean term: d dt m(t) = D(m, t; θ)

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-30
SLIDE 30

Filtering Density: p(f | θ)

Represent stochastic process, f as a filtering distribution with moments m(t) and P(t) Mean term: d dt m(t) = D(m, t; θ) Covariance term: ??

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-31
SLIDE 31

Extended Filtering Density: p(f | θ)

Represent stochastic process, f as a filtering distribution with moments m(t) and P(t) Mean term: d dt m(t) = D(m, t; θ) Covariance term: d dt P(t) = JD(m, t; θ)P(t) + P(t)JD(m, t; θ)T + Lς2LT

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-32
SLIDE 32

Extended Filtering Density: p(f | θ)

Represent stochastic process, f as a filtering distribution with moments m(t) and P(t) Assume steady state: dP/dt = 0.

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-33
SLIDE 33

Extended Filtering Density: p(f | θ)

Represent stochastic process, f as a filtering distribution with moments m(t) and P(t) Assume steady state: dP/dt = 0. Denote covariance in steady state by ˜ Σ and solve JD(m, t; θ) ˜ Σ + ˜ ΣJD(m, t; θ)T = −Lς2LT

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-34
SLIDE 34

Extended Filtering Density: p(f | θ)

Represent stochastic process, f as a filtering distribution with moments m(t) and P(t) Assume steady state: dP/dt = 0. Denote covariance in steady state by ˜ Σ and solve JD(m, t; θ) ˜ Σ + ˜ ΣJD(m, t; θ)T = −Lς2LT Example of continuous Lyapunov equation; easy to solve numerically, but need an differentiable form of ˜ Σ

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-35
SLIDE 35

Extended Filtering Density: p(f | θ)

Represent stochastic process, f as a filtering distribution with moments m(t) and P(t) Assume steady state: dP/dt = 0. Denote covariance in steady state by ˜ Σ and solve JD(m, t; θ) ˜ Σ + ˜ ΣJD(m, t; θ)T = −Lς2LT Example of continuous Lyapunov equation; easy to solve numerically, but need an differentiable form of ˜ Σ ˜ Σ may be a function of m(t) and θ so is stochastic too.

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-36
SLIDE 36

Extended Filtering Density: p(f | θ)

Represent stochastic process, f as a filtering distribution with moments m(t) and P(t) Assume steady state: dP/dt = 0.

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-37
SLIDE 37

Extended Filtering Density: p(f | θ)

Represent stochastic process, f as a filtering distribution with moments m(t) and P(t) Assume steady state: dP/dt = 0. Denote covariance in steady state by ˜ Σ and solve JD(m, t; θ) ˜ Σ + ˜ ΣJD(m, t; θ)T = −Lς2LT

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-38
SLIDE 38

Extended Filtering Density: p(f | θ)

Represent stochastic process, f as a filtering distribution with moments m(t) and P(t) Assume steady state: dP/dt = 0. Denote covariance in steady state by ˜ Σ and solve JD(m, t; θ) ˜ Σ + ˜ ΣJD(m, t; θ)T = −Lς2LT Example of continuous Lyapunov equation; easy to solve numerically, but need an differentiable form of ˜ Σ

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-39
SLIDE 39

Extended Filtering Density: p(f | θ)

Represent stochastic process, f as a filtering distribution with moments m(t) and P(t) Assume steady state: dP/dt = 0. Denote covariance in steady state by ˜ Σ and solve JD(m, t; θ) ˜ Σ + ˜ ΣJD(m, t; θ)T = −Lς2LT Example of continuous Lyapunov equation; easy to solve numerically, but need an differentiable form of ˜ Σ ˜ Σ may be a function of m(t) and θ so is stochastic too.

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-40
SLIDE 40

Transition density: p(fk | fk−1θ)

We construct a discrete-time transition density using Euler-Maruyama: p(fk | fk−1, θ) = N (fk | µ∆, Σ∆) , where µ∆ = fk−1 + D(fk−1, tk; θ)∆t Σ∆ = ˜ Σ(fk−1, tk; θ) − exp (∆tJD) ˜ Σ(fk−1, tk; θ) exp (∆tJD)T

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-41
SLIDE 41

Generative model: p(f | θ)

Marginal p(f | θ) = p(f0 | θ)

T

  • k=1

p(fk | fk−1, θ)

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-42
SLIDE 42

Generative model: p(f | θ)

Marginal p(f | θ) = p(f0 | θ)

T

  • k=1

p(fk | fk−1, θ) Additional points fk = f (tk) fk+1 = f (tk+1) = f (tk + ∆t) p(fk | fk−1) ≡ p(xk | xk−1, uk)p(uk | uk−1)

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-43
SLIDE 43

Variational Approximation: q(f | θ)

Family of distributions parameterised by φ Needs to be flexible, sampleable and invertible (for autodifferentiation)

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-44
SLIDE 44

Variational Approximation: q(f | θ)

Family of distributions parameterised by φ Needs to be flexible, sampleable and invertible (for autodifferentiation) Look to (Bayesian) neural networks and other deep models

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-45
SLIDE 45

Variational Approximation: q(f | θ)

Family of distributions parameterised by φ Needs to be flexible, sampleable and invertible (for autodifferentiation) Look to (Bayesian) neural networks and other deep models Need to encode temporal (recurrent) structure

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-46
SLIDE 46

Variational Approximation: q(f | θ)

Family of distributions parameterised by φ Needs to be flexible, sampleable and invertible (for autodifferentiation) Look to (Bayesian) neural networks and other deep models Need to encode temporal (recurrent) structure

RNNs with priors on the weights

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-47
SLIDE 47

Variational Approximation: q(f | θ)

Family of distributions parameterised by φ Needs to be flexible, sampleable and invertible (for autodifferentiation) Look to (Bayesian) neural networks and other deep models Need to encode temporal (recurrent) structure

RNNs with priors on the weights Normalising flows

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-48
SLIDE 48

Parametrising q with an RNN

Pros Can represent high dimensional recurrent structure Bi-directional RNN can represent first-order Markov properties of model Priors over weights and optimise in weight-space

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-49
SLIDE 49

Parametrising q with an RNN

Pros Can represent high dimensional recurrent structure Bi-directional RNN can represent first-order Markov properties of model Priors over weights and optimise in weight-space Cons Need to sample sequentially BPTT inefficient for propagation of gradients Doesn’t handle latent dimensions well

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-50
SLIDE 50

Inverse Autoregressive Flows

Want to define a distribution for f that is invertible and expressive Inverse autoregressive flows (IAFs) introduce a base random vector z0 ∼ N(0, I) Layers of this random variable are shifted and scaled through 1-D convolutions to create autoregressive model Very flexible, and can sample in parallel

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-51
SLIDE 51

Inverse Autoregressive Flows

Want to define a distribution for f that is invertible and expressive Inverse autoregressive flows (IAFs) introduce a base random vector z0 ∼ N(0, I) Layers of this random variable are shifted and scaled through 1-D convolutions to create autoregressive model Very flexible, and can sample in parallel Autoregressive Flows

zj = σj ⊙ zj−1 + µj where [µj, sj] = autoregressiveNN(zj−1, y, θ) and σj = log(1 + exp sj) f = bijector(zN)

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-52
SLIDE 52

Autoregressive Neural Network Layers

Algorithm 2 jth Autoregressive Neural Network Layer ξ(0a) ← conv1d(zj−1, y, t) ξ(0b) ← dense(θ) ξ(1) ← elu(ξ(0a) + ξ(0b)) for i = 2 . . . nℓ do ξ(i) ← batchnorm(conv1d(elu(ξ(i−1)))) end for [µj, sj] ← conv1d(ξ(nℓ)) σj ← softplus(sj) zj ← σj ⊙ zj−1 + µj

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-53
SLIDE 53

Locally Masked Multivariate Inverse Autoregressive Flows

Passing the entire flow vector, zj can lead to complex (unrepresentative temporal dependencies) We use a local receptive field to update flow layers (similar to Wavenet) Rolled out multidimensional state (fk) in sequence Hacks and tricks to approximate locally informed flow state

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-54
SLIDE 54

Locally Masked Multivariate Inverse Autoregressive Flows

ftk−1 ftk ftk+1 z2 z1 z0

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-55
SLIDE 55

Variational Log Density

log q(f | θ) = −1 2zT

0 z0 + T

2 log 2π + T

N

  • i=1

log σj + log |J−1(f )|

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-56
SLIDE 56

Variational Log Density

log q(f (i) | θ(i)) = −1 2z(i)T z(i)

0 +T

2 log 2π+T

N

  • i=1

log σ(i)

j +log |J−1(f (i))|

z(i) ∼ N(0, I)

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-57
SLIDE 57

Unbiased Evidence Lower Bound

elbo ˆ L(φ) = 1 ns

ns

  • i=1

log p(θ(i))p(f (i) | θ(i))p(y | f (i), θ(i)) q(θ(i))q(f (i) | θ(i)) where f (i) ∼ q(f | θ(i)) and θ(i) ∼ q(θ) i = 1, . . . , ns

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-58
SLIDE 58

Contents

1 Stochastic Differential Equations and Gaussian Processes 2 Variational Solutions to Non-Linear Latent Force Models 3 Approximate Gaussian Processes 4 Some Results 5 Recap 6 Open Issues

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-59
SLIDE 59

Exponential Gaussian Process

f (t) ∼ GP

  • 0, σ2

f exp(−λ|t − t′|)

  • Wil O C Ward

Department of Physics and Astronomy, The University of Sheffield

slide-60
SLIDE 60

Exponential Gaussian Process

f (t) ∼ GP

  • 0, σ2

f exp(−λ|t − t′|)

  • df (t) = −λf (t) + 2σ2

f λdβ(t)

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-61
SLIDE 61

Exponential Gaussian Process

Samples from q(f | θ)

5 10 15 20 25 30 35 40

t

4 3 2 1 1 2 3

f(t)

Variational paths Observations

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-62
SLIDE 62

Exponential Gaussian Process

Samples from q(f | θ) and mean and covariance of p(f | y, θ)

5 10 15 20 25 30 35 40

t

4 3 2 1 1 2 3

f(t)

Variational paths Observations Mean GP fit and 95% CI

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-63
SLIDE 63

Model Criticism

Visual confirmation fine, it looks like a good estimate Empirical evidence for reliability needed Map corresponding samples from p and q to rkhs Two-sample test with mmd to validate approximation

Maximum Mean Discrepancy (mmd) mmd is a measure of distance between two probabilities Samples are embedded in an rkhs Metric describes distance as some norm in the rkhs Two-sample testing for H0 : ˆ mmd2(µp, µq) = 0

Gretton, et al. A kernel two-sample test. JMLR, 2012

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-64
SLIDE 64

Model Criticism

MMD2 values comparing samples from q(f | θ) and p(f | y, θ) Fit on different number of observations, N Epoch 10 100 500 1 000 2 500 25 000 N = 6 0.1111 0.1267 0.0596 0.0484 0.0556 – N = 20 0.2731 0.1147 0.0654 0.0696 0.0471 0.0316 Thresholds for rejection at 95% confidence: 0.0371 (N = 6) and 0.0337 (N = 20)

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-65
SLIDE 65

Matérn Covariances and Non-Gaussian Likelihoods

Summary statistics for p(y | f (i), θ), f (i) ∼ q(f | θ), plotted against true latent function, f 2(t)

5 10 15 20 25 30 35 40

t

2 4 6 8 10 12 14

f(t)

f2(t) Observations Mean Variational Path Median Variational Path 95% CI of Variational Paths

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-66
SLIDE 66

Matérn Covariances and Non-Gaussian Likelihoods

Approximating Matérn 3/2 GPs

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-67
SLIDE 67

Contents

1 Stochastic Differential Equations and Gaussian Processes 2 Variational Solutions to Non-Linear Latent Force Models 3 Approximate Gaussian Processes 4 Some Results 5 Recap 6 Open Issues

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-68
SLIDE 68

Toy Non-Linear ODE

d dt x(t) = −2 3 sin(ωx(t)) + u(t)

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-69
SLIDE 69

Toy Non-Linear ODE

d dt x(t) = −2 3 sin(ωx(t)) + u(t) u(t) ∼ GP(0, kν=1/2(t, t′))

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-70
SLIDE 70

Toy Non-Linear ODE

d dt x(t) = −2 3 sin(ωx(t)) + u(t) u(t) ∼ GP(0, kν=1/2(t, t′)) d dt x(t) u(t)

  • f (t)

= −2 cos(ωf1)/3 + f2 −λf2

  • D(f (t),θ)

+ 1

  • L

w(t),

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-71
SLIDE 71

Toy Non-Linear ODE

The Jacobian of D(f (t), θ) w.r.t f is defined: JD(f (t)) = 2ω sin(ωf1)/3 1 −λ

  • ,

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-72
SLIDE 72

Toy Non-Linear ODE

The Jacobian of D(f (t), θ) w.r.t f is defined: JD(f (t)) = 2ω sin(ωf1)/3 1 −λ

  • ,

and steady state covariance ˜ Σ such that JD(f (t)) ˜ Σ + ˜ Σ[JD(f (t))]⊤ + 2λσ2LLT = 0

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-73
SLIDE 73

Toy Non-Linear ODE

The Jacobian of D(f (t), θ) w.r.t f is defined: JD(f (t)) = 2ω sin(ωf1)/3 1 −λ

  • ,

and steady state covariance ˜ Σ such that JD(f (t)) ˜ Σ + ˜ Σ[JD(f (t))]⊤ + 2λσ2LLT = 0 is ˜ Σ =

  • σ2λ

2λω sin(ωf1)(2ω sin(ωf1)/3−λ)/3 σ2λ λ2−2λω sin(ωf1)/3 σ2λ λ2−2λω sin(ωf1)/3

σ2

  • Wil O C Ward

Department of Physics and Astronomy, The University of Sheffield

slide-74
SLIDE 74

Toy Non-Linear ODE

Samples of joint posterior of x, u and θ

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-75
SLIDE 75

Real World: Gene Expression Data

Multi-output system with non-linear dependency on input d dt xd(t) = ad − bdxd(t) + sd u(t) γd + u(t) xd is a model of gene expression, that’s noisily observable u models the concentration of the transcription factor regulating the observed genes

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-76
SLIDE 76

Real World: Gene Expression Data

Multi-output system with non-linear dependency on input d dt xd(t) = ad − bdxd(t) + sd u(t) γd + u(t) xd(t), u(t) > 0, θd = {ad, bd, sd, γd} d = 1, ..., ?

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-77
SLIDE 77

Real World: Gene Expression Data

Multi-output system with non-linear dependency on input d dt xd(t) = ad − bdxd(t) + sd u(t) γd + u(t) xd(t), u(t) > 0, θd = {ad, bd, sd, γd} d = 1, ..., ? Place GP prior over exp u(t)

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-78
SLIDE 78

Real World: Gene Expression Data

Multi-output system with non-linear dependency on input d dt xd(t) = ad − bdxd(t) + sd u(t) γd + u(t) xd(t), u(t) > 0, θd = {ad, bd, sd, γd} d = 1, ..., ? Place GP prior over exp u(t) Infer θd simultaneously

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-79
SLIDE 79

Real World: Gene Expression Data

Inferred TF concentration and predicted gene expressions for tnfrsf10b (blue) and p26 sesn1 (red)

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-80
SLIDE 80

Contents

1 Stochastic Differential Equations and Gaussian Processes 2 Variational Solutions to Non-Linear Latent Force Models 3 Approximate Gaussian Processes 4 Some Results 5 Recap 6 Open Issues

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-81
SLIDE 81

Overview

GP priors on non-linear forced models are non-linear SDEs

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-82
SLIDE 82

Overview

GP priors on non-linear forced models are non-linear SDEs Filtering approaches struggle with joint parameter estimation Sequential inference slow for propagating gradients

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-83
SLIDE 83

Overview

GP priors on non-linear forced models are non-linear SDEs Filtering approaches struggle with joint parameter estimation Sequential inference slow for propagating gradients With inverse autoregressive flows we can batch sample time-series

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-84
SLIDE 84

Overview

GP priors on non-linear forced models are non-linear SDEs Filtering approaches struggle with joint parameter estimation Sequential inference slow for propagating gradients With inverse autoregressive flows we can batch sample time-series Can construct approximate model for joint posterior and infer state, input and parameters by optimising NN weights

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-85
SLIDE 85

Overview

GP priors on non-linear forced models are non-linear SDEs Filtering approaches struggle with joint parameter estimation Sequential inference slow for propagating gradients With inverse autoregressive flows we can batch sample time-series Can construct approximate model for joint posterior and infer state, input and parameters by optimising NN weights Approximation of GPs quantifiably good

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-86
SLIDE 86

Contents

1 Stochastic Differential Equations and Gaussian Processes 2 Variational Solutions to Non-Linear Latent Force Models 3 Approximate Gaussian Processes 4 Some Results 5 Recap 6 Open Issues

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-87
SLIDE 87

Open Issue: Calculating Steady State Covariance

Solving continuous Lyapunov equation JD(f , t; θ) ˜ Σ + ˜ ΣJD(f , t; θ)T = −Lς2LT, is possible to do for fixed values of f , t, and θ using numerical solvers, but hard to do online, so no gradients ! Manually solving is increasingly difficult with increase in dimension: solution is system of d(d − 1)/2 equations

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-88
SLIDE 88

Open Issue: Dimensionality

Smoother GPs have more orders of differentiation Approximations of infinitely-differentiable covariance functions, e.g. periodic, Gaussian RBF, require series approximation State dimension proportional to series threshold

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield

slide-89
SLIDE 89

References

  • M. A. Alvarez, D. Luengo, and N. D. Lawrence. Linear latent force models

using Gaussian processes. IEEE transactions on pattern analysis and machine intelligence, 35(11):2693–2705, 2013.

  • D. Duvenaud and R. P. Adams. Black-box stochastic variational inference

in five lines of Python. In NIPS Workshop on Black-box Learning and Inference, 2015.

  • A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A

kernel two-sample test. Journal of Machine Learning Research, 13(Mar): 723–773, 2012.

  • J. Hartikainen, M. Seppänen, and S. Särkkä. State-space inference for

non-linear latent force models with application to satellite orbit

  • prediction. In Proceedings of the 29th International Coference on

International Conference on Machine Learning, pages 723–730, 2012.

  • R. Ranganath, S. Gerrish, and D. Blei. Black box variational inference. In

Artificial Intelligence and Statistics, pages 814–822, 2014.

Wil O C Ward Department of Physics and Astronomy, The University of Sheffield