SLIDE 1 Variational Inference for Diffusion Processes
C´ edric Archambeau
Xerox Research Centre Europe cedric.archambeau@xerox.com
Joint work with Manfred Opper.
Statlearn ’11 Grenoble, March 2011
SLIDE 2 Stochastic differential systems
Many real dynamical systems are continuous in time: Data assimilation (e.g. numerical weather prediction) Systems biology (e.g. cellular stress response, transcription factors) fMRI brain image data (e.g. voxel based activity) Modelled by stochastic differential equations (SDEs): dx(t) = f(x(t), t)dt + D1/2(x(t), t)dw(t), where dw(t) is a Wiener process (Brownian motion): dw(t) = lim
∆t→0 ǫt
√ ∆t, ǫt ∼ N(0, I). Deterministic drift f and stochastic diffusion component D Continuous-time limit of discrete-time state-space model
SLIDE 3 Stochastic differential systems
Why should we bother? A lot of theory, few (effective) data driven approaches Time discretisation is unavoidable in practice Physics models enforce continuity constraints, such that the number
- f observations can be relatively small
High frequency fluctuations can be incorporated into the diffusion Any discrete representation can be chosen a posteriori Easy to handle irregular sampling/missing data Bayesian approaches are natural: The SDE induces a non-Gaussian prior over sample paths Define a noise model (or likelihood) and simulate posterior process
- ver trajectories via MCMC (Beskos et al., 2009)
Or develop fast deterministic approximations
SLIDE 4
Overview
Setting, notations and variational inference Partially observed diffusion processes Gaussian variational approximation Experiments and conclusion
SLIDE 5 Bayesian inference (framework and notations)
Predictions are made by averaging over all possible models: p(y∗|y) =
The latent variables are inferred using Bayes’ rule: p(x|y)
posterior
=
likelihood
p(y|x)
prior
p(y)
, p(y) =
Type II maximum likelihood estimation of the (hyper)parameters θ: θML2 = argmax
θ
ln p(y|θ), The marginals are in general analytically intractable:
1
We can use Markov chain Monte Carlo to simulate the integrals; potentially exact, but often slow.
2
Or we can focus on fast(er) approximate inference schemes, such as variational inference.
SLIDE 6 Approximate Bayesian inference (variational inference)
For any distribution q(x) ≈ p(x|y), we optimise a lower bound to the log-marginal likelihood: ln p(y|θ) = ln
- p(y, x|θ) dx
- q(x) ln p(y, x|θ)
q(x) dx . = −F(q, θ). (Variational) EM minimises the variational free energy iteratively and monotonically (Beal, 2003): F(q, θ) = − ln p(y|θ) + KL[q(x)p(x|y, θ)], F(q, θ) = − ln p(y, x|θ)q(x) − H[q(x)]. where KL[qp] = Eq{ln q
p} is the Kullback-Leibler divergence and
H[q] = −E{ln q)} the entropy. An alternative approach is to minimise F(q, θ) with your favourite
F(q, θ) = − ln p(y|x, θ)q(x) + KL[q(x)p(x|θ)].
SLIDE 7 Variational inference (continued)
−2 −1 1 2 3 4 0.2 0.4 0.6 0.8 1
Monotonic decrease of F; convergence is easy to monitor (unlike MCMC) Deterministic, but different from Laplace approximation Usually q is assumed to have a factorised form (q(x) ≈ p(x|y)) KL is wrt q; underestimation of correlations between latent variables Exmaple: variational treatment of Student-t mixtures
SLIDE 8 Partially observed diffusion process
0.2 0.4 0.6 0.8 1 !0.8 !0.6 !0.4 !0.2 0.2 0.4 0.6 t W!
Model data by a latent diffusion process: dx(t) = f(x(t), t)dt + D1/2(x(t), t)dw(t). where f and D have a known functional form. Discrete-time likelilhood observation operator: yn = Hx(t = tn) + ηn. Goal: infer the states x(t) and learn the parameters of f and D given the data.
SLIDE 9 Variational inference for diffusion processes
We are interested in the posterior measure over the sample paths: dP(x(t)|y1, . . . , yN) dP(x(t)) = 1 Z
P(yn|xt=tn) . This quantity is non-Gaussian when f is nonlinear (and in general intractable). For an approximate measure Q(·), we minimise the variational free energy over a certain time interval: F(Q, θ) = − ln P(y1, . . . , yN|x(t), θ)Q(x(t)) + KL[dQ(x(t))dP(x(t))], where t ∈ [0, T]. What is a suitable Q(·)?
SLIDE 10
Gaussian variational approximation
We restrict ourselves to a state independent diffusion matrix D. Consider the following linear, but time-dependent SDE: dx(t) = g(x(t), t)dt + D−1/2(t)dw(t), where g(x(t), t) . = −A(t)x(t) + b(t). It induces a non-stationary Gaussian measure, with marginal mean and marginal covariance satisfying a set of ODEs: ˙ m(t) = −A(t)m(t) + b(t), ˙ S(t) = −A(t)S(t) − S(t)A⊤(t) + D(t). We view A(t) and b(t) as variational parameters and approximate the posterior process by this non-stationary Gaussian process.
SLIDE 11
Gaussian process
Multivariate Gaussian: Probability density over D random variables (based on correlations). Characterized by a mean vector µ and covariance matrix Σ: f ≡ (f1, . . . , fD)⊤ ∼ N(µ, Σ). Gaussian process (GP): Probability measure over random functions (≈ infinitely long vector). Marginal over any finite subset of variables is a consistent finite dimensional Gaussian! Characterized by a mean function and a covariance function (kernel): f (·) ∼ GP(m(·), k(·, ·)). Gaussian processes for ML (Rasmussen and Williams, 2006) A and b specify the kernel (in general no closed form solution)
SLIDE 12 Consistency constraints and smoothing algorithm
The objective function is of the form F(Q, θ) =
- Eobs(t)dt +
- Esde(t)dt + KL[q(x0)p(x0)],
where Esde(t) = −1 2(ft − gt)⊤D−1(ft − gt)Q(xt). The diffusion matrix of the linear SDE is by construction equal to the diffusion matrix of the original SDE (so that F is finite). We enforce consistent Gaussian marginals by using the following ODEs as constraints (forward propagation): ˙ m(t) = −A(t)m(t) + b(t), ˙ S(t) = −A(t)S(t) − S(t)A⊤(t) + D(t). Differentiating the Lagrangian leads to a set of ODEs for the Lagrange multipliers (backward propagation): ˙ λ(t) = −∇mEsde(t) + A⊤(t)λ(t), λ+
n = λ− n − ∇mEobs(t)|t=tn,
˙ Ψ(t) = −∇SEsde(t) + 2Ψ(t)A(t), Ψ+
n = Ψ− n − ∇SEobs(t)|t=tn.
SLIDE 13 Optimal Gaussian variational approximation
The non-linear SDE is reduced to a set of linear ODEs describing the evolution of the means, covariances and Lagrange multipliers. The smoothing algorithm consists of a forward and a backward integration for fixed A(t) and b(t). The observation are incorporated in the backward pass (cf. jump conditions). The optimal Gaussian variational approximation is obtained by
- ptimising F wrt the variational parameters A(t) and b(t).
At equilibrium, the variational parameters satisfy the following conditions: A = − ∂f ∂x
b = f(x) + Am − Dλ. The variational solution is closely related to statistical linearisation: {A, b} ← argmin
A,b
.
SLIDE 14 Illustration of the statistical linearisation principle
!4 !2 2 4 6 !350 !300 !250 !200 !150 !100 !50 50 100 150
y f(y)
p(y) f(y) f(µ) + !f (y!µ) Ay + b
SLIDE 15
Related approaches
Continuous-time sigma point Kalman smoothers (KS; Sarkka and Sottinen, 2008): Unscented KS and central difference KS. Gaussian approximation of the transition density. No feedback loop to adjust the sigma points. Perfect simulation approaches (Beskos et al., 2009): No discrete time approximation of the transition density. Transition density is non-Gaussian. Drift is restricted to derive from a potential. Convergence is difficult to monitor, potentially slower. Other approaches include Particle smoothers, Hybrid MCMC (Eyinck et al., 2004) etc.
SLIDE 16
Diffusions with multiplicative noise
Apply explicit transformation to obtain a diffusion process with constant diffusion matrix; such a transformation does not always exist in the multivariate case. Construct Gaussian variational approximation based on the following ODEs, which hold for any non-linear SDE: ˙ m(t) = −A(t)m(t) + b(t), ˙ S(t) = −A(t)S(t) − S(t)A⊤(t) + D(x(t), t)Q(xt). The smoothing algorithm is analogue; the expression of A(t) and b(t) is more involved.
SLIDE 17 Bi-stable dynamical system
The deterministic drift is defined as f (t, x) = 4x(θ − x2), θ > 0. The system is driven by the stochastic noise.
!2 2 4 !2 !1.5 !1 !0.5 0.5 1 1.5 2 u(x) x 20 40 !2 !1.5 !1 !0.5 0.5 1 1.5 2 t x
2 4 6 8 −2 −1 1 2 Initialisation time state 2 4 6 8 −2 −1 1 2 Variational smoother time state 2 4 6 8 −30 −20 −10 10 20 30 Var params and Lagrange multip time A b ! " 2 4 6 8 −30 −20 −10 10 20 30 Var params and Lagrange multip time A b ! " 2 4 6 8 −2 −1 1 2 Initialisation time state 2 4 6 8 −2 −1 1 2 Variational smoother time state 2 4 6 8 −30 −20 −10 10 20 30 Var params and Lagrange multip time A b ! " 2 4 6 8 −30 −20 −10 10 20 30 Var params and Lagrange multip time A b ! "
SLIDE 18 Comparison to hybrid Markov Chain Monte Carlo (Eyinck et al., 2004)
Reference solution Based on a discrete approximation Generate complete sample paths from posterior Modified MCMC scheme to increase acceptance rate (Molecular Dynamics) Still requires to generate in order of 100,000 samples for good results Hard to check convergence
1 2 3 4 5 6 7 8 !2 !1.5 !1 !0.5 0.5 1 1.5 2
t x
(a) θ = 1, σ = 0.5.
"0 #0 $0 %0 50 !# !"'5 !" !0'5 0'5 " "'5 #
t )
(b) Large noise.
SLIDE 19 Comparison to the continuous-time Unscented Kalman Smoother
1 2 3 4 5 6 7 8 −2 −1.5 −1 −0.5 0.5 1 1.5 2
t x
SLIDE 20 Failure mode
1 2 3 4 5 6 7 8 −2 −1.5 −1 −0.5 0.5 1 1.5 2
t x
SLIDE 21 Stochastic Lorenz attractor
The Lorenz attractor: fx = σ(y − z), σ > 0, fy = ρx − y − xz, ρ > 0, fz = xy − βz, β > 0. When adding stochastic noise the system becomes chaotic.
−20 −10 10 20 30 −50 50 10 20 30 40 50 60 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 −20 20 40 x1 Variational smoother 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 −40 −20 20 40 x2 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 20 40 60 80 x3 time
SLIDE 22 Parameter inference
(Variational) EM fails for the diffusion coefficient: lim
δ→0 T/δ
(xiδ − x(i−1)δ)(xiδ − x(i−1)δ)⊤ = T D(x(t), t)dt a.s. Type II ML based on gradient techniques is ok as we change the sample paths together with the diffusion coefficient. Cheap estimate of the posterior (sanity check; Lappalainen and Miskin, 2000): q(θ) = e−F(Q,θ)p(θ)
0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
D
SLIDE 23 Conclusion
Stochastic process models are very powerful when the number of
- bservations is small compared to the complexity of the dynamics.
Gaussian variational approximation for non-linear SDEs boils down to solving a set of ODEs. Preferred integration scheme can be used, no discrete time approximation of the transition density. Can be viewed as generalisation of sigma-point Kalman smoother for certain instantiations of the statistical linearisation principle. Considerably faster than (most) MCMC schemes. Diffusion matrix can be estimated; multiplicative noise is ok (in principle). Error bars are underestimated.
SLIDE 24 References
- C. Archambeau, M. Opper: Approximate inference for continuous time Markov
- processes. Inference and Estimation in Probabilistic Time-Series Models.
Cambridge University Presse, 2011.
- C. Archambeau, M. Opper, Y. Shen, D. Cornford, J. Shawe-Taylor: Variational
Inference for Diffusion Processes. NIPS 20, pp.17-24, 2008.
- A. Beskos, et al. : Monte-Carlo maximum likelihood estimation for discretely
- bserved diffusion processes. Annals of Statistics, 37:1, pp 223-245, 2009.
- G. L. Eyink, J. L. Restrepo and F. J. Alexander: A mean field approximation in
data assimilation for nonlinear dynamics. Physica D, 194:347368, 2004.
- I. Karatzas and S. E. Schreve. Brownian Motion and Stochastic Calculus.
Springer, 1998.
- H. Lappalainen and J.W. Miskin: Ensemble learning. In M. Girolami, editor,
Advances in Independent Component Analysis, pp 7692. Springer-Verlag, 2000.
- C. E. Rasmussen and C. K.I. Williams: Gaussian Processes for Machine
- Learning. MIT Press, 2006.
- S. S¨
arkk¨ a and T. Sottinen: Application of Girsanov Theorem to Particle Filtering
- f Discretely Observed Continuous-Time Non-Linear Systems. Bayesian Analysis,
3:3, pp 555-584, 2008.
SLIDE 25 Informal proof for KL[Q(x(t))P(x(t))]
Consider the Euler-Muryama discrete approximation of the SDEs: ∆xk = fk∆t + D1/2∆wk, wk ∼ N(0, ∆tI), ∆xk = gk∆t + D1/2∆ˆ wk, ˆ wk ∼ N(0, ∆tI), where ∆xk ≡ xk+1 − xk. The joint distributions of discrete sample paths {xk}k≥0 for the true process and its approximation follow from the Markov property: p(x0, . . . , xK|D) = p(x0)
N(xk+1|xk + fk∆t, D∆t), q(x0, . . . , xK|D) = q(x0)
N(xk+1|xk + gk∆t, D∆t). The KL between the two discretised processes is then given by KL[qp] = KL[q(x0)p(x0)] −
q(xk+1|xk)
dxk = KL[q(x0)p(x0)] + 1 2
(fk − gk)⊤D−1(fk − gk)q(xk )∆t, Passing to the limit is ok! (Formal proof based on the Girsanov theorem.)