Variational Inference for Diffusion Processes C edric Archambeau - PowerPoint PPT Presentation

Variational Inference for Diffusion Processes C´ edric Archambeau Xerox Research Centre Europe cedric.archambeau@xerox.com Joint work with Manfred Opper. Statlearn ’11 Grenoble, March 2011

Stochastic differential systems Many real dynamical systems are continuous in time: Data assimilation (e.g. numerical weather prediction) Systems biology (e.g. cellular stress response, transcription factors) fMRI brain image data (e.g. voxel based activity) Modelled by stochastic differential equations (SDEs): d x ( t ) = f ( x ( t ) , t ) dt + D 1 / 2 ( x ( t ) , t ) d w ( t ) , where d w ( t ) is a Wiener process (Brownian motion): √ d w ( t ) = lim ∆ t , ǫ t ∼ N ( 0 , I ) . ∆ t → 0 ǫ t Deterministic drift f and stochastic diffusion component D Continuous-time limit of discrete-time state-space model

Stochastic differential systems Why should we bother? A lot of theory, few (effective) data driven approaches Time discretisation is unavoidable in practice Physics models enforce continuity constraints, such that the number of observations can be relatively small High frequency fluctuations can be incorporated into the diffusion Any discrete representation can be chosen a posteriori Easy to handle irregular sampling/missing data Bayesian approaches are natural: The SDE induces a non-Gaussian prior over sample paths Define a noise model (or likelihood) and simulate posterior process over trajectories via MCMC (Beskos et al., 2009) Or develop fast deterministic approximations

Overview Setting, notations and variational inference Partially observed diffusion processes Gaussian variational approximation Experiments and conclusion

Bayesian inference (framework and notations) Predictions are made by averaging over all possible models: � p ( y ∗ | y ) = p ( y ∗ | x ) p ( x | y ) d x . The latent variables are inferred using Bayes’ rule: likelihood prior � �� p ( y | x ) p ( x ) p ( x | y ) = , p ( y ) = p ( y , x ) d x . p ( y ) � �� posterior �� marginal likelihood Type II maximum likelihood estimation of the (hyper)parameters θ : θ ML2 = argmax ln p ( y | θ ) , θ The marginals are in general analytically intractable: We can use Markov chain Monte Carlo to simulate the integrals; 1 potentially exact, but often slow. Or we can focus on fast(er) approximate inference schemes, such as 2 variational inference.

Approximate Bayesian inference (variational inference) For any distribution q ( x ) ≈ p ( x | y ), we optimise a lower bound to the log-marginal likelihood: � � q ( x ) ln p ( y , x | θ ) d x . ln p ( y | θ ) = ln p ( y , x | θ ) d x � = −F ( q , θ ) . q ( x ) (Variational) EM minimises the variational free energy iteratively and monotonically (Beal, 2003): F ( q , θ ) = − ln p ( y | θ ) + KL [ q ( x ) � p ( x | y , θ )] , F ( q , θ ) = − � ln p ( y , x | θ ) � q ( x ) − H [ q ( x )] . where KL [ q � p ] = E q { ln q p } is the Kullback-Leibler divergence and H [ q ] = − E { ln q ) } the entropy. An alternative approach is to minimise F ( q , θ ) with your favourite optimisation algorithm: F ( q , θ ) = − � ln p ( y | x , θ ) � q ( x ) + KL [ q ( x ) � p ( x | θ )] .

Variational inference (continued) 1 0.8 0.6 0.4 0.2 0 −2 −1 0 1 2 3 4 Monotonic decrease of F ; convergence is easy to monitor (unlike MCMC) Deterministic, but different from Laplace approximation Usually q is assumed to have a factorised form ( q ( x ) ≈ p ( x | y )) KL is wrt q ; underestimation of correlations between latent variables Exmaple: variational treatment of Student- t mixtures

Partially observed diffusion process 0.6 0.4 0.2 0 W ! ! 0.2 ! 0.4 ! 0.6 ! 0.8 0 0.2 0.4 0.6 0.8 1 t Model data by a latent diffusion process: d x ( t ) = f ( x ( t ) , t ) dt + D 1 / 2 ( x ( t ) , t ) d w ( t ) . where f and D have a known functional form. Discrete-time likelilhood observation operator: y n = Hx ( t = t n ) + η n . Goal: infer the states x ( t ) and learn the parameters of f and D given the data.

Variational inference for diffusion processes We are interested in the posterior measure over the sample paths: dP ( x ( t ) | y 1 , . . . , y N ) = 1 � P ( y n | x t = t n ) . dP ( x ( t )) Z n This quantity is non-Gaussian when f is nonlinear (and in general intractable). For an approximate measure Q ( · ), we minimise the variational free energy over a certain time interval: F ( Q , θ ) = − � ln P ( y 1 , . . . , y N | x ( t ) , θ ) � Q ( x ( t )) + KL [ dQ ( x ( t )) � dP ( x ( t ))] , where t ∈ [0 , T ]. What is a suitable Q ( · )?

Gaussian variational approximation We restrict ourselves to a state independent diffusion matrix D . Consider the following linear, but time-dependent SDE: d x ( t ) = g ( x ( t ) , t ) dt + D − 1 / 2 ( t ) d w ( t ) , where g ( x ( t ) , t ) . = − A ( t ) x ( t ) + b ( t ) . It induces a non-stationary Gaussian measure, with marginal mean and marginal covariance satisfying a set of ODEs: m ( t ) = − A ( t ) m ( t ) + b ( t ) , ˙ ˙ S ( t ) = − A ( t ) S ( t ) − S ( t ) A ⊤ ( t ) + D ( t ) . We view A ( t ) and b ( t ) as variational parameters and approximate the posterior process by this non-stationary Gaussian process.

Gaussian process Multivariate Gaussian: Probability density over D random variables (based on correlations). Characterized by a mean vector µ and covariance matrix Σ : f ≡ ( f 1 , . . . , f D ) ⊤ ∼ N ( µ , Σ ) . Gaussian process (GP): Probability measure over random functions ( ≈ infinitely long vector). Marginal over any finite subset of variables is a consistent finite dimensional Gaussian! Characterized by a mean function and a covariance function (kernel): f ( · ) ∼ GP ( m ( · ) , k ( · , · )) . Gaussian processes for ML (Rasmussen and Williams, 2006) A and b specify the kernel (in general no closed form solution)

Consistency constraints and smoothing algorithm The objective function is of the form � � F ( Q , θ ) = E obs ( t ) dt + E sde ( t ) dt + KL [ q ( x 0 ) � p ( x 0 )] , where E sde ( t ) = − 1 2 � ( f t − g t ) ⊤ D − 1 ( f t − g t ) � Q ( x t ) . The diffusion matrix of the linear SDE is by construction equal to the diffusion matrix of the original SDE (so that F is finite). We enforce consistent Gaussian marginals by using the following ODEs as constraints (forward propagation): m ( t ) = − A ( t ) m ( t ) + b ( t ) , ˙ ˙ S ( t ) = − A ( t ) S ( t ) − S ( t ) A ⊤ ( t ) + D ( t ) . Differentiating the Lagrangian leads to a set of ODEs for the Lagrange multipliers (backward propagation): ˙ λ + λ ( t ) = −∇ m E sde ( t ) + A ⊤ ( t ) λ ( t ) , n = λ − n − ∇ m E obs ( t ) | t = t n , ˙ Ψ + n = Ψ − Ψ ( t ) = −∇ S E sde ( t ) + 2 Ψ ( t ) A ( t ) , n − ∇ S E obs ( t ) | t = t n .

Optimal Gaussian variational approximation The non-linear SDE is reduced to a set of linear ODEs describing the evolution of the means, covariances and Lagrange multipliers. The smoothing algorithm consists of a forward and a backward integration for fixed A ( t ) and b ( t ). The observation are incorporated in the backward pass (cf. jump conditions). The optimal Gaussian variational approximation is obtained by optimising F wrt the variational parameters A ( t ) and b ( t ). At equilibrium, the variational parameters satisfy the following conditions: � ∂ f � A = − + 2 DΨ , ∂ x b = � f ( x ) � + Am − D λ . The variational solution is closely related to statistical linearisation: � � f ( x ) + Ax − b � 2 � { A , b } ← argmin . A , b

Illustration of the statistical linearisation principle 150 p(y) f(y) 100 f( µ ) + ! f (y ! µ ) Ay + b 50 0 ! 50 f(y) ! 100 ! 150 ! 200 ! 250 ! 300 ! 350 ! 4 ! 2 0 2 4 6 y

Related approaches Continuous-time sigma point Kalman smoothers (KS; Sarkka and Sottinen, 2008): Unscented KS and central difference KS. Gaussian approximation of the transition density. No feedback loop to adjust the sigma points. Perfect simulation approaches (Beskos et al., 2009): No discrete time approximation of the transition density. Transition density is non-Gaussian. Drift is restricted to derive from a potential. Convergence is difficult to monitor, potentially slower. Other approaches include Particle smoothers, Hybrid MCMC (Eyinck et al., 2004) etc.

Diffusions with multiplicative noise Apply explicit transformation to obtain a diffusion process with constant diffusion matrix; such a transformation does not always exist in the multivariate case. Construct Gaussian variational approximation based on the following ODEs, which hold for any non-linear SDE: m ( t ) = − A ( t ) m ( t ) + b ( t ) , ˙ ˙ S ( t ) = − A ( t ) S ( t ) − S ( t ) A ⊤ ( t ) + � D ( x ( t ) , t ) � Q ( x t ) . The smoothing algorithm is analogue; the expression of A ( t ) and b ( t ) is more involved.

Variational Inference for Diffusion Processes C edric Archambeau - PowerPoint PPT Presentation

Variational Inference for Diffusion Processes C edric Archambeau Xerox Research Centre Europe cedric.archambeau@xerox.com Joint work with Manfred Opper. Statlearn 11 Grenoble, March 2011 Stochastic differential systems Many real

c + = Diffusion Diffusion 2 6.82 10 -6 v c D c 10 -1 Equation

Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is

Variational Inference for GPs: Presenters Group1: Stochastic variational inference. Slides 2 - 28

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

Rejection Sampling Variational Inference Karan Grewal CSC2547 / STA4273 Overview Variational

CS480/680 Machine Learning Lecture 11: February 11 th , 2020 Variational Inference Zahra

PLS Advanced Diffusion Model New Advanced Diffusion Model for Dopants in Silicon Advanced Dopant

An Introduction to An Introduction to Variational Variational Methods for Graphical Models

Lecture Variational 13 Inference Panini Kaushal Scribes : - Margulies Smedeuranh Niklas

CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. Probabilistic latent variable

Variational Inference for Bayes vMF Mixture Hanxiao Liu September 23, 2014 1 / 14 Variational

Lecture 14: Inference in Dirichlet Processes (Blei & Jordan, Variational inference for

NON-SYMMETRIC FRACTIONAL DIFFUSION NON-SYMMETRIC FRACTIONAL DIFFUSION AS A SPECIAL CASE OF AS A

Variational Mean Field Variational Mean Field for Graphical Models for Graphical Models

Variational Inference CMSC 691 UMBC Goal: Posterior Inference Hyperparameters Unknown

Questions about Exercise 2? Lecture: Natural diffusion Introduction to the

Introduction to General and Generalized Linear Models Mixed effects models - Part III Henrik

Using selective pressure to improve protein Aude GRELAUD tridimensional structure prediction

Lecture 5 Jan-Willem van de Meent Conjugate Priors <latexit

Pair HMMs and Pairwise Sequence Alignment COMP 571 Luay Nakhleh, Rice University Pair HMMs

AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic

Clustering and Prediction Probability and Statistics for Data Science CSE594 - Spring 2016 But

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning

Single-parameter models: Binomial data Applied Bayesian Statistics Dr. Earvin Balderama

Variational Inference for Diffusion Processes C edric Archambeau - PowerPoint PPT Presentation

Variational Inference for Diffusion Processes C edric Archambeau Xerox Research Centre Europe cedric.archambeau@xerox.com Joint work with Manfred Opper. Statlearn 11 Grenoble, March 2011 Stochastic differential systems Many real

c + = Diffusion Diffusion 2 6.82 10 -6 v c D c 10 -1 Equation

Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is

Variational Inference for GPs: Presenters Group1: Stochastic variational inference. Slides 2 - 28

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

Rejection Sampling Variational Inference Karan Grewal CSC2547 / STA4273 Overview Variational

CS480/680 Machine Learning Lecture 11: February 11 th , 2020 Variational Inference Zahra

PLS Advanced Diffusion Model New Advanced Diffusion Model for Dopants in Silicon Advanced Dopant

An Introduction to An Introduction to Variational Variational Methods for Graphical Models

Lecture Variational 13 Inference Panini Kaushal Scribes : - Margulies Smedeuranh Niklas

CS 285 Instructor: Sergey Levine UC Berkeley Todays Lecture 1. Probabilistic latent variable

Variational Inference for Bayes vMF Mixture Hanxiao Liu September 23, 2014 1 / 14 Variational

Lecture 14: Inference in Dirichlet Processes (Blei &amp; Jordan, Variational inference for

NON-SYMMETRIC FRACTIONAL DIFFUSION NON-SYMMETRIC FRACTIONAL DIFFUSION AS A SPECIAL CASE OF AS A

Variational Mean Field Variational Mean Field for Graphical Models for Graphical Models

Variational Inference CMSC 691 UMBC Goal: Posterior Inference Hyperparameters Unknown

Questions about Exercise 2? Lecture: Natural diffusion Introduction to the

Introduction to General and Generalized Linear Models Mixed effects models - Part III Henrik

Using selective pressure to improve protein Aude GRELAUD tridimensional structure prediction

Lecture 5 Jan-Willem van de Meent Conjugate Priors &lt;latexit

Pair HMMs and Pairwise Sequence Alignment COMP 571 Luay Nakhleh, Rice University Pair HMMs

AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic

Clustering and Prediction Probability and Statistics for Data Science CSE594 - Spring 2016 But

Module 13 Bayesian Bandits CS 886 Sequential Decision Making and Reinforcement Learning

Single-parameter models: Binomial data Applied Bayesian Statistics Dr. Earvin Balderama

Lecture 14: Inference in Dirichlet Processes (Blei & Jordan, Variational inference for

Lecture 5 Jan-Willem van de Meent Conjugate Priors <latexit