learning dynamical systems with particle stochastic
play

Learning dynamical systems with particle stochastic approximation EM - PowerPoint PPT Presentation

Learning dynamical systems with particle stochastic approximation EM Fredrik Lindsten, Link oping University 2019-04-11 Joint work with Andreas Lindholm, Uppsala University Parametric state-space models Dynamical system on state-space form,


  1. Learning dynamical systems with particle stochastic approximation EM Fredrik Lindsten, Link¨ oping University 2019-04-11 Joint work with Andreas Lindholm, Uppsala University

  2. Parametric state-space models Dynamical system on state-space form, � x t +1 = f ( x t , u t , v t ; θ ) , y t = g ( x t , u t , e t ; θ ) , Properties: • Can handle nonlinear dynamics ( f ( · ) and g ( · )) • Stochastic (process and measurement noise) • Noise variables can be non-Gaussian • Parametric: parameterized by θ ∈ R n θ Note: The states { x t } are latent (unobserved) random variables. 1/35

  3. Maximum likelihood identification We observe y 1: T = ( y 1 , . . . , y T ) and u 1: T = ( u 1 , . . . , u T ). Aim: Compute the maximum likelihood estimate, � θ ML = arg max { log p θ ( y 1: T ) } . θ 1. Convergent: If given enough time a (local) maximum is found 2. Computationally efficient: The total runtime to recover a satisfactory solution should not be excessively large 2/35

  4. Teaser: Cascaded water tanks Cascaded water tank system from the nonlinear system identification benchmark: http://www.nonlinearbenchmark.org/ M. Schoukens & J.P. No¨ el. Three Benchmarks Addressing Open Challenges in Nonlinear Sys- tem Identification. 20th World Congress of the International Federation of Automatic Control , Toulouse, France, July 2017. Evaluated by computing simulation error on test data. Model Error Initial model to PSAEM 2.85 Estimated with PSAEM 0.29 Best result from literature 0.34 3/35

  5. Teaser: Nonlinear toy model 2 10 1 10 Average relative error 0 10 −1 10 PSAEM,N=15 −2 10 PSEM, N=15 PSEM, N=50 PSEM, N=100 PSEM, N=500 −3 10 −2 −1 0 1 2 3 4 10 10 10 10 10 10 10 Computational time (seconds) 4/35

  6. A composite algorithm Particle Stochastic Approximation EM (PSAEM) is . . . . . . an exercise in building a complicated algorithm from “simple” parts Expectation Stochastic Maximization approximation SAEM MCMC PSAEM PMCMC Particle filters 5/35

  7. MCMC + Stochastic Approximation Particle Stochastic Approximation EM (PSAEM) is . . . . . . an example of how MCMC and Stochastic Approximation can come together! Algorithms involving both optimization and Monte Carlo sampling common in machine learning/system identification. ex) SGD, Variational Inference, SAEM, . . . Often we assume unbiased estimates but it is possible to generalize to Markovian dependencies = ⇒ Can use Markov Chain Monte Carlo sampling! 6/35

  8. Schematic outline Expectation Stochastic Maximization approximation SAEM Will focus on this MCMC PSAEM PMCMC Particle filters 7/35

  9. Expectation Maximization

  10. Latent variable models A latent variable model is described by a joint probability density function p θ ( x , y ) where • y is observed (the data ) • x is unobserved (the latent variable ) • θ is an unknown parameter ex) Probabilistic representation of state space model, � � p θ ( x t +1 | x t ) , x t +1 = f ( x t , u t , v t ; θ ) , ⇐ ⇒ y t = g ( x t , u t , e t ; θ ) , p θ ( y t | x t ) . Therefore, � T � � p θ ( x t +1 | x t ) p θ ( y t | x t ) p θ ( x 0: T , y 1: T ) = p ( x 0 ) . t =1 8/35

  11. Maximum likelihood estimation Aim: Compute the maximum likelihood estimate, � θ ML = arg max { log p θ ( y ) } . θ Evaluating the likelihood p θ ( y ) requires marginalizing the latent vari- able, � p θ ( y ) = p θ ( x , y ) d x Expectation maximization (EM) avoids explicit marginalization by it- eratively optimizing lower bounds on the likelihood. 9/35

  12. Expectation Maximization [3] The EM algorithm generates a sequence of iterates: θ 0 , θ 1 , θ 2 , . . . Iterate: � Expectation def (E) Q k ( θ ) = log p θ ( x , y ) p θ k − 1 ( x | y ) d x . Maximization (M) θ k = arg max θ Q k ( θ ). The iterates { θ k } k ≥ 0 converge (under weak conditions) to a stationary point of p θ ( y ). 10/35

  13. EM for nonlinear state space models ex) For a state space model, � Q k ( θ ) = log p θ ( x 0: T , y 1: T ) p θ k − 1 ( x 0: T | y 1: T ) d x 0: T . First hurdle – high-dimensional integration problem! 11/35

  14. Stochastic Approximation EM

  15. Stochastic approximation of the Q -function � def Q k ( θ ) = log p θ ( x , y ) p θ k − 1 ( x | y ) d x . If � x ∼ p θ k − 1 ( x | y ) then log p θ ( � x , y ) is an unbiased estimate of Q k ( θ ). Replace the E-step with, (S) Simulate � x k ∼ p θ k − 1 ( x | y ), � � (E) � Q k ( θ ) = � x k , y ) − � Q k − 1 ( θ ) + γ k log p θ ( � Q k − 1 ( θ ) . Here � k γ k = ∞ , � k < ∞ , and � k γ 2 Q 0 ( θ ) ≡ 0. Intuitive interpretation: If θ k converges, the x -values sampled at iter- ation k , k + 1, . . . , will come from approximately the same distribution. 12/35

  16. Represenation of � Q � � How is � Q k ( θ ) = � x , y ) − � Q k − 1 ( θ ) + γ k log p θ ( � Q k − 1 ( θ ) represented in practice? 1. In general, as a sum, Q k ( θ ) = � k � j =1 α jk log p θ ( � x j , y ) . But, this can be hard to maximize for large k . 2. Exponential family models , can be expressed using sufficient statis- tics, log p θ ( x , y ) = � ψ ( θ ) , s ( x , y ) � − A ( θ ) . We can then compute x k , y ) − S k − 1 ), (E.1) S k = S k − 1 + γ k ( s ( � (E.2) � Q k ( θ ) = � ψ ( θ ) , S k � − A ( θ ). 13/35

  17. Stochastic Approximation EM [2] The SAEM algorithm for exponential family models. Initialize θ 0 and S 0 = 0. Iterate: (S) Simulate � x k ∼ p θ k − 1 ( x | y ), Expectation Stochastic Maximization approximation (E.1) S k = S k − 1 + γ k ( s ( � x k , y ) − S k − 1 ), (E.2) � Q k ( θ ) = � ψ ( θ ) , S k � − A ( θ ), SAEM (M) θ k = arg max θ � Q k ( θ ). The iterates { θ k } k ≥ 0 converge (under regularity conditions) to a local maximum of p θ ( y ). 14/35

  18. SAEM for nonlinear state space models ex) For a state space model, the simulation step involves � x 0: T ∼ p θ ( x 0: T | y 1: T ) . Second hurdle – simulating from the joint smoothing distribution is not possible for a nonlinear/non-Gaussian state space model. 15/35

  19. Combining SAEM and MCMC

  20. Markovian noise We need to simulate from an intractable posterior p θ ( x | y ). This is what Markov Chain Monte Carlo is designed for! MCMC, key idea: An ergodic stochastic process has “the same behavior averaged over time as averaged over the space of all the system’s states” (Wikipedia) � � K 1 g ( x ) p θ ( x | y ) d x = lim g ( � x k ) K K →∞ k =1 16/35

  21. MCMC basics Requirements: The process { � x k } k ≥ 1 has to. . . 1. . . . have p θ ( x | y ) as stationary distribution 2. . . . be ergodic (“sufficiently stochastic”) MCMC: Initialize � x 0 and simulate � x k ∼ q θ ( � x k | � x k − 1 ), k = 1 , 2 , . . . x | x ) : θ ∈ R n θ } is a family of Markov kernels s.t., Here, { q θ ( � 1. (Stationary distribution) � q θ ( � x | x ) p θ ( x | y ) d x = p θ ( � x | y ) 2. (Ergodicity) Sufficient condition, ∃ η > 0 s.t. P q θ ( · | x ) [ � x ∈ A ] ≥ η P p θ ( · | y ) [ � x ∈ A ] . 17/35

  22. MCMC methods 65+ years of research on MCMC has resulted in many clever ways of constructing q θ ( � x | x ) • Metropolis–Hastings • Gibbs sampling • Langevin algorithms • Hamiltonian Monte Carlo • Slice sampling • Particle Markov chain Monte Carlo ( ← what we use in PSAEM) • . . . 18/35

  23. MCMC-SAEM [4] SAEM: Initialize θ 0 , � x 0 and S 0 = 0. Iterate: (S) Simulate � x k ∼ p θ k − 1 ( x | y ), Expectation Stochastic Maximization approximation (E.1) S k = S k − 1 + γ k ( s ( � x k ) − S k − 1 ), (E.2) � Q k ( θ ) = � ψ ( θ ) , S k � − A ( θ ), SAEM (M) θ k = arg max θ � MCMC Q k ( θ ). Markovian SAEM Under appropriate conditions the iterates { θ k } k ≥ 0 converge to a local maximum of p θ ( y ). x | x ) : θ ∈ R n θ } is. . . We require that { q θ ( � • . . . correct stationary distribution and ergodic for “all” θ x | x ) if θ ≈ θ ′ • . . . smooth in θ , q θ ( � x | x ) ≈ q θ ′ ( � 19/35

  24. MCMC-SAEM for nonlinear state space models ex) For a state space model, the MCMC step aims at simulating from the joint smoothing distribution p θ ( x 0: T | y 1: T ) . Third hurdle – how can we construct an efficient MCMC method with the required properties , to simulate from this distribution? 20/35

  25. Particle SAEM

  26. Particle MCMC [1] We propose to use Particle MCMC PMCMC: 1. Use a particle filter to approximate � N def w i p θ ( x 0: T | y 1: T ) ≈ � p θ ( x 0: T | y 1: T ) = T δ x i 0: T ( x 0: T ) . i =1 2. Use the approximation to define an MCMC kernel on the space of state trajectories 3. Some “tricks” are used to ensure correct stationary distribution for any N ≥ 1. 21/35

  27. PMCMC – Part 1 PMCMC – Part 1: The particle filter Use importance sampling to approximate p θ ( x 0 ) , p θ ( x 0:1 | y 1 ) , p θ ( x 0:2 | y 1:2 ) , . . . , p θ ( x 0: T | y 1: T ) , sequentially in time. Exploits the temporal structure of a dynamical system! 22/35

  28. PMCMC – Part 1 1 0 −1 State −2 −3 −4 5 10 15 20 25 Time 23/35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend