Learning dynamical systems with particle stochastic approximation EM - - PowerPoint PPT Presentation

learning dynamical systems with particle stochastic
SMART_READER_LITE
LIVE PREVIEW

Learning dynamical systems with particle stochastic approximation EM - - PowerPoint PPT Presentation

Learning dynamical systems with particle stochastic approximation EM Fredrik Lindsten, Link oping University 2019-04-11 Joint work with Andreas Lindholm, Uppsala University Parametric state-space models Dynamical system on state-space form,


slide-1
SLIDE 1

Learning dynamical systems with particle stochastic approximation EM

Fredrik Lindsten, Link¨

  • ping University

2019-04-11

Joint work with Andreas Lindholm, Uppsala University

slide-2
SLIDE 2

Parametric state-space models

Dynamical system on state-space form,

  • xt+1 = f (xt, ut, vt; θ),

yt = g(xt, ut, et; θ), Properties:

  • Can handle nonlinear dynamics (f (·) and g(·))
  • Stochastic (process and measurement noise)
  • Noise variables can be non-Gaussian
  • Parametric: parameterized by θ ∈ Rnθ

Note: The states {xt} are latent (unobserved) random variables.

1/35

slide-3
SLIDE 3

Maximum likelihood identification

We observe y1:T = (y1, . . . , yT) and u1:T = (u1, . . . , uT). Aim: Compute the maximum likelihood estimate,

  • θML = arg max

θ

{log pθ(y1:T)}.

  • 1. Convergent: If given enough time a (local) maximum is found
  • 2. Computationally efficient: The total runtime to recover a

satisfactory solution should not be excessively large

2/35

slide-4
SLIDE 4

Teaser: Cascaded water tanks

Cascaded water tank system from the nonlinear system identification benchmark: http://www.nonlinearbenchmark.org/

  • M. Schoukens & J.P. No¨
  • el. Three Benchmarks Addressing Open Challenges in Nonlinear Sys-

tem Identification. 20th World Congress of the International Federation of Automatic Control, Toulouse, France, July 2017.

Evaluated by computing simulation error on test data. Model Error Initial model to PSAEM 2.85 Estimated with PSAEM 0.29 Best result from literature 0.34

3/35

slide-5
SLIDE 5

Teaser: Nonlinear toy model

10

−2

10

−1

10 10

1

10

2

10

3

10

4

10

−3

10

−2

10

−1

10 10

1

10

2

Computational time (seconds) Average relative error PSAEM,N=15 PSEM, N=15 PSEM, N=50 PSEM, N=100 PSEM, N=500 4/35

slide-6
SLIDE 6

A composite algorithm

Particle Stochastic Approximation EM (PSAEM) is . . . . . . an exercise in building a complicated algorithm from “simple” parts

Stochastic approximation SAEM

PSAEM

PMCMC MCMC Particle filters Expectation Maximization 5/35

slide-7
SLIDE 7

MCMC + Stochastic Approximation

Particle Stochastic Approximation EM (PSAEM) is . . . . . . an example of how MCMC and Stochastic Approximation can come together! Algorithms involving both optimization and Monte Carlo sampling common in machine learning/system identification. ex) SGD, Variational Inference, SAEM, . . . Often we assume unbiased estimates but it is possible to generalize to Markovian dependencies = ⇒ Can use Markov Chain Monte Carlo sampling!

6/35

slide-8
SLIDE 8

Schematic outline

Stochastic approximation SAEM

PSAEM

PMCMC MCMC Particle filters Expectation Maximization Will focus on this 7/35

slide-9
SLIDE 9

Expectation Maximization

slide-10
SLIDE 10

Latent variable models

A latent variable model is described by a joint probability density function pθ(x, y) where

  • y is observed (the data)
  • x is unobserved (the latent variable)
  • θ is an unknown parameter

ex) Probabilistic representation of state space model,

  • xt+1 = f (xt, ut, vt; θ),

yt = g(xt, ut, et; θ), ⇐ ⇒

  • pθ(xt+1 | xt),

pθ(yt | xt). Therefore, pθ(x0:T, y1:T) = p(x0) T

  • t=1

pθ(xt+1 | xt)pθ(yt | xt)

  • .

8/35

slide-11
SLIDE 11

Maximum likelihood estimation

Aim: Compute the maximum likelihood estimate,

  • θML = arg max

θ

{log pθ(y)}. Evaluating the likelihood pθ(y) requires marginalizing the latent vari- able, pθ(y) =

  • pθ(x, y)dx

Expectation maximization (EM) avoids explicit marginalization by it- eratively optimizing lower bounds on the likelihood.

9/35

slide-12
SLIDE 12

Expectation Maximization [3]

The EM algorithm generates a sequence of iterates: θ0, θ1, θ2, . . . Iterate: (E) Qk(θ)

def

=

  • log pθ(x, y)pθk−1(x | y)dx.

(M) θk = arg maxθ Qk(θ).

Expectation Maximization

The iterates {θk}k≥0 converge (under weak conditions) to a stationary point of pθ(y).

10/35

slide-13
SLIDE 13

EM for nonlinear state space models

ex) For a state space model, Qk(θ) =

  • log pθ(x0:T, y1:T)pθk−1(x0:T | y1:T)dx0:T.

First hurdle – high-dimensional integration problem!

11/35

slide-14
SLIDE 14

Stochastic Approximation EM

slide-15
SLIDE 15

Stochastic approximation of the Q-function

Qk(θ)

def

=

  • log pθ(x, y)pθk−1(x | y)dx.

If x ∼ pθk−1(x | y) then log pθ( x, y) is an unbiased estimate of Qk(θ). Replace the E-step with, (S) Simulate xk ∼ pθk−1(x | y), (E) Qk(θ) = Qk−1(θ) + γk

  • log pθ(

xk, y) − Qk−1(θ)

  • .

Here

k γk = ∞, k γ2 k < ∞, and

Q0(θ) ≡ 0. Intuitive interpretation: If θk converges, the x-values sampled at iter- ation k, k + 1, . . . , will come from approximately the same distribution.

12/35

slide-16
SLIDE 16

Represenation of Q

How is Qk(θ) = Qk−1(θ) + γk

  • log pθ(

x, y) − Qk−1(θ)

  • represented in

practice?

  • 1. In general, as a sum,
  • Qk(θ) = k

j=1 αjk log pθ(

xj, y). But, this can be hard to maximize for large k.

  • 2. Exponential family models, can be expressed using sufficient statis-

tics, log pθ(x, y) = ψ(θ), s(x, y) − A(θ). We can then compute (E.1) Sk = Sk−1 + γk (s( xk, y) − Sk−1), (E.2) Qk(θ) = ψ(θ), Sk − A(θ).

13/35

slide-17
SLIDE 17

Stochastic Approximation EM [2]

The SAEM algorithm for exponential family models. Initialize θ0 and S0 = 0. Iterate: (S) Simulate xk ∼ pθk−1(x | y), (E.1) Sk = Sk−1 + γk (s( xk, y) − Sk−1), (E.2) Qk(θ) = ψ(θ), Sk − A(θ), (M) θk = arg maxθ Qk(θ).

Stochastic approximation SAEM Expectation Maximization

The iterates {θk}k≥0 converge (under regularity conditions) to a local maximum of pθ(y).

14/35

slide-18
SLIDE 18

SAEM for nonlinear state space models

ex) For a state space model, the simulation step involves

  • x0:T ∼ pθ(x0:T | y1:T).

Second hurdle – simulating from the joint smoothing distribution is not possible for a nonlinear/non-Gaussian state space model.

15/35

slide-19
SLIDE 19

Combining SAEM and MCMC

slide-20
SLIDE 20

Markovian noise

We need to simulate from an intractable posterior pθ(x | y). This is what Markov Chain Monte Carlo is designed for! MCMC, key idea: An ergodic stochastic process has “the same behavior averaged over time as averaged over the space of all the system’s states” (Wikipedia)

  • g(x)pθ(x | y)dx = lim

K→∞

1 K

K

  • k=1

g( xk)

16/35

slide-21
SLIDE 21

MCMC basics

Requirements: The process { xk}k≥1 has to. . .

  • 1. . . . have pθ(x | y) as stationary distribution
  • 2. . . . be ergodic (“sufficiently stochastic”)

MCMC: Initialize x0 and simulate xk ∼ qθ( xk | xk−1), k = 1, 2, . . . Here, {qθ( x | x) : θ ∈ Rnθ} is a family of Markov kernels s.t.,

  • 1. (Stationary distribution)
  • qθ(

x | x)pθ(x | y)dx = pθ( x | y)

  • 2. (Ergodicity) Sufficient condition, ∃η > 0 s.t.

Pqθ(· | x)[ x ∈ A] ≥ ηPpθ(· | y)[ x ∈ A].

17/35

slide-22
SLIDE 22

MCMC methods

65+ years of research on MCMC has resulted in many clever ways of constructing qθ( x | x)

  • Metropolis–Hastings
  • Gibbs sampling
  • Langevin algorithms
  • Hamiltonian Monte Carlo
  • Slice sampling
  • Particle Markov chain Monte Carlo (← what we use in PSAEM)
  • . . .

18/35

slide-23
SLIDE 23

MCMC-SAEM [4]

SAEM: Initialize θ0, x0 and S0 = 0. Iterate: (S) Simulate xk ∼ pθk−1(x | y), (E.1) Sk = Sk−1 + γk (s( xk) − Sk−1), (E.2) Qk(θ) = ψ(θ), Sk − A(θ), (M) θk = arg maxθ Qk(θ).

Stochastic approximation SAEM MCMC Expectation Maximization Markovian SAEM

Under appropriate conditions the iterates {θk}k≥0 converge to a local maximum of pθ(y). We require that {qθ( x | x) : θ ∈ Rnθ} is. . .

  • . . . correct stationary distribution and ergodic for “all” θ
  • . . . smooth in θ, qθ(

x | x) ≈ qθ′( x | x) if θ ≈ θ′

19/35

slide-24
SLIDE 24

MCMC-SAEM for nonlinear state space models

ex) For a state space model, the MCMC step aims at simulating from the joint smoothing distribution pθ(x0:T | y1:T). Third hurdle – how can we construct an efficient MCMC method with the required properties, to simulate from this distribution?

20/35

slide-25
SLIDE 25

Particle SAEM

slide-26
SLIDE 26

Particle MCMC [1]

We propose to use Particle MCMC PMCMC:

  • 1. Use a particle filter to approximate

pθ(x0:T | y1:T) ≈ pθ(x0:T | y1:T)

def

=

N

  • i=1

w i

Tδxi

0:T (x0:T).

  • 2. Use the approximation to define an MCMC kernel on the space of

state trajectories

  • 3. Some “tricks” are used to ensure correct stationary distribution

for any N ≥ 1.

21/35

slide-27
SLIDE 27

PMCMC – Part 1

PMCMC – Part 1: The particle filter Use importance sampling to approximate pθ(x0), pθ(x0:1 | y1), pθ(x0:2 | y1:2), . . . , pθ(x0:T | y1:T), sequentially in time. Exploits the temporal structure of a dynamical system!

22/35

slide-28
SLIDE 28

PMCMC – Part 1

23/35 5 10 15 20 25 −4 −3 −2 −1 1 Time State

slide-29
SLIDE 29

PMCMC – Part 2

PMCMC – Part 2: Sampling a state trajectory From the particle filter approximation

  • pθ(x0:T | y1:T) =

N

  • i=1

w i

Tδxi

0:T (x0:T).

we can simulate a state trajectory

  • x0:T ∼

pθ(x0:T | y1:T).

5 10 15 20 25 −4 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 0.5 1 Time State

Using a standard particle filter, we need N → ∞ to obtain a sample from the correct smoothing distribution.

24/35

slide-30
SLIDE 30

PMCMC – Part 3

PMCMC – Part 3: The “tricks” Let x0:T = (x1, . . . , xT) be a fixed reference trajectory. Run a particle filter with the modifications,

  • Sample only N − 1 particles in the standard way.
  • Set the Nth particle deterministically: xN

t = xt.

After a complete run of the particle filter, sample

  • x0:T ∼

pθ(x0:T | y1:T) This defines a Markov kernel qθ( x0:T | x0:T) with stationary distribution pθ(x0:T | y1:T) for any N ≥ 1.

25/35

slide-31
SLIDE 31

Conditional Particle filter with Ancestor Sampling [5]

26/35

5 10 15 20 25 30 35 40 45 50 −3 −2 −1 1 2 3 Time State

slide-32
SLIDE 32

Conditional Particle filter with Ancestor Sampling [5]

5 10 15 20 25 30 35 40 45 50 −3 −2 −1 1 2 3 Time State

27/35

slide-33
SLIDE 33

Putting the pieces together

Stochastic approximation SAEM

PSAEM

PMCMC MCMC Particle filters Expectation Maximization

  • A. Lindholm and F. Lindsten. Learning dynamical systems with particle stochastic approxima-

tion EM. Submitted, 2019.

  • F. Lindsten. An efficient stochastic approximation EM algorithm using conditional particle

filters. Proceedings of the 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vancouver, Canada, May 2013. 28/35

slide-34
SLIDE 34

Numerical examples

slide-35
SLIDE 35

ex) Nonlinear toy model (I/II)

Consider, xt+1 = axt + b xt 1 + xt2 + c cos(1.2t) + vt, vt ∼ N(0, σ2

v),

yt = 0.05xt

2 + et,

et ∼ N(0, σ2

e).

  • Parameterization: θ = (a, b, c, σ2

v, σ2 e)

  • Relative error: (θk −

θML)./ θML2.

29/35

slide-36
SLIDE 36

ex) Nonlinear toy model (II/II)

10

−2

10

−1

10 10

1

10

2

10

3

10

4

10

−3

10

−2

10

−1

10 10

1

10

2

Computational time (seconds) Average relative error PSAEM,N=15 PSEM, N=15 PSEM, N=50 PSEM, N=100 PSEM, N=500 30/35

slide-37
SLIDE 37

ex) Cascaded water tanks (I/II)

Cascaded water tank system from the nonlinear system identification benchmark: http://www.nonlinearbenchmark.org/

  • M. Schoukens & J.P. No¨
  • el. Three Benchmarks Addressing Open Challenges in Nonlinear Sys-

tem Identification. 20th World Congress of the International Federation of Automatic Control, Toulouse, France, July 2017.

We use a gray-box nonlinear state space model,

xu

t+1 =10 ∧ xu t + Ts(−k1

  • 10 ∧ xu

t − k2{10 ∧ xu t } + k5ut) + v u t

xl

t+1 =10 ∧ xl t + Ts(k1

  • 10 ∧ xu

t + k2{10 ∧ xu t } − k3

  • 10 ∧ xl

t

− k4{10 ∧ xl

t} + k6{(xu t − 10) ∨ 0}) + v l t

yt =10 ∧ xl

t + et,

with θ = {k1, k2, k3, k4, k5, k6, σ2

v, σ2 e, x0}. 31/35

slide-38
SLIDE 38

ex) Cascaded water tanks (II/II)

Evaluated by computing simulation error on test data. Model Error Initial model to PSAEM 2.85 Estimated with PSAEM 0.29 Best result from literature 0.34

32/35

slide-39
SLIDE 39

Wrapping up

slide-40
SLIDE 40

MCMC + Stochastic Approximation

Particle Stochastic Approximation EM (PSAEM) is . . . . . . an example of how MCMC and Stochastic Approximation can come together! Another example: Variational Inference Idea: Approximate an intractable distribution p(x) ≈ qη(x) by opti- mizing some divergence w.r.t. η.

  • (Standard VI) Minimize KL(qηp)
  • (Inclusive VI) Minimize KL(pqη)

— possible using MCMC coupled with Stochastic Gradient Decent! qη⋆ p qη0

33/35

slide-41
SLIDE 41

Summary

  • Possible to build powerful algorithms from simple parts
  • Usability an issue! We need ergonomic tools

(https://birch-lang.org/, PNLSS Toolbox, . . . )

  • Combining MCMC and Stochastic Approximation can turn

ad hoc approximations into consistent algorithms!

  • PSAEM – a consistent algorithm for maximum-likelihood

identification of nonlinear state-space models

  • Computational cost: O(K), K → ∞
  • Previous SOTA: O(KNM), K → ∞, N → ∞, M → ∞

34/35

slide-42
SLIDE 42

References

References

[1] C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B, 72(3):269–342, 2010. [2] B. Delyon, M. Lavielle, and E. Moulines. Convergence of a stochastic approximation version of the EM algorithm. The Annals of Statistics, 27(1):94–128, 1999. [3] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38, 1977. [4] E. Kuhn and M. Lavielle. Coupling a stochastic approximation version of EM with an MCMC procedure. ESAIM: Probability and Statistics, 8:115–131, 2004. [5] F. Lindsten, M. I. Jordan, and T. B. Sch¨

  • n. Particle Gibbs with ancestor
  • sampling. Journal of Machine Learning Research, 15:2145–2184, 2014.

35/35