Probabilistic Graphical Models 10-708 Factor Analysis and State - - PDF document

probabilistic graphical models
SMART_READER_LITE
LIVE PREVIEW

Probabilistic Graphical Models 10-708 Factor Analysis and State - - PDF document

Probabilistic Graphical Models 10-708 Factor Analysis and State Space Factor Analysis and State Space Models Models Eric Xing Eric Xing Lecture 15, Nov 2, 2005 Reading: MJ-Chap. 13,14,15 A road map to more complex dynamic models Y Y Y


slide-1
SLIDE 1

1

Probabilistic Graphical Models

10-708

Factor Analysis and State Space Factor Analysis and State Space Models Models

Eric Xing Eric Xing

Lecture 15, Nov 2, 2005 Reading: MJ-Chap. 13,14,15

A road map to more complex dynamic models

A

X Y

A

X Y

A

X Y discrete discrete discrete continuous continuous continuous Mixture model

e.g., mixture of multinomials

Mixture model

e.g., mixture of Gaussians

Factor analysis

A A A A

x2 x3 x1 xN y2 y3 y1 yN

... ... A A A A

x2 x3 x1 xN y2 y3 y1 yN

... ... A A A A

x2 x3 x1 xN y2 y3 y1 yN

... ... A A A A

x2 x3 x1 xN y2 y3 y1 yN

... ... A A A A

x2 x3 x1 xN y2 y3 y1 yN

... ... A A A A

x2 x3 x1 xN y2 y3 y1 yN

... ...

HMM

(for discrete sequential data, e.g., text)

HMM

(for continuous sequential data, e.g., speech signal)

State space model

... ... ... ... A A A A

x2 x3 x1 xN yk2 yk3 yk1 ykN

... ...

y12 y13 y11 y1N

...

S2 S3 S1 SN

... ... ... ... ... A A A A

x2 x3 x1 xN yk2 yk3 yk1 ykN

... ...

y12 y13 y11 y1N

...

S2 S3 S1 SN

... ... ... ... ... A A A A

x2 x3 x1 xN yk2 yk3 yk1 ykN

... ...

y12 y13 y11 y1N

...

S2 S3 S1 SN

... ... ... ... ... A A A A

x2 x3 x1 xN yk2 yk3 yk1 ykN

... ...

y12 y13 y11 y1N

...

S2 S3 S1 SN

...

Factorial HMM Switching SSM

slide-2
SLIDE 2

2

Review: A primer to multivariate Gaussian

Multivariate Gaussian density: A joint Gaussian: How to write down p(x1), p(x1|x2) or p(x2|x1) using the block

elements in µ and Σ?

  • Formulas to remember:

{ }

)

  • (

)

  • (
  • exp

) ( ) , | (

/ /

µ µ π µ x x x

1 2 1 2 1 2

2 1

Σ Σ = Σ

T n

p ) , ( ) , | ( ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ Σ Σ Σ Σ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ = Σ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡

22 21 12 11 2 1 2 1 2 1

µ µ µ x x x x N p

21 1 22 12 11 2 1 2 2 1 22 12 1 2 1 2 1 2 1 1 2 1

Σ Σ Σ − Σ = − Σ Σ + = =

− − | | | |

) ( ) , | ( ) ( V x m V m x x x µ µ N p

22 2 2 2 2 2 2 2

Σ = = =

m m m m

p V m V m x x µ ) , | ( ) ( N

Review: The matrix inverse lemma

Consider a block-partitioned matrix: First we diagonalize M

  • Schur complement:

Then we inverse, using this formula: Matrix inverse lemma

⎥ ⎦ ⎤ ⎢ ⎣ ⎡ = H G F E M

⎥ ⎦ ⎤ ⎢ ⎣ ⎡ = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ H F H

  • E

I

  • I

H G F E I F H

  • I
  • 1
  • 1

G G H 1

  • X

Z W Y W X Y Z

  • 1
  • 1

= ⇒ = G

  • 1

F H

  • E

M / H =

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

⎥ ⎦ ⎤ ⎢ ⎣ ⎡ + = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ + = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ =

− − − − − − − − − − − − − − − − − − 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

M / E G E M / E

  • M

/ E F E

  • G

E M / E F E E F H M / H G H H M / H G H

  • F

H M / H

  • M

/ H I F H

  • I

H M / H I

  • I

H G F E

1

  • 1
  • 1
  • 1
  • 1

G H M

1

  • (

) ( )

1

  • 1
  • 1
  • G

E G E

  • H

F E E F H

  • E

1 1 1 1 − − − −

+ = F G

slide-3
SLIDE 3

3

Review: Some matrix algebra

Trace and derivatives

  • Cyclical permutations
  • Derivatives

Determinants and derivatives

[ ] ∑

=

i ii

a

def

tr A

[ ]

T

B BA A = ∂ ∂ tr

[ ] [ ]

T T T

xx A xx A Ax x A = ∂ ∂ = ∂ ∂ tr tr

[ ] [ ] [ ]

BCA CAB ABC tr tr tr = =

  • T

A A A = ∂ ∂ log

Factor analysis

An unsupervised linear regression model Geometric interpretation

  • To generate data, first generate a point within the manifold then add
  • noise. Coordinates of point are components of latent variable.

A

Y X

where Λ is called a factor loading matrix, and Ψ is diagonal.

) , ; ( ) ( ) , ; ( ) ( Ψ Λ + = = x y y x x x µ N N p p I

slide-4
SLIDE 4

4

Marginal data distribution

A marginal Gaussian (e.g., p(x)) times a conditional Gaussian

(e.g., p(y|x)) is a joint Gaussian

Any marginal (e.g., p(y) of a joint Gaussian (e.g., p(x,y)) is

also a Gaussian

  • Since the marginal is Gaussian, we can determine it by just computing

its mean and variance. (Assume noise uncorrelated with data.) [ ] [ ] [ ] [ ] [ ] ( )( )

[ ]

( )( )

[ ]

( )( )

[ ] [ ] [ ]

Ψ + ΛΛ = + Λ Λ = + Λ + Λ = − + Λ + − + Λ + = − − = = + + = + Λ + = Ψ + Λ + =

T T T T T T T

E E E E E Var E E E E WW XX W X W X W X W X Y Y Y W X W W X Y µ µ µ µ µ µ µ µ µ µ ) , ( ~ here w N

FA = Constrained-Covariance Gaussian

Marginal density for factor analysis (y is p-dim, x is k-dim): So the effective covariance is the low-rank outer product of

two long skinny matrices plus a diagonal matrix:

In other words, factor analysis is just a constrained Gaussian

  • model. (If were not diagonal then we could model any

Gaussian and it would be pointless.)

) , ; ( ) | ( Ψ + ΛΛ =

T

µ θ y y N p

slide-5
SLIDE 5

5

FA joint distribution

Model Covariance between x and y Hence the joint distribution of x and y: Assume noise is uncorrelated with data or latent variables.

) , ; ( ) ( ) , ; ( ) ( Ψ Λ + = = x y x y x x µ N N p p I [ ]

( )( )

[ ]

( )

[ ] [ ]

T T T T T T

E E E Cov Λ = + Λ = − + Λ + = − − = xW XX W X x Y X Y X, µ µ µ

) , ( ) ( ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ Ψ + ΛΛ Λ Λ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡

T T

I µ y x y x N p

Inference in Factor Analysis

Apply the Gaussian conditioning formulas to the joint

distribution we derived above, where we can now derive the posterior of the latent variable x given

  • bservation y, , where

Applying the matrix inversion lemma

  • Here we only need to invert a matrix of size |x|×|x|, instead of |y|×|y|.

( )

) ( ) (

|

µ µ µ − Ψ + ΛΛ Λ = − Σ Σ + =

− −

y y m

1 2 1 22 12 1 2 1 T T

( )

Ψ + ΛΛ = Σ Λ = Σ = Σ = Σ

T T T

I

22 12 12 11

) , | ( ) (

| | 2 1 2 1

V m x y x N = p

⇒ ⇒

( ) ( )

1

  • 1
  • 1
  • G

E G E

  • H

F E E F H

  • E

1 1 1 1 − − − −

+ = F G

( ) Λ

Ψ + ΛΛ Λ − = Σ Σ Σ − Σ =

− − 1 21 1 22 12 11 2 1 T T

I

|

V

( )

1 1 2 1 − − Λ

Ψ Λ + =

T

I

|

V ) (

| |

µ − Ψ Λ =

− y

V m

1 2 1 2 1 T

slide-6
SLIDE 6

6

Geometric interpretation: inference is linear projection

The posterior is: Posterior covariance does not depend on observed data y! Computing the posterior mean is just a linear operation:

) , ; ( ) (

| | 2 1 2 1

V m x y x N = p

( )

1 1 2 1 − − Λ

Ψ Λ + =

T

I

|

V ) (

| |

µ − Ψ Λ =

− y

V m

1 2 1 2 1 T

EM for Factor Analysis

Incomplete data log likelihood function (marginal density of y)

  • Estimating m is trivial:
  • Parameters Λ and Ψ are coupled nonlinearly in log-likelihood

Complete log likelihood

( ) ( )

[ ]

T n n n n T n n

µ y µ y N y y N D ) ( ) ( where , tr log ) ( ) ( log ) , ( − − = Ψ + ΛΛ − Ψ + ΛΛ − = − Ψ + ΛΛ − − Ψ + ΛΛ − =

∑ ∑

− −

S S

1 1

2 1 2 2 1 2

T T T T

µ µ θ l

=

n n N ML

y

1

µ ˆ

[ ]

T n n n n n n n T n n n T n n n n n T n n n n n n n n

x y x y N N x x N x y x y N x x N x y p x p y x p D ) ( ) ( where , tr log ) ( ) ( log log ) | ( log ) ( log ) , ( log ) , ( Λ − Λ − = Ψ − − Ψ − = Λ − Ψ Λ − − Ψ − − − = + = =

∑ ∑ ∑ ∑ ∑ ∑

− −

1 2 2 1 2 2 1 2 2 1 2

1 1

S S I θ

c

l

slide-7
SLIDE 7

7

E-step for Factor Analysis

Compute

  • Recall that we have derived:

[ ]

tr log ) , (

1

2 2 1 2

Ψ − − Ψ − =

S N x x N D

n n T n

θ

c

l

) | (

) , (

y x p

D θ

c

l ) (

T n T n n T n T n T T n n T n n

X X y X X y y y N Λ Λ + Λ − Λ − =

1 S

[ ] [ ] [ ]

T n n n n n n T n n

y X y X y X X X | | | E E Var + =

[ ]

n n n

y X X | E =

( )

1 1 2 1 − − Λ

Ψ Λ + =

T

I

|

V ) (

| |

µ − Ψ Λ =

− y

V m

1 2 1 2 1 T

) (

| |

µ − Ψ Λ = =

− n y x n

y X

n n

1 2 1 T

V m

⇒ ⇒

T y x y x T n n

n n n n

X X

| | |

and m m V + =

2 1

M-step for Factor Analysis

Take the derivates of the expected complete log likelihood

  • wrt. parameters.
  • Using the trace and determinant derivative rules:

[ ]

S S 2 2 2 2 1 2

1 1 1

N N N x x N

n n T n

− Ψ = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ Ψ − − Ψ − Ψ ∂ ∂ = Ψ ∂ ∂

− − −

tr log

c

l

[ ]

∑ ∑ ∑ ∑

Λ Ψ − Ψ = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ Λ Λ + Λ − Λ − Λ ∂ ∂ Ψ − = Λ ∂ ∂ Ψ − = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ Ψ − − Ψ − Λ ∂ ∂ = Λ ∂ ∂

− − − − − n T n n n T n n T n T n n T n T n T T n n T n n n n T n

X X X y X X y X X y y y N N N N x x N

1 1 1 1 1

1 2 2 2 2 1 2 ) ( tr log S S

c

l

⇒ ⇒

S = Ψ +1

t

⇒ ⇒

1 1 − +

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = Λ

∑ ∑

n T n n n T n n t

X X X y

slide-8
SLIDE 8

8

Model Invariance and Identifiability

There is degeneracy in the FA model. Since Λ only appears as outer product ΛΛΤ, the model is

invariant to rotation and axis flips of the latent space.

We can replace Λ with ΛQ for any orthonormal matrix Q and

the model remains the same: (ΛQ)(ΛQ)Τ=Λ(QQΤ)ΛΤ=ΛΛΤ.

This means that there is no “one best” setting of the

  • parameters. An infinite number of parameters all give the ML

score!

Such models are called un-identifiable since two people both

fitting ML parameters to the identical data will not be guaranteed to identify the same parameters.

Independent Components Analysis (ICA)

  • ICA is similar to FA, except it assumes the latent source has
  • non-Gaussian density.
  • Hence ICA can extract higher order moments (not just second
  • rder).
  • It is commonly used to solve blind source separation (cocktail party

problem).

  • Independent Factor Analysis (IFA) is an approximation to ICA where

we model the source using a mixture of Gaussians. A

Y X

A

Y Z X

A

Y X1 X2 Y Y

… …

A

Y2 X1 X2 Y1 YK

… …

≡ ≡

Z1 Z1

FA FA Mixture of FA Mixture of FA IFA IFA

slide-9
SLIDE 9

9

A road map to more complex dynamic models

A

X Y

A

X Y

A

X Y discrete discrete discrete continuous continuous continuous Mixture model

e.g., mixture of multinomials

Mixture model

e.g., mixture of Gaussians

Factor analysis

A A A A

x2 x3 x1 xN y2 y3 y1 yN

... ... A A A A

x2 x3 x1 xN y2 y3 y1 yN

... ... A A A A

x2 x3 x1 xN y2 y3 y1 yN

... ... A A A A

x2 x3 x1 xN y2 y3 y1 yN

... ... A A A A

x2 x3 x1 xN y2 y3 y1 yN

... ... A A A A

x2 x3 x1 xN y2 y3 y1 yN

... ...

HMM

(for discrete sequential data, e.g., text)

HMM

(for continuous sequential data, e.g., speech signal)

State space model

... ... ... ... A A A A

x2 x3 x1 xN yk2 yk3 yk1 ykN

... ...

y12 y13 y11 y1N

...

S2 S3 S1 SN

... ... ... ... ... A A A A

x2 x3 x1 xN yk2 yk3 yk1 ykN

... ...

y12 y13 y11 y1N

...

S2 S3 S1 SN

... ... ... ... ... A A A A

x2 x3 x1 xN yk2 yk3 yk1 ykN

... ...

y12 y13 y11 y1N

...

S2 S3 S1 SN

... ... ... ... ... A A A A

x2 x3 x1 xN yk2 yk3 yk1 ykN

... ...

y12 y13 y11 y1N

...

S2 S3 S1 SN

...

Factorial HMM Switching SSM

State space models (SSM)

A sequential FA or a continuous state HMM In general,

A A A A

Y2 Y3 Y1 YN X2 X3 X1 XN

... ...

), ; ( ~ ) ; ( ~ ), ; ( ~ R Q C G A Σ + = + =

− − 1 1

N N N x x y x x

t t t t t t t t

v w v w

This is a linear dynamic system.

t t t t t t

v w + = + =

− −

) ( ) (

1 1

x y x x f G f

where f is an (arbitrary) dynamic model, and g is an (arbitrary)

  • bservation model
slide-10
SLIDE 10

10

The inference problem

Filtering given y1, …, yt, estimate xt

  • The Kalman filter is a way to perform exact online inference (sequential

Bayesian updating) in an LDS. It is the Gaussian analog of the forwards algorithm for HMMs:

Smoothing given y1, …, yT, estimate xt (t<T)

  • The Rauch-Tung-Strievel smoother is a way to perform exact off-line

inference in an LDS. It is the Gaussian analog of the forwards- backwards algorithm:

j t t j t t t i t t t

j i p i p i p

1 1 1 − − =

= = ∝ = =

α α ) X | X ( ) X | y ( ) | X (

:

y

i t i t T t

y i p β α ∝ = ) | X (

: 1

Online vs offline inference

slide-11
SLIDE 11

11

LDS for 2D tracking

Dynamics: new position = old position + ∆×velocity + noise

(constant velocity model, Gaussian noise)

Observation: project out first two components (we observe

Cartesian position of object - linear!)

noise + ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛ ∆ = ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛

− − − − 2 1 1 1 2 1 1 1 2 1 2 1

1 1 1 1 1 1

t t t t t t t t

x x x x x x x x & & & & noise + ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ ⎛ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛

2 1 2 1 2 1

1 1

t t t t t t

x x x x y y & &

2D tracking

slide-12
SLIDE 12

12

Kalman filtering in the brain? Kalman filtering derivation

Since all CPDs are linear Gaussian, the system defines a

large multivariate Gaussian.

  • Hence all marginals are Gaussian.
  • Hence we can represent the belief state p(Xt|y1:t) as a Gaussian with

mean and covariance .

  • It is common to work with the inverse covariance (precision) matrix ;

this is called information form.

Kalman filtering is a recursive procedure to update the belief

state:

  • Predict step: compute p(Xt+1|y1:t) from prior belief

p(Xt|y1:t) and dynamical model p(Xt+1|Xt) --- time update

  • Update step: compute new belief p(Xt+1|y1:t+1) from

prediction p(Xt+1|y1:t), observation yt+1 and observation model p(yt|Xt) --- measurement update

) , , | X X (

| t t t t t

y y K

1 T

E P ≡

1 − t t |

P

A A

Yt Y1 Xt Xt+1 X1

... A A A

Yt Yt+1 Y1 Xt Xt+1 X1

...

) , , | X ( x ˆ |

t t t t

y y K

1

E ≡

slide-13
SLIDE 13

13

Predict step

Dynamical Model:

  • One step ahead prediction of state:

Observation model:

  • One step ahead prediction of observation:

) ; ( ~ , Q G A

1

N

t t t t

w w + =

+

x x ) ; ( ~ , R C

1

N

t t t t

v v + =

x y

t t t t t t | |

ˆ ) , , | X ( ˆ x y y x A E = =

+ +

K

1 1 1 T T T

GQG A AP G A G A E E P + = − + − + = − − =

+ + + + + + t t t t t t t t t t t t t t t t t t t t

w w

| | | | | |

) , , | ) ˆ X )( ˆ X ( ) , , | ) ˆ )(X ˆ X ( y y x x y y x x K K

1 1 1 1 1 1 1 1 t t t t t t t

v

|

ˆ ) , , | X ( ) , , | Y (

1 1 1 1 1 1 + + + +

= + = x y y y y C C E E K K R + = − −

+ + + T T

C CP E

t t t t t t t t t | | |

) , , | ) ˆ )(Y ˆ Y (

1 1 1 1

y y y y K

t t t t t t t t t | | |

) , , | ) ˆ )(X ˆ Y (

1 1 1 1 + + +

= − − CP E

T

y y x y K

Update step

Summarizing results from previous slide, we have

p(Xt+1,Yt+1|y1:t) ~ N(mt+1, Vt+1), where

Remember the formulas for conditional Gaussian

distributions:

) , ( ) , | ( ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ Σ Σ Σ Σ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ = Σ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡

22 21 12 11 2 1 2 1 2 1

µ µ µ x x x x N p

21 1 22 12 11 2 1 2 2 1 22 12 1 2 1 2 1 2 1 1 2 1

Σ Σ Σ − Σ = − Σ Σ + = =

− − | | | |

) ( ) , | ( ) ( V x m V m x x x µ µ N p

22 2 2 2 2 2 2 2

Σ = = =

m m m m

p V m V m x x µ ) , | ( ) ( N , ˆ ˆ

| |

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ =

+ + + t t t t t

x C x m

1 1 1

,

| | | |

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + =

+ + + + +

R C CP CP C P P V

T t t t t T t t t t t 1 1 1 1 1

slide-14
SLIDE 14

14

Kalman Filter

Measurement updates:

  • where Kt+1 is the Kalman gain matrix

Time updates: Kt can be pre-computed (since it is independent of the data).

t t t t | |

ˆ ˆ x x A =

+1 T

GQG A AP P + =

+ t t t t | | 1

) x ˆ C (y ˆ ˆ

t | 1 t 1 t | | + + + + + +

− + =

1 1 1 1 t t t t t

K x x

t t t t t t | | | 1 1 1 1 + + + +

= KCP

  • P

P

  • 1

T T

R C CP C P K ) (

| |

+ =

+ + + t t t t t 1 1 1

Example of KF in 1D

Consider noisy observations of a 1D particle doing a random

walk:

KF equations:

) , ( ~ ,

| x t t t

w w x x σ

1 1

N + =

− −

) , ( ~ ,

z t t

v v x z σ N + =

t|t t|t t t

x x x ˆ ˆ ˆ

|

= =

+

A

1

,

| | x t t t t t

σ σ + = + =

+ T

GQG A AP P

1

( )

z x t t t z t x t t t t t t

x z x z x x σ σ σ σ σ σ + + + + = + =

+ + + + + + | t | 1 t 1 t | |

ˆ ) ˆ

  • (

ˆ ˆ C K

1 1 1 1

( )

z x t z x t t t t t t t

σ σ σ σ σ σ + + + = =

+ + + + | | | 1 1 1 1

KCP

  • P

P ) )( ( ) (

| | z x t x t t t t t t

σ σ σ σ σ + + + = + =

+ + +

  • 1

T T

R C CP C P K

1 1 1

slide-15
SLIDE 15

15

KF intuition

The KF update of the mean is

  • the term is called the innovation

New belief is convex combination of updates from prior and

  • bservation, weighted by Kalman Gain matrix:

If the observation is unreliable, σz (i.e., R) is large so Kt+1 is

small, so we pay more attention to the prediction.

If the old prior is unreliable (large σt) or the process is very

unpredictable (large σx), we pay more attention to the

  • bservation.

( )

z x t t t z t x t t t t t t

x z x z x x σ σ σ σ σ σ + + + + = + =

+ + + + + + | t | 1 t 1 t | |

ˆ ) ˆ

  • (

ˆ ˆ C K

1 1 1 1

) ˆ (

t | 1 t 1 t + + −

x z C

  • 1

T T

R C CP C P K ) (

| |

+ =

+ + + t t t t t 1 1 1

The KF update of the mean is Consider the special case where the hidden state is a

constant, xt =θ, but the “observation matrix” C is a time- varying vector, C = xt

T.

  • Hence the observation model at each time slide,

, is a linear regression

We can estimate recursively using the Kalman filter:

This is called the recursive least squares (RLS) algorithm.

We can approximate by a scalar constant. This is

called the least mean squares (LMS) algorithm.

We can adapt ηt online using stochastic approximation theory.

) ˆ

  • (

ˆ ˆ

t | 1 t 1 t | | + + + + +

+ = x y x x

t t t t t

C K A

1 1 1

t T t t

v x y + = θ

t t T t t t t

x x y ) ˆ ( ˆ ˆ

1 t

θ θ θ − + =

+ − + + 1 1 1

R P

1 1 1 + − +

t t

η R P

KF, RLS and LMS

slide-16
SLIDE 16

16

Complexity of one KF step

Let and , Computing takes O(Nx 2) time, assuming

dense P and dense A.

Computing takes O(Ny 3) time. So overall time is, in general, max {Nx 2,Ny 3}

x

N t

X R ∈

y

N t

Y R ∈

T

GQG A AP P + =

+ t t t t | | 1

  • 1

T T

R C CP C P K ) (

| |

+ =

+ + + t t t t t 1 1 1

Rauch-Tung-Strievel smoother

  • General structure: KF results + the difference of the "smooth" and predicted

results of the next step

  • Backward computation: Pretend to know things at t+1 –- such conditioning

makes things simple and we can remove this condition finally

The difficulty: The trick:

A A

Yt Yt+1 Xt Xt+1

) ˆ

  • ˆ

( ˆ ˆ

| | | | t t T t t t t T t 1 1 + +

+ = x x x x L

( )

T t t T t t t T t t t

L P P L P P

| | | | 1 1 + +

− + =

1 1 − +

=

t t t t t | |

P A P L

T

[ ] [ ] [ ] [ ] [ ] [ ]

t t t T t t t T T t t T t T t

y y X X y y y y X X y y y y X X y y X x , , , | , , | , , , | , , | , , , | , , | ˆ

def |

K K K K K K

1 1 1 1 1 1 1 1 1 + + +

= = = = E E E E E E

[ ] [ ] [ ]

Z Z Y X Z X | , | | E E E =

T t

y y X , , | K

1

[ ] [ ] [ ] [ ] [ ]

Z Z Y X Z Z Y X Z X | , | | , | | Var E E Var Var + = (Hw!) (Hw!) Same for Pt|T

slide-17
SLIDE 17

17

RTS derivation

Following the results from previous slide, we need to derive

p(Xt+1,Xt|y1:t) ~ N(m, V), where

  • all the quantities here are available after a forward KF pass

Remember the formulas for conditional Gaussian distributions: The RTS smoother

, ) , ( ) , | ( ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ Σ Σ Σ Σ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ = Σ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡

22 21 12 11 2 1 2 1 2 1

µ µ µ x x x x N p

21 1 22 12 11 2 1 2 2 1 22 12 1 2 1 2 1 2 1 1 2 1

Σ Σ Σ − Σ = − Σ Σ + = =

− − | | | |

) ( ) , | ( ) ( V x m V m x x x µ µ N p

22 2 2 2 2 2 2 2

Σ = = =

m m m m

p V m V m x x µ ) , | ( ) ( N

, ˆ ˆ

| |

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ =

+ t t t t

x x m

1

,

| | | |

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ =

+ t t t t T t t t t

P AP A P P V

1

[ ]

) ˆ

  • ˆ

( ˆ , , , | ˆ

| | | | t t T t t t t t t t T t

y y X X x

1 1 1 1 + + +

+ = = x x x L E K

[ ]

[ ] [ ]

( )

T t t T t t t T t t t T T t T t

y y X X y x

t t

L P P L P Var E Var P

| | | : : : | def |

| , | | ˆ

1 1 1 1 1 1 + + +

− + = + =

Learning SSMs

Complete log likelihood EM

  • E-step: compute

these quantities can be inferred via KF and RTS filters, etc., e,g.,

  • M-step: MLE using

c.f., M-step in factor analysis

{ } ( ) { } ( )

R C G Q A , ; : , , , ; : , , ) ; ( ) | ( log ) | ( log ) ( log ) , ( log ) , (

, , , ,

t X X X f t X X X X X f X f x y p x x p x p y x p D

t T t t t T t t T t t n t t n t n n t t n t n n n n n

∀ + ∀ + Σ = + + = =

− −

∑∑ ∑∑ ∑ ∑

3 1 2 1 1 1 1

θ

c

l

T t T t t T t t

y y X X X X X K , , ,

1 1 − 2 2 T t T t t T t t T t t

x P X X X X X

| |

ˆ ) ( E ) var( + = + ≡

( )

{ } ( ) { } ( )

R C G Q A , ; : , , , ; : , , ; ) , ( t X X X f t X X X X X f X f D

t T t t t T t t T t t

∀ + ∀ + Σ =

− 3 1 2 1 1

θ

c

l

slide-18
SLIDE 18

18

Nonlinear systems

  • In robotics and other problems, the motion model and the
  • bservation model are often nonlinear:
  • An optimal closed form solution to the filtering problem is no longer

possible.

  • The nonlinear functions f and g are sometimes represented by

neural networks (multi-layer perceptrons or radial basis function networks).

  • The parameters of f and g may be learned offline using EM, where

we do gradient descent (back propagation) in the M step, c.f. learning a MRF/CRF with hidden nodes.

  • Or we may learn the parameters online by adding them to the state

space: xt'= (xt, θ). This makes the problem even more nonlinear.

) ( , ) (

t t t t t t

v x g y w x f x + = + =

−1

Extended Kalman Filter (EKF)

The basic idea of the EKF is to linearize f and g using a

second order Taylor expansion, and then apply the standard KF.

  • i.e., we approximate a stationary nonlinear system with a non-stationary

linear system. where and and

The noise covariance (Q and R) is not changed, i.e., the

additional error due to linearization is not modeled.

) ˆ ( ) ˆ ( ) ˆ ( ) ˆ (

| ˆ | | ˆ |

| |

t t t t x t t t t t t t x t t t

v x x x g y w x x x f x

t t t t

+ − + = + − + =

− − − − − − −

− − −

1 1 1 1 1 1 1

1 1 1

C A ) ˆ ( ˆ

| | 1 1 1 − − − = t t t t

x f x

x x

x f

ˆ def ˆ

∂ ∂ = A

x x

x g

ˆ def ˆ

∂ ∂ = C