Hidden Markov Model, Kalman Filter and A Unifying View Mu Li April - - PowerPoint PPT Presentation

hidden markov model kalman filter and a unifying view
SMART_READER_LITE
LIVE PREVIEW

Hidden Markov Model, Kalman Filter and A Unifying View Mu Li April - - PowerPoint PPT Presentation

Recitations for 10-701: Hidden Markov Model, Kalman Filter and A Unifying View Mu Li April 16, 2013 Outline Hidden Markov Model Kalman Filter A Unifying View of Linear Gaussian Models based on slides from Simma & Batzoglou Outline


slide-1
SLIDE 1

Recitations for 10-701:

Hidden Markov Model, Kalman Filter and A Unifying View

Mu Li April 16, 2013

slide-2
SLIDE 2

Outline

Hidden Markov Model Kalman Filter A Unifying View of Linear Gaussian Models

based on slides from Simma & Batzoglou

slide-3
SLIDE 3

Outline

Hidden Markov Model Kalman Filter A Unifying View of Linear Gaussian Models

slide-4
SLIDE 4

Example: The Dishonest Casino

One day you go to Las Vegas, there is a casino player who has two dices:

◮ Fair die

P(1) = P(2) = P(3) = P(5) = P(6) = 1/6

◮ Loaded die

P(1) = P(2) = P(3) = P(5) = 1/10 P(6) = 1/2 and switch them once every 18 turns. The game:

  • 1. You roll with a fair die
  • 2. the casino player rolls, maybe a fair die, maybe a loaded die
  • 3. highest number wins
slide-5
SLIDE 5

Modeling as HMM

◮ two hidden status: fair, loaded ◮ status transition model ◮ observation model

For HMM, typically we want to ask three questions.

slide-6
SLIDE 6

Question 1: Evaluation

Given:

◮ a sequence of rolls by the casino player ◮ the models of dices and the work pattern of the casino player

Question: How likely the following sequence happens? 124552646214243156636266613666166466513612115146234126 Answer: probability ≈ 10−37

slide-7
SLIDE 7

Question 2: Decoding

Given:

◮ a sequence of rolls by the casino player ◮ the models of dices and the work pattern of the casino player

Question: What portion was generated by the fair die, and what portion by the loaded die? Answer: 124552646214243156

  • fair

636266613666166466

  • loaded

513612115146234126

  • fair
slide-8
SLIDE 8

Question 3: Learning

Given a sequence of rolls by the casino player Question:

◮ How “loaded” is the loaded die? ◮ How “fair” is the fair die? ◮ How often does the casino player changes the die?

Answer: 124552646214243156 636266613666166466

  • P(6)=66.6%

513612115146234126

slide-9
SLIDE 9

More Examples: Speech Recognition

Given an audio waveform, would like to robustly extract and recognize any spoken words

slide-10
SLIDE 10

Biological Sequence Analysis

Use temporal models to exploit sequential structure, such as DNA sequences

slide-11
SLIDE 11

Financial Forecasting

Predict future market behaviors from historical data, news reports, expert opinions

slide-12
SLIDE 12

Discrete Markov Process

Assume

◮ k states {1, . . . , k} ◮ state transition probability

aij = P(x = j|y = i) satisfies aij ≥ 0 and

k

  • j=1

aij = 1 Given a state sequence {x1, . . . , xT}, where xt ∈ {1, . . . , k} P(x1, . . . , xT) = P(x1)P(x2|x1) . . . P(xT−1|xT) = πx1ax1x2 . . . axT−1xT

slide-13
SLIDE 13

Extension to HMM

◮ k state, state transition probability A = {aij}, initial state

distribution Π = {πi}, hidden state sequence X = {x1, . . . xT}

◮ observed sequence Y = {y1, . . . , yT}, where yT ∈ {1, . . . , m} ◮ observation symbol probability B = {bj(ℓ)} where

bj(ℓ) = P(yt = ℓ|xt = j) Denote by Λ = (A, B, Π) the model parameters, then P(X, Y |Λ) = P(x1)

T−1

  • t=1

P(xt+1|xt)

T

  • t=1

P(yt|xt)

slide-14
SLIDE 14

Three problems of HMM

Evaluation: Given observation sequence Y and model parameters Λ, how to compute P(Y |Λ) Decoding: Given observation Y and model parameters Λ, how to choose the “optimal” hidden state sequence X Learning: How to find the model parameters Λ to maximize P(Y |Λ)

slide-15
SLIDE 15

Problem 1: Evaluation

The naive solution. Since it is easy to compute P(Y , X|Λ), then P(Y |Λ) =

  • all possible X

(P(Y , X|Λ)) However, the time complexity is O(TkT), even for 5 state and 100

  • bservations, there are on the order of 1072 operations.

But HMM is a tree, certainly we can have polynomial algorithms.

slide-16
SLIDE 16

The forward procedure

Let αt(i) = P(y1, . . . , yt, xt = i|Λ), then P(Y |Λ) =

k

  • i=1

αT(i) αt(i) can be computed recursively α1(i) = πibi(y1) ∀i αt+1(i) =

k

  • j=1

P(yt+1|xt+1 = i)P(xt+1 = i|xt = j)αt(j) = bi(yt+1)  

k

  • j=1

aijαt(j)   ∀i, t ≥ 1

slide-17
SLIDE 17

Illustration of the forward procedure

αt(i) are represented by nodes α1(i) = πibi(y1) ∀i αt+1(i) = bi(yt+1)  

k

  • j=1

aijαt(j)   ∀i, t ≥ 1

slide-18
SLIDE 18

The backward procedure, cont

Similar, we can compute in backward way, let βt(i) = P(yt+1, . . . , yT|xt = i, Λ) then P(Y |Λ) =

k

  • i=1

β1(i)πi =

k

  • i=1

αt(i)βt(i) ∀i βt(i) can be also computed recursively βT(i) = 1 ∀i βt(i) =

k

  • j=1

P(yt+1|xt+1 = j)P(xt+1 = j|xt = i)βt+1(j) =

k

  • j=1

bj(yt+1)aijβt+1(j) ∀i, t < T

slide-19
SLIDE 19

Problem 2: Decoding

There are several possible optimal criteria. One is “individually most likely”. Define the probability of being state i at time t given Y and Λ: γt(i) = P(xt = i|Y , Λ), then γt(i) = P(xt = i, Y |Λ) P(Y |Λ) = αt(i)βt(i) k

j=1 αt(j)βt(j)

Choose the individually most likely state x∗

t = argmax i

γt(i) The problem: ignore the sequence structure of X, may select {. . . , i, j, . . .} even if aij = 0

slide-20
SLIDE 20

Viterbi Algorithm

The improved criteria is to find the best state sequence argmax

X

P(X|Y , Λ) = argmax

X

P(Y , X|Λ), which can be solved by dynamic programming easily. Define δt(i) = max

x1,...,xt−1 P(x1, . . . , xt−1, xt = i, y1, . . . , yt|Λ)

then max P(Y , X|Λ) = max

i

δT(i) and δ1(i) = πibi(y1) δt+1(i) = max

j

P(yt+1|xt+1 = i)P(xt+1 = i|xt = j)δt(j) = max

j

bi(yt+1)ajiδt(j) ∀t ≥ 1

slide-21
SLIDE 21

Viterbi Algorithm

Given δt(i) = max

x1,...,xt−1 P(x1, . . . , xt−1, xt = i, y1, . . . , yt|Λ)

Further let φ1(i) = 0 φt+1(i) = argmax

j

δt(j)aji ∀t ≥ 1, then the optimal state sequence maximize P(X|Y , Λ) can be

  • btained by backtracking

x∗

T = argmax i

δT(i) x∗

t−1 = φt(x∗ t )

for t = T, T − 1, . . .

slide-22
SLIDE 22

Problem 3: Learning

◮ find Λ to maximize P(Y |Λ) ◮ can be solved by EM algorithm. The objective function is not

convex, only a local maximum is guaranteed. Define the prob of being statue i at time t and j at time t + 1 ξt(i, j) = P(xt = i, xt+1 = j|Y , Λ) = αt(j)aijbj(yt+1)βt+1(j) P(Y |Λ) = αt(j)aijbj(yt+1)βt+1(j)

  • i,j αt(j)aijbj(yt+1)βt+1(j)

and the prob of being statue i at time t γt(i) =

k

  • j=1

ξt(i, j)

slide-23
SLIDE 23

Learning, cont

Then we have the following update rules. Iterate until converge: π′

i ← #state i at time 1

= γ1(i) a′

ij ← #transition from state i to j

#transition from state i = T−1

t=1 ξt(i, j)

T−1

t=1 γt(i)

b′

i(ℓ) ← #observations of ℓ at state i

#state i = T

t=1 yt =ℓ γt(i)

T

t=1 γt(i) ◮ new parameters are still probabilities: k

  • i=1

π′

i = 1, k

  • j=1

a′

ij = 1, k

  • i=1

b′

i(ℓ) = 1 ◮ non-decreasing: P(Y |Λ′) ≥ P(Y |Λ)

slide-24
SLIDE 24

HMM variants

◮ Left-right model, namely aij = 0 for j < i ◮ continuous observations, namely yt is continuous. one

convenient assumption Gaussian, P(yt|xt = i) = N(µi, Σi).

◮ auto-regressive HMM

slide-25
SLIDE 25

Outline

Hidden Markov Model Kalman Filter A Unifying View of Linear Gaussian Models

slide-26
SLIDE 26

Basic Idea

Kalman Filter, also known as linear dynamic system, is just like an HMM, except the hidden state are continuous. An example: Object Tracking: Estimate motion of targets in 3D world from indirect, potentially noisy measurements

slide-27
SLIDE 27

Object Tracking: 2D example

◮ Let xt,1, xt,2 be the object position at time t and xt,3, xt,4 be

the corresponding velocity

◮ let ∆ be the sampling period, assume the following random

acceleration model:     xt+1,1 xt+1,2 xt+1,3 xt+1,4     =     1 ∆ 1 ∆ 1 1         xt,1 xt,2 xt,3 xt,4     +     ǫt,1 ǫt,2 ǫt,3 ǫt,4     , where ǫt ∼ N(0, Q) is the system noise

◮ suppose only positions are observed,

yt,1 yt,2

  • =

1 1

   xt,1 xt,2 xt,3 xt,4     + δt,1 δt,2

  • ,

where δt ∼ N(0, R) is the measurement noise

slide-28
SLIDE 28

Example: Robot Navigation

Simultaneous Localization and Mapping (SLAM): as robot moves, estimate its pose and world geometry We will back Kalman Filter later.

slide-29
SLIDE 29

Outline

Hidden Markov Model Kalman Filter A Unifying View of Linear Gaussian Models

slide-30
SLIDE 30

Discrete time linear dynamic with Gaussian noise

for each time t = 1, 2, . . ., the system generates state xt ∈ Rk and

  • bservation yt ∈ Rp by:

xt+1 = Axt + wt wt ∼ N(0, Q) yt = Bxt + vt vt ∼ N(0, R), both wt and vt are temporally white (uncorrelated with t) if assume the initial state x1 ∼ N(π, Q1) then all xt and yt will be Gaussian, xt+1 | xt ∼ N(Axt, Q) yt | xt ∼ N(Bxt, R)

slide-31
SLIDE 31

Question 1: Evaluation

◮ Assume we have state sequence X, observation sequence Y ,

and parameter model Λ,

◮ Due to the Markov property, the forward-backward methods

used for HMM to evaluate P(Y |Λ) can be applied here

◮ Remember that

P(X, Y |Λ) = P(x1|Λ)

T−1

  • t=1

P(xt+1|xt, Λ)

T

  • t=1

P(yt|xt, Λ) Then it is easy to compute P(X|Y , Λ) = P(X, Y |Λ) P(Y |Λ)

slide-32
SLIDE 32

Two types of Decoding

Filtering Online style, only use observations up to time t P(xt|y1, . . . , yt) Smoothing Offline style, use all observations P(xt|y1, . . . , yT) (a) true model. (b) Kalman Filtering. (c) Kalman Smoothing

slide-33
SLIDE 33

Learning: EM algorithm

Remember we compute Λ by maximizing P(Y |Λ), it equals to maximizing the log-likelihood L(Λ) = log P(Y |Λ) = log

  • X

P(X, Y |Λ)dX ≥

  • Q(X) log P(X, Y |Λ)

Q(X) = F(Q, Λ), for any distribution Q over X, the inequality is due to Jensen inequality

slide-34
SLIDE 34

EM algorithm, cont

EM, which is a coordinate ascent algorithm, iterates until converge E-step Q′ ← argmax

Q

F(Q, Λ) M-step Λ′ ← argmax

Λ

F(Q′, Λ) Note that P(X|Y , Λ) = argmax

Q

F(Q, Λ), then the above two steps are equal to Λ′ ← argmax

Λ

  • X

P(X|Y , Λ) log P(X, Y |Λ)dX

slide-35
SLIDE 35

Continuous-state

xt is a continuous vector x1 ∼ N(π, Q1) xt+1 = Axt + wt wt ∼ N(0, Q) yt = Bxt + vt vt ∼ N(0, R), static each point was generated independently and identically A = 0 ⇒ yt ∼ N(0, BQTB + R)

◮ Q = I, R diagonal ⇒ factor analysis ◮ Q = I, R = limǫ→0 ǫI ⇒ PCA

time-series A = 0 ⇒ Kalman Filter

slide-36
SLIDE 36

Discrete-State

Let winner-take-all WTA(x)i =

  • 1

if i = argmaxj xj

  • therwise

x1 ∼ WTA (N(π, Q1)) xt+1 = WTA (Axt + wt) wt ∼ N(0, Q) yt = Bxt + vt vt ∼ N(0, R), static Mixture of Gaussian A = 0 ⇒ P(yt) ∼

k

  • i=1

πiN(Ci, R), Ci is the i-th column of C time-series A = 0 ⇒ Hidden Markov Models

slide-37
SLIDE 37

Conclusion

Topics covered today: HMM discrete Markov process with hidden states Evaluation P(Y |Λ), forward-backward algorithm Decoding argmaxX P(X|Y , Λ), Viterbi algorithm Learning argmaxΛ P(Y |Λ), EM algorithm Kalman filter similar to HMM except for continuous hidden status Unify View discrete time linear dynamic system with Gaussian noise

◮ Three questions: evaluation, decoding, learning ◮ continuous-state: factor analysis, PCA, Kalman

filter

◮ discrete-state: mixture of Gaussian, HMM