SLIDE 1
Hidden Markov Model, Kalman Filter and A Unifying View Mu Li April - - PowerPoint PPT Presentation
Hidden Markov Model, Kalman Filter and A Unifying View Mu Li April - - PowerPoint PPT Presentation
Recitations for 10-701: Hidden Markov Model, Kalman Filter and A Unifying View Mu Li April 16, 2013 Outline Hidden Markov Model Kalman Filter A Unifying View of Linear Gaussian Models based on slides from Simma & Batzoglou Outline
SLIDE 2
SLIDE 3
Outline
Hidden Markov Model Kalman Filter A Unifying View of Linear Gaussian Models
SLIDE 4
Example: The Dishonest Casino
One day you go to Las Vegas, there is a casino player who has two dices:
◮ Fair die
P(1) = P(2) = P(3) = P(5) = P(6) = 1/6
◮ Loaded die
P(1) = P(2) = P(3) = P(5) = 1/10 P(6) = 1/2 and switch them once every 18 turns. The game:
- 1. You roll with a fair die
- 2. the casino player rolls, maybe a fair die, maybe a loaded die
- 3. highest number wins
SLIDE 5
Modeling as HMM
◮ two hidden status: fair, loaded ◮ status transition model ◮ observation model
For HMM, typically we want to ask three questions.
SLIDE 6
Question 1: Evaluation
Given:
◮ a sequence of rolls by the casino player ◮ the models of dices and the work pattern of the casino player
Question: How likely the following sequence happens? 124552646214243156636266613666166466513612115146234126 Answer: probability ≈ 10−37
SLIDE 7
Question 2: Decoding
Given:
◮ a sequence of rolls by the casino player ◮ the models of dices and the work pattern of the casino player
Question: What portion was generated by the fair die, and what portion by the loaded die? Answer: 124552646214243156
- fair
636266613666166466
- loaded
513612115146234126
- fair
SLIDE 8
Question 3: Learning
Given a sequence of rolls by the casino player Question:
◮ How “loaded” is the loaded die? ◮ How “fair” is the fair die? ◮ How often does the casino player changes the die?
Answer: 124552646214243156 636266613666166466
- P(6)=66.6%
513612115146234126
SLIDE 9
More Examples: Speech Recognition
Given an audio waveform, would like to robustly extract and recognize any spoken words
SLIDE 10
Biological Sequence Analysis
Use temporal models to exploit sequential structure, such as DNA sequences
SLIDE 11
Financial Forecasting
Predict future market behaviors from historical data, news reports, expert opinions
SLIDE 12
Discrete Markov Process
Assume
◮ k states {1, . . . , k} ◮ state transition probability
aij = P(x = j|y = i) satisfies aij ≥ 0 and
k
- j=1
aij = 1 Given a state sequence {x1, . . . , xT}, where xt ∈ {1, . . . , k} P(x1, . . . , xT) = P(x1)P(x2|x1) . . . P(xT−1|xT) = πx1ax1x2 . . . axT−1xT
SLIDE 13
Extension to HMM
◮ k state, state transition probability A = {aij}, initial state
distribution Π = {πi}, hidden state sequence X = {x1, . . . xT}
◮ observed sequence Y = {y1, . . . , yT}, where yT ∈ {1, . . . , m} ◮ observation symbol probability B = {bj(ℓ)} where
bj(ℓ) = P(yt = ℓ|xt = j) Denote by Λ = (A, B, Π) the model parameters, then P(X, Y |Λ) = P(x1)
T−1
- t=1
P(xt+1|xt)
T
- t=1
P(yt|xt)
SLIDE 14
Three problems of HMM
Evaluation: Given observation sequence Y and model parameters Λ, how to compute P(Y |Λ) Decoding: Given observation Y and model parameters Λ, how to choose the “optimal” hidden state sequence X Learning: How to find the model parameters Λ to maximize P(Y |Λ)
SLIDE 15
Problem 1: Evaluation
The naive solution. Since it is easy to compute P(Y , X|Λ), then P(Y |Λ) =
- all possible X
(P(Y , X|Λ)) However, the time complexity is O(TkT), even for 5 state and 100
- bservations, there are on the order of 1072 operations.
But HMM is a tree, certainly we can have polynomial algorithms.
SLIDE 16
The forward procedure
Let αt(i) = P(y1, . . . , yt, xt = i|Λ), then P(Y |Λ) =
k
- i=1
αT(i) αt(i) can be computed recursively α1(i) = πibi(y1) ∀i αt+1(i) =
k
- j=1
P(yt+1|xt+1 = i)P(xt+1 = i|xt = j)αt(j) = bi(yt+1)
k
- j=1
aijαt(j) ∀i, t ≥ 1
SLIDE 17
Illustration of the forward procedure
αt(i) are represented by nodes α1(i) = πibi(y1) ∀i αt+1(i) = bi(yt+1)
k
- j=1
aijαt(j) ∀i, t ≥ 1
SLIDE 18
The backward procedure, cont
Similar, we can compute in backward way, let βt(i) = P(yt+1, . . . , yT|xt = i, Λ) then P(Y |Λ) =
k
- i=1
β1(i)πi =
k
- i=1
αt(i)βt(i) ∀i βt(i) can be also computed recursively βT(i) = 1 ∀i βt(i) =
k
- j=1
P(yt+1|xt+1 = j)P(xt+1 = j|xt = i)βt+1(j) =
k
- j=1
bj(yt+1)aijβt+1(j) ∀i, t < T
SLIDE 19
Problem 2: Decoding
There are several possible optimal criteria. One is “individually most likely”. Define the probability of being state i at time t given Y and Λ: γt(i) = P(xt = i|Y , Λ), then γt(i) = P(xt = i, Y |Λ) P(Y |Λ) = αt(i)βt(i) k
j=1 αt(j)βt(j)
Choose the individually most likely state x∗
t = argmax i
γt(i) The problem: ignore the sequence structure of X, may select {. . . , i, j, . . .} even if aij = 0
SLIDE 20
Viterbi Algorithm
The improved criteria is to find the best state sequence argmax
X
P(X|Y , Λ) = argmax
X
P(Y , X|Λ), which can be solved by dynamic programming easily. Define δt(i) = max
x1,...,xt−1 P(x1, . . . , xt−1, xt = i, y1, . . . , yt|Λ)
then max P(Y , X|Λ) = max
i
δT(i) and δ1(i) = πibi(y1) δt+1(i) = max
j
P(yt+1|xt+1 = i)P(xt+1 = i|xt = j)δt(j) = max
j
bi(yt+1)ajiδt(j) ∀t ≥ 1
SLIDE 21
Viterbi Algorithm
Given δt(i) = max
x1,...,xt−1 P(x1, . . . , xt−1, xt = i, y1, . . . , yt|Λ)
Further let φ1(i) = 0 φt+1(i) = argmax
j
δt(j)aji ∀t ≥ 1, then the optimal state sequence maximize P(X|Y , Λ) can be
- btained by backtracking
x∗
T = argmax i
δT(i) x∗
t−1 = φt(x∗ t )
for t = T, T − 1, . . .
SLIDE 22
Problem 3: Learning
◮ find Λ to maximize P(Y |Λ) ◮ can be solved by EM algorithm. The objective function is not
convex, only a local maximum is guaranteed. Define the prob of being statue i at time t and j at time t + 1 ξt(i, j) = P(xt = i, xt+1 = j|Y , Λ) = αt(j)aijbj(yt+1)βt+1(j) P(Y |Λ) = αt(j)aijbj(yt+1)βt+1(j)
- i,j αt(j)aijbj(yt+1)βt+1(j)
and the prob of being statue i at time t γt(i) =
k
- j=1
ξt(i, j)
SLIDE 23
Learning, cont
Then we have the following update rules. Iterate until converge: π′
i ← #state i at time 1
= γ1(i) a′
ij ← #transition from state i to j
#transition from state i = T−1
t=1 ξt(i, j)
T−1
t=1 γt(i)
b′
i(ℓ) ← #observations of ℓ at state i
#state i = T
t=1 yt =ℓ γt(i)
T
t=1 γt(i) ◮ new parameters are still probabilities: k
- i=1
π′
i = 1, k
- j=1
a′
ij = 1, k
- i=1
b′
i(ℓ) = 1 ◮ non-decreasing: P(Y |Λ′) ≥ P(Y |Λ)
SLIDE 24
HMM variants
◮ Left-right model, namely aij = 0 for j < i ◮ continuous observations, namely yt is continuous. one
convenient assumption Gaussian, P(yt|xt = i) = N(µi, Σi).
◮ auto-regressive HMM
SLIDE 25
Outline
Hidden Markov Model Kalman Filter A Unifying View of Linear Gaussian Models
SLIDE 26
Basic Idea
Kalman Filter, also known as linear dynamic system, is just like an HMM, except the hidden state are continuous. An example: Object Tracking: Estimate motion of targets in 3D world from indirect, potentially noisy measurements
SLIDE 27
Object Tracking: 2D example
◮ Let xt,1, xt,2 be the object position at time t and xt,3, xt,4 be
the corresponding velocity
◮ let ∆ be the sampling period, assume the following random
acceleration model: xt+1,1 xt+1,2 xt+1,3 xt+1,4 = 1 ∆ 1 ∆ 1 1 xt,1 xt,2 xt,3 xt,4 + ǫt,1 ǫt,2 ǫt,3 ǫt,4 , where ǫt ∼ N(0, Q) is the system noise
◮ suppose only positions are observed,
yt,1 yt,2
- =
1 1
-
xt,1 xt,2 xt,3 xt,4 + δt,1 δt,2
- ,
where δt ∼ N(0, R) is the measurement noise
SLIDE 28
Example: Robot Navigation
Simultaneous Localization and Mapping (SLAM): as robot moves, estimate its pose and world geometry We will back Kalman Filter later.
SLIDE 29
Outline
Hidden Markov Model Kalman Filter A Unifying View of Linear Gaussian Models
SLIDE 30
Discrete time linear dynamic with Gaussian noise
for each time t = 1, 2, . . ., the system generates state xt ∈ Rk and
- bservation yt ∈ Rp by:
xt+1 = Axt + wt wt ∼ N(0, Q) yt = Bxt + vt vt ∼ N(0, R), both wt and vt are temporally white (uncorrelated with t) if assume the initial state x1 ∼ N(π, Q1) then all xt and yt will be Gaussian, xt+1 | xt ∼ N(Axt, Q) yt | xt ∼ N(Bxt, R)
SLIDE 31
Question 1: Evaluation
◮ Assume we have state sequence X, observation sequence Y ,
and parameter model Λ,
◮ Due to the Markov property, the forward-backward methods
used for HMM to evaluate P(Y |Λ) can be applied here
◮ Remember that
P(X, Y |Λ) = P(x1|Λ)
T−1
- t=1
P(xt+1|xt, Λ)
T
- t=1
P(yt|xt, Λ) Then it is easy to compute P(X|Y , Λ) = P(X, Y |Λ) P(Y |Λ)
SLIDE 32
Two types of Decoding
Filtering Online style, only use observations up to time t P(xt|y1, . . . , yt) Smoothing Offline style, use all observations P(xt|y1, . . . , yT) (a) true model. (b) Kalman Filtering. (c) Kalman Smoothing
SLIDE 33
Learning: EM algorithm
Remember we compute Λ by maximizing P(Y |Λ), it equals to maximizing the log-likelihood L(Λ) = log P(Y |Λ) = log
- X
P(X, Y |Λ)dX ≥
- Q(X) log P(X, Y |Λ)
Q(X) = F(Q, Λ), for any distribution Q over X, the inequality is due to Jensen inequality
SLIDE 34
EM algorithm, cont
EM, which is a coordinate ascent algorithm, iterates until converge E-step Q′ ← argmax
Q
F(Q, Λ) M-step Λ′ ← argmax
Λ
F(Q′, Λ) Note that P(X|Y , Λ) = argmax
Q
F(Q, Λ), then the above two steps are equal to Λ′ ← argmax
Λ
- X
P(X|Y , Λ) log P(X, Y |Λ)dX
SLIDE 35
Continuous-state
xt is a continuous vector x1 ∼ N(π, Q1) xt+1 = Axt + wt wt ∼ N(0, Q) yt = Bxt + vt vt ∼ N(0, R), static each point was generated independently and identically A = 0 ⇒ yt ∼ N(0, BQTB + R)
◮ Q = I, R diagonal ⇒ factor analysis ◮ Q = I, R = limǫ→0 ǫI ⇒ PCA
time-series A = 0 ⇒ Kalman Filter
SLIDE 36
Discrete-State
Let winner-take-all WTA(x)i =
- 1
if i = argmaxj xj
- therwise
x1 ∼ WTA (N(π, Q1)) xt+1 = WTA (Axt + wt) wt ∼ N(0, Q) yt = Bxt + vt vt ∼ N(0, R), static Mixture of Gaussian A = 0 ⇒ P(yt) ∼
k
- i=1
πiN(Ci, R), Ci is the i-th column of C time-series A = 0 ⇒ Hidden Markov Models
SLIDE 37