Hidden Markov Models
COSI 114 – Computational Linguistics James Pustejovsky February10, 2015 Brandeis University
Slides thanks to David Blei
Hidden Markov Models COSI 114 Computational Linguistics James - - PowerPoint PPT Presentation
Hidden Markov Models COSI 114 Computational Linguistics James Pustejovsky February10, 2015 Brandeis University Slides thanks to David Blei Markov Models { s , s , , s } Set of states: 1 2 N Process moves from one
COSI 114 – Computational Linguistics James Pustejovsky February10, 2015 Brandeis University
Slides thanks to David Blei
sequence of states :
depends only on what was the previous state:
specified: transition probabilities and initial probabilities
2 1 N
2 1 ik i i
1 1 2 1 − −
ik ik ik i i ik
j i ij
i
Rain Dry 0.7 0.3 0.2 0.8
formula:
example, {‘Dry’,’Dry’,’Rain’,Rain’}.
1 1 2 2 1 1 1 2 1 1 1 2 1 1 2 1 2 1 i i i ik ik ik ik ik i i ik ik ik i i ik i i ik ik i i
− − − − − − −
what was the previous state:
specified: matrix of transition probabilities A=(aij), aij= P(si | sj) , matrix
vector of initial probabilities π=(πi), πi = P(si) . Model is represented by M=(A, B, π).
2 1 N
2 1 ik i i
1 1 2 1 − −
ik ik ik i i ik
2 1 M
Low High 0.7 0.3 0.2 0.8 Dry Rain
0.6 0.6 0.4 0.4
Rain Dry 0.7 0.3 0.2 0.8
formula:
example, {‘Dry’,’Dry’,’Rain’,Rain’}.
1 1 2 2 1 1 1 2 1 1 1 2 1 1 2 1 2 1 i i i ik ik ik ik ik i i ik ik ik i i ik i i ik ik i i
− − − − − − −
what was the previous state:
specified: matrix of transition probabilities A=(aij), aij= P(si | sj) , matrix
vector of initial probabilities π=(πi), πi = P(si) . Model is represented by M=(A, B, π).
2 1 N
2 1 ik i i
1 1 2 1 − −
ik ik ik i i ik
2 1 M
Graphical Model Circles indicate states Arrows indicate probabilistic dependencies
Green circles are hidden states Dependent only on the previous state “The past is independent of the future given the
Purple nodes are observed states Dependent only on their corresponding hidden
{S, K, Π, Α, Β} S : {s1…sN } are the values for the hidden states K : {k1…kM } are the values for the observations
S S S K K K S K S K
{S, K, Π, Α, Β} Π = {πι} are the initial state probabilities A = {aij} are the state transition probabilities B = {bik} are the observation state probabilities
A B A A A B B S S S K K K S K S K
Compute the probability of a given observation
Given an observation sequence, compute the
Given an observation sequence and set of
T
T To
x
2 2 1 1
x1 xt+1 xT xt xt-1
T To
x
2 2 1 1
T T
x x x x x x x
1 3 2 2 1 1
−
x1 xt+1 xT xt xt-1
T To
x
2 2 1 1
T T
x x x x x x x
1 3 2 2 1 1
−
x1 xt+1 xT xt xt-1
T To
x
2 2 1 1
T T
x x x x x x x
1 3 2 2 1 1
−
X
x1 xt+1 xT xt xt-1
1 1 1 1 1 1 1
1 1 } ... {
+ + +
− =
t t t t T
x x T t x x
x
x1 xt+1 xT xt xt-1
1
t t i
x1 xt+1 xT xt xt-1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
t t t t t t t t t t t t t t
+ + + + + + + + + + + +
x1 xt+1 xT xt xt-1
j
x1 xt+1 xT xt xt-1
j
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
t t t t t t t t t t t t t t
+ + + + + + + + + + + +
x1 xt+1 xT xt xt-1
j
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
t t t t t t t t t t t t t t
+ + + + + + + + + + + +
x1 xt+1 xT xt xt-1
j
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
t t t t t t t t t t t t t t
+ + + + + + + + + + + +
= + + + = + + = + + + = +
+
= = = = = = = = = = = = = = =
N i jo ij i t t t t N i t t t t t N i t t t t t N i t t t
t
b a t j x
i x j x P i x
j x
i x P i x j x
j x
j x i x
... 1 1 1 1 ... 1 1 1 1 ... 1 1 1 1 1 ... 1 1 1
1
) ( ) | ( ) | ( ) , ... ( ) | ( ) ( ) | , ... ( ) | ( ) , , ... ( α
x1 xt+1 xT xt xt-1
= + + + = + + = + + + = +
+
= = = = = = = = = = = = = = =
N i jo ij i t t t t N i t t t t t N i t t t t t N i t t t
t
b a t j x
i x j x P i x
j x
i x P i x j x
j x
j x i x
... 1 1 1 1 ... 1 1 1 1 ... 1 1 1 1 1 ... 1 1 1
1
) ( ) | ( ) | ( ) , ... ( ) | ( ) ( ) | , ... ( ) | ( ) , , ... ( α
x1 xt+1 xT xt xt-1
= + + + = + + = + + + = +
+
= = = = = = = = = = = = = = =
N i jo ij i t t t t N i t t t t t N i t t t t t N i t t t
t
b a t j x
i x j x P i x
j x
i x P i x j x
j x
j x i x
... 1 1 1 1 ... 1 1 1 1 ... 1 1 1 1 1 ... 1 1 1
1
) ( ) | ( ) | ( ) , ... ( ) | ( ) ( ) | , ... ( ) | ( ) , , ... ( α
x1 xt+1 xT xt xt-1
= + + + = + + = + + + = +
+
= = = = = = = = = = = = = = =
N i jo ij i t t t t N i t t t t t N i t t t t t N i t t t
t
b a t j x
i x j x P i x
j x
i x P i x j x
j x
j x i x
... 1 1 1 1 ... 1 1 1 1 ... 1 1 1 1 1 ... 1 1 1
1
) ( ) | ( ) | ( ) , ... ( ) | ( ) ( ) | , ... ( ) | ( ) , , ... ( α
x1 xt+1 xT xt xt-1
t T t i
x1 xt+1 xT xt xt-1
i
=
N j j io ij i
t
... 1
x1 xt+1 xT xt xt-1
=
N i i T
1
=
N i i i
1
1
i N i i
=
Find the state sequence that best explains the observations Viterbi algorithm
X
1 1 1 1 ...
1 1
t t t t x x j
t
− −
−
x1 xt-1 j
1 1 1 1 ...
1 1
t t t t x x j
t
− −
−
1
+
t
jo ij i i j
1
+
t
jo ij i i j
x1 xt-1 xt xt+1
i i T
1 ^
+ t
t
X t
i i
x1 xt-1 xt xt+1 xT
A B A A A B B B B
A B A A A B B B B
=
+
N m m m j jo ij i t
t
... 1
1
=
N j t i
... 1
A B A A A B B B B
i
i
= =
T t i T t t ij
1 1
= =
T t i k
t ik
t
1 } : {
Generating parameters for n-gram models Tagging speech Speech recognition
A B A A A B B B B
Low High 0.7 0.3 0.2 0.8 Dry Rain
0.6 0.6 0.4 0.4
Rain Dry 0.7 0.3 0.2 0.8
formula:
example, {‘Dry’,’Dry’,’Rain’,Rain’}.
1 1 2 2 1 1 1 2 1 1 1 2 1 1 2 1 2 1 i i i ik ik ik ik ik i i ik ik ik i i ik i i ik ik i i
− − − − − − −
what was the previous state:
specified: matrix of transition probabilities A=(aij), aij= P(si | sj) , matrix
vector of initial probabilities π=(πi), πi = P(si) . Model is represented by M=(A, B, π).
2 1 N
2 1 ik i i
1 1 2 1 − −
ik ik ik i i ik
2 1 M
example, {‘Dry’,’Rain’}.
{‘High’,’Low’}) + P({‘Dry’,’Rain’} , {‘High’,’High’}) where first term is :
Evaluation problem. Given the HMM M=(A, B, π) and the observation sequence O=o1 o2 ... oK , calculate the probability that model M has generated sequence O .
sequence O=o1 o2 ... oK , calculate the most likely sequence of hidden states si that produced this observation sequence O.
and general structure of HMM (numbers of hidden and visible states), determine HMM parameters M=(A, B, π) that best fit training data.
P(image|character). 0.5 0.03 0.005 0.31 z c b a
Hidden state Observation
image . Note that there is an infinite number of observations
i i
α α
α
Amherst a m h e r s t Buffalo b u f f a l
0.03
HMM models.
0.4 0.6
a m h e r s t b v f
likely produced word image.
Gaussian models
s1 s2 s3
Transition probabilities: {aij}= Observation probabilities: {bjk}= ⎛ .8 .2 0 ⎞ ⏐ 0 .8 .2 ⏐ ⎝ 0 0 1 ⎠ ⎛ .9 .1 0 ⎞ ⏐ .1 .8 .1 ⏐ ⎝ .9 .1 0 ⎠
Transition probabilities: {aij}= Observation probabilities: {bjk}= ⎛ .8 .2 0 ⎞ ⏐ 0 .8 .2 ⏐ ⎝ 0 0 1 ⎠ ⎛ .9 .1 0 ⎞ ⏐ 0 .2 .8 ⏐ ⎝ .6 .4 0 ⎠
numbers in 4 slices was observed: { 1, 3, 2, 1}
‘A’ or HMM for ‘B’ ?
Consider likelihood of generating given observation for each possible sequence of hidden states:
Hidden state sequence Transition probabilities Observation probabilities
s1→ s1→ s2→s3 .8 * .2 * .2 * .9 * 0 * .8 * .9 = 0 s1→ s2→ s2→s3 .2 * .8 * .2 * .9 * .1 * .8 * .9 = 0.0020736 s1→ s2→ s3→s3 .2 * .2 * 1 * .9 * .1 * .1 * .9 = 0.000324 Total = 0.0023976
Hidden state sequence Transition probabilities Observation probabilities
s1→ s1→ s2→s3 .8 * .2 * .2 * .9 * 0 * .2 * .6 = 0 s1→ s2→ s2→s3 .2 * .8 * .2 * .9 * .8 * .2 * .6 = 0.0027648 s1→ s2→ s3→s3 .2 * .2 * 1 * .9 * .8 * .4 * .6 = 0.006912 Total = 0.0096768
sequence O=o1 o2 ... oK , calculate the probability that model M has generated sequence O .
hidden state sequences (as was done in example) is impractical: NK hidden state sequences - exponential complexity.
sequence o1 o2 ... ok and that the hidden state at time k is si : αk(i)= P(o1 o2 ...
a1j a2j aij aNj
Time= 1 k k+1 K
Σi P(o1 o2 ... ok+1 , qk= si , qk+1= sj ) = Σi P(o1 o2 ... ok , qk= si) aij bj (ok+1 ) = [Σi αk(i) aij ] bj (ok+1 ) , 1<=j<=N, 1<=k<=K-1.
N2K operations.
sequence ok+1 ok+2 ... oK given that the hidden state at time k is si : βk(i)= P(ok+1
Σi P(ok+1 ok+2 ... oK , qk+1= si | qk= sj ) = Σi P(ok+2 ok+3 ... oK | qk+1= si) aji bi (ok+1 ) = Σi βk+1(i) aji bi (ok+1 ) , 1<=j<=N, 1<=k<=K-1.
Σi P(o1 o2 ... oK |q1= si) P(q1= si) = Σi β1(i) bi (o1) πi
sequence O=o1 o2 ... oK , calculate the most likely sequence of hidden states si that produced this observation sequence.
algorithm instead.
δk(i) = max P(q1… qk-1 , qk= si , o1 o2 ... ok)
if best path ending in qk= sj goes through qk-1= si then it should coincide with best path ending in qk-1= si .
aij aNj a1j
qk-1 qk
maxi [ aij bj (ok ) max P(q1… qk-1= si , o1 o2 ... ok-1) ]
δ1(i) = max P(q1= si , o1) = πi bi (o1) , 1<=i<=N.
maxi [ aij bj (ok ) max P(q1… qk-1= si , o1 o2 ... ok-1) ] = maxi [ aij bj (ok ) δk-1(i) ] , 1<=j<=N, 2<=k<=K.
maxi [ δK(i) ]
This algorithm is similar to the forward recursion of evaluation problem, with Σ replaced by max and additional backtracking.
HMM parameters M=(A, B, π) that best fit training data, that is maximizes
recognition example), then use maximum likelihood estimation of parameters: aij= P(si | sj) = Number of transitions from state sj to state si
Number of transitions out of state sj
Number of times observation vm occurs in state si Number of times in state si
General idea:
Expected number of transitions from state sj to state si
Expected number of transitions out of state sj
Expected number of times observation vm occurs in state si
Expected number of times in state si
=
=
=
and γk(i)= P(qk= si | o1 o2 ... oK)
= Σk ξk(i,j)
= Σk γk(i) , k is such that ok= vm
Expected number of transitions from state sj to state si
Expected number of transitions out of state sj
Expected number of times observation vm occurs in state si
Expected number of times in state si
Search through space of all possible
Pick the one that is most probable given
What is the most likely sentence out of
Treat acoustic input O as sequence of
Define a sentence as a sequence of
Probabilistic implication: Pick the highest prob S: We can use Bayes rule to rewrite this: Since denominator is the same for each candidate
W ∈L
W ∈L
W ∈L
W ∈L
likelihood prior
Ignoring the denominator leaves us with
Time (s) 0.48152 0.937203 5000 ay k
Markov chains A kind of weighted finite-state automaton
Markov chains A kind of weighted finite-state automaton
What is probability of:
Bakis network Ergodic (fully-connected)
Left-to-right network
You are a climatologist in 2799 studying the
YOU can’t find records of the weather in
But you do find Jason Eisner’s diary Which records how many ice creams he ate each
Can we use this to figure out the weather?
each observation an integer = number of ice creams eaten Figure out correct hidden sequence Q of weather states (H
Three fundamental problems
Problem 1 (Evaluation): Given the observation
Problem 2 (Decoding): Given the observation sequence
Problem 3 (Learning): How do we adjust the model
Given the following HMM: How likely is the sequence 3 1 3?
For a Markov chain, we just follow the
But for an HMM, we don’t know what
So let’s start with a simpler situation. Computing the observation likelihood for
We would need to sum over
How many possible hidden state sequences are
How about in general for an HMM with N
So we can’t just do separate computation for
A kind of dynamic programming algorithm
Idea:
By folding all the sequences into a single trellis
Each cell of the forward algorithm trellis
Each cell thus expresses the following
Given an observation sequence
And an HMM The task of the decoder
Given the observation sequence O=(o1o2…
One possibility:
HHH, HHC, HCH,
Why not?
Instead:
Process observation sequence left to
Filling out the trellis Each cell:
“I spent the Fall quarter (of 1950) at RAND. My first task was to find a name for multistage decision
The 1950s were not good years for mathematical research. We had a very interesting gentleman in Washington named Wilson. He was Secretary of Defense, and he actually had a pathological fear and hatred of the word, research. I’m not using the term lightly; I’m using it precisely. His face would suffuse, he would turn red, and he would get violent if people used the term, research, in his
was employed by the Air Force, and the Air Force had Wilson as its boss, essentially. Hence, I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation. What title, what name, could I choose? In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons. I decided therefore to use the word, “programming” I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying I thought, lets kill two birds with one stone. Lets take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that is its impossible to use the word, dynamic, in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. Its impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it as an umbrella for my activities.” Richard Bellman, “Eye of the Hurrican: an autobiography” 1984.
Thanks to Chen, Picheny, Eide, Nock
We haven’t yet shown how to learn the
But let’s return to think about speech
The observation sequence O is a series of
The hidden states W are the phones and
For a given phone/word string W, our job
Intuition: how likely is the input to have
f ay ay ay ay v v v v f f ay ay ay ay v v v f f f f ay ay ay ay v f f ay ay ay ay ay ay v f f ay ay ay ay ay ay ay ay v f f ay v v v v v v v