EM & Hidden Markov Models CMSC 691 UMBC Recap from last time - - PowerPoint PPT Presentation
EM & Hidden Markov Models CMSC 691 UMBC Recap from last time - - PowerPoint PPT Presentation
EM & Hidden Markov Models CMSC 691 UMBC Recap from last time Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters 2. M-step:
Recap from last timeβ¦
Expectation Maximization (EM)
- 0. Assume some value for your parameters
Two step, iterative algorithm
- 1. E-step: count under uncertainty, assuming these
parameters
- 2. M-step: maximize log-likelihood, assuming these
uncertain counts
estimated counts
Counting Requires Marginalizing
E-step: count under uncertainty, assuming these parameters
Counting Requires Marginalizing
w z1 & w z2 & w z3 & w z4 & w
π π₯ = π π¨1, π₯ + π π¨2, π₯ + π π¨3, π₯ + π π¨4, π₯ = ΰ·
π¨=1 4
π(π¨π, π₯)
E-step: count under uncertainty, assuming these parameters
break into 4 disjoint pieces
Hidden Markov Models
β¦
Agenda
HMM Detailed Definition HMM Parameter Estimation EM for HMMs General Approach Expectation Calculation
Hidden Markov Models
Class-based Model Use different distributions to explain groupings of
- bservations
Sequence Model Bigram model of the classes, not the observations Implicitly model all possible class sequences There are algorithms for finding best sequence, the marginal likelihood, and doing semi-/un-supervised learning
Hidden Markov Model
Goal: maximize (log-)likelihood In practice: we donβt actually observe these z values; we just see the words w
π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = π π¨1| π¨0 π π₯1|π¨1 β― π π¨π| π¨πβ1 π π₯π|π¨π = ΰ·
π
π π₯π|π¨π π π¨π| π¨πβ1
Hidden Markov Model
Goal: maximize (log-)likelihood In practice: we donβt actually observe these z values; we just see the words w
π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = π π¨1| π¨0 π π₯1|π¨1 β― π π¨π| π¨πβ1 π π₯π|π¨π = ΰ·
π
π π₯π|π¨π π π¨π| π¨πβ1
if we knew the probability parameters then we could estimate z and evaluate likelihoodβ¦ but we donβt! :( if we did observe z, estimating the probability parameters would be easyβ¦ but we donβt! :(
Hidden Markov Model Terminology
Each zi can take the value of one of K latent states
π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = π π¨1| π¨0 π π₯1|π¨1 β― π π¨π| π¨πβ1 π π₯π|π¨π = ΰ·
π
π π₯π|π¨π π π¨π| π¨πβ1
Hidden Markov Model Terminology
Each zi can take the value of one of K latent states
π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = π π¨1| π¨0 π π₯1|π¨1 β― π π¨π| π¨πβ1 π π₯π|π¨π = ΰ·
π
π π₯π|π¨π π π¨π| π¨πβ1
transition probabilities/parameters
Hidden Markov Model Terminology
Each zi can take the value of one of K latent states
π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = π π¨1| π¨0 π π₯1|π¨1 β― π π¨π| π¨πβ1 π π₯π|π¨π = ΰ·
π
π π₯π|π¨π π π¨π| π¨πβ1
emission probabilities/parameters transition probabilities/parameters
Hidden Markov Model Terminology
Each zi can take the value of one of K latent states Transition and emission distributions do not change
π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = π π¨1| π¨0 π π₯1|π¨1 β― π π¨π| π¨πβ1 π π₯π|π¨π = ΰ·
π
π π₯π|π¨π π π¨π| π¨πβ1
emission probabilities/parameters transition probabilities/parameters
Hidden Markov Model Terminology
Each zi can take the value of one of K latent states Transition and emission distributions do not change Q: How many different probability values are there with K states and V vocab items?
π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = π π¨1| π¨0 π π₯1|π¨1 β― π π¨π| π¨πβ1 π π₯π|π¨π = ΰ·
π
π π₯π|π¨π π π¨π| π¨πβ1
emission probabilities/parameters transition probabilities/parameters
Hidden Markov Model Terminology
Each zi can take the value of one of K latent states Transition and emission distributions do not change Q: How many different probability values are there with K states and V vocab items? A: VK emission values and K2 transition values
π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = π π¨1| π¨0 π π₯1|π¨1 β― π π¨π| π¨πβ1 π π₯π|π¨π = ΰ·
π
π π₯π|π¨π π π¨π| π¨πβ1
emission probabilities/parameters transition probabilities/parameters
Hidden Markov Model Representation
π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = π π¨1| π¨0 π π₯1|π¨1 β― π π¨π| π¨πβ1 π π₯π|π¨π = ΰ·
π
π π₯π|π¨π π π¨π| π¨πβ1
emission probabilities/parameters transition probabilities/parameters
z1
w1
β¦
w2 w3 w4
z2 z3 z4
represent the probabilities and independence assumptions in a graph
Hidden Markov Model Representation
π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = π π¨1| π¨0 π π₯1|π¨1 β― π π¨π| π¨πβ1 π π₯π|π¨π = ΰ·
π
π π₯π|π¨π π π¨π| π¨πβ1
emission probabilities/parameters transition probabilities/parameters
z1
w1
β¦
w2 w3 w4
z2 z3 z4
π π₯1|π¨1 π π₯2|π¨2 π π₯3|π¨3 π π₯4|π¨4
Hidden Markov Model Representation
π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = π π¨1| π¨0 π π₯1|π¨1 β― π π¨π| π¨πβ1 π π₯π|π¨π = ΰ·
π
π π₯π|π¨π π π¨π| π¨πβ1
emission probabilities/parameters transition probabilities/parameters
z1
w1
β¦
w2 w3 w4
z2 z3 z4
π π₯1|π¨1 π π₯2|π¨2 π π₯3|π¨3 π π₯4|π¨4
π π¨2| π¨1 π π¨3| π¨2 π π¨4| π¨3
Hidden Markov Model Representation
π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = π π¨1| π¨0 π π₯1|π¨1 β― π π¨π| π¨πβ1 π π₯π|π¨π = ΰ·
π
π π₯π|π¨π π π¨π| π¨πβ1
emission probabilities/parameters transition probabilities/parameters
z1
w1
β¦
w2 w3 w4
z2 z3 z4
π π₯1|π¨1 π π₯2|π¨2 π π₯3|π¨3 π π₯4|π¨4
π π¨2| π¨1 π π¨3| π¨2 π π¨4| π¨3 π π¨1| π¨0 initial starting distribution (βBOSβ)
Hidden Markov Model Representation
π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = π π¨1| π¨0 π π₯1|π¨1 β― π π¨π| π¨πβ1 π π₯π|π¨π = ΰ·
π
π π₯π|π¨π π π¨π| π¨πβ1
emission probabilities/parameters transition probabilities/parameters
z1
w1
β¦
w2 w3 w4
z2 z3 z4
π π₯1|π¨1 π π₯2|π¨2 π π₯3|π¨3 π π₯4|π¨4
π π¨2| π¨1 π π¨3| π¨2 π π¨4| π¨3 π π¨1| π¨0 initial starting distribution (βBOSβ)
Each zi can take the value of one of K latent states Transition and emission distributions do not change
Example: 2-state Hidden Markov Model as a Lattice
z1 = N
w1
β¦
w2 w3 w4
z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V
β¦
Example: 2-state Hidden Markov Model as a Lattice
z1 = N
w1
β¦
w2 w3 w4
π π₯1|π π π₯2|π π π₯3|π π π₯4|π
z2 = N z3 = N z4 = N z1 = V z2 = V z4 = V
β¦
π π₯4|π π π₯3|π π π₯2|π π π₯1|π
z3 = V
Example: 2-state Hidden Markov Model as a Lattice
z1 = N
w1
β¦
w2 w3 w4
π π₯1|π π π₯2|π π π₯3|π π π₯4|π
π π| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π π| π π π| π π π| π π π| start
β¦
π π₯4|π π π₯3|π π π₯2|π π π₯1|π
π π| π π π| π π π| π
Example: 2-state Hidden Markov Model as a Lattice
z1 = N
w1
β¦
w2 w3 w4
π π₯1|π π π₯2|π π π₯3|π π π₯4|π
π π| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π π| π π π| π π π| π π π| start
β¦
π π| π π π| π π π| π π π| π π π| π π π| π
π π₯4|π π π₯3|π π π₯2|π π π₯1|π
π π| π π π| π π π| π
2 State HMM Likelihood
z1 = N
w1
β¦
w2 w3 w4
π π₯1|π π π₯2|π π π₯3|π π π₯4|π
π π| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π π| π π π| π π π| π π π| start
β¦
π π| π π π| π π π| π π π| π π π| π π π| π
π π₯4|π π π₯3|π π π₯2|π π π₯1|π
π π| π π π| π π π| π
N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1
2 State HMM Likelihood
z1 = N
w1
β¦
w2 w3 w4
π π₯1|π π π₯2|π π π₯3|π π π₯4|π
π π| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π π| π π π| π π π| π π π| start
β¦
π π| π π π| π π π| π π π| π π π| π π π| π
π π₯4|π π π₯3|π π π₯2|π π π₯1|π
π π| π π π| π π π| π
Q: Whatβs the probability of (N, w1), (V, w2), (V, w3), (N, w4)? N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1
2 State HMM Likelihood
z1 = N
w1 w2 w3 w4
π π₯1|π π π₯4|π
π π| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π π| π π π| π π π| π
π π₯3|π π π₯2|π Q: Whatβs the probability of (N, w1), (V, w2), (V, w3), (N, w4)? A: (.7*.7) * (.8*.6) * (.35*.1) * (.6*.05) = 0.0002822 N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1
2 State HMM Likelihood
z1 = N
w1
β¦
w2 w3 w4
π π₯1|π π π₯2|π π π₯3|π π π₯4|π
π π| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π π| π π π| π π π| π π π| start
β¦
π π| π π π| π π π| π π π| π π π| π π π| π
π π₯4|π π π₯3|π π π₯2|π π π₯1|π
π π| π π π| π π π| π
Q: Whatβs the probability of (N, w1), (V, w2), (N, w3), (N, w4)? N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1
2 State HMM Likelihood
z1 = N
w1 w2 w3 w4
π π₯1|π π π₯3|π π π₯4|π
π π| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π π| π π π| π
π π₯2|π
π π| π
Q: Whatβs the probability of (N, w1), (V, w2), (N, w3), (N, w4)? A: (.7*.7) * (.8*.6) * (.6*.05) * (.15*.05) = 0.00007056 N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1
Agenda
HMM Detailed Definition HMM Parameter Estimation EM for HMMs General Approach Expectation Calculation
Estimating Parameters from Observed Data
z1 = N
w1 w2 w3 w4
π π₯1|π π π₯3|π π π₯4|π
π π| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π π| π π π| π
π π₯2|π
π π| π z1 = N
w1 w2 w3 w4
π π₯1|π π π₯4|π
π π| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π π| π π π| π π π| π
π π₯3|π π π₯2|π N V end start N V w1 w2 W3 w4 N V Transition Counts Emission Counts
end emission not shown
Estimating Parameters from Observed Data
z1 = N
w1 w2 w3 w4
π π₯1|π π π₯3|π π π₯4|π
π π| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π π| π π π| π
π π₯2|π
π π| π z1 = N
w1 w2 w3 w4
π π₯1|π π π₯4|π
π π| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π π| π π π| π π π| π
π π₯3|π π π₯2|π N V end start 2 N 1 2 2 V 2 1 w1 w2 W3 w4 N 2 1 2 V 2 1 Transition Counts Emission Counts
end emission not shown
Estimating Parameters from Observed Data
z1 = N
w1 w2 w3 w4
π π₯1|π π π₯3|π π π₯4|π
π π| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π π| π π π| π
π π₯2|π
π π| π z1 = N
w1 w2 w3 w4
π π₯1|π π π₯4|π
π π| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π π| π π π| π π π| π
π π₯3|π π π₯2|π N V end start 1 N .2 .4 .4 V 2/3 1/3 w1 w2 W3 w4 N .4 .2 .4 V 2/3 1/3 Transition MLE Emission MLE
end emission not shown
Estimating Parameters from Observed Data
z1 = N
w1 w2 w3 w4
π π₯1|π π π₯3|π π π₯4|π
π π| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π π| π π π| π
π π₯2|π
π π| π z1 = N
w1 w2 w3 w4
π π₯1|π π π₯4|π
π π| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π π| π π π| π π π| π
π π₯3|π π π₯2|π N V end start 1 N .2 .4 .4 V 2/3 1/3 w1 w2 W3 w4 N .4 .2 .4 V 2/3 1/3 Transition MLE Emission MLE
end emission not shown
smooth these values if needed
What If We Donβt Observe π¨?
Approach: Develop EM algorithm Goal: Estimate ππ’ π‘β² π‘) and ππ π€ π‘) Why: Compute π½π¨π=π‘βπ¨π+1=π‘β² π π‘ β π‘β² π½π¨π=π‘βπ₯π=π€ π π‘ β π€
Expectation Maximization (EM)
- 0. Assume some value for your parameters
Two step, iterative algorithm
- 1. E-step: count under uncertainty, assuming these
parameters
- 2. M-step: maximize log-likelihood, assuming these
uncertain counts
estimated counts
Expectation Maximization (EM)
- 0. Assume some value for your parameters
Two step, iterative algorithm
- 1. E-step: count under uncertainty, assuming these
parameters
- 2. M-step: maximize log-likelihood, assuming these
uncertain counts
estimated counts
pobs(w | s) ptrans(sβ | s)
Expectation Maximization (EM)
- 0. Assume some value for your parameters
Two step, iterative algorithm
- 1. E-step: count under uncertainty, assuming these
parameters
- 2. M-step: maximize log-likelihood, assuming these
uncertain counts
estimated counts
pobs(w | s) ptrans(sβ | s)
πβ π¨π = π‘ π₯1, β― , π₯π) = π(π¨π = π‘, π₯1, β― , π₯π) π(π₯1, β― , π₯π) πβ π¨π = π‘, π¨π+1 = π‘β² π₯1, β― , π₯π) = π(π¨π = π‘, π¨π+1 = π‘β², π₯1, β― , π₯π) π(π₯1, β― , π₯π)
M-Step
βmaximize log-likelihood, assuming these uncertain countsβ
πnew π‘β² π‘) = π(π‘ β π‘β²) Οπ‘β²β² π(π‘ β π‘β²β²)
if we observed the hidden transitionsβ¦
M-Step
βmaximize log-likelihood, assuming these uncertain countsβ
πnew π‘β² π‘) = π½π‘βπ‘β²[π π‘ β π‘β² ] Οπ‘β²β² π½π‘βπ‘β²β²[π π‘ β π‘β²β² ]
we donβt observe the hidden transitions, but we can approximately count
M-Step
βmaximize log-likelihood, assuming these uncertain countsβ
πnew π‘β² π‘) = π½π‘βπ‘β²[π π‘ β π‘β² ] Οπ‘β²β² π½π‘βπ‘β²β²[π π‘ β π‘β²β² ]
we donβt observe the hidden transitions, but we can approximately count
we compute these in the E-step
Expectation Maximization (EM)
- 0. Assume some value for your parameters
Two step, iterative algorithm
- 1. E-step: count under uncertainty, assuming these
parameters
- 2. M-step: maximize log-likelihood, assuming these
uncertain counts
estimated counts
pobs(w | s) ptrans(sβ | s)
πβ π¨π = π‘ π₯1, β― , π₯π) = π(π¨π = π‘, π₯1, β― , π₯π) π(π₯1, β― , π₯π) πβ π¨π = π‘, π¨π+1 = π‘β² π₯1, β― , π₯π) = π(π¨π = π‘, π¨π+1 = π‘β², π₯1, β― , π₯π) π(π₯1, β― , π₯π)
Baum-Welch
Estimating Parameters from Unobserved Data
N V end start N V w1 w2 W3 w4 N V Expected Transition Counts Expected Emission Counts
end emission not shown
z1 = N
w1 w2 w3 w4
π
β π₯1|π
π
β π₯2|π
π
β π₯3|π
π
β π₯4|π
π
β π| start
z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π
β π| π
π
β π| start
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π₯4|π
π
β π₯3|π
π
β π₯2|π
π
β π₯1|π
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π| π
Estimating Parameters from Unobserved Data
N V end start N V w1 w2 W3 w4 N V Expected Transition Counts Expected Emission Counts
end emission not shown
z1 = N
w1 w2 w3 w4
π
β π₯1|π
π
β π₯2|π
π
β π₯3|π
π
β π₯4|π
π
β π| start
z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π
β π| π
π
β π| start
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π₯4|π
π
β π₯3|π
π
β π₯2|π
π
β π₯1|π
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π| π
all of these p* arcs are specific to a time-step
all of these p* arcs are specific to a time-step
Estimating Parameters from Unobserved Data
N V end start N V w1 w2 W3 w4 N V Expected Transition Counts Expected Emission Counts
end emission not shown
z1 = N
w1 w2 w3 w4
π
β π₯1|π
π
β π₯2|π
π
β π₯3|π
π
β π₯4|π
π
β π| start
z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π
β π| π
=.5 π
β π| start
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π₯4|π
π
β π₯3|π
π
β π₯2|π
π
β π₯1|π
π
β π| π
=.4 π
β π| π
=.6 π
β π| π
=.5 π
β π| π
=.3 π
β π| π
=.3
all of these p* arcs are specific to a time-step
Estimating Parameters from Unobserved Data
N V end start N 1.5 V 1.1 w1 w2 W3 w4 N V Expected Transition Counts Expected Emission Counts
end emission not shown
z1 = N
w1 w2 w3 w4
π
β π₯1|π
π
β π₯2|π
π
β π₯3|π
π
β π₯4|π
π
β π| start
z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π
β π| start
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π₯4|π
π
β π₯3|π
π
β π₯2|π
π
β π₯1|π
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π| π
=.5 π
β π| π
=.4 π
β π| π
=.6 π
β π| π
=.5 π
β π| π
=.3 π
β π| π
=.3
Estimating Parameters from Unobserved Data
N V end start 1.8 .1 .1 N 1.5 .8 .1 V 1.4 1.1 .4 w1 w2 W3 w4 N .4 .3 .2 .2 V .1 .6 .3 .3 Expected Transition Counts Expected Emission Counts
end emission not shown
z1 = N
w1 w2 w3 w4
π
β π₯1|π
π
β π₯2|π
π
β π₯3|π
π
β π₯4|π
π
β π| start
z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π
β π| start
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π₯4|π
π
β π₯3|π
π
β π₯2|π
π
β π₯1|π
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π| π (these numbers are made up)
Estimating Parameters from Unobserved Data
N V end start 1.8/2 .1/2 .1/2 N 1.5/ 2.4 .8/ 2.4 .1/ 2.4 V 1.4/2.9 1.1/ 2.9 .4/ 2.9 w1 w2 W3 w4 N .4/ 1.1 .3/ 1.1 .2/ 1.1 .2/ 1.1 V .1/ 1.3 .6/ 1.3 .3/ 1.3 .3/ 1.3
Expected Transition MLE Expected Emission MLE
end emission not shown
z1 = N
w1 w2 w3 w4
π
β π₯1|π
π
β π₯2|π
π
β π₯3|π
π
β π₯4|π
π
β π| start
z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π
β π| start
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π₯4|π
π
β π₯3|π
π
β π₯2|π
π
β π₯1|π
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π| π (these numbers are made up)
Semi-Supervised Parameter Estimation
N V end start 2 N 1 2 2 V 2 1 w1 w2 W3 w4 N 2 1 2 V 2 1 Transition Counts Emission Counts
Semi-Supervised Parameter Estimation
N V end start 1.8 .1 .1 N 1.5 .8 .1 V 1.4 1.1 .4 w1 w2 W3 w4 N .4 .3 .2 .2 V .1 .6 .3 .3 Expected Transition Counts Expected Emission Counts N V end start 2 N 1 2 2 V 2 1 w1 w2 W3 w4 N 2 1 2 V 2 1 Transition Counts Emission Counts
Semi-Supervised Parameter Estimation
N V end start 1.8 .1 .1 N 1.5 .8 .1 V 1.4 1.1 .4 w1 w2 W3 w4 N .4 .3 .2 .2 V .1 .6 .3 .3 Expected Transition Counts Expected Emission Counts N V end start 2 N 1 2 2 V 2 1 w1 w2 W3 w4 N 2 1 2 V 2 1 Transition Counts Emission Counts
N V end start 1.8 .1 .1 N 1.5 .8 .1 V 1.4 1.1 .4 w1 w2 W3 w4 N .4 .3 .2 .2 V .1 .6 .3 .3 Expected Transition Counts Expected Emission Counts N V end start 2 N 1 2 2 V 2 1 w1 w2 W3 w4 N 2 1 2 V 2 1 Transition Counts Emission Counts
Semi-Supervised Parameter Estimation
N V end start 3.8 .1 .1 N 2.5 2.8 2.1 V 3.4 2.1 .4 w1 w2 W3 w4 N 2.4 .3 1.2 2.2 V .1 2.6 1.3 .3 Mixed Transition Counts Mixed Emission Counts
Agenda
HMM Detailed Definition HMM Parameter Estimation EM for HMMs General Approach Expectation Calculation
EM Math
max
π
π½π¨ ~ ππ(π’)(β |π₯) log ππ(π¨, π₯)
current parameters new parameters new parameters posterior distribution
maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is
EM Math
max
π
π½π¨ ~ ππ(π’)(β |π₯) log ππ(π¨, π₯)
current parameters new parameters new parameters posterior distribution
maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is
ΰ·
π
log ππ(π¨π|π¨πβ1) + log ππ(π₯π|π¨π)
π¨ β π‘1, β¦ , π‘πΏ π
Estimating Parameters from Unobserved Data
N V end start 1.8/2 .1/2 .1/2 N 1.5/ 2.4 .8/ 2.4 .1/ 2.4 V 1.4/2.9 1.1/ 2.9 .4/ 2.9 w1 w2 W3 w4 N .4/ 1.1 .3/ 1.1 .2/ 1.1 .2/ 1.1 V .1/ 1.3 .6/ 1.3 .3/ 1.3 .3/ 1.3
Expected Transition MLE Expected Emission MLE
end emission not shown
z1 = N
w1 w2 w3 w4
π
β π₯1|π
π
β π₯2|π
π
β π₯3|π
π
β π₯4|π
π
β π| start
z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π
β π| start
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π₯4|π
π
β π₯3|π
π
β π₯2|π
π
β π₯1|π
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π| π
π
β π| π (these numbers are made up)
EM For HMMs (Baum-Welch Algorithm)
L = π(π₯1, β― , π₯π) for(i = 1; i β€ N; ++i) { for(state = 0; state < K*; ++state) { cobs(obsi | state) +=
π(π¨π=state,π₯1,β―,π₯π) π
for(prev = 0; prev < K*; ++prev) { ctrans(state | prev) +=
π(π¨π=state,π¨π+1=next,π₯1,β―,π₯π) π
} } }
EM For HMMs (Baum-Welch Algorithm)
L = π(π₯1, β― , π₯π) for(i = 1; i β€ N; ++i) { for(state = 0; state < K*; ++state) { cobs(obsi | state) +=
π π¨π=state,π₯1,β¦,π₯π=obsi π π₯π+1:π π¨π=state) π
for(prev = 0; prev < K*; ++prev) { u = pobs(obsi | state) * ptrans(state | prev) ctrans(state | prev) +=
π π¨πβ1=prev,π₯1:πβ1 βπ£βπ π₯π+1:π π¨π=state) π
} } }
EM For HMMs (Baum-Welch Algorithm)
L = π(π₯1, β― , π₯π) for(i = 1; i β€ N; ++i) { for(state = 0; state < K*; ++state) { cobs(obsi | state) +=
π π¨π=state,π₯1,β¦,π₯π=obsi π π₯π+1:π π¨π=state) π
for(prev = 0; prev < K*; ++prev) { u = pobs(obsi | state) * ptrans(state | prev) ctrans(state | prev) +=
π π¨πβ1=prev,π₯1:πβ1 βπ£βπ π₯π+1:π π¨π=state) π
} } } π½(state, π) πΎ(state, π) π½(prev, π β 1) πΎ(state, π)
Why Do We Need Backward Values?
zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi+1 = C zi+1 = B zi+1 = A
Ξ±(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i Ξ²(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end
3. (that emit the observation obs at i+1)
Why Do We Need Backward Values?
zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi+1 = C zi+1 = B zi+1 = A
Ξ±(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i Ξ²(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end
3. (that emit the observation obs at i+1)
Ξ±(i, B) Ξ²(i, B)
Why Do We Need Backward Values?
zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi+1 = C zi+1 = B zi+1 = A
Ξ±(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i Ξ²(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end
3. (that emit the observation obs at i+1)
Ξ±(i, B) Ξ²(i, B) Ξ±(i, B) * Ξ²(i, B) = total probability of paths through state B at step i
Why Do We Need Backward Values?
zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi+1 = C zi+1 = B zi+1 = A
Ξ±(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i Ξ²(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end
3. (that emit the observation obs at i+1)
Ξ±(i, B) Ξ²(i, B) Ξ±(i, s) * Ξ²(i, s) = total probability of paths through state s at step i
we can compute posterior state probabilities
(normalize by marginal likelihood)
Why Do We Need Backward Values?
zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A
Ξ±(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i Ξ²(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end
3. (that emit the observation obs at i+1)
Ξ±(i, B) Ξ²(i+1, s)
zi+1 = C zi+1 = B zi+1 = A
Why Do We Need Backward Values?
zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A
Ξ±(i, B) Ξ²(i+1, sβ)
zi+1 = C zi+1 = B zi+1 = A
Ξ±(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i Ξ²(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end
3. (that emit the observation obs at i+1)
Ξ±(i, B) * p(sβ | B) * p(obs at i+1 | sβ) * Ξ²(i+1, sβ) = total probability of paths through the Bβsβ arc (at time i)
Why Do We Need Backward Values?
zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A
Ξ±(i, B) Ξ²(i+1, sβ)
zi+1 = C zi+1 = B zi+1 = A
Ξ±(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i Ξ²(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end
3. (that emit the observation obs at i+1)
we can compute posterior transition probabilities
(normalize by marginal likelihood)
Ξ±(i, B) * p(sβ | B) * p(obs at i+1 | sβ) * Ξ²(i+1, sβ) = total probability of paths through the Bβsβ arc (at time i)
With Both Forward and Backward Values
Ξ±(i, s) * Ξ²(i, s) = total probability of paths through state s at step i Ξ±(i, s) * p(sβ | B) * p(obs at i+1 | sβ) * Ξ²(i+1, sβ) = total probability of paths through the sβsβ arc (at time i)
With Both Forward and Backward Values
Ξ±(i, s) * Ξ²(i, s) = total probability of paths through state s at step i
π π¨π = π‘ π₯1, β― , π₯π) = π½ π, π‘ β πΎ(π, π‘) π½(π + 1, END)
Ξ±(i, s) * p(sβ | B) * p(obs at i+1 | sβ) * Ξ²(i+1, sβ) = total probability of paths through the sβsβ arc (at time i)
With Both Forward and Backward Values
Ξ±(i, s) * p(sβ | B) * p(obs at i+1 | sβ) * Ξ²(i+1, sβ) = total probability of paths through the sβsβ arc (at time i) Ξ±(i, s) * Ξ²(i, s) = total probability of paths through state s at step i
π π¨π = π‘ π₯1, β― , π₯π) = π½ π, π‘ β πΎ(π, π‘) π½(π + 1, END) π π¨π = π‘, π¨π+1 = π‘β² π₯1, β― , π₯π) = π½ π, π‘ β π π‘β² π‘ β π obsπ+1 π‘β² β πΎ(π + 1, π‘β²) π½(π + 1, END)
Agenda
HMM Detailed Definition HMM Parameter Estimation EM for HMMs General Approach Expectation Calculation
HMM Expectation Calculation
Calculate the forward (log) likelihood of an
- bserved (sub-)sequence w1, β¦, wJ
Calculate the backward (log) likelihood of an
- bserved (sub-)sequence wJ+1, β¦, wN
π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = π π¨1| π¨0 π π₯1|π¨1 β― π π¨π| π¨πβ1 π π₯π|π¨π = ΰ·
π
π π₯π|π¨π π π¨π| π¨πβ1
emission probabilities/parameters transition probabilities/parameters
HMM Likelihood Task
Marginalize over all latent sequence joint likelihoods
π π₯1, π₯2, β¦ , π₯π = ΰ·
π¨1,β―,π¨π
π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π
Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there?
HMM Likelihood Task
Marginalize over all latent sequence joint likelihoods
π π₯1, π₯2, β¦ , π₯π = ΰ·
π¨1,β―,π¨π
π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π
Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there? A: KN
HMM Likelihood Task
Marginalize over all latent sequence joint likelihoods
π π₯1, π₯2, β¦ , π₯π = ΰ·
π¨1,β―,π¨π
π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π
Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there? A: KN Goal: Find a way to compute this exponential sum efficiently (in polynomial time)
2 State HMM Likelihood
z1 = N
w1 w2 w3 w4
π π₯1|π π π₯3|π π π₯4|π
π π| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π π| π π π| π
π π₯2|π
π π| π z1 = N
w1 w2 w3 w4
π π₯1|π π π₯4|π
π π| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π π| π π π| π π π| π
π π₯3|π π π₯2|π
2 State HMM Likelihood
z1 = N
w1 w2 w3 w4
π π₯1|π π π₯3|π π π₯4|π
π π| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π π| π π π| π
π π₯2|π
π π| π z1 = N
w1 w2 w3 w4
π π₯1|π π π₯4|π
π π| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π π| π π π| π π π| π
π π₯3|π π π₯2|π Up until here, all the computation was the same
2 State HMM Likelihood
z1 = N
w1 w2 w3 w4
π π₯1|π π π₯3|π π π₯4|π
π π| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π π| π π π| π
π π₯2|π
π π| π z1 = N
w1 w2 w3 w4
π π₯1|π π π₯4|π
π π| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π π| π π π| π π π| π
π π₯3|π π π₯2|π Up until here, all the computation was the same Letβs reuse what computations we can
2 State HMM Likelihood
z1 = N
w1 w2 w3 w4
π π₯1|π π π₯3|π π π₯4|π
π π| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π π| π π π| π
π π₯2|π
π π| π z1 = N
w1 w2 w3 w4
π π₯1|π π π₯4|π
π π| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π π| π π π| π π π| π
π π₯3|π π π₯2|π Solution: pass information βforwardβ in the graph, e.g., from time step 2 to 3β¦
2 State HMM Likelihood
z1 = N
w1 w2 w3 w4
π π₯1|π π π₯3|π π π₯4|π
π π| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π π| π π π| π
π π₯2|π
π π| π z1 = N
w1 w2 w3 w4
π π₯1|π π π₯4|π
π π| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π π| π π π| π π π| π
π π₯3|π π π₯2|π Solution: pass information βforwardβ in the graph, e.g., from time step 2 to 3β¦ Issue: these highlighted paths are only 2 of the 16 possible paths through the trellis
2 State HMM Likelihood
z1 = N
w1 w2 w3 w4
π π₯1|π π π₯3|π π π₯4|π
π π| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π π| π π π| π
π π₯2|π
π π| π z1 = N
w1 w2 w3 w4
π π₯1|π π π₯4|π
π π| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π π| π π π| π π π| π
π π₯3|π π π₯2|π Solution: pass information βforwardβ in the graph, e.g., from time step 2 to 3β¦ Solution: marginalize out all information from previous timesteps Issue: these highlighted paths are only 2 of the 16 possible paths through the trellis
Reusing Computation
zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A
letβs first consider βany shared path ending with B (AB, BB, or CB)β Bβ
zi-2 = C zi-2 = B zi-2 = A
Reusing Computation
zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A
letβs first consider βany shared path ending with B (AB, BB, or CB)β Bβ Assume that all necessary information has been computed and stored in π½(π β 1, π΅), π½(π β 1, πΆ), π½(π β 1, π·)
zi-2 = C zi-2 = B zi-2 = A
π½(π β 1, π΅) π½(π β 1, πΆ) π½(π β 1, π·)
Reusing Computation
zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A
letβs first consider βany shared path ending with B (AB, BB, or CB)β Bβ Assume that all necessary information has been computed and stored in π½(π β 1, π΅), π½(π β 1, πΆ), π½(π β 1, π·) Marginalize (sum) across the previous timestepβs possible states
zi-2 = C zi-2 = B zi-2 = A
π½(π β 1, π΅) π½(π β 1, πΆ) π½(π β 1, π·) π½(π, πΆ)
Reusing Computation
letβs first consider βany shared path ending with B (AB, BB, or CB)β Bβ marginalize across the previous hidden state values π½ π, πΆ = ΰ·
π‘
π½ π β 1, π‘ β π πΆ π‘) β π(obs at π | πΆ)
zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi-2 = C zi-2 = B zi-2 = A
π½(π β 1, π΅) π½(π β 1, πΆ) π½(π β 1, π·) π½(π, πΆ)
Reusing Computation
letβs first consider βany shared path ending with B (AB, BB, or CB)β Bβ marginalize across the previous hidden state values π½ π, πΆ = ΰ·
π‘
π½ π β 1, π‘ β π πΆ π‘) β π(obs at π | πΆ) computing Ξ± at time i-1 will correctly incorporate paths through time i-2: we correctly obey the Markov property
zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi-2 = C zi-2 = B zi-2 = A
π½(π β 1, π΅) π½(π β 1, πΆ) π½(π β 1, π·) π½(π, πΆ)
Forward Probability
letβs first consider βany shared path ending with B (AB, BB, or CB)β Bβ marginalize across the previous hidden state values π½ π, πΆ = ΰ·
π‘β²
π½ π β 1, π‘β² β π πΆ π‘β²) β π(obs at π | πΆ) computing Ξ± at time i-1 will correctly incorporate paths through time i-2: we correctly obey the Markov property Ξ±(i, B) is the total probability of all paths to that state B from the beginning
zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi-2 = C zi-2 = B zi-2 = A
Forward Probability
Ξ±(i, s) is the total probability of all paths:
- 1. that start from the beginning
- 2. that end (currently) in s at step i
- 3. that emit the observation obs at i
π½ π, π‘ = ΰ·
π‘β²
π½ π β 1, π‘β² β π π‘ π‘β²) β π(obs at π | π‘)
Forward Probability
Ξ±(i, s) is the total probability of all paths:
- 1. that start from the beginning
- 2. that end (currently) in s at step i
- 3. that emit the observation obs at i
π½ π, π‘ = ΰ·
π‘β²
π½ π β 1, π‘β² β π π‘ π‘β²) β π(obs at π | π‘)
how likely is it to get into state s this way? what are the immediate ways to get into state s? whatβs the total probability up until now?
Forward Algorithm
Ξ±: a 2D table, N+2 x K* N+2: number of observations (+2 for the BOS & EOS symbols) K*: number of states Use dynamic programming to build the Ξ± left-to- right
Forward Algorithm
Ξ± = double[N+2][K*] Ξ±[0][*] = 0.0 Ξ±[0][START] = 1.0 for(i = 1; i β€ N+1; ++i) { }
Forward Algorithm
Ξ± = double[N+2][K*] Ξ±[0][*] = 0.0 Ξ±[0][START] = 1.0 for(i = 1; i β€ N+1; ++i) { for(state = 0; state < K*; ++state) { } }
Forward Algorithm
Ξ± = double[N+2][K*] Ξ±[0][*] = 0.0 Ξ±[0][START] = 1.0 for(i = 1; i β€ N+1; ++i) { for(state = 0; state < K*; ++state) { pobs = pemission(obsi | state) } }
Forward Algorithm
Ξ± = double[N+2][K*] Ξ±[0][*] = 0.0 Ξ±[0][START] = 1.0 for(i = 1; i β€ N+1; ++i) { for(state = 0; state < K*; ++state) { pobs = pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = ptransition(state | old) Ξ±[i][state] += Ξ±[i-1][old] * pobs * pmove } } }
Forward Algorithm
Ξ± = double[N+2][K*] Ξ±[0][*] = 0.0 Ξ±[0][START] = 1.0 for(i = 1; i β€ N+1; ++i) { for(state = 0; state < K*; ++state) { pobs = pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = ptransition(state | old) Ξ±[i][state] += Ξ±[i-1][old] * pobs * pmove } } }
we still need to learn these (EM if not observed)
Forward Algorithm
Ξ± = double[N+2][K*] Ξ±[0][*] = 0.0 Ξ±[0][START] = 1.0 for(i = 1; i β€ N+1; ++i) { for(state = 0; state < K*; ++state) { pobs = pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = ptransition(state | old) Ξ±[i][state] += Ξ±[i-1][old] * pobs * pmove } } }
Q: What do we return? (How do we return the likelihood of the sequence?)
Forward Algorithm
Ξ± = double[N+2][K*] Ξ±[0][*] = 0.0 Ξ±[0][START] = 1.0 for(i = 1; i β€ N+1; ++i) { for(state = 0; state < K*; ++state) { pobs = pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = ptransition(state | old) Ξ±[i][state] += Ξ±[i-1][old] * pobs * pmove } } }
Q: What do we return? (How do we return the likelihood of the sequence?)
A: Ξ±[N+1][end]
Interactive HMM Example
https://goo.gl/rbHEoc (Jason Eisner, 2002)
Original: http://www.cs.jhu.edu/~jason/465/PowerPoint/lect24-hmm.xls
Forward Algorithm in Log-Space
Ξ± = double[N+2][K*] Ξ±[0][*] = -β Ξ±[0][*] = 0.0 for(i = 1; i β€ N+1; ++i) { for(state = 0; state < K*; ++state) { pobs = log pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = log ptransition(state | old) Ξ±[i][state] = logadd(Ξ±[i][state], Ξ±[i-1][old] + pobs + pmove) } } }
Forward Algorithm in Log-Space
Ξ± = double[N+2][K*] Ξ±[0][*] = -β Ξ±[0][*] = 0.0 for(i = 1; i β€ N+1; ++i) { for(state = 0; state < K*; ++state) { pobs = log pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = log ptransition(state | old) Ξ±[i][state] = logadd(Ξ±[i][state], Ξ±[i-1][old] + pobs + pmove) } } } logadd ππ, ππ = α ππ + log 1 + exp ππ β ππ , ππ β₯ ππ ππ + log 1 + exp ππ β ππ , ππ > ππ
scipy.misc.logsumexp
HMM Expectation Calculation
Calculate the forward (log) likelihood of an
- bserved (sub-)sequence w1, β¦, wJ
Calculate the backward (log) likelihood of an
- bserved (sub-)sequence wJ+1, β¦, wN
π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = π π¨1| π¨0 π π₯1|π¨1 β― π π¨π| π¨πβ1 π π₯π|π¨π = ΰ·
π
π π₯π|π¨π π π¨π| π¨πβ1
emission probabilities/parameters transition probabilities/parameters
HMM Probabilities
Forward Values Ξ±(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation
- bs at i
π½ π, π‘ = ΰ·
π‘β²
π½ π β 1, π‘β² β π π‘ π‘β²) β π(obs at π | π‘)
HMM Probabilities
Forward Values Ξ±(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation
- bs at i
Backward Values Ξ²(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end
3. (that emit the observation obs at i+1) π½ π, π‘ = ΰ·
π‘β²
π½ π β 1, π‘β² β π π‘ π‘β²) β π(obs at π | π‘) πΎ π, π‘ = ΰ·
π‘β²
πΎ π + 1, π‘β² β π(π‘β²|π‘) β π obs at π + 1 π‘β²)
Backward Algorithm
Ξ²: a 2D table, (N+2) x K* N+2: number of observations (+2 for the BOS & EOS symbols) K*: number of states Use dynamic programming to build the Ξ² right- to-left
Backward Algorithm
Ξ² = double[N+2][K*] Ξ²[n+1][END] = 1.0 for(i = N; i β₯ 0; --i) { for(next = 0; next < K*; ++next) { pobs = pemission(obsi+1 | next) for(state = 0; state < K*; ++state) { pmove = ptransition(next | state) Ξ²[i][state] += Ξ²[i+1][next] * pobs * pmove } } }
Backward Algorithm
Ξ² = double[N+2][K*] Ξ²[n+1][END] = 1.0 for(i = N; i β₯ 0; --i) { for(next = 0; next < K*; ++next) { pobs = pemission(obsi+1 | next) for(state = 0; state < K*; ++state) { pmove = ptransition(next | state) Ξ²[i][state] += Ξ²[i+1][next] * pobs * pmove } } } Q: What does Ξ²[0][START] represent?
Backward Algorithm
Ξ² = double[N+2][K*] Ξ²[n+1][END] = 1.0 for(i = N; i β₯ 0; --i) { for(next = 0; next < K*; ++next) { pobs = pemission(obsi+1 | next) for(state = 0; state < K*; ++state) { pmove = ptransition(next | state) Ξ²[i][state] += Ξ²[i+1][next] * pobs * pmove } } } Q: What does Ξ²[0][START] represent? A: Total probability of all paths from stop to start, for the
- bserved sequence
Backward Algorithm
Ξ² = double[N+2][K*] Ξ²[n+1][END] = 1.0 for(i = N; i β₯ 0; --i) { for(next = 0; next < K*; ++next) { pobs = pemission(obsi+1 | next) for(state = 0; state < K*; ++state) { pmove = ptransition(next | state) Ξ²[i][state] += Ξ²[i+1][next] * pobs * pmove } } } Q: What does Ξ²[0][START] represent? A: The marginal likelihood of the observed sequence
Backward Algorithm
Ξ² = double[N+2][K*] Ξ²[n+1][END] = 1.0 for(i = N; i β₯ 0; --i) { for(next = 0; next < K*; ++next) { pobs = pemission(obsi+1 | next) for(state = 0; state < K*; ++state) { pmove = ptransition(next | state) Ξ²[i][state] += Ξ²[i+1][next] * pobs * pmove } } } Q: What does Ξ²[0][START] represent? A: Ξ±[N+1][END]
HMM Expectation Calculation
Calculate the forward (log) likelihood of an
- bserved (sub-)sequence w1, β¦, wJ
Calculate the backward (log) likelihood of an
- bserved (sub-)sequence wJ+1, β¦, wN
π π¨1, π₯1, π¨2, π₯2, β¦ , π¨π, π₯π = π π¨1| π¨0 π π₯1|π¨1 β― π π¨π| π¨πβ1 π π₯π|π¨π = ΰ·
π
π π₯π|π¨π π π¨π| π¨πβ1
emission probabilities/parameters transition probabilities/parameters