em hidden markov models
play

EM & Hidden Markov Models CMSC 691 UMBC Recap from last time - PowerPoint PPT Presentation

EM & Hidden Markov Models CMSC 691 UMBC Recap from last time Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters 2. M-step:


  1. 2 State HMM Likelihood π‘ž π‘Š| start π‘ž π‘Š| π‘Š π‘ž π‘Š| π‘Š π‘ž π‘Š| π‘Š … z 1 = z 2 = z 3 = z 4 = V V V V π‘ž π‘Š| 𝑂 π‘ž π‘Š| 𝑂 π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| π‘Š π‘ž 𝑂| π‘Š π‘ž 𝑂| start π‘ž 𝑂| 𝑂 π‘ž 𝑂| 𝑂 π‘ž 𝑂| 𝑂 … z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 3 |π‘Š π‘ž π‘₯ 4 |π‘Š π‘ž π‘₯ 1 |π‘Š π‘ž π‘₯ 2 |π‘Š π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 2 |𝑂 π‘ž π‘₯ 3 |𝑂 w 1 w 2 w 3 w 4 N V end w 1 w 2 w 3 w 4 start .7 .2 .1 N .7 .2 .05 .05 N .15 .8 .05 V .2 .6 .1 .1 V .6 .35 .05

  2. 2 State HMM Likelihood π‘ž π‘Š| start π‘ž π‘Š| π‘Š π‘ž π‘Š| π‘Š π‘ž π‘Š| π‘Š … z 1 = z 2 = z 3 = z 4 = V V V V π‘ž π‘Š| 𝑂 π‘ž π‘Š| 𝑂 π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| π‘Š π‘ž 𝑂| π‘Š π‘ž 𝑂| start π‘ž 𝑂| 𝑂 π‘ž 𝑂| 𝑂 π‘ž 𝑂| 𝑂 … z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 3 |π‘Š π‘ž π‘₯ 4 |π‘Š π‘ž π‘₯ 1 |π‘Š π‘ž π‘₯ 2 |π‘Š π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 2 |𝑂 π‘ž π‘₯ 3 |𝑂 w 1 w 2 w 3 w 4 Q: What’s the probability of N V end (N, w 1 ), (V, w 2 ), (V, w 3 ), (N, w 4 )? w 1 w 2 w 3 w 4 start .7 .2 .1 N .7 .2 .05 .05 N .15 .8 .05 V .2 .6 .1 .1 V .6 .35 .05

  3. 2 State HMM Likelihood π‘ž π‘Š| π‘Š z 1 = z 2 = z 3 = z 4 = V V V V π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| start z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 3 |π‘Š π‘ž π‘₯ 2 |π‘Š π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 w 1 w 2 w 3 w 4 Q: What’s the probability of N V end (N, w 1 ), (V, w 2 ), (V, w 3 ), (N, w 4 )? w 1 w 2 w 3 w 4 start .7 .2 .1 N .7 .2 .05 .05 A: (.7*.7) * (.8*.6) * (.35*.1) * (.6*.05) = N .15 .8 .05 V .2 .6 .1 .1 0.0002822 V .6 .35 .05

  4. 2 State HMM Likelihood π‘ž π‘Š| start π‘ž π‘Š| π‘Š π‘ž π‘Š| π‘Š π‘ž π‘Š| π‘Š … z 1 = z 2 = z 3 = z 4 = V V V V π‘ž π‘Š| 𝑂 π‘ž π‘Š| 𝑂 π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| π‘Š π‘ž 𝑂| π‘Š π‘ž 𝑂| start π‘ž 𝑂| 𝑂 π‘ž 𝑂| 𝑂 π‘ž 𝑂| 𝑂 … z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 3 |π‘Š π‘ž π‘₯ 4 |π‘Š π‘ž π‘₯ 1 |π‘Š π‘ž π‘₯ 2 |π‘Š π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 2 |𝑂 π‘ž π‘₯ 3 |𝑂 w 1 w 2 w 3 w 4 Q: What’s the probability of N V end (N, w 1 ), (V, w 2 ), (N, w 3 ), (N, w 4 )? w 1 w 2 w 3 w 4 start .7 .2 .1 N .7 .2 .05 .05 N .15 .8 .05 V .2 .6 .1 .1 V .6 .35 .05

  5. 2 State HMM Likelihood z 1 = z 2 = z 3 = z 4 = V V V V π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| start π‘ž 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 2 |π‘Š π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |𝑂 w 1 w 2 w 3 w 4 N V end Q: What’s the probability of w 1 w 2 w 3 w 4 (N, w 1 ), (V, w 2 ), (N, w 3 ), (N, w 4 )? start .7 .2 .1 N .7 .2 .05 .05 A: (.7*.7) * (.8*.6) * (.6*.05) * (.15*.05) = N .15 .8 .05 V .2 .6 .1 .1 0.00007056 V .6 .35 .05

  6. Agenda HMM Detailed Definition HMM Parameter Estimation EM for HMMs General Approach Expectation Calculation

  7. Estimating Parameters from Observed Data π‘ž π‘Š| π‘Š z 1 = z 2 = z 3 = z 4 = π‘ž 𝑂| π‘Š V V V V Transition Counts π‘ž 𝑂| start π‘ž π‘Š| 𝑂 N V end z 1 = z 2 = z 3 = z 4 = N N N N start π‘ž π‘₯ 4 |𝑂 N π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |π‘Š π‘ž π‘₯ 2 |π‘Š V w 1 w 2 w 3 w 4 Emission Counts w 1 w 2 W 3 w 4 z 1 = z 2 = z 3 = z 4 = N V V V V V π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| start end emission not shown π‘ž 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 2 |π‘Š π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |𝑂 w 1 w 2 w 3 w 4

  8. Estimating Parameters from Observed Data π‘ž π‘Š| π‘Š z 1 = z 2 = z 3 = z 4 = π‘ž 𝑂| π‘Š V V V V Transition Counts π‘ž 𝑂| start π‘ž π‘Š| 𝑂 N V end z 1 = z 2 = z 3 = z 4 = N N N N start 2 0 0 π‘ž π‘₯ 4 |𝑂 N 1 2 2 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |π‘Š π‘ž π‘₯ 2 |π‘Š V 2 1 0 w 1 w 2 w 3 w 4 Emission Counts w 1 w 2 W 3 w 4 z 1 = z 2 = z 3 = z 4 = N 2 0 1 2 V V V V V 0 2 1 0 π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| start end emission not shown π‘ž 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 2 |π‘Š π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |𝑂 w 1 w 2 w 3 w 4

  9. Estimating Parameters from Observed Data π‘ž π‘Š| π‘Š z 1 = z 2 = z 3 = z 4 = π‘ž 𝑂| π‘Š V V V V Transition MLE π‘ž 𝑂| start π‘ž π‘Š| 𝑂 N V end z 1 = z 2 = z 3 = z 4 = N N N N start 1 0 0 π‘ž π‘₯ 4 |𝑂 N .2 .4 .4 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |π‘Š π‘ž π‘₯ 2 |π‘Š V 2/3 1/3 0 w 1 w 2 w 3 w 4 Emission MLE w 1 w 2 W 3 w 4 z 1 = z 2 = z 3 = z 4 = N .4 0 .2 .4 V V V V V 0 2/3 1/3 0 π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| start end emission not shown π‘ž 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 2 |π‘Š π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |𝑂 w 1 w 2 w 3 w 4

  10. Estimating Parameters from Observed Data π‘ž π‘Š| π‘Š z 1 = z 2 = z 3 = z 4 = π‘ž 𝑂| π‘Š V V V V Transition MLE π‘ž 𝑂| start π‘ž π‘Š| 𝑂 N V end z 1 = z 2 = z 3 = z 4 = N N N N start 1 0 0 π‘ž π‘₯ 4 |𝑂 N .2 .4 .4 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |π‘Š π‘ž π‘₯ 2 |π‘Š V 2/3 1/3 0 w 1 w 2 w 3 w 4 Emission MLE w 1 w 2 W 3 w 4 z 1 = z 2 = z 3 = z 4 = N .4 0 .2 .4 V V V V V 0 2/3 1/3 0 π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| start end emission not shown π‘ž 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N smooth these π‘ž π‘₯ 2 |π‘Š values if π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |𝑂 needed w 1 w 2 w 3 w 4

  11. What If We Don’t Observe 𝑨 ? Approach: Develop EM algorithm Goal: Estimate π‘ž 𝑒 𝑑 β€² 𝑑) and π‘ž 𝑓 𝑀 𝑑) Why: Compute 𝔽 𝑨 𝑗 =𝑑→𝑨 𝑗+1 =𝑑 β€² 𝑑 𝑑 β†’ 𝑑 β€² 𝔽 𝑨 𝑗 =𝑑→π‘₯ 𝑗 =𝑀 𝑑 𝑑 β†’ 𝑀

  12. Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters 2. M-step: maximize log-likelihood, assuming these uncertain counts estimated counts

  13. Expectation Maximization (EM) p obs (w | s) 0. Assume some value for your parameters p trans (s’ | s) Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters 2. M-step: maximize log-likelihood, assuming these uncertain counts estimated counts

  14. Expectation Maximization (EM) p obs (w | s) 0. Assume some value for your parameters p trans (s’ | s) Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters π‘ž βˆ— 𝑨 𝑗 = 𝑑, 𝑨 𝑗+1 = 𝑑 β€² π‘₯ 1 , β‹― , π‘₯ 𝑂 ) = π‘ž βˆ— 𝑨 𝑗 = 𝑑 π‘₯ 1 , β‹― , π‘₯ 𝑂 ) = π‘ž(𝑨 𝑗 = 𝑑, π‘₯ 1 , β‹― , π‘₯ 𝑂 ) π‘ž(𝑨 𝑗 = 𝑑, 𝑨 𝑗+1 = 𝑑 β€² , π‘₯ 1 , β‹― , π‘₯ 𝑂 ) π‘ž(π‘₯ 1 , β‹― , π‘₯ 𝑂 ) π‘ž(π‘₯ 1 , β‹― , π‘₯ 𝑂 ) 2. M-step: maximize log-likelihood, assuming these uncertain counts estimated counts

  15. M-Step β€œ maximize log-likelihood, assuming these uncertain counts ” 𝑑(𝑑 β†’ 𝑑 β€² ) π‘ž new 𝑑 β€² 𝑑) = Οƒ 𝑑 β€²β€² 𝑑(𝑑 β†’ 𝑑 β€²β€² ) if we observed the hidden transitions…

  16. M-Step β€œ maximize log-likelihood, assuming these uncertain counts ” 𝔽 𝑑→𝑑 β€² [𝑑 𝑑 β†’ 𝑑 β€² ] π‘ž new 𝑑 β€² 𝑑) = Οƒ 𝑑 β€²β€² 𝔽 𝑑→𝑑 β€²β€² [𝑑 𝑑 β†’ 𝑑′′ ] we don’t observe the hidden transitions, but we can approximately count

  17. M-Step β€œ maximize log-likelihood, assuming these uncertain counts ” 𝔽 𝑑→𝑑 β€² [𝑑 𝑑 β†’ 𝑑 β€² ] π‘ž new 𝑑 β€² 𝑑) = Οƒ 𝑑 β€²β€² 𝔽 𝑑→𝑑 β€²β€² [𝑑 𝑑 β†’ 𝑑′′ ] we don’t observe the hidden transitions, but we can approximately count we compute these in the E-step

  18. Expectation Maximization (EM) p obs (w | s) 0. Assume some value for your parameters p trans (s’ | s) Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters π‘ž βˆ— 𝑨 𝑗 = 𝑑, 𝑨 𝑗+1 = 𝑑 β€² π‘₯ 1 , β‹― , π‘₯ 𝑂 ) = π‘ž βˆ— 𝑨 𝑗 = 𝑑 π‘₯ 1 , β‹― , π‘₯ 𝑂 ) = π‘ž(𝑨 𝑗 = 𝑑, π‘₯ 1 , β‹― , π‘₯ 𝑂 ) π‘ž(𝑨 𝑗 = 𝑑, 𝑨 𝑗+1 = 𝑑 β€² , π‘₯ 1 , β‹― , π‘₯ 𝑂 ) π‘ž(π‘₯ 1 , β‹― , π‘₯ 𝑂 ) π‘ž(π‘₯ 1 , β‹― , π‘₯ 𝑂 ) 2. M-step: maximize log-likelihood, assuming these uncertain counts Baum-Welch estimated counts

  19. Estimating Parameters from Unobserved Data Expected Transition Counts N V end βˆ— π‘Š| start βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š π‘ž start π‘ž π‘ž π‘ž z 1 = z 2 = z 3 = z 4 = N V V V V βˆ— π‘Š| 𝑂 βˆ— π‘Š| 𝑂 V βˆ— π‘Š| 𝑂 βˆ— 𝑂| π‘Š π‘ž π‘ž π‘ž π‘ž βˆ— 𝑂| π‘Š βˆ— 𝑂| π‘Š π‘ž π‘ž Expected βˆ— 𝑂| start Emission Counts π‘ž βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 π‘ž π‘ž π‘ž z 1 = z 2 = z 3 = z 4 = w 1 w 2 W 3 w 4 N N N N N βˆ— π‘₯ 3 |π‘Š βˆ— π‘₯ 1 |π‘Š βˆ— π‘₯ 2 |π‘Š V π‘ž π‘ž π‘ž βˆ— π‘₯ 4 |𝑂 βˆ— π‘₯ 1 |𝑂 βˆ— π‘₯ 2 |𝑂 βˆ— π‘₯ 3 |𝑂 βˆ— π‘₯ 4 |π‘Š end emission not shown π‘ž π‘ž π‘ž π‘ž π‘ž w 1 w 2 w 3 w 4

  20. Estimating Parameters from Unobserved Data Expected all of these p* arcs are Transition Counts specific to a time-step N V end βˆ— π‘Š| start βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š π‘ž start π‘ž π‘ž π‘ž z 1 = z 2 = z 3 = z 4 = N V V V V βˆ— π‘Š| 𝑂 βˆ— π‘Š| 𝑂 V βˆ— π‘Š| 𝑂 βˆ— 𝑂| π‘Š π‘ž π‘ž π‘ž π‘ž βˆ— 𝑂| π‘Š βˆ— 𝑂| π‘Š π‘ž π‘ž Expected βˆ— 𝑂| start Emission Counts π‘ž βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 π‘ž π‘ž π‘ž z 1 = z 2 = z 3 = z 4 = w 1 w 2 W 3 w 4 N N N N N βˆ— π‘₯ 3 |π‘Š βˆ— π‘₯ 1 |π‘Š βˆ— π‘₯ 2 |π‘Š V π‘ž π‘ž π‘ž βˆ— π‘₯ 4 |𝑂 βˆ— π‘₯ 1 |𝑂 βˆ— π‘₯ 2 |𝑂 βˆ— π‘₯ 3 |𝑂 βˆ— π‘₯ 4 |π‘Š end emission not shown π‘ž π‘ž π‘ž π‘ž π‘ž w 1 w 2 w 3 w 4

  21. Estimating Parameters from Unobserved Data all of these p* arcs are Expected specific to a time-step Transition Counts N V end βˆ— π‘Š| start βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š π‘ž start π‘ž π‘ž π‘ž z 1 = z 2 = z 3 = z 4 = =.5 =.3 =.3 N V V V V βˆ— π‘Š| 𝑂 βˆ— π‘Š| 𝑂 V βˆ— π‘Š| 𝑂 βˆ— 𝑂| π‘Š π‘ž π‘ž π‘ž π‘ž βˆ— 𝑂| π‘Š βˆ— 𝑂| π‘Š π‘ž π‘ž Expected βˆ— 𝑂| start Emission Counts π‘ž βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 π‘ž π‘ž π‘ž z 1 = z 2 = z 3 = z 4 = w 1 w 2 W 3 w 4 =.4 =.6 =.5 N N N N N βˆ— π‘₯ 3 |π‘Š βˆ— π‘₯ 1 |π‘Š βˆ— π‘₯ 2 |π‘Š V π‘ž π‘ž π‘ž βˆ— π‘₯ 4 |𝑂 βˆ— π‘₯ 1 |𝑂 βˆ— π‘₯ 2 |𝑂 βˆ— π‘₯ 3 |𝑂 βˆ— π‘₯ 4 |π‘Š end emission not shown π‘ž π‘ž π‘ž π‘ž π‘ž w 1 w 2 w 3 w 4

  22. Estimating Parameters from Unobserved Data all of these p* arcs are Expected specific to a time-step Transition Counts N V end βˆ— π‘Š| start βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š π‘ž start π‘ž π‘ž π‘ž π‘ž π‘ž π‘ž z 1 = z 2 = z 3 = z 4 = =.5 =.3 =.3 N 1.5 V V V V βˆ— π‘Š| 𝑂 βˆ— π‘Š| 𝑂 V 1.1 βˆ— π‘Š| 𝑂 βˆ— 𝑂| π‘Š π‘ž π‘ž π‘ž π‘ž βˆ— 𝑂| π‘Š βˆ— 𝑂| π‘Š π‘ž π‘ž Expected βˆ— 𝑂| start Emission Counts π‘ž βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 π‘ž π‘ž π‘ž π‘ž π‘ž π‘ž z 1 = z 2 = z 3 = z 4 = w 1 w 2 W 3 w 4 =.4 =.6 =.5 N N N N N βˆ— π‘₯ 3 |π‘Š βˆ— π‘₯ 1 |π‘Š βˆ— π‘₯ 2 |π‘Š V π‘ž π‘ž π‘ž βˆ— π‘₯ 4 |𝑂 βˆ— π‘₯ 1 |𝑂 βˆ— π‘₯ 2 |𝑂 βˆ— π‘₯ 3 |𝑂 βˆ— π‘₯ 4 |π‘Š end emission not shown π‘ž π‘ž π‘ž π‘ž π‘ž w 1 w 2 w 3 w 4

  23. Estimating Parameters from Unobserved Data Expected Transition Counts N V end βˆ— π‘Š| start βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š π‘ž start 1.8 .1 .1 π‘ž π‘ž π‘ž z 1 = z 2 = z 3 = z 4 = N 1.5 .8 .1 V V V V βˆ— π‘Š| 𝑂 βˆ— π‘Š| 𝑂 V 1.4 1.1 .4 βˆ— π‘Š| 𝑂 βˆ— 𝑂| π‘Š π‘ž π‘ž π‘ž π‘ž βˆ— 𝑂| π‘Š βˆ— 𝑂| π‘Š π‘ž π‘ž Expected βˆ— 𝑂| start Emission Counts π‘ž βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 π‘ž π‘ž π‘ž z 1 = z 2 = z 3 = z 4 = w 1 w 2 W 3 w 4 N N N N N .4 .3 .2 .2 βˆ— π‘₯ 3 |π‘Š βˆ— π‘₯ 1 |π‘Š βˆ— π‘₯ 2 |π‘Š V .1 .6 .3 .3 π‘ž π‘ž π‘ž βˆ— π‘₯ 4 |𝑂 βˆ— π‘₯ 1 |𝑂 βˆ— π‘₯ 2 |𝑂 βˆ— π‘₯ 3 |𝑂 βˆ— π‘₯ 4 |π‘Š end emission not shown π‘ž π‘ž π‘ž π‘ž π‘ž (these numbers are made up) w 1 w 2 w 3 w 4

  24. Estimating Parameters from Unobserved Data Expected Transition MLE N V end start 1.8/2 .1/2 .1/2 N 1.5/ .8/ .1/ βˆ— π‘Š| start βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š π‘ž π‘ž π‘ž π‘ž 2.4 2.4 2.4 z 1 = z 2 = z 3 = z 4 = V 1.4/2.9 1.1/ .4/ V V V V 2.9 2.9 βˆ— π‘Š| 𝑂 βˆ— π‘Š| 𝑂 βˆ— π‘Š| 𝑂 βˆ— 𝑂| π‘Š π‘ž π‘ž π‘ž π‘ž βˆ— 𝑂| π‘Š βˆ— 𝑂| π‘Š π‘ž π‘ž Expected βˆ— 𝑂| start Emission MLE π‘ž βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 π‘ž π‘ž π‘ž z 1 = z 2 = z 3 = z 4 = w 1 w 2 W 3 w 4 N N N N N .4/ .3/ .2/ .2/ 1.1 1.1 1.1 1.1 βˆ— π‘₯ 3 |π‘Š βˆ— π‘₯ 1 |π‘Š βˆ— π‘₯ 2 |π‘Š π‘ž π‘ž π‘ž V .1/ .6/ .3/ .3/ βˆ— π‘₯ 4 |𝑂 βˆ— π‘₯ 1 |𝑂 βˆ— π‘₯ 2 |𝑂 βˆ— π‘₯ 3 |𝑂 1.3 1.3 1.3 1.3 π‘ž π‘ž π‘ž π‘ž βˆ— π‘₯ 4 |π‘Š π‘ž end emission not shown (these numbers are made up) w 1 w 2 w 3 w 4

  25. Semi-Supervised Parameter Estimation Transition Counts Emission Counts N V end w 1 w 2 W 3 w 4 start 2 0 0 N 2 0 1 2 N 1 2 2 V 0 2 1 0 V 2 1 0

  26. Semi-Supervised Parameter Estimation Transition Counts Emission Counts N V end w 1 w 2 W 3 w 4 start 2 0 0 N 2 0 1 2 N 1 2 2 V 0 2 1 0 V 2 1 0 Expected Transition Counts Expected Emission Counts N V end w 1 w 2 W 3 w 4 start 1.8 .1 .1 N .4 .3 .2 .2 N 1.5 .8 .1 V .1 .6 .3 .3 V 1.4 1.1 .4

  27. Semi-Supervised Parameter Estimation Transition Counts Emission Counts N V end w 1 w 2 W 3 w 4 start 2 0 0 N 2 0 1 2 N 1 2 2 V 0 2 1 0 V 2 1 0 Expected Transition Counts Expected Emission Counts N V end w 1 w 2 W 3 w 4 start 1.8 .1 .1 N .4 .3 .2 .2 N 1.5 .8 .1 V .1 .6 .3 .3 V 1.4 1.1 .4

  28. Semi-Supervised Parameter Estimation Transition Counts Emission Counts N V end w 1 w 2 W 3 w 4 start 2 0 0 N 2 0 1 2 N 1 2 2 V 0 2 1 0 V 2 1 0 Mixed Transition Counts Mixed Emission Counts N V end w 1 w 2 W 3 w 4 start 3.8 .1 .1 N 2.4 .3 1.2 2.2 N 2.5 2.8 2.1 V .1 2.6 1.3 .3 V 3.4 2.1 .4 Expected Transition Counts Expected Emission Counts N V end w 1 w 2 W 3 w 4 start 1.8 .1 .1 N .4 .3 .2 .2 N 1.5 .8 .1 V .1 .6 .3 .3 V 1.4 1.1 .4

  29. Agenda HMM Detailed Definition HMM Parameter Estimation EM for HMMs General Approach Expectation Calculation

  30. EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max 𝔽 𝑨 ~ π‘ž πœ„ (𝑒) (β‹…|π‘₯) log π‘ž πœ„ (𝑨, π‘₯) current parameters πœ„ new parameters posterior distribution new parameters

  31. EM Math maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is max 𝔽 𝑨 ~ π‘ž πœ„ (𝑒) (β‹…|π‘₯) log π‘ž πœ„ (𝑨, π‘₯) current parameters πœ„ new parameters posterior distribution new parameters 𝑨 ∈ 𝑑 1 , … , 𝑑 𝐿 𝑂 ෍ log π‘ž πœ„ (𝑨 𝑗 |𝑨 π‘—βˆ’1 ) + log π‘ž πœ„ (π‘₯ 𝑗 |𝑨 𝑗 ) 𝑗

  32. Estimating Parameters from Unobserved Data Expected Transition MLE N V end start 1.8/2 .1/2 .1/2 N 1.5/ .8/ .1/ βˆ— π‘Š| start βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š βˆ— π‘Š| π‘Š π‘ž π‘ž π‘ž π‘ž 2.4 2.4 2.4 z 1 = z 2 = z 3 = z 4 = V 1.4/2.9 1.1/ .4/ V V V V 2.9 2.9 βˆ— π‘Š| 𝑂 βˆ— π‘Š| 𝑂 βˆ— π‘Š| 𝑂 βˆ— 𝑂| π‘Š π‘ž π‘ž π‘ž π‘ž βˆ— 𝑂| π‘Š βˆ— 𝑂| π‘Š π‘ž π‘ž Expected βˆ— 𝑂| start Emission MLE π‘ž βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 βˆ— 𝑂| 𝑂 π‘ž π‘ž π‘ž z 1 = z 2 = z 3 = z 4 = w 1 w 2 W 3 w 4 N N N N N .4/ .3/ .2/ .2/ 1.1 1.1 1.1 1.1 βˆ— π‘₯ 3 |π‘Š βˆ— π‘₯ 1 |π‘Š βˆ— π‘₯ 2 |π‘Š π‘ž π‘ž π‘ž V .1/ .6/ .3/ .3/ βˆ— π‘₯ 4 |𝑂 βˆ— π‘₯ 1 |𝑂 βˆ— π‘₯ 2 |𝑂 βˆ— π‘₯ 3 |𝑂 1.3 1.3 1.3 1.3 π‘ž π‘ž π‘ž π‘ž βˆ— π‘₯ 4 |π‘Š π‘ž end emission not shown (these numbers are made up) w 1 w 2 w 3 w 4

  33. EM For HMMs (Baum-Welch Algorithm) L = π‘ž(π‘₯ 1 , β‹― , π‘₯ 𝑂 ) for(i = 1; i ≀ N; ++ i) { for(state = 0; state < K*; ++state) { π‘ž(𝑨 𝑗 = state ,π‘₯ 1 ,β‹―,π‘₯ 𝑂 ) c obs (obs i | state) += 𝑀 for(prev = 0; prev < K*; ++prev) { π‘ž(𝑨 𝑗 = state ,𝑨 𝑗+1 = next ,π‘₯ 1 ,β‹―,π‘₯ 𝑂 ) c trans (state | prev) += 𝑀 } } }

  34. EM For HMMs (Baum-Welch L = π‘ž(π‘₯ 1 , β‹― , π‘₯ 𝑂 ) Algorithm) for(i = 1; i ≀ N; ++i) { for(state = 0; state < K*; ++state) { c obs (obs i | state) += π‘ž 𝑨 𝑗 = state ,π‘₯ 1 ,…,π‘₯ 𝑗 = obs i π‘ž π‘₯ 𝑗+1:𝑂 𝑨 𝑗 = state ) 𝑀 for(prev = 0; prev < K*; ++prev) { u = p obs (obs i | state) * p trans (state | prev) c trans (state | prev) += π‘ž 𝑨 π‘—βˆ’1 = prev ,π‘₯ 1:π‘—βˆ’1 βˆ—π‘£βˆ—π‘ž π‘₯ 𝑗+1:𝑂 𝑨 𝑗 = state ) 𝑀 } } }

  35. EM For HMMs L = π‘ž(π‘₯ 1 , β‹― , π‘₯ 𝑂 ) (Baum-Welch for(i = 1; i ≀ N; ++i) { for(state = 0; state < K*; ++state) { Algorithm) c obs (obs i | state) += 𝛽( state , 𝑗) 𝛾( state , 𝑗) π‘ž 𝑨 𝑗 = state ,π‘₯ 1 ,…,π‘₯ 𝑗 = obs i π‘ž π‘₯ 𝑗+1:𝑂 𝑨 𝑗 = state ) 𝑀 for(prev = 0; prev < K*; ++prev) { u = p obs (obs i | state) * p trans (state | prev) c trans (state | prev) += 𝛽( prev , 𝑗 βˆ’ 1) 𝛾( state , 𝑗) π‘ž 𝑨 π‘—βˆ’1 = prev ,π‘₯ 1:π‘—βˆ’1 βˆ—π‘£βˆ—π‘ž π‘₯ 𝑗+1:𝑂 𝑨 𝑗 = state ) 𝑀 } } }

  36. Why Do We Need Backward Values? z i-1 z i+1 z i = = A = A A z i-1 z i+1 z i = = B = B B z i-1 z i+1 z i = = C = C C Ξ²( i, s ) is the total probability of all paths: Ξ±( i, s ) is the total probability of all paths: 1. that start at step i at state s 1. that start from the beginning 2. that terminate at the end 2. that end (currently) in s at step i 3. (that emit the observation obs at i+1) 3. that emit the observation obs at i

  37. Why Do We Need Backward Values? z i-1 z i+1 z i = = A = A A z i-1 z i+1 z i = = B = B B z i-1 z i+1 z i = = C = C C Ξ±( i, B ) Ξ²( i, B ) Ξ²( i, s ) is the total probability of all paths: Ξ±( i, s ) is the total probability of all paths: 1. that start at step i at state s 1. that start from the beginning 2. that terminate at the end 2. that end (currently) in s at step i 3. (that emit the observation obs at i+1) 3. that emit the observation obs at i

  38. Why Do We Need Backward Values? z i-1 z i+1 z i = = A = A A z i-1 z i+1 z i = = B = B B z i-1 z i+1 z i = = C = C C Ξ±( i, B ) Ξ²( i, B ) Ξ±( i, B ) * Ξ²( i, B ) = total probability of paths through state B at step i Ξ±( i, s ) is the total probability of all paths: Ξ²( i, s ) is the total probability of all paths: 1. that start from the beginning 1. that start at step i at state s 2. that end (currently) in s at step i 2. that terminate at the end 3. (that emit the observation obs at i+1) 3. that emit the observation obs at i

  39. Why Do We Need Backward Values? z i-1 z i+1 z i = = A = A A z i-1 z i+1 z i = = B = B B we can compute posterior state z i-1 z i+1 z i = probabilities = C = C C (normalize by marginal likelihood) Ξ±( i, B ) Ξ²( i, B ) Ξ±( i, s ) * Ξ²( i, s ) = total probability of paths through state s at step i Ξ±( i, s ) is the total probability of all paths: Ξ²( i, s ) is the total probability of all paths: 1. that start from the beginning 1. that start at step i at state s 2. that end (currently) in s at step i 2. that terminate at the end 3. (that emit the observation obs at i+1) 3. that emit the observation obs at i

  40. Why Do We Need Backward Values? z i-1 z i+1 z i = = A = A A z i-1 z i+1 z i = = B = B B z i-1 z i+1 z i = = C = C C Ξ±( i, B ) Ξ²( i+1, s ) Ξ²( i, s ) is the total probability of all paths: Ξ±( i, s ) is the total probability of all paths: 1. that start at step i at state s 1. that start from the beginning 2. that terminate at the end 2. that end (currently) in s at step i 3. (that emit the observation obs at i+1) 3. that emit the observation obs at i

  41. Why Do We Need Backward Values? z i-1 z i+1 z i = = A = A A z i-1 z i+1 z i = = B = B B z i-1 z i+1 z i = = C = C C Ξ±( i, B ) Ξ²( i+1, s’ ) Ξ±( i, B ) * p( s’ | B) * p(obs at i+1 | s’) * Ξ²( i+1, s’ ) = total probability of paths through the B β†’ s’ arc (at time i) Ξ±( i, s ) is the total probability of all paths: Ξ²( i, s ) is the total probability of all paths: 1. that start from the beginning 1. that start at step i at state s 2. that end (currently) in s at step i 2. that terminate at the end 3. (that emit the observation obs at i+1) 3. that emit the observation obs at i

  42. Why Do We Need Backward Values? z i-1 z i+1 z i = = A = A A we can compute z i-1 z i+1 z i = posterior transition = B = B B probabilities (normalize by z i-1 z i+1 z i = marginal likelihood) = C = C C Ξ±( i, B ) Ξ²( i+1, s’ ) Ξ±( i, B ) * p( s’ | B) * p(obs at i+1 | s’) * Ξ²( i+1, s’ ) = total probability of paths through the B β†’ s’ arc (at time i) Ξ±( i, s ) is the total probability of all paths: Ξ²( i, s ) is the total probability of all paths: 1. that start from the beginning 1. that start at step i at state s 2. that end (currently) in s at step i 2. that terminate at the end 3. (that emit the observation obs at i+1) 3. that emit the observation obs at i

  43. With Both Forward and Backward Values Ξ±( i, s ) * Ξ²( i, s) = total probability of paths through state s at step i Ξ±( i, s) * p( s’ | B) * p(obs at i+1 | s’) * Ξ²( i+1, s’ ) = total probability of paths through the s β†’ s ’ arc (at time i)

  44. With Both Forward and Backward Values Ξ±( i, s ) * Ξ²( i, s) = total probability of paths through state s at step i π‘ž 𝑨 𝑗 = 𝑑 π‘₯ 1 , β‹― , π‘₯ 𝑂 ) = 𝛽 𝑗, 𝑑 βˆ— 𝛾(𝑗, 𝑑) 𝛽(𝑂 + 1, END ) Ξ±( i, s) * p( s’ | B) * p(obs at i+1 | s’) * Ξ²( i+1, s’ ) = total probability of paths through the s β†’ s ’ arc (at time i)

  45. With Both Forward and Backward Values Ξ±( i, s ) * Ξ²( i, s) = total probability of paths through state s at step i π‘ž 𝑨 𝑗 = 𝑑 π‘₯ 1 , β‹― , π‘₯ 𝑂 ) = 𝛽 𝑗, 𝑑 βˆ— 𝛾(𝑗, 𝑑) 𝛽(𝑂 + 1, END ) Ξ±( i, s) * p( s’ | B) * p(obs at i+1 | s’) * Ξ²( i+1, s’ ) = total probability of paths through the s β†’ s ’ arc (at time i) π‘ž 𝑨 𝑗 = 𝑑, 𝑨 𝑗+1 = 𝑑 β€² π‘₯ 1 , β‹― , π‘₯ 𝑂 ) = 𝛽 𝑗, 𝑑 βˆ— π‘ž 𝑑 β€² 𝑑 βˆ— π‘ž obs 𝑗+1 𝑑 β€² βˆ— 𝛾(𝑗 + 1, 𝑑′) 𝛽(𝑂 + 1, END )

  46. Agenda HMM Detailed Definition HMM Parameter Estimation EM for HMMs General Approach Expectation Calculation

  47. HMM Expectation Calculation π‘ž 𝑨 1 , π‘₯ 1 , 𝑨 2 , π‘₯ 2 , … , 𝑨 𝑂 , π‘₯ 𝑂 = π‘ž 𝑨 1 | 𝑨 0 π‘ž π‘₯ 1 |𝑨 1 β‹― π‘ž 𝑨 𝑂 | 𝑨 π‘‚βˆ’1 π‘ž π‘₯ 𝑂 |𝑨 𝑂 emission transition = ΰ·‘ π‘ž π‘₯ 𝑗 |𝑨 𝑗 π‘ž 𝑨 𝑗 | 𝑨 π‘—βˆ’1 probabilities/parameters probabilities/parameters 𝑗 Calculate the forward (log) likelihood of an observed (sub-)sequence w 1 , …, w J Calculate the backward (log) likelihood of an observed (sub-)sequence w J+1 , …, w N

  48. HMM Likelihood Task Marginalize over all latent sequence joint likelihoods π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = ෍ π‘ž 𝑨 1 , π‘₯ 1 , 𝑨 2 , π‘₯ 2 , … , 𝑨 𝑂 , π‘₯ 𝑂 𝑨 1 ,β‹―,𝑨 𝑂 Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there?

  49. HMM Likelihood Task Marginalize over all latent sequence joint likelihoods π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = ෍ π‘ž 𝑨 1 , π‘₯ 1 , 𝑨 2 , π‘₯ 2 , … , 𝑨 𝑂 , π‘₯ 𝑂 𝑨 1 ,β‹―,𝑨 𝑂 Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there? A: K N

  50. HMM Likelihood Task Marginalize over all latent sequence joint likelihoods π‘ž π‘₯ 1 , π‘₯ 2 , … , π‘₯ 𝑂 = ෍ π‘ž 𝑨 1 , π‘₯ 1 , 𝑨 2 , π‘₯ 2 , … , 𝑨 𝑂 , π‘₯ 𝑂 𝑨 1 ,β‹―,𝑨 𝑂 Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there? A: K N Goal: Find a way to compute this exponential sum efficiently (in polynomial time)

  51. 2 State HMM Likelihood π‘ž π‘Š| π‘Š z 1 = z 2 = z 3 = z 4 = π‘ž 𝑂| π‘Š V V V V π‘ž 𝑂| start π‘ž π‘Š| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 3 |π‘Š π‘ž π‘₯ 2 |π‘Š w 1 w 2 w 3 w 4 z 1 = z 2 = z 3 = z 4 = V V V V π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| start π‘ž 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 2 |π‘Š π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |𝑂 w 1 w 2 w 3 w 4

  52. 2 State HMM Likelihood π‘ž π‘Š| π‘Š z 1 = z 2 = z 3 = z 4 = π‘ž 𝑂| π‘Š V V V V π‘ž 𝑂| start π‘ž π‘Š| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 3 |π‘Š π‘ž π‘₯ 2 |π‘Š w 1 w 2 w 3 w 4 Up until here, all the computation was the same z 1 = z 2 = z 3 = z 4 = V V V V π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| start π‘ž 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 2 |π‘Š π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |𝑂 w 1 w 2 w 3 w 4

  53. 2 State HMM Likelihood π‘ž π‘Š| π‘Š z 1 = z 2 = z 3 = z 4 = π‘ž 𝑂| π‘Š V V V V π‘ž 𝑂| start π‘ž π‘Š| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 3 |π‘Š π‘ž π‘₯ 2 |π‘Š w 1 w 2 w 3 w 4 Up until here, all the computation was the same Let’s reuse what computations we can z 1 = z 2 = z 3 = z 4 = V V V V π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| start π‘ž 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 2 |π‘Š π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |𝑂 w 1 w 2 w 3 w 4

  54. 2 State HMM Likelihood π‘ž π‘Š| π‘Š z 1 = z 2 = z 3 = z 4 = π‘ž 𝑂| π‘Š V V V V π‘ž 𝑂| start π‘ž π‘Š| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N Solution: pass information β€œforward” in π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 3 |π‘Š the graph, e.g., from time step 2 to 3… π‘ž π‘₯ 2 |π‘Š w 1 w 2 w 3 w 4 z 1 = z 2 = z 3 = z 4 = V V V V π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| start π‘ž 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 2 |π‘Š π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |𝑂 w 1 w 2 w 3 w 4

  55. 2 State HMM Likelihood π‘ž π‘Š| π‘Š z 1 = z 2 = z 3 = z 4 = π‘ž 𝑂| π‘Š V V V V π‘ž 𝑂| start π‘ž π‘Š| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N Solution: pass information β€œforward” in π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 3 |π‘Š the graph, e.g., from time step 2 to 3… π‘ž π‘₯ 2 |π‘Š w 1 w 2 w 3 w 4 Issue: these highlighted paths are only z 1 = z 2 = z 3 = z 4 = 2 of the 16 possible paths through the V V V V trellis π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| start π‘ž 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N π‘ž π‘₯ 2 |π‘Š π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |𝑂 w 1 w 2 w 3 w 4

  56. 2 State HMM Likelihood π‘ž π‘Š| π‘Š z 1 = z 2 = z 3 = z 4 = π‘ž 𝑂| π‘Š V V V V π‘ž 𝑂| start π‘ž π‘Š| 𝑂 z 1 = z 2 = z 3 = z 4 = N N N N Solution: pass information β€œforward” in π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 3 |π‘Š the graph, e.g., from time step 2 to 3… π‘ž π‘₯ 2 |π‘Š w 1 w 2 w 3 w 4 Issue: these highlighted paths are only z 1 = z 2 = z 3 = z 4 = 2 of the 16 possible paths through the V V V V trellis π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| start π‘ž 𝑂| 𝑂 z 1 = z 2 = z 3 = z 4 = Solution: marginalize out all N N N N information from previous timesteps π‘ž π‘₯ 2 |π‘Š π‘ž π‘₯ 4 |𝑂 π‘ž π‘₯ 1 |𝑂 π‘ž π‘₯ 3 |𝑂 w 1 w 2 w 3 w 4

  57. Reusing Computation z i-2 z i-1 z i = = A = A A z i-2 z i-1 z i = = B = B B z i-2 z i-1 z i = = C = C C let’s first consider β€œ any shared path ending with B (AB, BB, or CB) β†’ B”

  58. Reusing Computation 𝛽(𝑗 βˆ’ 1, 𝐡) z i-2 z i-1 z i = = A = A A 𝛽(𝑗 βˆ’ 1, 𝐢) z i-2 z i-1 z i = = B = B B 𝛽(𝑗 βˆ’ 1, 𝐷) z i-2 z i-1 z i = = C = C C let’s first consider β€œ any shared path ending with B (AB, BB, or CB) β†’ B” Assume that all necessary information has been computed and stored in 𝛽(𝑗 βˆ’ 1, 𝐡) , 𝛽(𝑗 βˆ’ 1, 𝐢) , 𝛽(𝑗 βˆ’ 1, 𝐷)

  59. Reusing Computation 𝛽(𝑗 βˆ’ 1, 𝐡) z i-2 z i-1 z i = = A = A A 𝛽(𝑗 βˆ’ 1, 𝐢) 𝛽(𝑗, 𝐢) z i-2 z i-1 z i = = B = B B 𝛽(𝑗 βˆ’ 1, 𝐷) z i-2 z i-1 z i = = C = C C let’s first consider β€œ any shared path ending with B (AB, BB, or CB) β†’ B” Assume that all necessary information has been computed and stored in 𝛽(𝑗 βˆ’ 1, 𝐡) , 𝛽(𝑗 βˆ’ 1, 𝐢) , 𝛽(𝑗 βˆ’ 1, 𝐷) Marginalize (sum) across the previous timestep’s possible states

  60. Reusing Computation 𝛽(𝑗 βˆ’ 1, 𝐡) z i-2 z i-1 z i = = A = A A 𝛽(𝑗 βˆ’ 1, 𝐢) 𝛽(𝑗, 𝐢) z i-2 z i-1 z i = = B = B B 𝛽(𝑗 βˆ’ 1, 𝐷) z i-2 z i-1 z i = = C = C C let’s first consider β€œ any shared path ending with B (AB, BB, or CB) β†’ B” marginalize across the previous hidden state values 𝛽 𝑗, 𝐢 = ෍ 𝛽 𝑗 βˆ’ 1, 𝑑 βˆ— π‘ž 𝐢 𝑑) βˆ— π‘ž(obs at 𝑗 | 𝐢) 𝑑

  61. Reusing Computation 𝛽(𝑗 βˆ’ 1, 𝐡) z i-2 z i-1 z i = = A = A A 𝛽(𝑗 βˆ’ 1, 𝐢) 𝛽(𝑗, 𝐢) z i-2 z i-1 z i = = B = B B 𝛽(𝑗 βˆ’ 1, 𝐷) z i-2 z i-1 z i = = C = C C let’s first consider β€œ any shared path ending with B (AB, BB, or CB) β†’ B” marginalize across the previous hidden state values 𝛽 𝑗, 𝐢 = ෍ 𝛽 𝑗 βˆ’ 1, 𝑑 βˆ— π‘ž 𝐢 𝑑) βˆ— π‘ž(obs at 𝑗 | 𝐢) 𝑑 computing Ξ± at time i-1 will correctly incorporate paths through time i-2 : we correctly obey the Markov property

  62. Forward Probability z i-2 z i-1 z i = = A = A A z i-2 z i-1 z i = = B = B B z i-2 z i-1 z i = = C = C C let’s first consider β€œ any shared path ending with B (AB, BB, or CB) β†’ B” marginalize across the previous hidden state values Ξ±(i, B) is the total probability of all 𝛽 𝑗 βˆ’ 1, 𝑑 β€² βˆ— π‘ž 𝐢 𝑑 β€² ) βˆ— π‘ž(obs at 𝑗 | 𝐢) 𝛽 𝑗, 𝐢 = ෍ paths to that state B from the 𝑑 β€² beginning computing Ξ± at time i-1 will correctly incorporate paths through time i-2 : we correctly obey the Markov property

  63. Forward Probability 𝛽 𝑗 βˆ’ 1, 𝑑 β€² βˆ— π‘ž 𝑑 𝑑 β€² ) βˆ— π‘ž(obs at 𝑗 | 𝑑) 𝛽 𝑗, 𝑑 = ෍ 𝑑 β€² Ξ±(i, s ) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i

  64. Forward Probability 𝛽 𝑗 βˆ’ 1, 𝑑 β€² βˆ— π‘ž 𝑑 𝑑 β€² ) βˆ— π‘ž(obs at 𝑗 | 𝑑) 𝛽 𝑗, 𝑑 = ෍ 𝑑 β€² what are the what’s the total probability how likely is it to get immediate ways to up until now? into state s this way? get into state s ? Ξ±(i, s ) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i

  65. Forward Algorithm Ξ± : a 2D table, N+2 x K* N+2: number of observations (+2 for the BOS & EOS symbols) K*: number of states Use dynamic programming to build the Ξ± left-to- right

  66. Forward Algorithm Ξ± = double[N+2][K*] Ξ± [0][*] = 0.0 Ξ± [0][START] = 1.0 for(i = 1; i ≀ N+1; ++ i) { }

  67. Forward Algorithm Ξ± = double[N+2][K*] Ξ± [0][*] = 0.0 Ξ± [0][START] = 1.0 for(i = 1; i ≀ N+1; ++ i) { for(state = 0; state < K*; ++state) { } }

  68. Forward Algorithm Ξ± = double[N+2][K*] Ξ± [0][*] = 0.0 Ξ± [0][START] = 1.0 for(i = 1; i ≀ N+1; ++ i) { for(state = 0; state < K*; ++state) { p obs = p emission (obs i | state) } }

  69. Forward Algorithm Ξ± = double[N+2][K*] Ξ± [0][*] = 0.0 Ξ± [0][START] = 1.0 for(i = 1; i ≀ N+1; ++ i) { for(state = 0; state < K*; ++state) { p obs = p emission (obs i | state) for(old = 0; old < K*; ++old) { p move = p transition (state | old) Ξ± [i][state] += Ξ± [i-1][old] * p obs * p move } } }

  70. Forward Algorithm Ξ± = double[N+2][K*] Ξ± [0][*] = 0.0 we still need to learn these Ξ± [0][START] = 1.0 (EM if not observed) for(i = 1; i ≀ N+1; ++ i) { for(state = 0; state < K*; ++state) { p obs = p emission (obs i | state) for(old = 0; old < K*; ++old) { p move = p transition (state | old) Ξ± [i][state] += Ξ± [i-1][old] * p obs * p move } } }

  71. Forward Algorithm Ξ± = double[N+2][K*] Ξ± [0][*] = 0.0 Ξ± [0][START] = 1.0 Q: What do we return? (How do we return the likelihood of the sequence?) for(i = 1; i ≀ N+1; ++ i) { for(state = 0; state < K*; ++state) { p obs = p emission (obs i | state) for(old = 0; old < K*; ++old) { p move = p transition (state | old) Ξ± [i][state] += Ξ± [i-1][old] * p obs * p move } } }

  72. Forward Algorithm Ξ± = double[N+2][K*] Ξ± [0][*] = 0.0 Ξ± [0][START] = 1.0 Q: What do we return? (How do we return the likelihood of the sequence?) for(i = 1; i ≀ N+1; ++ i) { for(state = 0; state < K*; ++state) { p obs = p emission (obs i | state) A: Ξ± [N+1][end] for(old = 0; old < K*; ++old) { p move = p transition (state | old) Ξ± [i][state] += Ξ± [i-1][old] * p obs * p move } } }

  73. Interactive HMM Example https://goo.gl/rbHEoc (Jason Eisner, 2002) Original: http://www.cs.jhu.edu/~jason/465/PowerPoint/lect24-hmm.xls

  74. Forward Algorithm in Ξ± = double[N+2][K*] Log-Space Ξ± [0][*] = - ∞ Ξ± [0][*] = 0.0 for(i = 1; i ≀ N+1; ++ i) { for(state = 0; state < K*; ++state) { p obs = log p emission (obs i | state) for(old = 0; old < K*; ++old) { p move = log p transition (state | old) Ξ± [i][state] = logadd( Ξ± [i][state], Ξ± [i-1][old] + p obs + p move ) } } }

  75. Forward Algorithm in Ξ± = double[N+2][K*] Log-Space Ξ± [0][*] = - ∞ Ξ± [0][*] = 0.0 for(i = 1; i ≀ N+1; ++ i) { for(state = 0; state < K*; ++state) { p obs = log p emission (obs i | state) for(old = 0; old < K*; ++old) { p move = log p transition (state | old) Ξ± [i][state] = logadd( Ξ± [i][state], Ξ± [i-1][old] + p obs + p move ) scipy.misc.logsumexp } } logadd π‘šπ‘ž, π‘šπ‘Ÿ = α‰Š π‘šπ‘ž + log 1 + exp π‘šπ‘Ÿ βˆ’ π‘šπ‘ž , π‘šπ‘ž β‰₯ π‘šπ‘Ÿ } π‘šπ‘Ÿ + log 1 + exp π‘šπ‘ž βˆ’ π‘šπ‘Ÿ , π‘šπ‘Ÿ > π‘šπ‘ž

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend