EM & Hidden Markov Models CMSC 691 UMBC Recap from last time - - PowerPoint PPT Presentation

β–Ά
em hidden markov models
SMART_READER_LITE
LIVE PREVIEW

EM & Hidden Markov Models CMSC 691 UMBC Recap from last time - - PowerPoint PPT Presentation

EM & Hidden Markov Models CMSC 691 UMBC Recap from last time Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters 2. M-step:


slide-1
SLIDE 1

EM & Hidden Markov Models

CMSC 691 UMBC

slide-2
SLIDE 2

Recap from last time…

slide-3
SLIDE 3

Expectation Maximization (EM)

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty, assuming these

parameters

  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

estimated counts

slide-4
SLIDE 4

Counting Requires Marginalizing

E-step: count under uncertainty, assuming these parameters

slide-5
SLIDE 5

Counting Requires Marginalizing

w z1 & w z2 & w z3 & w z4 & w

π‘ž π‘₯ = π‘ž 𝑨1, π‘₯ + π‘ž 𝑨2, π‘₯ + π‘ž 𝑨3, π‘₯ + π‘ž 𝑨4, π‘₯ = ෍

𝑨=1 4

π‘ž(𝑨𝑗, π‘₯)

E-step: count under uncertainty, assuming these parameters

break into 4 disjoint pieces

slide-6
SLIDE 6

Hidden Markov Models

…

slide-7
SLIDE 7

Agenda

HMM Detailed Definition HMM Parameter Estimation EM for HMMs General Approach Expectation Calculation

slide-8
SLIDE 8

Hidden Markov Models

Class-based Model Use different distributions to explain groupings of

  • bservations

Sequence Model Bigram model of the classes, not the observations Implicitly model all possible class sequences There are algorithms for finding best sequence, the marginal likelihood, and doing semi-/un-supervised learning

slide-9
SLIDE 9

Hidden Markov Model

Goal: maximize (log-)likelihood In practice: we don’t actually observe these z values; we just see the words w

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1| 𝑨0 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂| π‘¨π‘‚βˆ’1 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗| π‘¨π‘—βˆ’1

slide-10
SLIDE 10

Hidden Markov Model

Goal: maximize (log-)likelihood In practice: we don’t actually observe these z values; we just see the words w

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1| 𝑨0 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂| π‘¨π‘‚βˆ’1 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗| π‘¨π‘—βˆ’1

if we knew the probability parameters then we could estimate z and evaluate likelihood… but we don’t! :( if we did observe z, estimating the probability parameters would be easy… but we don’t! :(

slide-11
SLIDE 11

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1| 𝑨0 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂| π‘¨π‘‚βˆ’1 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗| π‘¨π‘—βˆ’1

slide-12
SLIDE 12

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1| 𝑨0 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂| π‘¨π‘‚βˆ’1 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗| π‘¨π‘—βˆ’1

transition probabilities/parameters

slide-13
SLIDE 13

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1| 𝑨0 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂| π‘¨π‘‚βˆ’1 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗| π‘¨π‘—βˆ’1

emission probabilities/parameters transition probabilities/parameters

slide-14
SLIDE 14

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states Transition and emission distributions do not change

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1| 𝑨0 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂| π‘¨π‘‚βˆ’1 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗| π‘¨π‘—βˆ’1

emission probabilities/parameters transition probabilities/parameters

slide-15
SLIDE 15

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states Transition and emission distributions do not change Q: How many different probability values are there with K states and V vocab items?

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1| 𝑨0 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂| π‘¨π‘‚βˆ’1 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗| π‘¨π‘—βˆ’1

emission probabilities/parameters transition probabilities/parameters

slide-16
SLIDE 16

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states Transition and emission distributions do not change Q: How many different probability values are there with K states and V vocab items? A: VK emission values and K2 transition values

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1| 𝑨0 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂| π‘¨π‘‚βˆ’1 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗| π‘¨π‘—βˆ’1

emission probabilities/parameters transition probabilities/parameters

slide-17
SLIDE 17

Hidden Markov Model Representation

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1| 𝑨0 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂| π‘¨π‘‚βˆ’1 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗| π‘¨π‘—βˆ’1

emission probabilities/parameters transition probabilities/parameters

z1

w1

…

w2 w3 w4

z2 z3 z4

represent the probabilities and independence assumptions in a graph

slide-18
SLIDE 18

Hidden Markov Model Representation

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1| 𝑨0 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂| π‘¨π‘‚βˆ’1 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗| π‘¨π‘—βˆ’1

emission probabilities/parameters transition probabilities/parameters

z1

w1

…

w2 w3 w4

z2 z3 z4

π‘ž π‘₯1|𝑨1 π‘ž π‘₯2|𝑨2 π‘ž π‘₯3|𝑨3 π‘ž π‘₯4|𝑨4

slide-19
SLIDE 19

Hidden Markov Model Representation

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1| 𝑨0 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂| π‘¨π‘‚βˆ’1 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗| π‘¨π‘—βˆ’1

emission probabilities/parameters transition probabilities/parameters

z1

w1

…

w2 w3 w4

z2 z3 z4

π‘ž π‘₯1|𝑨1 π‘ž π‘₯2|𝑨2 π‘ž π‘₯3|𝑨3 π‘ž π‘₯4|𝑨4

π‘ž 𝑨2| 𝑨1 π‘ž 𝑨3| 𝑨2 π‘ž 𝑨4| 𝑨3

slide-20
SLIDE 20

Hidden Markov Model Representation

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1| 𝑨0 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂| π‘¨π‘‚βˆ’1 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗| π‘¨π‘—βˆ’1

emission probabilities/parameters transition probabilities/parameters

z1

w1

…

w2 w3 w4

z2 z3 z4

π‘ž π‘₯1|𝑨1 π‘ž π‘₯2|𝑨2 π‘ž π‘₯3|𝑨3 π‘ž π‘₯4|𝑨4

π‘ž 𝑨2| 𝑨1 π‘ž 𝑨3| 𝑨2 π‘ž 𝑨4| 𝑨3 π‘ž 𝑨1| 𝑨0 initial starting distribution (β€œBOS”)

slide-21
SLIDE 21

Hidden Markov Model Representation

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1| 𝑨0 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂| π‘¨π‘‚βˆ’1 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗| π‘¨π‘—βˆ’1

emission probabilities/parameters transition probabilities/parameters

z1

w1

…

w2 w3 w4

z2 z3 z4

π‘ž π‘₯1|𝑨1 π‘ž π‘₯2|𝑨2 π‘ž π‘₯3|𝑨3 π‘ž π‘₯4|𝑨4

π‘ž 𝑨2| 𝑨1 π‘ž 𝑨3| 𝑨2 π‘ž 𝑨4| 𝑨3 π‘ž 𝑨1| 𝑨0 initial starting distribution (β€œBOS”)

Each zi can take the value of one of K latent states Transition and emission distributions do not change

slide-22
SLIDE 22

Example: 2-state Hidden Markov Model as a Lattice

z1 = N

w1

…

w2 w3 w4

z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V

…

slide-23
SLIDE 23

Example: 2-state Hidden Markov Model as a Lattice

z1 = N

w1

…

w2 w3 w4

π‘ž π‘₯1|𝑂 π‘ž π‘₯2|𝑂 π‘ž π‘₯3|𝑂 π‘ž π‘₯4|𝑂

z2 = N z3 = N z4 = N z1 = V z2 = V z4 = V

…

π‘ž π‘₯4|π‘Š π‘ž π‘₯3|π‘Š π‘ž π‘₯2|π‘Š π‘ž π‘₯1|π‘Š

z3 = V

slide-24
SLIDE 24

Example: 2-state Hidden Markov Model as a Lattice

z1 = N

w1

…

w2 w3 w4

π‘ž π‘₯1|𝑂 π‘ž π‘₯2|𝑂 π‘ž π‘₯3|𝑂 π‘ž π‘₯4|𝑂

π‘ž 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž π‘Š| π‘Š π‘ž π‘Š| π‘Š π‘ž π‘Š| π‘Š π‘ž π‘Š| start

…

π‘ž π‘₯4|π‘Š π‘ž π‘₯3|π‘Š π‘ž π‘₯2|π‘Š π‘ž π‘₯1|π‘Š

π‘ž 𝑂| 𝑂 π‘ž 𝑂| 𝑂 π‘ž 𝑂| 𝑂

slide-25
SLIDE 25

Example: 2-state Hidden Markov Model as a Lattice

z1 = N

w1

…

w2 w3 w4

π‘ž π‘₯1|𝑂 π‘ž π‘₯2|𝑂 π‘ž π‘₯3|𝑂 π‘ž π‘₯4|𝑂

π‘ž 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž π‘Š| π‘Š π‘ž π‘Š| π‘Š π‘ž π‘Š| π‘Š π‘ž π‘Š| start

…

π‘ž π‘Š| 𝑂 π‘ž π‘Š| 𝑂 π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| π‘Š π‘ž 𝑂| π‘Š

π‘ž π‘₯4|π‘Š π‘ž π‘₯3|π‘Š π‘ž π‘₯2|π‘Š π‘ž π‘₯1|π‘Š

π‘ž 𝑂| 𝑂 π‘ž 𝑂| 𝑂 π‘ž 𝑂| 𝑂

slide-26
SLIDE 26

2 State HMM Likelihood

z1 = N

w1

…

w2 w3 w4

π‘ž π‘₯1|𝑂 π‘ž π‘₯2|𝑂 π‘ž π‘₯3|𝑂 π‘ž π‘₯4|𝑂

π‘ž 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž π‘Š| π‘Š π‘ž π‘Š| π‘Š π‘ž π‘Š| π‘Š π‘ž π‘Š| start

…

π‘ž π‘Š| 𝑂 π‘ž π‘Š| 𝑂 π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| π‘Š π‘ž 𝑂| π‘Š

π‘ž π‘₯4|π‘Š π‘ž π‘₯3|π‘Š π‘ž π‘₯2|π‘Š π‘ž π‘₯1|π‘Š

π‘ž 𝑂| 𝑂 π‘ž 𝑂| 𝑂 π‘ž 𝑂| 𝑂

N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1

slide-27
SLIDE 27

2 State HMM Likelihood

z1 = N

w1

…

w2 w3 w4

π‘ž π‘₯1|𝑂 π‘ž π‘₯2|𝑂 π‘ž π‘₯3|𝑂 π‘ž π‘₯4|𝑂

π‘ž 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž π‘Š| π‘Š π‘ž π‘Š| π‘Š π‘ž π‘Š| π‘Š π‘ž π‘Š| start

…

π‘ž π‘Š| 𝑂 π‘ž π‘Š| 𝑂 π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| π‘Š π‘ž 𝑂| π‘Š

π‘ž π‘₯4|π‘Š π‘ž π‘₯3|π‘Š π‘ž π‘₯2|π‘Š π‘ž π‘₯1|π‘Š

π‘ž 𝑂| 𝑂 π‘ž 𝑂| 𝑂 π‘ž 𝑂| 𝑂

Q: What’s the probability of (N, w1), (V, w2), (V, w3), (N, w4)? N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1

slide-28
SLIDE 28

2 State HMM Likelihood

z1 = N

w1 w2 w3 w4

π‘ž π‘₯1|𝑂 π‘ž π‘₯4|𝑂

π‘ž 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž π‘Š| π‘Š π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š

π‘ž π‘₯3|π‘Š π‘ž π‘₯2|π‘Š Q: What’s the probability of (N, w1), (V, w2), (V, w3), (N, w4)? A: (.7*.7) * (.8*.6) * (.35*.1) * (.6*.05) = 0.0002822 N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1

slide-29
SLIDE 29

2 State HMM Likelihood

z1 = N

w1

…

w2 w3 w4

π‘ž π‘₯1|𝑂 π‘ž π‘₯2|𝑂 π‘ž π‘₯3|𝑂 π‘ž π‘₯4|𝑂

π‘ž 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž π‘Š| π‘Š π‘ž π‘Š| π‘Š π‘ž π‘Š| π‘Š π‘ž π‘Š| start

…

π‘ž π‘Š| 𝑂 π‘ž π‘Š| 𝑂 π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š π‘ž 𝑂| π‘Š π‘ž 𝑂| π‘Š

π‘ž π‘₯4|π‘Š π‘ž π‘₯3|π‘Š π‘ž π‘₯2|π‘Š π‘ž π‘₯1|π‘Š

π‘ž 𝑂| 𝑂 π‘ž 𝑂| 𝑂 π‘ž 𝑂| 𝑂

Q: What’s the probability of (N, w1), (V, w2), (N, w3), (N, w4)? N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1

slide-30
SLIDE 30

2 State HMM Likelihood

z1 = N

w1 w2 w3 w4

π‘ž π‘₯1|𝑂 π‘ž π‘₯3|𝑂 π‘ž π‘₯4|𝑂

π‘ž 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š

π‘ž π‘₯2|π‘Š

π‘ž 𝑂| 𝑂

Q: What’s the probability of (N, w1), (V, w2), (N, w3), (N, w4)? A: (.7*.7) * (.8*.6) * (.6*.05) * (.15*.05) = 0.00007056 N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1

slide-31
SLIDE 31

Agenda

HMM Detailed Definition HMM Parameter Estimation EM for HMMs General Approach Expectation Calculation

slide-32
SLIDE 32

Estimating Parameters from Observed Data

z1 = N

w1 w2 w3 w4

π‘ž π‘₯1|𝑂 π‘ž π‘₯3|𝑂 π‘ž π‘₯4|𝑂

π‘ž 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š

π‘ž π‘₯2|π‘Š

π‘ž 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

π‘ž π‘₯1|𝑂 π‘ž π‘₯4|𝑂

π‘ž 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž π‘Š| π‘Š π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š

π‘ž π‘₯3|π‘Š π‘ž π‘₯2|π‘Š N V end start N V w1 w2 W3 w4 N V Transition Counts Emission Counts

end emission not shown

slide-33
SLIDE 33

Estimating Parameters from Observed Data

z1 = N

w1 w2 w3 w4

π‘ž π‘₯1|𝑂 π‘ž π‘₯3|𝑂 π‘ž π‘₯4|𝑂

π‘ž 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š

π‘ž π‘₯2|π‘Š

π‘ž 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

π‘ž π‘₯1|𝑂 π‘ž π‘₯4|𝑂

π‘ž 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž π‘Š| π‘Š π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š

π‘ž π‘₯3|π‘Š π‘ž π‘₯2|π‘Š N V end start 2 N 1 2 2 V 2 1 w1 w2 W3 w4 N 2 1 2 V 2 1 Transition Counts Emission Counts

end emission not shown

slide-34
SLIDE 34

Estimating Parameters from Observed Data

z1 = N

w1 w2 w3 w4

π‘ž π‘₯1|𝑂 π‘ž π‘₯3|𝑂 π‘ž π‘₯4|𝑂

π‘ž 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š

π‘ž π‘₯2|π‘Š

π‘ž 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

π‘ž π‘₯1|𝑂 π‘ž π‘₯4|𝑂

π‘ž 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž π‘Š| π‘Š π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š

π‘ž π‘₯3|π‘Š π‘ž π‘₯2|π‘Š N V end start 1 N .2 .4 .4 V 2/3 1/3 w1 w2 W3 w4 N .4 .2 .4 V 2/3 1/3 Transition MLE Emission MLE

end emission not shown

slide-35
SLIDE 35

Estimating Parameters from Observed Data

z1 = N

w1 w2 w3 w4

π‘ž π‘₯1|𝑂 π‘ž π‘₯3|𝑂 π‘ž π‘₯4|𝑂

π‘ž 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š

π‘ž π‘₯2|π‘Š

π‘ž 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

π‘ž π‘₯1|𝑂 π‘ž π‘₯4|𝑂

π‘ž 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž π‘Š| π‘Š π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š

π‘ž π‘₯3|π‘Š π‘ž π‘₯2|π‘Š N V end start 1 N .2 .4 .4 V 2/3 1/3 w1 w2 W3 w4 N .4 .2 .4 V 2/3 1/3 Transition MLE Emission MLE

end emission not shown

smooth these values if needed

slide-36
SLIDE 36

What If We Don’t Observe 𝑨?

Approach: Develop EM algorithm Goal: Estimate π‘žπ‘’ 𝑑′ 𝑑) and π‘žπ‘“ 𝑀 𝑑) Why: Compute 𝔽𝑨𝑗=𝑑→𝑨𝑗+1=𝑑′ 𝑑 𝑑 β†’ 𝑑′ 𝔽𝑨𝑗=𝑑→π‘₯𝑗=𝑀 𝑑 𝑑 β†’ 𝑀

slide-37
SLIDE 37

Expectation Maximization (EM)

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty, assuming these

parameters

  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

estimated counts

slide-38
SLIDE 38

Expectation Maximization (EM)

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty, assuming these

parameters

  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

estimated counts

pobs(w | s) ptrans(s’ | s)

slide-39
SLIDE 39

Expectation Maximization (EM)

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty, assuming these

parameters

  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

estimated counts

pobs(w | s) ptrans(s’ | s)

π‘žβˆ— 𝑨𝑗 = 𝑑 π‘₯1, β‹― , π‘₯𝑂) = π‘ž(𝑨𝑗 = 𝑑, π‘₯1, β‹― , π‘₯𝑂) π‘ž(π‘₯1, β‹― , π‘₯𝑂) π‘žβˆ— 𝑨𝑗 = 𝑑, 𝑨𝑗+1 = 𝑑′ π‘₯1, β‹― , π‘₯𝑂) = π‘ž(𝑨𝑗 = 𝑑, 𝑨𝑗+1 = 𝑑′, π‘₯1, β‹― , π‘₯𝑂) π‘ž(π‘₯1, β‹― , π‘₯𝑂)

slide-40
SLIDE 40

M-Step

β€œmaximize log-likelihood, assuming these uncertain counts”

π‘žnew 𝑑′ 𝑑) = 𝑑(𝑑 β†’ 𝑑′) σ𝑑′′ 𝑑(𝑑 β†’ 𝑑′′)

if we observed the hidden transitions…

slide-41
SLIDE 41

M-Step

β€œmaximize log-likelihood, assuming these uncertain counts”

π‘žnew 𝑑′ 𝑑) = 𝔽𝑑→𝑑′[𝑑 𝑑 β†’ 𝑑′ ] σ𝑑′′ 𝔽𝑑→𝑑′′[𝑑 𝑑 β†’ 𝑑′′ ]

we don’t observe the hidden transitions, but we can approximately count

slide-42
SLIDE 42

M-Step

β€œmaximize log-likelihood, assuming these uncertain counts”

π‘žnew 𝑑′ 𝑑) = 𝔽𝑑→𝑑′[𝑑 𝑑 β†’ 𝑑′ ] σ𝑑′′ 𝔽𝑑→𝑑′′[𝑑 𝑑 β†’ 𝑑′′ ]

we don’t observe the hidden transitions, but we can approximately count

we compute these in the E-step

slide-43
SLIDE 43

Expectation Maximization (EM)

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty, assuming these

parameters

  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

estimated counts

pobs(w | s) ptrans(s’ | s)

π‘žβˆ— 𝑨𝑗 = 𝑑 π‘₯1, β‹― , π‘₯𝑂) = π‘ž(𝑨𝑗 = 𝑑, π‘₯1, β‹― , π‘₯𝑂) π‘ž(π‘₯1, β‹― , π‘₯𝑂) π‘žβˆ— 𝑨𝑗 = 𝑑, 𝑨𝑗+1 = 𝑑′ π‘₯1, β‹― , π‘₯𝑂) = π‘ž(𝑨𝑗 = 𝑑, 𝑨𝑗+1 = 𝑑′, π‘₯1, β‹― , π‘₯𝑂) π‘ž(π‘₯1, β‹― , π‘₯𝑂)

Baum-Welch

slide-44
SLIDE 44

Estimating Parameters from Unobserved Data

N V end start N V w1 w2 W3 w4 N V Expected Transition Counts Expected Emission Counts

end emission not shown

z1 = N

w1 w2 w3 w4

π‘ž

βˆ— π‘₯1|𝑂

π‘ž

βˆ— π‘₯2|𝑂

π‘ž

βˆ— π‘₯3|𝑂

π‘ž

βˆ— π‘₯4|𝑂

π‘ž

βˆ— 𝑂| start

z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž

βˆ— π‘Š| π‘Š

π‘ž

βˆ— π‘Š| start

π‘ž

βˆ— π‘Š| 𝑂

π‘ž

βˆ— π‘Š| 𝑂

π‘ž

βˆ— π‘Š| 𝑂

π‘ž

βˆ— 𝑂| π‘Š

π‘ž

βˆ— 𝑂| π‘Š

π‘ž

βˆ— 𝑂| π‘Š

π‘ž

βˆ— π‘₯4|π‘Š

π‘ž

βˆ— π‘₯3|π‘Š

π‘ž

βˆ— π‘₯2|π‘Š

π‘ž

βˆ— π‘₯1|π‘Š

π‘ž

βˆ— 𝑂| 𝑂

π‘ž

βˆ— 𝑂| 𝑂

π‘ž

βˆ— 𝑂| 𝑂

π‘ž

βˆ— π‘Š| π‘Š

π‘ž

βˆ— π‘Š| π‘Š

slide-45
SLIDE 45

Estimating Parameters from Unobserved Data

N V end start N V w1 w2 W3 w4 N V Expected Transition Counts Expected Emission Counts

end emission not shown

z1 = N

w1 w2 w3 w4

π‘ž

βˆ— π‘₯1|𝑂

π‘ž

βˆ— π‘₯2|𝑂

π‘ž

βˆ— π‘₯3|𝑂

π‘ž

βˆ— π‘₯4|𝑂

π‘ž

βˆ— 𝑂| start

z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž

βˆ— π‘Š| π‘Š

π‘ž

βˆ— π‘Š| start

π‘ž

βˆ— π‘Š| 𝑂

π‘ž

βˆ— π‘Š| 𝑂

π‘ž

βˆ— π‘Š| 𝑂

π‘ž

βˆ— 𝑂| π‘Š

π‘ž

βˆ— 𝑂| π‘Š

π‘ž

βˆ— 𝑂| π‘Š

π‘ž

βˆ— π‘₯4|π‘Š

π‘ž

βˆ— π‘₯3|π‘Š

π‘ž

βˆ— π‘₯2|π‘Š

π‘ž

βˆ— π‘₯1|π‘Š

π‘ž

βˆ— 𝑂| 𝑂

π‘ž

βˆ— 𝑂| 𝑂

π‘ž

βˆ— 𝑂| 𝑂

π‘ž

βˆ— π‘Š| π‘Š

π‘ž

βˆ— π‘Š| π‘Š

all of these p* arcs are specific to a time-step

slide-46
SLIDE 46

all of these p* arcs are specific to a time-step

Estimating Parameters from Unobserved Data

N V end start N V w1 w2 W3 w4 N V Expected Transition Counts Expected Emission Counts

end emission not shown

z1 = N

w1 w2 w3 w4

π‘ž

βˆ— π‘₯1|𝑂

π‘ž

βˆ— π‘₯2|𝑂

π‘ž

βˆ— π‘₯3|𝑂

π‘ž

βˆ— π‘₯4|𝑂

π‘ž

βˆ— 𝑂| start

z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž

βˆ— π‘Š| π‘Š

=.5 π‘ž

βˆ— π‘Š| start

π‘ž

βˆ— π‘Š| 𝑂

π‘ž

βˆ— π‘Š| 𝑂

π‘ž

βˆ— π‘Š| 𝑂

π‘ž

βˆ— 𝑂| π‘Š

π‘ž

βˆ— 𝑂| π‘Š

π‘ž

βˆ— 𝑂| π‘Š

π‘ž

βˆ— π‘₯4|π‘Š

π‘ž

βˆ— π‘₯3|π‘Š

π‘ž

βˆ— π‘₯2|π‘Š

π‘ž

βˆ— π‘₯1|π‘Š

π‘ž

βˆ— 𝑂| 𝑂

=.4 π‘ž

βˆ— 𝑂| 𝑂

=.6 π‘ž

βˆ— 𝑂| 𝑂

=.5 π‘ž

βˆ— π‘Š| π‘Š

=.3 π‘ž

βˆ— π‘Š| π‘Š

=.3

slide-47
SLIDE 47

all of these p* arcs are specific to a time-step

Estimating Parameters from Unobserved Data

N V end start N 1.5 V 1.1 w1 w2 W3 w4 N V Expected Transition Counts Expected Emission Counts

end emission not shown

z1 = N

w1 w2 w3 w4

π‘ž

βˆ— π‘₯1|𝑂

π‘ž

βˆ— π‘₯2|𝑂

π‘ž

βˆ— π‘₯3|𝑂

π‘ž

βˆ— π‘₯4|𝑂

π‘ž

βˆ— 𝑂| start

z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž

βˆ— π‘Š| start

π‘ž

βˆ— π‘Š| 𝑂

π‘ž

βˆ— π‘Š| 𝑂

π‘ž

βˆ— π‘Š| 𝑂

π‘ž

βˆ— 𝑂| π‘Š

π‘ž

βˆ— 𝑂| π‘Š

π‘ž

βˆ— 𝑂| π‘Š

π‘ž

βˆ— π‘₯4|π‘Š

π‘ž

βˆ— π‘₯3|π‘Š

π‘ž

βˆ— π‘₯2|π‘Š

π‘ž

βˆ— π‘₯1|π‘Š

π‘ž

βˆ— π‘Š| π‘Š

π‘ž

βˆ— 𝑂| 𝑂

π‘ž

βˆ— 𝑂| 𝑂

π‘ž

βˆ— 𝑂| 𝑂

π‘ž

βˆ— π‘Š| π‘Š

π‘ž

βˆ— π‘Š| π‘Š

π‘ž

βˆ— π‘Š| π‘Š

=.5 π‘ž

βˆ— 𝑂| 𝑂

=.4 π‘ž

βˆ— 𝑂| 𝑂

=.6 π‘ž

βˆ— 𝑂| 𝑂

=.5 π‘ž

βˆ— π‘Š| π‘Š

=.3 π‘ž

βˆ— π‘Š| π‘Š

=.3

slide-48
SLIDE 48

Estimating Parameters from Unobserved Data

N V end start 1.8 .1 .1 N 1.5 .8 .1 V 1.4 1.1 .4 w1 w2 W3 w4 N .4 .3 .2 .2 V .1 .6 .3 .3 Expected Transition Counts Expected Emission Counts

end emission not shown

z1 = N

w1 w2 w3 w4

π‘ž

βˆ— π‘₯1|𝑂

π‘ž

βˆ— π‘₯2|𝑂

π‘ž

βˆ— π‘₯3|𝑂

π‘ž

βˆ— π‘₯4|𝑂

π‘ž

βˆ— 𝑂| start

z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž

βˆ— π‘Š| start

π‘ž

βˆ— π‘Š| 𝑂

π‘ž

βˆ— π‘Š| 𝑂

π‘ž

βˆ— π‘Š| 𝑂

π‘ž

βˆ— 𝑂| π‘Š

π‘ž

βˆ— 𝑂| π‘Š

π‘ž

βˆ— 𝑂| π‘Š

π‘ž

βˆ— π‘₯4|π‘Š

π‘ž

βˆ— π‘₯3|π‘Š

π‘ž

βˆ— π‘₯2|π‘Š

π‘ž

βˆ— π‘₯1|π‘Š

π‘ž

βˆ— π‘Š| π‘Š

π‘ž

βˆ— 𝑂| 𝑂

π‘ž

βˆ— 𝑂| 𝑂

π‘ž

βˆ— 𝑂| 𝑂

π‘ž

βˆ— π‘Š| π‘Š

π‘ž

βˆ— π‘Š| π‘Š (these numbers are made up)

slide-49
SLIDE 49

Estimating Parameters from Unobserved Data

N V end start 1.8/2 .1/2 .1/2 N 1.5/ 2.4 .8/ 2.4 .1/ 2.4 V 1.4/2.9 1.1/ 2.9 .4/ 2.9 w1 w2 W3 w4 N .4/ 1.1 .3/ 1.1 .2/ 1.1 .2/ 1.1 V .1/ 1.3 .6/ 1.3 .3/ 1.3 .3/ 1.3

Expected Transition MLE Expected Emission MLE

end emission not shown

z1 = N

w1 w2 w3 w4

π‘ž

βˆ— π‘₯1|𝑂

π‘ž

βˆ— π‘₯2|𝑂

π‘ž

βˆ— π‘₯3|𝑂

π‘ž

βˆ— π‘₯4|𝑂

π‘ž

βˆ— 𝑂| start

z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž

βˆ— π‘Š| start

π‘ž

βˆ— π‘Š| 𝑂

π‘ž

βˆ— π‘Š| 𝑂

π‘ž

βˆ— π‘Š| 𝑂

π‘ž

βˆ— 𝑂| π‘Š

π‘ž

βˆ— 𝑂| π‘Š

π‘ž

βˆ— 𝑂| π‘Š

π‘ž

βˆ— π‘₯4|π‘Š

π‘ž

βˆ— π‘₯3|π‘Š

π‘ž

βˆ— π‘₯2|π‘Š

π‘ž

βˆ— π‘₯1|π‘Š

π‘ž

βˆ— π‘Š| π‘Š

π‘ž

βˆ— 𝑂| 𝑂

π‘ž

βˆ— 𝑂| 𝑂

π‘ž

βˆ— 𝑂| 𝑂

π‘ž

βˆ— π‘Š| π‘Š

π‘ž

βˆ— π‘Š| π‘Š (these numbers are made up)

slide-50
SLIDE 50

Semi-Supervised Parameter Estimation

N V end start 2 N 1 2 2 V 2 1 w1 w2 W3 w4 N 2 1 2 V 2 1 Transition Counts Emission Counts

slide-51
SLIDE 51

Semi-Supervised Parameter Estimation

N V end start 1.8 .1 .1 N 1.5 .8 .1 V 1.4 1.1 .4 w1 w2 W3 w4 N .4 .3 .2 .2 V .1 .6 .3 .3 Expected Transition Counts Expected Emission Counts N V end start 2 N 1 2 2 V 2 1 w1 w2 W3 w4 N 2 1 2 V 2 1 Transition Counts Emission Counts

slide-52
SLIDE 52

Semi-Supervised Parameter Estimation

N V end start 1.8 .1 .1 N 1.5 .8 .1 V 1.4 1.1 .4 w1 w2 W3 w4 N .4 .3 .2 .2 V .1 .6 .3 .3 Expected Transition Counts Expected Emission Counts N V end start 2 N 1 2 2 V 2 1 w1 w2 W3 w4 N 2 1 2 V 2 1 Transition Counts Emission Counts

slide-53
SLIDE 53

N V end start 1.8 .1 .1 N 1.5 .8 .1 V 1.4 1.1 .4 w1 w2 W3 w4 N .4 .3 .2 .2 V .1 .6 .3 .3 Expected Transition Counts Expected Emission Counts N V end start 2 N 1 2 2 V 2 1 w1 w2 W3 w4 N 2 1 2 V 2 1 Transition Counts Emission Counts

Semi-Supervised Parameter Estimation

N V end start 3.8 .1 .1 N 2.5 2.8 2.1 V 3.4 2.1 .4 w1 w2 W3 w4 N 2.4 .3 1.2 2.2 V .1 2.6 1.3 .3 Mixed Transition Counts Mixed Emission Counts

slide-54
SLIDE 54

Agenda

HMM Detailed Definition HMM Parameter Estimation EM for HMMs General Approach Expectation Calculation

slide-55
SLIDE 55

EM Math

max

πœ„

𝔽𝑨 ~ π‘žπœ„(𝑒)(β‹…|π‘₯) log π‘žπœ„(𝑨, π‘₯)

current parameters new parameters new parameters posterior distribution

maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is

slide-56
SLIDE 56

EM Math

max

πœ„

𝔽𝑨 ~ π‘žπœ„(𝑒)(β‹…|π‘₯) log π‘žπœ„(𝑨, π‘₯)

current parameters new parameters new parameters posterior distribution

maximize the average log-likelihood of our complete data (z, w), averaged across all z and according to how likely our current model thinks z is

෍

𝑗

log π‘žπœ„(𝑨𝑗|π‘¨π‘—βˆ’1) + log π‘žπœ„(π‘₯𝑗|𝑨𝑗)

𝑨 ∈ 𝑑1, … , 𝑑𝐿 𝑂

slide-57
SLIDE 57

Estimating Parameters from Unobserved Data

N V end start 1.8/2 .1/2 .1/2 N 1.5/ 2.4 .8/ 2.4 .1/ 2.4 V 1.4/2.9 1.1/ 2.9 .4/ 2.9 w1 w2 W3 w4 N .4/ 1.1 .3/ 1.1 .2/ 1.1 .2/ 1.1 V .1/ 1.3 .6/ 1.3 .3/ 1.3 .3/ 1.3

Expected Transition MLE Expected Emission MLE

end emission not shown

z1 = N

w1 w2 w3 w4

π‘ž

βˆ— π‘₯1|𝑂

π‘ž

βˆ— π‘₯2|𝑂

π‘ž

βˆ— π‘₯3|𝑂

π‘ž

βˆ— π‘₯4|𝑂

π‘ž

βˆ— 𝑂| start

z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž

βˆ— π‘Š| start

π‘ž

βˆ— π‘Š| 𝑂

π‘ž

βˆ— π‘Š| 𝑂

π‘ž

βˆ— π‘Š| 𝑂

π‘ž

βˆ— 𝑂| π‘Š

π‘ž

βˆ— 𝑂| π‘Š

π‘ž

βˆ— 𝑂| π‘Š

π‘ž

βˆ— π‘₯4|π‘Š

π‘ž

βˆ— π‘₯3|π‘Š

π‘ž

βˆ— π‘₯2|π‘Š

π‘ž

βˆ— π‘₯1|π‘Š

π‘ž

βˆ— π‘Š| π‘Š

π‘ž

βˆ— 𝑂| 𝑂

π‘ž

βˆ— 𝑂| 𝑂

π‘ž

βˆ— 𝑂| 𝑂

π‘ž

βˆ— π‘Š| π‘Š

π‘ž

βˆ— π‘Š| π‘Š (these numbers are made up)

slide-58
SLIDE 58

EM For HMMs (Baum-Welch Algorithm)

L = π‘ž(π‘₯1, β‹― , π‘₯𝑂) for(i = 1; i ≀ N; ++i) { for(state = 0; state < K*; ++state) { cobs(obsi | state) +=

π‘ž(𝑨𝑗=state,π‘₯1,β‹―,π‘₯𝑂) 𝑀

for(prev = 0; prev < K*; ++prev) { ctrans(state | prev) +=

π‘ž(𝑨𝑗=state,𝑨𝑗+1=next,π‘₯1,β‹―,π‘₯𝑂) 𝑀

} } }

slide-59
SLIDE 59

EM For HMMs (Baum-Welch Algorithm)

L = π‘ž(π‘₯1, β‹― , π‘₯𝑂) for(i = 1; i ≀ N; ++i) { for(state = 0; state < K*; ++state) { cobs(obsi | state) +=

π‘ž 𝑨𝑗=state,π‘₯1,…,π‘₯𝑗=obsi π‘ž π‘₯𝑗+1:𝑂 𝑨𝑗=state) 𝑀

for(prev = 0; prev < K*; ++prev) { u = pobs(obsi | state) * ptrans(state | prev) ctrans(state | prev) +=

π‘ž π‘¨π‘—βˆ’1=prev,π‘₯1:π‘—βˆ’1 βˆ—π‘£βˆ—π‘ž π‘₯𝑗+1:𝑂 𝑨𝑗=state) 𝑀

} } }

slide-60
SLIDE 60

EM For HMMs (Baum-Welch Algorithm)

L = π‘ž(π‘₯1, β‹― , π‘₯𝑂) for(i = 1; i ≀ N; ++i) { for(state = 0; state < K*; ++state) { cobs(obsi | state) +=

π‘ž 𝑨𝑗=state,π‘₯1,…,π‘₯𝑗=obsi π‘ž π‘₯𝑗+1:𝑂 𝑨𝑗=state) 𝑀

for(prev = 0; prev < K*; ++prev) { u = pobs(obsi | state) * ptrans(state | prev) ctrans(state | prev) +=

π‘ž π‘¨π‘—βˆ’1=prev,π‘₯1:π‘—βˆ’1 βˆ—π‘£βˆ—π‘ž π‘₯𝑗+1:𝑂 𝑨𝑗=state) 𝑀

} } } 𝛽(state, 𝑗) 𝛾(state, 𝑗) 𝛽(prev, 𝑗 βˆ’ 1) 𝛾(state, 𝑗)

slide-61
SLIDE 61

Why Do We Need Backward Values?

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi+1 = C zi+1 = B zi+1 = A

Ξ±(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i Ξ²(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end

3. (that emit the observation obs at i+1)

slide-62
SLIDE 62

Why Do We Need Backward Values?

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi+1 = C zi+1 = B zi+1 = A

Ξ±(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i Ξ²(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end

3. (that emit the observation obs at i+1)

Ξ±(i, B) Ξ²(i, B)

slide-63
SLIDE 63

Why Do We Need Backward Values?

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi+1 = C zi+1 = B zi+1 = A

Ξ±(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i Ξ²(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end

3. (that emit the observation obs at i+1)

Ξ±(i, B) Ξ²(i, B) Ξ±(i, B) * Ξ²(i, B) = total probability of paths through state B at step i

slide-64
SLIDE 64

Why Do We Need Backward Values?

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi+1 = C zi+1 = B zi+1 = A

Ξ±(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i Ξ²(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end

3. (that emit the observation obs at i+1)

Ξ±(i, B) Ξ²(i, B) Ξ±(i, s) * Ξ²(i, s) = total probability of paths through state s at step i

we can compute posterior state probabilities

(normalize by marginal likelihood)

slide-65
SLIDE 65

Why Do We Need Backward Values?

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A

Ξ±(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i Ξ²(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end

3. (that emit the observation obs at i+1)

Ξ±(i, B) Ξ²(i+1, s)

zi+1 = C zi+1 = B zi+1 = A

slide-66
SLIDE 66

Why Do We Need Backward Values?

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A

Ξ±(i, B) Ξ²(i+1, s’)

zi+1 = C zi+1 = B zi+1 = A

Ξ±(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i Ξ²(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end

3. (that emit the observation obs at i+1)

Ξ±(i, B) * p(s’ | B) * p(obs at i+1 | s’) * Ξ²(i+1, s’) = total probability of paths through the Bβ†’s’ arc (at time i)

slide-67
SLIDE 67

Why Do We Need Backward Values?

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A

Ξ±(i, B) Ξ²(i+1, s’)

zi+1 = C zi+1 = B zi+1 = A

Ξ±(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i Ξ²(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end

3. (that emit the observation obs at i+1)

we can compute posterior transition probabilities

(normalize by marginal likelihood)

Ξ±(i, B) * p(s’ | B) * p(obs at i+1 | s’) * Ξ²(i+1, s’) = total probability of paths through the Bβ†’s’ arc (at time i)

slide-68
SLIDE 68

With Both Forward and Backward Values

Ξ±(i, s) * Ξ²(i, s) = total probability of paths through state s at step i Ξ±(i, s) * p(s’ | B) * p(obs at i+1 | s’) * Ξ²(i+1, s’) = total probability of paths through the sβ†’s’ arc (at time i)

slide-69
SLIDE 69

With Both Forward and Backward Values

Ξ±(i, s) * Ξ²(i, s) = total probability of paths through state s at step i

π‘ž 𝑨𝑗 = 𝑑 π‘₯1, β‹― , π‘₯𝑂) = 𝛽 𝑗, 𝑑 βˆ— 𝛾(𝑗, 𝑑) 𝛽(𝑂 + 1, END)

Ξ±(i, s) * p(s’ | B) * p(obs at i+1 | s’) * Ξ²(i+1, s’) = total probability of paths through the sβ†’s’ arc (at time i)

slide-70
SLIDE 70

With Both Forward and Backward Values

Ξ±(i, s) * p(s’ | B) * p(obs at i+1 | s’) * Ξ²(i+1, s’) = total probability of paths through the sβ†’s’ arc (at time i) Ξ±(i, s) * Ξ²(i, s) = total probability of paths through state s at step i

π‘ž 𝑨𝑗 = 𝑑 π‘₯1, β‹― , π‘₯𝑂) = 𝛽 𝑗, 𝑑 βˆ— 𝛾(𝑗, 𝑑) 𝛽(𝑂 + 1, END) π‘ž 𝑨𝑗 = 𝑑, 𝑨𝑗+1 = 𝑑′ π‘₯1, β‹― , π‘₯𝑂) = 𝛽 𝑗, 𝑑 βˆ— π‘ž 𝑑′ 𝑑 βˆ— π‘ž obs𝑗+1 𝑑′ βˆ— 𝛾(𝑗 + 1, 𝑑′) 𝛽(𝑂 + 1, END)

slide-71
SLIDE 71

Agenda

HMM Detailed Definition HMM Parameter Estimation EM for HMMs General Approach Expectation Calculation

slide-72
SLIDE 72

HMM Expectation Calculation

Calculate the forward (log) likelihood of an

  • bserved (sub-)sequence w1, …, wJ

Calculate the backward (log) likelihood of an

  • bserved (sub-)sequence wJ+1, …, wN

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1| 𝑨0 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂| π‘¨π‘‚βˆ’1 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗| π‘¨π‘—βˆ’1

emission probabilities/parameters transition probabilities/parameters

slide-73
SLIDE 73

HMM Likelihood Task

Marginalize over all latent sequence joint likelihoods

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = ෍

𝑨1,β‹―,𝑨𝑂

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂

Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there?

slide-74
SLIDE 74

HMM Likelihood Task

Marginalize over all latent sequence joint likelihoods

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = ෍

𝑨1,β‹―,𝑨𝑂

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂

Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there? A: KN

slide-75
SLIDE 75

HMM Likelihood Task

Marginalize over all latent sequence joint likelihoods

π‘ž π‘₯1, π‘₯2, … , π‘₯𝑂 = ෍

𝑨1,β‹―,𝑨𝑂

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂

Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there? A: KN Goal: Find a way to compute this exponential sum efficiently (in polynomial time)

slide-76
SLIDE 76

2 State HMM Likelihood

z1 = N

w1 w2 w3 w4

π‘ž π‘₯1|𝑂 π‘ž π‘₯3|𝑂 π‘ž π‘₯4|𝑂

π‘ž 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š

π‘ž π‘₯2|π‘Š

π‘ž 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

π‘ž π‘₯1|𝑂 π‘ž π‘₯4|𝑂

π‘ž 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž π‘Š| π‘Š π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š

π‘ž π‘₯3|π‘Š π‘ž π‘₯2|π‘Š

slide-77
SLIDE 77

2 State HMM Likelihood

z1 = N

w1 w2 w3 w4

π‘ž π‘₯1|𝑂 π‘ž π‘₯3|𝑂 π‘ž π‘₯4|𝑂

π‘ž 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š

π‘ž π‘₯2|π‘Š

π‘ž 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

π‘ž π‘₯1|𝑂 π‘ž π‘₯4|𝑂

π‘ž 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž π‘Š| π‘Š π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š

π‘ž π‘₯3|π‘Š π‘ž π‘₯2|π‘Š Up until here, all the computation was the same

slide-78
SLIDE 78

2 State HMM Likelihood

z1 = N

w1 w2 w3 w4

π‘ž π‘₯1|𝑂 π‘ž π‘₯3|𝑂 π‘ž π‘₯4|𝑂

π‘ž 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š

π‘ž π‘₯2|π‘Š

π‘ž 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

π‘ž π‘₯1|𝑂 π‘ž π‘₯4|𝑂

π‘ž 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž π‘Š| π‘Š π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š

π‘ž π‘₯3|π‘Š π‘ž π‘₯2|π‘Š Up until here, all the computation was the same Let’s reuse what computations we can

slide-79
SLIDE 79

2 State HMM Likelihood

z1 = N

w1 w2 w3 w4

π‘ž π‘₯1|𝑂 π‘ž π‘₯3|𝑂 π‘ž π‘₯4|𝑂

π‘ž 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š

π‘ž π‘₯2|π‘Š

π‘ž 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

π‘ž π‘₯1|𝑂 π‘ž π‘₯4|𝑂

π‘ž 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž π‘Š| π‘Š π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š

π‘ž π‘₯3|π‘Š π‘ž π‘₯2|π‘Š Solution: pass information β€œforward” in the graph, e.g., from time step 2 to 3…

slide-80
SLIDE 80

2 State HMM Likelihood

z1 = N

w1 w2 w3 w4

π‘ž π‘₯1|𝑂 π‘ž π‘₯3|𝑂 π‘ž π‘₯4|𝑂

π‘ž 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š

π‘ž π‘₯2|π‘Š

π‘ž 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

π‘ž π‘₯1|𝑂 π‘ž π‘₯4|𝑂

π‘ž 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž π‘Š| π‘Š π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š

π‘ž π‘₯3|π‘Š π‘ž π‘₯2|π‘Š Solution: pass information β€œforward” in the graph, e.g., from time step 2 to 3… Issue: these highlighted paths are only 2 of the 16 possible paths through the trellis

slide-81
SLIDE 81

2 State HMM Likelihood

z1 = N

w1 w2 w3 w4

π‘ž π‘₯1|𝑂 π‘ž π‘₯3|𝑂 π‘ž π‘₯4|𝑂

π‘ž 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š

π‘ž π‘₯2|π‘Š

π‘ž 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

π‘ž π‘₯1|𝑂 π‘ž π‘₯4|𝑂

π‘ž 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V π‘ž π‘Š| π‘Š π‘ž π‘Š| 𝑂 π‘ž 𝑂| π‘Š

π‘ž π‘₯3|π‘Š π‘ž π‘₯2|π‘Š Solution: pass information β€œforward” in the graph, e.g., from time step 2 to 3… Solution: marginalize out all information from previous timesteps Issue: these highlighted paths are only 2 of the 16 possible paths through the trellis

slide-82
SLIDE 82

Reusing Computation

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A

let’s first consider β€œany shared path ending with B (AB, BB, or CB)β†’ B”

zi-2 = C zi-2 = B zi-2 = A

slide-83
SLIDE 83

Reusing Computation

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A

let’s first consider β€œany shared path ending with B (AB, BB, or CB)β†’ B” Assume that all necessary information has been computed and stored in 𝛽(𝑗 βˆ’ 1, 𝐡), 𝛽(𝑗 βˆ’ 1, 𝐢), 𝛽(𝑗 βˆ’ 1, 𝐷)

zi-2 = C zi-2 = B zi-2 = A

𝛽(𝑗 βˆ’ 1, 𝐡) 𝛽(𝑗 βˆ’ 1, 𝐢) 𝛽(𝑗 βˆ’ 1, 𝐷)

slide-84
SLIDE 84

Reusing Computation

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A

let’s first consider β€œany shared path ending with B (AB, BB, or CB)β†’ B” Assume that all necessary information has been computed and stored in 𝛽(𝑗 βˆ’ 1, 𝐡), 𝛽(𝑗 βˆ’ 1, 𝐢), 𝛽(𝑗 βˆ’ 1, 𝐷) Marginalize (sum) across the previous timestep’s possible states

zi-2 = C zi-2 = B zi-2 = A

𝛽(𝑗 βˆ’ 1, 𝐡) 𝛽(𝑗 βˆ’ 1, 𝐢) 𝛽(𝑗 βˆ’ 1, 𝐷) 𝛽(𝑗, 𝐢)

slide-85
SLIDE 85

Reusing Computation

let’s first consider β€œany shared path ending with B (AB, BB, or CB)β†’ B” marginalize across the previous hidden state values 𝛽 𝑗, 𝐢 = ෍

𝑑

𝛽 𝑗 βˆ’ 1, 𝑑 βˆ— π‘ž 𝐢 𝑑) βˆ— π‘ž(obs at 𝑗 | 𝐢)

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi-2 = C zi-2 = B zi-2 = A

𝛽(𝑗 βˆ’ 1, 𝐡) 𝛽(𝑗 βˆ’ 1, 𝐢) 𝛽(𝑗 βˆ’ 1, 𝐷) 𝛽(𝑗, 𝐢)

slide-86
SLIDE 86

Reusing Computation

let’s first consider β€œany shared path ending with B (AB, BB, or CB)β†’ B” marginalize across the previous hidden state values 𝛽 𝑗, 𝐢 = ෍

𝑑

𝛽 𝑗 βˆ’ 1, 𝑑 βˆ— π‘ž 𝐢 𝑑) βˆ— π‘ž(obs at 𝑗 | 𝐢) computing Ξ± at time i-1 will correctly incorporate paths through time i-2: we correctly obey the Markov property

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi-2 = C zi-2 = B zi-2 = A

𝛽(𝑗 βˆ’ 1, 𝐡) 𝛽(𝑗 βˆ’ 1, 𝐢) 𝛽(𝑗 βˆ’ 1, 𝐷) 𝛽(𝑗, 𝐢)

slide-87
SLIDE 87

Forward Probability

let’s first consider β€œany shared path ending with B (AB, BB, or CB)β†’ B” marginalize across the previous hidden state values 𝛽 𝑗, 𝐢 = ෍

𝑑′

𝛽 𝑗 βˆ’ 1, 𝑑′ βˆ— π‘ž 𝐢 𝑑′) βˆ— π‘ž(obs at 𝑗 | 𝐢) computing Ξ± at time i-1 will correctly incorporate paths through time i-2: we correctly obey the Markov property Ξ±(i, B) is the total probability of all paths to that state B from the beginning

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi-2 = C zi-2 = B zi-2 = A

slide-88
SLIDE 88

Forward Probability

Ξ±(i, s) is the total probability of all paths:

  • 1. that start from the beginning
  • 2. that end (currently) in s at step i
  • 3. that emit the observation obs at i

𝛽 𝑗, 𝑑 = ෍

𝑑′

𝛽 𝑗 βˆ’ 1, 𝑑′ βˆ— π‘ž 𝑑 𝑑′) βˆ— π‘ž(obs at 𝑗 | 𝑑)

slide-89
SLIDE 89

Forward Probability

Ξ±(i, s) is the total probability of all paths:

  • 1. that start from the beginning
  • 2. that end (currently) in s at step i
  • 3. that emit the observation obs at i

𝛽 𝑗, 𝑑 = ෍

𝑑′

𝛽 𝑗 βˆ’ 1, 𝑑′ βˆ— π‘ž 𝑑 𝑑′) βˆ— π‘ž(obs at 𝑗 | 𝑑)

how likely is it to get into state s this way? what are the immediate ways to get into state s? what’s the total probability up until now?

slide-90
SLIDE 90

Forward Algorithm

Ξ±: a 2D table, N+2 x K* N+2: number of observations (+2 for the BOS & EOS symbols) K*: number of states Use dynamic programming to build the Ξ± left-to- right

slide-91
SLIDE 91

Forward Algorithm

Ξ± = double[N+2][K*] Ξ±[0][*] = 0.0 Ξ±[0][START] = 1.0 for(i = 1; i ≀ N+1; ++i) { }

slide-92
SLIDE 92

Forward Algorithm

Ξ± = double[N+2][K*] Ξ±[0][*] = 0.0 Ξ±[0][START] = 1.0 for(i = 1; i ≀ N+1; ++i) { for(state = 0; state < K*; ++state) { } }

slide-93
SLIDE 93

Forward Algorithm

Ξ± = double[N+2][K*] Ξ±[0][*] = 0.0 Ξ±[0][START] = 1.0 for(i = 1; i ≀ N+1; ++i) { for(state = 0; state < K*; ++state) { pobs = pemission(obsi | state) } }

slide-94
SLIDE 94

Forward Algorithm

Ξ± = double[N+2][K*] Ξ±[0][*] = 0.0 Ξ±[0][START] = 1.0 for(i = 1; i ≀ N+1; ++i) { for(state = 0; state < K*; ++state) { pobs = pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = ptransition(state | old) Ξ±[i][state] += Ξ±[i-1][old] * pobs * pmove } } }

slide-95
SLIDE 95

Forward Algorithm

Ξ± = double[N+2][K*] Ξ±[0][*] = 0.0 Ξ±[0][START] = 1.0 for(i = 1; i ≀ N+1; ++i) { for(state = 0; state < K*; ++state) { pobs = pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = ptransition(state | old) Ξ±[i][state] += Ξ±[i-1][old] * pobs * pmove } } }

we still need to learn these (EM if not observed)

slide-96
SLIDE 96

Forward Algorithm

Ξ± = double[N+2][K*] Ξ±[0][*] = 0.0 Ξ±[0][START] = 1.0 for(i = 1; i ≀ N+1; ++i) { for(state = 0; state < K*; ++state) { pobs = pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = ptransition(state | old) Ξ±[i][state] += Ξ±[i-1][old] * pobs * pmove } } }

Q: What do we return? (How do we return the likelihood of the sequence?)

slide-97
SLIDE 97

Forward Algorithm

Ξ± = double[N+2][K*] Ξ±[0][*] = 0.0 Ξ±[0][START] = 1.0 for(i = 1; i ≀ N+1; ++i) { for(state = 0; state < K*; ++state) { pobs = pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = ptransition(state | old) Ξ±[i][state] += Ξ±[i-1][old] * pobs * pmove } } }

Q: What do we return? (How do we return the likelihood of the sequence?)

A: Ξ±[N+1][end]

slide-98
SLIDE 98

Interactive HMM Example

https://goo.gl/rbHEoc (Jason Eisner, 2002)

Original: http://www.cs.jhu.edu/~jason/465/PowerPoint/lect24-hmm.xls

slide-99
SLIDE 99

Forward Algorithm in Log-Space

Ξ± = double[N+2][K*] Ξ±[0][*] = -∞ Ξ±[0][*] = 0.0 for(i = 1; i ≀ N+1; ++i) { for(state = 0; state < K*; ++state) { pobs = log pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = log ptransition(state | old) Ξ±[i][state] = logadd(Ξ±[i][state], Ξ±[i-1][old] + pobs + pmove) } } }

slide-100
SLIDE 100

Forward Algorithm in Log-Space

Ξ± = double[N+2][K*] Ξ±[0][*] = -∞ Ξ±[0][*] = 0.0 for(i = 1; i ≀ N+1; ++i) { for(state = 0; state < K*; ++state) { pobs = log pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = log ptransition(state | old) Ξ±[i][state] = logadd(Ξ±[i][state], Ξ±[i-1][old] + pobs + pmove) } } } logadd π‘šπ‘ž, π‘šπ‘Ÿ = α‰Š π‘šπ‘ž + log 1 + exp π‘šπ‘Ÿ βˆ’ π‘šπ‘ž , π‘šπ‘ž β‰₯ π‘šπ‘Ÿ π‘šπ‘Ÿ + log 1 + exp π‘šπ‘ž βˆ’ π‘šπ‘Ÿ , π‘šπ‘Ÿ > π‘šπ‘ž

scipy.misc.logsumexp

slide-101
SLIDE 101

HMM Expectation Calculation

Calculate the forward (log) likelihood of an

  • bserved (sub-)sequence w1, …, wJ

Calculate the backward (log) likelihood of an

  • bserved (sub-)sequence wJ+1, …, wN

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1| 𝑨0 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂| π‘¨π‘‚βˆ’1 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗| π‘¨π‘—βˆ’1

emission probabilities/parameters transition probabilities/parameters

slide-102
SLIDE 102

HMM Probabilities

Forward Values Ξ±(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation

  • bs at i

𝛽 𝑗, 𝑑 = ෍

𝑑′

𝛽 𝑗 βˆ’ 1, 𝑑′ βˆ— π‘ž 𝑑 𝑑′) βˆ— π‘ž(obs at 𝑗 | 𝑑)

slide-103
SLIDE 103

HMM Probabilities

Forward Values Ξ±(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation

  • bs at i

Backward Values Ξ²(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end

3. (that emit the observation obs at i+1) 𝛽 𝑗, 𝑑 = ෍

𝑑′

𝛽 𝑗 βˆ’ 1, 𝑑′ βˆ— π‘ž 𝑑 𝑑′) βˆ— π‘ž(obs at 𝑗 | 𝑑) 𝛾 𝑗, 𝑑 = ෍

𝑑′

𝛾 𝑗 + 1, 𝑑′ βˆ— π‘ž(𝑑′|𝑑) βˆ— π‘ž obs at 𝑗 + 1 𝑑′)

slide-104
SLIDE 104

Backward Algorithm

Ξ²: a 2D table, (N+2) x K* N+2: number of observations (+2 for the BOS & EOS symbols) K*: number of states Use dynamic programming to build the Ξ² right- to-left

slide-105
SLIDE 105

Backward Algorithm

Ξ² = double[N+2][K*] Ξ²[n+1][END] = 1.0 for(i = N; i β‰₯ 0; --i) { for(next = 0; next < K*; ++next) { pobs = pemission(obsi+1 | next) for(state = 0; state < K*; ++state) { pmove = ptransition(next | state) Ξ²[i][state] += Ξ²[i+1][next] * pobs * pmove } } }

slide-106
SLIDE 106

Backward Algorithm

Ξ² = double[N+2][K*] Ξ²[n+1][END] = 1.0 for(i = N; i β‰₯ 0; --i) { for(next = 0; next < K*; ++next) { pobs = pemission(obsi+1 | next) for(state = 0; state < K*; ++state) { pmove = ptransition(next | state) Ξ²[i][state] += Ξ²[i+1][next] * pobs * pmove } } } Q: What does Ξ²[0][START] represent?

slide-107
SLIDE 107

Backward Algorithm

Ξ² = double[N+2][K*] Ξ²[n+1][END] = 1.0 for(i = N; i β‰₯ 0; --i) { for(next = 0; next < K*; ++next) { pobs = pemission(obsi+1 | next) for(state = 0; state < K*; ++state) { pmove = ptransition(next | state) Ξ²[i][state] += Ξ²[i+1][next] * pobs * pmove } } } Q: What does Ξ²[0][START] represent? A: Total probability of all paths from stop to start, for the

  • bserved sequence
slide-108
SLIDE 108

Backward Algorithm

Ξ² = double[N+2][K*] Ξ²[n+1][END] = 1.0 for(i = N; i β‰₯ 0; --i) { for(next = 0; next < K*; ++next) { pobs = pemission(obsi+1 | next) for(state = 0; state < K*; ++state) { pmove = ptransition(next | state) Ξ²[i][state] += Ξ²[i+1][next] * pobs * pmove } } } Q: What does Ξ²[0][START] represent? A: The marginal likelihood of the observed sequence

slide-109
SLIDE 109

Backward Algorithm

Ξ² = double[N+2][K*] Ξ²[n+1][END] = 1.0 for(i = N; i β‰₯ 0; --i) { for(next = 0; next < K*; ++next) { pobs = pemission(obsi+1 | next) for(state = 0; state < K*; ++state) { pmove = ptransition(next | state) Ξ²[i][state] += Ξ²[i+1][next] * pobs * pmove } } } Q: What does Ξ²[0][START] represent? A: Ξ±[N+1][END]

slide-110
SLIDE 110

HMM Expectation Calculation

Calculate the forward (log) likelihood of an

  • bserved (sub-)sequence w1, …, wJ

Calculate the backward (log) likelihood of an

  • bserved (sub-)sequence wJ+1, …, wN

π‘ž 𝑨1, π‘₯1, 𝑨2, π‘₯2, … , 𝑨𝑂, π‘₯𝑂 = π‘ž 𝑨1| 𝑨0 π‘ž π‘₯1|𝑨1 β‹― π‘ž 𝑨𝑂| π‘¨π‘‚βˆ’1 π‘ž π‘₯𝑂|𝑨𝑂 = ΰ·‘

𝑗

π‘ž π‘₯𝑗|𝑨𝑗 π‘ž 𝑨𝑗| π‘¨π‘—βˆ’1

emission probabilities/parameters transition probabilities/parameters

slide-111
SLIDE 111

Agenda

HMM Detailed Definition HMM Parameter Estimation EM for HMMs General Approach Expectation Calculation