Introduction to Hidden Markov Models CMSC 473/673 UMBC Recap from - - PowerPoint PPT Presentation

introduction to hidden markov models
SMART_READER_LITE
LIVE PREVIEW

Introduction to Hidden Markov Models CMSC 473/673 UMBC Recap from - - PowerPoint PPT Presentation

Introduction to Hidden Markov Models CMSC 473/673 UMBC Recap from last time Expectation Maximization (EM) 0. Assume some value for your parameters Two step, iterative algorithm 1. E-step: count under uncertainty, assuming these parameters


slide-1
SLIDE 1

Introduction to Hidden Markov Models

CMSC 473/673 UMBC

slide-2
SLIDE 2

Recap from last time…

slide-3
SLIDE 3

Expectation Maximization (EM)

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty, assuming these

parameters

  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

estimated counts

slide-4
SLIDE 4

Counting Requires Marginalizing

E-step: count under uncertainty, assuming these parameters

slide-5
SLIDE 5

Counting Requires Marginalizing

w z1 & w z2 & w z3 & w z4 & w

E-step: count under uncertainty, assuming these parameters

break into 4 disjoint pieces

slide-6
SLIDE 6

EM Example 1: Three Coins/Class-based Unigrams

Imagine three coins Flip 1st coin (penny) If heads: flip 2nd coin (dollar coin) If tails: flip 3rd coin (dime)

  • bserved:

a, b, e, etc. We run the code, vs. The run failed unobserved: vowel or constonant? part of speech?

slide-7
SLIDE 7

EM Example 2: Machine Translation Alignment

Want: P(f|e) But don’t know how to train this directly… Solution: Use P(a, f|e), where a is an alignment Remember:

Le chat est sur la chaise verte The cat is on the green chair

marginalizingacross all possible alignments

slide-8
SLIDE 8

IBM Model 1 (1993)

f: vector of French words (visualization of alignment) e: vector of English words a: vector of alignment indices t(fj|ei) : translation probability

  • f the word fj given the word

ei Le chat est sur la chaise verte The cat is on the green chair 0 1 2 3 4 6 5

slide-9
SLIDE 9

Learning the Alignments through EM

  • 0. Assume some value for

and compute other parameter values Two step, iterative algorithm

  • 1. E-step: count alignments and translations under

uncertainty, assuming these parameters

  • 2. M-step: maximize log-likelihood (update

parameters), using uncertain counts

estimated counts

P( | “the cat”) P( | “the cat”)

le chat le chat

slide-10
SLIDE 10

Follow up: IBM Model 1 Parameters

For IBM model 1, we can compute all parameters given translation parameters: How many of these are there? |French vocabulary| x |English vocabulary| From Rebecca: See Sec. 31 of the Knight tutorial for more about space considerations

slide-11
SLIDE 11

Alignment: Output and Complexities

Component of machine translation systems Produce a translation lexicon automatically Cross-lingual projection/extraction of information Supervision for training other models (for example, neural MT systems)

http://www.cis.upenn.edu/~ccb/figures/research-statement/pivoting.jpg

slide-12
SLIDE 12

Hidden Markov Models

slide-13
SLIDE 13

Agenda

HMM Motivation (Part of Speech) and Brief Definition What is Part of Speech? HMM Detailed Definition HMM Tasks

slide-14
SLIDE 14

Hidden Markov Models

Class-based Model Use different distributions to explain groupings of

  • bservations

Sequence Model Bigram model of the classes, not the observations Implicitly model all possible class sequences There are algorithms for finding best sequence, the marginal likelihood, and doing semi-/un-supervised learning

slide-15
SLIDE 15

Hidden Markov Models: Part of Speech

p(British Left Waffles on Falkland Islands)

Adjective Noun Verb Noun Verb Noun Prep Noun Noun Prep Noun Noun (i): (ii):

Class-based model Bigram model

  • f the classes

Model all class sequences

slide-16
SLIDE 16

Hidden Markov Models: Part of Speech

p(British Left Waffles on Falkland Islands)

Adjective Noun Verb Noun Verb Noun Prep Noun Noun Prep Noun Noun (i): (ii):

Class-based model Bigram model

  • f the classes

Model all class sequences

slide-17
SLIDE 17

Hidden Markov Models: Part of Speech

p(British Left Waffles on Falkland Islands)

Adjective Noun Verb Noun Verb Noun Prep Noun Noun Prep Noun Noun (i): (ii):

Class-based model Bigram model

  • f the classes

Model all class sequences

slide-18
SLIDE 18

Hidden Markov Models: Part of Speech

p(British Left Waffles on Falkland Islands)

Adjective Noun Verb Noun Verb Noun Prep Noun Noun Prep Noun Noun (i): (ii):

Class-based model Bigram model

  • f the classes

Model all class sequences

slide-19
SLIDE 19

Hidden Markov Models: Part of Speech

p(British Left Waffles on Falkland Islands)

1. Explain this sentence as a sequence of (likely?) latent (unseen) tags (labels) 2. Produce a tag sequence for this sentence Adjective Noun Verb Noun Verb Noun Prep Noun Noun Prep Noun Noun (i): (ii):

Class-based model Bigram model

  • f the classes

Model all class sequences

slide-20
SLIDE 20

Agenda

HMM Motivation (Part of Speech) and Brief Definition What is Part of Speech? HMM Detailed Definition HMM Tasks

slide-21
SLIDE 21

Brief Aside: Parts of Speech

Classes of words that behave like one another in similar syntactic contexts

slide-22
SLIDE 22

Parts of Speech

Classes of words that behave like one another in similar syntactic contexts Pronunciation (stress) can differ: object (noun: OB-ject) vs. object (verb: ob-JECT) It can help improve the inputs to other systems (text-to-speech, syntactic parsing)

slide-23
SLIDE 23

Parts of Speech

Adapted from Luke Zettlemoyer

Nouns milk cat cats UMBC Baltimore bread speak give Verbs run Adjectives would-be wettest large happy red fake

slide-24
SLIDE 24

Parts of Speech

Adapted from Luke Zettlemoyer

Nouns milk cat cats UMBC Baltimore bread speak give Verbs run Adjectives would-be wettest large happy red fake Determiners Conjunctions a the every what and

  • r

if because Prepositions in under top

slide-25
SLIDE 25

Parts of Speech

Adapted from Luke Zettlemoyer

Nouns milk cat cats UMBC Baltimore bread speak give Verbs run Adjectives would-be wettest large happy red fake can do may Determiners Conjunctions a the every what and

  • r

if because Prepositions in under top

“I can eat.”

slide-26
SLIDE 26

Parts of Speech

Adapted from Luke Zettlemoyer

Nouns milk cat cats UMBC Baltimore bread speak give Verbs run Adjectives would-be wettest large happy red fake can do may Determiners Conjunctions a the every what and

  • r

if because Prepositions in under top

slide-27
SLIDE 27

Parts of Speech

Adapted from Luke Zettlemoyer

Nouns milk cat cats UMBC Baltimore bread speak give Verbs run Adjectives would-be wettest large happy red fake can do may Determiners Conjunctions a the every what and

  • r

if because Prepositions in under top Adverbs recently happily

slide-28
SLIDE 28

Parts of Speech

Adapted from Luke Zettlemoyer

Nouns milk cat cats UMBC Baltimore bread speak give Verbs run Adjectives would-be wettest large happy red fake can do may Determiners Conjunctions a the every what and

  • r

if because Prepositions in under top Adverbs recently happily

“Today, we eat there.”

then there (location)

slide-29
SLIDE 29

Parts of Speech

Adapted from Luke Zettlemoyer

Nouns milk cat cats UMBC Baltimore bread speak give Verbs run Adjectives would-be wettest large happy red fake can do may Determiners Conjunctions a the every what and

  • r

if because Prepositions in under top Adverbs recently happily

“I ate.” “There is a cat.”

then there (location) I you there

slide-30
SLIDE 30

Parts of Speech

Adapted from Luke Zettlemoyer

Nouns milk cat cats UMBC Baltimore bread speak give Verbs run Adjectives would-be wettest large happy red fake can do may Determiners Conjunctions a the every what and

  • r

if because Prepositions in under top Adverbs recently happily then there (location) I you there Numbers

  • ne

1,324

slide-31
SLIDE 31

Closed class words Open class words

Parts of Speech

Adapted from Luke Zettlemoyer

Nouns milk cat cats UMBC Baltimore bread speak give Verbs run Adjectives would-be wettest large happy red fake can do may Determiners Conjunctions a the every what and

  • r

if because Prepositions in under top Adverbs recently happily then there (location) I you there Numbers

  • ne

1,324

slide-32
SLIDE 32

Parts of Speech

Adapted from Luke Zettlemoyer

Closed class words Open class words Nouns milk cat cats UMBC Baltimore bread speak give can do may Verbs Adjectives would-be wettest large happy red fake

Kamp & Partee (1995)

Adverbs recently happily then there (location)

intransitive

run

ditransitive transitive subsective non- subsective modals, auxiliaries

I you Determiners Prepositions Conjunctions

Pronouns

a the every what in under top and

  • r

if because there Numbers

  • ne

1,324

slide-33
SLIDE 33

Parts of Speech

Adapted from Luke Zettlemoyer

Closed class words Open class words Nouns milk cat cats UMBC Baltimore bread speak give can do may Verbs Adjectives would-be wettest large happy red fake

Kamp & Partee (1995)

Adverbs recently happily then there (location)

intransitive

run

ditransitive transitive subsective non- subsective modals, auxiliaries

I you Determiners Prepositions Conjunctions

Pronouns

a the every what in under top Particles

(set) up

so (far) not (call)

  • ff

and

  • r

if because there Numbers

  • ne

1,324

slide-34
SLIDE 34

Parts of Speech

Adapted from Luke Zettlemoyer

Closed class words Open class words Nouns milk cat cats UMBC Baltimore bread speak give can do may Verbs Adjectives would-be wettest large happy red fake

Kamp & Partee (1995)

Adverbs recently happily then there (location)

intransitive

run

ditransitive transitive subsective non- subsective modals, auxiliaries

Numbers I you

  • ne

1,324 Determiners Prepositions Conjunctions

Pronouns

and

  • r

if a the every what in under top Particles

(set) up

so (far) not (call)

  • ff

Language evolves! “I’m reading this because I want to procrastinate.” → “I’m reading this because procrastination.”

https://www.theatlantic.com/technology/archive/2013/11/english-has-a-new-preposition-because-internet/281601/

because because

slide-35
SLIDE 35

Agenda

HMM Motivation (Part of Speech) and Brief Definition What is Part of Speech? HMM Detailed Definition HMM Tasks

slide-36
SLIDE 36

Hidden Markov Models: Part of Speech

p(British Left Waffles on Falkland Islands)

Adjective Noun Verb Noun Verb Noun Prep Noun Noun Prep Noun Noun (i): (ii):

Class-based model Bigram model

  • f the classes

Model all class sequences

𝑞 𝑥𝑗|𝑨𝑗

slide-37
SLIDE 37

Hidden Markov Models: Part of Speech

p(British Left Waffles on Falkland Islands)

Adjective Noun Verb Noun Verb Noun Prep Noun Noun Prep Noun Noun (i): (ii):

Class-based model Bigram model

  • f the classes

Model all class sequences

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗|𝑨𝑗−1

slide-38
SLIDE 38

Hidden Markov Models: Part of Speech

p(British Left Waffles on Falkland Islands)

Adjective Noun Verb Noun Verb Noun Prep Noun Noun Prep Noun Noun (i): (ii):

Class-based model Bigram model

  • f the classes

Model all class sequences

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗|𝑨𝑗−1

𝑨1,..,𝑨𝑂

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂

slide-39
SLIDE 39

Hidden Markov Model

Goal: maximize (log-)likelihood In practice: we don’t actually observe these z values; we just see the words w

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

slide-40
SLIDE 40

Hidden Markov Model

Goal: maximize (log-)likelihood In practice: we don’t actually observe these z values; we just see the words w

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

if we knew the probability parameters then we could estimate z and evaluate likelihood… but we don’t! :( if we did observe z, estimating the probability parameterswould be easy… but we don’t! :(

slide-41
SLIDE 41

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

slide-42
SLIDE 42

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

transition probabilities/parameters

slide-43
SLIDE 43

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

emission probabilities/parameters transition probabilities/parameters

slide-44
SLIDE 44

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states Transition and emission distributions do not change

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

emission probabilities/parameters transition probabilities/parameters

slide-45
SLIDE 45

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states Transition and emission distributions do not change Q: How many different probability values are there with K states and V vocab items?

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

emission probabilities/parameters transition probabilities/parameters

slide-46
SLIDE 46

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states Transition and emission distributions do not change Q: How many different probability values are there with K states and V vocab items? A: VK emission values and K2 transition values

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

emission probabilities/parameters transition probabilities/parameters

slide-47
SLIDE 47

Hidden Markov Model Representation

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

emission probabilities/parameters transition probabilities/parameters

z1

w1

w2 w3 w4

z2 z3 z4

represent the probabilities and independence assumptions in a graph

slide-48
SLIDE 48

Hidden Markov Model Representation

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

emission probabilities/parameters transition probabilities/parameters

z1

w1

w2 w3 w4

z2 z3 z4

Graphical Models (see CMSC 478/678… and also CMSC 691: Graphical & Statistical Models of Learning)

slide-49
SLIDE 49

Hidden Markov Model Representation

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

emission probabilities/parameters transition probabilities/parameters

z1

w1

w2 w3 w4

z2 z3 z4

𝑞 𝑥1|𝑨1 𝑞 𝑥2|𝑨2 𝑞 𝑥3|𝑨3 𝑞 𝑥4|𝑨4

slide-50
SLIDE 50

Hidden Markov Model Representation

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

emission probabilities/parameters transition probabilities/parameters

z1

w1

w2 w3 w4

z2 z3 z4

𝑞 𝑥1|𝑨1 𝑞 𝑥2|𝑨2 𝑞 𝑥3|𝑨3 𝑞 𝑥4|𝑨4

𝑞 𝑨2| 𝑨1 𝑞 𝑨3| 𝑨2 𝑞 𝑨4| 𝑨3

slide-51
SLIDE 51

Hidden Markov Model Representation

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

emission probabilities/parameters transition probabilities/parameters

z1

w1

w2 w3 w4

z2 z3 z4

𝑞 𝑥1|𝑨1 𝑞 𝑥2|𝑨2 𝑞 𝑥3|𝑨3 𝑞 𝑥4|𝑨4

𝑞 𝑨2| 𝑨1 𝑞 𝑨3| 𝑨2 𝑞 𝑨4| 𝑨3 𝑞 𝑨1| 𝑨0 initial starting distribution (“BOS”)

slide-52
SLIDE 52

Hidden Markov Model Representation

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

emission probabilities/parameters transition probabilities/parameters

z1

w1

w2 w3 w4

z2 z3 z4

𝑞 𝑥1|𝑨1 𝑞 𝑥2|𝑨2 𝑞 𝑥3|𝑨3 𝑞 𝑥4|𝑨4

𝑞 𝑨2| 𝑨1 𝑞 𝑨3| 𝑨2 𝑞 𝑨4| 𝑨3 𝑞 𝑨1| 𝑨0 initial starting distribution (“BOS”)

Each zi can take the value of one of K latent states Transition and emission distributions do not change

slide-53
SLIDE 53

Example: 2-state Hidden Markov Model as a Lattice

z1 = N

w1

w2 w3 w4

z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V

slide-54
SLIDE 54

Example: 2-state Hidden Markov Model as a Lattice

z1 = N

w1

w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥2|𝑂 𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

z2 = N z3 = N z4 = N z1 = V z2 = V z4 = V

𝑞 𝑥4|𝑊 𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊 𝑞 𝑥1|𝑊

z3 = V

slide-55
SLIDE 55

Example: 2-state Hidden Markov Model as a Lattice

z1 = N

w1

w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥2|𝑂 𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| start

𝑞 𝑥4|𝑊 𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊 𝑞 𝑥1|𝑊

𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂

slide-56
SLIDE 56

Example: 2-state Hidden Markov Model as a Lattice

z1 = N

w1

w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥2|𝑂 𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| start

𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊

𝑞 𝑥4|𝑊 𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊 𝑞 𝑥1|𝑊

𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂

slide-57
SLIDE 57

Unigram Language Model

Comparison of Joint Probabilities

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗

slide-58
SLIDE 58

Unigram Class-based Language Model (“K” coins) Unigram Language Model

Comparison of Joint Probabilities

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗 𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … ,𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗

slide-59
SLIDE 59

Hidden Markov Model Unigram Class-based Language Model (“K” coins) Unigram Language Model

Comparison of Joint Probabilities

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = 𝑞 𝑥1 𝑞 𝑥2 ⋯ 𝑞 𝑥𝑂 = ෑ

𝑗

𝑞 𝑥𝑗 𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … ,𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗 𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

slide-60
SLIDE 60

Estimating Parameters from Observed Data

z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥2|𝑊

𝑞 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊 N V end start N V w1 w2 W3 w4 N V Transition Counts Emission Counts

end emission not shown

slide-61
SLIDE 61

Estimating Parameters from Observed Data

z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥2|𝑊

𝑞 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊 N V end start 2 N 1 2 2 V 2 1 w1 w2 W3 w4 N 2 1 2 V 2 1 Transition Counts Emission Counts

end emission not shown

slide-62
SLIDE 62

Estimating Parameters from Observed Data

z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥2|𝑊

𝑞 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊 N V end start 1 N .2 .4 .4 V 2/3 1/3 w1 w2 W3 w4 N .4 .2 .4 V 2/3 1/3 Transition MLE Emission MLE

end emission not shown

slide-63
SLIDE 63

Estimating Parameters from Observed Data

z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥2|𝑊

𝑞 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊 N V end start 1 N .2 .4 .4 V 2/3 1/3 w1 w2 W3 w4 N .4 .2 .4 V 2/3 1/3 Transition MLE Emission MLE

end emission not shown

smooth these values if needed

slide-64
SLIDE 64

Agenda

HMM Motivation (Part of Speech) and Brief Definition What is Part of Speech? HMM Detailed Definition HMM Tasks

slide-65
SLIDE 65

Hidden Markov Model Tasks

Calculate the (log) likelihood of an observed sequence w1, …, wN Calculate the most likely sequence of states (for an

  • bserved sequence)

Learn the emission and transition parameters

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

emission probabilities/parameters transition probabilities/parameters

slide-66
SLIDE 66

Hidden Markov Model Tasks

Calculate the (log) likelihood of an observed sequence w1, …, wN Calculate the most likely sequence of states (for an

  • bserved sequence)

Learn the emission and transition parameters

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

emission probabilities/parameters transition probabilities/parameters

slide-67
SLIDE 67

HMM Likelihood Task

Marginalize over all latent sequence joint likelihoods

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = ෍

𝑨1,⋯,𝑨𝑂

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂

Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there?

slide-68
SLIDE 68

HMM Likelihood Task

Marginalize over all latent sequence joint likelihoods

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = ෍

𝑨1,⋯,𝑨𝑂

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂

Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there? A: KN

slide-69
SLIDE 69

HMM Likelihood Task

Marginalize over all latent sequence joint likelihoods

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = ෍

𝑨1,⋯,𝑨𝑂

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂

Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there? A: KN Goal: Find a way to compute this exponential sum efficiently (in polynomial time)

slide-70
SLIDE 70

HMM Likelihood Task

Marginalize over all latent sequence joint likelihoods

𝑞 𝑥1, 𝑥2, … , 𝑥𝑂 = ෍

𝑨1,⋯,𝑨𝑂

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂

Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there? A: KN Goal: Find a way to compute this exponential sum efficiently (in polynomial time) Like in language modeling, you need to model when to stop generating. This ending state is generally not included in “K.”

slide-71
SLIDE 71

2 (3)-State HMM Likelihood

z1 = N

w1

w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥2|𝑂 𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| start

𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊

𝑞 𝑥4|𝑊 𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊 𝑞 𝑥1|𝑊

𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂

Q: What are the latent sequences here (EOS excluded)?

slide-72
SLIDE 72

2 (3)-State HMM Likelihood

z1 = N

w1

w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥2|𝑂 𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| start

𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊

𝑞 𝑥4|𝑊 𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊 𝑞 𝑥1|𝑊

𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂

(N, w1), (N, w2), (N, w3), (N, w4) (N, w1), (N, w2), (N, w3), (V, w4) (N, w1), (N, w2), (V, w3), (N, w4) (N, w1), (N, w2), (V, w3), (V, w4) A: (N, w1), (V, w2), (N, w3), (N, w4) (N, w1), (V, w2), (N, w3), (V, w4) (N, w1), (V, w2), (V, w3), (N, w4) (N, w1), (V, w2), (V, w3), (V, w4) (V, w1), (N, w2), (N, w3), (N, w4) (V, w1), (N, w2), (N, w3), (V, w4) … (six more) Q: What are the latent sequences here (EOS excluded)?

slide-73
SLIDE 73

2 (3)-State HMM Likelihood

z1 = N

w1

w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥2|𝑂 𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| start

𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊

𝑞 𝑥4|𝑊 𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊 𝑞 𝑥1|𝑊

𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂

(N, w1), (N, w2), (N, w3), (N, w4) (N, w1), (N, w2), (N, w3), (V, w4) (N, w1), (N, w2), (V, w3), (N, w4) (N, w1), (N, w2), (V, w3), (V, w4) A: (N, w1), (V, w2), (N, w3), (N, w4) (N, w1), (V, w2), (N, w3), (V, w4) (N, w1), (V, w2), (V, w3), (N, w4) (N, w1), (V, w2), (V, w3), (V, w4) (V, w1), (N, w2), (N, w3), (N, w4) (V, w1), (N, w2), (N, w3), (V, w4) … (six more) Q: What are the latent sequences here (EOS excluded)?

slide-74
SLIDE 74

2 (3)-State HMM Likelihood

z1 = N

w1

w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥2|𝑂 𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| start

𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊

𝑞 𝑥4|𝑊 𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊 𝑞 𝑥1|𝑊

𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂

N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1

slide-75
SLIDE 75

2 (3)-State HMM Likelihood

z1 = N

w1

w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥2|𝑂 𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| start

𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊

𝑞 𝑥4|𝑊 𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊 𝑞 𝑥1|𝑊

𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂

Q: What’s the probability of (N, w1), (V, w2), (V, w3), (N, w4)? N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1

slide-76
SLIDE 76

2 (3)-State HMM Likelihood

z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊 Q: What’s the probability of (N, w1), (V, w2), (V, w3), (N, w4)? A: (.7*.7) * (.8*.6) * (.35*.1) * (.6*.05) = 0.0002822 N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1

slide-77
SLIDE 77

2 (3)-State HMM Likelihood

z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊 Q: What’s the probability of (N, w1), (V, w2), (V, w3), (N, w4) with ending included (unique ending symbol “#”)? A: (.7*.7) * (.8*.6) * (.35*.1) * (.6*.05) * (.05 * 1) = 0.00001235 N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4

#

N .7 .2 .05 .05 V .2 .6 .1 .1

end

1

slide-78
SLIDE 78

2 (3)-State HMM Likelihood

z1 = N

w1

w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥2|𝑂 𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑊 𝑞 𝑊| start

𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊 𝑞 𝑂| 𝑊

𝑞 𝑥4|𝑊 𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊 𝑞 𝑥1|𝑊

𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂 𝑞 𝑂| 𝑂

Q: What’s the probability of (N, w1), (V, w2), (N, w3), (N, w4)? N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1

slide-79
SLIDE 79

2 (3)-State HMM Likelihood

z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥2|𝑊

𝑞 𝑂| 𝑂

Q: What’s the probability of (N, w1), (V, w2), (N, w3), (N, w4)? A: (.7*.7) * (.8*.6) * (.6*.05) * (.15*.05) = 0.00007056 N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1

slide-80
SLIDE 80

2 (3)-State HMM Likelihood

z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥2|𝑊

𝑞 𝑂| 𝑂

Q: What’s the probability of (N, w1), (V, w2), (N, w3), (N, w4) with ending (unique symbol “#”)? A: (.7*.7) * (.8*.6) * (.6*.05) * (.15*.05) * (.05 * 1) = 0.000002646 N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4

#

N .7 .2 .05 .05 V .2 .6 .1 .1

end

1

slide-81
SLIDE 81

2 (3)-State HMM Likelihood

z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥2|𝑊

𝑞 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊

slide-82
SLIDE 82

2 (3)-State HMM Likelihood

z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥2|𝑊

𝑞 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊 Up until here, all the computation was the same

slide-83
SLIDE 83

2 (3)-State HMM Likelihood

z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥2|𝑊

𝑞 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊 Up until here, all the computation was the same Let’s reuse what computations we can

slide-84
SLIDE 84

2 (3)-State HMM Likelihood

z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥2|𝑊

𝑞 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊 Solution: pass information “forward” in the graph, e.g., from time step 2 to 3…

slide-85
SLIDE 85

2 (3)-State HMM Likelihood

z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥2|𝑊

𝑞 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊 Solution: pass information “forward” in the graph, e.g., from time step 2 to 3… Issue: these highlighted paths are only 2 of the 16 possible paths through the trellis

slide-86
SLIDE 86

2 (3)-State HMM Likelihood

z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥2|𝑊

𝑞 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊 Solution: pass information “forward” in the graph, e.g., from time step 2 to 3… Solution: marginalize out all information from previous timesteps Issue: these highlighted paths are only 2 of the 16 possible paths through the trellis

slide-87
SLIDE 87

Reusing Computation

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A

let’s first consider “any shared path ending with B (AB, BB, or CB)→ B”

zi-2 = C zi-2 = B zi-2 = A

slide-88
SLIDE 88

Reusing Computation

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A

let’s first consider “any shared path ending with B (AB, BB, or CB)→ B” Assume that all necessary information has been computed and stored in 𝛽(𝑗 − 1, 𝐵), 𝛽(𝑗 − 1, 𝐶), 𝛽(𝑗 − 1, 𝐷)

zi-2 = C zi-2 = B zi-2 = A

𝛽(𝑗 − 1, 𝐵) 𝛽(𝑗 − 1, 𝐶) 𝛽(𝑗 − 1, 𝐷)

slide-89
SLIDE 89

Reusing Computation

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A

let’s first consider “any shared path ending with B (AB, BB, or CB)→ B” Assume that all necessary information has been computed and stored in 𝛽(𝑗 − 1, 𝐵), 𝛽(𝑗 − 1, 𝐶), 𝛽(𝑗 − 1, 𝐷) Marginalize (sum) across the previous timestep’s possible states

zi-2 = C zi-2 = B zi-2 = A

𝛽(𝑗 − 1, 𝐵) 𝛽(𝑗 − 1, 𝐶) 𝛽(𝑗 − 1, 𝐷) 𝛽(𝑗, 𝐶)

slide-90
SLIDE 90

Reusing Computation

let’s first consider “any shared path ending with B (AB, BB, or CB)→ B” marginalize across the previous hidden state values 𝛽 𝑗, 𝐶 = ෍

𝑡

𝛽 𝑗 − 1, 𝑡 ∗ 𝑞 𝐶 𝑡) ∗ 𝑞(obs at 𝑗 | 𝐶)

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi-2 = C zi-2 = B zi-2 = A

𝛽(𝑗 − 1, 𝐵) 𝛽(𝑗 − 1, 𝐶) 𝛽(𝑗 − 1, 𝐷) 𝛽(𝑗, 𝐶)

slide-91
SLIDE 91

Reusing Computation

let’s first consider “any shared path ending with B (AB, BB, or CB)→ B” marginalize across the previous hidden state values 𝛽 𝑗, 𝐶 = ෍

𝑡

𝛽 𝑗 − 1, 𝑡 ∗ 𝑞 𝐶 𝑡) ∗ 𝑞(obs at 𝑗 | 𝐶) computing α at time i-1 will correctly incorporate paths through time i-2: we correctly obey the Markov property

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi-2 = C zi-2 = B zi-2 = A

𝛽(𝑗 − 1, 𝐵) 𝛽(𝑗 − 1, 𝐶) 𝛽(𝑗 − 1, 𝐷) 𝛽(𝑗, 𝐶)

slide-92
SLIDE 92

Forward Probability

let’s first consider “any shared path ending with B (AB, BB, or CB)→ B” marginalize across the previous hidden state values 𝛽 𝑗, 𝐶 = ෍

𝑡′

𝛽 𝑗 − 1, 𝑡′ ∗ 𝑞 𝐶 𝑡′) ∗ 𝑞(obs at 𝑗 | 𝐶) computing α at time i-1 will correctly incorporate paths through time i-2: we correctly obey the Markov property α(i, B) is the total probability of all paths to that state B from the beginning

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi-2 = C zi-2 = B zi-2 = A

slide-93
SLIDE 93

Forward Probability

α(i, s) is the total probability of all paths:

  • 1. that start from the beginning
  • 2. that end (currently) in s at step i
  • 3. that emit the observation obs at i
slide-94
SLIDE 94

Forward Probability

α(i, s) is the total probability of all paths:

  • 1. that start from the beginning
  • 2. that end (currently) in s at step i
  • 3. that emit the observation obs at i

how likely is it to get into state s this way? what are the immediate ways to get into state s? what’s the total probability up until now?

slide-95
SLIDE 95

2 (3) -State HMM Likelihood with Forward Probabilities

w3 w4

𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

z3 = N z4 = N z3 = V z4 = V 𝑞 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊

𝑞 𝑂| 𝑊

α[1, N] = (.7*.7) α[2, V] = α[1, N] * (.8*.6) + α[1, V] * (.35*.6)

N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1

𝑞 𝑊| start

α[3, V] = α[2, V] * (.35*.1)+ α[2, N] * (.8*.1) α[3, N] = α[2, V] * (.6*.05) + α[2, N] * (.15*.05)

slide-96
SLIDE 96

2 (3) -State HMM Likelihood with Forward Probabilities

w3 w4

𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

z3 = N z4 = N z3 = V z4 = V 𝑞 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊

𝑞 𝑂| 𝑊

α[1, N] = (.7*.7) α[2, V] = α[1, N] * (.8*.6) + α[1, V] * (.35*.6) α[3, V] = α[2, V] * (.35*.1)+ α[2, N] * (.8*.1) α[3, N] = α[2, V] * (.6*.05) + α[2, N] * (.15*.05)

N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1

𝑞 𝑊| start

α[1, V] = (.2*.2)

slide-97
SLIDE 97

2 (3) -State HMM Likelihood with Forward Probabilities

w3 w4

𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

z3 = N z4 = N z3 = V z4 = V 𝑞 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊

𝑞 𝑂| 𝑊

α[1, N] = (.7*.7) = .49 α[2, V] = α[1, N] * (.8*.6) + α[1, V] * (.35*.6) = 0.2436 α[3, V] = α[2, V] * (.35*.1)+ α[2, N] * (.8*.1) α[3, N] = α[2, V] * (.6*.05) + α[2, N] * (.2*.05)

N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1

𝑞 𝑊| start

α[1, V] = (.2*.2) = .04 α[2, N] = α[1, N] * (.15*.2) + α[1, V] * (.6*.2) = .0195

slide-98
SLIDE 98

2 (3) -State HMM Likelihood with Forward Probabilities

w3 w4

𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

z3 = N z4 = N z3 = V z4 = V 𝑞 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊

𝑞 𝑂| 𝑊

α[1, N] = (.7*.7) = .49 α[2, V] = α[1, N] * (.8*.6) + α[1, V] * (.35*.6) = 0.2436 α[3, V] = α[2, V] * (.35*.1)+ α[2, N] * (.8*.1) α[3, N] = α[2, V] * (.6*.05) + α[2, N] * (.2*.05)

N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1

𝑞 𝑊| start

α[1, V] = (.2*.2) = .04 α[2, N] = α[1, N] * (.15*.2) + α[1, V] * (.6*.2) = .0195

slide-99
SLIDE 99

2 (3) -State HMM Likelihood with Forward Probabilities

w3 w4

𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

z3 = N z4 = N z3 = V z4 = V 𝑞 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊

𝑞 𝑂| 𝑊

α[1, N] = (.7*.7) = .49 α[2, V] = α[1, N] * (.8*.6) + α[1, V] * (.35*.6) = 0.2436 α[3, V] = α[2, V] * (.35*.1)+ α[2, N] * (.8*.1) α[3, N] = α[2, V] * (.6*.05) + α[2, N] * (.2*.05)

N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1

𝑞 𝑊| start

α[1, V] = (.2*.2) = .04 α[2, N] = α[1, N] * (.15*.2) + α[1, V] * (.6*.2) = .0195

Use dynamic programming to build the α left-to-right

slide-100
SLIDE 100

Forward Algorithm

α: a 2D table, N+2 x K* N+2: number of observations (+2 for the BOS & EOS symbols) K*: number of states Use dynamic programming to build the α left-to- right

slide-101
SLIDE 101

Forward Algorithm

α = double[N+2][K*] α[0][*] = 0.0 α[0][START] = 1.0 for(i = 1; i ≤ N+1; ++i) { }

slide-102
SLIDE 102

Forward Algorithm

α = double[N+2][K*] α[0][*] = 0.0 α[0][START] = 1.0 for(i = 1; i ≤ N+1; ++i) { for(state = 0; state < K*; ++state) { } }

slide-103
SLIDE 103

Forward Algorithm

α = double[N+2][K*] α[0][*] = 0.0 α[0][START] = 1.0 for(i = 1; i ≤ N+1; ++i) { for(state = 0; state < K*; ++state) { pobs = pemission(obsi | state) } }

slide-104
SLIDE 104

Forward Algorithm

α = double[N+2][K*] α[0][*] = 0.0 α[0][START] = 1.0 for(i = 1; i ≤ N+1; ++i) { for(state = 0; state < K*; ++state) { pobs = pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = ptransition(state | old) α[i][state] += α[i-1][old] * pobs * pmove } } }

slide-105
SLIDE 105

Forward Algorithm

α = double[N+2][K*] α[0][*] = 0.0 α[0][START] = 1.0 for(i = 1; i ≤ N+1; ++i) { for(state = 0; state < K*; ++state) { pobs = pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = ptransition(state | old) α[i][state] += α[i-1][old] * pobs * pmove } } }

we still need to learn these (EM if not observed)

slide-106
SLIDE 106

Forward Algorithm

α = double[N+2][K*] α[0][*] = 0.0 α[0][START] = 1.0 for(i = 1; i ≤ N+1; ++i) { for(state = 0; state < K*; ++state) { pobs = pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = ptransition(state | old) α[i][state] += α[i-1][old] * pobs * pmove } } }

Q: What do we return? (How do we return the likelihood of the sequence?)

slide-107
SLIDE 107

Forward Algorithm

α = double[N+2][K*] α[0][*] = 0.0 α[0][START] = 1.0 for(i = 1; i ≤ N+1; ++i) { for(state = 0; state < K*; ++state) { pobs = pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = ptransition(state | old) α[i][state] += α[i-1][old] * pobs * pmove } } }

Q: What do we return? (How do we return the likelihood of the sequence?)

A: α[N+1][end]

slide-108
SLIDE 108

Interactive HMM Example

https://goo.gl/rbHEoc (Jason Eisner, 2002)

Original: http://www.cs.jhu.edu/~jason/465/PowerPoint/lect24-hmm.xls

slide-109
SLIDE 109

Forward Algorithm in Log-Space

α = double[N+2][K*] α[0][*] = -∞ α[0][*] = 0.0 for(i = 1; i ≤ N+1; ++i) { for(state = 0; state < K*; ++state) { pobs = log pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = log ptransition(state | old) α[i][state] = logadd(α[i][state], α[i-1][old] + pobs + pmove) } } }

slide-110
SLIDE 110

Forward Algorithm in Log-Space

α = double[N+2][K*] α[0][*] = -∞ α[0][*] = 0.0 for(i = 1; i ≤ N+1; ++i) { for(state = 0; state < K*; ++state) { pobs = log pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = log ptransition(state | old) α[i][state] = logadd(α[i][state], α[i-1][old] + pobs + pmove) } } } logadd 𝑚𝑞,𝑚𝑟 = log exp 𝑚𝑞 + exp(𝑚𝑟)

slide-111
SLIDE 111

Forward Algorithm in Log-Space

α = double[N+2][K*] α[0][*] = -∞ α[0][*] = 0.0 for(i = 1; i ≤ N+1; ++i) { for(state = 0; state < K*; ++state) { pobs = log pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = log ptransition(state | old) α[i][state] = logadd(α[i][state], α[i-1][old] + pobs + pmove) } } } logadd 𝑚𝑞,𝑚𝑟 = log exp 𝑚𝑞 + exp(𝑚𝑟)

this can still overflow! (why?)

slide-112
SLIDE 112

Forward Algorithm in Log-Space

α = double[N+2][K*] α[0][*] = -∞ α[0][*] = 0.0 for(i = 1; i ≤ N+1; ++i) { for(state = 0; state < K*; ++state) { pobs = log pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = log ptransition(state | old) α[i][state] = logadd(α[i][state], α[i-1][old] + pobs + pmove) } } }

scipy.misc.logsumexp

slide-113
SLIDE 113

Hidden Markov Model Tasks

Calculate the (log) likelihood of an observed sequence w1, …, wN Calculate the most likely sequence of states (for an

  • bserved sequence)

Learn the emission and transition parameters

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

emission probabilities/parameters transition probabilities/parameters

slide-114
SLIDE 114

HMM Most-Likely Sequence Task

Maximize over all latent sequence joint likelihoods

max

𝑨1,⋯,𝑨𝑂 𝑞 𝑨1, 𝑥1,𝑨2, 𝑥2,… ,𝑨𝑂, 𝑥𝑂

Q: In a K-state HMM for a length N observation sequence, how many comparisons (different latent sequences) do we make?

slide-115
SLIDE 115

HMM Most-Likely Sequence Task

Maximize over all latent sequence joint likelihoods

max

𝑨1,⋯,𝑨𝑂 𝑞 𝑨1, 𝑥1,𝑨2, 𝑥2,… ,𝑨𝑂, 𝑥𝑂

Q: In a K-state HMM for a length N observation sequence, how many comparisons (different latent sequences) do we make? A: KN

slide-116
SLIDE 116

HMM Most-Likely Sequence Task

Maximize over all latent sequence joint likelihoods

max

𝑨1,⋯,𝑨𝑂 𝑞 𝑨1, 𝑥1,𝑨2, 𝑥2,… ,𝑨𝑂, 𝑥𝑂

Q: In a K-state HMM for a length N observation sequence, how many comparisons (different latent sequences) do we make? A: KN Goal: Find a way to compute this exponential comparison efficiently (in polynomial time)

slide-117
SLIDE 117

HMM Most-Likely Sequence Task

Maximize over all latent sequence joint likelihoods

max

𝑨1,⋯,𝑨𝑂 𝑞 𝑨1, 𝑥1,𝑨2, 𝑥2,… ,𝑨𝑂, 𝑥𝑂

Q: In a K-state HMM for a length N observation sequence, how many comparisons (different latent sequences) do we make? A: KN Goal: Find a way to compute this exponential comparison efficiently (in polynomial time)

Viterbi Decoding

slide-118
SLIDE 118

What’s the Maximum Value?

consider “any shared path ending with B (AB, BB, or CB)→ B” maximize across the previous hidden state values

𝑤 𝑗, 𝐶 = max

𝑡′

𝑤 𝑗 − 1, 𝑡′ ∗ 𝑞 𝐶 𝑡′) ∗ 𝑞(obs at 𝑗 | 𝐶)

v(i, B) is the maximum probability of any paths to that state B from the beginning (and emitting the

  • bservation)

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A

slide-119
SLIDE 119

What’s the Maximum Value?

consider “any shared path ending with B (AB, BB, or CB)→ B” maximize across the previous hidden state values

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi-2 = C zi-2 = B zi-2 = A

𝑤 𝑗, 𝐶 = max

𝑡′

𝑤 𝑗 − 1, 𝑡′ ∗ 𝑞 𝐶 𝑡′) ∗ 𝑞(obs at 𝑗 | 𝐶)

v(i, B) is the maximum probability of any paths to that state B from the beginning (and emitting the

  • bservation)
slide-120
SLIDE 120

What’s the Maximum Value?

consider “any shared path ending with B (AB, BB, or CB)→ B” maximize across the previous hidden state values

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi-2 = C zi-2 = B zi-2 = A

computing v at time i-1 will correctly incorporate (maximize over) paths through time i-2: we correctly obey the Markov property

𝑤 𝑗, 𝐶 = max

𝑡′

𝑤 𝑗 − 1, 𝑡′ ∗ 𝑞 𝐶 𝑡′) ∗ 𝑞(obs at 𝑗 | 𝐶)

v(i, B) is the maximum probability of any paths to that state B from the beginning (and emitting the

  • bservation)
slide-121
SLIDE 121

2 (3)-State Viterbi

z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥2|𝑊

𝑞 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊

Up until here, all the computation was the same Let’s reuse what computations we can

slide-122
SLIDE 122

2 (3) -State Viterbi

w3 w4

𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

z3 = N z4 = N z3 = V z4 = V 𝑞 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊

𝑞 𝑂| 𝑊

v[1, N] = (.7*.7) = .49 v[2, V] = max{v[1, N] * (.8*.6), v[1, V] * (.35*.6)} = 0.2352 v[3, V] = max{v[2, V] * (.35*.1), v[2, N] * (.8*.1)} v[3, N] = max{v[2, V] * (.6*.05), v[2, N] * (.2*.05)}

N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1

𝑞 𝑊| start

v[1, V] = (.2*.2) = .04 v[2, N] = max{v[1, N] * (.15*.2), v[1, V] * (.6*.2)} = .0588

slide-123
SLIDE 123

2 (3) -State Viterbi

w3 w4

𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

z3 = N z4 = N z3 = V z4 = V 𝑞 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊

𝑞 𝑂| 𝑊

v[1, N] = (.7*.7) = .49 v[2, V] = max{v[1, N] * (.8*.6), v[1, V] * (.35*.6)} = 0.2352 v[3, V] = max{v[2, V] * (.35*.1), v[2, N] * (.8*.1)} v[3, N] = max{v[2, V] * (.6*.05), v[2, N] * (.2*.05)}

N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1

𝑞 𝑊| start

v[1, V] = (.2*.2) = .04 v[2, N] = max{v[1, N] * (.15*.2), v[1, V] * (.6*.2)} = .0588

Use dynamic programming to build the v left-to-right

slide-124
SLIDE 124

2 (3) -State Viterbi

w3 w4

𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

z3 = N z4 = N z3 = V z4 = V 𝑞 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊

𝑞 𝑂| 𝑊

v[1, N] = (.7*.7) = .49 v[2, V] = max{v[1, N] * (.8*.6), v[1, V] * (.35*.6)} = 0.2352 v[3, V] = max{v[2, V] * (.35*.1), v[2, N] * (.8*.1)} v[3, N] = max{v[2, V] * (.6*.05), v[2, N] * (.2*.05)}

N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1

𝑞 𝑊| start

v[1, V] = (.2*.2) = .04 v[2, N] = max{v[1, N] * (.15*.2), v[1, V] * (.6*.2)} = .0588

keep backpointers: record the state that produced the maximum

slide-125
SLIDE 125

Viterbi Algorithm

v = double[N+2][K*] b = int[N+2][K*] v[0][*] = 0 v[0][START] = 1.0

slide-126
SLIDE 126

Viterbi Algorithm

v = double[N+2][K*] b = int[N+2][K*] v[0][*] = 0 v[0][START] = 1.0 for(i = 1; i ≤ N+1; ++i) { for(state = 0; state< K*; ++state) { } }

slide-127
SLIDE 127

Viterbi Algorithm

v = double[N+2][K*] b = int[N+2][K*] v[0][*] = 0 v[0][START] = 1.0 for(i = 1; i ≤ N+1; ++i) { for(state = 0; state< K*; ++state) { pobs = pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = ptransition(state| old) } } }

slide-128
SLIDE 128

Viterbi Algorithm

v = double[N+2][K*] b = int[N+2][K*] v[0][*] = 0 v[0][START] = 1.0 for(i = 1; i ≤ N+1; ++i) { for(state = 0; state< K*; ++state) { pobs = pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = ptransition(state| old) if(v[i-1][old] * pobs * pmove > v[i][state]) { v[i][state] = v[i-1][old] * pobs * pmove b[i][state] = old } } } }

slide-129
SLIDE 129

Viterbi Algorithm

v = double[N+2][K*] b = int[N+2][K*] v[0][*] = 0 v[0][START] = 1.0 for(i = 1; i ≤ N+1; ++i) { for(state = 0; state< K*; ++state) { pobs = pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = ptransition(state| old) if(v[i-1][old] * pobs * pmove > v[i][state]) { v[i][state] = v[i-1][old] * pobs * pmove b[i][state] = old } } } } Q: How do we return the most likely tag sequence?

slide-130
SLIDE 130

Viterbi Algorithm

v = double[N+2][K*] b = int[N+2][K*] v[0][*] = 0 v[0][START] = 1.0 for(i = 1; i ≤ N+1; ++i) { for(state = 0; state< K*; ++state) { pobs = pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = ptransition(state| old) if(v[i-1][old] * pobs * pmove > v[i][state]) { v[i][state] = v[i-1][old] * pobs * pmove b[i][state] = old } } } } Q: How do we return the most likely tag sequence?

A: (ti)i, where ti-1 = b[i][ti]

slide-131
SLIDE 131

Viterbi Algorithm in Log-Space

v = double[N+2][K*] b = int[N+2][K*] v[0][*] = -∞ v[0][START] = 0.0 for(i = 1; i ≤ N+1; ++i) { for(state = 0; state< K*; ++state) { pobs = log pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = log ptransition(state| old) if(v[i-1][old] + pobs + pmove > v[i][state]) { v[i][state] = v[i-1][old] + pobs + pmove b[i][state] = old } } } }

slide-132
SLIDE 132

Forward vs. Viterbi

α = double[N+2][K*] α[0][*] = 0.0 α[0][START] = 1.0 for(i = 1; i ≤ N+1; ++i) { for(state= 0; state< K*; ++state) { pobs = pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = ptransition(state| old) α[i][state] += α[i-1][old] * pobs * pmove } } } v = double[N+2][K*] b = int[N+2][K*] v[0][*] = 0.0 v[0][START] = 1.0 for(i = 1; i ≤ N+1; ++i) { for(state= 0; state< K*; ++state) { pobs = pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = ptransition(state| old) if(v[i-1][old] * pobs * pmove > v[i][state]) { v[i][state] = v[i-1][old] * pobs * pmove b[i][state] = old } } } }

slide-133
SLIDE 133

Forward vs. Viterbi

α = double[N+2][K*] α[0][*] = 0.0 α[0][START] = 1.0 for(i = 1; i ≤ N+1; ++i) { for(state= 0; state< K*; ++state) { pobs = pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = ptransition(state| old) α[i][state] += α[i-1][old] * pobs * pmove } } } v = double[N+2][K*] b = int[N+2][K*] v[0][*] = 0.0 v[0][START] = 1.0 for(i = 1; i ≤ N+1; ++i) { for(state= 0; state< K*; ++state) { pobs = pemission(obsi | state) for(old = 0; old < K*; ++old) { pmove = ptransition(state| old) if(v[i-1][old] * pobs * pmove > v[i][state]) { v[i][state] = v[i-1][old] * pobs * pmove b[i][state] = old } } } }

slide-134
SLIDE 134

Hidden Markov Model Tasks

Calculate the (log) likelihood of an observed sequence w1, …, wN Calculate the most likely sequence of states (for an

  • bserved sequence)

Learn the emission and transition parameters

𝑞 𝑨1, 𝑥1, 𝑨2, 𝑥2, … , 𝑨𝑂, 𝑥𝑂 = 𝑞 𝑨1| 𝑨0 𝑞 𝑥1|𝑨1 ⋯ 𝑞 𝑨𝑂| 𝑨𝑂−1 𝑞 𝑥𝑂|𝑨𝑂 = ෑ

𝑗

𝑞 𝑥𝑗|𝑨𝑗 𝑞 𝑨𝑗| 𝑨𝑗−1

emission probabilities/parameters transition probabilities/parameters

slide-135
SLIDE 135

“The farther backward you can look, the farther forward you can see.”

commonly attributed to Winston Churchill

slide-136
SLIDE 136

HMM Probabilities

Forward Values α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation

  • bs at i

𝛽 𝑗, 𝑡 = ෍

𝑡′

𝛽 𝑗 − 1, 𝑡′ ∗ 𝑞 𝑡 𝑡′) ∗ 𝑞(obs at 𝑗 | 𝑡)

slide-137
SLIDE 137

HMM Probabilities

Forward Values α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation

  • bs at i

Backward Values β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end

3. (that emit the observation obs at i+1) 𝛽 𝑗, 𝑡 = ෍

𝑡′

𝛽 𝑗 − 1, 𝑡′ ∗ 𝑞 𝑡 𝑡′) ∗ 𝑞(obs at 𝑗 | 𝑡) 𝛾 𝑗, 𝑡 = ෍

𝑡′

𝛾 𝑗 + 1, 𝑡′ ∗ 𝑞(𝑡′|𝑡) ∗ 𝑞 obs at 𝑗 + 1 𝑡′)

slide-138
SLIDE 138

Backward Algorithm

β: a 2D table, (N+2) x K* N+2: number of observations (+2 for the BOS & EOS symbols) K*: number of states Use dynamic programming to build the β right- to-left

slide-139
SLIDE 139

Backward Algorithm

β = double[N+2][K*] β[n+1][END] = 1.0 for(i = N; i ≥ 0; --i) { for(next = 0; next < K*; ++next) { pobs = pemission(obsi+1 | next) for(state = 0; state < K*; ++state) { pmove = ptransition(next | state) β[i][state] += β[i+1][next] * pobs * pmove } } }

slide-140
SLIDE 140

Backward Algorithm

β = double[N+2][K*] β[n+1][END] = 1.0 for(i = N; i ≥ 0; --i) { for(next = 0; next < K*; ++next) { pobs = pemission(obsi+1 | next) for(state = 0; state < K*; ++state) { pmove = ptransition(next | state) β[i][state] += β[i+1][next] * pobs * pmove } } } Q: What does β[0][START] represent?

slide-141
SLIDE 141

Backward Algorithm

β = double[N+2][K*] β[n+1][END] = 1.0 for(i = N; i ≥ 0; --i) { for(next = 0; next < K*; ++next) { pobs = pemission(obsi+1 | next) for(state = 0; state < K*; ++state) { pmove = ptransition(next | state) β[i][state] += β[i+1][next] * pobs * pmove } } } Q: What does β[0][START] represent? A: Total probability of all paths from stop to start, for the

  • bserved sequence
slide-142
SLIDE 142

Backward Algorithm

β = double[N+2][K*] β[n+1][END] = 1.0 for(i = N; i ≥ 0; --i) { for(next = 0; next < K*; ++next) { pobs = pemission(obsi+1 | next) for(state = 0; state < K*; ++state) { pmove = ptransition(next | state) β[i][state] += β[i+1][next] * pobs * pmove } } } Q: What does β[0][START] represent? A: The marginal likelihood of the observed sequence

slide-143
SLIDE 143

Backward Algorithm

β = double[N+2][K*] β[n+1][END] = 1.0 for(i = N; i ≥ 0; --i) { for(next = 0; next < K*; ++next) { pobs = pemission(obsi+1 | next) for(state = 0; state < K*; ++state) { pmove = ptransition(next | state) β[i][state] += β[i+1][next] * pobs * pmove } } } Q: What does β[0][START] represent? A: α[N+1][END]

slide-144
SLIDE 144

2 (3) -State HMM Likelihood with Backward Probabilities

w3 w4

𝑞 𝑥3|𝑂 𝑞 𝑥4|𝑂

z3 = N z4 = N z3 = V z4 = V 𝑞 𝑂| 𝑂 z1 = N

w1 w2 w3 w4

𝑞 𝑥1|𝑂 𝑞 𝑥4|𝑂

𝑞 𝑂| start z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞 𝑊| 𝑊 𝑞 𝑊| 𝑂 𝑞 𝑂| 𝑊

𝑞 𝑥3|𝑊 𝑞 𝑥2|𝑊

𝑞 𝑂| 𝑊

β[1, N] = β[2, N] * (.15*.2) + β[2, V] * (.8*.6) β[2, V] = β[3, N] * (.6*.05) + β[3, V] * (.35*.1) β[3, V] = β[4, V] * (.35*.1)+ β[4, N] * (.6*.05) β[3, N] = β[4, V] * (.8*.1) + β[4, N] * (.15*.05)

N V end start .7 .2 .1 N .15 .8 .05 V .6 .35 .05 w1 w2 w3 w4 N .7 .2 .05 .05 V .2 .6 .1 .1

𝑞 𝑊| start

β[1, V] = β[2, N] * (.6*.2) + β[2, V] * (.35*.6) β[2, N] = β[3, N] * (.15*.05) + β[3, V] * (.8*.1)

slide-145
SLIDE 145

Why Do We Need Backward Values?

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi+1 = C zi+1 = B zi+1 = A

α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end

3. (that emit the observation obs at i+1)

slide-146
SLIDE 146

Why Do We Need Backward Values?

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi+1 = C zi+1 = B zi+1 = A

α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end

3. (that emit the observation obs at i+1)

α(i, B) β(i, B)

slide-147
SLIDE 147

Why Do We Need Backward Values?

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi+1 = C zi+1 = B zi+1 = A

α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end

3. (that emit the observation obs at i+1)

α(i, B) β(i, B) α(i, B) * β(i, B) = total probability of paths through state B at step i

slide-148
SLIDE 148

Why Do We Need Backward Values?

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A zi+1 = C zi+1 = B zi+1 = A

α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end

3. (that emit the observation obs at i+1)

α(i, B) β(i, B) α(i, s) * β(i, s) = total probability of paths through state s at step i

we can compute posterior state probabilities

(normalize by marginal likelihood)

slide-149
SLIDE 149

Why Do We Need Backward Values?

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A

α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end

3. (that emit the observation obs at i+1)

α(i, B) β(i+1, s)

zi+1 = C zi+1 = B zi+1 = A

slide-150
SLIDE 150

Why Do We Need Backward Values?

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A

α(i, B) β(i+1, s’)

zi+1 = C zi+1 = B zi+1 = A

α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end

3. (that emit the observation obs at i+1)

α(i, B) * p(s’ | B) * p(obs at i+1 | s’) * β(i+1, s’) = total probability of paths through the B→s’ arc (at time i)

slide-151
SLIDE 151

Why Do We Need Backward Values?

zi-1 = C zi-1 = B zi-1 = A zi = C zi = B zi = A

α(i, B) β(i+1, s’)

zi+1 = C zi+1 = B zi+1 = A

α(i, s) is the total probability of all paths: 1. that start from the beginning 2. that end (currently) in s at step i 3. that emit the observation obs at i β(i, s) is the total probability of all paths: 1. that start at step i at state s 2. that terminate at the end

3. (that emit the observation obs at i+1)

we can compute posterior transition probabilities

(normalize by marginal likelihood)

α(i, B) * p(s’ | B) * p(obs at i+1 | s’) * β(i+1, s’) = total probability of paths through the B→s’ arc (at time i)

slide-152
SLIDE 152

With Both Forward and Backward Values

α(i, s) * β(i, s) = total probability of paths through state s at step i α(i, s) * p(s’ | B) * p(obs at i+1 | s’) * β(i+1, s’) = total probability of paths through the s→s’ arc (at time i)

slide-153
SLIDE 153

With Both Forward and Backward Values

α(i, s) * β(i, s) = total probability of paths through state s at step i α(i, s) * p(s’ | B) * p(obs at i+1 | s’) * β(i+1, s’) = total probability of paths through the s→s’ arc (at time i)

slide-154
SLIDE 154

With Both Forward and Backward Values

α(i, s) * p(s’ | B) * p(obs at i+1 | s’) * β(i+1, s’) = total probability of paths through the s→s’ arc (at time i) α(i, s) * β(i, s) = total probability of paths through state s at step i

𝑞 𝑨𝑗 = 𝑡, 𝑨𝑗+1 = 𝑡′ 𝑥1,⋯ , 𝑥𝑂) =

slide-155
SLIDE 155

Expectation Maximization (EM)

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty, assuming these

parameters

  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

estimated counts

slide-156
SLIDE 156

Expectation Maximization (EM)

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty, assuming these

parameters

  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

estimated counts

pobs(w | s) ptrans(s’ | s)

slide-157
SLIDE 157

Expectation Maximization (EM)

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty, assuming these

parameters

  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

estimated counts

pobs(w | s) ptrans(s’ | s)

𝑞∗ 𝑨𝑗 = 𝑡 𝑥1,⋯ , 𝑥𝑂) = 𝛽 𝑗, 𝑡 ∗ 𝛾(𝑗, 𝑡) 𝛽(𝑂 + 1, END) 𝑞∗ 𝑨𝑗 = 𝑡, 𝑨𝑗+1 = 𝑡′ 𝑥

1, ⋯ , 𝑥𝑂) =

𝛽 𝑗, 𝑡 ∗ 𝑞 𝑡′ 𝑡 ∗ 𝑞 obs𝑗+1 𝑡′ ∗ 𝛾(𝑗 + 1, 𝑡′) 𝛽(𝑂 + 1, END)

slide-158
SLIDE 158

M-Step

“maximize log-likelihood, assuming these uncertain counts”

if we observed the hidden transitions…

slide-159
SLIDE 159

M-Step

“maximize log-likelihood, assuming these uncertain counts”

we don’t the hidden transitions, but we can approximatelycount

slide-160
SLIDE 160

M-Step

“maximize log-likelihood, assuming these uncertain counts”

we don’t the hidden transitions, but we can approximatelycount

we compute these in the E-step, with

  • ur α and β values
slide-161
SLIDE 161

Estimating Parameters from Unobserved Data

N V end start N V w1 w2 W3 w4 N V Expected Transition Counts Expected Emission Counts

end emission not shown

z1 = N

w1 w2 w3 w4

𝑞

∗ 𝑥1|𝑂

𝑞

∗ 𝑥2|𝑂

𝑞

∗ 𝑥3|𝑂

𝑞

∗ 𝑥4|𝑂

𝑞

∗ 𝑂| start

z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞

∗ 𝑊| 𝑊

𝑞

∗ 𝑊| start

𝑞

∗ 𝑊| 𝑂

𝑞

∗ 𝑊| 𝑂

𝑞

∗ 𝑊| 𝑂

𝑞

∗ 𝑂| 𝑊

𝑞

∗ 𝑂| 𝑊

𝑞

∗ 𝑂| 𝑊

𝑞

∗ 𝑥4|𝑊

𝑞

∗ 𝑥3|𝑊

𝑞

∗ 𝑥2|𝑊

𝑞

∗ 𝑥1|𝑊

𝑞

∗ 𝑂| 𝑂

𝑞

∗ 𝑂| 𝑂

𝑞

∗ 𝑂| 𝑂

𝑞

∗ 𝑊| 𝑊

𝑞

∗ 𝑊| 𝑊

slide-162
SLIDE 162

Estimating Parameters from Unobserved Data

N V end start N V w1 w2 W3 w4 N V Expected Transition Counts Expected Emission Counts

end emission not shown

z1 = N

w1 w2 w3 w4

𝑞

∗ 𝑥1|𝑂

𝑞

∗ 𝑥2|𝑂

𝑞

∗ 𝑥3|𝑂

𝑞

∗ 𝑥4|𝑂

𝑞

∗ 𝑂| start

z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞

∗ 𝑊| 𝑊

𝑞

∗ 𝑊| start

𝑞

∗ 𝑊| 𝑂

𝑞

∗ 𝑊| 𝑂

𝑞

∗ 𝑊| 𝑂

𝑞

∗ 𝑂| 𝑊

𝑞

∗ 𝑂| 𝑊

𝑞

∗ 𝑂| 𝑊

𝑞

∗ 𝑥4|𝑊

𝑞

∗ 𝑥3|𝑊

𝑞

∗ 𝑥2|𝑊

𝑞

∗ 𝑥1|𝑊

𝑞

∗ 𝑂| 𝑂

𝑞

∗ 𝑂| 𝑂

𝑞

∗ 𝑂| 𝑂

𝑞

∗ 𝑊| 𝑊

𝑞

∗ 𝑊| 𝑊

all of these p* arcs are specific to a time-step

slide-163
SLIDE 163

all of these p* arcs are specific to a time-step

Estimating Parameters from Unobserved Data

N V end start N V w1 w2 W3 w4 N V Expected Transition Counts Expected Emission Counts

end emission not shown

z1 = N

w1 w2 w3 w4

𝑞

∗ 𝑥1|𝑂

𝑞

∗ 𝑥2|𝑂

𝑞

∗ 𝑥3|𝑂

𝑞

∗ 𝑥4|𝑂

𝑞

∗ 𝑂| start

z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞

∗ 𝑊| 𝑊

=.5 𝑞

∗ 𝑊| start

𝑞

∗ 𝑊| 𝑂

𝑞

∗ 𝑊| 𝑂

𝑞

∗ 𝑊| 𝑂

𝑞

∗ 𝑂| 𝑊

𝑞

∗ 𝑂| 𝑊

𝑞

∗ 𝑂| 𝑊

𝑞

∗ 𝑥4|𝑊

𝑞

∗ 𝑥3|𝑊

𝑞

∗ 𝑥2|𝑊

𝑞

∗ 𝑥1|𝑊

𝑞

∗ 𝑂| 𝑂

=.4 𝑞

∗ 𝑂| 𝑂

=.6 𝑞

∗ 𝑂| 𝑂

=.5 𝑞

∗ 𝑊| 𝑊

=.3 𝑞

∗ 𝑊| 𝑊

=.3

slide-164
SLIDE 164

all of these p* arcs are specific to a time-step

Estimating Parameters from Unobserved Data

N V end start N 1.5 V 1.1 w1 w2 W3 w4 N V Expected Transition Counts Expected Emission Counts

end emission not shown

z1 = N

w1 w2 w3 w4

𝑞

∗ 𝑥1|𝑂

𝑞

∗ 𝑥2|𝑂

𝑞

∗ 𝑥3|𝑂

𝑞

∗ 𝑥4|𝑂

𝑞

∗ 𝑂| start

z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞

∗ 𝑊| start

𝑞

∗ 𝑊| 𝑂

𝑞

∗ 𝑊| 𝑂

𝑞

∗ 𝑊| 𝑂

𝑞

∗ 𝑂| 𝑊

𝑞

∗ 𝑂| 𝑊

𝑞

∗ 𝑂| 𝑊

𝑞

∗ 𝑥4|𝑊

𝑞

∗ 𝑥3|𝑊

𝑞

∗ 𝑥2|𝑊

𝑞

∗ 𝑥1|𝑊

𝑞

∗ 𝑊| 𝑊

𝑞

∗ 𝑂| 𝑂

𝑞

∗ 𝑂| 𝑂

𝑞

∗ 𝑂| 𝑂

𝑞

∗ 𝑊| 𝑊

𝑞

∗ 𝑊| 𝑊

𝑞

∗ 𝑊| 𝑊

=.5 𝑞

∗ 𝑂| 𝑂

=.4 𝑞

∗ 𝑂| 𝑂

=.6 𝑞

∗ 𝑂| 𝑂

=.5 𝑞

∗ 𝑊| 𝑊

=.3 𝑞

∗ 𝑊| 𝑊

=.3

slide-165
SLIDE 165

Estimating Parameters from Unobserved Data

N V end start 1.8 .1 .1 N 1.5 .8 .1 V 1.4 1.1 .4 w1 w2 W3 w4 N .4 .3 .2 .2 V .1 .6 .3 .3 Expected Transition Counts Expected Emission Counts

end emission not shown

z1 = N

w1 w2 w3 w4

𝑞

∗ 𝑥1|𝑂

𝑞

∗ 𝑥2|𝑂

𝑞

∗ 𝑥3|𝑂

𝑞

∗ 𝑥4|𝑂

𝑞

∗ 𝑂| start

z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞

∗ 𝑊| start

𝑞

∗ 𝑊| 𝑂

𝑞

∗ 𝑊| 𝑂

𝑞

∗ 𝑊| 𝑂

𝑞

∗ 𝑂| 𝑊

𝑞

∗ 𝑂| 𝑊

𝑞

∗ 𝑂| 𝑊

𝑞

∗ 𝑥4|𝑊

𝑞

∗ 𝑥3|𝑊

𝑞

∗ 𝑥2|𝑊

𝑞

∗ 𝑥1|𝑊

𝑞

∗ 𝑊| 𝑊

𝑞

∗ 𝑂| 𝑂

𝑞

∗ 𝑂| 𝑂

𝑞

∗ 𝑂| 𝑂

𝑞

∗ 𝑊| 𝑊

𝑞

∗ 𝑊| 𝑊

(these numbers are made up)

slide-166
SLIDE 166

Estimating Parameters from Unobserved Data

N V end start 1.8/2 .1/2 .1/2 N 1.5/ 2.4 .8/ 2.4 .1/ 2.4 V 1.4/2.9 1.1/ 2.9 .4/ 2.9 w1 w2 W3 w4 N .4/ 1.1 .3/ 1.1 .2/ 1.1 .2/ 1.1 V .1/ 1.3 .6/ 1.3 .3/ 1.3 .3/ 1.3

Expected Transition MLE Expected Emission MLE

end emission not shown

z1 = N

w1 w2 w3 w4

𝑞

∗ 𝑥1|𝑂

𝑞

∗ 𝑥2|𝑂

𝑞

∗ 𝑥3|𝑂

𝑞

∗ 𝑥4|𝑂

𝑞

∗ 𝑂| start

z2 = N z3 = N z4 = N z1 = V z2 = V z3 = V z4 = V 𝑞

∗ 𝑊| start

𝑞

∗ 𝑊| 𝑂

𝑞

∗ 𝑊| 𝑂

𝑞

∗ 𝑊| 𝑂

𝑞

∗ 𝑂| 𝑊

𝑞

∗ 𝑂| 𝑊

𝑞

∗ 𝑂| 𝑊

𝑞

∗ 𝑥4|𝑊

𝑞

∗ 𝑥3|𝑊

𝑞

∗ 𝑥2|𝑊

𝑞

∗ 𝑥1|𝑊

𝑞

∗ 𝑊| 𝑊

𝑞

∗ 𝑂| 𝑂

𝑞

∗ 𝑂| 𝑂

𝑞

∗ 𝑂| 𝑂

𝑞

∗ 𝑊| 𝑊

𝑞

∗ 𝑊| 𝑊

(these numbers are made up)

slide-167
SLIDE 167

EM For HMMs (Baum-Welch Algorithm)

α = computeForwards() β = computeBackwards() L = α[N+1][END] for(i = N; i ≥ 0; --i) { for(next = 0; next < K*; ++next) { cobs(obsi+1 | next) += α[i+1][next]* β[i+1][next]/L for(state = 0; state < K*; ++state) { u = pobs(obsi+1 | next) * ptrans(next | state) ctrans(next| state) += α[i][state] * u * β[i+1][next]/L } } }

slide-168
SLIDE 168

Expectation Maximization (EM)

  • 0. Assume some value for your parameters

Two step, iterative algorithm

  • 1. E-step: count under uncertainty, assuming these

parameters

  • 2. M-step: maximize log-likelihood, assuming these

uncertain counts

estimated counts

pobs(w | s) ptrans(s’ | s)

𝑞∗ 𝑨𝑗 = 𝑡 𝑥1,⋯ , 𝑥𝑂) = 𝛽 𝑗, 𝑡 ∗ 𝛾(𝑗, 𝑡) 𝛽(𝑂 + 1, END) 𝑞∗ 𝑨𝑗 = 𝑡, 𝑨𝑗+1 = 𝑡′ 𝑥

1, ⋯ , 𝑥𝑂) =

𝛽 𝑗, 𝑡 ∗ 𝑞 𝑡′ 𝑡 ∗ 𝑞 obs𝑗+1 𝑡′ ∗ 𝛾(𝑗 + 1, 𝑡′) 𝛽(𝑂 + 1, END)

Baum-Welch

slide-169
SLIDE 169
slide-170
SLIDE 170

Semi-Supervised Learning

      ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

labeled data:

  • human annotated
  • relatively small/few

examples unlabeled data:

  • raw; not annotated
  • plentiful
slide-171
SLIDE 171

Semi-Supervised Learning

      ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

labeled data:

  • human annotated
  • relatively small/few

examples unlabeled data:

  • raw; not annotated
  • plentiful

EM

slide-172
SLIDE 172

Semi-Supervised Learning

      ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

labeled data:

  • human annotated
  • relatively small/few

examples unlabeled data:

  • raw; not annotated
  • plentiful
slide-173
SLIDE 173
slide-174
SLIDE 174
slide-175
SLIDE 175

Semi-Supervised Learning

      ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

labeled data:

  • human annotated
  • relatively small/few

examples unlabeled data:

  • raw; not annotated
  • plentiful

?

slide-176
SLIDE 176

Semi-Supervised Parameter Estimation

N V end start 2 N 1 2 2 V 2 1 w1 w2 W3 w4 N 2 1 2 V 2 1 Transition Counts Emission Counts

slide-177
SLIDE 177

Semi-Supervised Parameter Estimation

N V end start 1.8 .1 .1 N 1.5 .8 .1 V 1.4 1.1 .4 w1 w2 W3 w4 N .4 .3 .2 .2 V .1 .6 .3 .3 Expected Transition Counts Expected Emission Counts N V end start 2 N 1 2 2 V 2 1 w1 w2 W3 w4 N 2 1 2 V 2 1 Transition Counts Emission Counts

slide-178
SLIDE 178

Semi-Supervised Parameter Estimation

N V end start 1.8 .1 .1 N 1.5 .8 .1 V 1.4 1.1 .4 w1 w2 W3 w4 N .4 .3 .2 .2 V .1 .6 .3 .3 Expected Transition Counts Expected Emission Counts N V end start 2 N 1 2 2 V 2 1 w1 w2 W3 w4 N 2 1 2 V 2 1 Transition Counts Emission Counts

slide-179
SLIDE 179

N V end start 1.8 .1 .1 N 1.5 .8 .1 V 1.4 1.1 .4 w1 w2 W3 w4 N .4 .3 .2 .2 V .1 .6 .3 .3 Expected Transition Counts Expected Emission Counts N V end start 2 N 1 2 2 V 2 1 w1 w2 W3 w4 N 2 1 2 V 2 1 Transition Counts Emission Counts

Semi-Supervised Parameter Estimation

N V end start 3.8 .1 .1 N 2.5 2.8 2.1 V 3.4 2.1 .4 w1 w2 W3 w4 N 2.4 .3 1.2 2.2 V .1 2.6 1.3 .3 Mixed Transition Counts Mixed Emission Counts

slide-180
SLIDE 180

Agenda

HMM Motivation (Part of Speech) and Brief Definition What is Part of Speech? HMM Detailed Definition HMM Tasks