INF4820: Algorithms for Artificial Intelligence and Natural - - PowerPoint PPT Presentation

inf4820 algorithms for artificial intelligence and
SMART_READER_LITE
LIVE PREVIEW

INF4820: Algorithms for Artificial Intelligence and Natural - - PowerPoint PPT Presentation

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Hidden Markov Models Murhaf Fares & Stephan Oepen Language Technology Group (LTG) October 27, 2016 Recap: Probabilistic Language Models Basic probability


slide-1
SLIDE 1

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Hidden Markov Models

Murhaf Fares & Stephan Oepen

Language Technology Group (LTG)

October 27, 2016

slide-2
SLIDE 2

Recap: Probabilistic Language Models

◮ Basic probability theory: axioms, joint vs. conditional probability,

independence, Bayes’ Theorem;

2

slide-3
SLIDE 3

Recap: Probabilistic Language Models

◮ Basic probability theory: axioms, joint vs. conditional probability,

independence, Bayes’ Theorem;

◮ Previous context can help predict the next element of a sequence, for

example words in a sentence;

2

slide-4
SLIDE 4

Recap: Probabilistic Language Models

◮ Basic probability theory: axioms, joint vs. conditional probability,

independence, Bayes’ Theorem;

◮ Previous context can help predict the next element of a sequence, for

example words in a sentence;

◮ Rather than use the whole previous context, the Markov

assumption says that the whole history can be approximated by the last n − 1 elements;

2

slide-5
SLIDE 5

Recap: Probabilistic Language Models

◮ Basic probability theory: axioms, joint vs. conditional probability,

independence, Bayes’ Theorem;

◮ Previous context can help predict the next element of a sequence, for

example words in a sentence;

◮ Rather than use the whole previous context, the Markov

assumption says that the whole history can be approximated by the last n − 1 elements;

◮ An n-gram language model predicts the n-th word, conditioned on

the n − 1 previous words;

2

slide-6
SLIDE 6

Recap: Probabilistic Language Models

◮ Basic probability theory: axioms, joint vs. conditional probability,

independence, Bayes’ Theorem;

◮ Previous context can help predict the next element of a sequence, for

example words in a sentence;

◮ Rather than use the whole previous context, the Markov

assumption says that the whole history can be approximated by the last n − 1 elements;

◮ An n-gram language model predicts the n-th word, conditioned on

the n − 1 previous words;

◮ Maximum Likelihood Estimation uses relative frequencies to

approximate the conditional probabilities needed for an n-gram model;

2

slide-7
SLIDE 7

Recap: Probabilistic Language Models

◮ Basic probability theory: axioms, joint vs. conditional probability,

independence, Bayes’ Theorem;

◮ Previous context can help predict the next element of a sequence, for

example words in a sentence;

◮ Rather than use the whole previous context, the Markov

assumption says that the whole history can be approximated by the last n − 1 elements;

◮ An n-gram language model predicts the n-th word, conditioned on

the n − 1 previous words;

◮ Maximum Likelihood Estimation uses relative frequencies to

approximate the conditional probabilities needed for an n-gram model;

◮ Smoothing techniques are used to avoid zero probabilities.

2

slide-8
SLIDE 8

Today

Determining

◮ which string is most likely:

◮ She studies morphosyntax vs. She studies more faux syntax

◮ which tag sequence is most likely for flies like flowers:

◮ NNS VB NNS vs. VBZ P NNS

◮ which syntactic analysis is most likely:

S NP I VP VBD ate NP N sushi PP with tuna S NP I VP VBD ate NP N sushi PP with tuna 3

slide-9
SLIDE 9

Today

Determining

◮ which string is most likely:

◮ She studies morphosyntax vs. She studies more faux syntax

◮ which tag sequence is most likely for flies like flowers:

◮ NNS VB NNS vs. VBZ P NNS

◮ which syntactic analysis is most likely:

S NP I VP VBD ate NP N sushi PP with tuna S NP I VP VBD ate NP N sushi PP with tuna 3

slide-10
SLIDE 10

Today

Determining

◮ which string is most likely:

◮ She studies morphosyntax vs. She studies more faux syntax

◮ which tag sequence is most likely for flies like flowers:

◮ NNS VB NNS vs. VBZ P NNS

◮ which syntactic analysis is most likely:

S NP I VP VBD ate NP N sushi PP with tuna S NP I VP VBD ate NP N sushi PP with tuna 3

slide-11
SLIDE 11

Parts of Speech

◮ Known by a variety of names: part-of-speech, POS, lexical

categories, word classes, morphological classes, . . .

◮ ‘Traditionally’ defined semantically (e.g. “nouns are naming

words”), but more accurately by their distributional properties.

4

slide-12
SLIDE 12

Parts of Speech

◮ Known by a variety of names: part-of-speech, POS, lexical

categories, word classes, morphological classes, . . .

◮ ‘Traditionally’ defined semantically (e.g. “nouns are naming

words”), but more accurately by their distributional properties.

◮ Open-classes

◮ New words created/updated/deleted all the time 4

slide-13
SLIDE 13

Parts of Speech

◮ Known by a variety of names: part-of-speech, POS, lexical

categories, word classes, morphological classes, . . .

◮ ‘Traditionally’ defined semantically (e.g. “nouns are naming

words”), but more accurately by their distributional properties.

◮ Open-classes

◮ New words created/updated/deleted all the time

◮ Closed-classes

◮ Smaller classes, relatively static membership ◮ Usually function words 4

slide-14
SLIDE 14

Open Class Words

◮ Nouns: dog, Oslo, scissors, snow, people, truth, cups

◮ proper or common; countable or uncountable; plural or singular;

masculine, feminine or neuter; . . .

5

slide-15
SLIDE 15

Open Class Words

◮ Nouns: dog, Oslo, scissors, snow, people, truth, cups

◮ proper or common; countable or uncountable; plural or singular;

masculine, feminine or neuter; . . .

◮ Verbs: fly, rained, having, ate, seen

◮ transitive, intransitive, ditransitive; past, present, passive; stative or

dynamic; plural or singular; . . .

5

slide-16
SLIDE 16

Open Class Words

◮ Nouns: dog, Oslo, scissors, snow, people, truth, cups

◮ proper or common; countable or uncountable; plural or singular;

masculine, feminine or neuter; . . .

◮ Verbs: fly, rained, having, ate, seen

◮ transitive, intransitive, ditransitive; past, present, passive; stative or

dynamic; plural or singular; . . .

◮ Adjectives: good, smaller, unique, fastest, best, unhappy

◮ comparative or superlative; predicative or attributive; intersective or

non-intersective; . . .

5

slide-17
SLIDE 17

Open Class Words

◮ Nouns: dog, Oslo, scissors, snow, people, truth, cups

◮ proper or common; countable or uncountable; plural or singular;

masculine, feminine or neuter; . . .

◮ Verbs: fly, rained, having, ate, seen

◮ transitive, intransitive, ditransitive; past, present, passive; stative or

dynamic; plural or singular; . . .

◮ Adjectives: good, smaller, unique, fastest, best, unhappy

◮ comparative or superlative; predicative or attributive; intersective or

non-intersective; . . .

◮ Adverbs: again, somewhat, slowly, yesterday, aloud

◮ intersective; scopal; discourse; degree; temporal; directional;

comparative or superlative; . . .

5

slide-18
SLIDE 18

Closed Class Words

◮ Prepositions: on, under, from, at, near, over, . . . ◮ Determiners: a, an, the, that, . . . ◮ Pronouns: she, who, I, others, . . . ◮ Conjunctions: and, but, or, when, . . . ◮ Auxiliary verbs: can, may, should, must, . . . ◮ Interjections, particles, numerals, negatives, politeness markers,

greetings, existential there . . . (Examples from Jurafsky & Martin, 2008)

6

slide-19
SLIDE 19

POS Tagging

The (automatic) assignment of POS tags to word sequences

7

slide-20
SLIDE 20

POS Tagging

The (automatic) assignment of POS tags to word sequences

◮ non-trivial where words are ambiguous: fly (v) vs. fly (n) ◮ choice of the correct tag is context-dependent

7

slide-21
SLIDE 21

POS Tagging

The (automatic) assignment of POS tags to word sequences

◮ non-trivial where words are ambiguous: fly (v) vs. fly (n) ◮ choice of the correct tag is context-dependent ◮ useful in pre-processing for parsing, etc; but also directly for

text-to-speech (TTS) system: content (n) vs. content (adj)

7

slide-22
SLIDE 22

POS Tagging

The (automatic) assignment of POS tags to word sequences

◮ non-trivial where words are ambiguous: fly (v) vs. fly (n) ◮ choice of the correct tag is context-dependent ◮ useful in pre-processing for parsing, etc; but also directly for

text-to-speech (TTS) system: content (n) vs. content (adj)

◮ difficulty and usefulness can depend on the tagset

◮ English ◮ Penn Treebank (PTB)—45 tags: NNS, NN, NNP, JJ, JJR, JJS

http://bulba.sdsu.edu/jeanette/thesis/PennTags.html

7

slide-23
SLIDE 23

POS Tagging

The (automatic) assignment of POS tags to word sequences

◮ non-trivial where words are ambiguous: fly (v) vs. fly (n) ◮ choice of the correct tag is context-dependent ◮ useful in pre-processing for parsing, etc; but also directly for

text-to-speech (TTS) system: content (n) vs. content (adj)

◮ difficulty and usefulness can depend on the tagset

◮ English ◮ Penn Treebank (PTB)—45 tags: NNS, NN, NNP, JJ, JJR, JJS

http://bulba.sdsu.edu/jeanette/thesis/PennTags.html

◮ Norwegian ◮ Oslo-Bergen Tagset—multi-part: subst appell fem be ent

http://tekstlab.uio.no/obt-ny/english/tags.html

7

slide-24
SLIDE 24

Labelled Sequences

◮ We are interested in the probability of sequences like:

flies like the wind

  • r

flies like the wind nns vb dt nn vbz p dt nn

8

slide-25
SLIDE 25

Labelled Sequences

◮ We are interested in the probability of sequences like:

flies like the wind

  • r

flies like the wind nns vb dt nn vbz p dt nn

◮ In normal text, we see the words, but not the tags.

8

slide-26
SLIDE 26

Labelled Sequences

◮ We are interested in the probability of sequences like:

flies like the wind

  • r

flies like the wind nns vb dt nn vbz p dt nn

◮ In normal text, we see the words, but not the tags. ◮ Consider the POS tags to be underlying skeleton of the sentence,

unseen but influencing the sentence shape.

8

slide-27
SLIDE 27

Labelled Sequences

◮ We are interested in the probability of sequences like:

flies like the wind

  • r

flies like the wind nns vb dt nn vbz p dt nn

◮ In normal text, we see the words, but not the tags. ◮ Consider the POS tags to be underlying skeleton of the sentence,

unseen but influencing the sentence shape.

◮ A structure like this, consisting of a hidden state sequence, and a

related observation sequence can be modelled as a Hidden Markov Model.

8

slide-28
SLIDE 28

Hidden Markov Models

The generative story: S

9

slide-29
SLIDE 29

Hidden Markov Models

The generative story: S DT

P(DT|S)

P(S, O) = P( DT|S)

9

slide-30
SLIDE 30

Hidden Markov Models

The generative story: S DT the

P(DT|S) P(the|DT)

P(S, O) = P( DT|S) P(the|DT)

9

slide-31
SLIDE 31

Hidden Markov Models

The generative story: S DT the NN

P(DT|S) P(the|DT) P(NN|DT)

P(S, O) = P( DT|S) P(the|DT) P(NN|DT)

9

slide-32
SLIDE 32

Hidden Markov Models

The generative story: S DT the NN cat

P(DT|S) P(the|DT) P(NN|DT) P(cat|NN)

P(S, O) = P( DT|S) P(the|DT) P(NN|DT) P(cat|NN)

9

slide-33
SLIDE 33

Hidden Markov Models

The generative story: S DT the NN cat VBZ

P(DT|S) P(the|DT) P(NN|DT) P(cat|NN) P(VBZ|NN)

P(S, O) = P( DT|S) P(the|DT) P(NN|DT) P(cat|NN) P(VBZ|NN)

9

slide-34
SLIDE 34

Hidden Markov Models

The generative story: S DT the NN cat VBZ eats

P(DT|S) P(the|DT) P(NN|DT) P(cat|NN) P(VBZ|NN) P(eats|VBZ)

P(S, O) = P( DT|S) P(the|DT) P(NN|DT) P(cat|NN) P(VBZ|NN) P(eats|VBZ)

9

slide-35
SLIDE 35

Hidden Markov Models

The generative story: S DT the NN cat VBZ eats NNS

P(DT|S) P(the|DT) P(NN|DT) P(cat|NN) P(VBZ|NN) P(eats|VBZ) P(NNS|VBZ)

P(S, O) = P( DT|S) P(the|DT) P(NN|DT) P(cat|NN) P(VBZ|NN) P(eats|VBZ) P(NNS|VBZ)

9

slide-36
SLIDE 36

Hidden Markov Models

The generative story: S DT the NN cat VBZ eats NNS mice

P(DT|S) P(the|DT) P(NN|DT) P(cat|NN) P(VBZ|NN) P(eats|VBZ) P(NNS|VBZ) P(mice|NNS)

P(S, O) = P( DT|S) P(the|DT) P(NN|DT) P(cat|NN) P(VBZ|NN) P(eats|VBZ) P(NNS|VBZ) P(mice|NNS)

9

slide-37
SLIDE 37

Hidden Markov Models

The generative story: S DT the NN cat VBZ eats NNS mice /S

P(DT|S) P(the|DT) P(NN|DT) P(cat|NN) P(VBZ|NN) P(eats|VBZ) P(NNS|VBZ) P(mice|NNS) P(/S|NNS)

P(S, O) = P( DT|S) P(the|DT) P(NN|DT) P(cat|NN) P(VBZ|NN) P(eats|VBZ) P(NNS|VBZ) P(mice|NNS) P(/S|NNS)

9

slide-38
SLIDE 38

Hidden Markov Models

The generative story: S DT the NN cat VBZ eats NNS mice /S

P(DT|S) P(the|DT) P(NN|DT) P(cat|NN) P(VBZ|NN) P(eats|VBZ) P(NNS|VBZ) P(mice|NNS) P(/S|NNS)

P(S, O) = P( DT|S) P(the|DT) P(NN|DT) P(cat|NN) P(VBZ|NN) P(eats|VBZ) P(NNS|VBZ) P(mice|NNS) P(/S|NNS)

9

slide-39
SLIDE 39

Hidden Markov Models

For a bi-gram HMM, with ON

1 :

P(S, O) =

N+1

  • i=1

P(si|si−1)P(oi|si) where s0 = S, sN+1 = /S

10

slide-40
SLIDE 40

Hidden Markov Models

For a bi-gram HMM, with ON

1 :

P(S, O) =

N+1

  • i=1

P(si|si−1)P(oi|si) where s0 = S, sN+1 = /S

◮ The transition probabilities model the probabilities of moving from

state to state.

10

slide-41
SLIDE 41

Hidden Markov Models

For a bi-gram HMM, with ON

1 :

P(S, O) =

N+1

  • i=1

P(si|si−1)P(oi|si) where s0 = S, sN+1 = /S

◮ The transition probabilities model the probabilities of moving from

state to state.

◮ The emission probabilities model the probability that a state emits a

particular observation.

10

slide-42
SLIDE 42

Using HMMs

The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks:

◮ P(S, O) given S and O ◮ P(O) given O ◮ S that maximises P(S|O) given O ◮ We can also learn the model parameters, given a set of observations.

11

slide-43
SLIDE 43

Using HMMs

The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks:

◮ P(S, O) given S and O ◮ P(O) given O ◮ S that maximises P(S|O) given O ◮ We can also learn the model parameters, given a set of observations.

11

slide-44
SLIDE 44

Using HMMs

The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks:

◮ P(S, O) given S and O ◮ P(O) given O ◮ S that maximises P(S|O) given O ◮ We can also learn the model parameters, given a set of observations.

11

slide-45
SLIDE 45

Using HMMs

The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks:

◮ P(S, O) given S and O ◮ P(O) given O ◮ S that maximises P(S|O) given O ◮ We can also learn the model parameters, given a set of observations.

Our observations will be words (wi), and our states PoS tags (ti)

11

slide-46
SLIDE 46

Estimation

As so often in NLP, we learn an HMM from labelled data:

Transition probabilities

Based on a training corpus of previously tagged text, with tags as our state, the MLE can be computed from the counts of observed tags: P(ti|ti−1) = C(ti−1, ti) C(ti−1)

12

slide-47
SLIDE 47

Estimation

As so often in NLP, we learn an HMM from labelled data:

Transition probabilities

Based on a training corpus of previously tagged text, with tags as our state, the MLE can be computed from the counts of observed tags: P(ti|ti−1) = C(ti−1, ti) C(ti−1)

Emission probabilities

Computed from relative frequencies in the same way, with the words as observations: P(wi|ti) = C(ti, wi) C(ti)

12

slide-48
SLIDE 48

Implementation Issues

P(S, O) = P(s1|S)P(o1|s1)P(s2|s1)P(o2|s2)P(s3|s2)P(o3|s3) . . . = 0.0429 × 0.0031 × 0.0044 × 0.0001 × 0.0072 × . . .

13

slide-49
SLIDE 49

Implementation Issues

P(S, O) = P(s1|S)P(o1|s1)P(s2|s1)P(o2|s2)P(s3|s2)P(o3|s3) . . . = 0.0429 × 0.0031 × 0.0044 × 0.0001 × 0.0072 × . . .

◮ Multiplying many small probabilities → underflow

13

slide-50
SLIDE 50

Implementation Issues

P(S, O) = P(s1|S)P(o1|s1)P(s2|s1)P(o2|s2)P(s3|s2)P(o3|s3) . . . = 0.0429 × 0.0031 × 0.0044 × 0.0001 × 0.0072 × . . .

◮ Multiplying many small probabilities → underflow ◮ Solution: work in log(arithmic) space:

◮ log(AB) = log(A) + log(B) ◮ hence P(A)P(B) = exp(log(A) + log(B)) ◮ log(P(S, O)) = −1.368 + −2.509 + −2.357 + −4 + −2.143 + . . . 13

slide-51
SLIDE 51

Implementation Issues

P(S, O) = P(s1|S)P(o1|s1)P(s2|s1)P(o2|s2)P(s3|s2)P(o3|s3) . . . = 0.0429 × 0.0031 × 0.0044 × 0.0001 × 0.0072 × . . .

◮ Multiplying many small probabilities → underflow ◮ Solution: work in log(arithmic) space:

◮ log(AB) = log(A) + log(B) ◮ hence P(A)P(B) = exp(log(A) + log(B)) ◮ log(P(S, O)) = −1.368 + −2.509 + −2.357 + −4 + −2.143 + . . .

The issues related to MLE / smoothing that we discussed for n-gram models also applies here . . .

13

slide-52
SLIDE 52

Ice Cream and Global Warming

Missing records of weather in Baltimore for Summer 2007

◮ Jason likes to eat ice cream. ◮ He records his daily ice cream consumption in his diary. ◮ The number of ice creams he ate was influenced, but not entirely

determined by the weather.

◮ Today’s weather is partially predictable from yesterday’s.

14

slide-53
SLIDE 53

Ice Cream and Global Warming

Missing records of weather in Baltimore for Summer 2007

◮ Jason likes to eat ice cream. ◮ He records his daily ice cream consumption in his diary. ◮ The number of ice creams he ate was influenced, but not entirely

determined by the weather.

◮ Today’s weather is partially predictable from yesterday’s.

A Hidden Markov Model! with:

◮ Hidden states: {H, C} (plus pseudo-states S and /S) ◮ Observations: {1, 2, 3}

14

slide-54
SLIDE 54

Ice Cream and Global Warming

S H C /S 0.8 0.2 0.2 0.6 0.2 0.2 0.5 0.3 P(1|H)=0.2 P(2|H)=0.4 P(3|H)=0.4 P(1|C) = 0.5 P(2|C) = 0.4 P(3|C) = 0.1

15

slide-55
SLIDE 55

Using HMMs

The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks:

◮ P(S, O) given S and O ◮ P(O) given O ◮ S that maximises P(S|O) given O ◮ P(sx|O) given O ◮ We can also learn the model parameters, given a set of observations.

16

slide-56
SLIDE 56

Part-of-Speech Tagging

We want to find the tag sequence, given a word sequence. With tags as

  • ur states and words as our observations, we know:

P(S, O) =

N+1

  • i=1

P(si|si−1)P(oi|si) We want: P(S|O)

17

slide-57
SLIDE 57

Part-of-Speech Tagging

We want to find the tag sequence, given a word sequence. With tags as

  • ur states and words as our observations, we know:

P(S, O) =

N+1

  • i=1

P(si|si−1)P(oi|si) We want: P(S|O) = P(S, O) P(O)

17

slide-58
SLIDE 58

Part-of-Speech Tagging

We want to find the tag sequence, given a word sequence. With tags as

  • ur states and words as our observations, we know:

P(S, O) =

N+1

  • i=1

P(si|si−1)P(oi|si) We want: P(S|O) = P(S, O) P(O) Actually, we want the state sequence ˆ S that maximises P(S|O): ˆ S = arg max

S

P(S, O) P(O)

17

slide-59
SLIDE 59

Part-of-Speech Tagging

We want to find the tag sequence, given a word sequence. With tags as

  • ur states and words as our observations, we know:

P(S, O) =

N+1

  • i=1

P(si|si−1)P(oi|si) We want: P(S|O) = P(S, O) P(O) Actually, we want the state sequence ˆ S that maximises P(S|O): ˆ S = arg max

S

P(S, O) P(O) Since P(O) always is the same, we can drop the denominator: ˆ S = arg max

S

P(S, O)

17

slide-60
SLIDE 60

Decoding

Task What is the most likely state sequence S, given an observation sequence O and an HMM.

HMM P(H|S) = 0.8 P(C|S) = 0.2 P(H|H) = 0.6 P(C|H) = 0.2 P(H|C) = 0.3 P(C|C) = 0.5 P(/S|H) = 0.2 P(/S|C) = 0.2 P(1|H) = 0.2 P(1|C) = 0.5 P(2|H) = 0.4 P(2|C) = 0.4 P(3|H) = 0.4 P(3|C) = 0.1

18

slide-61
SLIDE 61

Decoding

Task What is the most likely state sequence S, given an observation sequence O and an HMM.

HMM if O = 3 1 3 P(H|S) = 0.8 P(C|S) = 0.2 P(H|H) = 0.6 P(C|H) = 0.2 P(H|C) = 0.3 P(C|C) = 0.5 P(/S|H) = 0.2 P(/S|C) = 0.2 P(1|H) = 0.2 P(1|C) = 0.5 P(2|H) = 0.4 P(2|C) = 0.4 P(3|H) = 0.4 P(3|C) = 0.1

18

slide-62
SLIDE 62

Decoding

Task What is the most likely state sequence S, given an observation sequence O and an HMM.

HMM if O = 3 1 3 P(H|S) = 0.8 P(C|S) = 0.2 S H H H /S P(H|H) = 0.6 P(C|H) = 0.2 P(H|C) = 0.3 P(C|C) = 0.5 P(/S|H) = 0.2 P(/S|C) = 0.2 P(1|H) = 0.2 P(1|C) = 0.5 P(2|H) = 0.4 P(2|C) = 0.4 P(3|H) = 0.4 P(3|C) = 0.1

18

slide-63
SLIDE 63

Decoding

Task What is the most likely state sequence S, given an observation sequence O and an HMM.

HMM if O = 3 1 3 P(H|S) = 0.8 P(C|S) = 0.2 S H H H /S 0.0018432 P(H|H) = 0.6 P(C|H) = 0.2 P(H|C) = 0.3 P(C|C) = 0.5 P(/S|H) = 0.2 P(/S|C) = 0.2 P(1|H) = 0.2 P(1|C) = 0.5 P(2|H) = 0.4 P(2|C) = 0.4 P(3|H) = 0.4 P(3|C) = 0.1

18

slide-64
SLIDE 64

Decoding

Task What is the most likely state sequence S, given an observation sequence O and an HMM.

HMM if O = 3 1 3 P(H|S) = 0.8 P(C|S) = 0.2 S H H H /S 0.0018432 P(H|H) = 0.6 P(C|H) = 0.2 S H H C /S P(H|C) = 0.3 P(C|C) = 0.5 P(/S|H) = 0.2 P(/S|C) = 0.2 P(1|H) = 0.2 P(1|C) = 0.5 P(2|H) = 0.4 P(2|C) = 0.4 P(3|H) = 0.4 P(3|C) = 0.1

18

slide-65
SLIDE 65

Decoding

Task What is the most likely state sequence S, given an observation sequence O and an HMM.

HMM if O = 3 1 3 P(H|S) = 0.8 P(C|S) = 0.2 S H H H /S 0.0018432 P(H|H) = 0.6 P(C|H) = 0.2 S H H C /S 0.0001536 P(H|C) = 0.3 P(C|C) = 0.5 P(/S|H) = 0.2 P(/S|C) = 0.2 P(1|H) = 0.2 P(1|C) = 0.5 P(2|H) = 0.4 P(2|C) = 0.4 P(3|H) = 0.4 P(3|C) = 0.1

18

slide-66
SLIDE 66

Decoding

Task What is the most likely state sequence S, given an observation sequence O and an HMM.

HMM if O = 3 1 3 P(H|S) = 0.8 P(C|S) = 0.2 S H H H /S 0.0018432 P(H|H) = 0.6 P(C|H) = 0.2 S H H C /S 0.0001536 P(H|C) = 0.3 P(C|C) = 0.5 S H C H /S 0.0007680 P(/S|H) = 0.2 P(/S|C) = 0.2 P(1|H) = 0.2 P(1|C) = 0.5 P(2|H) = 0.4 P(2|C) = 0.4 P(3|H) = 0.4 P(3|C) = 0.1

18

slide-67
SLIDE 67

Decoding

Task What is the most likely state sequence S, given an observation sequence O and an HMM.

HMM if O = 3 1 3 P(H|S) = 0.8 P(C|S) = 0.2 S H H H /S 0.0018432 P(H|H) = 0.6 P(C|H) = 0.2 S H H C /S 0.0001536 P(H|C) = 0.3 P(C|C) = 0.5 S H C H /S 0.0007680 P(/S|H) = 0.2 P(/S|C) = 0.2 S H C C /S 0.0003200 P(1|H) = 0.2 P(1|C) = 0.5 P(2|H) = 0.4 P(2|C) = 0.4 P(3|H) = 0.4 P(3|C) = 0.1

18

slide-68
SLIDE 68

Decoding

Task What is the most likely state sequence S, given an observation sequence O and an HMM.

HMM if O = 3 1 3 P(H|S) = 0.8 P(C|S) = 0.2 S H H H /S 0.0018432 P(H|H) = 0.6 P(C|H) = 0.2 S H H C /S 0.0001536 P(H|C) = 0.3 P(C|C) = 0.5 S H C H /S 0.0007680 P(/S|H) = 0.2 P(/S|C) = 0.2 S H C C /S 0.0003200 P(1|H) = 0.2 P(1|C) = 0.5 S C H H /S 0.0000576 P(2|H) = 0.4 P(2|C) = 0.4 P(3|H) = 0.4 P(3|C) = 0.1

18

slide-69
SLIDE 69

Decoding

Task What is the most likely state sequence S, given an observation sequence O and an HMM.

HMM if O = 3 1 3 P(H|S) = 0.8 P(C|S) = 0.2 S H H H /S 0.0018432 P(H|H) = 0.6 P(C|H) = 0.2 S H H C /S 0.0001536 P(H|C) = 0.3 P(C|C) = 0.5 S H C H /S 0.0007680 P(/S|H) = 0.2 P(/S|C) = 0.2 S H C C /S 0.0003200 P(1|H) = 0.2 P(1|C) = 0.5 S C H H /S 0.0000576 P(2|H) = 0.4 P(2|C) = 0.4 S C H C /S 0.0000048 P(3|H) = 0.4 P(3|C) = 0.1

18

slide-70
SLIDE 70

Decoding

Task What is the most likely state sequence S, given an observation sequence O and an HMM.

HMM if O = 3 1 3 P(H|S) = 0.8 P(C|S) = 0.2 S H H H /S 0.0018432 P(H|H) = 0.6 P(C|H) = 0.2 S H H C /S 0.0001536 P(H|C) = 0.3 P(C|C) = 0.5 S H C H /S 0.0007680 P(/S|H) = 0.2 P(/S|C) = 0.2 S H C C /S 0.0003200 P(1|H) = 0.2 P(1|C) = 0.5 S C H H /S 0.0000576 P(2|H) = 0.4 P(2|C) = 0.4 S C H C /S 0.0000048 P(3|H) = 0.4 P(3|C) = 0.1 S C C H /S 0.0001200

18

slide-71
SLIDE 71

Decoding

Task What is the most likely state sequence S, given an observation sequence O and an HMM.

HMM if O = 3 1 3 P(H|S) = 0.8 P(C|S) = 0.2 S H H H /S 0.0018432 P(H|H) = 0.6 P(C|H) = 0.2 S H H C /S 0.0001536 P(H|C) = 0.3 P(C|C) = 0.5 S H C H /S 0.0007680 P(/S|H) = 0.2 P(/S|C) = 0.2 S H C C /S 0.0003200 P(1|H) = 0.2 P(1|C) = 0.5 S C H H /S 0.0000576 P(2|H) = 0.4 P(2|C) = 0.4 S C H C /S 0.0000048 P(3|H) = 0.4 P(3|C) = 0.1 S C C H /S 0.0001200 S C C C /S 0.0000500

18

slide-72
SLIDE 72

Decoding

Task What is the most likely state sequence S, given an observation sequence O and an HMM.

HMM if O = 3 1 3 P(H|S) = 0.8 P(C|S) = 0.2 S H H H /S 0.0018432 P(H|H) = 0.6 P(C|H) = 0.2 S H H C /S 0.0001536 P(H|C) = 0.3 P(C|C) = 0.5 S H C H /S 0.0007680 P(/S|H) = 0.2 P(/S|C) = 0.2 S H C C /S 0.0003200 P(1|H) = 0.2 P(1|C) = 0.5 S C H H /S 0.0000576 P(2|H) = 0.4 P(2|C) = 0.4 S C H C /S 0.0000048 P(3|H) = 0.4 P(3|C) = 0.1 S C C H /S 0.0001200 S C C C /S 0.0000500

18

slide-73
SLIDE 73

Dynamic Programming

For (only) two states and a (short) observation sequence of length three, comparing all possible sequences is workable, but . . .

19

slide-74
SLIDE 74

Dynamic Programming

For (only) two states and a (short) observation sequence of length three, comparing all possible sequences is workable, but . . .

◮ for N observations and L states, there are LN sequences ◮ we do the same partial calculations over and over again

19

slide-75
SLIDE 75

Dynamic Programming

For (only) two states and a (short) observation sequence of length three, comparing all possible sequences is workable, but . . .

◮ for N observations and L states, there are LN sequences ◮ we do the same partial calculations over and over again

Dynamic Programming:

◮ records sub-problem solutions for further re-use ◮ useful when a complex problem can be described recursively ◮ examples: Dijkstra’s shortest path, minimum edit distance, longest

common subsequence, Viterbi algorithm

19

slide-76
SLIDE 76

Dynamic Programming

For (only) two states and a (short) observation sequence of length three, comparing all possible sequences is workable, but . . .

◮ for N observations and L states, there are LN sequences ◮ we do the same partial calculations over and over again

Dynamic Programming:

◮ records sub-problem solutions for further re-use ◮ useful when a complex problem can be described recursively ◮ examples: Dijkstra’s shortest path, minimum edit distance, longest

common subsequence, Viterbi algorithm

19

slide-77
SLIDE 77

Viterbi Algorithm

Recall our problem: maximise P(s1 . . . sn|o1 . . . on) = P(s1|s0)P(o1|s1)P(s2|s1)P(o2|s2) . . .

20

slide-78
SLIDE 78

Viterbi Algorithm

Recall our problem: maximise P(s1 . . . sn|o1 . . . on) = P(s1|s0)P(o1|s1)P(s2|s1)P(o2|s2) . . . Our recursive sub-problem: vi(x) =

L

max

k=1 [vi−1(k) · P(x|k) · P(oi|x)]

The variable vi(x) represents the maximum probability that the i-th state is x, given that we have seen Oi

1.

20

slide-79
SLIDE 79

Viterbi Algorithm

Recall our problem: maximise P(s1 . . . sn|o1 . . . on) = P(s1|s0)P(o1|s1)P(s2|s1)P(o2|s2) . . . Our recursive sub-problem: vi(x) =

L

max

k=1 [vi−1(k) · P(x|k) · P(oi|x)]

The variable vi(x) represents the maximum probability that the i-th state is x, given that we have seen Oi

1.

20

slide-80
SLIDE 80

Viterbi Algorithm

Recall our problem: maximise P(s1 . . . sn|o1 . . . on) = P(s1|s0)P(o1|s1)P(s2|s1)P(o2|s2) . . . Our recursive sub-problem: vi(x) =

L

max

k=1 [vi−1(k) · P(x|k) · P(oi|x)]

The variable vi(x) represents the maximum probability that the i-th state is x, given that we have seen Oi

1.

At each step, we record backpointers showing which previous state led to the maximum probability.

20

slide-81
SLIDE 81

An Example of the Viterbi Algorithm

C C C H H H S /S 3 1 3

  • 1
  • 2
  • 3

21

slide-82
SLIDE 82

An Example of the Viterbi Algorithm

C C C H H H S /S 3 1 3

  • 1
  • 2
  • 3

P(H|S)P(3|H) 0.8 ∗ 0.4 v1(H) = 0.32 21

slide-83
SLIDE 83

An Example of the Viterbi Algorithm

C C C H H H S /S 3 1 3

  • 1
  • 2
  • 3

P(H|S)P(3|H) 0.8 ∗ 0.4 v1(H) = 0.32 21

slide-84
SLIDE 84

An Example of the Viterbi Algorithm

C C C H H H S /S 3 1 3

  • 1
  • 2
  • 3

P(H|S)P(3|H) 0.8 ∗ 0.4 P(C|S)P(3|C) 0.2 ∗ 0.1 v1(H) = 0.32 v1(C) = 0.02 21

slide-85
SLIDE 85

An Example of the Viterbi Algorithm

C C C H H H S /S 3 1 3

  • 1
  • 2
  • 3

P(H|S)P(3|H) 0.8 ∗ 0.4 P(C|S)P(3|C) 0.2 ∗ 0.1 v1(H) = 0.32 v1(C) = 0.02 21

slide-86
SLIDE 86

An Example of the Viterbi Algorithm

C C C H H H S /S 3 1 3

  • 1
  • 2
  • 3

P(H|S)P(3|H) 0.8 ∗ 0.4 P(C|S)P(3|C) 0.2 ∗ 0.1 P(H|H)P(1|H) 0.6 ∗ 0.2 P ( H | C ) P ( 1 | H ) . 3 ∗ . 2 v1(H) = 0.32 v1(C) = 0.02 v2(H) = max(.32 ∗ .12, .02 ∗ .06) = .0384 21

slide-87
SLIDE 87

An Example of the Viterbi Algorithm

C C C H H H S /S 3 1 3

  • 1
  • 2
  • 3

P(H|S)P(3|H) 0.8 ∗ 0.4 P(C|S)P(3|C) 0.2 ∗ 0.1 P(H|H)P(1|H) 0.6 ∗ 0.2 P ( H | C ) P ( 1 | H ) . 3 ∗ . 2 v1(H) = 0.32 v1(C) = 0.02 v2(H) = max(.32 ∗ .12, .02 ∗ .06) = .0384 21

slide-88
SLIDE 88

An Example of the Viterbi Algorithm

C C C H H H S /S 3 1 3

  • 1
  • 2
  • 3

P(H|S)P(3|H) 0.8 ∗ 0.4 P(C|S)P(3|C) 0.2 ∗ 0.1 P(H|H)P(1|H) 0.6 ∗ 0.2 P ( C | H ) P ( 1 | C ) . 2 ∗ . 5 P ( H | C ) P ( 1 | H ) . 3 ∗ . 2 P(C|C)P(1|C) 0.5 ∗ 0.5 v1(H) = 0.32 v1(C) = 0.02 v2(H) = max(.32 ∗ .12, .02 ∗ .06) = .0384 v2(C) = max(.32∗.1, .02∗.25) = .032 21

slide-89
SLIDE 89

An Example of the Viterbi Algorithm

C C C H H H S /S 3 1 3

  • 1
  • 2
  • 3

P(H|S)P(3|H) 0.8 ∗ 0.4 P(C|S)P(3|C) 0.2 ∗ 0.1 P(H|H)P(1|H) 0.6 ∗ 0.2 P(C|H)P(1|C) 0.2 ∗ 0.5 P ( H | C ) P ( 1 | H ) . 3 ∗ . 2 P(C|C)P(1|C) 0.5 ∗ 0.5 v1(H) = 0.32 v1(C) = 0.02 v2(H) = max(.32 ∗ .12, .02 ∗ .06) = .0384 v2(C) = max(.32∗.1, .02∗.25) = .032 21

slide-90
SLIDE 90

An Example of the Viterbi Algorithm

C C C H H H S /S 3 1 3

  • 1
  • 2
  • 3

P(H|S)P(3|H) 0.8 ∗ 0.4 P(C|S)P(3|C) 0.2 ∗ 0.1 P(H|H)P(1|H) 0.6 ∗ 0.2 P ( C | H ) P ( 1 | C ) . 2 ∗ . 5 P ( H | C ) P ( 1 | H ) . 3 ∗ . 2 P(C|C)P(1|C) 0.5 ∗ 0.5 P(H|H)P(3|H) 0.6 ∗ 0.4 P ( H | C ) P ( 3 | H ) . 3 ∗ . 4 v1(H) = 0.32 v1(C) = 0.02 v2(H) = max(.32 ∗ .12, .02 ∗ .06) = .0384 v2(C) = max(.32∗.1, .02∗.25) = .032 v3(H) = max(.0384∗.24, .032∗.12) = .009216 21

slide-91
SLIDE 91

An Example of the Viterbi Algorithm

C C C H H H S /S 3 1 3

  • 1
  • 2
  • 3

P(H|S)P(3|H) 0.8 ∗ 0.4 P(C|S)P(3|C) 0.2 ∗ 0.1 P(H|H)P(1|H) 0.6 ∗ 0.2 P ( C | H ) P ( 1 | C ) . 2 ∗ . 5 P ( H | C ) P ( 1 | H ) . 3 ∗ . 2 P(C|C)P(1|C) 0.5 ∗ 0.5 P(H|H)P(3|H) 0.6 ∗ 0.4 P ( H | C ) P ( 3 | H ) . 3 ∗ . 4 v1(H) = 0.32 v1(C) = 0.02 v2(H) = max(.32 ∗ .12, .02 ∗ .06) = .0384 v2(C) = max(.32∗.1, .02∗.25) = .032 v3(H) = max(.0384∗.24, .032∗.12) = .009216 21

slide-92
SLIDE 92

An Example of the Viterbi Algorithm

C C C H H H S /S 3 1 3

  • 1
  • 2
  • 3

P(H|S)P(3|H) 0.8 ∗ 0.4 P(C|S)P(3|C) 0.2 ∗ 0.1 P(H|H)P(1|H) 0.6 ∗ 0.2 P ( C | H ) P ( 1 | C ) . 2 ∗ . 5 P ( H | C ) P ( 1 | H ) . 3 ∗ . 2 P(C|C)P(1|C) 0.5 ∗ 0.5 P(H|H)P(3|H) 0.6 ∗ 0.4 P ( C | H ) P ( 3 | C ) . 2 ∗ . 1 P ( H | C ) P ( 3 | H ) . 3 ∗ . 4 P(C|C)P(3|C) 0.5 ∗ 0.1 v1(H) = 0.32 v1(C) = 0.02 v2(H) = max(.32 ∗ .12, .02 ∗ .06) = .0384 v2(C) = max(.32∗.1, .02∗.25) = .032 v3(H) = max(.0384∗.24, .032∗.12) = .009216 v3(C) = max(.0384∗.02, .032∗.05) = .0016 21

slide-93
SLIDE 93

An Example of the Viterbi Algorithm

C C C H H H S /S 3 1 3

  • 1
  • 2
  • 3

P(H|S)P(3|H) 0.8 ∗ 0.4 P(C|S)P(3|C) 0.2 ∗ 0.1 P(H|H)P(1|H) 0.6 ∗ 0.2 P ( C | H ) P ( 1 | C ) . 2 ∗ . 5 P ( H | C ) P ( 1 | H ) . 3 ∗ . 2 P(C|C)P(1|C) 0.5 ∗ 0.5 P(H|H)P(3|H) 0.6 ∗ 0.4 P ( C | H ) P ( 3 | C ) . 2 ∗ . 1 P ( H | C ) P ( 3 | H ) . 3 ∗ . 4 P(C|C)P(3|C) 0.5 ∗ 0.1 v1(H) = 0.32 v1(C) = 0.02 v2(H) = max(.32 ∗ .12, .02 ∗ .06) = .0384 v2(C) = max(.32∗.1, .02∗.25) = .032 v3(H) = max(.0384∗.24, .032∗.12) = .009216 v3(C) = max(.0384∗.02, .032∗.05) = .0016 21

slide-94
SLIDE 94

An Example of the Viterbi Algorithm

C C C H H H S /S 3 1 3

  • 1
  • 2
  • 3

P(H|S)P(3|H) 0.8 ∗ 0.4 P(C|S)P(3|C) 0.2 ∗ 0.1 P(H|H)P(1|H) 0.6 ∗ 0.2 P ( C | H ) P ( 1 | C ) . 2 ∗ . 5 P ( H | C ) P ( 1 | H ) . 3 ∗ . 2 P(C|C)P(1|C) 0.5 ∗ 0.5 P(H|H)P(3|H) 0.6 ∗ 0.4 P ( C | H ) P ( 3 | C ) . 2 ∗ . 1 P ( H | C ) P ( 3 | H ) . 3 ∗ . 4 P(C|C)P(3|C) 0.5 ∗ 0.1 P(/S|H) 0.2 P (

  • /

S

  • |

C ) . 2 v1(H) = 0.32 v1(C) = 0.02 v2(H) = max(.32 ∗ .12, .02 ∗ .06) = .0384 v2(C) = max(.32∗.1, .02∗.25) = .032 v3(H) = max(.0384∗.24, .032∗.12) = .009216 v3(C) = max(.0384∗.02, .032∗.05) = .0016 vf (/S) = max(.009216 ∗ .2, .0016 ∗ .2) = .0018432 21

slide-95
SLIDE 95

An Example of the Viterbi Algorithm

C C C H H H S /S 3 1 3

  • 1
  • 2
  • 3

H

P(H|S)P(3|H) 0.8 ∗ 0.4 P(C|S)P(3|C) 0.2 ∗ 0.1 P(H|H)P(1|H) 0.6 ∗ 0.2 P ( C | H ) P ( 1 | C ) . 2 ∗ . 5 P ( H | C ) P ( 1 | H ) . 3 ∗ . 2 P(C|C)P(1|C) 0.5 ∗ 0.5 P(H|H)P(3|H) 0.6 ∗ 0.4 P ( C | H ) P ( 3 | C ) . 2 ∗ . 1 P(H|C)P(3|H) 0.3 ∗ 0.4 P(C|C)P(3|C) 0.5 ∗ 0.1 P(/S|H) 0.2 P (

  • /

S

  • |

C ) . 2 v1(H) = 0.32 v1(C) = 0.02 v2(H) = max(.32 ∗ .12, .02 ∗ .06) = .0384 v2(C) = max(.32∗.1, .02∗.25) = .032 v3(H) = max(.0384∗.24, .032∗.12) = .009216 v3(C) = max(.0384∗.02, .032∗.05) = .0016 vf (/S) = max(.009216 ∗ .2, .0016 ∗ .2) = .0018432 21

slide-96
SLIDE 96

An Example of the Viterbi Algorithm

C C C H H H S /S 3 1 3

  • 1
  • 2
  • 3

H H

P(H|S)P(3|H) 0.8 ∗ 0.4 P(C|S)P(3|C) 0.2 ∗ 0.1 P(H|H)P(1|H) 0.6 ∗ 0.2 P ( C | H ) P ( 1 | C ) . 2 ∗ . 5 P ( H | C ) P ( 1 | H ) . 3 ∗ . 2 P(C|C)P(1|C) 0.5 ∗ 0.5 P(H|H)P(3|H) 0.6 ∗ 0.4 P ( C | H ) P ( 3 | C ) . 2 ∗ . 1 P(H|C)P(3|H) 0.3 ∗ 0.4 P(C|C)P(3|C) 0.5 ∗ 0.1 P(/S|H) 0.2 P (

  • /

S

  • |

C ) . 2 v1(H) = 0.32 v1(C) = 0.02 v2(H) = max(.32 ∗ .12, .02 ∗ .06) = .0384 v2(C) = max(.32∗.1, .02∗.25) = .032 v3(H) = max(.0384∗.24, .032∗.12) = .009216 v3(C) = max(.0384∗.02, .032∗.05) = .0016 vf (/S) = max(.009216 ∗ .2, .0016 ∗ .2) = .0018432 21

slide-97
SLIDE 97

An Example of the Viterbi Algorithm

C C C H H H S /S 3 1 3

  • 1
  • 2
  • 3

H H H

P(H|S)P(3|H) 0.8 ∗ 0.4 P(C|S)P(3|C) 0.2 ∗ 0.1 P(H|H)P(1|H) 0.6 ∗ 0.2 P(C|H)P(1|C) 0.2 ∗ 0.5 P ( H | C ) P ( 1 | H ) . 3 ∗ . 2 P(C|C)P(1|C) 0.5 ∗ 0.5 P(H|H)P(3|H) 0.6 ∗ 0.4 P ( C | H ) P ( 3 | C ) . 2 ∗ . 1 P(H|C)P(3|H) 0.3 ∗ 0.4 P(C|C)P(3|C) 0.5 ∗ 0.1 P(/S|H) 0.2 P (

  • /

S

  • |

C ) . 2 v1(H) = 0.32 v1(C) = 0.02 v2(H) = max(.32 ∗ .12, .02 ∗ .06) = .0384 v2(C) = max(.32∗.1, .02∗.25) = .032 v3(H) = max(.0384∗.24, .032∗.12) = .009216 v3(C) = max(.0384∗.02, .032∗.05) = .0016 vf (/S) = max(.009216 ∗ .2, .0016 ∗ .2) = .0018432 21

slide-98
SLIDE 98

An Example of the Viterbi Algorithm

C C C H H H S /S 3 1 3

  • 1
  • 2
  • 3

H H H

  • P(H|S)P(3|H)

0.8 ∗ 0.4 P(C|S)P(3|C) 0.2 ∗ 0.1 P(H|H)P(1|H) 0.6 ∗ 0.2 P(C|H)P(1|C) 0.2 ∗ 0.5 P ( H | C ) P ( 1 | H ) . 3 ∗ . 2 P(C|C)P(1|C) 0.5 ∗ 0.5 P(H|H)P(3|H) 0.6 ∗ 0.4 P ( C | H ) P ( 3 | C ) . 2 ∗ . 1 P(H|C)P(3|H) 0.3 ∗ 0.4 P(C|C)P(3|C) 0.5 ∗ 0.1 P(/S|H) 0.2 P (

  • /

S

  • |

C ) . 2 v1(H) = 0.32 v1(C) = 0.02 v2(H) = max(.32 ∗ .12, .02 ∗ .06) = .0384 v2(C) = max(.32∗.1, .02∗.25) = .032 v3(H) = max(.0384∗.24, .032∗.12) = .009216 v3(C) = max(.0384∗.02, .032∗.05) = .0016 vf (/S) = max(.009216 ∗ .2, .0016 ∗ .2) = .0018432 21

slide-99
SLIDE 99

Pseudocode for the Viterbi Algorithm

Input: observations of length N, state set of size L Output: best-path create a path probability matrix viterbi[N, L + 2] create a path backpointer matrix backpointer[N, L + 2] for each state s from 1 to L do viterbi[1, s] ← trans(S, s) × emit(o1, s) backpointer[1, s] ← 0 end for each time step i from 2 to N do for each state s from 1 to L do viterbi[i, s] ← maxL

s′=1 viterbi[i − 1, s′] × trans(s′, s) × emit(oi, s)

backpointer[i, s] ← arg maxL

s′=1 viterbi[i − 1, s′] × trans(s′, s)

end end viterbi[N, L + 1] ← maxL

s=1 viterbi[s, N] × trans(s, /S)

backpointer[N, L + 1] ← arg maxL

s=1 viterbi[N, s] × trans(s, /S)

return the path by following backpointers from backpointer[N, L + 1]

22

slide-100
SLIDE 100

Diversion: Complexity and O(N)

Big-O notation describes the complexity of an algorithm.

◮ it describes the worst-case order of growth in terms of the size of the

input

◮ only the largest order term is represented ◮ constant factors are ignored ◮ determined by looking at loops in the code

23

slide-101
SLIDE 101

Pseudocode for the Viterbi Algorithm

Input: observations of length N, state set of length L Output: best-path create a path probability matrix viterbi[N, L + 2] create a path backpointer matrix backpointer[N, L + 2] for each state s from 1 to L do viterbi[1, s] ← trans(S, s) × emit(o1, s) backpointer[1, s] ← 0 end for each time step i from 2 to N do for each state s from 1 to L do viterbi[i, s] ← maxL

s′=1 viterbi[i − 1, s′] × trans(s′, s) × emit(oi, s)

backpointer[i, s] ← arg maxL

s′=1 viterbi[i − 1, s′] × trans(s′, s)

end end viterbi[N, L + 1] ← maxL

s=1 viterbi[s, N] × trans(s, /S)

backpointer[N, L + 1] ← arg maxL

s=1 viterbi[N, s] × trans(s, /S)

return the path by following backpointers from backpointer[N, L + 1]

24

slide-102
SLIDE 102

Pseudocode for the Viterbi Algorithm

Input: observations of length N, state set of length L Output: best-path create a path probability matrix viterbi[N, L + 2] create a path backpointer matrix backpointer[N, L + 2] for each state s from 1 to L do L viterbi[1, s] ← trans(S, s) × emit(o1, s) backpointer[1, s] ← 0 end for each time step i from 2 to N do for each state s from 1 to L do viterbi[i, s] ← maxL

s′=1 viterbi[i − 1, s′] × trans(s′, s) × emit(oi, s)

backpointer[i, s] ← arg maxL

s′=1 viterbi[i − 1, s′] × trans(s′, s)

end end viterbi[N, L + 1] ← maxL

s=1 viterbi[s, N] × trans(s, /S)

backpointer[N, L + 1] ← arg maxL

s=1 viterbi[N, s] × trans(s, /S)

return the path by following backpointers from backpointer[N, L + 1]

L

24

slide-103
SLIDE 103

Pseudocode for the Viterbi Algorithm

Input: observations of length N, state set of length L Output: best-path create a path probability matrix viterbi[N, L + 2] create a path backpointer matrix backpointer[N, L + 2] for each state s from 1 to L do L viterbi[1, s] ← trans(S, s) × emit(o1, s) backpointer[1, s] ← 0 end for each time step i from 2 to N do N for each state s from 1 to L do viterbi[i, s] ← maxL

s′=1 viterbi[i − 1, s′] × trans(s′, s) × emit(oi, s)

backpointer[i, s] ← arg maxL

s′=1 viterbi[i − 1, s′] × trans(s′, s)

end end viterbi[N, L + 1] ← maxL

s=1 viterbi[s, N] × trans(s, /S)

backpointer[N, L + 1] ← arg maxL

s=1 viterbi[N, s] × trans(s, /S)

return the path by following backpointers from backpointer[N, L + 1]

L

24

slide-104
SLIDE 104

Pseudocode for the Viterbi Algorithm

Input: observations of length N, state set of length L Output: best-path create a path probability matrix viterbi[N, L + 2] create a path backpointer matrix backpointer[N, L + 2] for each state s from 1 to L do L viterbi[1, s] ← trans(S, s) × emit(o1, s) backpointer[1, s] ← 0 end for each time step i from 2 to N do N for each state s from 1 to L do L viterbi[i, s] ← maxL

s′=1 viterbi[i − 1, s′] × trans(s′, s) × emit(oi, s)

backpointer[i, s] ← arg maxL

s′=1 viterbi[i − 1, s′] × trans(s′, s)

end end viterbi[N, L + 1] ← maxL

s=1 viterbi[s, N] × trans(s, /S)

backpointer[N, L + 1] ← arg maxL

s=1 viterbi[N, s] × trans(s, /S)

return the path by following backpointers from backpointer[N, L + 1]

L

24

slide-105
SLIDE 105

Pseudocode for the Viterbi Algorithm

Input: observations of length N, state set of length L Output: best-path create a path probability matrix viterbi[N, L + 2] create a path backpointer matrix backpointer[N, L + 2] for each state s from 1 to L do L viterbi[1, s] ← trans(S, s) × emit(o1, s) backpointer[1, s] ← 0 end for each time step i from 2 to N do N for each state s from 1 to L do L viterbi[i, s] ← maxL

s′=1 viterbi[i − 1, s′] × trans(s′, s) × emit(oi, s)

L backpointer[i, s] ← arg maxL

s′=1 viterbi[i − 1, s′] × trans(s′, s)

end end viterbi[N, L + 1] ← maxL

s=1 viterbi[s, N] × trans(s, /S)

backpointer[N, L + 1] ← arg maxL

s=1 viterbi[N, s] × trans(s, /S)

return the path by following backpointers from backpointer[N, L + 1]

L

24

slide-106
SLIDE 106

Pseudocode for the Viterbi Algorithm

Input: observations of length N, state set of length L Output: best-path create a path probability matrix viterbi[N, L + 2] create a path backpointer matrix backpointer[N, L + 2] for each state s from 1 to L do L viterbi[1, s] ← trans(S, s) × emit(o1, s) backpointer[1, s] ← 0 end for each time step i from 2 to N do N for each state s from 1 to L do L viterbi[i, s] ← maxL

s′=1 viterbi[i − 1, s′] × trans(s′, s) × emit(oi, s)

L backpointer[i, s] ← arg maxL

s′=1 viterbi[i − 1, s′] × trans(s′, s)

end end viterbi[N, L + 1] ← maxL

s=1 viterbi[s, N] × trans(s, /S)

backpointer[N, L + 1] ← arg maxL

s=1 viterbi[N, s] × trans(s, /S)

return the path by following backpointers from backpointer[N, L + 1]

L + L2N

24

slide-107
SLIDE 107

Pseudocode for the Viterbi Algorithm

Input: observations of length N, state set of length L Output: best-path create a path probability matrix viterbi[N, L + 2] create a path backpointer matrix backpointer[N, L + 2] for each state s from 1 to L do L viterbi[1, s] ← trans(S, s) × emit(o1, s) backpointer[1, s] ← 0 end for each time step i from 2 to N do N for each state s from 1 to L do L viterbi[i, s] ← maxL

s′=1 viterbi[i − 1, s′] × trans(s′, s) × emit(oi, s)

L backpointer[i, s] ← arg maxL

s′=1 viterbi[i − 1, s′] × trans(s′, s)

end end viterbi[N, L + 1] ← maxL

s=1 viterbi[s, N] × trans(s, /S)

backpointer[N, L + 1] ← arg maxL

s=1 viterbi[N, s] × trans(s, /S)

return the path by following backpointers from backpointer[N, L + 1] N

L + L2N + N

24

slide-108
SLIDE 108

Pseudocode for the Viterbi Algorithm

Input: observations of length N, state set of length L Output: best-path create a path probability matrix viterbi[N, L + 2] create a path backpointer matrix backpointer[N, L + 2] for each state s from 1 to L do L viterbi[1, s] ← trans(S, s) × emit(o1, s) backpointer[1, s] ← 0 end for each time step i from 2 to N do N for each state s from 1 to L do L viterbi[i, s] ← maxL

s′=1 viterbi[i − 1, s′] × trans(s′, s) × emit(oi, s)

L backpointer[i, s] ← arg maxL

s′=1 viterbi[i − 1, s′] × trans(s′, s)

end end viterbi[N, L + 1] ← maxL

s=1 viterbi[s, N] × trans(s, /S)

backpointer[N, L + 1] ← arg maxL

s=1 viterbi[N, s] × trans(s, /S)

return the path by following backpointers from backpointer[N, L + 1] N

O(L2N)

24

slide-109
SLIDE 109

Using HMMs

The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks:

◮ P(S, O) given S and O ◮ P(O) given O ◮ S that maximises P(S|O) given O ◮ P(sx|O) given O ◮ We can also learn the model parameters, given a set of observations.

25

slide-110
SLIDE 110

Computing Likelihoods

Task Given an observation sequence O, determine the likelihood P(O), according to the HMM.

26

slide-111
SLIDE 111

Computing Likelihoods

Task Given an observation sequence O, determine the likelihood P(O), according to the HMM. Compute the sum over all possible state sequences: P(O) =

  • S

P(O, S) For example, the ice cream sequence 3 1 3: P(3 1 3) = P(3 1 3, cold cold cold) + P(3 1 3, cold cold hot) + P(3 1 3, hot hot cold) + . . . ⇒ O(LNN)

26

slide-112
SLIDE 112

The Forward Algorithm

Again, we use dynamic programming—storing and reusing the results

  • f partial computations in a trellis α.

27

slide-113
SLIDE 113

The Forward Algorithm

Again, we use dynamic programming—storing and reusing the results

  • f partial computations in a trellis α.

Each cell in the trellis stores the probability of being in state sx after seeing the first i observations: αi(x) = P(o1 . . . oi, si = x) =

L

  • k=1

αi−1(k) · P(x|k) · P(oi|x)

27

slide-114
SLIDE 114

The Forward Algorithm

Again, we use dynamic programming—storing and reusing the results

  • f partial computations in a trellis α.

Each cell in the trellis stores the probability of being in state sx after seeing the first i observations: αi(x) = P(o1 . . . oi, si = x) =

L

  • k=1

αi−1(k) · P(x|k) · P(oi|x) Note , instead of the max in Viterbi.

27

slide-115
SLIDE 115

An Example of the Forward Algorithm

C C C H H H S /S 3 1 3

  • 1
  • 2
  • 3

28

slide-116
SLIDE 116

An Example of the Forward Algorithm

C C C H H H S /S 3 1 3

  • 1
  • 2
  • 3

P(H|S)P(3|H) 0.8 ∗ 0.4 α1(H) = 0.32 28

slide-117
SLIDE 117

An Example of the Forward Algorithm

C C C H H H S /S 3 1 3

  • 1
  • 2
  • 3

P(H|S)P(3|H) 0.8 ∗ 0.4 P(C|S)P(3|C) 0.2 ∗ 0.1 α1(H) = 0.32 α1(C) = 0.02 28

slide-118
SLIDE 118

An Example of the Forward Algorithm

C C C H H H S /S 3 1 3

  • 1
  • 2
  • 3

P(H|S)P(3|H) 0.8 ∗ 0.4 P(C|S)P(3|C) 0.2 ∗ 0.1 P(H|H)P(1|H) 0.6 ∗ 0.2 P ( H | C ) P ( 1 | H ) . 3 ∗ . 2 α1(H) = 0.32 α1(C) = 0.02 α2(H) = (.32 ∗ .12, .02 ∗ .06) = .0396 28

slide-119
SLIDE 119

An Example of the Forward Algorithm

C C C H H H S /S 3 1 3

  • 1
  • 2
  • 3

P(H|S)P(3|H) 0.8 ∗ 0.4 P(C|S)P(3|C) 0.2 ∗ 0.1 P(H|H)P(1|H) 0.6 ∗ 0.2 P ( C | H ) P ( 1 | C ) . 2 ∗ . 5 P ( H | C ) P ( 1 | H ) . 3 ∗ . 2 P(C|C)P(1|C) 0.5 ∗ 0.5 α1(H) = 0.32 α1(C) = 0.02 α2(H) = (.32 ∗ .12, .02 ∗ .06) = .0396 α2(C) = (.32 ∗ .1, .02 ∗ .25) = .037 28

slide-120
SLIDE 120

An Example of the Forward Algorithm

C C C H H H S /S 3 1 3

  • 1
  • 2
  • 3

P(H|S)P(3|H) 0.8 ∗ 0.4 P(C|S)P(3|C) 0.2 ∗ 0.1 P(H|H)P(1|H) 0.6 ∗ 0.2 P ( C | H ) P ( 1 | C ) . 2 ∗ . 5 P ( H | C ) P ( 1 | H ) . 3 ∗ . 2 P(C|C)P(1|C) 0.5 ∗ 0.5 P(H|H)P(3|H) 0.6 ∗ 0.4 P ( H | C ) P ( 3 | H ) . 3 ∗ . 4 α1(H) = 0.32 α1(C) = 0.02 α2(H) = (.32 ∗ .12, .02 ∗ .06) = .0396 α2(C) = (.32 ∗ .1, .02 ∗ .25) = .037 α3(H) = (.0396 ∗ .24, .037 ∗ .12) = .013944 28

slide-121
SLIDE 121

An Example of the Forward Algorithm

C C C H H H S /S 3 1 3

  • 1
  • 2
  • 3

P(H|S)P(3|H) 0.8 ∗ 0.4 P(C|S)P(3|C) 0.2 ∗ 0.1 P(H|H)P(1|H) 0.6 ∗ 0.2 P ( C | H ) P ( 1 | C ) . 2 ∗ . 5 P ( H | C ) P ( 1 | H ) . 3 ∗ . 2 P(C|C)P(1|C) 0.5 ∗ 0.5 P(H|H)P(3|H) 0.6 ∗ 0.4 P ( C | H ) P ( 3 | C ) . 2 ∗ . 1 P ( H | C ) P ( 3 | H ) . 3 ∗ . 4 P(C|C)P(3|C) 0.5 ∗ 0.1 α1(H) = 0.32 α1(C) = 0.02 α2(H) = (.32 ∗ .12, .02 ∗ .06) = .0396 α2(C) = (.32 ∗ .1, .02 ∗ .25) = .037 α3(H) = (.0396 ∗ .24, .037 ∗ .12) = .013944 α3(C) = (.0396 ∗ .02, .037 ∗ .05) = .002642 28

slide-122
SLIDE 122

An Example of the Forward Algorithm

C C C H H H S /S 3 1 3

  • 1
  • 2
  • 3

P(H|S)P(3|H) 0.8 ∗ 0.4 P(C|S)P(3|C) 0.2 ∗ 0.1 P(H|H)P(1|H) 0.6 ∗ 0.2 P ( C | H ) P ( 1 | C ) . 2 ∗ . 5 P ( H | C ) P ( 1 | H ) . 3 ∗ . 2 P(C|C)P(1|C) 0.5 ∗ 0.5 P(H|H)P(3|H) 0.6 ∗ 0.4 P ( C | H ) P ( 3 | C ) . 2 ∗ . 1 P ( H | C ) P ( 3 | H ) . 3 ∗ . 4 P(C|C)P(3|C) 0.5 ∗ 0.1 P(/S|H) 0.2 P (

  • /

S

  • |

C ) . 2 α1(H) = 0.32 α1(C) = 0.02 α2(H) = (.32 ∗ .12, .02 ∗ .06) = .0396 α2(C) = (.32 ∗ .1, .02 ∗ .25) = .037 α3(H) = (.0396 ∗ .24, .037 ∗ .12) = .013944 α3(C) = (.0396 ∗ .02, .037 ∗ .05) = .002642 αf (/S) = (.013944 ∗ .2, .002642 ∗ .2) = .0033172 28

slide-123
SLIDE 123

An Example of the Forward Algorithm

C C C H H H S /S 3 1 3

  • 1
  • 2
  • 3

P(H|S)P(3|H) 0.8 ∗ 0.4 P(C|S)P(3|C) 0.2 ∗ 0.1 P(H|H)P(1|H) 0.6 ∗ 0.2 P ( C | H ) P ( 1 | C ) . 2 ∗ . 5 P ( H | C ) P ( 1 | H ) . 3 ∗ . 2 P(C|C)P(1|C) 0.5 ∗ 0.5 P(H|H)P(3|H) 0.6 ∗ 0.4 P ( C | H ) P ( 3 | C ) . 2 ∗ . 1 P ( H | C ) P ( 3 | H ) . 3 ∗ . 4 P(C|C)P(3|C) 0.5 ∗ 0.1 P(/S|H) 0.2 P (

  • /

S

  • |

C ) . 2 α1(H) = 0.32 α1(C) = 0.02 α2(H) = (.32 ∗ .12, .02 ∗ .06) = .0396 α2(C) = (.32 ∗ .1, .02 ∗ .25) = .037 α3(H) = (.0396 ∗ .24, .037 ∗ .12) = .013944 α3(C) = (.0396 ∗ .02, .037 ∗ .05) = .002642 αf (/S) = (.013944 ∗ .2, .002642 ∗ .2) = .0033172

P(3 1 3) = 0.0033172

28

slide-124
SLIDE 124

Pseudocode for the Forward Algorithm

Input: observations of length N, state set of length L Output: forward-probability create a probability matrix forward[N, L + 2] for each state s from 1 to L do forward[1, s] ← trans(S, s) × emit(o1, s) end for each time step i from 2 to N do for each state s from 1 to L do forward[i, s] ← L

s′=1 forward[i − 1, s] × trans(s′, s) × emit(ot, s)

end end forward[N, L + 1] ← L

s=1 forward[N, s] × trans(s, /S)

return forward[N, L + 1]

29

slide-125
SLIDE 125

Tagger Evaluation

To evaluate a part-of-speech tagger (or any classification system) we:

◮ train on a labelled training set ◮ test on a separate test set

For a POS tagger, the standard evaluation metric is tag accuracy: Acc = number of correct tags number of words The other metric sometimes used is error rate: error rate = 1 − Acc

30

slide-126
SLIDE 126

Summary

◮ Part-of-speech tagging as an example of sequence labelling.

31

slide-127
SLIDE 127

Summary

◮ Part-of-speech tagging as an example of sequence labelling. ◮ Hidden Markov Models to model the observation and hidden

sequences.

31

slide-128
SLIDE 128

Summary

◮ Part-of-speech tagging as an example of sequence labelling. ◮ Hidden Markov Models to model the observation and hidden

sequences.

◮ Learn the parameters of HMM (i.e. transition and emission

probabilities) using MLE.

31

slide-129
SLIDE 129

Summary

◮ Part-of-speech tagging as an example of sequence labelling. ◮ Hidden Markov Models to model the observation and hidden

sequences.

◮ Learn the parameters of HMM (i.e. transition and emission

probabilities) using MLE.

◮ Use Viterbi for decoding, i.e.: S that maximises P(S|O) given O.

31

slide-130
SLIDE 130

Summary

◮ Part-of-speech tagging as an example of sequence labelling. ◮ Hidden Markov Models to model the observation and hidden

sequences.

◮ Learn the parameters of HMM (i.e. transition and emission

probabilities) using MLE.

◮ Use Viterbi for decoding, i.e.: S that maximises P(S|O) given O. ◮ Use Forward for computing likelihood, i.e.: P(O) given O

31