Lecture 11: Viterbi and Forward Algorithms Kai-Wei Chang CS @ - - PowerPoint PPT Presentation

lecture 11 viterbi and forward algorithms
SMART_READER_LITE
LIVE PREVIEW

Lecture 11: Viterbi and Forward Algorithms Kai-Wei Chang CS @ - - PowerPoint PPT Presentation

Lecture 11: Viterbi and Forward Algorithms Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501 Natural Language Processing 1 Quiz 1 Quiz 1 30 25 20 15 10 5 0 [0-5]


slide-1
SLIDE 1

Lecture 11: Viterbi and Forward Algorithms

Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16

1 CS6501 Natural Language Processing

slide-2
SLIDE 2

Quiz 1

v Max: 24; Mean: 18.1; Median: 18; SD: 3.36

CS6501 Natural Language Processing 2

5 10 15 20 25 30 [0-5] [6-10] [11-15] [16-20] [21-25]

Quiz 1

slide-3
SLIDE 3

This lecture

v Two important algorithms for inference

vForward algorithm vViterbi algorithm

3 CS6501 Natural Language Processing

slide-4
SLIDE 4

CS6501 Natural Language Processing ‹#›

slide-5
SLIDE 5

Three basic problems for HMMs

v Likelihood of the input:

vForward algorithm

v Decoding (tagging) the input:

vViterbi algorithm

v Estimation (learning):

vFind the best model parameters

v Case 1: supervised – tags are annotated vMaximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text vForward-backward algorithm

CS6501 Natural Language Processing 5

How likely the sentence ”I love cat” occurs POS tags of ”I love cat” occurs How to learn the model?

slide-6
SLIDE 6

Likelihood of the input

v How likely a sentence “I love cat” occur v Compute 𝑄(𝒙 ∣ 𝝁) for the input 𝒙 and HMM 𝜇 v Remember, we model 𝑄 𝒖,𝒙 ∣ 𝝁 v 𝑄 𝒙 𝝁 = ∑ 𝑄 𝒖, 𝒙 𝝁

𝒖

CS6501 Natural Language Processing 6

Marginal probability: Sum over all possible tag sequences

slide-7
SLIDE 7

Likelihood of the input

v How likely a sentence “I love cat” occur v 𝑄 𝒙 𝝁 = ∑ 𝑄 𝒖, 𝒙 𝝁

𝒖

= ∑ Π./0

1 𝑄 𝑥. 𝑢. 𝑄 𝑢. ∣ 𝑢.40 𝒖

v Assume we have 2 tags N, V v 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢” 𝝁

= 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”,”𝑂𝑂𝑂” 𝝁 + 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”,”𝑂𝑂𝑊” 𝝁 +𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”,”𝑂𝑊𝑂” 𝝁 + 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑂𝑊𝑊” 𝝁 +𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”,”𝑊𝑂𝑂” 𝝁 + 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑊𝑂𝑊” 𝝁 +𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑊𝑊𝑂” 𝝁 + 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”,”𝑊𝑊𝑊” 𝝁

v Now, let’s write down 𝑄(“𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢” ∣ 𝝁) with 45 tags…

CS6501 Natural Language Processing 7

slide-8
SLIDE 8

Trellis diagram

v Goal: P(𝐱 ∣ 𝝁) = ∑ Π./0

1 𝑄 𝑥. 𝑢. 𝑄 𝑢. ∣ 𝑢.40 𝒖

CS6501 Natural Language Processing 8

𝑄(𝑢C = 2|𝑢0 = 1) 𝑗 = 1 𝑗 = 2 𝑗 = 3 𝑗 = 4 𝑄(𝑢J = 1|𝑢C = 1) 𝑄(𝑥J|𝑢J = 1) ⋯ 𝝁 is the parameter set of

  • HMM. Let’s ignore it in some

slides for simplicity’s sake

slide-9
SLIDE 9

Trellis diagram

v P(“I eat a fish”, NVVA )

CS6501 Natural Language Processing 9

𝑗 = 1 𝑗 = 2 𝑗 = 3 𝑗 = 4 N V A N V A N V A N V A 𝑄(𝐽|𝑂) 𝑄(𝑓𝑏𝑢|𝑊) 𝑄(𝑏|𝑊) 𝑄(𝑔𝑗𝑡ℎ|A) 𝑄(𝑊|𝑂) 𝑄(𝑊|𝑊) 𝑄(𝑂| < 𝑇 >) 𝑄(𝐵|𝑊) ⋯

slide-10
SLIDE 10

Trellis diagram

v ∑ Π./0

1 𝑄 𝑥. 𝑢. 𝑄 𝑢. ∣ 𝑢.40 𝒖

: sum over all paths

CS6501 Natural Language Processing 10

𝑗 = 1 𝑗 = 2 𝑗 = 3 𝑗 = 4 N V A N V A N V A N V A ⋯

slide-11
SLIDE 11

Dynamic programming

v Recursively decompose a problem into smaller sub-problems v Similar to mathematical induction

vBase step: initial values for 𝑗 = 1 vInductive step: assume we know the values for 𝑗 = 𝑙, let’s compute 𝑗 = 𝑙 + 1

CS6501 Natural Language Processing 11

slide-12
SLIDE 12

Forward algorithm

v Inductive step: from 𝑗 = 𝑙 to i = k+1

v 𝒖Y : tag sequence with length 𝑙, 𝒙Y = 𝑥0, 𝑥C … 𝑥Y v ∑ 𝑄(𝒖Y,𝒙𝒍)

𝒖g

= ∑ ∑ 𝑄(𝒖Y40, 𝒙𝒍, 𝑢Y = 𝑟)

𝒖gij k

CS6501 Natural Language Processing 12

𝑗 = 1 𝑗 = 2 𝑗 = 3 𝑗 = 4 N V A N V A N V A N V A ⋯ tag sequences tag @ i=k P(𝒙𝒍, 𝑢Y = 𝑟) tag sequences

slide-13
SLIDE 13

Forward algorithm

v Inductive step: from 𝑗 = 𝑙 to i = k+1

v ∑ 𝑄(𝒖Y,𝒙)

𝒖g

= ∑ P(𝒙𝒍, 𝑢Y = 𝑟)

k

v P(𝒙𝒍, 𝑢Y = 𝑟) = ∑ 𝑄(

kl

𝒙𝒍, 𝑢Y40 = 𝑟m, 𝑢Y = 𝑟)

= ∑ 𝑄(

km

𝒙𝒍40, 𝑢Y40 = 𝑟m)𝑄(𝑢Y = 𝑟 ∣ 𝑢Y40 = 𝑟′)𝑄(𝑥Y ∣ 𝑢Y = 𝑟)

CS6501 Natural Language Processing 13

𝑗 = 𝑙 − 1 𝑗 = 𝑙 N V A N V A N V A N V A ⋯

slide-14
SLIDE 14

Forward algorithm

v Inductive step: from 𝑗 = 𝑙 to i = k+1

v ∑ 𝑄(𝒖Y,𝒙)

𝒖g

= ∑ P(𝒙𝒍, 𝑢Y = 𝑟)

k

v P(𝒙𝒍, 𝑢Y = 𝑟) = ∑ 𝑄(

kl

𝒙𝒍, 𝑢Y40 = 𝑟m, 𝑢Y = 𝑟)

= ∑ 𝑄(

km

𝒙𝒍40, 𝑢Y40 = 𝑟m)𝑄(𝑢Y = 𝑟 ∣ 𝑢Y40 = 𝑟′)𝑄(𝑥Y ∣ 𝑢Y = 𝑟)

CS6501 Natural Language Processing 14

𝑗 = 𝑙 − 1 𝑗 = 𝑙 N V A N V A N V A N V A ⋯ Let’s call it 𝛽Y(𝑟) This is 𝛽Y40(𝑟′)

slide-15
SLIDE 15

Forward algorithm

v Inductive step: from 𝑗 = 𝑙 to i = k+1

v 𝛽Y(𝑟)= ∑ 𝛽Y40(𝑟′)

km

𝑄(𝑢Y = 𝑟 ∣ 𝑢Y40 = 𝑟′)𝑄(𝑥Y ∣ 𝑢Y = 𝑟)

CS6501 Natural Language Processing 15

𝑗 = 𝑙 − 1 𝑗 = 𝑙 N V A N V A N V A N V A ⋯

slide-16
SLIDE 16

Forward algorithm

v Inductive step: from 𝑗 = 𝑙 to i = k+1

v 𝛽Y(𝑟)= ∑ 𝛽Y40(𝑟′)

km

𝑄(𝑢Y = 𝑟 ∣ 𝑢Y40 = 𝑟′)𝑄(𝑥Y ∣ 𝑢Y = 𝑟) =𝑄(𝑥Y ∣ 𝑢Y = 𝑟) ∑ 𝛽Y40(𝑟′)

km

𝑄(𝑢Y = 𝑟 ∣ 𝑢Y40 = 𝑟′)

CS6501 Natural Language Processing 16

𝑗 = 𝑙 − 1 𝑗 = 𝑙 N V A N V A N V A N V A ⋯

slide-17
SLIDE 17

Forward algorithm

v Base step: i=0

v 𝛽0 𝑟 = 𝑄 𝑥0 𝑢0 = 𝑟 𝑄(𝑢0 = 𝑟 ∣ 𝑢q)

CS6501 Natural Language Processing 17

𝑗 = 𝑙 − 1 𝑗 = 𝑙 N V A N V A N V A N V A ⋯ initial probability 𝑞(𝑢0 = 𝑟)

slide-18
SLIDE 18

Implementation using an array

v Use a 𝑜×𝑈 table to keep 𝛽Y(𝑟)

CS6501 Natural Language Processing 18

From Julia Hockenmaier, Intro to NLP

slide-19
SLIDE 19

Implementation using an array

CS6501 Natural Language Processing 19

Initial: Trellis[1][q] = 𝑄 𝑥0 𝑢0 = 𝑟 𝑄(𝑢0 = 𝑟 ∣ 𝑢q)

slide-20
SLIDE 20

Implementation using an array

CS6501 Natural Language Processing 20

Induction:

𝛽Y(𝑟)=𝑄(𝑥Y ∣ 𝑢Y = 𝑟) ∑ 𝛽Y40(𝑟′)

km

𝑄(𝑢Y = 𝑟 ∣ 𝑢Y40 = 𝑟′)

i

i

slide-21
SLIDE 21

The forward algorithm (Pseudo Code)

CS6501 Natural Language Processing 21

.fwd=0

slide-22
SLIDE 22

Jason’s ice cream

v P(”1,2,1”)?

CS6501 Natural Language Processing 22

p(…|C) p(…|H) p(…|START) p(1|…) 0.5 0.1 p(2|…) 0.4 0.2 p(3|…) 0.1 0.7 p(C|…) 0.8 0.2 0.5 p(H|…) 0.2 0.8 0.5 ard" o

#cones

C H C H C H 0.5 0.5 0.8 0.8 0.8 0.8 0.2 0.2 0.2 0.2 0.5 0.5 0.4 0.1 0.1 0.2

slide-23
SLIDE 23

Three basic problems for HMMs

v Likelihood of the input:

vForward algorithm

v Decoding (tagging) the input:

vViterbi algorithm

v Estimation (learning):

vFind the best model parameters

v Case 1: supervised – tags are annotated vMaximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text vForward-backward algorithm

CS6501 Natural Language Processing 23

How likely the sentence ”I love cat” occurs POS tags of ”I love cat” occurs How to learn the model?

slide-24
SLIDE 24

Prediction in generative model

v Inference: What is the most likely sequence of tags for the given sequence of words w v What are the latent states that most likely generate the sequence of word w

CS6501 Natural Language Processing 24

initial probability 𝑞(𝑢0)

slide-25
SLIDE 25

Tagging the input

v Find best tag sequence of “I love cat” v Remember, we model 𝑄 𝒖,𝒙 ∣ 𝝁 v 𝒖∗ = arg max

𝒖

𝑄 𝒖, 𝒙 𝝁

CS6501 Natural Language Processing 25

Find the best one from all possible tag sequences

slide-26
SLIDE 26

Tagging the input

v Assume we have 2 tags N, V v Which one is the best?

𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑂𝑂𝑂” 𝝁 , 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑂𝑂𝑊” 𝝁 , 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑂𝑊𝑂” 𝝁 , 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑂𝑊𝑊” 𝝁 , 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑊𝑂𝑂” 𝝁 , 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑊𝑂𝑊” 𝝁 , 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑊𝑊𝑂” 𝝁 , 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑊𝑊𝑊” 𝝁

v Again! We need an efficient algorithm

CS6501 Natural Language Processing 26

slide-27
SLIDE 27

Trellis diagram

v Goal: argmax

𝒖

Π./0

1 𝑄 𝑥. 𝑢. 𝑄 𝑢. ∣ 𝑢.40

CS6501 Natural Language Processing 27

𝑄(𝑢C = 2|𝑢0 = 1) 𝑗 = 1 𝑗 = 2 𝑗 = 3 𝑗 = 4 𝑄(𝑢J = 1|𝑢C = 1) 𝑄(𝑥J|𝑢J = 1) ⋯

slide-28
SLIDE 28

Trellis diagram

v Goal: argmax

𝒖

Π./0

1 𝑄 𝑥. 𝑢. 𝑄 𝑢. ∣ 𝑢.40

v Find the best path!

CS6501 Natural Language Processing 28

𝑗 = 1 𝑗 = 2 𝑗 = 3 𝑗 = 4 N V A N V A N V A N V A ⋯

slide-29
SLIDE 29

Dynamic programming again!

v Recursively decompose a problem into smaller sub-problems v Similar to mathematical induction

vBase step: initial values for 𝑗 = 1 vInductive step: assume we know the values for 𝑗 = 𝑙, let’s compute 𝑗 = 𝑙 + 1

CS6501 Natural Language Processing 29

slide-30
SLIDE 30

Viterbi algorithm

v Inductive step: from 𝑗 = 𝑙 to i = k+1

v 𝒖Y : tag sequence with length 𝑙, 𝒙Y = 𝑥0, 𝑥C … 𝑥Y v max

𝒖𝒍 𝑄(𝒖Y,𝒙𝒍) = 𝑛𝑏𝑦k max 𝒖𝒍i𝟐 𝑄(𝒖Y40, 𝑢Y = 𝑟, 𝒙𝒍)

CS6501 Natural Language Processing 30

𝑗 = 1 𝑗 = 2 𝑗 = 3 𝑗 = 4 N V A N V A N V A N V A ⋯ tag sequences tag @ i=k tag sequences

slide-31
SLIDE 31

Viterbi algorithm

v Inductive step: from 𝑗 = 𝑙 to i = k+1

v max

𝒖𝒍i𝟐 𝑄(𝒖Y40, 𝑢Y = 𝑟, 𝒙𝒍)

= 𝑛𝑏𝑦kl max

𝒖𝒍i𝟐 𝑄(𝒖Y4𝟑,𝑢Y = 𝑟,𝑢Y40 = 𝑟m,𝒙𝒍)

= 𝑛𝑏𝑦kl max

𝒖𝒍i𝟐 𝑄 𝒖Y4𝟑, 𝑢Y40 = 𝑟m,𝒙𝒍4𝟐 𝑄 𝑢Y = 𝑟,𝑢Y40 = 𝑟m 𝑄 𝑥Y

𝑢Y = 𝑟

CS6501 Natural Language Processing 31

𝑗 = 𝑙 − 1 𝑗 = 𝑙 N V A N V A N V A N V A ⋯ Let’s call it 𝜀Y(𝑟) This is 𝜀Y40(𝑟′)

slide-32
SLIDE 32

Viterbi algorithm

v Inductive step: from 𝑗 = 𝑙 to i = k+1

v 𝜀Y 𝑟 = max

km 𝜀Y40(𝑟m)𝑄(𝑢Y = 𝑟 ∣ 𝑢Y40 = 𝑟′)𝑄(𝑥Y ∣ 𝑢Y = 𝑟)

CS6501 Natural Language Processing 32

𝑗 = 𝑙 − 1 𝑗 = 𝑙 N V A N V A N V A N V A ⋯

slide-33
SLIDE 33

Viterbi algorithm

v Inductive step: from 𝑗 = 𝑙 to i = k+1

v 𝜀Y 𝑟 =max

kl 𝜀Y40 𝑟m 𝑄 𝑢Y = 𝑟 ∣ 𝑢Y40 = 𝑟m 𝑄 𝑥Y ∣ 𝑢Y = 𝑟

=𝑄 𝑥Y ∣ 𝑢Y = 𝑟 max

kl 𝜀Y40 𝑟m 𝑄 𝑢Y = 𝑟 ∣ 𝑢Y40 = 𝑟m

CS6501 Natural Language Processing 33

𝑗 = 𝑙 − 1 𝑗 = 𝑙 N V A N V A N V A N V A ⋯

slide-34
SLIDE 34

Viterbi algorithm

v Base step: i=0

v 𝜀0 𝑟 = 𝑄 𝑥0 𝑢0 = 𝑟 𝑄(𝑢0 = 𝑟 ∣ 𝑢q)

CS6501 Natural Language Processing 34

𝑗 = 𝑙 − 1 𝑗 = 𝑙 N V A N V A N V A N V A ⋯ initial probability 𝑞(𝑢0 = 𝑟)

slide-35
SLIDE 35

Implementation using an array

CS6501 Natural Language Processing 35

Initial: Trellis[1][q] = 𝑄 𝑥0 𝑢0 = 𝑟 𝑄(𝑢0 = 𝑟 ∣ 𝑢q)

slide-36
SLIDE 36

Implementation using an array

CS6501 Natural Language Processing 36

Induction:

𝜀Y 𝑟 = 𝑄 𝑥Y ∣ 𝑢Y = 𝑟 max

kl 𝜀Y40 𝑟m 𝑄 𝑢Y = 𝑟 ∣ 𝑢Y40 = 𝑟m

slide-37
SLIDE 37

Retrieving the best sequence

v Keep one backpointer

CS6501 Natural Language Processing 37

slide-38
SLIDE 38

The Viterbi algorithm (Pseudo Code)

CS6501 Natural Language Processing 38

.fwd=0

Max instead of sum

slide-39
SLIDE 39

CS6501 Natural Language Processing 39

slide-40
SLIDE 40

Jason’s ice cream

v Best tag sequence for P(”1,2,1”)?

CS6501 Natural Language Processing 40

p(…|C) p(…|H) p(…|START) p(1|…) 0.5 0.1 p(2|…) 0.4 0.2 p(3|…) 0.1 0.7 p(C|…) 0.8 0.2 0.5 p(H|…) 0.2 0.8 0.5 ard" o

#cones

C H C H C H 0.5 0.5 0.8 0.8 0.8 0.8 0.2 0.2 0.2 0.2 0.5 0.5 0.4 0.1 0.1 0.2

slide-41
SLIDE 41

Trick: computing everything in log space

v Homework:

vWrite forward and Viterbi algorithm in log-space vHint: you need a function to compute log(a+b)

CS6501 Natural Language Processing 41

slide-42
SLIDE 42

Three basic problems for HMMs

v Likelihood of the input:

vForward algorithm

v Decoding (tagging) the input:

vViterbi algorithm

v Estimation (learning):

vFind the best model parameters

v Case 1: supervised – tags are annotated vMaximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text vForward-backward algorithm

CS6501 Natural Language Processing 42

How likely the sentence ”I love cat” occurs POS tags of ”I love cat” occurs How to learn the model?