lecture 11 viterbi and forward algorithms
play

Lecture 11: Viterbi and Forward Algorithms Kai-Wei Chang CS @ - PowerPoint PPT Presentation

Lecture 11: Viterbi and Forward Algorithms Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501 Natural Language Processing 1 Quiz 1 Quiz 1 30 25 20 15 10 5 0 [0-5]


  1. Lecture 11: Viterbi and Forward Algorithms Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501 Natural Language Processing 1

  2. Quiz 1 Quiz 1 30 25 20 15 10 5 0 [0-5] [6-10] [11-15] [16-20] [21-25] v Max: 24; Mean: 18.1; Median: 18; SD: 3.36 CS6501 Natural Language Processing 2

  3. This lecture v Two important algorithms for inference v Forward algorithm v Viterbi algorithm CS6501 Natural Language Processing 3

  4. CS6501 Natural Language Processing ‹#›

  5. Three basic problems for HMMs v Likelihood of the input: v Forward algorithm How likely the sentence ”I love cat” occurs v Decoding (tagging) the input: v Viterbi algorithm POS tags of ”I love cat” occurs v Estimation (learning): How to learn the model? v Find the best model parameters v Case 1: supervised – tags are annotated v Maximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text v Forward-backward algorithm CS6501 Natural Language Processing 5

  6. Likelihood of the input v How likely a sentence “I love cat” occur v Compute 𝑄(𝒙 ∣ 𝝁 ) for the input 𝒙 and HMM 𝜇 v Remember, we model 𝑄 𝒖,𝒙 ∣ 𝝁 v 𝑄 𝒙 𝝁 = ∑ 𝑄 𝒖, 𝒙 𝝁 𝒖 Marginal probability: Sum over all possible tag sequences CS6501 Natural Language Processing 6

  7. Likelihood of the input v How likely a sentence “I love cat” occur v 𝑄 𝒙 𝝁 = ∑ 𝑄 𝒖, 𝒙 𝝁 𝒖 1 𝑄 𝑥 . 𝑢 . 𝑄 𝑢 . ∣ 𝑢 .40 = ∑ Π ./0 𝒖 v Assume we have 2 tags N, V v 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢” 𝝁 = 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”,”𝑂𝑂𝑂” 𝝁 + 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”,”𝑂𝑂𝑊” 𝝁 +𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”,”𝑂𝑊𝑂” 𝝁 + 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑂𝑊𝑊” 𝝁 +𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”,”𝑊𝑂𝑂” 𝝁 + 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑊𝑂𝑊” 𝝁 +𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑊𝑊𝑂” 𝝁 + 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”,”𝑊𝑊𝑊” 𝝁 v Now, let’s write down 𝑄(“𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢” ∣ 𝝁) with 45 tags… CS6501 Natural Language Processing 7

  8. 𝝁 is the parameter set of Trellis diagram HMM. Let’s ignore it in some slides for simplicity’s sake 1 𝑄 𝑥 . 𝑢 . 𝑄 𝑢 . ∣ 𝑢 .40 v Goal: P(𝐱 ∣ 𝝁) = ∑ Π ./0 𝒖 𝑄(𝑢 C = 2|𝑢 0 = 1 ) 𝑄(𝑥 J |𝑢 J = 1 ) 𝑄(𝑢 J = 1|𝑢 C = 1 ) ⋯ 𝑗 = 1 𝑗 = 2 𝑗 = 3 𝑗 = 4 CS6501 Natural Language Processing 8

  9. Trellis diagram v P(“I eat a fish”, NVVA ) 𝑄(𝑊|𝑊 ) 𝑄(𝐵|𝑊 ) 𝑄(𝑊|𝑂 ) 𝑄(𝐽|𝑂 ) N N N N 𝑄(𝑂| < 𝑇 > ) ⋯ V V V V 𝑄(𝑔𝑗𝑡ℎ| A) A A A A 𝑄(𝑓𝑏𝑢|𝑊 ) 𝑄(𝑏|𝑊 ) 𝑗 = 1 𝑗 = 2 𝑗 = 3 𝑗 = 4 CS6501 Natural Language Processing 9

  10. Trellis diagram 1 𝑄 𝑥 . 𝑢 . 𝑄 𝑢 . ∣ 𝑢 .40 v ∑ Π ./0 : sum over all paths 𝒖 N N N N ⋯ V V V V A A A A 𝑗 = 1 𝑗 = 2 𝑗 = 3 𝑗 = 4 CS6501 Natural Language Processing 10

  11. Dynamic programming v Recursively decompose a problem into smaller sub-problems v Similar to mathematical induction v Base step: initial values for 𝑗 = 1 v Inductive step: assume we know the values for 𝑗 = 𝑙 , let’s compute 𝑗 = 𝑙 + 1 CS6501 Natural Language Processing 11

  12. Forward algorithm v Inductive step: from 𝑗 = 𝑙 to i = k+1 P(𝒙 𝒍 , 𝑢 Y = 𝑟) v 𝒖 Y : tag sequence with length 𝑙, 𝒙 Y = 𝑥 0 , 𝑥 C … 𝑥 Y 𝑄(𝒖 Y ,𝒙 𝒍 ) 𝑄(𝒖 Y40 , 𝒙 𝒍 , 𝑢 Y = 𝑟) v ∑ = ∑ ∑ k 𝒖 g 𝒖 gij tag @ i=k tag sequences tag sequences N N N N ⋯ V V V V A A A A 𝑗 = 1 𝑗 = 2 𝑗 = 3 𝑗 = 4 CS6501 Natural Language Processing 12

  13. Forward algorithm v Inductive step: from 𝑗 = 𝑙 to i = k+1 𝑄(𝒖 Y ,𝒙) = ∑ P(𝒙 𝒍 , 𝑢 Y = 𝑟) v ∑ k 𝒖 g v P(𝒙 𝒍 , 𝑢 Y = 𝑟) = ∑ 𝒙 𝒍 , 𝑢 Y40 = 𝑟 m , 𝑢 Y = 𝑟) 𝑄( k l = ∑ 𝒙 𝒍40 , 𝑢 Y40 = 𝑟 m )𝑄(𝑢 Y = 𝑟 ∣ 𝑢 Y40 = 𝑟′)𝑄(𝑥 Y ∣ 𝑢 Y = 𝑟) 𝑄( km N N N N ⋯ V V V V A A A A 𝑗 = 𝑙 − 1 𝑗 = 𝑙 13 CS6501 Natural Language Processing

  14. Forward algorithm v Inductive step: from 𝑗 = 𝑙 to i = k+1 Let’s call it 𝛽 Y (𝑟) 𝑄(𝒖 Y ,𝒙) = ∑ P(𝒙 𝒍 , 𝑢 Y = 𝑟) v ∑ This is 𝛽 Y40 (𝑟′) k 𝒖 g v P(𝒙 𝒍 , 𝑢 Y = 𝑟) = ∑ 𝒙 𝒍 , 𝑢 Y40 = 𝑟 m , 𝑢 Y = 𝑟) 𝑄( k l = ∑ 𝒙 𝒍40 , 𝑢 Y40 = 𝑟 m )𝑄(𝑢 Y = 𝑟 ∣ 𝑢 Y40 = 𝑟′)𝑄(𝑥 Y ∣ 𝑢 Y = 𝑟) 𝑄( km N N N N ⋯ V V V V A A A A 𝑗 = 𝑙 − 1 𝑗 = 𝑙 14 CS6501 Natural Language Processing

  15. Forward algorithm v Inductive step: from 𝑗 = 𝑙 to i = k+1 v 𝛽 Y (𝑟) = ∑ 𝛽 Y40 (𝑟′) 𝑄(𝑢 Y = 𝑟 ∣ 𝑢 Y40 = 𝑟′)𝑄(𝑥 Y ∣ 𝑢 Y = 𝑟) km N N N N ⋯ V V V V A A A A 𝑗 = 𝑙 − 1 𝑗 = 𝑙 15 CS6501 Natural Language Processing

  16. Forward algorithm v Inductive step: from 𝑗 = 𝑙 to i = k+1 v 𝛽 Y (𝑟) = ∑ 𝛽 Y40 (𝑟′) 𝑄(𝑢 Y = 𝑟 ∣ 𝑢 Y40 = 𝑟′)𝑄(𝑥 Y ∣ 𝑢 Y = 𝑟) km = 𝑄(𝑥 Y ∣ 𝑢 Y = 𝑟) ∑ 𝛽 Y40 (𝑟′) 𝑄(𝑢 Y = 𝑟 ∣ 𝑢 Y40 = 𝑟′) km N N N N ⋯ V V V V A A A A 𝑗 = 𝑙 − 1 𝑗 = 𝑙 16 CS6501 Natural Language Processing

  17. Forward algorithm v Base step: i=0 v 𝛽 0 𝑟 = 𝑄 𝑥 0 𝑢 0 = 𝑟 𝑄(𝑢 0 = 𝑟 ∣ 𝑢 q ) initial probability 𝑞(𝑢 0 = 𝑟) N N N N ⋯ V V V V A A A A 𝑗 = 𝑙 − 1 𝑗 = 𝑙 17 CS6501 Natural Language Processing

  18. Implementation using an array From Julia Hockenmaier, Intro to NLP v Use a 𝑜×𝑈 table to keep 𝛽 Y (𝑟) CS6501 Natural Language Processing 18

  19. Implementation using an array Initial: Trellis[1][q] = 𝑄 𝑥 0 𝑢 0 = 𝑟 𝑄(𝑢 0 = 𝑟 ∣ 𝑢 q ) CS6501 Natural Language Processing 19

  20. Implementation using an array i i Induction: 𝛽 Y (𝑟) = 𝑄(𝑥 Y ∣ 𝑢 Y = 𝑟) ∑ 𝛽 Y40 (𝑟′) 𝑄(𝑢 Y = 𝑟 ∣ 𝑢 Y40 = 𝑟′) km CS6501 Natural Language Processing 20

  21. The forward algorithm (Pseudo Code) .fwd=0 CS6501 Natural Language Processing 21

  22. Jason’s ice cream ard" o p(…|C) p(…|H) p(…|START) p(1|…) 0.5 0.1 p(2|…) 0.4 0.2 #cones p(3|…) 0.1 0.7 p(C|…) 0.8 0.2 0.5 p(H|…) 0.2 0.8 0.5 v P(”1,2,1”)? 0.5 0.4 0.5 0.8 0.8 C C C 0.5 0.2 0.2 0.2 0.2 0.5 H H H 0.8 0.8 0.1 0.2 0.1 CS6501 Natural Language Processing 22

  23. Three basic problems for HMMs v Likelihood of the input: v Forward algorithm How likely the sentence ”I love cat” occurs v Decoding (tagging) the input: v Viterbi algorithm POS tags of ”I love cat” occurs v Estimation (learning): How to learn the model? v Find the best model parameters v Case 1: supervised – tags are annotated v Maximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text v Forward-backward algorithm CS6501 Natural Language Processing 23

  24. Prediction in generative model v Inference: What is the most likely sequence of tags for the given sequence of words w initial probability 𝑞(𝑢 0 ) v What are the latent states that most likely generate the sequence of word w CS6501 Natural Language Processing 24

  25. Tagging the input v Find best tag sequence of “I love cat” v Remember, we model 𝑄 𝒖,𝒙 ∣ 𝝁 v 𝒖 ∗ = arg max 𝑄 𝒖, 𝒙 𝝁 𝒖 Find the best one from all possible tag sequences CS6501 Natural Language Processing 25

  26. Tagging the input v Assume we have 2 tags N, V v Which one is the best? 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑂𝑂𝑂” 𝝁 , 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑂𝑂𝑊” 𝝁 , 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑂𝑊𝑂” 𝝁 , 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑂𝑊𝑊” 𝝁 , 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑊𝑂𝑂” 𝝁 , 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑊𝑂𝑊” 𝝁 , 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑊𝑊𝑂” 𝝁 , 𝑄 “𝐽 𝑚𝑝𝑤𝑓 𝑑𝑏𝑢”, ”𝑊𝑊𝑊” 𝝁 v Again! We need an efficient algorithm CS6501 Natural Language Processing 26

  27. Trellis diagram 1 𝑄 𝑥 . 𝑢 . 𝑄 𝑢 . ∣ 𝑢 .40 v Goal: argmax Π ./0 𝒖 𝑄(𝑢 C = 2|𝑢 0 = 1 ) 𝑄(𝑥 J |𝑢 J = 1 ) 𝑄(𝑢 J = 1|𝑢 C = 1 ) ⋯ 𝑗 = 1 𝑗 = 2 𝑗 = 3 𝑗 = 4 CS6501 Natural Language Processing 27

  28. Trellis diagram 1 𝑄 𝑥 . 𝑢 . 𝑄 𝑢 . ∣ 𝑢 .40 v Goal: argmax Π ./0 𝒖 v Find the best path! N N N N ⋯ V V V V A A A A 𝑗 = 1 𝑗 = 2 𝑗 = 3 𝑗 = 4 CS6501 Natural Language Processing 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend