lecture 12 em algorithm

Lecture 12: EM Algorithm Kai-Wei Chang CS @ University of Virginia - PowerPoint PPT Presentation

Lecture 12: EM Algorithm Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501 Natural Language Processing 1 Three basic problems for HMMs v Likelihood of the input: v Forward


  1. Lecture 12: EM Algorithm Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501 Natural Language Processing 1

  2. Three basic problems for HMMs v Likelihood of the input: v Forward algorithm How likely the sentence ”I love cat” occurs v Decoding (tagging) the input: v Viterbi algorithm POS tags of ”I love cat” occurs v Estimation (learning): How to learn the model? v Find the best model parameters v Case 1: supervised – tags are annotated v Maximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text v Forward-backward algorithm CS6501 Natural Language Processing 2

  3. EM algorithm v POS induction – can we tag POS without annotated data? v An old idea v Good mathematical intuition v Tutorial paper: ftp://ftp.icsi.berkeley.edu/pub/techreports/1997/t r-97-021.pdf v http://people.csail.mit.edu/regina/6864/em_note s_mike.pdf CS6501 Natural Language Processing 3

  4. Hard EM (Intuition) v We don’t know the hidden states (i.e., POS tags) v If we know the model CS6501 Natural Language Processing 4

  5. Recap: Learning from Labeled Data C C H H C H v If we know the hidden 1 2 2 2 3 2 states (labels) H H C H H H 2 3 1 2 3 2 v we count how often we see 𝑢 "#$ 𝑢 " and w & 𝑢 " then normalize. CS6501 Natural Language Processing 5

  6. Recap: Tagging the input v If we know the model, we can find the best tag sequence CS6501 Natural Language Processing 6

  7. Hard EM (Intuition) v We don’t know the hidden states (i.e., POS tags) 1. Let’s guess! 2. Then, we have labels; we can estimate the model 3. Check if the model is consistent with the labels we guessed; if no → Step 1. CS6501 Natural Language Processing 7

  8. P(…| C) P(… | H) P(…|Start) ( 1| … ) ? 0 - ( 2| … ) ? ? - ( 3 | …) 0 ? - Let’s make a guess ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 ? ? ? ? ? ? 1 2 2 2 3 2 ? ? ? ? ? ? 2 3 1 2 3 2 CS6501 Natural Language Processing 8

  9. P(…| C) P(… | H) P(…|Start) ( 1| … ) ? 0 - ( 2| … ) ? ? - ( 3 | …) 0 ? - These are obvious ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 C ? H ? ? ? 1 2 2 2 3 2 ? H C ? H ? 2 3 1 2 3 2 CS6501 Natural Language Processing 9

  10. P(…| C) P(… | H) P(…|Start) ( 1| … ) ? 0 - ( 2| … ) ? ? - ( 3 | …) 0 ? - Guess more ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 C C H H ? H 1 2 2 2 3 2 H H C ? H H 2 3 1 2 3 2 CS6501 Natural Language Processing 10

  11. P(…| C) P(… | H) P(…|Start) Guess all of them ( 1| … ) ? 0 - ( 2| … ) ? ? - ( 3 | …) 0 ? - Now we can estimate ML ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 C C H H C H 1 2 2 2 3 2 H H C H H H 2 3 1 2 3 2 CS6501 Natural Language Processing 11

  12. P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.5 0 - ( 2| … ) 0.5 0.625 - Does our guess consistent ( 3 | …) 0 0.375 - ( C| …) 0.8 0.2 0.5 with the model? ( H | …) 0.2 0.8 0.5 C C H H C H 1 2 2 2 3 2 H H C H H H 2 3 1 2 3 2 CS6501 Natural Language Processing 12

  13. P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.5 0 - How to find latent states ( 2| … ) 0.5 0.625 - based on our model? ( 3 | …) 0 0.375 - ( C| …) 0.8 0.2 0.5 Viterbi! ( H | …) 0.2 0.8 0.5 ? ? ? ? ? ? 1 2 2 2 3 2 ? ? ? ? ? ? 2 3 1 2 3 2 CS6501 Natural Language Processing 13

  14. P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.5 0 - Something wrong… ( 2| … ) 0.5 0.625 - ( 3 | …) 0 0.375 - ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 From Viterbi C H H H H H C C H H C H 1 2 2 2 3 2 From Viterbi H H C H H H H H C H H H 2 3 1 2 3 2 CS6501 Natural Language Processing 14

  15. P(…| C) P(… | H) P(…|Start) ( 1| … ) 1 0 - It’s fine. ( 2| … ) 0 0.7 - Let’s do again ( 3 | …) 0 0.3 - ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 C H H H H H 1 2 2 2 3 2 H H C H H H 2 3 1 2 3 2 CS6501 Natural Language Processing 15

  16. P(…| C) P(… | H) P(…|Start) ( 1| … ) 1 0 - This time it is consistent ( 2| … ) 0 0.7 - ( 3 | …) 0 0.3 - ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 From Viterbi C H H H H H C H H H H H 1 2 2 2 3 2 From Viterbi H H C H H H H H C H H H 2 3 1 2 3 2 CS6501 Natural Language Processing 16

  17. P(…| C) P(… | H) P(…|Start) Only one solution? ( 1| … ) 0.22 0 - ( 2| … ) 0.77 0 - ( 3 | …) 0 1 - No! ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 EM is sensitive to initialization C C H C C C 1 2 2 2 3 2 C H C C H C 2 3 1 2 3 2 CS6501 Natural Language Processing 17

  18. P(…| C) P(… | H) P(…|Start) How about this? ( 1| … ) ? 0 - ( 2| … ) ? ? - ( 3 | …) 0 ? - ( C| …) ? ? 0.5 ( H | …) ? ? 0.5 ? ? ? ? ? ? 1 2 2 2 3 2 ? ? ? ? ? ? 2 3 1 2 3 2 CS6501 Natural Language Processing 18

  19. Hard EM v We don’t know the hidden states (i.e., POS tags) 1. Let’s guess based on our model! v Find the best sequence using Viterbi algorithm 2. Then, we have labels; we can estimate the model v Maximum Likelihood Estimation 3. Check if the model is consistent with the labels we guessed; if no → Step 1. CS6501 Natural Language Processing 19

  20. Soft EM v We don’t know the hidden states (i.e., POS tags) 1. Let’s guess based on our model! v Find the best sequence using Viterbi algorithm 2. Then, we have labels; we can estimate Let’s use expected counts instead! the model v Maximum Likelihood Estimation 3. Check if the model is consistent with the labels we guessed; if no → Step 1. CS6501 Natural Language Processing 20

  21. P(…| C) P(… | H) P(…|Start) Expected Counts ( 1| … ) 0.8 0 - ( 2| … ) 0.2 0.2 - ( 3 | …) 0 0.8 - ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 ? ? ? 1 2 2 CS6501 Natural Language Processing 21

  22. P(…| C) P(… | H) P(…|Start) Expected Counts ( 1| … ) 0.8 0 - ( 2| … ) 0.2 0.2 - ( 3 | …) 0 0.8 - Some sequences are more ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 likely to occur than the others C C C C C H 1 2 2 1 2 2 C H C C H H 1 2 2 1 2 2 CS6501 Natural Language Processing 22

  23. P(…| C) P(… | H) P(…|Start) Expected Counts ( 1| … ) 0.8 0 - ( 2| … ) 0.2 0.2 - ( 3 | …) 0 0.8 - ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 0.01024 0.00256 C C C C C H 1 2 2 1 2 2 0.00064 0.00256 C H C C H H 1 2 2 1 2 2 CS6501 Natural Language Processing 23

  24. P(…| C) P(… | H) P(…|Start) Expected Counts ( 1| … ) 0.8 0 - ( 2| … ) 0.2 0.2 - Assume we draw 100,000 ( 3 | …) 0 0.8 - random samples… ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 1024 256 C C C C C H 1 2 2 1 2 2 64 256 C H C C H H 1 2 2 1 2 2 CS6501 Natural Language Processing 24

Recommend


More recommend