Lecture 12: EM Algorithm Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501 Natural Language Processing 1
Three basic problems for HMMs v Likelihood of the input: v Forward algorithm How likely the sentence ”I love cat” occurs v Decoding (tagging) the input: v Viterbi algorithm POS tags of ”I love cat” occurs v Estimation (learning): How to learn the model? v Find the best model parameters v Case 1: supervised – tags are annotated v Maximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text v Forward-backward algorithm CS6501 Natural Language Processing 2
EM algorithm v POS induction – can we tag POS without annotated data? v An old idea v Good mathematical intuition v Tutorial paper: ftp://ftp.icsi.berkeley.edu/pub/techreports/1997/t r-97-021.pdf v http://people.csail.mit.edu/regina/6864/em_note s_mike.pdf CS6501 Natural Language Processing 3
Hard EM (Intuition) v We don’t know the hidden states (i.e., POS tags) v If we know the model CS6501 Natural Language Processing 4
Recap: Learning from Labeled Data C C H H C H v If we know the hidden 1 2 2 2 3 2 states (labels) H H C H H H 2 3 1 2 3 2 v we count how often we see 𝑢 "#$ 𝑢 " and w & 𝑢 " then normalize. CS6501 Natural Language Processing 5
Recap: Tagging the input v If we know the model, we can find the best tag sequence CS6501 Natural Language Processing 6
Hard EM (Intuition) v We don’t know the hidden states (i.e., POS tags) 1. Let’s guess! 2. Then, we have labels; we can estimate the model 3. Check if the model is consistent with the labels we guessed; if no → Step 1. CS6501 Natural Language Processing 7
P(…| C) P(… | H) P(…|Start) ( 1| … ) ? 0 - ( 2| … ) ? ? - ( 3 | …) 0 ? - Let’s make a guess ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 ? ? ? ? ? ? 1 2 2 2 3 2 ? ? ? ? ? ? 2 3 1 2 3 2 CS6501 Natural Language Processing 8
P(…| C) P(… | H) P(…|Start) ( 1| … ) ? 0 - ( 2| … ) ? ? - ( 3 | …) 0 ? - These are obvious ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 C ? H ? ? ? 1 2 2 2 3 2 ? H C ? H ? 2 3 1 2 3 2 CS6501 Natural Language Processing 9
P(…| C) P(… | H) P(…|Start) ( 1| … ) ? 0 - ( 2| … ) ? ? - ( 3 | …) 0 ? - Guess more ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 C C H H ? H 1 2 2 2 3 2 H H C ? H H 2 3 1 2 3 2 CS6501 Natural Language Processing 10
P(…| C) P(… | H) P(…|Start) Guess all of them ( 1| … ) ? 0 - ( 2| … ) ? ? - ( 3 | …) 0 ? - Now we can estimate ML ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 C C H H C H 1 2 2 2 3 2 H H C H H H 2 3 1 2 3 2 CS6501 Natural Language Processing 11
P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.5 0 - ( 2| … ) 0.5 0.625 - Does our guess consistent ( 3 | …) 0 0.375 - ( C| …) 0.8 0.2 0.5 with the model? ( H | …) 0.2 0.8 0.5 C C H H C H 1 2 2 2 3 2 H H C H H H 2 3 1 2 3 2 CS6501 Natural Language Processing 12
P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.5 0 - How to find latent states ( 2| … ) 0.5 0.625 - based on our model? ( 3 | …) 0 0.375 - ( C| …) 0.8 0.2 0.5 Viterbi! ( H | …) 0.2 0.8 0.5 ? ? ? ? ? ? 1 2 2 2 3 2 ? ? ? ? ? ? 2 3 1 2 3 2 CS6501 Natural Language Processing 13
P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.5 0 - Something wrong… ( 2| … ) 0.5 0.625 - ( 3 | …) 0 0.375 - ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 From Viterbi C H H H H H C C H H C H 1 2 2 2 3 2 From Viterbi H H C H H H H H C H H H 2 3 1 2 3 2 CS6501 Natural Language Processing 14
P(…| C) P(… | H) P(…|Start) ( 1| … ) 1 0 - It’s fine. ( 2| … ) 0 0.7 - Let’s do again ( 3 | …) 0 0.3 - ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 C H H H H H 1 2 2 2 3 2 H H C H H H 2 3 1 2 3 2 CS6501 Natural Language Processing 15
P(…| C) P(… | H) P(…|Start) ( 1| … ) 1 0 - This time it is consistent ( 2| … ) 0 0.7 - ( 3 | …) 0 0.3 - ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 From Viterbi C H H H H H C H H H H H 1 2 2 2 3 2 From Viterbi H H C H H H H H C H H H 2 3 1 2 3 2 CS6501 Natural Language Processing 16
P(…| C) P(… | H) P(…|Start) Only one solution? ( 1| … ) 0.22 0 - ( 2| … ) 0.77 0 - ( 3 | …) 0 1 - No! ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 EM is sensitive to initialization C C H C C C 1 2 2 2 3 2 C H C C H C 2 3 1 2 3 2 CS6501 Natural Language Processing 17
P(…| C) P(… | H) P(…|Start) How about this? ( 1| … ) ? 0 - ( 2| … ) ? ? - ( 3 | …) 0 ? - ( C| …) ? ? 0.5 ( H | …) ? ? 0.5 ? ? ? ? ? ? 1 2 2 2 3 2 ? ? ? ? ? ? 2 3 1 2 3 2 CS6501 Natural Language Processing 18
Hard EM v We don’t know the hidden states (i.e., POS tags) 1. Let’s guess based on our model! v Find the best sequence using Viterbi algorithm 2. Then, we have labels; we can estimate the model v Maximum Likelihood Estimation 3. Check if the model is consistent with the labels we guessed; if no → Step 1. CS6501 Natural Language Processing 19
Soft EM v We don’t know the hidden states (i.e., POS tags) 1. Let’s guess based on our model! v Find the best sequence using Viterbi algorithm 2. Then, we have labels; we can estimate Let’s use expected counts instead! the model v Maximum Likelihood Estimation 3. Check if the model is consistent with the labels we guessed; if no → Step 1. CS6501 Natural Language Processing 20
P(…| C) P(… | H) P(…|Start) Expected Counts ( 1| … ) 0.8 0 - ( 2| … ) 0.2 0.2 - ( 3 | …) 0 0.8 - ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 ? ? ? 1 2 2 CS6501 Natural Language Processing 21
P(…| C) P(… | H) P(…|Start) Expected Counts ( 1| … ) 0.8 0 - ( 2| … ) 0.2 0.2 - ( 3 | …) 0 0.8 - Some sequences are more ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 likely to occur than the others C C C C C H 1 2 2 1 2 2 C H C C H H 1 2 2 1 2 2 CS6501 Natural Language Processing 22
P(…| C) P(… | H) P(…|Start) Expected Counts ( 1| … ) 0.8 0 - ( 2| … ) 0.2 0.2 - ( 3 | …) 0 0.8 - ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 0.01024 0.00256 C C C C C H 1 2 2 1 2 2 0.00064 0.00256 C H C C H H 1 2 2 1 2 2 CS6501 Natural Language Processing 23
P(…| C) P(… | H) P(…|Start) Expected Counts ( 1| … ) 0.8 0 - ( 2| … ) 0.2 0.2 - Assume we draw 100,000 ( 3 | …) 0 0.8 - random samples… ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 1024 256 C C C C C H 1 2 2 1 2 2 64 256 C H C C H H 1 2 2 1 2 2 CS6501 Natural Language Processing 24
Recommend
More recommend