Lecture 12: EM Algorithm Kai-Wei Chang CS @ University of Virginia - PowerPoint PPT Presentation

Lecture 12: EM Algorithm Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501 Natural Language Processing 1

Three basic problems for HMMs v Likelihood of the input: v Forward algorithm How likely the sentence ”I love cat” occurs v Decoding (tagging) the input: v Viterbi algorithm POS tags of ”I love cat” occurs v Estimation (learning): How to learn the model? v Find the best model parameters v Case 1: supervised – tags are annotated v Maximum likelihood estimation (MLE) v Case 2: unsupervised -- only unannotated text v Forward-backward algorithm CS6501 Natural Language Processing 2

EM algorithm v POS induction – can we tag POS without annotated data? v An old idea v Good mathematical intuition v Tutorial paper: ftp://ftp.icsi.berkeley.edu/pub/techreports/1997/t r-97-021.pdf v http://people.csail.mit.edu/regina/6864/em_note s_mike.pdf CS6501 Natural Language Processing 3

Hard EM (Intuition) v We don’t know the hidden states (i.e., POS tags) v If we know the model CS6501 Natural Language Processing 4

Recap: Learning from Labeled Data C C H H C H v If we know the hidden 1 2 2 2 3 2 states (labels) H H C H H H 2 3 1 2 3 2 v we count how often we see 𝑢 "#$ 𝑢 " and w & 𝑢 " then normalize. CS6501 Natural Language Processing 5

Recap: Tagging the input v If we know the model, we can find the best tag sequence CS6501 Natural Language Processing 6

Hard EM (Intuition) v We don’t know the hidden states (i.e., POS tags) 1. Let’s guess! 2. Then, we have labels; we can estimate the model 3. Check if the model is consistent with the labels we guessed; if no → Step 1. CS6501 Natural Language Processing 7

P(…| C) P(… | H) P(…|Start) ( 1| … ) ? 0 - ( 2| … ) ? ? - ( 3 | …) 0 ? - Let’s make a guess ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 ? ? ? ? ? ? 1 2 2 2 3 2 ? ? ? ? ? ? 2 3 1 2 3 2 CS6501 Natural Language Processing 8

P(…| C) P(… | H) P(…|Start) ( 1| … ) ? 0 - ( 2| … ) ? ? - ( 3 | …) 0 ? - These are obvious ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 C ? H ? ? ? 1 2 2 2 3 2 ? H C ? H ? 2 3 1 2 3 2 CS6501 Natural Language Processing 9

P(…| C) P(… | H) P(…|Start) ( 1| … ) ? 0 - ( 2| … ) ? ? - ( 3 | …) 0 ? - Guess more ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 C C H H ? H 1 2 2 2 3 2 H H C ? H H 2 3 1 2 3 2 CS6501 Natural Language Processing 10

P(…| C) P(… | H) P(…|Start) Guess all of them ( 1| … ) ? 0 - ( 2| … ) ? ? - ( 3 | …) 0 ? - Now we can estimate ML ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 C C H H C H 1 2 2 2 3 2 H H C H H H 2 3 1 2 3 2 CS6501 Natural Language Processing 11

P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.5 0 - ( 2| … ) 0.5 0.625 - Does our guess consistent ( 3 | …) 0 0.375 - ( C| …) 0.8 0.2 0.5 with the model? ( H | …) 0.2 0.8 0.5 C C H H C H 1 2 2 2 3 2 H H C H H H 2 3 1 2 3 2 CS6501 Natural Language Processing 12

P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.5 0 - How to find latent states ( 2| … ) 0.5 0.625 - based on our model? ( 3 | …) 0 0.375 - ( C| …) 0.8 0.2 0.5 Viterbi! ( H | …) 0.2 0.8 0.5 ? ? ? ? ? ? 1 2 2 2 3 2 ? ? ? ? ? ? 2 3 1 2 3 2 CS6501 Natural Language Processing 13

P(…| C) P(… | H) P(…|Start) ( 1| … ) 0.5 0 - Something wrong… ( 2| … ) 0.5 0.625 - ( 3 | …) 0 0.375 - ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 From Viterbi C H H H H H C C H H C H 1 2 2 2 3 2 From Viterbi H H C H H H H H C H H H 2 3 1 2 3 2 CS6501 Natural Language Processing 14

P(…| C) P(… | H) P(…|Start) ( 1| … ) 1 0 - It’s fine. ( 2| … ) 0 0.7 - Let’s do again ( 3 | …) 0 0.3 - ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 C H H H H H 1 2 2 2 3 2 H H C H H H 2 3 1 2 3 2 CS6501 Natural Language Processing 15

P(…| C) P(… | H) P(…|Start) ( 1| … ) 1 0 - This time it is consistent ( 2| … ) 0 0.7 - ( 3 | …) 0 0.3 - ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 From Viterbi C H H H H H C H H H H H 1 2 2 2 3 2 From Viterbi H H C H H H H H C H H H 2 3 1 2 3 2 CS6501 Natural Language Processing 16

P(…| C) P(… | H) P(…|Start) Only one solution? ( 1| … ) 0.22 0 - ( 2| … ) 0.77 0 - ( 3 | …) 0 1 - No! ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 EM is sensitive to initialization C C H C C C 1 2 2 2 3 2 C H C C H C 2 3 1 2 3 2 CS6501 Natural Language Processing 17

P(…| C) P(… | H) P(…|Start) How about this? ( 1| … ) ? 0 - ( 2| … ) ? ? - ( 3 | …) 0 ? - ( C| …) ? ? 0.5 ( H | …) ? ? 0.5 ? ? ? ? ? ? 1 2 2 2 3 2 ? ? ? ? ? ? 2 3 1 2 3 2 CS6501 Natural Language Processing 18

Hard EM v We don’t know the hidden states (i.e., POS tags) 1. Let’s guess based on our model! v Find the best sequence using Viterbi algorithm 2. Then, we have labels; we can estimate the model v Maximum Likelihood Estimation 3. Check if the model is consistent with the labels we guessed; if no → Step 1. CS6501 Natural Language Processing 19

Soft EM v We don’t know the hidden states (i.e., POS tags) 1. Let’s guess based on our model! v Find the best sequence using Viterbi algorithm 2. Then, we have labels; we can estimate Let’s use expected counts instead! the model v Maximum Likelihood Estimation 3. Check if the model is consistent with the labels we guessed; if no → Step 1. CS6501 Natural Language Processing 20

P(…| C) P(… | H) P(…|Start) Expected Counts ( 1| … ) 0.8 0 - ( 2| … ) 0.2 0.2 - ( 3 | …) 0 0.8 - ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 ? ? ? 1 2 2 CS6501 Natural Language Processing 21

P(…| C) P(… | H) P(…|Start) Expected Counts ( 1| … ) 0.8 0 - ( 2| … ) 0.2 0.2 - ( 3 | …) 0 0.8 - Some sequences are more ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 likely to occur than the others C C C C C H 1 2 2 1 2 2 C H C C H H 1 2 2 1 2 2 CS6501 Natural Language Processing 22

P(…| C) P(… | H) P(…|Start) Expected Counts ( 1| … ) 0.8 0 - ( 2| … ) 0.2 0.2 - ( 3 | …) 0 0.8 - ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 0.01024 0.00256 C C C C C H 1 2 2 1 2 2 0.00064 0.00256 C H C C H H 1 2 2 1 2 2 CS6501 Natural Language Processing 23

P(…| C) P(… | H) P(…|Start) Expected Counts ( 1| … ) 0.8 0 - ( 2| … ) 0.2 0.2 - Assume we draw 100,000 ( 3 | …) 0 0.8 - random samples… ( C| …) 0.8 0.2 0.5 ( H | …) 0.2 0.8 0.5 1024 256 C C C C C H 1 2 2 1 2 2 64 256 C H C C H H 1 2 2 1 2 2 CS6501 Natural Language Processing 24

Lecture 12: EM Algorithm Kai-Wei Chang CS @ University of Virginia - PowerPoint PPT Presentation

Lecture 12: EM Algorithm Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501 Natural Language Processing 1 Three basic problems for HMMs v Likelihood of the input: v Forward

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

Gram-Schmidt algorithm Aim lecture: We use the theory of last lecture to give an algorithm for

Visible Surface Determination CS418 Computer Graphics John C. Hart Painters Algorithm

Algorithm Analysis October 12, 2016 CMPE 250 Algorithm Analysis October 12, 2016 1 / 66

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 Lecture cture 10 Yan n Gu

Shortest path using A Algorithm Introduction History Components of A Algorithm

Stoer-Wagner Algorithm A Minimum Cut Algorithm for Undirected Graphs BigNews CS214: Algorithms

Quiz I Give the SVD-based algorithm for solving least squares, and I justify the algorithm by that

Some More Critical Section Solutions Dr. Liam OConnor University of Edinburgh LFCS (and UNSW)

A-Star Algorithm & Heaps/Priority Queues Mark Redekopp 2 A* Search Algorithm ALGORITHM

Earley algorithm Earley: introduction Example of Earley algorithm Scott Farrar CLMA,

The BBS Algorithm The BBS Algorithm The BBS Algorithm Prof. Paolo Ciaccia Prof. Paolo Ciaccia

Avoiding Register Overflow in the Bakery Algorithm The Bakery++ Algorithm The Bakery algorithm is

Lecture 1 Grade school algorithms Sept. 8, 2017 1 What is an algorithm? An algorithm is a

ECE 242 Data Structures Lecture 2 Algorithm Analysis September 11, 2009 ECE242 L2: Algorithm

A unifying methodology A Gentle Introduction to the EM Algorithm Dempster, Laird & Rubin

The EM Algorithm The EM algorithm Mixture models Why EM works EM variants Learning

Programming Reactive Systems in Scala: Principles and Abstractions Philipp Haller KTH Royal

Rational Krylov Methods for Solving Nonlinear Eigenvalue Problems Roel Van Beeumen

Learning from Unlabeled Data INFO-4604, Applied Machine Learning University of Colorado Boulder

Expectation maximization don't have any labels. Can you still do something? ! Amazingly you can!

FlexMix: Flexible fitting of finite mixtures with the EM algorithm Bettina Gr un Friedrich

EM-like algorithms for nonparametric estimation in multivariate mixtures Didier Chauveau MAPMO -

Sambuz

Useful Links

Newsletter

Mail Us

Lecture 12: EM Algorithm Kai-Wei Chang CS @ University of Virginia - PowerPoint PPT Presentation

Lecture 12: EM Algorithm Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501 Natural Language Processing 1 Three basic problems for HMMs v Likelihood of the input: v Forward

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

Gram-Schmidt algorithm Aim lecture: We use the theory of last lecture to give an algorithm for

Visible Surface Determination CS418 Computer Graphics John C. Hart Painters Algorithm

Algorithm Analysis October 12, 2016 CMPE 250 Algorithm Analysis October 12, 2016 1 / 66

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Algorithm Engineering (aka. How to Write Fast Code) CS26 S260 Lecture cture 10 Yan n Gu

Shortest path using A Algorithm Introduction History Components of A Algorithm

Stoer-Wagner Algorithm A Minimum Cut Algorithm for Undirected Graphs BigNews CS214: Algorithms

Quiz I Give the SVD-based algorithm for solving least squares, and I justify the algorithm by that

Some More Critical Section Solutions Dr. Liam OConnor University of Edinburgh LFCS (and UNSW)

A-Star Algorithm &amp; Heaps/Priority Queues Mark Redekopp 2 A* Search Algorithm ALGORITHM

Earley algorithm Earley: introduction Example of Earley algorithm Scott Farrar CLMA,

The BBS Algorithm The BBS Algorithm The BBS Algorithm Prof. Paolo Ciaccia Prof. Paolo Ciaccia

Avoiding Register Overflow in the Bakery Algorithm The Bakery++ Algorithm The Bakery algorithm is

Lecture 1 Grade school algorithms Sept. 8, 2017 1 What is an algorithm? An algorithm is a

ECE 242 Data Structures Lecture 2 Algorithm Analysis September 11, 2009 ECE242 L2: Algorithm

A unifying methodology A Gentle Introduction to the EM Algorithm Dempster, Laird &amp; Rubin

The EM Algorithm The EM algorithm Mixture models Why EM works EM variants Learning

Programming Reactive Systems in Scala: Principles and Abstractions Philipp Haller KTH Royal

Rational Krylov Methods for Solving Nonlinear Eigenvalue Problems Roel Van Beeumen

Learning from Unlabeled Data INFO-4604, Applied Machine Learning University of Colorado Boulder

Expectation maximization don't have any labels. Can you still do something? ! Amazingly you can!

FlexMix: Flexible fitting of finite mixtures with the EM algorithm Bettina Gr un Friedrich

EM-like algorithms for nonparametric estimation in multivariate mixtures Didier Chauveau MAPMO -

Sambuz

Useful Links

Newsletter

Mail Us

A-Star Algorithm & Heaps/Priority Queues Mark Redekopp 2 A* Search Algorithm ALGORITHM

A unifying methodology A Gentle Introduction to the EM Algorithm Dempster, Laird & Rubin