Why doesn't EM find good HMM POS-taggers? Mark Johnson Microsoft - PowerPoint PPT Presentation

Why doesn't EM find good HMM POS-taggers? Mark Johnson Microsoft Research Brown University 1

Bayesian inference for HMMs • Compare Bayesian methods for estimating HMMs for unsupervised POS tagging – Gibbs sampling – Variational Bayes – How do these compare to EM? • Most words belong to few POS: can a sparse Bayesian prior on P( w | y ) capture this? • KISS – look at bitag HMM models first • Cf: Goldwater and Griffiths 2007 study semi- supervised Bayesian inference for tritag HMM POS taggers 2

Main findings • Bayesian inference finds better POS tags • By reducing the number of states, EM can do almost as well • All these methods take hundreds of iterations to stabilize (converge?) • Wide variation in performance of all models multiple runs to assess performance 3

Evaluation methodology • “Many -to- 1” accuracy: – Each HMM hidden state y is mapped to the most frequent gold POS tag t it corresponds to • “1 -to- 1” accuracy: (Haghighi and Klein 06) – Greedily map HMM states to POS tags, under constraint that at most 1 state maps to each tag • Information-theoretic measures: (Meila 03) – VI( Y , T ) = H( Y | T ) + H( T | Y ) • Max marginal decoding faster and usually better than Viterbi 4

EM via Forward-Backward • Hmm model: • EM iterations: • All expts run on POS tags from WSJ PTB 5

EM is slow to stabilize 7.20E+06 7.15E+06 – log likelihood 7.10E+06 7.05E+06 7.00E+06 6.95E+06 0 200 400 600 800 1000 Iteration 6

EM 1-to-1 accuracy varies widely 0.5 0.45 1-to-1 accuracy 0.4 0.35 0.3 0.25 0.2 0 200 400 600 800 1000 Iteration 7

EM tag dist less peaked than empirical 200000 180000 160000 140000 Frequency PTB 120000 VB 100000 EM 50 80000 EM 25 60000 40000 20000 0 Tag (sorted by frequency) 8

Bayesian estimation of HMMs • HMM with Dirichlet priors on tag → tag and tag → word distributions • As Dirichlet parameter approaches zero, prior prefers sparse (more peaked) distributions 9

Gibbs sampling • A Gibbs sampler is a MCMC procedure for sampling from posterior dist P( y | x , α , β ) • Integrate out the θ , φ parameters • Repeatedly sample from P( y i | y - i , α , β ), where y - i is the vector of all y except y i 10

Gibbs sampling is even slower 9.00E+06 − log posterior probability 8.95E+06 8.90E+06 8.85E+06 8.80E+06 8.75E+06 8.70E+06 0 10000 20000 30000 40000 50000 Iterations of Gibbs sampler, α = β =0.1 11

Gibbs stabilizes fast (to poor solns) 0.5 0.45 1-to-1 accuracy 0.4 0.35 0.3 0.25 0.2 0 10000 20000 30000 40000 50000 Iterations of Gibbs sampler, α = β =0.1 12

Variational Bayes 2 x • Variational Bayes exp( ψ( x)) approximates the posterior x-0.5 1 P( y , θ , φ | x , α , β) ≈ Q( y ) Q( θ , φ ) (MacKay 97, Beal 03) 0 • Simple, EM-like procedure: 0 1 2 13

VB posterior seems to stabilize fast 6.20E+06 − log variational lower bound 6.15E+06 6.10E+06 6.05E+06 6.00E+06 0 200 400 600 800 1000 Iterations of VB with α = β =0.1 14

VB 1-to-1 accuracy stabilizes fast 0.5 0.45 1-to-1 accuracy 0.4 0.35 0.3 0.25 0.2 0 200 400 600 800 1000 Iterations of VB with α = β =0.1 15

Summary of results S.D. many-to-1 S.D. H(T|Y) S.D. H(Y|T) S.D. α β states 1-to-1 VI(T,Y) S.D. 0.08 1.75 EM 0.62 50 0.40 4.46 2.71 0.02 0.01 0.04 0.06 VB 0.47 50 0.50 4.28 2.39 1.89 0.1 0.1 0.02 0.02 0.09 0.07 0.06 VB 1E-04 50 0.46 0.50 4.28 2.39 1.90 1 0.03 0.02 0.11 0.08 0.07 VB 50 0.42 0.60 4.63 1.86 2.77 0.1 1E-04 0.02 0.01 0.07 0.03 0.05 VB 1E-04 1E-04 50 0.42 0.60 4.62 1.85 2.76 0.02 0.01 0.07 0.03 0.06 GS 50 0.37 0.51 5.45 2.35 3.20 0.1 0.1 0.02 0.01 0.07 0.09 0.03 GS 1E-04 50 0.38 0.51 5.47 2.26 3.22 0.1 0.01 0.01 0.04 0.03 0.01 GS 50 0.36 0.49 5.73 2.41 3.31 0.1 1E-04 0.02 0.01 0.05 0.04 0.03 GS 1E-04 1E-04 50 0.37 0.49 5.74 2.42 3.32 0.02 0.01 0.03 0.02 0.02 EM 40 0.42 0.60 4.37 1.84 2.55 0.03 0.02 0.14 0.07 0.08 EM 4.23 25 0.46 0.56 2.05 2.19 0.03 0.02 0.17 0.09 0.08 0.03 1.58 EM 10 0.41 0.43 4.32 2.74 0.01 0.01 0.04 0.05 • Griffiths and Goldwater 2007 report VI = 3.74 for an unsupervised tritag model using Gibbs sampling, but on a reduced 17-tag set 16

Conclusions • EM does better if you let it run longer • Its state distribution is not skewed enough – Bayesian priors – Reduce the number of states in EM • Variational Bayes may be faster than Gibbs (or maybe initialization?) • Huge performance variance with all estimators need multiple runs to assess performance 17

EM 1-to-1 accuracy vs likelihood 0.5 0.45 0.4 1-to-1 accuracy 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 6.00E+06 7.00E+06 8.00E+06 9.00E+06 1.00E+07 − log likelihood 18

EM many-to-1 accuracy vs likelihood 0.7 0.6 Many-to-1 accuracy 0.5 0.4 0.3 0.2 0.1 0 6.00E+06 7.00E+06 8.00E+06 9.00E+06 1.00E+07 − log likelihood 19

EM final many-to-1 accuracy vs final likelihood 0.65 0.64 Many-to-1 accuracy 0.63 0.62 0.61 0.6 0.59 0.58 0.57 6960000 6980000 7000000 7020000 7040000 7060000 7080000 − log likelihood 20

Why doesn't EM find good HMM POS-taggers? Mark Johnson Microsoft - PowerPoint PPT Presentation

Why doesn't EM find good HMM POS-taggers? Mark Johnson Microsoft Research Brown University 1 Bayesian inference for HMMs Compare Bayesian methods for estimating HMMs for unsupervised POS tagging Gibbs sampling Variational Bayes

HMM Can Find Pretty Good POS Taggers (When Given a Good Start) Yoav Goldberg Meni Adler Michael

POS Tagging HMMs L645 / B659 Dept. of Linguistics, Indiana University Fall 2015 1 / 17 POS

Introduction to Hmm Introduction to Hmm Joe Wu Nov 4 th 2011 Agenda The applications of HMM.

Cell implementation HMM (HMM hidden Markov model) Authors: Jakub Hork Ji Hona

Using HMM to Blur the Lines between CPU and GPU Programming John Hubbard, May 10, 2017

If all you have is a bit of the Bible: Learning POS taggers for truly low-resource languages

Why Why Google Google Shopping doesn't Shopping doesn't work work for for many many retailers

Character Eyes: Seeing Language through Character-Level Taggers Yuval Pinter Marc Marone Jacob

Hidden Markov Models (HMM) Many slides from Michael Collins and HMMs Overview I The Tagging

Algorithms for NLP IITP, Spring 2020 HMMs, POS tagging, NER Yulia Tsvetkov 1 Plan POS

Arabic POS Tagging Results Error Analysis Conclusion Emad Mohamed, Sandra K ubler Indiana

A Talk on Protein Homology Detection by HMM-HMM comparisons[1] Sding, J Qing Ye Department of

The Hidden Markov The Hidden Markov Model (HMM) Model (HMM) 1 Lecture Outline Lecture Outline

Fast TwoLevel Fast TwoLevel HMM Decodi HMM Decoding ng Algor gorithm for thm for Large

Global Robot Ego-Localization C Combining Image Retrieval and HMM- bi i I R i l d HMM

ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2019 Recap: HMM Elements of HMM:

Statistical Natural Language Processing Part of speech tagging ar ltekin University of

CS 744: SCOPE Shivaram Venkataraman Fall 2020 ADMINISTRIVIA - Assignment grades this week -

CSEE 3827: Fundamentals of Computer Systems Standard Forms and Simplification with Karnaugh Maps

Lecture 9: Hidden Markov Model Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse

Boolean Algebra Chapter 3 Boolean Values Introduction Boolean Operations Fundamental Operators

CSCI 3210: Computational Game Theory Inefficiency of Equilibria & Routing Games Ref: Ch 17,

Objective: Compatibility of microwave and infrared radiances for use in the retrieval algorithm.

INF4140 - Models of concurrency Hsten 2015 October 5, 2015 Abstract This is the handout

Why doesn't EM find good HMM POS-taggers? Mark Johnson Microsoft - PowerPoint PPT Presentation

Why doesn't EM find good HMM POS-taggers? Mark Johnson Microsoft Research Brown University 1 Bayesian inference for HMMs Compare Bayesian methods for estimating HMMs for unsupervised POS tagging Gibbs sampling Variational Bayes

HMM Can Find Pretty Good POS Taggers (When Given a Good Start) Yoav Goldberg Meni Adler Michael

POS Tagging HMMs L645 / B659 Dept. of Linguistics, Indiana University Fall 2015 1 / 17 POS

Introduction to Hmm Introduction to Hmm Joe Wu Nov 4 th 2011 Agenda The applications of HMM.

Cell implementation HMM (HMM hidden Markov model) Authors: Jakub Hork Ji Hona

Using HMM to Blur the Lines between CPU and GPU Programming John Hubbard, May 10, 2017

If all you have is a bit of the Bible: Learning POS taggers for truly low-resource languages

Why Why Google Google Shopping doesn't Shopping doesn't work work for for many many retailers

Character Eyes: Seeing Language through Character-Level Taggers Yuval Pinter Marc Marone Jacob

Hidden Markov Models (HMM) Many slides from Michael Collins and HMMs Overview I The Tagging

Algorithms for NLP IITP, Spring 2020 HMMs, POS tagging, NER Yulia Tsvetkov 1 Plan POS

Arabic POS Tagging Results Error Analysis Conclusion Emad Mohamed, Sandra K ubler Indiana

A Talk on Protein Homology Detection by HMM-HMM comparisons[1] Sding, J Qing Ye Department of

The Hidden Markov The Hidden Markov Model (HMM) Model (HMM) 1 Lecture Outline Lecture Outline

Fast TwoLevel Fast TwoLevel HMM Decodi HMM Decoding ng Algor gorithm for thm for Large

Global Robot Ego-Localization C Combining Image Retrieval and HMM- bi i I R i l d HMM

ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2019 Recap: HMM Elements of HMM:

Statistical Natural Language Processing Part of speech tagging ar ltekin University of

CS 744: SCOPE Shivaram Venkataraman Fall 2020 ADMINISTRIVIA - Assignment grades this week -

CSEE 3827: Fundamentals of Computer Systems Standard Forms and Simplification with Karnaugh Maps

Lecture 9: Hidden Markov Model Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse

Boolean Algebra Chapter 3 Boolean Values Introduction Boolean Operations Fundamental Operators

CSCI 3210: Computational Game Theory Inefficiency of Equilibria &amp; Routing Games Ref: Ch 17,

Objective: Compatibility of microwave and infrared radiances for use in the retrieval algorithm.

INF4140 - Models of concurrency Hsten 2015 October 5, 2015 Abstract This is the handout

CSCI 3210: Computational Game Theory Inefficiency of Equilibria & Routing Games Ref: Ch 17,