Why doesn't EM find good HMM POS-taggers? Mark Johnson Microsoft - - PowerPoint PPT Presentation

why doesn t em find good hmm pos taggers
SMART_READER_LITE
LIVE PREVIEW

Why doesn't EM find good HMM POS-taggers? Mark Johnson Microsoft - - PowerPoint PPT Presentation

Why doesn't EM find good HMM POS-taggers? Mark Johnson Microsoft Research Brown University 1 Bayesian inference for HMMs Compare Bayesian methods for estimating HMMs for unsupervised POS tagging Gibbs sampling Variational Bayes


slide-1
SLIDE 1

Why doesn't EM find good HMM POS-taggers?

Mark Johnson Microsoft Research Brown University

1

slide-2
SLIDE 2

Bayesian inference for HMMs

  • Compare Bayesian methods for estimating HMMs

for unsupervised POS tagging – Gibbs sampling – Variational Bayes – How do these compare to EM?

  • Most words belong to few POS: can a sparse

Bayesian prior on P(w|y) capture this?

  • KISS – look at bitag HMM models first
  • Cf: Goldwater and Griffiths 2007 study semi-

supervised Bayesian inference for tritag HMM POS taggers

2

slide-3
SLIDE 3

Main findings

  • Bayesian inference finds better POS tags
  • By reducing the number of states, EM can do

almost as well

  • All these methods take hundreds of iterations

to stabilize (converge?)

  • Wide variation in performance of all models

multiple runs to assess performance

3

slide-4
SLIDE 4

Evaluation methodology

  • “Many-to-1” accuracy:

– Each HMM hidden state y is mapped to the most frequent gold POS tag t it corresponds to

  • “1-to-1” accuracy: (Haghighi and Klein 06)

– Greedily map HMM states to POS tags, under constraint that at most 1 state maps to each tag

  • Information-theoretic measures: (Meila 03)

– VI(Y,T) = H(Y|T) + H(T|Y)

  • Max marginal decoding faster and usually

better than Viterbi

4

slide-5
SLIDE 5

EM via Forward-Backward

  • Hmm model:
  • EM iterations:
  • All expts run on POS tags from WSJ PTB

5

slide-6
SLIDE 6

EM is slow to stabilize

6

6.95E+06 7.00E+06 7.05E+06 7.10E+06 7.15E+06 7.20E+06 200 400 600 800 1000 – log likelihood Iteration

slide-7
SLIDE 7

EM 1-to-1 accuracy varies widely

7

0.2 0.25 0.3 0.35 0.4 0.45 0.5 200 400 600 800 1000 1-to-1 accuracy Iteration

slide-8
SLIDE 8

EM tag dist less peaked than empirical

8

20000 40000 60000 80000 100000 120000 140000 160000 180000 200000

Frequency Tag (sorted by frequency) PTB VB EM 50 EM 25

slide-9
SLIDE 9

Bayesian estimation of HMMs

  • HMM with Dirichlet priors on tag→tag and

tag→word distributions

  • As Dirichlet parameter approaches zero, prior

prefers sparse (more peaked) distributions

9

slide-10
SLIDE 10

Gibbs sampling

  • A Gibbs sampler is a MCMC procedure for

sampling from posterior dist P(y|x,α,β)

  • Integrate out the θ, φ parameters
  • Repeatedly sample from P(yi|y-i,α,β), where

y-i is the vector of all y except yi

10

slide-11
SLIDE 11

Gibbs sampling is even slower

11

8.70E+06 8.75E+06 8.80E+06 8.85E+06 8.90E+06 8.95E+06 9.00E+06 10000 20000 30000 40000 50000

− log posterior probability Iterations of Gibbs sampler, α=β=0.1

slide-12
SLIDE 12

Gibbs stabilizes fast (to poor solns)

12

0.2 0.25 0.3 0.35 0.4 0.45 0.5 10000 20000 30000 40000 50000

1-to-1 accuracy Iterations of Gibbs sampler, α=β=0.1

slide-13
SLIDE 13

Variational Bayes

  • Variational Bayes

approximates the posterior P(y,θ,φ|x,α,β) ≈ Q(y) Q(θ,φ)

(MacKay 97, Beal 03)

  • Simple, EM-like procedure:

13

1 2 1 2 x exp(ψ(x)) x-0.5

slide-14
SLIDE 14

VB posterior seems to stabilize fast

14

6.00E+06 6.05E+06 6.10E+06 6.15E+06 6.20E+06 200 400 600 800 1000

− log variational lower bound

Iterations of VB with α=β=0.1

slide-15
SLIDE 15

VB 1-to-1 accuracy stabilizes fast

15

0.2 0.25 0.3 0.35 0.4 0.45 0.5 200 400 600 800 1000

1-to-1 accuracy Iterations of VB with α=β=0.1

slide-16
SLIDE 16

Summary of results

16

α β states 1-to-1

S.D. many-to-1 S.D.

VI(T,Y)

S.D. H(T|Y) S.D. H(Y|T) S.D.

EM

50 0.40

0.02

0.62

0.01

4.46

0.08 1.75 0.04

2.71

0.06

VB

0.1 0.1

50

0.47

0.02

0.50

0.02

4.28

0.09

2.39

0.07

1.89

0.06

VB 1E-04

1

50 0.46

0.03

0.50

0.02

4.28

0.11

2.39

0.08

1.90

0.07

VB

0.1 1E-04

50 0.42

0.02

0.60

0.01

4.63

0.07

1.86

0.03

2.77

0.05

VB 1E-04 1E-04

50 0.42

0.02

0.60

0.01

4.62

0.07

1.85

0.03

2.76

0.06

GS

0.1 0.1

50 0.37

0.02

0.51

0.01

5.45

0.07

2.35

0.09

3.20

0.03

GS 1E-04

0.1

50 0.38

0.01

0.51

0.01

5.47

0.04

2.26

0.03

3.22

0.01

GS

0.1 1E-04

50 0.36

0.02

0.49

0.01

5.73

0.05

2.41

0.04

3.31

0.03

GS 1E-04 1E-04

50 0.37

0.02

0.49

0.01

5.74

0.03

2.42

0.02

3.32

0.02

EM

40 0.42

0.03

0.60

0.02

4.37

0.14

1.84

0.07

2.55

0.08

EM

25 0.46

0.03

0.56

0.02

4.23

0.17

2.05

0.09

2.19

0.08

EM

10 0.41

0.01

0.43

0.01

4.32

0.04

2.74

0.03 1.58 0.05

  • Griffiths and Goldwater 2007 report VI = 3.74 for an unsupervised

tritag model using Gibbs sampling, but on a reduced 17-tag set

slide-17
SLIDE 17

Conclusions

  • EM does better if you let it run longer
  • Its state distribution is not skewed enough

–Bayesian priors –Reduce the number of states in EM

  • Variational Bayes may be faster than Gibbs

(or maybe initialization?)

  • Huge performance variance with all

estimators need multiple runs to assess performance

17

slide-18
SLIDE 18

EM 1-to-1 accuracy vs likelihood

18

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 6.00E+06 7.00E+06 8.00E+06 9.00E+06 1.00E+07

1-to-1 accuracy − log likelihood

slide-19
SLIDE 19

EM many-to-1 accuracy vs likelihood

19

0.1 0.2 0.3 0.4 0.5 0.6 0.7 6.00E+06 7.00E+06 8.00E+06 9.00E+06 1.00E+07

Many-to-1 accuracy − log likelihood

slide-20
SLIDE 20

EM final many-to-1 accuracy vs final likelihood

20

0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 6960000 6980000 7000000 7020000 7040000 7060000 7080000

Many-to-1 accuracy − log likelihood