Semi-Supervised Learning of Sequence Models via Method of Moments - - PowerPoint PPT Presentation

semi supervised learning of sequence models via method of
SMART_READER_LITE
LIVE PREVIEW

Semi-Supervised Learning of Sequence Models via Method of Moments - - PowerPoint PPT Presentation

Semi-Supervised Learning of Sequence Models via Method of Moments EMNLP - Empirical Methods for Natural Language Processing Austin, Texas November 1-6, 2016 Zita Marinho Shay B. Cohen Noah A. Smith Andr F. T. Martins Computer Science


slide-1
SLIDE 1

Semi-Supervised Learning of Sequence Models via Method of Moments

EMNLP - Empirical Methods for Natural Language Processing

Zita Marinho

IST, University of Lisbon Robotics Institute, CMU

Shay B. Cohen

School of Informatics University of Edinburgh

André F. T. Martins

IT, IST, University of Lisbon Unbabel

Noah A. Smith

Computer Science & Eng. University of Washington

November 1-6, 2016

Austin, Texas

zmarinho@cmu.edu andre.martins@unbabel.com scohen@inf.ed.ac.uk nasmith@cs.washington.edu

slide-2
SLIDE 2

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Introduction

2

Sequence Labeling

w1

Herb

w2

fights y1 y2 y3 y5

w3

like

w5

ninja

w4

a

w6

. y4 y6

  • bserved data {w1, w2, w3,…, w6}

labels {y1, y2, y3,…, y6}

N V Pre . Det N

slide-3
SLIDE 3

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Introduction

N V V . Det N

3

Sequence Labeling

w1

Herb

w2

fights y1 y2 y3 y5

w3

like

w5

ninja

w4

a

w6

. y4 y6

  • bserved data {w1, w2, w3,…, w6}

labels {y1, y2, y3,…, y6}

slide-4
SLIDE 4

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Introduction

ADJ N V . Det N

4

Sequence Labeling

w1

Herb

w2

fights y1 y2 y3 y5

w3

like

w5

ninja

w4

a

w6

. y4 y6

  • bserved data {w1, w2, w3,…, w6}

labels {y1, y2, y3,…, y6}

slide-5
SLIDE 5

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Introduction

? ? ? ? ? ?

5

Sequence Labeling

w1

Herb

w2

fights y1 y2 y3 y5

w3

like

w5

ninja

w4

a

w6

. y4 y6

K6 possible assignments

  • bserved data {w1, w2, w3,…, w6}

labels {y1, y2, y3,…, y6}

slide-6
SLIDE 6

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Introduction

6

Hidden Markov Model

w1 w2

y1 y2 y3 y5

w3 w5 w4 w6

y4 y6

Learn parameters?

p(yt | yt-1) p(wt | yt)

  • supervised learning
  • unsupervised/semi-supervised (this talk)
slide-7
SLIDE 7

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Introduction

7

Hidden Markov Model

w1 w2

y1 y2 y3 y5

w3 w5 w4 w6

y4 y6

Learn parameters?

p(yt | yt-1) p(wt | yt)

  • model can be extended to include features

Berg-Kirkpatrick, et al, Painless unsupervised learning with features. NAACL HLT, 2010.

  • supervised learning
  • unsupervised/semi-supervised (this talk)
slide-8
SLIDE 8

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Problem Statement

8

Maximum Likelihood estimation (MLE)

  • exact inference is hard
  • EM sensitive to local optima

(depends on initialization)

  • EM expensive in large datasets

(several inference passes)

Method of Moments estimation (MoM)

computationally efficient no local optima

  • ne pass over data
slide-9
SLIDE 9

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Introduction

9

Hidden Markov Model

via Maximum Likelihood Estimation

unsupervised learning semi-supervised learning

feature HMM

MLE MoM

via Method of Moments

HMM feature HMM HMM

MoM

? ? ?

MLE

Arora et al., A Practical Algorithm for Topic Modeling with Provable Guarantees, ICML 2013 Shay B. Cohen, Karl Stratos, Michael Collins, Dean P. Foster and Lyle Ungar, Spectral Learning of Latent-Variable PCFGs: Algorithms and Sample Complexity, JMLR 2014

✓ ✓ ✓ ✓ ✓

slide-10
SLIDE 10

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Outline

10

Learning sequence models via MoM

  • 1. Learn HMM models via MoM
  • 2. Solve a QP
  • 3. Extend to feature-based model
  • 4. Experiments

Outline

  • 5. Experiments
slide-11
SLIDE 11

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Anchor Learning

11 Key insight: 2. Anchor Trick: 1. Conditional Independence: learn a proxy for labels with anchors infer label by looking at context

Method of Moments

slide-12
SLIDE 12

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Anchor Learning

12

w1

hehe

w2

its

w3

gonna

w5

a

w4

b

w6

good

w7

day

y1 y2 y3 y5 y4 y6 y7

stop start

word

  • 1. Conditional Independence
slide-13
SLIDE 13

13

w1

:)

w2

wait

w3

now

w5

am

w4

I

w6

goin

w7

2

y1 y2 y3 y5 y4 y6 y7

stop start

= { w-1 , w+1 }

Log-linear model

context

  • 1. Conditional Independence
slide-14
SLIDE 14

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Problem Statement

14

adp

wt-1

tasted yt-1 yt

wt

like

wt+1

chimichangas yt+1 context

  • 1. Conditional Independence
slide-15
SLIDE 15

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Problem Statement

15

context verb

wt-1

i yt-1 yt

wt

like

wt+1

fajitas yt+1

  • 1. Conditional Independence
slide-16
SLIDE 16

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Problem Statement

16

verb

wt-1

i yt-1 yt

wt

like

wt+1

fajitas yt+1

“You shall know a word by the company it keeps.”

Firth, 1957

context

  • 1. Conditional Independence
slide-17
SLIDE 17

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Anchor Learning

17

w1

hehe

w2

its

w3

gonna

w5

a

w4

b

w6

good

w7

day

y1 y2 y3 y5 y4 y6 y7

stop start

context word context word ⊥

| label

  • 1. Conditional Independence
slide-18
SLIDE 18

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Anchor Learning

18

Arora et al., A Practical Algorithm for Topic Modeling with Provable Guarantees, ICML 2013

verb

wt-1

yt-1 yt

wt

be yt+1 label anchor word

wt+1

p( label ≠ verb | be ) = 0 p( verb | be ) = 1

  • 2. Anchor Trick

all instances of be = verb

slide-19
SLIDE 19

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Anchor Learning

19

y

More anchors per label

verb

more than 1 anchor word less biased context estimates

verb = b, be, are, is, am, have, going

be go are is am have going

  • 2. Anchor Trick
slide-20
SLIDE 20

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Anchor Learning

20

How to find anchors?

  • small labeled corpus
  • small lexicon

Austin airport playground am,be,is,are go, make,made become so,on,of he,it,she noun verb pron adp

  • 2. Anchor Trick
slide-21
SLIDE 21

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Method of Moments

21

co-occurrences in data

unlabeled

Method of moments

Andrew fights like Jet Li. eat Fruit like cherry. Ann sings like me. Children like ice-cream. wt context wt-1 wt+1 wt+2

slide-22
SLIDE 22

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Method of Moments

22

Andrew fights like Jet Li. eat Fruit like cherry. Ann sings like me. Children like ice-cream. wt context wt-1 wt+1 wt+2

Method of moments

like

C h i l d r e n c h e r r y i c e

  • c

r e a m fi g h t s a J e t m e

.

context word

l

  • v

e t h e r e w i l l

Q

p(context | word)

slide-23
SLIDE 23

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Method of Moments

23

Let there be love. Bill will be a ninja.

Method of moments

like

C h i l d r e n c h e r r y i c e

  • c

r e a m fi g h t s a J e t m e

.

context word

l

  • v

e

be

t h e r e w i l l

Q

p(context | word)

slide-24
SLIDE 24

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Method of Moments

24

x

context

Method of moments

label

context word ⊥

| label

  • 1. Conditional Independence

p(label | word) p(context | word) p(context | label)

X

labels

=

=

word label word context

Q Γ R

slide-25
SLIDE 25

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Method of Moments

25

Method of moments

p(label | word) p(context | word) p(context | label)

X

labels

=

context word ⊥

| label

  • 1. Conditional Independence
  • 2. Anchor Trick

p(label | word) p(context | word) p(context | anchors)

X

labels

=

= x

word label

R

context word context

Q Γ

anchors

slide-26
SLIDE 26

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Outline

26

Learning sequence models via MoM

Proposed work

  • 1. Learn HMM models via MoM
  • 2. Solve a QP
  • 3. Extend to feature-based model
  • 4. Experiments

Outline

  • 5. Experiments
slide-27
SLIDE 27

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Method of Moments

27

q

Method of Moments

p(label | word) p(context | word) p(context | label)

= x

word label

R

context word context

Q Γ

anchors

γ

  • solve per word type ~(ms)

γ = argmin || q - R γ ||2 0 ≤ γ ≤ 1 γ = 1

X

labels

slide-28
SLIDE 28

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Method of Moments

28

q

Method of Moments

p(label | word) p(context | word) p(context | label)

= x

word label

R

context word context

Q Γ

anchors

γ

γ = argmin || q - R γ ||2 0 ≤ γ ≤ 1 γ = 1

X

labels

+ λ || γsup - γ ||2

slide-29
SLIDE 29

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Method of Moments

29

q

Method of Moments

p(label | word) p(context | word) p(context | label)

= x

word label

R

context word context

Q Γ

anchors

γ

γ = argmin || q - R γ ||2 0 ≤ γ ≤ 1 γ = 1

X

labels

+ λ || γsup - γ ||2

estimated from labeled data estimated from unlabeled data

slide-30
SLIDE 30

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Method of Moments

30

γ p(word)

X

words

p(label | word)

Learn parameters ?

p(label) =

HMM Learning

γ coefficients

Bayes’ Rule

p(word) p(label)

=

Observation Matrix

p(word | label)

γ

slide-31
SLIDE 31

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Method of Moments

31

Bayes’ Rule

p(word) p(label)

=

Observation Matrix

Learn parameters ?

p(word | label)

HMM Learning

γ Transition Matrix

  • estimate from labeled data only
slide-32
SLIDE 32

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Outline

32

Learning sequence models via MoM

  • 1. Learn HMM models via MoM
  • 2. Relax the notion of anchors
  • 3. Solve a QP
  • 4. Experiments

Outline

  • 5. Experiments
slide-33
SLIDE 33

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Experiments

Semi-supervised Twitter POS tagging

33

12 Universal POS

200k words

Twitter dataset

Slav Petrov et al., A Universal Part-of-Speech Tagset, 2011 Owoputi et al., Improved part-of-speech tagging for online conversational text with word clusters. 2013

2.7 M unlabeled tweets 1000-100 labeled tweets

hehe its gonna b a good day

x prt verb verb det adj noun

slide-34
SLIDE 34

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Experiments

34

Twitter POS tagging

150 training labeled sequences

71.7 77.2 78.2 84.3 70 72 74 76 78 80 82 84 86

HMM tagging accuracy

HMM EM self-training AHMM

slide-35
SLIDE 35

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Experiments

35

Twitter POS tagging

1000 training labeled sequences

81.1 83.1 86.1 88.0 80 81 82 83 84 85 86 87 88 89 90

HMM tagging accuracy

HMM EM self-training AHMM

slide-36
SLIDE 36

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Outline

36

Learning sequence models via MoM

Proposed work

  • 1. Learn HMM models via MoM
  • 2. Relax the notion of anchors
  • 3. Extend to feature HMM
  • 4. Experiments

Outline

  • 5. Experiments
slide-37
SLIDE 37

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Extend to features

37

w1

:)

w2

wait

w3

now

w5

am

w4

I

w6

goin

w7

2

y1 y2 y3 y5 y4 y6 y7

stop start

  • is upper
  • is title
  • is digit
  • is url
  • starts #
  • is emoticon
  • T. Berg-Kirkpatrick, Painless unsupervised learning with features, ACL 2010.

(word)

φ

(word)

φ

Log-linear model

slide-38
SLIDE 38

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Extend to features

38

w1

:)

w2

wait

w3

now

w5

am

w4

I

w6

goin

w7

2

y1 y2 y3 y5 y4 y6 y7

stop start

word ⟂ context | label label (word)

φ

ψ (context)

  • 1. Conditional Independence
slide-39
SLIDE 39

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Extend to features

39

E [ ψ(context) x Φ(word) ] p (context | label)

Log-linear model

p ( context | word )

Q

p ( label | word ) E [ Φ(word) | label ] p( label ) E [ Φ(word) ] E [ Φ(word) ] E[ψ(context) | label ]

R

= x

label

R

ψ(context)

Q Γ

anchors

(word)

φ

(word)

φ

ψ(context)

Γ

slide-40
SLIDE 40

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Method of Moments

40

γ = argmin || q - R γ ||2 γ = 1

X

labels

+ λ || γsup - γ ||2

= x

label

R

ψ(context)

Q Γ

anchors

(word)

φ

(word)

φ

ψ(context)

γ q

  • solve per feature dimension Φj

Log-linear model

slide-41
SLIDE 41

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Method of Moments

41

E[Φ(word)] p(label)

= Learn parameters ?

E[ Φ(word) | label ]

γ γ =

mean parameters

µy=

E [ Φ(word) | label ] p( label ) E [ Φ(word) ]

Log-linear model

slide-42
SLIDE 42

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Extend to features

42

θy

canonical parameters mean parameters

partition function

Learn parameters ?

Fenchel-Legendre Duality

ax θy

>µy − logZy

µy

Log-linear model

θyy = * argmax max

θy

Zy = X

w

exp (θ>

y tw)

slide-43
SLIDE 43

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Outline

43

canonical parameters mean parameters

Algorithm

compute moments

Γ

solve maxent problem

µy

Q

find anchors solve QP

θy

R

~ 10s min ~ 10s sec ~ 2-3h ~ secs ~ 10s min

slide-44
SLIDE 44

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Outline

44

mean parameters

Algorithm

compute moments

Γ

µy

Q

find anchors

R

solve QP

supervision

canonical parameters

solve maxent problem

θy

slide-45
SLIDE 45

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Outline

45

Learning sequence models via MoM

  • 1. Learn HMM models via MoM
  • 2. Relax the notion of anchors
  • 3. Solve a QP
  • 4. Experiments

Outline

  • 5. Experiments
slide-46
SLIDE 46

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Experiments

81.8 81.8 83.4 85.3 70 72 74 76 78 80 82 84 86

feature HMM tagging accuracy

HMM EM self-training AHMM

46

Twitter POS tagging

150 training labeled sequences

slide-47
SLIDE 47

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Experiments

47

Twitter POS tagging

1000 training labeled sequences

89.1 89.1 89.4 89.1 80 81 82 83 84 85 86 87 88 89 90

feature HMM tagging accuracy

HMM EM self-training AHMM

slide-48
SLIDE 48

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Experiments

48

Twitter POS tagging

Tagging accuracy vs. labeled training size

0.7 0.75 0.8 0.85 0.9 0.95 100 200 300 400 500 600 700 800 900 1000

tagging accuracy Labeled sequences feature HMM HMM anchor FHMM

slide-49
SLIDE 49

EMNLP 16 | Semi-supervised sequence labeling with MoM |

Experiments

49

Twitter POS tagging

1000 training sequences

42.0 14.9 10.3 3.8 10 20 30 40 Training Time (h) Brown Clusters EM self-training AHMM

slide-50
SLIDE 50

EMNLP 16 | Semi-supervised sequence labeling with MoM |

zmarinho@cmu.edu

50

Conclusions

  • MoM algorithm for semi-supervised learning
  • flexible method

(easy to add supervision)

  • fast to train

(only one pass over the data)

  • particularly good with little supervision

Thank you !

zmarinho@cmu.edu

Support for this research was provided by the Portuguese Science and Technology Foundation (FCT) and CMU Portugal Program, grant SFRH/BD/ 52015/2012. This work has also been partially supported by the European Union under H2020 project SUMMA, grant 688139, and by FCT, through contracts UID/EEA/50008/2013, through the LearnBig project (PTDC/EEISII/7092/2014), and the GoLocal project (grant CMUPERI/TIC/0046/2014).