Structured Generative Models for Unsupervised Named Entity - - PowerPoint PPT Presentation

structured generative models for unsupervised named
SMART_READER_LITE
LIVE PREVIEW

Structured Generative Models for Unsupervised Named Entity - - PowerPoint PPT Presentation

Structured Generative Models for Unsupervised Named Entity Clustering Micha Elsner, Prof. Eugene Charniak, Prof. Mark E. Johnson Brown Lab for Linguistic and Information Processing Brown University Providence, RI Named Entities People Micha


slide-1
SLIDE 1

Structured Generative Models for Unsupervised Named Entity Clustering

Micha Elsner, Prof. Eugene Charniak, Prof. Mark E. Johnson

Brown Lab for Linguistic and Information Processing Providence, RI Brown University

slide-2
SLIDE 2

Named Entities

Organizations Places People Brown Lab for Linguistic and Information Processing Brown University Providence, RI Micha Elsner

  • Prof. Eugene Charniak
  • Prof. Mark E. Johnson

2

slide-3
SLIDE 3

Named Entity Structure

Organizations Places People Micha Eugene Prof. Elsner Prof. Charniak Mark E. Johnson Brown University Brown Lab for Linguistic and Information Processing Providence RI

3

slide-4
SLIDE 4

Motivation

Isn’t this old news?

◮ Cotraining: (Collins+Singer ‘99, Riloff+Jones ‘99)

4

slide-5
SLIDE 5

Motivation

Isn’t this old news?

◮ Cotraining: (Collins+Singer ‘99, Riloff+Jones ‘99)

Generative models

New direction in coreference resolution:

(Haghighi+Klein ‘07) (Ng ‘08) and others

Integrated models for subtasks (including Named Entity)

◮ (H+K) cluster named entities using...

◮ Head word ◮ Coreferent pronouns

◮ Results are promising. ◮ Can we make them state-of-the-art?

4

slide-6
SLIDE 6

Goal

◮ Unsupervised, generative model ◮ Cluster named entities by type

People Micha Elsner

  • Prof. Eugene Charniak

5

slide-7
SLIDE 7

Goal

◮ Unsupervised, generative model ◮ Cluster named entities by type

People Micha Elsner

  • Prof. Eugene Charniak

◮ Discover word classes Micha Eugene Elsner Prof. Charniak

5

slide-8
SLIDE 8

Goal

◮ Unsupervised, generative model ◮ Cluster named entities by type

People Micha Elsner

  • Prof. Eugene Charniak

◮ Discover word classes Micha Eugene Elsner Prof. Charniak ◮ Cluster possibly-coreferent phrases?

People Micha Elsner

  • Prof. Eugene Charniak

Charniak

5

slide-9
SLIDE 9

Overview

Introduction Clustering as parsing Consistency: finding possible entities Experiments: pronouns are key! Future directions

6

slide-10
SLIDE 10

Overview

Introduction Clustering as parsing Consistency: finding possible entities Experiments: pronouns are key! Future directions

7

slide-11
SLIDE 11

Clustering as parsing

Grammar: NE →pers NE →org NE →loc

  • rg →org_term+
  • rg_term →Brown
  • rg_term →University

pers →pers_term+ pers_term →Moses pers_term →Brown

NE pers pers_term pers_term Moses Brown NE

  • rg
  • rg_term
  • rg_term

Brown University

8

slide-12
SLIDE 12

Internal structure

Grammar: NE →org

  • rg →org1org2
  • rg1 →Brown
  • rg2 →University

NE

  • rg
  • rg
  • rg

Brown University

1 2

9

slide-13
SLIDE 13

Internal structure

Grammar: NE →org

  • rg →org1org2
  • rg →(org1)(org2)(org3)(org4)(org5)
  • rg1 →Brown
  • rg2 →University

NE

  • rg
  • rg
  • rg

Brown University

1 2

9

slide-14
SLIDE 14

Multiword expansions

Grammar: NE →loc place →loc1loc2 loc1 →Providence loc2 →Rhode Island

NE loc loc loc Providence Rhode Island

1 2 10

slide-15
SLIDE 15

Gathering features

◮ Nominal modifiers (Collins+Singer ‘99)

◮ Appositive: “Hillary Clinton, the Secretary of State ◮ Prenominal: “candidate Hillary Clinton”

◮ Prepositional governor (C+S ‘99)

◮ “a spokesman for Hillary Clinton”

◮ Personal pronouns

◮ “. . . Hillary Clinton. She said . . . ” ◮ Unsupervised model of (Charniak+Elsner ‘09)

◮ Relative pronouns

◮ “Hillary Clinton, who said. . . ”

Add features to input strings:

Hillary Clinton # Secretary candidate # spokesman-for # she who

11

slide-16
SLIDE 16

Adding features

Grammar: NE →

  • rg pronounsorg
  • rg

  • rg1org2

pronounsorg → # pronounorg∗ pronounorg → which pronounorg → they . . . pronounorg → he . . .

NE

  • rg
  • rg
  • rg

Brown University

1 2

  • rg

# which pronouns

12

slide-17
SLIDE 17

Learning the grammar

How to learn rule probabilities?

◮ Many, many rules:

◮ With multiword strings, infinite!

◮ Most of them useless.

Bayesian model

Sparse prior over rules. Only useful rules get non-zero probability.

13

slide-18
SLIDE 18

Adaptor grammars (Johnson+al ‘07)

◮ Prior over grammars ◮ Form of hierarchical Dirichlet process ◮ Black-box inference, downloadable software

◮ Development is just writing the grammar

◮ But standard inference isn’t always good enough

Tuesday, 11:30

“Improving nonparameteric Bayesian inference experiments on unsupervised word segmentation with adaptor grammars”, Mark Johnson and Sharon Goldwater.

14

slide-19
SLIDE 19

Overview

Introduction Clustering as parsing Consistency: finding possible entities Experiments: pronouns are key! Future directions

15

slide-20
SLIDE 20

Consistent phrases

Definition: Consistent

Phrases that could refer to the same entity. Weaker than coreference. Non-trivial for named entities. Inconsistent, same heads:

◮ Ford Motor Co. ◮ Lockheed Martin Co.

Consistent, different heads:

◮ Professor Johnson ◮ Mark

16

slide-21
SLIDE 21

Modeling consistency

Model’s concept of consistency follows (Charniak ‘01): Phrases are consistent if none of their internal subparts clash.

pers pers pers pers

1 2 3 4

Ordered template Prof. Mark E. Johnson

17

slide-22
SLIDE 22

Modeling consistency

Model’s concept of consistency follows (Charniak ‘01): Phrases are consistent if none of their internal subparts clash.

pers pers pers pers

1 2 3 4

Ordered template Prof. Mark E. Johnson Mark Johnson realizations

17

slide-23
SLIDE 23

Modeling consistency

Model’s concept of consistency follows (Charniak ‘01): Phrases are consistent if none of their internal subparts clash.

Prof. Johnson pers pers pers pers

1 2 3 4

Ordered template Prof. Mark E. Johnson Mark Johnson realizations

17

slide-24
SLIDE 24

Modeling consistency

Model’s concept of consistency follows (Charniak ‘01): Phrases are consistent if none of their internal subparts clash.

Prof. Johnson Mark pers pers pers pers

1 2 3 4

Ordered template Prof. Mark E. Johnson Mark Johnson realizations

17

slide-25
SLIDE 25

Modeling consistency

Model’s concept of consistency follows (Charniak ‘01): Phrases are consistent if none of their internal subparts clash.

Prof. Johnson Mark pers pers pers pers

1 2 3 4

Ordered template Prof. Mark E. Johnson Mark Johnson realizations Mark Steedman inconsistent

17

slide-26
SLIDE 26

Overview

Introduction Clustering as parsing Consistency: finding possible entities Experiments: pronouns are key! Future directions

18

slide-27
SLIDE 27

Experimental setup

Datasets:

◮ Labeled data: MUC-7

◮ Three entity classes: PERS, ORG, LOC

◮ Unlabeled data: NANC

Combine features for multiple examples:

Hillary Clinton # # # who Hillary Clinton # Secretary # # she Hillary Clinton # # spokesman-for # her Hillary Clinton # Secretary # spokesman-for # she her who

More data in equal time... but no per-document features.

19

slide-28
SLIDE 28

Basic results

Our model:

Baseline (all ORG): 46% Our best model: 86%

Confusion matrix:

loc

  • rg

per LOC 1187 97 37 ORG 223 1517 122 PER 36 20 820

20

slide-29
SLIDE 29

Essentially unjustified comparisons

(Haghighi+Klein ‘07)

◮ ACE corpus: 61%

(Collins+Singer ‘99)

◮ Easier dataset

◮ Only examples with features ◮ Proportionally more people

◮ Generative baseline: 83% ◮ Cotraining: 91%

Supervised MUC-7:

◮ Best system (LTG): 94% ◮ Human: 97%

21

slide-30
SLIDE 30

Breakdown by features

Model Dev accuracy

Baseline (All ORG) 42.5 Core NPs (no consistency) 45.5 Core NPs (consistency) 48.5 Context features (nominal/prep) 83.3 All features (context + pronouns) 87.1

22

slide-31
SLIDE 31

Named entity structure

pers0 pers1 pers2 pers3 pers4 rep. john minister brown jr. sen. robert j. smith a washington david john b smith dr. michael l. johnson iii loc0 loc1 loc2 loc3 loc4 washington the texas county monday los angeles st. new york city thursday south new washington beach river north national united states valley tuesday

23

slide-32
SLIDE 32

Judging consistency

Sometimes right:

◮ Dr. Seuss ◮ Dr. Quinn

... correctly judged inconsistent.

24

slide-33
SLIDE 33

Judging consistency

Sometimes right:

◮ Dr. Seuss ◮ Dr. Quinn

... correctly judged inconsistent. Sometimes wrong:

◮ Dr. William F. Gibson ◮ Dr. William

Gibson ... judged inconsistent.

◮ Bruce

Jarvis

◮ Bruce Ellen Jarvis

... judged consistent.

24

slide-34
SLIDE 34

Inference is a problem

Gibbs sampling

◮ Converges in the limit.... ◮ Not in real life! ◮ Clustering problems are often NP-hard:

◮ There’s no guaranteed method.

For this model:

◮ Used heuristic inference ◮ Still only partial convergence!

25

slide-35
SLIDE 35

Conclusion

Introduction Clustering as parsing Consistency: finding possible entities Experiments: pronouns are key! Future directions

26

slide-36
SLIDE 36

Overview

Introduction Clustering as parsing Consistency: finding possible entities Experiments: pronouns are key! Future directions

27

slide-37
SLIDE 37

What’s next

◮ Add named-entity to unsupervised coreference

◮ Document-level features might help NE... ◮ If the combined model could scale.

◮ Improve inference for Bayesian models

◮ Gibbs sampling isn’t good enough... ◮ Better sampling? ◮ Or something completely different?

◮ Adaptor grammars: what else are they good for?

28

slide-38
SLIDE 38

Thanks!

◮ Three reviewers ◮ NSF ◮ All of you!

29

slide-39
SLIDE 39

Overview

Adaptor grammars: framework for Bayesian grammar learning Implementing Consistency Inference: a general problem for this approach

30

slide-40
SLIDE 40

Adaptor grammars (Johnson+al ‘07)

◮ A prior over grammars ◮ Some nonterms are Dirichlet processes over subtrees

◮ Previously used expansions gain probability

◮ Black-box inference, downloadable software

◮ Development is just writing the grammar

◮ But standard inference isn’t always good enough

◮ More on this later...

Tuesday, 11:30

“Improving nonparameteric Bayesian inference experiments on unsupervised word segmentation with adaptor grammars”, Mark Johnson and Sharon Goldwater.

31

slide-41
SLIDE 41

Adaptor grammars (Johnson+al ‘07)

Prior grammar: count rule 1 words → word words 1 words → word 1 word → Rhode 1 word → Island 1 word → Colorado . . . 1 loc2 → words Data:

Providence Rhode Island Boulder Colorado Newport Rhode Island

32

slide-42
SLIDE 42

Adaptor grammars (Johnson+al ‘07)

Posterior grammar: count rule 2 words → word words 2 words → word 2 word → Rhode 2 word → Island 1 word → Colorado . . . 1 loc2 → words 1 loc2 → Rhode Island Data:

Providence Rhode Island Boulder Colorado Newport Rhode Island NE loc loc loc

1 2

words word words word

32

slide-43
SLIDE 43

Adaptor grammars (Johnson+al ‘07)

Posterior grammar: count rule 2 words → word words 3 words → word 2 word → Rhode 2 word → Island 2 word → Colorado . . . 1 loc2 → words 1 loc2 → Rhode Island 1 loc2 → Colorado Data:

Providence Rhode Island Boulder Colorado Newport Rhode Island NE loc loc loc

1 2

words word words word loc 2 words word loc 1

32

slide-44
SLIDE 44

Adaptor grammars (Johnson+al ‘07)

Posterior grammar: count rule 2 words → word words 3 words → word 2 word → Rhode 2 word → Island 2 word → Colorado . . . 1 loc2 → words 2 loc2 → Rhode Island 1 loc2 → Colorado Data:

Providence Rhode Island Boulder Colorado Newport Rhode Island NE loc loc loc

1 2

words word words word loc 2 words word loc 1 loc 1 loc 2 words word words word

32

slide-45
SLIDE 45

Overview

Adaptor grammars: framework for Bayesian grammar learning Implementing Consistency Inference: a general problem for this approach

33

slide-46
SLIDE 46

Implementing consistency

Grammar: NE →org

  • rg →orgBrown . . .
  • rgBrown →org1

Brown org2 Brown

  • rg1

Brown →org1

  • rg2

Brown →org2

  • rg1 →Brown
  • rg2 →University

NE

  • rg

Brown University

  • rg1
  • rg

Brown

  • rg2
  • rg1

Brown

  • rg2

Brown

Underlined nonterminals are Dirichlet processes.

  • rg1

Brown and org2 Brown get only one expansion.

34

slide-47
SLIDE 47

Yet another infinity

How many entities (like orgBrown) are there?

◮ Grows with the data size... ◮ Again, use Bayesian methods.

Allow an infinite number... and constrain with a sparse prior. Simple in principle (special case of “Infinite PCFG”, Liang+al ‘07) Requires some code changes.

35

slide-48
SLIDE 48

Overview

Adaptor grammars: framework for Bayesian grammar learning Implementing Consistency Inference: a general problem for this approach

36

slide-49
SLIDE 49

Basic inference by sampling

Gibbs sampling:

◮ Start with arbitrary trees ◮ Repeat forever

◮ Erase a random tree ◮ Sample a tree from

the current grammar

◮ Update the grammar

given the new tree

Rules for loc2: 1 loc2 → words 1 loc2 → Colorado 2 loc2 → Rhode Island Data:

Providence Rhode Island Boulder Colorado Newport Rhode Island loc loc

1 2

loc 2 loc 1 loc 1 loc 2

37

slide-50
SLIDE 50

Basic inference by sampling

Gibbs sampling:

◮ Start with arbitrary trees ◮ Repeat forever

◮ Erase a random tree ◮ Sample a tree from

the current grammar

◮ Update the grammar

given the new tree

Rules for loc2: 1 loc2 → words 1 loc2 → Colorado 1 loc2 → Rhode Island Data:

Providence Rhode Island Boulder Colorado Newport Rhode Island loc 2 loc 1 loc 1 loc 2

37

slide-51
SLIDE 51

Basic inference by sampling

Gibbs sampling:

◮ Start with arbitrary trees ◮ Repeat forever

◮ Erase a random tree ◮ Sample a tree from

the current grammar

◮ Update the grammar

given the new tree

Rules for loc2: 1 loc2 → words 1 loc2 → Colorado 1 loc2 → Rhode Island Data:

Providence Rhode Island Boulder Colorado Newport Rhode Island loc 2 loc 1 loc 1 loc 2 loc 2 loc 1 loc 3

37

slide-52
SLIDE 52

Basic inference by sampling

Gibbs sampling:

◮ Start with arbitrary trees ◮ Repeat forever

◮ Erase a random tree ◮ Sample a tree from

the current grammar

◮ Update the grammar

given the new tree

Rules for loc2: 1 loc2 → words 1 loc2 → Colorado 1 loc2 → Rhode Island 1 loc2 → Rhode Data:

Providence Rhode Island Boulder Colorado Newport Rhode Island loc 2 loc 1 loc 1 loc 2 loc 2 loc 1 loc 3

37

slide-53
SLIDE 53

Basic inference by sampling

Gibbs sampling:

◮ Start with arbitrary trees ◮ Repeat forever

◮ Erase a random tree ◮ Sample a tree from

the current grammar

◮ Update the grammar

given the new tree

Rules for loc2: 1 loc2 → words 1 loc2 → Colorado 1 loc2 → Rhode Island 1 loc2 → Rhode Data:

Providence Rhode Island Boulder Colorado Newport Rhode Island loc 2 loc 1 loc 2 loc 1 loc 3

37

slide-54
SLIDE 54

Basic inference by sampling

Gibbs sampling:

◮ Start with arbitrary trees ◮ Repeat forever

◮ Erase a random tree ◮ Sample a tree from

the current grammar

◮ Update the grammar

given the new tree

Rules for loc2: 1 loc2 → words 1 loc2 → Colorado 1 loc2 → Rhode Data:

Providence Rhode Island Boulder Colorado Newport Rhode Island loc 2 loc 1 loc 2 loc 1 loc 3 loc 1 loc 3

37

slide-55
SLIDE 55

Basic inference by sampling

Gibbs sampling:

◮ Start with arbitrary trees ◮ Repeat forever

◮ Erase a random tree ◮ Sample a tree from

the current grammar

◮ Update the grammar

given the new tree

Rules for loc2: 1 loc2 → words 1 loc2 → Colorado 1 loc2 → Rhode Data:

Providence Rhode Island Boulder Colorado Newport Rhode Island loc 2 loc 1 loc 2 loc 1 loc 3 loc 1 loc 3

37

slide-56
SLIDE 56

Issue 1: efficiency

Sampling a new parse

◮ Via CKY algorithm: O(n3)

◮ ... times a grammar constant!

◮ One set of nonterminals for each entity ◮ Scales poorly

Can be dealt with (Metropolis-Hastings algorithm):

◮ Proposal distribution:

◮ Easy-to-calculate approximation to the grammar

◮ Worse approximations, slower runtimes.

38

slide-57
SLIDE 57

Issue 2: mobility

Local maxima are still a problem

◮ Gibbs sampling converges in the limit... ◮ Not in real life! ◮ What you’d expect – clustering is often NP-hard ◮ Resampling one tree at a time means lots of local maxima ◮ Better moves:

◮ Split and merge entities ◮ Reparse multiple strings at once

◮ Tricky to implement... ◮ Correct algorithms can be very slow in practice

39

slide-58
SLIDE 58

Compromise: heuristic inference

What we actually do:

◮ Propose only a subset of entities for each string:

◮ Must have at least one word in common ◮ Less likely if shared word is frequent

◮ Ignore the Hastings correction term!

Not theoretically valid, but faster.

◮ Even so, inference remains a problem.

◮ Too many clusters for the same entity 40