Structured Generative Models for Unsupervised Named Entity - - PowerPoint PPT Presentation
Structured Generative Models for Unsupervised Named Entity - - PowerPoint PPT Presentation
Structured Generative Models for Unsupervised Named Entity Clustering Micha Elsner, Prof. Eugene Charniak, Prof. Mark E. Johnson Brown Lab for Linguistic and Information Processing Brown University Providence, RI Named Entities People Micha
Named Entities
Organizations Places People Brown Lab for Linguistic and Information Processing Brown University Providence, RI Micha Elsner
- Prof. Eugene Charniak
- Prof. Mark E. Johnson
2
Named Entity Structure
Organizations Places People Micha Eugene Prof. Elsner Prof. Charniak Mark E. Johnson Brown University Brown Lab for Linguistic and Information Processing Providence RI
3
Motivation
Isn’t this old news?
◮ Cotraining: (Collins+Singer ‘99, Riloff+Jones ‘99)
4
Motivation
Isn’t this old news?
◮ Cotraining: (Collins+Singer ‘99, Riloff+Jones ‘99)
Generative models
New direction in coreference resolution:
(Haghighi+Klein ‘07) (Ng ‘08) and others
Integrated models for subtasks (including Named Entity)
◮ (H+K) cluster named entities using...
◮ Head word ◮ Coreferent pronouns
◮ Results are promising. ◮ Can we make them state-of-the-art?
4
Goal
◮ Unsupervised, generative model ◮ Cluster named entities by type
People Micha Elsner
- Prof. Eugene Charniak
5
Goal
◮ Unsupervised, generative model ◮ Cluster named entities by type
People Micha Elsner
- Prof. Eugene Charniak
◮ Discover word classes Micha Eugene Elsner Prof. Charniak
5
Goal
◮ Unsupervised, generative model ◮ Cluster named entities by type
People Micha Elsner
- Prof. Eugene Charniak
◮ Discover word classes Micha Eugene Elsner Prof. Charniak ◮ Cluster possibly-coreferent phrases?
People Micha Elsner
- Prof. Eugene Charniak
Charniak
5
Overview
Introduction Clustering as parsing Consistency: finding possible entities Experiments: pronouns are key! Future directions
6
Overview
Introduction Clustering as parsing Consistency: finding possible entities Experiments: pronouns are key! Future directions
7
Clustering as parsing
Grammar: NE →pers NE →org NE →loc
- rg →org_term+
- rg_term →Brown
- rg_term →University
pers →pers_term+ pers_term →Moses pers_term →Brown
NE pers pers_term pers_term Moses Brown NE
- rg
- rg_term
- rg_term
Brown University
8
Internal structure
Grammar: NE →org
- rg →org1org2
- rg1 →Brown
- rg2 →University
NE
- rg
- rg
- rg
Brown University
1 2
9
Internal structure
Grammar: NE →org
- rg →org1org2
- rg →(org1)(org2)(org3)(org4)(org5)
- rg1 →Brown
- rg2 →University
NE
- rg
- rg
- rg
Brown University
1 2
9
Multiword expansions
Grammar: NE →loc place →loc1loc2 loc1 →Providence loc2 →Rhode Island
NE loc loc loc Providence Rhode Island
1 2 10
Gathering features
◮ Nominal modifiers (Collins+Singer ‘99)
◮ Appositive: “Hillary Clinton, the Secretary of State ◮ Prenominal: “candidate Hillary Clinton”
◮ Prepositional governor (C+S ‘99)
◮ “a spokesman for Hillary Clinton”
◮ Personal pronouns
◮ “. . . Hillary Clinton. She said . . . ” ◮ Unsupervised model of (Charniak+Elsner ‘09)
◮ Relative pronouns
◮ “Hillary Clinton, who said. . . ”
Add features to input strings:
Hillary Clinton # Secretary candidate # spokesman-for # she who
11
Adding features
Grammar: NE →
- rg pronounsorg
- rg
→
- rg1org2
pronounsorg → # pronounorg∗ pronounorg → which pronounorg → they . . . pronounorg → he . . .
NE
- rg
- rg
- rg
Brown University
1 2
- rg
# which pronouns
12
Learning the grammar
How to learn rule probabilities?
◮ Many, many rules:
◮ With multiword strings, infinite!
◮ Most of them useless.
Bayesian model
Sparse prior over rules. Only useful rules get non-zero probability.
13
Adaptor grammars (Johnson+al ‘07)
◮ Prior over grammars ◮ Form of hierarchical Dirichlet process ◮ Black-box inference, downloadable software
◮ Development is just writing the grammar
◮ But standard inference isn’t always good enough
Tuesday, 11:30
“Improving nonparameteric Bayesian inference experiments on unsupervised word segmentation with adaptor grammars”, Mark Johnson and Sharon Goldwater.
14
Overview
Introduction Clustering as parsing Consistency: finding possible entities Experiments: pronouns are key! Future directions
15
Consistent phrases
Definition: Consistent
Phrases that could refer to the same entity. Weaker than coreference. Non-trivial for named entities. Inconsistent, same heads:
◮ Ford Motor Co. ◮ Lockheed Martin Co.
Consistent, different heads:
◮ Professor Johnson ◮ Mark
16
Modeling consistency
Model’s concept of consistency follows (Charniak ‘01): Phrases are consistent if none of their internal subparts clash.
pers pers pers pers
1 2 3 4
Ordered template Prof. Mark E. Johnson
17
Modeling consistency
Model’s concept of consistency follows (Charniak ‘01): Phrases are consistent if none of their internal subparts clash.
pers pers pers pers
1 2 3 4
Ordered template Prof. Mark E. Johnson Mark Johnson realizations
17
Modeling consistency
Model’s concept of consistency follows (Charniak ‘01): Phrases are consistent if none of their internal subparts clash.
Prof. Johnson pers pers pers pers
1 2 3 4
Ordered template Prof. Mark E. Johnson Mark Johnson realizations
17
Modeling consistency
Model’s concept of consistency follows (Charniak ‘01): Phrases are consistent if none of their internal subparts clash.
Prof. Johnson Mark pers pers pers pers
1 2 3 4
Ordered template Prof. Mark E. Johnson Mark Johnson realizations
17
Modeling consistency
Model’s concept of consistency follows (Charniak ‘01): Phrases are consistent if none of their internal subparts clash.
Prof. Johnson Mark pers pers pers pers
1 2 3 4
Ordered template Prof. Mark E. Johnson Mark Johnson realizations Mark Steedman inconsistent
17
Overview
Introduction Clustering as parsing Consistency: finding possible entities Experiments: pronouns are key! Future directions
18
Experimental setup
Datasets:
◮ Labeled data: MUC-7
◮ Three entity classes: PERS, ORG, LOC
◮ Unlabeled data: NANC
Combine features for multiple examples:
Hillary Clinton # # # who Hillary Clinton # Secretary # # she Hillary Clinton # # spokesman-for # her Hillary Clinton # Secretary # spokesman-for # she her who
More data in equal time... but no per-document features.
19
Basic results
Our model:
Baseline (all ORG): 46% Our best model: 86%
Confusion matrix:
loc
- rg
per LOC 1187 97 37 ORG 223 1517 122 PER 36 20 820
20
Essentially unjustified comparisons
(Haghighi+Klein ‘07)
◮ ACE corpus: 61%
(Collins+Singer ‘99)
◮ Easier dataset
◮ Only examples with features ◮ Proportionally more people
◮ Generative baseline: 83% ◮ Cotraining: 91%
Supervised MUC-7:
◮ Best system (LTG): 94% ◮ Human: 97%
21
Breakdown by features
Model Dev accuracy
Baseline (All ORG) 42.5 Core NPs (no consistency) 45.5 Core NPs (consistency) 48.5 Context features (nominal/prep) 83.3 All features (context + pronouns) 87.1
22
Named entity structure
pers0 pers1 pers2 pers3 pers4 rep. john minister brown jr. sen. robert j. smith a washington david john b smith dr. michael l. johnson iii loc0 loc1 loc2 loc3 loc4 washington the texas county monday los angeles st. new york city thursday south new washington beach river north national united states valley tuesday
23
Judging consistency
Sometimes right:
◮ Dr. Seuss ◮ Dr. Quinn
... correctly judged inconsistent.
24
Judging consistency
Sometimes right:
◮ Dr. Seuss ◮ Dr. Quinn
... correctly judged inconsistent. Sometimes wrong:
◮ Dr. William F. Gibson ◮ Dr. William
Gibson ... judged inconsistent.
◮ Bruce
Jarvis
◮ Bruce Ellen Jarvis
... judged consistent.
24
Inference is a problem
Gibbs sampling
◮ Converges in the limit.... ◮ Not in real life! ◮ Clustering problems are often NP-hard:
◮ There’s no guaranteed method.
For this model:
◮ Used heuristic inference ◮ Still only partial convergence!
25
Conclusion
Introduction Clustering as parsing Consistency: finding possible entities Experiments: pronouns are key! Future directions
26
Overview
Introduction Clustering as parsing Consistency: finding possible entities Experiments: pronouns are key! Future directions
27
What’s next
◮ Add named-entity to unsupervised coreference
◮ Document-level features might help NE... ◮ If the combined model could scale.
◮ Improve inference for Bayesian models
◮ Gibbs sampling isn’t good enough... ◮ Better sampling? ◮ Or something completely different?
◮ Adaptor grammars: what else are they good for?
28
Thanks!
◮ Three reviewers ◮ NSF ◮ All of you!
29
Overview
Adaptor grammars: framework for Bayesian grammar learning Implementing Consistency Inference: a general problem for this approach
30
Adaptor grammars (Johnson+al ‘07)
◮ A prior over grammars ◮ Some nonterms are Dirichlet processes over subtrees
◮ Previously used expansions gain probability
◮ Black-box inference, downloadable software
◮ Development is just writing the grammar
◮ But standard inference isn’t always good enough
◮ More on this later...
Tuesday, 11:30
“Improving nonparameteric Bayesian inference experiments on unsupervised word segmentation with adaptor grammars”, Mark Johnson and Sharon Goldwater.
31
Adaptor grammars (Johnson+al ‘07)
Prior grammar: count rule 1 words → word words 1 words → word 1 word → Rhode 1 word → Island 1 word → Colorado . . . 1 loc2 → words Data:
Providence Rhode Island Boulder Colorado Newport Rhode Island
32
Adaptor grammars (Johnson+al ‘07)
Posterior grammar: count rule 2 words → word words 2 words → word 2 word → Rhode 2 word → Island 1 word → Colorado . . . 1 loc2 → words 1 loc2 → Rhode Island Data:
Providence Rhode Island Boulder Colorado Newport Rhode Island NE loc loc loc
1 2
words word words word
32
Adaptor grammars (Johnson+al ‘07)
Posterior grammar: count rule 2 words → word words 3 words → word 2 word → Rhode 2 word → Island 2 word → Colorado . . . 1 loc2 → words 1 loc2 → Rhode Island 1 loc2 → Colorado Data:
Providence Rhode Island Boulder Colorado Newport Rhode Island NE loc loc loc
1 2
words word words word loc 2 words word loc 1
32
Adaptor grammars (Johnson+al ‘07)
Posterior grammar: count rule 2 words → word words 3 words → word 2 word → Rhode 2 word → Island 2 word → Colorado . . . 1 loc2 → words 2 loc2 → Rhode Island 1 loc2 → Colorado Data:
Providence Rhode Island Boulder Colorado Newport Rhode Island NE loc loc loc
1 2
words word words word loc 2 words word loc 1 loc 1 loc 2 words word words word
32
Overview
Adaptor grammars: framework for Bayesian grammar learning Implementing Consistency Inference: a general problem for this approach
33
Implementing consistency
Grammar: NE →org
- rg →orgBrown . . .
- rgBrown →org1
Brown org2 Brown
- rg1
Brown →org1
- rg2
Brown →org2
- rg1 →Brown
- rg2 →University
NE
- rg
Brown University
- rg1
- rg
Brown
- rg2
- rg1
Brown
- rg2
Brown
Underlined nonterminals are Dirichlet processes.
- rg1
Brown and org2 Brown get only one expansion.
34
Yet another infinity
How many entities (like orgBrown) are there?
◮ Grows with the data size... ◮ Again, use Bayesian methods.
Allow an infinite number... and constrain with a sparse prior. Simple in principle (special case of “Infinite PCFG”, Liang+al ‘07) Requires some code changes.
35
Overview
Adaptor grammars: framework for Bayesian grammar learning Implementing Consistency Inference: a general problem for this approach
36
Basic inference by sampling
Gibbs sampling:
◮ Start with arbitrary trees ◮ Repeat forever
◮ Erase a random tree ◮ Sample a tree from
the current grammar
◮ Update the grammar
given the new tree
Rules for loc2: 1 loc2 → words 1 loc2 → Colorado 2 loc2 → Rhode Island Data:
Providence Rhode Island Boulder Colorado Newport Rhode Island loc loc
1 2
loc 2 loc 1 loc 1 loc 2
37
Basic inference by sampling
Gibbs sampling:
◮ Start with arbitrary trees ◮ Repeat forever
◮ Erase a random tree ◮ Sample a tree from
the current grammar
◮ Update the grammar
given the new tree
Rules for loc2: 1 loc2 → words 1 loc2 → Colorado 1 loc2 → Rhode Island Data:
Providence Rhode Island Boulder Colorado Newport Rhode Island loc 2 loc 1 loc 1 loc 2
37
Basic inference by sampling
Gibbs sampling:
◮ Start with arbitrary trees ◮ Repeat forever
◮ Erase a random tree ◮ Sample a tree from
the current grammar
◮ Update the grammar
given the new tree
Rules for loc2: 1 loc2 → words 1 loc2 → Colorado 1 loc2 → Rhode Island Data:
Providence Rhode Island Boulder Colorado Newport Rhode Island loc 2 loc 1 loc 1 loc 2 loc 2 loc 1 loc 3
37
Basic inference by sampling
Gibbs sampling:
◮ Start with arbitrary trees ◮ Repeat forever
◮ Erase a random tree ◮ Sample a tree from
the current grammar
◮ Update the grammar
given the new tree
Rules for loc2: 1 loc2 → words 1 loc2 → Colorado 1 loc2 → Rhode Island 1 loc2 → Rhode Data:
Providence Rhode Island Boulder Colorado Newport Rhode Island loc 2 loc 1 loc 1 loc 2 loc 2 loc 1 loc 3
37
Basic inference by sampling
Gibbs sampling:
◮ Start with arbitrary trees ◮ Repeat forever
◮ Erase a random tree ◮ Sample a tree from
the current grammar
◮ Update the grammar
given the new tree
Rules for loc2: 1 loc2 → words 1 loc2 → Colorado 1 loc2 → Rhode Island 1 loc2 → Rhode Data:
Providence Rhode Island Boulder Colorado Newport Rhode Island loc 2 loc 1 loc 2 loc 1 loc 3
37
Basic inference by sampling
Gibbs sampling:
◮ Start with arbitrary trees ◮ Repeat forever
◮ Erase a random tree ◮ Sample a tree from
the current grammar
◮ Update the grammar
given the new tree
Rules for loc2: 1 loc2 → words 1 loc2 → Colorado 1 loc2 → Rhode Data:
Providence Rhode Island Boulder Colorado Newport Rhode Island loc 2 loc 1 loc 2 loc 1 loc 3 loc 1 loc 3
37
Basic inference by sampling
Gibbs sampling:
◮ Start with arbitrary trees ◮ Repeat forever
◮ Erase a random tree ◮ Sample a tree from
the current grammar
◮ Update the grammar
given the new tree
Rules for loc2: 1 loc2 → words 1 loc2 → Colorado 1 loc2 → Rhode Data:
Providence Rhode Island Boulder Colorado Newport Rhode Island loc 2 loc 1 loc 2 loc 1 loc 3 loc 1 loc 3
37
Issue 1: efficiency
Sampling a new parse
◮ Via CKY algorithm: O(n3)
◮ ... times a grammar constant!
◮ One set of nonterminals for each entity ◮ Scales poorly
Can be dealt with (Metropolis-Hastings algorithm):
◮ Proposal distribution:
◮ Easy-to-calculate approximation to the grammar
◮ Worse approximations, slower runtimes.
38
Issue 2: mobility
Local maxima are still a problem
◮ Gibbs sampling converges in the limit... ◮ Not in real life! ◮ What you’d expect – clustering is often NP-hard ◮ Resampling one tree at a time means lots of local maxima ◮ Better moves:
◮ Split and merge entities ◮ Reparse multiple strings at once
◮ Tricky to implement... ◮ Correct algorithms can be very slow in practice
39
Compromise: heuristic inference
What we actually do:
◮ Propose only a subset of entities for each string:
◮ Must have at least one word in common ◮ Less likely if shared word is frequent
◮ Ignore the Hastings correction term!
Not theoretically valid, but faster.
◮ Even so, inference remains a problem.
◮ Too many clusters for the same entity 40