Name Phylogeny A Generative Model of String Variation Nicholas - - PowerPoint PPT Presentation

name phylogeny
SMART_READER_LITE
LIVE PREVIEW

Name Phylogeny A Generative Model of String Variation Nicholas - - PowerPoint PPT Presentation

Name Phylogeny A Generative Model of String Variation Nicholas Andrews, Jason Eisner and Mark Dredze Department of Computer Science, Johns Hopkins University EMNLP 2012 Thursday, July 12 Outline Introduction Generative Model Mutation


slide-1
SLIDE 1

Name Phylogeny

A Generative Model of String Variation Nicholas Andrews, Jason Eisner and Mark Dredze

Department of Computer Science, Johns Hopkins University

EMNLP 2012 – Thursday, July 12

slide-2
SLIDE 2

Outline

Introduction Generative Model Mutation Model Inference Experiments Future Work

slide-3
SLIDE 3

What’s a name phylogeny?

A fragment of a “name phylogeny” learned by our model

Khawaja Gharibnawaz Muinuddin Hasan Chisty Khwaja Gharib Nawaz Khwaja Muin al-Din Chishti Ghareeb Nawaz Khwaja Moinuddin Chishti Khwaja gharibnawaz Muinuddin Chishti Thomas Ruggles Pynchon, Jr. Thomas Ruggles Pynchon Jr. Thomas R. Pynchon, Jr. Thomas R. Pynchon Jr. Thomas R. Pynchon Thomas Pynchon, Jr. Thomas Pynchon Jr.

◮ Each edge corresponds to a “mutation”

slide-4
SLIDE 4

Problem: organizing disorganized collections of strings

Barack Obama Obama President Barack Obama Barack Barrack barack obama Hillary Clinton Clinton Bill Clinton bill Bill Barry Vice President Clinton Billy Hillary will clinton Hillary Rodham Clinton Mitt Romney Barack Obama Sr Romney Willard M. Romney Governor Mitt Romney

  • Mr. Romney

mitt Mitt rommey clinton William Clinton barak President Bill Clinton President Barack H. Obama

  • Ms. Clinton
slide-5
SLIDE 5

Problem: organizing disorganized collections of strings

Barack Obama Obama President Barack Obama Barack Barrack barack obama Hillary Clinton Clinton Bill Clinton bill Bill Barry Vice President Clinton Billy Hillary will clinton Hillary Rodham Clinton Mitt Romney Barack Obama Sr Romney Willard M. Romney Governor Mitt Romney

  • Mr. Romney

mitt Mitt rommey clinton William Clinton barak President Bill Clinton President Barack H. Obama

  • Ms. Clinton
slide-6
SLIDE 6

Challenges

◮ Name variation: the same entity may have different names,

and a good measure of “similarity” between strings may not be available (This work)

◮ Disambiguation: different entities may have names in

common, requiring the use of context to disambiguate between them

Barack Obama Obama President Barack Obama Barack Barrack barack obama Hillary Clinton Clinton Bill Clinton bill Bill Barry Vice President Clinton Billy Hillary will clinton Hillary Rodham Clinton Mitt Romney Barack Obama Sr Romney Willard M. Romney Governor Mitt Romney

  • Mr. Romney

mitt Mitt rommey clinton William Clinton barak President Bill Clinton President Barack H. Obama

  • Ms. Clinton
slide-7
SLIDE 7

How does a name phylogeny help?

  • 1. Organizes name variants into connected components (clusters)

Khawaja Gharibnawaz Muinuddin Hasan Chisty Khwaja Gharib Nawaz Khwaja Muin al-Din Chishti Ghareeb Nawaz Khwaja Moinuddin Chishti Khwaja gharibnawaz Muinuddin Chishti Thomas Ruggles Pynchon, Jr. Thomas Ruggles Pynchon Jr. Thomas R. Pynchon, Jr. Thomas R. Pynchon Jr. Thomas R. Pynchon Thomas Pynchon, Jr. Thomas Pynchon Jr.

  • 2. Align names as “mutations” of one another

Khawaja Gharibnawaz Muinuddin Hasan Chisty Khwaja Gharib Nawaz Khwaja Muin al-Din Chishti Ghareeb Nawaz Khwaja Moinuddin Chishti Khwaja gharibnawaz Muinuddin Chishti Thomas Ruggles Pynchon, Jr. Thomas Ruggles Pynchon Jr. Thomas R. Pynchon, Jr. Thomas R. Pynchon Jr. Thomas R. Pynchon Thomas Pynchon, Jr. Thomas Pynchon Jr.

  • 3. We can estimate a mutation model given a phylogeny, and a

mutation model gives a distribution over phylogenies (→ EM)

slide-8
SLIDE 8

Outline

Introduction Generative Model Mutation Model Inference Experiments Future Work

slide-9
SLIDE 9

Generative Model

We propose a generative model for string variation explaining the reasons for name variation.

... x10001 = Mitt Romney x10002 = President Barack Obama x10003 = Barack Obama x10004 = Secretary of State Hillary Clinton x10005 = Hillary Clinton x10006 = Barack Obama x10007 = Clinton x10008 = Obama ...

What are the sources of variation for names?

slide-10
SLIDE 10

Copying a previous mention

We can copy a name seen before.

... x10001 = Mitt Romney x10002 = President Barack Obama x10003 = Barack Obama x10004 = Secretary of State Hillary Clinton x10005 = Hillary Clinton x10006 = Barack Obama x10007 = Clinton x10008 = Obama ... x100001 = Barack Obama

Procedure:

◮ Select a previous name mention uniformly at random ◮ Decide to copy it with probability 1 − µ

slide-11
SLIDE 11

Mutating a previous mention

We can mutate a name seen before.

... x10001 = Mitt Romney x10002 = President Barack Obama x10003 = Barack Obama x10004 = Secretary of State Hillary Clinton x10005 = Hillary Clinton x10006 = Barack Obama x10007 = Clinton x10008 = Obama ... x100001 = Mitt

Procedure:

◮ Select a previous name mention uniformly at random ◮ Decide to mutate it with probability µ ◮ Sample a mutation from p(· | Mitt Romney)

slide-12
SLIDE 12

Generating a new name

We can generate a new name.

... x10001 = Mitt Romney x10002 = President Barack Obama x10003 = Barack Obama x10004 = Secretary of State Hillary Clinton x10005 = Hillary Clinton x10006 = Barack Obama x10007 = Clinton x10008 = Obama ... x100001 = Joe Biden

Procedure:

◮ Select ♦ with probability proportional to α (a “pseudocount”) ◮ Sample a new name from p(· | ♦)

◮ A character language model

slide-13
SLIDE 13

Generative model summary

To generate the next name mention:

  • 1. Pick an existing name mention w with probability 1/(α + k)

1.1 Copy w verbatim with probability 1 − µ 1.2 Mutate w with probability µ

  • 2. Decide to talk about a new entity with probability α/(α + k)

2.1 Generate a name for it

slide-14
SLIDE 14

Generative model in action

x10001 = Mitt Romney x10002 = President Barack Obama x10003 = Barack Obama x10004 = Secretary of State Hillary Clinton x10005 = Hillary Clinton x10006 = Barack Obama x10007 = Clinton Mitt Romney President Barack Obama Secretary of State Hillary Clinton Barack Obama Hillary Clinton Barack Obama Clinton Obama x10008 = Obama ...

slide-15
SLIDE 15

Generative model in action

Mitt Romney President Barack Obama Secretary of State Hillary Clinton Barack Obama Hillary Clinton Barack Obama Clinton Obama x10008 = Obama x10009 = Mitt Mitt x10001 = Mitt Romney x10002 = President Barack Obama x10003 = Barack Obama x10004 = Secretary of State Hillary Clinton x10005 = Hillary Clinton x10006 = Barack Obama x10007 = Clinton ...

slide-16
SLIDE 16

Generative model in action

Mitt Romney President Barack Obama Secretary of State Hillary Clinton Barack Obama Hillary Clinton Barack Obama Clinton Obama x10008 = Obama x10009 = Mitt x10010 = Barack Mitt Barack x10001 = Mitt Romney x10002 = President Barack Obama x10003 = Barack Obama x10004 = Secretary of State Hillary Clinton x10005 = Hillary Clinton x10006 = Barack Obama x10007 = Clinton ...

slide-17
SLIDE 17

Generative model in action

Mitt Romney President Barack Obama Secretary of State Hillary Clinton Barack Obama Hillary Clinton Barack Obama Clinton Obama x10008 = Obama x10009 = Mitt x10010 = Barack x10011 = Barry Mitt Barack Barry x10001 = Mitt Romney x10002 = President Barack Obama x10003 = Barack Obama x10004 = Secretary of State Hillary Clinton x10005 = Hillary Clinton x10006 = Barack Obama x10007 = Clinton ...

slide-18
SLIDE 18

Generative model in action

Mitt Romney President Barack Obama Secretary of State Hillary Clinton Barack Obama Hillary Clinton Barack Obama Clinton Obama x10008 = Obama x10009 = Mitt x10010 = Barack x10011 = Barry x10012 = Hillary Clinton Mitt Barack Barry Hillary Clinton ... x10001 = Mitt Romney x10002 = President Barack Obama x10003 = Barack Obama x10004 = Secretary of State Hillary Clinton x10005 = Hillary Clinton x10006 = Barack Obama x10007 = Clinton

slide-19
SLIDE 19

A few observations

◮ The proposed generative model is clearly naive

◮ No model of discourse or of name structure

◮ The pseudocount α controls the likelihood of new names ◮ We assume a low mutation probability µ, so that most names

are copied from earlier frequent names

slide-20
SLIDE 20

Outline

Introduction Generative Model Mutation Model Inference Experiments Future Work

slide-21
SLIDE 21

Name variation as mutations

“Mutations” capture different types of name variation:

  • 1. Transcription errors: Barack → barack
  • 2. Misspellings: Barack → Barrack
  • 3. Abbreviations: Barack Obama → Barack O.
  • 4. Nicknames: Barack → Barry
  • 5. Dropping words: Barack Obama → Barack
slide-22
SLIDE 22

Mutation via probabilistic finite-state transducers

The mutation model is a probabilistic finite-state transducer with four character operations: copy, substitute, delete, insert

◮ Character operations are conditioned on the right input

character

◮ Latent regions of contiguous edits ◮ Back-off smoothing

Transducer parameters θ determine the probability of being in different regions, and of the different character operations

slide-23
SLIDE 23

Example: Mutating a name

  • Mr. Robert Kennedy
  • Mr. Bobby Kennedy

M r . _ R o b e r t _ K e n n e d y $ M r . _[ Beginning of edit region Example mutation

slide-24
SLIDE 24

Example: Mutating a name

  • Mr. Robert Kennedy
  • Mr. Bobby Kennedy

M r . _ R o b e r t _ K e n n e d y $ M r . _[B 1 substitution operation: (R, B) Example mutation

slide-25
SLIDE 25

Example: Mutating a name

  • Mr. Robert Kennedy
  • Mr. Bobby Kennedy

M r . _ R o b e r t _ K e n n e d y $ M r . _[B o b 2 copy operations: (ε, o), (ε, b) Example mutation

slide-26
SLIDE 26

Example: Mutating a name

  • Mr. Robert Kennedy
  • Mr. Bobby Kennedy

M r . _ R o b e r t _ K e n n e d y $ M r . _[B o b 3 deletion operations: (e,ε), (r,ε), (t, ε) Example mutation

slide-27
SLIDE 27

Example: Mutating a name

  • Mr. Robert Kennedy
  • Mr. Bobby Kennedy

M r . _ R o b e r t _ K e n n e d y$ M r . _[B o b b y 2 insertion operations: (ε,b), (ε,y) Example mutation

slide-28
SLIDE 28

Example: Mutating a name

  • Mr. Robert Kennedy
  • Mr. Bobby Kennedy

M r . _ R o b e r t _ K e n n e d y $ M r . _[B o b b y] End of edit region Example mutation

slide-29
SLIDE 29

Example: Mutating a name

  • Mr. Robert Kennedy
  • Mr. Bobby Kennedy

M r . _ R o b e r t _ K e n n e d y $ M r . _[B o b b y]_ K e n n e d y $ Example mutation

slide-30
SLIDE 30

Outline

Introduction Generative Model Mutation Model Inference Experiments Future Work

slide-31
SLIDE 31

Inference

Input: An unaligned corpus of names (“bag-of-words”)

◮ The order in which the tokens were generated is unknown ◮ No “inputs” or “outputs” are known for the mutation model

Barack Obama Obama President Barack Obama Barack Barrack barack obama Hillary Clinton Clinton Bill Clinton bill Bill Barry Vice President Clinton Billy Hillary will clinton Hillary Rodham Clinton Mitt Romney Barack Obama Sr Romney Willard M. Romney Governor Mitt Romney

  • Mr. Romney

mitt Mitt rommey clinton William Clinton barak President Bill Clinton President Barack H. Obama

  • Ms. Clinton

Output: A distribution over name phylogenies parametrized by transducer parameters θ

slide-32
SLIDE 32

Observed vs unobserved names

Could there be latent forms in the phylogeny?

?

Khwaja Gharib Nawaz Khwaja Muin al-Din Chishti Ghareeb Nawaz Khwaja gharibnawaz Muinuddin Chishti

?

slide-33
SLIDE 33

Observed vs unobserved names

Khawaja Gharibnawaz Muinuddin Hasan Chisty Khwaja Gharib Nawaz Khwaja Muin al-Din Chishti Ghareeb Nawaz Khwaja Moinuddin Chishti Khwaja gharibnawaz Muinuddin Chishti What we'd like to do: Khwaja Gharib Nawaz Khwaja Muin al-Din Chishti Ghareeb Nawaz Khwaja gharibnawaz Muinuddin Chishti What we actually do:

slide-34
SLIDE 34

Type phylogeny vs token phylogeny

The generative model is over tokens (name mentions)

Ehud Barak President Barack Obama Secretary of State Hillary Clinton Barack Obama Hillary Clinton Barack Obama Clinton Obama Barak Barack Barry Hillary Clinton Barry

But we do type-level inference for the following reasons:

  • 1. Allows faster inference
  • 2. Allows type-level supervision
slide-35
SLIDE 35

Type phylogeny vs token phylogeny

We collapse all copy edges into a single vertex

President Barack Obama Secretary of State Hillary Clinton BARACK OBAMA (2) HILLARY CLINTON (2) Clinton Obama Barack BARRY (2) Ehud Barak Barak Barry

◮ The first token in each collapsed vertex is a mutation, and

the rest are copies

◮ Every edge in the phylogeny now corresponds to a mutation ◮ Approximation:

disallow multiple tokens of the same type to be derived from mutations

slide-36
SLIDE 36

Scoring phylogenies

The weight of a single phylogeny is the product of the weight of its edges

  • y∈Y

δ(y | pa(y)) What should the edge weights be?

slide-37
SLIDE 37

Edge weights

◮ New names: edges from ♦ to a name x:

δ(x | ♦) = α · p(x | ♦)

◮ Mutations: edges from a name x to a name y:

δ(y | x) = µ · p(y | x) · nx ny + 1 Approximation: Edges weights are not quite edge factored. We are making an approximation of the form E

  • y

δ(y | pa(y)) ≈

  • y

Eδ(y | pa)

slide-38
SLIDE 38

Inference via EM

Iterate until convergence:

  • 1. E-step: Given θ, compute a distribution over name

phylogenies

  • 2. M-step: Re-estimate transducer parameters θ given marginal

edge probabilities.

◮ This step sums over alignments for each (x, y) string pair

using forward-backward

◮ Each (x, y) pair may be viewed as a training example weighted

by the marginal probability of the edge from x to y

slide-39
SLIDE 39

E-step: marginalizing over latent variables

The latent variables in the model are:

  • 1. Name phylogeny (spanning tree) relating names as inputs

and/or outputs

  • 2. Character alignments from potential input names x to output

names y We use the Matrix-Tree theorem for directed graphs (Tutte, 1984) to efficiently evaluate marginal probabilities:

  • 1. Partition function (sum over phylogenies)
  • 2. Edge marginals
slide-40
SLIDE 40

Speed of inference

Two main slowdowns:

◮ The complexity of the E-step is dominated by the O(n3) (for

n names) matrix inversion required to compute the edge marginals cxy.

◮ The M-step sums over alignments for O(n2) input-output

pairs Approximation: To speed up inference, we prune edges (set δ(y | x) = 0) for names with no trigrams in common

slide-41
SLIDE 41

Outline

Introduction Generative Model Mutation Model Inference Experiments Future Work

slide-42
SLIDE 42

Data preparation

We used English Wikipedia (2011) to create lists of name variants

  • 1. Wikipedia redirects are human-curated pages to resolve

common name variants to the correct page (unambiguously)

  • 2. We use Freebase to restrict to redirects for Person entities
  • 3. We applied some further filters to remove redirects that were

clearly not names (e.g. numbers)

  • 4. We use LDC Gigaword to obtain a frequency for each name

variant

slide-43
SLIDE 43

Sample Wikipedia redirects

Ho Chi Minh, Ho chi mihn, Ho-Chi Minh, Ho Chih-minh Guy Fawkes, Guy fawkes, Guy faux, Guy Falks, Guy Faukes, Guy Fawks, Guy foxe, Guy Falkes Nicholas II of Russia, Nikolai Aleksandrovich Romanov, Nicholas Alexandrovich of Russia, Nicolas II Bill Gates, Lord Billy, Bill Gates, BillGates, Billy Gates, William Gates III, William H. Gates William Shakespeare, William shekspere, William shakspeare, Bill Shakespear Bill Clinton, Billll Clinton, William Jefferson Blythe IV, Bill J. Clinton, William J Clinton

slide-44
SLIDE 44

Wikipedia as supervision

We use Wikipedia name lists for supervision and evaluation

◮ Treat page redirects as “gold” mutations of the page title:

Ho Chi Minh → Ho chi mihn Ho Chi Minh → Ho-Chi Minh Ho Chi Minh → Ho Chih-minh

◮ Each list of redirects is cluster of names belonging to the

same entity

◮ No ambiguous names (by construction)

slide-45
SLIDE 45

Experiment 1: Transducer log-likelihood

Data:

◮ 1500 entities (roughly 6000 names) for train ◮ 1500 different entities (roughly 6000 names) for test

Procedure:

◮ At train time

  • 1. Initialize transducer parameters θ using different amounts of

supervision (up to 250 entities)

  • 2. Run EM for 10 iterations to re-estimate θ
  • 3. α = 1.0, µ = 0.1

◮ At test time

  • 1. Evaluate log-likelihood of the transducer on all “gold” pairs

from the test set

slide-46
SLIDE 46

Experiment 1: Mutation model log-likelihood

,

1 2 3 4 5 6 7 8 9 EM iteration 240000 230000 220000 210000 200000 190000 180000 170000 160000 150000 Held out log-likelihood

sup=0 sup=5 sup=25 sup=100 sup=250

slide-47
SLIDE 47

Experiment 2: Ranking

Data: same as before Procedure:

◮ At train time

  • 1. Estimate transducer parameters θ
  • 2. α = 1.0, µ = 0.1

◮ At test time

  • 1. For each Wikipedia person page in the test set, produce a

ranking of all test aliases

  • 2. Compute mean reciprocal rank (MRR) over all such rankings
slide-48
SLIDE 48

Experiment 2: Ranking

1500 0.60 0.65 0.70 0.75 0.80 0.85 MRR

jwink lev sup10 semi10 unsup sup

◮ For each article name in the test corpus, produce a ranking of

redirects

◮ The rankings are evaluated using mean reciprocal rank

slide-49
SLIDE 49

Outline

Introduction Generative Model Mutation Model Inference Experiments Future Work

slide-50
SLIDE 50

Future Work

◮ More sophisticated mutation models

◮ Incorporate internal name structure

◮ Incorporate context in the generative story ◮ Cross-lingual experiments

◮ Each vertex labeled with a language, allowing systematic

relationships between languages

◮ Other potential applications

◮ Derivational morphology ◮ Paraphrase ◮ Transliteration ◮ Historical linguistics ◮ Bibliographic entry variation

slide-51
SLIDE 51

Experiment 3 (preliminary): Precision/Recall

Procedure:

◮ At train time

  • 1. Estimate transducer parameters θ using EM
  • 2. Find the best spanning tree given θ

◮ At test time

  • 1. Attach held-out names to the most likely vertex in the inferred

spanning tree

  • 2. Evaluate precision and recall for the connected component
slide-52
SLIDE 52

Experiment 3 (preliminary): Example attachment

Khawaja Gharibnawaz Muinuddin Hasan Chisty Khwaja Gharib Nawaz Khwaja Muin al-Din Chishti Ghareeb Nawaz Khwaja Moinuddin Chishti Khwaja gharibnawaz Muinuddin Chishti Thomas Ruggles Pynchon, Jr. Thomas Ruggles Pynchon Jr. Thomas R. Pynchon, Jr. Thomas R. Pynchon Jr. Thomas R. Pynchon Thomas Pynchon, Jr. Thomas Pynchon Jr. Thomas Ruggles

? ? ◮ Held-out names can attach to any vertex in the tree

◮ Including ♦

◮ Attachment weights given by edge weights δ(y|x)

slide-53
SLIDE 53

Experiment 3 (preliminary): Results

0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision

0% supervised 1% supervised 8% supervised 24% supervised 100% supervised