NAME MATCHING WITH PHYLOGENIES Nicholas Andrews, Jason Eisner, Mark - - PowerPoint PPT Presentation

name matching with phylogenies
SMART_READER_LITE
LIVE PREVIEW

NAME MATCHING WITH PHYLOGENIES Nicholas Andrews, Jason Eisner, Mark - - PowerPoint PPT Presentation

NAME MATCHING WITH PHYLOGENIES Nicholas Andrews, Jason Eisner, Mark Dredze 1 2 2 2 Martin Freeman 2 Martin Freeman M Freeman Martin Freedman Marty Freemen Marty Freeman Martin F 2 Entity Linking Coref Resolution Martin Freeman M


slide-1
SLIDE 1

NAME MATCHING WITH PHYLOGENIES

Nicholas Andrews, Jason Eisner, Mark Dredze

1

slide-2
SLIDE 2

2

slide-3
SLIDE 3

2

slide-4
SLIDE 4

2

slide-5
SLIDE 5

Martin Freeman

2

slide-6
SLIDE 6

Martin Freeman Marty Freeman Martin F M Freeman Martin Freedman Marty Freemen

2

slide-7
SLIDE 7

Martin Freeman Marty Freeman Martin F M Freeman Martin Freedman Marty Freemen Entity Linking Coref Resolution

2

slide-8
SLIDE 8

STRING COMPARISON

  • Levenshtein distance
  • Edit distance between two strings
  • Jaro Winkler
  • Measures matching characters and transpositions

3

slide-9
SLIDE 9

STRING COMPARISON

  • Levenshtein distance
  • Edit distance between two strings
  • Jaro Winkler
  • Measures matching characters and transpositions

Mark Dredze vs. Mark Drezde (e.g. typo, name variant) Mark Dredze vs. Benjamin Van Durme

3

slide-10
SLIDE 10

NAME VARIATION

  • Nicknames: Benjamin Van Durme vs. Ben Van Durme
  • Aliases: Caryn Elaine Johnson vs. Whoopi Goldberg
  • Chinese Names: Zhang Wei vs. Wei Zhang
  • Arab Names:

Muhammad ibn Saeed ibn Abd al-Aziz al-Filasteeni

  • vs. Muhammad
  • vs. Abu Kareem

4

slide-11
SLIDE 11

OUR GOAL LEARN HOW TO MATCH NAMES

5

slide-12
SLIDE 12

FINITE STATE TRANSDUCERS

  • Probabilistic finite state transducers

encode a probability distribution

  • ver strings given a string
  • Character operations: copy,

substitute, delete, insert

  • Train parameters on name pairs

6

slide-13
SLIDE 13

7

slide-14
SLIDE 14

Ronald Fairbairn William Ronald Dodds Fairbairn

  • Ideal: matched name pairs

7

slide-15
SLIDE 15

Ronald Fairbairn William Ronald Dodds Fairbairn

  • Ideal: matched name pairs

Ronald Fairbairn

  • W. R. D. Fairbairn

William Ronald Dodds Fairbairn

  • Sets of matching names

7

slide-16
SLIDE 16

Ronald Fairbairn

  • W. R. D. Fairbairn

William Ronald Dodds Fairbairn

8

slide-17
SLIDE 17

Ronald Fairbairn

  • W. R. D. Fairbairn

William Ronald Dodds Fairbairn

8

slide-18
SLIDE 18

Ronald Fairbairn

  • W. R. D. Fairbairn

William Ronald Dodds Fairbairn

X

8

slide-19
SLIDE 19

Ronald Fairbairn

  • W. R. D. Fairbairn

William Ronald Dodds Fairbairn

8

slide-20
SLIDE 20

Ronald Fairbairn William Ronald Dodds Fairbairn

  • Ideal: matched name pairs

Ronald Fairbairn

  • W. R. D. Fairbairn

William Ronald Dodds Fairbairn

  • Sets of matching names

9

slide-21
SLIDE 21

Ronald Fairbairn William Ronald Dodds Fairbairn

  • Ideal: matched name pairs

Ronald Fairbairn

  • W. R. D. Fairbairn

William Ronald Dodds Fairbairn

  • Sets of matching names

James Wakefield James Beach Wakefield Mikhail Dobuzhinsky Mstislav Dobuzhinsky John Wilkins Samuel Loyd

  • Unorganized set of names

9

slide-22
SLIDE 22

Ronald Fairbairn William Ronald Dodds Fairbairn

  • Ideal: matched name pairs

Ronald Fairbairn

  • W. R. D. Fairbairn

William Ronald Dodds Fairbairn

  • Sets of matching names

James Wakefield James Beach Wakefield Mikhail Dobuzhinsky Mstislav Dobuzhinsky John Wilkins Samuel Loyd

  • Unorganized set of names

Key Insight Learn name phylogenies

9

slide-23
SLIDE 23

Ronald Fairbairn

  • W. R. D. Fairbairn

William Ronald Dodds Fairbairn

James Wakefield James Beach Wakefield

10

slide-24
SLIDE 24

WHY A NAME PHYLOGENY?

  • Aligns matching names for transducer
  • Organizes names into connected components (clusters)
  • We can jointly estimate a phylogeny and a mutation model

(transducer)

  • A mutation model gives a phylogeny
  • A phylogeny provides training data for a mutation model

11

slide-25
SLIDE 25

OUTLINE

  • Generative model
  • Inference
  • Experiments

12

slide-26
SLIDE 26

GENERATIVE MODEL

13

slide-27
SLIDE 27

NAME VARIATION

  • A generative model of strings that can explain observed name variation

... Mitt Romney President Barack Obama Barack Obama Secretary of State Hillary Clinton Hillary Clinton Barack Obama Clinton Obama ...

  • What are the sources of variation for names?

14

slide-28
SLIDE 28

GENERATIVE MODEL OF NAME VARIATION

  • Suppose an author decides to write a name
  • Where do names come from?
  • Copy a previous mention
  • Mutate a previous mention
  • According to mutation model
  • Create a new name

15

slide-29
SLIDE 29

COPY A PREVIOUS MENTION

  • Select a previous mention at random (uniformly)
  • Copy it with probability 1-μ

16

slide-30
SLIDE 30

MUTATE PREVIOUS MENTION

  • Select a previous mention at random (uniformly)
  • Mutate it with probability μ
  • Sample a new mutation from the mutation model given the

mention

17

slide-31
SLIDE 31

CREATE A NEW NAME

  • Select the root of the phylogeny ♦ with probability

proportional to α

  • Sample a new name from a character language model

18

slide-32
SLIDE 32

SUMMARY

  • To generate the next mention
  • Pick an existing name mention w with probability 1/(α + k)
  • Copy w verbatim with probability 1 − μ
  • Mutate w with probability μ
  • Decide to talk about a new entity with probability α/(α + k)
  • Generate a name for it

19

slide-33
SLIDE 33

INFERENCE

20

slide-34
SLIDE 34

EM ALGORITHM

  • E-step
  • Given mutation model θ, compute a distribution over phylogenies
  • M-step
  • Re-estimate θ given marginal edge probabilities
  • Sum over alignments for all (x,y) string pairs via forward-

backward

  • Each pair is training example weighted by the marginal probability

21

slide-35
SLIDE 35

SUMMARY

  • Learn a name matching algorithm
  • θ (transducer/mutation model)
  • Phylogeny: a means to an end
  • Part of the reason for a distribution over phylogenies
  • Question: Is θ better than other name matching algorithms?
  • Can θ find matching names more accurately?

22

slide-36
SLIDE 36

EXPERIMENTS

23

slide-37
SLIDE 37

DATA

  • English Wikipedia (2011) to create lists of name variants
  • Wikipedia redirects are human-curated pages to resolve

common name variants to the correct page (unambiguously)

  • Use Freebase to restrict to redirects for Person entities
  • Applied some further filters to remove redirects that were

clearly not names (e.g. numbers)

  • Use LDC Gigaword to obtain a frequency for each name variant

24

slide-38
SLIDE 38

25

slide-39
SLIDE 39

Thomas Pynchon, Jr. Thomas R. Pynchon Thomas Pynchon Jr. Thomas R. Pynchon Jr. Thomas Ruggles Pynchon Jr.. Khawaja Gharibnawaz Muinuddin Hasan Chisty Khwaja Gharib Nawaz Khwaja Muin al-Din Chishti Ghareeb Nawaz Khwaja gharibnawaz Muinuddin Chishti

25

slide-40
SLIDE 40

Thomas Pynchon, Jr. Thomas R. Pynchon Thomas Pynchon Jr. Thomas R. Pynchon Jr. Thomas Ruggles Pynchon Jr.. Khawaja Gharibnawaz Muinuddin Hasan Chisty Khwaja Gharib Nawaz Khwaja Muin al-Din Chishti Ghareeb Nawaz Khwaja gharibnawaz Muinuddin Chishti

Our Algorithm

25

slide-41
SLIDE 41

Our Algorithm

25

slide-42
SLIDE 42

θ (Transducer)

Our Algorithm

25

slide-43
SLIDE 43

θ (Transducer)

Khawaja Gharibnawaz Muinuddin Hasan Chisty Khwaja Gharib Nawaz Khwaja Muin al-Din Chishti Ghareeb Nawaz Khwaja gharibnawaz Khwaja Moinuddin Chishti Muinuddin Chishti Thomas Ruggles Pynchon, Jr. Thomas Ruggles Pynchon Jr. Thomas R. Pynchon, Jr. Thomas R. Pynchon Jr. Thomas Pynchon, Jr. Thomas R. Pynchon Thomas Pynchon Jr.

Our Algorithm

25

slide-44
SLIDE 44

EXPERIMENT: RANKING

  • Input: query (name)
  • Output: ranked list of possible aliases
  • Evaluation: where is correct alias in list?
  • Mean Reciprocal Rank (MRR) (higher is better)

26

slide-45
SLIDE 45

SETUP

  • Data
  • Train: 1500 entities (~6000 names)
  • Test: 1500 different entities (~6000 names)
  • Settings
  • Train θ on a set of “supervised” pairs (varying levels of training)
  • Baselines: other name matching algorithms

27

slide-46
SLIDE 46

Jaro Winkler Levenshtein 10 entities 10+unlabeled Unsupervised 1500 entities 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 MRR

28

slide-47
SLIDE 47

Jaro Winkler Levenshtein 10 entities 10+unlabeled Unsupervised 1500 entities 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90

0.611

MRR

28

slide-48
SLIDE 48

Jaro Winkler Levenshtein 10 entities 10+unlabeled Unsupervised 1500 entities 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90

0.611 0.642

MRR

28

slide-49
SLIDE 49

Jaro Winkler Levenshtein 10 entities 10+unlabeled Unsupervised 1500 entities 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90

0.611 0.642 0.741

MRR

28

slide-50
SLIDE 50

Jaro Winkler Levenshtein 10 entities 10+unlabeled Unsupervised 1500 entities 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90

0.611 0.642 0.741 0.764

MRR

28

slide-51
SLIDE 51

Jaro Winkler Levenshtein 10 entities 10+unlabeled Unsupervised 1500 entities 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90

0.611 0.642 0.741 0.764 0.763

MRR

28

slide-52
SLIDE 52

Jaro Winkler Levenshtein 10 entities 10+unlabeled Unsupervised 1500 entities 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90

0.611 0.642 0.741 0.764 0.763 0.803

MRR

28

slide-53
SLIDE 53

FUTURE WORK

  • Include context for full entity disambiguation
  • Increase matching speed
  • More sophisticated mutation models
  • Incorporate internal name structure
  • Informal genres
  • Cross lingual data

29

slide-54
SLIDE 54

QUESTIONS

Nicholas Andrews, Jason Eisner, Mark Dredze. Name Phylogeny: A Generative Model of String

  • Variation. Empirical Methods in Natural

Language Processing (EMNLP), 2012.

30