NAME MATCHING WITH PHYLOGENIES Nicholas Andrews, Jason Eisner, Mark - PowerPoint PPT Presentation

NAME MATCHING WITH PHYLOGENIES Nicholas Andrews, Jason Eisner, Mark Dredze 1

Martin Freeman 2

Martin Freeman M Freeman Martin Freedman Marty Freemen Marty Freeman Martin F 2

Entity Linking Coref Resolution Martin Freeman M Freeman Martin Freedman Marty Freemen Marty Freeman Martin F 2

STRING COMPARISON • Levenshtein distance • Edit distance between two strings • Jaro Winkler • Measures matching characters and transpositions 3

STRING COMPARISON • Levenshtein distance • Edit distance between two strings • Jaro Winkler • Measures matching characters and transpositions Mark Dredze vs. Mark Drezde (e.g. typo, name variant) Mark Dredze vs. Benjamin Van Durme 3

NAME VARIATION • Nicknames: Benjamin Van Durme vs. Ben Van Durme • Aliases: Caryn Elaine Johnson vs. Whoopi Goldberg • Chinese Names: Zhang Wei vs. Wei Zhang • Arab Names: Muhammad ibn Saeed ibn Abd al-Aziz al-Filasteeni vs. Muhammad vs. Abu Kareem 4

OUR GOAL LEARN HOW TO MATCH NAMES 5

FINITE STATE TRANSDUCERS • Probabilistic finite state transducers encode a probability distribution over strings given a string • Character operations: copy, substitute, delete, insert • Train parameters on name pairs 6

William Ronald Dodds Fairbairn Ronald Fairbairn • Ideal: matched name pairs 7

William Ronald Dodds Fairbairn Ronald Fairbairn • Ideal: matched name pairs Ronald Fairbairn W. R. D. Fairbairn William Ronald Dodds Fairbairn • Sets of matching names 7

William Ronald Dodds Fairbairn Ronald Fairbairn W. R. D. Fairbairn 8

William Ronald Dodds Fairbairn X Ronald Fairbairn W. R. D. Fairbairn 8

William Ronald Dodds Fairbairn Ronald Fairbairn W. R. D. Fairbairn 8

William Ronald Dodds Fairbairn Ronald Fairbairn • Ideal: matched name pairs Ronald Fairbairn W. R. D. Fairbairn William Ronald Dodds Fairbairn • Sets of matching names 9

William Ronald Dodds Fairbairn Ronald Fairbairn • Ideal: matched name pairs Ronald Fairbairn W. R. D. Fairbairn William Ronald Dodds Fairbairn • Sets of matching names John Wilkins Mikhail Dobuzhinsky Samuel Loyd James Beach Wakefield Mstislav Dobuzhinsky James Wakefield • Unorganized set of names 9

William Ronald Dodds Fairbairn Ronald Fairbairn • Ideal: matched name pairs Ronald Fairbairn W. R. D. Fairbairn Key Insight William Ronald Dodds Fairbairn Learn name phylogenies • Sets of matching names John Wilkins Mikhail Dobuzhinsky Samuel Loyd James Beach Wakefield Mstislav Dobuzhinsky James Wakefield • Unorganized set of names 9

♦ James Beach Wakefield William Ronald Dodds Fairbairn James Wakefield Ronald Fairbairn W. R. D. Fairbairn 10

WHY A NAME PHYLOGENY? • Aligns matching names for transducer • Organizes names into connected components (clusters) • We can jointly estimate a phylogeny and a mutation model (transducer) • A mutation model gives a phylogeny • A phylogeny provides training data for a mutation model 11

OUTLINE • Generative model • Inference • Experiments 12

GENERATIVE MODEL 13

NAME VARIATION • A generative model of strings that can explain observed name variation ... Mitt Romney President Barack Obama Barack Obama Secretary of State Hillary Clinton Hillary Clinton Barack Obama Clinton Obama ... • What are the sources of variation for names? 14

GENERATIVE MODEL OF NAME VARIATION • Suppose an author decides to write a name • Where do names come from? • Copy a previous mention • Mutate a previous mention • According to mutation model • Create a new name 15

COPY A PREVIOUS MENTION • Select a previous mention at random (uniformly) • Copy it with probability 1- μ 16

MUTATE PREVIOUS MENTION • Select a previous mention at random (uniformly) • Mutate it with probability μ • Sample a new mutation from the mutation model given the mention 17

CREATE A NEW NAME • Select the root of the phylogeny ♦ with probability proportional to α • Sample a new name from a character language model 18

SUMMARY • To generate the next mention • Pick an existing name mention w with probability 1/( α + k) • Copy w verbatim with probability 1 − μ • Mutate w with probability μ • Decide to talk about a new entity with probability α /( α + k) • Generate a name for it 19

INFERENCE 20

EM ALGORITHM • E-step • Given mutation model θ , compute a distribution over phylogenies • M-step • Re-estimate θ given marginal edge probabilities • Sum over alignments for all (x,y) string pairs via forward- backward • Each pair is training example weighted by the marginal probability 21

SUMMARY • Learn a name matching algorithm • θ (transducer/mutation model) • Phylogeny: a means to an end • Part of the reason for a distribution over phylogenies • Question: Is θ better than other name matching algorithms? • Can θ find matching names more accurately? 22

EXPERIMENTS 23

DATA • English Wikipedia (2011) to create lists of name variants • Wikipedia redirects are human-curated pages to resolve common name variants to the correct page (unambiguously) • Use Freebase to restrict to redirects for Person entities • Applied some further filters to remove redirects that were clearly not names (e.g. numbers) • Use LDC Gigaword to obtain a frequency for each name variant 24

Khwaja Gharib Nawaz Khwaja Muin al-Din Chishti Muinuddin Chishti Thomas Pynchon, Jr. Thomas R. Pynchon Jr. Khawaja Gharibnawaz Muinuddin Hasan Chisty Thomas Ruggles Pynchon Jr.. Thomas Pynchon Jr. Ghareeb Nawaz Khwaja gharibnawaz Thomas R. Pynchon 25

Khwaja Gharib Nawaz Khwaja Muin al-Din Chishti Muinuddin Chishti Thomas Pynchon, Jr. Thomas R. Pynchon Jr. Khawaja Gharibnawaz Muinuddin Hasan Chisty Thomas Ruggles Pynchon Jr.. Thomas Pynchon Jr. Ghareeb Nawaz Khwaja gharibnawaz Thomas R. Pynchon Our Algorithm 25

Our Algorithm 25

θ (Transducer) Our Algorithm 25

θ (Transducer) Our Algorithm Thomas Ruggles Pynchon, Jr. Khawaja Gharibnawaz Muinuddin Hasan Chisty Thomas Ruggles Pynchon Jr. Khwaja Muin al-Din Chishti Thomas R. Pynchon, Jr. Khwaja Gharib Nawaz Khwaja Moinuddin Chishti Thomas R. Pynchon Jr. Thomas Pynchon, Jr. Ghareeb Nawaz Khwaja gharibnawaz Muinuddin Chishti Thomas R. Pynchon Thomas Pynchon Jr. 25

EXPERIMENT: RANKING • Input: query (name) • Output: ranked list of possible aliases • Evaluation: where is correct alias in list? • Mean Reciprocal Rank (MRR) (higher is better) 26

SETUP • Data • Train: 1500 entities (~6000 names) • Test: 1500 different entities (~6000 names) • Settings • Train θ on a set of “supervised” pairs (varying levels of training) • Baselines: other name matching algorithms 27

0.90 0.80 0.70 0.60 0.50 MRR 0.40 0.30 0.20 0.10 0 Jaro Winkler Levenshtein 10 entities 10+unlabeled Unsupervised 1500 entities 28

0.90 0.80 0.70 0.60 0.611 0.50 MRR 0.40 0.30 0.20 0.10 0 Jaro Winkler Levenshtein 10 entities 10+unlabeled Unsupervised 1500 entities 28

0.90 0.80 0.70 0.642 0.60 0.611 0.50 MRR 0.40 0.30 0.20 0.10 0 Jaro Winkler Levenshtein 10 entities 10+unlabeled Unsupervised 1500 entities 28

0.90 0.80 0.741 0.70 0.642 0.60 0.611 0.50 MRR 0.40 0.30 0.20 0.10 0 Jaro Winkler Levenshtein 10 entities 10+unlabeled Unsupervised 1500 entities 28

0.90 0.80 0.764 0.741 0.70 0.642 0.60 0.611 0.50 MRR 0.40 0.30 0.20 0.10 0 Jaro Winkler Levenshtein 10 entities 10+unlabeled Unsupervised 1500 entities 28

0.90 0.80 0.764 0.763 0.741 0.70 0.642 0.60 0.611 0.50 MRR 0.40 0.30 0.20 0.10 0 Jaro Winkler Levenshtein 10 entities 10+unlabeled Unsupervised 1500 entities 28

0.90 0.80 0.803 0.764 0.763 0.741 0.70 0.642 0.60 0.611 0.50 MRR 0.40 0.30 0.20 0.10 0 Jaro Winkler Levenshtein 10 entities 10+unlabeled Unsupervised 1500 entities 28

FUTURE WORK • Include context for full entity disambiguation • Increase matching speed • More sophisticated mutation models • Incorporate internal name structure • Informal genres • Cross lingual data 29

QUESTIONS Nicholas Andrews, Jason Eisner, Mark Dredze. Name Phylogeny: A Generative Model of String Variation . Empirical Methods in Natural Language Processing (EMNLP) , 2012. 30

NAME MATCHING WITH PHYLOGENIES Nicholas Andrews, Jason Eisner, Mark - PowerPoint PPT Presentation

NAME MATCHING WITH PHYLOGENIES Nicholas Andrews, Jason Eisner, Mark Dredze 1 2 2 2 Martin Freeman 2 Martin Freeman M Freeman Martin Freedman Marty Freemen Marty Freeman Martin F 2 Entity Linking Coref Resolution Martin Freeman M

Phylogenies Phylogenies describe history Phylogenies describe history Haeckel. 1879.

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

Multiple Alignments and Phylogenies Mark Voorhies 3/31/2011 Mark Voorhies Multiple Alignments

Matching of Matrix Elements and Parton Showers CKKW matching in e + e collisions Lecture 2:

Global Shape Matching Section 3.3: Articulated Matching using Graph Cuts Global Shape Matching:

CSEP 527 Spring 2016 Phylogenies: Parsimony Plus a Tantalizing Taste of Likelihood 1

Multiple Alignments and Phylogenies Mark Voorhies 3/29/2012 Mark Voorhies Multiple Alignments

Matching Bipartite Matching Input Given a (undirected) graph G = ( V , E ) Input Given a bipartite

Regular Expressions Simple matching and searching String: My name is Claus Regex: My name is

Impedance Matching of 640 GHz SIS Mixer Impedance Matching of 640 GHz SIS Mixer of 640 GHz SIS

String Matching Inge Li Grtz CLRS 32 String Matching String matching problem: string

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program

CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 Dictionary Matching

Graph Matchings Matching A matching M in a graph G is a set of non-loop edges with no shared

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

IND E 498 Special Topics: Data Analytics Instructor: Prof. Shuai Huang Office: AERB 141B Phone:

Adam Clayton Powell Jr. Blvd Month Safety and Mobility Improvements Year Presentation to CB10

Analysis of Algorithms continued Recursion On Capstone Project? Automatic extension to

Credit Recovery Environmental Scan Coaching Session 3 Dominique Bradley 02 / 04 / 19 Agenda 1.

Introduction to Big Data and Machine Learning Dr. Mihail November 12, 2019 (Dr. Mihail) Intro

Unit 1: Introduction to data Lecture 1: Data collection, observational studies, and experiments

Lecture #9 (Diffusion) Chapra L8 David A. Reckhow CEE 577 #9 1 Forge Pond project

Medial Scaffolds for 3D data modelling: status and challenges Frederic Fol Leymarie Outline

Sambuz

Useful Links

Newsletter

Mail Us