Entity Clustering Across Languages NAACL 2012 Montreal Spence - PowerPoint PPT Presentation

Entity Clustering Across Languages NAACL 2012 — Montreal Spence Green* Nicholas Andrews # Matthew R. Gormley # Mark Dredze # Christopher D. Manning* *Stanford University # Johns Hopkins University

One Entity, Many Names Qaddafi, Muammar Al-Gathafi, Muammar al-Qadhafi, Muammar Al Qathafi, Mu’ammar Al Qathafi, Muammar El Gaddafi, Moamar El Kadhafi, Moammar El Kazzafi, Moamer El Qathafi, Mu’Ammar 3 / 41

One Entity, Many Names �� Qaddafi, Muammar � �� Al-Gathafi, Muammar � �� al-Qadhafi, Muammar � �� Al Qathafi, Mu’ammar Al Qathafi, Muammar El Gaddafi, Moamar El Kadhafi, Moammar El Kazzafi, Moamer El Qathafi, Mu’Ammar 3 / 41

Basic Task: Entity Clustering Cluster co-referent entity mentions across a corpus (documents and languages) 4 / 41

Basic Task: Entity Clustering Cluster co-referent entity mentions across a corpus (documents and languages) Clustering/disambiguation relies on: ◮ Mention similarity ◮ Context similarity 4 / 41

Entity Disambiguation: Mention Similarity �� Qaddafi, Muammar � �� Al-Gathafi, Muammar � �� al-Qadhafi, Muammar � �� Al Qathafi, Mu’ammar Al Qathafi, Muammar El Gaddafi, Moamar El Kadhafi, Moammar El Kazzafi, Moamer El Qathafi, Mu’Ammar 5 / 41

Entity Disambiguation: Context Similarity Apple 6 / 41

Entity Disambiguation: Context Similarity �� Apple Inc. �� Apple Inc. Apple �� Apple Inc. �� town in Lebanon �� camel 6 / 41

Entity Disambiguation: Context Similarity The Apple chief executive was former Beatles road manager Neil Aspinall... Sentential context is usually required 7 / 41

Old: Entity Clustering Tasks Within-doc coref Peter said to himself ... 8 / 41

Old: Entity Clustering Tasks Within-doc coref Cross-doc coref doc1 : Peter Jones said ... Peter said to himself ... doc2 : I told Mr. Jones ... [Bagga and Baldwin, 1998] [Baron and Freedman, 2008] 9 / 41

Old: Entity Clustering Tasks Within-doc coref Cross-doc coref doc1 : Peter Jones said ... Peter said to himself ... doc2 : I told Mr. Jones ... [Bagga and Baldwin, 1998] [Baron and Freedman, 2008] Entity Linking Mr. Jones Peter Jones Peter [McNamee et al., 2011] [Rao et al., 2011] 10 / 41

New: Entity Clustering Across Languages Within-doc coref Cross-doc coref doc1 : Peter Jones said ... Peter said to himself ... doc2 : I told Mr. Jones ... [Bagga and Baldwin, 1998] [Baron and Freedman, 2008] Entity Linking This paper Mr. Jones doc1 : Peter Jones said ... �� Peter Jones doc2 : Peter [McNamee et al., 2011] [Rao et al., 2011] 11 / 41

New: Entity Clustering Across Languages The Apple chief executive was former Beatles road manager Neil Aspinall... �� No knowledge base of entities 12 / 41

Why? Crisis management Arab Spring (2011) ◮ French, Arabic dialects ◮ Facebook, Twitter, blog... Haiti earthquake (2010) ◮ Kreyol, English, French ◮ SMS, Twitter, blog... 13 / 41

Why? Crisis management Arab Spring (2011) ◮ French, Arabic dialects ◮ Facebook, Twitter, blog... Haiti earthquake (2010) ◮ Kreyol, English, French ◮ SMS, Twitter, blog... Other applications: machine translation, name search 13 / 41

Plan: Extend Existing Monolingual System 100 91.4 89.8 90 B3 78.8 80 70 English Arabic English+Arabic 14 / 41

Cross-lingual Mention Similarity Cross-lingual Context Similarity Clustering Algorithms Evaluation

Within-language: Edit Distance m i = Muammar Qaddafi m j = Moamer El Kazzafi sim ( m i , m j ) =? 17 / 41

Within-language: Edit Distance m i = Muammar Qaddafi m j = Moamer El Kazzafi sim ( m i , m j ) =? Algorithm : Sorted Jaro-Winkler [Christen, 2006] 1. Sort tokens 2. Compute Jaro-Winkler distance in O ( | m i | + | m j | ) 3. Evaluate sim ( m i , m j ) < β 17 / 41

Cross-language: Binary classifier m i = map ( Apple ) = abbl m j = map ( �� ) = abl sim ( m i , m j ) =? 18 / 41

Cross-language: Binary classifier m i = map ( Apple ) = abbl m j = map ( �� ) = abl sim ( m i , m j ) =? Algorithm : Phonetic mapping + classification 1. Apply deterministic mapping map ( · ) 2. Extract character-level features 3. Classify (Maxent) 18 / 41

Cross-language: Binary classifier Training: parallel name list ◮ 97.1% accuracy on a held-out set Phonetic mapping: think of Soundex Best features: character bigrams 19 / 41

Context Similarity: Mapping techniques Method Resource-level Machine Translation High Lexicon Medium Polylingual Topic Model Low 21 / 41

Context Similarity: Mapping techniques Method Resource-level Machine Translation High Lexicon Medium Polylingual Topic Model Low ◮ MT: Phrasal with NIST-09 data [Galley et al., 2009] ◮ Lexicon: 31k entries (web and LDC sources) 21 / 41

Context Similarity: Polylingual Topic Model Polylingual Topic Model (PLTM) [Mimno et al., 2009] ◮ Words linked through cross-lingual priors ◮ Training: Wikipedia document tuples Map context words to 1-best topics 22 / 41

Context Similarity: Polylingual Topic Model [Mimno et al., 2009] 23 / 41

Context Similarity: Polylingual Topic Model The Apple chief executive was former Beatles road 3 1 6 6 14 5 103 99 6 3 5... manager Neil Aspinall... �� 2 1 14 99 7 7 103 79 �� 24 / 41

Context Similarity Bag of words / smoothed unigram distributions Measure: Jensen-Shannon divergence 25 / 41

Constraint-Based Clustering Two algorithms: 1. Hierarchical clustering 2. Dirichlet process mixture model Setup: ◮ Mention similarity as a hard constraint ◮ Cluster distance: context similarity 27 / 41

Constraint-Based Clustering Muammar Qaddafi al-Qadhafi El Kazzafi context context Apple Apple Corps. context 28 / 41

Constraint-Based Clustering Muammar Qaddafi al-Qadhafi El Kazzafi 0.31 context context Apple 0.40 0.15 Apple Corps. context 29 / 41

Constraint-Based Clustering Muammar Qaddafi al-Qadhafi El Kazzafi 0.31 context context Apple Apple Corps. context 30 / 41

Evaluation Corpus Genres: 1. broadcast Automatic Content Extraction conversation (ACE) 2008 Arabic-English 2. broadcast news 3. meeting 4. newswire We annotated 216 cross-lingual 5. telephone entities 6. usenet 7. weblog 32 / 41

ACE2008 Evaluation Corpus Docs Tokens Entities Chains Mentions Arabic 412 178k 2.6k 4.2k 9.2k 414 246k 2.3k 4.0k 9.1k English ◮ Chain – set of mentions (within-doc) ◮ Entity – set of chains 33 / 41

Within-document Processing Our models cluster chains Evaluation: ◮ Gold chains ◮ Predicted chains from SERIF [Ramshaw et al., 2011] 34 / 41

Evaluation: Gold within-document processing 100 MT 85.4 90 Lexicon 80.4 78.8 77.3 PLTM 80 70.1 Baseline 66.4 70 58.4 60 54.5 50 B3 B3 (cross-lingual only) 35 / 41

Evaluation: Gold within-document processing 100 MT 85.4 90 Lexicon 80.4 78.8 77.3 PLTM 80 70.1 Baseline 66.4 70 58.4 60 54.5 50 B3 B3 (cross-lingual only) In paper: CEAF, NVI 35 / 41

Automatic within-document processing 80.0 MT 76.7 76.0 75.3 Le xicon 75.0 PLTM Baseline 70.0 67.0 65.0 B3 36 / 41

Entity Clustering Across Languages NAACL 2012 Montreal Spence - PowerPoint PPT Presentation

Entity Clustering Across Languages NAACL 2012 Montreal Spence Green* Nicholas Andrews # Matthew R. Gormley # Mark Dredze # Christopher D. Manning* *Stanford University # Johns Hopkins University One Entity, Many Names Qaddafi, Muammar

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Design Challenges for Entity Linking Xiao Ling , Sameer Singh, Daniel S. Weld Entity Linking

JOINT TALK ON THREE DATA SUBMISSIONS TO TEXT ALIGNMENT AND ONE SOURCE RETRIEVAL ALGORITHM

MULTILINGUAL DOCUMENT CLASSICATION VIA TRANSDUCTIVE LEARNING Salvatore Romeo UNICAL

MASS: Masked Sequence to Sequence Pre-training for Language Generation Tao Qin Joint work with

CS11-737: Multilingual Natural Language Processing Typology: The Space of Languages Yulia

W ITH the widespread use of hands-free electronic gad- are mapped to a multilingual set using a

Cross linguality and machine translation without bilingual data ith t bili l d t Enek

Fall Product Training _ _

Cross-Lingual Semantic Mapping of Authority Files Nadine Steinmetz, and Harald Sack November,

Entity Clustering Across Languages NAACL 2012 Montreal Spence - PowerPoint PPT Presentation

Entity Clustering Across Languages NAACL 2012 Montreal Spence Green* Nicholas Andrews # Matthew R. Gormley # Mark Dredze # Christopher D. Manning* *Stanford University # Johns Hopkins University One Entity, Many Names Qaddafi, Muammar

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Design Challenges for Entity Linking Xiao Ling , Sameer Singh, Daniel S. Weld Entity Linking

JOINT TALK ON THREE DATA SUBMISSIONS TO TEXT ALIGNMENT AND ONE SOURCE RETRIEVAL ALGORITHM

MULTILINGUAL DOCUMENT CLASSICATION VIA TRANSDUCTIVE LEARNING Salvatore Romeo UNICAL

MASS: Masked Sequence to Sequence Pre-training for Language Generation Tao Qin Joint work with

CS11-737: Multilingual Natural Language Processing Typology: The Space of Languages Yulia

W ITH the widespread use of hands-free electronic gad- are mapped to a multilingual set using a

Cross linguality and machine translation without bilingual data ith t bili l d t Enek

Fall Product Training ___________________________________ ___________________________________

Cross-Lingual Semantic Mapping of Authority Files Nadine Steinmetz, and Harald Sack November,

Fall Product Training _ _