entity clustering across languages
play

Entity Clustering Across Languages NAACL 2012 Montreal Spence - PowerPoint PPT Presentation

Entity Clustering Across Languages NAACL 2012 Montreal Spence Green* Nicholas Andrews # Matthew R. Gormley # Mark Dredze # Christopher D. Manning* *Stanford University # Johns Hopkins University One Entity, Many Names Qaddafi, Muammar


  1. Entity Clustering Across Languages NAACL 2012 — Montreal Spence Green* Nicholas Andrews # Matthew R. Gormley # Mark Dredze # Christopher D. Manning* *Stanford University # Johns Hopkins University

  2. One Entity, Many Names Qaddafi, Muammar Al-Gathafi, Muammar al-Qadhafi, Muammar Al Qathafi, Mu’ammar Al Qathafi, Muammar El Gaddafi, Moamar El Kadhafi, Moammar El Kazzafi, Moamer El Qathafi, Mu’Ammar 3 / 41

  3. One Entity, Many Names ���� ����� ����� ���� ���� ���� �� � Qaddafi, Muammar � �� � ��� ��� � ���� ����� ���� ���� �� � Al-Gathafi, Muammar � �� � ��� ��� � �� � al-Qadhafi, Muammar � �� � ��� ���� � ���� ����� Al Qathafi, Mu’ammar Al Qathafi, Muammar El Gaddafi, Moamar El Kadhafi, Moammar El Kazzafi, Moamer El Qathafi, Mu’Ammar 3 / 41

  4. One Entity, Many Names ���� ����� ����� ���� ���� ���� �� � Qaddafi, Muammar � �� � ��� ��� � ���� ����� ���� ���� �� � Al-Gathafi, Muammar � �� � ��� ��� � �� � al-Qadhafi, Muammar � �� � ��� ���� � ���� ����� Al Qathafi, Mu’ammar Al Qathafi, Muammar El Gaddafi, Moamar El Kadhafi, Moammar El Kazzafi, Moamer El Qathafi, Mu’Ammar 3 / 41

  5. Basic Task: Entity Clustering Cluster co-referent entity mentions across a corpus (documents and languages) 4 / 41

  6. Basic Task: Entity Clustering Cluster co-referent entity mentions across a corpus (documents and languages) Clustering/disambiguation relies on: ◮ Mention similarity ◮ Context similarity 4 / 41

  7. Entity Disambiguation: Mention Similarity ���� ����� ����� ���� ���� ���� �� � Qaddafi, Muammar � �� � ��� ��� � ���� ����� ���� ���� �� � Al-Gathafi, Muammar � �� � ��� ��� � �� � al-Qadhafi, Muammar � �� � ��� ���� � ���� ����� Al Qathafi, Mu’ammar Al Qathafi, Muammar El Gaddafi, Moamar El Kadhafi, Moammar El Kazzafi, Moamer El Qathafi, Mu’Ammar 5 / 41

  8. Entity Disambiguation: Context Similarity Apple 6 / 41

  9. Entity Disambiguation: Context Similarity ����� Apple Inc. ���� � Apple Inc. Apple ���� Apple Inc. ����� � �� �� town in Lebanon ����� camel 6 / 41

  10. Entity Disambiguation: Context Similarity The Apple chief executive was former Beatles road manager Neil Aspinall... Sentential context is usually required 7 / 41

  11. Old: Entity Clustering Tasks Within-doc coref Peter said to himself ... 8 / 41

  12. Old: Entity Clustering Tasks Within-doc coref Cross-doc coref doc1 : Peter Jones said ... Peter said to himself ... doc2 : I told Mr. Jones ... [Bagga and Baldwin, 1998] [Baron and Freedman, 2008] 9 / 41

  13. Old: Entity Clustering Tasks Within-doc coref Cross-doc coref doc1 : Peter Jones said ... Peter said to himself ... doc2 : I told Mr. Jones ... [Bagga and Baldwin, 1998] [Baron and Freedman, 2008] Entity Linking Mr. Jones Peter Jones Peter [McNamee et al., 2011] [Rao et al., 2011] 10 / 41

  14. New: Entity Clustering Across Languages Within-doc coref Cross-doc coref doc1 : Peter Jones said ... Peter said to himself ... doc2 : I told Mr. Jones ... [Bagga and Baldwin, 1998] [Baron and Freedman, 2008] Entity Linking This paper Mr. Jones doc1 : Peter Jones said ... ���������� �������� � Peter Jones doc2 : Peter [McNamee et al., 2011] [Rao et al., 2011] 11 / 41

  15. New: Entity Clustering Across Languages The Apple chief executive was former Beatles road manager Neil Aspinall... ����� � ������ �������� �� � ������ � � � ������ �� �� ��� ��� ����� �� � � �� � No knowledge base of entities 12 / 41

  16. Why? Crisis management Arab Spring (2011) ◮ French, Arabic dialects ◮ Facebook, Twitter, blog... Haiti earthquake (2010) ◮ Kreyol, English, French ◮ SMS, Twitter, blog... 13 / 41

  17. Why? Crisis management Arab Spring (2011) ◮ French, Arabic dialects ◮ Facebook, Twitter, blog... Haiti earthquake (2010) ◮ Kreyol, English, French ◮ SMS, Twitter, blog... Other applications: machine translation, name search 13 / 41

  18. Plan: Extend Existing Monolingual System 100 91.4 89.8 90 B3 78.8 80 70 English Arabic English+Arabic 14 / 41

  19. Cross-lingual Mention Similarity Cross-lingual Context Similarity Clustering Algorithms Evaluation

  20. Cross-lingual Mention Similarity Cross-lingual Context Similarity Clustering Algorithms Evaluation

  21. Within-language: Edit Distance m i = Muammar Qaddafi m j = Moamer El Kazzafi sim ( m i , m j ) =? 17 / 41

  22. Within-language: Edit Distance m i = Muammar Qaddafi m j = Moamer El Kazzafi sim ( m i , m j ) =? Algorithm : Sorted Jaro-Winkler [Christen, 2006] 1. Sort tokens 2. Compute Jaro-Winkler distance in O ( | m i | + | m j | ) 3. Evaluate sim ( m i , m j ) < β 17 / 41

  23. Cross-language: Binary classifier m i = map ( Apple ) = abbl m j = map ( ����� ) = abl sim ( m i , m j ) =? 18 / 41

  24. Cross-language: Binary classifier m i = map ( Apple ) = abbl m j = map ( ����� ) = abl sim ( m i , m j ) =? Algorithm : Phonetic mapping + classification 1. Apply deterministic mapping map ( · ) 2. Extract character-level features 3. Classify (Maxent) 18 / 41

  25. Cross-language: Binary classifier Training: parallel name list ◮ 97.1% accuracy on a held-out set Phonetic mapping: think of Soundex Best features: character bigrams 19 / 41

  26. Cross-lingual Mention Similarity Cross-lingual Context Similarity Clustering Algorithms Evaluation

  27. Context Similarity: Mapping techniques Method Resource-level Machine Translation High Lexicon Medium Polylingual Topic Model Low 21 / 41

  28. Context Similarity: Mapping techniques Method Resource-level Machine Translation High Lexicon Medium Polylingual Topic Model Low ◮ MT: Phrasal with NIST-09 data [Galley et al., 2009] ◮ Lexicon: 31k entries (web and LDC sources) 21 / 41

  29. Context Similarity: Polylingual Topic Model Polylingual Topic Model (PLTM) [Mimno et al., 2009] ◮ Words linked through cross-lingual priors ◮ Training: Wikipedia document tuples Map context words to 1-best topics 22 / 41

  30. Context Similarity: Polylingual Topic Model [Mimno et al., 2009] 23 / 41

  31. Context Similarity: Polylingual Topic Model The Apple chief executive was former Beatles road 3 1 6 6 14 5 103 99 6 3 5... manager Neil Aspinall... ����� � ������ �������� �� � ������ � � � ������ �� �� 2 1 14 99 7 7 103 79 ��� ��� ����� �� � � �� � 24 / 41

  32. Context Similarity Bag of words / smoothed unigram distributions Measure: Jensen-Shannon divergence 25 / 41

  33. Cross-lingual Mention Similarity Cross-lingual Context Similarity Clustering Algorithms Evaluation

  34. Constraint-Based Clustering Two algorithms: 1. Hierarchical clustering 2. Dirichlet process mixture model Setup: ◮ Mention similarity as a hard constraint ◮ Cluster distance: context similarity 27 / 41

  35. Constraint-Based Clustering Muammar Qaddafi al-Qadhafi El Kazzafi context context Apple Apple Corps. context 28 / 41

  36. Constraint-Based Clustering Muammar Qaddafi al-Qadhafi El Kazzafi 0.31 context context Apple 0.40 0.15 Apple Corps. context 29 / 41

  37. Constraint-Based Clustering Muammar Qaddafi al-Qadhafi El Kazzafi 0.31 context context Apple Apple Corps. context 30 / 41

  38. Cross-lingual Mention Similarity Cross-lingual Context Similarity Clustering Algorithms Evaluation

  39. Evaluation Corpus Genres: 1. broadcast Automatic Content Extraction conversation (ACE) 2008 Arabic-English 2. broadcast news 3. meeting 4. newswire We annotated 216 cross-lingual 5. telephone entities 6. usenet 7. weblog 32 / 41

  40. ACE2008 Evaluation Corpus Docs Tokens Entities Chains Mentions Arabic 412 178k 2.6k 4.2k 9.2k 414 246k 2.3k 4.0k 9.1k English ◮ Chain – set of mentions (within-doc) ◮ Entity – set of chains 33 / 41

  41. Within-document Processing Our models cluster chains Evaluation: ◮ Gold chains ◮ Predicted chains from SERIF [Ramshaw et al., 2011] 34 / 41

  42. Evaluation: Gold within-document processing 100 MT 85.4 90 Lexicon 80.4 78.8 77.3 PLTM 80 70.1 Baseline 66.4 70 58.4 60 54.5 50 B3 B3 (cross-lingual only) 35 / 41

  43. Evaluation: Gold within-document processing 100 MT 85.4 90 Lexicon 80.4 78.8 77.3 PLTM 80 70.1 Baseline 66.4 70 58.4 60 54.5 50 B3 B3 (cross-lingual only) In paper: CEAF, NVI 35 / 41

  44. Automatic within-document processing 80.0 MT 76.7 76.0 75.3 Le xicon 75.0 PLTM Baseline 70.0 67.0 65.0 B3 36 / 41

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend