lina identifying comparable documents from wikipedia
play

LINA: Identifying Comparable Documents from Wikipedia Emmanuel Morin - PowerPoint PPT Presentation

LINA: Identifying Comparable Documents from Wikipedia Emmanuel Morin 1 Amir Hazem 2 Elizaveta Loginova-Clouet 1 Florian Boudin 1 1 LINA - UMR CNRS 6241, Universit de Nantes, France 2 LIUM - EA 4023, Universit du Maine, France BUCC-2015 Shared


  1. LINA: Identifying Comparable Documents from Wikipedia Emmanuel Morin 1 Amir Hazem 2 Elizaveta Loginova-Clouet 1 Florian Boudin 1 1 LINA - UMR CNRS 6241, Université de Nantes, France 2 LIUM - EA 4023, Université du Maine, France BUCC-2015 Shared Task 1 / 14

  2. Introduction ◮ How far can we go with a language agnostic model? ◮ We experiment with [Enright and Kondrak, 2007]’s parallel document identification method ◮ We adapt the method to the BUCC-2015 Shared task based on two assumptions: 1. Source documents should be paired 1-to-1 with target documents 2. We have access to comparable documents in several languages 2 / 14

  3. Outline Introduction Method Experiments Summary 3 / 14

  4. Method ◮ Fast parallel document identification [Enright and Kondrak, 2007] ◮ Documents = bags of hapax words ◮ Words = blank separated strings that are 4+ characters long ◮ Given a document in language A, the document in language B that shares the largest number of words is considered as parallel ◮ Works very well for parallel documents ◮ 99.96% accuracy on EUROPARL [Enright and Kondrak, 2007] ◮ 80% precision on Wikipedia [Patry and Langlais, 2011] ◮ We use this approach as baseline for detecting comparable documents 4 / 14

  5. Improvements using 1-to-1 alignments ◮ In baseline , document pairs are scored independently ◮ Multiple source documents are paired to a same target document ◮ ≈ 60% of English pages are paired with multiple pages in French or German ◮ We remove multiply assigned source documents using pigeonhole reasoning ◮ From 60% to 11% of multiply assigned source documents doc fr 1 doc fr 2 doc fr 3 7 4 10 6 doc en 1 doc en 2 5 / 14

  6. Improvements using cross-lingual information ◮ Simple document weighting function → score ties ◮ We break the remaining score ties using a third language ◮ From 11% to less than 4% of multiply assigned source documents doc fr 1 doc fr 2 10 8 6 10 doc de doc en 14 6 / 14

  7. Outline Introduction Method Experiments Summary 7 / 14

  8. Experimental settings ◮ We focus on the French-English and German-English pairs ◮ The following measures are considered relevant ◮ Mean Average Precision (MAP) ◮ Success (Succ.) ◮ Precision at 5 (P@5) 8 / 14

  9. Results (FR → EN) Train Test Strategy MAP Succ. P@5 MAP Succ. P@5 baseline 31 . 4 28 . 0 7 . 4 32 . 9 30 . 0 7 . 5 + pigeonhole 57 . 7 56 . 4 11 . 9 − − − + cross-lingual 58 . 9 57 . 7 12 . 1 59 . 0 57 . 7 12 . 1 9 / 14

  10. Results (DE → EN) Train Test Strategy MAP Succ. P@5 MAP Succ. P@5 baseline 28 . 7 24 . 9 6 . 9 29 . 0 24 . 9 7 . 1 + pigeonhole 61 . 6 60 . 1 12 . 8 − − − + cross-lingual 62 . 3 60 . 9 12 . 8 62 . 2 60 . 7 12 . 8 10 / 14

  11. Outline Introduction Method Experiments Summary 11 / 14

  12. Summary ◮ Unsupervised, hapax words-based method ◮ Promising results, about 60% of success using pigeonhole reasoning ◮ Using a third language slightly improves the performance ◮ Future work ◮ Finding the optimal alignment across the all languages ◮ Relaxing the hapax-words constraint 12 / 14

  13. Thank you florian.boudin@univ-nantes.fr 13 / 14

  14. References I Enright, J. and Kondrak, G. (2007). A fast method for parallel document identification. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’07) , pages 29–32, Rochester, New York, USA. Patry, A. and Langlais, P. (2011). Identifying parallel documents from a large bilingual collection of texts: Application to parallel article extraction in wikipedia. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora (BUCC’11) , pages 87–95, Portland, Oregon, USA. 14 / 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend