LINA: Identifying Comparable Documents from Wikipedia Emmanuel Morin - - PowerPoint PPT Presentation

lina identifying comparable documents from wikipedia
SMART_READER_LITE
LIVE PREVIEW

LINA: Identifying Comparable Documents from Wikipedia Emmanuel Morin - - PowerPoint PPT Presentation

LINA: Identifying Comparable Documents from Wikipedia Emmanuel Morin 1 Amir Hazem 2 Elizaveta Loginova-Clouet 1 Florian Boudin 1 1 LINA - UMR CNRS 6241, Universit de Nantes, France 2 LIUM - EA 4023, Universit du Maine, France BUCC-2015 Shared


slide-1
SLIDE 1

LINA: Identifying Comparable Documents from Wikipedia

Emmanuel Morin1 Amir Hazem2 Elizaveta Loginova-Clouet1 Florian Boudin1

1LINA - UMR CNRS 6241, Université de Nantes, France 2LIUM - EA 4023, Université du Maine, France

BUCC-2015 Shared Task

1 / 14

slide-2
SLIDE 2

Introduction

◮ How far can we go with a language agnostic model? ◮ We experiment with [Enright and Kondrak, 2007]’s parallel document identification

method

◮ We adapt the method to the BUCC-2015 Shared task based on two assumptions:

  • 1. Source documents should be paired 1-to-1 with target documents
  • 2. We have access to comparable documents in several languages

2 / 14

slide-3
SLIDE 3

Outline

Introduction Method Experiments Summary

3 / 14

slide-4
SLIDE 4

Method

◮ Fast parallel document identification [Enright and Kondrak, 2007]

◮ Documents = bags of hapax words ◮ Words = blank separated strings that are 4+ characters long ◮ Given a document in language A, the document in language B that shares the largest

number of words is considered as parallel

◮ Works very well for parallel documents

◮ 99.96% accuracy on EUROPARL [Enright and Kondrak, 2007] ◮ 80% precision on Wikipedia [Patry and Langlais, 2011]

◮ We use this approach as baseline for detecting comparable documents

4 / 14

slide-5
SLIDE 5

Improvements using 1-to-1 alignments

◮ In baseline, document pairs are scored independently

◮ Multiple source documents are paired to a same target document ◮ ≈ 60% of English pages are paired with multiple pages in French or German

◮ We remove multiply assigned source documents using pigeonhole reasoning

◮ From 60% to 11% of multiply assigned source documents

docfr 1 docfr 2 docfr 3 docen 1 docen 2

10 6 7 4

5 / 14

slide-6
SLIDE 6

Improvements using cross-lingual information

◮ Simple document weighting function → score ties ◮ We break the remaining score ties using a third language

◮ From 11% to less than 4% of multiply assigned source documents

docfr 1 docde docen docfr 2

10 6 8 14 10

6 / 14

slide-7
SLIDE 7

Outline

Introduction Method Experiments Summary

7 / 14

slide-8
SLIDE 8

Experimental settings

◮ We focus on the French-English and German-English pairs ◮ The following measures are considered relevant

◮ Mean Average Precision (MAP) ◮ Success (Succ.) ◮ Precision at 5 (P@5)

8 / 14

slide-9
SLIDE 9

Results (FR→EN)

Train Test Strategy MAP Succ. P@5 MAP Succ. P@5 baseline

31.4 28.0 7.4 32.9 30.0 7.5

+ pigeonhole

57.7 56.4 11.9 − − −

+ cross-lingual

58.9 57.7 12.1 59.0 57.7 12.1

9 / 14

slide-10
SLIDE 10

Results (DE→EN)

Train Test Strategy MAP Succ. P@5 MAP Succ. P@5 baseline

28.7 24.9 6.9 29.0 24.9 7.1

+ pigeonhole

61.6 60.1 12.8 − − −

+ cross-lingual

62.3 60.9 12.8 62.2 60.7 12.8

10 / 14

slide-11
SLIDE 11

Outline

Introduction Method Experiments Summary

11 / 14

slide-12
SLIDE 12

Summary

◮ Unsupervised, hapax words-based method ◮ Promising results, about 60% of success using pigeonhole reasoning ◮ Using a third language slightly improves the performance ◮ Future work

◮ Finding the optimal alignment across the all languages ◮ Relaxing the hapax-words constraint

12 / 14

slide-13
SLIDE 13

Thank you

florian.boudin@univ-nantes.fr

13 / 14

slide-14
SLIDE 14

References I

Enright, J. and Kondrak, G. (2007). A fast method for parallel document identification. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’07), pages 29–32, Rochester, New York, USA. Patry, A. and Langlais, P. (2011). Identifying parallel documents from a large bilingual collection of texts: Application to parallel article extraction in wikipedia. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora (BUCC’11), pages 87–95, Portland, Oregon, USA.

14 / 14