LINA: Identifying Comparable Documents from Wikipedia Emmanuel Morin - PowerPoint PPT Presentation

LINA: Identifying Comparable Documents from Wikipedia Emmanuel Morin 1 Amir Hazem 2 Elizaveta Loginova-Clouet 1 Florian Boudin 1 1 LINA - UMR CNRS 6241, Université de Nantes, France 2 LIUM - EA 4023, Université du Maine, France BUCC-2015 Shared Task 1 / 14

Introduction ◮ How far can we go with a language agnostic model? ◮ We experiment with [Enright and Kondrak, 2007]’s parallel document identification method ◮ We adapt the method to the BUCC-2015 Shared task based on two assumptions: 1. Source documents should be paired 1-to-1 with target documents 2. We have access to comparable documents in several languages 2 / 14

Outline Introduction Method Experiments Summary 3 / 14

Method ◮ Fast parallel document identification [Enright and Kondrak, 2007] ◮ Documents = bags of hapax words ◮ Words = blank separated strings that are 4+ characters long ◮ Given a document in language A, the document in language B that shares the largest number of words is considered as parallel ◮ Works very well for parallel documents ◮ 99.96% accuracy on EUROPARL [Enright and Kondrak, 2007] ◮ 80% precision on Wikipedia [Patry and Langlais, 2011] ◮ We use this approach as baseline for detecting comparable documents 4 / 14

Improvements using 1-to-1 alignments ◮ In baseline , document pairs are scored independently ◮ Multiple source documents are paired to a same target document ◮ ≈ 60% of English pages are paired with multiple pages in French or German ◮ We remove multiply assigned source documents using pigeonhole reasoning ◮ From 60% to 11% of multiply assigned source documents doc fr 1 doc fr 2 doc fr 3 7 4 10 6 doc en 1 doc en 2 5 / 14

Improvements using cross-lingual information ◮ Simple document weighting function → score ties ◮ We break the remaining score ties using a third language ◮ From 11% to less than 4% of multiply assigned source documents doc fr 1 doc fr 2 10 8 6 10 doc de doc en 14 6 / 14

Experimental settings ◮ We focus on the French-English and German-English pairs ◮ The following measures are considered relevant ◮ Mean Average Precision (MAP) ◮ Success (Succ.) ◮ Precision at 5 (P@5) 8 / 14

Results (FR → EN) Train Test Strategy MAP Succ. P@5 MAP Succ. P@5 baseline 31 . 4 28 . 0 7 . 4 32 . 9 30 . 0 7 . 5 + pigeonhole 57 . 7 56 . 4 11 . 9 − − − + cross-lingual 58 . 9 57 . 7 12 . 1 59 . 0 57 . 7 12 . 1 9 / 14

Results (DE → EN) Train Test Strategy MAP Succ. P@5 MAP Succ. P@5 baseline 28 . 7 24 . 9 6 . 9 29 . 0 24 . 9 7 . 1 + pigeonhole 61 . 6 60 . 1 12 . 8 − − − + cross-lingual 62 . 3 60 . 9 12 . 8 62 . 2 60 . 7 12 . 8 10 / 14

Summary ◮ Unsupervised, hapax words-based method ◮ Promising results, about 60% of success using pigeonhole reasoning ◮ Using a third language slightly improves the performance ◮ Future work ◮ Finding the optimal alignment across the all languages ◮ Relaxing the hapax-words constraint 12 / 14

Thank you florian.boudin@univ-nantes.fr 13 / 14

References I Enright, J. and Kondrak, G. (2007). A fast method for parallel document identification. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’07) , pages 29–32, Rochester, New York, USA. Patry, A. and Langlais, P. (2011). Identifying parallel documents from a large bilingual collection of texts: Application to parallel article extraction in wikipedia. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora (BUCC’11) , pages 87–95, Portland, Oregon, USA. 14 / 14

LINA: Identifying Comparable Documents from Wikipedia Emmanuel Morin - PowerPoint PPT Presentation

LINA: Identifying Comparable Documents from Wikipedia Emmanuel Morin 1 Amir Hazem 2 Elizaveta Loginova-Clouet 1 Florian Boudin 1 1 LINA - UMR CNRS 6241, Universit de Nantes, France 2 LIUM - EA 4023, Universit du Maine, France BUCC-2015 Shared

Identifying Deceptive Product Reviews Wikipedia Vandalism The Gender of Authors via

Semantic Wikipedia [[enhances::Wikipedia]] Wikipedia today A free online encyclopdia

Algorithms for Identifying Rigid Subsystems in Geometric Constraint Systems Christophe Jermann,

Identifying Parallel Documents from a Large Bilingual Collection of Texts: Application to

Saturday, 29 January 2011 OVERVIEW What is Wikipedia/Wikimedia? (Mike) What makes a

Comparables (Welch, Chapter 15) Ivo Welch Quick Comps for Dummies ( Comparable or Comp means

Computers Session 1 INST 346 Agenda The Computer The Course Source: Wikipedia

Wikipedia: n ++ made easy Matt Might University of Utah / NGLY1.org matt.might.net What

Wikipedia Sociographics Jimmy Wales President, Wikimedia Foundation Wikipedia Founder Todays

Introduction to Wikipedia editing Mike Peel 12 November 2014 Questions Who has used

Trade Presentation Wikipedia:

Counting independent sets in middle two layers of Boolean lattice Lina Li Joint work with

Venues for expert participation in Wikipedia [Wikipedia] is not the bottom layer of authority,

From Non-Expert to Editor: Students Improving Wikipedia Content for Global Communities Becky J.

Mughal Wikipedia Project smithsonian libraries editing Wikipedia articles about indias mughal

EASM 2014 The aim of this work is to analyze the characteristics of the Paralympic

The Rise & Fall of an Online Project. Is Bureaucracy Killing Efficiency in Open Knowledge

Assistive Tech Presentation: Braille Claire Bradley October 8, 2014 Development Night Writing

1 AP Physics C Mechanics Rotational Motion 20151203 www.njctl.org 2 Table of Contents

Affordable Housing 101 Stepping Up Ohio Summit 2018 Sally Luken President, Luken Solutions with

from: http://en.wikipedia.org Doing the Right Thing for Urban Creeks Carole Schemmerling, CAG

Tracking recent events through recent Wikipedia changes using Storm by Gustaf Helgesson Aim

Train Design D y c a g e NAME L g n i n r a e L Who am I? t n e m u c

Imagine a world in which every single human being can freely share in the sum of all knowledge.

LINA: Identifying Comparable Documents from Wikipedia Emmanuel Morin - PowerPoint PPT Presentation

LINA: Identifying Comparable Documents from Wikipedia Emmanuel Morin 1 Amir Hazem 2 Elizaveta Loginova-Clouet 1 Florian Boudin 1 1 LINA - UMR CNRS 6241, Universit de Nantes, France 2 LIUM - EA 4023, Universit du Maine, France BUCC-2015 Shared

Identifying Deceptive Product Reviews Wikipedia Vandalism The Gender of Authors via

Semantic Wikipedia [[enhances::Wikipedia]] Wikipedia today A free online encyclopdia

Algorithms for Identifying Rigid Subsystems in Geometric Constraint Systems Christophe Jermann,

Identifying Parallel Documents from a Large Bilingual Collection of Texts: Application to

Saturday, 29 January 2011 OVERVIEW What is Wikipedia/Wikimedia? (Mike) What makes a

Comparables (Welch, Chapter 15) Ivo Welch Quick Comps for Dummies ( Comparable or Comp means

Computers Session 1 INST 346 Agenda The Computer The Course Source: Wikipedia

Wikipedia: n ++ made easy Matt Might University of Utah / NGLY1.org matt.might.net What

Wikipedia Sociographics Jimmy Wales President, Wikimedia Foundation Wikipedia Founder Todays

Introduction to Wikipedia editing Mike Peel 12 November 2014 Questions Who has used

Trade Presentation Wikipedia:

Counting independent sets in middle two layers of Boolean lattice Lina Li Joint work with

Venues for expert participation in Wikipedia [Wikipedia] is not the bottom layer of authority,

From Non-Expert to Editor: Students Improving Wikipedia Content for Global Communities Becky J.

Mughal Wikipedia Project smithsonian libraries editing Wikipedia articles about indias mughal

EASM 2014 The aim of this work is to analyze the characteristics of the Paralympic

The Rise &amp; Fall of an Online Project. Is Bureaucracy Killing Efficiency in Open Knowledge

Assistive Tech Presentation: Braille Claire Bradley October 8, 2014 Development Night Writing

1 AP Physics C Mechanics Rotational Motion 20151203 www.njctl.org 2 Table of Contents

Affordable Housing 101 Stepping Up Ohio Summit 2018 Sally Luken President, Luken Solutions with

from: http://en.wikipedia.org Doing the Right Thing for Urban Creeks Carole Schemmerling, CAG

Tracking recent events through recent Wikipedia changes using Storm by Gustaf Helgesson Aim

Train Design D y c a g e NAME L g n i n r a e L Who am I? t n e m u c

Imagine a world in which every single human being can freely share in the sum of all knowledge.

The Rise & Fall of an Online Project. Is Bureaucracy Killing Efficiency in Open Knowledge