Identifying Parallel Documents from a Large Bilingual Collection of - - PowerPoint PPT Presentation

identifying parallel documents from a large bilingual
SMART_READER_LITE
LIVE PREVIEW

Identifying Parallel Documents from a Large Bilingual Collection of - - PowerPoint PPT Presentation

Objectives Paradocs Experiments Discussion Identifying Parallel Documents from a Large Bilingual Collection of Texts: Application to Parallel Article Extraction in Wikipedia. Alexandre Patry 1 , Philippe Langlais 2 (1) KeaText, Saint-Laurent,


slide-1
SLIDE 1

Objectives Paradocs Experiments Discussion

Identifying Parallel Documents from a Large Bilingual Collection of Texts: Application to Parallel Article Extraction in Wikipedia.

Alexandre Patry1, Philippe Langlais2

(1) KeaText, Saint-Laurent, Qu´

ebec, Canada

(2) RALI/DIRO, Universit´

e de Montr´ eal, Qu´ ebec, Canada

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-2
SLIDE 2

Objectives Paradocs Experiments Discussion

Motivations

For identifying parallel document pairs

⊲ There is a need for tools able to spot texts that are translated without necessarily relying on naming conventions (as is [Chen & Nie, 2000] or [Resnik & Smith, 2003] ) ⊲ Currently, no decent SMT without parallel documents 109 web parallel corpus (22 M pairs of sentences), Opus corpus (13 GB of compressed files), Europarl (20 language pairs), etc.

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-3
SLIDE 3

Objectives Paradocs Experiments Discussion Indexing Pairing Post-Filtering

1

Objectives

2

Paradocs Indexing Pairing Post-Filtering

3

Experiments Europarl Wikipedia

4

Discussion

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-4
SLIDE 4

Objectives Paradocs Experiments Discussion Indexing Pairing Post-Filtering

Paradocs

First version described in [Patry and Langlais, 2005] Language (and collection) independent Relies on 3 lightweight components : searching pairing filtering

indexing retrieval N-best

ψ ψ

target s t1 t 2 t N t 3 2 1 1 3 1 3 2 1 2 2 σ 1

s t1 t 2 t N 2 1 2 2 1 3 2 1 2 2

ψ ψ ψ ψ

2 1 2 2

σ σ s1 s 2 s n s 3 t1 t 2 tm t 3 0.6 0.7 0.6 . 8 0.7

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-5
SLIDE 5

Objectives Paradocs Experiments Discussion Indexing Pairing Post-Filtering

Searching

indexing retrieval N-best

ψ ψ

target s t1 t 2 t N t 3 2 1 1 3 1 3 2 1 2 2

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-6
SLIDE 6

Objectives Paradocs Experiments Discussion Indexing Pairing Post-Filtering

Searching

2 (lightweight) indexing strategies :

1 document ≡ sequence of its numerical entities (ψ ≡num) 1 document ≡ sequence of its hapax words (ψ ≡hap)

[Enright & Kondrak, 2007] for parallel document pairing [Lardilleux & Lepage, 2007] for word alignment

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-7
SLIDE 7

Objectives Paradocs Experiments Discussion Indexing Pairing Post-Filtering

Pairing

σ 1

s t1 t 2 t N 2 1 2 2 1 3 2 1 2 2

ψ ψ ψ ψ

2 1 2 2

σ σ

⊲ binary classifier trained in a supervised way.

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-8
SLIDE 8

Objectives Paradocs Experiments Discussion Indexing Pairing Post-Filtering

Features used by the classifier

For a given indexing policy ψ : normalized edit-distance between the two representations : σ1 = ed(ψ(s), ψ(t))/max(|ψ(s)|, |ψ(t)|) total number of entities in both documents : σ2 = |ψ(s)| + |ψ(t)| a global binary feature : σ3 = δ(s, t) = 1 if ed (ψ(s), ψ(t)) ≤ ed (ψ(s), ψ(t′)) ∀ t′

  • therwise

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-9
SLIDE 9

Objectives Paradocs Experiments Discussion Indexing Pairing Post-Filtering

Features used by the classifier

⊲ One set of features per indexing policy ψ ; we tried 3 of them : numerical entities (ψ ≡num) hapax words (ψ ≡hap) punctuation marks (ψ ≡punc) .!?(): ψ ≡ num+punc The Legislative Assembly conve- ned at 3.30 pm . Mr. Quirke (Clerk-Designate) : THURSDAY, APRIL 1, 1999 sitamiq, ipuru 1, 1999 maligaliur- vik matuiqtaulauqtuq 3 :30mi un- nusakkut mista kuak (titiraqti - tikkuaqtausimajuq) : <3.30 ;1 ;1999> <. ;( ;) ; : ;, ;,> <1 ;1999 ;3 ;30> <, ;, ; : ;( ;) ; :>

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-10
SLIDE 10

Objectives Paradocs Experiments Discussion Indexing Pairing Post-Filtering

Features used by the classifier

⊲ One set of features per indexing policy ψ ; we tried 3 of them : numerical entities (ψ ≡num) hapax words (ψ ≡hap) punctuation marks (ψ ≡punc) .!?(): ψ ≡ num+punc The Legislative Assembly conve- ned at 3.30 pm . Mr. Quirke (Clerk-Designate) : THURSDAY, APRIL 1, 1999 sitamiq, ipuru 1, 1999 maligaliur- vik matuiqtaulauqtuq 3 :30mi un- nusakkut mista kuak (titiraqti - tikkuaqtausimajuq) : σ ≡ [0.75,7,1,0.83,12,0]

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-11
SLIDE 11

Objectives Paradocs Experiments Discussion Indexing Pairing Post-Filtering

Post-Filtering

s1 s 2 s n s 3 t1 t 2 tm t 3 0.6 . 7 0.6 0.8 0.7

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-12
SLIDE 12

Objectives Paradocs Experiments Discussion Indexing Pairing Post-Filtering

Post-Filtering

s1 s 2 s n s 3 t1 t 2 tm t 3 0.6 . 7 0.6 0.8 0.7

nop (s1, t2), (s1, tm) (s2, t1) (s3, t3) (sn, t2)

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-13
SLIDE 13

Objectives Paradocs Experiments Discussion Indexing Pairing Post-Filtering

Post-Filtering

s1 s 2 s n s 3 t1 t 2 tm t 3 0.6 . 7 0.6 0.8 0.7

dup (s1, t2), (s1, tm) (s2, t1) (s3, t3) (sn, t2)

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-14
SLIDE 14

Objectives Paradocs Experiments Discussion Europarl Wikipedia

1

Objectives

2

Paradocs Indexing Pairing Post-Filtering

3

Experiments Europarl Wikipedia

4

Discussion

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-15
SLIDE 15

Objectives Paradocs Experiments Discussion Europarl Wikipedia

Setting

Europarl V5 (split in documents of different size) Search engine : Lucene (N = 20) Learner : Weka (5-fold cross validation) ⊲ Europarl is organized into bitexts ֒ → controlled experiment

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-16
SLIDE 16

Objectives Paradocs Experiments Discussion Europarl Wikipedia

Europarl

We varied several experimental conditions : languages (10) : en-da, -de, -el, -es, -fi, -fr, -it, -nl, -pt and -sv

  • doc. length (7) : 10, 20, 30, 50, 70, 100 and 1 000 sentences

indexing strategy (2) : num and hap entities (4) : hap, num, num+hap and num+punc classifier (4) : logit, ada, bayes, and j48 post-filtering (2) : nop, dup = ⇒ 4 480 experiments

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-17
SLIDE 17

Objectives Paradocs Experiments Discussion Europarl Wikipedia

Search error

nodoc : no (English) document returned (source = Dutch) nogood : no good document returned

  • nb. of sent.

errors %

10 15 20 25 30 35 40 45 8 16 32 64 128 256 512 1024 nodoc + nogood nodoc

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-18
SLIDE 18

Objectives Paradocs Experiments Discussion Europarl Wikipedia

Which configuration is the best ?

70 runs (varying the language pairs and the document length) Counting the number of winners (in terms of f-measure) Nb ψ σ Class. Filter 31 num num+hap logit dup 31 num num logit dup 31 num num+punc logit dup 24 num num+hap j48 dup 24 num num j48 dup 24 num num+punc j48 dup . . . 1 num num+hap j48 nop

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-19
SLIDE 19

Objectives Paradocs Experiments Discussion Europarl Wikipedia

Which configuration is the best ?

70 runs (varying the language pairs and the document length) Counting the number of winners (in terms of f-measure) Nb ψ σ Class. Filter 31 num num+hap logit dup 31 num num logit dup 31 num num+punc logit dup 24 num num+hap j48 dup 24 num num j48 dup 24 num num+punc j48 dup . . . 1 num num+hap j48 nop ⊲ logit > j48 > . . .

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-20
SLIDE 20

Objectives Paradocs Experiments Discussion Europarl Wikipedia

Which configuration is the best ?

70 runs (varying the language pairs and the document length) Counting the number of winners (in terms of f-measure) Nb ψ σ Class. Filter 31 num num+hap logit dup 31 num num logit dup 31 num num+punc logit dup 24 num num+hap j48 dup 24 num num j48 dup 24 num num+punc j48 dup . . . 1 num num+hap j48 nop ⊲ num ≃ num+hap ≃ num+punc

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-21
SLIDE 21

Objectives Paradocs Experiments Discussion Europarl Wikipedia

1

Objectives

2

Paradocs Indexing Pairing Post-Filtering

3

Experiments Europarl Wikipedia

4

Discussion

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-22
SLIDE 22

Objectives Paradocs Experiments Discussion Europarl Wikipedia

Wikipedia

Setting

English-French pages cross-lingually linked in Wikipedia (as available in summer 2009) article pairs 537 067

  • avg. document size

711 (English) and 445 (French) words Classifier ≡ best configuration on a task conducted on http://www.olympic.org (no cross-validation) : ψ ≡ num, learner = j48, dup Lucene (N = 5)

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-23
SLIDE 23

Objectives Paradocs Experiments Discussion Europarl Wikipedia

Wikipedia

Manual analysis of 200 pairs cross-language linked

⊲ [Fung & Cheung, 2004] : parallel, noisy parallel, topic-aligned, and very-non-parallel Wikipedia Type Count Ratio very-non 92 46% topic 58 29% noisy 22 11% parallel 28 14% Total 200 ⊲ 25% of pairs of articles parallel or noisy parallel

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-24
SLIDE 24

Objectives Paradocs Experiments Discussion Europarl Wikipedia

Wikipedia

Example of a parallel pair of articles

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-25
SLIDE 25

Objectives Paradocs Experiments Discussion Europarl Wikipedia

Wikipedia

Example of a parallel pair of articles

Cabal Online est un jeu de rˆ

  • le en ligne massivement multijoueur en 3D et gratuit,

d´ evelopp´ e par ESTsoft. Diff´ erentes versions du jeu sont disponibles pour diff´ erents pays, publi´ ees par diff´ erentes compagnies dont OGPlanet and Asiasoft. Mˆ eme si Cabal Online”est gratuit, le jeu poss` ede un magasin sp´ ecial qui permet aux joueurs d’acheter des am´ eliorations et des items dans le jeu contre de l’argent r´

  • eel. Le jeu prend place

dans un monde fantastique connu sous le nom de Nevareth . . . Cabal Online is a free-of-charge, 3D massively-multiplayer online role-playing game (MMORPG), developed by South Korean company ESTsoft. Different versions of the game are available for specific countries or regions, published by various companies such as OGPlanet and Asiasoft. Although Cabal Online” is free-of-charge, the game has a ” Cash Shop” which allows players to purchase game enhancements and useful ingame items using real currency. The game takes place in a mythical world known as Nevareth, which was . . .

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-26
SLIDE 26

Objectives Paradocs Experiments Discussion Europarl Wikipedia

Wikipedia

A few numbers

From the 537 067 English documents of our collection : 106 896 (20%) did not receive any answer from Lucene (nodoc). 117 032 pairs of documents were judged by the classifier as parallel. dup eliminated (slightly less than) half of them = ⇒ 61 897 pairs. intersecting with the pairs cross-language linked = ⇒ 44 447 pairs

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-27
SLIDE 27

Objectives Paradocs Experiments Discussion Europarl Wikipedia

Wikipedia

Manual analysis of 200 pairs identified as parallel by Paradocs

⊲ The sampling is reflecting the distribution of Paradocs’ scores Paradocs Type Count Ratio very-non 5 2.5% topic 34 17.0% noisy 39 19.5% parallel 122 61.0% Total 200 ⊲ ≃80% of pairs are parallel or noisy parallel

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-28
SLIDE 28

Objectives Paradocs Experiments Discussion

1

Objectives

2

Paradocs Indexing Pairing Post-Filtering

3

Experiments Europarl Wikipedia

4

Discussion

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-29
SLIDE 29

Objectives Paradocs Experiments Discussion

Summary

We described Paradocs, a lightweight system for pairing parallel documents in a bilingual collection We tested Paradocs on Europarl

⊲ Paradocs outperforms [Enright & Kondrak 2007] ⊲ (slightly) trade recall for speed

We tested Paradocs on Wikipedia

⊲ precision of 80% (if noisy parallel pairs count as good pairs) ⊲ 1/4 of randomly inspected pairs are parallel or noisy parallel

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-30
SLIDE 30

Objectives Paradocs Experiments Discussion

Future Work

Better post-filtering (using the classifier score) Reducing search errors by indexing some words (possibly those passing a keyword test, e.g. tf.idf) Compare Paradocs+sentence alignment (when pairs are labeled as parallel) to parallel sentence extraction (e.g. [Munteanu et al., 2004] [Smith et al. 2010] ) Toward a generative model of comparable documents which integrates the notion of units (e.g. paragraphs)

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-31
SLIDE 31

Objectives Paradocs Experiments Discussion

Thank you

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-32
SLIDE 32

Objectives Paradocs Experiments Discussion

Comparison to (Enright & Kondrak, 2007)

[Enright & Kondrak, 2007] ranks candidate (document) pairs in decreasing order of the number of hapax words they share foreach in ($1 $2) cat $in | tr ’[:space:]’ ’\n’ | sort | uniq -u | grep ’\w\w\w\w’ >! $in.hpx end sort $1.hpx $2.hpx | uniq -d | wc -l

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-33
SLIDE 33

Objectives Paradocs Experiments Discussion

Comparison to (Enright & Kondrak, 2007)

[Enright & Kondrak, 2007] ranks candidate (document) pairs in decreasing order of the number of hapax words they share

  • nb. of sent.

gain %

5 10 15 20 25 30 35 40 8 16 32 64 128 256 512 1024 da de el es fi fr it nl pt sv

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-34
SLIDE 34

Objectives Paradocs Experiments Discussion

Wikipedia

How good is the score produced by Paradocs ?

p ≤ 0.1 p ≤ 0.2 p < 0.5 avg. very-non 1.1% 91.4% 92.5% 0.25 topic 1.7% 74.6% 78.0% 0.37 noisy 13.6% 77.3% 90.9% 0.26 parallel 7.1% 25.0% 35.7% 0.71 ⊲ ≃64% of parallel pairs receive a score ≥ 0.5

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of

slide-35
SLIDE 35

Objectives Paradocs Experiments Discussion

Wikipedia

How good is the score produced by Paradocs ?

p ≤ 0.1 p ≤ 0.2 p < 0.5 avg. very-non 1.1% 91.4% 92.5% 0.25 topic 1.7% 74.6% 78.0% 0.37 noisy 13.6% 77.3% 90.9% 0.26 parallel 7.1% 25.0% 35.7% 0.71 ⊲ ≃8% of very-non pairs receive a score ≥ 0.5

Alexandre Patry1, Philippe Langlais2 Identifying Parallel Documents from a Large Bilingual Collection of