 
              Objectives Paradocs Experiments Discussion Identifying Parallel Documents from a Large Bilingual Collection of Texts: Application to Parallel Article Extraction in Wikipedia. Alexandre Patry 1 , Philippe Langlais 2 (1) KeaText, Saint-Laurent, Qu´ ebec, Canada (2) RALI/DIRO, Universit´ e de Montr´ eal, Qu´ ebec, Canada Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of
Objectives Paradocs Experiments Discussion Motivations For identifying parallel document pairs ⊲ There is a need for tools able to spot texts that are translated without necessarily relying on naming conventions (as is [Chen & Nie, 2000] or [Resnik & Smith, 2003] ) ⊲ Currently, no decent SMT without parallel documents 10 9 web parallel corpus (22 M pairs of sentences), Opus corpus (13 GB of compressed files), Europarl (20 language pairs), etc. Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of
Objectives Indexing Paradocs Pairing Experiments Post-Filtering Discussion Objectives 1 Paradocs 2 Indexing Pairing Post-Filtering Experiments 3 Europarl Wikipedia Discussion 4 Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of
Objectives Indexing Paradocs Pairing Experiments Post-Filtering Discussion Paradocs First version described in [Patry and Langlais, 2005] Language (and collection) independent Relies on 3 lightweight components : searching pairing filtering s t 1 t 2 t N s 1 t 1 target 0.6 s ψ ψ ψ ψ 0.7 ψ ψ s 2 t 2 1 0 2 1 3 0.7 2 2 2 0 0 1 1 1 0 2 2 indexing retrieval 2 3 3 2 0.6 1 1 2 2 s 3 t 3 0 0 0 σ σ σ 2 2 2 1 0 0 0 . 8 t 1 t 2 t 3 t N s n t m 1 0 0 N-best Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of
Objectives Indexing Paradocs Pairing Experiments Post-Filtering Discussion Searching target s ψ ψ 1 0 1 0 indexing retrieval 2 2 3 3 1 2 2 1 t 3 t N t 1 t 2 N-best Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of
Objectives Indexing Paradocs Pairing Experiments Post-Filtering Discussion Searching 2 (lightweight) indexing strategies : 1 document ≡ sequence of its numerical entities ( ψ ≡ num ) 1 document ≡ sequence of its hapax words ( ψ ≡ hap ) [Enright & Kondrak, 2007] for parallel document pairing [Lardilleux & Lepage, 2007] for word alignment Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of
Objectives Indexing Paradocs Pairing Experiments Post-Filtering Discussion Pairing s t 1 t 2 t N ψ ψ ψ ψ 1 0 2 1 3 2 2 2 0 2 2 1 0 0 0 σ σ σ 2 2 2 1 0 0 1 0 0 ⊲ binary classifier trained in a supervised way. Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of
Objectives Indexing Paradocs Pairing Experiments Post-Filtering Discussion Features used by the classifier For a given indexing policy ψ : normalized edit-distance between the two representations : σ 1 = ed ( ψ ( s ) , ψ ( t )) / max ( | ψ ( s ) | , | ψ ( t ) | ) total number of entities in both documents : σ 2 = | ψ ( s ) | + | ψ ( t ) | a global binary feature : � 1 if ed ( ψ ( s ) , ψ ( t )) ≤ ed ( ψ ( s ) , ψ ( t ′ )) ∀ t ′ σ 3 = δ ( s , t ) = 0 otherwise Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of
Objectives Indexing Paradocs Pairing Experiments Post-Filtering Discussion Features used by the classifier ⊲ One set of features per indexing policy ψ ; we tried 3 of them : numerical entities ( ψ ≡ num ) hapax words ( ψ ≡ hap ) punctuation marks ( ψ ≡ punc ) .!?(): ψ ≡ num + punc The Legislative Assembly conve- sitamiq , ipuru 1 , 1999 maligaliur- ned at 3.30 pm . Mr. Quirke vik matuiqtaulauqtuq 3 : 30mi un- ( Clerk-Designate ) : THURSDAY , nusakkut mista kuak ( titiraqti - APRIL 1 , 1999 tikkuaqtausimajuq ) : < 3.30 ;1 ;1999 > < . ; ( ; ) ; : ; , ; , > < 1 ;1999 ;3 ;30 > < , ; , ; : ; ( ; ) ; : > Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of
Objectives Indexing Paradocs Pairing Experiments Post-Filtering Discussion Features used by the classifier ⊲ One set of features per indexing policy ψ ; we tried 3 of them : numerical entities ( ψ ≡ num ) hapax words ( ψ ≡ hap ) punctuation marks ( ψ ≡ punc ) .!?(): ψ ≡ num + punc The Legislative Assembly conve- sitamiq , ipuru 1 , 1999 maligaliur- ned at 3.30 pm . Mr. Quirke vik matuiqtaulauqtuq 3 : 30mi un- ( Clerk-Designate ) : THURSDAY , nusakkut mista kuak ( titiraqti - APRIL 1 , 1999 tikkuaqtausimajuq ) : σ ≡ [0.75,7,1, 0.83 , 12 , 0 ] Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of
Objectives Indexing Paradocs Pairing Experiments Post-Filtering Discussion Post-Filtering s 1 t 1 0.6 0 . 7 s 2 t 2 0.7 0.6 s 3 t 3 0.8 s n t m Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of
Objectives Indexing Paradocs Pairing Experiments Post-Filtering Discussion Post-Filtering s 1 t 1 0.6 0 . 7 s 2 t 2 nop 0.7 ( s 1 , t 2 ), ( s 1 , t m ) ( s 2 , t 1 ) 0.6 s 3 t 3 ( s 3 , t 3 ) ( s n , t 2 ) 0.8 s n t m Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of
Objectives Indexing Paradocs Pairing Experiments Post-Filtering Discussion Post-Filtering s 1 t 1 0.6 0 . 7 s 2 t 2 dup 0.7 ( s 1 , t 2 ), ( s 1 , t m ) ( s 2 , t 1 ) 0.6 s 3 t 3 ( s 3 , t 3 ) ( s n , t 2 ) 0.8 s n t m Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of
Objectives Paradocs Europarl Experiments Wikipedia Discussion Objectives 1 Paradocs 2 Indexing Pairing Post-Filtering Experiments 3 Europarl Wikipedia Discussion 4 Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of
Objectives Paradocs Europarl Experiments Wikipedia Discussion Setting Europarl V5 (split in documents of different size) Search engine : Lucene ( N = 20) Learner : Weka (5-fold cross validation) ⊲ Europarl is organized into bitexts ֒ → controlled experiment Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of
Objectives Paradocs Europarl Experiments Wikipedia Discussion Europarl We varied several experimental conditions : languages (10) : en-da, -de, -el, -es, -fi, -fr, -it, -nl, -pt and -sv doc. length (7) : 10, 20, 30, 50, 70, 100 and 1 000 sentences indexing strategy (2) : num and hap entities (4) : hap , num , num + hap and num + punc classifier (4) : logit , ada , bayes , and j48 post-filtering (2) : nop , dup = ⇒ 4 480 experiments Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of
Objectives Paradocs Europarl Experiments Wikipedia Discussion Search error nodoc : no (English) document returned (source = Dutch) nogood : no good document returned 45 errors % nodoc nodoc + nogood 40 35 30 25 20 15 nb. of sent. 10 8 16 32 64 128 256 512 1024 Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of
Objectives Paradocs Europarl Experiments Wikipedia Discussion Which configuration is the best ? 70 runs (varying the language pairs and the document length) Counting the number of winners (in terms of f-measure) Nb ψ σ Class. Filter 31 logit num num+hap dup 31 num num logit dup 31 logit num num+punc dup 24 num num+hap j48 dup 24 j48 num num dup 24 j48 num num+punc dup . . . 1 j48 num num+hap nop Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of
Objectives Paradocs Europarl Experiments Wikipedia Discussion Which configuration is the best ? 70 runs (varying the language pairs and the document length) Counting the number of winners (in terms of f-measure) Nb ψ σ Class. Filter 31 logit num num+hap dup 31 num num logit dup 31 logit num num+punc dup 24 num num+hap j48 dup 24 j48 num num dup 24 j48 num num+punc dup . . . 1 j48 num num+hap nop ⊲ logit > j48 > . . . Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of
Objectives Paradocs Europarl Experiments Wikipedia Discussion Which configuration is the best ? 70 runs (varying the language pairs and the document length) Counting the number of winners (in terms of f-measure) Nb ψ σ Class. Filter 31 logit num num+hap dup 31 num num logit dup 31 logit num num+punc dup 24 num num+hap j48 dup 24 j48 num num dup 24 j48 num num+punc dup . . . 1 j48 num num+hap nop ⊲ num ≃ num+hap ≃ num+punc Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of
Recommend
More recommend