identifying parallel documents from a large bilingual
play

Identifying Parallel Documents from a Large Bilingual Collection of - PowerPoint PPT Presentation

Objectives Paradocs Experiments Discussion Identifying Parallel Documents from a Large Bilingual Collection of Texts: Application to Parallel Article Extraction in Wikipedia. Alexandre Patry 1 , Philippe Langlais 2 (1) KeaText, Saint-Laurent,


  1. Objectives Paradocs Experiments Discussion Identifying Parallel Documents from a Large Bilingual Collection of Texts: Application to Parallel Article Extraction in Wikipedia. Alexandre Patry 1 , Philippe Langlais 2 (1) KeaText, Saint-Laurent, Qu´ ebec, Canada (2) RALI/DIRO, Universit´ e de Montr´ eal, Qu´ ebec, Canada Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of

  2. Objectives Paradocs Experiments Discussion Motivations For identifying parallel document pairs ⊲ There is a need for tools able to spot texts that are translated without necessarily relying on naming conventions (as is [Chen & Nie, 2000] or [Resnik & Smith, 2003] ) ⊲ Currently, no decent SMT without parallel documents 10 9 web parallel corpus (22 M pairs of sentences), Opus corpus (13 GB of compressed files), Europarl (20 language pairs), etc. Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of

  3. Objectives Indexing Paradocs Pairing Experiments Post-Filtering Discussion Objectives 1 Paradocs 2 Indexing Pairing Post-Filtering Experiments 3 Europarl Wikipedia Discussion 4 Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of

  4. Objectives Indexing Paradocs Pairing Experiments Post-Filtering Discussion Paradocs First version described in [Patry and Langlais, 2005] Language (and collection) independent Relies on 3 lightweight components : searching pairing filtering s t 1 t 2 t N s 1 t 1 target 0.6 s ψ ψ ψ ψ 0.7 ψ ψ s 2 t 2 1 0 2 1 3 0.7 2 2 2 0 0 1 1 1 0 2 2 indexing retrieval 2 3 3 2 0.6 1 1 2 2 s 3 t 3 0 0 0 σ σ σ 2 2 2 1 0 0 0 . 8 t 1 t 2 t 3 t N s n t m 1 0 0 N-best Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of

  5. Objectives Indexing Paradocs Pairing Experiments Post-Filtering Discussion Searching target s ψ ψ 1 0 1 0 indexing retrieval 2 2 3 3 1 2 2 1 t 3 t N t 1 t 2 N-best Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of

  6. Objectives Indexing Paradocs Pairing Experiments Post-Filtering Discussion Searching 2 (lightweight) indexing strategies : 1 document ≡ sequence of its numerical entities ( ψ ≡ num ) 1 document ≡ sequence of its hapax words ( ψ ≡ hap ) [Enright & Kondrak, 2007] for parallel document pairing [Lardilleux & Lepage, 2007] for word alignment Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of

  7. Objectives Indexing Paradocs Pairing Experiments Post-Filtering Discussion Pairing s t 1 t 2 t N ψ ψ ψ ψ 1 0 2 1 3 2 2 2 0 2 2 1 0 0 0 σ σ σ 2 2 2 1 0 0 1 0 0 ⊲ binary classifier trained in a supervised way. Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of

  8. Objectives Indexing Paradocs Pairing Experiments Post-Filtering Discussion Features used by the classifier For a given indexing policy ψ : normalized edit-distance between the two representations : σ 1 = ed ( ψ ( s ) , ψ ( t )) / max ( | ψ ( s ) | , | ψ ( t ) | ) total number of entities in both documents : σ 2 = | ψ ( s ) | + | ψ ( t ) | a global binary feature : � 1 if ed ( ψ ( s ) , ψ ( t )) ≤ ed ( ψ ( s ) , ψ ( t ′ )) ∀ t ′ σ 3 = δ ( s , t ) = 0 otherwise Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of

  9. Objectives Indexing Paradocs Pairing Experiments Post-Filtering Discussion Features used by the classifier ⊲ One set of features per indexing policy ψ ; we tried 3 of them : numerical entities ( ψ ≡ num ) hapax words ( ψ ≡ hap ) punctuation marks ( ψ ≡ punc ) .!?(): ψ ≡ num + punc The Legislative Assembly conve- sitamiq , ipuru 1 , 1999 maligaliur- ned at 3.30 pm . Mr. Quirke vik matuiqtaulauqtuq 3 : 30mi un- ( Clerk-Designate ) : THURSDAY , nusakkut mista kuak ( titiraqti - APRIL 1 , 1999 tikkuaqtausimajuq ) : < 3.30 ;1 ;1999 > < . ; ( ; ) ; : ; , ; , > < 1 ;1999 ;3 ;30 > < , ; , ; : ; ( ; ) ; : > Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of

  10. Objectives Indexing Paradocs Pairing Experiments Post-Filtering Discussion Features used by the classifier ⊲ One set of features per indexing policy ψ ; we tried 3 of them : numerical entities ( ψ ≡ num ) hapax words ( ψ ≡ hap ) punctuation marks ( ψ ≡ punc ) .!?(): ψ ≡ num + punc The Legislative Assembly conve- sitamiq , ipuru 1 , 1999 maligaliur- ned at 3.30 pm . Mr. Quirke vik matuiqtaulauqtuq 3 : 30mi un- ( Clerk-Designate ) : THURSDAY , nusakkut mista kuak ( titiraqti - APRIL 1 , 1999 tikkuaqtausimajuq ) : σ ≡ [0.75,7,1, 0.83 , 12 , 0 ] Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of

  11. Objectives Indexing Paradocs Pairing Experiments Post-Filtering Discussion Post-Filtering s 1 t 1 0.6 0 . 7 s 2 t 2 0.7 0.6 s 3 t 3 0.8 s n t m Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of

  12. Objectives Indexing Paradocs Pairing Experiments Post-Filtering Discussion Post-Filtering s 1 t 1 0.6 0 . 7 s 2 t 2 nop 0.7 ( s 1 , t 2 ), ( s 1 , t m ) ( s 2 , t 1 ) 0.6 s 3 t 3 ( s 3 , t 3 ) ( s n , t 2 ) 0.8 s n t m Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of

  13. Objectives Indexing Paradocs Pairing Experiments Post-Filtering Discussion Post-Filtering s 1 t 1 0.6 0 . 7 s 2 t 2 dup 0.7 ( s 1 , t 2 ), ( s 1 , t m ) ( s 2 , t 1 ) 0.6 s 3 t 3 ( s 3 , t 3 ) ( s n , t 2 ) 0.8 s n t m Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of

  14. Objectives Paradocs Europarl Experiments Wikipedia Discussion Objectives 1 Paradocs 2 Indexing Pairing Post-Filtering Experiments 3 Europarl Wikipedia Discussion 4 Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of

  15. Objectives Paradocs Europarl Experiments Wikipedia Discussion Setting Europarl V5 (split in documents of different size) Search engine : Lucene ( N = 20) Learner : Weka (5-fold cross validation) ⊲ Europarl is organized into bitexts ֒ → controlled experiment Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of

  16. Objectives Paradocs Europarl Experiments Wikipedia Discussion Europarl We varied several experimental conditions : languages (10) : en-da, -de, -el, -es, -fi, -fr, -it, -nl, -pt and -sv doc. length (7) : 10, 20, 30, 50, 70, 100 and 1 000 sentences indexing strategy (2) : num and hap entities (4) : hap , num , num + hap and num + punc classifier (4) : logit , ada , bayes , and j48 post-filtering (2) : nop , dup = ⇒ 4 480 experiments Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of

  17. Objectives Paradocs Europarl Experiments Wikipedia Discussion Search error nodoc : no (English) document returned (source = Dutch) nogood : no good document returned 45 errors % nodoc nodoc + nogood 40 35 30 25 20 15 nb. of sent. 10 8 16 32 64 128 256 512 1024 Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of

  18. Objectives Paradocs Europarl Experiments Wikipedia Discussion Which configuration is the best ? 70 runs (varying the language pairs and the document length) Counting the number of winners (in terms of f-measure) Nb ψ σ Class. Filter 31 logit num num+hap dup 31 num num logit dup 31 logit num num+punc dup 24 num num+hap j48 dup 24 j48 num num dup 24 j48 num num+punc dup . . . 1 j48 num num+hap nop Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of

  19. Objectives Paradocs Europarl Experiments Wikipedia Discussion Which configuration is the best ? 70 runs (varying the language pairs and the document length) Counting the number of winners (in terms of f-measure) Nb ψ σ Class. Filter 31 logit num num+hap dup 31 num num logit dup 31 logit num num+punc dup 24 num num+hap j48 dup 24 j48 num num dup 24 j48 num num+punc dup . . . 1 j48 num num+hap nop ⊲ logit > j48 > . . . Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of

  20. Objectives Paradocs Europarl Experiments Wikipedia Discussion Which configuration is the best ? 70 runs (varying the language pairs and the document length) Counting the number of winners (in terms of f-measure) Nb ψ σ Class. Filter 31 logit num num+hap dup 31 num num logit dup 31 logit num num+punc dup 24 num num+hap j48 dup 24 j48 num num dup 24 j48 num num+punc dup . . . 1 j48 num num+hap nop ⊲ num ≃ num+hap ≃ num+punc Alexandre Patry 1 , Philippe Langlais 2 Identifying Parallel Documents from a Large Bilingual Collection of

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend