Hybrid Example-Based Rule-Based MT: Feeding Apertium with Bilingual - - PowerPoint PPT Presentation

hybrid example based rule based mt feeding apertium with
SMART_READER_LITE
LIVE PREVIEW

Hybrid Example-Based Rule-Based MT: Feeding Apertium with Bilingual - - PowerPoint PPT Presentation

Hybrid Example-Based Rule-Based MT: Feeding Apertium with Bilingual Chunks Felipe S anchez-Mart nez Dept. Llenguatges i Sistemes Inform` atics Universitat dAlacant E-03071 Alacant, Spain fsanchez@dlsi.ua.es Work done in


slide-1
SLIDE 1

Hybrid Example-Based – Rule-Based MT: Feeding Apertium with Bilingual Chunks

Felipe S´ anchez-Mart´ ınez

  • Dept. Llenguatges i Sistemes Inform`

atics Universitat d’Alacant E-03071 Alacant, Spain fsanchez@dlsi.ua.es

Work done in collaboration with Andy Way (DCU) and Mikel L. Forcada (UA) at the Centre for Next Generation Localisation – DCU

8th July 2009

Felipe S´ anchez-Mart´ ınez (Univ. d’Alacant) Hybrid MT: Feeding Apertium with chunks 8th July 2009 1 / 27

slide-2
SLIDE 2

Outline

1

Motivation & goal

2

The Apertium free/open-source MT platform Apertium rule-based MT engine Apertium: example of translation

3

Integration of bilingual chunks into Apertium Considerations Translation approach Computation of the best coverage

4

Experiments Experimental setup Results: marker-based chunks Results: tree-based chunks

5

Discussion

Felipe S´ anchez-Mart´ ınez (Univ. d’Alacant) Hybrid MT: Feeding Apertium with chunks 8th July 2009 2 / 27

slide-3
SLIDE 3

Motivation & goal

Motivation & goal

Motivation: Usually rule-based machine translation (RBMT) systems do not benefit from the post-edition effort of professional translators Some RBMT may benefit from the translation units found in translation memories (usually whole sentences) Goal: To integrate sub-sentential translation units into the Apertium free/open-source MT platform Test the approach with bilingual chunks automatically obtained using the example-based methods implemented in Matrex

Felipe S´ anchez-Mart´ ınez (Univ. d’Alacant) Hybrid MT: Feeding Apertium with chunks 8th July 2009 3 / 27

slide-4
SLIDE 4

The Apertium free/open-source MT platform Apertium rule-based MT engine

Apertium rule-based MT engine

source text → de-formatter ↓

  • morph. analyser

↓ PoS tagger ↓ structural transfer ↔ lexical transfer ↓

  • morph. generator

↓ Post-generator ↓ Re-formatter → target text

Felipe S´ anchez-Mart´ ınez (Univ. d’Alacant) Hybrid MT: Feeding Apertium with chunks 8th July 2009 4 / 27

slide-5
SLIDE 5

The Apertium free/open-source MT platform Apertium: example of translation

Apertium: Example of execution /1

Source text: Francis’ <strong>car</strong> is broken De-formatter: Francis’[ <strong>]car[</strong> ]is broken Morphological analyser: ˆFrancis’/Francis<np><ant><m><sg>+’s<gen>$[ <strong>] ˆcar/car<n><sg>$[</strong> ]ˆis/be<vbser><pri><p3><sg>$ ˆbroken/break<vblex><pp>$ Part-of-speech tagger: ˆFrancis<np><ant><m><sg>$ ˆ’s<gen>$[ <strong>] ˆcar<n><sg>$[</strong> ]ˆbe<vbser><pri><p3><sg>$ ˆbreak<vblex><pp>$

Felipe S´ anchez-Mart´ ınez (Univ. d’Alacant) Hybrid MT: Feeding Apertium with chunks 8th July 2009 5 / 27

slide-6
SLIDE 6

The Apertium free/open-source MT platform Apertium: example of translation

Apertium: Example of execution /2

Structural transfer (prechunk) + Lexical transfer: ˆ nom <SN><UNDET><m><sg> {ˆFrancis<np><ant><3><4>$}$ ˆ pr <GEN> {}$[ <strong>] ˆ nom <SN><UNDET><m><sg> {ˆcoche<n><3><4>$}$ [</strong> ] ˆ be pp <Vcop><vblex><pri><p3><sg><GD> { ˆestar<vblex><3><4><5>$ ˆromper<vblex><pp><6><5>$}$ Structural transfer (interchunk): [<strong>]ˆ nom <SN><PDET><m><sg> {ˆcoche<n><3><4>$}$ [</strong> ]ˆ pr <PREP> { ˆde<pr>$}$ ˆ nom <SN><PDET><m><sg> { ˆFrancis<np><ant><3><4>$}$ ˆ be pp <Vcop><vblex><pri><p3><sg><m> { ˆestar<vblex><3><4><5>$ ˆromper<vblex><pp><6><5>$}$

Felipe S´ anchez-Mart´ ınez (Univ. d’Alacant) Hybrid MT: Feeding Apertium with chunks 8th July 2009 6 / 27

slide-7
SLIDE 7

The Apertium free/open-source MT platform Apertium: example of translation

Apertium: Example of execution /3

Structural transfer (postchunk): [<strong>]ˆel<det><def><m><sg>$ ˆcoche<n><m><sg>$ [</strong>] ˆde<pr>$ ˆFrancis<np><ant><m><sg>$ ˆestar<vblex><pri><p3><sg>$ ˆromper<vblex><pp><m><sg>$ Morphological generator and post-generator: [<strong>]el coche[</strong> ]de Francis est´ a roto De-formatter: <strong>el coche</strong> de Francis est´ a roto Target text: <strong>el coche</strong> de Francis est´ a roto

Felipe S´ anchez-Mart´ ınez (Univ. d’Alacant) Hybrid MT: Feeding Apertium with chunks 8th July 2009 7 / 27

slide-8
SLIDE 8

Integration of bilingual chunks into Apertium Considerations

Considerations

To take into account: Not break the application of structural transfer rules Use the longest possible chunks How can the application of rules be preserved? Introducing chunks delimiters as format information . . . is [BCH 12 0]the chunk detected[ECH 12 0] by . . . Chunks can be then recognised after the translation . . . es [BCH 12 0]el segmento detectado[ECH 12 0] por . . . Known problem: As a result of the structural transfer rules, format information may be moved around Some rules also delete format information (known bug)

Felipe S´ anchez-Mart´ ınez (Univ. d’Alacant) Hybrid MT: Feeding Apertium with chunks 8th July 2009 8 / 27

slide-9
SLIDE 9

Integration of bilingual chunks into Apertium Translation approach

Translation approach

1

apply a dynamic-programming algorithm to compute the best coverage of the input sentence

2

translate the input sentence as usual by Apertium

3

use a language model to choose one of the possible translations for each of the bilingual chunks detected

One source-language chunk may have different target-language translations Also consider Apertium translation

Felipe S´ anchez-Mart´ ınez (Univ. d’Alacant) Hybrid MT: Feeding Apertium with chunks 8th July 2009 9 / 27

slide-10
SLIDE 10

Integration of bilingual chunks into Apertium Computation of the best coverage

Computation of the best coverage: data structure

Store source-language chunks in a trie of strings

the session adjourned interest shown ... ... with interest the ... ... ... 1 2 3 4 5

It allows to compute the best coverage in O(l) time, where l is the length of the input sentence

Felipe S´ anchez-Mart´ ınez (Univ. d’Alacant) Hybrid MT: Feeding Apertium with chunks 8th July 2009 10 / 27

slide-11
SLIDE 11

Integration of bilingual chunks into Apertium Computation of the best coverage

Computation of the best coverage: algorithm

... the session adjourned with the interest

  • f

... in like

A set of alive states in the trie is maintained to compute all the possible ways to cover the input sentence A new search is started at every word At each position the best coverage until that position is stored Is applied to text segments shorter than sentences

The best coverage can be retrieved when there are no more alive states

Felipe S´ anchez-Mart´ ınez (Univ. d’Alacant) Hybrid MT: Feeding Apertium with chunks 8th July 2009 11 / 27

slide-12
SLIDE 12

Integration of bilingual chunks into Apertium Computation of the best coverage

Computation of the best coverage

The best coverage: is the one that uses the least possible number of chunks

longest possible chunks

each not covered word counts like one chunk if two coverages use the same number of chunks, the one that uses the most frequent chunks is used

Felipe S´ anchez-Mart´ ınez (Univ. d’Alacant) Hybrid MT: Feeding Apertium with chunks 8th July 2009 12 / 27

slide-13
SLIDE 13

Experiments Experimental setup

Experimental setup /1

Data used: Corpora distributed for the WMT 09 Workshop for MT Language pairs: Spanish–English (es-en), English–Spanish (en-es) Linguistic data: apertium-en-es; SVN revision 9284 Software used: Apertium Giza++ and Moses to calculate word alignments and lexical probabilities SRILM to train 5-gram language models Matrex to segment training corpora and to align chunks

Felipe S´ anchez-Mart´ ınez (Univ. d’Alacant) Hybrid MT: Feeding Apertium with chunks 8th July 2009 13 / 27

slide-14
SLIDE 14

Experiments Experimental setup

Experimental setup /2

Training corpus:

  • Max. sentence length:

45 words

  • Max. word ratio:

1.5 words (mean ration + std. dev.)

# sent: 1,187,905; # en words: 26,983,025; # es words: 27,951,388

Development corpus:

# sent: 2,050; # en words: 49,884; # es words: 52,719

Test corpus:

# sent: 3,027; # en words: 77,438; # es words: 80,580

Felipe S´ anchez-Mart´ ınez (Univ. d’Alacant) Hybrid MT: Feeding Apertium with chunks 8th July 2009 14 / 27

slide-15
SLIDE 15

Experiments Experimental setup

Experimental setup /3

Methods used to extract bilingual chunks: Marker-based bilingual chunks (using Matrex) Parse-tree based bilingual chunks (thanks to John Tinsley)

Preliminary results using previously compute chunks using an old version of the Europarl parallel corpus

Felipe S´ anchez-Mart´ ınez (Univ. d’Alacant) Hybrid MT: Feeding Apertium with chunks 8th July 2009 15 / 27

slide-16
SLIDE 16

Experiments Results: marker-based chunks

Results: marker-based chunks

Bilingual chunks filtering: There must be at least one word aligned in each side Chunks not seen at least N times are discarded

Tested values for N: 5 . . . 80

Chunks containing punctuation marks and numbers are discarded

Felipe S´ anchez-Mart´ ınez (Univ. d’Alacant) Hybrid MT: Feeding Apertium with chunks 8th July 2009 16 / 27

slide-17
SLIDE 17

Experiments Results: marker-based chunks

Results: marker-based chunks — Spanish→English /1

Development corpus (BLEU):

17.7 17.75 17.8 17.85 17.9 10 20 30 40 50 60 70 80

  • Freq. cut-off

Apertium Apertium+chunks

Felipe S´ anchez-Mart´ ınez (Univ. d’Alacant) Hybrid MT: Feeding Apertium with chunks 8th July 2009 17 / 27

slide-18
SLIDE 18

Experiments Results: marker-based chunks

Results: marker-based chunks — Spanish→English /2

Test corpus (BLEU): Apertium+chunks: 19.14 Apertium: 18.81 # of chunks: 6,600 (Freq. cut-off: 28) # of applications: 6,321 Words covered by chunks: ≈ 17% # of no Apertium: 2,559 (40%) Words really covered by chunks: ≈ 6%

Felipe S´ anchez-Mart´ ınez (Univ. d’Alacant) Hybrid MT: Feeding Apertium with chunks 8th July 2009 18 / 27

slide-19
SLIDE 19

Experiments Results: marker-based chunks

Results: marker-based chunks — English→Spanish /1

Development corpus (BLEU):

17.1 17.15 17.2 17.25 17.3 17.35 17.4 10 20 30 40 50 60 70 80

  • Freq. cut-off

Apertium Apertium+chunks

Felipe S´ anchez-Mart´ ınez (Univ. d’Alacant) Hybrid MT: Feeding Apertium with chunks 8th July 2009 19 / 27

slide-20
SLIDE 20

Experiments Results: marker-based chunks

Results: marker-based chunks — English→Spanish /2

Test corpus (BLEU): Apertium+chunks: 18.94 Apertium: 18.51 # of chunks: 16,395 (Freq. cut-off: 11) # of applications: 6,812 Words covered by chunks: ≈ 18% # of no Apertium: 2,884 (42%) Words really covered by chunks: ≈ 8 %

Felipe S´ anchez-Mart´ ınez (Univ. d’Alacant) Hybrid MT: Feeding Apertium with chunks 8th July 2009 20 / 27

slide-21
SLIDE 21

Experiments Results: marker-based chunks

Some Spanish→English examples /1

S: desde hace muchos a˜ nos un fen´

  • meno misterioso ...

R: for years , a mysterious phenomenon ... A: from does a lot of years a mysterious phenomenon ... A+C: for many years a mysterious phenomenon ... S:

  • lmert devolver´

ıa casi todas las zonas ocupadas a cambio de la paz R:

  • lmert would return ... territories in exchange for peace

A:

  • lmert it would give back ... zones to change of the peace

A+C:

  • lmert it would give back ... zones in exchange for peace

S: pero hay una cosa que nos une : R: but there is one thing that connects us : A: but there is a thing that joins us : A+C: but there is one thing that joins us :

Felipe S´ anchez-Mart´ ınez (Univ. d’Alacant) Hybrid MT: Feeding Apertium with chunks 8th July 2009 21 / 27

slide-22
SLIDE 22

Experiments Results: tree-based chunks

Results: tree-based chunks

Bilingual chunks filtering: There must be at least one target word aligned with a source word with p(target|source) > 0.01 Chunks not seen at least N times are discarded

Tested values for N: 5 . . . 120

Chunks containing punctuation marks and numbers are discarded

Felipe S´ anchez-Mart´ ınez (Univ. d’Alacant) Hybrid MT: Feeding Apertium with chunks 8th July 2009 22 / 27

slide-23
SLIDE 23

Experiments Results: tree-based chunks

Results: tree-based chunks — Spanish→English /1

Development corpus (BLEU):

17.7 17.75 17.8 17.85 17.9 20 40 60 80 100 120

  • Freq. cut-off

Apertium Apertium+chunks

Felipe S´ anchez-Mart´ ınez (Univ. d’Alacant) Hybrid MT: Feeding Apertium with chunks 8th July 2009 23 / 27

slide-24
SLIDE 24

Experiments Results: tree-based chunks

Results: tree-based chunks — Spanish→English /2

Test corpus (BLEU): Apertium+chunks: 0.1911 Apertium: 0.1881 # of chunks: 7,466 (Freq. cut-off: 74) # of applications: 4,650 Words covered by chunks: ≈ 12% # of no Apertium: 3,075 (66%) Words really covered by chunks: ≈ 4%

Felipe S´ anchez-Mart´ ınez (Univ. d’Alacant) Hybrid MT: Feeding Apertium with chunks 8th July 2009 24 / 27

slide-25
SLIDE 25

Discussion

Discussion /1

Small improvement in both the development set and the test set

Better improvement in the test set

marker-based chunks set improvement covered words es-en dev + 0.20 18% (7%) test + 0.33 17% (6%) en-es dev + 0.31 17% (7%) test + 0.43 18% (8%) Noise introduced due to how Apertium manages format information

Some chunks are not applied because chunk delimiters are lost Some chunk delimiters are moved and the detected sequence of words after translation is not correct

Felipe S´ anchez-Mart´ ınez (Univ. d’Alacant) Hybrid MT: Feeding Apertium with chunks 8th July 2009 25 / 27

slide-26
SLIDE 26

Discussion

Discussion /2

Possible way of improvement when computing the best coverage and two coverages uses the same number of chunks: Use the bilingual chunk that would produce the most-likely TL translation instead of the most frequent one How? Using a language model with gaps in the session with the interest .

Felipe S´ anchez-Mart´ ınez (Univ. d’Alacant) Hybrid MT: Feeding Apertium with chunks 8th July 2009 26 / 27

slide-27
SLIDE 27

Discussion

Discussion /3

Future work: Try tree-based chunks obtained from the WMT 09 corpus using FreeLing to parse Spanish (in both directions) Perform a manual evaluation Combine both Spanish→English and English→Spanish chunks

Intersection Union

Try to know how Matrex is helping Apertium

Felipe S´ anchez-Mart´ ınez (Univ. d’Alacant) Hybrid MT: Feeding Apertium with chunks 8th July 2009 27 / 27

slide-28
SLIDE 28

Discussion

Hybrid Example-Based – Rule-Based MT: Feeding Apertium with Bilingual Chunks

Felipe S´ anchez-Mart´ ınez

  • Dept. Llenguatges i Sistemes Inform`

atics Universitat d’Alacant E-03071 Alacant, Spain fsanchez@dlsi.ua.es

Work done in collaboration with Andy Way (DCU) and Mikel L. Forcada (UA) at the Centre for Next Generation Localisation – DCU

8th July 2009

Felipe S´ anchez-Mart´ ınez (Univ. d’Alacant) Hybrid MT: Feeding Apertium with chunks 8th July 2009 27 / 27