Building machine translation systems for language pairs with scarce - - PowerPoint PPT Presentation

building machine translation systems for language pairs
SMART_READER_LITE
LIVE PREVIEW

Building machine translation systems for language pairs with scarce - - PowerPoint PPT Presentation

Building machine translation systems for language pairs with scarce resources V ctor M. S anchez-Cartagena Department de Llenguatges i Sistemes Inform` atics Universitat dAlacant, Alacant, Spain vmsanchez@dlsi.ua.es Ph.D. Defence


slide-1
SLIDE 1

Building machine translation systems for language pairs with scarce resources

V´ ıctor M. S´ anchez-Cartagena Department de Llenguatges i Sistemes Inform` atics Universitat d’Alacant, Alacant, Spain

vmsanchez@dlsi.ua.es

Ph.D. Defence supervised by:

  • Dr. Felipe S´

anchez Mart´ ınez

  • Dr. Juan Antonio P´

erez Ortiz 2nd July 2015, Alacant

slide-2
SLIDE 2

Outline

1 Introduction 2 Inferring shallow-transfer rules from small parallel corpora 3 Integrating shallow-transfer data into statistical machine translation 4 Assisting non-expert users in extending morphological dictionaries 5 Concluding remarks

V´ ıctor M. S´ anchez-Cartagena 1/51

slide-3
SLIDE 3

Outline

1 Introduction 2 Inferring shallow-transfer rules from small parallel corpora 3 Integrating shallow-transfer data into statistical machine translation 4 Assisting non-expert users in extending morphological dictionaries 5 Concluding remarks

V´ ıctor M. S´ anchez-Cartagena 2/51

slide-4
SLIDE 4

Introduction

Objective Ease the development of machine translation (MT) systems for language pairs with scarce resources Resources needed by the two main types of MT systems are not available MT systems addressed:

Shallow-transfer rule-based MT (RBMT) Phrase-based statistical MT (SMT)

V´ ıctor M. S´ anchez-Cartagena 3/51

slide-5
SLIDE 5

Shallow-transfer rule-based MT

Apertium shallow-transfer RBMT platform

Source-language (SL) and target-language (TL) intermediate representations: sequence of lexical forms (lemma, part of speech and inflection information)

V´ ıctor M. S´ anchez-Cartagena 4/51

slide-6
SLIDE 6

Shallow-transfer rule-based MT

Apertium shallow-transfer RBMT platform

Source-language (SL) and target-language (TL) intermediate representations: sequence of lexical forms (lemma, part of speech and inflection information) las casas peque˜ nas

V´ ıctor M. S´ anchez-Cartagena 4/51

slide-7
SLIDE 7

Shallow-transfer rule-based MT

Apertium shallow-transfer RBMT platform

Source-language (SL) and target-language (TL) intermediate representations: sequence of lexical forms (lemma, part of speech and inflection information) las → el DT-gen:f.num:pl casas → casa N-gen:f.num:pl peque˜ nas → peque˜ no ADJ-gen:f.num:pl

V´ ıctor M. S´ anchez-Cartagena 4/51

slide-8
SLIDE 8

Shallow-transfer rule-based MT

Apertium shallow-transfer RBMT platform

Source-language (SL) and target-language (TL) intermediate representations: sequence of lexical forms (lemma, part of speech and inflection information) el DT-gen:f.num:pl → the DT-gen:ǫ.num:ǫ casa N-gen:f.num:pl → house N-gen:ǫ.num:pl peque˜ no ADJ-gen:f.num:pl → small ADJ-gen:ǫ.num:ǫ

V´ ıctor M. S´ anchez-Cartagena 4/51

slide-9
SLIDE 9

Shallow-transfer rule-based MT

Apertium shallow-transfer RBMT platform

Source-language (SL) and target-language (TL) intermediate representations: sequence of lexical forms (lemma, part of speech and inflection information) the DT-gen:ǫ.num:ǫ → rule → the DT-gen:ǫ.num:ǫ house N-gen:ǫ.num:pl

N + ADJ to

small ADJ-gen:ǫ.num:ǫ small ADJ-gen:ǫ.num:ǫ

ADJ + N

house N-gen:ǫ.num:pl

V´ ıctor M. S´ anchez-Cartagena 4/51

slide-10
SLIDE 10

Shallow-transfer rule-based MT

Apertium shallow-transfer RBMT platform

Source-language (SL) and target-language (TL) intermediate representations: sequence of lexical forms (lemma, part of speech and inflection information) the DT-gen:ǫ.num:ǫ → the small ADJ-gen:ǫ.num:ǫ → small house N-gen:ǫ.num:pl → houses

V´ ıctor M. S´ anchez-Cartagena 4/51

slide-11
SLIDE 11

Shallow-transfer rule-based MT

Apertium shallow-transfer RBMT platform

Source-language (SL) and target-language (TL) intermediate representations: sequence of lexical forms (lemma, part of speech and inflection information)

Linguistic resources Huge human effort spent in development from scratch of linguistic resources for RBMT systems

V´ ıctor M. S´ anchez-Cartagena 4/51

slide-12
SLIDE 12

Phrase-based statistical MT

Translation: TL sentence with highest probability according to a combination of statistical models

V´ ıctor M. S´ anchez-Cartagena 5/51

slide-13
SLIDE 13

Phrase-based statistical MT

Translation: TL sentence with highest probability according to a combination of statistical models Example: the small houses Phrase table: the el 0.5 the las 0.2 the small el 0.05 small houses casas peque˜ nas 0.7 small medianas 0.1 houses hogar 0.3 Translation hypotheses:

V´ ıctor M. S´ anchez-Cartagena 5/51

slide-14
SLIDE 14

Phrase-based statistical MT

Translation: TL sentence with highest probability according to a combination of statistical models Example: the small houses Phrase table: the el 0.5 the las 0.2 the small el 0.05 small houses casas peque˜ nas 0.7 small medianas 0.1 houses hogar 0.3 Translation hypotheses: el casas peque˜ nas 0.35

V´ ıctor M. S´ anchez-Cartagena 5/51

slide-15
SLIDE 15

Phrase-based statistical MT

Translation: TL sentence with highest probability according to a combination of statistical models Example: the small houses Phrase table: the el 0.5 the las 0.2 the small el 0.05 small houses casas peque˜ nas 0.7 small medianas 0.1 houses hogar 0.3 Translation hypotheses: el casas peque˜ nas 0.35 el hogar 0.015 las casas peque˜ nas 0.14 el medianas hogares 0.015

V´ ıctor M. S´ anchez-Cartagena 5/51

slide-16
SLIDE 16

Phrase-based statistical MT

Phrase-based statistical MT

Translation is the TL sentence with highest probability given the SL sentence according to a combination of statistical models

Example: the small houses Target language model How likely is that the translation hypothesis

  • ccurs in the target

language Translation hypotheses: el casas peque˜ nas 0.35 0.3 el hogar 0.015 0.7 las casas peque˜ nas 0.14 0.6 el medianas hogares 0.015 0.2

V´ ıctor M. S´ anchez-Cartagena 5/51

slide-17
SLIDE 17

Phrase-based statistical MT

Phrase-based statistical MT

Translation is the TL sentence with highest probability given the SL sentence according to a combination of statistical models

Example: the small houses Final score Combine translation model score, target language model score, and others Translation hypotheses: el casas peque˜ nas 0.35 0.3 0.3 el hogar 0.015 0.7 0.3 las casas peque˜ nas 0.14 0.6 0.45 el medianas hogares 0.015 0.2 0.2

V´ ıctor M. S´ anchez-Cartagena 5/51

slide-18
SLIDE 18

Phrase-based statistical MT

SMT systems can be built automatically as long as there are available corpora

Parallel corpora for translation model Monolingual corpora for target language model (easier to

  • btain)

SMT systems usually work with surface forms: data sparseness Data sparseness It is difficult to observe in the training corpora all the necessary sequences of inflected forms needed to properly translate the potential input texts

V´ ıctor M. S´ anchez-Cartagena 6/51

slide-19
SLIDE 19

Means to achieve main goal

Reduce human effort for building RBMT systems:

Automatically infer shallow-transfer rules for RBMT Allow non-expert users to insert entries in morphological dictionaries

Mitigate data sparseness in SMT:

Integrate shallow-transfer RBMT data into a phrase-based SMT system

V´ ıctor M. S´ anchez-Cartagena 7/51

slide-20
SLIDE 20

Outline

1 Introduction 2 Inferring shallow-transfer rules from small parallel corpora 3 Integrating shallow-transfer data into statistical machine translation 4 Assisting non-expert users in extending morphological dictionaries 5 Concluding remarks

V´ ıctor M. S´ anchez-Cartagena 8/51

slide-21
SLIDE 21

Motivation

Goal: New algorithm for the automatic inference of shallow-transfer rules from small parallel corpora and existing dictionaries Why? Shallow-transfer rules are complex and require deep knowledge

  • f grammar of the languages involved

Infer rules from a very small parallel corpus → speed up creation of new RBMT systems, even without bilingual experts Existing algorithm (S´ anchez-Mart´ ınez & Forcada, 2009) presents two main limitations:

Low generalisation power Poor segmentation of input

V´ ıctor M. S´ anchez-Cartagena 9/51

slide-22
SLIDE 22

Motivation

Generalisation power Inferred rules should be applied to lemmas/morphological inflection attributes different from those in the parallel corpus Example Spanish:

el DT-gen:f.num:pl casa N-gen:f.num:pl peque˜ no ADJ-gen:f.num:pl

English:

the DT-gen:ǫ.num:ǫ small ADJ-gen:ǫ.num:pl house N-gen:ǫ.num:pl

V´ ıctor M. S´ anchez-Cartagena 10/51

slide-23
SLIDE 23

Motivation

Generalisation power Inferred rules should be applied to lemmas/morphological inflection attributes different from those in the parallel corpus Example Spanish:

el DT-gen:f.num:pl casa N-gen:f.num:pl peque˜ no ADJ-gen:f.num:pl

English:

the DT-gen:ǫ.num:ǫ small ADJ-gen:ǫ.num:pl house N-gen:ǫ.num:pl

S´ anchez-Mart´ ınez & Forcada (2009): swap sequence N-gen:f.num:pl - ADJ-gen:f.num:pl

V´ ıctor M. S´ anchez-Cartagena 10/51

slide-24
SLIDE 24

Motivation

Generalisation power Inferred rules should be applied to lemmas/morphological inflection attributes different from those in the parallel corpus Example Spanish:

el DT-gen:f.num:pl casa N-gen:f.num:pl peque˜ no ADJ-gen:f.num:pl

English:

the DT-gen:ǫ.num:ǫ small ADJ-gen:ǫ.num:pl house N-gen:ǫ.num:pl

New algorithm: swap N - ADJ (regardless of morphological information) if no contrary evidence is found in other sentences

V´ ıctor M. S´ anchez-Cartagena 10/51

slide-25
SLIDE 25

Motivation

Segmentation of input Rules in Apertium are applied in a greedy, left-to-right, longest match manner: a word is never processed by more than one rule Inference algorithm should ensure that sequences of words that processed together by a single rule are not processed by different rules Example:

The DT white ADJ house N and CC the DT red ADJ cars N → La casa blanca y el rojo coches The DT white ADJ house N and CC the DT red ADJ cars N → La casa blanca y los coches rojos

V´ ıctor M. S´ anchez-Cartagena 11/51

slide-26
SLIDE 26

Motivation

New rule inference approach solves these issues thanks to: A rule formalism with more generalisation power: Generalised Alignment Templates (GATs)

Extension of the formalism defined by S´ anchez-Mart´ ınez & Forcada (2009)

A more powerful inference algorithm

Prevents overgeneralisation Solves conflicts between rules at a global level Ensures correct segmentation of input

V´ ıctor M. S´ anchez-Cartagena 12/51

slide-27
SLIDE 27

Generalised alignment templates

A GAT is made of: SL word classes and restrictions define SL lexical forms matched TL word classes define output

1 PN 2 POS 3 N-gen:ǫ.num:* 1 el DT-gen:$3 t .num:$3 s 2 N-gen:$3 t .num:$3 s 3 de PR 4 PN

r1 = {}, r2 = {}, r3 = {}

V´ ıctor M. S´ anchez-Cartagena 13/51

slide-28
SLIDE 28

Generalised alignment templates

New values introduced in word classes to apply the same GAT to lexical forms with different values of morphological inflection attributes Wildcard values (∗) SL references ($j

s) and TL references ($j t)

1 PN 2 POS 3 N-gen:ǫ.num:* 1 el DT-gen:$3 t .num:$3 s 2 N-gen:$3 t .num:$3 s 3 de PR 4 PN

r1 = {}, r2 = {}, r3 = {}

V´ ıctor M. S´ anchez-Cartagena 13/51

slide-29
SLIDE 29

Generalised alignment templates

Example of translation with a GAT (English→Spanish)

1 PN 2 POS 3 N-gen:ǫ.num:* 1 el DT-gen:$3 t .num:$3 s 2 N-gen:$3 t .num:$3 s 3 de PR 4 PN

r1 = {}, r2 = {}, r3 = {}

Victor’s plants

V´ ıctor M. S´ anchez-Cartagena 14/51

slide-30
SLIDE 30

Generalised alignment templates

Example of translation with a GAT (English→Spanish)

1 PN 2 POS 3 N-gen:ǫ.num:* 1 el DT-gen:$3 t .num:$3 s 2 N-gen:$3 t .num:$3 s 3 de PR 4 PN

r1 = {}, r2 = {}, r3 = {}

Victor PN ’s POS plant N-gen:ǫ.num:pl

V´ ıctor M. S´ anchez-Cartagena 14/51

slide-31
SLIDE 31

Generalised alignment templates

Example of translation with a GAT (English→Spanish)

1 PN 2 POS 3 N-gen:ǫ.num:* 1 el DT-gen:$3 t .num:$3 s 2 N-gen:$3 t .num:$3 s 3 de PR 4 PN

r1 = {}, r2 = {}, r3 = {}

Victor PN ’s POS plant N-gen:ǫ.num:pl

1Victor PN→

V´ ıctor PN

3plant N-gen:ǫ.num:pl→

planta N-gen:f.num:pl

V´ ıctor M. S´ anchez-Cartagena 14/51

slide-32
SLIDE 32

Generalised alignment templates

Example of translation with a GAT (English→Spanish)

1 PN 2 POS 3 N-gen:ǫ.num:* 1 el DT-gen:$3 t .num:$3 s 2 N-gen:$3 t .num:$3 s 3 de PR 4 PN

r1 = {}, r2 = {}, r3 = {}

el DT-gen:$3

t.num:$3 s

N-gen:$3

t.num:$3 s

de PR PN

1Victor PN→

V´ ıctor PN

3plant N-gen:ǫ.num:pl→

planta N-gen:f.num:pl

V´ ıctor M. S´ anchez-Cartagena 14/51

slide-33
SLIDE 33

Generalised alignment templates

Example of translation with a GAT (English→Spanish)

1 PN 2 POS 3 N-gen:ǫ.num:* 1 el DT-gen:$3 t .num:$3 s 2 N-gen:$3 t .num:$3 s 3 de PR 4 PN

r1 = {}, r2 = {}, r3 = {}

el DT-gen:$3

t.num:$3 s

planta N-gen:$3

t.num:$3 s

de PR V´ ıctor PN

1Victor PN→

V´ ıctor PN

3plant N-gen:ǫ.num:pl→

planta N-gen:f.num:pl

V´ ıctor M. S´ anchez-Cartagena 14/51

slide-34
SLIDE 34

Generalised alignment templates

Example of translation with a GAT (English→Spanish)

1 PN 2 POS 3 N-gen:ǫ.num:* 1 el DT-gen:$3 t .num:$3 s 2 N-gen:$3 t .num:$3 s 3 de PR 4 PN

r1 = {}, r2 = {}, r3 = {}

el DT-gen:$3

t.num:$3 s

planta N-gen:$3

t.num:$3 s

de PR V´ ıctor PN

1Victor PN→

V´ ıctor PN

3plant N-gen:ǫ.num:pl→

planta N-gen:f.num:pl

V´ ıctor M. S´ anchez-Cartagena 14/51

slide-35
SLIDE 35

Generalised alignment templates

Example of translation with a GAT (English→Spanish)

1 PN 2 POS 3 N-gen:ǫ.num:* 1 el DT-gen:$3 t .num:$3 s 2 N-gen:$3 t .num:$3 s 3 de PR 4 PN

r1 = {}, r2 = {}, r3 = {}

el DT-gen:$3

t.num:pl

planta N-gen:$3

t.num:pl

de PR V´ ıctor PN

1Victor PN→

V´ ıctor PN

3plant N-gen:ǫ.num:pl→

planta N-gen:f.num:pl

V´ ıctor M. S´ anchez-Cartagena 14/51

slide-36
SLIDE 36

Generalised alignment templates

Example of translation with a GAT (English→Spanish)

1 PN 2 POS 3 N-gen:ǫ.num:* 1 el DT-gen:$3 t .num:$3 s 2 N-gen:$3 t .num:$3 s 3 de PR 4 PN

r1 = {}, r2 = {}, r3 = {}

el DT-gen:$3

t.num:pl

planta N-gen:$3

t.num:pl

de PR V´ ıctor PN

1Victor PN→

V´ ıctor PN

3plant N-gen:ǫ.num:pl→

planta N-gen:f.num:pl

V´ ıctor M. S´ anchez-Cartagena 14/51

slide-37
SLIDE 37

Generalised alignment templates

Example of translation with a GAT (English→Spanish)

1 PN 2 POS 3 N-gen:ǫ.num:* 1 el DT-gen:$3 t .num:$3 s 2 N-gen:$3 t .num:$3 s 3 de PR 4 PN

r1 = {}, r2 = {}, r3 = {}

el DT-gen:f.num:pl planta N-gen:f.num:pl de PR V´ ıctor PN

1Victor PN→

V´ ıctor PN

3plant N-gen:ǫ.num:pl→

planta N-gen:f.num:pl

V´ ıctor M. S´ anchez-Cartagena 14/51

slide-38
SLIDE 38

Generalised alignment templates

Example of translation with a GAT (English→Spanish)

1 PN 2 POS 3 N-gen:ǫ.num:* 1 el DT-gen:$3 t .num:$3 s 2 N-gen:$3 t .num:$3 s 3 de PR 4 PN

r1 = {}, r2 = {}, r3 = {}

las plantas de V´ ıctor

V´ ıctor M. S´ anchez-Cartagena 14/51

slide-39
SLIDE 39

Inference of shallow-transfer rules

Method overview

V´ ıctor M. S´ anchez-Cartagena 15/51

slide-40
SLIDE 40

Inference of shallow-transfer rules

  • 1. Bilingual phrase extraction

V´ ıctor M. S´ anchez-Cartagena 16/51

slide-41
SLIDE 41

Inference of shallow-transfer rules

  • 1. Bilingual phrase extraction

1 Analyse SL and TL sides of the parallel corpus 2 Compute statistical word alignments as in SMT 3 Extract bilingual phrases compatible with the alignments (in a

similar way to SMT) English: There were white houses Spanish: Hab´ ıa casas blancas

V´ ıctor M. S´ anchez-Cartagena 17/51

slide-42
SLIDE 42

Inference of shallow-transfer rules

  • 1. Bilingual phrase extraction

1 Analyse SL and TL sides of the parallel corpus 2 Compute statistical word alignments as in SMT 3 Extract bilingual phrases compatible with the alignments (in a

similar way to SMT)

there ADV be VERB-t:past white ADJ-gen:ǫ num:ǫ house N-gen:ǫ num:pl haber VERB-t:past p:3.num:sg casa N-gen:f num:pl blanco ADJ-gen:f num:pl

V´ ıctor M. S´ anchez-Cartagena 17/51

slide-43
SLIDE 43

Inference of shallow-transfer rules

  • 1. Bilingual phrase extraction

1 Analyse SL and TL sides of the parallel corpus 2 Compute statistical word alignments as in SMT 3 Extract bilingual phrases compatible with the alignments (in a

similar way to SMT)

there ADV be VERB-t:past white ADJ-gen:ǫ num:ǫ house N-gen:ǫ num:pl haber VERB-t:past p:3.num:sg casa N-gen:f num:pl blanco ADJ-gen:f num:pl

V´ ıctor M. S´ anchez-Cartagena 17/51

slide-44
SLIDE 44

Inference of shallow-transfer rules

  • 1. Bilingual phrase extraction

1 Analyse SL and TL sides of the parallel corpus 2 Compute statistical word alignments as in SMT 3 Extract bilingual phrases compatible with the alignments (in a

similar way to SMT)

there ADV be VERB-t:past white ADJ-gen:ǫ num:ǫ house N-gen:ǫ num:pl haber VERB-t:past p:3.num:sg casa N-gen:f num:pl blanco ADJ-gen:f num:pl

V´ ıctor M. S´ anchez-Cartagena 17/51

slide-45
SLIDE 45

Inference of shallow-transfer rules

  • 1. Bilingual phrase extraction

1 Analyse SL and TL sides of the parallel corpus 2 Compute statistical word alignments as in SMT 3 Extract bilingual phrases compatible with the alignments (in a

similar way to SMT)

there ADV be VERB-t:past white ADJ-gen:ǫ num:ǫ house N-gen:ǫ num:pl haber VERB-t:past p:3.num:sg casa N-gen:f num:pl blanco ADJ-gen:f num:pl

V´ ıctor M. S´ anchez-Cartagena 17/51

slide-46
SLIDE 46

Inference of shallow-transfer rules

  • 1. Bilingual phrase extraction

1 Analyse SL and TL sides of the parallel corpus 2 Compute statistical word alignments as in SMT 3 Extract bilingual phrases compatible with the alignments (in a

similar way to SMT)

there ADV be VERB-t:past white ADJ-gen:ǫ num:ǫ house N-gen:ǫ num:pl haber VERB-t:past p:3.num:sg casa N-gen:f num:pl blanco ADJ-gen:f num:pl

V´ ıctor M. S´ anchez-Cartagena 17/51

slide-47
SLIDE 47

Inference of shallow-transfer rules

  • 1. Bilingual phrase extraction

1 Analyse SL and TL sides of the parallel corpus 2 Compute statistical word alignments as in SMT 3 Extract bilingual phrases compatible with the alignments (in a

similar way to SMT)

there ADV be VERB-t:past white ADJ-gen:ǫ num:ǫ house N-gen:ǫ num:pl haber VERB-t:past p:3.num:sg casa N-gen:f num:pl blanco ADJ-gen:f num:pl

V´ ıctor M. S´ anchez-Cartagena 17/51

slide-48
SLIDE 48

Inference of shallow-transfer rules

  • 2. Generation of GATs

V´ ıctor M. S´ anchez-Cartagena 18/51

slide-49
SLIDE 49

Inference of shallow-transfer rules

  • 2. Generation of GATs

Strategy From each bilingual phrase, generate as many GATs as possible as long as they correctly reproduce the original bilingual phrase Transformations applied:

1 Generate a very specific GAT (only matches the initial

bilingual phrase)

2 Lexical generalisation 3 Morphological generalisation

V´ ıctor M. S´ anchez-Cartagena 19/51

slide-50
SLIDE 50

Inference of shallow-transfer rules

  • 2. Generation of GATs

1 Generate a very specific GAT

Bilingual phrase (English–Spanish): white ADJ-gen:ǫ.num:ǫ house N-gen:ǫ.num:pl casa N-gen:f.num:pl blanco ADJ-gen:f.num:pl

V´ ıctor M. S´ anchez-Cartagena 20/51

slide-51
SLIDE 51

Inference of shallow-transfer rules

  • 2. Generation of GATs

1 Generate a very specific GAT

GAT:

1 white ADJ-gen:ǫ.num:ǫ 2 house N-gen:ǫ.num:pl 1 casa N-gen:f.num:pl 2 blanco ADJ-gen:f.num:pl

r1 = {gen : ǫ, num : ǫ}, r2 = {gen : f, num : pl}

V´ ıctor M. S´ anchez-Cartagena 20/51

slide-52
SLIDE 52

Inference of shallow-transfer rules

  • 2. Generation of GATs

2 Lexical generalisation

Remove lemmas from word classes if they can be obtained from the bilingual dictionary

1 white ADJ-gen:ǫ.num:ǫ 2 house N-gen:ǫ.num:pl 1 casa N-gen:f.num:pl 2 blanco ADJ-gen:f.num:pl

r1 = {gen : ǫ, num : ǫ}, r2 = {gen : f, num : pl}

V´ ıctor M. S´ anchez-Cartagena 21/51

slide-53
SLIDE 53

Inference of shallow-transfer rules

  • 2. Generation of GATs

2 Lexical generalisation

Remove lemmas from word classes if they can be obtained from the bilingual dictionary

1 ADJ-gen:ǫ.num:ǫ 2 house N-gen:ǫ.num:pl 1 casa N-gen:f.num:pl 2 ADJ-gen:f.num:pl

r1 = {gen : ǫ, num : ǫ}, r2 = {gen : f, num : pl}

V´ ıctor M. S´ anchez-Cartagena 21/51

slide-54
SLIDE 54

Inference of shallow-transfer rules

  • 2. Generation of GATs

2 Lexical generalisation

Remove lemmas from word classes if they can be obtained from the bilingual dictionary

1 white ADJ-gen:ǫ.num:ǫ 2 N-gen:ǫ.num:pl 1 N-gen:f.num:pl 2 blanco ADJ-gen:f.num:pl

r1 = {gen : ǫ, num : ǫ}, r2 = {gen : f, num : pl}

V´ ıctor M. S´ anchez-Cartagena 21/51

slide-55
SLIDE 55

Inference of shallow-transfer rules

  • 2. Generation of GATs

2 Lexical generalisation

Remove lemmas from word classes if they can be obtained from the bilingual dictionary

1 ADJ-gen:ǫ.num:ǫ 2 N-gen:ǫ.num:pl 1 N-gen:f.num:pl 2 ADJ-gen:f.num:pl

r1 = {gen : ǫ, num : ǫ}, r2 = {gen : f, num : pl}

V´ ıctor M. S´ anchez-Cartagena 21/51

slide-56
SLIDE 56

Inference of shallow-transfer rules

  • 2. Generation of GATs

3 Morphological generalisation

Detect morphological inflection attributes whose value can be

  • btained with references ($j

s or $j t) in the TL attributes

Add wildcards (∗) in the SL attributes Remove restrictions

1 ADJ-gen:ǫ.num:ǫ 2 N-gen:ǫ.num:pl 1 N-gen:f.num:pl 2 ADJ-gen:f.num:pl

r1 = {gen : ǫ, num : ǫ}, r2 = {gen : f, num : pl}

V´ ıctor M. S´ anchez-Cartagena 22/51

slide-57
SLIDE 57

Inference of shallow-transfer rules

  • 2. Generation of GATs

3 Morphological generalisation

Detect morphological inflection attributes whose value can be

  • btained with references ($j

s or $j t) in the TL attributes

Add wildcards (∗) in the SL attributes Remove restrictions

1 ADJ-gen:*.num:ǫ 2 N-gen:*.num:pl 1 N-gen:$2 t .num:pl 2 ADJ-gen:$2 t .num:pl

r1 = {num : ǫ}, r2 = {num : pl}

V´ ıctor M. S´ anchez-Cartagena 22/51

slide-58
SLIDE 58

Inference of shallow-transfer rules

  • 2. Generation of GATs

3 Morphological generalisation

Detect morphological inflection attributes whose value can be

  • btained with references ($j

s or $j t) in the TL attributes

Add wildcards (∗) in the SL attributes Remove restrictions

1 ADJ-gen:ǫ.num:* 2 N-gen:ǫ.num:* 1 N-gen:f.num:$2 s 2 ADJ-gen:f.num:$2 s

r1 = {gen : ǫ}, r2 = {gen : f}

V´ ıctor M. S´ anchez-Cartagena 22/51

slide-59
SLIDE 59

Inference of shallow-transfer rules

  • 2. Generation of GATs

3 Morphological generalisation

Detect morphological inflection attributes whose value can be

  • btained with references ($j

s or $j t) in the TL attributes

Add wildcards (∗) in the SL attributes Remove restrictions

1 ADJ-gen:*.num:* 2 N-gen:*.num:* 1 N-gen:$2 t .num:$2 s 2 ADJ-gen:$2 t .num:$2 s

r1 = {}, r2 = {}

V´ ıctor M. S´ anchez-Cartagena 22/51

slide-60
SLIDE 60

Inference of shallow-transfer rules

  • 3. Choosing the most appropriate GATs

V´ ıctor M. S´ anchez-Cartagena 23/51

slide-61
SLIDE 61

Inference of shallow-transfer rules

  • 3. Choosing the most appropriate GATs

Strategy Select the minimum set of GATs needed to reproduce all the bilingual phrases extracted from the parallel corpus Appropriate level of generalisation is found The more general the GATs → the fewer GATs are needed to reproduce the bilingual phrases In case of conflicts, more specific GATs are used First approach to solve conflicts solved at a global level Result: hierarchy in which specific GATs correct the errors of more general ones

V´ ıctor M. S´ anchez-Cartagena 24/51

slide-62
SLIDE 62

Inference of shallow-transfer rules

Implementation: NP-Hard problem similar to already studied set cover problem (Garey and Johnson, 1979) Split in independent subproblems for each sequence of SL lexical categories Rewrite as an integer linear programming problem:

  • ptimisation of a function subject to restrictions encoded as

set of linear inequations Solve with branch and cut (Xu et al., 2009)

V´ ıctor M. S´ anchez-Cartagena 25/51

slide-63
SLIDE 63

Inference of shallow-transfer rules

  • 4. Optimising rules for segmentation

V´ ıctor M. S´ anchez-Cartagena 26/51

slide-64
SLIDE 64

Inference of shallow-transfer rules

  • 4. Optimising rules for segmentation

Generate GATs only for selected sequences of SL lexical categories so as to ensure correct segmentation

V´ ıctor M. S´ anchez-Cartagena 27/51

slide-65
SLIDE 65

Inference of shallow-transfer rules

  • 4. Optimising rules for segmentation

Generate GATs only for selected sequences of SL lexical categories so as to ensure correct segmentation

1 Identify key text segments: text segments that maximise

BLEU when sentences are translated with the most specific GAT applied to them

The DT white ADJ house N and CC the DT red ADJ cars N

V´ ıctor M. S´ anchez-Cartagena 27/51

slide-66
SLIDE 66

Inference of shallow-transfer rules

  • 4. Optimising rules for segmentation

Generate GATs only for selected sequences of SL lexical categories so as to ensure correct segmentation

1 Identify key text segments: text segments that maximise

BLEU when sentences are translated with the most specific GAT applied to them

2 Discard sequences of lexical categories that usually prevent key

text segments from being translated with the same rule

The DT white ADJ house N and CC the DT red ADJ cars N

V´ ıctor M. S´ anchez-Cartagena 27/51

slide-67
SLIDE 67

Inference of shallow-transfer rules

  • 4. Optimising rules for segmentation

Generate GATs only for selected sequences of SL lexical categories so as to ensure correct segmentation

1 Identify key text segments: text segments that maximise

BLEU when sentences are translated with the most specific GAT applied to them

2 Discard sequences of lexical categories that usually prevent key

text segments from being translated with the same rule

The DT white ADJ house N and CC the DT red ADJ cars N

V´ ıctor M. S´ anchez-Cartagena 27/51

slide-68
SLIDE 68

Inference of shallow-transfer rules

  • 4. Optimising rules for segmentation

Generate GATs only for selected sequences of SL lexical categories so as to ensure correct segmentation

1 Identify key text segments: text segments that maximise

BLEU when sentences are translated with the most specific GAT applied to them

2 Discard sequences of lexical categories that usually prevent key

text segments from being translated with the same rule

The DT white ADJ house N and CC the DT red ADJ cars N

V´ ıctor M. S´ anchez-Cartagena 27/51

slide-69
SLIDE 69

Inference of shallow-transfer rules

  • 4. Optimising rules for segmentation

Generate GATs only for selected sequences of SL lexical categories so as to ensure correct segmentation

1 Identify key text segments: text segments that maximise

BLEU when sentences are translated with the most specific GAT applied to them

2 Discard sequences of lexical categories that usually prevent key

text segments from being translated with the same rule

Remove redundant GATs: long GATs that produce the same translation as the combination of shorter ones

V´ ıctor M. S´ anchez-Cartagena 27/51

slide-70
SLIDE 70

Evaluation

Evaluation goal: Compare translation quality achieved by inferred rules with other approaches: Word-for-word translation S´ anchez-Mart´ ınez & Forcada (2009) Apertium handcrafted rules Procedure:

1 Infer rules from small corpora fragments for different language

pairs

2 Translate out-of-domain texts and compute MT evaluation

metrics (BLEU, TER, METEOR) train test Spanish↔Catalan El Peri´

  • dico

Consumer Eroski English↔Spanish Europarl Newstest 2013 Breton→French Ofis Publik Ofis Publik*

V´ ıctor M. S´ anchez-Cartagena 28/51

slide-71
SLIDE 71

Some results

New algorithm systematically outperforms S´ anchez-Mart´ ınez & Forcada (2009) by a statistically significant margin Number of GATs inferred is at least 1 order of magnitude smaller → easier revision and maintenance of rules Example: Spanish→English

0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 100 250 500 1000 2500 5000 BLEU score Size of the training corpus (in sentences); log. scale Sanchez-Cartagena et al. Sanchez-Martinez and Forcada Apertium handcrafted Word-for-word V´ ıctor M. S´ anchez-Cartagena 29/51

slide-72
SLIDE 72

Some results

Morphological generalisation involves vast computational cost and it is only useful for very small corpora Disabling it permits using more training corpus and reaching translation quality of handcrafted rules Example: Spanish→English

0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 100 250 500 1000 2500 5000 10000 25000 BLEU score Size of the training corpus (in sentences); log. scale Sanchez-Cartagena et al. - no wildcard Sanchez-Martinez and Forcada Apertium handcrafted Word-for-word V´ ıctor M. S´ anchez-Cartagena 30/51

slide-73
SLIDE 73

Outline

1 Introduction 2 Inferring shallow-transfer rules from small parallel corpora 3 Integrating shallow-transfer data into statistical machine translation 4 Assisting non-expert users in extending morphological dictionaries 5 Concluding remarks

V´ ıctor M. S´ anchez-Cartagena 31/51

slide-74
SLIDE 74

Motivation

Goal: New method for integrating shallow-transfer RBMT linguistic resources into the phrase-based SMT architecture Why? Tackle data sparseness problem in SMT Existing dictionaries + rule inference → generalisation of knowledge from parallel corpus to unseen sequences of words Both shallow-transfer RBMT and phrase-based SMT split the text in flat sequences No strategy can be found in the literature addressed to integration of shallow-transfer RBMT resources into SMT architecture Existing black-box approach (Eisele et al., 2008) presents strong limitations

V´ ıctor M. S´ anchez-Cartagena 32/51

slide-75
SLIDE 75

Motivation

Limitations of a black-box strategy:

V´ ıctor M. S´ anchez-Cartagena 33/51

slide-76
SLIDE 76

Motivation

Limitations of a black-box strategy: Incorrect/missing phrase pairs extracted

A fierce lion eats a lot Un le´

  • n

feroz come mucho

Phrases extracted: fierce – un le´

  • n, lot – mucho, ...

V´ ıctor M. S´ anchez-Cartagena 33/51

slide-77
SLIDE 77

Motivation

Limitations of a black-box strategy: Incorrect/missing phrase pairs extracted Inadequate balance between two types of phrase pairs

V´ ıctor M. S´ anchez-Cartagena 33/51

slide-78
SLIDE 78

Integrating RBMT data into SMT

Method overview Use inner workings of RBMT translation process to generate error-free bilingual phrases Join corpus-extracted + synthetic phrase pairs, do phrase scoring and add binary feature function

V´ ıctor M. S´ anchez-Cartagena 34/51

slide-79
SLIDE 79

Integrating RBMT data into SMT

Generation of synthetic phrase pairs Strategy Generate phrase pairs for all the bilingual dictionary entries Segment the SL text to be translated with shallow-transfer rules All the linguistic information is extracted from the RBMT system without loss Example: SL text:

The white house and the red cars the DT white ADJ house N-num:sg and CC the DT red ADJ car N-num:pl

V´ ıctor M. S´ anchez-Cartagena 35/51

slide-80
SLIDE 80

Integrating RBMT data into SMT

Generation of synthetic phrase pairs Strategy Generate phrase pairs for all the bilingual dictionary entries Segment the SL text to be translated with shallow-transfer rules All the linguistic information is extracted from the RBMT system without loss Example: SL text:

The white house and the red cars the DT white ADJ house N-num:sg and CC the DT red ADJ car N-num:pl Generated bilingual phrases: the white house – la casa blanca

V´ ıctor M. S´ anchez-Cartagena 35/51

slide-81
SLIDE 81

Integrating RBMT data into SMT

Generation of synthetic phrase pairs Strategy Generate phrase pairs for all the bilingual dictionary entries Segment the SL text to be translated with shallow-transfer rules All the linguistic information is extracted from the RBMT system without loss Example: SL text:

The white house and the red cars the DT white ADJ house N-num:sg and CC the DT red ADJ car N-num:pl Generated bilingual phrases: the red cars – los coches rojos

V´ ıctor M. S´ anchez-Cartagena 35/51

slide-82
SLIDE 82

Evaluation

Evaluation goals: Compare translation quality achieved by hybrid system with

  • ther approaches:

Pure RBMT and phrase-based SMT systems Black-box hybrid approach by Eisele et al. (2008)

Measure impact of:

Size of parallel and monolingual corpora Rules: automatically inferred or handcrafted Domain of test corpus: same or different from training

Procedure: Build hybrid systems using Apertium data and compute MT evaluation metrics (BLEU, TER, METEOR)

train

  • ut-of-domain test

TL model English↔Spanish Europarl Newstest 2010 Europarl (+ newscrawl) Breton→French Ofis Publik − Ofis Publik + Europarl

V´ ıctor M. S´ anchez-Cartagena 36/51

slide-83
SLIDE 83

Some results

Systematically outperforms black-box approach Outperforms pure systems when parallel corpus is very small

  • r out-of-domain texts are translated

Increasing the size of the language model reduces impact Example: Spanish→English out-of-domain, handcrafted rules

TL model: Europarl

0.14 0.16 0.18 0.2 0.22 0.24 0.26 10000 40000 160000 600000 1272260 BLEU score Size of the training corpus (in sentences) SMT hybrid Apertium

TL model: Europarl + newscrawl (4x bigger)

0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3 10000 40000 160000 600000 1272260 BLEU score Size of the training corpus (in sentences) SMT hybrid Apertium

V´ ıctor M. S´ anchez-Cartagena 37/51

slide-84
SLIDE 84

Some results

Hybrid systems with inferred rules can outperform those with

  • nly dictionaries without using any additional resource

Example: English→Spanish out-of-domain. TL model: Europarl

0.14 0.16 0.18 0.2 0.22 0.24 0.26 10000 40000 160000 600000 1272260 BLEU score Size of the training corpus (in sentences) SMT hybrid-hand. rules hybrid-only dict. hybrid-auto. rules Apertium V´ ıctor M. S´ anchez-Cartagena 38/51

slide-85
SLIDE 85

Some results

Outperform factored translation models (Koehn and Hoang, 2007) for small parallel corpora Example: English→Spanish out-of-domain TL model: Europarl (factored system uses surface forms + morph. information)

0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 10000 40000 160000 600000 1272260 BLEU score Size of the training corpus (in sentences) hybrid-auto. rules factored V´ ıctor M. S´ anchez-Cartagena 39/51

slide-86
SLIDE 86

Outline

1 Introduction 2 Inferring shallow-transfer rules from small parallel corpora 3 Integrating shallow-transfer data into statistical machine translation 4 Assisting non-expert users in extending morphological dictionaries 5 Concluding remarks

V´ ıctor M. S´ anchez-Cartagena 40/51

slide-87
SLIDE 87

Motivation

Goal: Allow non-expert users to insert entries in RBMT morphological dictionaries Why? Creation of morphological dictionaries from scratch consumes a great portion of development time of an RBMT system Dictionaries are less complex than transfer rules Non-expert users can enlarge them. No need for people with:

Linguistic background Knowledge of RBMT system

Result: reduce RBMT development costs

V´ ıctor M. S´ anchez-Cartagena 41/51

slide-88
SLIDE 88

Overview

V´ ıctor M. S´ anchez-Cartagena 42/51

slide-89
SLIDE 89

Overview

System asks iteratively: Is this word a valid form of the word to be inserted?

V´ ıctor M. S´ anchez-Cartagena 42/51

slide-90
SLIDE 90

Overview

System asks iteratively: Is this word a valid form of the word to be inserted?

V´ ıctor M. S´ anchez-Cartagena 42/51

slide-91
SLIDE 91

Overview

System asks iteratively: Is this word a valid form of the word to be inserted?

V´ ıctor M. S´ anchez-Cartagena 42/51

slide-92
SLIDE 92

Overview

System asks iteratively: Is this word a valid form of the word to be inserted?

V´ ıctor M. S´ anchez-Cartagena 42/51

slide-93
SLIDE 93

Overview

System asks iteratively: Is this word a valid form of the word to be inserted?

V´ ıctor M. S´ anchez-Cartagena 42/51

slide-94
SLIDE 94

Overview

System asks iteratively: Is this word a valid form of the word to be inserted?

V´ ıctor M. S´ anchez-Cartagena 42/51

slide-95
SLIDE 95

Overview

System asks iteratively: Is this word a valid form of the word to be inserted?

V´ ıctor M. S´ anchez-Cartagena 42/51

slide-96
SLIDE 96

Overview

System asks iteratively: Is this word a valid form of the word to be inserted?

V´ ıctor M. S´ anchez-Cartagena 42/51

slide-97
SLIDE 97

Overview

System asks iteratively: Is this word a valid form of the word to be inserted?

V´ ıctor M. S´ anchez-Cartagena 42/51

slide-98
SLIDE 98

Overview

System asks iteratively: Is this word a valid form of the word to be inserted?

V´ ıctor M. S´ anchez-Cartagena 42/51

slide-99
SLIDE 99

Overview

System asks iteratively: Is this word a valid form of the word to be inserted?

V´ ıctor M. S´ anchez-Cartagena 42/51

slide-100
SLIDE 100

Overview

System asks iteratively: Is this word a valid form of the word to be inserted?

V´ ıctor M. S´ anchez-Cartagena 42/51

slide-101
SLIDE 101

Overview

System asks iteratively: Is this word a valid form of the word to be inserted?

V´ ıctor M. S´ anchez-Cartagena 42/51

slide-102
SLIDE 102

Overview

V´ ıctor M. S´ anchez-Cartagena 42/51

slide-103
SLIDE 103

Overview

V´ ıctor M. S´ anchez-Cartagena 42/51

slide-104
SLIDE 104

Overview

Monolingual morphological dictionaries Surface form ↔ lemma, part of speech, morphological inflection Inflection paradigms group regularities in inflection Example

Entries: bab/p1 marr/p2 identif/p2 Paradigms : p1:

  • y ↔ -y N-sg
  • ies ↔ -y N-pl

p2:

  • y ↔ -y V-base
  • ies ↔ -y V-3psg
  • ied ↔ -y V-past
  • ying ↔ -y V-gerund

V´ ıctor M. S´ anchez-Cartagena 43/51

slide-105
SLIDE 105

Overview

Monolingual morphological dictionaries Surface form ↔ lemma, part of speech, morphological inflection Inflection paradigms group regularities in inflection Example

Entries: bab/p1 marr/p2 identif/p2 Paradigms : p1:

  • y ↔ -y N-sg
  • ies ↔ -y N-pl

p2:

  • y ↔ -y V-base
  • ies ↔ -y V-3psg
  • ied ↔ -y V-past
  • ying ↔ -y V-gerund

V´ ıctor M. S´ anchez-Cartagena 43/51

slide-106
SLIDE 106

Overview

Monolingual morphological dictionaries Surface form ↔ lemma, part of speech, morphological inflection Inflection paradigms group regularities in inflection Example

baby ↔ baby N-sg babies ↔ baby N-pl

V´ ıctor M. S´ anchez-Cartagena 43/51

slide-107
SLIDE 107

Overview

Monolingual morphological dictionaries Surface form ↔ lemma, part of speech, morphological inflection Inflection paradigms group regularities in inflection Example

Entries: bab/p1 marr/p2 identif/p2 Paradigms : p1:

  • y ↔ -y N-sg
  • ies ↔ -y N-pl

p2:

  • y ↔ -y V-base
  • ies ↔ -y V-3psg
  • ied ↔ -y V-past
  • ying ↔ -y V-gerund

V´ ıctor M. S´ anchez-Cartagena 43/51

slide-108
SLIDE 108

Overview

Monolingual morphological dictionaries Surface form ↔ lemma, part of speech, morphological inflection Inflection paradigms group regularities in inflection Example

Entries: bab/p1 marr/p2 identif/p2 verif/p2 Paradigms : p1:

  • y ↔ -y N-sg
  • ies ↔ -y N-pl

p2:

  • y ↔ -y V-base
  • ies ↔ -y V-3psg
  • ied ↔ -y V-past
  • ying ↔ -y V-gerund

V´ ıctor M. S´ anchez-Cartagena 43/51

slide-109
SLIDE 109

Overview

Monolingual morphological dictionaries Surface form ↔ lemma, part of speech, morphological inflection Inflection paradigms group regularities in inflection Example

verify ↔ verify V-base verifies ↔ verify V-3psg verified ↔ verify V-past verifying ↔ verify V-gerund

V´ ıctor M. S´ anchez-Cartagena 43/51

slide-110
SLIDE 110

Choosing the most appropriate inflection paradigm

Word to be inserted: verifies

V´ ıctor M. S´ anchez-Cartagena 44/51

slide-111
SLIDE 111

Choosing the most appropriate inflection paradigm

Word to be inserted: verifies c1 = verif/p1 = verif/{ -y, -ies }

V´ ıctor M. S´ anchez-Cartagena 44/51

slide-112
SLIDE 112

Choosing the most appropriate inflection paradigm

Word to be inserted: verifies c1 = verif/p1 = verif/{ -y, -ies } c2 = verif/p2 = verif/{ -y, -ies, -ied, -ying }

V´ ıctor M. S´ anchez-Cartagena 44/51

slide-113
SLIDE 113

Choosing the most appropriate inflection paradigm

Word to be inserted: verifies c1 = verif/p1 = verif/{ -y, -ies } c2 = verif/p2 = verif/{ -y, -ies, -ied, -ying } c3 = verifie/p3 = verifie/{ -ǫ, -s }

V´ ıctor M. S´ anchez-Cartagena 44/51

slide-114
SLIDE 114

Choosing the most appropriate inflection paradigm

Word to be inserted: verifies c1 = verif/p1 = verif/{ -y, -ies } c2 = verif/p2 = verif/{ -y, -ies, -ied, -ying } c3 = verifie/p3 = verifie/{ -ǫ, -s } c4 = verifies/p3 = verifies/{ -ǫ, -s }

V´ ıctor M. S´ anchez-Cartagena 44/51

slide-115
SLIDE 115

Choosing the most appropriate inflection paradigm

Word to be inserted: verifies c1 = verif/p1 = verif/{ -y, -ies } = { verify, verifies } c2 = verif/p2 = verif/{ -y, -ies, -ied, -ying } = { verify, verifies, verified, verifying } c3 = verifie/p3 = verifie/{ -ǫ, -s } = { verifie, verifies } c4 = verifies/p3 = verifies/{ -ǫ, -s } = { verifies, verifiess }

V´ ıctor M. S´ anchez-Cartagena 44/51

slide-116
SLIDE 116

Choosing the most appropriate inflection paradigm

Word to be inserted: verifies c1 = verif/p1 = verif/{ -y, -ies } = { verify, verifies } c2 = verif/p2 = verif/{ -y, -ies, -ied, -ying } = { verify, verifies, verified, verifying } c3 = verifie/p3 = verifie/{ -ǫ, -s } = { verifie, verifies } c4 = verifies/p3 = verifies/{ -ǫ, -s } = { verifies, verifiess }

V´ ıctor M. S´ anchez-Cartagena 44/51

slide-117
SLIDE 117

Choosing the most appropriate inflection paradigm

Word to be inserted: verifies c1 = verif/p1 = verif/{ -y, -ies } = { verify, verifies } c2 = verif/p2 = verif/{ -y, -ies, -ied, -ying } = { verify, verifies, verified, verifying } c3 = verifie/p3 = verifie/{ -ǫ, -s } = { verifie, verifies } c4 = verifies/p3 = verifies/{ -ǫ, -s } = { verifies, verifiess }

V´ ıctor M. S´ anchez-Cartagena 44/51

slide-118
SLIDE 118

Choosing the most appropriate inflection paradigm

Word to be inserted: verifies c1 = verif/p1 = verif/{ -y, -ies } = { verify, verifies } c2 = verif/p2 = verif/{ -y, -ies, -ied, -ying } = { verify, verifies, verified, verifying } c3 = verifie/p3 = verifie/{ -ǫ, -s } = { verifie, verifies } c4 = verifies/p3 = verifies/{ -ǫ, -s } = { verifies, verifiess }

V´ ıctor M. S´ anchez-Cartagena 44/51

slide-119
SLIDE 119

Choosing the most appropriate inflection paradigm

Word to be inserted: verifies c1 = verif/p1 = verif/{ -y, -ies } = { verify, verifies } c2 = verif/p2 = verif/{ -y, -ies, -ied, -ying } = { verify, verifies, verified, verifying } c3 = verifie/p3 = verifie/{ -ǫ, -s } = { verifie, verifies } c4 = verifies/p3 = verifies/{ -ǫ, -s } = { verifies, verifiess }

V´ ıctor M. S´ anchez-Cartagena 44/51

slide-120
SLIDE 120

Choosing the most appropriate inflection paradigm

Word to be inserted: verifies c1 = verif/p1 = verif/{ -y, -ies } = { verify, verifies } c2 = verif/p2 = verif/{ -y, -ies, -ied, -ying } = { verify, verifies, verified, verifying } c3 = verifie/p3 = verifie/{ -ǫ, -s } = { verifie, verifies } c4 = verifies/p3 = verifies/{ -ǫ, -s } = { verifies, verifiess }

V´ ıctor M. S´ anchez-Cartagena 44/51

slide-121
SLIDE 121

Choosing the most appropriate inflection paradigm

Word to be inserted: verifies c1 = verif/p1 = verif/{ -y, -ies } = { verify, verifies } c2 = verif/p2 = verif/{ -y, -ies, -ied, -ying } = { verify, verifies, verified, verifying } c3 = verifie/p3 = verifie/{ -ǫ, -s } = { verifie, verifies } c4 = verifies/p3 = verifies/{ -ǫ, -s } = { verifies, verifiess } Efficiency Choose words to be validated in order to ask as few questions as possible

V´ ıctor M. S´ anchez-Cartagena 44/51

slide-122
SLIDE 122

Choosing words to be validated

1 Assign a score to each paradigm

with hidden Markov model (HMM)

Score depends on context: paradigms assigned to the other words in the sentence (similar to part-of-speech tagging) HMM is trained from a monolingual corpus

2 Choose words with a binary

decision tree built with modified ID3 algorithm. Balance between:

HMM scores Number of paradigms discarded if user accepts word Number of paradigms discarded if user rejects word

verif/p2 verifying yes (0.7) verify no verifies/p3 verifie/p3 verif/p1 yes (0.26) verifiess no yes (0.02) no (0.02)

V´ ıctor M. S´ anchez-Cartagena 45/51

slide-123
SLIDE 123

Evaluation

Evaluating viability: asked group of 4 non-expert users to insert 150 entries from Spanish Apertium monolingual dictionary High success rate: ∼90% Cannot choose between paradigms with same set of suffixes and different morphological information Evaluating efficiency: measured number of questions needed on a bigger test set (2423 word/sentence pairs) with an oracle

  • Avg. number of questions (∼5) lower than with simpler

heuristic approaches (∼7):

Score as proportion of inflected forms found in a parallel corpus Ask about most infrequent word from highest-scored candidate

Decision tree brings robustness against wrong paradigm scoring No clear winner between HMM and heuristic scoring

V´ ıctor M. S´ anchez-Cartagena 46/51

slide-124
SLIDE 124

Outline

1 Introduction 2 Inferring shallow-transfer rules from small parallel corpora 3 Integrating shallow-transfer data into statistical machine translation 4 Assisting non-expert users in extending morphological dictionaries 5 Concluding remarks

V´ ıctor M. S´ anchez-Cartagena 47/51

slide-125
SLIDE 125

Concluding remarks

Ease the development of MT systems when resources are scarce Enable the creation of transfer rules for RBMT when no bilingual experts are available with automatic inference of shallow-transfer rules from small parallel corpora

Outperforms previous approach in literature thanks to high generalisation power and optimisation of segmentation First approach that treats rule inference as an optimisation problem and solves conflicts at a global level Work published in Computer Speech and Language1

1Special Issue on Hybrid Machine Translation. Volume 32, issue 1, p. 49–90 V´ ıctor M. S´ anchez-Cartagena 48/51

slide-126
SLIDE 126

Concluding remarks

Ease the development of MT systems when resources are scarce Mitigate data sparseness in SMT thanks to the integration

  • f shallow-transfer RBMT data into phrase-based SMT

Takes advantage of inner workings of shallow-transfer RBMT system Successful combination with automatic inference of shallow-transfer rules Win WMT 2011 for Spanish→English with Apertium handcrafted rules Work accepted for publication in Journal of Artificial Intelligence Research2

2Special Track on Cross-language Algorithms and Applications V´ ıctor M. S´ anchez-Cartagena 49/51

slide-127
SLIDE 127

Concluding remarks

Ease the development of MT systems when resources are scarce Speed up and reduce costs of developing RBMT morphological dictionaries by allowing non-expert users to insert new entries into them

Uses existing paradigms and asks polar questions to users Users are able to successfully answer the questions Decision tree and hidden Markov model are efficient

Implementation of all methods released as free/open-source software

V´ ıctor M. S´ anchez-Cartagena 50/51

slide-128
SLIDE 128

Future research lines

Automatic inference of shallow-transfer rules

Automatic inference of rules that operate on chunks (Apertium level 2) Active learning + rule inference

Select most informative fragments of a monolingual corpus Ask a person to translate them and infer rules

Integration of automatically inferred rules into phrase-based SMT:

Less aggressive optimisation for segmentation Give up solving minimisation problem

Allowing non-expert users to insert entries into morphological dictionaries

Combination of multiple models for paradigm scoring More sophisticated questions/automatic methods to choose morphological information of paradigm

V´ ıctor M. S´ anchez-Cartagena 51/51

slide-129
SLIDE 129

Thank you for your attention

V´ ıctor M. S´ anchez-Cartagena 52/51