An Italian to Catalan RBMT system reusing data from existing - - PowerPoint PPT Presentation

an italian to catalan rbmt system reusing data from
SMART_READER_LITE
LIVE PREVIEW

An Italian to Catalan RBMT system reusing data from existing - - PowerPoint PPT Presentation

Introduction Methodology Evaluation Conclusions An Italian to Catalan RBMT system reusing data from existing language pairs Antonio Toral , Mireia Ginest -Rosell, Francis Tyers 2 nd International Workshop on Free/Open-Source Rule-Based


slide-1
SLIDE 1

Introduction Methodology Evaluation Conclusions

An Italian to Catalan RBMT system reusing data from existing language pairs

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers

2nd International Workshop on Free/Open-Source Rule-Based Machine Translation

2011/01/21

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-2
SLIDE 2

Introduction Methodology Evaluation Conclusions

Contents

1

Introduction

2

Methodology Crossdics Inconsistencies Coverage

3

Evaluation Setting Results

4

Conclusions

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-3
SLIDE 3

Introduction Methodology Evaluation Conclusions

Two main approaches in Machine Translation: Rule-Based and Statistical

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-4
SLIDE 4

Introduction Methodology Evaluation Conclusions

Two main approaches in Machine Translation: Rule-Based and Statistical What is needed to build a system for a new language pair?

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-5
SLIDE 5

Introduction Methodology Evaluation Conclusions

Two main approaches in Machine Translation: Rule-Based and Statistical What is needed to build a system for a new language pair?

RBMT: dictionaries and rules

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-6
SLIDE 6

Introduction Methodology Evaluation Conclusions

Two main approaches in Machine Translation: Rule-Based and Statistical What is needed to build a system for a new language pair?

RBMT: dictionaries and rules SMT: parallel corpus

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-7
SLIDE 7

Introduction Methodology Evaluation Conclusions

Two main approaches in Machine Translation: Rule-Based and Statistical What is needed to build a system for a new language pair?

RBMT: dictionaries and rules SMT: parallel corpus

Drawbacks:

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-8
SLIDE 8

Introduction Methodology Evaluation Conclusions

Two main approaches in Machine Translation: Rule-Based and Statistical What is needed to build a system for a new language pair?

RBMT: dictionaries and rules SMT: parallel corpus

Drawbacks:

RBMT: linguistic expertise on both languages, manual construction

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-9
SLIDE 9

Introduction Methodology Evaluation Conclusions

Two main approaches in Machine Translation: Rule-Based and Statistical What is needed to build a system for a new language pair?

RBMT: dictionaries and rules SMT: parallel corpus

Drawbacks:

RBMT: linguistic expertise on both languages, manual construction SMT: only applicable to language pairs with big amounts of parallel data

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-10
SLIDE 10

Introduction Methodology Evaluation Conclusions

This paper: build RBMT system by exploiting data from existing pairs

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-11
SLIDE 11

Introduction Methodology Evaluation Conclusions

This paper: build RBMT system by exploiting data from existing pairs We build an MT system for pair a–b given existing systems for pairs a–c and b–c

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-12
SLIDE 12

Introduction Methodology Evaluation Conclusions

This paper: build RBMT system by exploiting data from existing pairs We build an MT system for pair a–b given existing systems for pairs a–c and b–c Italian→Catalan from Apertium’s Italian–Spanish and Catalan–Spanish

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-13
SLIDE 13

Introduction Methodology Evaluation Conclusions

This paper: build RBMT system by exploiting data from existing pairs We build an MT system for pair a–b given existing systems for pairs a–c and b–c Italian→Catalan from Apertium’s Italian–Spanish and Catalan–Spanish Motivation:

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-14
SLIDE 14

Introduction Methodology Evaluation Conclusions

This paper: build RBMT system by exploiting data from existing pairs We build an MT system for pair a–b given existing systems for pairs a–c and b–c Italian→Catalan from Apertium’s Italian–Spanish and Catalan–Spanish Motivation:

RBMT competitive and useful for languages without parallel corpora

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-15
SLIDE 15

Introduction Methodology Evaluation Conclusions

This paper: build RBMT system by exploiting data from existing pairs We build an MT system for pair a–b given existing systems for pairs a–c and b–c Italian→Catalan from Apertium’s Italian–Spanish and Catalan–Spanish Motivation:

RBMT competitive and useful for languages without parallel corpora Reusing data from similar pairs significantly reduces the amount of work

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-16
SLIDE 16

Introduction Methodology Evaluation Conclusions Crossdics Inconsistencies Coverage Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-17
SLIDE 17

Introduction Methodology Evaluation Conclusions Crossdics Inconsistencies Coverage

Input dictionaries:

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-18
SLIDE 18

Introduction Methodology Evaluation Conclusions Crossdics Inconsistencies Coverage

Input dictionaries:

es–it: mono es 11k, mono it 10k, bi 12k

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-19
SLIDE 19

Introduction Methodology Evaluation Conclusions Crossdics Inconsistencies Coverage

Input dictionaries:

es–it: mono es 11k, mono it 10k, bi 12k es–ca: mono es 44k, mono ca 40k, bi 51k

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-20
SLIDE 20

Introduction Methodology Evaluation Conclusions Crossdics Inconsistencies Coverage

Input dictionaries:

es–it: mono es 11k, mono it 10k, bi 12k es–ca: mono es 44k, mono ca 40k, bi 51k

Output dictionaries:

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-21
SLIDE 21

Introduction Methodology Evaluation Conclusions Crossdics Inconsistencies Coverage

Input dictionaries:

es–it: mono es 11k, mono it 10k, bi 12k es–ca: mono es 44k, mono ca 40k, bi 51k

Output dictionaries:

it–ca: mono it 7k, mono ca 8k, bi 9k

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-22
SLIDE 22

Introduction Methodology Evaluation Conclusions Crossdics Inconsistencies Coverage

Input dictionaries:

es–it: mono es 11k, mono it 10k, bi 12k es–ca: mono es 44k, mono ca 40k, bi 51k

Output dictionaries:

it–ca: mono it 7k, mono ca 8k, bi 9k

Other linguistic data:

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-23
SLIDE 23

Introduction Methodology Evaluation Conclusions Crossdics Inconsistencies Coverage

Input dictionaries:

es–it: mono es 11k, mono it 10k, bi 12k es–ca: mono es 44k, mono ca 40k, bi 51k

Output dictionaries:

it–ca: mono it 7k, mono ca 8k, bi 9k

Other linguistic data:

it tagger and disambiguation probabilities taken from it–es

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-24
SLIDE 24

Introduction Methodology Evaluation Conclusions Crossdics Inconsistencies Coverage

Input dictionaries:

es–it: mono es 11k, mono it 10k, bi 12k es–ca: mono es 44k, mono ca 40k, bi 51k

Output dictionaries:

it–ca: mono it 7k, mono ca 8k, bi 9k

Other linguistic data:

it tagger and disambiguation probabilities taken from it–es transfer rules: 35 taken from oc–ca (mainly noun phrases) + 9 manually created (verbs and clitic pronouns)

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-25
SLIDE 25

Introduction Methodology Evaluation Conclusions Crossdics Inconsistencies Coverage

Reasons

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-26
SLIDE 26

Introduction Methodology Evaluation Conclusions Crossdics Inconsistencies Coverage

Reasons

Differences of gender and number (it–ca)

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-27
SLIDE 27

Introduction Methodology Evaluation Conclusions Crossdics Inconsistencies Coverage

Reasons

Differences of gender and number (it–ca) Different ways of categorising lemmas and morphological features (es–it and es–ca)

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-28
SLIDE 28

Introduction Methodology Evaluation Conclusions Crossdics Inconsistencies Coverage

Reasons

Differences of gender and number (it–ca) Different ways of categorising lemmas and morphological features (es–it and es–ca)

Solutions

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-29
SLIDE 29

Introduction Methodology Evaluation Conclusions Crossdics Inconsistencies Coverage

Reasons

Differences of gender and number (it–ca) Different ways of categorising lemmas and morphological features (es–it and es–ca)

Solutions

Manually solve inconsistencies (identified automatically), 0.5 Person Months

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-30
SLIDE 30

Introduction Methodology Evaluation Conclusions Crossdics Inconsistencies Coverage

Reasons

Differences of gender and number (it–ca) Different ways of categorising lemmas and morphological features (es–it and es–ca)

Solutions

Manually solve inconsistencies (identified automatically), 0.5 Person Months Substitute derived ca mono dictionary (8k) for that in es–ca (40k)

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-31
SLIDE 31

Introduction Methodology Evaluation Conclusions Crossdics Inconsistencies Coverage

Coverage calculated on two Italian corpora: Europarl and Wikipedia

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-32
SLIDE 32

Introduction Methodology Evaluation Conclusions Crossdics Inconsistencies Coverage

Coverage calculated on two Italian corpora: Europarl and Wikipedia 155 most frequent unknown words added to the system

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-33
SLIDE 33

Introduction Methodology Evaluation Conclusions Crossdics Inconsistencies Coverage

Coverage calculated on two Italian corpora: Europarl and Wikipedia 155 most frequent unknown words added to the system

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-34
SLIDE 34

Introduction Methodology Evaluation Conclusions Crossdics Inconsistencies Coverage

Coverage calculated on two Italian corpora: Europarl and Wikipedia 155 most frequent unknown words added to the system Europarl Wikipedida Number of words 46,569,602 241,563,615 Initial coverage 86.4% 75.5% Final coverage 88.9% 79.4%

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-35
SLIDE 35

Introduction Methodology Evaluation Conclusions Setting Results

Systems

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-36
SLIDE 36

Introduction Methodology Evaluation Conclusions Setting Results

Systems

Apertium (the it→ca system)

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-37
SLIDE 37

Introduction Methodology Evaluation Conclusions Setting Results

Systems

Apertium (the it→ca system) Apertium-i (indirect translation using it→es and es→ca)

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-38
SLIDE 38

Introduction Methodology Evaluation Conclusions Setting Results

Systems

Apertium (the it→ca system) Apertium-i (indirect translation using it→es and es→ca) Google Translate

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-39
SLIDE 39

Introduction Methodology Evaluation Conclusions Setting Results

Systems

Apertium (the it→ca system) Apertium-i (indirect translation using it→es and es→ca) Google Translate

Test set: 1k sentences from KDE4 (OPUS project)

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-40
SLIDE 40

Introduction Methodology Evaluation Conclusions Setting Results

Systems

Apertium (the it→ca system) Apertium-i (indirect translation using it→es and es→ca) Google Translate

Test set: 1k sentences from KDE4 (OPUS project) Metrics: TER, GTM, BLEU, NIST

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-41
SLIDE 41

Introduction Methodology Evaluation Conclusions Setting Results

Metric Apertium Apertium-i Google TER 0.5703 0.6118 0.6785 GTM 0.5162 0.4712 0.41637 BLEU 0.2290 0.1492 0.2459 NIST 5.6567 4.4753 6.1071 GTM and TER: Apertium > Apertium-i > Google

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-42
SLIDE 42

Introduction Methodology Evaluation Conclusions Setting Results

Metric Apertium Apertium-i Google TER 0.5703 0.6118 0.6785 GTM 0.5162 0.4712 0.41637 BLEU 0.2290 0.1492 0.2459 NIST 5.6567 4.4753 6.1071 GTM and TER: Apertium > Apertium-i > Google BLEU*: Google ≃ Apertium > Apertium-i

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-43
SLIDE 43

Introduction Methodology Evaluation Conclusions Setting Results

Metric Apertium Apertium-i Google TER 0.5703 0.6118 0.6785 GTM 0.5162 0.4712 0.41637 BLEU 0.2290 0.1492 0.2459 NIST 5.6567 4.4753 6.1071 GTM and TER: Apertium > Apertium-i > Google BLEU*: Google ≃ Apertium > Apertium-i NIST*: Google > Apertium > Apertium-i

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-44
SLIDE 44

Introduction Methodology Evaluation Conclusions

it→ca RBMT system derived from es–it and es–ca systems

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-45
SLIDE 45

Introduction Methodology Evaluation Conclusions

it→ca RBMT system derived from es–it and es–ca systems Limited amount of manual work: correct inconsistencies, augment coverage and add some transfer rules

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-46
SLIDE 46

Introduction Methodology Evaluation Conclusions

it→ca RBMT system derived from es–it and es–ca systems Limited amount of manual work: correct inconsistencies, augment coverage and add some transfer rules Evaluated against RBMT indirect system and SMT system, yielding significant improvements over both

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-47
SLIDE 47

Introduction Methodology Evaluation Conclusions

it→ca RBMT system derived from es–it and es–ca systems Limited amount of manual work: correct inconsistencies, augment coverage and add some transfer rules Evaluated against RBMT indirect system and SMT system, yielding significant improvements over both System released as apertium-ca-it-0.1.0

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs

slide-48
SLIDE 48

Introduction Methodology Evaluation Conclusions

Thanks! Questions?

Antonio Toral, Mireia Ginest´ ı-Rosell, Francis Tyers it→ca RBMT reusing data from existing language pairs