Experiments in Term Translation Mihael Arcan DERI, NUI Galway - - PowerPoint PPT Presentation

experiments in term translation
SMART_READER_LITE
LIVE PREVIEW

Experiments in Term Translation Mihael Arcan DERI, NUI Galway - - PowerPoint PPT Presentation

Experiments in Term Translation Mihael Arcan DERI, NUI Galway Supervised by Dr. Paul Buitelaar Monnet is supported by the European Union under Grant No. 248458 Motivation Generation of multilingual ontologies most of the


slide-1
SLIDE 1

Monnet is supported by the European Union under Grant No. 248458

Experiments in Term Translation

Mihael Arcan DERI, NUI Galway Supervised by Dr. Paul Buitelaar

slide-2
SLIDE 2
  • Generation of ‘multilingual’ ontologies

– most of the ontologies are in English language – terms need to be translated

Motivation

slide-3
SLIDE 3
  • Monnet Project
  • Research

– building domain-specific resources

  • architecture
  • domain-specific resources
  • results and evaluation
  • main findings

– term disambiguation

  • building a contextual-semantic resource

– general parallel resource – ontology

– future work

Overview

slide-4
SLIDE 4

Monnet

Business Information in EN, DE, NL, ES etc.

http://www.monnet-project.eu

slide-5
SLIDE 5

Research Objectives

  • Development and use of ‘multilingual ontologies’

– ontologies with rich multilingual descriptors

  • Exploit ‘domain semantics’ to improve Machine Translation

– use of ontological, terminological, linguistic knowledge Use Cases

  • Financial Use Case

– Cross-lingual Business Intelligence

  • Public Services Use Case

– Multilingual Access to Government Information

Research Objectives & Use Cases

slide-6
SLIDE 6

Harmonizing Business Registration across Europe XBRL (eXtensible Business Reporting Language) Europe Working Group works with Monnet on the xEBR taxonomy xEBR (XBRL European Business Register) taxonomy defines common concepts with mappings to country/language specific taxonomies

National Bank of Belgium (Belgium) Eogs / DCCA (Denmark) Registrite ja infosüsteemide Keskus eRik (Estonia) Bilans Service - Infogreffe (France) Bundesanzeiger (Germany) Infocamere (Italy) RSCL (Luxembourg) Kamer van Koophandel (Netherlands) Informa DB – Colegio de Registradores (Spain) Bolagsverket (Sweden) Companies House (United Kingdom) EBR (Europe) GBR (Global) IASCF Bank of Spain Software – Audit – Consulting

Financial Use Case

slide-7
SLIDE 7

Public Services Use Case

Translation of Dutch regulation (legal

  • ntology) into several EU languages:

Immigration law Tax law Student benefit law Health care benefit law Social security law Law on higher education

slide-8
SLIDE 8
  • Term translation in isolation (no document or

sentence context) – Experiment 1: domain-specific resources generation

  • addressing out-of-vocabulary issue

– Experiment 2: contextual-semantic resource generation

  • term disambiguation

Basic Ideas of my research

slide-9
SLIDE 9
  • Building and exploiting domain-specific

resources

Experiment 1

1 2

[1] http://www.linguee.com/ [2] http://en.wikipedia.org/

slide-10
SLIDE 10

Architecture of Experiment 1

xEBR Taxonomy Extraction of financial labels Wikipedia Title extraction Generation

  • f financial

lexicon Extraction of a parallel resource Phrase Table generation Decoding process Domain-specific parallel corpus generation Cross-Lingual Lexicon generation

slide-11
SLIDE 11

Domain-specific parallel corpus generation

xEBR Taxonomy Extraction of financial labels

Querying financial labels Parsing HTML files

Decoding process

Phrase Table generation

slide-12
SLIDE 12

Linguee

http://www.linguee.com/

slide-13
SLIDE 13

Cross-Lingual Lexicon Generation

xEBR Taxonomy Extraction of financial labels

Wikipedia Title extraction Generation

  • f financial

lexicon

Decoding process

slide-14
SLIDE 14

Cross-Lingual Lexicon Generation

slide-15
SLIDE 15

Cross-Lingual Lexicon Generation

slide-16
SLIDE 16

Cross-Lingual Lexicon Generation

slide-17
SLIDE 17

Cross-Lingual Lexicon Generation

slide-18
SLIDE 18

Cross-Lingual Lexicon Generation

slide-19
SLIDE 19

Cross-Lingual Lexicon Generation

slide-20
SLIDE 20

Cross-Lingual Lexicon Generation

slide-21
SLIDE 21

Cross-Lingual Lexicon Generation

slide-22
SLIDE 22

Linguee parallel corpus on xEBR EN terms

English: 24K sentences (1M tokens) German: 24K sentences (0.85M tokens) EuroParl (version 6): 1.7M sentences, 43M English, 40M German tokens JRC Acquis (version 3.0): 1.2M sentences, 32M English, 29M German tokens

Wikipedia lexicon generation:

7334 translations (translation pairs) from English to German the other way around – Current assets <-> Umlaufvermögen – Balance sheet <-> Bilanz – Unpaid calls on subscribed capital

  • capital -> Kapital

– Social security, post-employment and other employee benefit costs

  • employee benefit -> Sachbezug

Domain-Specific resource Generation Overview

slide-23
SLIDE 23

Automatic Evaluation

Translation source Translation Direction Exact Translation BLEU Meteor Acquis English to German 9 0.1267 0.4795 German to English 12 0.1673 0.3726 Europarl English to German 5 0.0207 0.4120 German to English 4 0.1132 0.3258 Linguee English to German 15 0.3641 0.6309 German to English 25 0.3471 0.4084 Linguee + Domain- specific lexicon substitution English to German 22 0.3479 0.6438 German to English 25 0.3237 0.4315 Google Translate English to German 21 0.4517 0.6410 German to English 18 0.2640 0.3688

  • Evaluation on 63 English-German financial labels
slide-24
SLIDE 24

Automatic Evaluation

Translation source Translation Direction Exact Translation BLEU Meteor Acquis English to German 9 0.1267 0.4795 German to English 12 0.1673 0.3726 Europarl English to German 5 0.0207 0.4120 German to English 4 0.1132 0.3258 Linguee English to German 15 0.3641 0.6309 German to English 25 0.3471 0.4084 Linguee + Domain- specific lexicon substitution English to German 22 0.3479 0.6438 German to English 25 0.3237 0.4315 Google Translate English to German 21 0.4517 0.6410 German to English 18 0.2640 0.3688

  • Evaluation on 63 English-German financial labels
slide-25
SLIDE 25

Automatic Evaluation

Translation source Translation Direction Exact Translation BLEU Meteor Acquis English to German 9 0.1267 0.4795 German to English 12 0.1673 0.3726 Europarl English to German 5 0.0207 0.4120 German to English 4 0.1132 0.3258 Linguee English to German 15 0.3641 0.6309 German to English 25 0.3471 0.4084 Linguee + Domain- specific lexicon substitution English to German 22 0.3479 0.6438 German to English 25 0.3237 0.4315 Google Translate English to German 21 0.4517 0.6410 German to English 18 0.2640 0.3688

  • Evaluation on 63 English-German financial labels
slide-26
SLIDE 26

Automatic Evaluation

Translation source Translation Direction Exact Translation BLEU Meteor Acquis English to German 9 0.1267 0.4795 German to English 12 0.1673 0.3726 Europarl English to German 5 0.0207 0.4120 German to English 4 0.1132 0.3258 Linguee English to German 15 0.3641 0.6309 German to English 25 0.3471 0.4084 Linguee + Domain- specific lexicon substitution English to German 22 0.3479 0.6438 German to English 25 0.3237 0.4315 Google Translate English to German 21 0.4517 0.6410 German to English 18 0.2640 0.3688

  • Evaluation on 63 English-German financial labels
slide-27
SLIDE 27

Mono-lingual Human Evaluation

slide-28
SLIDE 28

Translation into German Translation into English Acceptable Can easily be fixed None

  • f both

A C N Linguee+Wikipedia 58% 27% 15% 56% 32% 12% Google Translate 55% 31% 14% 56% 31% 13% Linguee 51% 37% 12% 39% 40% 21% JRC-Acquis 32% 28% 40% 39% 31% 30% Europarl 5% 25% 70% 15% 30% 55%

Manual Evaluation of Translation Quality

  • Evaluation on 63 English-German financial labels
slide-29
SLIDE 29

Agreement Metric Translation into German Translation into English S π κ α S π κ α Linguee+Wikipedia 0.599 0.528 0.533 0.530 0.532 0.452 0.457 0.454 Google Translate 0.698 0.655 0.657 0.657 0.480 0.460 0.465 0.463 Linguee 0.484 0.416 0.437 0.419 0.599 0.537 0.540 0.539 JRC-Acquis 0.412 0.406 0.413 0.408 0.363 0.359 0.366 0.360 Europarl 0.515 0.270 0.269 0.273 0.552 0.493 0.499 0.495

Annotator agreement scores

  • Evaluation on 63 English-German financial labels
slide-30
SLIDE 30

Agreement Metric Translation into German Translation into English S π κ α S π κ α Linguee+Wikipedia 0.599 0.528 0.533 0.530 0.532 0.452 0.457 0.454 Google Translate 0.698 0.655 0.657 0.657 0.480 0.460 0.465 0.463 Linguee 0.484 0.416 0.437 0.419 0.599 0.537 0.540 0.539 JRC-Acquis 0.412 0.406 0.413 0.408 0.363 0.359 0.366 0.360 Europarl 0.515 0.270 0.269 0.273 0.552 0.493 0.499 0.495

Annotator agreement scores

substantial agreement

  • Evaluation on 63 English-German financial labels
slide-31
SLIDE 31

Cross-Lingual Human Evaluation

slide-32
SLIDE 32

Translation into German Acceptable Can easily be fixed None of both Linguee+Wikipedia 59.15% 29.34% 11.50%

Manual Evaluation of Translation Quality

  • Evaluation on 142 English financial labels

Agreement Metric Translation into German S π κ α Linguee+Wikipedia 0.467 0.355 0.357 0.355

slide-33
SLIDE 33

Translation into German Acceptable Can easily be fixed None of both Linguee+Wikipedia 59.15% 29.34% 11.50% Agreement Metric Translation into German S π κ α Linguee+Wikipedia 0.467 0.355 0.357 0.355

Manual Evaluation of Translation Quality

  • Evaluation on 142 English financial labels

fair agreement

slide-34
SLIDE 34
  • Reference reduces semantics

Source: Long-term financial assets Reference: Finanzanlagen Translation: Langfristige finanzielle Vermögenswerte

  • Reference adds semantics

Source: Financial result Reference: Finanz- und Beteiligungsergebnis Translation: Finanzergebnis

  • Domain training needed

Source : equity Reference: Eigenkapital Translation (Google Translate): Gerechtigkeit

Discussion

slide-35
SLIDE 35
  • Domain-specific resource gives better results

than a bigger, but more general one

  • Linguee parallel corpus + domain-specific

multilingual Wikipedia outperform Google Translate for translating German terms into English Main Findings of Experiment 1

slide-36
SLIDE 36

Experiment 2: Term Disambiguation

slide-37
SLIDE 37

Experiment 1:

  • using the ontology vocabulary to

generate new domain-specific resources

  • domain-specific resource

(Linguee / Wikipedia) Experiment 2:

  • using the ontology vocabulary for

term disambiguation

  • general resource (Europarl)

Experiment 2: Term Disambiguation

slide-38
SLIDE 38

equity ||| Eigenkapital ||| 0.0635181, 0.207921, 0.0232247, 0.0603448, ... equity ||| Gerechtigkeit ||| 0.0362212, 0.0416455, 0.325991, 0.235632, ... p(equity|Gerechtigkeit) > p(equity|Eigenkapital)

Experiment 2: Term Disambiguation

slide-39
SLIDE 39

... to decrease significant the negative equity of the Group short-term and ... ... so zu einer Verringerung erhebliche negative Eigenkapital der Fraktion kurzfristige und ...

Experiment 2: Term Disambiguation

slide-40
SLIDE 40

... to decrease significant the negative equity of the Group short-term and ... ... so zu einer Verringerung erhebliche negative Eigenkapital der Fraktion kurzfristige und ...

  • Generation of a contextual-semantic resource

– concentrating only on unigrams

Experiment 2: Term Disambiguation

slide-41
SLIDE 41

equity ||| @-@ Equity ... equity ||| Beteiligungskapital ... equity ||| Eigen- ... equity ||| Eigenkapital ... equity ||| Eigenkapitalquote ... equity ||| Eigenkapitals ... equity ||| Eigenmittel ... equity ||| Eigenmitteln ... equity ||| Equity ... equity ||| Fairness ... equity ||| Fairneß ... equity ||| Gerechtigkeit , ... equity ||| Gerechtigkeit ... equity ||| Geschlechtergleichstellung ... equity ||| Gleichbehandlung ... equity ||| Gleichheit ... equity ||| Gleichheitsprinzips ... equity ||| Kapitalanlagegesellschaften ... equity ||| Kapitalbeteiligung ... equity ||| Kapitalbeteiligungen ... equity ||| Stammkapital ... equity ||| der Gerechtigkeit ... equity ||| der Gleichheit ... equity ||| die Gerechtigkeit ... equity ||| die Gleichbehandlung ... equity ||| gerechten ... equity ||| gerechtes ... equity ||| gleichberechtigten ... ...

Europarl Phrase Table for Equity

slide-42
SLIDE 42
  • ... please tell me which agreement or conclusion of equity or

redistribution ...

  • ... sagen Sie mir bitte , in welchem Übereinkommen oder

welcher Schlussfolgerung die Rede von Gerechtigkeit oder Umverteilung ist ...

Generating a contextual-semantic resource

Source Label Target Label Context (frequency + 1) equity Gerechtigkeit please, tell, me, agreement, conclusion, redistribution

slide-43
SLIDE 43
  • ... Minority interests relate to the percentage of equity of

subsidiaries owned by third-party shareholders ...

  • ... Minderheitsanteile betreffen die von Drittaktionären

gehaltenen Anteile am Eigenkapital von Tochtergesellschaften ... Source Label Target Label Context (frequency + 1) equity Eigenkapital minority, subsidiaries, interests, owned, relate, third-party, percentage, shareholders

Generating a contextual-semantic resource

slide-44
SLIDE 44

Source label Target label Context (frequency) equity Eigenkapital capital(43), SMEs(8), risk(7), enterprises(7), tax(6), banks(6), finance(5), European(5) . . . first(1), problematic(1), candidate(1), industries(1) equity Gerechtigkeit social(24), efficiency(21), education(17), European(15), justice(15), training(11), principle(11), quality(10), system(9) . . . migration(1), information(1), directive(1), inspection(1) ...

Generating a contextual-semantic resource

slide-45
SLIDE 45

Source Ontology vocabulary UK GAAP

  • ther non financial creditors after more than one year(1), social

security costs(1), other value adjustments and provisions(1), net financial income(1)... assets(24), year(24), charges(20), financial(20), income(18), payable(14), amounts(12), fixed(11), operating(11), taxes(9) . . .

Financial vocabulary of the UK GAAP ontology

slide-46
SLIDE 46

Cosine similarity

slide-47
SLIDE 47

Cosine similarity

slide-48
SLIDE 48

Cosine similarity

Equity-Gerechtigkeit extracted Context from Europarl Equity-Eigenkapital extracted Context from Europarl Ontology Context

slide-49
SLIDE 49

Translation Probability vs Cosine similarity

Translation probability (Equity) p(e|f)

Gerechtigkeit

  • 10,6227

Gleichheit

  • 11,5476

Eigenkapital

  • 12,7612

Gleichbehandlung

  • 13,0936

Fairness

  • 13,6301

Equity

  • 13,6523

Beteiligungskapital

  • 14,2502

Fairneß

  • 14,2729

Eigenmitteln

  • 14,5058

Eigenkapitals

  • 14,8938

Kapitalanlagegesellschaften

  • 15,2789

Kapitalbeteiligungen

  • 15,6793

Kapitalbeteiligung

  • 15,6971

Cosine similarity (Equity) Cosine

Eigenkapital 0,103756 Eigenkapitals 0,078093 Kapitalbeteiligungen 0,049344 Kapitalbeteiligung 0,045790 Equity 0,035826 Gerechtigkeit 0,035676 Beteiligungskapital 0,031068 Gleichheit 0,027332 Gleichbehandlung 0,012716 Kapitalanlagegesellschaften 0,012131 Fairness 0,011412 Fairneß 0,008297 Eigenmitteln 0,005969

slide-50
SLIDE 50

Translation Probability vs Cosine similarity

Translation probability (Equity) p(e|f)

Gerechtigkeit (justice, equity) -10,6227 Gleichheit (equality, equity) -11,5476 Eigenkapital (equity) -12,7612 Gleichbehandlung

  • 13,0936

Fairness

  • 13,6301

Equity

  • 13,6523

Beteiligungskapital

  • 14,2502

Fairneß

  • 14,2729

Eigenmitteln

  • 14,5058

Eigenkapitals

  • 14,8938

Kapitalanlagegesellschaften

  • 15,2789

Kapitalbeteiligungen

  • 15,6793

Kapitalbeteiligung

  • 15,6971

Cosine similarity (Equity) Cosine

Eigenkapital 0,103756 Eigenkapitals 0,078093 Kapitalbeteiligungen 0,049344 Kapitalbeteiligung 0,045790 Equity 0,035826 Gerechtigkeit 0,035676 Beteiligungskapital 0,031068 Gleichheit 0,027332 Gleichbehandlung 0,012716 Kapitalanlagegesellschaften 0,012131 Fairness 0,011412 Fairneß 0,008297 Eigenmitteln 0,005969

slide-51
SLIDE 51

Translation probability (Equity) p(e|f)

Gerechtigkeit

  • 10,6227

Gleichheit

  • 11,5476

Eigenkapital

  • 12,7612

Gleichbehandlung

  • 13,0936

Fairness

  • 13,6301

Equity

  • 13,6523

Beteiligungskapital

  • 14,2502

Fairneß

  • 14,2729

Eigenmitteln

  • 14,5058

Eigenkapitals

  • 14,8938

Kapitalanlagegesellschaften

  • 15,2789

Kapitalbeteiligungen

  • 15,6793

Kapitalbeteiligung

  • 15,6971

Cosine similarity (Equity) Cosine

Eigenkapital 0,103756 Eigenkapitals 0,078093 Kapitalbeteiligungen 0,049344 Kapitalbeteiligung 0,045790 Equity 0,035826 Gerechtigkeit 0,035676 Beteiligungskapital 0,031068 Gleichheit 0,027332 Gleichbehandlung 0,012716 Kapitalanlagegesellschaften 0,012131 Fairness 0,011412 Fairneß 0,008297 Eigenmitteln 0,005969

Translation Probability vs Cosine similarity

slide-52
SLIDE 52
  • Experiment 1: domain-specific resource

generation

  • cross-lingual manual evaluation with financial

experts

  • extraction of information in non-parallel resources
  • Experiment 2: contextual-semantic resource

generation

  • extending the model to all n-grams
  • examine the structure of the ontology

– part-of relation – parent-child relation

Future Work

slide-53
SLIDE 53

Thank You