Monnet is supported by the European Union under Grant No. 248458
Experiments in Term Translation Mihael Arcan DERI, NUI Galway - - PowerPoint PPT Presentation
Experiments in Term Translation Mihael Arcan DERI, NUI Galway - - PowerPoint PPT Presentation
Experiments in Term Translation Mihael Arcan DERI, NUI Galway Supervised by Dr. Paul Buitelaar Monnet is supported by the European Union under Grant No. 248458 Motivation Generation of multilingual ontologies most of the
- Generation of ‘multilingual’ ontologies
– most of the ontologies are in English language – terms need to be translated
Motivation
- Monnet Project
- Research
– building domain-specific resources
- architecture
- domain-specific resources
- results and evaluation
- main findings
– term disambiguation
- building a contextual-semantic resource
– general parallel resource – ontology
– future work
Overview
Monnet
Business Information in EN, DE, NL, ES etc.
http://www.monnet-project.eu
Research Objectives
- Development and use of ‘multilingual ontologies’
– ontologies with rich multilingual descriptors
- Exploit ‘domain semantics’ to improve Machine Translation
– use of ontological, terminological, linguistic knowledge Use Cases
- Financial Use Case
– Cross-lingual Business Intelligence
- Public Services Use Case
– Multilingual Access to Government Information
Research Objectives & Use Cases
Harmonizing Business Registration across Europe XBRL (eXtensible Business Reporting Language) Europe Working Group works with Monnet on the xEBR taxonomy xEBR (XBRL European Business Register) taxonomy defines common concepts with mappings to country/language specific taxonomies
National Bank of Belgium (Belgium) Eogs / DCCA (Denmark) Registrite ja infosüsteemide Keskus eRik (Estonia) Bilans Service - Infogreffe (France) Bundesanzeiger (Germany) Infocamere (Italy) RSCL (Luxembourg) Kamer van Koophandel (Netherlands) Informa DB – Colegio de Registradores (Spain) Bolagsverket (Sweden) Companies House (United Kingdom) EBR (Europe) GBR (Global) IASCF Bank of Spain Software – Audit – Consulting
Financial Use Case
Public Services Use Case
Translation of Dutch regulation (legal
- ntology) into several EU languages:
Immigration law Tax law Student benefit law Health care benefit law Social security law Law on higher education
- Term translation in isolation (no document or
sentence context) – Experiment 1: domain-specific resources generation
- addressing out-of-vocabulary issue
– Experiment 2: contextual-semantic resource generation
- term disambiguation
Basic Ideas of my research
- Building and exploiting domain-specific
resources
Experiment 1
1 2
[1] http://www.linguee.com/ [2] http://en.wikipedia.org/
Architecture of Experiment 1
xEBR Taxonomy Extraction of financial labels Wikipedia Title extraction Generation
- f financial
lexicon Extraction of a parallel resource Phrase Table generation Decoding process Domain-specific parallel corpus generation Cross-Lingual Lexicon generation
Domain-specific parallel corpus generation
xEBR Taxonomy Extraction of financial labels
Querying financial labels Parsing HTML files
Decoding process
Phrase Table generation
Linguee
http://www.linguee.com/
Cross-Lingual Lexicon Generation
xEBR Taxonomy Extraction of financial labels
Wikipedia Title extraction Generation
- f financial
lexicon
Decoding process
Cross-Lingual Lexicon Generation
Cross-Lingual Lexicon Generation
Cross-Lingual Lexicon Generation
Cross-Lingual Lexicon Generation
Cross-Lingual Lexicon Generation
Cross-Lingual Lexicon Generation
Cross-Lingual Lexicon Generation
Cross-Lingual Lexicon Generation
Linguee parallel corpus on xEBR EN terms
English: 24K sentences (1M tokens) German: 24K sentences (0.85M tokens) EuroParl (version 6): 1.7M sentences, 43M English, 40M German tokens JRC Acquis (version 3.0): 1.2M sentences, 32M English, 29M German tokens
Wikipedia lexicon generation:
7334 translations (translation pairs) from English to German the other way around – Current assets <-> Umlaufvermögen – Balance sheet <-> Bilanz – Unpaid calls on subscribed capital
- capital -> Kapital
– Social security, post-employment and other employee benefit costs
- employee benefit -> Sachbezug
Domain-Specific resource Generation Overview
Automatic Evaluation
Translation source Translation Direction Exact Translation BLEU Meteor Acquis English to German 9 0.1267 0.4795 German to English 12 0.1673 0.3726 Europarl English to German 5 0.0207 0.4120 German to English 4 0.1132 0.3258 Linguee English to German 15 0.3641 0.6309 German to English 25 0.3471 0.4084 Linguee + Domain- specific lexicon substitution English to German 22 0.3479 0.6438 German to English 25 0.3237 0.4315 Google Translate English to German 21 0.4517 0.6410 German to English 18 0.2640 0.3688
- Evaluation on 63 English-German financial labels
Automatic Evaluation
Translation source Translation Direction Exact Translation BLEU Meteor Acquis English to German 9 0.1267 0.4795 German to English 12 0.1673 0.3726 Europarl English to German 5 0.0207 0.4120 German to English 4 0.1132 0.3258 Linguee English to German 15 0.3641 0.6309 German to English 25 0.3471 0.4084 Linguee + Domain- specific lexicon substitution English to German 22 0.3479 0.6438 German to English 25 0.3237 0.4315 Google Translate English to German 21 0.4517 0.6410 German to English 18 0.2640 0.3688
- Evaluation on 63 English-German financial labels
Automatic Evaluation
Translation source Translation Direction Exact Translation BLEU Meteor Acquis English to German 9 0.1267 0.4795 German to English 12 0.1673 0.3726 Europarl English to German 5 0.0207 0.4120 German to English 4 0.1132 0.3258 Linguee English to German 15 0.3641 0.6309 German to English 25 0.3471 0.4084 Linguee + Domain- specific lexicon substitution English to German 22 0.3479 0.6438 German to English 25 0.3237 0.4315 Google Translate English to German 21 0.4517 0.6410 German to English 18 0.2640 0.3688
- Evaluation on 63 English-German financial labels
Automatic Evaluation
Translation source Translation Direction Exact Translation BLEU Meteor Acquis English to German 9 0.1267 0.4795 German to English 12 0.1673 0.3726 Europarl English to German 5 0.0207 0.4120 German to English 4 0.1132 0.3258 Linguee English to German 15 0.3641 0.6309 German to English 25 0.3471 0.4084 Linguee + Domain- specific lexicon substitution English to German 22 0.3479 0.6438 German to English 25 0.3237 0.4315 Google Translate English to German 21 0.4517 0.6410 German to English 18 0.2640 0.3688
- Evaluation on 63 English-German financial labels
Mono-lingual Human Evaluation
Translation into German Translation into English Acceptable Can easily be fixed None
- f both
A C N Linguee+Wikipedia 58% 27% 15% 56% 32% 12% Google Translate 55% 31% 14% 56% 31% 13% Linguee 51% 37% 12% 39% 40% 21% JRC-Acquis 32% 28% 40% 39% 31% 30% Europarl 5% 25% 70% 15% 30% 55%
Manual Evaluation of Translation Quality
- Evaluation on 63 English-German financial labels
Agreement Metric Translation into German Translation into English S π κ α S π κ α Linguee+Wikipedia 0.599 0.528 0.533 0.530 0.532 0.452 0.457 0.454 Google Translate 0.698 0.655 0.657 0.657 0.480 0.460 0.465 0.463 Linguee 0.484 0.416 0.437 0.419 0.599 0.537 0.540 0.539 JRC-Acquis 0.412 0.406 0.413 0.408 0.363 0.359 0.366 0.360 Europarl 0.515 0.270 0.269 0.273 0.552 0.493 0.499 0.495
Annotator agreement scores
- Evaluation on 63 English-German financial labels
Agreement Metric Translation into German Translation into English S π κ α S π κ α Linguee+Wikipedia 0.599 0.528 0.533 0.530 0.532 0.452 0.457 0.454 Google Translate 0.698 0.655 0.657 0.657 0.480 0.460 0.465 0.463 Linguee 0.484 0.416 0.437 0.419 0.599 0.537 0.540 0.539 JRC-Acquis 0.412 0.406 0.413 0.408 0.363 0.359 0.366 0.360 Europarl 0.515 0.270 0.269 0.273 0.552 0.493 0.499 0.495
Annotator agreement scores
substantial agreement
- Evaluation on 63 English-German financial labels
Cross-Lingual Human Evaluation
Translation into German Acceptable Can easily be fixed None of both Linguee+Wikipedia 59.15% 29.34% 11.50%
Manual Evaluation of Translation Quality
- Evaluation on 142 English financial labels
Agreement Metric Translation into German S π κ α Linguee+Wikipedia 0.467 0.355 0.357 0.355
Translation into German Acceptable Can easily be fixed None of both Linguee+Wikipedia 59.15% 29.34% 11.50% Agreement Metric Translation into German S π κ α Linguee+Wikipedia 0.467 0.355 0.357 0.355
Manual Evaluation of Translation Quality
- Evaluation on 142 English financial labels
fair agreement
- Reference reduces semantics
Source: Long-term financial assets Reference: Finanzanlagen Translation: Langfristige finanzielle Vermögenswerte
- Reference adds semantics
Source: Financial result Reference: Finanz- und Beteiligungsergebnis Translation: Finanzergebnis
- Domain training needed
Source : equity Reference: Eigenkapital Translation (Google Translate): Gerechtigkeit
Discussion
- Domain-specific resource gives better results
than a bigger, but more general one
- Linguee parallel corpus + domain-specific
multilingual Wikipedia outperform Google Translate for translating German terms into English Main Findings of Experiment 1
Experiment 2: Term Disambiguation
Experiment 1:
- using the ontology vocabulary to
generate new domain-specific resources
- domain-specific resource
(Linguee / Wikipedia) Experiment 2:
- using the ontology vocabulary for
term disambiguation
- general resource (Europarl)
Experiment 2: Term Disambiguation
equity ||| Eigenkapital ||| 0.0635181, 0.207921, 0.0232247, 0.0603448, ... equity ||| Gerechtigkeit ||| 0.0362212, 0.0416455, 0.325991, 0.235632, ... p(equity|Gerechtigkeit) > p(equity|Eigenkapital)
Experiment 2: Term Disambiguation
... to decrease significant the negative equity of the Group short-term and ... ... so zu einer Verringerung erhebliche negative Eigenkapital der Fraktion kurzfristige und ...
Experiment 2: Term Disambiguation
... to decrease significant the negative equity of the Group short-term and ... ... so zu einer Verringerung erhebliche negative Eigenkapital der Fraktion kurzfristige und ...
- Generation of a contextual-semantic resource
– concentrating only on unigrams
Experiment 2: Term Disambiguation
equity ||| @-@ Equity ... equity ||| Beteiligungskapital ... equity ||| Eigen- ... equity ||| Eigenkapital ... equity ||| Eigenkapitalquote ... equity ||| Eigenkapitals ... equity ||| Eigenmittel ... equity ||| Eigenmitteln ... equity ||| Equity ... equity ||| Fairness ... equity ||| Fairneß ... equity ||| Gerechtigkeit , ... equity ||| Gerechtigkeit ... equity ||| Geschlechtergleichstellung ... equity ||| Gleichbehandlung ... equity ||| Gleichheit ... equity ||| Gleichheitsprinzips ... equity ||| Kapitalanlagegesellschaften ... equity ||| Kapitalbeteiligung ... equity ||| Kapitalbeteiligungen ... equity ||| Stammkapital ... equity ||| der Gerechtigkeit ... equity ||| der Gleichheit ... equity ||| die Gerechtigkeit ... equity ||| die Gleichbehandlung ... equity ||| gerechten ... equity ||| gerechtes ... equity ||| gleichberechtigten ... ...
Europarl Phrase Table for Equity
- ... please tell me which agreement or conclusion of equity or
redistribution ...
- ... sagen Sie mir bitte , in welchem Übereinkommen oder
welcher Schlussfolgerung die Rede von Gerechtigkeit oder Umverteilung ist ...
Generating a contextual-semantic resource
Source Label Target Label Context (frequency + 1) equity Gerechtigkeit please, tell, me, agreement, conclusion, redistribution
- ... Minority interests relate to the percentage of equity of
subsidiaries owned by third-party shareholders ...
- ... Minderheitsanteile betreffen die von Drittaktionären
gehaltenen Anteile am Eigenkapital von Tochtergesellschaften ... Source Label Target Label Context (frequency + 1) equity Eigenkapital minority, subsidiaries, interests, owned, relate, third-party, percentage, shareholders
Generating a contextual-semantic resource
Source label Target label Context (frequency) equity Eigenkapital capital(43), SMEs(8), risk(7), enterprises(7), tax(6), banks(6), finance(5), European(5) . . . first(1), problematic(1), candidate(1), industries(1) equity Gerechtigkeit social(24), efficiency(21), education(17), European(15), justice(15), training(11), principle(11), quality(10), system(9) . . . migration(1), information(1), directive(1), inspection(1) ...
Generating a contextual-semantic resource
Source Ontology vocabulary UK GAAP
- ther non financial creditors after more than one year(1), social
security costs(1), other value adjustments and provisions(1), net financial income(1)... assets(24), year(24), charges(20), financial(20), income(18), payable(14), amounts(12), fixed(11), operating(11), taxes(9) . . .
Financial vocabulary of the UK GAAP ontology
Cosine similarity
Cosine similarity
Cosine similarity
Equity-Gerechtigkeit extracted Context from Europarl Equity-Eigenkapital extracted Context from Europarl Ontology Context
Translation Probability vs Cosine similarity
Translation probability (Equity) p(e|f)
Gerechtigkeit
- 10,6227
Gleichheit
- 11,5476
Eigenkapital
- 12,7612
Gleichbehandlung
- 13,0936
Fairness
- 13,6301
Equity
- 13,6523
Beteiligungskapital
- 14,2502
Fairneß
- 14,2729
Eigenmitteln
- 14,5058
Eigenkapitals
- 14,8938
Kapitalanlagegesellschaften
- 15,2789
Kapitalbeteiligungen
- 15,6793
Kapitalbeteiligung
- 15,6971
Cosine similarity (Equity) Cosine
Eigenkapital 0,103756 Eigenkapitals 0,078093 Kapitalbeteiligungen 0,049344 Kapitalbeteiligung 0,045790 Equity 0,035826 Gerechtigkeit 0,035676 Beteiligungskapital 0,031068 Gleichheit 0,027332 Gleichbehandlung 0,012716 Kapitalanlagegesellschaften 0,012131 Fairness 0,011412 Fairneß 0,008297 Eigenmitteln 0,005969
Translation Probability vs Cosine similarity
Translation probability (Equity) p(e|f)
Gerechtigkeit (justice, equity) -10,6227 Gleichheit (equality, equity) -11,5476 Eigenkapital (equity) -12,7612 Gleichbehandlung
- 13,0936
Fairness
- 13,6301
Equity
- 13,6523
Beteiligungskapital
- 14,2502
Fairneß
- 14,2729
Eigenmitteln
- 14,5058
Eigenkapitals
- 14,8938
Kapitalanlagegesellschaften
- 15,2789
Kapitalbeteiligungen
- 15,6793
Kapitalbeteiligung
- 15,6971
Cosine similarity (Equity) Cosine
Eigenkapital 0,103756 Eigenkapitals 0,078093 Kapitalbeteiligungen 0,049344 Kapitalbeteiligung 0,045790 Equity 0,035826 Gerechtigkeit 0,035676 Beteiligungskapital 0,031068 Gleichheit 0,027332 Gleichbehandlung 0,012716 Kapitalanlagegesellschaften 0,012131 Fairness 0,011412 Fairneß 0,008297 Eigenmitteln 0,005969
Translation probability (Equity) p(e|f)
Gerechtigkeit
- 10,6227
Gleichheit
- 11,5476
Eigenkapital
- 12,7612
Gleichbehandlung
- 13,0936
Fairness
- 13,6301
Equity
- 13,6523
Beteiligungskapital
- 14,2502
Fairneß
- 14,2729
Eigenmitteln
- 14,5058
Eigenkapitals
- 14,8938
Kapitalanlagegesellschaften
- 15,2789
Kapitalbeteiligungen
- 15,6793
Kapitalbeteiligung
- 15,6971
Cosine similarity (Equity) Cosine
Eigenkapital 0,103756 Eigenkapitals 0,078093 Kapitalbeteiligungen 0,049344 Kapitalbeteiligung 0,045790 Equity 0,035826 Gerechtigkeit 0,035676 Beteiligungskapital 0,031068 Gleichheit 0,027332 Gleichbehandlung 0,012716 Kapitalanlagegesellschaften 0,012131 Fairness 0,011412 Fairneß 0,008297 Eigenmitteln 0,005969
Translation Probability vs Cosine similarity
- Experiment 1: domain-specific resource
generation
- cross-lingual manual evaluation with financial
experts
- extraction of information in non-parallel resources
- Experiment 2: contextual-semantic resource
generation
- extending the model to all n-grams
- examine the structure of the ontology