QTL EAP : A PROJECT ON MACHINE TRANSLATION BY DEEP LANGUAGE - - PowerPoint PPT Presentation

▶

Sep 11, 2023 300 likes •486 views

QTL EAP : A PROJECT ON MACHINE TRANSLATION BY DEEP LANGUAGE ENGINEERING APPROACHES A NTNIO B RANCO , H ANS U SZKOREIT , A LJOSCHA B URCHARDT , J AN H AJI , M ARTIN P OPEL , K IRIL S IMOV , P ETYA O SENOVA , M ARKUS E GG , E NEKO A GIRRE , G

SLIDE 1

QTLEAP: A PROJECT ON MACHINE TRANSLATION

BY DEEP LANGUAGE ENGINEERING APPROACHES

ANTÓNIO BRANCO, HANS USZKOREIT, ALJOSCHA BURCHARDT, JAN HAJIČ, MARTIN POPEL, KIRIL SIMOV, PETYA OSENOVA, MARKUS EGG, ENEKO AGIRRE, GERTJAN VAN NOORD, FILIPE BARRANCOS, ROSA DEL GAUDIO, GORKA LABAKA AND JOÃO SILVA

SLIDE 2

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

CONSORTIUM

1 FCUL University of Lisbon, Faculty of Sciences Portugal 2 DFKI German Research Centre for Artificial Intelligence Germany 3 CUNI Charles University in Prague Czech Republic 4 IICT-BAS Bulgarian Academy of Sciences Bulgaria 5 UBER Humboldt University of Berlin Germany 6 UPV/EHU University of Basque Country Spain 7 UG University of Groningen The Netherlands 8 HF Higher Functions, Lda Portugal !

SLIDE 3

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

GOAL AND STRATEGY

Exploit an articulated methodology for quality machine translation that explores deep language engineering approaches to language technology Increasingly deep towards the deployment of 1+3 MT Pilots based on increasingly deeper language engineering approaches Increasingly real towards validation and evaluation increasingly closer to the real usage scenario of an ICT consumer devices and services online helpdesk via a chat channel

SLIDE 4

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

HYBRID AND TRANSFER-BASED

TectoMT System (Žabokrtský et al., 2008)

SLIDE 5

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

example – deep representa.on

SLIDE 6

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

examples - translations

English – Portuguese

– En: In the upper right corner of the Panda panel ... – Pt ref: No canto superior direito do painel do Panda … – Pt Pilot 0: Nenhum direito máximos canto fazer painel panda ... – Pt Pilot 2: Em canto direito superior de Panda painel ...

English – Czech

– En: Try using Prezi (www.prezi.com) – Cs ref: Zkuste použít Prezi (www.Prezi.com) – Cs Pilot 0: Zkuste Prezi Prezi (www. com). – Cs Pilot 2: Zkuste použít Prezi (www.Prezi.com)

wrong tokenizaPon: frequent verb missing arPcle missing;

rder switched

unintelligible:

nly 4 out of
rder words

matching

SLIDE 7

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

HUMAN EVALUATION

Pilot 2 vs. Pilot 0 (SMT Baseline): EN->X, 100 sentences, 2+ evaluators

each

SLIDE 8

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

HUMAN VS AUTOMATIC EVALUATION

Pilot 2 vs. Pilot 0 (SMT Baseline): EN->X

SLIDE 9

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

P3 AND FURTHER USAGE SCENARIO

1 CA Technologies Development Spain S.A. CA Spain 2 Eleka Ingeniaritza Linguistikoa SL ELEKA Spain 3 GridLine BV GRIDLINE The Netherlands 4 OMQ GmBH OMQ Germany 5 Ontotext AD ONTOTEXT Bulgaria 6 Lingea s.r.o. LINGEA Czech Republic 7 Seznam.cz, a.s. SEZNAM Czech Republic 8 Higher Functions, Lda HF Portugal 9 Mondragon Lingua MONDRAGON Spain 10 text&form TEXT&FORM Germany

Advisory Board

SLIDE 10

THANK YOU

SLIDE 11

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

APPROACH

a common vision: produce high-quality outbound MT using more linguistic-

intensive results

a common approach: deep processing;
a common methodology: hybrid between rule-based and statistical;
a common architecture: transfer-based;
a common evaluation real-usage scenario: online QA in ICT trouble shooting;
a common test dataset: interactions with users in the above real-usage

scenario;

a common set of evaluation metrics: automatic mainstream metrics

supplemented with the multidimensional quality metrics;

a common language (English) as target or source for each one of the seven

languages in the project;

a common path of progression for each language pair, ensuring comparability
f the research exercise: every pair is developed along the four Pilots M0-M3

SLIDE 12

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

Deep MT

Deep representa.on:

– TectogrammaPcal representaPon following FuncPonal GeneraPve DescripPon (Sgall et al., 1986) – Example: tecto representaPon for a Spanish sentence

No puedo pasar música del ordenador not I-can copy music from computer a un disco duro externo. to a disk hard external.

SLIDE 13

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

Architecture: Transfer-based

TectoMT System (Žabokrtský et al., 2008)

source: MarPn Popel

SLIDE 14

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

The Baseline – Pilot 0

Baseline system

– 7 language pairs: phrase-based SMT (Moses)

Two models: translaPon model, (mono, target) language model
Training

– Europarl and other parallel and monolingual corpora

Tuning

– MERT, 1 Ksentences in-domain data

EvaluaPon

– AutomaPc metrics (as usual: BLEU, METEOR), 1Ksentences in- domain

SLIDE 15

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

Baseline – training seengs

SLIDE 16

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

Baseline - datasets

Basque
1.5 Msentences bilingual corpora (Elhuyar FoundaPon, in-domain, etc.)
2.2 M monolingual corpora
Bulgarian
600 K bilingual (Europarl, in-domain LibreOffice, etc)
3.4 M monolingual (+ Bulgarian Ref Corpus)
Czech
15 M bilingual (Czech-English Parallel Corpus)
18 M monolingual (+ Europarl, etc)
Dutch
370 K bi & mono (Dutch Parallel Corpus, ½ in-domain)
German
4.5 M bi & mono (Europarl, in-domain, etc.)
Portuguese
2 M bi & mono (out-domain Europarl)
Spanish
15 M bi & mono (Europarl, UN, in-domain, etc.)

SLIDE 17

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

QTLEAP: A PROJECT ON MACHINE TRANSLATION

BY DEEP LANGUAGE ENGINEERING APPROACHES

CONSORTIUM

GOAL AND STRATEGY

HYBRID AND TRANSFER-BASED

example – deep representa.on

examples - translations

HUMAN EVALUATION

HUMAN VS AUTOMATIC EVALUATION

P3 AND FURTHER USAGE SCENARIO

THANK YOU

APPROACH

Deep MT

Architecture: Transfer-based

The Baseline – Pilot 0

Baseline – training seengs

Baseline - datasets

PILOT 2: WMT-STYLE RANKING