 
              QTL EAP : A PROJECT ON MACHINE TRANSLATION BY DEEP LANGUAGE ENGINEERING APPROACHES A NTÓNIO B RANCO , H ANS U SZKOREIT , A LJOSCHA B URCHARDT , J AN H AJI Č , M ARTIN P OPEL , K IRIL S IMOV , P ETYA O SENOVA , M ARKUS E GG , E NEKO A GIRRE , G ERTJAN VAN N OORD , F ILIPE B ARRANCOS , R OSA D EL G AUDIO , G ORKA L ABAKA AND J OÃO S ILVA
CONSORTIUM 1 FCUL University of Lisbon, Faculty of Sciences Portugal 2 DFKI German Research Centre for Artificial Intelligence Germany 3 CUNI Charles University in Prague Czech Republic 4 IICT-BAS Bulgarian Academy of Sciences Bulgaria 5 UBER Humboldt University of Berlin Germany 6 UPV/EHU University of Basque Country Spain 7 UG University of Groningen The Netherlands 8 HF Higher Functions, Lda Portugal ! António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016 2
GOAL AND STRATEGY Exploit an articulated methodology for quality machine translation that explores deep language engineering approaches to language technology Increasingly deep towards the deployment of 1+3 MT Pilots based on increasingly deeper language engineering approaches Increasingly real towards validation and evaluation increasingly closer to the real usage scenario of an ICT consumer devices and services online helpdesk via a chat channel António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016 3
H YBRID AND TRANSFER - BASED TectoMT System (Žabokrtský et al., 2008) António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016 4
example – deep representa.on António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016 5
examples - translations unintelligible : • English – Portuguese only 4 out of – En: In the upper right corner of the Panda panel ... order words matching No canto superior direito do painel do Panda … – Pt ref: – Pt Pilot 0 : Nenhum direito máximos canto fazer painel panda ... – Pt Pilot 2 : Em canto direito superior de Panda painel ... arPcle missing; order switched • English – Czech – En: Try using Prezi (www.prezi.com) – Cs ref: Zkuste použít Prezi (www.Prezi.com) – Cs Pilot 0 : Zkuste Prezi Prezi (www. com). wrong tokenizaPon: frequent verb missing – Cs Pilot 2 : Zkuste použít Prezi (www.Prezi.com) António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016 6
HUMAN EVALUATION • Pilot 2 vs. Pilot 0 (SMT Baseline) : EN->X , 100 sentences, 2+ evaluators each António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016 7
HUMAN VS AUTOMATIC EVALUATION • Pilot 2 vs. Pilot 0 (SMT Baseline): EN->X António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016 8
P3 AND FURTHER USAGE SCENARIO Advisory Board 1 CA Technologies Development Spain S.A. CA Spain 2 Eleka Ingeniaritza Linguistikoa SL ELEKA Spain 3 GridLine BV GRIDLINE The Netherlands 4 OMQ GmBH OMQ Germany 5 Ontotext AD ONTOTEXT Bulgaria 6 Lingea s.r.o. LINGEA Czech Republic 7 Seznam.cz, a.s. SEZNAM Czech Republic 8 Higher Functions, Lda HF Portugal 9 Mondragon Lingua MONDRAGON Spain 10 text&form TEXT&FORM Germany ! ! António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016 9
THANK YOU
APPROACH • a common vision : produce high-quality outbound MT using more linguistic- intensive results • a common approach : deep processing; • a common methodology : hybrid between rule-based and statistical; • a common architecture : transfer-based; • a common evaluation real-usage scenario : online QA in ICT trouble shooting; • a common test dataset : interactions with users in the above real-usage scenario; • a common set of evaluation metrics : automatic mainstream metrics supplemented with the multidimensional quality metrics; • a common language (English) as target or source for each one of the seven languages in the project; • a common path of progression for each language pair, ensuring comparability of the research exercise: every pair is developed along the four Pilots M0-M3 António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016 11
Deep MT • Deep representa.on : – TectogrammaPcal representaPon following FuncPonal GeneraPve DescripPon (Sgall et al., 1986) – Example: tecto representaPon for a Spanish sentence No puedo pasar música del ordenador not I-can copy music from computer a un disco duro externo. to a disk hard external. António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016 12
Architecture: Transfer-based • TectoMT System (Žabokrtský et al., 2008) source: MarPn Popel António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016 13
The Baseline – Pilot 0 • Baseline system – 7 language pairs: phrase-based SMT (Moses) • Two models: translaPon model, (mono, target) language model • Training – Europarl and other parallel and monolingual corpora • Tuning – MERT, 1 Ksentences in-domain data • EvaluaPon – AutomaPc metrics (as usual: BLEU, METEOR), 1Ksentences in- domain António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016 14
Baseline – training seengs António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016 15
Baseline - datasets • Basque • 1.5 Msentences bilingual corpora (Elhuyar FoundaPon, in-domain, etc.) • 2.2 M monolingual corpora • Bulgarian • 600 K bilingual (Europarl, in-domain LibreOffice, etc) • 3.4 M monolingual (+ Bulgarian Ref Corpus) • Czech • 15 M bilingual (Czech-English Parallel Corpus) • 18 M monolingual (+ Europarl, etc) • Dutch • 370 K bi & mono (Dutch Parallel Corpus, ½ in-domain ) • German • 4.5 M bi & mono (Europarl, in-domain, etc.) • Portuguese • 2 M bi & mono ( out-domain Europarl) • Spanish • 15 M bi & mono (Europarl, UN , in-domain, etc.) António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016 16
Recommend
More recommend