QTL EAP : A PROJECT ON MACHINE TRANSLATION BY DEEP LANGUAGE - - PowerPoint PPT Presentation

qtl eap a project on machine translation by deep language
SMART_READER_LITE
LIVE PREVIEW

QTL EAP : A PROJECT ON MACHINE TRANSLATION BY DEEP LANGUAGE - - PowerPoint PPT Presentation

QTL EAP : A PROJECT ON MACHINE TRANSLATION BY DEEP LANGUAGE ENGINEERING APPROACHES A NTNIO B RANCO , H ANS U SZKOREIT , A LJOSCHA B URCHARDT , J AN H AJI , M ARTIN P OPEL , K IRIL S IMOV , P ETYA O SENOVA , M ARKUS E GG , E NEKO A GIRRE , G


slide-1
SLIDE 1

QTLEAP: A PROJECT ON MACHINE TRANSLATION

BY DEEP LANGUAGE ENGINEERING APPROACHES

ANTÓNIO BRANCO, HANS USZKOREIT, ALJOSCHA BURCHARDT, JAN HAJIČ, MARTIN POPEL, KIRIL SIMOV, PETYA OSENOVA, MARKUS EGG, ENEKO AGIRRE, GERTJAN VAN NOORD, FILIPE BARRANCOS, ROSA DEL GAUDIO, GORKA LABAKA AND JOÃO SILVA

slide-2
SLIDE 2

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

CONSORTIUM

2

1 FCUL University of Lisbon, Faculty of Sciences Portugal 2 DFKI German Research Centre for Artificial Intelligence Germany 3 CUNI Charles University in Prague Czech Republic 4 IICT-BAS Bulgarian Academy of Sciences Bulgaria 5 UBER Humboldt University of Berlin Germany 6 UPV/EHU University of Basque Country Spain 7 UG University of Groningen The Netherlands 8 HF Higher Functions, Lda Portugal !

slide-3
SLIDE 3

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

GOAL AND STRATEGY

3

Exploit an articulated methodology for quality machine translation that explores deep language engineering approaches to language technology Increasingly deep towards the deployment of 1+3 MT Pilots based on increasingly deeper language engineering approaches Increasingly real towards validation and evaluation increasingly closer to the real usage scenario of an ICT consumer devices and services online helpdesk via a chat channel

slide-4
SLIDE 4

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

4

HYBRID AND TRANSFER-BASED

TectoMT System (Žabokrtský et al., 2008)

slide-5
SLIDE 5

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

example – deep representa.on

5

slide-6
SLIDE 6

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

examples - translations

  • English – Portuguese

– En: In the upper right corner of the Panda panel ... – Pt ref: No canto superior direito do painel do Panda … – Pt Pilot 0: Nenhum direito máximos canto fazer painel panda ... – Pt Pilot 2: Em canto direito superior de Panda painel ...

  • English – Czech

– En: Try using Prezi (www.prezi.com) – Cs ref: Zkuste použít Prezi (www.Prezi.com) – Cs Pilot 0: Zkuste Prezi Prezi (www. com). – Cs Pilot 2: Zkuste použít Prezi (www.Prezi.com)

6

wrong tokenizaPon: frequent verb missing arPcle missing;

  • rder switched

unintelligible:

  • nly 4 out of
  • rder words

matching

slide-7
SLIDE 7

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

HUMAN EVALUATION

7

  • Pilot 2 vs. Pilot 0 (SMT Baseline): EN->X, 100 sentences, 2+ evaluators

each

slide-8
SLIDE 8

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

HUMAN VS AUTOMATIC EVALUATION

8

  • Pilot 2 vs. Pilot 0 (SMT Baseline): EN->X
slide-9
SLIDE 9

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

P3 AND FURTHER USAGE SCENARIO

9

1 CA Technologies Development Spain S.A. CA Spain 2 Eleka Ingeniaritza Linguistikoa SL ELEKA Spain 3 GridLine BV GRIDLINE The Netherlands 4 OMQ GmBH OMQ Germany 5 Ontotext AD ONTOTEXT Bulgaria 6 Lingea s.r.o. LINGEA Czech Republic 7 Seznam.cz, a.s. SEZNAM Czech Republic 8 Higher Functions, Lda HF Portugal 9 Mondragon Lingua MONDRAGON Spain 10 text&form TEXT&FORM Germany

!

!

Advisory Board

slide-10
SLIDE 10

THANK YOU

slide-11
SLIDE 11

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

APPROACH

11

  • a common vision: produce high-quality outbound MT using more linguistic-

intensive results

  • a common approach: deep processing;
  • a common methodology: hybrid between rule-based and statistical;
  • a common architecture: transfer-based;
  • a common evaluation real-usage scenario: online QA in ICT trouble shooting;
  • a common test dataset: interactions with users in the above real-usage

scenario;

  • a common set of evaluation metrics: automatic mainstream metrics

supplemented with the multidimensional quality metrics;

  • a common language (English) as target or source for each one of the seven

languages in the project;

  • a common path of progression for each language pair, ensuring comparability
  • f the research exercise: every pair is developed along the four Pilots M0-M3
slide-12
SLIDE 12

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

Deep MT

  • Deep representa.on:

– TectogrammaPcal representaPon following FuncPonal GeneraPve DescripPon (Sgall et al., 1986) – Example: tecto representaPon for a Spanish sentence

No puedo pasar música del ordenador not I-can copy music from computer a un disco duro externo. to a disk hard external.

12

slide-13
SLIDE 13

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

13

Architecture: Transfer-based

  • TectoMT System (Žabokrtský et al., 2008)

source: MarPn Popel

slide-14
SLIDE 14

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

The Baseline – Pilot 0

  • Baseline system

– 7 language pairs: phrase-based SMT (Moses)

  • Two models: translaPon model, (mono, target) language model
  • Training

– Europarl and other parallel and monolingual corpora

  • Tuning

– MERT, 1 Ksentences in-domain data

  • EvaluaPon

– AutomaPc metrics (as usual: BLEU, METEOR), 1Ksentences in- domain

14

slide-15
SLIDE 15

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

Baseline – training seengs

15

slide-16
SLIDE 16

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

Baseline - datasets

  • Basque
  • 1.5 Msentences bilingual corpora (Elhuyar FoundaPon, in-domain, etc.)
  • 2.2 M monolingual corpora
  • Bulgarian
  • 600 K bilingual (Europarl, in-domain LibreOffice, etc)
  • 3.4 M monolingual (+ Bulgarian Ref Corpus)
  • Czech
  • 15 M bilingual (Czech-English Parallel Corpus)
  • 18 M monolingual (+ Europarl, etc)
  • Dutch
  • 370 K bi & mono (Dutch Parallel Corpus, ½ in-domain)
  • German
  • 4.5 M bi & mono (Europarl, in-domain, etc.)
  • Portuguese
  • 2 M bi & mono (out-domain Europarl)
  • Spanish
  • 15 M bi & mono (Europarl, UN, in-domain, etc.)

16

slide-17
SLIDE 17

António Branco | University of Lisbon META-FORUM 2016 | Lisbon, Jul 4-5, 2016

PILOT 2: WMT-STYLE RANKING

  • 17