deep linguistic information in hybrid machine translation
play

Deep Linguistic Information in Hybrid Machine Translation Charles - PDF document

Deep Linguistic Information in Hybrid Machine Translation Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics Czech Republic Outline: From Data To an MT


  1. Deep Linguistic Information in Hybrid Machine Translation ��������� Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics Czech Republic

  2. Outline: From Data To an MT System “DeepBank:” The Prague Czech-English Dependency Treebank (2.0) – Texts, annotation style(s), alignment, tools The platform: Treex TectoMT: hybrid MT English �������� – The (old) idea – Overall design – Core modules (A Speculation on) The Future Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 2

  3. The Prague Czech-English surface Dependency Treebank (PCEDT) 2.0 syntax Parallel treebank eebank ban nk nk Dependency style (“Prague”) ncy sty (“Prague” y y style (“Pragu ”) y ”) – (surface) syntax ) y ) syntax ta ax – syntax & semantics (“tectogrammatics”) sema semantics (“tectogrammatics” se mantics (“tec gramma matics ”) ”) syntax & semantics (and more) = “tectogrammatics” Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 Work h p - Workshop C Co oling 2012 2 4 4

  4. The Prague Czech-English Dependency Treebank (PCEDT) 2.0 Pa Parallel treebank Parallel treeban lel treeban nk nk nk Dependen Dependency style (“Prague”) De p dency style (“Prague” y style (“Pra y y Pragu ”) ”) – (surface) synta – (surface) syntax ( ( ( (sur rface) synta ) ) y ax ax – syntax & semantics (“tectogrammatics”) – s syntax & y tax & s x & semanti & semantics (“tec (“tectogramma ( ( ogramma g ma Penn T Penn Treebank translation into C Penn Treebank translation into Czech n Treebank tran k trans translat lation into ����������������������������������������������� Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 5

  5. The Prague Czech-English Dependency Treebank (PCEDT) 2.0 Parallel treebank Dependency style (“Prague”) – (surface) syntax ( ( ( ) y ) ) y – syntax & semantics (“tectogrammatics”) – s syntax & y tax & s x & semanti & semantics (“ (“ ( ectogra te te ec gramma g matic matic cs cs ”) ”) Penn Treeb Penn Treebank tran Penn Treebank translation into Czech k trans k translation into Czec anslat lation into Czec ch ch ch 1 million word 1 million word 1 million words ds Published at LDC, June 2012 (LDC2012T08) blished at DC LDC June 2012 (LDC2012T0 , J June 2012 (LDC (LDC2012 ( L C2012T 012T0 2T0 – Also available through LINDAT-Clarin and META- lso available through lso availa lso available through vail LI INDA AT- larin and MET C and MET ET SHARE HAR HARE Dec. 8, 2012 8 2 8 2 Hybrid MT Workshop - Coling 2012 6

  6. PCEDT 2.0 The Alignment(s) Czec Czech-English alignments Cz ch- ng i h alignment E nglish alignm nt g g g ts ts – – Sentence-level (manual, natural due to translation) S Sentenc tenc ce- evel ( le e vel (manua ( nual, natural ual, n ural At both syntactic levels At both syntactic level At both syntac y vel ls ls – Word (node) level – Word (node) le Word ( W Word ( o d (node ( de) leve ) evel el automatic, test section manually corrected (in part) a tomatic, tes automatic, tes c tes st sectio st se st section manually cor st tion manually cor rrected (in par rrected ed (in par rt) rt) Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 Co C oling 2 201 01 012 01 01 01 01 01 1 1 2 2 2 7 7

  7. tectogrammatics PCEDT 2.0 The Alignment(s) Czech-English alignments – Sentence-level (manual, natural due to translation) At both syntactic levels 1 � 1 – Word (node) level automatic, test section manually corrected (in part), m � n Between annotation levels PTB syntax – Tectogrammatics to surface syntax m ���������������� – Surface syntax to word level (1 ���� surface syntax Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 8

  8. Surface syntax annotation English – Dependency (head rules + additions, manual corrections) – Function label (PDT-style) at all nodes (from PTB + rules) – Lemmatization + „pure“ POS tags from PTB – Automatic (from PTB) + a few manual corrections Czech – PDT style, no change – Syntax: automatic (MST); 2000 sent. fully manual for testing – Lemmatization and tagging: auto 99%/96%, Spoustová et al. EACL 2009 (COMPOST tagger) http://ufal.mff.cuni.cz/compost (Czech, English & other) – No p-level (of course � ) Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 10

  9. Tectogrammatical annotation Manual (both languages) Major features – Nodes with „autosemantic“ words only (no function words) Ellipsis „restored“ (new node for verbal arguments) – (Semantic) function (dependent � head relation) Verb arguments + ca 50 functions for other relations – Valency lexicons attached (Eng: links to PropBank) – “Formemes”: prep+case style label (useful in MT and search) – Co-reference integrated (Eng: BBN + more), Czech: manually Alignment – To surface syntax & between Czech and English This temblor-prone city dispatched inspectors, firefighters and other earthquake-trained personnel *-1 to aid San Francisco. Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 11

  10. Accompanying Tools TrEd (http://ufal.mff.cuni.cz/tred) – Annotation, View/Browse and Search environment – Open source, perl – Search and visualization: Simple data browser (http://ufal.mff.cuni.cz/pcedt2.0) PML-TQ: Powerful query language for complex tree-based annotation Treex (http://ufal.mff.cuni.cz/treex) – Modular NLP processing environment – Easy handling of complex NLP-annotated data – Modules exists for Czech, English data processing incl. 3 rd -party tools integrated into Treex – CPAN-distributed Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 12

  11. PCEDT and Tectogrammatics in (hybrid) MT The famous, (almost) “Vauquois” triangle: ANALYSIS TRANSFER SYNTHESIS t-layer deep syntax & semantics: tectogrammatical layer a-layer shallow syntax: analytical layer m-layer POS & lemmatization: morphological layer w-layer source language (English) target language (Czech) Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 13

  12. Analysis-Transfer-Synthesis Hybrid System Over 90 steps: both rule-based and statistical ANALYSIS TRANSFER SYNTHESIS Grammatemes, formemes t-layer Structural Convert to t-tree Basic morph. categories transfer Analytical dep. function Agreement a-layer Lexical Parsing (MST) transfer Add function words (dictionary) Tagging (Compost) Generate forms m-layer & lexical choice Lemmatization Concatenate Tokenization w-layer source language (English) target language (Czech) Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 14

  13. Example Translation should Pred translation . be a-layer Sb AuxK Obj (parse) + easy machine functions Pnom Atr machine translation should be easy . Lemmatized NN NN MD VB JJ . & POS tagged Machine translation should be easy . Tokenized Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 15

  14. Example Translation should Pred Mark translation . be function Sb AuxK Obj nodes & edges to easy machine “collapse” Pnom Atr Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 16

  15. Example Translation be v:fin T-tree translation backbone easy n:subj adj:compl + formemes machine n:attr Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 17

  16. Example Translation Modality=hort be Conditional=1 v:fin Tense=PresSim T-tree backbone translation easy n:subj DoC=Positive Num=sg + adj:compl formemes + machine grammatemes n:attr Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 18

  17. Example Translation mít Modality=hort být Conditional=1 v:fin Tense=PresSim v:inf ������ ������� posun DoC=Positive Num=sg n:1 snadný jednoduchý Transfer p ������ adj:compl starts: n:1 strojový Clone t-tree adv: stroj n:2 adj:attr n:attr * Dictionary translation: MaxEnt classifier, ~10 6 features Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 19

  18. Example Translation mít Modality=hort být Conditional=1 v:fin Tense=PresSim v:inf ������ ������� posun DoC=Positive Num=sg n:1 Select snadný best jednoduchý combination p ������ adj:compl n:1 of lemmas & strojový adv: stroj Formemes n:2 (HMTM) adj:attr n:attr Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 20

  19. Example Translation mít Gen=MInanim Clone C=PastP to a-tree, Num=sg ������� add core Num=sg morphological . Case=1 . & POS tags by být snadný + Deg=pos C=inf agreement Case=1 strojový Gen=MInanim Deg=pos + Case=1 function words Gen=MInanim Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 21

  20. Example Translation mít Gen=MInanim C=PastP Num=sg ������� Num=sg . Case=1 . by být snadný Deg=pos C=inf Case=1 strojový Gen=MInanim Deg=pos Case=1 Rearrange Gen=MInanim clitics Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 22

  21. Example Translation m �� ������� . Synthesize by být snadný word forms strojový ... and flatten the tree: Strojový p � eklad by m � l být snadný. (capitalize, space) Dec. 8, 2012 Hybrid MT Workshop - Coling 2012 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend