resources for adding semantics
play

Resources for Adding Semantics to Machine Translation Jan Haji - PowerPoint PPT Presentation

Resources for Adding Semantics to Machine Translation Jan Haji Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics Major contributions by: E: Silvie Cinkov, Jana indlerov, Josef Toman, (J.


  1. Resources for Adding Semantics to Machine Translation Jan Haji č Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics Major contributions by: E: Silvie Cinková, Jana Š indlerová, Josef Toman, (J. Semeck ý ) C: Marie Mikulová, Zde ň ka Ure š ová, Jan Š t ě pánek

  2. Today... • The family of Prague Dependency Treebanks – Incl. the Prague (Czech-)English Dependency Treebank • English “Tectogrammatical Representation” (TR) – Annotation layers – From Penn Treebank+ to PDT-style English annotation – TR annotation of interesting English phenomena • Spoken language annotation – “Speech reconstruction” • Current status + to take home + pointers IWSLT Dec. 3, 2010

  3. The Family of Prague Dependency Treebanks • Prague Dependency Treebank (Czech) – 2001: version 1.0 (no deep syntax/semantics) – 2006: version 2.0 (w/deep syntax, semantics: “tectogrammatics” ) • Prague Czech-English Dependency TB 1.0 – 2004: automatic annotation – English: PTB, Czech: 1/3rd of PTB translated • Prague Arabic Dependency Treebank 1.0 – 2004: ~ PDT 1.0 (no deep syntax) IWSLT Dec. 3, 2010

  4. The Prague Cze-Eng Dependency Treebank • Penn Treebank + PropBank + BBN (co-reference and Named Entities) + NP structure (D. Vadas, J. R. Curran, ACL’07) + “Czech-like” tectogrammatics • Translation to Czech – Manual annotation (with auto pre-annotation) • Morphology, Syntax, Tectogrammatics (TR) IWSLT Dec. 3, 2010

  5. Example: English TR • Words • Dependencies • Sem. function • Valency (predicates) • Coref (BBN) • Named Entities (BBN) IWSLT Dec. 3, 2010

  6. Layers of Annotation • t-layer – tectogrammatics • a-layer – (surface) syntax • m-layer – Morphology (POS) • w-layer – words (tokens) IWSLT Dec. 3, 2010

  7. English Surface Syntax • From PTB: – Form – POS Tag – Function label – (Structure) • Added – Lemma – Heads IWSLT Dec. 3, 2010

  8. Head Determination Rules • Exhaustive set of rules – By J. Eisner + M. Č mejrek/J. Cu ř ín – 4000 rules (non-terminal based) • Ex.: (S (NP-SBJ VP .)) → VP – Additional rules • Coordination, Apposition • Punctuation (end-of-sentence, internal) • Original idea (possibility of conversion) – J. Robinson (1960s) IWSLT Dec. 3, 2010

  9. Example: Head Determination Rules (join) (join) (join) (will)  Rules: (join) (board) (NP (DT NN)) → NN (VP (VB NP)) → VB (board) (the) (VP (MD VP)) → VP (S (… VP …)) → VP IWSLT Dec. 3, 2010

  10. Conversion: Analytic Structure, Functions • Syntactic Function assignment (conversion) • Rules – based on PTB functional tags: -SBJ Sb -PRD Pnom -BNF Obj -DTV Obj -LGS Obj -ADV Adv -DIR Adv -EXT Adv -LOC Adv -MNR Adv -PRP Adv -PUT Adv -TMP Adv – Ad-hoc rules (if functional tags missing) – Lemmatization (years → year) IWSLT Dec. 3, 2010

  11. Structure & Functions: PTB to P(E)DT (join) (join) PRED.Fut → → (join) (will) PAT (join) (board) PDT-like Tectogrammatic (board) (the) Representation Penn Treebank structure (automatic PDT-like Analytic (with heads added) pre-annotation) Representation IWSLT Dec. 3, 2010

  12. English TR I Predicative Complement • Free (non-valency) modification (of both a noun and a verb) • attribute compl.rf (green arrow to the noun) IWSLT Dec. 3, 2010

  13. English TR II Which + Relative Clause We have not answered your question completely, for which we apologize. IWSLT Dec. 3, 2010

  14. English TR III: Coordination IWSLT Dec. 3, 2010

  15. English TR III: Comparison IWSLT Dec. 3, 2010

  16. English TR IV: Restriction (“Exclusion”) except, with the exception of, excluding, (all/none) but, beyond, apart from, unless, bar, barring, besides IWSLT Dec. 3, 2010

  17. English TR annotation • TrEd – Pre-annotated – Graphical • TR dep. tree is primary – Text + TR – Czech translation • Valency (a.k.a. “propbanking”) – During TR annotation – Propbank origins and examples • Linked, displayed IWSLT Dec. 3, 2010

  18. EngVallex ( give ) IWSLT Dec. 3, 2010

  19. EngVallex Format ( admit ) IWSLT Dec. 3, 2010

  20. Valency in Translation • leave-1 nechat-3 – ACT() PAT() LOC() ACT(.1) PAT(.4) LOC() • leave-2 odjet-1 – ACT() DIR1(from.) ACT(.1) DIR1(z.[.2]) IWSLT Dec. 3, 2010

  21. Interannotator Agreement 2007-2009: - New annotators (lower numbers) - Annotation “by phenomenon” - Restarting now IWSLT Dec. 3, 2010

  22. Prague English Dependency Treebank • Availability – Version 1.0 now (PTB license needed) • 250k words – Full version (parallel with Czech): early 2011 • Size – Full WSJ portion of PTB (2312 files) – 49208 sentences, 1253013 tokens IWSLT Dec. 3, 2010

  23. Czech PDT-style Annotation • All layers – morphology, syntax, tectogrammatical • So far… – Automatic (many tools by many authors) • Manual annotation – Complete now, co-reference annotation finishing – Top-down • Tectogrammatical first ( lower layers automatically ) • … then syntactic structure and morphology IWSLT Dec. 3, 2010

  24. Spoken corpus: Speech Reconstruction • Beyond disfluency removal: an idea by F. Jelinek: – Transcription, even if perfect, is hard to analyze – ~ “people [when spekaing] are ungrammatical” – ~ editing recorded dialogs for print • Example: Transcript: [breath] i think I th - see Si I think in this picture …after speech reconstruction: I think I see Si in this picture. IWSLT Dec. 3, 2010

  25. Speech Reconstruction Annotation • Multilevel audio/text editor “MEd” – Linking words, free movement of words – Editing, inserting, deleting words – Manual/auto transcripts (simultaneously visible) – Listening (as in transcription) IWSLT Dec. 3, 2010

  26. Speech Reconstruction Corpus: “Companions” • English, Czech dialogs – “Wizard-of-Oz” setting for recording – Topic: Reminiscing over photographs – Uses in the EU FP6 “Companions” project – English: 20h, Czech: 120h – Manual transcription – Double or triple SR annotated – Release: spring 2011 • http://ufal.mff.cuni.cz/pdtsl IWSLT Dec. 3, 2010

  27. Connecting speech and language understanding Deep syntax / tectogrammatics • Full annotation over ● -/CONJ speech data: ● be/PRED ● be/PRED – “Companions” corpus ● #PP ● Yankees ● #PP ● member → PDT-like annotated /ACT /PAT /ACT /PAT ● Club - All levels (morphology, /RSTR ● ● ● ● ● ● ● ● syntax, semantics, POS, surface syntax, … valency) “Reconstructed” - Over reconstructed He is a member of the Club – they were the Yankees. speech (“easy”) transcript he is a member they’re [UN] yeah, the yankees member of the club - Sample published: PDTSE corpus audio IWSLT Dec. 3, 2010

  28. Summary • PDT is/has (a)… – (Family of) dependency-based treebanking project(s) • Czech (English, Arabic, ...) – ~ 1mil. words • sufficient size for ML experiments – 4 interlinked layers of annotation • token, morphology, syntax, deep syntax/semantics++ ) • independent and “full” information at all levels • interlinked (for the development of parsers/generators) – Parallel corpus Cze <-> Eng -> Machine Translation • PDTSL adds… – Speech, transcription, speech reconstruction IWSLT Dec. 3, 2010

  29. Pointers, Acknowledgements • http://ufal.mff.cuni.cz/pedt • http://ufal.mff.cuni.cz/pdtsl • http://ufal.mff.cuni.cz/pdt2.0 • http://ufal.mff.cuni.cz/~pajas/tred • Acknowledgements – FP7 – Network “META-NET” – FP6-IST “Euromatrix”, Companions – FP7-IST “Euromatrix+”, “Faust” – LC536 (Center for Computational Linguistics) – GA Č R 405/06/0589 (Speech and deep syntax) – M Š MT: MSM0021620838, ME838, ME09008 IWSLT Dec. 3, 2010

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend