Resources for Adding Semantics to Machine Translation Jan Haji - - PowerPoint PPT Presentation

resources for adding semantics
SMART_READER_LITE
LIVE PREVIEW

Resources for Adding Semantics to Machine Translation Jan Haji - - PowerPoint PPT Presentation

Resources for Adding Semantics to Machine Translation Jan Haji Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics Major contributions by: E: Silvie Cinkov, Jana indlerov, Josef Toman, (J.


slide-1
SLIDE 1

Resources for Adding Semantics to Machine Translation

Jan Hajič

Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics

Major contributions by:

E: Silvie Cinková, Jana Šindlerová, Josef Toman, (J. Semecký) C: Marie Mikulová, Zdeňka Urešová, Jan Štěpánek

slide-2
SLIDE 2
  • Dec. 3, 2010

IWSLT

Today...

  • The family of Prague Dependency Treebanks

– Incl. the Prague (Czech-)English Dependency Treebank

  • English “Tectogrammatical Representation” (TR)

– Annotation layers – From Penn Treebank+ to PDT-style English annotation – TR annotation of interesting English phenomena

  • Spoken language annotation

– “Speech reconstruction”

  • Current status + to take home + pointers
slide-3
SLIDE 3
  • Dec. 3, 2010

IWSLT

The Family of Prague Dependency Treebanks

  • Prague Dependency Treebank (Czech)

– 2001: version 1.0 (no deep syntax/semantics) – 2006: version 2.0 (w/deep syntax, semantics: “tectogrammatics”)

  • Prague Czech-English Dependency TB 1.0

– 2004: automatic annotation – English: PTB, Czech: 1/3rd of PTB translated

  • Prague Arabic Dependency Treebank 1.0

– 2004: ~ PDT 1.0 (no deep syntax)

slide-4
SLIDE 4
  • Dec. 3, 2010

IWSLT

The Prague Cze-Eng Dependency Treebank

  • Penn Treebank

+ PropBank + BBN (co-reference and Named Entities) + NP structure (D. Vadas, J. R. Curran, ACL’07) + “Czech-like” tectogrammatics

  • Translation to Czech

– Manual annotation (with auto pre-annotation)

  • Morphology, Syntax, Tectogrammatics (TR)
slide-5
SLIDE 5
  • Dec. 3, 2010

IWSLT

Example: English TR

  • Words
  • Dependencies
  • Sem. function
  • Valency

(predicates)

  • Coref (BBN)
  • Named

Entities (BBN)

slide-6
SLIDE 6
  • Dec. 3, 2010

IWSLT

Layers of Annotation

  • t-layer

– tectogrammatics

  • a-layer

– (surface) syntax

  • m-layer

– Morphology (POS)

  • w-layer

– words (tokens)

slide-7
SLIDE 7
  • Dec. 3, 2010

IWSLT

English Surface Syntax

  • From PTB:

– Form – POS Tag – Function label – (Structure)

  • Added

– Lemma – Heads

slide-8
SLIDE 8
  • Dec. 3, 2010

IWSLT

Head Determination Rules

  • Exhaustive set of rules

– By J. Eisner + M. Čmejrek/J. Cuřín – 4000 rules (non-terminal based)

  • Ex.: (S (NP-SBJ VP .)) → VP

– Additional rules

  • Coordination, Apposition
  • Punctuation (end-of-sentence, internal)
  • Original idea (possibility of conversion)

– J. Robinson (1960s)

slide-9
SLIDE 9
  • Dec. 3, 2010

IWSLT

Example: Head Determination Rules

(board) (board) (the) (join) (will) (join) (join) (join) (NP (DT NN)) → NN (VP (VB NP)) → VB (VP (MD VP)) → VP (S (… VP …)) → VP

 Rules:

slide-10
SLIDE 10
  • Dec. 3, 2010

IWSLT

Conversion: Analytic Structure, Functions

  • Syntactic Function assignment (conversion)
  • Rules

– based on PTB functional tags:

  • SBJ Sb
  • PRD Pnom
  • BNF Obj
  • DTV Obj
  • LGS Obj
  • ADV Adv
  • DIR Adv
  • EXT Adv
  • LOC Adv
  • MNR Adv
  • PRP Adv
  • PUT Adv
  • TMP Adv

– Ad-hoc rules (if functional tags missing) – Lemmatization (years → year)

slide-11
SLIDE 11
  • Dec. 3, 2010

IWSLT

Structure & Functions: PTB to P(E)DT

(board) (board) (the) (join) (will) (join) (join) (join)

→ →

Penn Treebank structure (with heads added) PDT-like Analytic Representation

PRED.Fut PAT

PDT-like Tectogrammatic Representation (automatic pre-annotation)

slide-12
SLIDE 12
  • Dec. 3, 2010

IWSLT

English TR I Predicative Complement

  • Free (non-valency) modification (of both a noun and a verb)
  • attribute compl.rf (green arrow to the noun)
slide-13
SLIDE 13
  • Dec. 3, 2010

IWSLT

English TR II Which + Relative Clause

We have not answered your question completely, for which we apologize.

slide-14
SLIDE 14
  • Dec. 3, 2010

IWSLT

English TR III: Coordination

slide-15
SLIDE 15
  • Dec. 3, 2010

IWSLT

English TR III: Comparison

slide-16
SLIDE 16
  • Dec. 3, 2010

IWSLT

English TR IV: Restriction (“Exclusion”)

except, with the exception of, excluding, (all/none) but, beyond, apart from, unless, bar, barring, besides

slide-17
SLIDE 17
  • Dec. 3, 2010

IWSLT

English TR annotation

  • TrEd

– Pre-annotated – Graphical

  • TR dep. tree is

primary

– Text + TR – Czech translation

  • Valency (a.k.a.

“propbanking”)

– During TR annotation – Propbank origins and examples

  • Linked, displayed
slide-18
SLIDE 18
  • Dec. 3, 2010

IWSLT

EngVallex (give)

slide-19
SLIDE 19
  • Dec. 3, 2010

IWSLT

EngVallex Format (admit)

slide-20
SLIDE 20
  • Dec. 3, 2010

IWSLT

Valency in Translation

  • leave-1 nechat-3

– ACT() PAT() LOC() ACT(.1) PAT(.4) LOC()

  • leave-2 odjet-1

– ACT() DIR1(from.) ACT(.1) DIR1(z.[.2])

slide-21
SLIDE 21
  • Dec. 3, 2010

IWSLT

Interannotator Agreement

2007-2009:

  • New annotators

(lower numbers)

  • Annotation “by

phenomenon”

  • Restarting now
slide-22
SLIDE 22
  • Dec. 3, 2010

IWSLT

Prague English Dependency Treebank

  • Availability

– Version 1.0 now (PTB license needed)

  • 250k words

– Full version (parallel with Czech): early 2011

  • Size

– Full WSJ portion of PTB (2312 files)

– 49208 sentences, 1253013 tokens

slide-23
SLIDE 23
  • Dec. 3, 2010

IWSLT

Czech PDT-style Annotation

  • All layers

– morphology, syntax, tectogrammatical

  • So far…

– Automatic (many tools by many authors)

  • Manual annotation

– Complete now, co-reference annotation finishing – Top-down

  • Tectogrammatical first (lower layers automatically)
  • … then syntactic structure and morphology
slide-24
SLIDE 24
  • Dec. 3, 2010

IWSLT

Spoken corpus: Speech Reconstruction

  • Beyond disfluency removal: an idea by F. Jelinek:

– Transcription, even if perfect, is hard to analyze – ~ “people [when spekaing] are ungrammatical” – ~ editing recorded dialogs for print

  • Example:

Transcript: [breath] i think I th - see Si I think in this picture …after speech reconstruction: I think I see Si in this picture.

slide-25
SLIDE 25
  • Dec. 3, 2010

IWSLT

Speech Reconstruction Annotation

  • Multilevel audio/text editor “MEd”

– Linking words, free movement of words – Editing, inserting, deleting words – Manual/auto transcripts (simultaneously visible) – Listening (as in transcription)

slide-26
SLIDE 26
  • Dec. 3, 2010

IWSLT

Speech Reconstruction Corpus: “Companions”

  • English, Czech dialogs

– “Wizard-of-Oz” setting for recording – Topic: Reminiscing over photographs – Uses in the EU FP6 “Companions” project – English: 20h, Czech: 120h – Manual transcription – Double or triple SR annotated – Release: spring 2011

  • http://ufal.mff.cuni.cz/pdtsl
slide-27
SLIDE 27
  • Dec. 3, 2010

IWSLT

Connecting speech and language understanding

  • Full annotation over

speech data:

– “Companions” corpus → PDT-like annotated

  • All levels (morphology,

syntax, semantics, valency)

  • Over reconstructed

speech (“easy”)

  • Sample published:

PDTSE corpus

he is a member they’re [UN] yeah, the yankees member of the club

He is a member of the Club – they were the Yankees.

  • -/CONJ
  • be/PRED
  • #PP

/ACT

  • member

/PAT

  • Club

/RSTR

  • #PP

/ACT

  • Yankees

/PAT

  • be/PRED
  • ● ● ● ● ● ● ●

transcript audio “Reconstructed” POS, surface syntax, … Deep syntax / tectogrammatics

slide-28
SLIDE 28
  • Dec. 3, 2010

IWSLT

Summary

  • PDT is/has (a)…

– (Family of) dependency-based treebanking project(s)

  • Czech (English, Arabic, ...)

– ~ 1mil. words

  • sufficient size for ML experiments

– 4 interlinked layers of annotation

  • token, morphology, syntax, deep syntax/semantics++)
  • independent and “full” information at all levels
  • interlinked (for the development of parsers/generators)

– Parallel corpus Cze <-> Eng -> Machine Translation

  • PDTSL adds…

– Speech, transcription, speech reconstruction

slide-29
SLIDE 29
  • Dec. 3, 2010

IWSLT

Pointers, Acknowledgements

  • http://ufal.mff.cuni.cz/pedt
  • http://ufal.mff.cuni.cz/pdtsl
  • http://ufal.mff.cuni.cz/pdt2.0
  • http://ufal.mff.cuni.cz/~pajas/tred
  • Acknowledgements

– FP7 – Network “META-NET” – FP6-IST “Euromatrix”, Companions – FP7-IST “Euromatrix+”, “Faust” – LC536 (Center for Computational Linguistics) – GAČR 405/06/0589 (Speech and deep syntax) – MŠMT: MSM0021620838, ME838, ME09008