Universal Dependencies Joakim Nivre, Dan Zeman, Filip Ginter, Sampo - - PowerPoint PPT Presentation

universal dependencies
SMART_READER_LITE
LIVE PREVIEW

Universal Dependencies Joakim Nivre, Dan Zeman, Filip Ginter, Sampo - - PowerPoint PPT Presentation

Universal Dependencies Joakim Nivre, Dan Zeman, Filip Ginter, Sampo Pyysalo, Chris Manning, Marie-Catherine de Marneffe, Natalia Silveira, Slav Petrov, Ryan McDonald, Tim Dozat, Jan Haji, Jinho Choi, Reut Tsarfaty, Yoav Goldberg, Simonetta


slide-1
SLIDE 1

14.–15.9.2015, Sedlec-Prčice 1

Universal Dependencies

Joakim Nivre, Dan Zeman, Filip Ginter, Sampo Pyysalo, Chris Manning, Marie-Catherine de Marneffe, Natalia Silveira,

Slav Petrov, Ryan McDonald, Tim Dozat, Jan Hajič, Jinho Choi, Reut Tsarfaty, Yoav Goldberg, Simonetta Montemagni, Alessandro Lenci,

Maria Simi, Cristina Bosco, Veronika Vincze, Richárd Farkas, Teresa Lynn, Jennifer Foster, Prokopis Prokopidis, Jenna Kanerva, Juha Kuokkala, Veronika Laippala, Krister Lindén, Anna Missilä, Hanna Nurmi, Jussi Piitulainen, Aaron Smith, Željko Agić, Nikola Ljubešić, Maria Jesus Aranzabe, Aitziber Atutxa,

Iakes Goenaga, Koldo Gojenola, Anders Trærup Johannsen, Hèctor Martínez, Barbara Plank, Petya Osenova, Kiril Simov, Mojgan Seraji, Wolfgang Seeker, Fran Tyers,

Aibek Makazhanov, Jon Washington, Çağrı Çöltekin, Arne Skjærholt, Lilja Øvrelid, Miguel Ballesteros, Elena Pascual, Giuseppe Celano, Marco Passarotti, Christophe Onambélé, Dag Haug, Nizar Habash, Riyaz Ahmad, Verginica Mititelu, Catalina Mărănduc, Kaja Dobrovoljc, Tomaž Erjavec, Simon Krek, Yusuke Miyao, Shinsuke Mori, Takaaki Tanaka, Hiroshi Kanayama, Masayuki Asahara, Sumire Uematsu, Rob Voigt, …

Introduction slides stolen from Joakim Nivre

slide-2
SLIDE 2

14.–15.9.2015, Sedlec-Prčice 2

slide-3
SLIDE 3

14.–15.9.2015, Sedlec-Prčice 3

slide-4
SLIDE 4

14.–15.9.2015, Sedlec-Prčice 4

slide-5
SLIDE 5

14.–15.9.2015, Sedlec-Prčice 5

slide-6
SLIDE 6

14.–15.9.2015, Sedlec-Prčice 6

slide-7
SLIDE 7

14.–15.9.2015, Sedlec-Prčice 7

Universal Dependencies

http://universaldependencies.org

slide-8
SLIDE 8

14.–15.9.2015, Sedlec-Prčice 8

Universal Dependencies

Stanford Dependencies

http://universaldependencies.org

slide-9
SLIDE 9

14.–15.9.2015, Sedlec-Prčice 9

Universal Dependencies

Stanford Dependencies CLEAR

http://universaldependencies.org

slide-10
SLIDE 10

14.–15.9.2015, Sedlec-Prčice 10

Universal Dependencies

Stanford Dependencies CLEAR Google UD

http://universaldependencies.org

slide-11
SLIDE 11

14.–15.9.2015, Sedlec-Prčice 11

Universal Dependencies

Stanford Dependencies CLEAR Google UD Stanford UD

http://universaldependencies.org

slide-12
SLIDE 12

14.–15.9.2015, Sedlec-Prčice 12

Universal Dependencies

Stanford Dependencies CLEAR Google UD Stanford UD HamleDT

http://universaldependencies.org

slide-13
SLIDE 13

14.–15.9.2015, Sedlec-Prčice 13

Universal Dependencies

Stanford Dependencies CLEAR Google UD Stanford UD HamleDT Interset

http://universaldependencies.org

slide-14
SLIDE 14

14.–15.9.2015, Sedlec-Prčice 14

Universal Dependencies

Stanford Dependencies CLEAR Google UD Stanford UD HamleDT Interset Google universal tags

http://universaldependencies.org

slide-15
SLIDE 15

14.–15.9.2015, Sedlec-Prčice 15

Universal Dependencies

Universal Dependencies

http://universaldependencies.org

slide-16
SLIDE 16

14.–15.9.2015, Sedlec-Prčice 16

Universal Dependencies

Universal Dependencies

  • Milestones:

2014-04: EACL Göteborg, kick-off meeting

2014-10: UD guidelines version 1

2015-01: released treebanks of 10 languages (UD 1.0)

2015-05: released treebanks of 18 languages (UD 1.1)

2015-11: released 37 treebanks of 33 languages (UD 1.2)

2016-05: new release

http://universaldependencies.org

slide-17
SLIDE 17

14.–15.9.2015, Sedlec-Prčice 17

Goals and Requirements

  • Cross-linguistically consistent grammatical annotation
slide-18
SLIDE 18

14.–15.9.2015, Sedlec-Prčice 18

Goals and Requirements

  • Cross-linguistically consistent grammatical annotation
  • Support multilingual research and development in NLP
slide-19
SLIDE 19

14.–15.9.2015, Sedlec-Prčice 19

Goals and Requirements

  • Cross-linguistically consistent grammatical annotation
  • Support multilingual research and development in NLP
  • Based on common usage and existing de facto standards
slide-20
SLIDE 20

14.–15.9.2015, Sedlec-Prčice 20

Goals and Requirements

  • Cross-linguistically consistent grammatical annotation
  • Support multilingual research and development in NLP
  • Based on common usage and existing de facto standards
  • Caveats:

– Not a new linguistic theory –

but linguistically informed and relevant

– Not an ideal parsing representation –

but useful for comparative evaluation

– Not the ultimate annotation scheme –

but a lightweight lingua franca

slide-21
SLIDE 21

14.–15.9.2015, Sedlec-Prčice 21

Design Principles

  • Dependency

– Widely used in practical NLP systems – Available in treebanks for many languages

slide-22
SLIDE 22

14.–15.9.2015, Sedlec-Prčice 22

Design Principles

  • Dependency

– Widely used in practical NLP systems – Available in treebanks for many languages

  • Lexicalism

– Basic annotation units are words – syntactic words – Words have morphological properties – Words enter into syntactic relations

slide-23
SLIDE 23

14.–15.9.2015, Sedlec-Prčice 23

Design Principles

  • Dependency

– Widely used in practical NLP systems – Available in treebanks for many languages

  • Lexicalism

– Basic annotation units are words – syntactic words – Words have morphological properties – Words enter into syntactic relations

  • Recoverability

– Transparent mapping from input text to word segmentation

slide-24
SLIDE 24

14.–15.9.2015, Sedlec-Prčice 24

Golden Rules

  • Maximize parallelism

– Don’t annotate the same thing in different ways – Don’t make different things look the same

slide-25
SLIDE 25

14.–15.9.2015, Sedlec-Prčice 25

Golden Rules

  • Maximize parallelism

– Don’t annotate the same thing in different ways – Don’t make different things look the same

  • But don’t overdo it

– Don’t annotate things that are not there – Languages select from a universal pool of

categories

– Allow language-specific extensions

slide-26
SLIDE 26

14.–15.9.2015, Sedlec-Prčice 26

Morphology

Některé dívky si nicméně pochvalovaly zmrzlinu . některý dívka se nicméně pochvalovat zmrzlina . DET NOUN PRON CONJ VERB NOUN PUNCT

PronType=Ind Gender=Fem Number=Plur Case=Nom Gender=Fem Number=Plur Case=Nom PronType=Prs Reflex=Yes Case=Dat VerbForm=Part Tense=Past Voice=Act Aspect=Imp Gender=Fem Number=Plur Gender=Fem Number=Sing Case=Acc

slide-27
SLIDE 27

14.–15.9.2015, Sedlec-Prčice 27

Morphology

Některé dívky si nicméně pochvalovaly zmrzlinu . některý dívka se nicméně pochvalovat zmrzlina . DET NOUN PRON CONJ VERB NOUN PUNCT

PronType=Ind Gender=Fem Number=Plur Case=Nom Gender=Fem Number=Plur Case=Nom PronType=Prs Reflex=Yes Case=Dat VerbForm=Part Tense=Past Voice=Act Aspect=Imp Gender=Fem Number=Plur Gender=Fem Number=Sing Case=Acc

  • Lemma representing the semantic content of the word
slide-28
SLIDE 28

14.–15.9.2015, Sedlec-Prčice 28

Morphology

Některé dívky si nicméně pochvalovaly zmrzlinu . některý dívka se nicméně pochvalovat zmrzlina . DET NOUN PRON CONJ VERB NOUN PUNCT

PronType=Ind Gender=Fem Number=Plur Case=Nom Gender=Fem Number=Plur Case=Nom PronType=Prs Reflex=Yes Case=Dat VerbForm=Part Tense=Past Voice=Act Aspect=Imp Gender=Fem Number=Plur Gender=Fem Number=Sing Case=Acc

  • Lemma representing the semantic content of the word
  • Part-of-speech tag representing the abstract lexical category

associated with the word

slide-29
SLIDE 29

14.–15.9.2015, Sedlec-Prčice 29

Morphology

Některé dívky si nicméně pochvalovaly zmrzlinu . některý dívka se nicméně pochvalovat zmrzlina . DET NOUN PRON CONJ VERB NOUN PUNCT

PronType=Ind Gender=Fem Number=Plur Case=Nom Gender=Fem Number=Plur Case=Nom PronType=Prs Reflex=Yes Case=Dat VerbForm=Part Tense=Past Voice=Act Aspect=Imp Gender=Fem Number=Plur Gender=Fem Number=Sing Case=Acc

  • Lemma representing the semantic content of the word
  • Part-of-speech tag representing the abstract lexical category associated

with the word

  • Features representing lexical and grammatical properties associated with

the lemma or the particular word form

slide-30
SLIDE 30

14.–15.9.2015, Sedlec-Prčice 30

Part-of-Speech Tags

Open Closed Other ADJ ADP PUNCT ADV AUX SYM INTJ CONJ X NOUN DET PROPN NUM VERB PART PRON SCONJ

  • Taxonomy of 17 universal part-of-speech tags, based on the

Google Universal Tagset (Petrov et al., 2012)

  • All languages use the same inventory, but not all tags have to

be used by all languages

slide-31
SLIDE 31

14.–15.9.2015, Sedlec-Prčice 31

Features

Lexical Inflectional / Nominal Inflectional / Verbal PronType Gender VerbForm NumType Animacy Mood Poss Number Tense Reflex Case Aspect Definite Voice Degree Person Negative

  • Standardized inventory of morphological features, based on

Interset (Zeman, 2008)

  • Languages select relevant features and can add language-

specific features or values with documentation

slide-32
SLIDE 32

14.–15.9.2015, Sedlec-Prčice 32

slide-33
SLIDE 33

14.–15.9.2015, Sedlec-Prčice 33

slide-34
SLIDE 34

14.–15.9.2015, Sedlec-Prčice 34

slide-35
SLIDE 35

14.–15.9.2015, Sedlec-Prčice 35

slide-36
SLIDE 36

14.–15.9.2015, Sedlec-Prčice 36

slide-37
SLIDE 37

14.–15.9.2015, Sedlec-Prčice 37

slide-38
SLIDE 38

14.–15.9.2015, Sedlec-Prčice 38

slide-39
SLIDE 39

14.–15.9.2015, Sedlec-Prčice 39

slide-40
SLIDE 40

14.–15.9.2015, Sedlec-Prčice 40

Dependency Relations

  • Taxonomy of 40 universal grammatical relations,

broadly attested in language typology (de Marneffe et al., 2014)

– Language-specific subtypes may be added

  • Organizing principles

– Three types of structures: nominals, clauses, modifiers – Core arguments vs. other dependents (not arguments

  • vs. adjuncts)
slide-41
SLIDE 41

14.–15.9.2015, Sedlec-Prčice 41

Dependents of Clausal Predicates

Nominal Clausal Other Core nsubj nsubjpass dobj iobj csubj csubjpass ccomp xcomp Non-Core nmod vocative discourse expl advcl advmod neg aux auxpass cop mark punct

slide-42
SLIDE 42

14.–15.9.2015, Sedlec-Prčice 42

slide-43
SLIDE 43

14.–15.9.2015, Sedlec-Prčice 43

Dependents of Nominals

Nominal Clausal Other nmod appos nummod acl amod det neg case

punct

slide-44
SLIDE 44

14.–15.9.2015, Sedlec-Prčice 44

slide-45
SLIDE 45

14.–15.9.2015, Sedlec-Prčice 45

Multiword Expressions

Relation Examples mwe in spite of, as well as, ad hoc name Roger Bacon, New York compound phone book, four thousand, dress up goeswith notwith standing, with out

  • UD annotation does not permit “words with spaces”

– Multiword expressions are analyzed using special relations – The mwe, name and goeswith relations are always head-initial – The compound relation reflects the internal structure

slide-46
SLIDE 46

14.–15.9.2015, Sedlec-Prčice 46

Other Relations

Relation Explanation parataxis Loosely linked clauses of same rank list Lists without syntactic structure remnant Orphans in ellipsis linked to parallel elements reparandum Disfluency linked to (speech) repair foreign Elements within opaque stretches of code switching dep Unspecified dependency root Syntactically independent element of clause/phrase

slide-47
SLIDE 47

14.–15.9.2015, Sedlec-Prčice 47

Language-Specific Relations

  • Language-specific relations are subtypes of universal

relations added to capture important phenomena

  • Subtyping permits us to “back off” to universal relations

Relation Explanation acl:relcl Relative clause compound:prt Verb particle (dress up) nmod:poss Genitive nominal (Mary ’s book) nmod:agent Agent in passive (saved by the bell) cc:preconj Preconjunction (both … and) det:predet Predeterminer (all those …)

slide-48
SLIDE 48

14.–15.9.2015, Sedlec-Prčice 48

Word Segmentation

  • How do we segment sentences into words?

– Depends on language and writing system, often non-trivial – Segmentation must be reproducible on new data

  • Two options provided:

– Only include words in treebank, but document segmentation – Include mapping from low-level tokenization to words in treebank

slide-49
SLIDE 49

14.–15.9.2015, Sedlec-Prčice 49

CoNLL-U Format

  • Revised version of the CoNLL-X format
  • Two-level segmentation and secondary dependencies
slide-50
SLIDE 50

14.–15.9.2015, Sedlec-Prčice 50

CoNLL-U Format

  • Revised version of the CoNLL-X format
  • Two-level segmentation and secondary dependencies
slide-51
SLIDE 51

14.–15.9.2015, Sedlec-Prčice 51

CoNLL-U Format

  • Revised version of the CoNLL-X format
  • Two-level segmentation and secondary dependencies
slide-52
SLIDE 52

14.–15.9.2015, Sedlec-Prčice 52

CoNLL-U Format

  • Revised version of the CoNLL-X format
  • Two-level segmentation and secondary dependencies
slide-53
SLIDE 53

14.–15.9.2015, Sedlec-Prčice 53

CoNLL-U Format

  • Revised version of the CoNLL-X format
  • Two-level segmentation and secondary dependencies
slide-54
SLIDE 54

14.–15.9.2015, Sedlec-Prčice 54

CoNLL-U Format

  • Revised version of the CoNLL-X format
  • Two-level segmentation and secondary dependencies
slide-55
SLIDE 55

14.–15.9.2015, Sedlec-Prčice 55

CoNLL-U Format

  • Revised version of the CoNLL-X format
  • Two-level segmentation and secondary dependencies
slide-56
SLIDE 56

14.–15.9.2015, Sedlec-Prčice 56

CoNLL-U Format

  • Revised version of the CoNLL-X format
  • Two-level segmentation and secondary dependencies
slide-57
SLIDE 57

14.–15.9.2015, Sedlec-Prčice 57

CoNLL-U Format

  • Revised version of the CoNLL-X format
  • Two-level segmentation and secondary dependencies
slide-58
SLIDE 58

14.–15.9.2015, Sedlec-Prčice 58

CoNLL-U Format

  • Revised version of the CoNLL-X format
  • Two-level segmentation and secondary dependencies
slide-59
SLIDE 59

14.–15.9.2015, Sedlec-Prčice 59

Where are we now?

  • Universal Dependencies, Version 1

– Guidelines released October 2014 – Treebank release May 2014 (v. 1.1): Basque, Bulgarian, Croatian, Czech, Danish, English, Finnish, French, German, Greek, Hebrew, Hungarian, Indonesian, Irish, Italian, Persian, Spanish, Swedish

  • Future plans

– New releases every six months (May, November) – Revision of guidelines as needed

  • November 2015

– Improve consistency of existing data – New languages, incl. free part of HamleDT Slovene, Romanian, Kazakh; Ancient Greek, Arabic, Dutch, Estonian, Latin, Polish, Portuguese, Tamil; Korean? Hindi? Norwegian?

slide-60
SLIDE 60

14.–15.9.2015, Sedlec-Prčice 60

PML-TQ (Tree Query)

  • All UD 1.1 treebanks are there
  • HamleDT 3.0 alpha is there

http://lindat.mff.cuni.cz/services/pmltq/

slide-61
SLIDE 61

14.–15.9.2015, Sedlec-Prčice 61

UD vs. Prague

  • Tokens and words: abychom, kdybychom, dělalť
  • Pronouns vs. determiners, numerals and quantifiers
  • Prepositions, copula, auxiliaries

– Multi-word prepositions

  • Coordination (no nested structures now)
  • Ellipsis, remnant, single root node
  • Attachment of cardinal numbers
  • Direct object, indirect object and nmod
  • Reflexive pronouns, reflexive passive
slide-62
SLIDE 62

14.–15.9.2015, Sedlec-Prčice 62

PDT Tokenization

  • Abbreviation + period = 2 tokens (atd .)

… quite unusual in UD treebanks

slide-63
SLIDE 63

14.–15.9.2015, Sedlec-Prčice 63

Fused Tokens and Syntactic Words

  • proň, oň, naň, zaň (but not proč, oč …)

pro něj, o něj, na něj, za něj

  • byls, ses, sis, tys, cos, žes, …

(= byl jsi, jsi se, jsi si, ty jsi, co jsi, že jsi …)

  • VERB + ť (dělalť = neboť dělal)
  • abych, abys, aby, abychom, abyste

kdybych, kdybys, kdyby, kdybychom, kdybyste (aby bych? až bych? a bych? kdyby bych? když bych? kdy bych?)

slide-64
SLIDE 64

14.–15.9.2015, Sedlec-Prčice 64

Part-of-Speech Tags

  • Pronouns and determiners: next slide
  • The -li morpheme should be SCONJ (we have

PART)

  • Response words ano, ne should be INTJ (we

have PART)

  • We distinguish POS of foreign words
  • PUNCT and SYM have to be distinguished
slide-65
SLIDE 65

14.–15.9.2015, Sedlec-Prčice 65

Pronouns and Determiners

  • English + Romance languages: DET = article or pronominal

adjective (this, which, every)

  • Currently functional borderline (but ellipsis?)

This.DET car is expensive. This.PRON is expensive.

  • P: já, ty, on, ona, ono, my, vy, oni, ony, se, kdo, co, někdo,

něco, nikdo, nic

  • P/D: ten, tento, tamten, jaký, který, čí, nějaký, některý, něčí,

každý, všechen, žádný, …

  • D: můj, tvůj, jeho, její, náš, váš, jejich, svůj
slide-66
SLIDE 66

14.–15.9.2015, Sedlec-Prčice 66

Numerals and Quantifiers

  • NUM: jeden, dva, tři, čtyři, pět, šest, …, sto
  • NUM/NOUN: tisíc, milión, miliarda
  • NOUN: polovina, třetina, čtvrtina, setina
  • ADJ: první, druhý, třetí, čtvrtý, …, stý, tisící;

dvojí, trojí, čtverý, paterý

  • DET: kolik, tolik, několik, mnoho, málo, kolikátý, kolikerý
  • ADV: jedenkrát, dvakrát, třikrát, čtyřikrát, pětkrát; poprvé,

podruhé, potřetí, posté; kolikrát, pokolikáté

  • NUM: dvé, tré, čtvero, patero, šestero, sedmero;

jedny, dvoje, troje, čtvery, patery

but: více, méně, …

slide-67
SLIDE 67

14.–15.9.2015, Sedlec-Prčice 67

Syntax (a-layer of PDT!)

slide-68
SLIDE 68

14.–15.9.2015, Sedlec-Prčice 68

slide-69
SLIDE 69

14.–15.9.2015, Sedlec-Prčice 69

Copula and Nominal Predicate

slide-70
SLIDE 70

14.–15.9.2015, Sedlec-Prčice 70

Copula vs. Passive

slide-71
SLIDE 71

14.–15.9.2015, Sedlec-Prčice 71

Nested coordination! Shared dependent!

slide-72
SLIDE 72

14.–15.9.2015, Sedlec-Prčice 72

slide-73
SLIDE 73

14.–15.9.2015, Sedlec-Prčice 73

Multiple roots are frown upon

slide-74
SLIDE 74

14.–15.9.2015, Sedlec-Prčice 74

NUM:

slide-75
SLIDE 75

14.–15.9.2015, Sedlec-Prčice 75

dobj / iobj / nmod

  • We do not (yet?) distinguish direct and indirect
  • bject. Not as easy as accusative vs. dative.

– Cením si vaší pomoci. (Gen) – Čelíme velkým problémům. (Dat) – Nedisponuje takovým rozpočtem. (Ins) – Učí mou dceru fyziku. (2 × Acc)

slide-76
SLIDE 76

14.–15.9.2015, Sedlec-Prčice 76

dobj / iobj / nmod

  • Core arguments: what exactly is it?
  • English:

– He gave John the book. (iobj) – He gave the book to John. (nmod)

  • Spanish:

– Dio el libro a John. (iobj)

  • Czech:

– Every Obj is translated to dobj, regardless the case and the

presence of preposition

slide-77
SLIDE 77

14.–15.9.2015, Sedlec-Prčice 77

Reflexive Pronouns

  • Direct or indirect object (dobj, iobj):

Řízl se do prstu / Řízl ho do prstu.

– Including reciprocal usage:

Políbili se. / They kissed each other.

  • Inherently reflexive verbs: smát se, bát se

– compound:reflex (analogy to English compound:prt in give up,

come on, …) NOW CHANGED: expl:reflex

  • Reflexive passive:

To se snadněji řekne než udělá.

– auxpass:reflex