Morphology within the Multi-Layered Annotation Scenario of the - - PowerPoint PPT Presentation

morphology within the multi layered annotation scenario
SMART_READER_LITE
LIVE PREVIEW

Morphology within the Multi-Layered Annotation Scenario of the - - PowerPoint PPT Presentation

Morphology within the Multi-Layered Annotation Scenario of the Prague Dependency Treebank Magda Sev c kov a Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics SFCM 2015,


slide-1
SLIDE 1

Morphology within the Multi-Layered Annotation Scenario of the Prague Dependency Treebank

Magda ˇ Sevˇ c´ ıkov´ a

Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics

SFCM 2015, September 16–17, 2015

slide-2
SLIDE 2

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions

Outline

1

Introduction

2

Morphology in Prague Dependency Treebank PDT in a nutshell Morphological layer Tectogrammatical layer

3

Praguian morphology in NLP of Czech Developing taggers Named entity recognition Derivational morphology

4

Conclusions

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-3
SLIDE 3

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions

Introduction: Treebanks without morphology?

83 treebanks for 51 languages (Zeman 2015) from coarse-grained part-of-speech information to detailed description of morphological categories according to the theoretical approach (and morphological richness of the language)

Penn Treebank

https://lindat.mff.cuni.cz/services/pmltq/ Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-4
SLIDE 4

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions

Introduction: Treebanks without morphology?

83 treebanks for 51 languages (Zeman 2015) from coarse-grained part-of-speech information to detailed description of morphological categories according to the theoretical approach (and morphological richness of the language)

TIGER treebank

https://lindat.mff.cuni.cz/services/pmltq/ Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-5
SLIDE 5

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions

Introduction: Treebanks without morphology?

83 treebanks for 51 languages (Zeman 2015) from coarse-grained part-of-speech information to detailed description of morphological categories according to the theoretical approach (and morphological richness of the language)

T¨ uBa-D/Z

https://weblicht.sfs.uni-tuebingen.de/Tundra/ Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-6
SLIDE 6

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions

Introduction: Treebanks without morphology?

83 treebanks for 51 languages (Zeman 2015) from coarse-grained part-of-speech information to detailed description of morphological categories according to the theoretical approach (and morphological richness of the language)

BulTreeBank

https://lindat.mff.cuni.cz/services/pmltq/ Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-7
SLIDE 7

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions

Introduction: Morphology in recent treebanking projects

HamleDT (HArmonized Multi-LanguagE Dependency Treebank)

http://ufal.mff.cuni.cz/hamledt 42 treebanks for 36 languages in version 3.0 (August 18, 2015) surface-syntactic annotation based on Stanford Dependencies (de Marneffe et al. 2014) Interset interlingua for morphological features (Zeman 2008)

Universal Dependencies

http://universaldependencies.github.io/docs/ 34 languages in version 1.1 (May 15, 2015) Universal Dependencies standard based on Stanford Dep. “interlingua” based on Zeman’s Interset and Google universal part-of-speech tags (Petrov et al. 2012)

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-8
SLIDE 8

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions

Introduction: Interset interlingua for morphological tagsets

converting tagsets into interlingua (and/or into other tagsets) comparing tagsets (http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl)

Penn treebank tagset: 48 tags for English SynTagRus tagset: 376 tags for Russian Hajiˇ c’s tagset for Czech (PDT): 4,294 tags

  • vs. 846 tags for Czech assigned by the ajka tagger

Penn

NNPS VB

PDT

NNFP1- - - - -A- - - - VB-P- - -3P-AA- - -

Interset

pos=”noun”, subpos=”prop”, number=”plu” pos=”verb”, verbform=”inf”

Interset

pos=”noun”, negativeness=”pos”, gender=”fem”, number=”plu”, case=”nom” pos=”verb”, negativeness=”pos”, number=”plu”, person=”3”, verbform=”fin”, mood=”ind”, tense=”pres”, voice=”act” Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-9
SLIDE 9

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions

Introduction: Morphological richness (HamleDT)

[Zeman 2015] Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-10
SLIDE 10

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions

Introduction: How rich is Czech?

rich inflectional and derivational morphology in Czech

agent ‘agent’ agent (nom.sg.) agenta (gen.sg.|acc.sg.) agentu (dat.sg.|loc.sg.) agentovi (dat.sg.|loc.sg.) agente (voc.sg.) agentem (instr.sg.) agenti (nom.pl.|voc.pl.) agentov´ e (nom.pl.|voc.pl.) agent˚ u (gen.pl.) agent˚ um (dat.pl.) agenty (acc.pl.|instr.pl.) agentech (loc.pl.)

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-11
SLIDE 11

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions

Introduction: How rich is Czech?

rich inflectional and derivational morphology in Czech

agent ‘agent’ agent (nom.sg.) agenta (gen.sg.|acc.sg.) agentu (dat.sg.|loc.sg.) agentovi (dat.sg.|loc.sg.) agente (voc.sg.) agentem (instr.sg.) agenti (nom.pl.|voc.pl.) agentov´ e (nom.pl.|voc.pl.) agent˚ u (gen.pl.) agent˚ um (dat.pl.) agenty (acc.pl.|instr.pl.) agentech (loc.pl.) agent ‘agent’ > agent˚ uv ‘agent’s’ > agentka ‘female agent’ > agentsk´ y ‘agency’ > superagent ‘superagent’ ...

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-12
SLIDE 12

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions

zv´ at ‘to invite’ ind.pres.act.: zvu, zveˇ s, zve; zveme, zvete, zvou ind.pret.act.: zval(a) jsem, zval(a) jsi, zval(a); zvali/y jsme, zvali/y jste, zvali/y ind.fut.act.: budu zv´ at, budeˇ s zv´ at, bude zv´ at; budeme zv´ at, budete zv´ at, budou zv´ at ind.pres.pass.: jsem zv´ an(a), jsi zv´ an(a), je zv´ an(a); jsme zv´ ani/y, jste zv´ ani/y, jsou zv´ ani/y ind.pret.pass.: byl(a) jsem zv´ an(a), byl(a) jsi zv´ an(a), byl(a) zv´ an(a); byli/y jsme zv´ ani/y, ... ind.fut.pass.: budu zv´ an(a), budeˇ s zv´ an(a), bude zv´ an(a); budeme zv´ ani/y, ... cond.pres.act.: zval(a) bych, zval(a) bys, zval(a) by; zvali/y bychom, ... cond.pres.pass.: byl(a) bych zv´ an(a), byl(a) bys zv´ an(a), byl(a) zv´ an(a); byli/y by zv´ ani/y, ... ...

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-13
SLIDE 13

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Morphology in Prague Dependency Treebank: Form and meaning

multiple annotation layers

morphology as a separate layer of annotation

lemma and positional (POS+) tag (Hajiˇ c 2004)

agentu ‘(to an) agent’ agent NNMS3- - - - -A- - -1 byli jste zv´ ani ‘(you) were invited’ b´ yt VpMP- - -XR-AA- - - b´ yt VB-P- - -2P-AA- - - zv´ at VsMP- - -XX-AP- - -

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-14
SLIDE 14

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Morphology in Prague Dependency Treebank: Form and meaning

multiple annotation layers

morphology as a separate layer of annotation

lemma and positional (POS+) tag (Hajiˇ c 2004)

meanings expressed by morphological categories captured at the tectogrammatical layer

grammateme attributes

agentu ‘(to an) agent’ agent NNMS3- - - - -A- - -1 byli jste zv´ ani ‘(you) were invited’ b´ yt VpMP- - -XR-AA- - - b´ yt VB-P- - -2P-AA- - - zv´ at VsMP- - -XX-AP- - - agentu ‘(to an) agent’

  • ne entity

byli jste zv´ ani ‘(you) were invited’ past event

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-15
SLIDE 15

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Prague Dependency Treebank – a short history

theoretically rooted in Functional Generative Description (Sgall 1967, Sgall et al. 1986)

language system decomposed in multiple layers relation of form and function between neighboring layers unambiguity and self-containedness of the sentence representation at each layer

annotation of Prague Dependency Treebank

started in the late 1990s PDT 1.0 (2001): morphological and analytical annotation PDT 2.0 (2006): plus tectogrammatical annotation PDT 2.5 (2011) PDT 3.0 (2013)

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-16
SLIDE 16

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Annotation layers in PDT

  • ne non-annotation (word) layer

three layers of annotation

morphological layer

1,960k tokens in 116k sent. in PDT 2.0

analytical layer

88k sentences with 1,503k tokens

tectogrammatical layer

49k sentences with 830k tokens

cross-layer references between nodes of neighboring layers

lit.: ‘Was would gone to-forest.’ ‘He would have gone to the forest.’ Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-17
SLIDE 17

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Annotation at the morphological layer of PDT

automatic morphological analysis

MorfFlex dictionary with 350k+ manual entries (Hajiˇ c – Hlav´ aˇ cov´ a 1990) recognizer of about 12M Czech word forms

manual disambiguation

each file annotated by two annotators in parallel instances of disagreement decided by a third annotator

each token

two-component lemma (lemma proper and technical suffix) positional tag (15 positions)

agentu ‘(to an) agent’ agent NNMS3- - - - -A- - -1 Hrbkovu ‘(to) Hrbek’s’ Hrbk˚ uv ;S ˆ(*3ek) AUMS3M- - - - - - - - -

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-18
SLIDE 18

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Hrbkovu

Hrbk˚ uv ;S ˆ(*3ek) AUMS3M- - - - - - - - - Lemma part Explanation Hrbk˚ uv lemma proper ;S technical suffix named entity type: surname ˆ(*3ek) technical suffix derivation: substitute 3 last characters with “ek” Position In example 1 part of speech A: adjective 2 detailed POS U: possessive 3 gender M: masc.anim. 4 number S: singular 5

  • morph. case

3: dative 6 possessor’s gender M: masc.anim. 7 possessor’s number 8 person 9 tense 10 degree of comp. 11 negation 12 verbal voice 13 unused 14 unused 15 variant, register

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-19
SLIDE 19

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Do t´ eto situace se Sparta dostala, jak ˇ rekl jej´ ı pˇ redseda, d´ ık Hrbkovu agentu Richovi Wintrovi. lit.: Into this situation REFL Sparta got, as said her chairman, thanks Hrbek’s agent Rich Winter. ‘Sparta found itself in this situation, as its chairman said, thanks to Hrbek’s agent Rich Winter.’

Do do-1 RR--2---------- této tento PDFS2---------- situace situace NNFS2-----A---- se se_^(zvr._zájmeno/částice) P7-X4---------- Sparta Sparta_;K NNFS1-----A---- dostala dostat VpQW---XR-AA--- , , Z:------------- jak jak-3 Db------------- řekl říci_:W VpYS---XR-AA--- její jeho_^(přivlast.) PSZS1FS3------- předseda předseda NNMS1-----A---- , , Z:------------- dík dík NNIS4-----A---- Hrbkovu Hrbkův_;S_^(*3ek) AUMS3M--------- agentu agent NNMS3-----A---1 Richovi Rich_;Y NNMS3-----A---- Wintrovi Wintr_;S NNMS3-----A---- . . Z:------------- Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-20
SLIDE 20

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Morphological annotation: an overview

1,960,000 tokens at the morphological layer of the PDT 3.0

1,574 different positional tags (vs. 4k possible tags) 71,503 different morphological lemmas

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-21
SLIDE 21

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Meanings expressed by morphological categories: Grammateme attributes at the tectogrammatical layer

(a type of) node attributes in the tectogrammatical tree represent morphological meanings that participate in creating the meaning of the sentence, e.g.

number with nouns degree of comparison with adjectives tense with verbs

no grammatemes for categories imposed by government or agreement

case with nouns number and gender with adjectives person, gender and number with verbs

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-22
SLIDE 22

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Grammatemes: Disambiguating meaning of the sentence

hire PRED Tom ACT agent PAT young RSTR Determiner DET t-example

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-23
SLIDE 23

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Grammatemes: Disambiguating meaning of the sentence

hire PRED Tom ACT agent PAT young RSTR Determiner DET t-example

1

Tom hired a young agent.

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-24
SLIDE 24

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Grammatemes: Disambiguating meaning of the sentence

hire PRED Tom ACT agent PAT young RSTR Determiner DET t-example

1

Tom hired a young agent.

2

Tom will hire a younger agent.

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-25
SLIDE 25

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Grammatemes: Disambiguating meaning of the sentence

hire PRED Tom ACT agent PAT young RSTR Determiner DET t-example

1

Tom hired a young agent.

2

Tom will hire a younger agent.

3

Tom is hiring the youngest agent.

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-26
SLIDE 26

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Grammatemes: Disambiguating meaning of the sentence

hire PRED Tom ACT agent PAT young RSTR Determiner DET t-example

1

Tom hired a young agent.

2

Tom will hire a younger agent.

3

Tom is hiring the youngest agent.

4

Tom will hire younger agents.

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-27
SLIDE 27

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Grammatemes: Disambiguating meaning of the sentence

hire PRED Tom ACT agent PAT young RSTR Determiner DET t-example

1

Tom hired a young agent.

2

Tom will hire a younger agent.

3

Tom is hiring the youngest agent.

4

Tom will hire younger agents.

5

Tom hired young agents.

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-28
SLIDE 28

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Grammatemes: Disambiguating meaning of the sentence

hire PRED Tom ACT agent PAT young RSTR Determiner DET t-example

1

Tom hired a young agent.

2

Tom will hire a younger agent.

3

Tom is hiring the youngest agent.

4

Tom will hire younger agents.

5

Tom hired young agents.

6

. . .

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-29
SLIDE 29

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Grammatemes: Disambiguating meaning of the sentence

hire PRED fut Tom ACT agent PAT sg young RSTR comp Determiner DET t-example

1

Tom hired a young agent.

2

Tom will hire a younger agent.

3

Tom is hiring the youngest agent.

4

Tom will hire younger agents.

5

Tom hired young agents.

6

. . .

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-30
SLIDE 30

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Two-level typing of tectogrammatical nodes

1 8 types of nodes

nodetype attribute grammatemes relevant for complex nodes only

2 4 semantic parts of speech

sempos attribute semantic nouns, adjectives, adverbs, and verbs 19 subgroups the sempos value delimits the set of relevant gramamtemes

t-mf930709-030-p1s1 root benzín ACT n.denot levný RSTR adj.denot #EmpVerb PRED qcomplex východ LOC n.denot #Comma CONJ coap drahý RSTR adj.denot benzín ACT n.denot #EmpVerb PRED qcomplex západ LOC n.denot Levnˇ ejˇ s´ ı benz´ ın na V´ ychodˇ e, draˇ zˇ s´ ı na Z´ apadˇ e ‘Cheaper gasoline in the East, more expensive

  • ne in the West’

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-31
SLIDE 31

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

15 grammatemes in PDT 3.0 Semantic nouns, adjectives, and adverbs

1 number: number of entities which a noun refers to 2 typgroup: plural forms of nouns denoting pairs/groups 3 gender: grammatical gender of nouns 4 person: with pronouns (speaker vs. hearer vs. nonparticipant) 5 politeness: polite usage of 2nd person pronouns 6 degcmp: degree of comparision with adjectives and adverbs 7 negation: negated nouns etc. represented by positive counterparts 8 indeftype: pronominals reduced to a small set of lemmas 9 numertype: numerals reduced to cardinals Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-32
SLIDE 32

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

15 grammatemes in PDT 3.0 Semantic nouns, adjectives, and adverbs

1 number: number of entities which a noun refers to 2 typgroup: plural forms of nouns denoting pairs/groups 3 gender: grammatical gender of nouns 4 person: with pronouns (speaker vs. hearer vs. nonparticipant) 5 politeness: polite usage of 2nd person pronouns 6 degcmp: degree of comparision with adjectives and adverbs 7 negation: negated nouns etc. represented by positive counterparts 8 indeftype: pronominals reduced to a small set of lemmas 9 numertype: numerals reduced to cardinals Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-33
SLIDE 33

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

15 grammatemes in PDT 3.0 Semantic nouns, adjectives, and adverbs

1 number: number of entities which a noun refers to 2 typgroup: plural forms of nouns denoting pairs/groups 3 gender: grammatical gender of nouns 4 person: with pronouns (speaker vs. hearer vs. nonparticipant) 5 politeness: polite usage of 2nd person pronouns 6 degcmp: degree of comparision with adjectives and adverbs 7 negation: negated nouns etc. represented by positive counterparts 8 indeftype: pronominals reduced to a small set of lemmas 9 numertype: numerals reduced to cardinals Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-34
SLIDE 34

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

15 grammatemes in PDT 3.0 Semantic nouns, adjectives, and adverbs

1 number: number of entities which a noun refers to 2 typgroup: plural forms of nouns denoting pairs/groups 3 gender: grammatical gender of nouns 4 person: with pronouns (speaker vs. hearer vs. nonparticipant) 5 politeness: polite usage of 2nd person pronouns 6 degcmp: degree of comparision with adjectives and adverbs 7 negation: negated nouns etc. represented by positive counterparts 8 indeftype: pronominals reduced to a small set of lemmas 9 numertype: numerals reduced to cardinals Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-35
SLIDE 35

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

15 grammatemes in PDT 3.0 Semantic verbs

1 tense: past vs. present vs. future events 2 factmod: asserted vs. potential vs. irreal events 3 aspect: imperfective vs. perfective verbs 4 deontmod: modal verbs represented as auxiliaries 5 diatgram: gramaticalized diatheses of verbs 6 iterativeness: iterative verbs represented by non-iteratives Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-36
SLIDE 36

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Do t´ eto situace se Sparta dostala, jak ˇ rekl jej´ ı pˇ redseda, d´ ık Hrbkovu agentu Richovi Wintrovi. ‘Sparta found itself in this situation, as its chairman said, thanks to Hrbek’s agent Rich Winter.’

t-mf920922-056-p2s6B root Sparta t ACT n.denot fem sg single tento t RSTR adj.pron.def.demon situace t DIR3 basic state n.denot fem sg single #PersPron t APP P n.pron.def.pers fem sg 3 basic předseda t ACT P n.denot anim sg single #Gen t ADDR P qcomplex #PersPron t EFF P n.pron.def.pers neut sg 3 basic říci enunc t PAR P v decl asserted cpl act it0 ant dostat_se enunc f PRED v decl asserted cpl act it0 ant Wintr f CAUS n.denot anim sg single person_name RSTR Rich f RSTR n.denot anim sg single person_name agent f RSTR n.denot anim sg single Hrbek f APP n.denot anim sg single person_name _ . . _ _ . . . . _ _ . . . _ _ . . _ _ _ _ . . . . _ _ . . . . . _ . . . . _ . . _ . . _ . . _ . .

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-37
SLIDE 37

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Annotation of grammatemes

the last task in the PDT 2.0 annotation procedure automatic assignment based on

morphological annotation

grammateme values cannot be mostly interpreted from the positional tag of a single word form more complex structures including auxiliaries involved in the value assignment procedure

preceding tectogrammatical annotations

tree structure semantic roles coreference

lexical resources

special-purpose lists of pronouns, adverbs, verbs

manual annotation of special problems

e.g. number with pluralia tantum

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-38
SLIDE 38

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Automatic assignment of grammatemes: using positional tags, tree structure, and lexical lists

number grammateme from positional tags with most nouns from verb forms with pro-drops factmod grammateme from positional tags of (auxiliary) verb forms deontmod grammateme cht´ ıt ‘to want’ indeftype grammateme nˇ ejak´ y ‘some’ > jak´ y

t-mf920925-120-p18s2 root být inter f PRED v decl asserted proc act it0 sim cíl f ACT n.denot inan pl single ještě f RHEM atom jaký f RSTR adj.pron.indef indef1 který t PAT n.pron.indef inher relat inher inher #PersPron t ACT n.pron.def.pers anim sg 2 polite dosáhnout f RSTR v vol potential cpl act it0 nil . _ . . . . _ . . _ _ _ gender: . .number: .person: _ . . . _ . . . .tense:

Jsou jeˇ stˇ e nˇ ejak´ e c´ ıle, kter´ ych byste chtˇ el dos´ ahnout? ‘Are there any goals which (you) would like to achieve?’ Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-39
SLIDE 39

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Automatic assignment of grammatemes: using positional tags, tree structure, and lexical lists

number grammateme from positional tags with most nouns from verb forms with pro-drops factmod grammateme from positional tags of (auxiliary) verb forms deontmod grammateme cht´ ıt ‘to want’ indeftype grammateme nˇ ejak´ y ‘some’ > jak´ y

t-mf920925-120-p18s2 root být inter f PRED v decl asserted proc act it0 sim cíl f ACT n.denot inan pl single ještě f RHEM atom jaký f RSTR adj.pron.indef indef1 který t PAT n.pron.indef inher relat inher inher #PersPron t ACT n.pron.def.pers anim sg 2 polite dosáhnout f RSTR v vol potential cpl act it0 nil . _ . . . . _ . . _ _ _ gender: . .number: .person: _ . . . _ . . . .tense:

Jsou jeˇ stˇ e nˇ ejak´ e c´ ıle, kter´ ych byste chtˇ el dos´ ahnout? ‘Are there any goals which (you) would like to achieve?’ Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-40
SLIDE 40

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Automatic assignment of grammatemes: using positional tags, tree structure, and lexical lists

number grammateme from positional tags with most nouns from verb forms with pro-drops factmod grammateme from positional tags of (auxiliary) verb forms deontmod grammateme cht´ ıt ‘to want’ indeftype grammateme nˇ ejak´ y ‘some’ > jak´ y

t-mf920925-120-p18s2 root být inter f PRED v decl asserted proc act it0 sim cíl f ACT n.denot inan pl single ještě f RHEM atom jaký f RSTR adj.pron.indef indef1 který t PAT n.pron.indef inher relat inher inher #PersPron t ACT n.pron.def.pers anim sg 2 polite dosáhnout f RSTR v vol potential cpl act it0 nil . _ . . . . _ . . _ _ _ gender: . .number: .person: _ . . . _ . . . .tense:

Jsou jeˇ stˇ e nˇ ejak´ e c´ ıle, kter´ ych byste chtˇ el dos´ ahnout? ‘Are there any goals which (you) would like to achieve?’ Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-41
SLIDE 41

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Automatic assignment of grammatemes: using positional tags, tree structure, and lexical lists

number grammateme from positional tags with most nouns from verb forms with pro-drops factmod grammateme from positional tags of (auxiliary) verb forms deontmod grammateme cht´ ıt ‘to want’ indeftype grammateme nˇ ejak´ y ‘some’ > jak´ y

t-mf920925-120-p18s2 root být inter f PRED v decl asserted proc act it0 sim cíl f ACT n.denot inan pl single ještě f RHEM atom jaký f RSTR adj.pron.indef indef1 který t PAT n.pron.indef inher relat inher inher #PersPron t ACT n.pron.def.pers anim sg 2 polite dosáhnout f RSTR v vol potential cpl act it0 nil . _ . . . . _ . . _ _ _ gender: . .number: .person: _ . . . _ . . . .tense:

Jsou jeˇ stˇ e nˇ ejak´ e c´ ıle, kter´ ych byste chtˇ el dos´ ahnout? ‘Are there any goals which (you) would like to achieve?’ Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-42
SLIDE 42

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Automatic assignment of grammatemes: using positional tags, tree structure, and lexical lists

number grammateme from positional tags with most nouns from verb forms with pro-drops factmod grammateme from positional tags of (auxiliary) verb forms deontmod grammateme cht´ ıt ‘to want’ indeftype grammateme nˇ ejak´ y ‘some’ > jak´ y

t-mf920925-120-p18s2 root být inter f PRED v decl asserted proc act it0 sim cíl f ACT n.denot inan pl single ještě f RHEM atom jaký f RSTR adj.pron.indef indef1 který t PAT n.pron.indef inher relat inher inher #PersPron t ACT n.pron.def.pers anim sg 2 polite dosáhnout f RSTR v vol potential cpl act it0 nil . _ . . . . _ . . _ _ _ gender: . .number: .person: _ . . . _ . . . .tense:

Jsou jeˇ stˇ e nˇ ejak´ e c´ ıle, kter´ ych byste chtˇ el dos´ ahnout? ‘Are there any goals which (you) would like to achieve?’ Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-43
SLIDE 43

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Automatic assignment of grammatemes: using coreference

relative pronouns

grammatical categories imposed by agreement inherited from the antecedent values underspecified (inher value)

t-mf920925-120-p18s2 root být inter f PRED v decl asserted proc act it0 sim cíl f ACT n.denot inan pl single ještě f RHEM atom jaký f RSTR adj.pron.indef indef1 který t PAT n.pron.indef inher relat inher inher #PersPron t ACT n.pron.def.pers anim sg 2 polite dosáhnout f RSTR v vol potential cpl act it0 nil . _ . . . . _ . . _ _ _ gender: . .number: .person: _ . . . _ . . . .tense:

Jsou jeˇ stˇ e nˇ ejak´ e c´ ıle, kter´ ych byste chtˇ el dos´ ahnout? ‘Are thereany goals which (you) would like to achieve?’ Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-44
SLIDE 44

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Automatic assignment of grammatemes: using coreference

relative pronouns

grammatical categories imposed by agreement inherited from the antecedent values underspecified (inher value)

t-mf920925-120-p18s2 root být inter f PRED v decl asserted proc act it0 sim cíl f ACT n.denot inan pl single ještě f RHEM atom jaký f RSTR adj.pron.indef indef1 který t PAT n.pron.indef inher relat inher inher #PersPron t ACT n.pron.def.pers anim sg 2 polite dosáhnout f RSTR v vol potential cpl act it0 nil . _ . . . . _ . . _ _ _ gender: . .number: .person: _ . . . _ . . . .tense:

Jsou jeˇ stˇ e nˇ ejak´ e c´ ıle, kter´ ych byste chtˇ el dos´ ahnout? ‘Are thereany goals which (you) would like to achieve?’ Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-45
SLIDE 45

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Automatic vs. manual annotation of grammatemes

1,600,000 grammateme values assigned to 550,000 complex nodes at the tectogrammatical layer of PDT 2.0 17,500 out of them assigned manually

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-46
SLIDE 46

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions PDT in a nutshell Morphological layer Tectogrammatical layer

Manual annotation of grammatemes

two annotators in parallel

inter-annotator agreement: 70–85 %

simplified annotation environment

treebank positions extracted into simple HTML forms

pluralia tantum

Otevˇ rel dveˇ re.sg na terasu. ‘He opened the door to the terrace.’

  • vs. nˇ

ekolikery dveˇ re.pl ‘several doors’

polite usage of 2nd person pronouns

Vy.polite jste se uˇ z pˇ rihl´ asil? ‘Have you logged in already?’

  • vs. Vy.basic jste se uˇ

z pˇ rihl´ asili? ‘Have you logged in already?’

absolute usage of comparative forms of adjectives and adverbs

starˇ s´ ı.acomp ˇ zena ‘an elder woman’

  • vs. jeho starˇ

s´ ı.comp bratr ‘his older brother’

biaspectual verbs, pair/group meaning of plural forms, ...

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-47
SLIDE 47

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions Developing taggers Named entity recognition Derivational morphology

PDT data: developing taggers of Czech

feature-based tagger (Hajiˇ c 2004)

part of the PDT 2.0 release

HMM tagger (Krbec 2005) Morˇ ce tagger (Votrubec 2005)

averaged perceptron

combined approach (Spoustov´ a et al. 2007)

Morˇ ce tagger, feature-based tagger, HMM tagger, and a rule-based component

Morˇ ce tagger semi-supervised (Spoustov´ a et al. 2009) MorphoDiTa (Strakov´ a et al. 2014)

  • pen-source tool for morphological analysis, tagging,

lemmatization, tokenization, and morphological generation available with trained linguistic models

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-48
SLIDE 48

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions Developing taggers Named entity recognition Derivational morphology

Accuracy of taggers

Czech taggers (PDT 2.5) Accuracy Morˇ ce semi-supervised (Spoustov´ a et al. 2009) 95.89 % MorphoDiTa (Strakov´ a et al. 2014) 95.75 % combination of taggers (Spoustov´ a et al. 2007) 95.70 % Morˇ ce (Votrubec 2005) 95.67 % HMM (Krbec 2005) 94.82 % feature-based tagger (Hajiˇ c 2004) 94.04 %

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-49
SLIDE 49

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions Developing taggers Named entity recognition Derivational morphology

Accuracy of taggers

Czech taggers (PDT 2.5) Accuracy Morˇ ce semi-supervised (Spoustov´ a et al. 2009) 95.89 % MorphoDiTa (Strakov´ a et al. 2014) 95.75 % combination of taggers (Spoustov´ a et al. 2007) 95.70 % Morˇ ce (Votrubec 2005) 95.67 % HMM (Krbec 2005) 94.82 % feature-based tagger (Hajiˇ c 2004) 94.04 % English taggers (PennTB/WSJ) Accuracy Shen et al. (2007) 97.33 % MorphoDiTa (Strakov´ a et al. 2014) 97.27 % Morˇ ce semi-supervised (Spoustov´ a et al. 2009) 97.23 %

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-50
SLIDE 50

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions Developing taggers Named entity recognition Derivational morphology

Accuracy of taggers

Czech taggers (PDT 2.5) Accuracy Morˇ ce semi-supervised (Spoustov´ a et al. 2009) 95.89 % MorphoDiTa (Strakov´ a et al. 2014) 95.75 % combination of taggers (Spoustov´ a et al. 2007) 95.70 % Morˇ ce (Votrubec 2005) 95.67 % HMM (Krbec 2005) 94.82 % feature-based tagger (Hajiˇ c 2004) 94.04 % English taggers (PennTB/WSJ) Accuracy Shen et al. (2007) 97.33 % MorphoDiTa (Strakov´ a et al. 2014) 97.27 % Morˇ ce semi-supervised (Spoustov´ a et al. 2009) 97.23 % MorphoDiTa (Czech, first 2 positions) 99.18 %

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-51
SLIDE 51

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions Developing taggers Named entity recognition Derivational morphology

Named entity classification and recognition in Czech

pilot approach in 2007 two-level classification

rough and detailed categories embedding allowed

g geographical names gp planets gt continents gc states gu towns gs streets, squares gh hydronyms ... p person names pf first names ps surnames pm second names pd (academic) titles pc inhabitant names pp religious/myth. persons ...

5 recognizers since 2007

trained on Czech Named Entity Corpus

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-52
SLIDE 52

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions Developing taggers Named entity recognition Derivational morphology

Czech Named Entity Corpus

http://ufal.mff.cuni.cz/cnec/ data selection

random selection of isolated 6k sentences with 150k tokens 33k NEs manually assigned by two annotators in parallel

categories annotated

7 rough categories in CNEC 1.1, 10 in CNEC 2.0 42 detailed categories in CNEC 1.1, 62 in CNEC 2.0

LINDAT/Clarin repository

CNEC 1.0 (2009) CNEC 1.1 (2014) CNEC 2.0 (2014)

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-53
SLIDE 53

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions Developing taggers Named entity recognition Derivational morphology

Czech Named Entity Corpus

http://ufal.mff.cuni.cz/cnec/ data selection

random selection of isolated 6k sentences with 150k tokens 33k NEs manually assigned by two annotators in parallel

categories annotated

7 rough categories in CNEC 1.1, 10 in CNEC 2.0 42 detailed categories in CNEC 1.1, 62 in CNEC 2.0

LINDAT/Clarin repository

CNEC 1.0 (2009) CNEC 1.1 (2014) CNEC 2.0 (2014)

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-54
SLIDE 54

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions Developing taggers Named entity recognition Derivational morphology

1: <P<pf Jan> <ps Stavˇ el>> byl dlouho ˇ cinn´ ym , zemˇ rel jako staˇ reˇ sina moravsk´ eho hasiˇ cstva kr´ atce pˇ red dovrˇ sen´ ım <qo 75 .> narozenin v <tm ´ unoru> <ty 1933> . 2: " Zaˇ c´ ınala jsem v roce <ty 1995> s osmi chovanci m´ ıstn´ ıho ´ ustavu , dnes jich pracuje tˇ rin´ act , " uvedla ke vzniku mimoˇ r´ adn´ eho seskupen´ ı hereˇ cka <P<pf Viera> <ps Dubaˇ cov´ a>> . 3: V souˇ casn´ e dobˇ e je v <i <s CECIMO>> tedy <qc 14> ˇ clen˚ u . 4: Vnitˇ rn´ ı reforma <io Unie> dosud neprobˇ ehla a v´ alka na <gl Balk´ anˇ e>

cerp´ a finanˇ cn´ ı prostˇ redky : <io<s EU>> bude investovat do pov´ aleˇ cn´ e

  • bnovy <gc Jugosl´

avie> .

?

A C

P

T

ah

at az cn

cp cr

cs g_ gc

gh

gl

gp

gq

gr

gs

gt

gu

i_

ia

ic

if io mn

mr

mt n_

na

nc

ni

nq

  • _
  • a
  • c
  • e
  • m
  • p
  • r

p_

pb

pc

pd

pf

pm

pp

ps

qc

qo

tc

td

tf th

tm

tp ty

nr nw np nm ts tn cb mi

[Strakov´ a et al. 2015] Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-55
SLIDE 55

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions Developing taggers Named entity recognition Derivational morphology

Named entity recognizers for Czech

System F-measure F-measure (7 categories) (42 categories) Strakov´ a et al. (2013) 82.82 79.23 Strakov´ a et al. (NameTag; 2014) 81.01 77.88 Konkol – Konop´ ık (2013) 79.00 na Kravalov´ a et al. (2009) 71.00 68.00 ˇ Sevˇ c´ ıkov´ a et al. (2007) 68.00 62.00

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-56
SLIDE 56

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions Developing taggers Named entity recognition Derivational morphology

Derivational morphology of Czech

derivational morphology underresourced in most languages

CELEX for English, German and Dutch (Baayen et al. 1995) DerivBase for German (Zeller et al. 2013) DerivBase.Hr for Croatian (ˇ Snajder et al. 2014) language-independent approach by Baranes – Sagot (2014) D´ emonette network for French (Hathout – Namer 2014) DeriNet for Czech (ˇ Sevˇ c´ ıkov´ a – ˇ Zabokrtsk´ y 2014)

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-57
SLIDE 57

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions Developing taggers Named entity recognition Derivational morphology

DeriNet: lexical resource of derivational relations in Czech

970k lemmas connected with 715k derivational relations

compatible with the MorfFlex dictionary

superagentčin agentčin superagentka agentka agentskost agentsky agentství agentský superagentův agentův superagent agent

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-58
SLIDE 58

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions Developing taggers Named entity recognition Derivational morphology

DeriNet 1.0

DeriNet 1.0 lemmas 968,967 unique lemmas 965,535 derivational links 715,729 derivational clusters 253,238 singleton clusters 101,311 maximum lemmas per cluster 82 maximum cluster depth 8

[Vidra 2015] Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-59
SLIDE 59

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions Developing taggers Named entity recognition Derivational morphology

DeriNet 1.0

DeriNet 1.0 lemmas 968,967 unique lemmas 965,535 derivational links 715,729 derivational clusters 253,238 singleton clusters 101,311 maximum lemmas per cluster 82 maximum cluster depth 8

[Vidra 2015] Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-60
SLIDE 60

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions Developing taggers Named entity recognition Derivational morphology

Derivational information in dependency trees

derivational information currently available in PDT

lemma suffix at the morphological layer selected grammatemes and semantic roles at the tectogrammatical layer

extending derivational annotation in tectogrammatical trees

most frequent semantic classes derived words substituted by the lemma of the base word the word-formation meaning stored in a deriveme attribute

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-61
SLIDE 61

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions Developing taggers Named entity recognition Derivational morphology

Dependencies and derivations: natural language processing

machine translation: out-of-vocabulary words

English adverbs ending in -ly Czech female profession names Czech diminutives

parsing

sublexical analysis helped to achieve state-of-the-art results in parsing the Turkish Treebank (Eryigit et al. in Computational Linguistics 2008)

paraphrasing, ...

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-62
SLIDE 62

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions Developing taggers Named entity recognition Derivational morphology

Dependencies and derivations: linguistic research

derivational morphemes vs. valency of verbs and nouns

Actor of uˇ cit ‘to teach’ incorporated in uˇ citel ‘teacher’ Patient of the verb d´ at ‘to give’ involved in d´ arek ‘present’

derivational morphology of Czech vs. other languages

padnout – fallen – to fall napadnout – auffallen – to stand out vypadnout – ausfallen – to drop out

alignment at the level of morphemes

diminutive suffix Karl´ ık vs. noun phrase little Charles

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-63
SLIDE 63

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions

Conclusions (i)

Prague Dependency Treebank

morphology as a separate layer of annotation grammateme attributes at the tectogrammatical layer

universal part-of-speech tags

substituting language- and framework-specific tagsets grammatemes not yet confronted

lemmatization and tagging an essential prerequisite for most NLP tasks in Czech

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-64
SLIDE 64

Introduction Morphology in Prague Dependency Treebank Praguian morphology in NLP of Czech Conclusions

Conclusions (ii)

analysing derivational morphology derivational analysis in dependency trees

substituting derived words with base words as an extended lemmatization advantageous for NLP and linguistic research beware of ‘overloading’ the data

(automatic) morphemic analysis missing

semantic classification of affixes

Magda ˇ Sevˇ c´ ıkov´ a Morphology within the Annotation Scenario of PDT

slide-65
SLIDE 65

References

Baayen, H. et al.: The CELEX lexical database (release 2). LDC 1995. Baranes, M. – Sagot, B.: A Language-Independent Approach to Extracting Derivational Relations from an Inflectional Lexicon. LREC 2014:2793–2799. Eryigit, G. et al.: Dependency Parsing of Turkish. Computational Linguistics 2008:34, 357–389. Hajiˇ c, J.: Disambiguation of Rich Inflection: Computational Morphology of

  • Czech. Prague 2004.

Hajiˇ c, J. et al.: Prague Dependency Treebank 2.0. LDC 2006. Hathout, N. – Namer, F.: D´ emonette, a French Derivational Morpho-Semantic

  • network. LiLT 2014:11, 125–168.

Strakov´ a, J. et al.: Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition. ACL 2014: System Demonstrations, 13–18. ˇ Sevˇ c´ ıkov´ a, M. – ˇ Zabokrtsk´ y, Z.: Word-Formation Network for Czech. LREC 2014: 1087–1093. ˇ Snajder, J. et al.: DerivBase.Hr: A High-Coverage Derivational Morphology Resource for Croatian. LREC 2014: 3371–3377. Vidra, J.: Extending the lexical nerwork DeriNet. Bc Thesis, Charles University in Prague 2015. Zeller, B. et al.: DErivBase: Inducing and evaluating a derivational morphology resource for German. ACL 2013: 1201–1211. Zeman, D.: From the Jungle to a Park: Harmonizing Annotations across

  • Languages. Key note at SPMRL 2015, Bilbao.

... see References in the SFCM 2015 proceedings paper