Merging Data Resources for Inflectional and Derivational Morphology - - PowerPoint PPT Presentation

merging data resources for inflectional and derivational
SMART_READER_LITE
LIVE PREVIEW

Merging Data Resources for Inflectional and Derivational Morphology - - PowerPoint PPT Presentation

Merging Data Resources for Inflectional and Derivational Morphology in Czech ek y, Magda Zden Zabokrtsk Sev c kov a, Milan Straka, Jon a s Vidra, Ad ela Limbursk a Charles University in Prague Institute of


slide-1
SLIDE 1

Merging Data Resources for Inflectional and Derivational Morphology in Czech

Zdenˇ ek ˇ Zabokrtsk´ y, Magda ˇ Sevˇ c´ ıkov´ a, Milan Straka, Jon´ aˇ s Vidra, Ad´ ela Limbursk´ a

Charles University in Prague Institute of Formal and Applied Linguistics

LREC, 25th May 2016, Portoroˇ z

Zdenˇ ek ˇ Zabokrtsk´ y et al. (UFAL) Inflectional and Derivational Morphology LREC 2016 1 / 19

slide-2
SLIDE 2

Outline

Motivation for processing inflection and derivation together Inflectional and derivation resources for Czech The resulting (merged) data resource User interfaces to the data Conclusions

Zdenˇ ek ˇ Zabokrtsk´ y et al. (UFAL) Inflectional and Derivational Morphology LREC 2016 2 / 19

slide-3
SLIDE 3

Basic notions

morphological inflection: to derive → derives, derived, deriving morphological derivation: to derive → derivative, derivation, derivator

Zdenˇ ek ˇ Zabokrtsk´ y et al. (UFAL) Inflectional and Derivational Morphology LREC 2016 3 / 19

slide-4
SLIDE 4

Motivation

an omnipresent problem of NLP: zillions of different words

  • ne of the reasons: morphological variation

standards ways to reduce the lexical space:

◮ lemmatization – replacing inflectionally related words by a selected

representative

◮ stemming – replacing related words by a common stem (usually

approximated very roughly)

Zdenˇ ek ˇ Zabokrtsk´ y et al. (UFAL) Inflectional and Derivational Morphology LREC 2016 4 / 19

slide-5
SLIDE 5

Motivation, cont.

in morphologically complex languages:

◮ possibly several tens (or more) inflected word forms per lemma ◮ but possibly several tens (or more) derived lemmas too!

a common-sense expectation: extending lemmatization (as anti-inflection) with nesting (as anti-derivation) might help NLP apps in Czech, derivation is the most productive word formation method (hundreds of suffixes) surprisingly few data resources for derivation (e.g., Derivancze for Czech, DerivBase for German, D´ emonette for French)

Zdenˇ ek ˇ Zabokrtsk´ y et al. (UFAL) Inflectional and Derivational Morphology LREC 2016 5 / 19

slide-6
SLIDE 6

Derivation vs. inflection: similarities

For both it holds that there is a strong form-function asymetry, e.g.

◮ there are several suffixes that express the same meaning (e.g. an actor) ◮ one specific suffix can express several roles

the way how forms are combined is far from simple catenation

◮ consonant and vowel changes (not limited to morpheme boundaries,

can appear inside roots too)

◮ sometimes similar changes for inflection and derivation: sn´

ıh - snˇ ehu (inflection: snow gen.sg.), sn´ ıh - snˇ eˇ zn´ y (derivation: snowy adj.)

fuzzy boundaries of parole

◮ exhaustive enumeration of all potentially inflected/derived forms often

reaches language periphery

Zdenˇ ek ˇ Zabokrtsk´ y et al. (UFAL) Inflectional and Derivational Morphology LREC 2016 6 / 19

slide-7
SLIDE 7

Derivation vs. inflection: differences

different data structure

◮ a set of words connected by inflection: ⋆ typically a full Cartesian product of morphological categories ◮ a set of lemmas connected by derivation: ⋆ rather an oriented graph (a nest), a rooted tree is often enough

in inflection, the paradigm representative is chosen by a convention, while in derivation, the tree root seems more tangible semantic relatedness gradully weakens for more distant words in a derivation nest in NLP, lemmatization is widely used while nesting is not

Zdenˇ ek ˇ Zabokrtsk´ y et al. (UFAL) Inflectional and Derivational Morphology LREC 2016 7 / 19

slide-8
SLIDE 8

MorfFlex CZ

Czech morphological dictionary developed originally by Jan Hajiˇ c as a spelling checker and lemmatizer more than two decades of improvements 985 thousand unique lemmas with their inflectional paradigms associated with a positional tagset capable of analyzing/generating 120 million word forms (form-lemma-tag tripples) used inter alia in the Prague Dependency Treebank and Czech National Corpus

Zdenˇ ek ˇ Zabokrtsk´ y et al. (UFAL) Inflectional and Derivational Morphology LREC 2016 8 / 19

slide-9
SLIDE 9

A glimpse at the MorfFlex CZ data

podle-1 ^(*3´ y-1) Dg-------3N---6 nejnepodlejc podle-1 ^(*3´ y-1) Dg-------3N---- nejnepodleji podle-1 ^(*3´ y-1) Dg-------3A---6 nejpodlejc podle-1 ^(*3´ y-1) Dg-------3A---- nejpodleji podle-1 ^(*3´ y-1) Dg-------1N---- nepodle podle-1 ^(*3´ y-1) Dg-------2N---6 nepodlejc podle-1 ^(*3´ y-1) Dg-------2N---- nepodleji podle-1 ^(*3´ y-1) Dg-------1A---- podle podle-1 ^(*3´ y-1) Dg-------2A---6 podlejc podle-1 ^(*3´ y-1) Dg-------2A---- podleji podle-2 RR--2---------- podle

Zdenˇ ek ˇ Zabokrtsk´ y et al. (UFAL) Inflectional and Derivational Morphology LREC 2016 9 / 19

slide-10
SLIDE 10

DeriNet

a network capturing derivation in Czech, developed since 2013

  • riented graph (forest, each rooted tree = one derivational nest)

◮ nodes = lemmas ◮ edges = derivation relations (from base to derived lemmas)

size before merging with MorfFlex CZ

◮ 306 thousand nodes (chosen according to frequency in the Czech

National Corpus)

◮ 117 thousand edges

compiled using semi-automatic procedure, based especially on

◮ suffix substitution rules (extracted both from grammar books and from

data)

◮ manually assembled lists of exceptions ◮ patterns for vowel and consonants changes Zdenˇ ek ˇ Zabokrtsk´ y et al. (UFAL) Inflectional and Derivational Morphology LREC 2016 10 / 19

slide-11
SLIDE 11

A glimpse at the DeriNet data

prohlásit-V prohlášený-A prohlášení-N rozlehlý-A rozlehle-D rozlehlost-N zahlcovat-V zahlcovaný-A zahlcování-N zahlcující-A povaha-N povahový-A povahově-D

  • bhajovat-V
  • bhajující-A
  • bhajování-N
  • bhajovaný-A
  • bhajovací-A

věčný-A věčnost-N věčně-D vyšroubovat-V vyšroubovávat-V vyšroubování-N vyšroubovaný-A ponížit-V ponížení-N ponížený-A poníženě-D poníženost-N políbit-V políbený-A políbení-N bobr-N bobrův-A bobrový-A pivovar-N mikropivovar-N pivovarský-A pivovarsky-D nepřátelský-A nepřátelství-N nepřátelskost-N hrnek-N hrneček-N vysmrkat-V vysmrkání-N vysmrkávat-V předstírat-V předstírání-N předstírající-A předstíraný-A předstíraně-D básník-N básníkův-A básnice-N venkov-N venkovský-A venkovsky-D

Zdenˇ ek ˇ Zabokrtsk´ y et al. (UFAL) Inflectional and Derivational Morphology LREC 2016 11 / 19

slide-12
SLIDE 12

Merging process

set of lemmas of the previous DeriNet version extended to that of MorfFlex CZ the pipeline for building DeriNet re-executed on the new lemma set

  • nly minor modifications of substitution rules and exception lists

needed resulting data: 970 thousand lemmas connected with 715 thousand derivational relations

Zdenˇ ek ˇ Zabokrtsk´ y et al. (UFAL) Inflectional and Derivational Morphology LREC 2016 12 / 19

slide-13
SLIDE 13

Extension of the derivation forest

after merging DeriNet with MorfFlex CZ in the derivational forest

◮ #nodes increased 3.2 times ◮ #edges increased 6.1 times

evaluation (based on a manually annotated sample) shows that

◮ precision of derivations stayed at 99 % ◮ recall increased from 75 % to 85 %

we attribute both observations to language economy:

◮ lower-frequency words tend to be derived more frequently. . . ◮ . . . and they tend to be derived in a more regular way Zdenˇ ek ˇ Zabokrtsk´ y et al. (UFAL) Inflectional and Derivational Morphology LREC 2016 13 / 19

slide-14
SLIDE 14

POS and POS→POS counts in the merged data

NOUNS 421,213 VERBS 52,422 ADVERBS 155,096 ADJECTIVES 340,295 99,009 155,269 55,208 80 29 152,603 37 3 194,450 10 37,961 20,960 294 2,473 44,334 2,152 31,106 172,772 276

Zdenˇ ek ˇ Zabokrtsk´ y et al. (UFAL) Inflectional and Derivational Morphology LREC 2016 14 / 19

slide-15
SLIDE 15

Access to the data

Application Programming Interfaces

◮ derivations integrated in the MorphoDiTa tool since version 2.0 ◮ REST API

Graphical User Interfaces (in web browsers)

◮ MorphoDiTa online demo - shows both derivations and inflections ◮ DeriNet Viewer - for browsing derivation trees ◮ DeriNet Search - query language allowing quite complex search queries Zdenˇ ek ˇ Zabokrtsk´ y et al. (UFAL) Inflectional and Derivational Morphology LREC 2016 15 / 19

slide-16
SLIDE 16

Query example

The query [] ([lemma="n´ y$"], [lemma="ov´ y$"]) searches for adjectives which were derived by the two different suffixes.

Zdenˇ ek ˇ Zabokrtsk´ y et al. (UFAL) Inflectional and Derivational Morphology LREC 2016 16 / 19

slide-17
SLIDE 17

Future work and open questions

add some missing derivations (e.g. verb prefixation, aspectual counterparts created by suffixation, etc.) abandon the treeness constraint to allow composition semantic labelling of derivation relations (diminutives,

  • possessives. . . )

resolve homonymy – inflection and derivation might pose different criteria on distingushing homonyms some problems analogous to that of dependency trees

◮ clear presence of an edge, but unclear orientation ◮ sometimes intermediate words are “predicted” that simply do not exist

(phantom lexemes, similar to elipsis)

◮ we know trees are actually not enough even for derivations, but are

irresistibly attractive

Zdenˇ ek ˇ Zabokrtsk´ y et al. (UFAL) Inflectional and Derivational Morphology LREC 2016 17 / 19

slide-18
SLIDE 18

Conclusions

There is a morphological resource for Czech that handles both morphological inflection and derivation covers roughly one million Czech lemmas is equiped with several user interfaces is available to your under CC-BY-NC-SA, see http://ufal.mff.cuni.cz/derinet or http://ufal.mff.cuni.cz/morphodita

Zdenˇ ek ˇ Zabokrtsk´ y et al. (UFAL) Inflectional and Derivational Morphology LREC 2016 18 / 19

slide-19
SLIDE 19

Thank you!

Zdenˇ ek ˇ Zabokrtsk´ y et al. (UFAL) Inflectional and Derivational Morphology LREC 2016 19 / 19