Improving UD processing via satellite resources for morphology
Kaja Dobrovoljc Tomaž Erjavec Nikola Ljubešić
UDW 2019, Paris, August 30
Jozef Stefan Institute Ljubljana, Slovenia
Improving UD processing via satellite resources for morphology Kaja - - PowerPoint PPT Presentation
Improving UD processing via satellite resources for morphology Kaja Dobrovoljc Toma Erjavec Nikola Ljubei Jozef Stefan Institute Ljubljana, Slovenia UDW 2019, Paris, August 30 Motivation many treebanks and tools available for
Jozef Stefan Institute Ljubljana, Slovenia
Slovenian (~580,000 tokens)
Croatian (~500,000 tokens).
lemmatization, morphosyntax (JOS/MULTEXT
lemmatization, morphosyntax (MULTEXT
dependency syntax, semantic roles, multi- word expressions, Universal Dependencies.
Universal Dependencies.
et al. 2015) → 25% of the corpus
manual annotation of UD syntax (Agić and Ljubešić 2015) → 40% of the corpus
rules for morphology.
conversion of morphology.
and TEI XML
and TEI XML
MTE Numeral, Form=letter, Type=ordinal e.g. prvi ‘first’, drugi ‘second’, tretji ‘third’ … UD ADJ, NumType=Ord
inflected forms for Slovenian (~2.7M forms, 100k lemmas).
collection of inflected forms for Croatian (~6.4M forms, 170k lemmas).
grammatical features (JOS/MTE), pronunciation, frequency of usage.
grammatical features (MTE), frequency of usage.
from ssj500k.
from hr500k.
part of Sloleks 2.0 (CLARIN.SI), tab- separated list only.
part of hrLex 1.3 (CLARIN.SI), tab- separated list.
+1.56 +0.91
+1.35 +1.28 +0.53 +0.52
+0.24 +0.14
+0.32 +0.66 +0.54 +0.70 +1.98 +2.92 +1.34 +2.11
+2.6 +1.94
+1.08 +1.45
+3.15 +3.96 +0.13 +0.10 +0.03 +0.51 +0.72 +0.29
+1.45 +2.05
+0.0