The Tagged Corpus (SYN2010) as a Help and a Pitfall in the Word-formation Research
KLÁRA OSOLSOBĚ ÚČJ FF MU BRNO OSOLSOBE@PHIL.MUNI.CZ
25.09.2019 DERIMO 2019
1
The Tagged Corpus (SYN2010) as a Help and a Pitfall in the - - PowerPoint PPT Presentation
The Tagged Corpus (SYN2010) as a Help and a Pitfall in the Word-formation Research KLRA OSOLSOB J FF MU BRNO OSOLSOBE@PHIL.MUNI.CZ 1 25.09.2019 DERIMO 2019 Goals SAU Corpus based linguistic manual Three steps of the automatic
KLÁRA OSOLSOBĚ ÚČJ FF MU BRNO OSOLSOBE@PHIL.MUNI.CZ
25.09.2019 DERIMO 2019
1
25.09.2019 DERIMO 2019
2
25.09.2019 DERIMO 2019
3
25.09.2019 DERIMO 2019
4
25.09.2019 DERIMO 2019
5
25.09.2019 DERIMO 2019
6
25.09.2019 DERIMO 2019
7
25.09.2019 DERIMO 2019
8
.
25.09.2019 DERIMO 2019
9
The query [lemma=".*oš" & tag="NN[MI].*"] gives125 lemmata, 69 are relevant. The query [lemma="(.*oš)|(.*oš[eiů])|(.*oších)|(.*ošům) & tag="X.*"] gives 282 words, 36 are relevant lemmata. The examples given to illustrate the second query would indicate, that if we were not doing so, productivity would be significantly skewed.
25.09.2019 DERIMO 2019
10
25.09.2019 DERIMO 2019
11
vedoucí (↖1/2/3) 8.348 (1) gerund (e. g. Slepý vedoucí slepého je nebezpečný. = ‚ The blind leading the blind is dangerous.‘), (2) adjective (Vedoucí disidenti dostali dlouhé tresty. = ‚ The leading dissidents had received long prison sentences ‘) a (3) noun (profesionální vedoucí = ‚ professional leader ‘). (1) and (2) are not distinguished by the automatic morphological analysis. (3) is tagged, but the results of the desambiguation are far from satisfactory
25.09.2019 DERIMO 2019
12
25.09.2019 DERIMO 2019
13
Lemmas ended by the string of the characters, which doesn’t correspond to the words created by the affix ) were excluded. sou- -í : soustřed-i-t se → soustřed-ěn-í – concentration, ží-t → sou-ži-t-í – coexistence, soutěž-i-t → soutěž-en-í – competition, souž-i-t → souž-en-í – suffering/problem, soused → soused-ství – neighborhood, soukromí – privacy
25.09.2019 DERIMO 2019
14
limits of working with the results of automatic part of speech tagging the method of data mining is sufficiently described at the beginning of the frequency report (the corpus query) Despite the above-mentioned simplistic solutions, it is not disputed that without using the results of automatic tagging, any way of creating the Dictionary of affixes used in Czech would be a) incomparably more time-consuming, b) more expensive and c) in its result less objective. A detailed morphological description of word forms based on the data gained during the work
25.09.2019 DERIMO 2019
15
25.09.2019 DERIMO 2019
16
25.09.2019 DERIMO 2019
17