The Tagged Corpus (SYN2010) as a Help and a Pitfall in the - - PowerPoint PPT Presentation

▶

Dec 08, 2023 328 likes •510 views

The Tagged Corpus (SYN2010) as a Help and a Pitfall in the Word-formation Research KLRA OSOLSOB J FF MU BRNO OSOLSOBE@PHIL.MUNI.CZ 1 25.09.2019 DERIMO 2019 Goals SAU Corpus based linguistic manual Three steps of the automatic

SLIDE 1

The Tagged Corpus (SYN2010) as a Help and a Pitfall in the Word-formation Research

KLÁRA OSOLSOBĚ ÚČJ FF MU BRNO OSOLSOBE@PHIL.MUNI.CZ

25.09.2019 DERIMO 2019

SLIDE 2

Goals SAUČ – Corpus based linguistic manual Three steps of the automatic analysis Conclusion

25.09.2019 DERIMO 2019

SLIDE 3

25.09.2019 DERIMO 2019

SLIDE 4

25.09.2019 DERIMO 2019

SLIDE 5

Pitfall No 1: The tokenization The affixes described in SAUČ are usually graphically a part of a single lexeme. MRE (Multiword Expression)

25.09.2019 DERIMO 2019

SLIDE 6

Tokenization: circumfix na na- -o na natvrdo × na na tvrdo

two ways of writing the preposition that is not graphically united with some newly created adverb is an independent unit tagged as a preposition and its nominal part is very

ften not identified

Whereas only “written together variants” are included in the frequency report."

25.09.2019 DERIMO 2019

SLIDE 7

Tokenization: : prefix + reflexive particle se za za- se se zamyslel se nad tím × on se nad tím asi ani pořádně nezamyslel two most frequent word order variants (variants <- 1,1>) – frequency repport

25.09.2019 DERIMO 2019

SLIDE 8

Pitfall No 2: Assigning lemma + tag interpretation based on the morphological dictionary

MorfFlex CZ, LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague, http://hdl.handle.net/11858/00-097C-0000-0015- A780-9.) The productivity measuring is dictionary- dependent.

25.09.2019 DERIMO 2019

SLIDE 9

A A lack of

f the Dictionary:

: circumfix na na- -o

na na-prav-o, , na na-lev-o × na na-těsn-o, , na na-kratičk-o lemma=”na.o” & tag=”D.” lemma=”na.o” & tag=”X.” The words of low frequency (unrecognised by the automatic morphological analysis) correspond to the model of such type of compound adverbs in Czech and show its productivity. The productivity picture based on the results of automatic analysis is inaccurate.

25.09.2019 DERIMO 2019

SLIDE 10

A A lack of

f the Dictionary:

: -oš Mil-oš, Jug-oš × Káj-oš, , Tal-oš

The query [lemma=".*oš" & tag="NN[MI].*"] gives125 lemmata, 69 are relevant. The query [lemma="(.*oš)|(.*oš[eiů])|(.*oších)|(.*ošům) & tag="X.*"] gives 282 words, 36 are relevant lemmata. The examples given to illustrate the second query would indicate, that if we were not doing so, productivity would be significantly skewed.

25.09.2019 DERIMO 2019

SLIDE 11

Pitfall No 3: The disambiguation

the process of identifying which interpretation of a word is used in context The biggest problem here is homonymy (affects cases of part of speech transition, polyfunctional affixes, and overgeneration of formal query). Corpus analysis results are „disambiguation-addicted”.

25.09.2019 DERIMO 2019

SLIDE 12

cí vedou-cí cí (leader/leading)

vedoucí (↖1/2/3) 8.348 (1) gerund (e. g. Slepý vedoucí slepého je nebezpečný. = ‚ The blind leading the blind is dangerous.‘), (2) adjective (Vedoucí disidenti dostali dlouhé tresty. = ‚ The leading dissidents had received long prison sentences ‘) a (3) noun (profesionální vedoucí = ‚ professional leader ‘). (1) and (2) are not distinguished by the automatic morphological analysis. (3) is tagged, but the results of the desambiguation are far from satisfactory

25.09.2019 DERIMO 2019

SLIDE 13

cestují-cí cí (travelling/traveller)

25.09.2019 DERIMO 2019

SLIDE 14

Desambiguation: : sou- -í (overgeneration) soutěžení, souručenství, soukromí

Lemmas ended by the string of the characters, which doesn’t correspond to the words created by the affix ) were excluded. sou- -í : soustřed-i-t se → soustřed-ěn-í – concentration, ží-t → sou-ži-t-í – coexistence, soutěž-i-t → soutěž-en-í – competition, souž-i-t → souž-en-í – suffering/problem, soused → soused-ství – neighborhood, soukromí – privacy

25.09.2019 DERIMO 2019

SLIDE 15

Conclusion

limits of working with the results of automatic part of speech tagging the method of data mining is sufficiently described at the beginning of the frequency report (the corpus query) Despite the above-mentioned simplistic solutions, it is not disputed that without using the results of automatic tagging, any way of creating the Dictionary of affixes used in Czech would be a) incomparably more time-consuming, b) more expensive and c) in its result less objective. A detailed morphological description of word forms based on the data gained during the work

n SAUČ is reflected in the NovaMorf project (Osolsobě et al. 2017).

25.09.2019 DERIMO 2019

SLIDE 16

NOVAMORF (https://sites.google.com/site/koncepcenovamorf/) /)

25.09.2019 DERIMO 2019

SLIDE 17

Thank you for your attention

Více soch je sou-soš-í, více žen je s-ouž-en-í / sou-žen-í. “Several sculptures create a sculptural group, several women create a problem.”

25.09.2019 DERIMO 2019