The Tagged Corpus (SYN2010) as a Help and a Pitfall in the - - PowerPoint PPT Presentation

the tagged corpus syn2010 as a help and a pitfall in
SMART_READER_LITE
LIVE PREVIEW

The Tagged Corpus (SYN2010) as a Help and a Pitfall in the - - PowerPoint PPT Presentation

The Tagged Corpus (SYN2010) as a Help and a Pitfall in the Word-formation Research KLRA OSOLSOB J FF MU BRNO OSOLSOBE@PHIL.MUNI.CZ 1 25.09.2019 DERIMO 2019 Goals SAU Corpus based linguistic manual Three steps of the automatic


slide-1
SLIDE 1

The Tagged Corpus (SYN2010) as a Help and a Pitfall in the Word-formation Research

KLÁRA OSOLSOBĚ ÚČJ FF MU BRNO OSOLSOBE@PHIL.MUNI.CZ

25.09.2019 DERIMO 2019

1

slide-2
SLIDE 2

Goals SAUČ – Corpus based linguistic manual Three steps of the automatic analysis Conclusion

25.09.2019 DERIMO 2019

2

slide-3
SLIDE 3

25.09.2019 DERIMO 2019

3

slide-4
SLIDE 4

25.09.2019 DERIMO 2019

4

slide-5
SLIDE 5

Pitfall No 1: The tokenization The affixes described in SAUČ are usually graphically a part of a single lexeme. MRE (Multiword Expression)

25.09.2019 DERIMO 2019

5

slide-6
SLIDE 6

Tokenization: circumfix na na- -o na natvrdo × na na tvrdo

two ways of writing the preposition that is not graphically united with some newly created adverb is an independent unit tagged as a preposition and its nominal part is very

  • ften not identified

Whereas only “written together variants” are included in the frequency report."

25.09.2019 DERIMO 2019

6

slide-7
SLIDE 7

Tokenization: : prefix + reflexive particle se za za- se se zamyslel se nad tím × on se nad tím asi ani pořádně nezamyslel two most frequent word order variants (variants <- 1,1>) – frequency repport

25.09.2019 DERIMO 2019

7

slide-8
SLIDE 8

Pitfall No 2: Assigning lemma + tag interpretation based on the morphological dictionary

MorfFlex CZ, LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague, http://hdl.handle.net/11858/00-097C-0000-0015- A780-9.) The productivity measuring is dictionary- dependent.

25.09.2019 DERIMO 2019

8

.

slide-9
SLIDE 9

A A lack of

  • f the Dictionary:

: circumfix na na- -o

na na-prav-o, , na na-lev-o × na na-těsn-o, , na na-kratičk-o lemma=”na.*o” & tag=”D.*” lemma=”na.*o” & tag=”X.*” The words of low frequency (unrecognised by the automatic morphological analysis) correspond to the model of such type of compound adverbs in Czech and show its productivity. The productivity picture based on the results of automatic analysis is inaccurate.

25.09.2019 DERIMO 2019

9

slide-10
SLIDE 10

A A lack of

  • f the Dictionary:

: -oš Mil-oš, Jug-oš × Káj-oš, , Tal-oš

The query [lemma=".*oš" & tag="NN[MI].*"] gives125 lemmata, 69 are relevant. The query [lemma="(.*oš)|(.*oš[eiů])|(.*oších)|(.*ošům) & tag="X.*"] gives 282 words, 36 are relevant lemmata. The examples given to illustrate the second query would indicate, that if we were not doing so, productivity would be significantly skewed.

25.09.2019 DERIMO 2019

10

slide-11
SLIDE 11

Pitfall No 3: The disambiguation

the process of identifying which interpretation of a word is used in context The biggest problem here is homonymy (affects cases of part of speech transition, polyfunctional affixes, and overgeneration of formal query). Corpus analysis results are „disambiguation-addicted”.

25.09.2019 DERIMO 2019

11

slide-12
SLIDE 12

cí vedou-cí cí (leader/leading)

vedoucí (↖1/2/3) 8.348 (1) gerund (e. g. Slepý vedoucí slepého je nebezpečný. = ‚ The blind leading the blind is dangerous.‘), (2) adjective (Vedoucí disidenti dostali dlouhé tresty. = ‚ The leading dissidents had received long prison sentences ‘) a (3) noun (profesionální vedoucí = ‚ professional leader ‘). (1) and (2) are not distinguished by the automatic morphological analysis. (3) is tagged, but the results of the desambiguation are far from satisfactory

25.09.2019 DERIMO 2019

12

slide-13
SLIDE 13

cestují-cí cí (travelling/traveller)

25.09.2019 DERIMO 2019

13

slide-14
SLIDE 14

Desambiguation: : sou- -í (overgeneration) soutěžení, souručenství, soukromí

Lemmas ended by the string of the characters, which doesn’t correspond to the words created by the affix ) were excluded. sou- -í : soustřed-i-t se → soustřed-ěn-í – concentration, ží-t → sou-ži-t-í – coexistence, soutěž-i-t → soutěž-en-í – competition, souž-i-t → souž-en-í – suffering/problem, soused → soused-ství – neighborhood, soukromí – privacy

25.09.2019 DERIMO 2019

14

slide-15
SLIDE 15

Conclusion

limits of working with the results of automatic part of speech tagging the method of data mining is sufficiently described at the beginning of the frequency report (the corpus query) Despite the above-mentioned simplistic solutions, it is not disputed that without using the results of automatic tagging, any way of creating the Dictionary of affixes used in Czech would be a) incomparably more time-consuming, b) more expensive and c) in its result less objective. A detailed morphological description of word forms based on the data gained during the work

  • n SAUČ is reflected in the NovaMorf project (Osolsobě et al. 2017).

25.09.2019 DERIMO 2019

15

slide-16
SLIDE 16

NOVAMORF (https://sites.google.com/site/koncepcenovamorf/) /)

25.09.2019 DERIMO 2019

16

slide-17
SLIDE 17

Thank you for your attention

Více soch je sou-soš-í, více žen je s-ouž-en-í / sou-žen-í. “Several sculptures create a sculptural group, several women create a problem.”

25.09.2019 DERIMO 2019

17