Rusyn as a language between state borders a statistical approach to - - PowerPoint PPT Presentation

▶

May 29, 2023 514 likes •729 views

Rusyn as a language between state borders a statistical approach to variation (for small sample sizes) Albert-Ludwig University of Freiburg, Germany Department of Slavonic Studies Prof. Dr. Achim Rabus & M. Zaidan Lahjouji

SLIDE 1

Rusyn as a language between state borders – a statistical approach to variation   (for small sample sizes) 

Albert-Ludwig University of Freiburg, Germany Department of Slavonic Studies

Prof. Dr. Achim Rabus & M. Zaidan Lahjouji

Project: Russinisch als eine Staatsgrenzen überschreitende Minderheitensprache: Quantitative Perspektiven (Rusyn as a state border transgressing minority language: quantitive perspectives)

SLIDE 2

Topics:

The Rusyn project
Interests and Aims
Border Effects
The Corpus of Spoken Rusyn
Quantitative approaches to spoken data
Pitfalls, Possible Solutions and Limitations
Example Dataset & Analysis

SLIDE 3

The Rusyn Project

Interests and aims:
Status / Condition of the Carpatho Rusyn language
Documentation of Spoken Carpatho-Rusyn (Corpus)
Language Contact with Several „Roofing Languages“ (Slavic and Non-Slavic)
Contact induced changes?
Language Perception
Border Effects (Woolhiser 2005)
Quantitative / Statistical Approaches to Spoken Language Data

SLIDE 4

The Rusyn Project

Interests and aims:
Status / Condition of the Carpatho Rusyn language
Documentation of the Rusyn Varieties (Corpus)
Language Contact with Several „Roofing Languages“ (Slavic and Non-Slavic)
Contact induced changes?
Language Perception
Border Effects (Woolhiser 2005)
Quantitative / Statistical Approaches to Spoken Language Data (R-Studio)

SLIDE 5

Magocsi, P. R.: Národ znikadiaľ : ilustrovaná história karpatských Rusínov. Prešov : Rusín a Ľudové noviny, 2007, p. 34.

SLIDE 6

Rusyns as National Minority

SLIDE 7

Sociolinguistic Factors

Status? Age? Sex? Education? Mobility? Religion?

Which factors determine how people speak?

SLIDE 8

Border Effects as Hypothesis

Poland Slovakia Ukraine Hungary Romania

Border effects (Woolhiser 2005) are detectable within Rusyn vernacular

SLIDE 9

Example: A(j) Conjugation

Pugh, S.M. (2009). The Rusyn language: A grammar of the literary standard of Slovakia with reference to Lemko and Subcarpathian Rusyn. München. P. 117.

SLIDE 10

Corpus of Spoken Rusyn

SLIDE 11

Example

Variation within conjugation types AJ and A(j) (Pugh 2009: 116-20)

Our dataset contains:
Threefold variation:

(ма, має, мат(ь)) and ((по-)зна, (по-)знає, (по-)знат(ь)).

Several utterances by the same speakers. Biased + violation of assumptions! Bad!
Context
Metadata of speakers

мати3𝑄𝑡.𝑇𝑕.𝑄𝑠𝑓𝑡. (по−)знати3𝑄𝑡.𝑇𝑕.𝑄𝑠𝑓𝑡.

SLIDE 12

Coefficients in Multinomial Logistic Regression Model

ln( 𝑄(𝑤𝑓𝑠𝑐𝐺𝑝𝑠𝑛 = 𝑛𝑏) 𝑄(𝑤𝑓𝑠𝑐𝐺𝑝𝑠𝑛 = 𝑛𝑏𝑓) = 𝑐10 + 𝑐11(𝑤𝑏𝑠𝑗𝑓𝑢𝑧 = 𝑇𝑚𝑝) + 𝑐12(𝑤𝑏𝑠𝑗𝑓𝑢𝑧 = 𝑀𝑓𝑛) + 𝑐13𝐵𝑕𝑓 𝑐14(𝑡𝑓𝑦 = 𝑛) ln( 𝑄(𝑤𝑓𝑠𝑐𝐺𝑝𝑠𝑛 = 𝑛𝑏𝑢) 𝑄(𝑤𝑓𝑠𝑐𝐺𝑝𝑠𝑛 = 𝑛𝑏𝑓) = 𝑐20 + 𝑐21(𝑤𝑏𝑠𝑗𝑓𝑢𝑧 = 𝑇𝑚𝑝) + 𝑐22(𝑤𝑏𝑠𝑗𝑓𝑢𝑧 = 𝑀𝑓𝑛) + 𝑐23𝐵𝑕𝑓 + 𝑐24(𝑡𝑓𝑦 = 𝑛)

SLIDE 13

Problems

X Data set is rather small X Biased data set X Dependent variable(verb_form) is categorical X Threefold variation X Independent variables are predominantly categorical X Violation of assumptions (Independence) X We have collected precious data, so we don’t want to give up

SLIDE 14

Bootstrapping

… Sample1 Sample0 Sample2 Sample3 Sample6 Sample5 Sample500

Regression Regression Robust estimation

SLIDE 15

Conclusion

Bootstrapping provides us with a robust estimation of the values of

interest, even when assumptions aren’t met or the data set was small and or biased.

Even after Bootstrapping, we can still see clear tendencies: settlement

area (Variety) seem to be the most significant factor.

SLIDE 16

Conclusion

Statistical methods are useful for several aspects of our research.
Our possibilities are rather limited.
Assumptions are often violated when applying state of the art

methods.

Nevertheless, robust methods help us to get more unbiased

estimations.

Robust estimations should always be reported.

SLIDE 17

Contact: zaidan.lahjouji@slavistik.uni-freiburg.de achim.rabus@slavistik.uni-freiburg.de www.russinisch.de

Файно Вам дякуєме за Вашу увагу! Thank you very much for your attention!

SLIDE 18

Literature

Christ, Oliver (1994). A modular and flexible architecture for an integrated corpus query system. In: Proceedings of

COMPLEX’94: 3rd Conference on Computational Lexicography and Text Research, 23–32.

Evert, S. and Hardie, A. (2011). Twenty-first century Corpus Workbench: Updating a query architecture for the new
millennium. In: Proceedings of the Corpus Linguistics 2011 Conference, Birmingham, UK. University of Birmingham.
Hinneburg, Alexander, Heikki Mannila, Samuli Kaislaniemi, TerŠu Nevalainen & Helena Raumolin-Brunberg (2007).

“How to handle small samples: Bootstrap and Bayesian methods in the analysis of linguis‹c change”. Literary and Linguis‹c Compu‹ng 22(2): 137–150.

Mueller,T; Schmid,H & Schütze, H. (2013). Efficient higher-order CRFs for morphological tagging. In: Proceedings of the

2013 Conference on Empirical Methods in Natural Language Processing, pages 322–332, Seattle, Washington, USA,

October. Association for Computational Linguistics.
Rabus, A. & A. Šymon (2015): Na novŷch putjach isslidovanja rusyns’kŷch dialektu. Korpus rozhovornoho rusyns’koho

jazŷka. In: Koporová, Kvetoslava (Hrsg.): Rusyn’skŷj literaturnŷj jazŷk na Slovakiji. 20 rokiv kodifikaciji. Prešov, 40-54.

Rabus, Achim (2015): Current Developments in Carpatho-Rusyn Speech - Preliminary Observations. In: Krafcik P. & V.

Padjak (eds.): Juvilejnyj zbirnyk na čest' profesora Pavla-Roberta Magočija. Užhorod, 489-496.

Rabus, A. & Scherrer, Y. (2017): Lexicon Induction for Spoken Rusyn - Challenges and Results. In: Proceedings of the 6th

Workshop on Balto-Slavic Natural Language Processing, 27-32.

SLIDE 19

Literature

Rabus, A., Savić, S., Waldenfels, R. v. (2012). Towards an electronic corpus of the Velikie Minei Čet'i. In: Rediscovery:

Bulgarian Codex Suprasliensis of the 10th century. Sofia: Iztok Zapad.

Scherrer, Y & Rabus, A (2017): Multi-source morphosyntactic tagging for spoken Rusyn. In: Proceedings of the Fourth

Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), 84 – 92.

Schimon, A. & A. Rabus (2016): Wahrnehmungsdialektologische Untersuchungen zum Russinischen in Zakarpattja

am Beispiel der Region Chust. In: Zeitschrift für Slawistik 61(3), 401-432.

Šymon, A. & A. Rabus (2016): Ysslidovanja rusyns'koho jazŷka yz pohljada vospryymatel'noji dialektologiji. In:

Dynamické procesy v súčasnej slavistike, S. 71-88. (Nachdruck in Rusyn 5/2016 und 6/2016)

v. Waldenfels, R.; Woźniak, M. (2017). SpoCo – a simple and adaptable web interface for dialect corpora. In: Journal

for Language Technology and Computational Linguistics, 31(1), 145 – 160.

v. Waldenfels,R.; Daniel, M., Dobrushina, N. (2014): Why Standard Orthography? Building the Ustya River Basin

Corpus, an online corpus of a Russian dialect. Komp'juternaja lingvistika i intellektual'nye technologii: Po materialam ežegodnoj Meždunarodnoj konferencii «Dialog» (Bekasovo, 4 — 8 ijunja 2014 g.). Vyp. 13 (20). — M.: Izd-vo RGGU, 2014.

Woolhiser, C. (2005). Political borders and dialect divergence/convergence in Europe. In Peter Auer, Frans Hinskens,

and Paul Kerswill, editors, Dialect change, 236–262. Cambridge Univ. Press, Cambridge

SLIDE 20

Rusyn as a language between state borders – a statistical approach to variation (for small sample sizes)

Topics:

The Rusyn Project

The Rusyn Project

Rusyns as National Minority

Sociolinguistic Factors

Which factors determine how people speak?

Border Effects as Hypothesis

Example: A(j) Conjugation

Corpus of Spoken Rusyn

Corpus of Spoken Rusyn

Example

Variation within conjugation types AJ and A(j) (Pugh 2009: 116-20)

(ма, має, мат(ь)) and ((по-)зна, (по-)знає, (по-)знат(ь)).

мати3𝑄𝑡.𝑇𝑕.𝑄𝑠𝑓𝑡. (по−)знати3𝑄𝑡.𝑇𝑕.𝑄𝑠𝑓𝑡.

Coefficients in Multinomial Logistic Regression Model

Problems

X Data set is rather small X Biased data set X Dependent variable(verb_form) is categorical X Threefold variation X Independent variables are predominantly categorical X Violation of assumptions (Independence) X We have collected precious data, so we don’t want to give up

Bootstrapping

Regression Regression Robust estimation

Conclusion

interest, even when assumptions aren’t met or the data set was small and or biased.

area (Variety) seem to be the most significant factor.

Conclusion

methods.

estimations.

Файно Вам дякуєме за Вашу увагу! Thank you very much for your attention!

Literature

Literature

Rusyn as a language between state borders – a statistical approach to variation   (for small sample sizes)