Rusyn as a language between state borders a statistical approach to - - PowerPoint PPT Presentation

rusyn as a language between state borders a statistical
SMART_READER_LITE
LIVE PREVIEW

Rusyn as a language between state borders a statistical approach to - - PowerPoint PPT Presentation

Rusyn as a language between state borders a statistical approach to variation (for small sample sizes) Albert-Ludwig University of Freiburg, Germany Department of Slavonic Studies Prof. Dr. Achim Rabus & M. Zaidan Lahjouji


slide-1
SLIDE 1

Rusyn as a language between state borders – a statistical approach to variation 
 (for small sample sizes)


Albert-Ludwig University of Freiburg, Germany Department of Slavonic Studies

  • Prof. Dr. Achim Rabus & M. Zaidan Lahjouji

Project: Russinisch als eine Staatsgrenzen überschreitende Minderheitensprache: Quantitative Perspektiven (Rusyn as a state border transgressing minority language: quantitive perspectives)

slide-2
SLIDE 2

Topics:

  • The Rusyn project
  • Interests and Aims
  • Border Effects
  • The Corpus of Spoken Rusyn
  • Quantitative approaches to spoken data
  • Pitfalls, Possible Solutions and Limitations
  • Example Dataset & Analysis
slide-3
SLIDE 3

The Rusyn Project

  • Interests and aims:
  • Status / Condition of the Carpatho Rusyn language
  • Documentation of Spoken Carpatho-Rusyn (Corpus)
  • Language Contact with Several „Roofing Languages“ (Slavic and Non-Slavic)
  • Contact induced changes?
  • Language Perception
  • Border Effects (Woolhiser 2005)
  • Quantitative / Statistical Approaches to Spoken Language Data
slide-4
SLIDE 4

The Rusyn Project

  • Interests and aims:
  • Status / Condition of the Carpatho Rusyn language
  • Documentation of the Rusyn Varieties (Corpus)
  • Language Contact with Several „Roofing Languages“ (Slavic and Non-Slavic)
  • Contact induced changes?
  • Language Perception
  • Border Effects (Woolhiser 2005)
  • Quantitative / Statistical Approaches to Spoken Language Data (R-Studio)
slide-5
SLIDE 5

Magocsi, P. R.: Národ znikadiaľ : ilustrovaná história karpatských Rusínov. Prešov : Rusín a Ľudové noviny, 2007, p. 34.

slide-6
SLIDE 6

Rusyns as National Minority

slide-7
SLIDE 7

Sociolinguistic Factors

  • Status? Age? Sex? Education? Mobility? Religion?

Which factors determine how people speak?

slide-8
SLIDE 8

Border Effects as Hypothesis

Poland Slovakia Ukraine Hungary Romania

Border effects (Woolhiser 2005) are detectable within Rusyn vernacular

slide-9
SLIDE 9

Example: A(j) Conjugation

Pugh, S.M. (2009). The Rusyn language: A grammar of the literary standard of Slovakia with reference to Lemko and Subcarpathian Rusyn. München. P. 117.

slide-10
SLIDE 10

Corpus of Spoken Rusyn

Corpus of Spoken Rusyn

CQP – query search: [word=‚ма|має|мат*|зна|знає|знат*|позна|познає|познат*'%cd]

slide-11
SLIDE 11

Example

Variation within conjugation types AJ and A(j) (Pugh 2009: 116-20)

  • Our dataset contains:
  • Threefold variation:

(ма, має, мат(ь)) and ((по-)зна, (по-)знає, (по-)знат(ь)).

  • Several utterances by the same speakers. Biased + violation of assumptions! Bad!
  • Context
  • Metadata of speakers

мати3𝑄𝑡.𝑇𝑕.𝑄𝑠𝑓𝑡. (по−)знати3𝑄𝑡.𝑇𝑕.𝑄𝑠𝑓𝑡.

slide-12
SLIDE 12

Coefficients in Multinomial Logistic Regression Model

+

ln( 𝑄(𝑤𝑓𝑠𝑐𝐺𝑝𝑠𝑛 = 𝑛𝑏) 𝑄(𝑤𝑓𝑠𝑐𝐺𝑝𝑠𝑛 = 𝑛𝑏𝑓) = 𝑐10 + 𝑐11(𝑤𝑏𝑠𝑗𝑓𝑢𝑧 = 𝑇𝑚𝑝) + 𝑐12(𝑤𝑏𝑠𝑗𝑓𝑢𝑧 = 𝑀𝑓𝑛) + 𝑐13𝐵𝑕𝑓 𝑐14(𝑡𝑓𝑦 = 𝑛) ln( 𝑄(𝑤𝑓𝑠𝑐𝐺𝑝𝑠𝑛 = 𝑛𝑏𝑢) 𝑄(𝑤𝑓𝑠𝑐𝐺𝑝𝑠𝑛 = 𝑛𝑏𝑓) = 𝑐20 + 𝑐21(𝑤𝑏𝑠𝑗𝑓𝑢𝑧 = 𝑇𝑚𝑝) + 𝑐22(𝑤𝑏𝑠𝑗𝑓𝑢𝑧 = 𝑀𝑓𝑛) + 𝑐23𝐵𝑕𝑓 + 𝑐24(𝑡𝑓𝑦 = 𝑛)

slide-13
SLIDE 13

Problems

X Data set is rather small X Biased data set X Dependent variable(verb_form) is categorical X Threefold variation X Independent variables are predominantly categorical X Violation of assumptions (Independence) X We have collected precious data, so we don’t want to give up

slide-14
SLIDE 14

Bootstrapping

… Sample1 Sample0 Sample2 Sample3 Sample6 Sample5 Sample500

Regression Regression Robust estimation

slide-15
SLIDE 15

Conclusion

  • Bootstrapping provides us with a robust estimation of the values of

interest, even when assumptions aren’t met or the data set was small and or biased.

  • Even after Bootstrapping, we can still see clear tendencies: settlement

area (Variety) seem to be the most significant factor.

slide-16
SLIDE 16

Conclusion

  • Statistical methods are useful for several aspects of our research.
  • Our possibilities are rather limited.
  • Assumptions are often violated when applying state of the art

methods.

  • Nevertheless, robust methods help us to get more unbiased

estimations.

  • Robust estimations should always be reported.
slide-17
SLIDE 17

Contact: zaidan.lahjouji@slavistik.uni-freiburg.de achim.rabus@slavistik.uni-freiburg.de www.russinisch.de

Файно Вам дякуєме за Вашу увагу! Thank you very much for your attention!

slide-18
SLIDE 18

Literature

  • Christ, Oliver (1994). A modular and flexible architecture for an integrated corpus query system. In: Proceedings of

COMPLEX’94: 3rd Conference on Computational Lexicography and Text Research, 23–32.

  • Evert, S. and Hardie, A. (2011). Twenty-first century Corpus Workbench: Updating a query architecture for the new
  • millennium. In: Proceedings of the Corpus Linguistics 2011 Conference, Birmingham, UK. University of Birmingham.
  • Hinneburg, Alexander, Heikki Mannila, Samuli Kaislaniemi, TerŠu Nevalainen & Helena Raumolin-Brunberg (2007).

“How to handle small samples: Bootstrap and Bayesian methods in the analysis of linguis‹c change”. Literary and Linguis‹c Compu‹ng 22(2): 137–150.

  • Mueller,T; Schmid,H & Schütze, H. (2013). Efficient higher-order CRFs for morphological tagging. In: Proceedings of the

2013 Conference on Empirical Methods in Natural Language Processing, pages 322–332, Seattle, Washington, USA,

  • October. Association for Computational Linguistics.
  • Rabus, A. & A. Šymon (2015): Na novŷch putjach isslidovanja rusyns’kŷch dialektu. Korpus rozhovornoho rusyns’koho

jazŷka. In: Koporová, Kvetoslava (Hrsg.): Rusyn’skŷj literaturnŷj jazŷk na Slovakiji. 20 rokiv kodifikaciji. Prešov, 40-54.

  • Rabus, Achim (2015): Current Developments in Carpatho-Rusyn Speech - Preliminary Observations. In: Krafcik P. & V.

Padjak (eds.): Juvilejnyj zbirnyk na čest' profesora Pavla-Roberta Magočija. Užhorod, 489-496.

  • Rabus, A. & Scherrer, Y. (2017): Lexicon Induction for Spoken Rusyn - Challenges and Results. In: Proceedings of the 6th

Workshop on Balto-Slavic Natural Language Processing, 27-32.

slide-19
SLIDE 19

Literature

  • Rabus, A., Savić, S., Waldenfels, R. v. (2012). Towards an electronic corpus of the Velikie Minei Čet'i. In: Rediscovery:

Bulgarian Codex Suprasliensis of the 10th century. Sofia: Iztok Zapad.

  • Scherrer, Y & Rabus, A (2017): Multi-source morphosyntactic tagging for spoken Rusyn. In: Proceedings of the Fourth

Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), 84 – 92.

  • Schimon, A. & A. Rabus (2016): Wahrnehmungsdialektologische Untersuchungen zum Russinischen in Zakarpattja

am Beispiel der Region Chust. In: Zeitschrift für Slawistik 61(3), 401-432.

  • Šymon, A. & A. Rabus (2016): Ysslidovanja rusyns'koho jazŷka yz pohljada vospryymatel'noji dialektologiji. In:

Dynamické procesy v súčasnej slavistike, S. 71-88. (Nachdruck in Rusyn 5/2016 und 6/2016)

  • v. Waldenfels, R.; Woźniak, M. (2017). SpoCo – a simple and adaptable web interface for dialect corpora. In: Journal

for Language Technology and Computational Linguistics, 31(1), 145 – 160.

  • v. Waldenfels,R.; Daniel, M., Dobrushina, N. (2014): Why Standard Orthography? Building the Ustya River Basin

Corpus, an online corpus of a Russian dialect. Komp'juternaja lingvistika i intellektual'nye technologii: Po materialam ežegodnoj Meždunarodnoj konferencii «Dialog» (Bekasovo, 4 — 8 ijunja 2014 g.). Vyp. 13 (20). — M.: Izd-vo RGGU, 2014.

  • Woolhiser, C. (2005). Political borders and dialect divergence/convergence in Europe. In Peter Auer, Frans Hinskens,

and Paul Kerswill, editors, Dialect change, 236–262. Cambridge Univ. Press, Cambridge

slide-20
SLIDE 20