1
L1-Identification
Serhiy Bykh, Detmar Meurers
Second Tübingen-Berlin Meeting on Analyzing Learner Language 5./6. December 2011
L1-Identification Serhiy Bykh, Detmar Meurers Second Tbingen-Berlin - - PowerPoint PPT Presentation
L1-Identification Serhiy Bykh, Detmar Meurers Second Tbingen-Berlin Meeting on Analyzing Learner Language 5./6. December 2011 1 Contents 1. Introduction 2. Previous work on L1 identification 3. Our baseline point: surface-based
1
Serhiy Bykh, Detmar Meurers
Second Tübingen-Berlin Meeting on Analyzing Learner Language 5./6. December 2011
2
3
German German German German German German L1!=!English L1!=!Russian L1!=!French
(here: L2 German)
4
German German German German German German L1!=!English L1!=!Russian L1!=!French
5
6
! Corpus: 665 ICLEv2 essays
– seven L1, with 95 (+ 15) essays per language
! Features:
– 3 error types (subj-verb diagreement, noun-number
– 70/363/398 function words – 300 letter n-grams, n ! [1, 3] – 450 POS n-grams, n ! [2, 3]
! Method: SVM, 70 essays for training, 25 for testing ! Result: 73.7% accuracy (combi)
7
8
! Features used: word-based recurring n-grams ! Examples (from FALKO):
– n=2: und zwar, 30 Jahre, wirkliche Welt, berüfliche Ausbildung,
– n=3: was mich betrifft, von geringen Wert, müssen die
– n=6: die Studenten auf die wirkliche Welt, ...
! All n-grams occurring in ! 2 texts of the used corpus ! n-grams of all occurring lengths, 2 " n " max_n(corpus)
9
! Machine Learning: k-NN, different distance metrics ! Cosine, Dot Product metrics best for sparse vectors ! Testing: leave-one-out ! Features: as bit vectors (0=feature absent, 1= present)
feature1 feature2 feature3 featuren textA 1 textB 1 1 1 1 textX 1 Feature bit vector
10
! Replication of Wong & Dras (2009), i.e., we used same
! Corpus: ICLEv2
– seven L1 (Bulgarian, Czech, French, Russian, Spanish,
! Feature set: word based recurring n-grams:
– 1. Single n ! {2, 3, 4, 5} – 2. Intervals:
! [n, 29], n ! [2, 5] (max_n(corpus) = 29) ! [2, n], n ! [3, 6]
– 3. Picked subsets: {2, 4}, {2, 5}, {2, 3, 5}, {2, 4, 5}, ...
11
12
13
! Corpus: FALKO
– Subset with 6 L1 (Rus, Uzb, Fra, Eng, Dan, Tur) x 10 essays
! Feature set: recurring n-grams:
– intervals [2, n], n ! [2, 6] – (exploration of some other n-gram subsets)
14
[2] [2,3] [2,4] [2,5] [2,6] 10 20 30 40 50 60 70 80 90 100 63,3 46,7 43,3 40 36,7 accuracy % [2] [2,3] [2,4] [2,5] [2,6] 2000 4000 6000 8000 10000 12000 2361 3054 3236 3328 3399 features #
15 [2] [2,3] [2,4] [2,5] [2,6] 10 20 30 40 50 60 70 80 90 100 20 41,7 46,7 45 43,3 accuracy % [2] [2,3] [2,4] [2,5] [2,6] 2000 4000 6000 8000 10000 12000 670 3050 6560 9390 10924 features #
16
[2] [2,3] [2,4] [2,5] [2,6] 10 20 30 40 50 60 70 80 90 100 43,3 45 46,7 53,3 50 accuracy % [2] [2,3] [2,4] [2,5] [2,6] 2000 4000 6000 8000 10000 12000 1917 4702 6757 7626 7894 features #
17
[2] [2,3] [2,4] [2,5] [2,6] 10 20 30 40 50 60 70 80 90 100 46,7 46,7 46,7 53,3 50 accuracy % [2] [2,3] [2,4] [2,5] [2,6] 2000 4000 6000 8000 10000 12000 2135 4987 6835 7530 7741 features #
18
[2] [2,3] [2,4] [2,5] [2,6] 10 20 30 40 50 60 70 80 90 100 56,7 51,7 43,3 36,7 35 accuracy % [2] [2,3] [2,4] [2,5] [2,6] 2000 4000 6000 8000 10000 12000 2541 3589 3857 3965 4039 features #
19
[2] [2,3] [2,4] [2,5] [2,6] 10 20 30 40 50 60 70 80 90 100 53,3 48,3 45 41,7 38,3 accuracy % [2] [2,3] [2,4] [2,5] [2,6] 2000 4000 6000 8000 10000 12000 2551 3699 3981 4090 4165 features #
20
[2] [2,3] [2,4] [2,5] [2,6] 10 20 30 40 50 60 70 80 90 100 46,7 56,7 51,7 50 48,3 accuracy % [2] [2,3] [2,4] [2,5] [2,6] 2000 4000 6000 8000 10000 12000 2322 4124 4876 5130 5242 features #
21
! Best results (accuracy baseline # 16.7%) ! Word based:
– n = 2 (single n), cosine, 2361 feat. (max. 3801):
! POS based:
– n interval [2, 4], cosine, 6560 feat. (max. 12246):
! Word + open class POS based:
– N.*, ADJ.*, VV.*, n interval [2, 5], cosine, 7530 feat. (max.
– N.*, n subset {2, 3, 6}, cosine, 4236 feat. (max. 5663):
22
23
! Features: from surface to more linguistic modeling ! modeling on different levels of abstraction: words,
! modeling on different levels of units: phrases,
! Evaluation method: Use of other Machine Learning and
! e.g. PCA, SVM etc.
24
25
Daelemans, W. / Zavrel, J. / van der Sloot, K. / van den Bosch, A. (2010): TiMBL: Tilburg Memory Based Learner, version 6.3, Reference Guide. ILK Research Group Technical Report Series no. 10-01. (web: http://ilk.uvt.nl/downloads/pub/papers/Timbl_6.3_Manual.pdf) Diehl, Erika / Christen, Helen / Leuenberger, Sandra / Pelvat, Isabelle / Studer, Thérèse (2000): Grammatikunterricht: Alles für der Katz? Untersuchungen zum Zweitspracherwerb Deutsch. In: Henne, Helmut et al. (ed.): Reihe Germanistische Linguistik 220. Niemeyer Verlag, Tübingen. Granger, S. / Dagneaux, E. / Meunier, F. / Paquot, M. (2009): International Corpus of Learner English (Version 2). Presses Universitaires de Louvain, Louvain-la-Neuve. van Halteren, H. (2008): Source Language Markers in EUROPARL Translations. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING), pages 937–944. Koppel, M. / Schler, J. / Zigdon, K. (2005): Automatically Determining an Anonymous Author’s Native Language. In Intelligence and Security Informatics, volume 3495 of Lecture Notes in Computer Science. Springer-Verlag, pages 209-217. Odlin, Terence (1989) Language Transfer: Cross-linguistic influence in language learning, Cambridge University Press, New York. Reznicek, Marc / Walter, Maik / Schmid, Karin / Lüdeling, Anke / Hirschmann, Hagen / Krummes, Cedric (2010): Das Falko-Handbuch. Korpusaufbau und Annotationen Version 1.0.1 Tsur, O. / Rappoport, A. (2007): Using Classifier Features for Studying the Effect of Native Language on the Choice of Written Second Language Words. In Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition (CACLA ’07), pages 9–16. Wong, S.-M. J. / Dras, M. (2009): Contrastive Analysis and Native Language Identification. In Proceedings of the Australasian Language Technology Association Workshop, pages 53–61. Swan, Michael / Smith, Bernard (ed.) (2001): Learner English. A teacher's guide to interference and
26
27
! Koppel / Schler / Zigdon 2005; ! Corpus: ICLEv1, 5 L1 x 258 essays = 1290 essays ! Features:
– 400 function words – 200 char n-grams – 185 error types – 250 POS bi-grams
! Method: SVM, 10-fold-cross-validation ! Result: 80.2% accuracy (combi)
28
! Tsur / Rappoport 2007; ! Corpus: ICLEv1, 5 L1 x 258 essays = 1290 essays ! Features:
– char n-grams, n:={1, 2, 3}
! Motivation: Influence of syllable structure of L1 on the L2 lexis
– 460 function words
! Method: SVM, 10-fold-cross-validation ! Result: 65.6% accuracy (bi-grams)