L1-Identification Serhiy Bykh, Detmar Meurers Second Tbingen-Berlin - - PowerPoint PPT Presentation

l1 identification
SMART_READER_LITE
LIVE PREVIEW

L1-Identification Serhiy Bykh, Detmar Meurers Second Tbingen-Berlin - - PowerPoint PPT Presentation

L1-Identification Serhiy Bykh, Detmar Meurers Second Tbingen-Berlin Meeting on Analyzing Learner Language 5./6. December 2011 1 Contents 1. Introduction 2. Previous work on L1 identification 3. Our baseline point: surface-based


slide-1
SLIDE 1

1

L1-Identification

Serhiy Bykh, Detmar Meurers

Second Tübingen-Berlin Meeting on Analyzing Learner Language 5./6. December 2011

slide-2
SLIDE 2

2

Contents

  • 1. Introduction
  • 2. Previous work on L1 identification
  • 3. Our baseline point: surface-based classification

3.1. Features 3.2. Method 3.3. Results on ICLEv2 and FALKO

  • 4. Future work: Towards more linguistic modeling
  • 5. References
slide-3
SLIDE 3

3

Introduction

German German German German German German L1!=!English L1!=!Russian L1!=!French

Learner Corpus

(here: L2 German)

slide-4
SLIDE 4

4

Introduction

Corpus in L2 = German

German German German German German German L1!=!English L1!=!Russian L1!=!French

slide-5
SLIDE 5

5

Contents

  • 1. Introduction
  • 2. Previous work on L1 identification
  • 3. Our baseline approach: surface-based classification

3.1. Features 3.2. Method 3.3. Results on ICLEv2 and FALKO

  • 4. Future work: Towards more linguistic modeling
  • 5. References
slide-6
SLIDE 6

6

Previous work: Wong & Dras (2009)

! Corpus: 665 ICLEv2 essays

– seven L1, with 95 (+ 15) essays per language

! Features:

– 3 error types (subj-verb diagreement, noun-number

disagreement, misuse of determiners)

– 70/363/398 function words – 300 letter n-grams, n ! [1, 3] – 450 POS n-grams, n ! [2, 3]

! Method: SVM, 70 essays for training, 25 for testing ! Result: 73.7% accuracy (combi)

slide-7
SLIDE 7

7

Contents

  • 1. Introduction
  • 2. Previous work on L1 identification
  • 3. Our baseline approach: surface-based classification

3.1. Features 3.2. Method 3.3. Results on ICLEv2 and FALKO

  • 4. Future work: Towards adding linguistic modeling
  • 5. References
slide-8
SLIDE 8

8

Our baseline approach: Features

! Features used: word-based recurring n-grams ! Examples (from FALKO):

– n=2: und zwar, 30 Jahre, wirkliche Welt, berüfliche Ausbildung,

der Abitur

– n=3: was mich betrifft, von geringen Wert, müssen die

Studenten

– n=6: die Studenten auf die wirkliche Welt, ...

! All n-grams occurring in ! 2 texts of the used corpus ! n-grams of all occurring lengths, 2 " n " max_n(corpus)

slide-9
SLIDE 9

9

Our baseline approach: Method

! Machine Learning: k-NN, different distance metrics ! Cosine, Dot Product metrics best for sparse vectors ! Testing: leave-one-out ! Features: as bit vectors (0=feature absent, 1= present)

feature1 feature2 feature3 featuren textA 1 textB 1 1 1 1 textX 1 Feature bit vector

slide-10
SLIDE 10

10

Baseline approach: ICLEv2 task

! Replication of Wong & Dras (2009), i.e., we used same

dataset, but our own features & machine learning setup:

! Corpus: ICLEv2

– seven L1 (Bulgarian, Czech, French, Russian, Spanish,

Chinese, Japanese) x 95 essays = 665 essays

! Feature set: word based recurring n-grams:

– 1. Single n ! {2, 3, 4, 5} – 2. Intervals:

! [n, 29], n ! [2, 5] (max_n(corpus) = 29) ! [2, n], n ! [3, 6]

– 3. Picked subsets: {2, 4}, {2, 5}, {2, 3, 5}, {2, 4, 5}, ...

slide-11
SLIDE 11

11

Baseline approach: ICLEv2 results

slide-12
SLIDE 12

12

Baseline approach: ICLEv2 results

Confusion matrix for the best result

slide-13
SLIDE 13

13

Baseline approach: FALKO setup

! Corpus: FALKO

– Subset with 6 L1 (Rus, Uzb, Fra, Eng, Dan, Tur) x 10 essays

= 60 essays

! Feature set: recurring n-grams:

– intervals [2, n], n ! [2, 6] – (exploration of some other n-gram subsets)

slide-14
SLIDE 14

14

Baseline approach: FALKO results

[2] [2,3] [2,4] [2,5] [2,6] 10 20 30 40 50 60 70 80 90 100 63,3 46,7 43,3 40 36,7 accuracy % [2] [2,3] [2,4] [2,5] [2,6] 2000 4000 6000 8000 10000 12000 2361 3054 3236 3328 3399 features #

Word based n-grams

slide-15
SLIDE 15

15 [2] [2,3] [2,4] [2,5] [2,6] 10 20 30 40 50 60 70 80 90 100 20 41,7 46,7 45 43,3 accuracy % [2] [2,3] [2,4] [2,5] [2,6] 2000 4000 6000 8000 10000 12000 670 3050 6560 9390 10924 features #

Baseline approach: FALKO results

Part-of-speech based n-grams

slide-16
SLIDE 16

16

Word + open class (N.*, VV.*, ADJ.*, CARD classes) n-grams:

[2] [2,3] [2,4] [2,5] [2,6] 10 20 30 40 50 60 70 80 90 100 43,3 45 46,7 53,3 50 accuracy % [2] [2,3] [2,4] [2,5] [2,6] 2000 4000 6000 8000 10000 12000 1917 4702 6757 7626 7894 features #

Baseline approach: FALKO results

slide-17
SLIDE 17

17

Baseline approach: FALKO results

Word + open class POS (matching N.*, VV.*, ADJ.*, CARD):

[2] [2,3] [2,4] [2,5] [2,6] 10 20 30 40 50 60 70 80 90 100 46,7 46,7 46,7 53,3 50 accuracy % [2] [2,3] [2,4] [2,5] [2,6] 2000 4000 6000 8000 10000 12000 2135 4987 6835 7530 7741 features #

slide-18
SLIDE 18

18

Baseline approach: FALKO results

Word + ADJ.* POS (ADJA, ADJD):

[2] [2,3] [2,4] [2,5] [2,6] 10 20 30 40 50 60 70 80 90 100 56,7 51,7 43,3 36,7 35 accuracy % [2] [2,3] [2,4] [2,5] [2,6] 2000 4000 6000 8000 10000 12000 2541 3589 3857 3965 4039 features #

slide-19
SLIDE 19

19

Baseline approach: FALKO results

Word + VV.* POS (VVFIN, VVIMP, VVINF, VVIZU, VVPP):

[2] [2,3] [2,4] [2,5] [2,6] 10 20 30 40 50 60 70 80 90 100 53,3 48,3 45 41,7 38,3 accuracy % [2] [2,3] [2,4] [2,5] [2,6] 2000 4000 6000 8000 10000 12000 2551 3699 3981 4090 4165 features #

slide-20
SLIDE 20

20

Baseline approach: FALKO results

Word + N.* POS (NN, NE)

[2] [2,3] [2,4] [2,5] [2,6] 10 20 30 40 50 60 70 80 90 100 46,7 56,7 51,7 50 48,3 accuracy % [2] [2,3] [2,4] [2,5] [2,6] 2000 4000 6000 8000 10000 12000 2322 4124 4876 5130 5242 features #

slide-21
SLIDE 21

21

Baseline approach: FALKO results

! Best results (accuracy baseline # 16.7%) ! Word based:

– n = 2 (single n), cosine, 2361 feat. (max. 3801):

63.3% accuracy

! POS based:

– n interval [2, 4], cosine, 6560 feat. (max. 12246):

46.7% accuracy

! Word + open class POS based:

– N.*, ADJ.*, VV.*, n interval [2, 5], cosine, 7530 feat. (max.

8232): 53.3% accuracy

– N.*, n subset {2, 3, 6}, cosine, 4236 feat. (max. 5663):

58.3% accuracy

slide-22
SLIDE 22

22

Contents

  • 1. Introduction
  • 2. Previous work on L1 identification
  • 3. Our baseline approach: surface-based classification

3.1. Features 3.2. Method 3.3. Results on ICLEv2 and FALKO

  • 4. Future work: Towards adding linguistic modeling
  • 5. References
slide-23
SLIDE 23

23

Towards more linguistic modeling

! Features: from surface to more linguistic modeling ! modeling on different levels of abstraction: words,

POS, lemmas, induced classes, ...

! modeling on different levels of units: phrases,

dependency triples, clauses, sentences, discourse, ...

! Evaluation method: Use of other Machine Learning and

Data Mining techniques

! e.g. PCA, SVM etc.

slide-24
SLIDE 24

24

Towards more linguistic modeling

Example: Choice of Adj N vs. N N typical?

slide-25
SLIDE 25

25

References

Daelemans, W. / Zavrel, J. / van der Sloot, K. / van den Bosch, A. (2010): TiMBL: Tilburg Memory Based Learner, version 6.3, Reference Guide. ILK Research Group Technical Report Series no. 10-01. (web: http://ilk.uvt.nl/downloads/pub/papers/Timbl_6.3_Manual.pdf) Diehl, Erika / Christen, Helen / Leuenberger, Sandra / Pelvat, Isabelle / Studer, Thérèse (2000): Grammatikunterricht: Alles für der Katz? Untersuchungen zum Zweitspracherwerb Deutsch. In: Henne, Helmut et al. (ed.): Reihe Germanistische Linguistik 220. Niemeyer Verlag, Tübingen. Granger, S. / Dagneaux, E. / Meunier, F. / Paquot, M. (2009): International Corpus of Learner English (Version 2). Presses Universitaires de Louvain, Louvain-la-Neuve. van Halteren, H. (2008): Source Language Markers in EUROPARL Translations. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING), pages 937–944. Koppel, M. / Schler, J. / Zigdon, K. (2005): Automatically Determining an Anonymous Author’s Native Language. In Intelligence and Security Informatics, volume 3495 of Lecture Notes in Computer Science. Springer-Verlag, pages 209-217. Odlin, Terence (1989) Language Transfer: Cross-linguistic influence in language learning, Cambridge University Press, New York. Reznicek, Marc / Walter, Maik / Schmid, Karin / Lüdeling, Anke / Hirschmann, Hagen / Krummes, Cedric (2010): Das Falko-Handbuch. Korpusaufbau und Annotationen Version 1.0.1 Tsur, O. / Rappoport, A. (2007): Using Classifier Features for Studying the Effect of Native Language on the Choice of Written Second Language Words. In Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition (CACLA ’07), pages 9–16. Wong, S.-M. J. / Dras, M. (2009): Contrastive Analysis and Native Language Identification. In Proceedings of the Australasian Language Technology Association Workshop, pages 53–61. Swan, Michael / Smith, Bernard (ed.) (2001): Learner English. A teacher's guide to interference and

  • ther problems. Cambridge University Press, Cambridge.
slide-26
SLIDE 26

26

Thank you for your attention!

slide-27
SLIDE 27

27

Previous work

! Koppel / Schler / Zigdon 2005; ! Corpus: ICLEv1, 5 L1 x 258 essays = 1290 essays ! Features:

– 400 function words – 200 char n-grams – 185 error types – 250 POS bi-grams

! Method: SVM, 10-fold-cross-validation ! Result: 80.2% accuracy (combi)

slide-28
SLIDE 28

28

Previous work

! Tsur / Rappoport 2007; ! Corpus: ICLEv1, 5 L1 x 258 essays = 1290 essays ! Features:

– char n-grams, n:={1, 2, 3}

! Motivation: Influence of syllable structure of L1 on the L2 lexis

– 460 function words

! Method: SVM, 10-fold-cross-validation ! Result: 65.6% accuracy (bi-grams)