Authorship identification in large email collections: Experiments - - PowerPoint PPT Presentation

authorship identification in large email collections
SMART_READER_LITE
LIVE PREVIEW

Authorship identification in large email collections: Experiments - - PowerPoint PPT Presentation

Authorship identification in large email collections: Experiments using features that belong to different linguistic levels George K. Mikros & Kostas Perifanos National and Kapodistrian University of Athens 2 PAN 2011 Lab, 19-22


slide-1
SLIDE 1

Authorship identification in large email collections: Experiments using features that belong to different linguistic levels

George K. Mikros & Kostas Perifanos National and Kapodistrian University of Athens

slide-2
SLIDE 2

Style

  • Our approach to authorship identification is based

mainly on the idea that an author’s style is a complex multifaceted phenomenon affecting the whole spectrum

  • f his/her linguistic production.
  • Following the old theoretical notion of “double

articulation” of the Prague School of Linguistics we accept that stylistic information is constructed in parallel blocks of increasing semantic load, from character n- grams, to word n-grams.

  • In order to capture the multilevel manifestation of

stylistic traits we should detect these features, which belong to many different linguistic levels, and utterly combine them for achieving the most accurate representation of an author’s style.

PAN 2011 Lab, 19-22 September 2011, Amsterdam

2

slide-3
SLIDE 3

An hierarchical representation of features and related linguistic levels

Word trigrams Word bigrams Word unigrams Character trigrams Character bigrams

Semantics Syntax Morphology Phonology

PAN 2011 Lab, 19-22 September 2011, Amsterdam

3

slide-4
SLIDE 4

Features

1000 most frequent n-grams from the following feature groups:

  • Character Bigrams (cbg): Character n-grams provide a robust indicator
  • f authorship and many studies have confirmed their superiority in large

datasets.

  • Character Trigrams (ctg): Character trigrams capture significant

amount of stylistic information and have the additional merit that they also represent common email acronyms like FYI, FAQ, BTW, etc.

  • Word Unigrams (ung): Word frequency is considered among the oldest

and most reliable indicators of authorship outperforming sometimes even the n-gram features.

  • Word Bigrams (wbg): Word bigrams have long been used in authorship

attribution with success.

  • Word Trigrams (wtg): Word trigrams have also been found to convey

useful stylistic information since they approach more closely the syntactic structure of the document.

PAN 2011 Lab, 19-22 September 2011, Amsterdam

4

slide-5
SLIDE 5

Algorithms and Datasets

  • Large and Small Datasets (Authorship Attribution

scenario)

▫ L2 Regularized Logistic Regression (Authorship Attribution tasks)

  • Large and Small + Datasets (Combined Authorship

Attribution and Verification scenario)

▫ One-Class SVM and L2 Regularized Logistic Regression

  • Verify 1, 2 & 3 Datasets (Pure Author Verification)

▫ One-Class SVM (Authorship Verification tasks) using

  • nly the 2000 most frequent character bigrams.

PAN 2011 Lab, 19-22 September 2011, Amsterdam

5

slide-6
SLIDE 6

Results in Large Train Dataset

0,26 0,281 0,312 0,32 0,322 0,481 0,246 0,256 0,293 0,303 0,311 0,465

0,1 0,2 0,3 0,4 0,5 0,6 Cbg Wtg Ctg Wbg Ung All Acc F1

PAN 2011 Lab, 19-22 September 2011, Amsterdam

6

slide-7
SLIDE 7

F1 in Large Test Dataset

0,658 0,642 0,594 0,594 0,571 0,519 0,522 0,508 0,5 0,428 0,238 0,255 0,221 0,219 0,148 0,035 0,055 0,1 0,2 0,3 0,4 0,5 0,6 0,7

PAN 2011 Lab, 19-22 September 2011, Amsterdam

7

slide-8
SLIDE 8

Results in Small Train Dataset

0,423 0,502 0,519 0,576 0,59 0,683 0,407 0,472 0,49 0,551 0,568 0,662

0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 Cbg Wtg Ctg Wbg Ung All Acc F1

PAN 2011 Lab, 19-22 September 2011, Amsterdam

8

slide-9
SLIDE 9

F1 in Small Test Dataset

0,717 0,717 0,709 0,659 0,685 0,642 0,638 0,629 0,62 0,44 0,432 0,374 0,311 0,372 0,232 0,091 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8

PAN 2011 Lab, 19-22 September 2011, Amsterdam

9

slide-10
SLIDE 10

Procedure in Large & Small + Datasets

Dataset with Unknown Authors Dataset with Known Authors

Unknown Authors Known Authors

Unknown Author Author 1 Author 2 Author 3 Author … Author n

L2 Regularized Logistic Regression One-Class SVM

PAN 2011 Lab, 19-22 September 2011, Amsterdam

10

slide-11
SLIDE 11

F1 in Large & Small +

0,587 0,492 0,518 0,451 0,446 0,368 0,369 0,222 0,216 0,201 0,175 0,037 0,001

0,1 0,2 0,3 0,4 0,5 0,6 0,7

Large+

0,588 0,575 0,527 0,349 0,377 0,303 0,331 0,173 0,301 0,254 0,189 0,065

0,1 0,2 0,3 0,4 0,5 0,6 0,7

Small+

PAN 2011 Lab, 19-22 September 2011, Amsterdam

11

slide-12
SLIDE 12

Results in Verification datasets

PAN 2011 Lab, 19-22 September 2011, Amsterdam

12

0,125 0,035 0,036 0,667 0,6 0,5 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 Verify1 Verify2 Verify3 Precision Recall

slide-13
SLIDE 13

Conclusions

  • Features spanning in multiple linguistic levels

capture better author’s stylistic variation than features that focus in a specific level.

  • L2 Regularized Logistic Regression performs very

well in high dimensional data.

  • Authorship verification research remains a difficult

problem and research should be focused to new algorithms handling one-class problems.

  • We need one / many common benchmark

corpus/corpora in order to further advance authorship identification tools and methods.

PAN 2011 Lab, 19-22 September 2011, Amsterdam

13