Authorship identification in large email collections: Experiments - - PowerPoint PPT Presentation
Authorship identification in large email collections: Experiments - - PowerPoint PPT Presentation
Authorship identification in large email collections: Experiments using features that belong to different linguistic levels George K. Mikros & Kostas Perifanos National and Kapodistrian University of Athens 2 PAN 2011 Lab, 19-22
Style
- Our approach to authorship identification is based
mainly on the idea that an author’s style is a complex multifaceted phenomenon affecting the whole spectrum
- f his/her linguistic production.
- Following the old theoretical notion of “double
articulation” of the Prague School of Linguistics we accept that stylistic information is constructed in parallel blocks of increasing semantic load, from character n- grams, to word n-grams.
- In order to capture the multilevel manifestation of
stylistic traits we should detect these features, which belong to many different linguistic levels, and utterly combine them for achieving the most accurate representation of an author’s style.
PAN 2011 Lab, 19-22 September 2011, Amsterdam
2
An hierarchical representation of features and related linguistic levels
Word trigrams Word bigrams Word unigrams Character trigrams Character bigrams
Semantics Syntax Morphology Phonology
PAN 2011 Lab, 19-22 September 2011, Amsterdam
3
Features
1000 most frequent n-grams from the following feature groups:
- Character Bigrams (cbg): Character n-grams provide a robust indicator
- f authorship and many studies have confirmed their superiority in large
datasets.
- Character Trigrams (ctg): Character trigrams capture significant
amount of stylistic information and have the additional merit that they also represent common email acronyms like FYI, FAQ, BTW, etc.
- Word Unigrams (ung): Word frequency is considered among the oldest
and most reliable indicators of authorship outperforming sometimes even the n-gram features.
- Word Bigrams (wbg): Word bigrams have long been used in authorship
attribution with success.
- Word Trigrams (wtg): Word trigrams have also been found to convey
useful stylistic information since they approach more closely the syntactic structure of the document.
PAN 2011 Lab, 19-22 September 2011, Amsterdam
4
Algorithms and Datasets
- Large and Small Datasets (Authorship Attribution
scenario)
▫ L2 Regularized Logistic Regression (Authorship Attribution tasks)
- Large and Small + Datasets (Combined Authorship
Attribution and Verification scenario)
▫ One-Class SVM and L2 Regularized Logistic Regression
- Verify 1, 2 & 3 Datasets (Pure Author Verification)
▫ One-Class SVM (Authorship Verification tasks) using
- nly the 2000 most frequent character bigrams.
PAN 2011 Lab, 19-22 September 2011, Amsterdam
5
Results in Large Train Dataset
0,26 0,281 0,312 0,32 0,322 0,481 0,246 0,256 0,293 0,303 0,311 0,465
0,1 0,2 0,3 0,4 0,5 0,6 Cbg Wtg Ctg Wbg Ung All Acc F1
PAN 2011 Lab, 19-22 September 2011, Amsterdam
6
F1 in Large Test Dataset
0,658 0,642 0,594 0,594 0,571 0,519 0,522 0,508 0,5 0,428 0,238 0,255 0,221 0,219 0,148 0,035 0,055 0,1 0,2 0,3 0,4 0,5 0,6 0,7
PAN 2011 Lab, 19-22 September 2011, Amsterdam
7
Results in Small Train Dataset
0,423 0,502 0,519 0,576 0,59 0,683 0,407 0,472 0,49 0,551 0,568 0,662
0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 Cbg Wtg Ctg Wbg Ung All Acc F1
PAN 2011 Lab, 19-22 September 2011, Amsterdam
8
F1 in Small Test Dataset
0,717 0,717 0,709 0,659 0,685 0,642 0,638 0,629 0,62 0,44 0,432 0,374 0,311 0,372 0,232 0,091 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8
PAN 2011 Lab, 19-22 September 2011, Amsterdam
9
Procedure in Large & Small + Datasets
Dataset with Unknown Authors Dataset with Known Authors
Unknown Authors Known Authors
Unknown Author Author 1 Author 2 Author 3 Author … Author n
L2 Regularized Logistic Regression One-Class SVM
PAN 2011 Lab, 19-22 September 2011, Amsterdam
10
F1 in Large & Small +
0,587 0,492 0,518 0,451 0,446 0,368 0,369 0,222 0,216 0,201 0,175 0,037 0,001
0,1 0,2 0,3 0,4 0,5 0,6 0,7
Large+
0,588 0,575 0,527 0,349 0,377 0,303 0,331 0,173 0,301 0,254 0,189 0,065
0,1 0,2 0,3 0,4 0,5 0,6 0,7
Small+
PAN 2011 Lab, 19-22 September 2011, Amsterdam
11
Results in Verification datasets
PAN 2011 Lab, 19-22 September 2011, Amsterdam
12
0,125 0,035 0,036 0,667 0,6 0,5 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 Verify1 Verify2 Verify3 Precision Recall
Conclusions
- Features spanning in multiple linguistic levels
capture better author’s stylistic variation than features that focus in a specific level.
- L2 Regularized Logistic Regression performs very
well in high dimensional data.
- Authorship verification research remains a difficult
problem and research should be focused to new algorithms handling one-class problems.
- We need one / many common benchmark
corpus/corpora in order to further advance authorship identification tools and methods.
PAN 2011 Lab, 19-22 September 2011, Amsterdam
13