Authorship identification in large email collections: Experiments - PowerPoint PPT Presentation

Authorship identification in large email collections: Experiments using features that belong to different linguistic levels George K. Mikros & Kostas Perifanos National and Kapodistrian University of Athens

2 PAN 2011 Lab, 19-22 September 2011, Amsterdam Style • Our approach to authorship identification is based mainly on the idea that an author’s style is a complex multifaceted phenomenon affecting the whole spectrum of his/her linguistic production. • Following the old theoretical notion of “double articulation” of the Prague School of Linguistics we accept that stylistic information is constructed in parallel blocks of increasing semantic load, from character n- grams, to word n-grams. • In order to capture the multilevel manifestation of stylistic traits we should detect these features, which belong to many different linguistic levels, and utterly combine them for achieving the most accurate representation of an author’s style.

3 PAN 2011 Lab, 19-22 September 2011, Amsterdam An hierarchical representation of features and related linguistic levels Semantics Word trigrams Word bigrams Syntax Word unigrams Morphology Character trigrams Character Phonology bigrams

4 PAN 2011 Lab, 19-22 September 2011, Amsterdam Features 1000 most frequent n-grams from the following feature groups: • Character Bigrams (cbg) : Character n-grams provide a robust indicator of authorship and many studies have confirmed their superiority in large datasets. • Character Trigrams (ctg) : Character trigrams capture significant amount of stylistic information and have the additional merit that they also represent common email acronyms like FYI, FAQ, BTW, etc. • Word Unigrams (ung) : Word frequency is considered among the oldest and most reliable indicators of authorship outperforming sometimes even the n-gram features. • Word Bigrams (wbg) : Word bigrams have long been used in authorship attribution with success. • Word Trigrams (wtg) : Word trigrams have also been found to convey useful stylistic information since they approach more closely the syntactic structure of the document.

5 PAN 2011 Lab, 19-22 September 2011, Amsterdam Algorithms and Datasets • Large and Small Datasets (Authorship Attribution scenario) ▫ L2 Regularized Logistic Regression (Authorship Attribution tasks) • Large and Small + Datasets (Combined Authorship Attribution and Verification scenario) ▫ One-Class SVM and L2 Regularized Logistic Regression • Verify 1, 2 & 3 Datasets (Pure Author Verification) ▫ One-Class SVM (Authorship Verification tasks) using only the 2000 most frequent character bigrams.

6 PAN 2011 Lab, 19-22 September 2011, Amsterdam Results in Large Train Dataset 0,6 0,5 0,481 0,465 0,4 0,32 0,322 0,312 0,311 0,303 0,293 0,3 0,281 0,26 0,256 0,246 0,2 0,1 0 Cbg Wtg Ctg Wbg Ung All Acc F1

7 PAN 2011 Lab, 19-22 September 2011, Amsterdam F 1 in Large Test Dataset 0,7 0,658 0,642 0,594 0,594 0,6 0,571 0,519 0,522 0,508 0,5 0,5 0,428 0,4 0,3 0,238 0,255 0,221 0,219 0,2 0,148 0,1 0,035 0,055 0 0

8 PAN 2011 Lab, 19-22 September 2011, Amsterdam Results in Small Train Dataset 0,8 0,683 0,7 0,662 0,59 0,6 0,576 0,568 0,551 0,519 0,502 0,49 0,5 0,472 0,423 0,407 0,4 0,3 0,2 0,1 0 Cbg Wtg Ctg Wbg Ung All Acc F1

9 PAN 2011 Lab, 19-22 September 2011, Amsterdam F 1 in Small Test Dataset 0,8 0,717 0,717 0,709 0,685 0,7 0,659 0,642 0,638 0,629 0,62 0,6 0,5 0,44 0,432 0,4 0,374 0,372 0,311 0,3 0,232 0,2 0,091 0,1 0 0

10 PAN 2011 Lab, 19-22 September 2011, Amsterdam Procedure in Large & Small + Datasets Unknown Author Author Dataset with 1 Unknown Authors Unknown Authors L2 Author Regularized 2 Logistic Author Known Regression 3 Dataset with Known Authors Authors Author … Author n One-Class SVM

11 PAN 2011 Lab, 19-22 September 2011, Amsterdam F 1 in Large & Small + 0,7 Large+ 0,587 0,6 0,492 0,518 0,5 0,451 0,446 0,368 0,4 0,369 0,3 0,222 0,216 0,201 0,175 0,2 Small+ 0,1 0,037 0,001 0,7 0 0,588 0,575 0,6 0,527 0,5 0,377 0,4 0,349 0,303 0,331 0,301 0,3 0,254 0,189 0,173 0,2 0,1 0,065 0 0

12 PAN 2011 Lab, 19-22 September 2011, Amsterdam Results in Verification datasets 0,8 0,7 0,667 0,6 0,6 0,5 0,5 0,4 0,3 0,2 0,125 0,1 0,035 0,036 0 Verify1 Verify2 Verify3 Precision Recall

13 PAN 2011 Lab, 19-22 September 2011, Amsterdam Conclusions • Features spanning in multiple linguistic levels capture better author’s stylistic variation than features that focus in a specific level. • L2 Regularized Logistic Regression performs very well in high dimensional data. • Authorship verification research remains a difficult problem and research should be focused to new algorithms handling one-class problems. • We need one / many common benchmark corpus/corpora in order to further advance authorship identification tools and methods.

Authorship identification in large email collections: Experiments - PowerPoint PPT Presentation

Authorship identification in large email collections: Experiments using features that belong to different linguistic levels George K. Mikros & Kostas Perifanos National and Kapodistrian University of Athens 2 PAN 2011 Lab, 19-22

Authorship & Publication August 4, 2009 Authorship Publication Authorship Each author

Authorship: why not just toss a coin? Benefits and responsibilities of authorship Tactics

Kernel Methods and String Kernels for Authorship Analysis Marius Popescu 1 Cristian Grozea 2 1

The Natural Science Collections Facility Natural Science Collections Collections in South Africa

Scala Collections 1 / 20 Scala Collections Figure 1: Abstract classes and traits in

GLAD: Groningen Lightweight Authorship Detection PAN, Authorship verification, 2015 Manuela

A Mathematical Study A Mathematical Study of Authorship Attribution of Authorship Attribution

Cross-domain Authorship Attribution Overview of the Author Identification Task at PAN-2018

COMP 213 Advanced Object-oriented Programming Lecture 12 Java Collections. The Collections

Collections Objectives Explore collections in System.Collections namespace memory

Java Collections and Generics Object-oriented programming Inf1 :: 2008 Object-oriented

Authorship ID at PAN11 What -- Why -- How Patrick Juola Evaluating Variations in Language

Managing Research Integrity during the COVID-19 Emergency Authorship agreements Abigail Norris

A multitude of linguistically- rich features for authorship attribution Ludovic Tanguy, Assaf

Leveraging discourse information effectively for authorship attribution Elisa Ferracane, Su

Deep Bayes Factor Scoring for Authorship Verifjcation Benedikt Boenninghoff Dorothea Kolossa

Formal Models of Language Paula Buttery Dept of Computer Science & Technology, University of

Financial Coaching & Counseling Listening & Learning Series Webinar 4 | Taking It to the

An Isabelle/HOL Formalisation of Greens Theorem Mohammad Abdulaziz Data61/ANU and Lawrence

Full accounting for verifiable outsourcing Riad S. Wahby , Ye Ji , Andrew J. Blumberg ,

HMDA Implementation Lessons learned along the way New regulatory requirement to expand the

The Direct Stiffness Method Part I IFEM Ch 2 Slide 1 Introduction to FEM The Direct

DSRC: Deployment and Beyond WINLAB Research Review John B. Kenney Toyota InfoTechnology Center,

Security in Sensor Networks Written by: Prof. Srdjan Capkun & Others Presented By :

Authorship identification in large email collections: Experiments - PowerPoint PPT Presentation

Authorship identification in large email collections: Experiments using features that belong to different linguistic levels George K. Mikros & Kostas Perifanos National and Kapodistrian University of Athens 2 PAN 2011 Lab, 19-22

Authorship &amp; Publication August 4, 2009 Authorship Publication Authorship Each author

Authorship: why not just toss a coin? Benefits and responsibilities of authorship Tactics

Kernel Methods and String Kernels for Authorship Analysis Marius Popescu 1 Cristian Grozea 2 1

The Natural Science Collections Facility Natural Science Collections Collections in South Africa

Scala Collections 1 / 20 Scala Collections Figure 1: Abstract classes and traits in

GLAD: Groningen Lightweight Authorship Detection PAN, Authorship verification, 2015 Manuela

A Mathematical Study A Mathematical Study of Authorship Attribution of Authorship Attribution

Cross-domain Authorship Attribution Overview of the Author Identification Task at PAN-2018

COMP 213 Advanced Object-oriented Programming Lecture 12 Java Collections. The Collections

Collections Objectives Explore collections in System.Collections namespace memory

Java Collections and Generics Object-oriented programming Inf1 :: 2008 Object-oriented

Authorship ID at PAN11 What -- Why -- How Patrick Juola Evaluating Variations in Language

Managing Research Integrity during the COVID-19 Emergency Authorship agreements Abigail Norris

A multitude of linguistically- rich features for authorship attribution Ludovic Tanguy, Assaf

Leveraging discourse information effectively for authorship attribution Elisa Ferracane, Su

Deep Bayes Factor Scoring for Authorship Verifjcation Benedikt Boenninghoff Dorothea Kolossa

Formal Models of Language Paula Buttery Dept of Computer Science &amp; Technology, University of

Financial Coaching &amp; Counseling Listening &amp; Learning Series Webinar 4 | Taking It to the

An Isabelle/HOL Formalisation of Greens Theorem Mohammad Abdulaziz Data61/ANU and Lawrence

Full accounting for verifiable outsourcing Riad S. Wahby , Ye Ji , Andrew J. Blumberg ,

HMDA Implementation Lessons learned along the way New regulatory requirement to expand the

The Direct Stiffness Method Part I IFEM Ch 2 Slide 1 Introduction to FEM The Direct

DSRC: Deployment and Beyond WINLAB Research Review John B. Kenney Toyota InfoTechnology Center,

Security in Sensor Networks Written by: Prof. Srdjan Capkun &amp; Others Presented By :

Authorship & Publication August 4, 2009 Authorship Publication Authorship Each author

Formal Models of Language Paula Buttery Dept of Computer Science & Technology, University of

Financial Coaching & Counseling Listening & Learning Series Webinar 4 | Taking It to the

Security in Sensor Networks Written by: Prof. Srdjan Capkun & Others Presented By :