Authorship Attribution: Using Rich Linguistic Features when - PowerPoint PPT Presentation

Authorship Attribution: Using Rich Linguistic Features when Training Data is Scarce Ludovic Tanguy, Franck Sajous, Basilio Calderone and Nabil Hathout CLLE-ERSS: CNRS & University of Toulouse, France PAN 2012 – Authorship Attribution - CLEF

Overview  General method for all subtasks  Maximum Entropy classifier (csvLearner)  Substantial effort in feature engineering  Many linguistically rich features  No feature selection  Whole texts as items (no splitting)  Four runs were submitted:  Run 1 (CLLE-ERSS1): char. trigrams + all linguistic features  Run 2 (CLLE-ERSS2): character trigrams only  Run 3 (CLLE-ERSS3): bag of words (lemma frequencies)  Run 4 (CLLE-ERSS4): a selection of 60 synthetic features 2

Processing  All training and test texts were :  Normalised for encoding  De-hyphenised (based on a lexicon)  POS-tagged and parsed (Stanford CoreNLP)  No split?  Using splits of the same few texts is misleading (textual cohesion)  No cross-validation data available… 3

List of features (1)  Contracted forms  Average ratio of frequencies (« do not » vs « don’t », etc.)  Phrasal verbs  Frequency of all verb-prepositions pairs (« put on », etc.)  Lexical genericity and ambiguity  Average depth in WordNet  Average number of synsets per word  Frequency of POS trigrams  Syntactic dependencies  Frequency of all word-relation-word triples (« cat – subj – eat »  Syntactic complexity  Average depth of syntactic parse trees  Average length of syntactic links 4

List of features (2)  Lexical cohesion  Density of semantically-similar word pairs  (according to Distributional Memory database)  Morphological complexity  Frequency of suffixed words  Lexical absolute frequency  Repartition of words according to Nation’s wordlists  Punctuation and case  Frequency of punctuation marks  Frequency of uppercased words  Direct speech  Ratio of sentences between quotes  First person narrative  Relative frequency of « I » (per verb, outside quotes) 5

Outcome  Closed-class tasks (A,C,I)  Choose the author with highest probability  Open-class tasks (B,D,J)  Author is « unknown » if max(p) < mean(p) + 1.25 * st.dev(p)  Results :  Overall:  All rich+3char > synthetic rich > lemmas > 3char  Results :  Good for A, I and J  Average for B  Bad for C and D 6

Posthoc analysis  Lesion studies on test data for tasks A and C  Measuring accuracy with different combinations of features  Average accuracy gain when adding each subset Feature Subset Gain for task A Gain for task C Punctuation & case +0.204 -0.040 Suffix frequency +0.097 +0.009 Absolute lexical frequency +0.030 -0.003 r = -0.48 Syntactic complexity +0.015 +0.006 Ambiguity/genericity +0.012 +0.008 Lexical cohesion +0.002 -0.000 Phrasal verbs (synthetic) -0.000 +0.022 Morphological complexity -0.005 -0.002 Phrasal verbs (detail) -0.006 -0.006 Contractions -0.014 +0.018 First/third person narrative -0.027 -0.026 POS trigrams -0.028 +0.045 Char. trigrams -0.034 +0.206 Syntactic dependencies -0.059 +0.089 7

Author clustering / intrusion tasks  Using MaxEnt as an unsupervised classifier  Method proposed by DePauw and Wagacha, 2008  Principles:  Training: all paragraphs as training items  Class value = paragraph ID  Reclassifying: every paragraph processed by the trained classifier  Result = square matrix of probabilities (Mp)  Distance matrix between paragraphs: Md= -log(Mp)  Clustering: regroup similar paragraphs  Hierarchical ascending clustering on Md  Result: highest level clusters 8

Sample dendogram  Task F, Text 4, Run CLLE-ERSS1 (correct guess) 9

 Conclusions  Average results for traditional tasks, quite disappointing  Good results for paragraph intrusions  Overall, rich features are once more proven to be an improvement over character trigrams  There’s still room for improvement with feature selection  Feature efficiency varies greatly across tasks and authors  Very small linguistic feature subsets can be sufficient 10

Authorship Attribution: Using Rich Linguistic Features when - PowerPoint PPT Presentation

Authorship Attribution: Using Rich Linguistic Features when Training Data is Scarce Ludovic Tanguy, Franck Sajous, Basilio Calderone and Nabil Hathout CLLE-ERSS: CNRS & University of Toulouse, France PAN 2012 Authorship Attribution - CLEF

Authorship & Publication August 4, 2009 Authorship Publication Authorship Each author

Authorship: why not just toss a coin? Benefits and responsibilities of authorship Tactics

A Mathematical Study A Mathematical Study of Authorship Attribution of Authorship Attribution

A multitude of linguistically- rich features for authorship attribution Ludovic Tanguy, Assaf

Leveraging discourse information effectively for authorship attribution Elisa Ferracane, Su

Cross-domain Authorship Attribution Overview of the Author Identification Task at PAN-2018

Bootstrapped Authorship Attribution in Compression Space Ramon de Graaf Leiden Institute of

Grieve 2007: Quantitative Authorship Attribution: An Vocabulary Richness Measures Evaluation of

Authorship Attribution of Micro-Messages Roy Schwartz + , Oren Tsur + , Ari Rappoport + and Moshe

Kernel Methods and String Kernels for Authorship Analysis Marius Popescu 1 Cristian Grozea 2 1

Identification of Configurational Features for Authorship Attribution by Intrinsic Evaluation

GLAD: Groningen Lightweight Authorship Detection PAN, Authorship verification, 2015 Manuela

Obfuscation Using Distributional Features Bachelors Thesis Defense by Janek Bevendorff Date:

Recognizing and Imitating Programmer Style: Adversaries in Program Authorship Attribution Lucy

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

Authorship identification in large email collections: Experiments using features that belong to

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Martin Emms September 20, 2019 4CSLL5

Software Reliability Engineering: An Introduction SE 350 Software Process & Product Quality

Population Mean and Standard Deviation In a population with N members Population mean: = x 1 +

Synthetic LISA simulating time-delay interferometry in a model LISA (presenting) Michele

Pairs Design Pattern Stripes Design Pattern map(docID a, doc d) map(docID a, doc d) for all

Multiple-cell cavity for high mass axion dark matter search 3 rd Workshop on Microwave Cavities

Disclosures RECOGNITION AND MANAGEMENT Research : NIH, Great Lakes Neurotechnologies, and the

CS 478 - Learning Rules 1 If (Color = Red) and (Shape = round) then Class is A If (Color = Blue)

Sambuz

Useful Links

Newsletter

Mail Us

Authorship Attribution: Using Rich Linguistic Features when - PowerPoint PPT Presentation

Authorship Attribution: Using Rich Linguistic Features when Training Data is Scarce Ludovic Tanguy, Franck Sajous, Basilio Calderone and Nabil Hathout CLLE-ERSS: CNRS & University of Toulouse, France PAN 2012 Authorship Attribution - CLEF

Authorship &amp; Publication August 4, 2009 Authorship Publication Authorship Each author

Authorship: why not just toss a coin? Benefits and responsibilities of authorship Tactics

A Mathematical Study A Mathematical Study of Authorship Attribution of Authorship Attribution

A multitude of linguistically- rich features for authorship attribution Ludovic Tanguy, Assaf

Leveraging discourse information effectively for authorship attribution Elisa Ferracane, Su

Cross-domain Authorship Attribution Overview of the Author Identification Task at PAN-2018

Bootstrapped Authorship Attribution in Compression Space Ramon de Graaf Leiden Institute of

Grieve 2007: Quantitative Authorship Attribution: An Vocabulary Richness Measures Evaluation of

Authorship Attribution of Micro-Messages Roy Schwartz + , Oren Tsur + , Ari Rappoport + and Moshe

Kernel Methods and String Kernels for Authorship Analysis Marius Popescu 1 Cristian Grozea 2 1

Identification of Configurational Features for Authorship Attribution by Intrinsic Evaluation

GLAD: Groningen Lightweight Authorship Detection PAN, Authorship verification, 2015 Manuela

Obfuscation Using Distributional Features Bachelors Thesis Defense by Janek Bevendorff Date:

Recognizing and Imitating Programmer Style: Adversaries in Program Authorship Attribution Lucy

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

Authorship identification in large email collections: Experiments using features that belong to

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Martin Emms September 20, 2019 4CSLL5

Software Reliability Engineering: An Introduction SE 350 Software Process &amp; Product Quality

Population Mean and Standard Deviation In a population with N members Population mean: = x 1 +

Synthetic LISA simulating time-delay interferometry in a model LISA (presenting) Michele

Pairs Design Pattern Stripes Design Pattern map(docID a, doc d) map(docID a, doc d) for all

Multiple-cell cavity for high mass axion dark matter search 3 rd Workshop on Microwave Cavities

Disclosures RECOGNITION AND MANAGEMENT Research : NIH, Great Lakes Neurotechnologies, and the

CS 478 - Learning Rules 1 If (Color = Red) and (Shape = round) then Class is A If (Color = Blue)

Sambuz

Useful Links

Newsletter

Mail Us

Authorship & Publication August 4, 2009 Authorship Publication Authorship Each author

Software Reliability Engineering: An Introduction SE 350 Software Process & Product Quality