SLIDE 1 GLAD: Groningen Lightweight Authorship Detection
PAN, Authorship verification, 2015
Manuela Hürlinmann, Benno Weck, Esther van den Berg, Simon Šuster, Malvina Nissim
SLIDE 2
The challenge
given: a set of Known documents written by the same Author A_K , given: one Unknown document written by an unknown Author A_U, task: determine whether A_U = A_K
SLIDE 3
How can we recognise different authors?
SLIDE 4
How can we recognise different authors?
Shorter sentences? Unusual word choice? More complex grammar?
SLIDE 5
How can we recognise different authors?
individual_vector(feat1, feat2…) individual_vector(feat1, feat2…) individual_vector(feat1, feat2…)
SLIDE 6
How can we then differentiate between authors?
SLIDE 7
How can we then differentiate between authors?
Different sentence length? Different word choice? Different grammar?
SLIDE 8
How can we then differentiate between authors?
similarity_vector(feat1, feat2, …)
SLIDE 9 Our approach
- machine learning approach training on PAN (2015) data
- using SVM to do two-class classification task
- a set of features
- feature ablation studies to tune the system to each
different language
SLIDE 11 The aim
training instance
Input in any language
training instance training instance
SLIDE 12 The aim
training instance
Input in any language Features should be easy to extract
model training instance training instance
SLIDE 13 The aim
training instance
Input in any language Features should be easy to extract
model
Training & Testing time should be fast
training instance training instance
prediction
SLIDE 14
Our features
SLIDE 15
Our features
similarity_vector(entropy_of_known, visual_features, …)
SLIDE 16
Our features
To determine relevance: grouping
SLIDE 17 Our features
Individual Individual Joint
Vector_U(feat1,feat2) Vector_K(feat1,feat2) Vector_Joint(feat1,feat2)
SLIDE 18
Comparing features
SLIDE 19
Results of ablation & single-feature experiments: Helpful features
Comparing features
SLIDE 20 Side note:
Visual features
- Punctuation
- Line ending
- Letter case
- Ling length
- Block size
SLIDE 21 Side note:
Visual features
- Punctuation
- Line ending
- Letter case
- Ling length
- Block size
Con
characteristic
linguistic feature
SLIDE 22 Side note:
Visual features
- Punctuation
- Line ending
- Letter case
- Ling length
- Block size
Con
characteristic
linguistic feature
Pro
author- specific for some genres
“Pa-pa, pa-pa, pa-pa!
Here, stop her. She’ll fall down. Here, turn around. Walk this way. Ma-ma, ma-ma, ma-ma;
Oh, I think you are a darling. Mer-ry Christ-mas! Mer-ry Christmas.”
SLIDE 23
Results of ablation & single-feature experiments: Harmful features
Comparing features
SLIDE 24
Results of ablation & single-feature experiments:
Comparing features
Features that are harmful, helpful, or helpful-depending-on-the-language
SLIDE 25
Results of ablation & single-feature experiments:
Comparing features
Features that are harmful, helpful, or helpful-depending-on-the-language
SLIDE 26
Results of ablation & single-feature experiments:
Comparing features
Differences are subtle
SLIDE 27
Results of ablation & single-feature experiments:
Comparing features
Differences are subtle
SLIDE 28
Resulting groups
SLIDE 29
Results
SLIDE 30 Results
- Simple similarity features work
SLIDE 31 Results
- Simple similarity features work in
unison
SLIDE 32 Results
- Simple similarity features work in
unison independent of language (except greek)
SLIDE 33 Results
- Simple similarity features work in
unison independent of language (except greek)
- System works fast (runtime av. 1
minute)
SLIDE 34
Final conclusion
GLAD … is a light and fast language- independent system … allows language adaptation done via feature selection … involves innovative visual features which appear useful (especially for English data) and could be investigated further