Vote/Veto Meta-Classifier for Authorship Identification
Roman Kern Christin Seifert Mario Zechner Michael Granitzer
Institute of Knowledge Management Graz University of Technology {rkern, christin.seifert}@tugraz.at
- Know-Center GmbH
Vote/Veto Meta-Classifier for Authorship Identification Roman Kern - - PowerPoint PPT Presentation
Vote/Veto Meta-Classifier for Authorship Identification Roman Kern Christin Seifert Mario Zechner Michael Granitzer Institute of Knowledge Management Graz University of Technology { rkern, christin.seifert } @tugraz.at - Know-Center GmbH {
Graz University of Technology
◮ Apply NLP techniques ◮ Annotate the plain text
◮ Multiple feature spaces ◮ Each should encode specific aspects ◮ Integrate feature weighting
◮ Base classifiers ◮ Record performance while training ◮ Selectively use the output for combined result 2 / 21
Graz University of Technology
◮ Text lines - characters terminated by a newline ◮ Text blocks - consecutive lines separated by empty lines
◮ All consecutive annotations operate on blocks only ◮ Natural language annotations ◮ Slang-word annotations ◮ Grammar annotations
3 / 21
Graz University of Technology
◮ Split sentences ◮ Tokenize ◮ Part-of-speech tags
◮ Predefined list ◮ Heuristics (numbers, non-letter characters) 4 / 21
Graz University of Technology
◮ :-) :)
◮ imho imm imma imnerho imnl imnshmfo imnsho imo
◮
5 / 21
Graz University of Technology
◮ Stanford parser
◮ Structure and complexity of sentences
◮ Richness of grammatical constructs
6 / 21
Graz University of Technology
◮ Open American National Corpus (OANC)
◮ Apply a linear text segmentation algorithm
◮ About 70,000 documents
◮ About 200,000 terms 7 / 21
Graz University of Technology
◮ wbinary = sgn tfx
◮ wlocal = √tfx
◮ External corpus, modified BM25 Kern and Granitzer [2010] ◮ wext = √tfx ∗ log(N−dfx +0.5) dfx +0.5
1 √length ∗ DP(x)−0.3
◮ Training set as corpus ◮ wglobal = √tfx ∗ log(N−dfx +0.5) dfx +0.5
1 √length
◮ Combine all document of an author into one big document ◮ wpurity = √tfx ∗ log(|A|−afx +0.5) afx +0.5
1 √length 8 / 21
Graz University of Technology
◮ Basic statistics ◮ Token statistics ◮ Grammar statistics
◮ Slang words
◮ Pronouns
◮ Stop words
◮ Pure unigrams
◮ Bigrams
◮ Intro-outro
◮ Unigrams
9 / 21
Graz University of Technology
10 / 21
Graz University of Technology
11 / 21
Graz University of Technology
12 / 21
Graz University of Technology
◮ Statistical feature spaces ◮ Bagging with random forests
◮ Vector space models ◮ L2-regularized logistic regression, LibLINEAR
13 / 21
Graz University of Technology
◮ Records the performance of all base classifiers during training ◮ 10-fold cross-validation ◮ If precision > tp, the base classifier may vote for a class ◮ If recall > tr, the base classifier may veto against a class
◮ Apply all base classifiers, record posterior probabilities ◮ If (may vote AND probability > pp) → vote for this class ◮ Wc = Wc + (w i c · pi c) ◮ If (may veto AND probability < pr) → veto against this class ◮ Wc = Wc − (w i c · pi c) ◮ The final base classifier is treated differently, the probabilities
◮ Class with the highest Wc wins 14 / 21
Graz University of Technology
15 / 21
Graz University of Technology
16 / 21
Graz University of Technology
Grammar Basic Unigrams Unigrams + Basic All 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Macro Precision Macro Recall
17 / 21
Graz University of Technology
3−Grams 4−Grams 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Macro Precision Macro Recall
18 / 21
Graz University of Technology
19 / 21
Graz University of Technology
20 / 21
Graz University of Technology
21 / 21