Quite Simple Approaches for Authorship Attribution, Intrinsic Plagiarism Detection and Sexual Predator Identification
Notebook for PAN at CLEF 2012
Anna Vartapetiance
- Dr. Lee Gillam
Quite Simple Approaches for Authorship Attribution, Intrinsic - - PowerPoint PPT Presentation
Quite Simple Approaches for Authorship Attribution, Intrinsic Plagiarism Detection and Sexual Predator Identification Notebook for PAN at CLEF 2012 Anna Vartapetiance Dr. Lee Gillam Oh what a tangled web we weave, When first we practice to
Anna Vartapetiance
Vartapetiance, A., Gillam, L.: “I don't know where he's not”: Does Deception Research yet offer a basis for Deception Detectives?: Proceedings of the Workshop on Computational Approaches to Deception Detection, pp. 3-14, Avignon, France (2012)
helpful
measurable
negative words, Exclusive words, Action verbs
but inconsistent in both categories and membership
Categories Cues (1) (2) (3) (4) (5) (6) (7) Quantity Syllables
Q Word ** Q ** Q ** +** Q +** Q
Q
Q Sentence ** Q ** Q ** +** Q +** Q
Q
Q Noun phrase
Q +** Q ** Q
Sensory details ** S ** S **
** S
U
U ** Q ** Q
S
V +** V
V
S
U ** V
S
V
S
S
S
S
S
information
S
S
Affective terms ** A ** A **
S Imagery ** A ** A
S +** A
S
S +** A +** S
Expressiveness Emotiveness index ** E
+** E
E
S Activation ** E
Lexical diversity ** D
D
D
D
D
diversity ** D ** D
D
D
D
** D
D
D ** D +** D
immediacy Passive voice ** V ** V
V ** V + ** V
V
** U
U
U ** V +** V
U ** V +** V
V +** V ** V
V
V ** V
Typo errors
** +** I +** I +** I
Informality = I; Uncertainty = U; Vocabulary Complexity = VC; Grammatical Complexity = GC; (1) Burgoon & Qin, 2006 (2) Qin et al. 2005 (3) Qin, Burgoon & Nunamaker, 2004 (4) Zhou et al. 2004 (5) Zhou, Burgoon & Twitchell, 2003 (6) Zhou et al. 2003 (7) Burgoon et al. 2003
1. Top 10 most frequent words (English)
2. Regular expressions for all paired, with specific window size
3. Create author profiles based on the patterns 4. Calculate frequency, mean, variance of the patterns for each author (mean-variance, following Church & Hanks, 1991) 5. Calculate frequency, mean and variance for each test document 6. Select the author with closest match values
Church, K., and Hanks, P. (1991). Word Association Norms, Mutual Information and Lexicography. Computational Linguistics, Vol 16:1, pp. 22-29
5, “NA”
1. 50 most frequent words for each file after removing stopwords 2. Determining frequency by paragraphs for these 50 words 3. Selecting (sequences of) paragraphs with fewer similarities (10)
longest sequence of paragraphs that
1. Step 1 and 2 as Task F 2. Select proper nouns from the top 50 3. Create a cluster and remove from consideration all other linked nouns 4. Where the paragraphs are not allocated
create a new cluster
allocated to the same cluster, (b) paragraphs between different clusters are allocated to the subsequent cluster 5. Results
"right" suspect rather than "all" the possible suspects”.
predators, or 13?)
a.vartapetiance@surrey.ac.uk
l.gillam@surrey.ac.uk
Department of Computing University of Surrey