Mining E-mail Content for Author Identification Forensics O. de Vel, - PowerPoint PPT Presentation

Mining E-mail Content for Author Identification Forensics O. de Vel, A. Anderson, M. Corney and G. Mohay A presentation by Fabian Duffhauß

Reasons for Author Identification of E-mails • Everyday 200 billions of e-mails are sent → 90 % spam • Misuse of e-mails: • Distribute inappropriate messages or documents • Send offensive or threatening material • sender try to hide their identity → identify the author of e-mail misuse 2

E-mail Topic and Authors Used in the Experiments Topic Author Category AC i (i = 1; 2; 3) Topic Category Total Author AC 1 Author AC 2 Author AC 3 Movie 15 21 21 59 Food 12 21 25 58 Travel 3 21 15 39 Author Total 30 63 63 156 • salutations, reply text, attachments and signatures are removed • Existence and position are stored 3

170 Style Marker Attribute Types • Number of blank lines/total number of lines • Average sentence length M = total number of words • Average word length (number of characters) V = total number of distinct words • Vocabulary richness i.e., V/M • Total number of function words/M C = total number of characters • Function word frequency distribution (122 features) • Total number of short words/M • Count of hapax legomena/M • Count of hapax legomena/V • Total number of characters in words/C • Total number of alphabetic characters in words/C • Total number of upper-case characters in words/C • Total number of digit characters in words/C • Total number of white-space characters/C • Total number of space characters/C • Total number of space characters/number white-space characters • Total number of tab spaces/C • Total number of tab spaces/number white-space characters • Total number of punctuations/C • Word length frequency distribution/M (30 features) 4

21 Structural Attribute Types • Has a greeting acknowledgment • Uses a farewell acknowledgment • Contains signature text • Number of attachments • Position of requoted text within e-mail body • HTML tag frequency distribution/total number of HTML tags (16 features) 5

Support Vector Machine Classifier • SVM light • separate objects into two different classes. • Best results with a polynomial kernel of degree 3 6

Measuring Units • C = set of objects that belong to a class • A = set of objects the classifier has identified as belonging to the class 𝐷 ∩ 𝐵 𝐷 ∩ 𝐵 𝑠𝑓𝑑𝑏𝑚𝑚 𝑆 = 𝑞𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 𝑄 = 𝐷 𝐵 𝐺 = 2𝑆𝑄 𝑆 + 𝑄 7

First Experiment style markers and structural features • Mixed topics Performance Author Category, AC i (i = 1, 2, 3) • Stratified 10-fold cross Statistic Author AC 1 Author AC 2 Author AC 3 validation procedure P ACi 100.0 % 83.8 % 93.8 % R ACi 63.3 % 98.3 % 89.6 % F ACi 77.6 % 90.5 % 91.6 % only style markers Performance Author Category, AC i (i = 1, 2, 3) Statistic Author AC 1 Author AC 2 Author AC 3 P ACi 100.0 % 93.0 % 83.6 % R ACi 60.0 % 80.3 % 93.3 % F AC i 75.0 % 86.2 % 88.2 % 8

Second Experiment • Training set: E- mails with topic “Movie” style markers and structural features Author Category, AC i ( i = 1, 2, 3) Author AC 1 Author AC 2 Author AC 3 Topic Class P AC1 R AC1 F AC1 P AC2 R AC2 F AC2 P AC3 R AC3 F AC3 Food 100.0 16.7 28.6 77.8 100.0 87.5 85.2 92.0 88.5 Travel 100.0 33.3 50.0 90.9 100.0 95.2 100.0 100.0 100.0 categorisation performance results (in %) 9

Third Experiment • Number of function words: 320 (instead of 122) • Split into parts-of-speech words and others • Result: No improvements 10

PAN-11 Author Identification Training Corpus training sets Validation sets Name Number of Number of Name Number Number of Authors Documents of Authors Documents Large 72 9337 LargeValid 66 1298 Small 26 3001 LargeValid+ 86 1440 Verify1 1 42 SmallValid 23 518 Verify2 1 55 SmallValid+ 43 601 Verify3 1 47 Verify1Valid+ 24 104 Verify2Valid+ 21 95 Verify3Valid+ 23 100 11

Live Demonstration • Parser in C++: • Reads a list of function words • Reads the e-mail bodies • Extracts style marker attributes • Creates training and test files • SVM light -Learn: • Reads the training file • Creates a model • SVM light -Classify: • Reads the model and the test file • Makes a prediction 12

Mining E-mail Content for Author Identification Forensics O. de Vel, - PowerPoint PPT Presentation

Mining E-mail Content for Author Identification Forensics O. de Vel, A. Anderson, M. Corney and G. Mohay A presentation by Fabian Duffhau Reasons for Author Identification of E-mails Everyday 200 billions of e-mails are sent 90 % spam

Author: Bill Buchanan Author: Bill Buchanan Author: Bill Buchanan Author: Bill Buchanan Author:

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Mail Service Quality Support: Mail Service Quality Support: Mail Service Quality Support: Mail

CSE 469: Computer and Network Forensics Topic 6: Email Forensics Dr. Mike Mabey | Spring 2019

CSE 469: Computer and Network Forensics Topic 5: Image Forensics Dr. Mike Mabey | Spring 2019

CSE 469: Computer and Network Forensics Topic 1: Forensics Intro Dr. Mike Mabey | Spring 2019

CSN08101 Digital Forensics Lecture 1A: Introduction to Forensics Lecture 1A: Introduction to

CSE 469: Computer and Network Forensics Topic 7: Mobile Forensics Dr. Mike Mabey | Spring 2019

Image Forensics of High Dynamic Range Imaging 10th International Workshop on Digital-Forensics

Introduction Why is the Study of Digital Forensics Relevant? What is Digital/Computer

About this presentation : Learning : What is Digital Forensics ? Political : Digital

2015-2017 (c) P.Pale: Computer Forensics 2015-10-17 File System Forensics A New York

Teaching digital forensics in a large class Teaching forensics at of students UL FRI

SQL SERVER Anti-Forensics Cesar Cerrudo Introduction Sophisticated attacks requires leaving

CSE 469: Computer and Network Forensics Topic 9: Semester Review Dr. Mike Mabey | Spring 2019

Android: forensics and reverse engineering Raphal Rigo - ANSSI 26/11/2010 Agence nationale de

Intro to THREE.js Dr. Mihail November 2, 2015 (Dr. Mihail) THREE.js November 2, 2015 1 / 18

B IB 2x for processing B IB T EX-bibliographies Alexander Feder i@xandi.eu 28. 4. 2006 Feder B

pdfT EX and XML in the Workflow for Conference Proceedings Volker RW Schaa Gesellschaft fr

Yo Your First R Package in in 30 30 Min inutes Jay Lee Yo Your First R Package in in 90

XLIFF 2.0 the Easy Way: The Okapi XLIFF Toolkit FEISGILTT Dublin June 2014 Yves Savourel ENLASO

CSc 337 LECTURE 2: MORE HTML AND CSS Activity: match this page Page Text xt: Koala Bears

DocBook Documentation at SUSE and Automated Document Quality Assurance Stefan Knorr

Static program checking and verification Correctness class ArraySet implements Set { class