Vote/Veto Classi fi cation, Ensemble Clustering and Sequence Classi - PowerPoint PPT Presentation

Vote/Veto Classi fi cation, Ensemble Clustering and Sequence Classi fi cation for Author Identi fi cation Roman Kern 1 , 2 Stefan Klamp fl 2 Mario Zechner 2 1 Knowledge Management Institute - Graz University of Technology 2 Know-Center rkern@tugraz.at {rkern, sklamp fl , mzechner}@know-center.at PAN Workshop @ CLEF 2012 / 2012-09-20 Graz University of Technology

Authorship Attribution - Approach Graz University of Technology Vote/Veto Classi fi cation ◮ Same as last year ⇒ Compare data-sets ◮ Three di ff erent feature-set sets ⇒ Compare in fl uence of uni-grams features vs. stylometric features 2 / 15

Authorship Attribution - Classi fi cation Graz University of Technology Classi fi cation Algorithm ◮ Combine feature-spaces via individual base classi fi ers ◮ Based on performance in training phase ◮ In classi fi cation phase combine results Base Feature Spaces ◮ Basic statistics, token statistics, grammar statistics ◮ Stop-word terms, slang terms, pronoun terms ◮ Intro-outro terms, bigram terms, unigram terms, terms Feature Space Combinations ◮ Terms ◮ Stylometric ◮ Statistics 3 / 15

Authorship Attribution - Data-Sets Graz University of Technology Basic Statistics PAN 2011 PAN 2012 1 Paragraph to lines ratio Number of characters 2 Text to lines ratio Number of words 3 Number of lines Number of lines 4 Empty lines ratio Number of stop-words 5 Number of paragraphs Number of tokens Token Statistics PAN 2011 PAN 2012 1 Likelihood of proper nouns Number of tokens 2 Number of tokens Likelihood of proper nouns 3 Average token length Average verb length 4 Likelihood of infrequent word groups Average token length 5 Likelihood of tokens of length 9 Likelihood of pronouns 4 / 15

Authorship Attribution - Feature Types Graz University of Technology Comparison of con fi gurations 70 Terms Statistics Stylometric 60 50 40 30 20 10 0 Terms Statistics Stylometric 5 / 15

Authorship Clustering - Approach Graz University of Technology Ensemble Clustering ◮ Multi-tier clustering ◮ Combine output of base clusters ◮ Only use stylometric features Ensemble clustering is also known as consensus clustering or clustering aggregation 6 / 15

Authorship Clustering - Features Graz University of Technology Multiple feature spaces ◮ Basic statistics (same as for authorship attribution) ◮ Stylometric features (hapax-legomena, hapax-dislegomena, yules-k, simpsons-d, brunets-w, sichels-s, honores-h, ...) ◮ Stem-suffixes, stop-words, pronouns ◮ Character 1-grams, 2-grams, 3-grams ⇒ Total of 7 feature spaces 7 / 15

Authorship Clustering - Clustering Graz University of Technology Base clustering ◮ k-means clustering ◮ k-means++ seed selection ◮ Di ff erent relatedness measures for di ff erent feature spaces ◮ Cosine similarity ◮ Euclidean distance (after normalising the features) Ensemble clustering ◮ Create a meta-space from the individual clustering solution ◮ In meta-space the distance between instances depends on the agreement of the clustering solutions ◮ Give di ff erent base clusters di ff erent weight ◮ k-means clustering 8 / 15

Authorship Clustering - Evaluation Graz University of Technology Ensemble clustering results Feature Space A vs B C vs D E vs F 1-grams 51.52% 53.98% 61.87% 2-grams 50.91% 54.46% 56.70% 3-grams 50.91% 51.33% 52.37% Stop-Words & Pronouns 62.20% 50.72% 72.91% Stem Suffices 65.85% 63.01% 54.61% Stylometry 52.74% 59.76% 64.25% Basic Statistics 57.01% 56.87% 65.22% Ensemble 66.10% 80.34% 78.44% 9 / 15

Sexual Predator Identi fi cation - Approach Graz University of Technology Sequence classi fi cation ◮ Not directly classify predators ◮ Classify individual messages/line in chats ◮ Simple features 10 / 15

Sexual Predator Identi fi cation - Classes Graz University of Technology Chat message classes/labels ◮ normal, predator; o ff ending; reaction, post-o ff ending C h a t # 1 C h a t # 2 1 n o r m a l 1 n o r m a l e 2 p r e d a t o r 2 p r e d a t o r r p 3 n o r m a l 3 n o r m a l 4 n o r m a l 4 n o r m a l 5 o f f e n d i n g 5 p r e d a t o r 6 r e a c t i o n 6 n o r m a l t 7 p o s t - o f f e n d i n g s o 7 p r e d a t o r p 8 p o s t - o f f e n d i n g 8 p r e d a t o r 9 r e a c t i o n 9 n o r m a l 1 0 r e a c t i o n 11 / 15

Sexual Predator Identi fi cation - Features Graz University of Technology Simple features ◮ Unigrams ◮ Double Metaphone ◮ ✐s■♥✐t✐❛❧❆✉t❤♦r , ✐s▲❛st❆✉t❤♦r , ✐s▼♦st❱❡r❜♦s❡❆✉t❤♦r , ✐s❋❡✇❡r❆✉t❤♦rs , ❤❛s❚❡r♠❋r♦♠Pr❡✈✐♦✉s Classi fi cation algorithm ◮ Maximum entropy & beam search 12 / 15

Sexual Predator Identi fi cation - Training Graz University of Technology 13 / 15

Sexual Predator Identi fi cation - Results Graz University of Technology Class Count Precision Recall normal 3,117 0.955 0.995 predator 29 0.3 0.103 o ff ending 52 0 0 post-o ff ending 216 0.871 0.847 reaction 275 0.959 0.764 Identify predators 2 0.667 1 14 / 15

The End Graz University of Technology Thank you! Open-source code ❤tt♣s✿✴✴✇✇✇✳❦♥♦✇♠✐♥❡r✳❛t✴s✈♥✴ ♦♣❡♥s♦✉r❝❡✴♣r♦❥❡❝ts✴♣❛♥✷✵✶✷✴tr✉♥❦ Corresponding Author Roman Kern ❁r❦❡r♥❅t✉❣r❛③✳❛t❃ 15 / 15

Vote/Veto Classi fi cation, Ensemble Clustering and Sequence Classi - PowerPoint PPT Presentation

Vote/Veto Classi fi cation, Ensemble Clustering and Sequence Classi fi cation for Author Identi fi cation Roman Kern 1 , 2 Stefan Klamp fl 2 Mario Zechner 2 1 Knowledge Management Institute - Graz University of Technology 2 Know-Center

CLASSI C CLASSI C CLASSI C Modelling , Specification , and Verification using UPPAAL Kim

The Muon Veto DAQ The Muon Veto DAQ Florian Ritter Kepler Center for Astro and Particle Physics

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Boosting (ensemble) Module 4 - Ensemble classifiers - Objectives module 4: boosting (ensemble

Unsupervised Models for Named Entity Classi fi cation Michael Collins and Yoram Singer AT&T

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Budget Veto Impacts Budget Veto Impacts th Legislative Response and the July 6 th Legislative

Status of the muon veto GERDA Collaboration Meeting LNGS November, 5 th -7 th 2007 Markus Knapp

Online Veto Analysis of Online Veto Analysis of TAMA300 TAMA300 Daisuke Tatsumi Daisuke

Counting events reliably with storm & riak Frank Schrder - eBay Classi fi eds Group

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Using Git Hooks to Help Your 40 40 40 Joo Santos Engineering Teams Work Software Engineer

Reporting and analyzing bugs How to communicate efficiently to the programmer EOT1 Bug

the long web JSON HTML SVG CSS ABC JavaScript RSS X: 1 T: Chief O'Neill's Favourite R:

Lecture 4: Tableau, D3 DS 4200 F ALL 2020 Prof. Cody Dunne N ORTHEASTERN U NIVERSITY Slides and

Mediensystemen mit iOS WS 2011 Prof. Dr. Michael Rohs michael.rohs@ifi.lmu.de MHCI Lab, LMU

Optimization for Machine Learning Lecture 2: Support Vector Machine Training S.V . N. (vishy)

Best practices in scientific programming Soware Carpentry, Part I Rike-Benjamin Schuppner

An Intermediate Look at Git + GitHub U of T Scientific Coders University of Toronto October 1,

Sambuz

Useful Links

Newsletter

Mail Us

Vote/Veto Classi fi cation, Ensemble Clustering and Sequence Classi - PowerPoint PPT Presentation

Vote/Veto Classi fi cation, Ensemble Clustering and Sequence Classi fi cation for Author Identi fi cation Roman Kern 1 , 2 Stefan Klamp fl 2 Mario Zechner 2 1 Knowledge Management Institute - Graz University of Technology 2 Know-Center

CLASSI C CLASSI C CLASSI C Modelling , Specification , and Verification using UPPAAL Kim

The Muon Veto DAQ The Muon Veto DAQ Florian Ritter Kepler Center for Astro and Particle Physics

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Boosting (ensemble) Module 4 - Ensemble classifiers - Objectives module 4: boosting (ensemble

Unsupervised Models for Named Entity Classi fi cation Michael Collins and Yoram Singer AT&amp;T

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Budget Veto Impacts Budget Veto Impacts th Legislative Response and the July 6 th Legislative

Status of the muon veto GERDA Collaboration Meeting LNGS November, 5 th -7 th 2007 Markus Knapp

Online Veto Analysis of Online Veto Analysis of TAMA300 TAMA300 Daisuke Tatsumi Daisuke

Counting events reliably with storm &amp; riak Frank Schrder - eBay Classi fi eds Group

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Using Git Hooks to Help Your 40 40 40 Joo Santos Engineering Teams Work Software Engineer

Reporting and analyzing bugs How to communicate efficiently to the programmer EOT1 Bug

the long web JSON HTML SVG CSS ABC JavaScript RSS X: 1 T: Chief O'Neill's Favourite R:

Lecture 4: Tableau, D3 DS 4200 F ALL 2020 Prof. Cody Dunne N ORTHEASTERN U NIVERSITY Slides and

Mediensystemen mit iOS WS 2011 Prof. Dr. Michael Rohs michael.rohs@ifi.lmu.de MHCI Lab, LMU

Optimization for Machine Learning Lecture 2: Support Vector Machine Training S.V . N. (vishy)

Best practices in scientific programming Soware Carpentry, Part I Rike-Benjamin Schuppner

An Intermediate Look at Git + GitHub U of T Scientific Coders University of Toronto October 1,

Sambuz

Useful Links

Newsletter

Mail Us

Unsupervised Models for Named Entity Classi fi cation Michael Collins and Yoram Singer AT&T

Counting events reliably with storm & riak Frank Schrder - eBay Classi fi eds Group