vote veto classi fi cation ensemble clustering and
play

Vote/Veto Classi fi cation, Ensemble Clustering and Sequence Classi - PowerPoint PPT Presentation

Vote/Veto Classi fi cation, Ensemble Clustering and Sequence Classi fi cation for Author Identi fi cation Roman Kern 1 , 2 Stefan Klamp fl 2 Mario Zechner 2 1 Knowledge Management Institute - Graz University of Technology 2 Know-Center


  1. Vote/Veto Classi fi cation, Ensemble Clustering and Sequence Classi fi cation for Author Identi fi cation Roman Kern 1 , 2 Stefan Klamp fl 2 Mario Zechner 2 1 Knowledge Management Institute - Graz University of Technology 2 Know-Center rkern@tugraz.at {rkern, sklamp fl , mzechner}@know-center.at PAN Workshop @ CLEF 2012 / 2012-09-20 Graz University of Technology

  2. Authorship Attribution - Approach Graz University of Technology Vote/Veto Classi fi cation ◮ Same as last year ⇒ Compare data-sets ◮ Three di ff erent feature-set sets ⇒ Compare in fl uence of uni-grams features vs. stylometric features 2 / 15

  3. Authorship Attribution - Classi fi cation Graz University of Technology Classi fi cation Algorithm ◮ Combine feature-spaces via individual base classi fi ers ◮ Based on performance in training phase ◮ In classi fi cation phase combine results Base Feature Spaces ◮ Basic statistics, token statistics, grammar statistics ◮ Stop-word terms, slang terms, pronoun terms ◮ Intro-outro terms, bigram terms, unigram terms, terms Feature Space Combinations ◮ Terms ◮ Stylometric ◮ Statistics 3 / 15

  4. Authorship Attribution - Data-Sets Graz University of Technology Basic Statistics PAN 2011 PAN 2012 1 Paragraph to lines ratio Number of characters 2 Text to lines ratio Number of words 3 Number of lines Number of lines 4 Empty lines ratio Number of stop-words 5 Number of paragraphs Number of tokens Token Statistics PAN 2011 PAN 2012 1 Likelihood of proper nouns Number of tokens 2 Number of tokens Likelihood of proper nouns 3 Average token length Average verb length 4 Likelihood of infrequent word groups Average token length 5 Likelihood of tokens of length 9 Likelihood of pronouns 4 / 15

  5. Authorship Attribution - Feature Types Graz University of Technology Comparison of con fi gurations 70 Terms Statistics Stylometric 60 50 40 30 20 10 0 Terms Statistics Stylometric 5 / 15

  6. Authorship Clustering - Approach Graz University of Technology Ensemble Clustering ◮ Multi-tier clustering ◮ Combine output of base clusters ◮ Only use stylometric features Ensemble clustering is also known as consensus clustering or clustering aggregation 6 / 15

  7. Authorship Clustering - Features Graz University of Technology Multiple feature spaces ◮ Basic statistics (same as for authorship attribution) ◮ Stylometric features (hapax-legomena, hapax-dislegomena, yules-k, simpsons-d, brunets-w, sichels-s, honores-h, ...) ◮ Stem-suffixes, stop-words, pronouns ◮ Character 1-grams, 2-grams, 3-grams ⇒ Total of 7 feature spaces 7 / 15

  8. Authorship Clustering - Clustering Graz University of Technology Base clustering ◮ k-means clustering ◮ k-means++ seed selection ◮ Di ff erent relatedness measures for di ff erent feature spaces ◮ Cosine similarity ◮ Euclidean distance (after normalising the features) Ensemble clustering ◮ Create a meta-space from the individual clustering solution ◮ In meta-space the distance between instances depends on the agreement of the clustering solutions ◮ Give di ff erent base clusters di ff erent weight ◮ k-means clustering 8 / 15

  9. Authorship Clustering - Evaluation Graz University of Technology Ensemble clustering results Feature Space A vs B C vs D E vs F 1-grams 51.52% 53.98% 61.87% 2-grams 50.91% 54.46% 56.70% 3-grams 50.91% 51.33% 52.37% Stop-Words & Pronouns 62.20% 50.72% 72.91% Stem Suffices 65.85% 63.01% 54.61% Stylometry 52.74% 59.76% 64.25% Basic Statistics 57.01% 56.87% 65.22% Ensemble 66.10% 80.34% 78.44% 9 / 15

  10. Sexual Predator Identi fi cation - Approach Graz University of Technology Sequence classi fi cation ◮ Not directly classify predators ◮ Classify individual messages/line in chats ◮ Simple features 10 / 15

  11. Sexual Predator Identi fi cation - Classes Graz University of Technology Chat message classes/labels ◮ normal, predator; o ff ending; reaction, post-o ff ending C h a t # 1 C h a t # 2 1 n o r m a l 1 n o r m a l e 2 p r e d a t o r 2 p r e d a t o r r p 3 n o r m a l 3 n o r m a l 4 n o r m a l 4 n o r m a l 5 o f f e n d i n g 5 p r e d a t o r 6 r e a c t i o n 6 n o r m a l t 7 p o s t - o f f e n d i n g s o 7 p r e d a t o r p 8 p o s t - o f f e n d i n g 8 p r e d a t o r 9 r e a c t i o n 9 n o r m a l 1 0 r e a c t i o n 11 / 15

  12. Sexual Predator Identi fi cation - Features Graz University of Technology Simple features ◮ Unigrams ◮ Double Metaphone ◮ ✐s■♥✐t✐❛❧❆✉t❤♦r , ✐s▲❛st❆✉t❤♦r , ✐s▼♦st❱❡r❜♦s❡❆✉t❤♦r , ✐s❋❡✇❡r❆✉t❤♦rs , ❤❛s❚❡r♠❋r♦♠Pr❡✈✐♦✉s Classi fi cation algorithm ◮ Maximum entropy & beam search 12 / 15

  13. Sexual Predator Identi fi cation - Training Graz University of Technology 13 / 15

  14. Sexual Predator Identi fi cation - Results Graz University of Technology Class Count Precision Recall normal 3,117 0.955 0.995 predator 29 0.3 0.103 o ff ending 52 0 0 post-o ff ending 216 0.871 0.847 reaction 275 0.959 0.764 Identify predators 2 0.667 1 14 / 15

  15. The End Graz University of Technology Thank you! Open-source code ❤tt♣s✿✴✴✇✇✇✳❦♥♦✇♠✐♥❡r✳❛t✴s✈♥✴ ♦♣❡♥s♦✉r❝❡✴♣r♦❥❡❝ts✴♣❛♥✷✵✶✷✴tr✉♥❦ Corresponding Author Roman Kern ❁r❦❡r♥❅t✉❣r❛③✳❛t❃ 15 / 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend