graph based and lexical syntactic approaches for the
play

Graph-based and Lexical-Syntactic Approaches for the Authorship - PowerPoint PPT Presentation

Introduction Proposed approaches Experimental settings and results Universidad Aut onoma de Puebla Conclusion Graph-based and Lexical-Syntactic Approaches for the Authorship Attribution Task Notebook for PAN at CLEF 2012 Esteban Castillo,


  1. Introduction Proposed approaches Experimental settings and results Universidad Aut´ onoma de Puebla Conclusion Graph-based and Lexical-Syntactic Approaches for the Authorship Attribution Task Notebook for PAN at CLEF 2012 Esteban Castillo, Darnes Vilari˜ no, David Pinto, Iv´ an Olmos, Jes´ us A. Gonz´ alez and Maya Carrillo September 12, 2012 BUAP NLP September 12, 2012 Traditional Authorship Attribution 1 / 14

  2. Introduction Proposed approaches Experimental settings and results Universidad Aut´ onoma de Puebla Conclusion Index Introduction Proposed approaches Experimental settings and results Conclusion BUAP NLP September 12, 2012 Traditional Authorship Attribution 2 / 14

  3. Introduction Proposed approaches Experimental settings and results Universidad Aut´ onoma de Puebla Conclusion Traditional Authorship Attribution • Authorship attribution assumes unique and identifiable writeprints in text. • The importance of finding the correct features for characterizing the signature or particular writing style of a given author is fundamental BUAP NLP September 12, 2012 Traditional Authorship Attribution 3 / 14

  4. Introduction Proposed approaches Experimental settings and results Universidad Aut´ onoma de Puebla Conclusion Lexical-syntactic approach: features 1 Phrase level features • Word prefixes ⋄ e.g. ad → { ad vance , ad junct , ad ulterate } • Word sufixes ⋄ e.g. est → { fin est , tough est , bigg est } • Stopwords ⋄ e.g. { and , the , but , did } • Trigrams of PoS ⋄ e.g. she:PRP drove:VBD a:DT silver:NN pt:NN cruiser:NN { ( PRP , VBD , DT ) , ( VBD , DT , NN ) , ( DT , NN , NN ) , ( NN , NN , NN ) } 2 Character level features • Vowel combination ⋄ e.g. influential → iueia → iuea • Vowel permutation ⋄ e.g. influential → iueia BUAP NLP September 12, 2012 Traditional Authorship Attribution 4 / 14

  5. Introduction Proposed approaches Experimental settings and results Universidad Aut´ onoma de Puebla Conclusion Lexical-syntactic approach: text representation • Training stage: ( x 1 , x 2 , x 3 , . . . , x s , C ) , · · · , y 1 , y 2 , y 3 , . . . , y m � �� � � �� � Feature 1 Feature n • Testing stage: ( x 1 , x 2 , x 3 , . . . , x s ) , · · · , y 1 , y 2 , y 3 , . . . , y m � �� � � �� � Feature 1 Feature n BUAP NLP September 12, 2012 Traditional Authorship Attribution 5 / 14

  6. Introduction Proposed approaches Experimental settings and results Universidad Aut´ onoma de Puebla Conclusion Lexical-syntactic approach: Classification process TRAINING Feature Extraction . . . Feature Training Extraction Classification Classification algorithm Model TEST Result Feature Extraction Test BUAP NLP September 12, 2012 Traditional Authorship Attribution 6 / 14

  7. Introduction Proposed approaches Experimental settings and results Universidad Aut´ onoma de Puebla Conclusion Graph-based approach: features • In this approach, a graph based representation is considered. • Each text paragraph is tagged with its corresponding PoS tags with the TreeTagger tool. • Each word is stemmed using the Porter stemmer. • In the graph representation each vertex is considered to be a stemmed word and each edge is considered to be its corresponding PoS tag. • The word sequence of the paragraphs to be represented is kept. • Once each paragraph is represented by means of a graph, we apply a data mining algorithm called SUBDUE in order to find the most representative words of an author BUAP NLP September 12, 2012 Traditional Authorship Attribution 7 / 14

  8. Introduction Proposed approaches Experimental settings and results Universidad Aut´ onoma de Puebla Conclusion Graph-based approach: example • “second qualifier long road leading 1998 world cup”. BUAP NLP September 12, 2012 Traditional Authorship Attribution 8 / 14

  9. Introduction Proposed approaches Experimental settings and results Universidad Aut´ onoma de Puebla Conclusion Graph-based approach: text representation • Training stage: D = ( , C ) x 1 , x 2 , x 3 , . . . , x n � �� � Words obtained from SUBDUE • Testing stage: D = ( ) x 1 , x 2 , x 3 , . . . , x n � �� � Words obtained from SUBDUE BUAP NLP September 12, 2012 Traditional Authorship Attribution 9 / 14

  10. Introduction Proposed approaches Experimental settings and results Universidad Aut´ onoma de Puebla Conclusion Graph-based approach: Classification process Classification Classification algorithm Model Result Test Training BUAP NLP September 12, 2012 Traditional Authorship Attribution 10 / 14

  11. Introduction Proposed approaches Experimental settings and results Universidad Aut´ onoma de Puebla Conclusion Experimental settings • For SUBDUE we extract the 30 most representative words • For the problems A, B, C, D, I and J we used WEKA’s implementation of SVMs • Kernell = polynomial mapping • For the problems E and F, we used WEKA’s implementation K -means clustering method • K = 2,3 or 4 authors BUAP NLP September 12, 2012 Traditional Authorship Attribution 11 / 14

  12. Introduction Proposed approaches Experimental settings and results Universidad Aut´ onoma de Puebla Conclusion Results Results obtained in the traditional sub-task Task A correct/A% B correct/B% C correct/C% D correct/D% I correct/I% J correct/J% Graph-based approach 5/83.333 6/60 5/62.5 4/23.529 8/57.142 13/81.25 Lexical-syntactic approach 4/66.666 3/30 2/25 6/35.294 10/71.428 7/43.75 Results obtained in the clustering sub-task Task E correct/E% F correct/F% Graph-based approach 68/75.555 43/53.75 Lexical-Syntactic approach 61/67.777 51/63.75 BUAP NLP September 12, 2012 Traditional Authorship Attribution 12 / 14

  13. Introduction Proposed approaches Experimental settings and results Universidad Aut´ onoma de Puebla Conclusion Concluding remarks 1 Lessons learned • The lexical-syntactic feature approach helped to represent the writing style • the graph-based representation obtained a better performance than the other one. However, more investigation on the graph representation is still required 2 Current work • Other data sets and tasks • Still more lexical-syntactic features to design and use • Understand better the role of the Graph representation • Experiment with different graph based text representations that allow us to obtain much more complex patterns. BUAP NLP September 12, 2012 Traditional Authorship Attribution 13 / 14

  14. Introduction Proposed approaches Experimental settings and results Universidad Aut´ onoma de Puebla Conclusion Thank you for your attention! BUAP NLP September 12, 2012 Traditional Authorship Attribution 14 / 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend