towards the exploitation of statistical language models
play

Towards the Exploitation of Statistical Language Models for - PowerPoint PPT Presentation

Towards the Exploitation of Statistical Language Models for Plagiarism Detection with Reference Alberto Barrn Cedeo and Paolo Rosso Universidad Politcnica de Valencia July, 2008 LM for plagiarism detection PAN08, Patras Greece 1/20


  1. Towards the Exploitation of Statistical Language Models for Plagiarism Detection with Reference Alberto Barrón Cedeño and Paolo Rosso Universidad Politécnica de Valencia July, 2008 LM for plagiarism detection PAN’08, Patras Greece 1/20

  2. Overview • Introduction • LM approach • Experiments • Discussion • Conclusions LM for plagiarism detection PAN’08, Patras Greece 2/20

  3. Introduction Plagiarise To robe credit of another person’s work; in text it means including text fragments from an author without giving him the corresponding credit LM for plagiarism detection PAN’08, Patras Greece 3/20

  4. Introduction Plagiarise To robe credit of another person’s work; in text it means including text fragments from an author without giving him the corresponding credit In this work we describe our first attempt to detect plagiarised fragments in a text employing statistical Language Models (LMs) and perplexity. LM for plagiarism detection PAN’08, Patras Greece 3/20

  5. Introduction 1 Intrinsic plagiarism analysis [Meyer zu Eissen and Stein, 2006, 2007] • No reference corpus is exploited • Idea: Search for variations (syntax, grammatical categories or text complexity) through the suspicious text LM for plagiarism detection PAN’08, Patras Greece 4/20

  6. Introduction 1 Intrinsic plagiarism analysis [Meyer zu Eissen and Stein, 2006, 2007] • No reference corpus is exploited • Idea: Search for variations (syntax, grammatical categories or text complexity) through the suspicious text 2 Plagiarism analysis with reference [Si et al., 1997, Iyer and Singh, 2005] • A reference corpus of original documents is needed • Idea: to compare fragments from the suspicious document with the original documents in the reference corpus LM for plagiarism detection PAN’08, Patras Greece 4/20

  7. Introduction We are interested in the second approach but... Usually The reference corpus is conformed by original documents LM for plagiarism detection PAN’08, Patras Greece 5/20

  8. Introduction We are interested in the second approach but... Here The reference corpus is conformed by texts written by the author of the suspicious document LM for plagiarism detection PAN’08, Patras Greece 6/20

  9. Introduction Statistical Language Model ( LM ) A LM “tries to predict a word given the previous words” [Manning and Schutze, 2000]. Ideal calculation: P ( W ) = P ( w 1 ) · P ( w 2 | w 1 ) · P ( w 3 | w 1 w 2 ) · · · P ( w n | w 1 · · · w n − 1 ) LM for plagiarism detection PAN’08, Patras Greece 7/20

  10. Introduction Statistical Language Model ( LM ) A LM “tries to predict a word given the previous words” [Manning and Schutze, 2000]. Ideal calculation: P ( W ) = P ( w 1 ) · P ( w 2 | w 1 ) · P ( w 3 | w 1 w 2 ) · · · P ( w n | w 1 · · · w n − 1 ) n-grams approach (case n = 3 ) P 3 ( W ) = P ( w n − 2 ) · P ( w n − 1 | w n − 2 ) · P ( w n | w n − 2 w n − 1 ) LM for plagiarism detection PAN’08, Patras Greece 7/20

  11. LM approach Basic idea • Computing the probability of n-grams in a corpus of texts from one author (representation of vocabulary, grammatical frequency and writing style) • These representations can be compared to other texts in order to look for candidates to plagiarism LM for plagiarism detection PAN’08, Patras Greece 8/20

  12. LM approach Is a fragment f a plagiarism candidate? LM for plagiarism detection PAN’08, Patras Greece 9/20

  13. LM approach Is a fragment f a plagiarism candidate? • Determine if a text is similar to another one based on perplexity, frequently used in order to evaluate how good a LM describes a language: “our author language“ � N 1 PP 2 = N � P ( w i | w i − 1) i =1 • The lower a text perplexity is, the more predictable its words are. In other words, the higher a perplexity is, the bigger the uncertainty about the following word in a sentence LM for plagiarism detection PAN’08, Patras Greece 9/20

  14. LM approach Hypothesis Given a LM m calculated over texts T written by author A . The perplexity of fragments g, h ∈ T ′ , given that g has been written by A and h has been ”plagiarised” from an author B will be clearly different. LM for plagiarism detection PAN’08, Patras Greece 10/20

  15. LM approach Hypothesis Given a LM m calculated over texts T written by author A . The perplexity of fragments g, h ∈ T ′ , given that g has been written by A and h has been ”plagiarised” from an author B will be clearly different. Specifically, PP m ( g ) ≪ PP m ( h ) LM for plagiarism detection PAN’08, Patras Greece 10/20

  16. Experiments: corpus We have carried out experiments over two different kind of texts: Specialised Corpus about Lexicography topics written by only one author Literature A set of books written by Lewis Carroll and some passages from William Shakespeare texts LM for plagiarism detection PAN’08, Patras Greece 11/20

  17. Experiments: corpus Corpora preprocessing: vocabulary and morphosyntactic syntactic richness style i original text � ii part-of-speech � iii stemmed text � LM for plagiarism detection PAN’08, Patras Greece 12/20

  18. Experiments: corpus Training partition has been used for the LMs calculation Test partition contains randomly inserted fragments written by a different author LM for plagiarism detection PAN’08, Patras Greece 13/20

  19. Experiments: corpus Training partition has been used for the LMs calculation Test partition contains randomly inserted fragments written by a different author In order to identify candidates, we calculate the perplexity of each sentence with respect to the LM associated to the author LM for plagiarism detection PAN’08, Patras Greece 13/20

  20. Experiments: results Results over the literature corpus Considering the original text: 7640 Literature example, n=3 7258 plagiarised µ =319 6876 6494 6112 5730 5348 4966 4584 Perplexity 4202 3820 3438 3056 2674 2292 1910 1528 1146 764 382 0 0 95 190 285 380 475 570 665 760 855 950 Sentence LM for plagiarism detection PAN’08, Patras Greece 14/20

  21. Experiments: results Results over the literature corpus Considering the stemming of the text: 1260 Literature example, n=3 1197 plagiarised µ =88 1134 1071 1008 945 882 819 756 Perplexity 693 630 567 504 441 378 315 252 189 126 63 0 0 95 190 285 380 475 570 665 760 855 950 Sentence LM for plagiarism detection PAN’08, Patras Greece 15/20

  22. Experiments: results Results over the literature corpus Considering the POS of the text: Literature example, n=3 26 plagiarised 25 µ =9 24 23 22 21 20 19 18 17 16 Perplexity 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 95 190 285 380 475 570 665 760 855 950 Sentence LM for plagiarism detection PAN’08, Patras Greece 16/20

  23. Discussion This approach considers three of the five stylometric features categories useful for the plagiarism detection task [Meyer zu Eissen and Stein, 2006]: Original and Syntactic features (writing style) stemmed Special words counting (vocabulary richness) POS Part-of-speech classes quantifycation Not considered Text statistics (character level) Structural features LM for plagiarism detection PAN’08, Patras Greece 17/20

  24. Discussion Perplexity (as we applied it) is not enough to discriminate plagiarised from ”legal” fragments but... LM for plagiarism detection PAN’08, Patras Greece 18/20

  25. Discussion Perplexity (as we applied it) is not enough to discriminate plagiarised from ”legal” fragments but... Is a good idea to consider it? LM for plagiarism detection PAN’08, Patras Greece 18/20

  26. Discussion Perplexity (as we applied it) is not enough to discriminate plagiarised from ”legal” fragments but... Is a good idea to consider it? What about original text, POS and stem versions? LM for plagiarism detection PAN’08, Patras Greece 18/20

  27. Conclusions 1 We have considered perplexity on three different levels: word, part-of-speech and stem. 2 Unfortunately, there are non-plagiarised fragments that present high perplexity. However, plagiarised fragments seem to stand out in the highest positions when we consider these features. 3 We know that the perplexity feature space of plagiarised and non-plagiarised segments is not completely separable, but we believe that including perplexity among other features may improve the results. LM for plagiarism detection PAN’08, Patras Greece 19/20

  28. References Iyer, P . and Singh, A. (2005). Document similarity analysis for a plagiarism detection system. 2nd Indian Int. Conf. on Artificial Intelligence (IICAI-2005) , pages 2534–2544. Manning, C. D. and Schutze, H. (2000). Foundations of Statistical Natural Language Processing . The MIT Press Publisher, Cambridge Massachusetts and London, England. Meyer zu Eissen, S. and Stein, B. (2006). Intrinsic plagiarism detection. Lalmas et. al. (Eds.): Advances in Information Retrieval Proc. of the 28th European Conf. on IR research, ECIR 2006, London , pages 565–569. Si, A., Leong, H. V., and Lau, R. W. H. (1997). Check: a document plagiarism detection system. LM for plagiarism detection PAN’08, Patras Greece 20/20 Proc. of the 1997 ACM Symposium on Applied Computing,

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend