uncovering plagiarism authorship and social software
play

Uncovering Plagiarism, Authorship, and Social Software Misuse PAN - PowerPoint PPT Presentation

Uncovering Plagiarism, Authorship, and Social Software Misuse PAN 2011 Results [pan.webis.de] The PAN Competition Plagiarism Detection The web is rife with text reuse: boilerplate, translations, paraphrases, summaries, and plagiarism. c 2


  1. Uncovering Plagiarism, Authorship, and Social Software Misuse PAN 2011 Results [pan.webis.de]

  2. The PAN Competition Plagiarism Detection The web is rife with text reuse: boilerplate, translations, paraphrases, summaries, and plagiarism. c 2 � www.webis.de

  3. The PAN Competition Plagiarism Detection The web is rife with text reuse: boilerplate, translations, paraphrases, summaries, and plagiarism. Tasks: ❑ External Detection . Given a suspicious document and a set of potential source documents, the task is to find all plagiarized passages in the suspicious document and their corresponding source passages in the source documents. ❑ Intrinsic Detection . Given a suspicious document, the task is to extract all plagiarized passages based on clues extracted from the document itself. Corpus: ❑ PAN plagiarism corpus of 2010, 2011 [www.webis.de/research/corpora] ❑ 61 000 plagiarism cases hidden in about 27 000 documents ❑ 5 plagiarism-relevant parameters (length, language, task, obfuscation, fraction) c 3 � www.webis.de

  4. The PAN Competition External plagiarism detection: Plagdet Precision Recall Granularity 0 0.5 1 0 0.5 1 0 0.5 1 1 1.5 2 Grman Grozea Oberreuter Cooke Torrejón Rao Palkovskii Nawab Ghosh Intrinsic plagiarism detection: Oberreuter Stamatatos Kestemont Akiva Gupta ❑ Plagdet combines the measures as F / log(granularity). ❑ Granularity measures the average number of times a plagiarism case is detected. c 4 � www.webis.de

  5. The PAN Competition Authorship Identification Many texts on the web are of uncertain authorship. c 5 � www.webis.de

  6. The PAN Competition Authorship Identification Many texts on the web are of uncertain authorship. Tasks: ❑ Authorship Attribution. Given a document of uncertain authorship and documents from a set of candidate authors, the task is to map the document onto its true authors among the candidates. ❑ Authorship Verification. Given a document of uncertain authorship and a document from a specific author, the task is to determine whether the given text has been written by that author. Corpus: ❑ Subset of the Enron Email Dataset [www.cs.cmu.edu/~enron] ❑ More than 12 000 documents written by 118 authors. ❑ 3 relevant parameters (task, canidate set size, closed vs. open canidate set) c 6 � www.webis.de

  7. The PAN Competition Authorship attribution: F Precision Recall 0 0.5 1 0 0.5 1 0 0.5 1 Tanguy Mikros Escalante Kourtis Luyckx Vilarino Kern Snyder Ryan Solorio Eriksson Noecker Authorship verification: Escalante Snider Kern Eriksson Tanguy Vilarino Mikros c 7 � www.webis.de

  8. The PAN Competition Wikipedia Vandalism Detection Every edit on Wikipedia has to be double-checked for integrity. c 8 � www.webis.de

  9. The PAN Competition Wikipedia Vandalism Detection Every edit on Wikipedia has to be double-checked for integrity. Task: ❑ Given a set of edits on Wikipedia articles, separate the ill-intentioned edits from the well-intentioned edits. Corpus: ❑ PAN Wikipedia vandalism corpus of 2010, 2011 [www.webis.de/research/corpora] ❑ About 2 800 vandalism cases among about 30 000 edits ❑ 3 languages with corpus annotations obtained from Mechanical Turk. c 9 � www.webis.de

  10. The PAN Competition Wikipedia Vandalism Detection 1 1 1 West and Lee (PR-AUC 0.48938) 0.8 0.8 0.8 Aksit (PR-AUC 0.22077) West and Lee (PR-AUC 0.82230) Precision 0.6 Precision 0.6 Precision 0.6 West and Lee Dragusanu et al. ˘ ¸ (PR-AUC 0.70591) (PR-AUC 0.42464) Aksit 0.4 0.4 0.4 (PR-AUC 0.18978) 0.2 0.2 0.2 English German Spanish 0 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall Recall Recall 1 0.8 0.6 Precision West and Lee (PR-AUC 0.75385) 0.4 Mola Velasco (PR-AUC 0.66522) Adler et al. 0.2 (PR-AUC 0.49263) English, PAN-WVC-10 0 0 0.2 0.4 0.6 0.8 1 Recall c 10 � www.webis.de

  11. Quo Vadis PAN?

  12. Quo Vadis PAN? Lessons Learned and Outlook ❑ Focus & Simplicity ➜ Focus on specific aspects of the tasks. ➜ Reduced number of task variants. ➜ Reduced number of parameters and limited ranges. ❑ Realism & Scale ➜ New corpora for plagiarism detection and authorship identification. ➜ Scale up where necessary, scale down otherwise. ❑ Contributions & Challenges ➜ Inclusion of real plagiarism and real cases of disputed authorship. ➜ Distinguishing text reuse and plagiarism. ➜ Considering human performance. c 12 � www.webis.de

  13. Thank you! Visit us at pan.webis.de. Mail us at pan@webis.de.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend