Uncovering Plagiarism, Authorship, and Social Software Misuse PAN - - PowerPoint PPT Presentation

uncovering plagiarism authorship and social software
SMART_READER_LITE
LIVE PREVIEW

Uncovering Plagiarism, Authorship, and Social Software Misuse PAN - - PowerPoint PPT Presentation

Uncovering Plagiarism, Authorship, and Social Software Misuse PAN 2011 Results [pan.webis.de] The PAN Competition Plagiarism Detection The web is rife with text reuse: boilerplate, translations, paraphrases, summaries, and plagiarism. c 2


slide-1
SLIDE 1

Uncovering Plagiarism, Authorship, and Social Software Misuse PAN 2011 Results

[pan.webis.de]

slide-2
SLIDE 2

The PAN Competition

Plagiarism Detection The web is rife with text reuse: boilerplate, translations, paraphrases, summaries, and plagiarism.

2 c www.webis.de

slide-3
SLIDE 3

The PAN Competition

Plagiarism Detection The web is rife with text reuse: boilerplate, translations, paraphrases, summaries, and plagiarism. Tasks:

❑ External Detection. Given a suspicious document and a set of potential

source documents, the task is to find all plagiarized passages in the suspicious document and their corresponding source passages in the source documents.

❑ Intrinsic Detection. Given a suspicious document, the task is to extract all

plagiarized passages based on clues extracted from the document itself. Corpus:

❑ PAN plagiarism corpus of 2010, 2011 [www.webis.de/research/corpora] ❑ 61 000 plagiarism cases hidden in about 27 000 documents ❑ 5 plagiarism-relevant parameters (length, language, task, obfuscation, fraction)

3 c www.webis.de

slide-4
SLIDE 4

The PAN Competition

0.5 1 0.5 1 0.5 1 1 1.5 2 Grman Grozea Oberreuter Cooke Torrejón Rao Palkovskii Nawab Ghosh Plagdet Precision Recall Granularity Oberreuter Stamatatos Kestemont Akiva Gupta

External plagiarism detection: Intrinsic plagiarism detection:

❑ Plagdet combines the measures as F / log(granularity). ❑ Granularity measures the average number of times a plagiarism case is

detected.

4 c www.webis.de

slide-5
SLIDE 5

The PAN Competition

Authorship Identification Many texts on the web are of uncertain authorship.

5 c www.webis.de

slide-6
SLIDE 6

The PAN Competition

Authorship Identification Many texts on the web are of uncertain authorship. Tasks:

❑ Authorship Attribution. Given a document of uncertain authorship and

documents from a set of candidate authors, the task is to map the document

  • nto its true authors among the candidates.

❑ Authorship Verification. Given a document of uncertain authorship and a

document from a specific author, the task is to determine whether the given text has been written by that author. Corpus:

❑ Subset of the Enron Email Dataset [www.cs.cmu.edu/~enron] ❑ More than 12 000 documents written by 118 authors. ❑ 3 relevant parameters (task, canidate set size, closed vs. open canidate set)

6 c www.webis.de

slide-7
SLIDE 7

The PAN Competition

0.5 1 0.5 1 0.5 1 F Precision Recall Escalante Snider Kern Eriksson Tanguy Vilarino Mikros Tanguy Mikros Escalante Kourtis Luyckx Vilarino Kern Snyder Ryan Solorio Eriksson Noecker

Authorship attribution: Authorship verification:

7 c www.webis.de

slide-8
SLIDE 8

The PAN Competition

Wikipedia Vandalism Detection Every edit on Wikipedia has to be double-checked for integrity.

8 c www.webis.de

slide-9
SLIDE 9

The PAN Competition

Wikipedia Vandalism Detection Every edit on Wikipedia has to be double-checked for integrity. Task:

❑ Given a set of edits on Wikipedia articles, separate the ill-intentioned edits

from the well-intentioned edits. Corpus:

❑ PAN Wikipedia vandalism corpus of 2010, 2011 [www.webis.de/research/corpora] ❑ About 2 800 vandalism cases among about 30 000 edits ❑ 3 languages with corpus annotations obtained from Mechanical Turk.

9 c www.webis.de

slide-10
SLIDE 10

The PAN Competition

Wikipedia Vandalism Detection

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Precision Recall

West and Lee (PR-AUC 0.70591) Aksit (PR-AUC 0.18978)

German 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Precision Recall

West and Lee (PR-AUC 0.48938) Aksit (PR-AUC 0.22077)

Spanish 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Precision Recall

West and Lee (PR-AUC 0.82230) Dragusanu et al. (PR-AUC 0.42464)

English

˘ ¸

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Precision Recall

West and Lee (PR-AUC 0.75385) Mola Velasco (PR-AUC 0.66522) Adler et al. (PR-AUC 0.49263)

English, PAN-WVC-10

10 c www.webis.de

slide-11
SLIDE 11

Quo Vadis PAN?

slide-12
SLIDE 12

Quo Vadis PAN?

Lessons Learned and Outlook

❑ Focus & Simplicity

➜ Focus on specific aspects of the tasks. ➜ Reduced number of task variants. ➜ Reduced number of parameters and limited ranges.

❑ Realism & Scale

➜ New corpora for plagiarism detection and authorship identification. ➜ Scale up where necessary, scale down otherwise.

❑ Contributions & Challenges

➜ Inclusion of real plagiarism and real cases of disputed authorship. ➜ Distinguishing text reuse and plagiarism. ➜ Considering human performance.

12 c www.webis.de

slide-13
SLIDE 13

Thank you!

Visit us at pan.webis.de. Mail us at pan@webis.de.