PAN 2010 Results Uncovering Plagiarism, Authorship, and Social - - PowerPoint PPT Presentation

pan 2010 results
SMART_READER_LITE
LIVE PREVIEW

PAN 2010 Results Uncovering Plagiarism, Authorship, and Social - - PowerPoint PPT Presentation

PAN 2010 Results Uncovering Plagiarism, Authorship, and Social Software Misuse Bauhaus-Universitt Weimar Martin Potthast, Benno Stein Andreas Eiselt, Teresa Holfeld Universidad Politcnica de Valencia Alberto Barrn-Cedeo, Paolo


slide-1
SLIDE 1

PAN 2010 Results

Uncovering Plagiarism, Authorship, and Social Software Misuse

Bauhaus-Universität Weimar –

Martin Potthast, Benno Stein Andreas Eiselt, Teresa Holfeld

Universidad Politécnica de Valencia –

Alberto Barrón-Cedeño, Paolo Rosso

University of the Aegean –

Efstathios Stamatatos

Bar-Ilan University –

Moshe Koppel

http://pan.webis.de

slide-2
SLIDE 2

The PAN Competition

Information is nothing without Retrieval

2 c www.webis.de

slide-3
SLIDE 3

The PAN Competition

2nd International Competition on Plagiarism Detection, PAN 2010 These days, plagiarism and text reuse is rife on the Web. Task: Given a set of suspicious documents and a set of source documents, find all plagiarized sections in the suspicious documents and, if available, the corresponding source sections.

3 c www.webis.de

slide-4
SLIDE 4

The PAN Competition

2nd International Competition on Plagiarism Detection, PAN 2010 These days, plagiarism and text reuse is rife on the Web. Task: Given a set of suspicious documents and a set of source documents, find all plagiarized sections in the suspicious documents and, if available, the corresponding source sections. Corpus: PAN-PC-10

❑ 27 073 documents (obtained from 22 874 books from the Project Gutenberg) ❑ 68 558 plagiarism cases (about 0-10 cases per document) ❑ 6 plagiarism-relevant parameters (length, language, task, obfuscation, topic, fraction) [Potthast et al., COLING 2010]

4 c www.webis.de

slide-5
SLIDE 5

The PAN Competition

Plagiarism Detection Results

Kasprzak Zou Muhr Grozea Oberreuter Torrejón Pereira Palkovskii Sobha Gottron Micol Costa-jussà Nawab Gupta Vania Suàrez Alzahrani Iftene 1

0.80 0.71 0.69 0.62 0.61 0.59 0.52 0.51 0.44 0.26 0.22 0.21 0.21 0.20 0.14 0.06 0.02 0.00

Plagdet

5 c www.webis.de

slide-6
SLIDE 6

The PAN Competition

Plagiarism Detection Results

Kasprzak Zou Muhr Grozea Oberreuter Torrejón Pereira Palkovskii Sobha Gottron Micol Costa-jussà Nawab Gupta Vania Suàrez Alzahrani Iftene 1

0.80 0.71 0.69 0.62 0.61 0.59 0.52 0.51 0.44 0.26 0.22 0.21 0.21 0.20 0.14 0.06 0.02 0.00

Plagdet

❑ Plagdet combines precision, recall,

and granularity: plagdet(S, R) = F1 log2(1 + gran(S, R)) prec(S, R) = 1 |R|

  • r∈R

|

s∈S(s ⊓ r)|

|r| rec(S, R) = 1 |S|

  • s∈S

|

r∈R(s ⊓ r)|

|s|

❑ The granularity gran measures

the average number of times a plagiarism case is detected.

[Potthast et al., COLING 2010]

6 c www.webis.de

slide-7
SLIDE 7

The PAN Competition

Plagiarism Detection Results

Kasprzak Zou Muhr Grozea Oberreuter Torrejón Pereira Palkovskii Sobha Gottron Micol Costa-jussà Nawab Gupta Vania Suàrez Alzahrani Iftene Recall 1

0.94 0.91 0.84 0.91 0.85 0.85 0.73 0.78 0.96 0.51 0.93 0.18 0.40 0.50 0.91 0.13 0.35 0.60

1

0.69 0.63 0.71 0.48 0.48 0.45 0.41 0.39 0.29 0.32 0.24 0.30 0.17 0.14 0.26 0.07 0.05 0.00

1 2

1.00 1.07 1.15 1.02 1.01 1.00 1.00 1.02 1.01 1.87 2.23 1.07 1.21 1.15 6.78 2.24 17.31 8.68

Precision Granularity

7 c www.webis.de

slide-8
SLIDE 8

The PAN Competition

Information is nothing without Retrieval

8 c www.webis.de

slide-9
SLIDE 9

The PAN Competition

1st International Competition on Wikipedia Vandalism Detection, PAN 2010 Every edit on Wikipedia has to be double-checked for integrity— even if it affects just one char. Task: Given a set of edits on Wikipedia articles, distinguish ill-intentioned edits from well-intentioned edits.

9 c www.webis.de

slide-10
SLIDE 10

The PAN Competition

1st International Competition on Wikipedia Vandalism Detection, PAN 2010 Every edit on Wikipedia has to be double-checked for integrity— even if it affects just one char. Task: Given a set of edits on Wikipedia articles, distinguish ill-intentioned edits from well-intentioned edits. Corpus: PAN-WVC-10

❑ 32 452 edits (sampled from a week’s worth of Wikipedia edit logs) ❑ 28 468 different edited articles (edit frequency resembles article importance) ❑ 2391 edits are vandalism (a 7% ratio is in concordance with the literature) [Potthast, SIGIR 2010]

10 c www.webis.de

slide-11
SLIDE 11

The PAN Competition

Plagiarism Detection Results

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 TP rate FP rate Random Detector Iftene White Harpalani Hegedüs Seaward Chichkov Javanmardi Adler Mola Velasco PAN'10 Meta Detector

11 c www.webis.de

slide-12
SLIDE 12

The PAN Competition

Vandalism Detection Results

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Precision Recall Random Detector

12 c www.webis.de

slide-13
SLIDE 13

The PAN Competition

Vandalism Detection Results

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Precision Recall Random Detector Iftene

13 c www.webis.de

slide-14
SLIDE 14

The PAN Competition

Vandalism Detection Results

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Precision Recall Random Detector Iftene White

14 c www.webis.de

slide-15
SLIDE 15

The PAN Competition

Vandalism Detection Results

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Precision Recall Random Detector Iftene White Harpalani

15 c www.webis.de

slide-16
SLIDE 16

The PAN Competition

Vandalism Detection Results

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Precision Recall Random Detector Iftene White Harpalani Hegedüs

16 c www.webis.de

slide-17
SLIDE 17

The PAN Competition

Vandalism Detection Results

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Precision Recall Random Detector Iftene White Harpalani Hegedüs Seaward

17 c www.webis.de

slide-18
SLIDE 18

The PAN Competition

Vandalism Detection Results

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Precision Recall Random Detector Iftene White Harpalani Hegedüs Seaward Chichkov

18 c www.webis.de

slide-19
SLIDE 19

The PAN Competition

Vandalism Detection Results

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Precision Recall Random Detector Iftene White Harpalani Hegedüs Seaward Chichkov Javanmardi

19 c www.webis.de

slide-20
SLIDE 20

The PAN Competition

Vandalism Detection Results

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Precision Recall Random Detector Iftene White Harpalani Hegedüs Seaward Chichkov Javanmardi Adler

20 c www.webis.de

slide-21
SLIDE 21

The PAN Competition

Vandalism Detection Results

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Precision Recall Random Detector Iftene White Harpalani Hegedüs Seaward Chichkov Javanmardi Adler Mola Velasco

21 c www.webis.de

slide-22
SLIDE 22

The PAN Competition

Vandalism Detection Results

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Precision Recall Random Detector Iftene White Harpalani Hegedüs Seaward Chichkov Javanmardi Adler Mola Velasco PAN'10 Meta Detector

22 c www.webis.de

slide-23
SLIDE 23

The PAN Competition

Vandalism Detection Results ROC-AUC ROC rank PR-AUC PR rank Detector 0.95690 – 0.77609 – – PAN’10 Meta Detector 0.92236 1 0.66522 1 – Mola Velasco 0.90351 2 0.49263 3 ↓ Adler 0.89856 3 0.44756 4 ↓ Javanmardi 0.89377 4 0.56213 2 ⇈ Chichkov 0.87990 5 0.41365 7

  • Seaward

0.87669 6 0.42203 5 ↑ Hegedus 0.85875 7 0.41498 6 ↑ Harpalani 0.84340 8 0.39341 8 – White 0.65404 9 0.12235 9 – Iftene 0.50000 10 0.08490 10 – Random Detector

23 c www.webis.de

slide-24
SLIDE 24

The PAN Competition

Information is nothing without Retrieval Retrieval is nothing without Evaluation

24 c www.webis.de

slide-25
SLIDE 25

The PAN Competition

Information is nothing without Retrieval Retrieval is nothing without Evaluation

25 c www.webis.de