Overview of the 3rd International Competition on Plagiarism - - PowerPoint PPT Presentation

overview of the 3rd international competition on
SMART_READER_LITE
LIVE PREVIEW

Overview of the 3rd International Competition on Plagiarism - - PowerPoint PPT Presentation

Overview of the 3rd International Competition on Plagiarism Detection Martin Potthast 1 , Andreas Eiselt 1 , Alberto Barrn-Cedeo 2 Benno Stein 1 , Paolo Rosso 2 1 Web Technology & Information Systems. Bauhaus-Universit Weimar, Germany 2


slide-1
SLIDE 1
slide-2
SLIDE 2

Overview of the 3rd International Competition on Plagiarism Detection

Martin Potthast1, Andreas Eiselt1, Alberto Barrón-Cedeño2 Benno Stein1, Paolo Rosso2

1Web Technology & Information Systems. Bauhaus-Universiät Weimar, Germany 2Natural Language Engineering Lab, ELiRF. Universidad Politécnica de Valencia, Spain

pan@webis.de http://pan.webis.de

slide-3
SLIDE 3

Introduction

Task:

  • Given a set of suspicious documents and a set of source documents, find

all plagiarized sections in the suspicious documents and, if available, the corresponding source sections.

PAN @ CLEF 2011 3/11

slide-4
SLIDE 4

Introduction: Facts

Participation

2009 2010 2011 13 groups 18 11 14 countries 12 10

Corpus size

2009 2010 2011 41,223 docs. 27,073 26,939 94,202 cases 68,558 61,064

Competition phases: training / test

2009 2010 2011 10 weeks 9 9 3 weeks 5 5

PAN @ CLEF 2011 4/11

slide-5
SLIDE 5

The PAN Competition 2011: Corpus PAN-PC-11

PAN @ CLEF 2011 5/11

slide-6
SLIDE 6

The PAN Competition 2011: Corpus PAN-PC-11

Document length

50% 35% 15% short (1−10 pp.)

  • med. (10−100 pp.)

long (100−1000 pp.)

PAN @ CLEF 2011 5/11

slide-7
SLIDE 7

The PAN Competition 2011: Corpus PAN-PC-11

Document length

50% 35% 15% short (1−10 pp.)

  • med. (10−100 pp.)

long (100−1000 pp.)

Document purpose

50% 50% source documents suspicious documents

PAN @ CLEF 2011 5/11

slide-8
SLIDE 8

The PAN Competition 2011: Corpus PAN-PC-11

Document length

50% 35% 15% short (1−10 pp.)

  • med. (10−100 pp.)

long (100−1000 pp.)

Document purpose

50% 50% source documents suspicious documents 25% 25% with plagiarism without plagiarism

PAN @ CLEF 2011 5/11

slide-9
SLIDE 9

The PAN Competition 2011: Corpus PAN-PC-11

Document length

50% 35% 15% short (1−10 pp.)

  • med. (10−100 pp.)

long (100−1000 pp.)

Document purpose

50% 50% source documents suspicious documents 25% 25% with plagiarism without plagiarism

Plagiarism per document

57% 15% 18% 10% hardly (5−20%) medium (20−50%) much (50−80%) entirely (>80%)

PAN @ CLEF 2011 5/11

slide-10
SLIDE 10

The PAN Competition 2011: Corpus PAN-PC-11

Document length

50% 35% 15% short (1−10 pp.)

  • med. (10−100 pp.)

long (100−1000 pp.)

Document purpose

50% 50% source documents suspicious documents 25% 25% with plagiarism without plagiarism

Plagiarism per document

57% 15% 18% 10% hardly (5−20%) medium (20−50%) much (50−80%) entirely (>80%)

Case length

35% 38% 27% short (<150 words) medium (150−1150 words) long (> 1150 words)

PAN @ CLEF 2011 5/11

slide-11
SLIDE 11

The PAN Competition 2011: Corpus PAN-PC-11

Document length

50% 35% 15% short (1−10 pp.)

  • med. (10−100 pp.)

long (100−1000 pp.)

Document purpose

50% 50% source documents suspicious documents 25% 25% with plagiarism without plagiarism

Plagiarism per document

57% 15% 18% 10% hardly (5−20%) medium (20−50%) much (50−80%) entirely (>80%)

Case length

35% 38% 27% short (<150 words) medium (150−1150 words) long (> 1150 words)

Obfuscation

18% 71% 11% none paraphrasing translation

PAN @ CLEF 2011 5/11

slide-12
SLIDE 12

The PAN Competition 2011: Corpus PAN-PC-11

Document length

50% 35% 15% short (1−10 pp.)

  • med. (10−100 pp.)

long (100−1000 pp.)

Document purpose

50% 50% source documents suspicious documents 25% 25% with plagiarism without plagiarism

Plagiarism per document

57% 15% 18% 10% hardly (5−20%) medium (20−50%) much (50−80%) entirely (>80%)

Case length

35% 38% 27% short (<150 words) medium (150−1150 words) long (> 1150 words)

Obfuscation

18% 71% 11% none paraphrasing translation 32% 31% 8% automatic (low) automatic (high) manual

PAN @ CLEF 2011 5/11

slide-13
SLIDE 13

The PAN Competition 2011: Corpus PAN-PC-11

Document length

50% 35% 15% short (1−10 pp.)

  • med. (10−100 pp.)

long (100−1000 pp.)

Document purpose

50% 50% source documents suspicious documents 25% 25% with plagiarism without plagiarism

Plagiarism per document

57% 15% 18% 10% hardly (5−20%) medium (20−50%) much (50−80%) entirely (>80%)

Case length

35% 38% 27% short (<150 words) medium (150−1150 words) long (> 1150 words)

Obfuscation

18% 71% 11% none paraphrasing translation 32% 31% 8% automatic (low) automatic (high) manual

10% 1% automatic + m.c.

PAN @ CLEF 2011 5/11

slide-14
SLIDE 14

The PAN Competition 2011: Evaluation

  • riginal characters

plagiarized characters detected characters

  • yy

document as character sequence S R

  • yy
  • y

r1 r3

  • y

r2

  • y
yy

r5 r4 s1 s3 s2 PAN @ CLEF 2011 6/11

slide-15
SLIDE 15

Intrinsic Detection

d Document chunking Retrieval model post- processing Suspicious sections

Intrinsic Plagiarism Detection

q

Outlier detection

PAN @ CLEF 2011 7/11

slide-16
SLIDE 16

Intrinsic Detection

0.5 1 plagdet Oberreuter Kestemont Akiva Rao 0.33 0.17 0.08 0.07 0.5 1 recall 0.34 0.43 0.13 0.11 0.5 1 precision Oberreuter Kestemont Akiva Rao 0.31 0.11 0.07 0.08 1 2 3 granularity 1.00 1.03 1.05 1.48 PAN @ CLEF 2011 7/11

slide-17
SLIDE 17

External Detection

d Reference collection D Candidate documents Heuristic retrieval Detailed analysis Knowledge-based post-processing Suspicious sections

External Plagiarism Detection

q

PAN @ CLEF 2011 8/11

slide-18
SLIDE 18

External Detection

0.5 1 plagdet Grman Grozea Oberreuter Cooke Rodriguez Rao Palkovskii Nawab Ghosh 0.56 0.42 0.35 0.25 0.23 0.20 0.19 0.08 0.00 0.5 1 recall 0.40 0.34 0.23 0.15 0.16 0.16 0.14 0.09 0.00 0.5 1 precision Grman Grozea Oberreuter Cooke Rodriguez Rao Palkovskii Nawab Ghosh 0.94 0.81 0.91 0.71 0.85 0.45 0.44 0.28 0.01 1 2 3 granularity 1.00 1.22 1.06 1.01 1.23 1.29 1.17 2.18 2.00

PAN @ CLEF 2011 8/11

slide-19
SLIDE 19

Summary

Overview paper

  • This year’s best practices for intrinsic and external detection.
  • Detection results with regard to every corpus parameter.
  • Comparison to PAN 2009 and PAN 2010.

Lessons & frontiers

  • Detection performances decreased by the increased detection difficulty
  • Intrinsic detection results may be biased due to the corpus nature
  • Both approaches are important (also to win the competition)
  • Short plagiarism cases remain being the hardest to detect
  • Manual translation shows to be much harder to detect than automatic

(result less biased)

PAN @ CLEF 2011 9/11

slide-20
SLIDE 20

CL!TR: Cross-Language !ndian Text Reuse

  • Task on cross-language text re-use detection
  • Potential source texts in English, suspicious texts in Hindi
  • Document level task (no specific fragments are expected to be

identified) http://users.dsic.upv.es/grupos/nle/fire-workshop-clitr.html

PAN @ CLEF 2011 10/11

slide-21
SLIDE 21

Jean-François Millet (1854) Sheep Shearing Beneath a Tree

slide-22
SLIDE 22

Jean-François Millet (1854) Sheep Shearing Beneath a Tree Vincent van Gogh (1889) The Sheep Shearers

slide-23
SLIDE 23

Jean-François Millet (1854) Sheep Shearing Beneath a Tree Vincent van Gogh (1889) The Sheep Shearers (after Millet)

slide-24
SLIDE 24

Jean-François Millet (1854) Sheep Shearing Beneath a Tree Vincent van Gogh (1889) The Sheep Shearers (after Millet) “[I am] translating the black and white impressions into another language –that of colour”