Overview of the 3rd International Competition on Plagiarism - - PowerPoint PPT Presentation
Overview of the 3rd International Competition on Plagiarism - - PowerPoint PPT Presentation
Overview of the 3rd International Competition on Plagiarism Detection Martin Potthast 1 , Andreas Eiselt 1 , Alberto Barrn-Cedeo 2 Benno Stein 1 , Paolo Rosso 2 1 Web Technology & Information Systems. Bauhaus-Universit Weimar, Germany 2
Overview of the 3rd International Competition on Plagiarism Detection
Martin Potthast1, Andreas Eiselt1, Alberto Barrón-Cedeño2 Benno Stein1, Paolo Rosso2
1Web Technology & Information Systems. Bauhaus-Universiät Weimar, Germany 2Natural Language Engineering Lab, ELiRF. Universidad Politécnica de Valencia, Spain
pan@webis.de http://pan.webis.de
Introduction
Task:
- Given a set of suspicious documents and a set of source documents, find
all plagiarized sections in the suspicious documents and, if available, the corresponding source sections.
PAN @ CLEF 2011 3/11
Introduction: Facts
Participation
2009 2010 2011 13 groups 18 11 14 countries 12 10
Corpus size
2009 2010 2011 41,223 docs. 27,073 26,939 94,202 cases 68,558 61,064
Competition phases: training / test
2009 2010 2011 10 weeks 9 9 3 weeks 5 5
PAN @ CLEF 2011 4/11
The PAN Competition 2011: Corpus PAN-PC-11
PAN @ CLEF 2011 5/11
The PAN Competition 2011: Corpus PAN-PC-11
Document length
50% 35% 15% short (1−10 pp.)
- med. (10−100 pp.)
long (100−1000 pp.)
PAN @ CLEF 2011 5/11
The PAN Competition 2011: Corpus PAN-PC-11
Document length
50% 35% 15% short (1−10 pp.)
- med. (10−100 pp.)
long (100−1000 pp.)
Document purpose
50% 50% source documents suspicious documents
PAN @ CLEF 2011 5/11
The PAN Competition 2011: Corpus PAN-PC-11
Document length
50% 35% 15% short (1−10 pp.)
- med. (10−100 pp.)
long (100−1000 pp.)
Document purpose
50% 50% source documents suspicious documents 25% 25% with plagiarism without plagiarism
PAN @ CLEF 2011 5/11
The PAN Competition 2011: Corpus PAN-PC-11
Document length
50% 35% 15% short (1−10 pp.)
- med. (10−100 pp.)
long (100−1000 pp.)
Document purpose
50% 50% source documents suspicious documents 25% 25% with plagiarism without plagiarism
Plagiarism per document
57% 15% 18% 10% hardly (5−20%) medium (20−50%) much (50−80%) entirely (>80%)
PAN @ CLEF 2011 5/11
The PAN Competition 2011: Corpus PAN-PC-11
Document length
50% 35% 15% short (1−10 pp.)
- med. (10−100 pp.)
long (100−1000 pp.)
Document purpose
50% 50% source documents suspicious documents 25% 25% with plagiarism without plagiarism
Plagiarism per document
57% 15% 18% 10% hardly (5−20%) medium (20−50%) much (50−80%) entirely (>80%)
Case length
35% 38% 27% short (<150 words) medium (150−1150 words) long (> 1150 words)
PAN @ CLEF 2011 5/11
The PAN Competition 2011: Corpus PAN-PC-11
Document length
50% 35% 15% short (1−10 pp.)
- med. (10−100 pp.)
long (100−1000 pp.)
Document purpose
50% 50% source documents suspicious documents 25% 25% with plagiarism without plagiarism
Plagiarism per document
57% 15% 18% 10% hardly (5−20%) medium (20−50%) much (50−80%) entirely (>80%)
Case length
35% 38% 27% short (<150 words) medium (150−1150 words) long (> 1150 words)
Obfuscation
18% 71% 11% none paraphrasing translation
PAN @ CLEF 2011 5/11
The PAN Competition 2011: Corpus PAN-PC-11
Document length
50% 35% 15% short (1−10 pp.)
- med. (10−100 pp.)
long (100−1000 pp.)
Document purpose
50% 50% source documents suspicious documents 25% 25% with plagiarism without plagiarism
Plagiarism per document
57% 15% 18% 10% hardly (5−20%) medium (20−50%) much (50−80%) entirely (>80%)
Case length
35% 38% 27% short (<150 words) medium (150−1150 words) long (> 1150 words)
Obfuscation
18% 71% 11% none paraphrasing translation 32% 31% 8% automatic (low) automatic (high) manual
PAN @ CLEF 2011 5/11
The PAN Competition 2011: Corpus PAN-PC-11
Document length
50% 35% 15% short (1−10 pp.)
- med. (10−100 pp.)
long (100−1000 pp.)
Document purpose
50% 50% source documents suspicious documents 25% 25% with plagiarism without plagiarism
Plagiarism per document
57% 15% 18% 10% hardly (5−20%) medium (20−50%) much (50−80%) entirely (>80%)
Case length
35% 38% 27% short (<150 words) medium (150−1150 words) long (> 1150 words)
Obfuscation
18% 71% 11% none paraphrasing translation 32% 31% 8% automatic (low) automatic (high) manual
10% 1% automatic + m.c.
PAN @ CLEF 2011 5/11
The PAN Competition 2011: Evaluation
- riginal characters
plagiarized characters detected characters
- yy
document as character sequence S R
- yy
- y
r1 r3
- y
r2
- y
r5 r4 s1 s3 s2 PAN @ CLEF 2011 6/11
Intrinsic Detection
d Document chunking Retrieval model post- processing Suspicious sections
Intrinsic Plagiarism Detection
q
Outlier detection
PAN @ CLEF 2011 7/11
Intrinsic Detection
0.5 1 plagdet Oberreuter Kestemont Akiva Rao 0.33 0.17 0.08 0.07 0.5 1 recall 0.34 0.43 0.13 0.11 0.5 1 precision Oberreuter Kestemont Akiva Rao 0.31 0.11 0.07 0.08 1 2 3 granularity 1.00 1.03 1.05 1.48 PAN @ CLEF 2011 7/11
External Detection
d Reference collection D Candidate documents Heuristic retrieval Detailed analysis Knowledge-based post-processing Suspicious sections
External Plagiarism Detection
q
PAN @ CLEF 2011 8/11
External Detection
0.5 1 plagdet Grman Grozea Oberreuter Cooke Rodriguez Rao Palkovskii Nawab Ghosh 0.56 0.42 0.35 0.25 0.23 0.20 0.19 0.08 0.00 0.5 1 recall 0.40 0.34 0.23 0.15 0.16 0.16 0.14 0.09 0.00 0.5 1 precision Grman Grozea Oberreuter Cooke Rodriguez Rao Palkovskii Nawab Ghosh 0.94 0.81 0.91 0.71 0.85 0.45 0.44 0.28 0.01 1 2 3 granularity 1.00 1.22 1.06 1.01 1.23 1.29 1.17 2.18 2.00
PAN @ CLEF 2011 8/11
Summary
Overview paper
- This year’s best practices for intrinsic and external detection.
- Detection results with regard to every corpus parameter.
- Comparison to PAN 2009 and PAN 2010.
Lessons & frontiers
- Detection performances decreased by the increased detection difficulty
- Intrinsic detection results may be biased due to the corpus nature
- Both approaches are important (also to win the competition)
- Short plagiarism cases remain being the hardest to detect
- Manual translation shows to be much harder to detect than automatic
(result less biased)
PAN @ CLEF 2011 9/11
CL!TR: Cross-Language !ndian Text Reuse
- Task on cross-language text re-use detection
- Potential source texts in English, suspicious texts in Hindi
- Document level task (no specific fragments are expected to be
identified) http://users.dsic.upv.es/grupos/nle/fire-workshop-clitr.html
PAN @ CLEF 2011 10/11