overview of the 3rd international competition on
play

Overview of the 3rd International Competition on Plagiarism - PowerPoint PPT Presentation

Overview of the 3rd International Competition on Plagiarism Detection Martin Potthast 1 , Andreas Eiselt 1 , Alberto Barrn-Cedeo 2 Benno Stein 1 , Paolo Rosso 2 1 Web Technology & Information Systems. Bauhaus-Universit Weimar, Germany 2


  1. Overview of the 3rd International Competition on Plagiarism Detection Martin Potthast 1 , Andreas Eiselt 1 , Alberto Barrón-Cedeño 2 Benno Stein 1 , Paolo Rosso 2 1 Web Technology & Information Systems. Bauhaus-Universiät Weimar, Germany 2 Natural Language Engineering Lab, ELiRF. Universidad Politécnica de Valencia, Spain pan@webis.de http://pan.webis.de

  2. Introduction Task: • Given a set of suspicious documents and a set of source documents, find all plagiarized sections in the suspicious documents and, if available, the corresponding source sections. PAN @ CLEF 2011 3/11

  3. Introduction: Facts Participation 2009 13 groups 14 countries 2010 18 12 2011 11 10 Corpus size 2009 41,223 docs. 94,202 cases 2010 27,073 68,558 2011 26,939 61,064 Competition phases: training / test 2009 10 weeks 3 weeks 2010 9 5 2011 9 5 PAN @ CLEF 2011 4/11

  4. The PAN Competition 2011: Corpus PAN-PC-11 PAN @ CLEF 2011 5/11

  5. The PAN Competition 2011: Corpus PAN-PC-11 Document length 50% 35% 15% short (1−10 pp.) med. (10−100 pp.) long (100−1000 pp.) PAN @ CLEF 2011 5/11

  6. The PAN Competition 2011: Corpus PAN-PC-11 Document length 50% 35% 15% short (1−10 pp.) med. (10−100 pp.) long (100−1000 pp.) Document purpose 50% 50% source documents suspicious documents PAN @ CLEF 2011 5/11

  7. The PAN Competition 2011: Corpus PAN-PC-11 Document length 50% 35% 15% short (1−10 pp.) med. (10−100 pp.) long (100−1000 pp.) Document purpose 50% 50% 25% 25% source documents suspicious documents with plagiarism without plagiarism PAN @ CLEF 2011 5/11

  8. The PAN Competition 2011: Corpus PAN-PC-11 Document length 50% 35% 15% short (1−10 pp.) med. (10−100 pp.) long (100−1000 pp.) Document purpose 50% 50% 25% 25% source documents suspicious documents with plagiarism without plagiarism Plagiarism per document much (50−80%) 57% 15% 18% 10% hardly (5−20%) medium (20−50%) entirely (>80%) PAN @ CLEF 2011 5/11

  9. The PAN Competition 2011: Corpus PAN-PC-11 Document length 50% 35% 15% short (1−10 pp.) med. (10−100 pp.) long (100−1000 pp.) Document purpose 50% 50% 25% 25% source documents suspicious documents with plagiarism without plagiarism Plagiarism per document much (50−80%) 57% 15% 18% 10% hardly (5−20%) medium (20−50%) entirely (>80%) Case length 35% 38% 27% short (<150 words) medium (150−1150 words) long (> 1150 words) PAN @ CLEF 2011 5/11

  10. The PAN Competition 2011: Corpus PAN-PC-11 Document length 50% 35% 15% short (1−10 pp.) med. (10−100 pp.) long (100−1000 pp.) Document purpose 50% 50% 25% 25% source documents suspicious documents with plagiarism without plagiarism Plagiarism per document much (50−80%) 57% 15% 18% 10% hardly (5−20%) medium (20−50%) entirely (>80%) Case length 35% 38% 27% short (<150 words) medium (150−1150 words) long (> 1150 words) Obfuscation 18% 71% 11% none paraphrasing translation PAN @ CLEF 2011 5/11

  11. The PAN Competition 2011: Corpus PAN-PC-11 Document length 50% 35% 15% short (1−10 pp.) med. (10−100 pp.) long (100−1000 pp.) Document purpose 50% 50% 25% 25% source documents suspicious documents with plagiarism without plagiarism Plagiarism per document much (50−80%) 57% 15% 18% 10% hardly (5−20%) medium (20−50%) entirely (>80%) Case length 35% 38% 27% short (<150 words) medium (150−1150 words) long (> 1150 words) Obfuscation 18% 71% 11% 32% 31% 8% none paraphrasing translation automatic (low) automatic (high) manual PAN @ CLEF 2011 5/11

  12. The PAN Competition 2011: Corpus PAN-PC-11 Document length 50% 35% 15% short (1−10 pp.) med. (10−100 pp.) long (100−1000 pp.) Document purpose 50% 50% 25% 25% source documents suspicious documents with plagiarism without plagiarism Plagiarism per document much (50−80%) 57% 15% 18% 10% hardly (5−20%) medium (20−50%) entirely (>80%) Case length 35% 38% 27% short (<150 words) medium (150−1150 words) long (> 1150 words) Obfuscation 18% 71% 11% + m.c. 32% 31% 8% none paraphrasing translation 10% 1% automatic (low) automatic (high) manual automatic PAN @ CLEF 2011 5/11

  13. yy �� � y � y�� y � yy �� yy The PAN Competition 2011: Evaluation S s 1 s 2 s 3 r 1 r 2 r 3 r 4 r 5 R document as character sequence original characters plagiarized characters detected characters PAN @ CLEF 2011 6/11

  14. Intrinsic Detection d q Intrinsic Plagiarism Detection Document Retrieval Outlier post- chunking model detection processing Suspicious sections PAN @ CLEF 2011 7/11

  15. Intrinsic Detection plagdet recall Oberreuter 0.33 0.34 Kestemont 0.17 0.43 0.13 Akiva 0.08 0.11 Rao 0.07 0 0.5 1 0 0.5 1 precision granularity Oberreuter 0.31 1.00 Kestemont 0.11 1.03 Akiva 0.07 1.05 Rao 0.08 1.48 0 0.5 1 1 2 3 PAN @ CLEF 2011 7/11

  16. External Detection d q External Plagiarism Detection Heuristic Candidate Detailed Knowledge-based retrieval documents analysis post-processing Suspicious sections Reference collection D PAN @ CLEF 2011 8/11

  17. External Detection plagdet recall Grman 0.56 0.40 Grozea 0.42 0.34 Oberreuter 0.35 0.23 Cooke 0.25 0.15 Rodriguez 0.23 0.16 Rao 0.20 0.16 Palkovskii 0.19 0.14 Nawab 0.08 0.09 Ghosh 0.00 0.00 0 0.5 1 0 0.5 1 precision granularity Grman 0.94 1.00 Grozea 0.81 1.22 Oberreuter 0.91 1.06 Cooke 0.71 1.01 Rodriguez 0.85 1.23 Rao 0.45 1.29 Palkovskii 0.44 1.17 Nawab 0.28 2.18 Ghosh 0.01 2.00 0 0.5 1 1 2 3 PAN @ CLEF 2011 8/11

  18. Summary Overview paper • This year’s best practices for intrinsic and external detection. • Detection results with regard to every corpus parameter. • Comparison to PAN 2009 and PAN 2010. Lessons & frontiers • Detection performances decreased by the increased detection difficulty • Intrinsic detection results may be biased due to the corpus nature • Both approaches are important (also to win the competition) • Short plagiarism cases remain being the hardest to detect • Manual translation shows to be much harder to detect than automatic (result less biased) PAN @ CLEF 2011 9/11

  19. CL!TR: Cross-Language !ndian Text Reuse • Task on cross-language text re-use detection • Potential source texts in English, suspicious texts in Hindi • Document level task (no specific fragments are expected to be identified) http://users.dsic.upv.es/grupos/nle/fire-workshop-clitr.html PAN @ CLEF 2011 10/11

  20. Jean-François Millet (1854) Sheep Shearing Beneath a Tree

  21. Jean-François Millet (1854) Vincent van Gogh (1889) Sheep Shearing Beneath a Tree The Sheep Shearers

  22. Jean-François Millet (1854) Vincent van Gogh (1889) Sheep Shearing Beneath a Tree The Sheep Shearers (after Millet)

  23. Jean-François Millet (1854) Vincent van Gogh (1889) Sheep Shearing Beneath a Tree The Sheep Shearers (after Millet) “[I am] translating the black and white impressions into another language –that of colour”

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend