Overview of the 3rd International Competition on Plagiarism - PowerPoint PPT Presentation

Overview of the 3rd International Competition on Plagiarism Detection Martin Potthast 1 , Andreas Eiselt 1 , Alberto Barrón-Cedeño 2 Benno Stein 1 , Paolo Rosso 2 1 Web Technology & Information Systems. Bauhaus-Universiät Weimar, Germany 2 Natural Language Engineering Lab, ELiRF. Universidad Politécnica de Valencia, Spain pan@webis.de http://pan.webis.de

Introduction Task: • Given a set of suspicious documents and a set of source documents, find all plagiarized sections in the suspicious documents and, if available, the corresponding source sections. PAN @ CLEF 2011 3/11

Introduction: Facts Participation 2009 13 groups 14 countries 2010 18 12 2011 11 10 Corpus size 2009 41,223 docs. 94,202 cases 2010 27,073 68,558 2011 26,939 61,064 Competition phases: training / test 2009 10 weeks 3 weeks 2010 9 5 2011 9 5 PAN @ CLEF 2011 4/11

The PAN Competition 2011: Corpus PAN-PC-11 PAN @ CLEF 2011 5/11

The PAN Competition 2011: Corpus PAN-PC-11 Document length 50% 35% 15% short (1−10 pp.) med. (10−100 pp.) long (100−1000 pp.) PAN @ CLEF 2011 5/11

The PAN Competition 2011: Corpus PAN-PC-11 Document length 50% 35% 15% short (1−10 pp.) med. (10−100 pp.) long (100−1000 pp.) Document purpose 50% 50% source documents suspicious documents PAN @ CLEF 2011 5/11

The PAN Competition 2011: Corpus PAN-PC-11 Document length 50% 35% 15% short (1−10 pp.) med. (10−100 pp.) long (100−1000 pp.) Document purpose 50% 50% 25% 25% source documents suspicious documents with plagiarism without plagiarism PAN @ CLEF 2011 5/11

The PAN Competition 2011: Corpus PAN-PC-11 Document length 50% 35% 15% short (1−10 pp.) med. (10−100 pp.) long (100−1000 pp.) Document purpose 50% 50% 25% 25% source documents suspicious documents with plagiarism without plagiarism Plagiarism per document much (50−80%) 57% 15% 18% 10% hardly (5−20%) medium (20−50%) entirely (>80%) PAN @ CLEF 2011 5/11

The PAN Competition 2011: Corpus PAN-PC-11 Document length 50% 35% 15% short (1−10 pp.) med. (10−100 pp.) long (100−1000 pp.) Document purpose 50% 50% 25% 25% source documents suspicious documents with plagiarism without plagiarism Plagiarism per document much (50−80%) 57% 15% 18% 10% hardly (5−20%) medium (20−50%) entirely (>80%) Case length 35% 38% 27% short (<150 words) medium (150−1150 words) long (> 1150 words) PAN @ CLEF 2011 5/11

The PAN Competition 2011: Corpus PAN-PC-11 Document length 50% 35% 15% short (1−10 pp.) med. (10−100 pp.) long (100−1000 pp.) Document purpose 50% 50% 25% 25% source documents suspicious documents with plagiarism without plagiarism Plagiarism per document much (50−80%) 57% 15% 18% 10% hardly (5−20%) medium (20−50%) entirely (>80%) Case length 35% 38% 27% short (<150 words) medium (150−1150 words) long (> 1150 words) Obfuscation 18% 71% 11% none paraphrasing translation PAN @ CLEF 2011 5/11

The PAN Competition 2011: Corpus PAN-PC-11 Document length 50% 35% 15% short (1−10 pp.) med. (10−100 pp.) long (100−1000 pp.) Document purpose 50% 50% 25% 25% source documents suspicious documents with plagiarism without plagiarism Plagiarism per document much (50−80%) 57% 15% 18% 10% hardly (5−20%) medium (20−50%) entirely (>80%) Case length 35% 38% 27% short (<150 words) medium (150−1150 words) long (> 1150 words) Obfuscation 18% 71% 11% 32% 31% 8% none paraphrasing translation automatic (low) automatic (high) manual PAN @ CLEF 2011 5/11

The PAN Competition 2011: Corpus PAN-PC-11 Document length 50% 35% 15% short (1−10 pp.) med. (10−100 pp.) long (100−1000 pp.) Document purpose 50% 50% 25% 25% source documents suspicious documents with plagiarism without plagiarism Plagiarism per document much (50−80%) 57% 15% 18% 10% hardly (5−20%) medium (20−50%) entirely (>80%) Case length 35% 38% 27% short (<150 words) medium (150−1150 words) long (> 1150 words) Obfuscation 18% 71% 11% + m.c. 32% 31% 8% none paraphrasing translation 10% 1% automatic (low) automatic (high) manual automatic PAN @ CLEF 2011 5/11

yy �� y � y�� y � yy �� yy The PAN Competition 2011: Evaluation S s 1 s 2 s 3 r 1 r 2 r 3 r 4 r 5 R document as character sequence original characters plagiarized characters detected characters PAN @ CLEF 2011 6/11

Intrinsic Detection d q Intrinsic Plagiarism Detection Document Retrieval Outlier post- chunking model detection processing Suspicious sections PAN @ CLEF 2011 7/11

Intrinsic Detection plagdet recall Oberreuter 0.33 0.34 Kestemont 0.17 0.43 0.13 Akiva 0.08 0.11 Rao 0.07 0 0.5 1 0 0.5 1 precision granularity Oberreuter 0.31 1.00 Kestemont 0.11 1.03 Akiva 0.07 1.05 Rao 0.08 1.48 0 0.5 1 1 2 3 PAN @ CLEF 2011 7/11

External Detection d q External Plagiarism Detection Heuristic Candidate Detailed Knowledge-based retrieval documents analysis post-processing Suspicious sections Reference collection D PAN @ CLEF 2011 8/11

External Detection plagdet recall Grman 0.56 0.40 Grozea 0.42 0.34 Oberreuter 0.35 0.23 Cooke 0.25 0.15 Rodriguez 0.23 0.16 Rao 0.20 0.16 Palkovskii 0.19 0.14 Nawab 0.08 0.09 Ghosh 0.00 0.00 0 0.5 1 0 0.5 1 precision granularity Grman 0.94 1.00 Grozea 0.81 1.22 Oberreuter 0.91 1.06 Cooke 0.71 1.01 Rodriguez 0.85 1.23 Rao 0.45 1.29 Palkovskii 0.44 1.17 Nawab 0.28 2.18 Ghosh 0.01 2.00 0 0.5 1 1 2 3 PAN @ CLEF 2011 8/11

Summary Overview paper • This year’s best practices for intrinsic and external detection. • Detection results with regard to every corpus parameter. • Comparison to PAN 2009 and PAN 2010. Lessons & frontiers • Detection performances decreased by the increased detection difficulty • Intrinsic detection results may be biased due to the corpus nature • Both approaches are important (also to win the competition) • Short plagiarism cases remain being the hardest to detect • Manual translation shows to be much harder to detect than automatic (result less biased) PAN @ CLEF 2011 9/11

CL!TR: Cross-Language !ndian Text Reuse • Task on cross-language text re-use detection • Potential source texts in English, suspicious texts in Hindi • Document level task (no specific fragments are expected to be identified) http://users.dsic.upv.es/grupos/nle/fire-workshop-clitr.html PAN @ CLEF 2011 10/11

Jean-François Millet (1854) Sheep Shearing Beneath a Tree

Jean-François Millet (1854) Vincent van Gogh (1889) Sheep Shearing Beneath a Tree The Sheep Shearers

Jean-François Millet (1854) Vincent van Gogh (1889) Sheep Shearing Beneath a Tree The Sheep Shearers (after Millet)

Jean-François Millet (1854) Vincent van Gogh (1889) Sheep Shearing Beneath a Tree The Sheep Shearers (after Millet) “[I am] translating the black and white impressions into another language –that of colour”

Overview of the 3rd International Competition on Plagiarism - PowerPoint PPT Presentation

Overview of the 3rd International Competition on Plagiarism Detection Martin Potthast 1 , Andreas Eiselt 1 , Alberto Barrn-Cedeo 2 Benno Stein 1 , Paolo Rosso 2 1 Web Technology & Information Systems. Bauhaus-Universit Weimar, Germany 2

Trade and Competition Policy Trade and Competition Policy Has Past WTO Work Stood the Has Past

INTRODUCTION TO COMPETITION LAW Presented by: Mr. Bevan Narinesingh Definition of Competition

COMPETITION LAW RAJINDER KUMAR JOINT DIRECTOR GENERAL COMPETITION COMMISSION OF INDIA

Modeling Land Competition Modeling Land Competition Modeling Land Competition Ron Sands Ron

The R Role of the Moldovan ole of the Moldovan The Competition Autority in Competition

WORKSHOP 2016 WORKSHOP 2016 -- COMPETITION RESULTS -- COMPETITION RESULTS Competition

Chapter 5: Short Run Price Competition Price competition (Bertrand competition) A1. Firms meet

Chapter 5: Short Run Price Competition Price competition (Bertrand competition) A1. Firms meet

HQSVC BN MCB Golf-April Softball-May Soccer-June 1st) HQ BN 3rd MHG 1st) Headquarters Battery

Financial Report for 3rd Quarter FY2011 31 January 2012 Kawasaki Kisen Kaisha, Ltd. Agenda A.

TMVCA Student Presentation Competition Student presentation competition is held in conjunction

iGEM Competition 2011, World Championship Jamboree, MIT, Boston, USA iGEM Competition 2011, World

Competition in the Forest Sector an extensive review Authors: Elias Olofsson Robert Lundmark

GLOBAL FORUM ON COMPETITION Does competition kill or create jobs? Jean-Luc Schneider OECD

Monopolistic Competition GCE A-LEVEL & IB ECONOMICS What is Monopolistic Competition? Think

THE EUROPEAN COMPETITION NETWORK AND INTERNATIONAL COOPERATION The experience of the Italian

IATA DATA & AVIATION DIGITAL CAPABILITIES DATA & DIGITAL DURING CRISIS WEBINAR

Service Function Chaining (SFC) and Network Slicing in Backhaul and Metro Networks in Support of

Virtual Atom Smasher, Progress of Work Crowdcrafting: An EC project proposal Ioannis Charalampidis

From asynchronous games to coherence spaces Paul-Andr Mellis CNRS, Universit Paris Denis

Ottoman Empire Balkan Region: SERBIA, CROATIA, & BOSNIA Rise and Expansion In 1453 the

Establishment of hybrid Poplar on a Reclaimed Mine site in West Virginia A. Hass, 1 R.S. Zalesny

LCS 11: Cognitive Science 1. Gestalt principles 2. Recognition by components theory Object

-- Anthony Chow (UNCG) Robert Burgin (RB Software) Bill Millet (Scope View) 2013 NCLA Biennial

Overview of the 3rd International Competition on Plagiarism - PowerPoint PPT Presentation

Overview of the 3rd International Competition on Plagiarism Detection Martin Potthast 1 , Andreas Eiselt 1 , Alberto Barrn-Cedeo 2 Benno Stein 1 , Paolo Rosso 2 1 Web Technology & Information Systems. Bauhaus-Universit Weimar, Germany 2

Trade and Competition Policy Trade and Competition Policy Has Past WTO Work Stood the Has Past

INTRODUCTION TO COMPETITION LAW Presented by: Mr. Bevan Narinesingh Definition of Competition

COMPETITION LAW RAJINDER KUMAR JOINT DIRECTOR GENERAL COMPETITION COMMISSION OF INDIA

Modeling Land Competition Modeling Land Competition Modeling Land Competition Ron Sands Ron

The R Role of the Moldovan ole of the Moldovan The Competition Autority in Competition

WORKSHOP 2016 WORKSHOP 2016 -- COMPETITION RESULTS -- COMPETITION RESULTS Competition

Chapter 5: Short Run Price Competition Price competition (Bertrand competition) A1. Firms meet

Chapter 5: Short Run Price Competition Price competition (Bertrand competition) A1. Firms meet

HQSVC BN MCB Golf-April Softball-May Soccer-June 1st) HQ BN 3rd MHG 1st) Headquarters Battery

Financial Report for 3rd Quarter FY2011 31 January 2012 Kawasaki Kisen Kaisha, Ltd. Agenda A.

TMVCA Student Presentation Competition Student presentation competition is held in conjunction

iGEM Competition 2011, World Championship Jamboree, MIT, Boston, USA iGEM Competition 2011, World

Competition in the Forest Sector an extensive review Authors: Elias Olofsson Robert Lundmark

GLOBAL FORUM ON COMPETITION Does competition kill or create jobs? Jean-Luc Schneider OECD

Monopolistic Competition GCE A-LEVEL &amp; IB ECONOMICS What is Monopolistic Competition? Think

THE EUROPEAN COMPETITION NETWORK AND INTERNATIONAL COOPERATION The experience of the Italian

IATA DATA &amp; AVIATION DIGITAL CAPABILITIES DATA &amp; DIGITAL DURING CRISIS WEBINAR

Service Function Chaining (SFC) and Network Slicing in Backhaul and Metro Networks in Support of

Virtual Atom Smasher, Progress of Work Crowdcrafting: An EC project proposal Ioannis Charalampidis

From asynchronous games to coherence spaces Paul-Andr Mellis CNRS, Universit Paris Denis

Ottoman Empire Balkan Region: SERBIA, CROATIA, &amp; BOSNIA Rise and Expansion In 1453 the

Establishment of hybrid Poplar on a Reclaimed Mine site in West Virginia A. Hass, 1 R.S. Zalesny

LCS 11: Cognitive Science 1. Gestalt principles 2. Recognition by components theory Object

-- Anthony Chow (UNCG) Robert Burgin (RB Software) Bill Millet (Scope View) 2013 NCLA Biennial

Monopolistic Competition GCE A-LEVEL & IB ECONOMICS What is Monopolistic Competition? Think

IATA DATA & AVIATION DIGITAL CAPABILITIES DATA & DIGITAL DURING CRISIS WEBINAR

Ottoman Empire Balkan Region: SERBIA, CROATIA, & BOSNIA Rise and Expansion In 1453 the