Overview of the 4th International Competition on Plagiarism - PowerPoint PPT Presentation

Overview of the 4th International Competition on Plagiarism Detection Martin Potthast Parth Gupta Tim Gollub Paolo Rosso Matthias Hagen Jan Graßegger NLEL Group Johannes Kiesel Universitat Politècnica de València Maximilian Michel www.dsic.upv.es/grupos/nle Arnd Oberländer Martin Tippmann Benno Stein Alberto Barrón-Cedeño Webis Group LSI Group Bauhaus-Universität Weimar Universitat Politècnica de Catalunya www.webis.de www.lsi.upc.edu

Introduction c � www.webis.de 2012 2

Introduction Suspicious Thesis document Knowledge-based Candidate Candidate Detailed post-processing documents comparison retrieval Document Suspicious passages collection c � www.webis.de 2012 3

Introduction Suspicious Thesis document Knowledge-based Candidate Candidate Detailed post-processing documents comparison retrieval Document Suspicious passages collection Observations, problems: 1. Representativeness: the corpus consists of books, many of which are very old, whereas today the web is the predominant source for plagiarists. 2. Scale: the corpus is too small to enforce a true candidate retrieval situation; most participants did a complete detailed comparison on all O ( n 2 ) document pairs. 3. Realism: plagiarized passages consider not the surrounding document, paraphrasing mostly done by machines, the Web is not used as source. 4. Comparability: evaluation frameworks must be developed, too, and ours kept changing over the years, rendering the obtained results incomparable across years. c � www.webis.de 2012 4

Introduction Suspicious Thesis document Knowledge-based Candidate Candidate Detailed post-processing documents comparison retrieval Document Suspicious 1 passages collection Observations, problems: 1. Representativeness: the corpus consists of books, many of which are very old, whereas today the web is the predominant source for plagiarists. 2. Scale: the corpus is too small to enforce a true candidate retrieval situation; most participants did a complete detailed comparison on all O ( n 2 ) document pairs. 3. Realism: plagiarized passages consider not the surrounding document, paraphrasing mostly done by machines, the Web is not used as source. 4. Comparability: evaluation frameworks must be developed, too, and ours kept changing over the years, rendering the obtained results incomparable across years. c � www.webis.de 2012 5

Introduction Suspicious Thesis document Knowledge-based Candidate Candidate Detailed 2 post-processing documents comparison retrieval Document Suspicious 1 passages collection Observations, problems: 1. Representativeness: the corpus consists of books, many of which are very old, whereas today the web is the predominant source for plagiarists. 2. Scale: the corpus is too small to enforce a true candidate retrieval situation; most participants did a complete detailed comparison on all O ( n 2 ) document pairs. 3. Realism: plagiarized passages consider not the surrounding document, paraphrasing mostly done by machines, the Web is not used as source. 4. Comparability: evaluation frameworks must be developed, too, and ours kept changing over the years, rendering the obtained results incomparable across years. c � www.webis.de 2012 6

Introduction Suspicious Thesis document 3 Knowledge-based Candidate Candidate Detailed 2 post-processing documents comparison retrieval Document Suspicious 1 passages collection Observations, problems: 1. Representativeness: the corpus consists of books, many of which are very old, whereas today the web is the predominant source for plagiarists. 2. Scale: the corpus is too small to enforce a true candidate retrieval situation; most participants did a complete detailed comparison on all O ( n 2 ) document pairs. 3. Realism: plagiarized passages consider not the surrounding document, paraphrasing mostly done by machines, the Web is not used as source. 4. Comparability: evaluation frameworks must be developed, too, and ours kept changing over the years, rendering the obtained results incomparable across years. c � www.webis.de 2012 7

Introduction 4 Suspicious Thesis document 3 Knowledge-based Candidate Candidate Detailed 2 post-processing documents comparison retrieval Document Suspicious 1 passages collection Observations, problems: 1. Representativeness: the corpus consists of books, many of which are very old, whereas today the web is the predominant source for plagiarists. 2. Scale: the corpus is too small to enforce a true candidate retrieval situation; most participants did a complete detailed comparison on all O ( n 2 ) document pairs. 3. Realism: plagiarized passages consider not the surrounding document, paraphrasing mostly done by machines, the Web is not used as source. 4. Comparability: evaluation frameworks must be developed, too, and ours kept changing over the years, rendering the obtained results incomparable across years. c � www.webis.de 2012 8

Candidate Retrieval 4 Suspicious Thesis document 3 Knowledge-based Candidate Candidate Detailed 2 post-processing documents comparison retrieval Document Suspicious 1 passages collection Considerations: 1. PAN’12 employed the English part of the ClueWeb09 corpus (used in TREC 2009-11 for several tracks) as a static Web snapshot. Size: 500 million web pages, 12.5TB 2. Participants was given efficient corpus access via the API of the ChatNoir search engine. ClueWeb and ChatNoir ensured experiment reproducibility and controllability. 3. The new corpus: manually written digestible texts, topically matching plagiarism cases, Web as source (for document synthesis and plagiarism detection). c � www.webis.de 2012 9

Candidate Retrieval 4 Suspicious Thesis document 3 Knowledge-based Candidate Candidate Detailed 2 post-processing documents comparison retrieval ✓ Document Suspicious 1 passages collection Considerations: 1. PAN’12 employed the English part of the ClueWeb09 corpus (used in TREC 2009-11 for several tracks) as a static Web snapshot. Size: 500 million web pages, 12.5TB 2. Participants was given efficient corpus access via the API of the ChatNoir search engine. ClueWeb and ChatNoir ensured experiment reproducibility and controllability. 3. The new corpus: manually written digestible texts, topically matching plagiarism cases, Web as source (for document synthesis and plagiarism detection). c � www.webis.de 2012 10

Candidate Retrieval 4 Suspicious Thesis document 3 ✓ Knowledge-based Candidate Candidate Detailed 2 post-processing documents comparison retrieval ✓ Document Suspicious 1 passages collection Considerations: 1. PAN’12 employed the English part of the ClueWeb09 corpus (used in TREC 2009-11 for several tracks) as a static Web snapshot. Size: 500 million web pages, 12.5TB 2. Participants was given efficient corpus access via the API of the ChatNoir search engine. ClueWeb and ChatNoir ensured experiment reproducibility and controllability. 3. The new corpus: manually written digestible texts, topically matching plagiarism cases, Web as source (for document synthesis and plagiarism detection). c � www.webis.de 2012 11

Candidate Retrieval 4 Suspicious Thesis ✓ document 3 ✓ Knowledge-based Candidate Candidate Detailed 2 post-processing documents comparison retrieval ✓ Document Suspicious 1 passages collection Candidate retrieval task: ❑ Humans write essays on given topics, plagiarizing from the ClueWeb, using the ChatNoir search engine for research. ❑ Detectors use ChatNoir to retrieve candidate documents from the ClueWeb. ❑ Detectors are expected to maximize recall, but use ChatNoir in a cost-effective way. c � www.webis.de 2012 12

Candidate Retrieval About ChatNoir [chatnoir.webis.de] c � www.webis.de 2012 13

Candidate Retrieval About ChatNoir [chatnoir.webis.de] ❑ employs BM25F retrieval model (CMU’s Indri search engine is language-model-based) ❑ provides search facets capturing readability issues ❑ own index development based on externalized minimal perfect hash functions ❑ index built on a 40 nodes Hadoop cluster ❑ search engine currently running on 11 machines c � www.webis.de 2012 14

Candidate Retrieval About Corpus Construction c � www.webis.de 2012 15

Candidate Retrieval About Corpus Construction ❑ an essay has approx. 5000 words which means 8-10 pages ❑ own web editor was developed for essay writing ❑ the writing is crowdsourced via oDesk ➜ full control over: – plagiarized document – set of used source documents – annotations of paraphrased passages – query log of the writer while researching the topic – search results for each query – click-through data for each query – browsing data of links clicked within ClueWeb – edit history of the document covering all keystrokes – work diary and screenshots as recorded by oDesk ➜ insights on how humans work when reusing text c � www.webis.de 2012 16

Overview of the 4th International Competition on Plagiarism - PowerPoint PPT Presentation

Overview of the 4th International Competition on Plagiarism Detection Martin Potthast Parth Gupta Tim Gollub Paolo Rosso Matthias Hagen Jan Graegger NLEL Group Johannes Kiesel Universitat Politcnica de Valncia Maximilian Michel

Trade and Competition Policy Trade and Competition Policy Has Past WTO Work Stood the Has Past

INTRODUCTION TO COMPETITION LAW Presented by: Mr. Bevan Narinesingh Definition of Competition

4th Quarter 2000 4th Quarter 2000 November 28, 2000 November 28, 2000 Investor Community

MSc Advanced Computing, MSc Computing (Spec.) Comp. 4th year, ISE 4th year & JMC 4th year 480

COMPETITION LAW RAJINDER KUMAR JOINT DIRECTOR GENERAL COMPETITION COMMISSION OF INDIA

Modeling Land Competition Modeling Land Competition Modeling Land Competition Ron Sands Ron

The R Role of the Moldovan ole of the Moldovan The Competition Autority in Competition

WORKSHOP 2016 WORKSHOP 2016 -- COMPETITION RESULTS -- COMPETITION RESULTS Competition

Chapter 5: Short Run Price Competition Price competition (Bertrand competition) A1. Firms meet

Chapter 5: Short Run Price Competition Price competition (Bertrand competition) A1. Firms meet

TMVCA Student Presentation Competition Student presentation competition is held in conjunction

iGEM Competition 2011, World Championship Jamboree, MIT, Boston, USA iGEM Competition 2011, World

Competition in the Forest Sector an extensive review Authors: Elias Olofsson Robert Lundmark

GLOBAL FORUM ON COMPETITION Does competition kill or create jobs? Jean-Luc Schneider OECD

Monopolistic Competition GCE A-LEVEL & IB ECONOMICS What is Monopolistic Competition? Think

THE EUROPEAN COMPETITION NETWORK AND INTERNATIONAL COOPERATION The experience of the Italian

FUNDAMENTAL RULES OF AWESOME EMAILS THAT SELL Remove all visual branding The from

Digestible Microfoundations: Buffer Stock Saving in a KrusellSmith World Christopher Carroll 1

Food microstructure, Processing and Digestion Jaspreet Singh and Lovedeep Kaur L.Kaur@massey.ac.nz

TOM HAREN CEO AGPROFESSIONALS Thomas M. Haren, CEO July 18, 2019 Who We Are

Welcome to CSCI 256: Algorithm Design and Analysis Quick Logistics Please mute yourself if

Marketing Your Event to the Media and Public Officials Sept 26 - Oct 4, 2020

Concurrent Types as Engineering Principles for Large Distributed Systems http://mrg.doc.ic.ac.uk/

STRATEGIES FOR REACHING DIVERSE POPULATIONS IN SELF-DIRECTION AGENDA Changing Demographics of

Sambuz

Useful Links

Newsletter

Mail Us

Overview of the 4th International Competition on Plagiarism - PowerPoint PPT Presentation

Overview of the 4th International Competition on Plagiarism Detection Martin Potthast Parth Gupta Tim Gollub Paolo Rosso Matthias Hagen Jan Graegger NLEL Group Johannes Kiesel Universitat Politcnica de Valncia Maximilian Michel

Trade and Competition Policy Trade and Competition Policy Has Past WTO Work Stood the Has Past

INTRODUCTION TO COMPETITION LAW Presented by: Mr. Bevan Narinesingh Definition of Competition

4th Quarter 2000 4th Quarter 2000 November 28, 2000 November 28, 2000 Investor Community

MSc Advanced Computing, MSc Computing (Spec.) Comp. 4th year, ISE 4th year &amp; JMC 4th year 480

COMPETITION LAW RAJINDER KUMAR JOINT DIRECTOR GENERAL COMPETITION COMMISSION OF INDIA

Modeling Land Competition Modeling Land Competition Modeling Land Competition Ron Sands Ron

The R Role of the Moldovan ole of the Moldovan The Competition Autority in Competition

WORKSHOP 2016 WORKSHOP 2016 -- COMPETITION RESULTS -- COMPETITION RESULTS Competition

Chapter 5: Short Run Price Competition Price competition (Bertrand competition) A1. Firms meet

Chapter 5: Short Run Price Competition Price competition (Bertrand competition) A1. Firms meet

TMVCA Student Presentation Competition Student presentation competition is held in conjunction

iGEM Competition 2011, World Championship Jamboree, MIT, Boston, USA iGEM Competition 2011, World

Competition in the Forest Sector an extensive review Authors: Elias Olofsson Robert Lundmark

GLOBAL FORUM ON COMPETITION Does competition kill or create jobs? Jean-Luc Schneider OECD

Monopolistic Competition GCE A-LEVEL &amp; IB ECONOMICS What is Monopolistic Competition? Think

THE EUROPEAN COMPETITION NETWORK AND INTERNATIONAL COOPERATION The experience of the Italian

FUNDAMENTAL RULES OF AWESOME EMAILS THAT SELL Remove all visual branding The from

Digestible Microfoundations: Buffer Stock Saving in a KrusellSmith World Christopher Carroll 1

Food microstructure, Processing and Digestion Jaspreet Singh and Lovedeep Kaur L.Kaur@massey.ac.nz

TOM HAREN CEO AGPROFESSIONALS Thomas M. Haren, CEO July 18, 2019 Who We Are

Welcome to CSCI 256: Algorithm Design and Analysis Quick Logistics Please mute yourself if

Marketing Your Event to the Media and Public Officials Sept 26 - Oct 4, 2020

Concurrent Types as Engineering Principles for Large Distributed Systems http://mrg.doc.ic.ac.uk/

STRATEGIES FOR REACHING DIVERSE POPULATIONS IN SELF-DIRECTION AGENDA Changing Demographics of

Sambuz

Useful Links

Newsletter

Mail Us

MSc Advanced Computing, MSc Computing (Spec.) Comp. 4th year, ISE 4th year & JMC 4th year 480

Monopolistic Competition GCE A-LEVEL & IB ECONOMICS What is Monopolistic Competition? Think