Plagiarism Candidate Retrieval Using Selective Query Formulation and - PowerPoint PPT Presentation

Plagiarism Candidate Retrieval Using Selective Query Formulation and Discriminative Query Scoring Osama Haggag and Samhaa El-Beltagy Center for Informatics Science, Nile University, Egypt

Outline  Introduction  Problem Description  Task Description  Implementation  Results  Conclusion  Future Work 2

Task Description 10

Task Description  We are given a plagiarized dataset  Plagiarized from the ClueWeb09 corpus  There’s little to no obfuscation  Some passages and headlines are not plagiarized  Documents are well written, and punctuated  Documents are organized into paragraphs focusing on certain subtopics related to the larger topic at hand 11

Task Description  The goal is to:  Maximize and maintain a good balance in the retrieval performance  Minimize workload and runtime  The plan is to broaden the searching scope through topical segmentation  While introducing some form of search control in utilizing the queries  It would be favorable to score queries that haven’t been used yet against already downloaded documents  The core of the problem is document downloads  Downloading irrelevant documents leads to more irrelevance  Downloading relevant documents minimizes the search effort and sharpens precision 14

Implementation 15

Implementation  The slight obfuscation was disregarded due its insignificance  ChatNoir is the search engine of choice  The system is made up of a number of phases  Data preparation  Query formulation  Searching  Tuning the parameters 16

Implementation  Data Preparation : “ obama ”: 23 “clan”: 1 “ barack obama ” “ michelle obama ” [s1, s4, s11, s13] Sent 1, sent 2, [s16, s19, s22, s25] … sent 3, sent 4 , sent 5, sent 6, Sent 1, sent 4, … Keyphrase 1, … Sent 3, sent 6, … Keyphrase 2, Sent 2, sent 6, … Keyphrase 3, 17 , … , …

Implementation  Query formulation :  For each segment we have: Queries are stored as a list  For each 4-sentence chunk: of strings per document Word 4-sentence Query has frequencies chunk to be < 10 Freq > 1 Segment keywords Freq = 1 Keyphrase Q 1 Q 2 KP 18

Implementation  Searching :  Given a list of queries per document: Skip to next Query Query 1 Snippet > 50%? Query 2 Consider Query 3 document a source … > 60%? Query n 19

Implementation  Tuning the parameters :  The system has a number of parameters that need tuning  Due to the time cost of an experiment over the dataset, difficult to optimize by iteration over combinations  We use human intuition, common sense, and a small number of experiments to determine values that are good enough, but not necessarily optimal 20

Implementation  Tuning the parameters (in processing) :  TextTiling parameters:  Control over size of subdocuments  Tuning for a large number of segments of small size gives higher recall  Tuning for a small number of large topics is best for both precision and recall 21

Implementation  Tuning the parameters (in processing) :  Sentence chunk size selection:  A chunk size of 1, gives better recall at loss of precision  A chunk size of 4 is determined to do best  Frequency threshold:  Identifies the “unique” words in the query  The threshold of 1 is chosen after running experiments 22

Implementation  Tuning the parameters (for search) :  Number of results returned:  First result is often the most relevant one  Query vs. Snippet score:  A score of 50% filtered search results nicely  Less meant higher recall, more meant less recall without equivalent improvement in precision 23

Implementation  Tuning the parameters (for search) :  Query vs. Candidate Document score:  Same rationale as scoring against snippets  60% a relatively good filter  Higher values are better for recall  Refer to Tables 1,2,3 on page 6 in the paper for details 24

Results 25

Results  Our system was evaluated using the measures set by PAN’13  The system is determined to be one of the top three systems at PAN’13 26

Conclusion 27

Conclusion  We have a system that can retrieve possible plagiarism sources with competitive performance at minimal workload  This is done through careful formulation, and discriminative elimination of queries  The system employs two algorithms  TextTiling: topical segmentation – Marti A. Hearst  KPMiner: keyphrase extraction – Samhaa R. El-Beltagy 28

Future Work  There is room for improvement on the current system  Optimize the parameters  Make use of ChatNoir’s advanced search functions  Investigate more about obfuscation  More intelligence in the scoring functions  The code to our implementation available on git-hub, under the MIT license 29

Plagiarism Candidate Retrieval Using Selective Query Formulation and - PowerPoint PPT Presentation

Plagiarism Candidate Retrieval Using Selective Query Formulation and Discriminative Query Scoring Osama Haggag and Samhaa El-Beltagy Center for Informatics Science, Nile University, Egypt Outline Introduction Problem Description

07.01.2011 Topics Plagiarism Detection Software 2010 Plagiarism Plagiarism Detection

WHAT IS PLAGIARISM? According to plagiarism.org, following to be plagiarism: To submit

Who idea is it? Acknowledging and building on other work, or just plain plagiarism. Allison Mann

Instructor-Centric Source Code Plagiarism Detection and Plagiarism Corpus Jonathan Y. H. Poon,

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

External Plagiarism Detection using Information Retrieval and Sequence Alignment Rao Muhammad

Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection Using Character n gram Profiles

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Stylometry in plagiarism detection and author profiling Paolo Rosso PRHLT Research Center

A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection Introduction

Plagiarism Detection in Open Access Publications Jens Brandt, Martin Gutbrod, Oliver Wellnitz,

Whose idea is it? Acknowledging and building on other work, or just plain plagiarism? Lina Qiu,

plagiarism detection system Andrzej Sobecki, Marcin Kpa IKC 2017 Plagiarism detection problem

Uncovering Plagiarism, Authorship, and Social Software Misuse PAN 2011 Results [pan.webis.de]

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Ireland: Technical Development Claudio Piccinini and Mike Smith, School of Geography, Geology and

Recognition of organic matter types in standard palynological slides Article January 1990

Chapter VI: Information Extraction Information Retrieval & Data Mining Universitt des

Evergreen CWPP Update By The Forest Stewards Guild and Anchorpoint Group Quick Update from

Approaches for Source Retrieval and Text Alignment of Plagiarism Detection Kong Leilei, Qi

pRedis: Penalty and Locality Aware Memory Allocation in Redis Cheng Pan , Zhenlin Wang Yingwei

1 IEEE 802.15.4 PHY IEEE 802.15.4 PHY Features Receiver Energy Detection

The Community Contribution to BCs Provincial Strategy to Address HIV/AIDS Elayne Vlahaki, PAN

Sambuz

Useful Links

Newsletter

Mail Us

Plagiarism Candidate Retrieval Using Selective Query Formulation and - PowerPoint PPT Presentation

Plagiarism Candidate Retrieval Using Selective Query Formulation and Discriminative Query Scoring Osama Haggag and Samhaa El-Beltagy Center for Informatics Science, Nile University, Egypt Outline Introduction Problem Description

07.01.2011 Topics Plagiarism Detection Software 2010 Plagiarism Plagiarism Detection

WHAT IS PLAGIARISM? According to plagiarism.org, following to be plagiarism: To submit

Who idea is it? Acknowledging and building on other work, or just plain plagiarism. Allison Mann

Instructor-Centric Source Code Plagiarism Detection and Plagiarism Corpus Jonathan Y. H. Poon,

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

External Plagiarism Detection using Information Retrieval and Sequence Alignment Rao Muhammad

Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection Using Character n gram Profiles

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Stylometry in plagiarism detection and author profiling Paolo Rosso PRHLT Research Center

A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection Introduction

Plagiarism Detection in Open Access Publications Jens Brandt, Martin Gutbrod, Oliver Wellnitz,

Whose idea is it? Acknowledging and building on other work, or just plain plagiarism? Lina Qiu,

plagiarism detection system Andrzej Sobecki, Marcin Kpa IKC 2017 Plagiarism detection problem

Uncovering Plagiarism, Authorship, and Social Software Misuse PAN 2011 Results [pan.webis.de]

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Ireland: Technical Development Claudio Piccinini and Mike Smith, School of Geography, Geology and

Recognition of organic matter types in standard palynological slides Article January 1990

Chapter VI: Information Extraction Information Retrieval &amp; Data Mining Universitt des

Evergreen CWPP Update By The Forest Stewards Guild and Anchorpoint Group Quick Update from

Approaches for Source Retrieval and Text Alignment of Plagiarism Detection Kong Leilei, Qi

pRedis: Penalty and Locality Aware Memory Allocation in Redis Cheng Pan , Zhenlin Wang Yingwei

1 IEEE 802.15.4 PHY IEEE 802.15.4 PHY Features Receiver Energy Detection

The Community Contribution to BCs Provincial Strategy to Address HIV/AIDS Elayne Vlahaki, PAN

Sambuz

Useful Links

Newsletter

Mail Us

Chapter VI: Information Extraction Information Retrieval & Data Mining Universitt des