A Pairwise Document Analysis Approach for Monolingual Plagiarism - PowerPoint PPT Presentation

A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection

Introduction Plagiarism : ◦ Unauthorized use of Text , code, idea, … . Plagiarism detection research area has received increasing attention ◦ The rapid growth of documents in different languages ◦ Increased accessibility of electronic documents 29/1/2017 2

Prototypical Plagiarism Monolingual : copy or paraphrased Cross-Language : includes translation 29/1/2017 3

Problem definition has two steps Candidate document retrieval ◦ D : set of source documents ◦ d’ : suspicious document with fragments d’ f ’           ( , ) { | , , ( , ) } Candidate documents D d d D d d d d Sim d d   f f f f Pairwise document similarity ◦ d : source document with fragments d f ◦ d ’ : suspicious document with fragments d’ f ’             ( , ) { , | , , ( , ) } Copied pairs d d d d d d d d Sim d d    f f f f f f 29/1/2017 4

Detailed analysis in a pair of documents Possible errors in detecting plagiarism: • Text that is not plagiarized might be erroneously reported • Part or whole of plagiarized source or target text might be unreported • Parts of one plagiarism case might be reported as separate cases 29/1/2017 5

Evaluation Metrics ◦ 𝑇 : set of true plagiarism cases , 𝑆 : set of detections reported     ( ) s r ( ) s r 1  1     r R  s S ( , ) Recall R S ( , ) Precision R S | | | | S s | | | | R r   s S r R Fraction of reported detections (at character Fraction of plagiarism cases (at level) that are truly plagiarized character level) that are detected 1 å ( , ) F R S Granularity ( R , S ) = | R s |  1 ( , ) Plagdet R S  | S R | log ( 1 ( , )) Granularit y R S s Î S R 2 Average number of reported detections per Combined metric detected plagiarism case 29/1/2017 6

Two phase algorithm for identifying plagiarized text fragments Candidate sentence • Finds many possibly plagiarized fragments selection: • Focusing on recall • Finds alignments between the identified passages Result filtering: • Focusing on precision 29/1/2017 7

Step 1: Candidate Sentence Selection 29/1/2017 8

Source Suspicious 𝑒 𝑔 𝑒’ 𝑔’ Obtain Obtain ( 𝑒 ) ( 𝑒’ ) fragments fragments Token Extraction Seeding: Token extraction, Each fragment is created from a sequence of 𝑙 Using all words or keywords consecutive sentences using a sliding window. Representative words 𝐿 1 𝐿 𝑜 𝐿 1 … 𝐿 𝑜 … 29/1/2017 9

Source Suspicious 𝑒 𝑔 𝑒’ 𝑔’ Obtain Obtain ( 𝑒 ) ( 𝑒’ ) fragments fragments Check existence of items in fragments Check existence of items in fragments Token Extraction Match merging: Two detected fragments are Use Cosine Similarity Identify presence of representative terms merged to report a single plagiarism case if the number of characters between those fragments in the source and suspicious documents are both below a proximity threshold. Representative words 𝐿 1 𝐿 𝑜 𝐿 1 … 𝐿 𝑜 … Create Create vector vector sim(d f , K 1 ) sim( d’ f ’ , K 1 ) . . . Similarity computation . . . 29/1/2017 10 sim(d f , K n ) sim( d’ f ’ , K n )

Source Suspicious 𝑒 𝑔 𝑒’ 𝑔’ Obtain Obtain ( 𝑒 ) ( 𝑒’ ) fragments fragments Check existence of items in fragments Check existence of items in fragments Token Extraction Cross-lingual plagiarism detection Representative words Translate 𝐿’ 𝑜 𝐿 1 … 𝐿 𝑜 … 𝐿’ 1 Create Create vector vector sim( d’ f ’ , K’ 1 ) sim(d f , K 1 ) . . . Similarity computation . . . 29/1/2017 11 sim( d’ f ’ , K’ n ) sim(d f , K n )

Step 2: Result Filtering 29/1/2017 12

Aligning segments within fragment pairs • Fragment pair from the first step retrieved • Fragments split into smaller segments • Segments aligned using a dynamic programming algorithm • allowing 1:0, 0:1, 1:1, 2:1, 1:2, 3:1 and 1:3 alignments • exclude sentences at start or end of fragment with >50% content in 1:0 or 0:1 alignments 𝑒 𝑔 𝑒’ 𝑔’ 29/1/2017 13

Alignment details where 𝑇(𝑗, 𝑘) represents the score of the optimal alignment from the beginning of the fragment to the 𝑗 𝑢ℎ suspicious segment and the 𝑘 𝑢ℎ source segment • To penalize 1-0 and 0-1 alignments and also to make all scores comparable, we keep track of the number of alignments obtained so far, and the score in each step is normalized by the number of alignments. 29/1/2017 14

Granularity level of alignment Sentence level: ◦ Using sentences as the granularity level of alignment n -gram level: ◦ A plagiarized fragment may omit pieces from the source, but it is likely that at least some of the smallest units are preserved ◦ n is the expected number of terms in each segment 29/1/2017 15

Results Result of detailed analysis sub-task using PersinaPlagdet2016 training corpus t = Similarity threshold, n=Number of sentences Precision Recall Granularity Plagdet (t = 0.2, n = 5) 0.4004 0.8151 1 0.5370 (t = 0.3, n = 5) 0.7630 0.7486 1 0.7557 (t = 0.4, n = 5) 0.8532 0.5357 1 0.6582 (t = 0.3, n = 3) 0.7867 0.8304 1 0.8080 (t = 0.4, n = 3) 0.8604 0.6567 1 0.7449 Result of detailed analysis sub-task using PersinaPlagdet2016 test corpus Precision Recall Granularity Plagdet Runtime (t = 0.3, n = 5) 0.7496 0.7050 1 0.7266 00:24:08 29/1/2017 16

Results Evaluation of the second phase, result filtering step: t = 0.3, n = 3 Precision Recall Granularity Plagdet Without result filtering 0.6029 0.8602 1 0.7087 After result filtering 0.7867 0.8304 1 0.8080 29/1/2017 17

Results Evaluation of the seeding phase, using keywords: Precision Recall Granularity Plagdet (t = 0.3, n = 3) 0.5118 0.8858 1 0.6487 (t = 0.4, n = 3) 0.6431 0.8928 1 0.7476 (t = 0.5, n = 3) 0.7475 0.8862 1 0.8110 (t = 0.6, n = 3) 0.8117 0.8459 1 0.8282 (t = 0.7, n = 3) 0.8522 0.7531 1 0.7996 29/1/2017 18

Cross-lingual detailed analysis for plagiarism detection Ehsan, N., Tompa, F.W., Shakery, A.: Using a dictionary and n-gram alignment to improve fine-grained cross-language plagiarism detection. In : Proceedings of the 2016 ACM Symposium on Document Engineering . pp. 59 – 68. ACM (2016). Precision Recall Granularity Plagdet Using PAN2012 English-German dataset 0.9301 0.8193 1 0.8712 29/1/2017 19

Summary • The proposed method is a two phase approach for identifying plagiarized fragments • The first phase tries to find possibly plagiarized fragments • The second phase tries to improve the precision metric • The framework is applicable in any language • The approach could be adapted for cross language domain 29/1/2017 20

Thanks for your attention 29/1/2017 21

A Pairwise Document Analysis Approach for Monolingual Plagiarism - PowerPoint PPT Presentation

A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection Introduction Plagiarism : Unauthorized use of Text , code, idea, . Plagiarism detection research area has received increasing attention The rapid growth of

Graph Resistance and Learning from Pairwise Comparisons pairwise comparisons of items. In

Pairwise sequence alignments Volker Flegel Vassilios Ioannidis VI - 2004 Page 1 Outline

Pairwise Alignment Mark Voorhies 3/27/2012 Mark Voorhies Pairwise Alignment Review: Tips and

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice

PAIRWISE DECOMPOSITION OF IMAGE SEQUENCES FOR ACTIVE MULTI-VIEW RECOGNITION(EXPERIMENT)

Online Learning with Pairwise Loss Functions Online Learning with Pairwise Loss Functions MLSIG

Rank Aggregation from Pairwise Comparisons in the Presence of Adversarial Corruptions Arpit

CSCE 471/871 Lecture 2: Alignments Pairwise Alignments Stephen Scott Alignments Scoring

Database searching Using pairwise alignments to search databases for similar sequences Query

BLAST Anders Gorm Pedersen & Rasmus Wernersson Database searching Using pairwise

Pairwise Comparisons with Flexible Time-Dynamics Lucas Maystre , Victor Kristof, Matthias

Pairwise comparison, and other methods MATH 105: Contemporary Mathematics University of

Pairwise, Rigid Registration The ICP Algorithm and Its Variants 1 1 Correspondence Problem

Small-scale galaxy dynamics: the pairwise velocity dispersion Jon Loveday University of Sussex

Foundations of Computing II Lecture 9: Pairwise-Independent Hashing Stefano Tessaro

Learning to Rank: From Pairwise Approach to Listwise Approach Zhe Cao Tao Qin Tie-Yan Liu

NORTHAMPTON COUNTY An Overview DRUG COURT THE TEAM: AN INTERDISCIPLINARY APPROACH The

Dr Khurshid Iqbal, Dean Faculty Justificati tion on Basic understanding of sentencing

Refugee 101 Colorado Refugee School Impact Grant ivymama.wordpress.com Muslimvoices.org GTZ.DE

Higher History Study Skills The Exam The exam consists of two papers: Paper 1: Essay Paper

TAX FILINGS & RESPONDING TO CLAIMS Kaitlin A. Brown, Esq. November 11, 2017 Overview

Office Of The Consent Decree Monitor Status Report 31 May 2018 U.S. District Court for the

3M Natural Resource Damage Settlement Kirk Koudelka| Assistant Commissioner MPCA Barb Naramore

Ross-Adams Site Characterization Report and EE/ CA - Update Hydaburg Meeting December 14, 2010

A Pairwise Document Analysis Approach for Monolingual Plagiarism - PowerPoint PPT Presentation

A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection Introduction Plagiarism : Unauthorized use of Text , code, idea, . Plagiarism detection research area has received increasing attention The rapid growth of

Graph Resistance and Learning from Pairwise Comparisons pairwise comparisons of items. In

Pairwise sequence alignments Volker Flegel Vassilios Ioannidis VI - 2004 Page 1 Outline

Pairwise Alignment Mark Voorhies 3/27/2012 Mark Voorhies Pairwise Alignment Review: Tips and

Pairwise Sequence Alignment: Dynamic Programming Algorithms COMP 571 Luay Nakhleh, Rice

PAIRWISE DECOMPOSITION OF IMAGE SEQUENCES FOR ACTIVE MULTI-VIEW RECOGNITION(EXPERIMENT)

Online Learning with Pairwise Loss Functions Online Learning with Pairwise Loss Functions MLSIG

Rank Aggregation from Pairwise Comparisons in the Presence of Adversarial Corruptions Arpit

CSCE 471/871 Lecture 2: Alignments Pairwise Alignments Stephen Scott Alignments Scoring

Database searching Using pairwise alignments to search databases for similar sequences Query

BLAST Anders Gorm Pedersen &amp; Rasmus Wernersson Database searching Using pairwise

Pairwise Comparisons with Flexible Time-Dynamics Lucas Maystre , Victor Kristof, Matthias

Pairwise comparison, and other methods MATH 105: Contemporary Mathematics University of

Pairwise, Rigid Registration The ICP Algorithm and Its Variants 1 1 Correspondence Problem

Small-scale galaxy dynamics: the pairwise velocity dispersion Jon Loveday University of Sussex

Foundations of Computing II Lecture 9: Pairwise-Independent Hashing Stefano Tessaro

Learning to Rank: From Pairwise Approach to Listwise Approach Zhe Cao Tao Qin Tie-Yan Liu

NORTHAMPTON COUNTY An Overview DRUG COURT THE TEAM: AN INTERDISCIPLINARY APPROACH The

Dr Khurshid Iqbal, Dean Faculty Justificati tion on Basic understanding of sentencing

Refugee 101 Colorado Refugee School Impact Grant ivymama.wordpress.com Muslimvoices.org GTZ.DE

Higher History Study Skills The Exam The exam consists of two papers: Paper 1: Essay Paper

TAX FILINGS &amp; RESPONDING TO CLAIMS Kaitlin A. Brown, Esq. November 11, 2017 Overview

Office Of The Consent Decree Monitor Status Report 31 May 2018 U.S. District Court for the

3M Natural Resource Damage Settlement Kirk Koudelka| Assistant Commissioner MPCA Barb Naramore

Ross-Adams Site Characterization Report and EE/ CA - Update Hydaburg Meeting December 14, 2010

BLAST Anders Gorm Pedersen & Rasmus Wernersson Database searching Using pairwise

TAX FILINGS & RESPONDING TO CLAIMS Kaitlin A. Brown, Esq. November 11, 2017 Overview