A Text Alignment Corpus for Persian Plagiarism Detection Fatemeh - - PowerPoint PPT Presentation

▶

Apr 26, 2023 368 likes •700 views

Persian Plagdet 2016 A Text Alignment Corpus for Persian Plagiarism Detection Fatemeh Mashhadirajab, Mehrnoush Shamsfard Fatemeh Shafiee, Chakaveh Saedi Razieh Adelkhah, NLP Research Lab, Faculty of Computer Science and Engineering, Shahid

SLIDE 1

A Text Alignment Corpus for Persian Plagiarism Detection

Fatemeh Mashhadirajab, Mehrnoush Shamsfard Razieh Adelkhah, Fatemeh Shafiee, Chakaveh Saedi

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

Persian Plagdet 2016

SLIDE 2

Outline

Introduction
Text Alignment Corpus Construction
Strategies For Plagiarisms Types
Dataset Statistics
Conclusions
References

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

2 /29

SLIDE 3

Introduction

3 /29

A taxonomy of plagiarism [2]

SLIDE 4

Text Alignment Corpus Construction

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

4 /29  Documents Clustering  Source Documents Segmentation  Segment Extraction  Segment Obfuscation  Obfuscated Segment Insertion  Set of Suspicious and source Documents  source and suspicious document pairs selection  Data Source Preparation

SLIDE 5

Text Alignment Corpus Construction

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

5 /29



Data Source Preparation

articles or theses in the fields of computer science and engineering & electrical engineering
1,500 documents from articles and theses available from online stores
4,500 documents from Wikipedia articles

SLIDE 6

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

5 /29

Our corpus contains 11,089 documents



Data Source Preparation

Text Alignment Corpus Construction

SLIDE 7

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

6 /29  Documents Clustering

Text Alignment Corpus Construction

SLIDE 8

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

7 /29  Set of Suspicious and source Documents

Text Alignment Corpus Construction

SLIDE 9

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

8 /29  Set of Suspicious and source Documents

Suspicious Documents Source Documents

They are randomly selected from each cluster

Text Alignment Corpus Construction

SLIDE 10

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

9 /29  source and suspicious document pairs selection

Suspicious Documents Source Documents

Text Alignment Corpus Construction

SLIDE 11

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

9 /29  source and suspicious document pairs selection

if the similarity < 50% Similarity Detection system susp Suspicious Documents Source Documents

Text Alignment Corpus Construction

SLIDE 12

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

9 /29  source and suspicious document pairs selection

if the similarity < 50% Similarity Detection system susp src Suspicious Documents Source Documents

Text Alignment Corpus Construction

SLIDE 13

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

10/29  Source Documents Segmentation

if the similarity < 50% Similarity Detection system src Suspicious Documents Source Documents

Text Alignment Corpus Construction

SLIDE 14

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

11/29  Segment Extraction

if the similarity < 50% Similarity Detection system src Suspicious Documents Source Documents

Text Alignment Corpus Construction

SLIDE 15

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

12/29  Segment Obfuscation

if the similarity < 50% Similarity Detection system src

Segment Obfuscation

Suspicious Documents Source Documents

Text Alignment Corpus Construction

SLIDE 16

susp

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

13/29  Obfuscated Segment Insertion

if the similarity < 50% Similarity Detection system src

Segment Obfuscation

Suspicious Documents Source Documents

Text Alignment Corpus Construction

SLIDE 17

Strategies For Plagiarisms Types

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

14/29  Exact Copy  Near Copy  Modified Copy  Text Manipulation (Paraphrasing)  Text Manipulation (Summarizing)  Automatic Translation  Manual Translation  Cyclic Translation  Idea Adoption (semantic-based meaning)

SLIDE 18

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

15/29  Exact Copy

Segment Obfuscation

susp src

Strategies For Plagiarisms Types

SLIDE 19

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

16/29  Near Copy

 Idea Adoption (semantic-based meaning)

Strategies For Plagiarisms Types

SLIDE 27

Dataset Statistics

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

24/29

documents 11089 plagiarism cases 11603 Document purpose languages fa source documents 48% suspicious documents with plagiarism 28% w/o plagiarism 24% Document length short (<10 pages1) 64% medium (10-100 pages) 35% long (>100 pages) 1% Plagiarism per document hardly (<20%) 25% medium (20%-50%) 20% much (50%-80%) 26% entirely (>80%) 29% Case length short (<1k characters) 37% medium (1k-3k characters) 55% long (>3k characters) 8%

SLIDE 28

Conclusion

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

25/29  This article describes a methodology for building a Persian corpus for evaluating plagiarism detection systems.  This corpus is in PAN format.  To produce this corpus, the focus is on the simulation of different types of plagiarism  Different strategies are employed to create obfuscation in each plagiarism category  This corpus is a variety of plagiarism types in large volume are created.

SLIDE 29

References

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

26/29

1. Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., Potthast, M. and Rosso, P. 2016. Overview of the

PAN@FIRE2016 Shared Task on Persian Plagiarism Detection and Text Alignment Corpus Construction, Notebook Papers of FIRE 2016, FIRE-2016, CEUR-WS.org.

2. Alzahrani, M., Salim, N. and Abraham, A. 2012. Understanding plagiarism linguistic patterns,

Textual features, and detection Methods. IEEE Trans. SYSTEMS, MAN, AND CYBERNETICS— PART C: APPLICATIONS AND REVIEWS, vol. 42, no. 2.

3. Potthast, M., Stein, B. and et.al. 2010. An Evaluation Framework for Plagiarism Detection.

Proceedings of the 23rd International Conference on Computational Linguistics, COLING 2010 Beijing,_c ACL.

4. Shamsfard, M., and Kiani, S., and Shahedi, Y. STeP-1: Standard Text Preparation for Persian
Language. CAASL3 Third Workshop on Computational Approaches to Arabic Script- Languages.
5. Makrehchi, M. and Kamel, M. 2004. A fuzzy set approach to extracting keywords from abstracts.

North American Fuzzy Information Processing Society- NAFIPS 2003, Banf, Canada.

6. Shafiee, F. and Shamsfard, M. 2015. The automatic Persian summarizer. The 20st Computer

Society of Iran computer conference.

SLIDE 30

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

27/29

7. Adelkhah, R., Sadeghi, R. and Shamsfard, M. 2016. Persian sentence understanding and

generation: a mutual conversion. The 21st Computer Society of Iran computer conference.

8. Potthast, M., Göring, S. and et.al. 2015. Towards Data Submissions for Shared Task: First

Experiences for the Task of Text Alignment. Working Notes Papers of the CLEF 2015 Evaluation Labs, CEUR Workshop Proceedings, (September 2015), ISSN 1613-0073.

9. Potthast, M., Hagen, M., Gollub, T. and et.al. 2013. Overview of the 5th International Competition
n Plagiarism Detection”, Working Notes Papers of the CLEF 2013Evaluation Labs and Workshop,

(September 2013), ISBN 978-88-904810-3-1. 10.Manku, G. S., Jain, A. and Sarma, A. D .2007. Detecting NearDuplicates for Web Crawling. Data mining. 11.Kamran, K., Ahmadi, A. and Kazemivanhari, F. 2013. Plagiarism detection in Persian text using Fingerprint algorithms. The 21st Iranian Conference on Electrical Engineering. 12.Davarpanah, M. R., sanji, M. and Aramideh, M. 2009. Farsi Lexical Analysis and StopWord List. Library Hi Tech, vol. 27, pp 435–449.

References

SLIDE 31

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

28/29

13.Iran Telecommunication Research Center (ITRC), 2013. Buali Sina University. http://217.218.62.234:8080/. 14.Shamsfard, M., Hesabi, A., Fadaei H. and et.al 2010. Semi Automatic Development of FarsNet; The Persian WordNet. Proceedings of 5th Global WordNet Conference. 15.Mashhadirajab, F. and Shamsfard, M. 2014. Plagiarism Detection in Persian documents. Master's

thesis. Shahid Beheshti University.

16.Potthast, M., Eiselt, A and et.al. 2011. Overview of the 3rd International Competition on Plagiarism

Detection. Notebook Papers of CLEF 2011 Labs and Workshops, (September 2011), ISBN 978-88-

904810-1-7. 17.Potthast, M., Gollub, T. and et.al. 2012. Overview of the 4th International Competition on Plagiarism Detection. CLEF 2012 Evaluation Labs and Workshop – Working Notes Papers, (September 2012), ISBN 978-88-904810-3-1. 18.Potthast, M., Hagen, M. and et.al. 2014. Overview of the 6th International Competition on Plagiarism Detection. CLEF 2014 Evaluation Labs and Workshop – Working Notes Papers, (September 2014).

References

SLIDE 32