A Text Alignment Corpus for Persian Plagiarism Detection Fatemeh - - PowerPoint PPT Presentation

a text alignment corpus for persian plagiarism detection
SMART_READER_LITE
LIVE PREVIEW

A Text Alignment Corpus for Persian Plagiarism Detection Fatemeh - - PowerPoint PPT Presentation

Persian Plagdet 2016 A Text Alignment Corpus for Persian Plagiarism Detection Fatemeh Mashhadirajab, Mehrnoush Shamsfard Fatemeh Shafiee, Chakaveh Saedi Razieh Adelkhah, NLP Research Lab, Faculty of Computer Science and Engineering, Shahid


slide-1
SLIDE 1

A Text Alignment Corpus for Persian Plagiarism Detection

Fatemeh Mashhadirajab, Mehrnoush Shamsfard Razieh Adelkhah, Fatemeh Shafiee, Chakaveh Saedi

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

Persian Plagdet 2016

slide-2
SLIDE 2

Outline

  • Introduction
  • Text Alignment Corpus Construction
  • Strategies For Plagiarisms Types
  • Dataset Statistics
  • Conclusions
  • References

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

2 /29

slide-3
SLIDE 3

Introduction

3 /29

A taxonomy of plagiarism [2]

slide-4
SLIDE 4

Text Alignment Corpus Construction

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

4 /29  Documents Clustering  Source Documents Segmentation  Segment Extraction  Segment Obfuscation  Obfuscated Segment Insertion  Set of Suspicious and source Documents  source and suspicious document pairs selection  Data Source Preparation

slide-5
SLIDE 5

Text Alignment Corpus Construction

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

5 /29

Data Source Preparation

  • articles or theses in the fields of computer science and engineering & electrical engineering
  • 1,500 documents from articles and theses available from online stores
  • 4,500 documents from Wikipedia articles
slide-6
SLIDE 6

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

5 /29

  • Our corpus contains 11,089 documents

Data Source Preparation

Text Alignment Corpus Construction

slide-7
SLIDE 7

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

6 /29  Documents Clustering

Text Alignment Corpus Construction

slide-8
SLIDE 8

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

7 /29  Set of Suspicious and source Documents

Text Alignment Corpus Construction

slide-9
SLIDE 9

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

8 /29  Set of Suspicious and source Documents

Suspicious Documents Source Documents

They are randomly selected from each cluster

Text Alignment Corpus Construction

slide-10
SLIDE 10

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

9 /29  source and suspicious document pairs selection

Suspicious Documents Source Documents

Text Alignment Corpus Construction

slide-11
SLIDE 11

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

9 /29  source and suspicious document pairs selection

if the similarity < 50% Similarity Detection system susp Suspicious Documents Source Documents

Text Alignment Corpus Construction

slide-12
SLIDE 12

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

9 /29  source and suspicious document pairs selection

if the similarity < 50% Similarity Detection system susp src Suspicious Documents Source Documents

Text Alignment Corpus Construction

slide-13
SLIDE 13

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

10/29  Source Documents Segmentation

if the similarity < 50% Similarity Detection system src Suspicious Documents Source Documents

Text Alignment Corpus Construction

slide-14
SLIDE 14

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

11/29  Segment Extraction

if the similarity < 50% Similarity Detection system src Suspicious Documents Source Documents

Text Alignment Corpus Construction

slide-15
SLIDE 15

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

12/29  Segment Obfuscation

if the similarity < 50% Similarity Detection system src

Segment Obfuscation

Suspicious Documents Source Documents

Text Alignment Corpus Construction

slide-16
SLIDE 16

susp

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

13/29  Obfuscated Segment Insertion

if the similarity < 50% Similarity Detection system src

Segment Obfuscation

Suspicious Documents Source Documents

Text Alignment Corpus Construction

slide-17
SLIDE 17

Strategies For Plagiarisms Types

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

14/29  Exact Copy  Near Copy  Modified Copy  Text Manipulation (Paraphrasing)  Text Manipulation (Summarizing)  Automatic Translation  Manual Translation  Cyclic Translation  Idea Adoption (semantic-based meaning)

slide-18
SLIDE 18

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

15/29  Exact Copy

Segment Obfuscation

susp src

Strategies For Plagiarisms Types

slide-19
SLIDE 19

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

16/29  Near Copy

Segment Obfuscation

susp src

Insertion deletion substitution sentence split or join

Strategies For Plagiarisms Types

slide-20
SLIDE 20

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

17/29

Segment Obfuscation

susp src

the Persian sentence understanding and generation system introduced by Adelkhah et al. [7]

 Modified Copy

semantic representation (sentence understanding) sentence production based on semantic representation (sentence generation)

Strategies For Plagiarisms Types

slide-21
SLIDE 21

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

18/29

Segment Obfuscation

susp src

the Persian sentence understanding and generation system introduced by Adelkhah et al. [7]

 Text Manipulation (Paraphrasing)

Each word is replaced with a synonym retrieved from FarsNet or FavaNet

Strategies For Plagiarisms Types

slide-22
SLIDE 22

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

19/29

Segment Obfuscation

susp src

Text Manipulation (Summarizing) Persian summarizer introduced by Shafiee et al. [6]

Strategies For Plagiarisms Types

slide-23
SLIDE 23

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

20/29

Segment Obfuscation

susp src

 Automatic Translation

Google translate Persian to English Hunspell Spell checker Persian English

Strategies For Plagiarisms Types

slide-24
SLIDE 24

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

21/29

Segment Obfuscation

susp src

 Manual Translation

Translateion Persian English

Strategies For Plagiarisms Types

slide-25
SLIDE 25

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

22/29

Segment Obfuscation

susp src

 Cyclic Translation

Persian English Google translate Negar spell checker Google translate Hunspell

Strategies For Plagiarisms Types

slide-26
SLIDE 26

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

23/29

Segment Obfuscation

susp src

 Idea Adoption (semantic-based meaning)

Strategies For Plagiarisms Types

slide-27
SLIDE 27

Dataset Statistics

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

24/29

documents 11089 plagiarism cases 11603 Document purpose languages fa source documents 48% suspicious documents with plagiarism 28% w/o plagiarism 24% Document length short (<10 pages1) 64% medium (10-100 pages) 35% long (>100 pages) 1% Plagiarism per document hardly (<20%) 25% medium (20%-50%) 20% much (50%-80%) 26% entirely (>80%) 29% Case length short (<1k characters) 37% medium (1k-3k characters) 55% long (>3k characters) 8%

slide-28
SLIDE 28

Conclusion

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

25/29  This article describes a methodology for building a Persian corpus for evaluating plagiarism detection systems.  This corpus is in PAN format.  To produce this corpus, the focus is on the simulation of different types of plagiarism  Different strategies are employed to create obfuscation in each plagiarism category  This corpus is a variety of plagiarism types in large volume are created.

slide-29
SLIDE 29

References

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

26/29

  • 1. Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., Potthast, M. and Rosso, P. 2016. Overview of the

PAN@FIRE2016 Shared Task on Persian Plagiarism Detection and Text Alignment Corpus Construction, Notebook Papers of FIRE 2016, FIRE-2016, CEUR-WS.org.

  • 2. Alzahrani, M., Salim, N. and Abraham, A. 2012. Understanding plagiarism linguistic patterns,

Textual features, and detection Methods. IEEE Trans. SYSTEMS, MAN, AND CYBERNETICS— PART C: APPLICATIONS AND REVIEWS, vol. 42, no. 2.

  • 3. Potthast, M., Stein, B. and et.al. 2010. An Evaluation Framework for Plagiarism Detection.

Proceedings of the 23rd International Conference on Computational Linguistics, COLING 2010 Beijing,_c ACL.

  • 4. Shamsfard, M., and Kiani, S., and Shahedi, Y. STeP-1: Standard Text Preparation for Persian
  • Language. CAASL3 Third Workshop on Computational Approaches to Arabic Script- Languages.
  • 5. Makrehchi, M. and Kamel, M. 2004. A fuzzy set approach to extracting keywords from abstracts.

North American Fuzzy Information Processing Society- NAFIPS 2003, Banf, Canada.

  • 6. Shafiee, F. and Shamsfard, M. 2015. The automatic Persian summarizer. The 20st Computer

Society of Iran computer conference.

slide-30
SLIDE 30

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

27/29

  • 7. Adelkhah, R., Sadeghi, R. and Shamsfard, M. 2016. Persian sentence understanding and

generation: a mutual conversion. The 21st Computer Society of Iran computer conference.

  • 8. Potthast, M., Göring, S. and et.al. 2015. Towards Data Submissions for Shared Task: First

Experiences for the Task of Text Alignment. Working Notes Papers of the CLEF 2015 Evaluation Labs, CEUR Workshop Proceedings, (September 2015), ISSN 1613-0073.

  • 9. Potthast, M., Hagen, M., Gollub, T. and et.al. 2013. Overview of the 5th International Competition
  • n Plagiarism Detection”, Working Notes Papers of the CLEF 2013Evaluation Labs and Workshop,

(September 2013), ISBN 978-88-904810-3-1. 10.Manku, G. S., Jain, A. and Sarma, A. D .2007. Detecting NearDuplicates for Web Crawling. Data mining. 11.Kamran, K., Ahmadi, A. and Kazemivanhari, F. 2013. Plagiarism detection in Persian text using Fingerprint algorithms. The 21st Iranian Conference on Electrical Engineering. 12.Davarpanah, M. R., sanji, M. and Aramideh, M. 2009. Farsi Lexical Analysis and StopWord List. Library Hi Tech, vol. 27, pp 435–449.

References

slide-31
SLIDE 31

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

28/29

13.Iran Telecommunication Research Center (ITRC), 2013. Buali Sina University. http://217.218.62.234:8080/. 14.Shamsfard, M., Hesabi, A., Fadaei H. and et.al 2010. Semi Automatic Development of FarsNet; The Persian WordNet. Proceedings of 5th Global WordNet Conference. 15.Mashhadirajab, F. and Shamsfard, M. 2014. Plagiarism Detection in Persian documents. Master's

  • thesis. Shahid Beheshti University.

16.Potthast, M., Eiselt, A and et.al. 2011. Overview of the 3rd International Competition on Plagiarism

  • Detection. Notebook Papers of CLEF 2011 Labs and Workshops, (September 2011), ISBN 978-88-

904810-1-7. 17.Potthast, M., Gollub, T. and et.al. 2012. Overview of the 4th International Competition on Plagiarism Detection. CLEF 2012 Evaluation Labs and Workshop – Working Notes Papers, (September 2012), ISBN 978-88-904810-3-1. 18.Potthast, M., Hagen, M. and et.al. 2014. Overview of the 6th International Competition on Plagiarism Detection. CLEF 2014 Evaluation Labs and Workshop – Working Notes Papers, (September 2014).

References

slide-32
SLIDE 32