Deep Investigation of Cross-Language Plagiarism Detection Methods - PowerPoint PPT Presentation

Deep Investigation of Cross-Language Plagiarism Detection Methods Authors Jérémy Ferrero Laurent Besacier Didier Schwab Frédéric Agnès Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès BUCC - August 2017 Deep Investigation of Cross-Language Plagiarism Detection Methods 1

What is Cross-Language Plagiarism Detection? Cross-Language Plagiarism is a plagiarism by translation, i.e. a text has been plagiarized while being translated (manually or automatically). From a text in a language L, we must find similar passage(s) in other text(s) from a set of candidate texts in language L’ (cross-language textual similarity). Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès BUCC - August 2017 Deep Investigation of Cross-Language Plagiarism Detection Methods 2

Why is it so important? Sources: - McCabe, D. (2010). Students’ cheating takes a high-tech turn. In Rutgers Business School. - Josephson Institute. (2011). What would honest Abe Lincoln say? Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès BUCC - August 2017 Deep Investigation of Cross-Language Plagiarism Detection Methods 3

Research Questions to the characteristics of the compared texts? compared texts? And if so, which characteristics? Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès BUCC - August 2017 Deep Investigation of Cross-Language Plagiarism Detection Methods 4 • How do the state-of-the-art methods behave according • Are the methods depend on the characteristics of the • Are the state-of-the-art methods complementary?

State-of-the-Art Methods MT-Based Models Translation + Monolingual Analysis [Muhr et al., 2010] Comparable Corpora-Based Models CL-KGA, CL-ESA [Potthast et al., 2008] Parallel Corpora-Based Models Dictionary-Based Models CL-VSM, CL-CTS [Pataki, 2012] Syntax-Based Models Length Model, CL-C n G [Potthast et al., 2011], Cognateness Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès BUCC - August 2017 Deep Investigation of Cross-Language Plagiarism Detection Methods 5 CL-ASA [Pinto et al., 2009], CL-LSI, CL-KCCA

BUCC - August 2017 Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès Deep Investigation of Cross-Language Plagiarism Detection Methods 6 CL-C 3 G [Potthast et al., 2011]

CL-CTS [Pataki, 2012] We use DBNary [Sérasset, 2015] as linked lexical resource. Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès BUCC - August 2017 Deep Investigation of Cross-Language Plagiarism Detection Methods 7

CL-ASA [Pinto et al., 2009] Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès BUCC - August 2017 Deep Investigation of Cross-Language Plagiarism Detection Methods 8

CL-ESA [Potthast et al., 2008] Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès BUCC - August 2017 Deep Investigation of Cross-Language Plagiarism Detection Methods 9

T+MA [Muhr et al., 2010] Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès BUCC - August 2017 Deep Investigation of Cross-Language Plagiarism Detection Methods 10

Evaluation Dataset [Ferrero et al., 2016] 1 1 A Multilingual, Multi-style and Multi-granularity Dataset for Cross-language Textual Similarity Deep Investigation of Cross-Language Plagiarism Detection Methods BUCC - August 2017 Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès https://github.com/FerreroJeremy/Cross-Language-Dataset Detection. In Proceedings of LREC 2016. 11 Europarl and JRC); added noise ; • French , English and Spanish ; • Parallel and comparable (mix of Wikipedia, conference papers, product reviews, • Different granularities: document level, sentence level and chunk level; • Human and machine translated texts; • Obfuscated (to make the similarity detection more complicated) and without • Written and translated by multiple types of authors ; • Cover various fields .

Fist experiment: Evaluation Protocol another language and to 999 other units randomly selected; validation; Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès BUCC - August 2017 Deep Investigation of Cross-Language Plagiarism Detection Methods 12 • We compared each textual unit to its corresponding unit in • We threshold the obtained distance matrix to find the threshold giving the best F 1 score; • We repeat these two steps 10 times, leading to a 10 folds • The final value are the average of the 10 F 1 score.

Results: Across Language Pairs 0.4633 0.2694 0.3523 0.3576 CL-ASA 0.4575 0.4645 0.3204 0.3171 0.4734 0.3098 CL-CTS 0.4577 0.4577 0.3819 0.3819 0.4931 0.4931 CL-C3G 0.2531 0.2843 Methods 0.3505 Deep Investigation of Cross-Language Plagiarism Detection Methods BUCC - August 2017 Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès language pair (EN: English; FR: French; ES: Spanish). Table 1: 0.3525 0.3673 0.3526 0.3692 CL-ESA 0.3760 T+MA 0.1383 0.1383 0.1337 0.1337 0.1430 0.1430 Chunk level 13 Sentence level 0.4250 0.4252 0.3140 CL-ASA 0.4169 0.4203 0.3881 0.3780 04116 CL-CTS 0.3941 0.4795 0.4795 0.4375 0.4375 0.5071 0.5071 CL-C3G Methods 0.4083 0.4738 0.3736 T+MA 0.3158 0.3540 0.3279 0.3177 0.3730 0.3634 0.1520 0.1520 0.1476 0.1476 0.1499 0.1499 CL-ESA EN → FR FR → EN EN → ES ES → EN ES → FR FR → ES EN → FR FR → EN EN → ES ES → EN ES → FR FR → ES Overall F 1 score over all sub-corpora of the state-of-the-art methods for each

Results: Across Language Pairs CL-C3G Deep Investigation of Cross-Language Plagiarism Detection Methods BUCC - August 2017 Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès Top 3 methods by source and target language. Table 2: (b) Sentence granularity T+MA CL-CTS T+MA CL-C3G T+MA CL-CTS CL-CTS CL-C3G CL-ASA (a) Chunk granularity CL-CTS CL-CTS CL-ASA CL-C3G CL-C3G 14 EN ↔ FR ES ↔ FR EN ↔ FR EN ↔ ES ES → FR EN ↔ ES FR → ES

Results: Across Language Pairs 0.991 0.981 0.989 0.924 0.931 1.000 0.971 0.982 0.922 1.000 0.929 1.000 1.000 Lang. Pair Overall Strong correlation between languages! Sentence level 0.971 0.997 1.000 0.971 Deep Investigation of Cross-Language Plagiarism Detection Methods BUCC - August 2017 Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès between the different language pairs (EN: English; FR: French; ES: Spanish). Table 3: 0.966 1.000 0.997 0.925 1.000 0.949 0.922 0.928 1.000 0.949 0.913 0.970 15 0.994 0.990 1.000 0.987 0.971 0.980 0.980 Overall 1.000 0.967 0.980 0.940 Lang. Pair 0.957 0.995 0.998 1.000 0.996 0.949 0.988 0.998 Chunk level 1.000 0.983 0.991 0.965 0.978 1.000 EN → FR FR → EN EN → ES ES → EN ES → FR FR → ES EN → FR FR → EN EN → ES ES → EN ES → FR FR → ES EN → FR FR → EN EN → ES ES → EN ES → FR FR → ES EN → FR FR → EN EN → ES ES → EN ES → FR FR → ES Pearson correlations of the overall F 1 score over all sub-corpora of all methods

Results: Across Language Pairs Strong correlation between granularities! Deep Investigation of Cross-Language Plagiarism Detection Methods BUCC - August 2017 Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès (calculated from Table 1). chunk and the sentence granularity, by language pair (EN: English; FR: French; ES: Spanish) Pearson correlations of the results of all methods on all sub-corpora, between the Table 4: 0.939 0.932 16 0.838 0.833 0.946 0.907 Correlation Lang. Pair EN → FR FR → EN EN → ES ES → EN ES → FR FR → ES

Results: Across Language Pairs 0.515 Deep Investigation of Cross-Language Plagiarism Detection Methods BUCC - August 2017 Jérémy Ferrero, Laurent Besacier, Didier Schwab and Frédéric Agnès the chunk and the sentence granularity, by methods (calculated from Table 1). Pearson correlations of the results on all sub-corpora on all language pairs, between Table 5: 0.780 T+MA CL-ESA Strong correlation between granularities! 0.649 CL-ASA 0.970 CL-CTS 0.996 CL-C3G Correlation Methods 17

Deep Investigation of Cross-Language Plagiarism Detection Methods - PowerPoint PPT Presentation

Deep Investigation of Cross-Language Plagiarism Detection Methods Authors Jrmy Ferrero Laurent Besacier Didier Schwab Frdric Agns Jrmy Ferrero, Laurent Besacier, Didier Schwab and Frdric Agns BUCC - August 2017 Deep

07.01.2011 Topics Plagiarism Detection Software 2010 Plagiarism Plagiarism Detection

WHAT IS PLAGIARISM? According to plagiarism.org, following to be plagiarism: To submit

Instructor-Centric Source Code Plagiarism Detection and Plagiarism Corpus Jonathan Y. H. Poon,

Babelplagiarism: what can BabelNet do for cross- language plagiarism detection? Roberto Navigli

Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection Using Character n gram Profiles

plagiarism detection system Andrzej Sobecki, Marcin Kpa IKC 2017 Plagiarism detection problem

Stylometry in plagiarism detection and author profiling Paolo Rosso PRHLT Research Center

A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection Introduction

Who idea is it? Acknowledging and building on other work, or just plain plagiarism. Allison Mann

Plagiarism Detection in Open Access Publications Jens Brandt, Martin Gutbrod, Oliver Wellnitz,

Towards the Exploitation of Statistical Language Models for Plagiarism Detection with Reference

Uncovering Plagiarism, Authorship, and Social Software Misuse PAN 2011 Results [pan.webis.de]

External Plagiarism Detection using Information Retrieval and Sequence Alignment Rao Muhammad

Laboratory Investigation of Laboratory Investigation of Laboratory Investigation of Laboratory

02 | 27 SOUTHERN CROSS 23.04 03 | 27 SOUTHERN CROSS 23.04 04 | 27 SOUTHERN CROSS 23.04 06

INTRINSIC PLAGIARISM DETECTION PAN 2011 @ CLEF USING CHARACTER TRIGRAM DISTANCE SCORES U N D E

SPONSORSHIP PROPOSAL FOR HOFFNUNG INTERNATIONAL FOOTBALL ACADEMY www.hifa.com.ng INTRODUCTION

The value of a college education is not the learning of many facts but the training of

1 SN & CO. Chartered Accountants 1 12/01/2017 Realty reforms has boosted offshore funds

JUDGEMENT WRITING IN THE AREA/SHARIA/CUSTOMARY COURT: BEING A PAPER PRESENTED AT THE NATIONAL

An Olympia masterplan for everyone Overview Our masterplan vision is to evolve Olympia in to a

Corporate Presentation October 2017 CA CAUTIONARY NOTE REGARDING FORWARD-LO LOOKING IN

ASX Release 6 September 2019 RAG Austrian Production Asset Update and Revision of Strategic

FOLDS AND THRUST SYSTEMS IN MASS TRANSPORT DEPOSITS G.I Aslop, S. Marco, T. Levi, R. Weinberger

Deep Investigation of Cross-Language Plagiarism Detection Methods - PowerPoint PPT Presentation

Deep Investigation of Cross-Language Plagiarism Detection Methods Authors Jrmy Ferrero Laurent Besacier Didier Schwab Frdric Agns Jrmy Ferrero, Laurent Besacier, Didier Schwab and Frdric Agns BUCC - August 2017 Deep

07.01.2011 Topics Plagiarism Detection Software 2010 Plagiarism Plagiarism Detection

WHAT IS PLAGIARISM? According to plagiarism.org, following to be plagiarism: To submit

Instructor-Centric Source Code Plagiarism Detection and Plagiarism Corpus Jonathan Y. H. Poon,

Babelplagiarism: what can BabelNet do for cross- language plagiarism detection? Roberto Navigli

Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection Using Character n gram Profiles

plagiarism detection system Andrzej Sobecki, Marcin Kpa IKC 2017 Plagiarism detection problem

Stylometry in plagiarism detection and author profiling Paolo Rosso PRHLT Research Center

A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection Introduction

Who idea is it? Acknowledging and building on other work, or just plain plagiarism. Allison Mann

Plagiarism Detection in Open Access Publications Jens Brandt, Martin Gutbrod, Oliver Wellnitz,

Towards the Exploitation of Statistical Language Models for Plagiarism Detection with Reference

Uncovering Plagiarism, Authorship, and Social Software Misuse PAN 2011 Results [pan.webis.de]

External Plagiarism Detection using Information Retrieval and Sequence Alignment Rao Muhammad

Laboratory Investigation of Laboratory Investigation of Laboratory Investigation of Laboratory

02 | 27 SOUTHERN CROSS 23.04 03 | 27 SOUTHERN CROSS 23.04 04 | 27 SOUTHERN CROSS 23.04 06

INTRINSIC PLAGIARISM DETECTION PAN 2011 @ CLEF USING CHARACTER TRIGRAM DISTANCE SCORES U N D E

SPONSORSHIP PROPOSAL FOR HOFFNUNG INTERNATIONAL FOOTBALL ACADEMY www.hifa.com.ng INTRODUCTION

The value of a college education is not the learning of many facts but the training of

1 SN &amp; CO. Chartered Accountants 1 12/01/2017 Realty reforms has boosted offshore funds

JUDGEMENT WRITING IN THE AREA/SHARIA/CUSTOMARY COURT: BEING A PAPER PRESENTED AT THE NATIONAL

An Olympia masterplan for everyone Overview Our masterplan vision is to evolve Olympia in to a

Corporate Presentation October 2017 CA CAUTIONARY NOTE REGARDING FORWARD-LO LOOKING IN

ASX Release 6 September 2019 RAG Austrian Production Asset Update and Revision of Strategic

FOLDS AND THRUST SYSTEMS IN MASS TRANSPORT DEPOSITS G.I Aslop, S. Marco, T. Levi, R. Weinberger

1 SN & CO. Chartered Accountants 1 12/01/2017 Realty reforms has boosted offshore funds