Using Word Embedding for Cross-Language Plagiarism Detection - PowerPoint PPT Presentation

Using Word Embedding for Cross-Language Plagiarism Detection Authors Jérémy Ferrero Frédéric Agnès Laurent Besacier Didier Schwab Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 1

What is Cross-Language Plagiarism Detection? Cross-Language Plagiarism is a plagiarism by translation, i.e. a text has been plagiarized while being translated (manually or automatically). From a text in a language L, we must find similar passage(s) in other text(s) from among a set of candidate texts in language L’ (cross-language textual similarity). Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 2

Why is it so important? Sources: - McCabe, D. (2010). Students’ cheating takes a high-tech turn. In Rutgers Business School. - Josephson Institute. (2011). What would honest Abe Lincoln say? Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 3

Research Questions plagiarism detection? sentences useful for the text entailment? complementary? Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 4 • Are Word Embeddings useful for cross-language • Is syntax weighting in distributed representations of • Are cross-language plagiarism detection methods

State-of-the-Art Methods MT-Based Models Translation + Monolingual Analysis [Muhr et al., 2010, Barrón-Cedeño, 2012] Comparable Corpora-Based Models CL-KGA, CL-ESA [Gabrilovich and Markovitch, 2007, Potthast et al., 2008] Parallel Corpora-Based Models Dictionary-Based Models CL-VSM, CL-CTS [Gupta et al., 2012, Pataki, 2012] Syntax-Based Models Length Model, CL-C n G [Mcnamee and Mayfield, 2004, Potthast et al., 2011], Cognateness Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 5 CL-ASA [Barrón-Cedeño et al., 2008, Pinto et al., 2009], CL-LSI, CL-KCCA

Augmented CL-CTS We use DBNary [Sérasset, 2015] as linked lexical resource. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 6

Augmented CL-CTS Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 6

CL-WES: Cross-Language Word Embedding-based Similarity This feature is available in MultiVec [Berard et al., 2016] ( https://github.com/eske/multivec ) Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 7

CL-WESS: Cross-Language Word Embedding-based Syntax Similarity This feature is available in MultiVec [Berard et al., 2016] ( https://github.com/eske/multivec ) Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 8

Evaluation Dataset [Ferrero et al., 2016] 1 Using Word Embedding for Cross-Language Plagiarism Detection EACL - April 2017 Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab https://github.com/FerreroJeremy/Cross-Language-Dataset Detection. In Proceedings of LREC 2016. 1 A Multilingual, Multi-style and Multi-granularity Dataset for Cross-language Textual Similarity 9 added noise ; Europarl and JRC); • French , English and Spanish ; • Parallel and comparable (mix of Wikipedia, conference papers, product reviews, • Different granularities: document level, sentence level and chunk level; • Human and machine translated texts; • Obfuscated (to make the similarity detection more complicated) and without • Written and translated by multiple types of authors ; • Cover various fields .

Evaluation Protocol French unit and to 999 other units randomly selected; - 2 folds for tuning (CL-WESS) and fusion (Decision Tree) - 8 folds for validation Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 10 • We compared each English textual unit to its corresponding • We threshold the obtained distance matrix to find the threshold giving the best F 1 score; • We repeat these two steps 10 times, leading to a 10 folds:

Results 88.50 Overall (%) 41.95 41.43 CL-WESS 53.73 56.35 Decision Tree 89.15 chunks and +3.19% on sentences); 46.67 chunks and +14.92% on sentences); chunks and +7.01% on sentences); results. CL-CTS-WE: Cross-Language Conceptual Thesaurus-based Similarity with Word-Embedding Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 50.69 CL-WES CL-CTS-WE New Proposed Methods Chunk-Level Sentence-Level State-of-the-Art Methods CL-C3G 50.76 49.34 CL-CTS 42.84 47.50 11 CL-ASA 47.32 35.81 CL-ESA 14.81 14.44 T+MA 37.12 37.42 • CL-CTS-WE boosts CL-CTS (+3.83% on • CL-WESS boosts CL-WES (+11.78% on • CL-WESS is better than CL-C3G (+2.97% on • Decision Tree fusion significantly improves the Table: Average F 1 scores of methods applied on EN → FR sub-corpora.

Results 88.50 Overall (%) 41.95 41.43 CL-WESS 53.73 56.35 Decision Tree 89.15 chunks and +3.19% on sentences); 46.67 chunks and +14.92% on sentences); chunks and +7.01% on sentences); results. CL-WES: Cross-Language Word Embedding-based Similarity CL-WESS: Cross-Language Word Embedding-based Syntax Similarity Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 50.69 CL-WES CL-CTS-WE New Proposed Methods Chunk-Level Sentence-Level State-of-the-Art Methods CL-C3G 50.76 49.34 CL-CTS 42.84 47.50 11 CL-ASA 47.32 35.81 CL-ESA 14.81 14.44 T+MA 37.12 37.42 • CL-CTS-WE boosts CL-CTS (+3.83% on • CL-WESS boosts CL-WES (+11.78% on • CL-WESS is better than CL-C3G (+2.97% on • Decision Tree fusion significantly improves the Table: Average F 1 scores of methods applied on EN → FR sub-corpora.

Results 88.50 Overall (%) 41.95 41.43 CL-WESS 53.73 56.35 Decision Tree 89.15 chunks and +3.19% on sentences); 46.67 chunks and +14.92% on sentences); on chunks and +7.01% on sentences); results. CL-C3G: Cross-Language Character 3-Gram CL-WESS: Cross-Language Word Embedding-based Syntax Similarity Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 50.69 CL-WES CL-CTS-WE New Proposed Methods Chunk-Level Sentence-Level State-of-the-Art Methods CL-C3G 50.76 49.34 CL-CTS 42.84 47.50 11 CL-ASA 47.32 35.81 CL-ESA 14.81 14.44 T+MA 37.12 37.42 • CL-CTS-WE boosts CL-CTS (+3.83% on • CL-WESS boosts CL-WES (+11.78% on • CL-WESS is better than CL-C3G (+2.97% • Decision Tree fusion significantly improves the Table: Average F 1 scores of methods applied on EN → FR sub-corpora.

Using Word Embedding for Cross-Language Plagiarism Detection - PowerPoint PPT Presentation

Using Word Embedding for Cross-Language Plagiarism Detection Authors Jrmy Ferrero Frdric Agns Laurent Besacier Didier Schwab Jrmy Ferrero, Frdric Agns, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word

07.01.2011 Topics Plagiarism Detection Software 2010 Plagiarism Plagiarism Detection

Greedy embedding of a graph Greedy embedding of a graph 99 Greedy embedding Greedy embedding

WHAT IS PLAGIARISM? According to plagiarism.org, following to be plagiarism: To submit

Who idea is it? Acknowledging and building on other work, or just plain plagiarism. Allison Mann

Instructor-Centric Source Code Plagiarism Detection and Plagiarism Corpus Jonathan Y. H. Poon,

Babelplagiarism: what can BabelNet do for cross- language plagiarism detection? Roberto Navigli

Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection Using Character n gram Profiles

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Graph Drawing Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 )

Planarity Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 ) assigns

02 | 27 SOUTHERN CROSS 23.04 03 | 27 SOUTHERN CROSS 23.04 04 | 27 SOUTHERN CROSS 23.04 06

Stylometry in plagiarism detection and author profiling Paolo Rosso PRHLT Research Center

A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection Introduction

Plagiarism Detection in Open Access Publications Jens Brandt, Martin Gutbrod, Oliver Wellnitz,

Whose idea is it? Acknowledging and building on other work, or just plain plagiarism? Lina Qiu,

plagiarism detection system Andrzej Sobecki, Marcin Kpa IKC 2017 Plagiarism detection problem

~ Q) 'a = ri! tn tn CI) r: tn -- " tn ca :::l ... m 0 tn .... tnLt) 0

INVESTOR PRESENTATION RESULTS FOR YEAR ENDED 30 JUNE 2019 Mick OBrien, Managing Director

Presentation People https://www.indiamart.com/presentation-people/ Presentation People has

Civil Tax Enforcement Update Andrew L. Sobotka Assistant Chief, CTS-SW Department of Justice,

Local Advisory Committee Meeting #2 May 4, 2016 Presentation Outline Review of items from

Going into Grade 11 & 12... Ms. Lummer Transition and Success Teacher Monday, Wednesday

DISCLAIMER Certain information contained in this document consists of forward-looking statements

Investor Presentation Disclaimer This presentation contains certain statements that may be deemed

Sambuz

Useful Links

Newsletter

Mail Us

Using Word Embedding for Cross-Language Plagiarism Detection - PowerPoint PPT Presentation

Using Word Embedding for Cross-Language Plagiarism Detection Authors Jrmy Ferrero Frdric Agns Laurent Besacier Didier Schwab Jrmy Ferrero, Frdric Agns, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word

07.01.2011 Topics Plagiarism Detection Software 2010 Plagiarism Plagiarism Detection

Greedy embedding of a graph Greedy embedding of a graph 99 Greedy embedding Greedy embedding

WHAT IS PLAGIARISM? According to plagiarism.org, following to be plagiarism: To submit

Who idea is it? Acknowledging and building on other work, or just plain plagiarism. Allison Mann

Instructor-Centric Source Code Plagiarism Detection and Plagiarism Corpus Jonathan Y. H. Poon,

Babelplagiarism: what can BabelNet do for cross- language plagiarism detection? Roberto Navigli

Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection Using Character n gram Profiles

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Graph Drawing Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 )

Planarity Embedding Embedding For a given graph G = ( V , E ) , an embedding (into R 2 ) assigns

02 | 27 SOUTHERN CROSS 23.04 03 | 27 SOUTHERN CROSS 23.04 04 | 27 SOUTHERN CROSS 23.04 06

Stylometry in plagiarism detection and author profiling Paolo Rosso PRHLT Research Center

A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection Introduction

Plagiarism Detection in Open Access Publications Jens Brandt, Martin Gutbrod, Oliver Wellnitz,

Whose idea is it? Acknowledging and building on other work, or just plain plagiarism? Lina Qiu,

plagiarism detection system Andrzej Sobecki, Marcin Kpa IKC 2017 Plagiarism detection problem

~ Q) 'a = ri! tn tn CI) r: tn -- &quot; tn ca :::l ... m 0 tn .... tnLt) 0

INVESTOR PRESENTATION RESULTS FOR YEAR ENDED 30 JUNE 2019 Mick OBrien, Managing Director

Presentation People https://www.indiamart.com/presentation-people/ Presentation People has

Civil Tax Enforcement Update Andrew L. Sobotka Assistant Chief, CTS-SW Department of Justice,

Local Advisory Committee Meeting #2 May 4, 2016 Presentation Outline Review of items from

Going into Grade 11 &amp; 12... Ms. Lummer Transition and Success Teacher Monday, Wednesday

DISCLAIMER Certain information contained in this document consists of forward-looking statements

Investor Presentation Disclaimer This presentation contains certain statements that may be deemed

Sambuz

Useful Links

Newsletter

Mail Us

~ Q) 'a = ri! tn tn CI) r: tn -- " tn ca :::l ... m 0 tn .... tnLt) 0

Going into Grade 11 & 12... Ms. Lummer Transition and Success Teacher Monday, Wednesday