Text Reuse Detection Using a Composition of Text Similarity - PowerPoint PPT Presentation

Text Reuse Detection Using a Composition of Text Similarity Measures Bär, Zesch, Gurevych 2012 HS Computational study of Sabrina Galasso linguistic differences sabrina.galasso@student.uni-tuebingen.de December 16th, 2015 HS LingDiff 1

1. Introduction What is meant by “text reuse”? How and why should text reuse be detected? M E A 2. Text Similarity Measures S U R How can text similarity be measured? E S What types of measures do exist? 3. Experiments & Results How do the measures perform on different datasets? How do individual measure perform? How can they be combined? 4. Summary What can we conclude from the experiments? What can be done as future work? December 16th, 2015 HS LingDiff 2

What is text reuse? • Examples for text reuse: M E A • Mirroring texts on different websites S U R • Reusing texts in public blogs E S • Problems with text reuse: • Using systems in a collaborative manner • e.g., Wikipedia • Users should avoid content duplication • Idea: Supporting authors of collaborative text collections by means of automatic text reuse detection December 16th, 2015 HS LingDiff 3

Text reuse detection M • Applications: E A S U • Detection of journalistic text reuse R E • Identification of rewrite sources for ancient texts S • Analysis of text reuse in blogs or web pages • Plagiarism detection • Near-duplicate detection of websites (web search and crawling) • Few NLP used so far December 16th, 2015 HS LingDiff 4

Text reuse detection • Common approach: M E A S • Computation of similarity based on surface- U R level or semantic features E S → only consider the text's content • Idea: investigation of three similarity dimensions: • content • structure • style December 16th, 2015 HS LingDiff 5

Text reuse detection • Verbatim reuse vs. use of similar words or phrases M E A S Source Text. PageRank is a link analysis algorithm used by the Google U R Internet search engine that assigns a numerical weighting to each element of a E hyperlinked set of documents, such as the World Wide Web, with the purpose S of “measuring” its relative importance within the set. Text Reuse. The PageRank algorithm is used to designate every aspect of a set of hyperlinked documents with a numerical weighting. It is used by the Google search engine to estimate the relative importance of a web page according to this weighting. → detectable by content-centric measures → But: What about structural and stylistic similarity? • Source text was split into two sentences • Similar vocabulary richness December 16th, 2015 HS LingDiff 6

Text Similarity Measures Content Similarity M E A • Detecting verbatim copying: using string measures on S U R substring sequences: E S • Longest Common Substring length of longest contiguous sequence of characters, normalized by the text length • Longest Common Subsequence: allows for insertions/deletions • Greedy String Tiling: determines a set of shared contiguous substrings → allows to deal with reordered parts • Other string similarity measures, e.g. Levenshtein December 16th, 2015 HS LingDiff 7

Text Similarity Measures Content Similarity M • tfidf: E A Measuring similarity based on the importance of S U individual words R E • word n-grams S • character n-grams • Semantic similarity measures using WordNet • Latent Semantic Analysis (LSA) • Explicit Semantic Analysis (ESA) using WordNet, Wikipedia and Wiktionary December 16th, 2015 HS LingDiff 8

Text Similarity Measures Structural Similarity M E • Assumption: “ Two independently written texts about the A S same topic are likely to make use of a common vocabulary to U R a certain extent. ” E S →content similarity is not sufficient → inclusion of structural aspects • often only content words are exchanged: → comparison of stopword n-grams → comparison of part-of-speech n-grams • two words are likely to occur again in the same order (with any number of words in between) • word pair order • word pair distance December 16th, 2015 HS LingDiff 9

Text Similarity Measures Stylistic Similarity M E Stylistic similarity: A S - Ideas partly adopted from authorship attribution U R - Investigation of statistical properties of a text E S • Type-token ratio (TTR) → no sensitivity to text length → assumes textual homogeneity • Sequential TTR computation of the mean length of a string sequence, which maintains a TTR above a default threshold December 16th, 2015 HS LingDiff 10

Text Similarity Measures Stylistic Similarity M E • sentence length ratio A S U • token length ratio R E S • function word frequencies • makes use of a set of 70 function words identified by Mosteller and Wallace (1964) December 16th, 2015 HS LingDiff 11

Experiments & Results Experimental Setup M E A S U ● Three datasets: R E S – Wikipedia Rewrite Corpus (Clough and Stevenson, 2011) → plagiarism detection – METER Corpus (Gaizauskas et al., 2001) → journalistic text reuse – Webis Crowd Paraphrase Corpus (Burrows et al., 2012) → paraphrase recognition December 16th, 2015 HS LingDiff 12

Experiments & Results Experimental Setup M ● Computation of text similarity scores E A ● Machine learning classifiers: Naive Bayes and decision tree S U classifier R E S ● Three sets of experiments using 10-fold cross-validation: – Performance of individual features – Performance of feature combinations within dimensions – Performance of feature combinations across dimensions ● Comparison baselines: – Majority class baseline – Word trigram similarity measure (Ferret) ¯ ● Evaluation in terms of accuracy and score (arithmetic mean F 1 across the F 1 scores of all classes) December 16th, 2015 HS LingDiff 13

Wikipedia Rewrite Corpus Dataset M ● 100 pairs of short texts (193 words) E A S ● Topics of computer science U R E ● Source texts: manually created out of Wikipedia S texts ● Reused texts: generated by participants according to 4 rewrite levels: – Cut & paste – Light revision – Heavy revision – No plagiarism December 16th, 2015 HS LingDiff 14

Wikipedia Rewrite Corpus Comparison to other approaches ● Results for the best classification (combining M E measures across dimensions): A S U R E S Features used in Clough and Stevenson (2011): - word n-gram containment (n= 1,2,...,5) - longest common subsequence December 16th, 2015 HS LingDiff 15

Wikipedia Rewrite Corpus Consideration of individual measures M ● Reasonable E A S performance of some U content measures R E S ● Structural measures at most = 0.554 ¯ F 1 ● Stylistic measures only slightly better than baseline December 16th, 2015 HS LingDiff 16

Wikipedia Rewrite Corpus Performance within and across dimensions M E ● Content outperforms A S structural and stylistic U R similarity E S ● Best performance by combination across content and structure: – longest common subsequence (content) – stopword 10-grams (content) – character 5-gram profiles (structure) December 16th, 2015 HS LingDiff 17

Wikipedia Rewrite Corpus Error analysis ● 15 out of 95 texts have been classified wrongly M E ● light vs. heavy revision → 67 % of all misclassification A S ● Annotation study: only “fair” inter-annotator agreement for this distinction U R E S ¯ F 1 = 0.811 ¯ F 1 = 0.859 ¯ F 1 = 0.967 December 16th, 2015 HS LingDiff 18

METER Corpus Dataset M E A ● Source texts: S U R – News sources from the UK press Association (PA) E S ● Derived texts: articles from 9 newspapers that reused PA source texts. ● 2 domains: Law & court and show business ● 253 pairs of short texts ● binary classification: 181 reused (wholly or partially) texts 72 non-reused texts December 16th, 2015 HS LingDiff 19

METER Corpus Individual measures vs. combinations → Application of individual M E measures often cannot A S exceed majority baseline U R → improvement by measure E S combination December 16th, 2015 HS LingDiff 20

METER Corpus Comparison to other approaches ● Sanchez-Vega et al. (2010): M E A – Length and frequency of common word sequences S U – Relevance of individual words R E S December 16th, 2015 HS LingDiff 21

METER Corpus Error analysis M E A S U R E S ● 50 out of 253 texts were classified incorrectly ● Cause for many of the 30 errors: ⇏ Lower similarity no reuse e.g., text length (introduction of new facts, ideas etc.) → similarity measures could be computed per section, not per document → detection of text reuse for partially matching texts ● Still sufficient performance for providing authors with suggestions of potential instances December 16th, 2015 HS LingDiff 22

Webis Crowd Paraphrase Corpus Dataset M E A ● 7859 pairs of texts (original book excerpt from the S U R Project Gutenberg + paraphrase acquired via E S crowdsourcing) manual assignment: – 52% positive samples good paraphrases: e.g., synonym use, changes between active and passive voice – 48% negative samples bad paraphrases: near-duplicates December 16th, 2015 HS LingDiff 23

Webis Crowd Paraphrase Corpus Comparison to other approaches M E A S U R E S ● Burrows et al. (2012): 10 similarity measures on string sequences December 16th, 2015 HS LingDiff 24

Text Reuse Detection Using a Composition of Text Similarity - PowerPoint PPT Presentation

Text Reuse Detection Using a Composition of Text Similarity Measures Br, Zesch, Gurevych 2012 HS Computational study of Sabrina Galasso linguistic differences sabrina.galasso@student.uni-tuebingen.de December 16th, 2015 HS LingDiff 1 1.

TRACER TUTORIAL: TEXT REUSE DETECTION INTRODUCTION TO HISTORICAL TEXT REUSE DETECTION M arco B

1 Infrastructure Requirements Limit Reuse Planned Indirect Potable Reuse (Purple pipe may be a

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

UC Berkeley ReUSE Programs March 9, 2017 Lin King Cal Zero Waste Manager UC Berkeley Chair

Software Reuse From informal reuse (scavenging) to systematic reuse Management and technical

TRACER TUTORIAL: TEXT REUSE DETECTION RECENT WORK M arco B uchler, Emily Franzini and Greta

Framework for Metric Composition + Spatial Composition of Spatial Composition of Metrics Al

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Using text data to detect fraud Charlotte Werger Data Scientist DataCamp Fraud Detection in

Californias Regulatory Process to Protect Public Health for Crop Irrigation Reuse and Potable

Japanese waste paper trend Japanese waste paper trend High collection & reuse High

Groundwater Solutions for Indirect Potable Reuse 2014 Rocky Mountain Water Reuse Workshop August

Potable Reuse for Inland Applications: Pilot Testing Results from a New Potable Reuse Treatment

CS 5150 So(ware Engineering 18. Reuse and Design Pa9erns William Y. Arms So(ware Reuse It is

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

Fourth Grade Measurement and Data 2015-11-23 www.njctl.org Slide 3 / 100 Table of Contents

Natural Language Processing Machine Translation Machine Translation Dan Klein UC Berkeley

CS224N/Ling284 Lecture 6: Language Models and Recurrent Neural Networks Abigail See Overview

Machine Translation Evaluation Sara Stymne Partly based on Philipp Koehns slides for chapter 8

The ENCOPLOT Similarity Measure for Automatic Detection of Plagiarism Cristian Grozea 1 Marius

Accelerated Natural Language Processing Lecture 5 N-gram models, entropy Sharon Goldwater (some

On the Eigenspectrum Eigenspectrum of the Gram of the Gram On the Matrix and the Generalisation

Text Reuse Detection Using a Composition of Text Similarity - PowerPoint PPT Presentation

Text Reuse Detection Using a Composition of Text Similarity Measures Br, Zesch, Gurevych 2012 HS Computational study of Sabrina Galasso linguistic differences sabrina.galasso@student.uni-tuebingen.de December 16th, 2015 HS LingDiff 1 1.

TRACER TUTORIAL: TEXT REUSE DETECTION INTRODUCTION TO HISTORICAL TEXT REUSE DETECTION M arco B

1 Infrastructure Requirements Limit Reuse Planned Indirect Potable Reuse (Purple pipe may be a

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

UC Berkeley ReUSE Programs March 9, 2017 Lin King Cal Zero Waste Manager UC Berkeley Chair

Software Reuse From informal reuse (scavenging) to systematic reuse Management and technical

TRACER TUTORIAL: TEXT REUSE DETECTION RECENT WORK M arco B uchler, Emily Franzini and Greta

Framework for Metric Composition + Spatial Composition of Spatial Composition of Metrics Al

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Using text data to detect fraud Charlotte Werger Data Scientist DataCamp Fraud Detection in

Californias Regulatory Process to Protect Public Health for Crop Irrigation Reuse and Potable

Japanese waste paper trend Japanese waste paper trend High collection &amp; reuse High

Groundwater Solutions for Indirect Potable Reuse 2014 Rocky Mountain Water Reuse Workshop August

Potable Reuse for Inland Applications: Pilot Testing Results from a New Potable Reuse Treatment

CS 5150 So(ware Engineering 18. Reuse and Design Pa9erns William Y. Arms So(ware Reuse It is

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

Fourth Grade Measurement and Data 2015-11-23 www.njctl.org Slide 3 / 100 Table of Contents

Natural Language Processing Machine Translation Machine Translation Dan Klein UC Berkeley

CS224N/Ling284 Lecture 6: Language Models and Recurrent Neural Networks Abigail See Overview

Machine Translation Evaluation Sara Stymne Partly based on Philipp Koehns slides for chapter 8

The ENCOPLOT Similarity Measure for Automatic Detection of Plagiarism Cristian Grozea 1 Marius

Accelerated Natural Language Processing Lecture 5 N-gram models, entropy Sharon Goldwater (some

On the Eigenspectrum Eigenspectrum of the Gram of the Gram On the Matrix and the Generalisation

Japanese waste paper trend Japanese waste paper trend High collection & reuse High