Towards the Exploitation of Statistical Language Models for - - PowerPoint PPT Presentation

towards the exploitation of statistical language models
SMART_READER_LITE
LIVE PREVIEW

Towards the Exploitation of Statistical Language Models for - - PowerPoint PPT Presentation

Towards the Exploitation of Statistical Language Models for Plagiarism Detection with Reference Alberto Barrn Cedeo and Paolo Rosso Universidad Politcnica de Valencia July, 2008 LM for plagiarism detection PAN08, Patras Greece 1/20


slide-1
SLIDE 1

Towards the Exploitation

  • f Statistical Language

Models for Plagiarism Detection with Reference

Alberto Barrón Cedeño and Paolo Rosso

Universidad Politécnica de Valencia

July, 2008

LM for plagiarism detection PAN’08, Patras Greece 1/20

slide-2
SLIDE 2

Overview

  • Introduction
  • LM approach
  • Experiments
  • Discussion
  • Conclusions

LM for plagiarism detection PAN’08, Patras Greece 2/20

slide-3
SLIDE 3

Introduction

Plagiarise To robe credit of another person’s work; in text it means including text fragments from an author without giving him the corresponding credit

LM for plagiarism detection PAN’08, Patras Greece 3/20

slide-4
SLIDE 4

Introduction

Plagiarise To robe credit of another person’s work; in text it means including text fragments from an author without giving him the corresponding credit In this work we describe our first attempt to detect plagiarised fragments in a text employing statistical Language Models (LMs) and perplexity.

LM for plagiarism detection PAN’08, Patras Greece 3/20

slide-5
SLIDE 5

Introduction

1 Intrinsic plagiarism analysis

[Meyer zu Eissen and Stein, 2006, 2007]

  • No reference corpus is exploited
  • Idea: Search for variations (syntax, grammatical categories
  • r text complexity) through the suspicious text

LM for plagiarism detection PAN’08, Patras Greece 4/20

slide-6
SLIDE 6

Introduction

1 Intrinsic plagiarism analysis

[Meyer zu Eissen and Stein, 2006, 2007]

  • No reference corpus is exploited
  • Idea: Search for variations (syntax, grammatical categories
  • r text complexity) through the suspicious text

2 Plagiarism analysis with reference

[Si et al., 1997, Iyer and Singh, 2005]

  • A reference corpus of original documents is needed
  • Idea: to compare fragments from the suspicious document

with the original documents in the reference corpus

LM for plagiarism detection PAN’08, Patras Greece 4/20

slide-7
SLIDE 7

Introduction

We are interested in the second approach but... Usually The reference corpus is conformed by original documents

LM for plagiarism detection PAN’08, Patras Greece 5/20

slide-8
SLIDE 8

Introduction

We are interested in the second approach but... Here The reference corpus is conformed by texts written by the author of the suspicious document

LM for plagiarism detection PAN’08, Patras Greece 6/20

slide-9
SLIDE 9

Introduction

Statistical Language Model (LM) A LM “tries to predict a word given the previous words” [Manning and Schutze, 2000]. Ideal calculation: P(W) = P(w1) · P(w2|w1) · P(w3|w1w2) · · · P(wn|w1 · · · wn−1)

LM for plagiarism detection PAN’08, Patras Greece 7/20

slide-10
SLIDE 10

Introduction

Statistical Language Model (LM) A LM “tries to predict a word given the previous words” [Manning and Schutze, 2000]. Ideal calculation: P(W) = P(w1) · P(w2|w1) · P(w3|w1w2) · · · P(wn|w1 · · · wn−1) n-grams approach (case n = 3) P3(W) = P(wn−2) · P(wn−1|wn−2) · P(wn|wn−2wn−1)

LM for plagiarism detection PAN’08, Patras Greece 7/20

slide-11
SLIDE 11

LM approach

Basic idea

  • Computing the probability of n-grams in a corpus of texts

from one author (representation of vocabulary, grammatical frequency and writing style)

  • These representations can be compared to other texts in
  • rder to look for candidates to plagiarism

LM for plagiarism detection PAN’08, Patras Greece 8/20

slide-12
SLIDE 12

LM approach

Is a fragment f a plagiarism candidate?

LM for plagiarism detection PAN’08, Patras Greece 9/20

slide-13
SLIDE 13

LM approach

Is a fragment f a plagiarism candidate?

  • Determine if a text is similar to another one based on

perplexity, frequently used in order to evaluate how good a LM describes a language: “our author language“ PP2 =

N

  • N
  • i=1

1 P (wi|wi−1)

  • The lower a text perplexity is, the more predictable its

words are. In other words, the higher a perplexity is, the bigger the uncertainty about the following word in a sentence

LM for plagiarism detection PAN’08, Patras Greece 9/20

slide-14
SLIDE 14

LM approach

Hypothesis Given a LM m calculated over texts T written by author A. The perplexity of fragments g, h ∈ T ′, given that g has been written by A and h has been ”plagiarised” from an author B will be clearly different.

LM for plagiarism detection PAN’08, Patras Greece 10/20

slide-15
SLIDE 15

LM approach

Hypothesis Given a LM m calculated over texts T written by author A. The perplexity of fragments g, h ∈ T ′, given that g has been written by A and h has been ”plagiarised” from an author B will be clearly different. Specifically, PPm(g) ≪ PPm(h)

LM for plagiarism detection PAN’08, Patras Greece 10/20

slide-16
SLIDE 16

Experiments: corpus

We have carried out experiments over two different kind of texts: Specialised Corpus about Lexicography topics written by only

  • ne author

Literature A set of books written by Lewis Carroll and some passages from William Shakespeare texts

LM for plagiarism detection PAN’08, Patras Greece 11/20

slide-17
SLIDE 17

Experiments: corpus

Corpora preprocessing: vocabulary and morphosyntactic syntactic richness style i

  • riginal text
  • ii

part-of-speech

  • iii

stemmed text

  • LM for plagiarism detection

PAN’08, Patras Greece 12/20

slide-18
SLIDE 18

Experiments: corpus

Training partition has been used for the LMs calculation Test partition contains randomly inserted fragments written by a different author

LM for plagiarism detection PAN’08, Patras Greece 13/20

slide-19
SLIDE 19

Experiments: corpus

Training partition has been used for the LMs calculation Test partition contains randomly inserted fragments written by a different author In order to identify candidates, we calculate the perplexity of each sentence with respect to the LM associated to the author

LM for plagiarism detection PAN’08, Patras Greece 13/20

slide-20
SLIDE 20

Experiments: results

Results over the literature corpus Considering the original text:

7640 7258 6876 6494 6112 5730 5348 4966 4584 4202 3820 3438 3056 2674 2292 1910 1528 1146 764 382 950 855 760 665 570 475 380 285 190 95 Perplexity Sentence Literature example, n=3 plagiarised µ=319

LM for plagiarism detection PAN’08, Patras Greece 14/20

slide-21
SLIDE 21

Experiments: results

Results over the literature corpus Considering the stemming of the text:

1260 1197 1134 1071 1008 945 882 819 756 693 630 567 504 441 378 315 252 189 126 63 950 855 760 665 570 475 380 285 190 95 Perplexity Sentence Literature example, n=3 plagiarised µ=88

LM for plagiarism detection PAN’08, Patras Greece 15/20

slide-22
SLIDE 22

Experiments: results

Results over the literature corpus Considering the POS of the text:

26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 950 855 760 665 570 475 380 285 190 95 Perplexity Sentence Literature example, n=3 plagiarised µ=9

LM for plagiarism detection PAN’08, Patras Greece 16/20

slide-23
SLIDE 23

Discussion

This approach considers three of the five stylometric features categories useful for the plagiarism detection task [Meyer zu Eissen and Stein, 2006]: Original and Syntactic features (writing style) stemmed Special words counting (vocabulary richness) POS Part-of-speech classes quantifycation Not considered Text statistics (character level) Structural features

LM for plagiarism detection PAN’08, Patras Greece 17/20

slide-24
SLIDE 24

Discussion

Perplexity (as we applied it) is not enough to discriminate plagiarised from ”legal” fragments but...

LM for plagiarism detection PAN’08, Patras Greece 18/20

slide-25
SLIDE 25

Discussion

Perplexity (as we applied it) is not enough to discriminate plagiarised from ”legal” fragments but... Is a good idea to consider it?

LM for plagiarism detection PAN’08, Patras Greece 18/20

slide-26
SLIDE 26

Discussion

Perplexity (as we applied it) is not enough to discriminate plagiarised from ”legal” fragments but... Is a good idea to consider it? What about original text, POS and stem versions?

LM for plagiarism detection PAN’08, Patras Greece 18/20

slide-27
SLIDE 27

Conclusions

1 We have considered perplexity on three different levels:

word, part-of-speech and stem.

2 Unfortunately, there are non-plagiarised fragments that

present high perplexity. However, plagiarised fragments seem to stand out in the highest positions when we consider these features.

3 We know that the perplexity feature space of plagiarised

and non-plagiarised segments is not completely separable, but we believe that including perplexity among other features may improve the results.

LM for plagiarism detection PAN’08, Patras Greece 19/20

slide-28
SLIDE 28

References

Iyer, P . and Singh, A. (2005). Document similarity analysis for a plagiarism detection system. 2nd Indian Int. Conf. on Artificial Intelligence (IICAI-2005), pages 2534–2544. Manning, C. D. and Schutze, H. (2000). Foundations of Statistical Natural Language Processing. The MIT Press Publisher, Cambridge Massachusetts and London, England. Meyer zu Eissen, S. and Stein, B. (2006). Intrinsic plagiarism detection. Lalmas et. al. (Eds.): Advances in Information Retrieval

  • Proc. of the 28th European Conf. on IR research, ECIR

2006, London, pages 565–569. Si, A., Leong, H. V., and Lau, R. W. H. (1997). Check: a document plagiarism detection system.

  • Proc. of the 1997 ACM Symposium on Applied Computing,

LM for plagiarism detection PAN’08, Patras Greece 20/20