Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection Using - - PowerPoint PPT Presentation

intrinsic plagiarism detection intrinsic plagiarism
SMART_READER_LITE
LIVE PREVIEW

Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection Using - - PowerPoint PPT Presentation

Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection Using Character n gram Profiles g g Efstathios Stamatatos Efstathios Stamatatos University of the Aegean Talk Layout Talk Layout Introduction The style change function The


slide-1
SLIDE 1

Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection Using Character n‐gram Profiles g g

Efstathios Stamatatos Efstathios Stamatatos

University of the Aegean

slide-2
SLIDE 2

Talk Layout Talk Layout

  • Introduction
  • The style change function

The style change function

  • Detecting plagiarism
  • Evaluation
  • Conclusions

Conclusions

slide-3
SLIDE 3

Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection

  • Ambitious and demanding task
  • It can be used:

It can be used:

– When no appropriate reference corpus is available Wh h f i l ( b) – When the reference corpus is too large (web)

  • Closely related to authorship verification

y p

  • Detection of irregularities of stylistic nature

H t ll t li ti i l iti d – However, not all stylistic irregularities are caused by plagiarism

slide-4
SLIDE 4

Representing Writing Style Representing Writing Style

  • Lexical features
  • Character features

Character features

  • Syntactic features
  • Semantic features
  • Application‐specific features

Application specific features

slide-5
SLIDE 5

Character n grams Character n‐grams

  • Can be easily measured in any text
  • Language‐independent

Language independent

  • Domain‐independent
  • Require no text‐preprocessing
  • Very effective in authorship attribution

Very effective in authorship attribution

  • Robust to noise

– Obfuscation in plagiarism can be considered as noise insertion

slide-6
SLIDE 6

The Proposed Approach The Proposed Approach

Th i i f d l i d b h

  • The variation of document style is represented by the

style change function

Using a sliding window over the text length – Using a sliding window over the text‐length

  • Writing style is represented by character n‐gram

profiles profiles

– The set of different character n‐grams encountered in the text and their normalized frequencies q

  • A set of heuristic rules:

– Decide whether or not the document is plagiarism‐free p g – Detect the plagiarized section boundaries – Detect irrelevant stylistic inconsistencies

slide-7
SLIDE 7

Representing Stylistic Changes Representing Stylistic Changes

Sliding Profile of the text window Distance Sliding window (length, step) Profile of the estimation Document whole document Document

  • High value means

stylistic anomaly

  • Low value means

t li ti i t stylistic consistency

slide-8
SLIDE 8

Distance Estimation Distance Estimation

h lidi i d i h ( h

  • The sliding window text is shorter (or much

shorter) than the whole document

  • An accurate and robust function for

imbalanced profiles is proposed by p p p y (Stamatatos, 2007):

⎟ ⎞ ⎜ ⎛ −

2

)) ( ) ( ( 2 g f g f

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + =

) ( 1

) ( ) ( )) ( ) ( ( 2 ) , (

A P g B A B A

g f g f g f g f B A d

  • This is not a symmetric function

– dissimilarity rather than distance measure

slide-9
SLIDE 9

Style Change Function Style Change Function

d i li d th fil l th

  • d1 is normalized over the profile length:

)) ( ) ( ( 2

2

g f g f

B A

⎟ ⎟ ⎞ ⎜ ⎜ ⎛ − ) ( 4 ) ( ) ( ) , (

) ( 1

A P g f g f B A nd

A P g B A

⎟ ⎟ ⎠ ⎜ ⎜ ⎝ + =

  • Then, the style change function sc of a document D is:

sc(i,D)=nd1(wi, D), i=1…|w|

| | d d th t t l th

⎥ ⎢ − l x 1

  • |w| depends on the text‐length:

– x: text‐length – l: sliding window length

⎥ ⎦ ⎥ ⎢ ⎣ ⎢ + = s w 1

l: sliding window length – s: sliding window step

slide-10
SLIDE 10

An Example An Example

200 400 600 800 200 400 600 800 0 40 0.50

  • n

0.30 0.40 ge functio 0.10 0.20 Style chan 0.00 200 400 600 800 S

IPAT‐DC document #5

Sliding window position

slide-11
SLIDE 11

A Plagiarism free Example A Plagiarism‐free Example

0.50

  • n

0.30 0.40 ge functio 0 10 0.20 yle chang 0.00 0.10 100 200 300 400 500 600 Sty

IPAT‐DC d t #17

100 200 300 400 500 600 Sliding window position

document #17

slide-12
SLIDE 12

Detecting Plagiarism

  • n the Document Level

Thi i i l t k i i hi h

  • This is crucial to keep precision high
  • Two options:

– Pre‐processing – Post‐processing

Pl i i f it i S<t

  • Plagiarism‐free criterion: S<t1

where S: the standard deviation of the style change function S: the standard deviation of the style change function t1: a predefined threshold (0.02)

  • Deficiencies:
  • Deficiencies:

– Very short documents tend to have low sc values – Very long documents may contain stylistically – Very long documents may contain stylistically inconsistent sections (high variance of sc)

slide-13
SLIDE 13

A False Negative Example A False Negative Example

50 100 150 50 100 150 0 40 0.50

  • n

0.30 0.40 nge functio 0.10 0.20 Style chan

IPAT‐DC Document #34

0.00 50 100 150 S Sliding window position

slide-14
SLIDE 14

Identifying Plagiarized Passages Identifying Plagiarized Passages

i d h l h lf f h i

  • It is assumed that at least half of the text is not

plagiarized

– The average sc value would correspond to the style of the alleged author

I l i i k h f

  • In general, it is not known the amount of

plagiarized text

– All sc values greater than M+S are removed – M ′ and S ′ are then calculated

  • Plagiarized passage criterion: sc(i′,D) >M′+a*S′

– a determines the sensitivity of the method (set to 2.0)

slide-15
SLIDE 15

An Example An Example

200 400 600 800 200 400 600 800 0.50 n 0.30 0.40 ge function 0 10 0.20 tyle chang 0.00 0.10 200 400 600 800 St

IPAT‐DC document #5

200 400 600 800 Sliding window position

slide-16
SLIDE 16

Another Example Another Example

0 40 0.50

  • n

0.30 0.40 ge functio 0 10 0.20 yle chang 0.00 0.10 Sty

IPAT‐DC

100 200 300 400 Sliding window position

Document #22

slide-17
SLIDE 17

Detecting Irrelevant Style Changes Detecting Irrelevant Style Changes

N t ll t li ti h d b l i i

  • Not all stylistic changes are caused by plagiarism

– Text formatting affects style – Genre affects style Genre affects style – …

  • To reduce the formatting factor:

g

– All text is transformed to lowercase – Every character n‐gram that contains no letter characters ( ) i d f th fil (a‐z) is removed from the profile – The sliding window parameters operate on letter characters

  • each window has the same number of letter characters (window

length l) but different number of total characters (real window length l′)

slide-18
SLIDE 18

Detecting Irrelevant Style Changes Detecting Irrelevant Style Changes

T d th lti l f t

  • To reduce the multiple genre factor:

– Special Section Criterion: l′<t2 where where – l′: the real window length – t : a predefined threshold (1 500) – t2: a predefined threshold (1,500) – It combines with the plagiarized passage criterion

  • Weaknesses

Weaknesses

– One can insert multiple non letter characters to

  • bfuscate a plagiarized section

– All special sections (table‐of‐contents, index) are considered plagiarism‐free

slide-19
SLIDE 19

An Example An Example

IPAT‐DC Document #46

slide-20
SLIDE 20

Summary of Parameter Settings Summary of Parameter Settings

Description Symbol Value Character n‐gram length n 3 Sliding window length l 1,000 Sliding window step s 200 Threshold of plagiarism‐free criterion t1 0.02 Real window length threshold t2 1,500 Sensitivity of plagiarism detection a 2

  • Empirically derived, not optimized
slide-21
SLIDE 21

Evaluation on the Document Level Evaluation on the Document Level

Guess Actual Guess Actual Plagiarism‐free Plagiarized g g Plagiarism free 1102 545 (22%) Plagiarism‐free 1102 545 (22%)

Plagiarized passages

Plagiarized 443 1001 (78%)

Upper bound for Recall for Recall

  • Results on IPAT‐DC
slide-22
SLIDE 22

False Negatives False Negatives

  • The majority of

false negatives all documents

The majority of false negatives are relatively short documents

1200 1400 1600 s

short documents (<30K chars)

  • The shorter a

600 800 1000 Documents

document, the more likely to false negative

200 400 D

false negative

<10K 10K-30K 30K-100K >100K Text length (chars)

slide-23
SLIDE 23

Evaluation on the Passage Level Evaluation on the Passage Level

Corpus IPAT‐DC IPAT‐CC R ll 0 4552 0 4607 Recall 0.4552 0.4607 Precision 0.2183 0.2321 F score 0 2876 0 3086 F‐score 0.2876 0.3086 Granularity 1.22 1.25 Overall score 0.2358 0.2462

  • Performance remains stable for both corpora
slide-24
SLIDE 24

Recall and Precision vs Text length Recall and Precision vs. Text‐length

  • Recall is

affected by

60 recall precision

affected by decreasing text length

40 50 60

text‐length

– A result of f l

20 30

false negative distribution

10 <10K 10K-30K 30K-100K >100K Text length (chars)

slide-25
SLIDE 25

Conclusions Conclusions

f ll d h

  • A fully‐automated approach

– Easy to follow (no text preprocessing) – Able to detect plagiarism‐free documents – Able to detect plagiarized passage boundaries

  • Nearly half of plagiarized passages are detected

while precision remains low

– An increased a value can improve precision (and harm recall)

  • Window length determines the shortest

plagiarized passage that can be detected

slide-26
SLIDE 26

Future Work Future Work

fi i i f hi i d i i

  • Definition of more sophisticated criteria
  • Parameter settings can be optimized by

g p y machine learning algorithms

  • Different schemes to acquire style change

Different schemes to acquire style change function

Comparison of text window with the window – Comparison of text window with the window complement Comparison of text window with all the other text – Comparison of text window with all the other text windows