[PPT] - Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection Using PowerPoint Presentation

SLIDE 1

Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection Using Character n‐gram Profiles g g

Efstathios Stamatatos Efstathios Stamatatos

University of the Aegean

SLIDE 2

Talk Layout Talk Layout

Introduction
The style change function

The style change function

Detecting plagiarism
Evaluation
Conclusions

Conclusions

SLIDE 3

Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection

Ambitious and demanding task
It can be used:

It can be used:

– When no appropriate reference corpus is available Wh h f i l ( b) – When the reference corpus is too large (web)

Closely related to authorship verification

y p

Detection of irregularities of stylistic nature

H t ll t li ti i l iti d – However, not all stylistic irregularities are caused by plagiarism

SLIDE 4

Representing Writing Style Representing Writing Style

Lexical features
Character features

Character features

Syntactic features
Semantic features
Application‐specific features

Application specific features

SLIDE 5

Character n grams Character n‐grams

Can be easily measured in any text
Language‐independent

Language independent

Domain‐independent
Require no text‐preprocessing
Very effective in authorship attribution

Very effective in authorship attribution

Robust to noise

– Obfuscation in plagiarism can be considered as noise insertion

SLIDE 6

The Proposed Approach The Proposed Approach

Th i i f d l i d b h

The variation of document style is represented by the

style change function

Using a sliding window over the text length – Using a sliding window over the text‐length

Writing style is represented by character n‐gram

profiles profiles

– The set of different character n‐grams encountered in the text and their normalized frequencies q

A set of heuristic rules:

– Decide whether or not the document is plagiarism‐free p g – Detect the plagiarized section boundaries – Detect irrelevant stylistic inconsistencies

SLIDE 7

Representing Stylistic Changes Representing Stylistic Changes

Sliding Profile of the text window Distance Sliding window (length, step) Profile of the estimation Document whole document Document

High value means

stylistic anomaly

Low value means

t li ti i t stylistic consistency

SLIDE 8

Distance Estimation Distance Estimation

h lidi i d i h ( h

The sliding window text is shorter (or much

shorter) than the whole document

An accurate and robust function for

imbalanced profiles is proposed by p p p y (Stamatatos, 2007):

⎟ ⎞ ⎜ ⎛ −

2

)) ( ) ( ( 2 g f g f

∑

∈

⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + =

) ( 1

) ( ) ( )) ( ) ( ( 2 ) , (

A P g B A B A

g f g f g f g f B A d

This is not a symmetric function

– dissimilarity rather than distance measure

SLIDE 9

Style Change Function Style Change Function

d i li d th fil l th

d1 is normalized over the profile length:

)) ( ) ( ( 2

2

g f g f

B A

∑

⎟ ⎟ ⎞ ⎜ ⎜ ⎛ − ) ( 4 ) ( ) ( ) , (

) ( 1

A P g f g f B A nd

A P g B A

∑

∈

⎟ ⎟ ⎠ ⎜ ⎜ ⎝ + =

Then, the style change function sc of a document D is:

sc(i,D)=nd1(wi, D), i=1…|w|

| | d d th t t l th

⎥ ⎢ − l x 1

|w| depends on the text‐length:

– x: text‐length – l: sliding window length

⎥ ⎦ ⎥ ⎢ ⎣ ⎢ + = s w 1

l: sliding window length – s: sliding window step

SLIDE 10

An Example An Example

200 400 600 800 200 400 600 800 0 40 0.50

n

0.30 0.40 ge functio 0.10 0.20 Style chan 0.00 200 400 600 800 S

IPAT‐DC document #5

Sliding window position

SLIDE 11

A Plagiarism free Example A Plagiarism‐free Example

0.50

n

0.30 0.40 ge functio 0 10 0.20 yle chang 0.00 0.10 100 200 300 400 500 600 Sty

IPAT‐DC d t #17

100 200 300 400 500 600 Sliding window position

document #17

SLIDE 12

Detecting Plagiarism

n the Document Level

Thi i i l t k i i hi h

This is crucial to keep precision high
Two options:

– Pre‐processing – Post‐processing

Pl i i f it i S<t

Plagiarism‐free criterion: S<t1

where S: the standard deviation of the style change function S: the standard deviation of the style change function t1: a predefined threshold (0.02)

Deficiencies:
Deficiencies:

– Very short documents tend to have low sc values – Very long documents may contain stylistically – Very long documents may contain stylistically inconsistent sections (high variance of sc)

SLIDE 13

A False Negative Example A False Negative Example

50 100 150 50 100 150 0 40 0.50

n

0.30 0.40 nge functio 0.10 0.20 Style chan

IPAT‐DC Document #34

0.00 50 100 150 S Sliding window position

SLIDE 14

Identifying Plagiarized Passages Identifying Plagiarized Passages

i d h l h lf f h i

It is assumed that at least half of the text is not

plagiarized

– The average sc value would correspond to the style of the alleged author

I l i i k h f

In general, it is not known the amount of

plagiarized text

– All sc values greater than M+S are removed – M ′ and S ′ are then calculated

Plagiarized passage criterion: sc(i′,D) >M′+a*S′

– a determines the sensitivity of the method (set to 2.0)

SLIDE 15

An Example An Example

200 400 600 800 200 400 600 800 0.50 n 0.30 0.40 ge function 0 10 0.20 tyle chang 0.00 0.10 200 400 600 800 St

IPAT‐DC document #5

200 400 600 800 Sliding window position

SLIDE 16

Another Example Another Example

0 40 0.50

n

0.30 0.40 ge functio 0 10 0.20 yle chang 0.00 0.10 Sty

IPAT‐DC

100 200 300 400 Sliding window position

Document #22

SLIDE 17

Detecting Irrelevant Style Changes Detecting Irrelevant Style Changes

N t ll t li ti h d b l i i

Not all stylistic changes are caused by plagiarism

– Text formatting affects style – Genre affects style Genre affects style – …

To reduce the formatting factor:

g

– All text is transformed to lowercase – Every character n‐gram that contains no letter characters ( ) i d f th fil (a‐z) is removed from the profile – The sliding window parameters operate on letter characters

each window has the same number of letter characters (window

length l) but different number of total characters (real window length l′)

SLIDE 18

Detecting Irrelevant Style Changes Detecting Irrelevant Style Changes

T d th lti l f t

To reduce the multiple genre factor:

– Special Section Criterion: l′<t2 where where – l′: the real window length – t : a predefined threshold (1 500) – t2: a predefined threshold (1,500) – It combines with the plagiarized passage criterion

Weaknesses

Weaknesses

– One can insert multiple non letter characters to

bfuscate a plagiarized section

– All special sections (table‐of‐contents, index) are considered plagiarism‐free

SLIDE 19

An Example An Example

IPAT‐DC Document #46

SLIDE 20

Summary of Parameter Settings Summary of Parameter Settings

Description Symbol Value Character n‐gram length n 3 Sliding window length l 1,000 Sliding window step s 200 Threshold of plagiarism‐free criterion t1 0.02 Real window length threshold t2 1,500 Sensitivity of plagiarism detection a 2

Empirically derived, not optimized

SLIDE 21

Evaluation on the Document Level Evaluation on the Document Level

Guess Actual Guess Actual Plagiarism‐free Plagiarized g g Plagiarism free 1102 545 (22%) Plagiarism‐free 1102 545 (22%)

Plagiarized passages

Plagiarized 443 1001 (78%)

Upper bound for Recall for Recall

Results on IPAT‐DC

SLIDE 22

False Negatives False Negatives

The majority of

false negatives all documents

The majority of false negatives are relatively short documents

1200 1400 1600 s

short documents (<30K chars)

The shorter a

600 800 1000 Documents

document, the more likely to false negative

200 400 D

false negative

<10K 10K-30K 30K-100K >100K Text length (chars)

SLIDE 23

Evaluation on the Passage Level Evaluation on the Passage Level

Corpus IPAT‐DC IPAT‐CC R ll 0 4552 0 4607 Recall 0.4552 0.4607 Precision 0.2183 0.2321 F score 0 2876 0 3086 F‐score 0.2876 0.3086 Granularity 1.22 1.25 Overall score 0.2358 0.2462

Performance remains stable for both corpora

SLIDE 24

Recall and Precision vs Text length Recall and Precision vs. Text‐length

Recall is

affected by

60 recall precision

affected by decreasing text length

40 50 60

text‐length

– A result of f l

20 30

false negative distribution

10 <10K 10K-30K 30K-100K >100K Text length (chars)

SLIDE 25

Conclusions Conclusions

f ll d h

A fully‐automated approach

– Easy to follow (no text preprocessing) – Able to detect plagiarism‐free documents – Able to detect plagiarized passage boundaries

Nearly half of plagiarized passages are detected

while precision remains low

– An increased a value can improve precision (and harm recall)

Window length determines the shortest

plagiarized passage that can be detected

SLIDE 26

Future Work Future Work

fi i i f hi i d i i

Definition of more sophisticated criteria
Parameter settings can be optimized by

g p y machine learning algorithms

Different schemes to acquire style change

Different schemes to acquire style change function

Comparison of text window with the window – Comparison of text window with the window complement Comparison of text window with all the other text – Comparison of text window with all the other text windows