Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection Using - - PowerPoint PPT Presentation
Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection Using - - PowerPoint PPT Presentation
Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection Using Character n gram Profiles g g Efstathios Stamatatos Efstathios Stamatatos University of the Aegean Talk Layout Talk Layout Introduction The style change function The
Talk Layout Talk Layout
- Introduction
- The style change function
The style change function
- Detecting plagiarism
- Evaluation
- Conclusions
Conclusions
Intrinsic Plagiarism Detection Intrinsic Plagiarism Detection
- Ambitious and demanding task
- It can be used:
It can be used:
– When no appropriate reference corpus is available Wh h f i l ( b) – When the reference corpus is too large (web)
- Closely related to authorship verification
y p
- Detection of irregularities of stylistic nature
H t ll t li ti i l iti d – However, not all stylistic irregularities are caused by plagiarism
Representing Writing Style Representing Writing Style
- Lexical features
- Character features
Character features
- Syntactic features
- Semantic features
- Application‐specific features
Application specific features
Character n grams Character n‐grams
- Can be easily measured in any text
- Language‐independent
Language independent
- Domain‐independent
- Require no text‐preprocessing
- Very effective in authorship attribution
Very effective in authorship attribution
- Robust to noise
– Obfuscation in plagiarism can be considered as noise insertion
The Proposed Approach The Proposed Approach
Th i i f d l i d b h
- The variation of document style is represented by the
style change function
Using a sliding window over the text length – Using a sliding window over the text‐length
- Writing style is represented by character n‐gram
profiles profiles
– The set of different character n‐grams encountered in the text and their normalized frequencies q
- A set of heuristic rules:
– Decide whether or not the document is plagiarism‐free p g – Detect the plagiarized section boundaries – Detect irrelevant stylistic inconsistencies
Representing Stylistic Changes Representing Stylistic Changes
Sliding Profile of the text window Distance Sliding window (length, step) Profile of the estimation Document whole document Document
- High value means
stylistic anomaly
- Low value means
t li ti i t stylistic consistency
Distance Estimation Distance Estimation
h lidi i d i h ( h
- The sliding window text is shorter (or much
shorter) than the whole document
- An accurate and robust function for
imbalanced profiles is proposed by p p p y (Stamatatos, 2007):
⎟ ⎞ ⎜ ⎛ −
2
)) ( ) ( ( 2 g f g f
∑
∈
⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + =
) ( 1
) ( ) ( )) ( ) ( ( 2 ) , (
A P g B A B A
g f g f g f g f B A d
- This is not a symmetric function
– dissimilarity rather than distance measure
Style Change Function Style Change Function
d i li d th fil l th
- d1 is normalized over the profile length:
)) ( ) ( ( 2
2
g f g f
B A
∑
⎟ ⎟ ⎞ ⎜ ⎜ ⎛ − ) ( 4 ) ( ) ( ) , (
) ( 1
A P g f g f B A nd
A P g B A
∑
∈
⎟ ⎟ ⎠ ⎜ ⎜ ⎝ + =
- Then, the style change function sc of a document D is:
sc(i,D)=nd1(wi, D), i=1…|w|
| | d d th t t l th
⎥ ⎢ − l x 1
- |w| depends on the text‐length:
– x: text‐length – l: sliding window length
⎥ ⎦ ⎥ ⎢ ⎣ ⎢ + = s w 1
l: sliding window length – s: sliding window step
An Example An Example
200 400 600 800 200 400 600 800 0 40 0.50
- n
0.30 0.40 ge functio 0.10 0.20 Style chan 0.00 200 400 600 800 S
IPAT‐DC document #5
Sliding window position
A Plagiarism free Example A Plagiarism‐free Example
0.50
- n
0.30 0.40 ge functio 0 10 0.20 yle chang 0.00 0.10 100 200 300 400 500 600 Sty
IPAT‐DC d t #17
100 200 300 400 500 600 Sliding window position
document #17
Detecting Plagiarism
- n the Document Level
Thi i i l t k i i hi h
- This is crucial to keep precision high
- Two options:
– Pre‐processing – Post‐processing
Pl i i f it i S<t
- Plagiarism‐free criterion: S<t1
where S: the standard deviation of the style change function S: the standard deviation of the style change function t1: a predefined threshold (0.02)
- Deficiencies:
- Deficiencies:
– Very short documents tend to have low sc values – Very long documents may contain stylistically – Very long documents may contain stylistically inconsistent sections (high variance of sc)
A False Negative Example A False Negative Example
50 100 150 50 100 150 0 40 0.50
- n
0.30 0.40 nge functio 0.10 0.20 Style chan
IPAT‐DC Document #34
0.00 50 100 150 S Sliding window position
Identifying Plagiarized Passages Identifying Plagiarized Passages
i d h l h lf f h i
- It is assumed that at least half of the text is not
plagiarized
– The average sc value would correspond to the style of the alleged author
I l i i k h f
- In general, it is not known the amount of
plagiarized text
– All sc values greater than M+S are removed – M ′ and S ′ are then calculated
- Plagiarized passage criterion: sc(i′,D) >M′+a*S′
– a determines the sensitivity of the method (set to 2.0)
An Example An Example
200 400 600 800 200 400 600 800 0.50 n 0.30 0.40 ge function 0 10 0.20 tyle chang 0.00 0.10 200 400 600 800 St
IPAT‐DC document #5
200 400 600 800 Sliding window position
Another Example Another Example
0 40 0.50
- n
0.30 0.40 ge functio 0 10 0.20 yle chang 0.00 0.10 Sty
IPAT‐DC
100 200 300 400 Sliding window position
Document #22
Detecting Irrelevant Style Changes Detecting Irrelevant Style Changes
N t ll t li ti h d b l i i
- Not all stylistic changes are caused by plagiarism
– Text formatting affects style – Genre affects style Genre affects style – …
- To reduce the formatting factor:
g
– All text is transformed to lowercase – Every character n‐gram that contains no letter characters ( ) i d f th fil (a‐z) is removed from the profile – The sliding window parameters operate on letter characters
- each window has the same number of letter characters (window
length l) but different number of total characters (real window length l′)
Detecting Irrelevant Style Changes Detecting Irrelevant Style Changes
T d th lti l f t
- To reduce the multiple genre factor:
– Special Section Criterion: l′<t2 where where – l′: the real window length – t : a predefined threshold (1 500) – t2: a predefined threshold (1,500) – It combines with the plagiarized passage criterion
- Weaknesses
Weaknesses
– One can insert multiple non letter characters to
- bfuscate a plagiarized section
– All special sections (table‐of‐contents, index) are considered plagiarism‐free
An Example An Example
IPAT‐DC Document #46
Summary of Parameter Settings Summary of Parameter Settings
Description Symbol Value Character n‐gram length n 3 Sliding window length l 1,000 Sliding window step s 200 Threshold of plagiarism‐free criterion t1 0.02 Real window length threshold t2 1,500 Sensitivity of plagiarism detection a 2
- Empirically derived, not optimized
Evaluation on the Document Level Evaluation on the Document Level
Guess Actual Guess Actual Plagiarism‐free Plagiarized g g Plagiarism free 1102 545 (22%) Plagiarism‐free 1102 545 (22%)
Plagiarized passages
Plagiarized 443 1001 (78%)
Upper bound for Recall for Recall
- Results on IPAT‐DC
False Negatives False Negatives
- The majority of
false negatives all documents
The majority of false negatives are relatively short documents
1200 1400 1600 s
short documents (<30K chars)
- The shorter a
600 800 1000 Documents
document, the more likely to false negative
200 400 D
false negative
<10K 10K-30K 30K-100K >100K Text length (chars)
Evaluation on the Passage Level Evaluation on the Passage Level
Corpus IPAT‐DC IPAT‐CC R ll 0 4552 0 4607 Recall 0.4552 0.4607 Precision 0.2183 0.2321 F score 0 2876 0 3086 F‐score 0.2876 0.3086 Granularity 1.22 1.25 Overall score 0.2358 0.2462
- Performance remains stable for both corpora
Recall and Precision vs Text length Recall and Precision vs. Text‐length
- Recall is
affected by
60 recall precision
affected by decreasing text length
40 50 60
text‐length
– A result of f l
20 30
false negative distribution
10 <10K 10K-30K 30K-100K >100K Text length (chars)
Conclusions Conclusions
f ll d h
- A fully‐automated approach
– Easy to follow (no text preprocessing) – Able to detect plagiarism‐free documents – Able to detect plagiarized passage boundaries
- Nearly half of plagiarized passages are detected
while precision remains low
– An increased a value can improve precision (and harm recall)
- Window length determines the shortest
plagiarized passage that can be detected
Future Work Future Work
fi i i f hi i d i i
- Definition of more sophisticated criteria
- Parameter settings can be optimized by
g p y machine learning algorithms
- Different schemes to acquire style change