A plagiarism detection procedure in three steps: selection, matches - - PowerPoint PPT Presentation

a plagiarism detection procedure in three steps selection
SMART_READER_LITE
LIVE PREVIEW

A plagiarism detection procedure in three steps: selection, matches - - PowerPoint PPT Presentation

A plagiarism detection procedure in three steps: selection, matches and squares Chiara Basile - basile@dm.unibo.it Mathematics Department University of Bologna, Italy PAN09 Workshop, San Sebastin - Donostia, 10/09/2009 Joint work


slide-1
SLIDE 1

A plagiarism detection procedure in three steps: selection, matches and “squares”

Chiara Basile - basile@dm.unibo.it

Mathematics Department University of Bologna, Italy

PAN‘09 Workshop, San Sebastián - Donostia, 10/09/2009 Joint work with Dario Benedetto, Emanuele Caglioti, Giampaolo Cristadoro, Mirko Degli Esposti

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 1 / 12

slide-2
SLIDE 2

Introduction

Once upon a time...

03/05/09 A group of mathematicians from the Universities of Bologna and Rome La Sapienza gets to know of the Plagiarism Competition and decides to try some preliminary experiments on the external plagiarism corpus using methods developed for different tasks, like authorship recognition and text categorization.

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 2 / 12

slide-3
SLIDE 3

Introduction

Once upon a time...

03/05/09 A group of mathematicians from the Universities of Bologna and Rome La Sapienza gets to know of the Plagiarism Competition and decides to try some preliminary experiments on the external plagiarism corpus using methods developed for different tasks, like authorship recognition and text categorization. The competition deadline: 07/06/09

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 2 / 12

slide-4
SLIDE 4

Introduction

Once upon a time...

03/05/09 A group of mathematicians from the Universities of Bologna and Rome La Sapienza gets to know of the Plagiarism Competition and decides to try some preliminary experiments on the external plagiarism corpus using methods developed for different tasks, like authorship recognition and text categorization. The competition deadline: 07/06/09 - just one month...

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 2 / 12

slide-5
SLIDE 5

Introduction

Once upon a time...

03/05/09 A group of mathematicians from the Universities of Bologna and Rome La Sapienza gets to know of the Plagiarism Competition and decides to try some preliminary experiments on the external plagiarism corpus using methods developed for different tasks, like authorship recognition and text categorization. The competition deadline: 07/06/09 - just one month... ...and a few documents: “just” 14,428!

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 2 / 12

slide-6
SLIDE 6

Introduction

Once upon a time...

03/05/09 A group of mathematicians from the Universities of Bologna and Rome La Sapienza gets to know of the Plagiarism Competition and decides to try some preliminary experiments on the external plagiarism corpus using methods developed for different tasks, like authorship recognition and text categorization. The competition deadline: 07/06/09 - just one month... ...and a few documents: “just” 14,428! Therefore, two imperatives:

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 2 / 12

slide-7
SLIDE 7

Introduction

Once upon a time...

03/05/09 A group of mathematicians from the Universities of Bologna and Rome La Sapienza gets to know of the Plagiarism Competition and decides to try some preliminary experiments on the external plagiarism corpus using methods developed for different tasks, like authorship recognition and text categorization. The competition deadline: 07/06/09 - just one month... ...and a few documents: “just” 14,428! Therefore, two imperatives:

1 be (not only computationally) fast

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 2 / 12

slide-8
SLIDE 8

Introduction

Once upon a time...

03/05/09 A group of mathematicians from the Universities of Bologna and Rome La Sapienza gets to know of the Plagiarism Competition and decides to try some preliminary experiments on the external plagiarism corpus using methods developed for different tasks, like authorship recognition and text categorization. The competition deadline: 07/06/09 - just one month... ...and a few documents: “just” 14,428! Therefore, two imperatives:

1 be (not only computationally) fast 2 use heuristics

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 2 / 12

slide-9
SLIDE 9

Introduction

Where do we come from?

Various problems of classification and clustering of symbolic sequences (authorship attribution, classification of biological or genetic sequences, ...)

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 3 / 12

slide-10
SLIDE 10

Introduction

Where do we come from?

Various problems of classification and clustering of symbolic sequences (authorship attribution, classification of biological or genetic sequences, ...)

The Gramsci Project

  • C. Basile, D. Benedetto, E. Caglioti, M. Degli Esposti

An example of mathematical authorship attribution Journal of Mathematical Physics 49, 125211 (2008).

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 3 / 12

slide-11
SLIDE 11

Introduction

Where do we come from?

Various problems of classification and clustering of symbolic sequences (authorship attribution, classification of biological or genetic sequences, ...) faced using ideas coming from Information Theory, Dynamical Systems, Statistical Mechanics...

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 3 / 12

slide-12
SLIDE 12

Introduction

Where do we come from?

Various problems of classification and clustering of symbolic sequences (authorship attribution, classification of biological or genetic sequences, ...) faced using ideas coming from Information Theory, Dynamical Systems, Statistical Mechanics... and usually defining some similarity metric(s) to estimate the “distance” between couples of sequences.

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 3 / 12

slide-13
SLIDE 13

Introduction

Where do we come from?

Various problems of classification and clustering of symbolic sequences (authorship attribution, classification of biological or genetic sequences, ...) faced using ideas coming from Information Theory, Dynamical Systems, Statistical Mechanics... and usually defining some similarity metric(s) to estimate the “distance” between couples of sequences.

Given two texts x, y their n-gram distance is: dn(x, y) := 1 |Dn(x)| + |Dn(y)| X

ω∈Dn(x)∪Dn(y)

„ fx(ω) − fy (ω) fx(ω) + fy (ω) «2 where: ◮ fx(ω) = frequency of the (character) n−gram ω in x; ◮ Dn(x) = set of all the n−grams with non-zero frequency in x.

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 3 / 12

slide-14
SLIDE 14

Introduction

Where do we come from?

Various problems of classification and clustering of symbolic sequences (authorship attribution, classification of biological or genetic sequences, ...) faced using ideas coming from Information Theory, Dynamical Systems, Statistical Mechanics... and usually defining some similarity metric(s) to estimate the “distance” between couples of sequences.

Given two texts x, y their n-gram distance is: dn(x, y) := 1 |Dn(x)| + |Dn(y)| X

ω∈Dn(x)∪Dn(y)

„ fx(ω) − fy (ω) fx(ω) + fy (ω) «2 where: ◮ fx(ω) = frequency of the (character) n−gram ω in x; ◮ Dn(x) = set of all the n−grams with non-zero frequency in x.

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 3 / 12

slide-15
SLIDE 15

Introduction

Where do we come from?

Various problems of classification and clustering of symbolic sequences (authorship attribution, classification of biological or genetic sequences, ...) faced using ideas coming from Information Theory, Dynamical Systems, Statistical Mechanics... and usually defining some similarity metric(s) to estimate the “distance” between couples of sequences.

Given two texts x, y their n-gram distance is: dn(x, y) := 1 |Dn(x)| + |Dn(y)| X

ω∈Dn(x)∪Dn(y)

„ fx(ω) − fy (ω) fx(ω) + fy (ω) «2 where: ◮ fx(ω) = frequency of the (character) n−gram ω in x; ◮ Dn(x) = set of all the n−grams with non-zero frequency in x.

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 3 / 12

slide-16
SLIDE 16

Introduction

Where do we come from?

Various problems of classification and clustering of symbolic sequences (authorship attribution, classification of biological or genetic sequences, ...) faced using ideas coming from Information Theory, Dynamical Systems, Statistical Mechanics... and usually defining some similarity metric(s) to estimate the “distance” between couples of sequences.

Given two texts x, y their n-gram distance is: dn(x, y) := 1 |Dn(x)| + |Dn(y)| X

ω∈Dn(x)∪Dn(y)

„ fx(ω) − fy (ω) fx(ω) + fy (ω) «2 where: ◮ fx(ω) = frequency of the (character) n−gram ω in x; ◮ Dn(x) = set of all the n−grams with non-zero frequency in x.

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 3 / 12

slide-17
SLIDE 17

Introduction

Corpus statistics

500000 1.0106 1.5106 2.0106 2.5106 1104 5104 0.001 0.005 0.010 0.050 0.100 text length characters percentage of texts logarithmic scale suspicious texts competition suspicious texts training source texts competition source texts training

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 4 / 12

slide-18
SLIDE 18

Introduction

Corpus statistics

5000 10000 15000 20000 25000 0.01 0.1 1 10 length carachters percentage of plagiarized passages

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 4 / 12

slide-19
SLIDE 19

Our method

1 - Selection

First of all: reduce the search space by selecting a small number of suitable candidates for plagiarism for each plagiarized text.

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 5 / 12

slide-20
SLIDE 20

Our method

1 - Selection

First of all: reduce the search space by selecting a small number of suitable candidates for plagiarism for each plagiarized text. Can we use the n−gram distance for this task?

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 5 / 12

slide-21
SLIDE 21

Our method

1 - Selection

First of all: reduce the search space by selecting a small number of suitable candidates for plagiarism for each plagiarized text. Can we use the n−gram distance for this task? Maybe, but there is not enough statistics using the “normal” alphabet + it takes too long

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 5 / 12

slide-22
SLIDE 22

Our method

1 - Selection

First of all: reduce the search space by selecting a small number of suitable candidates for plagiarism for each plagiarized text. Can we use the n−gram distance for this task? Maybe, but there is not enough statistics using the “normal” alphabet + it takes too long ⇒ reduce the alphabet!

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 5 / 12

slide-23
SLIDE 23

Our method

1 - Selection

First of all: reduce the search space by selecting a small number of suitable candidates for plagiarism for each plagiarized text. Can we use the n−gram distance for this task? Maybe, but there is not enough statistics using the “normal” alphabet + it takes too long ⇒ reduce the alphabet! We converted all texts into word lengths (up to 9):

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 5 / 12

slide-24
SLIDE 24

Our method

1 - Selection

First of all: reduce the search space by selecting a small number of suitable candidates for plagiarism for each plagiarized text. Can we use the n−gram distance for this task? Maybe, but there is not enough statistics using the “normal” alphabet + it takes too long ⇒ reduce the alphabet! We converted all texts into word lengths (up to 9):

To be or not to be: that is the question

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 5 / 12

slide-25
SLIDE 25

Our method

1 - Selection

First of all: reduce the search space by selecting a small number of suitable candidates for plagiarism for each plagiarized text. Can we use the n−gram distance for this task? Maybe, but there is not enough statistics using the “normal” alphabet + it takes too long ⇒ reduce the alphabet! We converted all texts into word lengths (up to 9):

To be or not to be: that is the question → 2223224238

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 5 / 12

slide-26
SLIDE 26

Our method

1 - Selection

First of all: reduce the search space by selecting a small number of suitable candidates for plagiarism for each plagiarized text. Can we use the n−gram distance for this task? Maybe, but there is not enough statistics using the “normal” alphabet + it takes too long ⇒ reduce the alphabet! We converted all texts into word lengths (up to 9):

To be or not to be: that is the question → 2223224238

The value n = 8 was chosen as a compromise between

◮ acceptable computational time (2.3 days for the whole corpus)

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 5 / 12

slide-27
SLIDE 27

Our method

1 - Selection

First of all: reduce the search space by selecting a small number of suitable candidates for plagiarism for each plagiarized text. Can we use the n−gram distance for this task? Maybe, but there is not enough statistics using the “normal” alphabet + it takes too long ⇒ reduce the alphabet! We converted all texts into word lengths (up to 9):

To be or not to be: that is the question → 2223224238

The value n = 8 was chosen as a compromise between

◮ acceptable computational time (2.3 days for the whole corpus) ◮ a good recall (81% of the plagiarized characters come from the

first 10 neighbours → very good! 13% of translated plagiarism...)

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 5 / 12

slide-28
SLIDE 28

Our method

2 - Matches

Now we can perform a detailed analysis on the 7214 x 10 couples of texts, looking for common subsequences (matches) longer then a fixed threshold (e.g. 15 characters).

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 6 / 12

slide-29
SLIDE 29

Our method

2 - Matches

Now we can perform a detailed analysis on the 7214 x 10 couples of texts, looking for common subsequences (matches) longer then a fixed threshold (e.g. 15 characters). A new conversion: T9 encoding.

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 6 / 12

slide-30
SLIDE 30

Our method

2 - Matches

Now we can perform a detailed analysis on the 7214 x 10 couples of texts, looking for common subsequences (matches) longer then a fixed threshold (e.g. 15 characters). A new conversion: T9 encoding. Why T9?

◮ “almost unique” translation for long enough sequences (10-15

characters);

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 6 / 12

slide-31
SLIDE 31

Our method

2 - Matches

Now we can perform a detailed analysis on the 7214 x 10 couples of texts, looking for common subsequences (matches) longer then a fixed threshold (e.g. 15 characters). A new conversion: T9 encoding. Why T9?

◮ “almost unique” translation for long enough sequences (10-15

characters);

◮ it reduces the alphabet to 10 symbols ⇒ speeds up the indexing

phase of the matching algorithm.

more... Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 6 / 12

slide-32
SLIDE 32

Our method

2 - Matches

Now we can perform a detailed analysis on the 7214 x 10 couples of texts, looking for common subsequences (matches) longer then a fixed threshold (e.g. 15 characters). A new conversion: T9 encoding. Why T9?

◮ “almost unique” translation for long enough sequences (10-15

characters);

◮ it reduces the alphabet to 10 symbols ⇒ speeds up the indexing

phase of the matching algorithm.

more...

Computation times for the whole corpus: 40 hours.

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 6 / 12

slide-33
SLIDE 33

Our method

2 - Matches (continued)

50000 100000 150000 200000 250000 20000 40000 60000 80000 100000 120000 140000

suspiciousdocument00814.txt vs. sourcedocument04005.txt

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 7 / 12

slide-34
SLIDE 34

Our method

2 - Matches (continued)

50000 100000 150000 200000 250000 50000 100000 150000 200000 250000 300000

suspiciousdocument00814.txt vs. sourcedocument03464.txt

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 7 / 12

slide-35
SLIDE 35

Our method

3-“Squares”

How to identify the “squares” which are so evident in this picture?

50000 100000 150000 200000 250000 20000 40000 60000 80000 100000 120000 140000

suspiciousdocument00814.txt vs. sourcedocument04005.txt

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 8 / 12

slide-36
SLIDE 36

Our method

3-“Squares”

How to identify the “squares” which are so evident in this picture?

50000 100000 150000 200000 250000 20000 40000 60000 80000 100000 120000 140000

suspiciousdocument00814.txt vs. sourcedocument04005.txt

We need scalability!

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 8 / 12

slide-37
SLIDE 37

Our method

3-“Squares”

How to identify the “squares” which are so evident in this picture?

50000 100000 150000 200000 250000 20000 40000 60000 80000 100000 120000 140000

suspiciousdocument00814.txt vs. sourcedocument04005.txt

We need scalability! Join two matches if the following conditions hold simultaneously:

1 matches are subsequent in the suspicious file

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 8 / 12

slide-38
SLIDE 38

Our method

3-“Squares”

How to identify the “squares” which are so evident in this picture?

50000 100000 150000 200000 250000 20000 40000 60000 80000 100000 120000 140000

suspiciousdocument00814.txt vs. sourcedocument04005.txt

We need scalability! Join two matches if the following conditions hold simultaneously:

1 matches are subsequent in the suspicious file 2 matches are not superimposed in the suspicious file and their

distance in the suspicious file is not larger than the length of the longest of the two sequences, scaled by δx

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 8 / 12

slide-39
SLIDE 39

Our method

3-“Squares”

How to identify the “squares” which are so evident in this picture?

50000 100000 150000 200000 250000 20000 40000 60000 80000 100000 120000 140000

suspiciousdocument00814.txt vs. sourcedocument04005.txt

We need scalability! Join two matches if the following conditions hold simultaneously:

1 matches are subsequent in the suspicious file 2 matches are not superimposed in the suspicious file and their

distance in the suspicious file is not larger than the length of the longest of the two sequences, scaled by δx

3 the same as 2 (with possibly a different δy) in the source file

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 8 / 12

slide-40
SLIDE 40

Our method

3-“Squares”

How to identify the “squares” which are so evident in this picture?

50000 100000 150000 200000 250000 20000 40000 60000 80000 100000 120000 140000

suspiciousdocument00814.txt vs. sourcedocument04005.txt

We need scalability! Join two matches if the following conditions hold simultaneously:

1 matches are subsequent in the suspicious file 2 matches are not superimposed in the suspicious file and their

distance in the suspicious file is not larger than the length of the longest of the two sequences, scaled by δx

3 the same as 2 (with possibly a different δy) in the source file

Then: repeatedly merge superimposed segments

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 8 / 12

slide-41
SLIDE 41

Our method

3-“Squares”

How to identify the “squares” which are so evident in this picture?

50000 100000 150000 200000 250000 20000 40000 60000 80000 100000 120000 140000

suspiciousdocument00814.txt vs. sourcedocument04005.txt

We need scalability! Join two matches if the following conditions hold simultaneously:

1 matches are subsequent in the suspicious file 2 matches are not superimposed in the suspicious file and their

distance in the suspicious file is not larger than the length of the longest of the two sequences, scaled by δx

3 the same as 2 (with possibly a different δy) in the source file

Then: repeatedly merge superimposed segments + run the algorithm above again with smaller parameters δ′

x and δ′ y.

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 8 / 12

slide-42
SLIDE 42

Our method

3-“Squares”

How to identify the “squares” which are so evident in this picture?

50000 100000 150000 200000 250000 20000 40000 60000 80000 100000 120000 140000

suspiciousdocument00814.txt vs. sourcedocument04005.txt

120000 140000

suspiciousdocument00814.txt vs. sourcedocument04005.txt

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 8 / 12

slide-43
SLIDE 43

Our method

3-“Squares”

How to identify the “squares” which are so evident in this picture?

50000 100000 150000 200000 250000 20000 40000 60000 80000 100000 120000 140000

suspiciousdocument00814.txt vs. sourcedocument04005.txt

140000

suspiciousdocument00814.txt vs. sourcedocument04005.txt

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 8 / 12

slide-44
SLIDE 44

Our method

Summary of the procedure

1 - Selection 2 - Matches 3 - “Squares”

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 9 / 12

slide-45
SLIDE 45

Our method

Summary of the procedure

1 - Selection The Constance letters of Charles Chapin, edited by Eleanor Early and Constance... − → 397276627539...

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 9 / 12

slide-46
SLIDE 46

Our method

Summary of the procedure

1 - Selection The Constance letters of Charles Chapin, edited by Eleanor Early and Constance... − → 397276627539...

  • by the 8-gram distance
  • suspicious-document00814
  • 1) source-document04005

2) source-document04080 3) source-document02123 4) source-document02648 5) source-document03464 6) source-document02737 7) source-document03876 8) source-document05012 9) source-document04456 10) source-document04223 Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 9 / 12

slide-47
SLIDE 47

Our method

Summary of the procedure

1 - Selection

2 - Matches The Constance letters of Charles Chapin, edited by Eleanor Early and Constance... − → 8430266782623053883770 6302427537024274610 334833029035326670327590 2630266782623...

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 9 / 12

slide-48
SLIDE 48

Our method

Summary of the procedure

1 - Selection

2 - Matches The Constance letters of Charles Chapin, edited by Eleanor Early and Constance... − → 8430266782623053883770 6302427537024274610 334833029035326670327590 2630266782623...

50000 100000 150000 200000 250000 20000 40000 60000 80000 100000 120000 140000

suspiciousdocument00814.txt vs. sourcedocument04005.txt

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 9 / 12

slide-49
SLIDE 49

Our method

Summary of the procedure

1 - Selection 2 - Matches

3 - “Squares”

50000 100000 150000 200000 250000 20000 40000 60000 80000 100000 120000 140000

suspiciousdocument00814.txt vs. sourcedocument04005.txt

1496 matches

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 9 / 12

slide-50
SLIDE 50

Our method

Summary of the procedure

1 - Selection 2 - Matches

3 - “Squares”

50000 100000 150000 200000 250000 20000 40000 60000 80000 100000 120000 140000

suspiciousdocument00814.txt vs. sourcedocument04005.txt

1496 matches → 244 pieces

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 9 / 12

slide-51
SLIDE 51

Our method

Summary of the procedure

1 - Selection 2 - Matches

3 - “Squares”

50000 100000 150000 200000 250000 20000 40000 60000 80000 100000 120000 140000

suspiciousdocument00814.txt vs. sourcedocument04005.txt

1496 matches → 244 pieces→ 16 passages

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 9 / 12

slide-52
SLIDE 52

Our method

Summary of the procedure

1 - Selection 2 - Matches

3 - “Squares”

50000 100000 150000 200000 250000 20000 40000 60000 80000 100000 120000 140000

suspiciousdocument00814.txt vs. sourcedocument04005.txt

1496 matches → 244 pieces→ 16 passages → 8 suspected plagiarisms

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 9 / 12

slide-53
SLIDE 53

Our method

Summary of the procedure

1 - Selection 2 - Matches

3 - “Squares”

50000 100000 150000 200000 250000 20000 40000 60000 80000 100000 120000 140000

suspiciousdocument00814.txt vs. sourcedocument04005.txt

Comparison with the associated xml file... ok!

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 9 / 12

slide-54
SLIDE 54

Conclusions

Results and conclusions

Results on the competition corpus, with δx = δy = 3, δ′

x = δ′ y = 0.5: ◮ Precision: 0.6727 ◮ Recall: 0.6272 ◮ F-measure: 0.6491 ◮ Granularity: 1.0745 ◮ Overall score: 0.6041

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 10 / 12

slide-55
SLIDE 55

Conclusions

Results and conclusions

Results on the competition corpus, with δx = δy = 3, δ′

x = δ′ y = 0.5: ◮ Precision: 0.6727 ◮ Recall: 0.6272 ◮ F-measure: 0.6491 ◮ Granularity: 1.0745 ◮ Overall score: 0.6041

i.e. the third overall score after 0.6093 and 0.6957 of the first two.

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 10 / 12

slide-56
SLIDE 56

Conclusions

Results and conclusions

Results on the competition corpus, with δx = δy = 3, δ′

x = δ′ y = 0.5: ◮ Precision: 0.6727 ◮ Recall: 0.6272 ◮ F-measure: 0.6491 ◮ Granularity: 1.0745 ◮ Overall score: 0.6041

i.e. the third overall score after 0.6093 and 0.6957 of the first two. Many possible improvements:

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 10 / 12

slide-57
SLIDE 57

Conclusions

Results and conclusions

Results on the competition corpus, with δx = δy = 3, δ′

x = δ′ y = 0.5: ◮ Precision: 0.6727 ◮ Recall: 0.6272 ◮ F-measure: 0.6491 ◮ Granularity: 1.0745 ◮ Overall score: 0.6041

i.e. the third overall score after 0.6093 and 0.6957 of the first two. Many possible improvements:

◮ less heuristics in the tuning of δx, δy, δ′ x, δ′ y... density of matches?

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 10 / 12

slide-58
SLIDE 58

Conclusions

Results and conclusions

Results on the competition corpus, with δx = δy = 3, δ′

x = δ′ y = 0.5: ◮ Precision: 0.6727 ◮ Recall: 0.6272 ◮ F-measure: 0.6491 ◮ Granularity: 1.0745 ◮ Overall score: 0.6041

i.e. the third overall score after 0.6093 and 0.6957 of the first two. Many possible improvements:

◮ less heuristics in the tuning of δx, δy, δ′ x, δ′ y... density of matches?

Maybe they can be used to control precision, recall and granularity according to the task...

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 10 / 12

slide-59
SLIDE 59

Conclusions

Results and conclusions

Results on the competition corpus, with δx = δy = 3, δ′

x = δ′ y = 0.5: ◮ Precision: 0.6727 ◮ Recall: 0.6272 ◮ F-measure: 0.6491 ◮ Granularity: 1.0745 ◮ Overall score: 0.6041

i.e. the third overall score after 0.6093 and 0.6957 of the first two. Many possible improvements:

◮ less heuristics in the tuning of δx, δy, δ′ x, δ′ y... density of matches?

Maybe they can be used to control precision, recall and granularity according to the task...

◮ there are certainly better ideas for the selection phase...

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 10 / 12

slide-60
SLIDE 60

Conclusions

Results and conclusions

Results on the competition corpus, with δx = δy = 3, δ′

x = δ′ y = 0.5: ◮ Precision: 0.6727 ◮ Recall: 0.6272 ◮ F-measure: 0.6491 ◮ Granularity: 1.0745 ◮ Overall score: 0.6041

i.e. the third overall score after 0.6093 and 0.6957 of the first two. Many possible improvements:

◮ less heuristics in the tuning of δx, δy, δ′ x, δ′ y... density of matches?

Maybe they can be used to control precision, recall and granularity according to the task...

◮ there are certainly better ideas for the selection phase... ◮ try other/standard clustering algorithms

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 10 / 12

slide-61
SLIDE 61

Conclusions

Results and conclusions

Results on the competition corpus, with δx = δy = 3, δ′

x = δ′ y = 0.5: ◮ Precision: 0.6727 ◮ Recall: 0.6272 ◮ F-measure: 0.6491 ◮ Granularity: 1.0745 ◮ Overall score: 0.6041

i.e. the third overall score after 0.6093 and 0.6957 of the first two. Many possible improvements:

◮ less heuristics in the tuning of δx, δy, δ′ x, δ′ y... density of matches?

Maybe they can be used to control precision, recall and granularity according to the task...

◮ there are certainly better ideas for the selection phase... ◮ try other/standard clustering algorithms

And... what about the internal plagiarism problem?

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 10 / 12

slide-62
SLIDE 62

Conclusions

To conclude

Thank you!

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 11 / 12

slide-63
SLIDE 63

Appendix

Our matching algorithm

Phase 1: every source document s of length N is indexed (once and for all) by two vectors:

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 12 / 12

slide-64
SLIDE 64

Appendix

Our matching algorithm

Phase 1: every source document s of length N is indexed (once and for all) by two vectors: index has length N and its ith element is the index of the previous occurrence in s of the 7-gram si, . . . , si+6

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 12 / 12

slide-65
SLIDE 65

Appendix

Our matching algorithm

Phase 1: every source document s of length N is indexed (once and for all) by two vectors: index has length N and its ith element is the index of the previous occurrence in s of the 7-gram si, . . . , si+6 last has length 107 and its jth element is the index of the last

  • ccurrence of the 7-gram j (padded with zeroes on the

left, if needed) in s

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 12 / 12

slide-66
SLIDE 66

Appendix

Our matching algorithm

Phase 1: every source document s of length N is indexed (once and for all) by two vectors: index has length N and its ith element is the index of the previous occurrence in s of the 7-gram si, . . . , si+6 last has length 107 and its jth element is the index of the last

  • ccurrence of the 7-gram j (padded with zeroes on the

left, if needed) in s N.B. The minimum length for detected matches is 7

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 12 / 12

slide-67
SLIDE 67

Appendix

Our matching algorithm

Phase 1: every source document s of length N is indexed (once and for all) by two vectors: index has length N and its ith element is the index of the previous occurrence in s of the 7-gram si, . . . , si+6 last has length 107 and its jth element is the index of the last

  • ccurrence of the 7-gram j (padded with zeroes on the

left, if needed) in s N.B. The minimum length for detected matches is 7 Phase 2: every suspicious document t (length M) is ran through once and for each k = 0, . . . , M − 1 the indexes p = last(tk, . . . , tk+6) and index(p) are used to retrieve the position of the possible matches in s without running through it again.

Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 12 / 12

slide-68
SLIDE 68

Appendix

Our matching algorithm

Phase 1: every source document s of length N is indexed (once and for all) by two vectors: index has length N and its ith element is the index of the previous occurrence in s of the 7-gram si, . . . , si+6 last has length 107 and its jth element is the index of the last

  • ccurrence of the 7-gram j (padded with zeroes on the

left, if needed) in s N.B. The minimum length for detected matches is 7 Phase 2: every suspicious document t (length M) is ran through once and for each k = 0, . . . , M − 1 the indexes p = last(tk, . . . , tk+6) and index(p) are used to retrieve the position of the possible matches in s without running through it again. Total cost: M + N for each couple suspicious-source.

back Chiara Basile (University of Bologna) Plagiarism detection in three steps San Sebastián, 10/09/2009 12 / 12