Approaches for Source Retrieval and Text Alignment of Plagiarism - - PowerPoint PPT Presentation

approaches for source retrieval and text alignment of
SMART_READER_LITE
LIVE PREVIEW

Approaches for Source Retrieval and Text Alignment of Plagiarism - - PowerPoint PPT Presentation

Approaches for Source Retrieval and Text Alignment of Plagiarism Detection Kong Leilei, Qi Haoliang, Du Cuixia, Wang Mingxing, Han Zhongyuan www.hljit.edu.cn PAN@CLEF2013 1 Who are we? 2 Who are we? 2 Who are we? 2 Who are we?


slide-1
SLIDE 1

Kong Leilei, Qi Haoliang, Du Cuixia, Wang Mingxing, Han Zhongyuan

www.hljit.edu.cn PAN@CLEF2013

Approaches for Source Retrieval and Text Alignment of Plagiarism Detection

1

slide-2
SLIDE 2

Who are we?

2

slide-3
SLIDE 3

Who are we?

2

slide-4
SLIDE 4

Who are we?

2

slide-5
SLIDE 5

Who are we?

Heilongjiang Institute of Technology Harbin, Heilongjiang Province, China

2

slide-6
SLIDE 6

Our University

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

3

slide-7
SLIDE 7

Our University

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

3

slide-8
SLIDE 8

Our University

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

3

slide-9
SLIDE 9

Our University

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

3

slide-10
SLIDE 10

Our University

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

3

slide-11
SLIDE 11

Our University

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

3

slide-12
SLIDE 12

Index

 Approaches for Source Retrieval  Approaches for Text Alignment  Further works Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

4

slide-13
SLIDE 13

13

Source Retrieval

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

Candidate Documents Suspicious plagiarism text Document Set Internet Resource Text Alignment Source Retrieval Suspicious document Query keywords

slide-14
SLIDE 14

14

Source Retrieval

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

Candidate Documents Suspicious plagiarism text Document Set Internet Resource Text Alignment

Source Retrieval

Suspicious document Query keywords

slide-15
SLIDE 15

2 problmes of source retrieval

 Tow core problem of source retrieval

 Retrieval source is millions of documents

from the Internet

 This work was done by PAN

 The query keywords of suspicious document

which would be used for retrieval are not specified

 How to extract query keyword is one of important

issues of our work

6

slide-16
SLIDE 16

16

Query Keywords Extraction

 Query Keywords Extraction Based on TF-IDF  Query Keywords Extraction Based on Weighted

TF-IDF

 Adjacent Query Keywords Extraction by PatTree  Combination of Queries and Execution of Retrieval

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

slide-17
SLIDE 17

17

Keywords Based on TF-IDF

 TF - term frequency, denotes the frequency of term i in

document j

 IDF - inverse document frequency

IDF =log2 (N/ df j)

 TF-IDF of term i is:  Tips: we found that the top 10 terms with the highest TF-IDF

can obtain a good results

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

slide-18
SLIDE 18

18

Keywords Based on Weighted TF-IDF

 Weighted TF-IDF  Where weight is a weighted parameter, and we calculate

the weight of term i according to its location

 Tips: the keywords extraction based on the weighted TF-IDF

sometimes is useful, sometimes useless.

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

slide-19
SLIDE 19

19

Adjacent Query Keywords Extraction by PatTree

 The adjacent string with high frequency is more

important than a single word

 We use PatTree - an efficient data structure– to get the

adjacent strings and their frequency

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013 example

slide-20
SLIDE 20

20

Combination of Queries

Query Query Keywords 1 2 3 4 5 6 7 8 9 Top 1 to 5 query keywords based on TF-IDF Top 2 to 10 query keywords based on TF-IDF 2-Gram query keywords based on PatTree 3-Gram query keywords based on PatTree 4-Gram query keywords based on PatTree 4-Gram query keywords based on PatTree Top 1 to 5 query keywords based on weighted TF-IDF Top 6 to 10 query keywords based on weighted TF-IDF 5-Gram query keywords based on PatTree

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

Table 1: Query Combination and Group

slide-21
SLIDE 21

21

Results Source Retrieval subtask

Workload Queries 48.50 Downloads 5691.47 Time to 1st Detection Queries 2.46 Downloads 285.66 Retrieved Performance Precision 0.01 Recall 0.65 No Detection 3

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

Table2:Results Source of PAN@CLEF2013 Retrieval subtask

slide-22
SLIDE 22

22

Results Source Retrieval subtask

Workload Queries 48.50 Downloads 5691.47 Time to 1st Detection Queries 2.46 Downloads 285.66 Retrieved Performance Precision 0.01 Recall 0.65 No Detection 3

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

Table2:Results Source of PAN@CLEF2013 Retrieval subtask

slide-23
SLIDE 23

23

Results Source Retrieval subtask

Workload Queries 48.50 Downloads 5691.47 Time to 1st Detection Queries 2.46 Downloads 285.66 Retrieved Performance Precision 0.01 Recall 0.65 No Detection 3

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

Table2:Results Source of PAN@CLEF2013 Retrieval subtask

slide-24
SLIDE 24

24 Candidate Documents Suspicious plagiarism text Document Set Internet Resource

Text Alignment

Text Alignment Source Retrieval

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

Suspicious document Query keywords

slide-25
SLIDE 25

25 Candidate Documents Suspicious plagiarism text Document Set Internet Resource

Text Alignment

Text Alignment

Source Retrieval

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

Suspicious document Query keywords

slide-26
SLIDE 26

26

Text Alignment

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

slide-27
SLIDE 27

27

Text Alignment

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

slide-28
SLIDE 28

28

Text Alignment

Seeding

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

slide-29
SLIDE 29

29

Text Alignment

Seeding

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

slide-30
SLIDE 30

30 Match Merging

Text Alignment

Seeding

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

slide-31
SLIDE 31

31 Match Merging

Text Alignment

Seeding

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

slide-32
SLIDE 32

32 Match Merging

Text Alignment

Extraction Filtering Seeding

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

slide-33
SLIDE 33

33 Match Merging

Text Alignment

Extraction Filtering Seeding

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

slide-34
SLIDE 34

34 Seeding Match Merging

Text Alignment

Extraction Filtering

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

slide-35
SLIDE 35

35 Seeding Match Merging

Text Alignment

Extraction Filtering

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

slide-36
SLIDE 36

36 Seeding Match Merging

Text Alignment

Extraction Filtering

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

slide-37
SLIDE 37

37 Seeding Match Merging

Text Alignment

Extraction Filtering

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

slide-38
SLIDE 38

38 Seeding Match Merging

Text Alignment

Extraction Filtering

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

slide-39
SLIDE 39

39 Seeding Match Merging

Text Alignment

Extraction Filtering

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

  • Bilateral Alternating

Merging Algorithm

slide-40
SLIDE 40

40 Seeding Match Merging

Text Alignment

Extraction Filtering

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

slide-41
SLIDE 41

41 Seeding Match Merging

Text Alignment

Extraction Filtering

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

slide-42
SLIDE 42

42 Seeding Match Merging

Text Alignment

Extraction Filtering

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

slide-43
SLIDE 43

43

Performance on the PAN2012 test corpus

Table 3: Overall evaluation results for the final test corpus

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

slide-44
SLIDE 44

44

Performance on the PAN2012 test corpus

Table 4: Results for the 02-no-obfuscation sub-corpus

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

slide-45
SLIDE 45

45

Performance on the PAN2012 test corpus

Table 5: Results for the 03-random-obfuscation

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

slide-46
SLIDE 46

46

Performance on the PAN2012 test corpus

Table 6: Results for the 04-translation-obfuscation

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

slide-47
SLIDE 47

47

Performance on the PAN2012 test corpus

Table 7: Evaluation results for the 05-summary-obfuscation

Heilongjiang Institute of Technology, Kong Leilei PAN@CLEF2013

slide-48
SLIDE 48

48

Further work

 Use different methods to deal with different plagiarism

problems to obtain a better performance

 Query keywords extraction and ranking

slide-49
SLIDE 49

Thank you for your attention!