approaches for source retrieval and text alignment of
play

Approaches for Source Retrieval and Text Alignment of Plagiarism - PowerPoint PPT Presentation

Approaches for Source Retrieval and Text Alignment of Plagiarism Detection Kong Leilei, Qi Haoliang, Du Cuixia, Wang Mingxing, Han Zhongyuan www.hljit.edu.cn PAN@CLEF2013 1 Who are we? 2 Who are we? 2 Who are we? 2 Who are we?


  1. Approaches for Source Retrieval and Text Alignment of Plagiarism Detection Kong Leilei, Qi Haoliang, Du Cuixia, Wang Mingxing, Han Zhongyuan www.hljit.edu.cn PAN@CLEF2013 1

  2. Who are we? 2

  3. Who are we? 2

  4. Who are we? 2

  5. Who are we? Heilongjiang Institute of Technology Harbin, Heilongjiang Province, China 2

  6. Our University PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 3

  7. Our University PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 3

  8. Our University PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 3

  9. Our University PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 3

  10. Our University PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 3

  11. Our University PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 3

  12. Index  Approaches for Source Retrieval  Approaches for Text Alignment  Further works PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 4

  13. Source Retrieval Suspicious Query document keywords Source Text Candidate Retrieval Alignment Documents Suspicious plagiarism Internet text Document Resource Set PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 13

  14. Source Retrieval Suspicious Query document keywords Source Text Candidate Retrieval Alignment Documents Suspicious plagiarism Internet text Document Resource Set PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 14

  15. 2 problmes of source retrieval  Tow core problem of source retrieval  Retrieval source is millions of documents from the Internet  This work was done by PAN  The query keywords of suspicious document which would be used for retrieval are not specified  How to extract query keyword is one of important issues of our work 6

  16. Query Keywords Extraction  Query Keywords Extraction Based on TF-IDF  Query Keywords Extraction Based on Weighted TF-IDF  Adjacent Query Keywords Extraction by PatTree  Combination of Queries and Execution of Retrieval PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 16

  17. Keywords Based on TF-IDF  TF - term frequency, denotes the frequency of term i in document j  IDF - inverse document frequency IDF =log 2 (N/ df j )  TF-IDF of term i is:  Tips: we found that the top 10 terms with the highest TF-IDF can obtain a good results PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 17

  18. Keywords Based on Weighted TF-IDF  Weighted TF-IDF  Where weight is a weighted parameter, and we calculate the weight of term i according to its location  Tips: the keywords extraction based on the weighted TF-IDF sometimes is useful, sometimes useless. PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 18

  19. Adjacent Query Keywords Extraction by PatTree  The adjacent string with high frequency is more important than a single word  We use PatTree - an efficient data structure – to get the adjacent strings and their frequency example PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 19

  20. Combination of Queries Table 1: Query Combination and Group Query Query Keywords 1 Top 1 to 5 query keywords based on TF-IDF 2 Top 2 to 10 query keywords based on TF-IDF 3 2-Gram query keywords based on PatTree 4 3-Gram query keywords based on PatTree 5 4-Gram query keywords based on PatTree 6 4-Gram query keywords based on PatTree 7 Top 1 to 5 query keywords based on weighted TF-IDF 8 Top 6 to 10 query keywords based on weighted TF-IDF 9 5-Gram query keywords based on PatTree PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 20

  21. Results Source Retrieval subtask Table2 : Results Source of PAN@CLEF2013 Retrieval subtask Queries 48.50 Workload Downloads 5691.47 Queries 2.46 Time to 1st Detection Downloads 285.66 Precision 0.01 Retrieved Performance Recall 0.65 3 No Detection PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 21

  22. Results Source Retrieval subtask Table2 : Results Source of PAN@CLEF2013 Retrieval subtask Queries 48.50 Workload Downloads 5691.47 Queries 2.46 Time to 1st Detection Downloads 285.66 Precision 0.01 Retrieved Performance Recall 0.65 3 No Detection PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 22

  23. Results Source Retrieval subtask Table2 : Results Source of PAN@CLEF2013 Retrieval subtask Queries 48.50 Workload Downloads 5691.47 Queries 2.46 Time to 1st Detection Downloads 285.66 Precision 0.01 Retrieved Performance Recall 0.65 3 No Detection PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 23

  24. Text Alignment Suspicious Query document keywords Source Text Candidate Retrieval Alignment Documents Suspicious plagiarism Internet text Document Resource Set PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 24

  25. Text Alignment Suspicious Query document keywords Text Source Candidate Alignment Retrieval Documents Suspicious plagiarism Internet text Document Resource Set PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 25

  26. Text Alignment PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 26

  27. Text Alignment PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 27

  28. Text Alignment Seeding PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 28

  29. Text Alignment Seeding PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 29

  30. Text Alignment Seeding Match Merging PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 30

  31. Text Alignment Seeding Match Merging PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 31

  32. Text Alignment Seeding Match Merging Extraction Filtering PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 32

  33. Text Alignment Seeding Match Merging Extraction Filtering PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 33

  34. Text Alignment Seeding Match Merging Extraction Filtering PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 34

  35. Text Alignment Seeding Match Merging Extraction Filtering PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 35

  36. Text Alignment Seeding Match Merging Extraction Filtering PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 36

  37. Text Alignment Seeding Match Merging Extraction Filtering PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 37

  38. Text Alignment Seeding Match Merging Extraction Filtering PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 38

  39. Text Alignment Seeding Match Merging Extraction Filtering • Bilateral Alternating Merging Algorithm PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 39

  40. Text Alignment Seeding Match Merging Extraction Filtering PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 40

  41. Text Alignment Seeding Match Merging Extraction Filtering PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 41

  42. Text Alignment Seeding Match Merging Extraction Filtering PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 42

  43. Performance on the PAN2012 test corpus Table 3: Overall evaluation results for the final test corpus PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 43

  44. Performance on the PAN2012 test corpus Table 4: Results for the 02-no-obfuscation sub-corpus PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 44

  45. Performance on the PAN2012 test corpus Table 5: Results for the 03-random-obfuscation PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 45

  46. Performance on the PAN2012 test corpus Table 6: Results for the 04-translation-obfuscation PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 46

  47. Performance on the PAN2012 test corpus Table 7: Evaluation results for the 05-summary-obfuscation PAN@CLEF2013 Heilongjiang Institute of Technology, Kong Leilei 47

  48. Further work  Use different methods to deal with different plagiarism problems to obtain a better performance  Query keywords extraction and ranking 48

  49. Thank you for your attention!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend