instituto polit cnico nacional
play

INSTITUTO POLITCNICO NACIONAL Centro de Investigacin en Computacin - PowerPoint PPT Presentation

INSTITUTO POLITCNICO NACIONAL Centro de Investigacin en Computacin Miguel A. Sanchez- Miguel A. Sanchez-Pere Perez, Grig z, Grigori ori Sidorov, Alexander Gelbukh idorov, Alexander Gelbukh Tuesday, 16 September 2014 1 Task 1.


  1. INSTITUTO POLITÉCNICO NACIONAL Centro de Investigación en Computación Miguel A. Sanchez- Miguel A. Sanchez-Pere Perez, Grig z, Grigori ori Sidorov, Alexander Gelbukh idorov, Alexander Gelbukh Tuesday, 16 September 2014 1

  2. Task 1. Methodology 2. Adaptative behavior 3. Results 4. Conclusions 5. Future Work 6.

  3. Text Alignment: Given a pair of pair of documen documents, the task is to identify all contiguous maximal-length ntiguous maximal-length passages of reus reused text ed text between them. Suspicious spicious document document Source Source Source Source Text Align xt Alignmen ent Retrieval trieval documents documents Suspicious spicious Collecti Coll ection of on of passag passages documents documents

  4.  Preprocessing  Seeding  Extension  Filtering

  5.  Sentence splitting (Kiss pretrained punkt model)  Tokenizing (Treebank word tokenizer)  Keeping tokens starting with a letter or digit  Reducing to lowercase  Stemming (Porter algorithm)  Joining small sentences (1-2 words) with the next one

  6. PAN 2014 training corpus Sentences length histogram (words)

  7. Vector representation of sentences: TF-IDF TF-IDF, where sentences sentences are “documents,” thus called TF-ISF: inverse sentence sentence freq. “Documents”: union of sentences of both docs Vector similarity: Cosine similarity � threshold th1 AND Dice similarity � threshold th2

  8. Seeds Seeds: pairs of similar similar sentences

  9. Groupin Grouping Group left left

  10. Groupin Grouping Group left left

  11. Groupin Grouping Group right right

  12. Groupin Grouping Group right right

  13. Groupin Grouping Group left left

  14. Example: Groupin Grouping maxGap maxGap = 1 1 Group left left

  15. Example: Groupin Grouping maxGap maxGap = 1 1 Group left left

  16. Example: Groupin Grouping maxGap maxGap = 1 1 Group right right

  17. Example: Groupin Grouping maxGap maxGap = 1 1 Group right right

  18. Example: Groupin Grouping maxGap maxGap = 1 1 Group left left

  19. Example: Groupin Grouping maxGap maxGap = 1 1 Group left left

  20. Grouping Groupin No Iteration None Random Translation Summary plagiarism 1 674 6803 6436 7637 3074 2 3 278 180 246 294 3 0 7 7 3 3 4 0 1 0 0 0

  21. Example: Validati Validation on maxGap maxGap = 2 2

  22. Example: Validati Validation on maxGap maxGap = 2 2 Cosine similarity If cosine similarity < th3 th3 Regroup with maxGap - maxGap - 1

  23. Validati Validation on

  24. 1. Resolving overlapping A B ����� � � � 1 � � � �, 2. Removing small cases If n° n° charact characters rs in left side OR OR rigth side < minPlagLen minPlagLength gth then the case is removed

  25. Cumulative histogram of plagiarism cases passages Source documents Suspicious documents

  26. Text alignment task: best result of all 11 participating systems, thanks to: TF-ISF (inverse sentence frequency) measure for 1. “soft” removal of stopwords. Recursive extension algorithm: dynamic 2. adjustment of tolerance to gaps Algorithm for resolution of overlapping cases by 3. comparison of competing cases Dynamic adjustment of parameters by type of 4. obfuscation (summary vs. other types)

  27.  Text reuse focused on paraphrase  Soft cosine to measure similarity between features  New strategy to resolve overlapping

  28. Thanks! http://www.gelbukh.com/plag http://www.gelbukh.com/plagiarism-detection/PAN-2014 iarism-detection/PAN-2014

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend