INSTITUTO POLITÉCNICO NACIONAL
Centro de Investigación en Computación
Tuesday, 16 September 2014
1
Miguel A. Sanchez- Miguel A. Sanchez-Pere Perez, Grig z, Grigori
- ri Sidorov, Alexander Gelbukh
idorov, Alexander Gelbukh
INSTITUTO POLITCNICO NACIONAL Centro de Investigacin en Computacin - - PowerPoint PPT Presentation
INSTITUTO POLITCNICO NACIONAL Centro de Investigacin en Computacin Miguel A. Sanchez- Miguel A. Sanchez-Pere Perez, Grig z, Grigori ori Sidorov, Alexander Gelbukh idorov, Alexander Gelbukh Tuesday, 16 September 2014 1 Task 1.
Centro de Investigación en Computación
Tuesday, 16 September 2014
1
Miguel A. Sanchez- Miguel A. Sanchez-Pere Perez, Grig z, Grigori
idorov, Alexander Gelbukh
1.
2.
3.
4.
5.
6.
Source Source Retrieval trieval Coll Collecti ection of
documents documents Source Source documents documents Text Align xt Alignmen ent Suspicious spicious document document Suspicious spicious passag passages
Text Alignment: Given a pair of pair of documen documents, the task is to identify all contiguous maximal-length ntiguous maximal-length passages of reus reused text ed text between them.
Preprocessing Seeding Extension Filtering
Sentence splitting (Kiss pretrained punkt model) Tokenizing (Treebank word tokenizer) Keeping tokens starting with a letter or digit Reducing to lowercase Stemming (Porter algorithm) Joining small sentences (1-2 words) with the next one
PAN 2014 training corpus Sentences length histogram (words)
Seeds Seeds: pairs of similar similar sentences
Group left left Groupin Grouping
Group left left Groupin Grouping
Group right right Groupin Grouping
Group right right Groupin Grouping
Group left left Groupin Grouping
Example: maxGap maxGap = 1 1 Group left left Groupin Grouping
Example: maxGap maxGap = 1 1 Group left left Groupin Grouping
Example: maxGap maxGap = 1 1 Group right right Groupin Grouping
Example: maxGap maxGap = 1 1 Group right right Groupin Grouping
Example: maxGap maxGap = 1 1 Group left left Groupin Grouping
Example: maxGap maxGap = 1 1 Group left left Groupin Grouping
Iteration No plagiarism None Random Translation Summary 1 674 6803 6436 7637 3074 2 3 278 180 246 294 3 7 7 3 3 4 1
Groupin Grouping
Example: maxGap maxGap = 2 2 Validati Validation
Example: maxGap maxGap = 2 2
Cosine similarity If cosine similarity < th3 th3 Regroup with maxGap - maxGap - 1
Validati Validation
Validati Validation
1 ,
If n° n° charact characters rs in left side OR OR rigth side < minPlagLen minPlagLength gth then the case is removed
Source documents Suspicious documents Cumulative histogram of plagiarism cases passages
1.
2.
3.
4.
Text reuse focused on paraphrase Soft cosine to measure similarity between features New strategy to resolve overlapping