INSTITUTO POLITCNICO NACIONAL Centro de Investigacin en Computacin - - PowerPoint PPT Presentation

instituto polit cnico nacional
SMART_READER_LITE
LIVE PREVIEW

INSTITUTO POLITCNICO NACIONAL Centro de Investigacin en Computacin - - PowerPoint PPT Presentation

INSTITUTO POLITCNICO NACIONAL Centro de Investigacin en Computacin Miguel A. Sanchez- Miguel A. Sanchez-Pere Perez, Grig z, Grigori ori Sidorov, Alexander Gelbukh idorov, Alexander Gelbukh Tuesday, 16 September 2014 1 Task 1.


slide-1
SLIDE 1

INSTITUTO POLITÉCNICO NACIONAL

Centro de Investigación en Computación

Tuesday, 16 September 2014

1

Miguel A. Sanchez- Miguel A. Sanchez-Pere Perez, Grig z, Grigori

  • ri Sidorov, Alexander Gelbukh

idorov, Alexander Gelbukh

slide-2
SLIDE 2

1.

Task

2.

Methodology

3.

Adaptative behavior

4.

Results

5.

Conclusions

6.

Future Work

slide-3
SLIDE 3

Source Source Retrieval trieval Coll Collecti ection of

  • n of

documents documents Source Source documents documents Text Align xt Alignmen ent Suspicious spicious document document Suspicious spicious passag passages

Text Alignment: Given a pair of pair of documen documents, the task is to identify all contiguous maximal-length ntiguous maximal-length passages of reus reused text ed text between them.

slide-4
SLIDE 4

 Preprocessing  Seeding  Extension  Filtering

slide-5
SLIDE 5

 Sentence splitting (Kiss pretrained punkt model)  Tokenizing (Treebank word tokenizer)  Keeping tokens starting with a letter or digit  Reducing to lowercase  Stemming (Porter algorithm)  Joining small sentences (1-2 words) with the next one

slide-6
SLIDE 6

PAN 2014 training corpus Sentences length histogram (words)

slide-7
SLIDE 7

Vector representation of sentences: TF-IDF TF-IDF, where sentences sentences are “documents,” thus called TF-ISF: inverse sentence sentence freq. “Documents”: union of sentences of both docs Vector similarity: Cosine similarity threshold th1 AND Dice similarity threshold th2

slide-8
SLIDE 8

Seeds Seeds: pairs of similar similar sentences

slide-9
SLIDE 9

Group left left Groupin Grouping

slide-10
SLIDE 10

Group left left Groupin Grouping

slide-11
SLIDE 11

Group right right Groupin Grouping

slide-12
SLIDE 12

Group right right Groupin Grouping

slide-13
SLIDE 13

Group left left Groupin Grouping

slide-14
SLIDE 14

Example: maxGap maxGap = 1 1 Group left left Groupin Grouping

slide-15
SLIDE 15

Example: maxGap maxGap = 1 1 Group left left Groupin Grouping

slide-16
SLIDE 16

Example: maxGap maxGap = 1 1 Group right right Groupin Grouping

slide-17
SLIDE 17

Example: maxGap maxGap = 1 1 Group right right Groupin Grouping

slide-18
SLIDE 18

Example: maxGap maxGap = 1 1 Group left left Groupin Grouping

slide-19
SLIDE 19

Example: maxGap maxGap = 1 1 Group left left Groupin Grouping

slide-20
SLIDE 20

Iteration No plagiarism None Random Translation Summary 1 674 6803 6436 7637 3074 2 3 278 180 246 294 3 7 7 3 3 4 1

Groupin Grouping

slide-21
SLIDE 21

Example: maxGap maxGap = 2 2 Validati Validation

  • n
slide-22
SLIDE 22

Example: maxGap maxGap = 2 2

Cosine similarity If cosine similarity < th3 th3 Regroup with maxGap - maxGap - 1

Validati Validation

  • n
slide-23
SLIDE 23

Validati Validation

  • n
slide-24
SLIDE 24

1 ,

A B

If n° n° charact characters rs in left side OR OR rigth side < minPlagLen minPlagLength gth then the case is removed

  • 1. Resolving overlapping
  • 2. Removing small cases
slide-25
SLIDE 25

Source documents Suspicious documents Cumulative histogram of plagiarism cases passages

slide-26
SLIDE 26
slide-27
SLIDE 27
slide-28
SLIDE 28
slide-29
SLIDE 29

Text alignment task: best result of all 11 participating systems, thanks to:

1.

TF-ISF (inverse sentence frequency) measure for “soft” removal of stopwords.

2.

Recursive extension algorithm: dynamic adjustment of tolerance to gaps

3.

Algorithm for resolution of overlapping cases by comparison of competing cases

4.

Dynamic adjustment of parameters by type of

  • bfuscation (summary vs. other types)
slide-30
SLIDE 30

 Text reuse focused on paraphrase  Soft cosine to measure similarity between features  New strategy to resolve overlapping

slide-31
SLIDE 31

Thanks!

http://www.gelbukh.com/plag http://www.gelbukh.com/plagiarism-detection/PAN-2014 iarism-detection/PAN-2014