the encoplot similarity measure for automatic detection
play

The ENCOPLOT Similarity Measure for Automatic Detection of - PowerPoint PPT Presentation

The ENCOPLOT Similarity Measure for Automatic Detection of Plagiarism Cristian Grozea 1 Marius Nicolae Popescu 2 cristian.grozea@brainsignals.de Fraunhofer Institute FIRST Berlin University of Bucharest Romania September 23, 2011 C.Grozea,


  1. The ENCOPLOT Similarity Measure for Automatic Detection of Plagiarism Cristian Grozea 1 Marius Nicolae Popescu 2 cristian.grozea@brainsignals.de Fraunhofer Institute FIRST – Berlin University of Bucharest Romania September 23, 2011 C.Grozea, M.Popescu: ENCOPLOT measure Fraunhofer Institute FIRST – Berlin, University of Bucharest Romania

  2. I’ll be short... Thank you! . Our extended paper http://brainsignals.de/encsimTR.pdf C.Grozea, M.Popescu: ENCOPLOT measure Fraunhofer Institute FIRST – Berlin, University of Bucharest Romania

  3. Results External plagiarism, same language. ◮ 2009: 1 st ◮ 2010: 4 th (2 nd w. vers.2011) ◮ 2011: 2 nd (1 st ?) ◮ best score on the manual paraphrasing ◮ best recall on the non-translated corpus C.Grozea, M.Popescu: ENCOPLOT measure Fraunhofer Institute FIRST – Berlin, University of Bucharest Romania

  4. Encoplot and the Similarity Measure 350000 300000 250000 Suspicious Document Position 200000 150000 100000 50000 0 0 100000 200000 300000 400000 500000 600000 700000 Source Document Position C.Grozea, M.Popescu: ENCOPLOT measure Fraunhofer Institute FIRST – Berlin, University of Bucharest Romania

  5. Encoplot Features ◮ Guaranteed linear time – Dotplot is quadratic. ◮ Extremely fast highly optimized open-source implementation, for N-grams up to N=16, on 64 bit CPUs. Grozea et. al. (PAN 2009) C.Grozea, M.Popescu: ENCOPLOT measure Fraunhofer Institute FIRST – Berlin, University of Bucharest Romania

  6. The Parallel Encoplot ◮ Open source, licensed under Apache APL http://code.google.com/p/parallel- encoplot/ ◮ Includes the parallelization with BSC SMPSs ◮ Scalable, tested on a machine with 256 cores HPC Europa2 - You can have that too! C.Grozea, M.Popescu: ENCOPLOT measure Fraunhofer Institute FIRST – Berlin, University of Bucharest Romania

  7. Ranking - 2010 1 Standard − ranking sources Standard − ranking destinations 0.9 Encoplot − global rank 0.8 0.7 0.6 Recall 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 7 Document pairs 6 x 10 C.Grozea, M.Popescu: ENCOPLOT measure Fraunhofer Institute FIRST – Berlin, University of Bucharest Romania

  8. Ranking - 2011 0.8 Standard − min rank Encoplot − min rank 0.7 0.6 0.5 Recall 0.4 0.3 0.2 0.1 0 5 6 10 10 Document pairs (logarithmic scale) C.Grozea, M.Popescu: ENCOPLOT measure Fraunhofer Institute FIRST – Berlin, University of Bucharest Romania

  9. Ranking - 2010 P-R 1 Standard − global rank Standard − min rank Standard − ranking sources 0.9 Standard − ranking destinations Encoplot − global rank Encoplot − min rank 0.8 0.7 0.6 Recall 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision C.Grozea, M.Popescu: ENCOPLOT measure Fraunhofer Institute FIRST – Berlin, University of Bucharest Romania

  10. Who’s the Thief? 7 x 10 5 6 5 Position in destination 4 3 2 1 0 0 0.5 1 1.5 2 2.5 3 Position in source x 10 5 Grozea and Popescu (CICLING 2010) – 75% C.Grozea, M.Popescu: ENCOPLOT measure Fraunhofer Institute FIRST – Berlin, University of Bucharest Romania

  11. Found anything useful to you? Thank you again! C.Grozea, M.Popescu: ENCOPLOT measure Fraunhofer Institute FIRST – Berlin, University of Bucharest Romania

  12. Reserve slides C.Grozea, M.Popescu: ENCOPLOT measure Fraunhofer Institute FIRST – Berlin, University of Bucharest Romania

  13. 2010 duplicates 350000 300000 250000 Suspicious Document Position 200000 150000 100000 50000 0 0 20000 40000 60000 80000 100000 120000 140000 Source Document Position C.Grozea, M.Popescu: ENCOPLOT measure Fraunhofer Institute FIRST – Berlin, University of Bucharest Romania

  14. 2011 Corpus Table: Results on 2011 Competition Data Subset Size Recall Precision F-score Granularity Plagdet score Entire corpus 49,621 0.34 0.81 0.48 1.22 0.42 No paraphrasing 976 0.90 0.84 0.86 1.02 0.85 Manual paraphras- 0.36 0.96 0.53 0.50 4,609 1.06 ing 0.58 Automatic low 19,779 0.90 0.71 1.27 0.60 Automatic high 19,115 0.08 0.64 0.14 1.19 0.13 Manual translation 433 0.08 0.25 0.12 1.01 0.12 Automatic transla- 4,709 0.23 0.40 0.29 1.07 0.28 tion C.Grozea, M.Popescu: ENCOPLOT measure Fraunhofer Institute FIRST – Berlin, University of Bucharest Romania

  15. Other Bits 2011 no obfuscation: 976 = 1.97% of 49 621 total (vs. 40%). The 18% includes about 10 000 from the intrinsic corpus. 2010 multiplicity problem: Maximum multiplicity =17 (source 8584, suspicious 3283). 55 723 external plagiarism instances 10 694 of which with multiplicity ≥ 2 (20% of total). 3 483 with multiplicity at least 3. Being able to handle multiplicity up to 4 would leave out only 506 instances. 2010 performance: plagdet score 0.72 (first team - 0.78), with recall 0.66 and precision 0.86, without handling the translated cases (14%). C.Grozea, M.Popescu: ENCOPLOT measure Fraunhofer Institute FIRST – Berlin, University of Bucharest Romania

  16. N-Gram Coincidence Plot Algorithm Input: Sequences A and B to compare Output: list (x,y) of positions in A, respectively B, where there is exactly the same N-gram Steps 1. Extract the N-grams from A and B 2. Sort these two lists of N-grams 3. Compare these lists in a modified mergesort algorithm. Whenever the two smallest N-grams are the equal, output the position in A and the one in B. C.Grozea, M.Popescu: ENCOPLOT measure Fraunhofer Institute FIRST – Berlin, University of Bucharest Romania

  17. Small example A=abcabd B=xabdy Encoplot pairs Dotplot pairs 1 2 ab 1 2 ab N=2 4 2 ab 5 4 bd 5 4 bd Encoplot pairs Dotplot pairs N=3 4 2 abd 4 2 abd C.Grozea, M.Popescu: ENCOPLOT measure Fraunhofer Institute FIRST – Berlin, University of Bucharest Romania

  18. Fast Radix Sort for N-Grams for(i,NN)ix[i]=i; //radix sort, the input is x, // the output rank is ix for(k,RANGE)counters[k]=0; for(i,NN)counters[*(x+i)]++; for(j,DEPTH){ int ofs=j;//low endian t_int sp=0; for(k,RANGE){ startpos[k]=sp; sp+=counters[k]; } for(i,NN){ unsigned char c=x[ofs+ix[i]]; ox[startpos[c]++]=ix[i]; } memcpy(ix,ox,NN*sizeof(ix[0])); //update counters if(j<DEPTH-1){ counters[*pout++]--; counters[*pin++]++; } } C.Grozea, M.Popescu: ENCOPLOT measure Fraunhofer Institute FIRST – Berlin, University of Bucharest Romania

  19. ◮ Who’s the Thief? Automatic Detection of the Direction of Plagiarism, C.Grozea and M.Popescu, CICLING 2010 , LNCS 6008, DOI 10.1007/978-3-642-12116-6, 2010 ◮ ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection, C.Grozea, C.Gehl, and M.Popescu – In Proceedings of the 3rd PAN Workshop. Uncovering Plagiarism, Authorship and Social Software Misuse, San Sebastian, Spain, 2009. Universidad Politecnica de Valencia 2009 ◮ Encoplot – Performance in the Second International Plagiarism Detection Challenge, C. Grozea and M. Popescu, Lab Report for PAN at CLEF 2010 ◮ Plagiarism Detection with State of the Art Compression Programs, C.Grozea Report CDMTCS-247, Centre for Discrete Mathematics and Theoretical Computer Science, University of Auckland, Auckland, New Zealand, 2004. C.Grozea, M.Popescu: ENCOPLOT measure Fraunhofer Institute FIRST – Berlin, University of Bucharest Romania

  20. Self-plagiarism 9 x 10 5 8 Suspicious Document Position 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 Source Document Position x 10 5 C.Grozea, M.Popescu: ENCOPLOT measure Fraunhofer Institute FIRST – Berlin, University of Bucharest Romania

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend