plagiarism detection system Andrzej Sobecki, Marcin Kpa IKC 2017 - - PowerPoint PPT Presentation

plagiarism detection system
SMART_READER_LITE
LIVE PREVIEW

plagiarism detection system Andrzej Sobecki, Marcin Kpa IKC 2017 - - PowerPoint PPT Presentation

Improving performance of a plagiarism detection system Andrzej Sobecki, Marcin Kpa IKC 2017 Plagiarism detection problem Text documents unstructured form Finding a potential source documents based on the suspected document


slide-1
SLIDE 1

Improving performance of a plagiarism detection system

Andrzej Sobecki, Marcin Kępa IKC 2017

slide-2
SLIDE 2

Plagiarism detection problem

  • Text documents – unstructured form
  • Finding a potential source documents based on the suspected

document

  • Searching in many repositories
  • Process must be short and accuracy
slide-3
SLIDE 3

How we do that

Parsing Hashing Filtering Calculating similarities

The crucial stage for performance The important stage for accuracy

slide-4
SLIDE 4

Filtering – actual solution

Doc Doc profile Hash function h(x) One hash – One sentence Suspected doc Suspected doc profile Repository Repositories Doc profile Doc profile Doc profile Available documents profiles Count identical hash values

slide-5
SLIDE 5

Filtering – possible solutions

  • Algorithms dedicated for digital libraries,
  • Available search engines e.g., the elastic search,
  • Components of the hadoop ecosystem,
  • What is an effect of precision and recall values for performance and

accuracy of the plagiarism detection process?

slide-6
SLIDE 6

Class of problem detecting similarities

  • Unstructured text documents,
  • Is required to analyzing most of the documents available in the

repositories,

  • New documents are continuously add to the repositories,
  • Effective filtering with high values of recall and precision,
  • Finding similar sentences are more important than keywords.
slide-7
SLIDE 7

Models described in the article

  • KASKADA HashMap,
  • HDFS,
  • HDFS HashMap,
  • Hbase,
slide-8
SLIDE 8

Results – documents with fixed length

slide-9
SLIDE 9

Results – documents with different lengths

slide-10
SLIDE 10

Results – parallel tasks

slide-11
SLIDE 11

Results - scalability

slide-12
SLIDE 12

Results— cost of preparing structures

slide-13
SLIDE 13

Summary

  • Have you any questions?