External and Intrinsic Plagiarism Detection using a Cross-Lingual - - PowerPoint PPT Presentation

▶

Jan 07, 2024 124 likes •284 views

External and Intrinsic Plagiarism Detection using a Cross-Lingual Retrieval and Segmentation System Markus Muhr, Roman Kern, Mario Zechner, Michael Granitzer { mmuhr, rkern, mzechner, mgrani } @know-center.at CLEF 2010 / PAN / 2010-09-22

SLIDE 1

External and Intrinsic Plagiarism Detection using a Cross-Lingual Retrieval and Segmentation System

Markus Muhr, Roman Kern, Mario Zechner, Michael Granitzer {mmuhr, rkern, mzechner, mgrani}@know-center.at CLEF 2010 / PAN / 2010-09-22

SLIDE 2

Graz University of Technology

Overview

Hybrid System

◮ External

◮ Based on information retrieval techniques ◮ Post-processing based on sequence analysis

◮ Intrinsic

◮ Detect style change

◮ Cross-lingual plagiarism detection ◮ No heuristics for high obfuscation

◮ No word reordering ◮ No synonym resolution

Focus

◮ Simulate a production system ◮ Scalable architecture

2 / 14

SLIDE 3

Graz University of Technology

System Overview

Flowchart

Suspicious Document Token based sequence matching Merge neighboring sequences Similarity & heuristics filtering Segment document into coherent segments Filtering on stylometric features Detected passages Segment into small

verlapping blocks

Use block terms as queries & apply heuristics for fast retrieval Blocks found? No Yes If no external passages are detected Intrinsic External English? Translate Words No Segment into overlapping blocks Yes Block Index Source Documents For each source document Add blocks to index Search block index

3 / 14

SLIDE 4

Graz University of Technology

External Plagiarism Detection

Overview

◮ Two step approach

◮ Search for potentially matching suspicious document blocks ◮ Apply heuristic post-processing on the potential matches

Work-Flow

◮ Build index out of source documents

◮ Build overlapping blocks (40 terms)

◮ Split suspicious documents into blocks (16 terms)

◮ Transform blocks into queries ◮ Search source index for matching source blocks 4 / 14

SLIDE 5

Graz University of Technology

External Plagiarism Detection

Query Construction

◮ For each block in the suspicious document build a query ◮ Sort query terms by document frequency ◮ Join the low frequent terms by AND ◮ Join the remaining terms by OR ◮ Additional heuristics to keep number of queries low

5 / 14

SLIDE 6

Graz University of Technology

External Plagiarism Detection

Post-Processing

◮ Starting with query-block pairs

◮ Expand the text around the query and the block ◮ Build token by token matrix ◮ Match for 3 consecutive tokens (and at least 10 characters) -

ther thresholds for translated documents

◮ Process the sequences

◮ Merged by a neighborhood criterion ◮ Finally a similarity between merged sequences is calculated 6 / 14

SLIDE 7

Graz University of Technology

Cross-lingual Plagiarism Detection

Overview

◮ Approach: Normalize all documents to English ◮ Multiple alternative translations

◮ Not the single-best translation, but multiple candidates

◮ Word translations

◮ First step of a complete statistical machine translation system 7 / 14

SLIDE 8

Graz University of Technology

Cross-lingual Plagiarism Detection

Word translations

◮ Sentence aligned multi-lingual corpus

◮ Europarl v5 Koehn [2005]

◮ Apply word alignment algorithm

◮ BerkeleyAligner Liang et al. [2006]

◮ Number of translation candidates sorted by probability ◮ Replace each non-English word by up to 5 translation

candidates

task time no translation 7 ms translation 9.38 ms

8 / 14

SLIDE 9

Graz University of Technology

Intrinsic Plagiarism Detection

Overview

◮ Style change detection ◮ Focus on features without semantics

Work-Flow

◮ Identify regions within a document ◮ Build feature centroid vector ◮ Compare regions with centroid

9 / 14

SLIDE 10

Graz University of Technology

Intrinsic Plagiarism Detection

Region Detection

◮ First idea: Split document in blocks of equal size ◮ Approach: Linear text-segmentation algorithm

◮ Build blocks of coherent topics ◮ Stop-word filtered stems as features

◮ TextSegFault Kern and Granitzer [2009]

◮ Efficient O(n) ◮ Open-source 10 / 14

SLIDE 11

Graz University of Technology

Result

Candidate Retrieval Step

◮ How many false positives are retrieved by the block candidate

selection?

◮ Left: Based on 500 suspicious document in the development

corpus

◮ Right: Based on the evaluation corpus task hit all ratio high 2543 3676 0.6918 low 6614 6988 0.9465 none 9381 9592 0.9780 translated 2349 2543 0.9237 task hit all ratio high 13348 14756 0.9046 low 14832 14883 0.9966 none 16784 16784 1.0 translated 5462 6314 0.8651

11 / 14

SLIDE 12

Graz University of Technology

Result

Overall System Performance

◮ Performance results of detected plagiarism separated by

different sub-tasks for the hybrid evaluation corpus task Precision Recall Granularity Score non-translated all 0.9299 0.8967 1.0553 0.8785 non-translated none

0.9497

1.0025

non-translated low
0.9207

1.0968

non-translated high
0.8122

1.0771

translated

0.8036 0.61616 2.1655 0.4195 external 0.9053 0.8631 1.1611 0.7949 intrinsic 0.212 0.1566 1.0 0.1802 Overall 0.8417 0.7057 1.1508 0.6948

12 / 14

SLIDE 13

Graz University of Technology

Conclusions

◮ Hybrid system

◮ External plagiarism detection ◮ Support for cross-lingual plagiarism detection ◮ Intrinsic (style-based) plagiarism detection

◮ Issues

◮ Scalable (but slow implementation)

◮ Outlook

◮ We plan to build a web service initialized with the Wikipedia

as source

13 / 14

SLIDE 14

Graz University of Technology

The End

Thank you!

14 / 14

SLIDE 15

References

R. Kern and M. Granitzer. Efficient linear text segmentation based on information

retrieval techniques. In MEDES ’09, pages 167–171. ACM, 2009. ISBN 978-1-60558-829-2.

P. Koehn. Europarl: A parallel corpus for statistical machine translation. MT summit,

5:12–16, 2005.

P. Liang, B. Taskar, and D. Klein. Alignment by agreement. In Proceedings of the

External and Intrinsic Plagiarism Detection using a Cross-Lingual Retrieval and Segmentation System

Markus Muhr, Roman Kern, Mario Zechner, Michael Granitzer {mmuhr, rkern, mzechner, mgrani}@know-center.at CLEF 2010 / PAN / 2010-09-22

Overview

Hybrid System

◮ External

◮ Intrinsic

◮ Cross-lingual plagiarism detection ◮ No heuristics for high obfuscation

Focus

◮ Simulate a production system ◮ Scalable architecture

System Overview

Flowchart

External Plagiarism Detection

Overview

◮ Two step approach

Work-Flow

◮ Build index out of source documents

◮ Split suspicious documents into blocks (16 terms)

External Plagiarism Detection

Query Construction

◮ For each block in the suspicious document build a query ◮ Sort query terms by document frequency ◮ Join the low frequent terms by AND ◮ Join the remaining terms by OR ◮ Additional heuristics to keep number of queries low

External Plagiarism Detection

Post-Processing

◮ Starting with query-block pairs

◮ Process the sequences

Cross-lingual Plagiarism Detection

Overview

◮ Approach: Normalize all documents to English ◮ Multiple alternative translations

◮ Word translations

Cross-lingual Plagiarism Detection

Word translations

◮ Sentence aligned multi-lingual corpus

◮ Apply word alignment algorithm

◮ Number of translation candidates sorted by probability ◮ Replace each non-English word by up to 5 translation

candidates

task time no translation 7 ms translation 9.38 ms

Intrinsic Plagiarism Detection

Overview

◮ Style change detection ◮ Focus on features without semantics

Work-Flow

◮ Identify regions within a document ◮ Build feature centroid vector ◮ Compare regions with centroid

Intrinsic Plagiarism Detection

Region Detection

◮ First idea: Split document in blocks of equal size ◮ Approach: Linear text-segmentation algorithm

◮ TextSegFault Kern and Granitzer [2009]

Result

Candidate Retrieval Step

◮ How many false positives are retrieved by the block candidate

selection?

◮ Left: Based on 500 suspicious document in the development

corpus

◮ Right: Based on the evaluation corpus task hit all ratio high 2543 3676 0.6918 low 6614 6988 0.9465 none 9381 9592 0.9780 translated 2349 2543 0.9237 task hit all ratio high 13348 14756 0.9046 low 14832 14883 0.9966 none 16784 16784 1.0 translated 5462 6314 0.8651

Result

Overall System Performance

◮ Performance results of detected plagiarism separated by

different sub-tasks for the hybrid evaluation corpus task Precision Recall Granularity Score non-translated all 0.9299 0.8967 1.0553 0.8785 non-translated none

1.0025

1.0968

1.0771

0.8036 0.61616 2.1655 0.4195 external 0.9053 0.8631 1.1611 0.7949 intrinsic 0.212 0.1566 1.0 0.1802 Overall 0.8417 0.7057 1.1508 0.6948

Conclusions

◮ Hybrid system

◮ Issues

◮ Outlook

as source

The End

Thank you!

References

retrieval techniques. In MEDES ’09, pages 167–171. ACM, 2009. ISBN 978-1-60558-829-2.

5:12–16, 2005.

Human Language Technology Conference of the NAACL, pages 104–111, June 2006.