developing a corpus of plagiarized short
play

Developing a corpus of plagiarized short Bj orn Rudzewitz - PowerPoint PPT Presentation

Developing a corpus of plagiarized short answers [Clough and Stevenson, 2011] Developing a corpus of plagiarized short Bj orn Rudzewitz University of answers [Clough and Stevenson, 2011] T ubingen Introduction Plagiarism orn


  1. Developing a corpus of plagiarized short answers [Clough and Stevenson, 2011] Developing a corpus of plagiarized short Bj¨ orn Rudzewitz University of answers [Clough and Stevenson, 2011] T¨ ubingen Introduction Plagiarism orn Rudzewitz 1 Bj¨ Typology Corpus Creation University of T¨ ubingen Data Analysis Individual Differences Data Observations Automatic Hauptseminar Language Variation and Stylometrics Plagiarism Detection WS 15/16 N-Gram Overlap LCS Baselines L1 vs L2 December 16, 2015 Classification Conclusion Discussion References 1 brzdwtz@sfs.uni-tuebingen.de

  2. Developing a Introduction corpus of plagiarized short answers [Clough Plagiarism Typology and Stevenson, 2011] Corpus Creation Bj¨ orn Rudzewitz University of Data Analysis T¨ ubingen Individual Differences Introduction Data Observations Plagiarism Typology Automatic Plagiarism Detection Corpus Creation N-Gram Overlap Data Analysis LCS Individual Differences Data Observations Baselines Automatic Plagiarism L1 vs L2 Detection N-Gram Overlap Classification LCS Baselines Conclusion L1 vs L2 Classification Conclusion Discussion Discussion To avoid the objection of plagiarism: References ideas and examples in this presentation are taken from Clough and Stevenson [2011]

  3. Developing a Motivation corpus of plagiarized short answers [Clough and Stevenson, 2011] Bj¨ orn Rudzewitz University of ◮ correlation between availability of electronic resources T¨ ubingen and plagiarism Introduction ◮ plagiarism detection as a field suffering from lack of Plagiarism Typology standardized evaluation resources Corpus Creation ◮ previous corpus creation efforts suboptimal: Data Analysis Individual Differences ◮ lack of data (’deception’, how to find plagiarized text) Data Observations ◮ lack of gold labels (authors deny judgments) Automatic ◮ lack of legal and ethical basis for data publication Plagiarism Detection ◮ lack of transparency in data preparation N-Gram Overlap LCS ( → Leech’s maximes for corpus creation) Baselines L1 vs L2 Classification Conclusion Discussion References

  4. Developing a Impact and application corpus of plagiarized short answers [Clough and Stevenson, 2011] Bj¨ orn Rudzewitz University of T¨ ubingen Introduction Desired effects of the corpus: Plagiarism Typology ◮ new resource for comparative evaluation and Corpus Creation pedagogical methods Data Analysis ◮ enable new work on plagiarism detection and task Individual Differences Data Observations strategies Automatic Plagiarism Detection N-Gram Overlap LCS Baselines L1 vs L2 Classification Conclusion Discussion References

  5. Developing a Related work corpus of plagiarized short answers [Clough and Stevenson, 2011] Bj¨ orn Rudzewitz University of ◮ Microsoft Research Paraphrase Corpus [Dolan et al., T¨ ubingen 2004] Introduction ◮ Multiple-Translation Chinese Corpus [Pang et al., 2003] Plagiarism Typology ◮ METER corpus [Gaizauskas et al., 2001] Corpus Creation ◮ Corpus for plagiarism detection [Zu Eissen et al., 2007] Data Analysis Individual Differences ◮ PAN Plagiarism detection corpus [Eiselt and Rosso, Data Observations Automatic 2009] Plagiarism Detection N-Gram Overlap LCS More related resources in Machine Translation evaluation and Short Baselines L1 vs L2 Answer Assessment. Classification Conclusion Discussion References

  6. Developing a High-level perspective on approaches corpus of plagiarized short answers [Clough and Stevenson, 2011] Bj¨ orn Rudzewitz University of ◮ extrinsic T¨ ubingen ◮ comparison of source and (potentially) plagiarized text Introduction ◮ authorship attribution approaches Plagiarism ◮ intrinsic Typology ◮ comparison of text passages in one document with each Corpus Creation other Data Analysis Individual Differences ◮ stylometric approaches Data Observations Automatic Plagiarism Problem: documents can plagiarize n ∈ N 0 other documents in Detection N-Gram Overlap different ways LCS Baselines L1 vs L2 → interaction between extrinsic and intrinsic analysis desirable Classification Conclusion Discussion References

  7. Developing a Plagiarism Techniques: How to plagiarize corpus of plagiarized short answers [Clough and Stevenson, 2011] Goal: produce an answer of 200-300 words to a question Bj¨ orn Rudzewitz ◮ Near copy University of T¨ ubingen ◮ copy-and-paste (parts of) Wikipedia article Introduction Plagiarism Typology Corpus Creation Data Analysis Individual Differences Data Observations Automatic Plagiarism Detection N-Gram Overlap LCS Baselines L1 vs L2 Classification Conclusion Discussion References

  8. Developing a Plagiarism Techniques: How to plagiarize corpus of plagiarized short answers [Clough and Stevenson, 2011] Goal: produce an answer of 200-300 words to a question Bj¨ orn Rudzewitz ◮ Near copy University of T¨ ubingen ◮ copy-and-paste (parts of) Wikipedia article Introduction ◮ Light revision Plagiarism ◮ like light revision, but with possibility to replace words Typology with synonyms, (lexical/morphosyntactic) paraphrasing Corpus Creation ◮ information structure preserved Data Analysis Individual Differences Data Observations Automatic Plagiarism Detection N-Gram Overlap LCS Baselines L1 vs L2 Classification Conclusion Discussion References

  9. Developing a Plagiarism Techniques: How to plagiarize corpus of plagiarized short answers [Clough and Stevenson, 2011] Goal: produce an answer of 200-300 words to a question Bj¨ orn Rudzewitz ◮ Near copy University of T¨ ubingen ◮ copy-and-paste (parts of) Wikipedia article Introduction ◮ Light revision Plagiarism ◮ like light revision, but with possibility to replace words Typology with synonyms, (lexical/morphosyntactic) paraphrasing Corpus Creation ◮ information structure preserved Data Analysis Individual Differences ◮ Heavy revision Data Observations ◮ rephrasing/paraphrasing of Wikipedia article, n-to-m Automatic Plagiarism sentence alignment Detection N-Gram Overlap LCS Baselines L1 vs L2 Classification Conclusion Discussion References

  10. Developing a Plagiarism Techniques: How to plagiarize corpus of plagiarized short answers [Clough and Stevenson, 2011] Goal: produce an answer of 200-300 words to a question Bj¨ orn Rudzewitz ◮ Near copy University of T¨ ubingen ◮ copy-and-paste (parts of) Wikipedia article Introduction ◮ Light revision Plagiarism ◮ like light revision, but with possibility to replace words Typology with synonyms, (lexical/morphosyntactic) paraphrasing Corpus Creation ◮ information structure preserved Data Analysis Individual Differences ◮ Heavy revision Data Observations ◮ rephrasing/paraphrasing of Wikipedia article, n-to-m Automatic Plagiarism sentence alignment Detection N-Gram Overlap ◮ Non-plagiarism LCS Baselines ◮ no access to Wikipedia L1 vs L2 Classification ◮ participants read material, then answer question with Conclusion their (partly freshly) acquired knowledge Discussion References

  11. Developing a Corpus Creation corpus of plagiarized short answers [Clough and Stevenson, 2011] ◮ 19 participants, CS students Bj¨ orn Rudzewitz University of T¨ ubingen ◮ each participant writing answer for each task (2 times non-plagiarism) Introduction → 95 answers + 5 articles = 100 documents (19 , 995 Plagiarism Typology tokens) Corpus Creation ◮ Graeco-Latin Square Design for systematic Data Analysis Individual Differences randomization and rotation of revision types per Data Observations participant and question Automatic Plagiarism ◮ participant meta data: native language, familiarity with Detection N-Gram Overlap answer, perceived difficulty of task LCS Baselines L1 vs L2 µ tok / aw = 208 σ tok / aw = 64 . 91 Classification µ types / aw = 113 σ types / aw = 30 . 11 Conclusion Discussion References

  12. Developing a Data Analysis: Individual Differences corpus of plagiarized short answers [Clough and Stevenson, 2011] Bj¨ orn Rudzewitz University of T¨ ubingen ◮ statistically significant difference ( p < 0 . 01) between Introduction native and non-native speakers wrt. mean knowledge Plagiarism Typology and perceived difficulty (two-sample t-test) Corpus Creation → difference in population means of two independent Data Analysis Individual Differences samples Data Observations ◮ Positive Pearson’s correlation of r = 0 . 344 between Automatic Plagiarism knowledge and perceived difficulty Detection N-Gram Overlap LCS Baselines L1 vs L2 Classification Conclusion Discussion References

  13. Developing a Data Analysis: Observations corpus of plagiarized short answers [Clough and Stevenson, 2011] Bj¨ orn Rudzewitz University of T¨ ubingen Introduction Plagiarism Typology Corpus Creation Data Analysis Individual Differences Data Observations Automatic Plagiarism Detection N-Gram Overlap LCS Baselines L1 vs L2 Classification Conclusion Discussion References

  14. Developing a Data Analysis: Observations corpus of plagiarized short answers [Clough and Stevenson, 2011] Bj¨ orn Rudzewitz University of T¨ ubingen Introduction Plagiarism Typology Corpus Creation Data Analysis Individual Differences Data Observations Automatic Plagiarism Detection N-Gram Overlap LCS Baselines L1 vs L2 Classification Conclusion Discussion References

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend