tracer tutorial text reuse detection featuring
play

TRACER TUTORIAL: TEXT REUSE DETECTION FEATURING M arco B uchler, - PowerPoint PPT Presentation

TRACER TUTORIAL: TEXT REUSE DETECTION FEATURING M arco B uchler, Emily Franzini and Greta Franzini TABLE OF CONTENTS 1. What is featuring? 2. Featuring techniques 3. Hacking 4. Conclusion and revision 2/27 REMINDER: CURRENT APPROACH 3/27


  1. TRACER TUTORIAL: TEXT REUSE DETECTION FEATURING M arco B¨ uchler, Emily Franzini and Greta Franzini

  2. TABLE OF CONTENTS 1. What is featuring? 2. Featuring techniques 3. Hacking 4. Conclusion and revision 2/27

  3. REMINDER: CURRENT APPROACH 3/27

  4. WHAT IS FEATURING?

  5. QUESTION What do you associate with featuring? 5/27

  6. A VISUALISATION OF FEATURING From biometry: 6/27

  7. SOME VOCABULARY 7/27

  8. FEATURING 8/27

  9. FEATURING TECHNIQUES

  10. FEATURING: AN EXAMPLE V 1 = s 1 , s 2 , s 3 , s 4 , s 5 12 Features V 2 = A , B , ..., J , K s 1 : A B C D E s 2 : A C E F G s 3 : G F A C D s 4 : C F A G E s 5 : D H I J K 10/27

  11. FEATURING: MATRIX STYLE s 1 : A B C D E s 2 : A C E F G s 3 : G F A C D s 4 : C F A G E s 5 : D H I J K A C D F G E B H I J K   s 1 1 1 1 0 0 1 1 0 0 0 0 s 2 1 1 0 1 1 1 0 0 0 0 0     1 1 1 1 1 0 0 0 0 0 0 = M s 3       s 4 1 1 0 1 1 1 0 0 0 0 0   s 5 0 0 1 0 0 0 0 1 1 1 1 11/27

  12. HACKING: CONFIGURATION 12/27

  13. HACKING

  14. HACKING Tasks: • Run on your own texts ... 1. ... n-gram shingling with n=2, 3 2. ... words as features 14/27

  15. HACKING Questions: • Run the aforementioned tasks. Compare the resulting ”tail distributions” (in the featuring folder you’ll find all this information in e.g. KJV.meta ). • Compare the .train -files of n-gram shingling with hash-breaking, also compared to words as features (use Excel or OpenOffice to open the .train file; sort by columns B and C). 15/27

  16. CONFIGURING THE TRAINING IMPL PARAMETER Hint: The configuration file can be found in: $ TRACER HOME/conf/tracer conf.xml 16/27

  17. CONFIGURING THE TRAINING IMPL PARAMETER Hint: • The configuration file can be found in: $ TRACER HOME/conf/tracer conf.xml • eu.etrap.tracer.featuring.syntactical.shingle. TriGramShinglingTrainingImpl • eu.etrap.tracer.featuring.syntactical.shingle. BiGramShinglingTrainingImpl • eu.etrap.tracer.featuring.semantic. WordBasedTrainingImpl 17/27

  18. GAP BETWEEN KNOWLEDGE AND EXPERIENCE 18/27

  19. CONCLUSION AND REVISION

  20. CHECK Question: How does the number of features change with the changing feature size (e.g. bigrams, trigrams)? 20/27

  21. CHECK A C D F G E B H I J K   s 1 1 1 1 0 0 1 1 0 0 0 0 s 2 1 1 0 1 1 1 0 0 0 0 0     s 3 1 1 1 1 1 0 0 0 0 0 0 = M      1 1 0 1 1 1 0 0 0 0 0  s 4   s 5 0 0 1 0 0 0 0 1 1 1 1 Question: What is the Digital Fingerprint of a reuse unit? 21/27

  22. CHECK A C D F G E B H I J K   s 1 1 1 1 0 0 1 1 0 0 0 0 s 2 1 1 0 1 1 1 0 0 0 0 0     s 3 1 1 1 1 1 0 0 0 0 0 0 = M      1 1 0 1 1 1 0 0 0 0 0  s 4   s 5 0 0 1 0 0 0 0 1 1 1 1 Question: How does preprocessing influence F? 22/27

  23. CHECK A C D F G E B H I J K   s 1 1 1 1 0 0 1 1 0 0 0 0 s 2 1 1 0 1 1 1 0 0 0 0 0     s 3 1 1 1 1 1 0 0 0 0 0 0 = M      1 1 0 1 1 1 0 0 0 0 0  s 4   s 5 0 0 1 0 0 0 0 1 1 1 1 Question: How can you compute the feature frequency? 23/27

  24. IMPORTANCE OF FEATURING • Featuring defines the unit to measure similarity. • Most featuring techniques ”generate” a power-law distribution: • A few features occur very often; • At least 50% of all features occur just once; • Most features are rare. 24/27

  25. FINITO! 25/27

  26. CONTACT Team Marco B¨ uchler, Greta Franzini and Emily Franzini. Visit us http://www.etrap.eu contact@etrap.eu 26/27

  27. LICENCE The theme this presentation is based on is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Changes to the theme are the work of eTRAP. cba 27/27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend