tracer tutorial text reuse detection preprocessing
play

TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING M arco B - PowerPoint PPT Presentation

TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING M arco B uchler, Emily Franzini and Greta Franzini TABLE OF CONTENTS 1. What is preprocessing? 2. Preprocessing techniques 3. Hacking 4. Conclusion and revision 2/100 HACKING,


  1. TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING M arco B¨ uchler, Emily Franzini and Greta Franzini

  2. TABLE OF CONTENTS 1. What is preprocessing? 2. Preprocessing techniques 3. Hacking 4. Conclusion and revision 2/100

  3. HACKING, INSTALLATION & CONFIGURATION GUIDE FOR TRACER 1. Download TRACER from http://etrap.eu/tracer/ to your storage folder, e.g. /roedel/mbuechler 2. Using the command line, navigate to your storage folder with the cd command 3. U nzip archive: gunzip tracer.tar.gz 4. Untar archive: tar -xvf tracer.tar 5. Change to the TRACER folder: cd TRACER 6. Open the configuration file with vim conf/tracer config.xml 7. Configure your input file: 3/100

  4. HACKING: STARTING TRACER Start the tool with the command: java -Xmx600m -Deu.etrap.medusa.config.ClassConfig=conf/tracer config.xml -jar tracer.jar Explanation: • -Xmx600m (up to 600 MB memory); • -Dfile.encoding sets the encoding of your input file (optionally); • -Deu.etrap.medusa.config.ClassConfig (configuration file). 4/100

  5. WHAT IS PREPROCESSING?

  6. REMINDER: CURRENT APPROACH 6/100

  7. PRE-STEP: SEGMENTATION - AN EXAMPLE 7/100

  8. PRE-STEP: SEGMENTATION 8/100

  9. QUESTION What do you associate with preprocessing ? 9/100

  10. FOUNDATIONS FOR PREPROCESSING: ZIPFIAN LAW 10/100

  11. IMPLICATIONS OF THE ZIPFIAN LAW • Approx. 50% of all words occur only once • Approx. 16% of all words occur only twice • Approx. 8% of all words occur three times • ... • Approx. 90% of all words in a corpus occur 10 times or less n 1 1 s n ( f ) = � s ( f ) = f ∗ ( f + 1 ) f ∗ ( f + 1 ) f = 1 • The top 300-700 most frequent words cover already about 50% of all tokens (depending language) 11/100

  12. QUESTION What does lemmatisation mean for this plot? 12/100

  13. PREPROCESSING TECHNIQUES

  14. PREPROCESSING 14/100

  15. PREPROCESSING: DIRECTED GRAPH NORMALISATION E.g. lemmatisation 15/100

  16. PREPROCESSING: INDIRECTED GRAPH NORMALISATION E.g. synonyms, string similarity 16/100

  17. HACKING

  18. HACKING Tasks: • Run on your texts ... 1. ... without preprocessing 2. ... 1) + lemmatisation 3. ... 2) + synonym replacement 18/100

  19. HACKING Questions: • Compare the input file with the *.prep file for all preprocessing techniques. Which methods seem to work best for you? Which make no sense for the dataset? • Compare all *.meta files containing some numbers! How many words have changed and through which method? • (optional and advanced) What is the number of word types for each preprocessing technique (can be derived from the first column of *.prep.inv ). 19/100

  20. PREPROCESSING 1) WITHOUT PREPROCESSING Hint: • The configuration file can be found in: $ TRACER HOME/conf/tracer conf.xml • All values show false . 20/100

  21. PREPROCESSING 2) REMOVING DIACHRITICS Hint: • boolRemoveDiachritics is switched on by value true . 21/100

  22. PREPROCESSING 4) LEMMATISING TEXT Hint: • boolLemmatisation is switched on by value true . • Lemmatisation can be configured by: < property name="BASEFORM FILE NAME" value="data/corpora/Bible/Bible.lemma" / > 22/100

  23. PREPROCESSING 5) SYNONYM HANDLING Hint: • boolReplaceSynonyms is switched on by value true . • Synonyms can be configured by: < property name="SYNONYMS FILE NAME" value="data/corpora/Bible/Bible.syns" / > 23/100

  24. PREPROCESSING 6) STRING SIMILARITY FOR NORMALISING VARIANTS Hint: • boolReplaceStringSimilarWords is switched on by value true . • Thresholds: < property name="SYNONYMS FILE NAME" value="data/corpora/Bible/Bible.syns" / > 24/100

  25. OPEN ISSUE: FRAGMENTARY WORDS 25/100

  26. OPEN ISSUE: FRAGMENTARY WORDS - DEALING WITH GAPS AND LEIDEN CONVENTIONS 26/100

  27. GAP BETWEEN KNOWLEDGE AND EXPERIENCE 27/100

  28. CONCLUSION AND REVISION

  29. CHECK Statement: • ”My lemmatisation tool < XYZ > is able to compute the base forms of 80% of all tokens in a corpus.” Good or bad? 29/100

  30. CHECK Fact file: • Language variants • Different writing styles • (Some) dialects • Diachritics • OCR errors Question: What’s the difference for you? 30/100

  31. CHECK Fact file: • Language variants • Different writing styles • (Some) dialects • Diachritics • OCR errors Question: What do you think is the difference for the computer? 31/100

  32. IMPORTANCE OF PREPROCESSING • Cleaning and harmonising the data. • When working with a new corpus -not only the language but also the same language in different epochs or geographical regions- cleaning/harmonising the data can take up to 70% of the overall time. Preprocessing mantra: Garbage in, garbage out 32/100

  33. FINITO! 33/100

  34. CONTACT Team Marco B¨ uchler, Greta Franzini and Emily Franzini. Visit us http://www.etrap.eu contact@etrap.eu 34/100

  35. LICENCE The theme this presentation is based on is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Changes to the theme are the work of eTRAP. cba 35/100

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend