TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING
Marco B¨ uchler, Emily Franzini and Greta Franzini
TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING M arco B - - PowerPoint PPT Presentation
TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING M arco B uchler, Emily Franzini and Greta Franzini TABLE OF CONTENTS 1. What is preprocessing? 2. Preprocessing techniques 3. Hacking 4. Conclusion and revision 2/100 HACKING,
Marco B¨ uchler, Emily Franzini and Greta Franzini
2/100
to your storage folder, e.g. /roedel/mbuechler
command
3/100
Start the tool with the command: java -Xmx600m
Explanation:
4/100
6/100
7/100
8/100
What do you associate with preprocessing?
9/100
10/100
s(f) = 1 f ∗ (f + 1) sn(f) =
n
1 f ∗ (f + 1)
all tokens (depending language)
11/100
What does lemmatisation mean for this plot?
12/100
14/100
E.g. lemmatisation
15/100
E.g. synonyms, string similarity
16/100
Tasks:
18/100
Questions:
make no sense for the dataset?
words have changed and through which method?
preprocessing technique (can be derived from the first column of *.prep.inv).
19/100
Hint:
$TRACER HOME/conf/tracer conf.xml
20/100
Hint:
21/100
Hint:
<property name="BASEFORM FILE NAME" value="data/corpora/Bible/Bible.lemma" />
22/100
Hint:
<property name="SYNONYMS FILE NAME" value="data/corpora/Bible/Bible.syns" />
23/100
Hint:
<property name="SYNONYMS FILE NAME" value="data/corpora/Bible/Bible.syns" />
24/100
25/100
26/100
27/100
Statement:
80% of all tokens in a corpus.” Good or bad?
29/100
Fact file:
Question: What’s the difference for you?
30/100
Fact file:
Question: What do you think is the difference for the computer?
31/100
same language in different epochs or geographical regions- cleaning/harmonising the data can take up to 70% of the overall time. Preprocessing mantra: Garbage in, garbage out
32/100
33/100
Team Marco B¨ uchler, Greta Franzini and Emily Franzini. Visit us http://www.etrap.eu contact@etrap.eu
34/100
The theme this presentation is based on is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Changes to the theme are the work of eTRAP.
35/100