TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING M arco B - - PowerPoint PPT Presentation

tracer tutorial text reuse detection preprocessing
SMART_READER_LITE
LIVE PREVIEW

TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING M arco B - - PowerPoint PPT Presentation

TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING M arco B uchler, Emily Franzini and Greta Franzini TABLE OF CONTENTS 1. What is preprocessing? 2. Preprocessing techniques 3. Hacking 4. Conclusion and revision 2/100 HACKING,


slide-1
SLIDE 1

TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING

Marco B¨ uchler, Emily Franzini and Greta Franzini

slide-2
SLIDE 2

TABLE OF CONTENTS

  • 1. What is preprocessing?
  • 2. Preprocessing techniques
  • 3. Hacking
  • 4. Conclusion and revision

2/100

slide-3
SLIDE 3

HACKING, INSTALLATION & CONFIGURATION GUIDE FOR TRACER

  • 1. Download TRACER from http://etrap.eu/tracer/

to your storage folder, e.g. /roedel/mbuechler

  • 2. Using the command line, navigate to your storage folder with the cd

command

  • 3. Unzip archive: gunzip tracer.tar.gz
  • 4. Untar archive: tar -xvf tracer.tar
  • 5. Change to the TRACER folder: cd TRACER
  • 6. Open the configuration file with vim conf/tracer config.xml
  • 7. Configure your input file:

3/100

slide-4
SLIDE 4

HACKING: STARTING TRACER

Start the tool with the command: java -Xmx600m

  • Deu.etrap.medusa.config.ClassConfig=conf/tracer config.xml
  • jar tracer.jar

Explanation:

  • -Xmx600m (up to 600 MB memory);
  • -Dfile.encoding sets the encoding of your input file (optionally);
  • -Deu.etrap.medusa.config.ClassConfig (configuration file).

4/100

slide-5
SLIDE 5

WHAT IS PREPROCESSING?

slide-6
SLIDE 6

REMINDER: CURRENT APPROACH

6/100

slide-7
SLIDE 7

PRE-STEP: SEGMENTATION - AN EXAMPLE

7/100

slide-8
SLIDE 8

PRE-STEP: SEGMENTATION

8/100

slide-9
SLIDE 9

QUESTION

What do you associate with preprocessing?

9/100

slide-10
SLIDE 10

FOUNDATIONS FOR PREPROCESSING: ZIPFIAN LAW

10/100

slide-11
SLIDE 11

IMPLICATIONS OF THE ZIPFIAN LAW

  • Approx. 50% of all words occur only once
  • Approx. 16% of all words occur only twice
  • Approx. 8% of all words occur three times
  • ...
  • Approx. 90% of all words in a corpus occur 10 times or less

s(f) = 1 f ∗ (f + 1) sn(f) =

n

  • f=1

1 f ∗ (f + 1)

  • The top 300-700 most frequent words cover already about 50% of

all tokens (depending language)

11/100

slide-12
SLIDE 12

QUESTION

What does lemmatisation mean for this plot?

12/100

slide-13
SLIDE 13

PREPROCESSING TECHNIQUES

slide-14
SLIDE 14

PREPROCESSING

14/100

slide-15
SLIDE 15

PREPROCESSING: DIRECTED GRAPH NORMALISATION

E.g. lemmatisation

15/100

slide-16
SLIDE 16

PREPROCESSING: INDIRECTED GRAPH NORMALISATION

E.g. synonyms, string similarity

16/100

slide-17
SLIDE 17

HACKING

slide-18
SLIDE 18

HACKING

Tasks:

  • Run on your texts ...
  • 1. ... without preprocessing
  • 2. ... 1) + lemmatisation
  • 3. ... 2) + synonym replacement

18/100

slide-19
SLIDE 19

HACKING

Questions:

  • Compare the input file with the *.prep file for all preprocessing
  • techniques. Which methods seem to work best for you? Which

make no sense for the dataset?

  • Compare all *.meta files containing some numbers! How many

words have changed and through which method?

  • (optional and advanced) What is the number of word types for each

preprocessing technique (can be derived from the first column of *.prep.inv).

19/100

slide-20
SLIDE 20

PREPROCESSING 1) WITHOUT PREPROCESSING

Hint:

  • The configuration file can be found in:

$TRACER HOME/conf/tracer conf.xml

  • All values show false.

20/100

slide-21
SLIDE 21

PREPROCESSING 2) REMOVING DIACHRITICS

Hint:

  • boolRemoveDiachritics is switched on by value true.

21/100

slide-22
SLIDE 22

PREPROCESSING 4) LEMMATISING TEXT

Hint:

  • boolLemmatisation is switched on by value true.
  • Lemmatisation can be configured by:

<property name="BASEFORM FILE NAME" value="data/corpora/Bible/Bible.lemma" />

22/100

slide-23
SLIDE 23

PREPROCESSING 5) SYNONYM HANDLING

Hint:

  • boolReplaceSynonyms is switched on by value true.
  • Synonyms can be configured by:

<property name="SYNONYMS FILE NAME" value="data/corpora/Bible/Bible.syns" />

23/100

slide-24
SLIDE 24

PREPROCESSING 6) STRING SIMILARITY FOR NORMALISING VARIANTS

Hint:

  • boolReplaceStringSimilarWords is switched on by value true.
  • Thresholds:

<property name="SYNONYMS FILE NAME" value="data/corpora/Bible/Bible.syns" />

24/100

slide-25
SLIDE 25

OPEN ISSUE: FRAGMENTARY WORDS

25/100

slide-26
SLIDE 26

OPEN ISSUE: FRAGMENTARY WORDS - DEALING WITH GAPS AND LEIDEN CONVENTIONS

26/100

slide-27
SLIDE 27

GAP BETWEEN KNOWLEDGE AND EXPERIENCE

27/100

slide-28
SLIDE 28

CONCLUSION AND REVISION

slide-29
SLIDE 29

CHECK

Statement:

  • ”My lemmatisation tool <XYZ> is able to compute the base forms of

80% of all tokens in a corpus.” Good or bad?

29/100

slide-30
SLIDE 30

CHECK

Fact file:

  • Language variants
  • Different writing styles
  • (Some) dialects
  • Diachritics
  • OCR errors

Question: What’s the difference for you?

30/100

slide-31
SLIDE 31

CHECK

Fact file:

  • Language variants
  • Different writing styles
  • (Some) dialects
  • Diachritics
  • OCR errors

Question: What do you think is the difference for the computer?

31/100

slide-32
SLIDE 32

IMPORTANCE OF PREPROCESSING

  • Cleaning and harmonising the data.
  • When working with a new corpus -not only the language but also the

same language in different epochs or geographical regions- cleaning/harmonising the data can take up to 70% of the overall time. Preprocessing mantra: Garbage in, garbage out

32/100

slide-33
SLIDE 33

FINITO!

33/100

slide-34
SLIDE 34

CONTACT

Team Marco B¨ uchler, Greta Franzini and Emily Franzini. Visit us http://www.etrap.eu contact@etrap.eu

34/100

slide-35
SLIDE 35

LICENCE

The theme this presentation is based on is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Changes to the theme are the work of eTRAP.

cba

35/100