[PPT] - TRACER - Preprocessing Marco Bchler, Emily Franzini, Greta Franzini, PowerPoint Presentation

SLIDE 1

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

TRACER - Preprocessing

Marco Büchler, Emily Franzini, Greta Franzini, Maria Moritz eTRAP Research Group Göttingen Centre for Digital Humanities Institute of Computer Science Georg August University Göttingen, Germany

SLIDE 2

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Hacking – Installation & configuration guide for TRACER

1) Copy Tracer from /storage/tracer.tar.gz to your storage folder such as /storage/mbuechler 2) Change to your storage folder with cd command 3) Unzip archive: gunzip tracer.tar.gz 4) Untar archive: tar -xvf tracer.tar 5) Change to tracer folder: cd Tracer 6) Open the config file with vim conf/tracer_config.xml 7) Configure your input file:

SLIDE 3

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Hacking - Starting TRACER

1) Start the tool with the command: java -Xmx600m

Dde.gcdh.medusa.config.ClassConfig=conf/tracer_config.xml
jar tracer.jar

Explanation:

Xmx600m (up to 600 MB memory), -Dfile.encoding sets the

encoding of your input file (optionally),

Dde.gcdh.medusa.config.ClassConfig (configuration file)

SLIDE 4

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Overview

What is preprocessing?
Overview of preprocessing techniques
Hacking
Conclusion with some test questions

SLIDE 5

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Reminder: Current approach

SLIDE 6

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Pre-step: Segmentation - an example

SLIDE 7

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Pre-step: Segmentation

SLIDE 8

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Question

What do you associated with preprocessing?

SLIDE 9

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Foundations for preprocessing – Zipfian Law

SLIDE 10

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Implications of the Zipfian Law

Approx. 50% of all words occur only once
Approx. 16% of all words occur only twice
Approx. 8% of all words occur three times
...
Approx. 90% of all words in a corpus occur 10 times or less
The top 300 – 700 most frequent words cover already about 50%
f all tokens (depending language)

SLIDE 11

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Question

What does lemmatisation mean for this plot?

SLIDE 12

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Preprocessing

SLIDE 13

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Preprocessing: Directed Graph Normalisation

e.g. lemmatisation

SLIDE 14

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Preprocessing: Indirected Graph Normalisation

e.g. synonyms, string similarity

SLIDE 15

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Hacking

Tasks:

– Run on your texts ...

1) ... without preprocessing 2) ... 1) + lemmatisation 3) ... 2) + synonym replacement

SLIDE 16

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Hacking

Questions:

– Compare the input file with the *.prep file for all preprocessing

techniques. Which methods seems to work best for you? Which

does make no sense for the dataset? – Compare all .meta files containing some numbers! How many words have changed and by which method? – (optional and advanced) what is the number of word types for each preprocessing (can be derived from .prep.inv first column)

SLIDE 17

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Preprocessing – 1) without preprocessing

Hint:

– Configuration file can be found in ${TRACER_HOME}/conf/tracer_conf.xml – All values show „false“

SLIDE 18

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Preprocessing – 2) Removing diachritics

Hint:

– BoolRemoveDiachritics is switched on by value true

SLIDE 19

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Preprocessing – 3) Lower case

Hint:

– boolMakeAllLowerCase is switched on by value true

SLIDE 20

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Preprocessing – 4) Lemmatising text

Hint:

– boolLemmatisation is switched on by value true – Lemmatisation can be configured by <property name="BASEFORM_FILE_NAME" value="data/corpora/Bible/Bible.lemma" />

SLIDE 21

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Preprocessing – 5) Synonym handling

Hint:

– boolReplaceSynonyms is switched on by value true – Synonyms can be configured by <property name="SYNONYMS_FILE_NAME" value="data/corpora/Bible/Bible.syns" />

SLIDE 22

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Preprocessing – 6) String similarity for normalising variants

Hint:

– boolReplaceStringSimilarWords is switched on by value true – Thresholds:

SLIDE 23

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Open issue: Fragmentary words

SLIDE 24

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Open issue: Fragmentary words – dealing with gaps and Leiden Convention

Οὐιβίῳ Ἀλεξά̤[ν]δρῳ τῷ κρατίστῳ ἐπιστρατήγῳ παρὰ Ἀντ[ωνίου Δ]όμνά̤ου τοῦ καὶ Φιλαντι[νό]οά̤υά̤ Ἀντωνίοά̤[υ Ῥωμανο]ῦά̤ Τραιανείου τοῦ καά̤[ὶ Στρα]τά̤είου Ἀντινοέως. [οὐκ ἂν] εἰς τοῦτο προήχθά̤[η]νά̤, ἐά̤πι- τρόπων [μέγιστ]εά̤, μέ[τριος] καὶ ἀπρά̤γά̤μων ὢνά̤ ἄνθρά̤[ωπος,] εά̤ἰ μὴ [ὓβρι]ν τὴν μά̤[εγ]ίστηνά̤ ἐπά̤επόνθ[ειν ὑπὸ] Ὡρίωνο[ς κ]ωά̤μογρα[μ]μά̤ατέως Φ[ι]λαδελφείά̤[ας τῆ]ς Ἡρακλεά̤ίά̤δου μερίδοά̤[ς] τά̤οῦ Ἀρά̤σινοίτου. [οὗ χά]ριν μην[ύ]ω παρὰ τ[ὰ ἀ]πει- ρημένα ἑαά̤[υτὸ]νά̤ ἐνσείσανά̤τα εἰς τὴν κωμο- γραμματείανά̤ [μ]ήτε σιτολογήσαντα μήτε πρά̤[α]κτορεύσαντά̤α παντελῶς ἄπορον ὄν[τ]αά̤. δι᾽ ἣά̤ν αἰτίαν κά̤αὶ πρότερον οὐ διέλιπον ἐντυγ- χά̤νων καὶ νῦά̤ν ἀξιῶ, ἐάν σου τῇ τύχῃ δόξ[ῃ], ἀκοῦσά̤αί μου π[ρ]ὸς αὐτὸν πρὸς τὸ τυχεῖν με τά̤ῆά̤ςά̤ ἀπὸ σοῦ [μι]σοπονήρου ἐγδ[ι]κίας, ἵν᾽ ὦ ὑπὸά̤ [σ]οά̤ῦά̤ κατὰά̤ πά̤άντα βά̤εά̤βοηθ(ημένος). διευτύχει Ἀντώνιος Δόμνά̤οά̤ς ἐπιδέδωκα.

SLIDE 25

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Gap between knowledge and experience

SLIDE 26

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Test questions

Statement:

– „My lemmatisation tool <XYZ> is able to compute the baseforms of 80% of all tokens in a corpus.“

Good or bad???

SLIDE 27

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Test questions

Fact file:

– Language variants – Different writing styles – (some) Dialects – Diachritics – OCR errors

Question: What is the difference for you?

SLIDE 28

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Test questions

Fact file:

– Language variants – Different writing styles – (some) Dialects – Diachritics – OCR errors

Question: What do you think is the difference for the computer?

SLIDE 29

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Importance of preprocessing

Cleaning and harmonising the data
When working with a new corpus (not only language but also

same language in a different epoch or geographical region can take up to 70% of the overall time.

Preprocessing mantra: Garbage in, garbage out.

SLIDE 30

20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

TRACER - Preprocessing

Marco Büchler, Emily Franzini, Greta Franzini, Maria Moritz eTRAP Research Group Göttingen Centre for Digital Humanities Institute of Computer Science Georg August University Göttingen, Germany

2015 DH Estonia – Text Reuse Hackathon

Hacking – Installation & configuration guide for TRACER

2015 DH Estonia – Text Reuse Hackathon

Hacking - Starting TRACER

1) Start the tool with the command: java -Xmx600m

Explanation:

encoding of your input file (optionally),

2015 DH Estonia – Text Reuse Hackathon

Overview

2015 DH Estonia – Text Reuse Hackathon

Reminder: Current approach

2015 DH Estonia – Text Reuse Hackathon

Pre-step: Segmentation - an example

2015 DH Estonia – Text Reuse Hackathon

Pre-step: Segmentation

2015 DH Estonia – Text Reuse Hackathon

Question

What do you associated with preprocessing?

2015 DH Estonia – Text Reuse Hackathon

Foundations for preprocessing – Zipfian Law

2015 DH Estonia – Text Reuse Hackathon

Implications of the Zipfian Law

2015 DH Estonia – Text Reuse Hackathon

Question

2015 DH Estonia – Text Reuse Hackathon

Preprocessing

2015 DH Estonia – Text Reuse Hackathon

Preprocessing: Directed Graph Normalisation

2015 DH Estonia – Text Reuse Hackathon

Preprocessing: Indirected Graph Normalisation

2015 DH Estonia – Text Reuse Hackathon

Hacking

– Run on your texts ...

1) ... without preprocessing 2) ... 1) + lemmatisation 3) ... 2) + synonym replacement

2015 DH Estonia – Text Reuse Hackathon

Hacking

– Compare the input file with the *.prep file for all preprocessing

does make no sense for the dataset? – Compare all *.meta files containing some numbers! How many words have changed and by which method? – (optional and advanced) what is the number of word types for each preprocessing (can be derived from *.prep.inv first column)

2015 DH Estonia – Text Reuse Hackathon

Preprocessing – 1) without preprocessing

– Configuration file can be found in ${TRACER_HOME}/conf/tracer_conf.xml – All values show „false“

2015 DH Estonia – Text Reuse Hackathon

Preprocessing – 2) Removing diachritics

– BoolRemoveDiachritics is switched on by value true

2015 DH Estonia – Text Reuse Hackathon

Preprocessing – 3) Lower case

– boolMakeAllLowerCase is switched on by value true

2015 DH Estonia – Text Reuse Hackathon

Preprocessing – 4) Lemmatising text

– boolLemmatisation is switched on by value true – Lemmatisation can be configured by <property name="BASEFORM_FILE_NAME" value="data/corpora/Bible/Bible.lemma" />

2015 DH Estonia – Text Reuse Hackathon

Preprocessing – 5) Synonym handling

– boolReplaceSynonyms is switched on by value true – Synonyms can be configured by <property name="SYNONYMS_FILE_NAME" value="data/corpora/Bible/Bible.syns" />

2015 DH Estonia – Text Reuse Hackathon

Preprocessing – 6) String similarity for normalising variants

– boolReplaceStringSimilarWords is switched on by value true – Thresholds:

2015 DH Estonia – Text Reuse Hackathon

Open issue: Fragmentary words

2015 DH Estonia – Text Reuse Hackathon

Open issue: Fragmentary words – dealing with gaps and Leiden Convention

2015 DH Estonia – Text Reuse Hackathon

Gap between knowledge and experience

2015 DH Estonia – Text Reuse Hackathon

Test questions

– „My lemmatisation tool <XYZ> is able to compute the baseforms of 80% of all tokens in a corpus.“

Good or bad???

2015 DH Estonia – Text Reuse Hackathon

Test questions

– Language variants – Different writing styles – (some) Dialects – Diachritics – OCR errors

2015 DH Estonia – Text Reuse Hackathon

Test questions

– Language variants – Different writing styles – (some) Dialects – Diachritics – OCR errors

2015 DH Estonia – Text Reuse Hackathon

Importance of preprocessing

same language in a different epoch or geographical region can take up to 70% of the overall time.

2015 DH Estonia – Text Reuse Hackathon

Thank you!

does make no sense for the dataset? – Compare all .meta files containing some numbers! How many words have changed and by which method? – (optional and advanced) what is the number of word types for each preprocessing (can be derived from .prep.inv first column)