- 20. Oktober 2015
TRACER - Preprocessing Marco Bchler, Emily Franzini, Greta Franzini, - - PowerPoint PPT Presentation
TRACER - Preprocessing Marco Bchler, Emily Franzini, Greta Franzini, - - PowerPoint PPT Presentation
TRACER - Preprocessing Marco Bchler, Emily Franzini, Greta Franzini, Maria Moritz eTRAP Research Group Gttingen Centre for Digital Humanities Institute of Computer Science Georg August University Gttingen, Germany 2015 DH Estonia
- 20. Oktober 2015
2015 DH Estonia – Text Reuse Hackathon
Hacking – Installation & configuration guide for TRACER
1) Copy Tracer from /storage/tracer.tar.gz to your storage folder such as /storage/mbuechler 2) Change to your storage folder with cd command 3) Unzip archive: gunzip tracer.tar.gz 4) Untar archive: tar -xvf tracer.tar 5) Change to tracer folder: cd Tracer 6) Open the config file with vim conf/tracer_config.xml 7) Configure your input file:
- 20. Oktober 2015
2015 DH Estonia – Text Reuse Hackathon
Hacking - Starting TRACER
1) Start the tool with the command: java -Xmx600m
- Dde.gcdh.medusa.config.ClassConfig=conf/tracer_config.xml
- jar tracer.jar
Explanation:
- Xmx600m (up to 600 MB memory), -Dfile.encoding sets the
encoding of your input file (optionally),
- Dde.gcdh.medusa.config.ClassConfig (configuration file)
- 20. Oktober 2015
2015 DH Estonia – Text Reuse Hackathon
Overview
- What is preprocessing?
- Overview of preprocessing techniques
- Hacking
- Conclusion with some test questions
- 20. Oktober 2015
2015 DH Estonia – Text Reuse Hackathon
Reminder: Current approach
- 20. Oktober 2015
2015 DH Estonia – Text Reuse Hackathon
Pre-step: Segmentation - an example
- 20. Oktober 2015
2015 DH Estonia – Text Reuse Hackathon
Pre-step: Segmentation
- 20. Oktober 2015
2015 DH Estonia – Text Reuse Hackathon
Question
What do you associated with preprocessing?
- 20. Oktober 2015
2015 DH Estonia – Text Reuse Hackathon
Foundations for preprocessing – Zipfian Law
- 20. Oktober 2015
2015 DH Estonia – Text Reuse Hackathon
Implications of the Zipfian Law
- Approx. 50% of all words occur only once
- Approx. 16% of all words occur only twice
- Approx. 8% of all words occur three times
- ...
- Approx. 90% of all words in a corpus occur 10 times or less
- The top 300 – 700 most frequent words cover already about 50%
- f all tokens (depending language)
- 20. Oktober 2015
2015 DH Estonia – Text Reuse Hackathon
Question
- What does lemmatisation mean for this plot?
- 20. Oktober 2015
2015 DH Estonia – Text Reuse Hackathon
Preprocessing
- 20. Oktober 2015
2015 DH Estonia – Text Reuse Hackathon
Preprocessing: Directed Graph Normalisation
- e.g. lemmatisation
- 20. Oktober 2015
2015 DH Estonia – Text Reuse Hackathon
Preprocessing: Indirected Graph Normalisation
- e.g. synonyms, string similarity
- 20. Oktober 2015
2015 DH Estonia – Text Reuse Hackathon
Hacking
- Tasks:
– Run on your texts ...
1) ... without preprocessing 2) ... 1) + lemmatisation 3) ... 2) + synonym replacement
- 20. Oktober 2015
2015 DH Estonia – Text Reuse Hackathon
Hacking
- Questions:
– Compare the input file with the *.prep file for all preprocessing
- techniques. Which methods seems to work best for you? Which
does make no sense for the dataset? – Compare all *.meta files containing some numbers! How many words have changed and by which method? – (optional and advanced) what is the number of word types for each preprocessing (can be derived from *.prep.inv first column)
- 20. Oktober 2015
2015 DH Estonia – Text Reuse Hackathon
Preprocessing – 1) without preprocessing
- Hint:
– Configuration file can be found in ${TRACER_HOME}/conf/tracer_conf.xml – All values show „false“
- 20. Oktober 2015
2015 DH Estonia – Text Reuse Hackathon
Preprocessing – 2) Removing diachritics
- Hint:
– BoolRemoveDiachritics is switched on by value true
- 20. Oktober 2015
2015 DH Estonia – Text Reuse Hackathon
Preprocessing – 3) Lower case
- Hint:
– boolMakeAllLowerCase is switched on by value true
- 20. Oktober 2015
2015 DH Estonia – Text Reuse Hackathon
Preprocessing – 4) Lemmatising text
- Hint:
– boolLemmatisation is switched on by value true – Lemmatisation can be configured by <property name="BASEFORM_FILE_NAME" value="data/corpora/Bible/Bible.lemma" />
- 20. Oktober 2015
2015 DH Estonia – Text Reuse Hackathon
Preprocessing – 5) Synonym handling
- Hint:
– boolReplaceSynonyms is switched on by value true – Synonyms can be configured by <property name="SYNONYMS_FILE_NAME" value="data/corpora/Bible/Bible.syns" />
- 20. Oktober 2015
2015 DH Estonia – Text Reuse Hackathon
Preprocessing – 6) String similarity for normalising variants
- Hint:
– boolReplaceStringSimilarWords is switched on by value true – Thresholds:
- 20. Oktober 2015
2015 DH Estonia – Text Reuse Hackathon
Open issue: Fragmentary words
- 20. Oktober 2015
2015 DH Estonia – Text Reuse Hackathon
Open issue: Fragmentary words – dealing with gaps and Leiden Convention
Οὐιβίῳ Ἀλεξά̤[ν]δρῳ τῷ κρατίστῳ ἐπιστρατήγῳ παρὰ Ἀντ[ωνίου Δ]όμνά̤ου τοῦ καὶ Φιλαντι[νό]οά̤υά̤ Ἀντωνίοά̤[υ Ῥωμανο]ῦά̤ Τραιανείου τοῦ καά̤[ὶ Στρα]τά̤είου Ἀντινοέως. [οὐκ ἂν] εἰς τοῦτο προήχθά̤[η]νά̤, ἐά̤πι- τρόπων [μέγιστ]εά̤, μέ[τριος] καὶ ἀπρά̤γά̤μων ὢνά̤ ἄνθρά̤[ωπος,] εά̤ἰ μὴ [ὓβρι]ν τὴν μά̤[εγ]ίστηνά̤ ἐπά̤επόνθ[ειν ὑπὸ] Ὡρίωνο[ς κ]ωά̤μογρα[μ]μά̤ατέως Φ[ι]λαδελφείά̤[ας τῆ]ς Ἡρακλεά̤ίά̤δου μερίδοά̤[ς] τά̤οῦ Ἀρά̤σινοίτου. [οὗ χά]ριν μην[ύ]ω παρὰ τ[ὰ ἀ]πει- ρημένα ἑαά̤[υτὸ]νά̤ ἐνσείσανά̤τα εἰς τὴν κωμο- γραμματείανά̤ [μ]ήτε σιτολογήσαντα μήτε πρά̤[α]κτορεύσαντά̤α παντελῶς ἄπορον ὄν[τ]αά̤. δι᾽ ἣά̤ν αἰτίαν κά̤αὶ πρότερον οὐ διέλιπον ἐντυγ- χά̤νων καὶ νῦά̤ν ἀξιῶ, ἐάν σου τῇ τύχῃ δόξ[ῃ], ἀκοῦσά̤αί μου π[ρ]ὸς αὐτὸν πρὸς τὸ τυχεῖν με τά̤ῆά̤ςά̤ ἀπὸ σοῦ [μι]σοπονήρου ἐγδ[ι]κίας, ἵν᾽ ὦ ὑπὸά̤ [σ]οά̤ῦά̤ κατὰά̤ πά̤άντα βά̤εά̤βοηθ(ημένος). διευτύχει Ἀντώνιος Δόμνά̤οά̤ς ἐπιδέδωκα.
- 20. Oktober 2015
2015 DH Estonia – Text Reuse Hackathon
Gap between knowledge and experience
- 20. Oktober 2015
2015 DH Estonia – Text Reuse Hackathon
Test questions
- Statement:
– „My lemmatisation tool <XYZ> is able to compute the baseforms of 80% of all tokens in a corpus.“
Good or bad???
- 20. Oktober 2015
2015 DH Estonia – Text Reuse Hackathon
Test questions
- Fact file:
– Language variants – Different writing styles – (some) Dialects – Diachritics – OCR errors
- Question: What is the difference for you?
- 20. Oktober 2015
2015 DH Estonia – Text Reuse Hackathon
Test questions
- Fact file:
– Language variants – Different writing styles – (some) Dialects – Diachritics – OCR errors
- Question: What do you think is the difference for the computer?
- 20. Oktober 2015
2015 DH Estonia – Text Reuse Hackathon
Importance of preprocessing
- Cleaning and harmonising the data
- When working with a new corpus (not only language but also
same language in a different epoch or geographical region can take up to 70% of the overall time.
- Preprocessing mantra: Garbage in, garbage out.
- 20. Oktober 2015