TRACER - Preprocessing Marco Bchler, Emily Franzini, Greta Franzini, - - PowerPoint PPT Presentation

tracer preprocessing
SMART_READER_LITE
LIVE PREVIEW

TRACER - Preprocessing Marco Bchler, Emily Franzini, Greta Franzini, - - PowerPoint PPT Presentation

TRACER - Preprocessing Marco Bchler, Emily Franzini, Greta Franzini, Maria Moritz eTRAP Research Group Gttingen Centre for Digital Humanities Institute of Computer Science Georg August University Gttingen, Germany 2015 DH Estonia


slide-1
SLIDE 1
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

TRACER - Preprocessing

Marco Büchler, Emily Franzini, Greta Franzini, Maria Moritz eTRAP Research Group Göttingen Centre for Digital Humanities Institute of Computer Science Georg August University Göttingen, Germany

slide-2
SLIDE 2
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Hacking – Installation & configuration guide for TRACER

1) Copy Tracer from /storage/tracer.tar.gz to your storage folder such as /storage/mbuechler 2) Change to your storage folder with cd command 3) Unzip archive: gunzip tracer.tar.gz 4) Untar archive: tar -xvf tracer.tar 5) Change to tracer folder: cd Tracer 6) Open the config file with vim conf/tracer_config.xml 7) Configure your input file:

slide-3
SLIDE 3
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Hacking - Starting TRACER

1) Start the tool with the command: java -Xmx600m

  • Dde.gcdh.medusa.config.ClassConfig=conf/tracer_config.xml
  • jar tracer.jar

Explanation:

  • Xmx600m (up to 600 MB memory), -Dfile.encoding sets the

encoding of your input file (optionally),

  • Dde.gcdh.medusa.config.ClassConfig (configuration file)
slide-4
SLIDE 4
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Overview

  • What is preprocessing?
  • Overview of preprocessing techniques
  • Hacking
  • Conclusion with some test questions
slide-5
SLIDE 5
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Reminder: Current approach

slide-6
SLIDE 6
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Pre-step: Segmentation - an example

slide-7
SLIDE 7
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Pre-step: Segmentation

slide-8
SLIDE 8
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Question

What do you associated with preprocessing?

slide-9
SLIDE 9
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Foundations for preprocessing – Zipfian Law

slide-10
SLIDE 10
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Implications of the Zipfian Law

  • Approx. 50% of all words occur only once
  • Approx. 16% of all words occur only twice
  • Approx. 8% of all words occur three times
  • ...
  • Approx. 90% of all words in a corpus occur 10 times or less
  • The top 300 – 700 most frequent words cover already about 50%
  • f all tokens (depending language)
slide-11
SLIDE 11
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Question

  • What does lemmatisation mean for this plot?
slide-12
SLIDE 12
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Preprocessing

slide-13
SLIDE 13
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Preprocessing: Directed Graph Normalisation

  • e.g. lemmatisation
slide-14
SLIDE 14
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Preprocessing: Indirected Graph Normalisation

  • e.g. synonyms, string similarity
slide-15
SLIDE 15
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Hacking

  • Tasks:

– Run on your texts ...

1) ... without preprocessing 2) ... 1) + lemmatisation 3) ... 2) + synonym replacement

slide-16
SLIDE 16
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Hacking

  • Questions:

– Compare the input file with the *.prep file for all preprocessing

  • techniques. Which methods seems to work best for you? Which

does make no sense for the dataset? – Compare all *.meta files containing some numbers! How many words have changed and by which method? – (optional and advanced) what is the number of word types for each preprocessing (can be derived from *.prep.inv first column)

slide-17
SLIDE 17
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Preprocessing – 1) without preprocessing

  • Hint:

– Configuration file can be found in ${TRACER_HOME}/conf/tracer_conf.xml – All values show „false“

slide-18
SLIDE 18
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Preprocessing – 2) Removing diachritics

  • Hint:

– BoolRemoveDiachritics is switched on by value true

slide-19
SLIDE 19
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Preprocessing – 3) Lower case

  • Hint:

– boolMakeAllLowerCase is switched on by value true

slide-20
SLIDE 20
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Preprocessing – 4) Lemmatising text

  • Hint:

– boolLemmatisation is switched on by value true – Lemmatisation can be configured by <property name="BASEFORM_FILE_NAME" value="data/corpora/Bible/Bible.lemma" />

slide-21
SLIDE 21
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Preprocessing – 5) Synonym handling

  • Hint:

– boolReplaceSynonyms is switched on by value true – Synonyms can be configured by <property name="SYNONYMS_FILE_NAME" value="data/corpora/Bible/Bible.syns" />

slide-22
SLIDE 22
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Preprocessing – 6) String similarity for normalising variants

  • Hint:

– boolReplaceStringSimilarWords is switched on by value true – Thresholds:

slide-23
SLIDE 23
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Open issue: Fragmentary words

slide-24
SLIDE 24
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Open issue: Fragmentary words – dealing with gaps and Leiden Convention

Οὐιβίῳ Ἀλεξά̤[ν]δρῳ τῷ κρατίστῳ ἐπιστρατήγῳ παρὰ Ἀντ[ωνίου Δ]όμνά̤ου τοῦ καὶ Φιλαντι[νό]οά̤υά̤ Ἀντωνίοά̤[υ Ῥωμανο]ῦά̤ Τραιανείου τοῦ καά̤[ὶ Στρα]τά̤είου Ἀντινοέως. [οὐκ ἂν] εἰς τοῦτο προήχθά̤[η]νά̤, ἐά̤πι- τρόπων [μέγιστ]εά̤, μέ[τριος] καὶ ἀπρά̤γά̤μων ὢνά̤ ἄνθρά̤[ωπος,] εά̤ἰ μὴ [ὓβρι]ν τὴν μά̤[εγ]ίστηνά̤ ἐπά̤επόνθ[ειν ὑπὸ] Ὡρίωνο[ς κ]ωά̤μογρα[μ]μά̤ατέως Φ[ι]λαδελφείά̤[ας τῆ]ς Ἡρακλεά̤ίά̤δου μερίδοά̤[ς] τά̤οῦ Ἀρά̤σινοίτου. [οὗ χά]ριν μην[ύ]ω παρὰ τ[ὰ ἀ]πει- ρημένα ἑαά̤[υτὸ]νά̤ ἐνσείσανά̤τα εἰς τὴν κωμο- γραμματείανά̤ [μ]ήτε σιτολογήσαντα μήτε πρά̤[α]κτορεύσαντά̤α παντελῶς ἄπορον ὄν[τ]αά̤. δι᾽ ἣά̤ν αἰτίαν κά̤αὶ πρότερον οὐ διέλιπον ἐντυγ- χά̤νων καὶ νῦά̤ν ἀξιῶ, ἐάν σου τῇ τύχῃ δόξ[ῃ], ἀκοῦσά̤αί μου π[ρ]ὸς αὐτὸν πρὸς τὸ τυχεῖν με τά̤ῆά̤ςά̤ ἀπὸ σοῦ [μι]σοπονήρου ἐγδ[ι]κίας, ἵν᾽ ὦ ὑπὸά̤ [σ]οά̤ῦά̤ κατὰά̤ πά̤άντα βά̤εά̤βοηθ(ημένος). διευτύχει Ἀντώνιος Δόμνά̤οά̤ς ἐπιδέδωκα.

slide-25
SLIDE 25
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Gap between knowledge and experience

slide-26
SLIDE 26
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Test questions

  • Statement:

– „My lemmatisation tool <XYZ> is able to compute the baseforms of 80% of all tokens in a corpus.“

Good or bad???

slide-27
SLIDE 27
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Test questions

  • Fact file:

– Language variants – Different writing styles – (some) Dialects – Diachritics – OCR errors

  • Question: What is the difference for you?
slide-28
SLIDE 28
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Test questions

  • Fact file:

– Language variants – Different writing styles – (some) Dialects – Diachritics – OCR errors

  • Question: What do you think is the difference for the computer?
slide-29
SLIDE 29
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Importance of preprocessing

  • Cleaning and harmonising the data
  • When working with a new corpus (not only language but also

same language in a different epoch or geographical region can take up to 70% of the overall time.

  • Preprocessing mantra: Garbage in, garbage out.
slide-30
SLIDE 30
  • 20. Oktober 2015

2015 DH Estonia – Text Reuse Hackathon

Thank you!

"Stealing from one is plagiarism, stealing from many is research" (Wilson Mitzner, 1876-1933) Visit us at http://etrap.gcdh.de