TRACER TUTORIAL: TEXT REUSE DETECTION RECENT WORK
Marco B¨ uchler, Emily Franzini and Greta Franzini
TRACER TUTORIAL: TEXT REUSE DETECTION RECENT WORK M arco B uchler, - - PowerPoint PPT Presentation
TRACER TUTORIAL: TEXT REUSE DETECTION RECENT WORK M arco B uchler, Emily Franzini and Greta Franzini METHODOLOGY Basic idea: Embed historical text reuse in Shannons N oisy Channel theorem. 2/26 MICROVIEW II Source: Stefan J anicke,
Marco B¨ uchler, Emily Franzini and Greta Franzini
Basic idea: Embed historical text reuse in Shannon’s Noisy Channel theorem.
2/26
Source: Stefan J¨ anicke, eTRACES project, University of Leipzig.
3/26
birth-day vs. birthday back-bone vs. backbone zareth-shahar vs. zarethshahar
ambush vs. ambushment shimite vs. shimites
bearing vs. childbearing
sea-beast vs. sea-monster (synonym) sea-gull vs. sea-mew vs. sea-hawk (cohyponym) apple-tree vs. citron-tree (cohyponym)
4/26
anathothite vs. anethothite vs. anetothite vs. annethothite vs. antothite
not defer (ASV, KJV, Webster), without loss of time (Basic), not delay (Darby, YLT), and not wait (WEB)
5/26
Basic idea: Embed historical text reuse in Shannon’s Noisy Channel theorem.
6/26
Hint: The results are ALWAYS compared between the natural texts and the randomised texts as a whole.
7/26
Signal-Noise-Ratio adapted from signal- and satellite techniques: SNR = Psignal Pnoise Signal-Noise-Ratio scaled, unit is dB: SNRdb = 10.log10 Psignal Pnoise
to make distinctions between natural-language structures/patterns and random noise given a model with the same parameters. LQuant(Θ) = 10.log10 |EDs,φΘ | max(1, |EDm
s , φΘ|)dB
8/26
Motivation for randomisation by Word Shuffling:
changes JUST and ONLY depend on destruction of 1) and are not induced by changes of distributions.
with the entropy test: ∆Hn = Hmax − Hn Die Wahl von n ∈ [180, 183] sichert eine Genauigkeit von ∆Hn ≤ 10−3 Bit f¨ ur den Entropietest.
9/26
CΘ = m
j=1
n
i=1 θΘ(Si, Sj)
n.m
10/26
Question: Why is the result of a randomised Digital Library typically not empty?
11/26
Corpus size in sentences (average sentence length is ca. 18 words). LGL is the threshold for the Log-Likelihood-Ratio.
12/26
Why does the use of the Bible make sense?
13/26
language;
Greek texts, multiple authors through death of Darby;
(16th Cent.);
14/26
Example: book Genesis, chapter 1, verse 1. Reduced Bibles: all seven reduced Bible versions contain ”only” the 28632 verses present in all seven editions.
15/26
Segmentation: disjoint and verse-wise segmentation. Selection: max pruning with a Feature Density of 0.8; Linking: Inter- Digital Library Linking (different Bible editions); Scoring: Broder’s Resemblance with a threshold of 0.6; Post-processing: not used.
16/26
17/26
With Without
18/26
19/26
20/26
F-Measure: WBS, ASV, DBY, WEB, YLT, BBE NCE: WBS, ASV, DBY, WEB, BBE, YLT
21/26
Source: Stefan J¨ anicke, eTRACES project, University of Leipzig.
22/26
23/26
24/26
Team Marco B¨ uchler, Greta Franzini and Emily Franzini. Visit us http://www.etrap.eu contact@etrap.eu
25/26
The theme this presentation is based on is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Changes to the theme are the work of eTRAP.
26/26