TRACER TUTORIAL: TEXT REUSE DETECTION RECENT WORK M arco B uchler, - - PowerPoint PPT Presentation

tracer tutorial text reuse detection recent work
SMART_READER_LITE
LIVE PREVIEW

TRACER TUTORIAL: TEXT REUSE DETECTION RECENT WORK M arco B uchler, - - PowerPoint PPT Presentation

TRACER TUTORIAL: TEXT REUSE DETECTION RECENT WORK M arco B uchler, Emily Franzini and Greta Franzini METHODOLOGY Basic idea: Embed historical text reuse in Shannons N oisy Channel theorem. 2/26 MICROVIEW II Source: Stefan J anicke,


slide-1
SLIDE 1

TRACER TUTORIAL: TEXT REUSE DETECTION RECENT WORK

Marco B¨ uchler, Emily Franzini and Greta Franzini

slide-2
SLIDE 2

METHODOLOGY

Basic idea: Embed historical text reuse in Shannon’s Noisy Channel theorem.

2/26

slide-3
SLIDE 3

MICROVIEW II

Source: Stefan J¨ anicke, eTRACES project, University of Leipzig.

3/26

slide-4
SLIDE 4

NOISY CHANNEL MINING I

  • Hyphen:

birth-day vs. birthday back-bone vs. backbone zareth-shahar vs. zarethshahar

  • Prefix:

ambush vs. ambushment shimite vs. shimites

  • Suffix:

bearing vs. childbearing

  • Composition:

sea-beast vs. sea-monster (synonym) sea-gull vs. sea-mew vs. sea-hawk (cohyponym) apple-tree vs. citron-tree (cohyponym)

4/26

slide-5
SLIDE 5

NOISY CHANNEL MINING II

  • Orthographically similar words:

anathothite vs. anethothite vs. anetothite vs. annethothite vs. antothite

  • Some 4000 word pairs containing noise are extracted but not
  • classified. But also: punishment vs. torment
  • Any kind of negation (e.g. book Genesis, chapter 34, verse 19):

not defer (ASV, KJV, Webster), without loss of time (Basic), not delay (Darby, YLT), and not wait (WEB)

5/26

slide-6
SLIDE 6

METHODOLOGY

Basic idea: Embed historical text reuse in Shannon’s Noisy Channel theorem.

6/26

slide-7
SLIDE 7

METHODOLOGY: NOISY CHANNEL EVALUATION I

Hint: The results are ALWAYS compared between the natural texts and the randomised texts as a whole.

7/26

slide-8
SLIDE 8

METHODOLOGY: NOISY CHANNEL EVALUATION II

Signal-Noise-Ratio adapted from signal- and satellite techniques: SNR = Psignal Pnoise Signal-Noise-Ratio scaled, unit is dB: SNRdb = 10.log10 Psignal Pnoise

  • Mining Ability (in dB): The Mining Ability describes the power of a method

to make distinctions between natural-language structures/patterns and random noise given a model with the same parameters. LQuant(Θ) = 10.log10 |EDs,φΘ | max(1, |EDm

s , φΘ|)dB

8/26

slide-9
SLIDE 9

METHODOLOGY: NOISY CHANNEL EVALUATION III

Motivation for randomisation by Word Shuffling:

  • 1. Syntax and distributional semantics are randomised and ”destroyed”.
  • 2. Distributions of words and sentence lengths remain unchanged;

changes JUST and ONLY depend on destruction of 1) and are not induced by changes of distributions.

  • 3. Easy measurement of ”randomness” of the randomising method

with the entropy test: ∆Hn = Hmax − Hn Die Wahl von n ∈ [180, 183] sichert eine Genauigkeit von ∆Hn ≤ 10−3 Bit f¨ ur den Entropietest.

9/26

slide-10
SLIDE 10

METHODOLOGY: TEXT RE-USE COMPRESSION

CΘ = m

j=1

n

i=1 θΘ(Si, Sj)

n.m

10/26

slide-11
SLIDE 11

RANDOMNESS & STRUCTURE

Question: Why is the result of a randomised Digital Library typically not empty?

11/26

slide-12
SLIDE 12

RANDOMNESS & STRUCTURE: IMPACTS

Corpus size in sentences (average sentence length is ca. 18 words). LGL is the threshold for the Log-Likelihood-Ratio.

12/26

slide-13
SLIDE 13

TEXT REUSE IN ENGLISH BIBLE VERSIONS

Why does the use of the Bible make sense?

  • The Bible is easy to evaluate.
  • There are different editions written for different purposes.

13/26

slide-14
SLIDE 14

TEXT REUSE IN ENGLISH BIBLE VERSIONS

  • 1. American Standard Version (ASV): 20th century, focus is USA;
  • 2. Bible in Basic English (BBE): Verses are written in a simplified

language;

  • 3. Darby Version (DBY): created in the 19th century from Hebrew and

Greek texts, multiple authors through death of Darby;

  • 4. King James Version (KJV): one of the oldest English Bible versions

(16th Cent.);

  • 5. Webster’s Revision (WBS): Revision of KJV in 19th century;
  • 6. World English Bible (WEB): 21st century, global focus;
  • 7. Young Literal Translation (YLT): Verses in Hebrew syntax.

14/26

slide-15
SLIDE 15

TEXT REUSE ON ENGLISH BIBLE VERSIONS: EVALUATION

Example: book Genesis, chapter 1, verse 1. Reduced Bibles: all seven reduced Bible versions contain ”only” the 28632 verses present in all seven editions.

15/26

slide-16
SLIDE 16

TEXT REUSE IN ENGLISH BIBLE VERSIONS: SETUP

Segmentation: disjoint and verse-wise segmentation. Selection: max pruning with a Feature Density of 0.8; Linking: Inter- Digital Library Linking (different Bible editions); Scoring: Broder’s Resemblance with a threshold of 0.6; Post-processing: not used.

16/26

slide-17
SLIDE 17

TEXT REUSE IN ENGLISH BIBLE VERSIONS: RESULTS RECALL

17/26

slide-18
SLIDE 18

TEXT REUSE IN ENGLISH BIBLE VERSIONS: RECALL VS. TEXT REUSE COMPRESSION

With Without

18/26

slide-19
SLIDE 19

TEXT REUSE IN ENGLISH BIBLE VERSIONS: DEPENDENCY OF RECALL & TR COMPRESSION I

19/26

slide-20
SLIDE 20

TEXT REUSE IN ENGLISH BIBLE VERSIONS: DEPENDENCY OF RECALL & TR COMPRESSION II

20/26

slide-21
SLIDE 21

TEXT REUSE IN ENGLISH BIBLE VERSIONS: F-MEASURE VS. NOISY CHANNEL EVAL. I

F-Measure: WBS, ASV, DBY, WEB, YLT, BBE NCE: WBS, ASV, DBY, WEB, BBE, YLT

21/26

slide-22
SLIDE 22

MICROVIEW I

Source: Stefan J¨ anicke, eTRACES project, University of Leipzig.

22/26

slide-23
SLIDE 23

DEPENDENCY OF RECALL AND TR COMPRESSION

23/26

slide-24
SLIDE 24

FINITO!

24/26

slide-25
SLIDE 25

CONTACT

Team Marco B¨ uchler, Greta Franzini and Emily Franzini. Visit us http://www.etrap.eu contact@etrap.eu

25/26

slide-26
SLIDE 26

LICENCE

The theme this presentation is based on is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Changes to the theme are the work of eTRAP.

cba

26/26