tracer tutorial text reuse detection recent work
play

TRACER TUTORIAL: TEXT REUSE DETECTION RECENT WORK M arco B uchler, - PowerPoint PPT Presentation

TRACER TUTORIAL: TEXT REUSE DETECTION RECENT WORK M arco B uchler, Emily Franzini and Greta Franzini METHODOLOGY Basic idea: Embed historical text reuse in Shannons N oisy Channel theorem. 2/26 MICROVIEW II Source: Stefan J anicke,


  1. TRACER TUTORIAL: TEXT REUSE DETECTION RECENT WORK M arco B¨ uchler, Emily Franzini and Greta Franzini

  2. METHODOLOGY Basic idea: Embed historical text reuse in Shannon’s N oisy Channel theorem. 2/26

  3. MICROVIEW II Source: Stefan J¨ anicke, eTRACES project, University of Leipzig. 3/26

  4. NOISY CHANNEL MINING I • Suffix: • Hyphen: bearing vs. childbearing birth-day vs. birthday back-bone vs. backbone • Composition: zareth-shahar vs. zarethshahar sea-beast vs. sea-monster (synonym) sea-gull vs. sea-mew vs. sea-hawk • Prefix: (cohyponym) ambush vs. ambushment apple-tree vs. citron-tree (cohyponym) shimite vs. shimites 4/26

  5. NOISY CHANNEL MINING II • Orthographically similar words: anathothite vs. anethothite vs. anetothite vs. annethothite vs. antothite • Some 4000 word pairs containing noise are extracted but not classified . But also: punishment vs. torment • Any kind of negation (e.g. book Genesis, chapter 34, verse 19): not defer (ASV, KJV, Webster), without loss of time (Basic), not delay (Darby, YLT), and not wait (WEB) 5/26

  6. METHODOLOGY Basic idea: Embed historical text reuse in Shannon’s Noisy Channel theorem. 6/26

  7. METHODOLOGY: NOISY CHANNEL EVALUATION I Hint: T he results are ALWAYS compared between the natural texts and the randomised texts as a whole. 7/26

  8. METHODOLOGY: NOISY CHANNEL EVALUATION II S ignal-Noise-Ratio adapted from signal- and satellite techniques: SNR = P signal P noise Signal-Noise-Ratio scaled , unit is dB: � P signal � SNR db = 10 . log 10 P noise Mining Ability (in dB): The Mining Ability describes the power of a method to make distinctions between natural-language structures/patterns and random noise given a model with the same parameters. | E D s ,φ Θ | L Quant (Θ) = 10 . log 10 s , φ Θ | ) dB max ( 1 , | E D m 8/26

  9. METHODOLOGY: NOISY CHANNEL EVALUATION III M otivation for randomisation by Word Shuffling : 1. Syntax and distributional semantics are randomised and ”destroyed”. 2. Distributions of words and sentence lengths remain unchanged; changes JUST and ONLY depend on destruction of 1) and are not induced by changes of distributions. 3. Easy measurement of ”randomness” of the randomising method with the entropy test: ∆ H n = H max − H n Die Wahl von n ∈ [ 180 , 183 ] sichert eine Genauigkeit von ∆ H n ≤ 10 − 3 Bit f¨ ur den Entropietest. 9/26

  10. METHODOLOGY: TEXT RE-USE COMPRESSION � m � n i = 1 θ Θ ( S i , S j ) j = 1 C Θ = n . m 10/26

  11. RANDOMNESS & STRUCTURE Question: Why is the result of a randomised Digital Library typically not empty? 11/26

  12. RANDOMNESS & STRUCTURE: IMPACTS C orpus size in sentences (average sentence length is ca. 18 words). LGL is the threshold for the Log-Likelihood-Ratio. 12/26

  13. TEXT REUSE IN ENGLISH BIBLE VERSIONS Why does the use of the Bible make sense? • The Bible is easy to evaluate . • There are different editions written for different purposes . 13/26

  14. TEXT REUSE IN ENGLISH BIBLE VERSIONS 1. American Standard Version (ASV): 20th century, focus is USA; 2. Bible in Basic English (BBE): Verses are written in a simplified language; 3. Darby Version (DBY): created in the 19th century from Hebrew and Greek texts, multiple authors through death of Darby; 4. King James Version (KJV): one of the oldest English Bible versions (16th Cent.); 5. Webster’s Revision (WBS): Revision of KJV in 19th century; 6. World English Bible (WEB): 21st century, global focus; 7. Young Literal Translation (YLT): Verses in Hebrew syntax. 14/26

  15. TEXT REUSE ON ENGLISH BIBLE VERSIONS: EVALUATION Exampl e: book Genesis, chapter 1, verse 1. Reduced Bibles: all seven reduced Bible versions contain ”only” the 28632 verses present in all seven editions. 15/26

  16. TEXT REUSE IN ENGLISH BIBLE VERSIONS: SETUP Segmentation: disjoint and verse-wise segmentation. Selection: max pruning with a Feature Density of 0.8; Linking: Inter- Digital Library Linking (different Bible editions); Scoring: Broder’s Resemblance with a threshold of 0.6; Post-processing: not used. 16/26

  17. TEXT REUSE IN ENGLISH BIBLE VERSIONS: RESULTS RECALL 17/26

  18. TEXT REUSE IN ENGLISH BIBLE VERSIONS: RECALL VS. TEXT REUSE COMPRESSION With Without 18/26

  19. TEXT REUSE IN ENGLISH BIBLE VERSIONS: DEPENDENCY OF RECALL & TR COMPRESSION I 19/26

  20. TEXT REUSE IN ENGLISH BIBLE VERSIONS: DEPENDENCY OF RECALL & TR COMPRESSION II 20/26

  21. TEXT REUSE IN ENGLISH BIBLE VERSIONS: F-MEASURE VS. NOISY CHANNEL EVAL. I F-Measure: WBS, ASV, DBY, WEB, YLT, BBE NCE: WBS, ASV, DBY, WEB, BBE, YLT 21/26

  22. MICROVIEW I Source: Stefan J¨ anicke, eTRACES project, University of Leipzig. 22/26

  23. DEPENDENCY OF RECALL AND TR COMPRESSION 23/26

  24. FINITO! 24/26

  25. CONTACT T eam Marco B¨ uchler, Greta Franzini and Emily Franzini. Visit us http://www.etrap.eu contact@etrap.eu 25/26

  26. LICENCE T he theme this presentation is based on is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Changes to the theme are the work of eTRAP. cba 26/26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend