TRACER TUTORIAL: TEXT REUSE DETECTION FEATURING M arco B uchler, - - PowerPoint PPT Presentation

tracer tutorial text reuse detection featuring
SMART_READER_LITE
LIVE PREVIEW

TRACER TUTORIAL: TEXT REUSE DETECTION FEATURING M arco B uchler, - - PowerPoint PPT Presentation

TRACER TUTORIAL: TEXT REUSE DETECTION FEATURING M arco B uchler, Emily Franzini and Greta Franzini TABLE OF CONTENTS 1. What is featuring? 2. Featuring techniques 3. Hacking 4. Conclusion and revision 2/27 REMINDER: CURRENT APPROACH 3/27


slide-1
SLIDE 1

TRACER TUTORIAL: TEXT REUSE DETECTION FEATURING

Marco B¨ uchler, Emily Franzini and Greta Franzini

slide-2
SLIDE 2

TABLE OF CONTENTS

  • 1. What is featuring?
  • 2. Featuring techniques
  • 3. Hacking
  • 4. Conclusion and revision

2/27

slide-3
SLIDE 3

REMINDER: CURRENT APPROACH

3/27

slide-4
SLIDE 4

WHAT IS FEATURING?

slide-5
SLIDE 5

QUESTION

What do you associate with featuring?

5/27

slide-6
SLIDE 6

A VISUALISATION OF FEATURING

From biometry:

6/27

slide-7
SLIDE 7

SOME VOCABULARY

7/27

slide-8
SLIDE 8

FEATURING

8/27

slide-9
SLIDE 9

FEATURING TECHNIQUES

slide-10
SLIDE 10

FEATURING: AN EXAMPLE

V1 = s1, s2, s3, s4, s5 12 Features V2 = A, B, ..., J, K s1 : A B C D E s2 : A C E F G s3 : G F A C D s4 : C F A G E s5 : D H I J K

10/27

slide-11
SLIDE 11

FEATURING: MATRIX STYLE

s1 : A B C D E s2 : A C E F G s3 : G F A C D s4 : C F A G E s5 : D H I J K        A C D F G E B H I J K s1 1 1 1 1 1 s2 1 1 1 1 1 s3 1 1 1 1 1 s4 1 1 1 1 1 s5 1 1 1 1 1        = M

11/27

slide-12
SLIDE 12

HACKING: CONFIGURATION

12/27

slide-13
SLIDE 13

HACKING

slide-14
SLIDE 14

HACKING

Tasks:

  • Run on your own texts ...
  • 1. ... n-gram shingling with n=2, 3
  • 2. ... words as features

14/27

slide-15
SLIDE 15

HACKING

Questions:

  • Run the aforementioned tasks. Compare the resulting ”tail

distributions” (in the featuring folder you’ll find all this information in e.g. KJV.meta).

  • Compare the .train-files of n-gram shingling with hash-breaking,

also compared to words as features (use Excel or OpenOffice to

  • pen the .train file; sort by columns B and C).

15/27

slide-16
SLIDE 16

CONFIGURING THE TRAINING IMPL PARAMETER

Hint: The configuration file can be found in: $TRACER HOME/conf/tracer conf.xml

16/27

slide-17
SLIDE 17

CONFIGURING THE TRAINING IMPL PARAMETER

Hint:

  • The configuration file can be found in:

$TRACER HOME/conf/tracer conf.xml

  • eu.etrap.tracer.featuring.syntactical.shingle.

TriGramShinglingTrainingImpl

  • eu.etrap.tracer.featuring.syntactical.shingle.

BiGramShinglingTrainingImpl

  • eu.etrap.tracer.featuring.semantic.

WordBasedTrainingImpl

17/27

slide-18
SLIDE 18

GAP BETWEEN KNOWLEDGE AND EXPERIENCE

18/27

slide-19
SLIDE 19

CONCLUSION AND REVISION

slide-20
SLIDE 20

CHECK

Question: How does the number of features change with the changing feature size (e.g. bigrams, trigrams)?

20/27

slide-21
SLIDE 21

CHECK

       A C D F G E B H I J K s1 1 1 1 1 1 s2 1 1 1 1 1 s3 1 1 1 1 1 s4 1 1 1 1 1 s5 1 1 1 1 1        = M Question: What is the Digital Fingerprint of a reuse unit?

21/27

slide-22
SLIDE 22

CHECK

       A C D F G E B H I J K s1 1 1 1 1 1 s2 1 1 1 1 1 s3 1 1 1 1 1 s4 1 1 1 1 1 s5 1 1 1 1 1        = M Question: How does preprocessing influence F?

22/27

slide-23
SLIDE 23

CHECK

       A C D F G E B H I J K s1 1 1 1 1 1 s2 1 1 1 1 1 s3 1 1 1 1 1 s4 1 1 1 1 1 s5 1 1 1 1 1        = M Question: How can you compute the feature frequency?

23/27

slide-24
SLIDE 24

IMPORTANCE OF FEATURING

  • Featuring defines the unit to measure similarity.
  • Most featuring techniques ”generate” a power-law distribution:
  • A few features occur very often;
  • At least 50% of all features occur just once;
  • Most features are rare.

24/27

slide-25
SLIDE 25

FINITO!

25/27

slide-26
SLIDE 26

CONTACT

Team Marco B¨ uchler, Greta Franzini and Emily Franzini. Visit us http://www.etrap.eu contact@etrap.eu

26/27

slide-27
SLIDE 27

LICENCE

The theme this presentation is based on is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Changes to the theme are the work of eTRAP.

cba

27/27