InfoTracker: Pedigree Tracking in the Face of Ancillary Content - - PowerPoint PPT Presentation

infotracker
SMART_READER_LITE
LIVE PREVIEW

InfoTracker: Pedigree Tracking in the Face of Ancillary Content - - PowerPoint PPT Presentation

InfoTracker: Pedigree Tracking in the Face of Ancillary Content Eugene Creswick, Terrance Goan and Emi Fujioka Stottler Henke Associates Inc. 1107 NE 45th St., Suite 310, Seattle, WA 98105 206-675-1169 FAX: 206-545-7227


slide-1
SLIDE 1

InfoTracker:

Pedigree Tracking in the Face of Ancillary Content

Eugene Creswick, Terrance Goan and Emi Fujioka Stottler Henke Associates Inc.

1107 NE 45th St., Suite 310, Seattle, WA 98105 206-675-1169 FAX: 206-545-7227 rcreswick@stottlerhenke.com http://www.stottlerhenke.com

slide-2
SLIDE 2

Track Document Pedigree

slide-3
SLIDE 3

Track Document Pedigree

slide-4
SLIDE 4

Applications >

Applications

Security Policies Plagiarism Information Flow

slide-5
SLIDE 5

The Challenge

slide-6
SLIDE 6

Common content confuses comparisons

The Challenge >

slide-7
SLIDE 7

Common content confuses comparisons

The Challenge >

slide-8
SLIDE 8

Common content confuses comparisons

The Challenge >

slide-9
SLIDE 9

Related Work

Hoad & Zobel's Fingerprints Suffix Tree Document Models Fuzzy Fingerprints

The Challenge >

slide-10
SLIDE 10

Solution

slide-11
SLIDE 11

Ignore the ancillary content

Solution >

slide-12
SLIDE 12

How?

Solution >

slide-13
SLIDE 13

How? Use Contrasting Corpora

Solution >

Headers Resumes Introductions Web Content Footers Published work Secrets Homework Assignments Intellectual Property

Open Content Sensitive Content

slide-14
SLIDE 14

Algorithm

slide-15
SLIDE 15

Index Both Corpora with one Suffix Tree

Algorithm >

slide-16
SLIDE 16

Search for a document

Algorithm >

“Hotel rooms as their hideout” Query:

slide-17
SLIDE 17

Search for a document

Algorithm >

“Hotel rooms as their hideout” “Hotel rooms” Open: Query:

slide-18
SLIDE 18

Search for a document

Algorithm >

“Hotel rooms as their hideout” “Hotel rooms” “rooms” Open: Open: Query:

slide-19
SLIDE 19

Search for a document

Algorithm >

“Hotel rooms as their hideout” “Hotel rooms” “rooms” “as their hideout” Open: Sensitive: Open: Query:

slide-20
SLIDE 20

Search for a document

Algorithm >

“Hotel rooms as their hideout” “Hotel rooms” “rooms” “as their hideout” Open: Sensitive: Open: Query: “their hideout” Open:

slide-21
SLIDE 21

Search for a document

Algorithm >

“Hotel rooms as their hideout” “Hotel rooms” “rooms” “as their hideout” Open: Sensitive: Open: Query: “their hideout” Open:

slide-22
SLIDE 22

Filter the resulting string overlaps

Query Doc.

  • Sens. Overlap

Open Overlap Resulting Overlap(s) Aligned Character Strings

Algorithm >

Too Short

slide-23
SLIDE 23

Algorithm > Ranking

Algorithm >

slide-24
SLIDE 24

Overlap-based Ranking

Algorithm > Ranking >

slide-25
SLIDE 25

Overlap-based Ranking

A Q C B

Algorithm > Ranking >

slide-26
SLIDE 26

Overlap Frequency for Ranking

B: Northwest coast of the C: the Indonesian island of Sumatra.

unique text lower frequency Greater impact common text higher frequency Less impact

A: the Indonesian island of Sumatra.

Algorithm > Ranking >

slide-27
SLIDE 27

Evaluation

slide-28
SLIDE 28

InfoTracker was compared to Vector Space

Evaluation >

No stop words Cosine Similarity TF-IDF weighted vectors

slide-29
SLIDE 29

Evaluation >

Data Set

Headers Resumes Web Content (on-line news, blogs, etc...) Footers Published work Intellectual Property

Open Content Sensitive Content

Related Work

slide-30
SLIDE 30

Evaluation >

Data Set

38 query proposals 272 SBIR proposals 234 historical proposals

slide-31
SLIDE 31

Evaluation >

Oracle

Image from: http://www.marketoracle.co.uk

slide-32
SLIDE 32

Evaluation > Results

Evaluation >

slide-33
SLIDE 33

InfoTracker improved precision / recall

Evaluation > Results >

Algorithm Precision Recall Vector Space

0.119 0.764

InfoTracker

0.167 0.913

slide-34
SLIDE 34

Contributions / Future Work

slide-35
SLIDE 35

Ancillary content can be managed

Contributions / Future Work >

Detecting document sections Contrasting corpora Manual/actively learned tags

slide-36
SLIDE 36

(re)Evaluate on Open data

Contributions / Future Work >

Compare with differing corpora The Linux Doc. Project

slide-37
SLIDE 37

Algorithmic Improvements

Contributions / Future Work >

Active Learning Document time stamps Overlap size / encapsulation

slide-38
SLIDE 38

Questions?

slide-39
SLIDE 39
slide-40
SLIDE 40

Calculating Precision / Recall

Evaluation >

slide-41
SLIDE 41

Calculating Precision / Recall

Consider the top 23 results.

(to allow for perfect recall)

Evaluation >

slide-42
SLIDE 42

Trimming Results >

Ranking Scores Plummet Quickly

slide-43
SLIDE 43

Trimming Results >

Ranking Scores Plummet Quickly

slide-44
SLIDE 44

Trimming Results >

Trimming improves precision, retains recall