SLIDE 1
InfoTracker: Pedigree Tracking in the Face of Ancillary Content - - PowerPoint PPT Presentation
InfoTracker: Pedigree Tracking in the Face of Ancillary Content - - PowerPoint PPT Presentation
InfoTracker: Pedigree Tracking in the Face of Ancillary Content Eugene Creswick, Terrance Goan and Emi Fujioka Stottler Henke Associates Inc. 1107 NE 45th St., Suite 310, Seattle, WA 98105 206-675-1169 FAX: 206-545-7227
SLIDE 2
SLIDE 3
Track Document Pedigree
SLIDE 4
Applications >
Applications
Security Policies Plagiarism Information Flow
SLIDE 5
The Challenge
SLIDE 6
Common content confuses comparisons
The Challenge >
SLIDE 7
Common content confuses comparisons
The Challenge >
SLIDE 8
Common content confuses comparisons
The Challenge >
SLIDE 9
Related Work
Hoad & Zobel's Fingerprints Suffix Tree Document Models Fuzzy Fingerprints
The Challenge >
SLIDE 10
Solution
SLIDE 11
Ignore the ancillary content
Solution >
SLIDE 12
How?
Solution >
SLIDE 13
How? Use Contrasting Corpora
Solution >
Headers Resumes Introductions Web Content Footers Published work Secrets Homework Assignments Intellectual Property
Open Content Sensitive Content
SLIDE 14
Algorithm
SLIDE 15
Index Both Corpora with one Suffix Tree
Algorithm >
SLIDE 16
Search for a document
Algorithm >
“Hotel rooms as their hideout” Query:
SLIDE 17
Search for a document
Algorithm >
“Hotel rooms as their hideout” “Hotel rooms” Open: Query:
SLIDE 18
Search for a document
Algorithm >
“Hotel rooms as their hideout” “Hotel rooms” “rooms” Open: Open: Query:
SLIDE 19
Search for a document
Algorithm >
“Hotel rooms as their hideout” “Hotel rooms” “rooms” “as their hideout” Open: Sensitive: Open: Query:
SLIDE 20
Search for a document
Algorithm >
“Hotel rooms as their hideout” “Hotel rooms” “rooms” “as their hideout” Open: Sensitive: Open: Query: “their hideout” Open:
SLIDE 21
Search for a document
Algorithm >
“Hotel rooms as their hideout” “Hotel rooms” “rooms” “as their hideout” Open: Sensitive: Open: Query: “their hideout” Open:
SLIDE 22
Filter the resulting string overlaps
Query Doc.
- Sens. Overlap
Open Overlap Resulting Overlap(s) Aligned Character Strings
Algorithm >
Too Short
SLIDE 23
Algorithm > Ranking
Algorithm >
SLIDE 24
Overlap-based Ranking
Algorithm > Ranking >
SLIDE 25
Overlap-based Ranking
A Q C B
Algorithm > Ranking >
SLIDE 26
Overlap Frequency for Ranking
B: Northwest coast of the C: the Indonesian island of Sumatra.
unique text lower frequency Greater impact common text higher frequency Less impact
A: the Indonesian island of Sumatra.
Algorithm > Ranking >
SLIDE 27
Evaluation
SLIDE 28
InfoTracker was compared to Vector Space
Evaluation >
No stop words Cosine Similarity TF-IDF weighted vectors
SLIDE 29
Evaluation >
Data Set
Headers Resumes Web Content (on-line news, blogs, etc...) Footers Published work Intellectual Property
Open Content Sensitive Content
Related Work
SLIDE 30
Evaluation >
Data Set
38 query proposals 272 SBIR proposals 234 historical proposals
SLIDE 31
Evaluation >
Oracle
Image from: http://www.marketoracle.co.uk
SLIDE 32
Evaluation > Results
Evaluation >
SLIDE 33
InfoTracker improved precision / recall
Evaluation > Results >
Algorithm Precision Recall Vector Space
0.119 0.764
InfoTracker
0.167 0.913
SLIDE 34
Contributions / Future Work
SLIDE 35
Ancillary content can be managed
Contributions / Future Work >
Detecting document sections Contrasting corpora Manual/actively learned tags
SLIDE 36
(re)Evaluate on Open data
Contributions / Future Work >
Compare with differing corpora The Linux Doc. Project
SLIDE 37
Algorithmic Improvements
Contributions / Future Work >
Active Learning Document time stamps Overlap size / encapsulation
SLIDE 38
Questions?
SLIDE 39
SLIDE 40
Calculating Precision / Recall
Evaluation >
SLIDE 41
Calculating Precision / Recall
Consider the top 23 results.
(to allow for perfect recall)
Evaluation >
SLIDE 42
Trimming Results >
Ranking Scores Plummet Quickly
SLIDE 43
Trimming Results >
Ranking Scores Plummet Quickly
SLIDE 44