infotracker
play

InfoTracker: Pedigree Tracking in the Face of Ancillary Content - PowerPoint PPT Presentation

InfoTracker: Pedigree Tracking in the Face of Ancillary Content Eugene Creswick, Terrance Goan and Emi Fujioka Stottler Henke Associates Inc. 1107 NE 45th St., Suite 310, Seattle, WA 98105 206-675-1169 FAX: 206-545-7227


  1. InfoTracker: Pedigree Tracking in the Face of Ancillary Content Eugene Creswick, Terrance Goan and Emi Fujioka Stottler Henke Associates Inc. 1107 NE 45th St., Suite 310, Seattle, WA 98105 206-675-1169 FAX: 206-545-7227 rcreswick@stottlerhenke.com http://www.stottlerhenke.com

  2. Track Document Pedigree

  3. Track Document Pedigree

  4. Applications > Applications Plagiarism Information Flow Security Policies

  5. The Challenge

  6. The Challenge > Common content confuses comparisons

  7. The Challenge > Common content confuses comparisons

  8. The Challenge > Common content confuses comparisons

  9. The Challenge > Related Work Suffix Tree Document Models Fuzzy Fingerprints Hoad & Zobel's Fingerprints

  10. Solution

  11. Solution > Ignore the ancillary content

  12. Solution > How?

  13. Solution > How? Use Contrasting Corpora Open Content Sensitive Content Intellectual Property Resumes Published Footers work Introductions Secrets Web Content Homework Headers Assignments

  14. Algorithm

  15. Algorithm > Index Both Corpora with one Suffix Tree

  16. Algorithm > Search for a document “Hotel rooms as their hideout” Query:

  17. Algorithm > Search for a document “Hotel rooms as their hideout” Query: Open: “Hotel rooms”

  18. Algorithm > Search for a document “Hotel rooms as their hideout” Query: Open: “Hotel rooms” Open: “rooms”

  19. Algorithm > Search for a document “Hotel rooms as their hideout” Query: Open: “Hotel rooms” Open: “rooms” “as their hideout” Sensitive:

  20. Algorithm > Search for a document “Hotel rooms as their hideout” Query: Open: “Hotel rooms” Open: “rooms” “as their hideout” Sensitive: Open: “their hideout”

  21. Algorithm > Search for a document “Hotel rooms as their hideout” Query: Open: “Hotel rooms” Open: “rooms” “as their hideout” Sensitive: Open: “their hideout”

  22. Algorithm > Filter the resulting string overlaps Aligned Character Strings Query Doc. Sens. Overlap Open Overlap Resulting Overlap(s) Too Short

  23. Algorithm > Algorithm > Ranking

  24. Algorithm > Ranking > Overlap-based Ranking

  25. Algorithm > Ranking > Overlap-based Ranking B A Q C

  26. Algorithm > Ranking > Overlap Frequency for Ranking A: the Indonesian island of Sumatra. B: Northwest coast of the C: the Indonesian island of Sumatra. unique text common text lower frequency higher frequency Greater impact Less impact

  27. Evaluation

  28. Evaluation > InfoTracker was compared to Vector Space Cosine Similarity TF-IDF weighted vectors No stop words

  29. Evaluation > Data Set Open Content Sensitive Content Intellectual Property Resumes Web Content (on-line news, blogs, Footers etc...) Related Work Published work Headers

  30. Evaluation > Data Set 272 SBIR proposals 234 historical proposals 38 query proposals

  31. Evaluation > Oracle Image from: http://www.marketoracle.co.uk

  32. Evaluation > Evaluation > Results

  33. Evaluation > Results > InfoTracker improved precision / recall Algorithm Precision Recall Vector Space 0.119 0.764 InfoTracker 0.167 0.913

  34. Contributions / Future Work

  35. Contributions / Future Work > Ancillary content can be managed Contrasting corpora Manual/actively learned tags Detecting document sections

  36. Contributions / Future Work > (re)Evaluate on Open data Compare with differing corpora The Linux Doc. Project

  37. Contributions / Future Work > Algorithmic Improvements Active Learning Document time stamps Overlap size / encapsulation

  38. Questions?

  39. Evaluation > Calculating Precision / Recall

  40. Evaluation > Calculating Precision / Recall Consider the top 23 results. (to allow for perfect recall)

  41. Trimming Results > Ranking Scores Plummet Quickly

  42. Trimming Results > Ranking Scores Plummet Quickly

  43. Trimming Results > Trimming improves precision, retains recall

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend