dimensionality reduction for information retrieval using
play

Dimensionality Reduction for Information Retrieval using Vector - PowerPoint PPT Presentation

Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Dimensionality Reduction for Information Retrieval using Vector Replacement of Rare Terms Tobias Berka, Marian Vajter sic April 30, 2011 Tobias Berka, Marian


  1. Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Dimensionality Reduction for Information Retrieval using Vector Replacement of Rare Terms Tobias Berka, Marian Vajterˇ sic April 30, 2011 Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

  2. Introduction Rare Term Vector Replacement Evaluation Summary & Conclusions Outline 1 Introduction Dimensionality Reduction 2 Rare Term Vector Replacement Zipf’s Law Replacement Vectors Rare Term Replacement 3 Evaluation Retrieval Performance Computational Performance Stability 4 Summary & Conclusions Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

  3. Introduction Rare Term Vector Replacement Dimensionality Reduction Evaluation Summary & Conclusions Introduction Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

  4. Introduction Rare Term Vector Replacement Dimensionality Reduction Evaluation Summary & Conclusions Goals Reduce dimensionality, Preserve or improve... Pair-wise distances, Cross-class scatter, Retrieval / clustering / classification performance. Detect... Contributing factors, Individual components, Signals or noise. Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

  5. Introduction Rare Term Vector Replacement Dimensionality Reduction Evaluation Summary & Conclusions Methods Great Classics: Linear Methods, Singular Value Decomposition, Principal Component Analysis (PCA), Non-negative Matrix Factorization(s), Independent Component Analysis. Canonical Extension: Kernel Methods, Maps: Mesh Fitting, Self-Organization, Manifold Learning: Local Linearization, Local Non-linear Reduction. Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

  6. Introduction Rare Term Vector Replacement Dimensionality Reduction Evaluation Summary & Conclusions My Interest Better retrieval, More complete retrieval. Dynamic searching, Less reliance on static indices, Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

  7. Introduction Rare Term Vector Replacement Dimensionality Reduction Evaluation Summary & Conclusions My Interest Interactive semi-supervised clustering, Exploratory data analysis, Search. Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

  8. Introduction Rare Term Vector Replacement Dimensionality Reduction Evaluation Summary & Conclusions My Interest Sparse ❀ dense. Good for super-scalar CPUs, More efficient parallelism. Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

  9. Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions Rare Term Vector Replacement Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

  10. Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions Zipf’s Law “The [document] frequency of a word is reciprocally proportional to its frequency rank.” 1 f i ∝ rank ( f i ) . Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

  11. Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions Zipf’s Law in Practice “Most words occur only in a very small number of documents.” Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

  12. Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions Zipf’s Law in Pictures 10000 occurrences Q1=1 Q2=2 Q3=7 mean=75.93 1000 cut-off=694 occurrences 100 10 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 feature (relative) Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

  13. Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions Zipf’s Law in Pictures occurrences Q1=1 Q2=1 Q3=4 100 mean=6.52 cut-off=10 occurrences 10 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 feature (relative) Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

  14. Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions Zipf’s Law vs. Dimensionality Reduction Eliminate rare terms? High importance for information retrieval! Can we compress them? Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

  15. Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions Replacement Vectors Let’s compute replacement vectors! Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

  16. Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions Centroid Summarization We operate on a corpus in vector form. Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

  17. Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions Centroid Summarization Select the vectors containing a rare term. Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

  18. Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions Centroid Summarization Compute the centroid. Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

  19. Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions Vector Truncation Discard the rare features. occurrences Q1=1 Q2=1 Q3=4 100 mean=6.52 cut-off=10 occurrences 10 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 feature (relative) Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

  20. Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions Computing Replacement Vectors For all rare terms, we compute the following: Select all documents containing the rare term, Compute the (weighted) average vector, Truncate all rare terms from this vector. Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

  21. Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions A More Efficient Algorithm For all documents, For all rare terms, Add the common terms to the average vector. Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

  22. Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions New Document Representation For all documents, we compute the following: Truncate all rare terms from the document vectors (i.e. retain only common terms), Add the linear combination of all replacement vectors, For all rare terms in the document, Scaled by the weighted term frequency, Normalize the result to unit length. Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

  23. Introduction Zipf’s Law Rare Term Vector Replacement Replacement Vectors Evaluation Rare Term Replacement Summary & Conclusions Subsequent Rank Reduction Once we have computed the replacement vectors, we compute a rank-reduced PCA, Reduces number of features by 50%, improves the retrieval performance, Low number of features, dense data matrix – use a symmetric eigensolver, In LAPACK terms: xSPEVX , xSYEV , etc. Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

  24. Introduction Retrieval Performance Rare Term Vector Replacement Computational Performance Evaluation Stability Summary & Conclusions Evaluation Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

  25. Introduction Retrieval Performance Rare Term Vector Replacement Computational Performance Evaluation Stability Summary & Conclusions Reuters Corpus Reuters corpus, training set, all categories. 0.8 sparse TD-IDF (47,236) vector replacement (535) rank-reduced vector replacement (392) 0.75 0.7 mean precision 0.65 0.6 0.55 0.5 0 10 20 30 40 50 60 70 80 90 100 hit list rank Tobias Berka, Marian Vajterˇ sic Dimensionality Reduction for Information Retrieval using Vecto

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend