latent semantic
play

Latent Semantic Indexing Mandar Haldekar CMSC 676 Introduction to - PowerPoint PPT Presentation

Latent Semantic Indexing Mandar Haldekar CMSC 676 Introduction to LSI Retrieval based on word overlap between document and query is not enough Synonymy - decreases recall Polysemy - decreases precision Retrieval based on


  1. Latent Semantic Indexing Mandar Haldekar CMSC 676

  2. Introduction to LSI  Retrieval based on word overlap between document and query is not enough  Synonymy - decreases recall  Polysemy - decreases precision  Retrieval based on underlying concept or topic is important

  3. Introduction to LSI  Assumption: Some underlying latent/hidden semantic structure in the corpus  Projects both documents and terms in lower dimensional space which represents latent semantic concepts/topic in the corpus.

  4. T echnical Details • For an term document matrix A of size t x d and rank r , there exists a factorization using SVD as follows: A = U x Σ x V T t  r r  r r  d U and V are orthonormal matrices. Σ is r x r diagonal matrix containing singular values of A in descending order

  5. Low Rank Approximation Documents = Terms r x d r x r t x r t x d A U V T Σ Documents = Terms t x k t x d k x k k x d A k U Σ V T

  6. Query Processing  Query q must be projected in k-dimensional space q k = Σ -1 x U T k x q  Require number of top ranking similar documents are retrieved  Representation in k-dimensional space captures semantic structure, so queries are placed near similar terms and documents in semantic space (even though they do not have any word overlap)

  7. Applications  Information Filtering ◦ Compute SVD on initial set of documents. ◦ Represent user’s interest as one or more document vectors in latent semantic space using SVD ◦ New documents matching with these vectors are returned.  Cross Language Retrieval ◦ Apply SVD on bi-lingual corpus to generate semantic space and then process queries on this semantic space without any query translation  T ext Summarization ◦ Construct term-sentence matrix and considers sentences with highest singular value for each pattern

  8. Current State of Research  Issue : Scaling it to large collections  Some Recent Steps towards it ◦ Sparse LSA, 2010  Use L1 regularization to enforce sparsity constraints on projection matrix.  Compact representation ◦ Regularized LSI, 2011  Completely new model in which term document matrix is represented as product of two matrices: term topic and topic document.  It also uses regularization to constraint the soultion.  Main advantage : can be parallelized.

  9. References S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic  analysis. Journal of the American Society for Information Science , 41(6):391 – 407, September 1990. Zha, H., Simon, H. D. On updating problems in latent semantic indexing. SIAM Journal on Scientific  Computing, 21 (2), 782 – 791,1999. Y. Gong, X. Liu: Generic Text Summarization Using Relevance Measure and Latent Semantic  Analysis. Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, New Orleans, Louisiana, United States , pp. 19-25, 2001. Michael Littman , Susan T. Dumais , Thomas K. Landauer. Automatic Cross-Language Information  Retrieval using Latent Semantic Indexing. AAAI-97 Spring Symposium Series: Cross-Language Text and Speech Retrieval .pp. 18-24.Stanford University, 1997. Quan Wang, Jun Xu, Hang Li, and Nick Craswell. Regularized latent semantic indexing. Proceedings  of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, SIGIR ’11 , pages 685 – 694, New York, NY, USA, 2011. X. Chen, B. Bai, Y. Qi, Q. Lin, and J. Carbonell. Sparse latent semantic analysis. NIPS Workshop ,  2010. Michael Berry, Susan Dumais, and Gavin O'Brien. Using linear algebra for intelligent information  retrieval. SIAM Review , 37(4):573-595, December 1995.

  10. Thanks you!! Questions ??

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend