Latent Semantic Indexing: A Regularized approach to large-scale - - PowerPoint PPT Presentation
Latent Semantic Indexing: A Regularized approach to large-scale - - PowerPoint PPT Presentation
Latent Semantic Indexing: A Regularized approach to large-scale modeling. Parth Guntoorkar parth.gun@umbc.edu WI52610 INTRODUCTION It finds the hidden (latent) relationships between words (semantics) in order to improve information
INTRODUCTION
- It finds the hidden (latent) relationships between words (semantics) in order to
improve information understanding (indexing).
- Document similarity is defined by the ways in which those words occur or do
not occur.
- LSI performs a low-rank approximation of document-term matrix (typical rank
100-300)
- Retrieval is relevant based on the underlying definition or subject
EXAMPLE
GENERAL IDEA
- Map documents (and terms) to a low-dimensional representation.
- Design a mapping such that the low-dimensional space reflects semantic
associations (latent semantic space).
- Compute document similarity based on the inner product in this latent
semantic space
- It uses SVD(Singular Value Decomposition).
- SVD decomposes a matrix as a product of 3 matrices. For an term document
matrix A of size t x d and rank r, there exists a factorization using SVD as follows:
- Where, U and V are Left and Right Singular matrices respectively, Σ is r x r
diagonal matrix containing singular values of A in descending order.
BUILDING LSI
- 1. Preprocess
the collection
- f
documents.
- a. Stemming
- b. Removing stop words
- 2. Build Frequency Matrix
- 3. Apply Pre-weights
- 4. Decompose FM into U, S, V
- 5. Project Queries
WHY TO USE LSI
- Provides defense against ‘Keyword Stuffing’
- LSI targets Synonymy and Polysemy
- It also gives better results and best ranked pages.
ISSUE AND SOLUTION
- The main issue with LSI is Scalability issue. Scaling to larger document
collections via parallelization is difficult.
- Few alternatives are available such as PLSI(Probabilistic LSI), LDA(Latent
Dirichlet Allocation), but most solution requires drastic step such as vastly reducing input vocabulary.
- Regularized LSI is the solution to this problem in which term document matrix
is represented as product of two matrices: term topic and topic document.
- It also uses regularization to constrain the solution.
- The main advantage is that it can be parallelized.
REGULARIZED LSI (RLSI)
- RLSI is different from LSI in that it uses regularization instead of orthogonality
to constrain the solution.
- Two methods of RLSI:
○ batch Regularized Latent Semantic Indexing(bRLSI) ○
- nline Regularized Latent Semantic Indexing(oRLSI)
- Both methods are formalized as minimization of a quadratic loss function
regularized by ℓ1 and/or ℓ2 norm.
- Collection is represented as a term-document matrix, where each entry
represents the occurrence (or tf-idf score) of a term in a document.
- The term-document matrix is then approximated by the product of two
matrices: a term-topic matrix and topic-document matrix.
○ term-topic matrix : represents the latent topics with terms ○ topic-document matrix : represents the documents with topics
Performance of RLSI
- TREC datasets are used to compare different RLSI regularization strategies
and to compare RLSI with existing topic modeling methods.
- TREC datasets used were AP, WSJ, and OHSUMED, which are widely used
in relevance ranking experiments.
- Compared different regularization strategies on (batch) RLSI. For example
RLSI (Uℓ1 -Vℓ2 ), RLSI (Uℓ2 -Vℓ1 ), RLSI (Uℓ1 -Vℓ1 ), and RLSI (Uℓ2 -Vℓ2 ).
- Topics Discovered by RLSI Variants on AP
- Average topic compactness is defined as the average ratio of terms with nonzero
weights per topic
Retrieval Performance of RLSI Variants on AP Retrieval Performance of RLSI Variants on WSJ
- Combined topic-matching scores with term-matching scores given by conventional
IR models of BM25
- Normalized Discounted Cumulative Gain is a measure of ranking quality
Retrieval Performance of different methods on AP dataset
- RLSI
Variants results in terms
- f
topic readability, topic compactness, and retrieval performance.
- It is a better practice to apply ℓ1 norm on U and ℓ2 norm on V in RLSI
for achieving good topic readability, topic compactness, and retrieval performance.
- Where U is Term-topic matrix and V is Topic-document matrix
Application
- Cross Language Retrieval
○ Apply SVD on bilingual corpus to generate semantic space and then process queries on this semantic space without any query translation.
- Text Summarization
○ Construct term-sentence matrix and considers sentences with highest singular value for each pattern.
- Search Engine Optimization(SEO)
- 1. Wang, Q., Xu, J., Li, H., & Craswell, N. (2013). Regularized Latent Semantic
Indexing. ACM Transactions on Information Systems , 31 (1), 1–44. DOI:10.1145/2414782.2414787
- 2. Atreya, A., & Elkan, C. (2011). Latent semantic indexing (LSI) fails for TREC
collections. ACM SIGKDD Explorations Newsletter, 12 (2), 5. DOI:10.1145/1964897.1964900
- 3. Chen, X., Qi, Y., Bai, B., Lin, Q., & Carbonell, J. G. (2011). Sparse Latent
Semantic Analysis. Proceedings of the 2011 SIAM International Conference
- n Data Mining . DOI: 10.1137/1.9781611972818.41
- 4. Crain S.P., Zhou K., Yang SH., Zha H. (2012) Dimensionality Reduction and