Latent Semantic Indexing: A Regularized approach to large-scale - - PowerPoint PPT Presentation

latent semantic indexing a regularized approach to large
SMART_READER_LITE
LIVE PREVIEW

Latent Semantic Indexing: A Regularized approach to large-scale - - PowerPoint PPT Presentation

Latent Semantic Indexing: A Regularized approach to large-scale modeling. Parth Guntoorkar parth.gun@umbc.edu WI52610 INTRODUCTION It finds the hidden (latent) relationships between words (semantics) in order to improve information


slide-1
SLIDE 1

Latent Semantic Indexing: A Regularized approach to large-scale modeling.

Parth Guntoorkar parth.gun@umbc.edu WI52610

slide-2
SLIDE 2

INTRODUCTION

  • It finds the hidden (latent) relationships between words (semantics) in order to

improve information understanding (indexing).

  • Document similarity is defined by the ways in which those words occur or do

not occur.

  • LSI performs a low-rank approximation of document-term matrix (typical rank

100-300)

  • Retrieval is relevant based on the underlying definition or subject
slide-3
SLIDE 3

EXAMPLE

slide-4
SLIDE 4

GENERAL IDEA

  • Map documents (and terms) to a low-dimensional representation.
  • Design a mapping such that the low-dimensional space reflects semantic

associations (latent semantic space).

  • Compute document similarity based on the inner product in this latent

semantic space

  • It uses SVD(Singular Value Decomposition).
slide-5
SLIDE 5
  • SVD decomposes a matrix as a product of 3 matrices. For an term document

matrix A of size t x d and rank r, there exists a factorization using SVD as follows:

  • Where, U and V are Left and Right Singular matrices respectively, Σ is r x r

diagonal matrix containing singular values of A in descending order.

slide-6
SLIDE 6

BUILDING LSI

  • 1. Preprocess

the collection

  • f

documents.

  • a. Stemming
  • b. Removing stop words
  • 2. Build Frequency Matrix
  • 3. Apply Pre-weights
  • 4. Decompose FM into U, S, V
  • 5. Project Queries
slide-7
SLIDE 7

WHY TO USE LSI

  • Provides defense against ‘Keyword Stuffing’
  • LSI targets Synonymy and Polysemy
  • It also gives better results and best ranked pages.
slide-8
SLIDE 8

ISSUE AND SOLUTION

  • The main issue with LSI is Scalability issue. Scaling to larger document

collections via parallelization is difficult.

  • Few alternatives are available such as PLSI(Probabilistic LSI), LDA(Latent

Dirichlet Allocation), but most solution requires drastic step such as vastly reducing input vocabulary.

  • Regularized LSI is the solution to this problem in which term document matrix

is represented as product of two matrices: term topic and topic document.

  • It also uses regularization to constrain the solution.
  • The main advantage is that it can be parallelized.
slide-9
SLIDE 9

REGULARIZED LSI (RLSI)

  • RLSI is different from LSI in that it uses regularization instead of orthogonality

to constrain the solution.

  • Two methods of RLSI:

○ batch Regularized Latent Semantic Indexing(bRLSI) ○

  • nline Regularized Latent Semantic Indexing(oRLSI)
  • Both methods are formalized as minimization of a quadratic loss function

regularized by ℓ1 and/or ℓ2 norm.

  • Collection is represented as a term-document matrix, where each entry

represents the occurrence (or tf-idf score) of a term in a document.

slide-10
SLIDE 10
  • The term-document matrix is then approximated by the product of two

matrices: a term-topic matrix and topic-document matrix.

○ term-topic matrix : represents the latent topics with terms ○ topic-document matrix : represents the documents with topics

slide-11
SLIDE 11

Performance of RLSI

  • TREC datasets are used to compare different RLSI regularization strategies

and to compare RLSI with existing topic modeling methods.

  • TREC datasets used were AP, WSJ, and OHSUMED, which are widely used

in relevance ranking experiments.

  • Compared different regularization strategies on (batch) RLSI. For example

RLSI (Uℓ1 -Vℓ2 ), RLSI (Uℓ2 -Vℓ1 ), RLSI (Uℓ1 -Vℓ1 ), and RLSI (Uℓ2 -Vℓ2 ).

slide-12
SLIDE 12
  • Topics Discovered by RLSI Variants on AP
  • Average topic compactness is defined as the average ratio of terms with nonzero

weights per topic

slide-13
SLIDE 13

Retrieval Performance of RLSI Variants on AP Retrieval Performance of RLSI Variants on WSJ

  • Combined topic-matching scores with term-matching scores given by conventional

IR models of BM25

  • Normalized Discounted Cumulative Gain is a measure of ranking quality
slide-14
SLIDE 14

Retrieval Performance of different methods on AP dataset

slide-15
SLIDE 15
  • RLSI

Variants results in terms

  • f

topic readability, topic compactness, and retrieval performance.

  • It is a better practice to apply ℓ1 norm on U and ℓ2 norm on V in RLSI

for achieving good topic readability, topic compactness, and retrieval performance.

  • Where U is Term-topic matrix and V is Topic-document matrix
slide-16
SLIDE 16

Application

  • Cross Language Retrieval

○ Apply SVD on bilingual corpus to generate semantic space and then process queries on this semantic space without any query translation.

  • Text Summarization

○ Construct term-sentence matrix and considers sentences with highest singular value for each pattern.

  • Search Engine Optimization(SEO)
slide-17
SLIDE 17
  • 1. Wang, Q., Xu, J., Li, H., & Craswell, N. (2013). Regularized Latent Semantic

Indexing. ACM Transactions on Information Systems , 31 (1), 1–44. DOI:10.1145/2414782.2414787

  • 2. Atreya, A., & Elkan, C. (2011). Latent semantic indexing (LSI) fails for TREC

collections. ACM SIGKDD Explorations Newsletter, 12 (2), 5. DOI:10.1145/1964897.1964900

  • 3. Chen, X., Qi, Y., Bai, B., Lin, Q., & Carbonell, J. G. (2011). Sparse Latent

Semantic Analysis. Proceedings of the 2011 SIAM International Conference

  • n Data Mining . DOI: 10.1137/1.9781611972818.41
  • 4. Crain S.P., Zhou K., Yang SH., Zha H. (2012) Dimensionality Reduction and

Topic Modeling: From Latent Semantic Indexing to Latent Dirichlet Allocation and Beyond. In: Aggarwal C., Zhai C. (eds) Mining Text Data. Springer, Boston, MA

REFERENCES

slide-18
SLIDE 18

Thank You!!