latent semantic indexing
play

Latent Semantic Indexing Information Systems M Prof. Paolo Ciaccia - PDF document

Latent Semantic Indexing Information Systems M Prof. Paolo Ciaccia http://www-db.deis.unibo.it/courses/SI-M/ Two major problems plague the Vector Space Model: synonymy: many ways to refer to the same


  1. Latent Semantic Indexing Information Systems M Prof. Paolo Ciaccia http://www-db.deis.unibo.it/courses/SI-M/ ����������� Two major problems plague the Vector Space Model: � synonymy: many ways to refer to the same object, e.g. car and � automobile. � leads to poor recall polysemy: most words have more than one distinct meaning, e.g.model, � python, chip � leads to poor precision doc 1 doc 2 doc 3 ���� ��� ���� ������ ��������� ������ ������ ����� ������ ����� ���� ����� ����� ����� ��������� ���� ����� ��������� Synonymy: Polysemy: Low similarity, High similarity, but related but unrelated LSI Sistemi Informativi M 2 1

  2. ������������������������ Latent Semantic Indexing (LSI), also known as Latent Semantic Analysis (LSA) � when not applied to IR, was proposed at the end of 80’s as a way to solve such problem, http://lsi.argreenhouse.com/lsi/LSI.html The basic observation is that terms are an unreliable means to assess the � relevance of a document wrt to a query Because of synonymy and polysemy � Thus, one would like to represent document using a more semantically � accurate way, i.e., in terms of “concepts” LSI achieves this by analyzing the whole term-document matrrix W , � and by projecting it in a lower-dimensional “latent” space spanned by relevant “concepts” More precisely, LSI uses a linear algebra technique, called Singular Value � Decomposition (SVD), before performing dimensionality reduction LSI Sistemi Informativi M 3 ������������������������������������ Given a square m × m matrix S , a non-null vector v is an eigenvector of S if � there exists a scalar λ, called the eigenvalue of v , such that: ������� Sv = λ v The linear transformation associated to S does not change the directions of � eigenvectors, which are just stretched/shrinked by an amount given by the corresponding eigenvalue There are at most m distinct eigenvalues, which are solutions of the � characteristic equation Det( S - λ I ) = 0 For each eigenvalue, there are infinite corresponding eigenvectors � If � is an eigenvector, so it is k � , k ≠ 0 � Thus, we can consider normalized eigenvectors, ∥v v v v∥=1 � LSI Sistemi Informativi M 4 2

  3. �������������������������������� If S is a real and symmetric matrix, then � All its eigenvalues are real � All the (normalized) eigenvectors of distinct eigenvalues are mutually � orthogonal (thus, linearly independent)   1 / 2   λ v = ; = 3     2 1 1 1   1 / 2 λ I λ 2 − = − − = S =   | S | ( 2 ) 1 0 .   1 2   1 / 2   λ = = v ; 1   2 2   − 1 / 2 If S has m linearly independent eigenvectors, then it can be written as � m ∑ λ s = u × × u Λ U T S = U Λ Λ Λ i, j i, c c j, c where: c = 1 Λ is a diagonal matrix, Λ Λ = diag(λ 1 , λ 2 ,…, λ m ), with λ 1 ≥ λ 2 ≥ … ≥ λ m Λ Λ Λ Λ Λ � The columns of � are the corresponding eigenvectors of � � � U is a column-orthonormal matrix LSI Sistemi Informativi M 5 �������         2 1 3 0 1 / 2 1 / 2 1 / 2 1 / 2     = = S         1 2  −  0 1  −  1 / 2 1 / 2 1 / 2 1 / 2 Λ Λ Λ Λ U T U m ∑ λ s = u × × u ; i, j i, c c j, c c = 1 m ∑ λ λ λ s = u × × u = u × × u + u × × u 1,2 1, c c 2, c 1,1 1 2,1 1,2 2 2,2 c = 1 = × × + × × − = − = 1 / 2 3 1 / 2 1 / 2 1 1 / 2 3 / 2 1 / 2 1 LSI Sistemi Informativi M 6 3

  4. ��������� �����!������������ Consider the M × N term-document weight matrix W � If W has rank r ≤ min{M,N}, then W can be factorized as: � r ∑ λ = × × Λ D T w t d W = T Λ Λ Λ i, j i, c c j, c = where: c 1 � is an M × r column.orthonormal matrix ( � T � = � ), � Λ is an r × r diagonal matrix Λ Λ Λ � � is an N × r column.orthonormal matrix ( � T � = � ) � Λ Λ Λ Λ is also called the “concept matrix” � T is the “term-concept similarity matrix” � D is the “document-concept similarity matrix” � LSI Sistemi Informativi M 7 ������������� SVD represents both terms and documents using a set of latent concepts � r ∑ λ w = t × × d i, j i, c c j, c = c 1 This just says that the weight of t i in doc j is expressed as a � “linear combination of term-concept and doc-concept weights” LSI Sistemi Informativi M 8 4

  5. ��������"#$ Consider the 12 × 9 weight matrix below, whose rank is r = 9, in which two � “groups” of documents are present W C1 C2 C3 C4 C5 G1 G2 G3 G4 Human 1 0 0 1 0 0 0 0 0 Interface 1 0 1 0 0 0 0 0 0 Computer 1 1 0 0 0 0 0 0 0 User 0 1 1 0 1 0 0 0 0 System 0 1 1 2 0 0 0 0 0 Response 0 1 0 0 1 0 0 0 0 Time 0 1 0 0 1 0 0 0 0 EPS 0 0 1 1 0 0 0 0 0 Survey 0 1 0 0 0 0 0 0 1 Tree 0 0 0 0 0 1 1 1 0 Graph 0 0 0 0 0 0 1 1 1 Minors 0 0 0 0 0 0 0 1 1 LSI Sistemi Informativi M 9 ��������"%$ It is Λ Λ = diag(3.34,2.54,2.35,1.64,1.50,1.31,0.85,0.56,0.35) Λ Λ � T D 0.22 -0.11 0.29 -0.41 -0.11 -0.34 0.52 -0.06 -0.41 0.20 -0.06 0.11 -0.95 0.05 0.08 0.18 -0.01 -0.06 0.20 -0.07 0.14 -0.55 0.28 0.50 -0.07 -0.01 -0.11 0.61 0.17 -0.50 -0.03 -0.21 -0.26 -0.43 0.05 0.24 0.24 0.04 -0.16 -0.59 -0.11 -0.25 -0.30 0.06 0.49 0.46 -0.13 0.21 0.04 0.38 0.72 -0.24 0.01 0.02 0.54 -0.23 0.57 0.27 -0.21 -0.37 0.26 -0.02 -0.08 0.40 0.06 -0.34 0.10 0.33 0.38 0.00 0.00 0.01 0.28 0.11 -0.51 0.15 0.33 0.03 0.67 -0.06 -0.26 0.64 -0.17 0.36 0.33 -0.16 -0.21 -0.17 0.03 0.27 0.00 0.19 0.10 0.02 0.39 -0.30 -0.34 0.45 -0.62 0.27 0.11 -0.43 0.07 0.08 -0.17 0.28 -0.02 -0.05 0.01 0.44 0.19 0.02 0.35 -0.21 -0.15 -0.76 0.02 0.27 0.11 -0.43 0.07 0.08 -0.17 0.28 -0.02 -0.05 0.02 0.62 0.25 0.01 0.15 0.00 0.25 0.45 0.52 0.30 -0.14 0.33 0.19 0.11 0.27 0.03 -0.02 -0.17 0.08 0.53 0.08 -0.03 -0.60 0.36 0.04 -0.07 -0.45 0.21 0.27 -0.18 -0.03 -0.54 0.08 -0.47 -0.04 -0.58 0.01 0.49 0.23 0.03 0.59 -0.39 -0.29 0.25 -0.23 0.04 0.62 0.22 0.00 -0.07 0.11 0.16 -0.68 0.23 0.03 0.45 0.14 -0.01 -0.30 0.28 0.34 0.68 0.18 LSI Sistemi Informativi M 10 5

  6. � !����������������"#$ Since both T and D are column-orthonormal matrices, it is � W W T = ( T Λ Λ D T ) T = ( T Λ Λ D T )( T Λ Λ D T )( D Λ Λ T T T ) Λ Λ Λ Λ Λ Λ Λ Λ = T Λ Λ 2 T T Λ Λ W T W = ( D Λ Λ T T T )( T Λ Λ D T ) Λ Λ Λ Λ Λ 2 D T = D Λ Λ Λ ��� T is the (real and symmetric) M × M term.term similarity matrix, and the � columns of � are the eigenvectors of such matrix � T �� is the N × N document.document similarity matrix, and the columns of � � are the eigenvectors of such matrix Λ 2 is a matrix with the eigenvalues of ��� T (and � T � ) Λ Λ Λ � LSI Sistemi Informativi M 11 � !����������������"%$ T T W = Λ Λ D T Λ Λ Since it is � we can view this as a “projection” of documents (columns of W ) in the r-dimensional “concept space” spanned by T columns (i.e., T T rows) Λ D T In this space, documents are represented by the columns of Λ Λ Λ � (i.e., rows of D Λ Λ Λ ) Λ It follows that W T W = ( D Λ Λ D T ) amounts to compute the similarity Λ Λ Λ )( Λ Λ Λ � between documents as the inner product in this r-dimensional latent semantic space Similarly, in this space terms are represented by by the columns of Λ Λ T T Λ Λ � LSI Sistemi Informativi M 12 6

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend