Latent Semantic Indexing Information Systems M Prof. Paolo Ciaccia - PDF document

Latent Semantic Indexing Information Systems M Prof. Paolo Ciaccia http://www-db.deis.unibo.it/courses/SI-M/ �� Two major problems plague the Vector Space Model: � synonymy: many ways to refer to the same object, e.g. car and � automobile. � leads to poor recall polysemy: most words have more than one distinct meaning, e.g.model, � python, chip � leads to poor precision doc 1 doc 2 doc 3 �� Synonymy: Polysemy: Low similarity, High similarity, but related but unrelated LSI Sistemi Informativi M 2 1

�� Latent Semantic Indexing (LSI), also known as Latent Semantic Analysis (LSA) � when not applied to IR, was proposed at the end of 80’s as a way to solve such problem, http://lsi.argreenhouse.com/lsi/LSI.html The basic observation is that terms are an unreliable means to assess the � relevance of a document wrt to a query Because of synonymy and polysemy � Thus, one would like to represent document using a more semantically � accurate way, i.e., in terms of “concepts” LSI achieves this by analyzing the whole term-document matrrix W , � and by projecting it in a lower-dimensional “latent” space spanned by relevant “concepts” More precisely, LSI uses a linear algebra technique, called Singular Value � Decomposition (SVD), before performing dimensionality reduction LSI Sistemi Informativi M 3 �� Given a square m × m matrix S , a non-null vector v is an eigenvector of S if � there exists a scalar λ, called the eigenvalue of v , such that: �� Sv = λ v The linear transformation associated to S does not change the directions of � eigenvectors, which are just stretched/shrinked by an amount given by the corresponding eigenvalue There are at most m distinct eigenvalues, which are solutions of the � characteristic equation Det( S - λ I ) = 0 For each eigenvalue, there are infinite corresponding eigenvectors � If � is an eigenvector, so it is k � , k ≠ 0 � Thus, we can consider normalized eigenvectors, ∥v v v v∥=1 � LSI Sistemi Informativi M 4 2

�� If S is a real and symmetric matrix, then � All its eigenvalues are real � All the (normalized) eigenvectors of distinct eigenvalues are mutually � orthogonal (thus, linearly independent)   1 / 2   λ v = ; = 3     2 1 1 1   1 / 2 λ I λ 2 − = − − = S =   | S | ( 2 ) 1 0 .   1 2   1 / 2   λ = = v ; 1   2 2   − 1 / 2 If S has m linearly independent eigenvectors, then it can be written as � m ∑ λ s = u × × u Λ U T S = U Λ Λ Λ i, j i, c c j, c where: c = 1 Λ is a diagonal matrix, Λ Λ = diag(λ 1 , λ 2 ,…, λ m ), with λ 1 ≥ λ 2 ≥ … ≥ λ m Λ Λ Λ Λ Λ � The columns of � are the corresponding eigenvectors of � � � U is a column-orthonormal matrix LSI Sistemi Informativi M 5 ��         2 1 3 0 1 / 2 1 / 2 1 / 2 1 / 2     = = S         1 2  −  0 1  −  1 / 2 1 / 2 1 / 2 1 / 2 Λ Λ Λ Λ U T U m ∑ λ s = u × × u ; i, j i, c c j, c c = 1 m ∑ λ λ λ s = u × × u = u × × u + u × × u 1,2 1, c c 2, c 1,1 1 2,1 1,2 2 2,2 c = 1 = × × + × × − = − = 1 / 2 3 1 / 2 1 / 2 1 1 / 2 3 / 2 1 / 2 1 LSI Sistemi Informativi M 6 3

�� !�� Consider the M × N term-document weight matrix W � If W has rank r ≤ min{M,N}, then W can be factorized as: � r ∑ λ = × × Λ D T w t d W = T Λ Λ Λ i, j i, c c j, c = where: c 1 � is an M × r column.orthonormal matrix ( � T � = � ), � Λ is an r × r diagonal matrix Λ Λ Λ � � is an N × r column.orthonormal matrix ( � T � = � ) � Λ Λ Λ Λ is also called the “concept matrix” � T is the “term-concept similarity matrix” � D is the “document-concept similarity matrix” � LSI Sistemi Informativi M 7 �� SVD represents both terms and documents using a set of latent concepts � r ∑ λ w = t × × d i, j i, c c j, c = c 1 This just says that the weight of t i in doc j is expressed as a � “linear combination of term-concept and doc-concept weights” LSI Sistemi Informativi M 8 4

��"#$ Consider the 12 × 9 weight matrix below, whose rank is r = 9, in which two � “groups” of documents are present W C1 C2 C3 C4 C5 G1 G2 G3 G4 Human 1 0 0 1 0 0 0 0 0 Interface 1 0 1 0 0 0 0 0 0 Computer 1 1 0 0 0 0 0 0 0 User 0 1 1 0 1 0 0 0 0 System 0 1 1 2 0 0 0 0 0 Response 0 1 0 0 1 0 0 0 0 Time 0 1 0 0 1 0 0 0 0 EPS 0 0 1 1 0 0 0 0 0 Survey 0 1 0 0 0 0 0 0 1 Tree 0 0 0 0 0 1 1 1 0 Graph 0 0 0 0 0 0 1 1 1 Minors 0 0 0 0 0 0 0 1 1 LSI Sistemi Informativi M 9 ��"%$ It is Λ Λ = diag(3.34,2.54,2.35,1.64,1.50,1.31,0.85,0.56,0.35) Λ Λ � T D 0.22 -0.11 0.29 -0.41 -0.11 -0.34 0.52 -0.06 -0.41 0.20 -0.06 0.11 -0.95 0.05 0.08 0.18 -0.01 -0.06 0.20 -0.07 0.14 -0.55 0.28 0.50 -0.07 -0.01 -0.11 0.61 0.17 -0.50 -0.03 -0.21 -0.26 -0.43 0.05 0.24 0.24 0.04 -0.16 -0.59 -0.11 -0.25 -0.30 0.06 0.49 0.46 -0.13 0.21 0.04 0.38 0.72 -0.24 0.01 0.02 0.54 -0.23 0.57 0.27 -0.21 -0.37 0.26 -0.02 -0.08 0.40 0.06 -0.34 0.10 0.33 0.38 0.00 0.00 0.01 0.28 0.11 -0.51 0.15 0.33 0.03 0.67 -0.06 -0.26 0.64 -0.17 0.36 0.33 -0.16 -0.21 -0.17 0.03 0.27 0.00 0.19 0.10 0.02 0.39 -0.30 -0.34 0.45 -0.62 0.27 0.11 -0.43 0.07 0.08 -0.17 0.28 -0.02 -0.05 0.01 0.44 0.19 0.02 0.35 -0.21 -0.15 -0.76 0.02 0.27 0.11 -0.43 0.07 0.08 -0.17 0.28 -0.02 -0.05 0.02 0.62 0.25 0.01 0.15 0.00 0.25 0.45 0.52 0.30 -0.14 0.33 0.19 0.11 0.27 0.03 -0.02 -0.17 0.08 0.53 0.08 -0.03 -0.60 0.36 0.04 -0.07 -0.45 0.21 0.27 -0.18 -0.03 -0.54 0.08 -0.47 -0.04 -0.58 0.01 0.49 0.23 0.03 0.59 -0.39 -0.29 0.25 -0.23 0.04 0.62 0.22 0.00 -0.07 0.11 0.16 -0.68 0.23 0.03 0.45 0.14 -0.01 -0.30 0.28 0.34 0.68 0.18 LSI Sistemi Informativi M 10 5

� !��"#$ Since both T and D are column-orthonormal matrices, it is � W W T = ( T Λ Λ D T ) T = ( T Λ Λ D T )( T Λ Λ D T )( D Λ Λ T T T ) Λ Λ Λ Λ Λ Λ Λ Λ = T Λ Λ 2 T T Λ Λ W T W = ( D Λ Λ T T T )( T Λ Λ D T ) Λ Λ Λ Λ Λ 2 D T = D Λ Λ Λ �� T is the (real and symmetric) M × M term.term similarity matrix, and the � columns of � are the eigenvectors of such matrix � T �� is the N × N document.document similarity matrix, and the columns of � � are the eigenvectors of such matrix Λ 2 is a matrix with the eigenvalues of �� T (and � T � ) Λ Λ Λ � LSI Sistemi Informativi M 11 � !��"%$ T T W = Λ Λ D T Λ Λ Since it is � we can view this as a “projection” of documents (columns of W ) in the r-dimensional “concept space” spanned by T columns (i.e., T T rows) Λ D T In this space, documents are represented by the columns of Λ Λ Λ � (i.e., rows of D Λ Λ Λ ) Λ It follows that W T W = ( D Λ Λ D T ) amounts to compute the similarity Λ Λ Λ )( Λ Λ Λ � between documents as the inner product in this r-dimensional latent semantic space Similarly, in this space terms are represented by by the columns of Λ Λ T T Λ Λ � LSI Sistemi Informativi M 12 6

Latent Semantic Indexing Information Systems M Prof. Paolo Ciaccia - PDF document

Latent Semantic Indexing Information Systems M Prof. Paolo Ciaccia http://www-db.deis.unibo.it/courses/SI-M/ Two major problems plague the Vector Space Model: synonymy: many ways to refer to the same

Retrieval by Content Part 3: Text Retrieval Latent Semantic Indexing Srihari: CSE 626 1 Latent

NPFL103: Information Retrieval (11) Latent semantic indexing Pavel Pecina Institute of Formal

Introduction to Information Retrieval http://informationretrieval.org IIR 18: Latent Semantic

Models for Retrieval Models for Retrieval 1. HMM/N-gram-based 2. Latent Semantic Indexing (LSI)

Introduction to Information Retrieval http://informationretrieval.org IIR 18: Latent Semantic

Latent Semantic Indexing: A Regularized approach to large-scale modeling. Parth Guntoorkar

INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING Crista Lopes Outline Precision

Distributed Indexing Indexing, session 8 CS6200: Information Retrieval Slides by: Jesse Anderton

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Lecture 6: (Probabilistic) Latent Semantic Analysis Julia Hockenmaier juliahmr@illinois.edu

Latent Semantic Indexing Mandar Haldekar CMSC 676 Introduction to LSI Retrieval based on

Latent Semantic Indexing for Video Content Modeling and Analysis Fabrice Souvannavong, Bernard

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of

1 Latent variable models In the next section we will discuss latent variable models for

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

An Introduction to Latent Semantic Analysis Thomas K Landauer Department of Psychology

Statistical Modeling Approaches for Statistical Modeling Approaches for Information Retrieval

HL7 2.x Security Hacking medical devices Anirudh Duggal Disclaimer: All the views/ research

Maths Knowledge Overview - for Part 1, COMP24111 Tingting Mu tingtingmu@manchester.ac.uk School

Temporal Planning Planning with Temporal and Concurrent Actions 1 Literature Malik

Tricks for kernel methods in large datasets Matthias Treder Stellenbosch University MML 10 May

CS 134: Operating Systems I/O Hardware 1 / 23 Overview CS34 Overview 2013-05-17 Hardware

Integrating cover crop residue and moldboard plowing into glyphosate-resistant Palmer amaranth

What is a Transition Town? Traditionally A local response to Climate Change and Rising