Latent Semantic Indexing (LSI)
CE-324: Modern Information Retrieval
Sharif University of Technology
- M. Soleymani
Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval - - PowerPoint PPT Presentation
Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Spring 2020 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Vector space model: pros
Sharif University of Technology
} dealing with the case where no doc contains all search terms
} improves retrieval performance
2
3
4
st
nd
5
6
} Compute doc similarity based on the inner product in this latent
7
9
11
12
13
14
15
:
16
2 V BX>
} SVD can be used to compute optimal low-rank approximations.
} Keeping the π largest singular values and setting all others to zero results in
} No matrix of the rank π can approximates π΅ better than π΅V.
} Approximation problem: Given matrix π΅, find matrix π΅π of rank π (e.g.
18
} Suggests why Frobenius error drops as π increases.
19
Psychometrika, 1, 211-218, 1936.
21
22
23
24
25
26
27
27
28
29
30
31
V 2
32
33
34
} be noise β reduced LSI is a better representation
} Details make things dissimilar that should be similar β reduced LSI is a better
35
} Desired effect of LSI: Synonyms contribute strongly to doc similarity.
} Standard vector space: Synonyms contribute nothing to doc similarity.
} different words (= different dimensions of the full space) are
} Thus, it maps synonyms or semantically related words to the same dimension.
} βcostβ of mapping synonyms to the same dimension is much less
} Thus, LSI will avoid doing that for unrelated words. 36
37
h = Ξ£V i>πV 2π·V , we
38
} Running times of ~ one day on tens of thousands of docs [still an
} Reducing k improves recall. } Under 200 reported unsatisfactory
39
} Top scorer on almost 20% of TREC topics
40
41
42
43