[PPT] - Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval PowerPoint Presentation

SLIDE 1

Latent Semantic Indexing (LSI)

CE-324: Modern Information Retrieval

Sharif University of Technology

M. Soleymani

Spring 2020

Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

SLIDE 2

Vector space model: pros

} Partial matching of queries and docs

} dealing with the case where no doc contains all search terms

} Ranking according to similarity score } T

erm weighting schemes

} improves retrieval performance

} Various extensions

} Relevance feedback (modifying query vector) } Doc clustering and classification

2

SLIDE 3

Problems with lexical semantics

} Ambiguity and association in natural language

} Polysemy: Words often have a multitude of meanings and

different types of usage

} More severe in very heterogeneous collections. } The vector space model is unable to discriminate between

different meanings of the same word.

3

SLIDE 4

Problems with lexical semantics

} Synonymy: Different terms may have identical or similar

meanings (weaker: words indicating the same topic).

} No associations between words are made in the vector

space representation.

4

SLIDE 5

Polysemy and context

} Doc similarity on single word level: polysemy and context car company

dodge

ford

meaning 2

ring jupiter

space

voyager

meaning 1

…

saturn

...

… planet ...

contribution to similarity, if but not meaning,

st

1 used in

nd

2 if in

5

SLIDE 6

SVD

6

Type equation here. 𝑉 𝑊2

SLIDE 7

Latent Semantic Indexing (LSI)

} Perform a low-rank approximation of doc-term

matrix (typical rank 100-300) by SVD

} latent semantic space } Term-doc matrices are very large but the number of topics

that people talk about is small (in some sense)

} General idea: Map docs (and terms) to a low-dimensional

space

} Design a mapping such that the low-dimensional space reflects

semantic associations

} Compute doc similarity based on the inner product in this latent

semantic space

7

SLIDE 8

Singular Value Decomposition (SVD)

𝑁´𝑁 𝑁´𝑂 𝑂´𝑂

For an 𝑁´𝑂 matrix 𝐵 of rank 𝑠 there exists a factorization:

. 𝐵𝐵𝑈 are orthogonal eigenvectors of 𝑉 The columns of . 𝐵𝑈𝐵 are orthogonal eigenvectors of 𝑊 The columns of

Singular values

. 𝐵𝑈𝐵 the eigenvalues of are also 𝐵𝐵𝑈

f

l𝑠 … l1 Eigenvalues 𝐵 = 𝑉Σ𝑊2

Typically, the singular values arranged in decreasing order.

Σ = diag 𝜏>, … , 𝜏A 𝜏B = 𝜇B

SLIDE 9

Singular Value Decomposition (SVD)

} Truncated SVD

9

min(𝑁, 𝑂) min(𝑁, 𝑂) M´min(M,N) Min(M,N)´min(M,N) Min(M,N)´N

𝐵 = 𝑉Σ𝑊2

SLIDE 10

SVD example

M=3, N=2 Or equivalently:

2/ 6

1/ 2
−1/ 6
1/ 2
−1/ 6
1

3

1/ 2
1/√2

1/ 2

−1/ 2
𝐵 =

2/ 6

1/ 2
−1/ 6
1/ 2
−1/ 6
1/ 3
1/ 3
1/ 3
1

3

1/ 2
1/√2

1/ 2

−1/ 2
𝐵 =

1 −1 1 1

SLIDE 11

Example

11

We use a non-weighted matrix here to simplify the example.

SLIDE 12

Example of 𝐷 = 𝑉Σ𝑊2: All four matrices

12

𝐷 = 𝑉Σ𝑊𝑈

SLIDE 13

Example of 𝐷 = 𝑉Σ𝑊2: matrix 𝑉

13

Columns: “semantic” dims (distinct topics like politics, sports,...) . 𝑘 in column is to the topic 𝑗 how strongly related term : 𝑣𝑗𝑘 One row per term One column per min(M,N)

SLIDE 14

Example of 𝐷 = 𝑉Σ𝑊2: The matrix Σ

14

Singular value: “measures the importance of the corresponding semantic dimension”. We’ll make use of this by omitting unimportant dimensions.

square, diagonal matrix min(M,N) × min(M,N).

SLIDE 15

Example of 𝐷 = 𝑉Σ𝑊2: The matrix 𝑊2

15

Columns of 𝑊: “semantic” dims . 𝑘 in column is to the topic 𝑗 doc strongly related how

:

𝑤𝑗𝑘

One column per doc One row per min(M,N)

SLIDE 16

Matrix decomposition: Summary

} We’ve decomposed the term-doc matrix 𝐷

into a product of three matrices.

} 𝑉: consists of one (row) vector for each term } 𝑊2: consists of one (column) vector for each doc } Σ: diagonal matrix with singular values, reflecting importance of

each dimension

} Next:Why are we doing this?

16

SLIDE 17

} Solution via SVD

Low-rank approximation

set smallest r-k singular values to zero column notation: sum of rank 1 matrices 𝑁×𝑂 𝑁×𝑙 𝑙×𝑙 𝑙×𝑂 We retain only 𝑙 singular values

𝐵V = 𝑉 diag 𝜏>, … , 𝜏V, 0, … 0 𝑊2

𝐵V = W 𝜏V𝑣B𝑤B

2 V BX>

SLIDE 18

} SVD can be used to compute optimal low-rank approximations.

} Keeping the 𝑙 largest singular values and setting all others to zero results in

the optimal approximation [Eckart-Young].

} No matrix of the rank 𝑙 can approximates 𝐵 better than 𝐵V.

} Approximation problem: Given matrix 𝐵, find matrix 𝐵𝑙 of rank 𝑙 (e.g.

a matrix with 𝑙 linearly independent rows or columns) such that 𝐵𝑙 and 𝑌 are both 𝑁×𝑂 matrices.

Typically, we want 𝑙 ≪ 𝑠.

Low-rank approximation

Frobenius norm

18

𝐵V = min

[:A]^V [ XV 𝐵 − 𝑌 _

SLIDE 19

Approximation error

} How good (bad) is this approximation? } It’s the best possible, measured by the Frobenius norm of

the error: where the s𝑗 are ordered such that s𝑗 ³ sB`>.

} Suggests why Frobenius error drops as 𝑙 increases.

19

min

[:A]^V [ XV 𝐵 − 𝑌 _ =

𝐵 − 𝐵V

_ 𝐵V = 𝑉 diag 𝜏>, … , 𝜏V, 0, … 0 𝑊2

SLIDE 20

SVD Low-rank approximation

} Term-doc matrix 𝐷 may have 𝑁 = 50000, 𝑂 = 10b

} rank close to 50000

} Construct an approximation 𝐷100 with rank 100.

} Of all rank 100 matrices, it would have the lowest Frobenius

error.

} Great … but why would we?? } Answer: Latent Semantic Indexing

C. Eckart, G. Young, The approximation of a matrix by another of lower rank.

Psychometrika, 1, 211-218, 1936.

SLIDE 21

Goals of LSI

} SVD on the term-doc matrix } Similar

terms map to similar location in low dimensional space

} Noise reduction by dimension reduction

21

SLIDE 22

22

This matrix is the basis for computing similarity between docs and queries. Can we transform this matrix, so that we get a better measure of similarity between docs and queries? . . .

Term-document matrix

SLIDE 23

Recall unreduced decomposition 𝐷 = 𝑉Σ𝑊𝑈

23

SLIDE 24

Reducing the dimensionality to 2

24

SLIDE 25

Reducing the dimensionality to 2

25

SLIDE 26

Original matrix 𝐷 vs. reduced 𝐷c = 𝑉Σc𝑊2

26

dimensional

two

as a 𝐷2 representation of 𝐷. Dimensionality reduction to two dimensions.

SLIDE 27

Why is the reduced matrix “better”?

27

Similarity of d2 and d3 in the original space: 0. Similarity of d2 und d3 in the reduced space: 0.52 * 0.28 + 0.36 * 0.16 + 0.72 * 0.36 + 0.12 * 0.20 + - 0.39 * - 0.08 ≈ 0.52

SLIDE 28

Why the reduced matrix is “better”?

28

“boat” and “ship” are semantically similar. The “reduced” similarity measure reflects this. What property of the SVD reduction is responsible for improved similarity?

SLIDE 29

Example

29

[Example from Dumais et. al]

SLIDE 30

Example

30

[Example from Dumais et. al]

SLIDE 31

Example (k=2)

31

[Example from Dumais et. al] 𝑉V ΣV 𝑊

V 2

SLIDE 32

32

graph tree minor survey time response user computer interface human EPS system Squares: terms Circles: docs

SLIDE 33

33

[Example from Dumais et. al]

SLIDE 34

LSI: Summary

34

} Decompose term-doc matrix 𝐷 into a product of

matrices using SVD 𝐷 = 𝑉Σ𝑊2

} We use columns of matrices 𝑉 and 𝑊 that correspond to the

largest values in the diagonal matrix Σ as term and document dimensions in the new space

SLIDE 35

How we use the SVD in LSI

} Key property of SVD: Each singular value tells us how

important its dimension is.

} By setting less important dimensions to zero, we keep the

important information, but get rid of the “details”.

} These details may

} be noise ⇒ reduced LSI is a better representation

} Details make things dissimilar that should be similar ⇒ reduced LSI is a better

representation because it represents similarity better.

35

SLIDE 36

How LSI addresses synonymy and semantic relatedness?

} Docs may be semantically similar but are not similar in the

vector space (when we talk about the same topics but use different words).

} Desired effect of LSI: Synonyms contribute strongly to doc similarity.

} Standard vector space: Synonyms contribute nothing to doc similarity.

} LSI (via SVD) selects the “least costly” mapping:

} different words (= different dimensions of the full space) are

mapped to the same dimension in the reduced space.

} Thus, it maps synonyms or semantically related words to the same dimension.

} “cost” of mapping synonyms to the same dimension is much less

than cost of collapsing unrelated words.

} Thus, LSI will avoid doing that for unrelated words. 36

SLIDE 37

Performing the maps

} Each row and column of 𝐷 gets mapped into the 𝑙-

dimensional LSI space, by the SVD.

} A query 𝑟 is also mapped into this space, by

} Query NOT a sparse vector.

} Claim: this is not only the mapping with the best

(Frobenius error) approximation to 𝐷, but also improves retrieval.

37

Since Vg

h = ΣV i>𝑉V 2𝐷V , we

𝑟𝑙 to 𝑟 should transform query

𝑟𝑙 = ΣV

i>𝑉V 2𝑟

SLIDE 38

Implementation

} Compute SVD of term-doc matrix } Map docs to the reduced space } Map the query into the reduced space 𝑟𝑙 = 𝑟2𝑉VΣV

i>

} Compute similarity of 𝑟V with all reduced docs in 𝑊

V. } Output ranked list of docs as usual } What is the fundamental problem with this approach?

38

SLIDE 39

Empirical evidence

} Experiments on TREC 1/2/3 – Dumais } Lanczos SVD code (available on netlib) due to Berry used

in these experiments

} Running times of ~ one day on tens of thousands of docs [still an

bstacle to use]

} Dimensions – various values 250-350 reported.

} Reducing k improves recall. } Under 200 reported unsatisfactory

} Generally expect recall to improve – what about precision?

39

SLIDE 40

Empirical evidence

} Precision at or above median TREC precision

} Top scorer on almost 20% of TREC topics

} Slightly better on average than straight vector spaces } Effect of dimensionality:

Dimensions Precision 250 0.367 300 0.371 346 0.374

40

SLIDE 41

But why is this clustering?

} We’ve talked about docs, queries, retrieval and

precision here.

} What does this have to do with clustering? } Intuition: Dimension reduction through LSI brings

together “related” axes in the vector space.

41

SLIDE 42

Simplistic picture

Topic 1 Topic 2 Topic 3

42

SLIDE 43

Reference

43