Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval - - PowerPoint PPT Presentation

β–Ά
latent semantic indexing lsi
SMART_READER_LITE
LIVE PREVIEW

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval - - PowerPoint PPT Presentation

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Spring 2020 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Vector space model: pros


slide-1
SLIDE 1

Latent Semantic Indexing (LSI)

CE-324: Modern Information Retrieval

Sharif University of Technology

  • M. Soleymani

Spring 2020

Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

slide-2
SLIDE 2

Vector space model: pros

} Partial matching of queries and docs

} dealing with the case where no doc contains all search terms

} Ranking according to similarity score } T

erm weighting schemes

} improves retrieval performance

} Various extensions

} Relevance feedback (modifying query vector) } Doc clustering and classification

2

slide-3
SLIDE 3

Problems with lexical semantics

} Ambiguity and association in natural language

} Polysemy: Words often have a multitude of meanings and

different types of usage

} More severe in very heterogeneous collections. } The vector space model is unable to discriminate between

different meanings of the same word.

3

slide-4
SLIDE 4

Problems with lexical semantics

} Synonymy: Different terms may have identical or similar

meanings (weaker: words indicating the same topic).

} No associations between words are made in the vector

space representation.

4

slide-5
SLIDE 5

Polysemy and context

} Doc similarity on single word level: polysemy and context car company

  • dodge

ford

meaning 2

ring jupiter

  • space

voyager

meaning 1

…

saturn

...

… planet ...

contribution to similarity, if but not meaning,

st

1 used in

nd

2 if in

5

slide-6
SLIDE 6

SVD

6

Type equation here. 𝑉 π‘Š2

slide-7
SLIDE 7

Latent Semantic Indexing (LSI)

} Perform a low-rank approximation of doc-term

matrix (typical rank 100-300) by SVD

} latent semantic space } Term-doc matrices are very large but the number of topics

that people talk about is small (in some sense)

} General idea: Map docs (and terms) to a low-dimensional

space

} Design a mapping such that the low-dimensional space reflects

semantic associations

} Compute doc similarity based on the inner product in this latent

semantic space

7

slide-8
SLIDE 8

Singular Value Decomposition (SVD)

𝑁´𝑁 𝑁´𝑂 𝑂´𝑂

For an 𝑁´𝑂 matrix 𝐡 of rank 𝑠 there exists a factorization:

. π΅π΅π‘ˆ are orthogonal eigenvectors of 𝑉 The columns of . π΅π‘ˆπ΅ are orthogonal eigenvectors of π‘Š The columns of

Singular values

. π΅π‘ˆπ΅ the eigenvalues of are also π΅π΅π‘ˆ

  • f

l𝑠 … l1 Eigenvalues 𝐡 = π‘‰Ξ£π‘Š2

Typically, the singular values arranged in decreasing order.

Ξ£ = diag 𝜏>, … , 𝜏A 𝜏B = πœ‡B

slide-9
SLIDE 9

Singular Value Decomposition (SVD)

} Truncated SVD

9

min(𝑁, 𝑂) min(𝑁, 𝑂) MΒ΄min(M,N) Min(M,N)Β΄min(M,N) Min(M,N)Β΄N

𝐡 = π‘‰Ξ£π‘Š2

slide-10
SLIDE 10

SVD example

M=3, N=2 Or equivalently:

2/ 6

  • 1/ 2
  • βˆ’1/ 6
  • 1/ 2
  • βˆ’1/ 6
  • 1

3

  • 1/ 2
  • 1/√2

1/ 2

  • βˆ’1/ 2
  • 𝐡 =

2/ 6

  • 1/ 2
  • βˆ’1/ 6
  • 1/ 2
  • βˆ’1/ 6
  • 1/ 3
  • 1/ 3
  • 1/ 3
  • 1

3

  • 1/ 2
  • 1/√2

1/ 2

  • βˆ’1/ 2
  • 𝐡 =

1 βˆ’1 1 1

slide-11
SLIDE 11

Example

11

We use a non-weighted matrix here to simplify the example.

slide-12
SLIDE 12

Example of 𝐷 = π‘‰Ξ£π‘Š2: All four matrices

12

𝐷 = π‘‰Ξ£π‘Šπ‘ˆ

slide-13
SLIDE 13

Example of 𝐷 = π‘‰Ξ£π‘Š2: matrix 𝑉

13

Columns: β€œsemantic” dims (distinct topics like politics, sports,...) . π‘˜ in column is to the topic 𝑗 how strongly related term : π‘£π‘—π‘˜ One row per term One column per min(M,N)

slide-14
SLIDE 14

Example of 𝐷 = π‘‰Ξ£π‘Š2: The matrix Ξ£

14

Singular value: β€œmeasures the importance of the corresponding semantic dimension”. We’ll make use of this by omitting unimportant dimensions.

square, diagonal matrix min(M,N) Γ— min(M,N).

slide-15
SLIDE 15

Example of 𝐷 = π‘‰Ξ£π‘Š2: The matrix π‘Š2

15

Columns of π‘Š: β€œsemantic” dims . π‘˜ in column is to the topic 𝑗 doc strongly related how

:

π‘€π‘—π‘˜

One column per doc One row per min(M,N)

slide-16
SLIDE 16

Matrix decomposition: Summary

} We’ve decomposed the term-doc matrix 𝐷

into a product of three matrices.

} 𝑉: consists of one (row) vector for each term } π‘Š2: consists of one (column) vector for each doc } Ξ£: diagonal matrix with singular values, reflecting importance of

each dimension

} Next:Why are we doing this?

16

slide-17
SLIDE 17

} Solution via SVD

Low-rank approximation

set smallest r-k singular values to zero column notation: sum of rank 1 matrices 𝑁×𝑂 𝑁×𝑙 𝑙×𝑙 𝑙×𝑂 We retain only 𝑙 singular values

𝐡V = 𝑉 diag 𝜏>, … , 𝜏V, 0, … 0 π‘Š2

𝐡V = W 𝜏V𝑣B𝑀B

2 V BX>

slide-18
SLIDE 18

} SVD can be used to compute optimal low-rank approximations.

} Keeping the 𝑙 largest singular values and setting all others to zero results in

the optimal approximation [Eckart-Young].

} No matrix of the rank 𝑙 can approximates 𝐡 better than 𝐡V.

} Approximation problem: Given matrix 𝐡, find matrix 𝐡𝑙 of rank 𝑙 (e.g.

a matrix with 𝑙 linearly independent rows or columns) such that 𝐡𝑙 and π‘Œ are both 𝑁×𝑂 matrices.

Typically, we want 𝑙 β‰ͺ 𝑠.

Low-rank approximation

Frobenius norm

18

𝐡V = min

[:A]^V [ XV 𝐡 βˆ’ π‘Œ _

slide-19
SLIDE 19

Approximation error

} How good (bad) is this approximation? } It’s the best possible, measured by the Frobenius norm of

the error: where the s𝑗 are ordered such that s𝑗 Β³ sB`>.

} Suggests why Frobenius error drops as 𝑙 increases.

19

min

[:A]^V [ XV 𝐡 βˆ’ π‘Œ _ =

𝐡 βˆ’ 𝐡V

_ 𝐡V = 𝑉 diag 𝜏>, … , 𝜏V, 0, … 0 π‘Š2

slide-20
SLIDE 20

SVD Low-rank approximation

} Term-doc matrix 𝐷 may have 𝑁 = 50000, 𝑂 = 10b

} rank close to 50000

} Construct an approximation 𝐷100 with rank 100.

} Of all rank 100 matrices, it would have the lowest Frobenius

error.

} Great … but why would we?? } Answer: Latent Semantic Indexing

  • C. Eckart, G. Young, The approximation of a matrix by another of lower rank.

Psychometrika, 1, 211-218, 1936.

slide-21
SLIDE 21

Goals of LSI

} SVD on the term-doc matrix } Similar

terms map to similar location in low dimensional space

} Noise reduction by dimension reduction

21

slide-22
SLIDE 22

22

This matrix is the basis for computing similarity between docs and queries. Can we transform this matrix, so that we get a better measure of similarity between docs and queries? . . .

Term-document matrix

slide-23
SLIDE 23

Recall unreduced decomposition 𝐷 = π‘‰Ξ£π‘Šπ‘ˆ

23

slide-24
SLIDE 24

Reducing the dimensionality to 2

24

slide-25
SLIDE 25

Reducing the dimensionality to 2

25

slide-26
SLIDE 26

Original matrix 𝐷 vs. reduced 𝐷c = 𝑉Σcπ‘Š2

26

dimensional

  • two

as a 𝐷2 representation of 𝐷. Dimensionality reduction to two dimensions.

slide-27
SLIDE 27

Why is the reduced matrix β€œbetter”?

27

27

Similarity of d2 and d3 in the original space: 0. Similarity of d2 und d3 in the reduced space: 0.52 * 0.28 + 0.36 * 0.16 + 0.72 * 0.36 + 0.12 * 0.20 + - 0.39 * - 0.08 β‰ˆ 0.52

slide-28
SLIDE 28

Why the reduced matrix is β€œbetter”?

28

β€œboat” and β€œship” are semantically similar. The β€œreduced” similarity measure reflects this. What property of the SVD reduction is responsible for improved similarity?

slide-29
SLIDE 29

Example

29

[Example from Dumais et. al]

slide-30
SLIDE 30

Example

30

[Example from Dumais et. al]

slide-31
SLIDE 31

Example (k=2)

31

[Example from Dumais et. al] 𝑉V Ξ£V π‘Š

V 2

slide-32
SLIDE 32

32

graph tree minor survey time response user computer interface human EPS system Squares: terms Circles: docs

slide-33
SLIDE 33

33

[Example from Dumais et. al]

slide-34
SLIDE 34

LSI: Summary

34

} Decompose term-doc matrix 𝐷 into a product of

matrices using SVD 𝐷 = π‘‰Ξ£π‘Š2

} We use columns of matrices 𝑉 and π‘Š that correspond to the

largest values in the diagonal matrix Ξ£ as term and document dimensions in the new space

slide-35
SLIDE 35

How we use the SVD in LSI

} Key property of SVD: Each singular value tells us how

important its dimension is.

} By setting less important dimensions to zero, we keep the

important information, but get rid of the β€œdetails”.

} These details may

} be noise β‡’ reduced LSI is a better representation

} Details make things dissimilar that should be similar β‡’ reduced LSI is a better

representation because it represents similarity better.

35

slide-36
SLIDE 36

How LSI addresses synonymy and semantic relatedness?

} Docs may be semantically similar but are not similar in the

vector space (when we talk about the same topics but use different words).

} Desired effect of LSI: Synonyms contribute strongly to doc similarity.

} Standard vector space: Synonyms contribute nothing to doc similarity.

} LSI (via SVD) selects the β€œleast costly” mapping:

} different words (= different dimensions of the full space) are

mapped to the same dimension in the reduced space.

} Thus, it maps synonyms or semantically related words to the same dimension.

} β€œcost” of mapping synonyms to the same dimension is much less

than cost of collapsing unrelated words.

} Thus, LSI will avoid doing that for unrelated words. 36

slide-37
SLIDE 37

Performing the maps

} Each row and column of 𝐷 gets mapped into the 𝑙-

dimensional LSI space, by the SVD.

} A query π‘Ÿ is also mapped into this space, by

} Query NOT a sparse vector.

} Claim: this is not only the mapping with the best

(Frobenius error) approximation to 𝐷, but also improves retrieval.

37

Since Vg

h = Ξ£V i>𝑉V 2𝐷V , we

π‘Ÿπ‘™ to π‘Ÿ should transform query

π‘Ÿπ‘™ = Ξ£V

i>𝑉V 2π‘Ÿ

slide-38
SLIDE 38

Implementation

} Compute SVD of term-doc matrix } Map docs to the reduced space } Map the query into the reduced space π‘Ÿπ‘™ = π‘Ÿ2𝑉VΞ£V

i>

} Compute similarity of π‘ŸV with all reduced docs in π‘Š

V.

} Output ranked list of docs as usual } What is the fundamental problem with this approach?

38

slide-39
SLIDE 39

Empirical evidence

} Experiments on TREC 1/2/3 – Dumais } Lanczos SVD code (available on netlib) due to Berry used

in these experiments

} Running times of ~ one day on tens of thousands of docs [still an

  • bstacle to use]

} Dimensions – various values 250-350 reported.

} Reducing k improves recall. } Under 200 reported unsatisfactory

} Generally expect recall to improve – what about precision?

39

slide-40
SLIDE 40

Empirical evidence

} Precision at or above median TREC precision

} Top scorer on almost 20% of TREC topics

} Slightly better on average than straight vector spaces } Effect of dimensionality:

Dimensions Precision 250 0.367 300 0.371 346 0.374

40

slide-41
SLIDE 41

But why is this clustering?

} We’ve talked about docs, queries, retrieval and

precision here.

} What does this have to do with clustering? } Intuition: Dimension reduction through LSI brings

together β€œrelated” axes in the vector space.

41

slide-42
SLIDE 42

Simplistic picture

Topic 1 Topic 2 Topic 3

42

slide-43
SLIDE 43

Reference

43

} Chapter 18 of IIR Book