[PPT] - INFO 4300 / CS4300 Information Retrieval slides adapted from PowerPoint Presentation

SLIDE 1

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/

IR 10: SVD and Latent Semantic Indexing

Paul Ginsparg

Cornell University, Ithaca, NY

30 Sep 2010

1 / 58

SLIDE 2

Administrativa

Assignment 2 due Sat 9 Oct, 1pm (late submission permitted until Sun 10 Oct at 11 p.m.) No class Tue 12 Oct (midterm break) The Midterm Examination is on Thu Oct 14 from 11:40 to 12:55, in Olin 165. It will be open book. Topics examined include assignments, lectures and discussion class readings before the midterm break. (Review of topics next Thurs, 7 Oct) According to the registrar (http://registrar.sas.cornell.edu/Sched/EXFA.html ), final exam is Fri 17 Dec 2:00-4:30 pm (location TBD). Early opportunity to take exam will be Mon 13 Dec, 2:00pm

2 / 58

SLIDE 3

Discussion 4, Tue,Thu 5,7 Oct 2010

Read and be prepared to discuss the following paper: Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas

K. Landauer, Richard Harshman, ”Indexing by latent semantic

analysis”. Journal of the American Society for Information Science, Volume 41, Issue 6, 1990. http://www3.interscience.wiley.com/cgi-bin/issuetoc?ID=10049584 Note that to access this paper from Wiley InterScience, you need to use a computer with a Cornell IP address. (also at /readings/jasis90f.pdf ) X = T0S0D′ ⇐ ⇒ C = UΣV T ˆ X = TSD′ ⇐ ⇒ Ck = UΣkV T

3 / 58

SLIDE 4

Overview

1

Recap

2

Singular value decomposition

3

Latent semantic indexing

4

Dimensionality reduction

5

LSI in information retrieval

6

Redux of Comparisons

4 / 58

SLIDE 5

Outline

1

Recap

2

Singular value decomposition

3

Latent semantic indexing

4

Dimensionality reduction

5

LSI in information retrieval

6

Redux of Comparisons

5 / 58

SLIDE 6

Symmetric diagonalization theorem

S a square, symmetric, real-valued M × M matrix with M linearly independent eigenvectors then there exists a symmetric diagonal decomposition S = QΛQ−1 where the columns of Q are the orthogonal and normalized (unit length, real) eigenvectors of S, and Λ is the diagonal matrix with entries the eigenvalues of S all entries of Q are real and Q−1 = QT We will use this to build low-rank approximations to term document matrices, using CC T

6 / 58

SLIDE 7

Outline

1

Recap

2

Singular value decomposition

3

Latent semantic indexing

4

Dimensionality reduction

5

LSI in information retrieval

6

Redux of Comparisons

7 / 58

SLIDE 8

SVD

C an M × N matrix of rank r, C T its N × M transpose. CC T and C TC have the same r eigenvalues λ1, . . . , λr U = M × M matrix whose columns are the orthogonal eigenvectors of CC T V = N × N matrix whose columns are the orthogonal eigenvectors of C TC Then there’s a singular value decomposition (SVD) C = UΣV T where the M × N matrix Σ has Σii = σi for 1 ≤ i ≤ r, and zero otherwise. σi are called the singular values of C

8 / 58

SLIDE 9

Compare to S = QΛQT

C = UΣV T ⇒ CC T = UΣV T V ΣUT = UΣ2UT (C TC = V ΣUT UΣV T = V Σ2V T) l.h.s. is square symmetric real-valued, and r.h.s. is symmetric diagonal decomposition CC T (C TC) is a square matrix with rows, columns corresponding to each of the M terms (documents) i, j entry measures overlap between ith and jth terms (documents), based on document (term) co-occurrence Depends on term weighting: simplest case (1,0): i, j entry counts number of documents in which both terms i and j

ccur (number of terms which occur in both documents i, j)

9 / 58

SLIDE 10

Illustration of SVD

Upper: C has M > N Lower: C has M < N r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r C = U Σ VT r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r

10 / 58

SLIDE 11

4 × 2 Example

Example: singular value decomp of 4 × 2 matrix of rank 2 C =     1 −1 1 1 −1 1     = UΣV T =

B B @ −0.633 0.00 −0.489 0.601 0.316 −0.707 −0.611 −0.164 −0.316 −0.707 0.611 0.164 0.633 −0.00 0.122 0.765 1 C C A B B @ 2.24 1.00 1 C C A „−0.707 −0.707 0.707 −0.707 «

=     −0.632 0.000 0.316 −0.707 −0.316 −0.707 0.632 0.000     2.236 0.000 0.000 1.000 −0.707 0.707 −0.707 −0.707

Σ11 = 2.236 and Σ22 = 1

Σ22 = 0 ⇒ C1 = B B @ 1 −1 −.5 .5 .5 −.5 −1 1 1 C C A

11 / 58

SLIDE 12

Low Rank Approximations

Given M × N matrix C and positive integer k, find M × N matrix Ck of rank ≤ k which minimizes Frobenius norm of difference X = C − CK: ||X||F =

M
i=1

N

j=1

X 2

ij

(minimize discrepancy between C and Ck for fixed k smaller than rank r of C). Use SVD: Given C, construct SVD C = UΣV T Σ → Σk by setting smallest r − k singular values to 0 Ck = UΣkV T is the rank-k approximation to C Theorem (Eckart Young): yields matrix rank k with lowest possible Frobenius error, error given by σk+1

12 / 58

SLIDE 13

Illustration of low rank approximation

Matrix entries affected by “zeroing out” smallest singular value indicated by dashed boxes Ck

= U Σk VT r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r

13 / 58

SLIDE 14

Example 1

      1 1 1 1 1       =

14 / 58

SLIDE 15

Example 1, cont’d

      1 1 1 1 1       =       

1 √ 2

· · · · · ·

1 √ 2

· · · · · · 1 · · · · · ·

1 √ 2

· · · · · ·

1 √ 2

· · · · · ·              √ 2 √ 2 1         1 1 1  

http://www.wolframalpha.com/input/?i=svd{{1,0,0},{0,1,0},{0,0,1},{1,0,0},{0,1,0}}

matlab: [U,S,V]=svd([1 0 0; 0 1 0; 0 0 1; 1 0 0; 0 1 0])

15 / 58

SLIDE 16

Example 2

      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1       =

16 / 58

SLIDE 17

Example 2, cont’d

      1 1 1 1 1 1 1 1 1 1 1 1 1 1 1       =           

1

5

· · · · · · · · · · · ·

1

5

· · · · · · · · · · · ·

1

5

· · · · · · · · · · · ·

1

5

· · · · · · · · · · · ·

1

5

· · · · · · · · · · · ·                  √ 15        

1 √ 3 1 √ 3 1 √ 3

· · · · · · · · · · · · · · · · · ·  

http://www.wolframalpha.com/input/?i=svd{{1,1,1},{1,1,1},{1,1,1},{1,1,1},{1,1,1}}

matlab: [U,S,V]=svd([1 1 1; 1 1 1; 1 1 1; 1 1 1; 1 1 1])

17 / 58

SLIDE 18

Example 3

tea coffee cocoa drink beverage       1 1 1 1 1 1 1 1 1 1 1 1       =           

2

15 1 √ 2 1 √ 6

· · · · · ·

2

15

−

2

3

· · · · · ·

2

15

− 1

√ 2 1 √ 6

· · · · · ·

3

10

· · · · · ·

3

10

· · · · · ·                  √ 10 1 1         

1 √ 3 1 √ 3 1 √ 3

− 1

√ 2 1 √ 2

− 1

√ 6 2 √ 3

− 1

√ 6

  

http://www.wolframalpha.com/input/?i=svd{{0,1,1},{1,0,1},{1,1,0},{1,1,1},{1,1,1}}

matlab: [U,S,V]=svd([0 1 1; 1 0 1; 1 1 0; 1 1 1; 1 1 1])

18 / 58

SLIDE 19

Example 3, cont’d

tea coffee cocoa drink beverage       1 1 1 1 1 1 1 1 1 1 1 1       ⇒           

2

15

· · · · · · · · · · · ·

2

15

· · · · · · · · · · · ·

2

15

· · · · · · · · · · · ·

3

10

· · · · · · · · · · · ·

3

10

· · · · · · · · · · · ·                  √ 10        

1 √ 3 1 √ 3 1 √ 3

· · · · · · · · · · · · · · · · · ·   =       2/3 2/3 2/3 2/3 2/3 2/3 2/3 2/3 2/3 1 1 1 1 1 1      

19 / 58

SLIDE 20

Reduced or truncated SVD

Represent Σ as r × r matrix with singular values on diagonal (rest 0). Omit rightmost M − r columns of U (omitted rows of Σ), Omit rightmost N − r columns of V (in V T the rows multiplied by the now N − r columns of zeros in Σ).       1 1 1 1 1 1 1 1 1 1 1 1       ⇒           

2

15

2

15

2

15

3

10

3

10

           √ 10

1 √ 3 1 √ 3 1 √ 3

=

      2/3 2/3 2/3 2/3 2/3 2/3 2/3 2/3 2/3 1 1 1 1 1 1      

20 / 58

SLIDE 21

Further intuition

Ck = UΣkV T = U         σ1 ... σk ...         V T =

k

i=1

σi ui v T

i

where ui, vi are ith columns of U, V .

ui

v T

i

is a rank 1 matrix Ck is the sum of k rank-1 matrices weighted by singular values, where σi decreases with i. Remove last few terms in sum

21 / 58

SLIDE 22

Outline

1

Recap

2

Singular value decomposition

3

Latent semantic indexing

4

Dimensionality reduction

5

LSI in information retrieval

6

Redux of Comparisons

22 / 58

SLIDE 23

Recall: Term-document matrix

Anthony Julius The Hamlet Othello Macbeth and Caesar Tempest Cleopatra anthony 5.25 3.18 0.0 0.0 0.0 0.35 brutus 1.21 6.10 0.0 1.0 0.0 0.0 caesar 8.59 2.54 0.0 1.51 0.25 0.0 calpurnia 0.0 1.54 0.0 0.0 0.0 0.0 cleopatra 2.85 0.0 0.0 0.0 0.0 0.0 mercy 1.51 0.0 1.90 0.12 5.25 0.88 worser 1.37 0.0 0.11 4.15 0.25 1.95 . . . This matrix is the basis for computing the similarity between documents and queries. But: synonomy and polysemy Today: Can we transform this matrix, so that we get a better measure of similarity between documents and queries?

23 / 58

SLIDE 24

Latent semantic indexing: Overview

We will decompose the term-document matrix into a product

f matrices.

The particular decomposition we’ll use: singular value decomposition (SVD). SVD: C = UΣV T (where C = term-document matrix) We will then use the SVD to compute a new, improved term-document matrix C ′. We’ll get better similarity values out of C ′ (compared to C). Using SVD for this purpose is called latent semantic indexing

r LSI.

24 / 58

SLIDE 25

Example of C = UΣV T: The matrix C

C d1 d2 d3 d4 d5 d6 ship 1 1 boat 1

cean

1 1 wood 1 1 1 tree 1 1 This is a standard term-document matrix. Actually, we use a non-weighted matrix here to simplify the example.

25 / 58

SLIDE 26

Example of C = UΣV T: All four matrices

C d1 d2 d3 d4 d5 d6 ship 1 1 boat 1

cean

1 1 wood 1 1 1 tree 1 1 = U 1 2 3 4 5 ship −0.44 −0.30 0.57 0.58 0.25 boat −0.13 −0.33 −0.59 0.00 0.73

cean

−0.48 −0.51 −0.37 0.00 −0.61 wood −0.70 0.35 0.15 −0.58 0.16 tree −0.26 0.65 −0.41 0.58 −0.09 × Σ 1 2 3 4 5 1 2.16 0.00 0.00 0.00 0.00 2 0.00 1.59 0.00 0.00 0.00 3 0.00 0.00 1.28 0.00 0.00 4 0.00 0.00 0.00 1.00 0.00 5 0.00 0.00 0.00 0.00 0.39 × V T d1 d2 d3 d4 d5 d6 1 −0.75 −0.28 −0.20 −0.45 −0.33 −0.12 2 −0.29 −0.53 −0.19 0.63 0.22 0.41 3 0.28 −0.75 0.45 −0.20 0.12 −0.33 4 0.00 0.00 0.58 0.00 −0.58 0.58 5 −0.53 0.29 0.63 0.19 0.41 −0.22

26 / 58

SLIDE 27

Example of C = UΣV T: The matrix U

U 1 2 3 4 5 ship −0.44 −0.30 0.57 0.58 0.25 boat −0.13 −0.33 −0.59 0.00 0.73

cean

−0.48 −0.51 −0.37 0.00 −0.61 wood −0.70 0.35 0.15 −0.58 0.16 tree −0.26 0.65 −0.41 0.58 −0.09 One row per term, one column per min(M, N) where M is the number of terms and N is the number of documents. This is an orthonormal matrix: (i) Row vectors have unit length. (ii) Any two distinct row vectors are orthogonal to each other. Think of the columns as labelled by “semantic” dimensions that capture distinct topics like politics, sports, economics. Each number Uij in the matrix indicates how strongly related is term i to the topic represented by semantic dimension j.

27 / 58

SLIDE 28

Example of C = UΣV T: The matrix Σ

Σ 1 2 3 4 5 6 1 2.16 0.00 0.00 0.00 0.00 2 0.00 1.59 0.00 0.00 0.00 3 0.00 0.00 1.28 0.00 0.00 4 0.00 0.00 0.00 1.00 0.00 5 0.00 0.00 0.00 0.00 0.39 This is M × N matrix always reduces to a square, diagonal matrix

f dimensionality min(M, N) × min(M, N).

The diagonal consists of the singular values of C. The magnitude of the singular value measures the importance of the corresponding semantic dimension. We’ll next make use of this by omitting the three least important dimensions.

28 / 58

SLIDE 29

Example of C = UΣV T: The matrix V T

V T d1 d2 d3 d4 d5 d6 1 −0.75 −0.28 −0.20 −0.45 −0.33 −0.12 2 −0.29 −0.53 −0.19 0.63 0.22 0.41 3 0.28 −0.75 0.45 −0.20 0.12 −0.33 4 0.00 0.00 0.58 0.00 −0.58 0.58 5 −0.53 0.29 0.63 0.19 0.41 −0.22 6 0.00 0.00 0.00 −0.58 0.58 0.58 One column per document, one row per min(M, N) where M is the number of terms and N is the number of documents. Again an orthonormal matrix: (i) Column vectors have unit length. (ii) Any two distinct column vectors are orthogonal to each other. The rows are the “semantic” dimensions from the term matrix U that capture distinct topics like politics, sports, economics. Each number Vij in the (untransposed) matrix indicates how strongly related is document i to the topic represented by semantic dimension j.

29 / 58

SLIDE 30

LSI: Summary

We’ve decomposed the term-document matrix C into a product of three matrices. The term matrix U: consists of one (row) vector for each term The document matrix V T: consists of one (column) vector for each document The singular value matrix Σ: diagonal matrix with singular values, reflecting importance of each dimension Next: Why are we doing this?

30 / 58

SLIDE 31

Outline

1

Recap

2

Singular value decomposition

3

Latent semantic indexing

4

Dimensionality reduction

5

LSI in information retrieval

6

Redux of Comparisons

31 / 58

SLIDE 32

How we use the SVD in LSI

Key property: Each singular value tells us how important its dimension is. By setting less important dimensions to zero, we keep the important information, but get rid of the “details”. These details may

be noise – in that case, reduced LSI is a better representation because it is less noisy make things dissimilar that should be similar – again reduced LSI is a better representation because it represents similarity better.

Analogy for “fewer details is better”

Image of a bright red flower Image of a black and white flower Omitting color makes is easier to see similarity

32 / 58

SLIDE 33

Selection of singular values

t × d t × m m × m m × d Ck Uk Σk V T

k

t × d t × k k × k k × d m is the original rank of C. k is the number of singular values chosen to represent the concepts in the set of documents. Usually, k ≪ m. Σ−1

k

defined only on k-dimensional subspace.

33 / 58

SLIDE 34

Reducing the dimensionality to 2

U 1 2 3 4 5 ship −0.44 −0.30 boat −0.13 −0.33

cean

−0.48 −0.51 wood −0.70 0.35 tree −0.26 0.65 Σ2 1 2 3 4 5 1 2.16 0.00 0.00 0.00 0.00 2 0.00 1.59 0.00 0.00 0.00 3 0.00 0.00 0.00 0.00 4 0.00 0.00 0.00 0.00 5 0.00 0.00 0.00 0.00 V T d1 d2 d3 d4 d5 d6 1 −0.75 −0.28 −0.20 −0.45 −0.33 −0.12 2 −0.29 −0.53 −0.19 0.63 0.22 0.41 3 4 5

Actually, we only zero out singular values in Σ. This has the effect

f setting the

corresponding dimensions in U and V T to zero when computing the product C = UΣV T.

34 / 58

SLIDE 35

Reducing the dimensionality to 2

C2 d1 d2 d3 d4 d5 d6 ship 0.85 0.52 0.28 0.13 0.21 −0.08 boat 0.36 0.36 0.16 −0.20 −0.02 −0.18

cean

1.01 0.72 0.36 −0.04 0.16 −0.21 wood 0.97 0.12 0.20 1.03 0.62 0.41 tree 0.12 −0.39 −0.08 0.90 0.41 0.49 = U 1 2 3 4 5 ship −0.44 −0.30 0.57 0.58 0.25 boat −0.13 −0.33 −0.59 0.00 0.73

cean

−0.48 −0.51 −0.37 0.00 −0.61 wood −0.70 0.35 0.15 −0.58 0.16 tree −0.26 0.65 −0.41 0.58 −0.09 × Σ2 1 2 3 4 5 1 2.16 0.00 0.00 0.00 0.00 2 0.00 1.59 0.00 0.00 0.00 3 0.00 0.00 0.00 0.00 4 0.00 0.00 0.00 0.00 5 0.00 0.00 0.00 0.00 × V T d1 d2 d3 d4 d5 d6 1 −0.75 −0.28 −0.20 −0.45 −0.33 −0.12 2 −0.29 −0.53 −0.19 0.63 0.22 0.41 3 0.28 −0.75 0.45 −0.20 0.12 −0.33 4 0.00 0.00 0.58 0.00 −0.58 0.58 5 −0.53 0.29 0.63 0.19 0.41 −0.22

35 / 58

SLIDE 36

Recall unreduced decomposition C = UΣV T

C d1 d2 d3 d4 d5 d6 ship 1 1 boat 1

cean

1 1 wood 1 1 1 tree 1 1 = U 1 2 3 4 5 ship −0.44 −0.30 0.57 0.58 0.25 boat −0.13 −0.33 −0.59 0.00 0.73

cean

−0.48 −0.51 −0.37 0.00 −0.61 wood −0.70 0.35 0.15 −0.58 0.16 tree −0.26 0.65 −0.41 0.58 −0.09 × Σ 1 2 3 4 5 1 2.16 0.00 0.00 0.00 0.00 2 0.00 1.59 0.00 0.00 0.00 3 0.00 0.00 1.28 0.00 0.00 4 0.00 0.00 0.00 1.00 0.00 5 0.00 0.00 0.00 0.00 0.39 × V T d1 d2 d3 d4 d5 d6 1 −0.75 −0.28 −0.20 −0.45 −0.33 −0.12 2 −0.29 −0.53 −0.19 0.63 0.22 0.41 3 0.28 −0.75 0.45 −0.20 0.12 −0.33 4 0.00 0.00 0.58 0.00 −0.58 0.58 5 −0.53 0.29 0.63 0.19 0.41 −0.22

36 / 58

SLIDE 37

Original matrix C vs. reduced C2 = UΣ2V T

C d1 d2 d3 d4 d5 d6 ship 1 1 boat 1

cean

1 1 wood 1 1 1 tree 1 1 C2 d1 d2 d3 d4 d5 d6 ship 0.85 0.52 0.28 0.13 0.21 −0.08 boat 0.36 0.36 0.16 −0.20 −0.02 −0.18

cean

1.01 0.72 0.36 −0.04 0.16 −0.21 wood 0.97 0.12 0.20 1.03 0.62 0.41 tree 0.12 −0.39 −0.08 0.90 0.41 0.49

We can view C2 as a two-dimensional representation of the matrix. We have performed a dimensionality reduction to two dimensions (marine, arboreal)

Note: The matrix called C2 in example 18.4 of course text MRS (p.381 in printed book) is not C2 by definition (eq 18.17), but is instead Σ2V T (or equivalently UTC2), which is why

nly first two rows are non-vanishing. The authors promise to correct this in next edition.)

37 / 58

SLIDE 38

Why the reduced matrix is “better”

C d1 d2 d3 d4 d5 d6 ship 1 1 boat 1

cean

1 1 wood 1 1 1 tree 1 1 C2 d1 d2 d3 d4 d5 d6 ship 0.85 0.52 0.28 0.13 0.21 −0.08 boat 0.36 0.36 0.16 −0.20 −0.02 −0.18

cean

1.01 0.72 0.36 −0.04 0.16 −0.21 wood 0.97 0.12 0.20 1.03 0.62 0.41 tree 0.12 −0.39 −0.08 0.90 0.41 0.49

Similarity of d2 and d3 in the original space: 0. Similarity of d2 and d3 in the reduced space: 0.52 ∗ 0.28 + 0.36 ∗ 0.16+0.72∗0.36+ 0.12 ∗ 0.20 + −0.39 ∗ −0.08 ≈ 0.52 “boat” and “ship” are semantically similar. The “reduced” similarity measure reflects this. What property of the SVD reduction is responsible for improved similarity?

38 / 58

SLIDE 39

Documents in concept space

Consider the original term–document matrix C, and let e(j) = jth basis vector (single 1 in jth position, 0 elsewhere). Then d(j) = C e(j) are the components of the jth document, considered as a column vector. Since C = UΣV T, so we can consider V T e(j) as the components

f the document vector in concept space, before U maps it into

word space (up to rescaling of the axes by Σ). Note: we can also consider the original d(j) to be a vector in word space, and since left multiplication by U maps from concept space to word space, we can apply U−1 = UT to map d(j) into concept space, giving UT d(j) = UTC e(j) = UTUΣV T e(j) = ΣV T e(j) , i.e., the same answer as before, up to rescaling of the axes by Σ (a convention to be considered more systematically in a few slides).

39 / 58

SLIDE 40

Rank 2 reduced V T

V T d1 d2 d3 d4 d5 d6 1 −0.75 −0.28 −0.20 −0.45 −0.33 −0.12 2 −0.29 −0.53 −0.19 0.63 0.22 0.41 3 0.28 −0.75 0.45 −0.20 0.12 −0.33 4 0.00 0.00 0.58 0.00 −0.58 0.58 5 −0.53 0.29 0.63 0.19 0.41 −0.22 V T

2

d1 d2 d3 d4 d5 d6 1 −0.75 −0.28 −0.20 −0.45 −0.33 −0.12 2 −0.29 −0.53 −0.19 0.63 0.22 0.41 3 4 5

40 / 58

SLIDE 41

Documents in V T

2 space (Fig 18.3)

Use first two rows of V T

2

as coordinates for documents in reduced semantic space. Note d2, d3 not orthogonal. Note clustering of d1d2, d3 and d4d5, d6.

Σ = „2.16 1.59 « x d3

d(1) = Σ(−0.75, −0.29) = (−1.62, −0.46)
d(2) = Σ(−0.28, −0.53) = (−0.61, −0.84)

→ d(3) = Σ(−0.20, −0.19) = (−0.44, −0.30)

d(4) = Σ(−0.45, 0.63) = (−0.97, 1.00)
d(5) = Σ(−0.33, 0.22) = (−0.70, 0.35)
d(6) = Σ(−0.12, 0.41) = (−0.26, 0.65)

41 / 58

SLIDE 42

Outline

1

Recap

2

Singular value decomposition

3

Latent semantic indexing

4

Dimensionality reduction

5

LSI in information retrieval

6

Redux of Comparisons

42 / 58

SLIDE 43

Why we use LSI in information retrieval

LSI takes documents that are semantically similar (= talk about the same topics), . . . . . . but are not similar in the vector space (because they use different words) . . . . . . and re-represents them in a reduced vector space . . . . . . in which they have higher similarity. Thus, LSI addresses the problems of synonymy and semantic relatedness. Standard vector space: Synonyms contribute nothing to document similarity. Desired effect of LSI: Synonyms contribute strongly to document similarity.

43 / 58

SLIDE 44

How LSI addresses synonymy and semantic relatedness

The dimensionality reduction forces us to omit a lot of “detail”. We have to map differents words (= different dimensions of the full space) to the same dimension in the reduced space. The “cost” of mapping synonyms to the same dimension is much less than the cost of collapsing unrelated words. SVD selects the “least costly” mapping (see below). Thus, it will map synonyms to the same dimension. But it will avoid doing that for unrelated words. LSI like soft clustering: interprets each dimension of the reduced space as a cluster, and value of document on that dimension as its fractional membership in cluster

44 / 58

SLIDE 45

LSI: Comparison to other approaches

Recap: Relevance feedback and query expansion are used to increase recall in information retrieval – if query and documents have (in the extreme case) no terms in common. See Chapt 9 of course text LSI increases recall and hurts precision. Thus, it addresses the same problems as (pseudo) relevance feedback and query expansion . . . . . . and it has the same problems.

45 / 58

SLIDE 46

Implementation

Compute SVD of term-document matrix Reduce the space and compute reduced document representations Map the query into the reduced space qk = qUΣ−1

k

This follows from (more details starting in two slides): Ck = UΣkV T ⇒ C T

k = V ΣkUT ⇒ C TUΣ−1 k

= Vk (Note: intuitive to translate query into concept space using same transformation as used on documents: let jth column of V T represent the components of document j in concept space,

ˆ

d(j) = Vji. Then d(j) = UkΣk ˆ d(j) and ˆ d(j) = Σ−1

k UT k

d(j). Same transformation on query vector q gives ˆ q = Σ−1

k UT k

q, and compare with other concept space vectors via cos( ˆ q, ˆ d(j)) ) Compute similarity of qk with all reduced documents in Vk. Output ranked list of documents as usual Exercise: What is the fundamental problem with this approach?

46 / 58

SLIDE 47

Optimality

SVD is optimal in the following sense. Keeping the k largest singular values and setting all others to zero gives you the optimal approximation of the original matrix C. Eckart-Young theorem Optimal: no other matrix of the same rank (= with the same underlying dimensionality) approximates C better. Measure of approximation is Frobenius norm: ||C||F =

i
j c2

ij

So LSI uses the “best possible” matrix. Caveat: There is only a tenuous relationship between the Frobenius norm and cosine similarity between documents.

47 / 58

SLIDE 48

Outline

1

Recap

2

Singular value decomposition

3

Latent semantic indexing

4

Dimensionality reduction

5

LSI in information retrieval

6

Redux of Comparisons

48 / 58

SLIDE 49

Term–term Comparison

To compare two terms, take the dot product between two rows of C, which measures the extent to which they have similar pattern of

ccurrence across the full set of documents.

The i, j entry of CC T is equal to the dot product between i, j rows of C Since CC T = UΣV TV ΣUT = UΣ2UT = (UΣ)(UΣ)T , the i, j entry is the dot product between the i, j rows of UΣ. Hence the rows of UΣ can be considered as coordinates for terms, whose dot products give comparisons between terms. (Σ just rescales the coordinates)

49 / 58

SLIDE 50

Document–document Comparison

To compare two documents, take the dot product between two columns of C, which measures the extent to which two documents have a similar profile of terms. The i, j entry of C TC is equal to the dot product between the i, j columns of C Since C TC = V ΣUTUΣV T = V Σ2V T = (V Σ)(V Σ)T, the i, j entry is the dot product between the i, j rows of V Σ Hence the rows of V Σ can be considered as coordinates for documents, whose dot products give comparisons between documents. (Σ again just rescales coordinates)

50 / 58

SLIDE 51

Term–document Comparison

To compare a term and a document Use directly the value of i, j entry of C = UΣV T This is the dot product between ith row of UΣ1/2 and jth row

f V Σ1/2

So use UΣ1/2 and V Σ1/2 as coordinates Recall UΣ for term–term, and V Σ for document–document comparisons — can’t use a single set of coordinates to make both between document and term and within term or document comparisons, but difference is only Σ1/2 stretch.

51 / 58

SLIDE 52

Pseudo-document – document Comparison

How to represent “pseudo-documents”, and how to compute comparisons? e.g., given a novel query, find its location in concept space, and find its cosine w.r.t existing documents, or other documents not in

riginal analysis (SVD).

A query q is a vector of terms, like the columns of C, hence considered a pseudo-document Derive representation for any term vector q to be used in document comparison formulas. (like a row of V as earlier) Constraint: for a real document q = d(j) (= jth column Cij), and before truncation (i.e., for Ck = C), should give row of V Use q(s) = qUΣ−1 for comparing pseudodocs to docs

52 / 58

SLIDE 53

Pseudo-document – document Comparison: q(s) = qUΣ−1

Consider the j, i component of C TUΣ−1 = (V ΣUT)UΣ−1 = V By inspection, the jth row of l.h.s. corresponds to the case q = d(j):

C TUΣ−1

ji =

d(j)UΣ−1

i

and the r.h.s. Vji is the jth row of V , as desired for comparing docs. So use q(s) = qUΣ−1, which sums corresponding rows of UΣ, hence corresponds to placing pseudo-document at centroid of corresponding term points (up to rescaling of rows by Σ). (Just as row of V scaled by Σ1/2 or Σ can be used in semantic space for making term–doc or doc–doc comparisons.) Note: all of above after any preprocessing used to construct C

53 / 58

SLIDE 54

Now approximate C → Ck

In the LSI approximation, use Ck (the rank k approximation to C), so similarity measure between query and document becomes

q ·

d(j) | q| | d(j)| = q · C · e(j) | q| |C e(j)| = ⇒

q · Ck ·

e(j) | q| |Ck e(j)| =

q ·

d∗

(j)

| q| | d∗

(j)|

, (2) where d∗

(j) = Ck

e(j) = UkΣkV T e(j) is the LSI representation of the jth document vector in the original term–document space. Finding the closest documents to a query in the LSI approximation thus amounts to computing (2) for each of the j = 1, . . . , N documents, and returning the best matches.

55 / 58

SLIDE 56

Pseudo-document

To see that this agrees with the prescription given in the course text (and the original LSI article), recall: jth column of V T

k represents document j in “concept space”:

ˆ

d(j) = V T

k

e(j) query q is considered a “pseudo-document” in this space. LSI document vector in term space given above as

d∗

(j) = Ck

e(j) = UkΣkV T

k

e(j) = UkΣk ˆ d(j), so follows that

ˆ

d(j) = Σ−1

k UT k

d∗

(j)

The “pseudo-document” query vector q is translated into the concept space using the same transformation: ˆ q = Σ−1

k UT k

q.

56 / 58

SLIDE 57

Compare documents in concept space

Recall the i, j entry of C TC is dot product between i,j columns of C (term vectors for documents i and j). In the truncated space, C T

k Ck = (UkΣkV T k )T (UkΣkV T k ) = VkΣkUT k UkΣkV T k = (VkΣk)(VkΣk)T

Thus i, j entry the dot product between the i, j columns of (VkΣk)T = ΣkV T

k .

In concept space, comparison between pseudo-document ˆ q and document ˆ d(j) thus given by the cosine between Σk ˆ q and Σk ˆ d(j): (Σk ˆ q) · Σk ˆ d(j) |Σk ˆ q| |Σk ˆ d(j)| = ( qT UkΣ−1

k Σk)(ΣkΣ−1 k UT k

d∗

(j))

|UT

k

q| |UT

k

d∗

(j)|

=

q ·

d∗

(j)

|UT

k

q| | d∗

(j)|

, (3) in agreement with (2), up to an overall q-dependent normalization which doesn’t affect similarity rankings.

57 / 58

SLIDE 58

58 / 58