INFO 4300 / CS4300 Information Retrieval slides adapted from - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/ IR 11: Latent Semantic Indexing Paul Ginsparg Cornell University, Ithaca, NY 1 Oct 2009 1 / 44

Overview Recap 1 LSI in information retrieval 2 Edit distance 3 Spelling correction 4 Soundex 5 2 / 44

Discussion 4, 6 Oct 2009 Read and be prepared to discuss the following paper: Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman, ”Indexing by latent semantic analysis”. Journal of the American Society for Information Science, Volume 41, Issue 6, 1990. http://www3.interscience.wiley.com/cgi-bin/issuetoc?ID=10049584 Note that to access this paper from Wiley InterScience, you need to use a computer with a Cornell IP address. (also at /readings/jasis90f.pdf ) C = U Σ V T X = T 0 S 0 D ′ ⇐ ⇒ 0 ˆ C k = U Σ k V T X = TSD ′ ⇐ ⇒ 3 / 44

Outline Recap 1 LSI in information retrieval 2 Edit distance 3 Spelling correction 4 Soundex 5 4 / 44

SVD C an M × N matrix of rank r , C T its N × M transpose. CC T and C T C have the same r eigenvalues λ 1 , . . . , λ r U = M × M matrix whose columns are the orthogonal eigenvectors of CC T V = N × N matrix whose columns are the orthogonal eigenvectors of C T C Then there’s a singular value decomposition (SVD) C = U Σ V T where the M × N matrix Σ has Σ ii = σ i for 1 ≤ i ≤ r , and zero otherwise. σ i are called the singular values of C 5 / 44

Illustration of low rank approximation V T C k U Σ k = r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r Matrix entries affected by “zeroing out” smallest singular value indicated by dashed boxes 6 / 44

LSI: Summary We’ve decomposed the term-document matrix C into a product of three matrices. The term matrix U : consists of one (row) vector for each term The document matrix V T : consists of one (column) vector for each document The singular value matrix Σ: diagonal matrix with singular values, reflecting importance of each dimension Next: Why are we doing this? 7 / 44

Why the reduced matrix is “better” Similarity of d 2 C d 1 d 2 d 3 d 4 d 5 d 6 ship 1 0 1 0 0 0 and d 3 in the boat 0 1 0 0 0 0 original space: 0. ocean 1 1 0 0 0 0 wood 1 0 0 1 1 0 Similarity of d 2 tree 0 0 0 1 0 1 and d 3 in the C 2 d 1 d 2 d 3 d 4 d 5 d 6 reduced space: ship 0 . 85 0 . 52 0 . 28 0 . 13 0 . 21 − 0 . 08 0 . 52 ∗ 0 . 28 + boat 0 . 36 0 . 36 0 . 16 − 0 . 20 − 0 . 02 − 0 . 18 0 . 36 ∗ 0 . 16 + ocean 1 . 01 0 . 72 0 . 36 − 0 . 04 0 . 16 − 0 . 21 0 . 72 ∗ 0 . 36 + wood 0 . 97 0 . 12 0 . 20 1 . 03 0 . 62 0 . 41 tree 0 . 12 − 0 . 39 − 0 . 08 0 . 90 0 . 41 0 . 49 0 . 12 ∗ 0 . 20 + − 0 . 39 ∗ − 0 . 08 ≈ 0 . 52 “boat” and “ship” are semantically similar. The “reduced” similarity measure reflects this. What property of the SVD reduction is responsible for improved similarity? 8 / 44

Documents in V T 2 space 9 / 44

Why we use LSI in information retrieval LSI takes documents that are semantically similar (= talk about the same topics), . . . . . . but are not similar in the vector space (because they use different words) . . . . . . and re-represents them in a reduced vector space . . . . . . in which they have higher similarity. Thus, LSI addresses the problems of synonymy and semantic relatedness. Standard vector space: Synonyms contribute nothing to document similarity. Desired effect of LSI: Synonyms contribute strongly to document similarity. 11 / 44

How LSI addresses synonymy and semantic relatedness The dimensionality reduction forces us to omit a lot of “detail”. We have to map differents words (= different dimensions of the full space) to the same dimension in the reduced space. The “cost” of mapping synonyms to the same dimension is much less than the cost of collapsing unrelated words. SVD selects the “least costly” mapping (see below). Thus, it will map synonyms to the same dimension. But it will avoid doing that for unrelated words. LSI like soft clustering: interprets each dimension of the reduced space as a cluster, and value of document on that dimension as its fractional membership in cluster 12 / 44

LSI: Comparison to other approaches Recap: Relevance feedback and query expansion are used to increase recall in information retrieval – if query and documents have (in the extreme case) no terms in common. LSI increases recall and hurts precision. Thus, it addresses the same problems as (pseudo) relevance feedback and query expansion . . . . . . and it has the same problems. 13 / 44

Implementation Compute SVD of term-document matrix Reduce the space and compute reduced document representations qU Σ − 1 Map the query into the reduced space � q k = � k This follows from: C k = U Σ k V T ⇒ C T k = V Σ k U T ⇒ C T U Σ − 1 = V k k (Note: intuitive to translate query into concept space using same transformation as used on documents: let j th column of V T represent the components of document j in concept space, � d ( j ) = U k Σ k � d ( j ) and � ˆ d ( j ) = V ji . Then � ˆ ˆ d ( j ) = Σ − 1 k � k U T d ( j ) . q gives � q = Σ − 1 k U T Same transformation on query vector � ˆ k � q , q ,� ˆ and compare with other concept space vectors via cos( � ˆ d ( j ) ) ) Compute similarity of q k with all reduced documents in V k . Output ranked list of documents as usual Exercise: What is the fundamental problem with this approach? 14 / 44

Optimality SVD is optimal in the following sense. Keeping the k largest singular values and setting all others to zero gives you the optimal approximation of the original matrix C . Eckart-Young theorem Optimal: no other matrix of the same rank (= with the same underlying dimensionality) approximates C better. Measure of approximation is Frobenius norm: �� j c 2 || C || F = � i ij So LSI uses the “best possible” matrix. Caveat: There is only a tenuous relationship between the Frobenius norm and cosine similarity between documents. 15 / 44

Spelling correction Two principal uses Correcting documents being indexed Correcting user queries Two different methods for spelling correction Isolated word spelling correction Check each word on its own for misspelling Will not catch typos resulting in correctly spelled words, e.g., an asteroid that fell form the sky Context-sensitive spelling correction Look at surrounding words Can correct form / from error above 17 / 44

Correcting documents We’re not interested in interactive spelling correction of documents (e.g., MS Word) in this class. In IR, we use document correction primarily for OCR’ed documents. The general philosophy in IR is: don’t change the documents. 18 / 44

Correcting queries First: isolated word spelling correction Premise 1: There is a list of “correct words” from which the correct spellings come. Premise 2: We have a way of computing the distance between a misspelled word and a correct word. Simple spelling correction algorithm: return the “correct” word that has the smallest distance to the misspelled word. Example: informaton → information We can use the term vocabulary of the inverted index as the list of correct words. Why is this problematic? 19 / 44

Alternatives to using the term vocabulary A standard dictionary (Webster’s, OED etc.) An industry-specific dictionary (for specialized IR systems) The term vocabulary of the collection, appropriately weighted 20 / 44

Distance between misspelled word and “correct” word We will study several alternatives. Edit distance and Levenshtein distance Weighted edit distance k -gram overlap 21 / 44

Edit distance The edit distance between string s 1 and string s 2 is the minimum number of basic operations that convert s 1 to s 2 . Levenshtein distance: The admissible basic operations are insert, delete, and replace Levenshtein distance dog - do : 1 Levenshtein distance cat - cart : 1 Levenshtein distance cat - cut : 1 Levenshtein distance cat - act : 2 Damerau-Levenshtein distance cat - act : 1 Damerau-Levenshtein includes transposition as a fourth possible operation. 22 / 44

Levenshtein distance: Computation f a s t 0 1 2 3 4 c 1 1 2 3 4 a 2 2 1 2 3 t 3 3 2 2 2 s 4 4 3 2 3 23 / 44

Levenshtein distance: Example f a s t 0 1 1 2 2 3 3 4 4 1 1 2 2 3 3 4 4 5 c 1 2 1 2 2 3 3 4 4 2 2 2 1 3 3 4 4 5 a 2 3 2 3 1 2 2 3 3 3 3 3 3 2 2 3 2 4 t 3 4 3 4 2 3 2 3 2 4 4 4 4 3 2 3 3 3 s 4 5 4 5 3 4 2 3 3 24 / 44

Each cell of Levenshtein matrix cost of getting here from cost of getting here my upper left neighbor from my upper neighbor (copy or replace) (delete) the minimum of the cost of getting here from three possible “move- my left neighbor (insert) ments”; the cheapest way of getting here 25 / 44

INFO 4300 / CS4300 Information Retrieval slides adapted from - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 11: Latent Semantic Indexing Paul Ginsparg Cornell University, Ithaca, NY 1 Oct 2009 1 / 44 Overview Recap

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

Flow polytopes in combinatorics and algebra Karola M esz aros Cornell University Triangle

Linear Algebra 1: Computing canonical forms in exact linear algebra Clment P ERNET ,

Announcements Wednesday, August 29 WeBWorK 2.1 due today at 11:59pm. The first quiz is on

rt ssr

Abadies Semiparametric Difference-in-Difference Estimator Kenneth Houngbedji Agence

Article 102 TFEU Bill Allan 15 November 2013 at Centre of European Law, KCL Why should we care?

Regulatory and Market Pathways: the pharma sector Leigh Hancher University of Tilburg Allen

FixingChoreographies usingGraphSimilarities NielsLohmann YRSOC2008