INFO 4300 / CS4300 Information Retrieval slides adapted from - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/ IR 12: LSA wrap-up, Relevance Feedback and Query Expansion Paul Ginsparg Cornell University, Ithaca, NY 6 Oct 2011 1 / 60

Administrativa Assignment 2 due Sat 8 Oct, 1pm (late submission permitted until Sun 9 Oct at 11 p.m.) Office hour: Saeed F3:30-4:30 (or cs4300-l, or Piazza) No class Tue 11 Oct (midterm break) Remember mid-term is one week from today, Thu Oct 13 from 11:40 to 12:55, in Kimball B11. It will be open book. Email me by tomorrow if you will be out of town. Topics examined include assignments, lectures and discussion class readings before the midterm break: term-doc matrix, tf.idf, precision recall graph, LSA and recommender systems, word statistics (Heap and Zipf). For sample, see http://www.infosci.cornell.edu/Courses/info4300/2011fa/exams.html 2 / 60

Assignment related issues Include instructions on how to compile and run your code. Comment all files to include name and netID. Follow common sense programming practices (this is referring to things like static paths, no visible prompts and “some generally bizarre interfaces encountered”) Make sure that code runs in the CSUG Lab environment (e.g., at least from remote on the Linux machines: everyone should have a CSUG Lab account, and should be able to access lab machines from remote) Use of external libraries (i.e., other than those already specified in assignment description) is discouraged, should be justified and cleared with graders beforehand 3 / 60

Overview Recap 1 Reduction in number of parameters 2 Motivation for query expansion 3 Relevance feedback: Details 4 Pseudo Relevance Feedback 5 Query expansion 6 4 / 60

Outline Recap 1 Reduction in number of parameters 2 Motivation for query expansion 3 Relevance feedback: Details 4 Pseudo Relevance Feedback 5 Query expansion 6 5 / 60

Documents in concept space e ( j ) = j th Consider the original term–document matrix C , and let � basis vector (single 1 in j th position, 0 elsewhere). e ( j ) are the components of the j th document, Then � d ( j ) = C � considered as a column vector. Since C = U Σ V T , we can consider Σ V T � e ( j ) as the components of the document vector in concept space, before U maps it into word space. Note: we can also consider the original � d ( j ) to be a vector in word space, and since left multiplication by U maps from concept space to word space, we can apply U − 1 = U T to map � d ( j ) into concept space, giving U T � d ( j ) = U T C � e ( j ) = U T U Σ V T � e ( j ) = Σ V T � e ( j ) , as above. 6 / 60

Term–term Comparison To compare two terms, take the dot product between two rows of C , which measures the extent to which they have similar pattern of occurrence across the full set of documents. The i , j entry of CC T is equal to the dot product between i , j rows of C Since CC T = U Σ V T V Σ U T = U Σ 2 U T = ( U Σ)( U Σ) T , the i , j entry is the dot product between the i , j rows of U Σ. Hence the rows of U Σ can be considered as coordinates for terms, whose dot products give comparisons between terms. (Σ just rescales the coordinates) 7 / 60

Document–document Comparison To compare two documents, take the dot product between two columns of C , which measures the extent to which two documents have a similar profile of terms. The i , j entry of C T C is equal to the dot product between the i , j columns of C Since C T C = V Σ U T U Σ V T = V Σ 2 V T = ( V Σ)( V Σ) T , the i , j entry is the dot product between the i , j rows of V Σ Hence the rows of V Σ can be considered as coordinates for documents, whose dot products give comparisons between documents. (Σ again just rescales coordinates) 8 / 60

Term–document Comparison To compare a term and a document Use directly the value of i , j entry of C = U Σ V T This is the dot product between i th row of U Σ 1 / 2 and j th row of V Σ 1 / 2 So use U Σ 1 / 2 and V Σ 1 / 2 as coordinates Recall U Σ for term–term, and V Σ for document–document comparisons — can’t use a single set of coordinates to make both between document and term and within term or document comparisons, but difference is only Σ 1 / 2 stretch. 9 / 60

Recall query document comparison query = vector � q in term space components q i = 1 if term i is in the query, and otherwise 0 any query terms not in the original term vector space ignored q and j th document � In VSM, similarity between query � d ( j ) given by the “cosine measure”: q · � � d ( j ) q | | � | � d ( j ) | Using term–document matrix C ij , this dot product given by the j th e ( j ) = j th basis vector, single 1 q · C : � component of � d ( j ) = C � e ( j ) ( � in j th position, 0 elsewhere). Hence q · � d ( j ) ) = cos( θ ) = � d ( j ) = � q · C · � e ( j ) q ,� Similarity ( � e ( j ) | . (1) q | | � | � q | | C � | � d ( j ) | 10 / 60

Now approximate C → C k In the LSI approximation, use C k (the rank k approximation to C ), so similarity measure between query and document becomes q · � q · � � d ∗ � d ( j ) = � q · C · � e ( j ) � q · C k · � e ( j ) ( j ) = ⇒ e ( j ) | = , (2) q | | � q | | � | � q | | C � e ( j ) | | � q | | C k � | � d ( j ) | | � d ∗ ( j ) | where � e ( j ) = U k Σ k V T � d ∗ ( j ) = C k � e ( j ) is the LSI–reduced representation of the j th document vector in the original term space. Finding the closest documents to a query in the LSI approximation thus amounts to computing (2) for each of the j = 1 , . . . , N documents, and returning the best matches. 11 / 60

Pseudo-document To see that this agrees with the prescription given in the course text (and the original LSI article), recall: j th column of V T k represents document j in “concept space”: � d c ( j ) = V T k � e ( j ) query � q is considered a “pseudo-document” in this space. LSI document vector in term space given above as � e ( j ) = U k Σ k � e ( j ) = U k Σ k V T d c d ∗ ( j ) = C k � k � ( j ) , so follows that � ( j ) = Σ − 1 k � d c k U T d ∗ ( j ) The “pseudo-document” query vector � q is translated into the q c = Σ − 1 k U T concept space using the same transformation: � k � q . 12 / 60

More document–document comparison in concept space Recall the i , j entry of C T C is dot product between i , j columns of C (term vectors for documents i and j ). In the truncated space, k ) T ( U k Σ k V T C T k C k = ( U k Σ k V T k ) = V k Σ k U T k U k Σ k V T k = ( V k Σ k )( V k Σ k ) T Thus i , j entry the dot product between the i , j columns of ( V k Σ k ) T = Σ k V T k . In concept space, comparison between pseudo-document � q c and document � q c and Σ k � d c d c ( j ) thus given by the cosine between Σ k � ( j ) : q c ) · (Σ k � q T U k Σ − 1 k � q · � k Σ k )(Σ k Σ − 1 d c k U T (Σ k � ( j ) ) ( � d ∗ ( j ) ) � d ∗ ( j ) = = , (3) q c | | Σ k � k � q | | � d c | U T q | | U T | U T | Σ k � ( j ) | k � d ∗ ( j ) | k � d ∗ ( j ) | in agreement with (2), up to an overall � q -dependent normalization which doesn’t affect similarity rankings . 13 / 60

Pseudo-document – document comparison summary So given a novel query, find its location in concept space, and find its cosine w.r.t existing documents, or other documents not in original analysis (SVD). We’ve just learned how to represent “pseudo-documents”, and how to compute comparisons. A query � q is a vector of terms, like the columns of C , hence considered a pseudo-document Derive representation for any term vector � q to be used in document comparison formulas. (like a row of V as earlier) d ( j ) (= j th column C ij ), q = � Constraint: for a real document � and before truncation (i.e., for C k = C ), should give row of V qU Σ − 1 for comparing pseudodocs to docs Use � q c = � 14 / 60

qU Σ − 1 Pseudo-document – document Comparison: � q c = � qU Σ − 1 sums corresponding rows of U Σ, hence corresponds � q c = � to placing pseudo-document at centroid of corresponding term points (up to rescaling of rows by Σ). (Just as row of V scaled by Σ 1 / 2 or Σ can be used in semantic space for making term–doc or doc–doc comparisons.) Note: all of above after any preprocessing used to construct C 15 / 60

Recommendation to new user Let 1 k be the diagonal matrix with first k entries equal to 1 (i.e., the projection or truncation onto first k dimensions). Then since X = U Σ V T and Σ k = Σ 1 k , the usual rank reduction can be written X k = ( U Σ) 1 k V T = ( XV ) 1 k V T = X ( V 1 k V T ) , where the rows of X k contain the recommendations for existing users. ( V T k = 1 k V T , V k = V 1 k , so X k = X ( VV T k )) We are looking for a transformation of a new user vector � n , which would have the same effect. From the above, right multiplying any row of X by V 1 k V T turns it into the corresponding row of the n V 1 k V T to make recommendations for ”improved” X k , so we use � a new user who is not already contained in X . 16 / 60

17 / 60

INFO 4300 / CS4300 Information Retrieval slides adapted from - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 12: LSA wrap-up, Relevance Feedback and Query Expansion Paul Ginsparg Cornell University, Ithaca, NY 6 Oct

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

PSYCHOTRONICA nitesh dhanjani july 2009 black hat briefings 2009 (las vegas) Sunday, June 28,

Modules in Java Classes that are related should be grouped together, may even share access to

Open heart surgery in p ublic Brand cardiology Disclaimer The success story

Guiding SMT Solvers with Monte Carlo Tree Search and Neural Networks Stphane Graham-Lengrand

Supercomputing Operating Systems: A Naive View from Over the Fence Timothy Roscoe (Mothy)

$53 million federal support, with $ 256 M from state general fund = $309,000,000 Federal

TCMU Meeting Technical Council updates & requests Discuss a proposal for a series of

A Dual Self Model of Impulse Control Drew Fudenberg and David K. Levine December 30, 2004 The

Sambuz

Useful Links

Newsletter

Mail Us

INFO 4300 / CS4300 Information Retrieval slides adapted from - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 12: LSA wrap-up, Relevance Feedback and Query Expansion Paul Ginsparg Cornell University, Ithaca, NY 6 Oct

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

PSYCHOTRONICA nitesh dhanjani july 2009 black hat briefings 2009 (las vegas) Sunday, June 28,

Modules in Java Classes that are related should be grouped together, may even share access to

Open heart surgery in p ublic Brand cardiology Disclaimer The success story

Guiding SMT Solvers with Monte Carlo Tree Search and Neural Networks Stphane Graham-Lengrand

Supercomputing Operating Systems: A Naive View from Over the Fence Timothy Roscoe (Mothy)

$53 million federal support, with $ 256 M from state general fund = $309,000,000 Federal

TCMU Meeting Technical Council updates &amp; requests Discuss a proposal for a series of

A Dual Self Model of Impulse Control Drew Fudenberg and David K. Levine December 30, 2004 The

Sambuz

Useful Links

Newsletter

Mail Us

TCMU Meeting Technical Council updates & requests Discuss a proposal for a series of