INFO 4300 / CS4300 Information Retrieval slides adapted from - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/ IR 13: Query Expansion and Probabilistic Retrieval Paul Ginsparg Cornell University, Ithaca, NY 8 Oct 2009 1 / 34

Administrativa No office hours tomorrow, Fri 9 Oct e-mail questions, doubts, concerns, problems to cs4300-l@lists.cs.cornell.edu Remember mid-term is one week from today, Thu 15 Oct. For more info, see http://www.infosci.cornell.edu/Courses/info4300/2009fa/exams.html 2 / 34

Overview Recap 1 Pseudo Relevance Feedback 2 Query expansion 3 Probabalistic Retrieval 4 Discussion 5 3 / 34

Outline Recap 1 Pseudo Relevance Feedback 2 Query expansion 3 Probabalistic Retrieval 4 Discussion 5 4 / 34

Selection of singular values t × d t × m m × m m × d V T Σ k k C k U k t × d t × k k × k k × d m is the original rank of C . k is the number of singular values chosen to represent the concepts in the set of documents. Usually, k ≪ m . Σ − 1 defined only on k -dimensional subspace. k 5 / 34

Now approximate C → C k In the LSI approximation, use C k (the rank k approximation to C ), so similarity measure between query and document becomes q · � d ∗ q · � � � = � q · C · � � q · C k · � d ( j ) e ( j ) e ( j ) ( j ) = ⇒ e ( j ) | = , (2) q | | � | � q | | C � e ( j ) | | � q | | C k � q | | � d ∗ | � d ( j ) | | � ( j ) | where � d ∗ e ( j ) = U k Σ k V T � ( j ) = C k � e ( j ) is the LSI representation of the j th document vector in the original term–document space. Finding the closest documents to a query in the LSI approximation thus amounts to computing (2) for each of the j = 1 , . . . , N documents, and returning the best matches. 6 / 34

Compare documents in concept space Recall the i , j entry of C T C is dot product between i , j columns of C (term vectors for documents i and j ). In the truncated space, k ) T ( U k Σ k V T C T k C k = ( U k Σ k V T k ) = V k Σ k U T k U k Σ k V T k = ( V k Σ k )( V k Σ k ) T Thus i , j entry the dot product between the i , j columns of ( V k Σ k ) T = Σ k V T k . In concept space, comparison between pseudo-document � q and ˆ document � q and Σ k � d ( j ) thus given by the cosine between Σ k � ˆ ˆ ˆ d ( j ) : q ) · Σ k � q T U k Σ − 1 k � q · � k Σ k )(Σ k Σ − 1 (Σ k � ˆ k U T d ∗ d ∗ ( � ( j ) ) � ˆ d ( j ) ( j ) = = , (3) q | | Σ k � k � q | | � d ∗ d ∗ | U T q | | U T | U T | Σ k � ˆ k � ( j ) | k � ( j ) | ˆ d ( j ) | in agreement with (2), up to an overall � q -dependent normalization which doesn’t affect similarity rankings . 7 / 34

8 / 34

Rocchio illustrated x � µ R − � µ NR x � q opt x x x � µ R x � µ NR � µ R : centroid of relevant documents � µ NR : centroid of nonrelevant documents � µ R − � µ NR : difference vector µ R to get � Add difference vector to � q opt � q opt separates relevant/nonrelevant perfectly. 9 / 34

Relevance feedback: Problems Relevance feedback is expensive. Relevance feedback creates long modified queries. Long queries are expensive to process. Users are reluctant to provide explicit feedback. It’s often hard to understand why a particular document was retrieved after applying relevance feedback. Excite had full relevance feedback at one point, but abandoned it later. 11 / 34

Pseudo-relevance feedback Pseudo-relevance feedback automates the “manual” part of true relevance feedback. Pseudo-relevance algorithm: Retrieve a ranked list of hits for the user’s query Assume that the top k documents are relevant. Do relevance feedback (e.g., Rocchio) Works very well on average But can go horribly wrong for some queries. Several iterations can cause query drift . 12 / 34

Pseudo-relevance feedback at TREC4 Cornell SMART system Results show number of relevant documents out of top 100 for 50 queries (so total number of documents is 5000): method number of relevant documents lnc.ltc 3210 lnc.ltc-PsRF 3634 Lnu.ltu 3709 Lnu.ltu-PsRF 4350 Results contrast two length normalization schemes (L vs. l) and pseudo-relevance feedback (PsRF). The pseudo-relevance feedback method used added only 20 terms to the query. (Rocchio will add many more.) This demonstrates that pseudo-relevance feedback is effective on average. 13 / 34

Query expansion Query expansion is another method for increasing recall. We use “global query expansion” to refer to “global methods for query reformulation”. In global query expansion, the query is modified based on some global resource, i.e. a resource that is not query-dependent. Main information we use: (near-)synonymy A publication or database that collects (near-)synonyms is called a thesaurus. We will look at two types of thesauri: manually created and automatically created. 15 / 34

Query expansion: Example 16 / 34

Types of user feedback User gives feedback on documents. More common in relevance feedback User gives feedback on words or phrases. More common in query expansion 17 / 34

Types of query expansion Manual thesaurus (maintained by editors, e.g., PubMed) Automatically derived thesaurus (e.g., based on co-occurrence statistics) Query-equivalence based on query log mining (common on the web as in the “palm” example) 18 / 34

Thesaurus-based query expansion For each term t in the query, expand the query with words the thesaurus lists as semantically related with t . Example from earlier: hospital → medical Generally increases recall May significantly decrease precision, particularly with ambiguous terms interest rate → interest rate fascinate Widely used in specialized search engines for science and engineering It’s very expensive to create a manual thesaurus and to maintain it over time. A manual thesaurus is roughly equivalent to annotation with a controlled vocabulary. 19 / 34

Example for manual thesaurus: PubMed 20 / 34

Automatic thesaurus generation Attempt to generate a thesaurus automatically by analyzing the distribution of words in documents Fundamental notion: similarity between two words Definition 1: Two words are similar if they co-occur with similar words. “car” and “motorcycle” cooccur with “road”, “gas” and “license”, so they must be similar. Definition 2: Two words are similar if they occur in a given grammatical relation with the same words. You can harvest, peel, eat, prepare, etc. apples and pears, so apples and pears must be similar. Co-occurrence is more robust, grammatical relations are more accurate. 21 / 34

Co-occurence-based thesaurus: Examples Word Nearest neighbors absolutely absurd, whatsoever, totally, exactly, nothing bottomed dip, copper, drops, topped, slide, trimmed captivating shimmer, stunningly, superbly, plucky, witty doghouse dog, porch, crawling, beside, downstairs makeup repellent, lotion, glossy, sunscreen, skin, gel mediating reconciliation, negotiate, case, conciliation keeping hoping, bring, wiping, could, some, would lithographs drawings, Picasso, Dali, sculptures, Gauguin pathogens toxins, bacteria, organisms, bacterial, parasite senses grasp, psyche, truly, clumsy, naive, innate 22 / 34

Summary Relevance feedback and query expansion increase recall. In many cases, precision is decreased, often significantly. Log-based query modification (which is more complex than simple query expansion) is more common on the web than relevance feedback. 23 / 34

Basics of probability theory A = event 0 ≤ p ( A ) ≤ 1 joint probability p ( A , B ) = p ( A ∩ B ) conditional probability p ( A | B ) = p ( A , B ) / p ( B ) Note p ( A , B ) = p ( A | B ) p ( B ) = p ( B | A ) p ( A ), gives posterior probability of A after seeing the evidence B p ( A | B ) = p ( B | A ) p ( A ) Bayes ‘ Thm ‘ : p ( B ) In denominator, use p ( B ) = p ( B , A ) + p ( B , A ) = p ( B | A ) p ( A ) + p ( B | A ) p ( A ) O ( A ) = p ( A ) p ( A ) Odds: p ( A ) = 1 − p ( A ) 25 / 34

Probability Ranking Principle (PRP) For query q and document d , let R d , q be binary indicator whether d relevant to q : R = 1 if relevant, else R = 0 Order documents according to estimated probability of relevance with respect to information need p ( R = 1 | d , q ) (Bayes optimal decision rule) d is relevant iff p ( R = 1 | d , q ) > p ( R = 0 | d , q ) 26 / 34

INFO 4300 / CS4300 Information Retrieval slides adapted from - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 13: Query Expansion and Probabilistic Retrieval Paul Ginsparg Cornell University, Ithaca, NY 8 Oct 2009 1 /

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

Generating asymptotics for factorially divergent sequences Michael Borinsky 1 Humboldt-University

Asset Pricing Chapter VIII. Arrow-Debreu Pricing June 22, 2006 Asset Pricing 8.1 Setting: An

On the chordality of polynomial sets in triangular decomposition in top-down style Chenqi Mou

Introduction to Big Data and Machine Learning Classification Dr. Mihail September 19, 2019 (Dr.

RECSM Summer School: Machine Learning for Social Sciences Session 3.3: K -Means Clustering Reto

Mind Your Keys? A Security Evaluation of Java Keystores Marco Squarcina (Universit Ca

Model-Switching: Dealing with Fluctuating Workloads in MLaaS * Systems Jeff Zhang 1 , Sameh

Alaska Native Womens Resource Center 101 Training Series FVPSA Webinar 2018-2020 Tami Truett