retrieval strategy retrieval strategies vector space model
play

Retrieval Strategy Retrieval Strategies: Vector Space Model An IR - PDF document

Retrieval Strategy Retrieval Strategies: Vector Space Model An IR strategy is a technique by which a relevance measure is obtained between a and Boolean query and a document. (COSC 488) Nazli Goharian nazli@cs.georgetown.edu


  1. Retrieval Strategy Retrieval Strategies: Vector Space Model • An IR strategy is a technique by which a relevance measure is obtained between a and Boolean query and a document. (COSC 488) Nazli Goharian nazli@cs.georgetown.edu  Goharian, Grossman, Frieder 2002, 2011  Goharian, Grossman, Frieder 2002, 2011 Retrieval Strategies Vector Space Model • One of the most commonly used strategy is the vector • Manual Systems space model (proposed by Salton in 1975) – Boolean, Fuzzy Set • Idea: Meaning of a document is conveyed by the • Automatic Systems words used in that document. – Vector Space Model • Documents and queries are mapped into term vector – Language Models space. • Each dimension represents tf-idf for one term. – Latent Semantic Indexing • Adaptive • Documents are ranked by closeness to the query. Closeness is determined by a similarity score – Probabilistic, Genetic Algorithms , Neural Networks, calculation. Inference Networks  Goharian, Grossman, Frieder 2002, 2011  Goharian, Grossman, Frieder 2002, 2011 Weights for Term Components Document and query presentation in VSM (Example) • Using Term Weight to rank the relevance. • Parameters in calculating a weight for a document • Consider a two term vocabulary, A and I term or query term: I Query: A I – Term Frequency (tf): Term Frequency is the number of times D 1 - A I Q and D 1 a term i appears in document j (tf ij ) D 3 D 2 - A – Document Frequency (df): Number of documents a term i appears in, (df i ). D 3 – I D 2 – Inverse Document Frequency (idf): A discriminating A measure for a term i in collection, i.e., how discriminating term i is. ( idf i ) = log 10 ( n / df j ), where n is the number of document Idea: a document and a query are similar as their vectors point to the same general direction.  Goharian, Grossman, Frieder 2002, 2011  Goharian, Grossman, Frieder 2002, 2011 1

  2. Weights for Term Components Weights for Term Components • Classic thing to do is use tf x idf • Many variations of term weight exist as the result of improving on basic tf-idf • Incorporate idf in the query and the document, one • A good one: or the other or neither. ( ) + tf idf log 1 . 0 * • Scale the idf with a log = ij j w ij t [ ( ) ] ∑ tf + idf • Scale the tf (log tf+1) or (tf/sum tf of all terms in that document) 2 log 1 . 0 * ij j j = 1 • Augment the weight with some constant (e.g.; w = • Some efforts suggest using different weighting for document (w)(0.5)) terms and query terms. Similarity Measures: Similarity Measures (Inner Product) • Similarity Coefficient (SC) identifies the Similarity between query Q and document D i • Inner Product (dot product) t ( ) ∑ •Inner Product (dot Product) = SC Q D w x d , i qj ij •Cosine j = 1 •Pivoted Cosine • Problem: Longer documents will score very high because they have more chances to match query words. Similarity Measures: (Cosine) t ∑ w x d Probability qj ij ( ) = of relevance SC Q D = j 1 , i ( ) ( ) t ∑ ∑ t Slope d 2 w 2 ij qj j = 1 Pivot j = 1 • Assumption: document length has no impact on the Probability of retrieval relevance. • Normalizes the weight by considering document length. • Problem: Longer documents are somewhat penalized because indeed they might have more components that are Document Length indeed relevant [Singhal, 1997- Trec] 2

  3. Pivoted Cosine Normalization Pivoted Cosine Normalization • Pivoted Cosine Normalization worked well for • Comparing likelihood of retrieval and relevance in a short and moderately long documents. collection to identify pivot and thus, identify the new correction factor. t ∑ w d • Extremely long documents are favored qj ij ( ) = j = SC Q D 1 , i t ( ) ∑ d 2 ij ( ) ( ) j = − s + s 1 1 . 0 avgn Avgn: average document normalization factor over entire collection s : can be obtained empirically VSM Example Pivoted Unique Normalization • Q: “gold silver truck” • D 1 : “Shipment of gold damaged in a fire” t ∑ • D 2 : “Delivery of silver arrived in a silver truck” w d • D3: “Shipment of gold arrived in a truck” qj ij ( ) j = SC Q D = 1 ( ) ( ) , ( ) i Id Term df idf − + s p s d • 1 . 0 i 1 a 3 0 2 arrived 2 0.176 dij = (1+log(tf))idf/ (1+log(atf)) where, atf is average tf 3 damaged 1 0.477 4 delivery 1 0.477 |di| : number of unique terms in a document. 5 fire 1 0.477 p : average of number of unique terms of documents over 6 gold 2 0.176 7 in 3 0 entire collection 8 of 3 0 s : can be obtained empirically 9 silver 1 0.477 10 shipment 2 0.176 11 truck 2 0.176 Algorithm for Vector Space VSM Example (dot product) •Assume: t.idf gives the idf of any term t doc t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 • q.tf gives the tf of any query term D 1 0 0 .477 0 .477 ..176 0 0 0 .176 0 D 2 0 .176 0 .477 0 0 0 0 .954 0 .176 Begin Score[] � 0 D 3 0 .176 0 0 0 .176 0 0 0 .176 .176 For each term t in Query Q 0 0 0 0 0 .176 0 0 .477 0 .176 Q Obtain posting list l For each entry p in l • Computing SC using inner product: • SC(Q, D 1 ) = (0)(0) + (0)(0) + (0)(0.477) + (0)(0) + (0)(0.477) Score[p.docid] = Score[p.docid] + (p.tf * t.idf)(q.tf * t.idf) + (0.176)(0.176) + (0)(0) + (0)(0) •Now we have a SCORE array that is unsorted. •Sort the score array and display top x results. 3

  4. Summary: Vector Space Model Boolean Retrieval • Pros • For many years, most commercial systems were only Boolean. – Fairly cheap to compute • Most old library systems and Lexis/Nexis have a – Yields decent effectiveness long history of Boolean retrieval. – Very popular • Users who are experts at a complex query language • Cons can find what they are looking for. – No theoretical foundation (t1 AND t2) OR (t3 AND t7) WITHIN 2 Sentences – Weights in the vectors are arbitrary (t4 AND t5) NOT (t9 OR t10) – Assumes term independence • Considers each document as bag of words Boolean Example Boolean Retrieval doc t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 10 t 11 • Expression := 0 0 1 0 1 1 0 0 0 1 0 D 1 D 2 1 1 0 1 0 0 0 0 1 0 1 – term 1 1 0 0 0 1 0 0 0 1 1 D 3 – ( expr ) D4 0 0 0 0 0 1 0 0 1 0 1 – NOT expr (not recommended) Q: t1 AND t2 AND NOT t4 – expr AND expr – expr OR expr 0110 AND 0110 AND 1011 = 0010 That is D3 • (cost OR price) AND paper AND NOT article Processing Boolean Query Processing Boolean Queries t1 AND t2 • Doc-term matrix is too sparse, thus, using • Algorithm: inverted index Find t1 in index (lexicon) Retrieve its posting list Find t2 in index (lexicon) • Query optimization in Boolean retrieval: Retrieve its posting list The order in which posting lists are accessed! Intersect (merge) the posting lists The matching DodIDs are added to the result list 4

  5. Processing Boolean Query Intersection of Posting Lists t1 AND t2 AND t3 • What is the best order to process this? Algorithm Sort query terms based on document frequency • Process in the order of increasing document Merge the smallest posting list with the next smallest posting frequency, i.e, smaller Posting Lists first! list and create the result set • Thus, if t 1 , t 2 have smaller PL than t 3 , then Merge the next smaller posting list with the result set, update the result set process as: Continue till no more terms left (t1 AND t2) AND t3 Processing Boolean Query Boolean Retrieval (t1 OR t2) AND (t3 OR t4) AND (t5 OR t6) • Using document frequency estimate the size • AND returns too few documents (low recall) of disjuncts • OR return too many document (low precision) • Order the conjuncts in order of smaller • NOT eliminates many good documents (low disjuncts recall) • Proximity information not supported • Term weight not incorporated Extended (Weighted) Boolean Summary of Boolean Retrieval Retrieval • Pro • Extended Boolean supports term weight and proximity information. – Can use very restrictive search • Example of incorporating term weight: – Makes experienced users happy • Ranking by term frequency (Sony Search Engine) • Con x AND y: tf x x tf y – Simple queries do not work well. x OR y: tf x + tf y NOT x: 0 if tf x > 0, 1 if tf x = 0 – Complex query language, confusing to end • User may assign term weights users cost and +paper 5

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend