info 4300 cs4300 information retrieval slides adapted
play

INFO 4300 / CS4300 Information Retrieval slides adapted from - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 7: Scores and Evaluation Paul Ginsparg Cornell University, Ithaca, NY 15 Sep 2011 1 / 42 Administrativa


  1. INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/ IR 7: Scores and Evaluation Paul Ginsparg Cornell University, Ithaca, NY 15 Sep 2011 1 / 42

  2. Administrativa Course Webpage: http://www.infosci.cornell.edu/Courses/info4300/2011fa/ Assignment 1. Posted: 2 Sep, Due: Sun, 18 Sep Lectures: Tuesday and Thursday 11:40-12:55, Kimball B11 Instructor: Paul Ginsparg, ginsparg@..., 255-7371, Physical Sciences Building 452 Instructor’s Office Hours: Wed 1-2pm, Fri 2-3pm, or e-mail instructor to schedule an appointment Teaching Assistant: Saeed Abdullah, office hour Fri 3:30pm-4:30pm in the small conference room (133) at 301 College Ave, and by email, use cs4300-l@lists.cs.cornell.edu Course text at: http://informationretrieval.org/ Introduction to Information Retrieval , C.Manning, P.Raghavan, H.Sch¨ utze see also Information Retrieval , S. B¨ uttcher, C. Clarke, G. Cormack http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=12307 2 / 42

  3. Discussion 2, 20 Sep For this class, read and be prepared to discuss the following: K. Sparck Jones, “A statistical interpretation of term specificity and its application in retrieval”. Journal of Documentation 28, 11-21, 1972. http://www.soi.city.ac.uk/ ∼ ser/idfpapers/ksj orig.pdf Letter by Stephen Robertson and reply by Karen Sparck Jones, Journal of Documentation 28, 164-165, 1972. http://www.soi.city.ac.uk/ ∼ ser/idfpapers/letters.pdf The first paper introduced the term weighting scheme known as inverse document frequency (IDF). Some of the terminology used in this paper will be introduced in the lectures. The letter describes a slightly different way of expressing IDF, which has become the standard form. (Stephen Robertson has mounted these papers on his Web site with permission from the publisher.) 3 / 42

  4. Overview Recap 1 Implementation 2 Unranked evaluation 3 Ranked evaluation 4 SVD Intuition 5 Incremental Numerics 6 4 / 42

  5. Outline Recap 1 Implementation 2 Unranked evaluation 3 Ranked evaluation 4 SVD Intuition 5 Incremental Numerics 6 5 / 42

  6. Cluster pruning Cluster docs in preprocessing step √ Pick N “leaders” √ For non-leaders, find nearest leader (expect N / leader) √ For query q , find closest leader L ( N computations) Rank L and followers or generalize: b 1 closest leaders, and then b 2 leaders closest to query 6 / 42

  7. 7 / 42

  8. Outline Recap 1 Implementation 2 Unranked evaluation 3 Ranked evaluation 4 SVD Intuition 5 Incremental Numerics 6 8 / 42

  9. Non-docID ordering of postings lists So far: postings lists have been ordered according to docID. Alternative: a query-independent measure of “goodness” of a page. Example: PageRank g ( d ) of page d , a measure of how many “good” pages hyperlink to d Order documents in postings lists according to PageRank: g ( d 1 ) > g ( d 2 ) > g ( d 3 ) > . . . Define composite score of a document: net-score( q , d ) = g ( d ) + cos( q , d ) This scheme supports early termination: We do not have to process postings lists in their entirety to find top k . 9 / 42

  10. Non-docID ordering of postings lists (2) Order documents in postings lists according to PageRank: g ( d 1 ) > g ( d 2 ) > g ( d 3 ) > . . . Define composite score of a document: net-score( q , d ) = g ( d ) + cos( q , d ) Suppose: (i) g → [0 , 1]; (ii) g ( d ) < 0 . 1 for the document d we’re currently processing; (iii) smallest top k score we’ve found so far is 1.2 Then all subsequent scores will be < 1 . 1. So we’ve already found the top k and can stop processing the remainder of postings lists. Questions? 10 / 42

  11. Outline Recap 1 Implementation 2 Unranked evaluation 3 Ranked evaluation 4 SVD Intuition 5 Incremental Numerics 6 11 / 42

  12. Measures for a search engine How fast does it index e.g., number of bytes per hour How fast does it search e.g., latency as a function of queries per second What is the cost per query? in dollars 12 / 42

  13. Measures for a search engine All of the preceding criteria are measurable: we can quantify speed / size / money However, the key measure for a search engine is user happiness. What is user happiness? Factors include: Speed of response Size of index Uncluttered UI Most important: relevance (actually, maybe even more important: it’s free) Note that none of these is sufficient: blindingly fast, but useless answers won’t make a user happy. How can we quantify user happiness? 13 / 42

  14. Who is the user? Who is the user we are trying to make happy? Web search engine: searcher. Success: Searcher finds what she was looking for. Measure: rate of return to this search engine Web search engine: advertiser. Success: Searcher clicks on ad. Measure: clickthrough rate Ecommerce: buyer. Success: Buyer buys something. Measures: time to purchase, fraction of “conversions” of searchers to buyers Ecommerce: seller. Success: Seller sells something. Measure: profit per item sold Enterprise: CEO. Success: Employees are more productive (because of effective search). Measure: profit of the company 14 / 42

  15. Most common definition of user happiness: Relevance User happiness is equated with the relevance of search results to the query. But how do you measure relevance? Standard methodology in information retrieval consists of three elements. A benchmark document collection A benchmark suite of queries An assessment of the relevance of each query-document pair 15 / 42

  16. Relevance: query vs. information need Relevance to what? First take: relevance to the query “Relevance to the query” is very problematic. Information need i : “I am looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.” This is an information need, not a query. Query q : [red wine white wine heart attack] Consider document d ′ : At heart of his speech was an attack on the wine industry lobby for downplaying the role of red and white wine in drunk driving. d ′ is an excellent match for query q . . . d ′ is not relevant to the information need i . 16 / 42

  17. Relevance: query vs. information need User happiness can only be measured by relevance to an information need, not by relevance to queries. Terminology is sloppy here and in course text: “query-document” relevance judgments even though we mean “information-need–document” relevance judgments. 17 / 42

  18. Precision and recall Precision ( P ) is the fraction of retrieved documents that are relevant Precision = #(relevant items retrieved) = P (relevant | retrieved) #(retrieved items) Recall ( R ) is the fraction of relevant documents that are retrieved Recall = #(relevant items retrieved) = P (retrieved | relevant) #(relevant items) 18 / 42

  19. Precision and recall Relevant Nonrelevant Retrieved true positives (TP) false positives (FP) Not retrieved false negatives (FN) true negatives (TN) P = TP / ( TP + FP ) R = TP / ( TP + FN ) 19 / 42

  20. Precision/recall tradeoff You can increase recall by returning more docs. Recall is a non-decreasing function of the number of docs retrieved. A system that returns all docs has 100% recall! The converse is also true (usually): It’s easy to get high precision for very low recall. Suppose the document with the largest score is relevant. How can we maximize precision? 20 / 42

  21. A combined measure: F Frequently used: balanced F , the harmonic mean of P and R : � 1 � F = 1 1 P + 1 F = 2 PR or 2 R P + R Extremes: If P ≪ R , then F ≈ 2 P . If R ≪ P , then F ≈ 2 R . So F is automatically sensitive to the one that is much smaller. If P ≈ R , then F ≈ P ≈ R . 21 / 42

  22. Outline Recap 1 Implementation 2 Unranked evaluation 3 Ranked evaluation 4 SVD Intuition 5 Incremental Numerics 6 22 / 42

  23. Precision-recall curve Precision/recall/F are measures for unranked sets. We can easily turn set measures into measures of ranked lists. Just compute the set measure for each “prefix”: the top 1, top 2, top 3, top 4 etc results Doing this for precision and recall gives you a precision-recall curve. 23 / 42

  24. A precision-recall curve 1.0 0.8 Precision 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Each point corresponds to a result for the top k ranked hits ( k = 1 , 2 , 3 , 4 , . . . ). Interpolation (in red): Take maximum of all future points Rationale for interpolation: The user is willing to look at more stuff if both precision and recall get better. Questions? 24 / 42

  25. 11-point interpolated average precision Recall Interpolated Precision 0.0 1.00 0.1 0.67 0.2 0.63 11-point average: ≈ 0.425 0.3 0.55 0.4 0.45 0.5 0.41 How can precision at 0.0 be > 0? 0.6 0.36 0.7 0.29 0.8 0.13 0.9 0.10 1.0 0.08 25 / 42

  26. Averaged 11-point precision/recall graph 1 0.8 Precision 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 Recall Compute interpolated precision at recall levels 0.0, 0.1, 0.2, . . . Do this for each of the queries in the evaluation benchmark Average over queries This measure measures performance at all recall levels. The curve is typical of performance levels at TREC. Note that performance is not very good! 26 / 42

  27. Outline Recap 1 Implementation 2 Unranked evaluation 3 Ranked evaluation 4 SVD Intuition 5 Incremental Numerics 6 27 / 42

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend