INFO 4300 / CS4300 Information Retrieval slides adapted from - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/ IR 8: Evaluation & Result Summaries Paul Ginsparg Cornell University, Ithaca, NY 22 Sep 2009 1 / 62

Discussion 3 Read and be prepared to discuss the following: K. Sparck Jones, “A statistical interpretation of term specificity and its application in retrieval”. Journal of Documentation 28, 11-21, 1972. http://www.soi.city.ac.uk/ ∼ ser/idfpapers/ksj orig.pdf Letter by Stephen Robertson and reply by Karen Sparck Jones, Journal of Documentation 28, 164-165, 1972. http://www.soi.city.ac.uk/ ∼ ser/idfpapers/letters.pdf The first paper introduced the term weighting scheme known as inverse document frequency (IDF). Some of the terminology used in this paper will be introduced in the lectures. The letter describes a slightly different way of expressing IDF, which has become the standard form. (Stephen Robertson has mounted these papers on his Web site with permission from the publisher.) 2 / 62

Overview Recap 1 Unranked evaluation 2 Ranked evaluation 3 Evaluation benchmarks 4 Result summaries 5 3 / 62

Outline Recap 1 Unranked evaluation 2 Ranked evaluation 3 Evaluation benchmarks 4 Result summaries 5 4 / 62

Pivot normalization source: Lillian Lee 6 / 62

Heuristics for finding the top k even faster Document-at-a-time processing We complete computation of the query-document similarity score of document d i before starting to compute the query-document similarity score of d i +1 . Requires a consistent ordering of documents in the postings lists Term-at-a-time processing We complete processing the postings list of query term t i before starting to process the postings list of t i +1 . Requires an accumulator for each document “still in the running” The most effective heuristics switch back and forth between term-at-a-time and document-at-a-time processing. 7 / 62

Use min heap for selecting top k ouf of N Use a binary min heap A binary min heap is a binary tree in which each node’s value is less than the values of its children. It takes O ( N log k ) operations to construct the k -heap containing the k largest values (where N is the number of documents). Essentially linear in N for small k and large N . 8 / 62

Cluster pruning Cluster docs in preprocessing step √ Pick N “leaders” √ For non-leaders, find nearest leader (expect N / leader) √ For query q , find closest leader L ( N computations) Rank L and followers or generalize: b 1 closest leaders, and then b 2 leaders closest to query 9 / 62

Tiered index auto Doc2 best Tier 1 car Doc1 Doc3 insurance Doc2 Doc3 auto best Doc1 Doc3 Tier 2 car insurance auto Doc1 best Tier 3 car Doc2 insurance 10 / 62

Complete search system 11 / 62

Components we have introduced thus far Document preprocessing (linguistic and otherwise) Positional indexes Tiered indexes Spelling correction k-gram indexes for wildcard queries and spelling correction Query processing Document scoring Term-at-a-time processing 12 / 62

Components we haven’t covered yet Document cache: we need this for generating snippets (= dynamic summaries) Zone indexes: They separate the indexes for different zones: the body of the document, all highlighted text in the document, anchor text, text in metadata fields etc Machine-learned ranking functions Proximity ranking (e.g., rank documents in which the query terms occur in the same local window higher than documents in which the query terms occur far from each other) Query parser 13 / 62

Vector space retrieval: Complications How do we combine phrase retrieval with vector space retrieval? We do not want to compute document frequency – idf for every possible phrase. Why? How do we combine Boolean retrieval with vector space retrieval? For example: “+”-constraints and “-”-constraints Postfiltering is simple, but can be very inefficient – no easy answer. How do we combine wild cards with vector space retrieval? Again, no easy answer 14 / 62

Outline Recap 1 Unranked evaluation 2 Ranked evaluation 3 Evaluation benchmarks 4 Result summaries 5 15 / 62

Measures for a search engine How fast does it index e.g., number of bytes per hour How fast does it search e.g., latency as a function of queries per second What is the cost per query? in dollars 16 / 62

Measures for a search engine All of the preceding criteria are measurable: we can quantify speed / size / money However, the key measure for a search engine is user happiness. What is user happiness? Factors include: Speed of response Size of index Uncluttered UI Most important: relevance (actually, maybe even more important: it’s free) Note that none of these is sufficient: blindingly fast, but useless answers won’t make a user happy. How can we quantify user happiness? 17 / 62

Who is the user? Who is the user we are trying to make happy? Web search engine: searcher. Success: Searcher finds what she was looking for. Measure: rate of return to this search engine Web search engine: advertiser. Success: Searcher clicks on ad. Measure: clickthrough rate Ecommerce: buyer. Success: Buyer buys something. Measures: time to purchase, fraction of “conversions” of searchers to buyers Ecommerce: seller. Success: Seller sells something. Measure: profit per item sold Enterprise: CEO. Success: Employees are more productive (because of effective search). Measure: profit of the company 18 / 62

Most common definition of user happiness: Relevance User happiness is equated with the relevance of search results to the query. But how do you measure relevance? Standard methodology in information retrieval consists of three elements. A benchmark document collection A benchmark suite of queries An assessment of the relevance of each query-document pair 19 / 62

Relevance: query vs. information need Relevance to what? First take: relevance to the query “Relevance to the query” is very problematic. Information need i : “I am looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.” This is an information need, not a query. Query q : [red wine white wine heart attack] Consider document d ′ : At heart of his speech was an attack on the wine industry lobby for downplaying the role of red and white wine in drunk driving. d ′ is an excellent match for query q . . . d ′ is not relevant to the information need i . 20 / 62

Relevance: query vs. information need User happiness can only be measured by relevance to an information need, not by relevance to queries. Terminology is sloppy here and in course text: “query-document” relevance judgments even though we mean “information-need–document” relevance judgments. 21 / 62

Precision and recall Precision ( P ) is the fraction of retrieved documents that are relevant Precision = #(relevant items retrieved) = P (relevant | retrieved) #(retrieved items) Recall ( R ) is the fraction of relevant documents that are retrieved Recall = #(relevant items retrieved) = P (retrieved | relevant) #(relevant items) 22 / 62

Precision and recall Relevant Nonrelevant Retrieved true positives (TP) false positives (FP) Not retrieved false negatives (FN) true negatives (TN) P = TP / ( TP + FP ) R = TP / ( TP + FN ) 23 / 62

Precision/recall tradeoff You can increase recall by returning more docs. Recall is a non-decreasing function of the number of docs retrieved. A system that returns all docs has 100% recall! The converse is also true (usually): It’s easy to get high precision for very low recall. Suppose the document with the largest score is relevant. How can we maximize precision? 24 / 62

A combined measure: F F allows us to trade off precision against recall. = ( β 2 + 1) PR 1 β 2 = 1 − α F = where α 1 P + (1 − α ) 1 β 2 P + R α R α ∈ [0 , 1] and thus β 2 ∈ [0 , ∞ ] Most frequently used: balanced F with β = 1 or α = 0 . 5 1 F = 1 2 ( 1 P + 1 This is the harmonic mean of P and R : R ) What value range of β weights recall higher than precision? 25 / 62

F: Example relevant not relevant retrieved 20 40 60 not retrieved 60 1,000,000 1,000,060 80 1,000,040 1,000,120 P = 20 / (20 + 40) = 1 / 3 R = 20 / (20 + 60) = 1 / 4 1 F 1 = 2 = 2 / 7 1 + 1 1 1 3 4 26 / 62

Accuracy Why do we use complex measures like precision, recall, and F ? Why not something simple like accuracy? Accuracy is the fraction of decisions (relevant/nonrelevant) that are correct. In terms of the contingency table above, accuracy = ( TP + TN ) / ( TP + FP + FN + TN ). Why is accuracy not a useful measure for web information retrieval? 27 / 62

Exercise Compute precision, recall and F 1 for this result set: relevant not relevant retrieved 18 2 not retrieved 82 1,000,000,000 The snoogle search engine below always returns 0 results (“0 matching results found”), regardless of the query. Why does snoogle demonstrate that accuracy is not a useful measure in IR? 28 / 62

INFO 4300 / CS4300 Information Retrieval slides adapted from - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 8: Evaluation & Result Summaries Paul Ginsparg Cornell University, Ithaca, NY 22 Sep 2009 1 / 62

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

Fourth Grade Measurement and Data 2015-11-23 www.njctl.org Slide 3 / 100 Table of Contents

Natural Language Processing Machine Translation Machine Translation Dan Klein UC Berkeley

CS224N/Ling284 Lecture 6: Language Models and Recurrent Neural Networks Abigail See Overview

srs r

Text Reuse Detection Using a Composition of Text Similarity Measures Br, Zesch, Gurevych 2012

Machine Translation Evaluation Sara Stymne Partly based on Philipp Koehns slides for chapter 8

The ENCOPLOT Similarity Measure for Automatic Detection of Plagiarism Cristian Grozea 1 Marius

Accelerated Natural Language Processing Lecture 5 N-gram models, entropy Sharon Goldwater (some