Evaluation Experimental protocols, datasets, metrics Web Search 1

What makes a good search engine? • Efficiency : It replies to user queries without noticeable delays. • 1 sec is the “ limit for users feeling that they are freely navigating the command space without having to unduly wait for the computer ” • Miller, R. B. (1968). Response time in man-computer conversational transactions. Proc. AFIPS Fall Joint Computer Conference Vol. 33, 267-277. • Effectiveness : It replies to user queries with relevant answers. • This depends on the interpretation of the user query and the stored information. 2

Efficiency metrics Metric name Description Elapsed indexing time Measures the amount of time necessary to build a document index on a particular system. Indexing processor time Measures the CPU seconds used in building a document index. This is similar to elapsed time, but does not count time waiting for I/O or speed gains from parallelism. Query throughput Number of queries processed per second. Query latency The amount of time a user must wait after issuing a query before receiving a response, measured in milliseconds. This can be measured using the mean, but is often more instructive when used with the median or a percentile bound. Indexing temporary space Amount of temporary disk space used while creating an index. Index size Amount of storage necessary to store the index files. 3

What makes a good search engine? • Efficiency : It replies to user queries without noticeable delays. • 1 sec is the “ limit for users feeling that they are freely navigating the command space without having to unduly wait for the computer ” • Miller, R. B. (1968). Response time in man-computer conversational transactions. Proc. AFIPS Fall Joint Computer Conference Vol. 33, 267-277. • Effectiveness : It replies to user queries with relevant answers. • This depends on the interpretation of the user query and the stored information. 4

Essential aspects of a sound evaluation • Experimental protocol • Is the task/problem clear? Is it a standard task? • Detailed description of the experimental setup: • identify all steps of the experiments. • Reference dataset • Use a well known dataset if possible. • If not, how was the data obtained? • Clear separation between training and test set. • Evaluation metrics • Prefer the commonly used metrics by the community. • Check which statistical test is most adequate. 5

Experimental setups • There are experimental setups made available by different organizations: • TREC: http://trec.nist.gov/tracks.html • CLEF: http://clef2017.clef-initiative.eu/ • SemEVAL: http://alt.qcri.org/semeval2017/ • Visual recognition: http://image-net.org/challenges/LSVRC/ • These experimental setups define a protocol, a dataset (documents and relevance judgments) and suggest a set of metrics to evaluate performance. 6

What is a standard task? • Experimental setups are designed to develop a search engine to address a specific task. • Retrieval by keyword • Retrieval by example • Ranking annotations • Interactive retrieval • Search query categorization • Real-time summarization • Datasets exist for all the above tasks. 7

Examples of standard tasks in IR • For example, TRECVID tasks include: • Video shot-detection • Video news story segmentation • High-level feature task (concept detection) • Automatic and semi-automatic video search • Exploratory analysis (unsupervised) • Other forums exist with different tasks: • TREC: Blog search, opinion leader, patent search, Web search, document categorization... • CLEF: Plagiarism detection, expert search, wikipedia mining, multimodal image tagging, medical image search... • Others: Japanese, Russian, Spanish, etc... 8

A retrieval evaluation setup Ranked results System Data Evaluation metrics Queries Groundtruth 9

Reference datasets • A reference dataset is made of: • a collection of documents • a set of training queries • a set of test queries • the relevance judgments of the pairs query-document. • Reference datasets are as important as metrics for evaluating the proposed method. • Many different datasets exist for standard tasks. • Reference datasets set the difficulty level of the task. • Allow a fair comparison across different methods. 11

Ground-truth (relevance judgments) • Ground-truth tells the scientist how the method must behave. • The ultimate goal is to devise a method that produces exactly the same output as the ground-truth. Ground-truth True False Type I error True True positive False positive Method False False negative True negative Type II error 12

Annotate these pictures with keywords: 13

Relevance judgments People Sunset Nepal Horizon Mother Coulds Baby Orange Colorful dress Desert Fence Flowers Beach Yellow Sea Nature Palm tree White-sand Clear sky 14

Relevance judgments • Judgments can be obtained by experts or by crowdsourcing • Human relevance judgments can be incorrect and inconsistent • How do we measure the quality of human judgments? 𝑙𝑏𝑞𝑞𝑏 = 𝑞 𝐵 − 𝑞 𝐹 𝑞 𝐵 -> proportion of times humans agreed 1 − 𝑞 𝐹 𝑞 𝐹 -> probability of agreeing by chance • Values above 0.8 are considered good • Values between 0.67 and 0.8 are considered fair • Values below 0.67 are considered dubious 15

Evaluation metrics • Complete relevance judgments • Ranked relevance judgments • Binary relevance judgments • Incomplete relevance judgments (Web scale eval.) • Binary relevance judgments • Multi-level relevance judgments 17

Ranked relevance evaluation metrics 2 • Spearman’s rank correlation: 6 σ 𝑒 𝑗 𝑠 = 1 − 𝑜 𝑜 2 − 1 • Example: 1 1 1 − 1 2 + 2 − 3 2 + 3 − 4 2 + 4 − 2 2 𝑠 = 1 − 6 4 2 4 4 2 − 1 2 3 3 4 • Another popular rank correlation metric is the Kendall-Tau. 18

Binary relevance judgments 𝑢𝑠𝑣𝑓𝑄𝑝𝑡 + 𝑢𝑠𝑣𝑓𝑂𝑓𝑕 Ground-truth 𝐵𝑑𝑑𝑣𝑠𝑏𝑑𝑧 = 𝑢𝑠𝑣𝑓𝑄𝑝𝑡 + 𝑔𝑏𝑚𝑡𝑓𝑄𝑝𝑡 + 𝑢𝑠𝑣𝑓𝑂𝑓𝑕 + 𝑔𝑏𝑚𝑡𝑓𝑂𝑓𝑕 True False 𝑢𝑠𝑣𝑓𝑄𝑝𝑡 𝑄𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 = True True positive False positive Method 𝑢𝑠𝑣𝑓𝑄𝑝𝑡 + 𝑔𝑏𝑚𝑡𝑓𝑄𝑝𝑡 False False negative True negative 𝑢𝑠𝑣𝑓𝑄𝑝𝑡 𝑆𝑓𝑑𝑏𝑚𝑚 = 𝑢𝑠𝑣𝑓𝑄𝑝𝑡 + 𝑔𝑏𝑚𝑡𝑓𝑂𝑓𝑕 2 𝐺 1 = 1 𝑄 + 1 𝑆 Em PT: exatidão, precisão e abragência. 19

Precision-recall graphs for ranked results Improved precision S1 S2 S3 A ... ... System A B A ... ... ... ... Improved F-measure Precision ... B A ... ... B System B ... C C ... ... D ... ... ... System C Improved recall Recall 20

Interpolated precision-recall graphs S1 S2 S3 A ... ... B A ... ... ... ... ... B A ... ... B ... C C ... ... D ... ... ... 21

Average Precision • Web systems favor high-precision methods (P@20) • Other more robust metric is AP: 1 2 1 3 𝐵𝑄 = #𝑠𝑓𝑚𝑓𝑤𝑏𝑜𝑢 ∙ ෍ 𝑞@𝑙 4 𝑙∈ 𝑡𝑓𝑢 𝑝𝑔 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜𝑡 𝑝𝑔 𝑢ℎ𝑓 𝑠𝑓𝑚𝑓𝑤𝑏𝑜𝑢 𝑒𝑝𝑑𝑡 5 6 𝐵𝑄 = 1 2 + 2 1 4 + 3 4 ∙ 6 =0.375 7 8 22

Average Precision • Average precision is the area under the P-R curve 1 𝐵𝑄 = #𝑠𝑓𝑚𝑓𝑤𝑏𝑜𝑢 ∙ ෍ 𝑞@𝑙 𝑙∈ 𝑡𝑓𝑢 𝑝𝑔 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜𝑡 𝑝𝑔 𝑢ℎ𝑓 𝑠𝑓𝑚𝑓𝑤𝑏𝑜𝑢 𝑒𝑝𝑑𝑡 23

Mean Average Precision (MAP) • MAP evaluates the system for a given range of queries. A ... ... B A ... ... ... ... • It summarizes the global system ... B A performance in one single value. ... ... B ... C C ... ... D • It is the mean of the average ... ... ... precision of a set of n queries: AP(q1) AP(q2) AP(q3) 𝑁𝐵𝑄 = 𝐵𝑄 𝑟 1 + 𝐵𝑄 𝑟 2 + 𝐵𝑄 𝑟 3 +…+ 𝐵𝑄 𝑟 𝑜 𝑜 24

Evaluation Experimental protocols, datasets, metrics Web Search 1 - PowerPoint PPT Presentation

Evaluation Experimental protocols, datasets, metrics Web Search 1 What makes a good search engine? Efficiency : It replies to user queries without noticeable delays. 1 sec is the limit for users feeling that they are freely

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Progress on

Evidence evaluation for discrete data Evidence evaluation for discrete data Evidence evaluation

Webinar on Meta-evaluation Approaches to Improve Evaluation Practice Mnica Lomea Gelis,

Programme BRICK Programme Evaluation: How, why and what? The plan Practical evaluation -

Evaluation Update Laura Forsythe, PhD, MPH Associate Director, Evaluation & Analysis Lori

Evaluation DEMMS: Evaluation of Multimedia What are the Evaluation lectures about: When

Heuristic Evaluation (Pinelle) Heuristic evaluation is a method of qualitative evaluation of

Model Evaluation Model Evaluation Metrics for Performance Evaluation How to evaluate the

UX Evaluation SWEN-444 Selected material from The UX Book , Hartson & Pyla UX Evaluation

SENSORY EVALUATION .. Basics of Sensory evaluation, Tools, Techniques, Methods and

Evaluation: Using CDCs Evaluation Framework By: Thomas J. Chapel, MA, MBA Chief Evaluation

Overview of Overview of Evaluation in Evaluation in the UN Secretariat the UN Secretariat Prepared

e-Bug Pack Evaluation 1 Evaluation Process Evaluation carried out in 3 countries Finland

An Evaluation of the Effectiveness of An Evaluation of the Effectiveness of School Zone Flashers

Plagiarism Detection in Open Access Publications Jens Brandt, Martin Gutbrod, Oliver Wellnitz,

Lecture Overview Web 2.0, Tagging, Multimedia, Introduction to Web 2.0 Overview of

Information Needs IR, session 2 CS6200: Information Retrieval Slides by: Jesse Anderton

Welcome to the webinar! Webinar Host Paul Teague Digital Development Manager Cumbria Chamber

SAYES : Digital Marketing Business SA <Insert presentation title here using slide master>

A Performance Evaluation of Open Source Erasure Codes for Storage Applications James S. Plank

CSE 143 Whats wrong with the way things are? One problem: All of our data structures so

Gaming for the Greater Good Horia Dragomir goto;amsterdam @hdragomir "Games lubricate the