evaluation
play

Evaluation Experimental protocols, datasets, metrics Web Search 1 - PowerPoint PPT Presentation

Evaluation Experimental protocols, datasets, metrics Web Search 1 What makes a good search engine? Efficiency : It replies to user queries without noticeable delays. 1 sec is the limit for users feeling that they are freely


  1. Evaluation Experimental protocols, datasets, metrics Web Search 1

  2. What makes a good search engine? • Efficiency : It replies to user queries without noticeable delays. • 1 sec is the “ limit for users feeling that they are freely navigating the command space without having to unduly wait for the computer ” • Miller, R. B. (1968). Response time in man-computer conversational transactions. Proc. AFIPS Fall Joint Computer Conference Vol. 33, 267-277. • Effectiveness : It replies to user queries with relevant answers. • This depends on the interpretation of the user query and the stored information. 2

  3. Efficiency metrics Metric name Description Elapsed indexing time Measures the amount of time necessary to build a document index on a particular system. Indexing processor time Measures the CPU seconds used in building a document index. This is similar to elapsed time, but does not count time waiting for I/O or speed gains from parallelism. Query throughput Number of queries processed per second. Query latency The amount of time a user must wait after issuing a query before receiving a response, measured in milliseconds. This can be measured using the mean, but is often more instructive when used with the median or a percentile bound. Indexing temporary space Amount of temporary disk space used while creating an index. Index size Amount of storage necessary to store the index files. 3

  4. What makes a good search engine? • Efficiency : It replies to user queries without noticeable delays. • 1 sec is the “ limit for users feeling that they are freely navigating the command space without having to unduly wait for the computer ” • Miller, R. B. (1968). Response time in man-computer conversational transactions. Proc. AFIPS Fall Joint Computer Conference Vol. 33, 267-277. • Effectiveness : It replies to user queries with relevant answers. • This depends on the interpretation of the user query and the stored information. 4

  5. Essential aspects of a sound evaluation • Experimental protocol • Is the task/problem clear? Is it a standard task? • Detailed description of the experimental setup: • identify all steps of the experiments. • Reference dataset • Use a well known dataset if possible. • If not, how was the data obtained? • Clear separation between training and test set. • Evaluation metrics • Prefer the commonly used metrics by the community. • Check which statistical test is most adequate. 5

  6. Experimental setups • There are experimental setups made available by different organizations: • TREC: http://trec.nist.gov/tracks.html • CLEF: http://clef2017.clef-initiative.eu/ • SemEVAL: http://alt.qcri.org/semeval2017/ • Visual recognition: http://image-net.org/challenges/LSVRC/ • These experimental setups define a protocol, a dataset (documents and relevance judgments) and suggest a set of metrics to evaluate performance. 6

  7. What is a standard task? • Experimental setups are designed to develop a search engine to address a specific task. • Retrieval by keyword • Retrieval by example • Ranking annotations • Interactive retrieval • Search query categorization • Real-time summarization • Datasets exist for all the above tasks. 7

  8. Examples of standard tasks in IR • For example, TRECVID tasks include: • Video shot-detection • Video news story segmentation • High-level feature task (concept detection) • Automatic and semi-automatic video search • Exploratory analysis (unsupervised) • Other forums exist with different tasks: • TREC: Blog search, opinion leader, patent search, Web search, document categorization... • CLEF: Plagiarism detection, expert search, wikipedia mining, multimodal image tagging, medical image search... • Others: Japanese, Russian, Spanish, etc... 8

  9. A retrieval evaluation setup Ranked results System Data Evaluation metrics Queries Groundtruth 9

  10. Essential aspects of a sound evaluation • Experimental protocol • Is the task/problem clear? Is it a standard task? • Detailed description of the experimental setup: • identify all steps of the experiments. • Reference dataset • Use a well known dataset if possible. • If not, how was the data obtained? • Clear separation between training and test set. • Evaluation metrics • Prefer the commonly used metrics by the community. • Check which statistical test is most adequate. 10

  11. Reference datasets • A reference dataset is made of: • a collection of documents • a set of training queries • a set of test queries • the relevance judgments of the pairs query-document. • Reference datasets are as important as metrics for evaluating the proposed method. • Many different datasets exist for standard tasks. • Reference datasets set the difficulty level of the task. • Allow a fair comparison across different methods. 11

  12. Ground-truth (relevance judgments) • Ground-truth tells the scientist how the method must behave. • The ultimate goal is to devise a method that produces exactly the same output as the ground-truth. Ground-truth True False Type I error True True positive False positive Method False False negative True negative Type II error 12

  13. Annotate these pictures with keywords: 13

  14. Relevance judgments People Sunset Nepal Horizon Mother Coulds Baby Orange Colorful dress Desert Fence Flowers Beach Yellow Sea Nature Palm tree White-sand Clear sky 14

  15. Relevance judgments • Judgments can be obtained by experts or by crowdsourcing • Human relevance judgments can be incorrect and inconsistent • How do we measure the quality of human judgments? 𝑙𝑏𝑞𝑞𝑏 = 𝑞 𝐵 − 𝑞 𝐹 𝑞 𝐵 -> proportion of times humans agreed 1 − 𝑞 𝐹 𝑞 𝐹 -> probability of agreeing by chance • Values above 0.8 are considered good • Values between 0.67 and 0.8 are considered fair • Values below 0.67 are considered dubious 15

  16. Essential aspects of a sound evaluation • Experimental protocol • Is the task/problem clear? Is it a standard task? • Detailed description of the experimental setup: • identify all steps of the experiments. • Reference dataset • Use a well known dataset if possible. • If not, how was the data obtained? • Clear separation between training and test set. • Evaluation metrics • Prefer the commonly used metrics by the community. • Check which statistical test is most adequate. 16

  17. Evaluation metrics • Complete relevance judgments • Ranked relevance judgments • Binary relevance judgments • Incomplete relevance judgments (Web scale eval.) • Binary relevance judgments • Multi-level relevance judgments 17

  18. Ranked relevance evaluation metrics 2 • Spearman’s rank correlation: 6 σ 𝑒 𝑗 𝑠 = 1 − 𝑜 𝑜 2 − 1 • Example: 1 1 1 − 1 2 + 2 − 3 2 + 3 − 4 2 + 4 − 2 2 𝑠 = 1 − 6 4 2 4 4 2 − 1 2 3 3 4 • Another popular rank correlation metric is the Kendall-Tau. 18

  19. Binary relevance judgments 𝑢𝑠𝑣𝑓𝑄𝑝𝑡 + 𝑢𝑠𝑣𝑓𝑂𝑓𝑕 Ground-truth 𝐵𝑑𝑑𝑣𝑠𝑏𝑑𝑧 = 𝑢𝑠𝑣𝑓𝑄𝑝𝑡 + 𝑔𝑏𝑚𝑡𝑓𝑄𝑝𝑡 + 𝑢𝑠𝑣𝑓𝑂𝑓𝑕 + 𝑔𝑏𝑚𝑡𝑓𝑂𝑓𝑕 True False 𝑢𝑠𝑣𝑓𝑄𝑝𝑡 𝑄𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 = True True positive False positive Method 𝑢𝑠𝑣𝑓𝑄𝑝𝑡 + 𝑔𝑏𝑚𝑡𝑓𝑄𝑝𝑡 False False negative True negative 𝑢𝑠𝑣𝑓𝑄𝑝𝑡 𝑆𝑓𝑑𝑏𝑚𝑚 = 𝑢𝑠𝑣𝑓𝑄𝑝𝑡 + 𝑔𝑏𝑚𝑡𝑓𝑂𝑓𝑕 2 𝐺 1 = 1 𝑄 + 1 𝑆 Em PT: exatidão, precisão e abragência. 19

  20. Precision-recall graphs for ranked results Improved precision S1 S2 S3 A ... ... System A B A ... ... ... ... Improved F-measure Precision ... B A ... ... B System B ... C C ... ... D ... ... ... System C Improved recall Recall 20

  21. Interpolated precision-recall graphs S1 S2 S3 A ... ... B A ... ... ... ... ... B A ... ... B ... C C ... ... D ... ... ... 21

  22. Average Precision • Web systems favor high-precision methods (P@20) • Other more robust metric is AP: 1 2 1 3 𝐵𝑄 = #𝑠𝑓𝑚𝑓𝑤𝑏𝑜𝑢 ∙ ෍ 𝑞@𝑙 4 𝑙∈ 𝑡𝑓𝑢 𝑝𝑔 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜𝑡 𝑝𝑔 𝑢ℎ𝑓 𝑠𝑓𝑚𝑓𝑤𝑏𝑜𝑢 𝑒𝑝𝑑𝑡 5 6 𝐵𝑄 = 1 2 + 2 1 4 + 3 4 ∙ 6 =0.375 7 8 22

  23. Average Precision • Average precision is the area under the P-R curve 1 𝐵𝑄 = #𝑠𝑓𝑚𝑓𝑤𝑏𝑜𝑢 ∙ ෍ 𝑞@𝑙 𝑙∈ 𝑡𝑓𝑢 𝑝𝑔 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜𝑡 𝑝𝑔 𝑢ℎ𝑓 𝑠𝑓𝑚𝑓𝑤𝑏𝑜𝑢 𝑒𝑝𝑑𝑡 23

  24. Mean Average Precision (MAP) • MAP evaluates the system for a given range of queries. A ... ... B A ... ... ... ... • It summarizes the global system ... B A performance in one single value. ... ... B ... C C ... ... D • It is the mean of the average ... ... ... precision of a set of n queries: AP(q1) AP(q2) AP(q3) 𝑁𝐵𝑄 = 𝐵𝑄 𝑟 1 + 𝐵𝑄 𝑟 2 + 𝐵𝑄 𝑟 3 +…+ 𝐵𝑄 𝑟 𝑜 𝑜 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend