Lecture 6: Evaluation Information Retrieval Computer Science Tripos - PowerPoint PPT Presentation

Lecture 6: Evaluation Information Retrieval Computer Science Tripos Part II Helen Yannakoudakis 1 Natural Language and Information Processing (NLIP) Group helen.yannakoudakis@cl.cam.ac.uk 2018 1 Based on slides from Simone Teufel and Ronan Cummins 1

Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation 2

Overview 1 Recap/Catchup 2 Introduction 3 Unranked evaluation 4 Ranked evaluation 5 Benchmarks 6 Other types of evaluation

Recap: Ranked retrieval In VSM, one represents documents and queries as weighted tf-idf vectors Compute the cosine similarity between the vectors to rank Language models rank based on the probability of a document model generating the query 3

Today Document Collection Document Normalisation Indexer Query Norm. IR System Query Indexes UI Ranking/Matching Module Set of relevant documents Today: evaluation 4

Today Document Collection Document Normalisation Indexer Query Norm. IR System Query Indexes UI Ranking/Matching Module Evaluation Set of relevant documents Today: how good are the returned documents? 5

Measures for a search engine How fast does it index? e.g., number of bytes per hour How fast does it search? e.g., latency as a function of queries per second What is the cost per query? in dollars All of the preceding criteria are measurable: we can quantify speed / size / money 6

Measures for a search engine However, the key measure for a search engine is user happiness. 7

Measures for a search engine However, the key measure for a search engine is user happiness. What is user happiness? 7

Measures for a search engine However, the key measure for a search engine is user happiness. What is user happiness? Factors include: Speed of response Size of index Uncluttered UI We can measure: Rate of return to this search engine Whether something was bought Whether ads were clicked 7

Measures for a search engine However, the key measure for a search engine is user happiness. What is user happiness? Factors include: Speed of response Size of index Uncluttered UI We can measure: Rate of return to this search engine Whether something was bought Whether ads were clicked Most important: relevance (actually, maybe even more important: it’s free) 7

Measures for a search engine However, the key measure for a search engine is user happiness. What is user happiness? Factors include: Speed of response Size of index Uncluttered UI We can measure: Rate of return to this search engine Whether something was bought Whether ads were clicked Most important: relevance (actually, maybe even more important: it’s free) User happiness is equated with the relevance of search results to the query. Note that none of the other measures is sufficient: blindingly fast, but useless answers won’t make a user happy. 7

Most common definition of user happiness: Relevance But how do you measure relevance? Standard methodology in information retrieval consists of three elements: A benchmark document collection 1 A benchmark suite of queries 2 A set of relevance judgments for each query–document pair 3 (gold standard or ground truth judgement of relevance) We need to hire/pay “judges” or assessors to do this. 8

Relevance: query vs. information need Relevance to what? The query? 9

Relevance: query vs. information need Relevance to what? The query? Information need “I am looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.” 9

Relevance: query vs. information need Relevance to what? The query? Information need “I am looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.” translated into: Query q [red wine white wine heart attack] 9

Relevance: query vs. information need Relevance to what? The query? Information need “I am looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.” translated into: Query q [red wine white wine heart attack] So what about the following document: Document d ′ At the heart of his speech was an attack on the wine industry lobby for downplaying the role of red and white wine in drunk driving. 9

Relevance: query vs. information need Relevance to what? The query? Information need “I am looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.” translated into: Query q [red wine white wine heart attack] So what about the following document: Document d ′ At the heart of his speech was an attack on the wine industry lobby for downplaying the role of red and white wine in drunk driving. d ′ is an excellent match for query q . . . 9

Relevance: query vs. information need Relevance to what? The query? Information need “I am looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.” translated into: Query q [red wine white wine heart attack] So what about the following document: Document d ′ At the heart of his speech was an attack on the wine industry lobby for downplaying the role of red and white wine in drunk driving. d ′ is an excellent match for query q . . . d ′ is not relevant to the information need. 9

Relevance: query vs. information need User happiness can only be measured by relevance to an information need, not by relevance to queries. Sloppy terminology here and elsewhere in the literature: we talk about query–document relevance judgments even though we mean information-need–document relevance judgments. 10

Precision and recall Precision ( P ) is the fraction of retrieved documents that are relevant: Precision = #(relevant items retrieved) = P (relevant | retrieved) #(retrieved items) 11

Precision and recall Precision ( P ) is the fraction of retrieved documents that are relevant: Precision = #(relevant items retrieved) = P (relevant | retrieved) #(retrieved items) Recall ( R ) is the fraction of relevant documents that are retrieved: Recall = #(relevant items retrieved) = P (retrieved | relevant) #(relevant items) 11

Precision and recall: 2 × 2 contingency table w THE TRUTH WHAT THE Relevant Non relevant SYSTEM Retrieved true positives (TP) false positives (FP) THINKS Not retrieved false negatives (FN) true negatives (TN) False False True Positives Negatives Positives Retrieved Relevant True Negatives 12

Precision and recall: 2 × 2 contingency table w THE TRUTH WHAT THE Relevant Non relevant SYSTEM Retrieved true positives (TP) false positives (FP) THINKS Not retrieved false negatives (FN) true negatives (TN) False False True Positives Negatives Positives Retrieved Relevant True Negatives = TP / ( TP + FP ) P 12

Precision and recall: 2 × 2 contingency table w THE TRUTH WHAT THE Relevant Non relevant SYSTEM Retrieved true positives (TP) false positives (FP) THINKS Not retrieved false negatives (FN) true negatives (TN) False False True Positives Negatives Positives Retrieved Relevant True Negatives = TP / ( TP + FP ) P R = TP / ( TP + FN ) 12

Precision/recall trade-off Recall is a non-decreasing function of the number of docs retrieved. You can increase recall by returning more docs. 13

Precision/recall trade-off Recall is a non-decreasing function of the number of docs retrieved. You can increase recall by returning more docs. A system that returns all docs has 100% recall! (but very low precision) 13

Precision/recall trade-off Recall is a non-decreasing function of the number of docs retrieved. You can increase recall by returning more docs. A system that returns all docs has 100% recall! (but very low precision) The converse is also true (usually): It’s easy to get high precision for very low recall. 13

A combined measure: F measure F measure: single measure that allows us to trade off precision against recall (weighted harmonic mean): = ( β 2 + 1) PR 1 β 2 = 1 − α F = where α 1 P + (1 − α ) 1 β 2 P + R α R α ∈ [0 , 1] and thus β 2 ∈ [0 , ∞ ] Most frequently used: balanced F 1 with β = 1 (or α = 0 . 5): This is the harmonic mean of P and R : F 1 = 2 P R P + R Using β , you can control whether you want to pay more attention to P or R. 14

A combined measure: F measure F measure: single measure that allows us to trade off precision against recall (weighted harmonic mean): = ( β 2 + 1) PR 1 β 2 = 1 − α F = where α 1 P + (1 − α ) 1 β 2 P + R α R α ∈ [0 , 1] and thus β 2 ∈ [0 , ∞ ] Most frequently used: balanced F 1 with β = 1 (or α = 0 . 5): This is the harmonic mean of P and R : F 1 = 2 P R P + R Using β , you can control whether you want to pay more attention to P or R. Why don’t we use the arithmetic mean? 14

Example for precision, recall, F 1 relevant not relevant retrieved 20 40 60 not retrieved 60 1,000,000 1,000,060 80 1,000,040 1,000,120 15

Lecture 6: Evaluation Information Retrieval Computer Science Tripos - PowerPoint PPT Presentation

Lecture 6: Evaluation Information Retrieval Computer Science Tripos Part II Helen Yannakoudakis 1 Natural Language and Information Processing (NLIP) Group helen.yannakoudakis@cl.cam.ac.uk 2018 1 Based on slides from Simone Teufel and Ronan

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Progress on

Evidence evaluation for discrete data Evidence evaluation for discrete data Evidence evaluation

Webinar on Meta-evaluation Approaches to Improve Evaluation Practice Mnica Lomea Gelis,

Programme BRICK Programme Evaluation: How, why and what? The plan Practical evaluation -

Evaluation Update Laura Forsythe, PhD, MPH Associate Director, Evaluation & Analysis Lori

Evaluation DEMMS: Evaluation of Multimedia What are the Evaluation lectures about: When

Heuristic Evaluation (Pinelle) Heuristic evaluation is a method of qualitative evaluation of

Model Evaluation Model Evaluation Metrics for Performance Evaluation How to evaluate the

UX Evaluation SWEN-444 Selected material from The UX Book , Hartson & Pyla UX Evaluation

Human-Robot Interaction Elective in Artificial Intelligence Lecture 10 User Evaluation Luca

SENSORY EVALUATION .. Basics of Sensory evaluation, Tools, Techniques, Methods and

Evaluation: Using CDCs Evaluation Framework By: Thomas J. Chapel, MA, MBA Chief Evaluation

Overview of Overview of Evaluation in Evaluation in the UN Secretariat the UN Secretariat Prepared

Ordinary Katherine Kirk QCON 2016 @kkirk Intro Over 10 years contracting and freelancing in

Sustainable Well-Being: The Pursuit of Happiness and Wellbeing: A Forlorn Hope?

Utilit Utility and d Happiness pp Miles Kimball and Bob Willis + Related Empirical

Marbles, happiness, and surprise Floris Roelofsen WORKSHOP IN HONOR OF BARBARA PARTEE JANUARY 9

CSE 101 Algorithm Design and Analysis Sanjoy Dasgupta, Russell Impagliazzo, Ragesh Jaiswal

The Command/Invitation Thou shalt live together in love (D&C 42:45) The Power of Love

Adaptation and Subjective Well - being Yannis Georgellis ONS, Technical Advisory Group, 26

The Stata module CUB for fitting mixture models for ordinal data Christopher F. BAUM 1 , Giovanni

Lecture 6: Evaluation Information Retrieval Computer Science Tripos - PowerPoint PPT Presentation

Lecture 6: Evaluation Information Retrieval Computer Science Tripos Part II Helen Yannakoudakis 1 Natural Language and Information Processing (NLIP) Group helen.yannakoudakis@cl.cam.ac.uk 2018 1 Based on slides from Simone Teufel and Ronan

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Progress on

Evidence evaluation for discrete data Evidence evaluation for discrete data Evidence evaluation

Webinar on Meta-evaluation Approaches to Improve Evaluation Practice Mnica Lomea Gelis,

Programme BRICK Programme Evaluation: How, why and what? The plan Practical evaluation -

Evaluation Update Laura Forsythe, PhD, MPH Associate Director, Evaluation &amp; Analysis Lori

Evaluation DEMMS: Evaluation of Multimedia What are the Evaluation lectures about: When

Heuristic Evaluation (Pinelle) Heuristic evaluation is a method of qualitative evaluation of

Model Evaluation Model Evaluation Metrics for Performance Evaluation How to evaluate the

UX Evaluation SWEN-444 Selected material from The UX Book , Hartson &amp; Pyla UX Evaluation

Human-Robot Interaction Elective in Artificial Intelligence Lecture 10 User Evaluation Luca

SENSORY EVALUATION .. Basics of Sensory evaluation, Tools, Techniques, Methods and

Evaluation: Using CDCs Evaluation Framework By: Thomas J. Chapel, MA, MBA Chief Evaluation

Overview of Overview of Evaluation in Evaluation in the UN Secretariat the UN Secretariat Prepared

Ordinary Katherine Kirk QCON 2016 @kkirk Intro Over 10 years contracting and freelancing in

Sustainable Well-Being: The Pursuit of Happiness and Wellbeing: A Forlorn Hope?

Utilit Utility and d Happiness pp Miles Kimball and Bob Willis + Related Empirical

Marbles, happiness, and surprise Floris Roelofsen WORKSHOP IN HONOR OF BARBARA PARTEE JANUARY 9

CSE 101 Algorithm Design and Analysis Sanjoy Dasgupta, Russell Impagliazzo, Ragesh Jaiswal

The Command/Invitation Thou shalt live together in love (D&amp;C 42:45) The Power of Love

Adaptation and Subjective Well - being Yannis Georgellis ONS, Technical Advisory Group, 26

The Stata module CUB for fitting mixture models for ordinal data Christopher F. BAUM 1 , Giovanni

Evaluation Update Laura Forsythe, PhD, MPH Associate Director, Evaluation & Analysis Lori

UX Evaluation SWEN-444 Selected material from The UX Book , Hartson & Pyla UX Evaluation

The Command/Invitation Thou shalt live together in love (D&C 42:45) The Power of Love