Evaluation Experimental protocols, datasets, metrics Web Search 1 - - PowerPoint PPT Presentation

evaluation
SMART_READER_LITE
LIVE PREVIEW

Evaluation Experimental protocols, datasets, metrics Web Search 1 - - PowerPoint PPT Presentation

Evaluation Experimental protocols, datasets, metrics Web Search 1 What makes a good search engine? Efficiency : It replies to user queries without noticeable delays. 1 sec is the limit for users feeling that they are freely


slide-1
SLIDE 1

Evaluation

Experimental protocols, datasets, metrics

Web Search

1

slide-2
SLIDE 2

What makes a good search engine?

  • Efficiency: It replies to user queries without noticeable

delays.

  • 1 sec is the “limit for users feeling that they are freely navigating

the command space without having to unduly wait for the computer”

  • Miller, R. B. (1968). Response time in man-computer

conversational transactions. Proc. AFIPS Fall Joint Computer Conference Vol. 33, 267-277.

  • Effectiveness: It replies to user queries with relevant

answers.

  • This depends on the interpretation of the user query and the stored

information.

2

slide-3
SLIDE 3

Efficiency metrics

3

Metric name Description Elapsed indexing time Measures the amount of time necessary to build a document index on a particular system. Indexing processor time Measures the CPU seconds used in building a document

  • index. This is similar to elapsed time, but does not count

time waiting for I/O or speed gains from parallelism. Query throughput Number of queries processed per second. Query latency The amount of time a user must wait after issuing a query before receiving a response, measured in milliseconds. This can be measured using the mean, but is often more instructive when used with the median or a percentile bound. Indexing temporary space Amount of temporary disk space used while creating an index. Index size Amount of storage necessary to store the index files.

slide-4
SLIDE 4

What makes a good search engine?

  • Efficiency: It replies to user queries without noticeable

delays.

  • 1 sec is the “limit for users feeling that they are freely navigating

the command space without having to unduly wait for the computer”

  • Miller, R. B. (1968). Response time in man-computer

conversational transactions. Proc. AFIPS Fall Joint Computer Conference Vol. 33, 267-277.

  • Effectiveness: It replies to user queries with relevant

answers.

  • This depends on the interpretation of the user query and the stored

information.

4

slide-5
SLIDE 5

Essential aspects of a sound evaluation

  • Experimental protocol
  • Is the task/problem clear? Is it a standard task?
  • Detailed description of the experimental setup:
  • identify all steps of the experiments.
  • Reference dataset
  • Use a well known dataset if possible.
  • If not, how was the data obtained?
  • Clear separation between training and test set.
  • Evaluation metrics
  • Prefer the commonly used metrics by the community.
  • Check which statistical test is most adequate.

5

slide-6
SLIDE 6

Experimental setups

  • There are experimental setups made available by different
  • rganizations:
  • TREC: http://trec.nist.gov/tracks.html
  • CLEF: http://clef2017.clef-initiative.eu/
  • SemEVAL: http://alt.qcri.org/semeval2017/
  • Visual recognition: http://image-net.org/challenges/LSVRC/
  • These experimental setups define a protocol, a dataset

(documents and relevance judgments) and suggest a set of metrics to evaluate performance.

6

slide-7
SLIDE 7

What is a standard task?

  • Experimental setups are designed to develop a search

engine to address a specific task.

  • Retrieval by keyword
  • Retrieval by example
  • Ranking annotations
  • Interactive retrieval
  • Search query categorization
  • Real-time summarization
  • Datasets exist for all the above tasks.

7

slide-8
SLIDE 8

Examples of standard tasks in IR

  • For example, TRECVID tasks include:
  • Video shot-detection
  • Video news story segmentation
  • High-level feature task (concept detection)
  • Automatic and semi-automatic video search
  • Exploratory analysis (unsupervised)
  • Other forums exist with different tasks:
  • TREC: Blog search, opinion leader, patent search, Web search,

document categorization...

  • CLEF: Plagiarism detection, expert search, wikipedia mining,

multimodal image tagging, medical image search...

  • Others: Japanese, Russian, Spanish, etc...

8

slide-9
SLIDE 9

A retrieval evaluation setup

9

Data System Ranked results Queries

Groundtruth

Evaluation metrics

slide-10
SLIDE 10

Essential aspects of a sound evaluation

  • Experimental protocol
  • Is the task/problem clear? Is it a standard task?
  • Detailed description of the experimental setup:
  • identify all steps of the experiments.
  • Reference dataset
  • Use a well known dataset if possible.
  • If not, how was the data obtained?
  • Clear separation between training and test set.
  • Evaluation metrics
  • Prefer the commonly used metrics by the community.
  • Check which statistical test is most adequate.

10

slide-11
SLIDE 11

Reference datasets

  • A reference dataset is made of:
  • a collection of documents
  • a set of training queries
  • a set of test queries
  • the relevance judgments of the pairs query-document.
  • Reference datasets are as important as metrics for

evaluating the proposed method.

  • Many different datasets exist for standard tasks.
  • Reference datasets set the difficulty level of the task.
  • Allow a fair comparison across different methods.

11

slide-12
SLIDE 12

Ground-truth (relevance judgments)

  • Ground-truth tells the scientist how the method must

behave.

  • The ultimate goal is to devise a method that produces

exactly the same output as the ground-truth.

12

Ground-truth True False Method True True positive False positive False False negative True negative Type I error Type II error

slide-13
SLIDE 13

Annotate these pictures with keywords:

13

slide-14
SLIDE 14

Relevance judgments

14

People Nepal Mother Baby Colorful dress Fence Sunset Horizon Coulds Orange Desert Flowers Yellow Nature Beach Sea Palm tree White-sand Clear sky

slide-15
SLIDE 15

Relevance judgments

  • Judgments can be obtained by experts or by crowdsourcing
  • Human relevance judgments can be incorrect and inconsistent
  • How do we measure the quality of human judgments?
  • Values above 0.8 are considered good
  • Values between 0.67 and 0.8 are considered fair
  • Values below 0.67 are considered dubious

15

𝑙𝑏𝑞𝑞𝑏 = 𝑞 𝐵 − 𝑞 𝐹 1 − 𝑞 𝐹

𝑞 𝐹 -> probability of agreeing by chance 𝑞 𝐵 -> proportion of times humans agreed

slide-16
SLIDE 16

Essential aspects of a sound evaluation

  • Experimental protocol
  • Is the task/problem clear? Is it a standard task?
  • Detailed description of the experimental setup:
  • identify all steps of the experiments.
  • Reference dataset
  • Use a well known dataset if possible.
  • If not, how was the data obtained?
  • Clear separation between training and test set.
  • Evaluation metrics
  • Prefer the commonly used metrics by the community.
  • Check which statistical test is most adequate.

16

slide-17
SLIDE 17

Evaluation metrics

  • Complete relevance judgments
  • Ranked relevance judgments
  • Binary relevance judgments
  • Incomplete relevance judgments (Web scale eval.)
  • Binary relevance judgments
  • Multi-level relevance judgments

17

slide-18
SLIDE 18

Ranked relevance evaluation metrics

  • Spearman’s rank correlation:
  • Example:
  • Another popular rank correlation metric is the Kendall-Tau.

18

1 4 2 3 1 2 3 4

𝑠 = 1 − 6 σ 𝑒𝑗

2

𝑜 𝑜2 − 1 𝑠 = 1 − 6 1 − 1 2 + 2 − 3 2 + 3 − 4 2 + 4 − 2 2 4 42 − 1

slide-19
SLIDE 19

Binary relevance judgments

19

Em PT: exatidão, precisão e abragência. Ground-truth True False Method True True positive False positive False False negative True negative 𝐵𝑑𝑑𝑣𝑠𝑏𝑑𝑧 = 𝑢𝑠𝑣𝑓𝑄𝑝𝑡 + 𝑢𝑠𝑣𝑓𝑂𝑓𝑕 𝑢𝑠𝑣𝑓𝑄𝑝𝑡 + 𝑔𝑏𝑚𝑡𝑓𝑄𝑝𝑡 + 𝑢𝑠𝑣𝑓𝑂𝑓𝑕 + 𝑔𝑏𝑚𝑡𝑓𝑂𝑓𝑕 𝑄𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜 = 𝑢𝑠𝑣𝑓𝑄𝑝𝑡 𝑢𝑠𝑣𝑓𝑄𝑝𝑡 + 𝑔𝑏𝑚𝑡𝑓𝑄𝑝𝑡 𝑆𝑓𝑑𝑏𝑚𝑚 = 𝑢𝑠𝑣𝑓𝑄𝑝𝑡 𝑢𝑠𝑣𝑓𝑄𝑝𝑡 + 𝑔𝑏𝑚𝑡𝑓𝑂𝑓𝑕 𝐺

1 =

2 1 𝑄 + 1 𝑆

slide-20
SLIDE 20

Precision-recall graphs for ranked results

20

Improved recall Improved precision Improved F-measure

Recall Precision

System A System B System C

A B ... ... ... ... ... ... ... A ... B ... C ... ... ... ... ... A B C D ... S1 S2 S3

slide-21
SLIDE 21

Interpolated precision-recall graphs

21

A B ... ... ... ... ... ... ... A ... B ... C ... ... ... ... ... A B C D ... S1 S2 S3

slide-22
SLIDE 22

Average Precision

  • Web systems favor high-precision methods (P@20)
  • Other more robust metric is AP:

22

1 2 3 4 5 6 7 8

𝐵𝑄 = 1 #𝑠𝑓𝑚𝑓𝑤𝑏𝑜𝑢 ∙ ෍

𝑙∈ 𝑡𝑓𝑢 𝑝𝑔 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜𝑡 𝑝𝑔 𝑢ℎ𝑓 𝑠𝑓𝑚𝑓𝑤𝑏𝑜𝑢 𝑒𝑝𝑑𝑡

𝑞@𝑙 𝐵𝑄 = 1

4 ∙ 1 2 + 2 4 + 3 6 =0.375

slide-23
SLIDE 23

Average Precision

  • Average precision is the area under the P-R curve

23

𝐵𝑄 = 1 #𝑠𝑓𝑚𝑓𝑤𝑏𝑜𝑢 ∙ ෍

𝑙∈ 𝑡𝑓𝑢 𝑝𝑔 𝑞𝑝𝑡𝑗𝑢𝑗𝑝𝑜𝑡 𝑝𝑔 𝑢ℎ𝑓 𝑠𝑓𝑚𝑓𝑤𝑏𝑜𝑢 𝑒𝑝𝑑𝑡

𝑞@𝑙

slide-24
SLIDE 24

Mean Average Precision (MAP)

  • MAP evaluates the system for a

given range of queries.

  • It summarizes the global system

performance in one single value.

  • It is the mean of the average

precision of a set of n queries:

24

A B ... ... ... ... ... ... ... A ... B ... C ... ... ... ... ... A B C D ...

AP(q1) AP(q2) AP(q3)

𝑁𝐵𝑄 = 𝐵𝑄 𝑟1 + 𝐵𝑄 𝑟2 +𝐵𝑄 𝑟3 +…+𝐵𝑄 𝑟𝑜 𝑜

slide-25
SLIDE 25

Web scale evaluation

  • It is impossible to know all relevant documents.
  • It is too expensive or time-consuming.
  • DCG, BPref and Inferred AP are three measures to evaluate

a system with incomplete ground-truth.

  • These metrics use the concept of pooled results

25

  • E. Yilmaz and J. A. Aslam, Estimating average precision with incomplete and imperfect judgments, ACM CIKM 2006.
  • C. Buckley and E. M. Voorhees. Retrieval evaluation with incomplete information. ACM SIGIR 2004.
slide-26
SLIDE 26

Results pooling

  • This technique is used when the dataset is too large to be

completely examined.

  • Considering the results of 10 systems:
  • Examine the top 100 results of each system
  • Label all documents according to its relevance
  • Use the labeled results as ground-truth to evaluate all systems.
  • Drawback: can’t compute recall, AP and MAP

26

slide-27
SLIDE 27

DCG: Incomplete multi-level relevance

  • Useful when some documents are more relevant

than others.

  • Documents need to have ground-truth with

different levels of relevance.

  • A common metric is the Discounted Cumulative

Gain:

27

  • K. Jarvelin, J. Kekalainen, “Cumulated gain-based evaluation of IR techniques,” ACM Transactions on Information Systems 20(4), 422–446 (2002).

... A ... B ... C ... ...

𝐸𝐷𝐻𝑛 = ෍

𝑗=1 𝑛

2𝑠𝑓𝑚𝑗 − 1 log2 1 + 𝑗 𝑠𝑓𝑚𝑗 = 0,1,2,3, … 𝑜𝐸𝐷𝐻𝑛 = 𝐸𝐷𝐻𝑛 𝑐𝑓𝑡𝑢𝐸𝐷𝐻𝑛

slide-28
SLIDE 28

BPref: Incomplete binary relevance

  • When only incomplete binary relevance judgments are

available BPREF is a popular metric:

  • where R is the total number of relevant documents in a given query
  • 𝑒𝑠 is a relevant document
  • 𝑂𝑒𝑠 is the number of non-relevant documents ranked higher than

𝑒𝑠

28

𝐶𝑄𝑆𝐹𝐺 = 1 𝑆 ෍

𝑒𝑠

1 − 𝑂𝑒𝑠 𝑆

slide-29
SLIDE 29

Diversity and novelty

  • Diversity and novelty are difficult to evaluate.
  • There are no defacto method to measure it.
  • The goal is to measure how diverse and novel is the

information contained in the retrieved documents.

  • Assessment focus is not at the level of the documents.

29

slide-30
SLIDE 30

Nuggets or information facts

  • A nugget is an information fact
  • Documents contain many nuggets.
  • The same nugget can be present in many different documents.
  • The goal is to retrieve a ranked list with many different

nuggets at the top of the list

  • Repeated nuggets will have a decreasing importance

30

slide-31
SLIDE 31

The -nDCG metric for diversity and novelty

  • The relevance of a document is determined by its nuggets

and by the nuggets that occurred previously in the ranked results

  • A popular metric is the -nDCG, where each document at

position k is judged by its nuggets

31

𝛽 = 0.5

slide-32
SLIDE 32

Example

  • Top results for query “Norwegian Cruise Lines”
  • The relevance of each document is:
  • What would be the ideal ordering?

32

slide-33
SLIDE 33

System quality and user utility

  • The discussed evaluation procedures only measure the system

performance on a given task

  • It can overfit
  • It might be distant from what users expect
  • Only real users actually assess the system
  • How expressive is its query language?
  • How large is its collection?
  • How effective are the results?
  • A/B testing
  • Make small variation on the system and direct a proportion of users to that

system

  • Evaluate frequency with which users clik on top results

33

slide-34
SLIDE 34

Qualitative discussion

  • Relevance depends on:
  • Task objective
  • User knowledge
  • Time
  • Not all people “see” the same
  • Binary relevance judgments
  • Multi-level relevance judgments
  • Ranked relevance judgments
  • Incomplete relevance judgments

The notion of relevance is a subjective concept There is no relation between AP and user satisfaction

34

slide-35
SLIDE 35

Summary

  • Metrics for complete relevance judgments
  • Binary: Precision, Recall, F-measure, Average Precision, Mean AP
  • Ranked: Spearman, Kendal-tau
  • Metrics for incomplete relevance judgments
  • Binary: Bpref, InfMAP
  • Multi-valued: Normalized DCG
  • Evaluation collections / resources
  • See TRECVID and ImageCLEF for multimedia datasets.
  • See TREC and CLEF forums for Web and large-scale datasets
  • User search interaction, Geographic IR, Expert finding, Blog search, Plagiarism,…
  • Use trec_eval application to evaluate your system

35

Chapter 8 Chapter 8