Evaluating search engines
CE-324: Modern Information Retrieval
Sharif University of Technology
- M. Soleymani
Spring 2020
Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
Evaluating search engines CE-324: Modern Information Retrieval - - PowerPoint PPT Presentation
Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Spring 2020 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Sec. 8.6 Evaluation of a
Sharif University of Technology
Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
2
} Number of documents/hour } Incremental indexing
3
} How satisfied is each user with the obtained results? } The most common proxy to measure human satisfaction is
4
} Should we use stop lists? Should we stem? Should we use
5
1.
2.
3.
}
6
} Cost of getting these relevance judgments
7
} Hope that this is cheaper than hiring qualified assessors
8
9
} Query: pool cleaner
10
} Must be germane to docs available } Must be representative of actual user needs } Random query terms from the documents generally not a
} Sample from query logs if available
} Low query rates – not enough query logs } Experts hand-craft “user needs”
11
12
} or at least for subset of docs that some systems (participating
13
} fraction of retrieved docs that are relevant
} fraction of relevant docs that are retrieved
} evaluation measure in machine learning classification works
} (tp + tn) / ( tp + fp + fn + tn)
14
15
} The snoogle search engine below always returns 0 results (“No
} Since many more non-relevant docs than relevant ones
16
} High recall but low precision
} This is not a theorem, but a result with strong empirical
17
} allows us to trade off precision against recall } weighted harmonic mean of P and R
18
19
} e.g., the arithmetic mean
} Taking the minimum achieves this. } F (harmonic mean) is a kind of smooth minimum.
20
Combined Measures
20 40 60 80 100 20 40 60 80 100 Precision (Recall fixed at 70%) Minimum Maximum Arithmetic Geometric Harmonic
21
} We can easily turn set measures into measures of ranked lists.
} Taking various numbers of top returned docs (recall levels)
} Sets of retrieved docs are given by the top k retrieved docs.
¨ Just compute the set measure for each “prefix”: the top 1, top 2, top 3, top
} Doing this for precision and recall gives you a precision-recall
22
} Precision-Recall curve } Precision@K (P@K) } Mean Average Precision (MAP) } Mean Reciprocal Rank (MRR)
} Normalized Discounted Cumulative Gain (NDCG)
23
24
7=>7 𝑞(𝑠@)
25
} It isn’t a very sensible thing to look at
} Precision-recall: only place some points on the graph } How do you determine a value (interpolate) between the
} 11-point interpolated average precision
26
27
} by tenths of the docs using interpolation and average them
28
} 11pt precision fromTREC 8 (1999)
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Recall Precision
} Set a rank threshold K } Ignores documents ranked lower than K
} people want good matches on the first one or two results
} But: averages badly and has an arbitrary parameter of k
29
30
} Prec@3 of 2/3 } Prec@4 of 2/4 } Prec@5 of 3/5
31
} K1, K2, … KR
32
} Average precision is obtained for the top k docs, each time a
} MAP for query collection is arithmetic average
} Macro-averaging: each query counts equally
33
34
35
WX 3)* Y D)*
36
} MAP assumes user is interested in finding many relevant docs
} MAP requires many relevance judgments in text collection
37
} known-item search } navigational queries } looking for a fact
} measures a user’s effort
38
} Could be – only clicked doc
Y D)*
39 fair fair Good
40
} More than two levels (i.e. relevant and non-relevant)
} With base 2, the discount at rank 4 is 1/2, and at rank 8 it is 1/3
41
} We may use any base for the logarithm
42
} used by some web search companies } emphasis on retrieving highly relevant documents
43
44
45
} Ideal ranking: first returns docs with the highest relevance level, then the
i Ground Truth Ranking Function1 Ranking Function2 Document Order ri Document Order ri Document Order ri 1 d4 2 d3 2 d3 2 2 d3 2 d4 2 d2 1 3 d2 1 d2 1 d4 2 4 d1 d1 d1 NDCGGT=1.00 NDCGRF1=1.00 NDCGRF2=0.9203
6309 . 4 4 log 3 log 1 2 log 2 2
2 2 2
= ÷ ÷ ø ö ç ç è æ + + + =
GT
DCG 6309 . 4 4 log 3 log 1 2 log 2 2
2 2 2 1
= ÷ ÷ ø ö ç ç è æ + + + =
RF
DCG 2619 . 4 4 log 3 log 2 2 log 1 2
2 2 2 2
= ÷ ÷ ø ö ç ç è æ + + + =
RF
DCG 6309 . 4 = =
GT
DCG MaxDCG
4
3
2
1
46
47
} 3, 3, 3, 2, 2, 2, 1
} 3, 6, 7.89, 8.89, 9.75, 10.52, 10.88
} 3, 5, 6.89, 6.89, 6.89, 7.28, 7.99
} 1, 0.83, 0.87, 0.76, 0.71, 0.69 } NDCG £ 1 at any rank position
} Between raters } Over time
} Rating vis-à-vis query, vs underlying need
48
49
50
Adapted from: CIKM'09 Tutorial, Hong Kong, China
} Adapt ranking to user clicks?
} e.g., navigational and informational queries
51
Adapted from: CIKM'09 Tutorial, Hong Kong, China
52
Adapted from: CIKM'09 Tutorial, Hong Kong, China
53
Normal Position Percentage Reversed Impression Percentage
Adapted from: CIKM'09 Tutorial, Hong Kong, China
54
Adapted from: CIKM'09 Tutorial, Hong Kong, China
55
Adapted from: CIKM'09 Tutorial, Hong Kong, China
} Doesn’t mean that DocA relevant to query
56
} So most users use old system
57
58
59
60
61
62
} Start randomly with ranking A or ranking B to evens out
63
64
} Document collection } Query set } Assessment methodology
} These
} Different
66
67
68
} Usually, a list of doc titles plus a short summary (snippet)
68
69
} Or field and zone
} This description is crucial. } User can identify good/relevant hits based on description.
} Static } Dynamic
70
71
} Simplest heuristic: e.g., title & the first 50 words of the doc
} Summary cached at indexing time
} More sophisticated: extract from each doc a set of “key”
} Simple NLP heuristics to score each sentence and summary is made
} Most sophisticated: NLP used to synthesize a summary
} Seldom used in IR; cf. text summarization work
72
} “KWIC” snippets: Keyword in Context
} Requires a high disk space to save docs or at-least their prefixes
} However, they can greatly improve the usability of IR systems.
} Requires fast window lookup in a doc cache
} Use various features such as window width, position, etc. } Combine features through a scoring function
} Pairwise comparisons rather than binary relevance assessments
73
} user’s need likely satisfied on www.united.com } Quicklinks provide navigational cues on that home page
74
75
76