Why recall matters Stephen Robertson Microsoft Research Cambridge - - PowerPoint PPT Presentation
Why recall matters Stephen Robertson Microsoft Research Cambridge - - PowerPoint PPT Presentation
Why recall matters Stephen Robertson Microsoft Research Cambridge Traditional ideas Assume binary relevance Assume (unranked, exact match) set retrieval and also Note: although I will refer to metrics such as NDCG which can deal with
Traditional ideas
Assume binary relevance Assume (unranked, exact match) set retrieval
- and also
Note: although I will refer to metrics such as NDCG which can deal with graded relevance, I will not discuss the issue further in the present talk.
Traditional ideas
Devices:
things you might do to improve results
Recall device:
something to increase/improve Recall
(that is, increase the size of the retrieved set by allowing the query to match more items)
Precision device:
similarly, something to improve precision
(that is, reduce the size of the retrieved set by making the query more specific)
The inverse relationship
Recall devices reduce precision; precision devices reduce recall Hence recall and precision are in some sense in opposition
(the recall/precision curve) user should choose his/her emphasis
User orientation
High-recall user:
I want high recall; I’m not so interested in precision
High-precision user:
I want high precision; I’m not so interested in recall
Scoring and ranking
Replace set retrieval with a scoring function
measuring how well each document matches the query … and rank the results in descending score order
Now think of stepping down the ranked list as a recall device
… and stopping early as a precision device … leading to the usual recall-precision curve
As a simplification, think of ranking the entire corpus Note that there are many other devices
which interact with the scoring in complex ways
Recall–Precision curve
Single topic curves
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Scoring and ranking
Note an asymmetry:
as you step down, recall must increase
(or at least not decrease)
… but precision may go either way
its tendency to decrease is not a logical property, but a statistical one
High-precision search
Assume that the user really only wants to see a small number of (highly) relevant items
extreme case: just one would suffice metrics commonly used: Precision at 5, Mean reciprocal rank, NDCG@1, … common view: recall is of no consequence
what the eye does not see…
web search generally thought of this way
Recall-oriented search
Some search tasks/environments are seen to be recall-oriented:
– E-discovery: documents required for disclosure in a
legal case
– Prior-art patent search: looking for existing patents
which might invalidate a new patent application
– Evidence-based medicine: assessing all the
evidence on alternative approaches
But these are often thought of as exceptions, strange special cases
It’s the web that’s strange
Peculiarities of the (English) web:
–
size
–
variety of material
–
variety of authorship
–
lack of standardisation (of anything)
–
linguistic variety
–
variety of anchor text
–
variety of quality
–
variety of level
–
scale of search activity
–
monetisation of search engines
Example: enterprise search
Enterprise search environment
–
much more limited
–
much less variety
–
various kinds of standardisation
–
few searches
Even for high-precision search, need to think about recall issues
–
e.g. using the right terminology
Recall
So for all non-web search…
some attention to recall is necessary recall devices may be useful
But even in web search
query modifications and suggestions are often recall devices
e.g. spelling correction, singular/plural, related queries…
- nly of value if they increase recall
(that is, lead you to relevant items you would not otherwise have found)
some kinds of queries/web environments need particular attention to recall
e.g. minority languages
The recall-fallout curve
(Another way of thinking about the performance curve)
Signal detection systems
System is trying to tell us which items are relevant
Relevant item = signal, non-relevant item = noise Recall is true positive rate, fallout is false positive rate Operating characteristic (OC, ROC) curve:
recall against fallout
The recall-fallout curve
The recall-fallout curve
As in IR, distinguish between rank order and decision point
–
although the distinction is stronger in most IR contexts
–
IR: system only ranks, user makes on-the-fly decision about stopping point
–
Signal detection: system normally has to have a set threshold for acceptance/rejection
Nevertheless, consider harder stopping points in IR
–
e.g. for filtering, legal discovery
The recall-fallout curve
A single measure: area under ROC curve
= probability of pairwise success
– choose a random pair of signal–noise instances (one of each), measure the probability that the signal instance is ranked before the noise instance
Contrast Average Precision
= area under recall-precision curve can also be interpreted as a top-weighted probability of pairwise success
Recall/fallout graphs have little practical use in IR
–
but the idea does provide useful insights
Single topic curves
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
A simplified view
Assume:
- (for the purpose of this discussion)
–
All queries look the same
–
Entire collection is scored/ranked
–
Evaluation metrics are continuous functions
- f the score
–
Signal-to-noise ratio (generality) is reasonable
Normally get a smooth concave curve
The recall-fallout curve
The recall-fallout curve
A touch of realism
Now where is precision?
Precision combines information from both noise and signal
–
and thus may be seen as an overall measure
–
but only for a single point on the graph
If we fix other things
–
e.g. the generality of the query = ratio of signal to noise
… then we can see a form of precision on the graph:
The recall-fallout curve
The recall-fallout curve
Devices revisited
Recall device: be more inclusive
intention: increase recall necessarily increases both recall and fallout probably reduces precision because of the curvature
Precision Fallout device: be more selective
intention: reduce fallout necessarily reduces both should increase precision because of the curvature
User orientation
High-recall user:
I want high recall; I’m less worried by high fallout
High-precision Low-fallout user:
I want low fallout; I’m less interested in high recall
The challenge of recall
(Some) recall is necessary, even for precision! Recall challenges:
–
measuring/estimating recall
–
discovering recall failures
–
improving recall!
–
providing the user with evidence about recall
–
providing the user with guidance on how far to go (optimising stopping point)
–