Why recall matters Stephen Robertson Microsoft Research Cambridge - - PowerPoint PPT Presentation

why recall matters
SMART_READER_LITE
LIVE PREVIEW

Why recall matters Stephen Robertson Microsoft Research Cambridge - - PowerPoint PPT Presentation

Why recall matters Stephen Robertson Microsoft Research Cambridge Traditional ideas Assume binary relevance Assume (unranked, exact match) set retrieval and also Note: although I will refer to metrics such as NDCG which can deal with


slide-1
SLIDE 1

Why recall matters

Stephen Robertson

Microsoft Research Cambridge

slide-2
SLIDE 2

Traditional ideas

Assume binary relevance Assume (unranked, exact match) set retrieval

  • and also

Note: although I will refer to metrics such as NDCG which can deal with graded relevance, I will not discuss the issue further in the present talk.

slide-3
SLIDE 3

Traditional ideas

Devices:

things you might do to improve results

Recall device:

something to increase/improve Recall

(that is, increase the size of the retrieved set by allowing the query to match more items)

Precision device:

similarly, something to improve precision

(that is, reduce the size of the retrieved set by making the query more specific)

slide-4
SLIDE 4

The inverse relationship

Recall devices reduce precision; precision devices reduce recall Hence recall and precision are in some sense in opposition

(the recall/precision curve) user should choose his/her emphasis

slide-5
SLIDE 5

User orientation

High-recall user:

I want high recall; I’m not so interested in precision

High-precision user:

I want high precision; I’m not so interested in recall

slide-6
SLIDE 6

Scoring and ranking

Replace set retrieval with a scoring function

measuring how well each document matches the query … and rank the results in descending score order

Now think of stepping down the ranked list as a recall device

… and stopping early as a precision device … leading to the usual recall-precision curve

As a simplification, think of ranking the entire corpus Note that there are many other devices

which interact with the scoring in complex ways

slide-7
SLIDE 7

Recall–Precision curve

slide-8
SLIDE 8

Single topic curves

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

slide-9
SLIDE 9

Scoring and ranking

Note an asymmetry:

as you step down, recall must increase

(or at least not decrease)

… but precision may go either way

its tendency to decrease is not a logical property, but a statistical one

slide-10
SLIDE 10

High-precision search

Assume that the user really only wants to see a small number of (highly) relevant items

extreme case: just one would suffice metrics commonly used: Precision at 5, Mean reciprocal rank, NDCG@1, … common view: recall is of no consequence

what the eye does not see…

web search generally thought of this way

slide-11
SLIDE 11

Recall-oriented search

Some search tasks/environments are seen to be recall-oriented:

– E-discovery: documents required for disclosure in a

legal case

– Prior-art patent search: looking for existing patents

which might invalidate a new patent application

– Evidence-based medicine: assessing all the

evidence on alternative approaches

But these are often thought of as exceptions, strange special cases

slide-12
SLIDE 12

It’s the web that’s strange

Peculiarities of the (English) web:

size

variety of material

variety of authorship

lack of standardisation (of anything)

linguistic variety

variety of anchor text

variety of quality

variety of level

scale of search activity

monetisation of search engines

slide-13
SLIDE 13

Example: enterprise search

Enterprise search environment

much more limited

much less variety

various kinds of standardisation

few searches

Even for high-precision search, need to think about recall issues

e.g. using the right terminology

slide-14
SLIDE 14

Recall

So for all non-web search…

some attention to recall is necessary recall devices may be useful

But even in web search

query modifications and suggestions are often recall devices

e.g. spelling correction, singular/plural, related queries…

  • nly of value if they increase recall

(that is, lead you to relevant items you would not otherwise have found)

some kinds of queries/web environments need particular attention to recall

e.g. minority languages

slide-15
SLIDE 15

The recall-fallout curve

(Another way of thinking about the performance curve)

Signal detection systems

System is trying to tell us which items are relevant

Relevant item = signal, non-relevant item = noise Recall is true positive rate, fallout is false positive rate Operating characteristic (OC, ROC) curve:

recall against fallout

slide-16
SLIDE 16

The recall-fallout curve

slide-17
SLIDE 17

The recall-fallout curve

As in IR, distinguish between rank order and decision point

although the distinction is stronger in most IR contexts

IR: system only ranks, user makes on-the-fly decision about stopping point

Signal detection: system normally has to have a set threshold for acceptance/rejection

Nevertheless, consider harder stopping points in IR

e.g. for filtering, legal discovery

slide-18
SLIDE 18

The recall-fallout curve

A single measure: area under ROC curve

= probability of pairwise success

– choose a random pair of signal–noise instances (one of each), measure the probability that the signal instance is ranked before the noise instance

Contrast Average Precision

= area under recall-precision curve can also be interpreted as a top-weighted probability of pairwise success

Recall/fallout graphs have little practical use in IR

but the idea does provide useful insights

slide-19
SLIDE 19

Single topic curves

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

slide-20
SLIDE 20

A simplified view

Assume:

  • (for the purpose of this discussion)

All queries look the same

Entire collection is scored/ranked

Evaluation metrics are continuous functions

  • f the score

Signal-to-noise ratio (generality) is reasonable

Normally get a smooth concave curve

slide-21
SLIDE 21

The recall-fallout curve

slide-22
SLIDE 22

The recall-fallout curve

slide-23
SLIDE 23

A touch of realism

slide-24
SLIDE 24

Now where is precision?

Precision combines information from both noise and signal

and thus may be seen as an overall measure

but only for a single point on the graph

If we fix other things

e.g. the generality of the query = ratio of signal to noise

… then we can see a form of precision on the graph:

slide-25
SLIDE 25

The recall-fallout curve

slide-26
SLIDE 26

The recall-fallout curve

slide-27
SLIDE 27

Devices revisited

Recall device: be more inclusive

intention: increase recall necessarily increases both recall and fallout probably reduces precision because of the curvature

Precision Fallout device: be more selective

intention: reduce fallout necessarily reduces both should increase precision because of the curvature

slide-28
SLIDE 28

User orientation

High-recall user:

I want high recall; I’m less worried by high fallout

High-precision Low-fallout user:

I want low fallout; I’m less interested in high recall

slide-29
SLIDE 29

The challenge of recall

(Some) recall is necessary, even for precision! Recall challenges:

measuring/estimating recall

discovering recall failures

improving recall!

providing the user with evidence about recall

providing the user with guidance on how far to go (optimising stopping point)

predicting recall-at-a-stopping-point