Why recall matters Stephen Robertson Microsoft Research Cambridge

Traditional ideas Assume binary relevance Assume (unranked, exact match) set retrieval and also ● Note: although I will refer to metrics such as NDCG which can deal with graded relevance, I will not discuss the issue further in the present talk.

Traditional ideas Devices: things you might do to improve results Recall device: something to increase/improve Recall (that is, increase the size of the retrieved set by allowing the query to match more items) Precision device: similarly, something to improve precision (that is, reduce the size of the retrieved set by making the query more specific)

The inverse relationship Recall devices reduce precision; precision devices reduce recall Hence recall and precision are in some sense in opposition (the recall/precision curve) user should choose his/her emphasis

User orientation High-recall user: I want high recall; I’m not so interested in precision High-precision user: I want high precision; I’m not so interested in recall

Scoring and ranking Replace set retrieval with a scoring function measuring how well each document matches the query … and rank the results in descending score order Now think of stepping down the ranked list as a recall device … and stopping early as a precision device … leading to the usual recall-precision curve As a simplification, think of ranking the entire corpus Note that there are many other devices which interact with the scoring in complex ways

Recall–Precision curve

Single topic curves 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Scoring and ranking Note an asymmetry: as you step down, recall must increase (or at least not decrease) … but precision may go either way its tendency to decrease is not a logical property, but a statistical one

High-precision search Assume that the user really only wants to see a small number of (highly) relevant items extreme case: just one would suffice metrics commonly used: Precision at 5, Mean reciprocal rank, NDCG@1, … common view: recall is of no consequence what the eye does not see … web search generally thought of this way

Recall-oriented search Some search tasks/environments are seen to be recall-oriented: – E-discovery: documents required for disclosure in a legal case – Prior-art patent search: looking for existing patents which might invalidate a new patent application – Evidence-based medicine: assessing all the evidence on alternative approaches But these are often thought of as exceptions, strange special cases

It’s the web that’s strange Peculiarities of the (English) web: size – variety of material – variety of authorship – lack of standardisation (of anything) – linguistic variety – variety of anchor text – variety of quality – variety of level – scale of search activity – monetisation of search engines –

Example: enterprise search Enterprise search environment much more limited – much less variety – various kinds of standardisation – few searches – Even for high-precision search, need to think about recall issues e.g. using the right terminology –

Recall So for all non-web search… some attention to recall is necessary recall devices may be useful But even in web search query modifications and suggestions are often recall devices e.g. spelling correction, singular/plural, related queries… only of value if they increase recall (that is, lead you to relevant items you would not otherwise have found) some kinds of queries/web environments need particular attention to recall e.g. minority languages

The recall-fallout curve (Another way of thinking about the performance curve) Signal detection systems System is trying to tell us which items are relevant Relevant item = signal, non-relevant item = noise Recall is true positive rate, fallout is false positive rate Operating characteristic (OC, ROC) curve: recall against fallout

The recall-fallout curve

The recall-fallout curve As in IR, distinguish between rank order and decision point although the distinction is stronger in most IR – contexts IR: system only ranks, user makes on-the-fly decision – about stopping point Signal detection: system normally has to have a set – threshold for acceptance/rejection Nevertheless, consider harder stopping points in IR e.g. for filtering, legal discovery –

The recall-fallout curve A single measure: area under ROC curve = probability of pairwise success – choose a random pair of signal–noise instances (one of each), measure the probability that the signal instance is ranked before the noise instance Contrast Average Precision = area under recall-precision curve can also be interpreted as a top-weighted probability of pairwise success Recall/fallout graphs have little practical use in IR but the idea does provide useful insights –

Single topic curves 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

A simplified view Assume: ● (for the purpose of this discussion) All queries look the same – Entire collection is scored/ranked – Evaluation metrics are continuous functions – of the score Signal-to-noise ratio ( generality ) is – reasonable Normally get a smooth concave curve

A touch of realism

Now where is precision? Precision combines information from both noise and signal and thus may be seen as an overall measure – but only for a single point on the graph – If we fix other things e.g. the generality of the query = ratio of – signal to noise … then we can see a form of precision on the graph:

Devices revisited Recall device: be more inclusive intention: increase recall necessarily increases both recall and fallout probably reduces precision because of the curvature Precision Fallout device: be more selective intention: reduce fallout necessarily reduces both should increase precision because of the curvature

User orientation High-recall user: I want high recall; I’m less worried by high fallout High-precision Low-fallout user: I want low fallout; I’m less interested in high recall

The challenge of recall (Some) recall is necessary, even for precision! Recall challenges: measuring/estimating recall – discovering recall failures – improving recall! – providing the user with evidence about recall – providing the user with guidance on how far to – go (optimising stopping point) predicting recall-at-a-stopping-point –

Why recall matters Stephen Robertson Microsoft Research Cambridge - PowerPoint PPT Presentation

Why recall matters Stephen Robertson Microsoft Research Cambridge Traditional ideas Assume binary relevance Assume (unranked, exact match) set retrieval and also Note: although I will refer to metrics such as NDCG which can deal with

GPU tuning, part 1 (updated) CSE 6230: HPC Tools & Apps Fall 2014 September 30 &

S et the Bar Low. Be a WINNER every time. Public Power Matters Public Power Matters Innovation

Circuits Lecture 11 Uniform Circuit Complexity 1 Recall 2 Recall Non-uniform complexity 2

Rational Phosphorus Rational Phosphorus Management in Biosolids Management in Biosolids

3rd Annual Automotive Industry Warranty & Recall Symposium Global Financial Advisory Services

Lectur Lecture 4: e 4: Electr Electrical Test Equipm ical Test Equipment ent Recall Fr

20 January 2017 1 Purpose Where Every Child Matters, Every Staff Matters Parents to know

Curriculum matters Mark Phillips Senior HMI, London Monday 3 July 2017 Curriculum matters - 3

When (Low ) Pow er Really Matters When (Low ) Pow er Really Matters When (Low ) Pow er Really

PROCUREMENT IN ANGOLA Why it Matters and How to Achieve it? 1. WHY IT MATTERS PUBLIC

Why Im NOT Why Im NOT Why Im NOT Why Im NOT a Hindu Why Im NOT a Hindu

A Guide to Club Matters Niall Judge Development Manager, Clubs & Coaching at Sport England

+ + Technical tax update Various issues covered FRS 102 tax matters Practical matters

Theory of Computer Science A1. Organizational Matters Gabriele R oger University of Basel

P4 CURRICULUM BRIE FING 2018 20 JAN 2018 OVE RVIE W English Matters Attendance

P5 CURRICULUM BRIE FING 2018 20 JAN 2018 OVE RVIE W Mathematics Matters Attendance

ECO 305 Fall 2003 SUMMARY OF (SHORT RUN) COST CONCEPTS Total Cost: TC = C ( Q ) Fixed and

2020 Decennial Census: Formal Privacy Implementation Update Philip Leclerc, Stephen Clark, and

LHC Accelerator, Higgs Factory, and a Long-Term Strategy for High Energy Physics Frank

Applications of electron lenses: scraping of high-power beams, beam-beam compensation, and

Elliptic Curves and the Birch and Swinnerton-Dyer Conjecture William Stein Harvard University

Kilonova signatures and the r -process FRIB and the GW170817 kilonova Jennifer Barnes NASA

Towards Certification of Network Calculus Marc Boyer, Lo c Fejoz, Etienne Mabille, Stephan

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970