The idea Outline Definition Definition ADM: an IR effectiveness - - PDF document

the idea outline
SMART_READER_LITE
LIVE PREVIEW

The idea Outline Definition Definition ADM: an IR effectiveness - - PDF document

Evaluating ADM on a Three-Level Evaluating ADM on a FOUR FOUR-Level Relevance Scale Document Set Relevance Scale Document Set from NTCIR from NTCIR Vincenzo Della Mea, Luca Di Gaspero, Stefano Mizzaro* Stefano Mizzaro* Vincenzo Della Mea,


slide-1
SLIDE 1

Evaluating ADM on a Three-Level Relevance Scale Document Set from NTCIR

Vincenzo Della Mea, Luca Di Gaspero, Stefano Mizzaro* Stefano Mizzaro*

Department of Mathematics and Computer Science University of Udine http://www.dimi.uniud.it/~mizzaro mizzaro@dimi.uniud.it NTCIR-4, Tokyo, 2 June 2004

Evaluating ADM on a FOUR FOUR-Level Relevance Scale Document Set from NTCIR

Vincenzo Della Mea, Luca Di Gaspero, Stefano Mizzaro* Stefano Mizzaro*

Department of Mathematics and Computer Science University of Udine http://www.dimi.uniud.it/~mizzaro mizzaro@dimi.uniud.it NTCIR-4, Tokyo, 2 June 2004

  • S. Mizzaro - ADM

3

The idea

ADM: an IR effectiveness measure based on

continuous relevance

Relevance

Binary {0,1} Categories {low, medium, high} Continuous [0..1]

Retrieval: too (boolean, vector space, …)

  • V. Della Mea, S. Mizzaro (2004). Measuring

Retrieval Effectiveness: A New Proposal and a First Experimental Validation, JASIST, 55(6):530-543

Draft, p. 30, supplement v. 2

  • S. Mizzaro - ADM

4

Outline

  • Definition

Definition

  • The URS/SRS plane

The URS/SRS plane

  • ADM (Average Distance Measure)

ADM (Average Distance Measure)

  • Examples

Examples

Conceptual analysis

Problems with precision and recall

Experimental analysis

TREC data

ADM is as good as TREC measures ADM is effective with less data than TREC measures

NTCIR data: preliminary results

  • S. Mizzaro - ADM

5

From binary relevance…

Not retrieved Retrieved Not relevant Relevant [Salton & McGill, 84] Documents database

  • S. Mizzaro - ADM

6

… to continuous relevance

“Less” retrieved “More” retrieved “Less” relevant “More” relevant

slide-2
SLIDE 2
  • S. Mizzaro - ADM

7

The URS/SRS plane

URS SRS

0.5 0.5 1.0 1.0

“Less” retrieved “More” retrieved “Less” relevant “More” relevant

α β γ δ u s

User Relevance Score System Relevance Score Documents

  • S. Mizzaro - ADM

8

SRS and URS

SRS (System

System Relevance Score)

Relevance value given by the IRS

URS (User

User Relevance Score)

Relevance value given by the user

Real numbers, in the [0..1] range Different from

RSV (Retrieval Status Value), insensible to rank-

preserving transformations

Estimate of the probability of relevance

  • S. Mizzaro - ADM

9

A step backward: P & R

URS SRS

0.5 0.5 1.0 1.0

“Less” retrieved “More” retrieved “Less” relevant “More” relevant Retrieved & relevant? Nonretrieved & relevant? Nonretrieved& nonrelevant? Retrieved & nonrelevant? P = RetRel /(RetRel+RetNRel) R = RetRel /(RetRel+NRetRel)

  • S. Mizzaro - ADM

10

The “right” places…

URS SRS

0.5 0.5 1.0 1.0

“Less” retrieved “More” retrieved “Less” relevant “More” relevant

α β γ δ u s

  • S. Mizzaro - ADM

11

ADM: Average Distance Measure

URS SRS

1.0 1.0

  • S. Mizzaro - ADM

12

ADM: Average Distance Measure

( ) ( )

1

i

q i q i d D q

SRS d URS d ADM D

− = − ∑

ADM for one query: 1 - average distance

between SRS and URS over all (?) the documents

ADM for one IRS

for one IRS: average over some queries

slide-3
SLIDE 3
  • S. Mizzaro - ADM

13

An example

0.1 0.4 0.8 URS 0.7 1.0 0.4 0.8 IRS3 0.8 0.3 0.6 1.0 IRS2 0.9 0.2 0.5 0.9 IRS1 ADM d3 d2 d1 Docs.

URS SRE

0.5 0.3 0.1 1.0 1.0 0.2 0.4 0.6 0.8 0.9 0.4 0.8

SRS

0.5 0.3 0.1 1.0 1.0 0.2 0.4 0.6 0.8 0.9 0.4 0.8 d3 d2 d1

  • S. Mizzaro - ADM

14

Outline

Definition

The URS/SRS plane ADM (Average Distance Measure) Examples

  • Conceptual analysis

Conceptual analysis

  • Problems with precision and recall

Problems with precision and recall

Experimental analysis

TREC data

ADM is as good as TREC measures ADM is effective with less data than TREC measures

NTCIR data: preliminary results

  • S. Mizzaro - ADM

15

ADM vs. P & R

Precision and recall are:

Hyper-sensitive to relevant/nonrelevant and retrieved/nonretrieved

thresholds

(i.e., 0.49 and 0.51 are two very similar values, but the

  • utcome is very different…)

Insensitive

to variations within particular areas (0.99 and 0.51 are

very different, but the outcome is the same…)

  • S. Mizzaro - ADM

16

URS SRS

0.5 0.5 0.49 0.49 1.0 1.0

Hyper-sensitiveness: 3 similar IRS

0.826 0.5 0.5 0.5 IRS3 0.83 0.75 0.5 1.0 IRS2 0.83 0.84 1.0 0.67 IRS1 ADM E R P

stable unstable

  • S. Mizzaro - ADM

17

Insensitiveness: 2 different IRS

unstable stable URS SRS

0.5 0.5 1.0 1.0 0.5 1 1 1 IRS2 1 1 1 1 IRS1 ADM E R P

  • S. Mizzaro - ADM

18

Problem: arbitrary & wrong thresholds

URS SRS

0.5 0.5 1.0 1.0

Retrieved & relevant? Nonretrieved & relevant? Nonretrieved & nonrelevant? Retrieved & nonrelevant? URS SRS

0.5 1.0 1.0

Over Evaluated Under Evaluated

0.5

Correctly Evaluated

t

slide-4
SLIDE 4
  • S. Mizzaro - ADM

19

ADM variants

ADM for precision and

recall

R: on the over-evaluated

documents only

P: on the under-evaluated

documents only

ADM with non continuous

SRSs and URSs

URS SRS

1.0 1.0

  • S. Mizzaro - ADM

20

What do we need for ADM?

Ideal situation: Continuous SRS & URS Worst situation: “binarized” ADM

All the documents in (0,0),(0,1),(1,0),(1,1) Docs in (0,1) e (1,1) only: R Docs in (1,0) e (1,1) only: P

Intermediate situations: “discrete” ADM

Categories, combinations, …

0.5 0.5 1.0 1.0 0.5 0.5 1.0 1.0 0.5 0.5 1.0 1.0 0.5 0.5 1.0 1.0 0.5 0.5 1.0 1.0

  • S. Mizzaro - ADM

21

Outline

Definition

The URS/SRS plane ADM (Average Distance Measure) Examples

Conceptual analysis

Problems with precision and recall

  • Experimental analysis

Experimental analysis

  • TREC data

TREC data

  • ADM is as good as TREC measures

ADM is as good as TREC measures

  • ADM is effective with less data than TREC measures

ADM is effective with less data than TREC measures

  • NTCIR data: preliminary results

NTCIR data: preliminary results

  • S. Mizzaro - ADM

22

ADM on TREC data

ADM Variants:

(simplifying…) URSs are binary (either relevant or nonrelevant) SRSs are not reliable → We used the ranking

… 0.000 0.001 0.002 … 0.997 0.998 0.999 1.0 SRS … 1001st 1000th 999th … 4th 3rd 2nd 1st Rank

  • S. Mizzaro - ADM

23

ADM is as good as TREC measures

Kendall Correlations

1 0.902 0.807 0.844 0.844 R-Prec 1 0.824 0.876 0.876 AvgPrec 1 0.891 0.891 Rel-Ret 1 ADM R-Prec AvgPrec Rel-Ret ADM

Correlations (graphically)

ADM' By Rel-Ret

ADM' 0,46 0,48 0,50 0,52 0,54 1000 2000 3000 Rel-Ret

ADM' By AvgPrec

ADM' 0,46 0,48 0,50 0,52 0,54 0,00 0,10 0,20 0,30 0,40 AvgPrec

ADM' By R-Prec

ADM' 0,46 0,48 0,50 0,52 0,54 0,00 0,10 0,20 0,30 0,40 R-Prec

ADM" By Rel-Ret

ADM" 0,47 0,49 0,51 0,53 0,55 0,57 0,59 0,61 1000 2000 3000 Rel-Ret

ADM" By AvgPrec

ADM" 0,47 0,49 0,51 0,53 0,55 0,57 0,59 0,61 0,00 0,10 0,20 0,30 0,40 AvgPrec

ADM" By R-Prec

ADM" 0,47 0,49 0,51 0,53 0,55 0,57 0,59 0,61 0,00 0,10 0,20 0,30 0,40 R-Prec

slide-5
SLIDE 5
  • S. Mizzaro - ADM

25

ADM is effective with less data than TREC measures

Correlations between “global” ADM (on the

TREC pool docs.) and ADM on subsets:

0.935 0.935 50000 (100% 0% 100%) 0.807 0.807 13000 (50% 50% 50%) 0.802 0.802 5000 (10% 10% 100%) 0.910 0.910 26000 (50%, 50%, 100%) 0.852 0.852 26000 (100%, 100%, 50%) 1 53000 (100%, 100%, 100%) ADM

  • N. docs

(approx.) Set (Ret, Rel, topics)

  • S. Mizzaro - ADM

26

ADM on NTCIR-4 data

  • PRELIMINARY RESULTS

PRELIMINARY RESULTS!

URS:

4 categories → 4 values (…)

SRS:

Continuous scores → Linear normalization into

SRSs

Rank, as in TREC

  • S. Mizzaro - ADM

27

URS and SRS distributions: in theory…

URS Docs S A B C

  • S. Mizzaro - ADM

28

…and in practice (good)…

  • S. Mizzaro - ADM

29

…and bad

  • S. Mizzaro - ADM

30

Some results: low correlations…

No correlation between ADM and standard

measures

Standard measures are not sensible to how well

an IRS approximates the URS distribution

Good IRS according to standard measures =

Good rank

Good IRS according to ADM =

Good approximation of the URS distribution shape

slide-6
SLIDE 6
  • S. Mizzaro - ADM

31

…and some high correlations

Rank-based ADM On the first N retrieved documents

0.799 0.816 0.802 0.755 R-Prec 0.788 0.8 0.792 0.747 AvgPrec 50 20 10 5 N

  • S. Mizzaro - ADM

32

Summary

Definition

The URS/SRS plane ADM (Average Distance Measure) Examples

Conceptual analysis

Problems with precision and recall

Experimental analysis

TREC data

ADM is as good as TREC measures ADM is effective with less data than TREC measures

NTCIR data: preliminary results

  • S. Mizzaro - ADM

33

Future work

Carefully analyze NTCIR-4 data A proposal

IRSs participating in next NTCIR-5 could be

evaluated by ADM too

SRSs normalized in [0..1] Carefully decide how to compute the SRSs Try to better approximate the URS distribution

Continuous URS? Distributed IR, data fusion, meta-search, …