Stability of INEX 2007 Evaluation Measures Sukomal Pal Mandar Mitra - - PowerPoint PPT Presentation

stability of inex 2007 evaluation measures
SMART_READER_LITE
LIVE PREVIEW

Stability of INEX 2007 Evaluation Measures Sukomal Pal Mandar Mitra - - PowerPoint PPT Presentation

Stability of INEX 2007 Evaluation Measures Sukomal Pal Mandar Mitra Arnab Chakraborty { sukomal r, mandar } @isical.ac.in arnabc@stanfordalumni.org Information Retrieval Lab, CVPR Unit Indian Statistical Institute Kolkata - 700108, India. ISI


slide-1
SLIDE 1

Stability of INEX 2007 Evaluation Measures

Sukomal Pal Mandar Mitra Arnab Chakraborty

{sukomal r, mandar}@isical.ac.in arnabc@stanfordalumni.org

Information Retrieval Lab, CVPR Unit Indian Statistical Institute Kolkata - 700108, India.

ISI @ EVIA ’08 – p. 1/29

slide-2
SLIDE 2

Outline

Introduction

ISI @ EVIA ’08 – p. 2/29

slide-3
SLIDE 3

Outline

Introduction Test Environment

ISI @ EVIA ’08 – p. 2/29

slide-4
SLIDE 4

Outline

Introduction Test Environment Experiments & Results

ISI @ EVIA ’08 – p. 2/29

slide-5
SLIDE 5

Outline

Introduction Test Environment Experiments & Results Limitations & Future Work

ISI @ EVIA ’08 – p. 2/29

slide-6
SLIDE 6

Outline

Introduction Test Environment Experiments & Results Limitations & Future Work Conclusion

ISI @ EVIA ’08 – p. 2/29

slide-7
SLIDE 7

Outline

Introduction Test Environment Experiments & Results Limitations & Future Work Conclusion

ISI @ EVIA ’08 – p. 2/29

slide-8
SLIDE 8

Introduction: Content-oriented XML retrieval

a new domain in IR XML as standard document format in web & DL growth in XML information repositories increase in XML-IR systems Two aspects of XML-IR systems

  • content (text/image/music/video info)
  • structure (info about the tags)

ISI @ EVIA ’08 – p. 3/29

slide-9
SLIDE 9

Introduction: Content-oriented XML retrieval

from whole document → document-part retrieval new evaluation framework (corpus, topic, rel-judged data, metrics )needed Initiative for the Evaluation of XML retrieval, INEX (’02 - ..)

  • ur stability study on met-

rics of INEX 07 adhoc fo- cused task Figure 1: A book example

ISI @ EVIA ’08 – p. 4/29

slide-10
SLIDE 10

Introduction: Content-oriented XML retrieval

from whole document → document-part retrieval new evaluation framework (corpus, topic, rel-judged data, metrics )needed Initiative for the Evaluation of XML retrieval, INEX (’02 - ..)

  • ur stability study on met-

rics of INEX 07 adhoc fo- cused task Figure 2: A book example

ISI @ EVIA ’08 – p. 4/29

slide-11
SLIDE 11

Outline

Introduction Test Environment Experiments & Results Limitations & Future Work Conclusion

ISI @ EVIA ’08 – p. 5/29

slide-12
SLIDE 12

Test Environment: Collection

XML-ified version of English Wikipedia

  • 659,388 documents
  • 4.6 GB

INEX 2007 topic set

  • 130 queries (414 - 543)

Relevance Judgment

  • 107 queries

Runs

  • 79 valid runs (ranked list acc. to relevance-score)
  • max. 1500 passages/elements per topic

ISI @ EVIA ’08 – p. 6/29

slide-13
SLIDE 13

Test Environment: Measures

Precision precision = amount of relevant text retrieved total amount of retrieved text = length of relevant text retrieved total length of retrieved text Recall recall = length of relevant text retrieved total length of relevant text

ISI @ EVIA ’08 – p. 7/29

slide-14
SLIDE 14

Test Environment: Measures

pr = document part at rank r size(pr) = total #characters in pr rsize(pr) = length of relevant text in pr Trel(q) = total amt of relevant text for topic q Precision at rank r P[r] = r

i=1 rsize(pi)

r

i=1 size(pi)

Recall at rank r R[r] = r

i=1 rsize(pi)

Trel(q)

ISI @ EVIA ’08 – p. 8/29

slide-15
SLIDE 15

Test Environment: Measures

Drawback

  • rank not well-understandable for passages/elements

(retrieval granularity not fixed)

  • recall level used instead

Interpolated Precision at recall level x iP[x] =    max1≤r≤|Lq|

R[r]≥x

(P[r]) if x ≤ R[|Lq|] if x > R[|Lq|] (Lq = set of ranked list, |Lq| ≤ 1500) e.g. iP[0.00] =

  • int. prec. for first unit retrieved

iP[0.01] =

  • int. prec. at 1% recall for a topic

ISI @ EVIA ’08 – p. 9/29

slide-16
SLIDE 16

Test Environment: Measures

Average interpolated precision for topic t AiP(t) = 1 101

  • x={0.00,0.01,...,1.00}

iP[x](t)

  • verall int. precision at reall level x

iP[x]overall = 1 n

n

  • t=1

iP[x](t) Mean Average Interpolated Precision MAiP = 1 n

n

  • t=1

AiP(t). Reported metrics for INEX 2007 Adhoc focused task

  • iP[0.00], iP[0.01], iP[0.05], iP[0.10] & MAiP
  • official metric : iP[0.01]

ISI @ EVIA ’08 – p. 10/29

slide-17
SLIDE 17

Test Environment: Experimental setup

relevance judgment

  • NOT just boolean indicator
  • relevant psg. with start & end-offset in xpath

db of start & end offsets for each element of entire corpus

  • size ∼ 14 GB

a subset of db, representing rel-jdg file, stored Out of 79 runs, 62 chosen

  • taken runs ranked 1-21, 31-50, 59-79 acc. to iP[0.01]
  • run file consulted with db to get offsets, compared with

stored rel-jdg file

ISI @ EVIA ’08 – p. 11/29

slide-18
SLIDE 18

Outline

Introduction Test Environment Experiments & Results Limitations & Future Work Conclusion

ISI @ EVIA ’08 – p. 12/29

slide-19
SLIDE 19

Experiments

3 categories: Pool Sampling

  • evaluate using incomplete relevance judgments
  • some rel. passages made irrel. for each topic

Query Sampling

  • evaluate using smaller subsets of topics
  • complete rel-jdg info for a topic, if selected

Error Rate

  • offshoot of query sampling
  • study of pairwise runs with topic set reduced

ISI @ EVIA ’08 – p. 13/29

slide-20
SLIDE 20

Experiments: Pool Sampling

Pool generated from the participants’ runs collaboratively judged by participants

  • relevant passages highlighted
  • no highlighting =

⇒ NOT relevant Qrel start and end-points of highlighted passages by xpath consulted db to get the offsets, stored in a sorted file No entries for assessed non-relevant text contained 107 topics

ISI @ EVIA ’08 – p. 14/29

slide-21
SLIDE 21

Experiments: Pool Sampling

Alogrithm:

  • 1. 99 topics having >= 10 relevant units selected
  • 2. 80% relevant passages SRSWOR for each topic → new qrel
  • 3. 62 runs evaluated with reduced sample qrel
  • 4. Kendall tau (τ) computed betn. 2 rankings for each metric

(i.e. ranking by original qrel and reduced qrel)

  • 5. 10-iterations of the above steps 1-4 at 80%-sample

Steps 1-5 done at 60%, 40%, 20% samples

ISI @ EVIA ’08 – p. 15/29

slide-22
SLIDE 22

Results: Pool Sampling

100 80 60 40 20 0.6 0.7 0.8 0.9 1.0

Rank correlation with partial relevance judgments

%−age of total relevant documents used for evaluation Kendall Tau iP[0.00] iP[0.01] iP[0.05] iP[0.10] MAiP

ISI @ EVIA ’08 – p. 16/29

slide-23
SLIDE 23

Results: Pool Sampling

sampling level ↓ → correlation ↓ → curve droops precision-score affected non-uniformly across systems

  • depending upon ranks of retrieved text missing in pool

τ drops for iP[0.00], iP[0.01] faster than iP[0.05] or iP[0.10]

  • r MAiP

sampling level ↓ → error-bar ↑ sampling level ↓ → overlap among the samples at a fixed n% ↓ → irregular prec-score MAiP - least variation in τ across different pool-sizes across samples at a fixed pool-size

ISI @ EVIA ’08 – p. 17/29

slide-24
SLIDE 24

Experiments: Query Sampling

Algorithm:

  • 1. All 107 topics considered
  • 2. 80% of total topics selected at random (SRSWOR)
  • 3. if a topic selected, its entire rel-jdg taken → new reduced

qrel

  • 4. 62 runs evaluated with reduced sample qrel
  • 5. Kendall tau (τ) computed betn. 2 rankings for each metric

(i.e. ranking by original qrel and reduced qrel)

  • 6. 10-iterations of the above steps 1-4 at 80%-sample

Steps 1-5 done at 60%, 40%, 20% samples

ISI @ EVIA ’08 – p. 18/29

slide-25
SLIDE 25

Results: Query Sampling

100 80 60 40 20 0.6 0.7 0.8 0.9 1.0

Rank correlation with subset of all queries

Size of sample (%−age total queries) Kendall Tau iP[0.00] iP[0.01] iP[0.05] iP[0.10] MAiP

ISI @ EVIA ’08 – p. 19/29

slide-26
SLIDE 26

Results: Query Sampling

Similar characteristic comp. to Pool Sampling τ drops for iP[0.00], iP[0.01] faster than iP[0.05] or iP[0.10] or MAiP sampling level ↓ → error-bar ↑ MAiP - best as it has least variation in τ across different pool-sizes across samples at a fixed pool-size Curves are more stable than those in Pool Sampling (i.e. system rankings more in agreement with original rankings) if a topic selected, its entire rel-jdgmnt used the topic contributes to prec. score uniformly across systems τ reduces due to different response of systems to a query

ISI @ EVIA ’08 – p. 20/29

slide-27
SLIDE 27

Experiments: Error Rate

Algorithm:

  • 1. Acc. to Buckley & Voorhees 2000 but with modification
  • participants’ systems not available
  • results of systems under varying query formulations

NOT possible

  • 2. Samples of Query-set with full qrel per topic
  • partitioning of the query-set(SRSWOR) → upper

bound of error-rate

  • subsets of query-set(SRSWR) → lower bound

error-rate

  • 3. 10 samples (SRSWR) at 20%, 40%, 60%, 80% of 107

queries

ISI @ EVIA ’08 – p. 21/29

slide-28
SLIDE 28

Experiments: Error Rate

Error Rate ( Buckley et al. ’00) Error rate = min(|A > B|, |A < B|) (|A > B| + |A < B| + |A == B|) |A > B| = #times (out of 10) system A better B at a fixed sampling level. Note, A > B by ≥ 5%, else A == B. 62 systems, 62

2

  • = 62.61/2 = 1891 pairs

ISI @ EVIA ’08 – p. 22/29

slide-29
SLIDE 29

Results: Error Rate

20 30 40 50 60 70 80 0.00 0.02 0.04 0.06 0.08 0.10

Error rates with a subset of queries

Size of sample (%−age total queries) Error rate iP[0.00] iP[0.01] iP[0.05] iP[0.10] MAiP

ISI @ EVIA ’08 – p. 23/29

slide-30
SLIDE 30

Results: Error Rate

Error-rates high for small query-sets progressively ↓ as overlap among query samples ↑ 40% topics sufficient to achieve less than 5% error early-prec. measures more error-prone MAiP has least error-rate MAiP - best as it has least variation in τ

ISI @ EVIA ’08 – p. 24/29

slide-31
SLIDE 31

Outline

Introduction Previous Work Test Environment Experiments & Results Limitations & Future Work Conclusion

ISI @ EVIA ’08 – p. 25/29

slide-32
SLIDE 32

Limitation & Future Work

Observations based only on INEX 2007 test collection Not all (79 valid) runs, could consider 62 of them Runs from non-random influencing categories

  • passage/element, CO/CAS, short/long, hard/easy

queries etc. No knowledge of top-n retrieved units used to create pool

  • future task

Bias of qrels towards participating runs

  • future task

Error-rates - No idea why steady nature was disturbed We considered 5% error rate Lot more study needed MAiP averages well across topics more shock-absorbing than other metrics most reliable metric for static test environement

ISI @ EVIA ’08 – p. 26/29

slide-33
SLIDE 33

Outline

Introduction Previous Work Test Environment Experiments & Results Limitations & Future Work Conclusion

ISI @ EVIA ’08 – p. 27/29

slide-34
SLIDE 34

Conclusion

XML retrieval evaluation gruelling challenge Various metrics tried since INEX ’02 to ’06 prec-recall based metrics since INEX ’07 validation of previous findings in XML retrieval domain similar results → intrinsic properties of metrics MAiP averages well across topics more shock-absorbing than other metrics most reliable metric for static test environement

ISI @ EVIA ’08 – p. 28/29

slide-35
SLIDE 35

Acknowledgments

work : DIT, Govt. of India trip : NTCIR, Japan & Google Inc., USA. !! THANK YOU !!

ISI @ EVIA ’08 – p. 29/29