Improving the Accuracy of System Performance Estimation by Using - - PowerPoint PPT Presentation

improving the accuracy of system performance estimation
SMART_READER_LITE
LIVE PREVIEW

Improving the Accuracy of System Performance Estimation by Using - - PowerPoint PPT Presentation

Improving the Accuracy of System Performance Estimation by Using Shards Nicola Ferro & Mark Sanderson 1 IR evaluation is noisy 1.00 0.80 0.60 0.40 0.20 0.00 1.00 0.80 0.60 0.40 0.20 0.00 2 ANOVA Data = Model + Error


slide-1
SLIDE 1

1

— Improving the Accuracy of System Performance Estimation by Using Shards

Nicola Ferro & Mark Sanderson

slide-2
SLIDE 2

2

IR evaluation is noisy

0.00 0.20 0.40 0.60 0.80 1.00 0.00 0.20 0.40 0.60 0.80 1.00

slide-3
SLIDE 3

3

ANOVA

Data = Model + Error Model: Linear mixture of factors

slide-4
SLIDE 4

4

First go

Tague-Sutcliffe and Blustein, 1995

Topics Systems Factors

slide-5
SLIDE 5

5

Question

Can we do better? Add a Topic*System factor?

slide-6
SLIDE 6

6

New system

0.00 0.20 0.40 0.60 0.80 1.00 0.00 0.20 0.40 0.60 0.80 1.00 0.00 0.20 0.40 0.60 0.80 1.00

slide-7
SLIDE 7

7

Partition collections

Shards

slide-8
SLIDE 8

8

Replicates

0.00 0.20 0.40 0.60 0.80 1.00 0.00 0.20 0.40 0.60 0.80 1.00

slide-9
SLIDE 9

9

Replicates

0.00 0.20 0.40 0.60 0.80 1.00 0.00 0.20 0.40 0.60 0.80 1.00

slide-10
SLIDE 10

10

Replicates

0.00 0.20 0.40 0.60 0.80 1.00 0.00 0.20 0.40 0.60 0.80 1.00

  • E. M. Voorhees, D. Samarov, and I. Soboroff. Using Replicates in Information Retrieval Evaluation. ACM Transactions on Information Systems

(TOIS), 36(2): 12:1–12:21, September 2017

slide-11
SLIDE 11

11

Past ANOVA Factors

Topics Systems Topics*System interactions

slide-12
SLIDE 12

12

Our ANOVA Factors

Topics Systems Shards Topics*System System*Shard Topic*Shard

slide-13
SLIDE 13

13

Models

slide-14
SLIDE 14

14

IR evaluation is noisy

0.00 0.20 0.40 0.60 0.80 1.00 0.00 0.20 0.40 0.60 0.80 1.00

slide-15
SLIDE 15

15

Hard vs Easy Topics?

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

  • k8alx

INQ604

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

slide-16
SLIDE 16

16

Few vs Many QRELs

slide-17
SLIDE 17

17

x

This paper

slide-18
SLIDE 18

18

Proof in the paper

Include topic*shard factor? Value of x is not important, we choose x=0

slide-19
SLIDE 19

19

Few vs Many QRELs?

  • G. V. Cormack and T. R. Lynam. Statistical Precision of Information Retrieval Evaluation. In E. N. Efthimiadis, S. Dumais, D. Hawking, and K.

Järvelin, editors, Proc. 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006), pages 533–540. ACM Press, New York, USA, 2006

  • S. Robertson. On document populations and measures of IR effectiveness. In Proceedings of the 1st International Conference on the Theory
  • f Information Retrieval (ICTIR’07), Foundation for Information Society, pages 9–22, 2007.

Should topics be ‘treated’ equally?

slide-20
SLIDE 20

20

Compare MD6 factors

0.00 0.20 0.40 0.60 0.80 1.00 0.00 0.20 0.40 0.60 0.80 1.00

slide-21
SLIDE 21

21

Experiments

TREC-8, Adhoc, 129 runs TREC-9, Web, 104 runs TREC-27, Common core, 72 runs Original runs rankings (τ)

slide-22
SLIDE 22

22

slide-23
SLIDE 23

23

MD6

slide-24
SLIDE 24

24

Other parts of the paper

Confidence intervals calculated with Tukey HSD Details of the proof on zero value for shards & MD6 Code: https://bitbucket.org/frrncl/sigir2019-fs-code/

slide-25
SLIDE 25

25

Conclusions

Can we do better than past ANOVA? Yes, MD6 Topic*shard interaction is strong. Its impact has not been observed when measuring performance Test collections are expensive to build, we can get substantially more signal out of three collections

slide-26
SLIDE 26

26

Future work

UQV100: query test collection Compare to Voorhees, Samarov, Soboroff, 2017. Metric, not significant differences but predictive power Create new collections with fewer judgments/topics