why batch and user evaluations do not give the same
play

Why Batch and User Evaluations Do Not Give the Same Results A. - PowerPoint PPT Presentation

Why Batch and User Evaluations Do Not Give the Same Results A. Turpin Curtin University of Technology Perth, Australia W. Hersh Oregon Health Sciences University Portland, Oregon Presented at SIGIR2001 New Orleans TREC ad-hoc 0.4 MAP 0.3


  1. Why Batch and User Evaluations Do Not Give the Same Results A. Turpin Curtin University of Technology Perth, Australia W. Hersh Oregon Health Sciences University Portland, Oregon Presented at SIGIR2001 New Orleans

  2. TREC ad-hoc 0.4 MAP 0.3 0.2 95 96 97 98 99 00

  3. Experimental method 1. Set baseline system to basic Cosine Vector weights 2. Identify “super” system using batch experiments 3. Run 24 users on the 2 systems with same topics 4. Send results off to NIST 5. Get relevance judgments 6. Analyse user results 7. Check batch results

  4. Example instance recall query Number: 414i Title: Cuba, sugar, imports Description: What countries import Cuban sugar? Instances: In the time alloted, please find as many DIFFERENT countries of the sort described above as you can. Please save at least one document for EACH such DIFFERENT country. If one document discusses several such countries, then you need not save other documents that repeat those, since your goal is to identify as many DIFFERENT countries of the sort described above as possible.

  5. Experiment 1 - Instance Recall Baseline Improved 0.390 0.385 0.330 0.324 0.275 0.213 Pre-batch MAP User IR Post-batch MAP

  6. 8 Q&A queries 1) What are the names of three US national parks where one can find redwoods? 2) Identify a site with Roman ruins in present day France 3) Name four films in which Orson Welles appeared 4) Name three countries that imported Cuban sugar during the period of time covered by the document collection

  7. 8 Q&A queries 5) Which childrens TV program was on the air longer the original Mickey Mouse Club or the original Howdy Doody Show? 6) Which painting did Edvard Munch complete first Vampire or Puberty? 7) Which was the last dynasty of China Qing or Ming? 8) Is Denmark larger or smaller in population than Norway?

  8. Experiment 2 - Question Answering Baseline 66% Improved 60% 0.354 0.327 0.270 0.228 Pre-batch MAP User QA Post-batch MAP

  9. Results Summary Predicted Actual Instance recall 81% 15% (p = 0.27) Question answering 58% -6% (p = 0.41) Why? 1. Systems no different on topics and collection used 2. There was a difference, but users ignored it

  10. Precision metrics on user queries and collection 47% 57% Baseline Improved 0.60 p=0.02 p=0.03 68% 0.50 33% p=0.001 p=0.14 0.40 100% p=0.001 0.30 40% p=0.02 0.20 0.10 0.00 MAP p@10 p@50 MAP p@10 p@50 Inst. Recall experiment QA experiment

  11. Number of instances on user queries and collection 30% 14.00 p=0.28 105% 12.00 p=0.04 10.00 Baseline 8.00 Improved 6.00 4.00 2.00 0.00 Num. inst. @10 Num. inst. @50

  12. So what happens to the difference? • Users compensate for the lack of relevant docs within time limit • Users ignore high ranked relevant documents – Maybe obscure document titles? – Don’t read the list from the top? • “Extra” relevant docs give no new information

  13. Number of queries per topic 33% 5 16% p=0.04 p=0.16 4 3 2 Baseline Improved 1 0 IR QA

  14. Number of docs retrieved Baseline 35% Improved p=0.01 2% 150 p=0.93 35% 100 p=0.02 0% 50 p=0.97 0 IR QA IR QA Relevant Irrelevant

  15. Number top 10 relevant docs ignored 87% 24% 60% p=0.002 p=0.22 40% 20% Baseline Improved 0% IR QA

  16. Conclusion • In these two tasks there is no use providing users with a good weighting scheme because – They will ignore high ranking relevant docs – They will happily issue a few extra queries • They find answers just as well with old technology • User interface effects? • Task effect?

  17. TF ( t , d ) IDF ( t ) � � Basic cosine 2 TF ( t , d ) � t T t T � � q , d d 2 IDF ( t ) f � d , t � Okapi f W + t T � d , t d q , d f N f � � � d , t t f ln � � � � � Pivoted Okapi � � q , t ' f f W + t T � � � t d , t d q , d

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend