9. Evaluation Outline 9.1. Cranfield Paradigm & TREC 9.2. - PowerPoint PPT Presentation

9. Evaluation

Outline 9.1. Cranfield Paradigm & TREC 9.2. Non-Traditional Measures 9.3. Incomplete Judgments 9.4. Low-Cost Evaluation 9.5. Crowdsourcing 9.6. Online Evaluation Advanced Topics in Information Retrieval / Evaluation 2

9.1. Cranfield Paradigm & TREC ๏ IR evaluation typically follows Cranfield paradigm named after two studies conducted by Cyril Cleverdon in the 1960s   ๏ who was a librarian at the College of Aeronautics, Cranfield, England Key Ideas: ๏ provide a document collection ๏ define a set of topics (queries) upfront ๏ obtain results for topics from different participating systems (runs) ๏ collect relevance assessments for topic-result pairs ๏ measure system effectiveness (e.g., using MAP) ๏ Advanced Topics in Information Retrieval / Evaluation 3

TREC ๏ Text Retrieval Evaluation Conference (TREC) organized by the National Institute of Standards and Technology (NIST) since 1992 from 1992–1999 focus on ad-hoc information retrieval (TREC 1–8)   ๏ and document collections mostly consisting of news articles (Disks 1–5) topic development and relevance assessment   ๏ conducted by retired information analysts   from the National Security Agency (NSA) nowadays much broader scope including   ๏ tracks on web retrieval, question answering,   blogs, temporal summarization Advanced Topics in Information Retrieval / Evaluation 4

  Evaluation Process ๏ TREC process to evaluate participating systems Document (1) Release of document collection and topics ๏ Collection (2) Participants submit runs , i.e., results obtained for   ๏ Topics the topics using a specific system configuration (3) Runs are pooled an a per-topic basis, i.e., merge   Pooling ๏ documents returned (within top- k ) by any run Relevance (4) Relevance assessments are conducted; each   ๏ (topic, document) pair judged by one assessor Assessments (5) Runs ranked according to their overall   ๏ Run performance across all topics using an   agreed-upon effectiveness measure   Ranking Advanced Topics in Information Retrieval / Evaluation 5

9.2. Non-Traditional Measures ๏ Traditional effectiveness measures (e.g., Precision, Recall, MAP)   assume binary relevance assessments (relevant/irrelevant)   ๏ Heterogeneous document collections like the Web and complex information needs demand graded relevance assessments   ๏ User behavior exhibits strong click bias in favor of top-ranked results and tendency not to go beyond first few relevant results   ๏ Non-traditional e ff ectiveness measures (e.g., RBP , nDCG, ERR) consider graded relevance assessments and/or are based on more complex models of user behavior Advanced Topics in Information Retrieval / Evaluation 6

Position Models vs. Cascade Models ๏ Position models assume that user inspects   1. P [ d 1 ] each rank with fixed probability that is   2. independent of other ranks P [ d 2 ] … ๏ Example: Precision@k corresponds to user   inspecting each rank 1…k with   k. P [ d k ] uniform probability 1/k   ๏ Cascade models assume that user inspects   1. P [ d 1 ] each rank with probability that depends on   2. P [ d 2 | d 1 ] relevance of documents at higher ranks 3. P [ d 3 | d 1 , d 2 ] ๏ Example: α -nDCG assumes that user inspects   rank k with probability P[n ∉ d1] x … x P[n ∉ d k-1 ] … Advanced Topics in Information Retrieval / Evaluation 7

      Rank-Biased Precision ๏ Moffat and Zobel [9] propose rank-biased precision (RBP) as   an effectiveness measure based on a more realistic user model   ๏ Persistence parameter p : User moves on to inspect next result with probability p and stops with probability (1-p)   d X r i · p i − 1 RBP = (1 − p ) · i =1 with r i ∈ {0,1} indicating relevance of result at rank i Advanced Topics in Information Retrieval / Evaluation 8

    Normalized Discounted Cumulative Gain ๏ Discounted Cumulative Gain (DCG) considers graded relevance judgments (e.g., 2: relevant, 1: marginal, 0: irrelevant) ๏ position bias (i.e., results close to the top are preferred) ๏ ๏ Considering top- k result with R(q,m) as grade of m -th result   k 2 R ( q,m ) − 1 X DCG ( q, k ) = log(1 + m ) m =1 ๏ Normalized DCG (nDCG) obtained through normalization with idealized DCG (iDCG) of fictitious optimal top- k result nDCG ( q, k ) = DCG ( q, k ) iDCG ( q, k ) Advanced Topics in Information Retrieval / Evaluation 9

        Expected Reciprocal Rank ๏ Chapelle et al. [6] propose expected reciprocal rank (ERR)   as the expected reciprocal time to find a relevant result   r − 1 ! n 1 X Y ERR = (1 − R i ) R r r r =1 i =1 with R i as probability that user sees a relevant result at rank i   and decides to stop inspecting result ๏ R i can be estimated from graded relevance assessments as   R i = 2 g ( i ) − 1 2 g max ๏ ERR equivalent to RR for binary estimates of R i Advanced Topics in Information Retrieval / Evaluation 10

9.3. Incomplete Judgments ๏ TREC and other initiatives typically make their document collections, topics, and relevance assessments available   to foster further research   ๏ Problem: When evaluating a new system which did not contribute to the pool of assessed results, one typically also retrieves results which have not been judged   ๏ Naïve Solution: Results without assessment assumed irrelevant corresponds to applying a majority classifier (most irrelevant) ๏ induces a bias against new systems ๏ Advanced Topics in Information Retrieval / Evaluation 11

        Bpref ๏ Bpref assumes binary relevance assessments and   evaluates a system only based on judged results   1 − min ( | d 0 ∈ N ranked higher than d | , | R | ) 1 ✓ ◆ X bpref = | R | min ( | R | , | N | ) d 2 R with R and N as sets of relevant and irrelevant results   ๏ Intuition: For every retrieved relevant result compute a penalty   reflecting how many irrelevant results were ranked higher Advanced Topics in Information Retrieval / Evaluation 12

        Condensed Lists ๏ Sakai [10] proposes a more general approach to the problem of incomplete judgments, namely to condense result lists by removing all unjudged results can be used with any effectiveness measure (e.g., MAP , nDCG)   ๏ d 1 d 1 relevant d 7 d 7 irrelevant d 9 unknown d 2 d 2 ๏ Experiments on runs submitted to the Cross-Lingual Information Retrieval tracks of NTCIR 3&5 suggest that the condensed list   approach is at least as robust as bpref and its variants Advanced Topics in Information Retrieval / Evaluation 13

        Kendall’s τ ๏ Kendall’s τ coefficient measures the rank correlation between   two permutations π i and π j of the same set of elements   τ = (# concordant pairs) − (# discordant pairs) 1 2 · n · ( n − 1) with n as the number of elements   ๏ Example: π 1 = ⟨ a b c d ⟩ and π 2 = ⟨ d b a c ⟩ concordant pairs: ( a , c ) ( b , c ) ๏ discordant pairs: ( a , b ) ( a , d ) ( b , d ) ( c , d ) ๏ Kendall’s τ : -2/6 ๏ Advanced Topics in Information Retrieval / Evaluation 14

Experiments ๏ Sakai [10] compares the condensed list approach on several effectiveness measures against bpref in terms of robustness ๏ Setup: Remove a random fraction of relevance assessments   and compare the resulting system ranking in terms of Kendall’s τ   against the original system ranking with all relevance assessments Advanced Topics in Information Retrieval / Evaluation 15

Label Prediction ๏ Büttcher et al. [3] examine the e ff ect of incomplete judgments   based on runs submitted to the TREC 2006 Terabyte track 1 1 0.9 0.8 0.8 0.6 0.7 RankEff bpref 0.4 0.6 AP P@20 0.2 0.5 nDCG@20 100% 80% 60% 40% 20% 0% 100% 80% 60% 40% 20% 0% Size of qrels file (compared to original) Size of qrels file (compared to original) ๏ They also examine the amount of bias against new systems   by removing judged results solely contributed by one system MRR P@10 P@20 nDCG@20 Avg. Prec. bpref P@20(j) RankE ff Avg. absolute rank di ff erence 0.905 1.738 2.095 2.143 1.524 2.000 2.452 0.857 Max. rank di ff erence 0 ↑ /15 ↓ 1 ↑ /16 ↓ 0 ↑ /12 ↓ 0 ↑ /14 ↓ 0 ↑ /10 ↓ 14 ↑ /1 ↓ 22 ↑ /1 ↓ 4 ↑ /3 ↓ RMS Error 0.0130 0.0207 0.0243 0.0223 0.0105 0.0346 0.0258 0.0143 Runs with significant di ff . ( p < 0 . 05) 4.8% 38.1% 50.0% 54.8% 95.2% 90.5% 61.9% 81.0% Advanced Topics in Information Retrieval / Evaluation 16

      Label Prediction ๏ Idea: Predict missing labels using classification methods   ๏ Classifier based on Kullback-Leibler divergence estimate unigram language model θ R from relevant documents ๏ document d with language model θ d is considered relevant if   ๏ KL ( θ d k θ R ) < ψ with threshold ψ estimated such that exactly |R| documents   in the training data exceed it and are thus considered relevant Advanced Topics in Information Retrieval / Evaluation 17

    Label Prediction ๏ Classifier based on Support Vector Machine (SVM)   sign( w T · x + b ) with w ∈ R n and b ∈ R as parameters and x as document vector consider the 10 6 globally most frequent terms as features ๏ features determined using tf.idf weighting ๏ Advanced Topics in Information Retrieval / Evaluation 18

9. Evaluation Outline 9.1. Cranfield Paradigm & TREC 9.2. - PowerPoint PPT Presentation

9. Evaluation Outline 9.1. Cranfield Paradigm & TREC 9.2. Non-Traditional Measures 9.3. Incomplete Judgments 9.4. Low-Cost Evaluation 9.5. Crowdsourcing 9.6. Online Evaluation Advanced Topics in Information Retrieval / Evaluation 2

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Progress on

Evidence evaluation for discrete data Evidence evaluation for discrete data Evidence evaluation

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

Webinar on Meta-evaluation Approaches to Improve Evaluation Practice Mnica Lomea Gelis,

Programme BRICK Programme Evaluation: How, why and what? The plan Practical evaluation -

Evaluation Update Laura Forsythe, PhD, MPH Associate Director, Evaluation & Analysis Lori

Evaluation DEMMS: Evaluation of Multimedia What are the Evaluation lectures about: When

Heuristic Evaluation (Pinelle) Heuristic evaluation is a method of qualitative evaluation of

Model Evaluation Model Evaluation Metrics for Performance Evaluation How to evaluate the

UX Evaluation SWEN-444 Selected material from The UX Book , Hartson & Pyla UX Evaluation

ICS 667 Advanced HCI Design Methods 08. Intro to Evaluation Analytic Evaluation Dan Suthers

SENSORY EVALUATION .. Basics of Sensory evaluation, Tools, Techniques, Methods and

Evaluation: Using CDCs Evaluation Framework By: Thomas J. Chapel, MA, MBA Chief Evaluation

Overview of Overview of Evaluation in Evaluation in the UN Secretariat the UN Secretariat Prepared

Survey Visualization Maria Tkatchenko CPSC 533C November 19, 2004 Problem Survey composed

Degree-degree correlations in directed networks with heavy-tailed degrees Pim van der Hoorn

004 - Exploring Data - Part II EPIB 607 - FALL 2020 Sahir Rai Bhatnagar Department of

Extreme Event-Size Extreme Event-Size Fluctuations in Biased Fluctuations in Biased Random

Network Analysis of Software Repositories The Eclipse Bugzilla Case Monika Schubert, Michel

IN5550: Neural Methods in Natural Language Processing Lecture 6 Evaluating Word Embeddings and

Visualization + Analysis Blockchains Are Networks Time-series Visualization Quickly

Correlation Quantitative A Aptitude & & Business S Statistics Correlation

9. Evaluation Outline 9.1. Cranfield Paradigm & TREC 9.2. - PowerPoint PPT Presentation

9. Evaluation Outline 9.1. Cranfield Paradigm & TREC 9.2. Non-Traditional Measures 9.3. Incomplete Judgments 9.4. Low-Cost Evaluation 9.5. Crowdsourcing 9.6. Online Evaluation Advanced Topics in Information Retrieval / Evaluation 2

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Progress on

Evidence evaluation for discrete data Evidence evaluation for discrete data Evidence evaluation

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

Webinar on Meta-evaluation Approaches to Improve Evaluation Practice Mnica Lomea Gelis,

Programme BRICK Programme Evaluation: How, why and what? The plan Practical evaluation -

Evaluation Update Laura Forsythe, PhD, MPH Associate Director, Evaluation &amp; Analysis Lori

Evaluation DEMMS: Evaluation of Multimedia What are the Evaluation lectures about: When

Heuristic Evaluation (Pinelle) Heuristic evaluation is a method of qualitative evaluation of

Model Evaluation Model Evaluation Metrics for Performance Evaluation How to evaluate the

UX Evaluation SWEN-444 Selected material from The UX Book , Hartson &amp; Pyla UX Evaluation

ICS 667 Advanced HCI Design Methods 08. Intro to Evaluation Analytic Evaluation Dan Suthers

SENSORY EVALUATION .. Basics of Sensory evaluation, Tools, Techniques, Methods and

Evaluation: Using CDCs Evaluation Framework By: Thomas J. Chapel, MA, MBA Chief Evaluation

Overview of Overview of Evaluation in Evaluation in the UN Secretariat the UN Secretariat Prepared

Survey Visualization Maria Tkatchenko CPSC 533C November 19, 2004 Problem Survey composed

Degree-degree correlations in directed networks with heavy-tailed degrees Pim van der Hoorn

004 - Exploring Data - Part II EPIB 607 - FALL 2020 Sahir Rai Bhatnagar Department of

Extreme Event-Size Extreme Event-Size Fluctuations in Biased Fluctuations in Biased Random

Network Analysis of Software Repositories The Eclipse Bugzilla Case Monika Schubert, Michel

IN5550: Neural Methods in Natural Language Processing Lecture 6 Evaluating Word Embeddings and

Visualization + Analysis Blockchains Are Networks Time-series Visualization Quickly

Correlation Quantitative A Aptitude &amp; &amp; Business S Statistics Correlation

Evaluation Update Laura Forsythe, PhD, MPH Associate Director, Evaluation & Analysis Lori

UX Evaluation SWEN-444 Selected material from The UX Book , Hartson & Pyla UX Evaluation

Correlation Quantitative A Aptitude & & Business S Statistics Correlation