Modelling A User Population for Designing Information Retrieval - PowerPoint PPT Presentation

Modelling A User Population for Designing Information Retrieval Metrics Tetsuya Sakai (NewsWatch, Inc.) tetsuyasakai@acm.org Stephen Robertson (Microsoft Research Cambridge) ser@microsoft.com EVIA 2008, December 16, 2008@NII, Tokyo

TALK OUTLINE 1. Objectives 2. Normalised Cumulative Precision (NCP) 3. Normalised Cumulative Utility (NCU) 4. Evaluating Evaluation Metrics: Resemblance to Average Precision 5. Evaluating Evaluation Metrics: Discriminative Power 6. Conclusions

Average Precision (AP) Precision at rank r Number of 1 iff doc at r relevant is relevant docs • Used widely since the advent of TREC • Mean over topics is referred to as “MAP” • Cannot handle graded relevance (but many IR researchers just love it)

Criticisms of (Mean) Average Precision ((M)AP) • AP may be a poor measure of user performance/satisfaction [Turpin/Scholer SIGIR 06 etc.] • AP lacks a user model “there is no single user application that directly motivates MAP” [Buckley/Voorhees TREC book] “there is no plausible search model that corresponds to MAP, because no user knows in advance the number of relevant answers present in the collection…” [Moffat/Webber/Zobel SIGIR 07]

“AP lacks a user model?” Rubbish! [Robertson SIGIR 08]

Objectives • Robertson showed that AP is a special case of Normalised Cumulative Precision (NCP) which models a population of users . • We generalise NCP and introduce Normalised Cumulative Utility (NCU), and show that - AP and Q-measure are in fact reasonable metrics! - A version of NCU, which utilises graded relevance in a novel way, has high discriminative power!

Query I need the latest information on EVIA! Information need

L1 (partially relevant) L3 (highly relevant) L0 (not relevant) L2 (relevant) L0 (not relevant) L0 (not relevant) L3 (highly relevant)

Where do users stop scanning the list? I stop at L1 rank 1 I stop at L3 rank 2 L0 I stop at L2 rank 4 L0 L0 I stop at L3 rank 7

p u : Uniform Distribution over Relevant Documents L1 L3 ASSUMPTIONS: - Users stop at a relevant doc; L0 - The stopping probability is L2 uniform across all relevant docs L0 L0 L3

p rb : Rank-Biased Distribution over Relevant Docs L1 L3 ASSUMPTIONS: - Users stop at a relevant doc; L0 - Users tend to stop L2 near the top than near the bottom L0 L0 L3

p gu : Graded-Uniform Distribution over Relevant Docs L1 L3 ASSUMPTIONS: - Users stop at a relevant doc; L0 - Users tend to stop L2 at a highly relevant doc than at a partially relevant doc L0 L0 L3

Robertson’s Normalised Cumulative Precision (NCP) Probability that the user stops at the (relevant) document at rank n Cost: #docs seen so far Utility/Cost Expectation Utility: given the over a user #relevant stopping point population seen so far (precision at n )

p u : Uniform Distribution over Relevant Documents L1 ASSUMPTIONS: - Users stop at a relevant doc; L3 - The stopping probability is uniform across all relevant docs L0 Let p s ( n ) = p u ( n ) = 1/ R for every rank n that has a L2 relevant doc. Then NCP reduces to AP (=NCP u )! L0 L0 L3

That is, • AP is a special case of NCP. • It is an expectation of utility/cost over a user population whose stopping probability is uniform across all relevant documents. • Hence, AP is in fact a reasonable metric!

We generalise NCP in two ways Stopping probability: Normalised Utility: p u (uniform) BR ( n ) (blended ratio) p rb (rank-biased) which generalises p gu (graded-uniform) P ( n )

p u : Uniform Distribution over Relevant Documents L1 L3 ASSUMPTIONS: - Users stop at a relevant doc; L0 - The stopping probability is L2 uniform across all relevant docs L0 L0 L3

p rb : Rank-Biased Distribution over Relevant Docs L1 L3 ASSUMPTIONS: - Users stop at a relevant doc; L0 - Users tend to stop L2 near the top than near the bottom L0 L0 L3

γ : top-heaviness parameter for p rb Stopping probability γ =1 reduces p rb to p u Relevant documents found in the ranked list

p gu : Graded-Uniform Distribution over Relevant Docs L1 ASSUMPTIONS: - Users stop at a relevant doc; L3 - Users tend to stop at a highly relevant doc than L0 at a partially relevant doc L2 Stopping weights stop(L3):stop(L2):stop(L1)=3:2:1 L0 stop(L3):stop(L2):stop(L1)=10:5:1 (stop(L3):stop(L2):stop(L1)=1:1:1 L0 reduces p gu to p u ) L3

Blended Ratio (BR) Normalised Cumulative Gain Precision for handling graded relevance A large β represents a very persistent user; β =0 reduces BR to P BR is suitable as a utility/cost function because, given the stopping point n , it does NOT matter where the relevant documents are within top n .

NCU family Normalised Utility Stopping probability: given the stopping p rb (rank-biased) with point: BR ( n ) top-heaviness parameter (blended ratio) with γ ( γ =1 reduces p rb to p u ) persistence p gu (graded-uniform) parameter β with stopping weights ( β =0 reduces (flat weights BR ( n ) to P ( n ) ) reduces p gu to p u )

Comparing a system ranking by Metric M to that by AP • Kendall’s rank correlation Monotonic function of the probability that a randomly chosen system pair is ordered identically in two rankings • Yilmaz/Aslam/Robertson (YAR) rank correlation [SIGIR 08] Monotonic function of the probability that a randomly chosen system and one ranked above it are ordered identically in two rankings Assumes that the top ranks are the most important Not symmetrical, but is almost symmetrical in practice

YAR rank correlation with AP (NCU u , β =0 ): NTCIR-6J Q Q, NCU gu , β =0 and AP 0.96 0.89 NCU gu , β =1 produce 0.954 1 1 rankings that are very similar 0.9 0.8 0.74 to that by AP 0.773 0.7 0.628 0.6 0.66 rb, β=0 0.5 0.589 rb, β=1 0.4 0.604 gu, β=0 0.3 gu, β=1 0.2 0.1 gu, β=1 0 gu, β=0 γ=1 rb, β=1 γ=0.9 rb, β=0 γ=0.7 Heavy rank bias produces γ=0.5 Stop very unconventional weights=3:2:1 system rankings

YAR rank correlation with AP (NCU u , β =0 ): TREC03 Q Q, NCU gu , β =0 and AP 0.909 0.925 NCU gu , β =1 produce 1 1 0.893 rankings that are very similar 0.9 0.776 0.8 to that by AP 0.761 0.7 0.6 0.595 rb, β=0 0.5 0.601 rb, β=1 0.524 0.4 gu, β=0 0.535 0.3 gu, β=1 0.2 0.1 gu, β=1 0 gu, β=0 γ=1 rb, β=1 γ=0.9 rb, β=0 γ=0.7 Heavy rank bias produces γ=0.5 Stop very unconventional weights=3:2:1 system rankings

Measuring discriminative power of metrics [Sakai SIGIR06] • Given a set of systems and a significance level α , for how many system pairs can a metric detect statistical significance? Probability of Type I error α =0.05 ⇔ 95% confidence • Sakai’s method uses the bootstrap test, and can also estimate the absolute performance difference required to achieve statistical significance (e.g. “a difference of 0.20 is usually statistically significant”) • Sakai’s method and the Voorhees/Buckley swap method [SIGIR 02] give similar results in practice

Discriminative power at α =0.05: NTCIR-6J Q AP, Q, NCU gu , β =0 and AP 64.4 NCU gu , β =1 have high 70 57.8 62.2 discriminative power 60 60 57.8 55.6 50 48.9 53.3 48.9 53.3 40 rb, β=0 rb, β=1 30 gu, β=0 20 gu, β=1 10 gu, β=1 0 gu, β=0 γ=1 rb, β=1 γ=0.9 rb, β=0 γ=0.7 γ=0.5 Heavy rank bias hurts discriminative power (by ignoring low-ranked docs)

Discriminative power at α =0.05: TREC03 Q 68.3 AP, Q, NCU gu , β =0 and AP 64.2 NCU gu , β =1 have high 66.7 70 64.2 62.5 discriminative power 60 54.2 50 45.8 40 46.7 rb, β=0 40.8 rb, β=1 30 41.7 gu, β=0 20 gu, β=1 10 gu, β=1 0 gu, β=0 γ=1 rb, β=1 γ=0.9 rb, β=0 γ=0.7 γ=0.5 Heavy rank bias hurts discriminative power (by ignoring low-ranked docs)

Effect of γ on discriminative power: TREC03 Heavy rank bias hurts Achieved significance level (ASL) discriminative power (by ignoring low-ranked docs) Run pairs sorted by ASL

Modelling A User Population for Designing Information Retrieval - PowerPoint PPT Presentation

Modelling A User Population for Designing Information Retrieval Metrics Tetsuya Sakai (NewsWatch, Inc.) tetsuyasakai@acm.org Stephen Robertson (Microsoft Research Cambridge) ser@microsoft.com EVIA 2008, December 16, 2008@NII, Tokyo TALK

Population Ecology 1. Population Concepts 2. Population Growth 3. Regulation of Population

Designing for Designing for Greenspace Greenspace Greenspace Designing for Designing for

User Interface Design User Interface Design Designing effective Designing effective interfaces

Modelling and Designing a Database Bela Tiwari btiwari@ceh.ac.uk Modelling and designing a

Class 14 Slides SLIDE what is the designing principle how does designing principle

RUN groupadd -r user && useradd -r -g user user USER user $ docker run --read-only debian

Modelling and Synthesis of User Interfaces for Complex, Web-Based Modelling Environments Jacob

World Population Trends January 26, 2012 World Population Trends World Population Growth

voice Kate Howland End-user programming? End-user programming? End-user programming?

I ntroduction to population PKPD modelling modelling I ntroduction to population PKPD in

Designing Your Fashion Portfolio From Concept To Presentation Designing Your Fashion Portfolio

The Modelling and Simulation Process 1. History of Modelling and Simulation 2. Modelling and

(Modelling) Semantics of Modelling Languages Hans Vangheluwe 7 September 2010, Lisboa, Portugal

Modelling with Differential Equations Modelling with Differential Equations Modelling with

Population Health Update 2.1.2019 Board of Trustee Retreat 1 2 Topics AHS Population Health

Modelling and Designing a Database Bela Tiwari btiwari@ceh.ac.uk Environmental Genomics

Developing Automated Scoring for Large-scale Assessments of Three-dimensional Learning Jay Thomas

A Recommendation System for Software Function Discovery Naoki Ohsugi Software Engineering

Lessons learned from monitoring investment newsletters for over 30 years June 24, 2013, meeting

Correlation Quantitative A Aptitude & & Business S Statistics Correlation

Optimal driving waveform for the overdamped, rocking ratchets Maria Laura Olivera Instituto de

Biological motors 18.S995 - L10 Reynolds numbers Re = UL = UL dunkel@math.mit.edu

Distribution Workgroup Ratchet Charges Overview Hilary Chapman 27 th August 2015 Mod Panel

1. The General Linear-Quadratic Framework Notation: x = ( x j ) , n -vector of agents effort

Modelling A User Population for Designing Information Retrieval - PowerPoint PPT Presentation

Modelling A User Population for Designing Information Retrieval Metrics Tetsuya Sakai (NewsWatch, Inc.) tetsuyasakai@acm.org Stephen Robertson (Microsoft Research Cambridge) ser@microsoft.com EVIA 2008, December 16, 2008@NII, Tokyo TALK

Population Ecology 1. Population Concepts 2. Population Growth 3. Regulation of Population

Designing for Designing for Greenspace Greenspace Greenspace Designing for Designing for

User Interface Design User Interface Design Designing effective Designing effective interfaces

Modelling and Designing a Database Bela Tiwari btiwari@ceh.ac.uk Modelling and designing a

Class 14 Slides SLIDE what is the designing principle how does designing principle

RUN groupadd -r user &amp;&amp; useradd -r -g user user USER user $ docker run --read-only debian

Modelling and Synthesis of User Interfaces for Complex, Web-Based Modelling Environments Jacob

World Population Trends January 26, 2012 World Population Trends World Population Growth

voice Kate Howland End-user programming? End-user programming? End-user programming?

I ntroduction to population PKPD modelling modelling I ntroduction to population PKPD in

Designing Your Fashion Portfolio From Concept To Presentation Designing Your Fashion Portfolio

The Modelling and Simulation Process 1. History of Modelling and Simulation 2. Modelling and

(Modelling) Semantics of Modelling Languages Hans Vangheluwe 7 September 2010, Lisboa, Portugal

Modelling with Differential Equations Modelling with Differential Equations Modelling with

Population Health Update 2.1.2019 Board of Trustee Retreat 1 2 Topics AHS Population Health

Modelling and Designing a Database Bela Tiwari btiwari@ceh.ac.uk Environmental Genomics

Developing Automated Scoring for Large-scale Assessments of Three-dimensional Learning Jay Thomas

A Recommendation System for Software Function Discovery Naoki Ohsugi Software Engineering

Lessons learned from monitoring investment newsletters for over 30 years June 24, 2013, meeting

Correlation Quantitative A Aptitude &amp; &amp; Business S Statistics Correlation

Optimal driving waveform for the overdamped, rocking ratchets Maria Laura Olivera Instituto de

Biological motors 18.S995 - L10 Reynolds numbers Re = UL = UL dunkel@math.mit.edu

Distribution Workgroup Ratchet Charges Overview Hilary Chapman 27 th August 2015 Mod Panel

1. The General Linear-Quadratic Framework Notation: x = ( x j ) , n -vector of agents effort

RUN groupadd -r user && useradd -r -g user user USER user $ docker run --read-only debian

Correlation Quantitative A Aptitude & & Business S Statistics Correlation