Extending average precision to graded relevance judgements - PowerPoint PPT Presentation

Extending ¡average ¡precision ¡ to ¡graded ¡relevance ¡judgements ¡ Stephen ¡Robertson ¡ ¡ ¡ ¡ ¡Evangelos ¡Kanoulas ¡ ¡ ¡ ¡ ¡Emine ¡Yilmaz ¡

Mo=va=on ¡ • For ¡binary ¡relevance, ¡we ¡have ¡a ¡plethora ¡of ¡effec=veness ¡metrics ¡ – most ¡commonly ¡used ¡by ¡far: ¡ ¡(M)AP ¡ we ¡know ¡quite ¡a ¡lot ¡about ¡AP ¡and ¡its ¡rela=on ¡to ¡other ¡binary ¡metrics ¡ • For ¡graded ¡relevance, ¡we ¡have ¡very ¡few ¡ – only ¡commonly ¡used ¡one ¡is ¡nDCG ¡ Some ¡points ¡ Both ¡metrics ¡are ¡top-‑heavy ¡ • AP ¡naturally ¡so ¡ • nDCG ¡requires ¡defini=on ¡of ¡a ¡rank ¡discount ¡func=on ¡ (we ¡would ¡like ¡beQer ¡understanding ¡of ¡discoun=ng ¡by ¡rank) ¡ nDCG ¡also ¡requires ¡a ¡gain ¡func=on ¡ • something ¡similar ¡is ¡almost ¡certainly ¡required ¡for ¡a ¡graded ¡measure ¡ ... ¡but ¡alterna=ve ¡interpreta=ons ¡are ¡available ¡ Primary ¡ques=on: ¡ ¡Can ¡we ¡define ¡a ¡graded ¡analogue ¡of ¡AP? ¡

Outline ¡ • What ¡do ¡graded ¡judgements ¡mean ¡ – queries ¡and ¡needs ¡in ¡evalua=on ¡experiments ¡ • Defini=on ¡of ¡Graded ¡Average ¡Precision ¡ – some ¡proper=es ¡ • Evalua=ng ¡GAP ¡ – informa=veness ¡ – discrimina=on ¡ – use ¡for ¡learning ¡to ¡rank ¡

What ¡do ¡graded ¡judgements ¡mean? ¡ Cranfield ¡2 ¡used ¡a ¡5-‑point ¡scale ¡ ¡ (4+1) ¡ Quote ¡from ¡one ¡of ¡the ¡judges: ¡ I ¡believe ¡that ¡I ¡have ¡‘scored’ ¡the ¡documents ¡roughly ¡in ¡ propor8on ¡to ¡the ¡degree ¡of ¡irrita8on ¡I ¡should ¡feel ¡if ¡a ¡ librarian ¡produced ¡them ¡in ¡response ¡to ¡my ¡original ¡ query ¡ ... ¡but ¡few ¡later ¡experiments ¡have ¡used ¡more ¡ than ¡2 ¡or ¡3 ¡ ¡ (1+1 ¡or ¡2+1) ¡ ... ¡un=l ¡the ¡web ¡search ¡engine ¡era ¡ – where ¡we ¡seem ¡to ¡be ¡back ¡to ¡about ¡5 ¡

Aside: ¡queries ¡and ¡needs ¡ Cranfield/TREC ¡tradi=on: ¡ – define/catch ¡informa=on ¡need ¡ – construct ¡query ¡ – judge ¡against ¡need ¡ Web ¡search ¡engine ¡prac=ce ¡ – catch ¡query ¡ – allow ¡for ¡mul=ple ¡needs ¡ represented ¡by ¡the ¡same ¡query ¡ – judge ¡accordingly ¡

What ¡do ¡graded ¡judgements ¡mean? ¡ One ¡document ¡is ¡ more ¡useful ¡than ¡another ¡ One ¡possible ¡meaning: ¡ one ¡document ¡is ¡ useful ¡to ¡more ¡users ¡than ¡another ¡ Hence ¡the ¡following: ¡ assume ¡grades ¡of ¡relevance... ¡ ... ¡but ¡that ¡user ¡has ¡a ¡threshold ¡relevance ¡grade ¡ which ¡defines ¡a ¡binary ¡view ¡ different ¡users ¡have ¡different ¡thresholds ¡ described ¡by ¡a ¡ probability ¡distribu8on ¡over ¡users ¡

Graded ¡Average ¡Precision ¡ Assume ¡relevance ¡grades ¡{0... c } ¡ – 0 ¡for ¡non-‑relevant, ¡+ ¡c ¡posi=ve ¡grades ¡ g i ¡= ¡P(user ¡threshold ¡is ¡at ¡ i ) ¡for ¡ i ¡ ∈ ¡{1... c } ¡ i.e. ¡user ¡regards ¡grades ¡{ i ... c } ¡as ¡relevant, ¡grades ¡{0... ( i -‑1)} ¡as ¡not ¡relevant ¡ g i s ¡sum ¡to ¡one ¡ Step ¡down ¡the ¡ranked ¡list, ¡stopping ¡at ¡documents ¡ that ¡may ¡be ¡relevant ¡ – then ¡calculate ¡expected ¡precision ¡at ¡each ¡of ¡these ¡ (expected ¡over ¡the ¡popula=on ¡of ¡users) ¡ – ¡formula ¡in ¡the ¡paper ¡

Proper=es ¡of ¡GAP ¡ • Generalises ¡AP ¡ – similarly ¡top-‑heavy ¡ • Behaves ¡correctly ¡under ¡swaps ¡ • Has ¡a ¡similar ¡probabilis=c ¡interpreta=on ¡ • Is ¡the ¡area ¡under ¡a ¡generalised ¡recall-‑precision ¡ curve ¡ • Can ¡be ¡jus=fied ¡under ¡a ¡similar ¡user ¡model ¡ – simple ¡but ¡moderately ¡plausible! ¡ – g i s ¡may ¡be ¡es=mated ¡from ¡user ¡click ¡behaviour ¡

Probabilis=c ¡interpreta=on ¡ For ¡AP: ¡ 1. Choose ¡at ¡random ¡a ¡relevant ¡document ¡d1 ¡ 2. Choose ¡at ¡random ¡a ¡document ¡d2 ¡ranked ¡above ¡ d1 ¡ 3. AP ¡= ¡P(d2 ¡is ¡relevant) ¡ For ¡GAP: ¡ Do ¡the ¡same ¡as ¡above, ¡under ¡the ¡GAP ¡user ¡model ¡ • replace ¡ relevant ¡by ¡ “relevant” ¡in ¡the ¡above ¡– ¡i.e. ¡ considered ¡relevant ¡by ¡a ¡user ¡

Area ¡under ¡curve ¡ Define ¡graded ¡recall ¡ – expected ¡number ¡of ¡“relevant” ¡documents ¡at ¡ some ¡threshold ¡/ ¡expected ¡total ¡ Define ¡graded ¡precision ¡ – similarly ¡ Plot ¡curve ¡ – GAP ¡is ¡approximately ¡area ¡under ¡curve ¡ Note: ¡very ¡different ¡from ¡Gain-‑based ¡defini=ons ¡ of ¡Kekäläinen ¡and ¡Järvelin ¡

Evalua=ng ¡GAP ¡ Compare ¡with ¡nDCG ¡ • Informa=veness ¡ – should ¡summarise ¡the ¡quality ¡of ¡a ¡search ¡engine ¡ well ¡ • Discrimina=ve ¡power ¡ – should ¡iden=fy ¡significant ¡differences ¡between ¡ systems ¡ • Learning ¡to ¡rank ¡objec=ve ¡ – should ¡lead ¡to ¡good ¡test ¡results ¡with ¡other ¡metrics ¡

Informa=veness ¡ If ¡you ¡know ¡the ¡value ¡of ¡a ¡metric ¡ – how ¡much ¡does ¡this ¡tell ¡you ¡about ¡the ¡performance ¡of ¡ the ¡system? ¡ metric ¡→ ¡probability-‑at-‑rank ¡distribu=on ¡→ ¡RP ¡curve ¡ Maximum ¡entropy ¡method ¡ ¡ (Aslam ¡et ¡al) ¡ – find ¡the ¡maxent ¡prob-‑at-‑rank ¡distribu=on ¡ given ¡the ¡metric ¡value ¡and ¡the ¡total ¡rel ¡at ¡each ¡grade ¡ – infer ¡the ¡maxent ¡RP ¡curve ¡ – compare ¡to ¡the ¡actual ¡RP ¡curve ¡ mean ¡RMS ¡error ¡ Do ¡this ¡for ¡GAP ¡and ¡nDCG ¡

TREC9 ¡– ¡rel ¡+ ¡highly ¡rel ¡ choose ¡ g i ¡values ¡0.5,0.5 ¡to ¡maximise ¡informa=veness ¡

TREC9 ¡– ¡highly ¡rel ¡only ¡

Discrimina=ve ¡power ¡ How ¡well ¡does ¡the ¡metric ¡discriminate ¡between ¡ systems? ¡ given ¡the ¡set ¡of ¡queries, ¡which ¡metric ¡can ¡beQer ¡ iden=fy ¡significant ¡differences ¡between ¡systems? ¡ Bootstrap ¡method ¡ ¡(Sakai) ¡ mixed ¡results ¡– ¡some=mes ¡nDCG, ¡some=mes ¡GAP ¡ GAP ¡appears ¡to ¡do ¡beQer ¡on ¡the ¡best ¡performing ¡ systems ¡

Learning ¡to ¡rank ¡ Two ¡methods ¡(SooRank ¡and ¡LambdaRank) ¡ over ¡two ¡datasets ¡(OHSUMED ¡from ¡Letor ¡and ¡a ¡Web ¡ collec=on ¡with ¡5,000 ¡queries) ¡ 2+1 ¡relevance ¡grades ¡ 5-‑fold ¡cross-‑valida=on ¡for ¡OHSUMED, ¡ Op=mise ¡GAP ¡/ ¡AP ¡/ ¡nDCG ¡ test ¡on ¡nDCG ¡/ ¡AP ¡/ ¡Prec@10 ¡ For ¡both ¡datasets, ¡both ¡methods, ¡and ¡all ¡three ¡test ¡ metrics: ¡ op=mising ¡on ¡ GAP ¡gave ¡beQer ¡results ¡than ¡the ¡other ¡ two ¡

Conclusions ¡ We ¡can ¡generalise ¡AP ¡to ¡graded ¡relevance ¡ judgements ¡ – with ¡a ¡par=cular ¡interpreta=on ¡of ¡relevance ¡ grades ¡ as ¡a ¡probability ¡distribu=on ¡over ¡users ¡ ... ¡each ¡user ¡having ¡a ¡binary ¡view ¡of ¡relevance ¡ The ¡resul=ng ¡metric ¡inherits ¡many ¡desirable ¡ proper=es ¡from ¡AP ¡ and ¡is ¡at ¡least ¡compe==ve ¡with ¡nDCG ¡

Extending average precision to graded relevance judgements - PowerPoint PPT Presentation

Extending average precision to graded relevance judgements Stephen Robertson Evangelos Kanoulas Emine Yilmaz Mo=va=on For binary

Topic of this talk Topic of this talk From E- -Relevance Relevance From E to W- -Relevance

Self-Graded Quizzes CDR Matt Schell US Naval Academy Mechanical Engineering Department The

Graded Work Schedule Client Side Web Application Development George Corser, PhD Graded work

Mixed Precision Training PAI Overview What is mixed-precision

Graded Cohen-Macaulayness for commutative Brian Johnson rings graded by arbitrary abelian groups

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Relevance Feedback Relevance Feedback Relevance Feedback Prof. Paolo Ciaccia Prof. Paolo

Framework as of September 2019 Current Key Judgements: Teaching Learning and assessment.

Semantics S E T O N T F A R D Gabriele Keller Where we are So far - Judgements and

Average Connectivity and Average Edge-connectivity in Graphs Suil O joint work with Jaehoon Kim

Extending ns Extending ns In OTcl In C++ Debugging Padma Haldar USC/ISI 1 2 ns

Extending CSP with tests for availability Gavin Lowe Extending CSP with tests for availability

VLVK EHF. VLVK EHF. Precision machining Precision machining Professional precision for

2018 Milken Institute Hamptons Dialogues Precision, Precision, Precision: The Future of Health

WATERFORD GRADED October 2015 Table of Contents Introduction 3 Overview of the Planning Process 4

SemEval-2013 Task 13: Word Sense Induction for Graded and Non-Graded Senses David Jurgens

Consistency-driven multiple graph matching Junchi

Disclosure Cook Medical, Inc. Royalties Research grants UCSF 1 4/17/2018

The curve they fit to the data What ordinary people expected to see The curve they fit to the

Edge-wise funnel synchronization Stephan Trenn Technomathematics group, University of

Wallet Security 35th Chaos Communication Congre, Leipzig, Germany Stephan Verbcheln December

5.3.9 Line detection by local pre-processing operators Special local operators: line finding

Generalized Geometric Programming for Circuit Design Stephen Boyd Seung Jean Kim 4/4/05 ISPD

References A. Readings directly related to the lecture material *Benabou, Roland,

Sambuz

Useful Links

Newsletter

Mail Us