Reusable Test Collec.ons Through Experimental Design Ben - PowerPoint PPT Presentation

Reusable ¡Test ¡Collec.ons ¡ Through ¡Experimental ¡Design ¡ Ben ¡Cartere:e, ¡University ¡of ¡Delaware ¡ Evagelos ¡Kanoulas, ¡University ¡of ¡Sheffield ¡ Virgil ¡Pavlu, ¡Northeastern ¡University ¡ Hui ¡Fang, ¡University ¡of ¡Delaware ¡

Reusability ¡ • Test ¡collec.ons ¡have ¡many ¡uses: ¡ – Evalua.ng ¡systems ¡ – Training ¡systems ¡ – Failure ¡analysis ¡ – … ¡ • Reusability : ¡ ¡using ¡the ¡topics ¡and ¡relevance ¡judgments ¡ from ¡an ¡evalua.on ¡experiment ¡for ¡purposes ¡beyond ¡ that ¡ini.al ¡experiment ¡ ¡ • We ¡need ¡some ¡reusability ¡to ¡be ¡able ¡to ¡use ¡a ¡test ¡ collec.on ¡for ¡these ¡purposes ¡

Experimental ¡Design ¡in ¡IR ¡ Query ¡1 ¡ 0.353 ¡ 0.421 ¡ 0.331 ¡ Query ¡2 ¡ 0.251 ¡ 0.122 ¡ 0.209 ¡ Query ¡3 ¡ 0.487 ¡ 0.434 ¡ 0.421 ¡ Query ¡4 ¡ 0.444 ¡ 0.446 ¡ 0.386 ¡ Query ¡5 ¡ 0.320 ¡ 0.290 ¡ 0.277 ¡ Query ¡6 ¡ 0.299 ¡ 0.302 ¡ 0.324 ¡

Pooling ¡and ¡Reusability ¡ • Deep ¡judging ¡during ¡data ¡collec.on ¡produces ¡ a ¡reusable ¡test ¡collec.on ¡ • How ¡do ¡we ¡know? ¡ – Leave-‑systems-‑out ¡experiments: ¡ • Choose ¡a ¡group ¡of ¡systems ¡to ¡be ¡held ¡out ¡of ¡judgment ¡ collec.on ¡ • Simulate ¡pooling ¡in ¡remaining ¡systems ¡ • Use ¡that ¡pool ¡to ¡evaluate ¡all ¡systems ¡

Low-‑Cost ¡Experimental ¡Design ¡ Query ¡1 ¡ 0.353 ¡ 0.421 ¡ 0.331 ¡ Query ¡2 ¡ 0.251 ¡ 0.122 ¡ 0.209 ¡ Query ¡3 ¡ 0.487 ¡ 0.434 ¡ 0.421 ¡ Query ¡4 ¡ 0.444 ¡ 0.446 ¡ 0.386 ¡ Query ¡5 ¡ 0.320 ¡ 0.290 ¡ 0.277 ¡ Query ¡6 ¡ 0.299 ¡ 0.302 ¡ 0.324 ¡

Sampling ¡and ¡Reusability ¡ • Does ¡sampling ¡produce ¡a ¡reusable ¡collec.on? ¡ – We ¡don’t ¡know… ¡ – … ¡and ¡we ¡can’t ¡simulate ¡it ¡ • Holding ¡systems ¡out ¡would ¡produce ¡a ¡different ¡ sample ¡ – Meaning ¡we ¡would ¡need ¡judgments ¡that ¡we ¡don’t ¡ have ¡

Experimen.ng ¡on ¡Reusability ¡ • Our ¡goal ¡is ¡to ¡define ¡an ¡experimental ¡design ¡that ¡ will ¡allow ¡us ¡to ¡simultaneously: ¡ – Acquire ¡relevance ¡judgments ¡ – Test ¡hypotheses ¡about ¡differences ¡between ¡systems ¡ – Test ¡reusability ¡of ¡the ¡topics ¡and ¡judgments ¡ • What ¡does ¡it ¡mean ¡to ¡“test ¡reusability”? ¡ – Test ¡a ¡null ¡hypothesis ¡that ¡the ¡collec.on ¡ is ¡ reusable ¡ – Reject ¡that ¡hypothesis ¡if ¡the ¡data ¡demands ¡it ¡ – Never ¡ accept ¡ that ¡hypothesis ¡

Reusability ¡for ¡Evalua.on ¡ • We ¡focus ¡on ¡evalua.on ¡(rather ¡than ¡training, ¡failure ¡ analysis, ¡etc) ¡ • Three ¡types ¡of ¡evalua.on: ¡ – Within-‑site : ¡ ¡a ¡group ¡wants ¡to ¡internally ¡evaluate ¡their ¡systems ¡ – Between-‑site : ¡ ¡a ¡group ¡wants ¡to ¡compare ¡their ¡systems ¡to ¡those ¡ of ¡another ¡group ¡ – Par.cipant-‑comparison : ¡ ¡a ¡group ¡wants ¡to ¡compare ¡their ¡ systems ¡to ¡those ¡that ¡par.cipated ¡in ¡the ¡original ¡experiment ¡ (e.g. ¡TREC) ¡ • We ¡want ¡data ¡for ¡each ¡of ¡these ¡

Blocking ¡on ¡Leave-‑One-‑Out ¡ Query ¡1 ¡ 0.353 ¡ 0.421 ¡ 0.331 ¡ Query ¡2 ¡ 0.251 ¡ 0.122 ¡ 0.209 ¡ Query ¡3 ¡ 0.487 ¡ 0.434 ¡ 0.421 ¡ Query ¡4 ¡ 0.444 ¡ 0.446 ¡ 0.386 ¡ Query ¡5 ¡ 0.320 ¡ 0.290 ¡ 0.277 ¡ Query ¡6 ¡ 0.299 ¡ 0.302 ¡ 0.324 ¡

subset ¡ topic ¡ Site ¡1 ¡ Site ¡2 ¡ Site ¡3 ¡ Site ¡4 ¡ Site ¡5 ¡ Site ¡6 ¡ T 0 ¡ 1 ¡ All-‑Site ¡Baseline ¡ … ¡ n ¡ T 1 ¡ n+1 ¡ n+2 ¡ Within-‑Site ¡ ¡ n+3 ¡ Reuse ¡ n+4 ¡ n+5 ¡ n+6 ¡ n+7 ¡ n+8 ¡ Within-‑Site ¡ ¡ n+9 ¡ Baseline ¡ n+10 ¡ (for ¡site ¡6) ¡ n+11 ¡ n+12 ¡ n+13 ¡ n+14 ¡ n+15 ¡ T 2 ¡ n+16 ¡

subset ¡ topic ¡ Site ¡1 ¡ Site ¡2 ¡ Site ¡3 ¡ Site ¡4 ¡ Site ¡5 ¡ Site ¡6 ¡ T 0 ¡ 1 ¡ All-‑Site ¡Baseline ¡ … ¡ n ¡ Between-‑Site ¡Reuse ¡ T 1 ¡ n+1 ¡ n+2 ¡ n+3 ¡ n+4 ¡ n+5 ¡ n+6 ¡ n+7 ¡ n+8 ¡ n+9 ¡ n+10 ¡ Between-‑Site ¡ ¡ n+11 ¡ Baseline ¡ n+12 ¡ n+13 ¡ (for ¡sites ¡5 ¡and ¡6) ¡ n+14 ¡ n+15 ¡ T 2 ¡ n+16 ¡

subset ¡ topic ¡ Site ¡1 ¡ Site ¡2 ¡ Site ¡3 ¡ Site ¡4 ¡ Site ¡5 ¡ Site ¡6 ¡ T 0 ¡ 1 ¡ All-‑Site ¡Baseline ¡ … ¡ n ¡ T 1 ¡ n+1 ¡ n+2 ¡ n+3 ¡ n+4 ¡ n+5 ¡ n+6 ¡ Par.cipant ¡ n+7 ¡ Comparison ¡ n+8 ¡ n+9 ¡ n+10 ¡ n+11 ¡ n+12 ¡ n+13 ¡ n+14 ¡ n+15 ¡ T 2 ¡ n+16 ¡

Design ¡Parameters ¡ • Number ¡of ¡sites: ¡ ¡m ¡ • Total ¡number ¡of ¡topics: ¡ ¡N ¡ • Min. ¡size ¡of ¡baseline ¡topic ¡set: ¡ ¡n 0 ¡ • Number ¡to ¡hold ¡out: ¡ ¡k ¡ • Number ¡of ¡topic ¡groups: ¡ ¡b ¡ • Size ¡of ¡all-‑site ¡baseline: ¡ ¡n ¡ • Size ¡of ¡within-‑site ¡baseline: ¡ • Size ¡of ¡between-‑site ¡baseline: ¡ • Size ¡of ¡within-‑site ¡reuse ¡set: ¡ • Size ¡of ¡between-‑site ¡reuse ¡set: ¡ • Size ¡of ¡par.cipant-‑comparison ¡set: ¡

Sta.s.cal ¡Analysis ¡ • Goal ¡of ¡sta.s.cal ¡analysis ¡is ¡to ¡try ¡to ¡reject ¡the ¡hypothesis ¡ about ¡reusability ¡ – Show ¡that ¡the ¡judgments ¡are ¡ not ¡ reusable ¡ • Three ¡approaches: ¡ – Show ¡that ¡measures ¡such ¡as ¡average ¡precision ¡on ¡the ¡baseline ¡ sets ¡do ¡not ¡match ¡measures ¡on ¡the ¡reuse ¡sets ¡ – Show ¡that ¡significance ¡tests ¡in ¡the ¡baseline ¡sets ¡do ¡not ¡match ¡ significance ¡tests ¡in ¡the ¡reuse ¡sets ¡ – Show ¡that ¡rankings ¡in ¡the ¡baseline ¡sets ¡do ¡not ¡match ¡rankings ¡in ¡ the ¡reuse ¡sets ¡ • Note: ¡ ¡ within ¡confidence ¡intervals! ¡

Agreement ¡in ¡Significance ¡ • Perform ¡significance ¡tests ¡on: ¡ – all ¡pairs ¡of ¡systems ¡in ¡a ¡baseline ¡set ¡ – all ¡pairs ¡of ¡systems ¡in ¡a ¡reuse ¡set ¡ • If ¡the ¡aggregate ¡outcomes ¡of ¡the ¡tests ¡ disagree ¡significantly, ¡reject ¡reusability ¡

Within-‑Site ¡Example ¡ • Some ¡site ¡submi:ed ¡five ¡runs ¡to ¡the ¡TREC ¡2004 ¡ Robust ¡track ¡ • Within-‑site ¡baseline: ¡ ¡210 ¡topics ¡ • Within-‑site ¡reuse: ¡ ¡39 ¡topics ¡ • Perform ¡5*4/2 ¡= ¡10 ¡paired ¡t-‑tests ¡with ¡each ¡ group ¡of ¡topics ¡ • Aggregate ¡agreement ¡in ¡a ¡con.ngency ¡table ¡

Within-‑Site ¡Example ¡ baseline ¡tests ¡ reuse ¡tests ¡ p ¡< ¡0.05 ¡ p ¡≥ ¡0.05 ¡ p’ ¡< ¡0.05 ¡ 6 ¡ 0 ¡ p’ ¡≥ ¡0.05 ¡ 3 ¡ 1 ¡ • 3 ¡significant ¡differences ¡in ¡baseline ¡set ¡that ¡are ¡not ¡ significant ¡in ¡reuse ¡set ¡ –  ¡70% ¡agreement ¡ • … ¡ ¡is ¡that ¡bad? ¡

Expected ¡Errors ¡ • Compare ¡observed ¡error ¡rate ¡to ¡expected ¡error ¡ rate ¡ • To ¡es.mate ¡expected ¡error ¡rate, ¡use ¡ power ¡ analysis ¡(Cohen, ¡1992) ¡ – What ¡is ¡the ¡probability ¡that ¡the ¡observed ¡difference ¡ over ¡210 ¡topics ¡would ¡be ¡found ¡significant? ¡ – What ¡is ¡the ¡probability ¡that ¡the ¡observed ¡difference ¡ over ¡39 ¡topics ¡would ¡be ¡found ¡significant? ¡ – Call ¡these ¡probabili.es ¡q 1 , ¡q 2 ¡

Reusable Test Collec.ons Through Experimental Design Ben - PowerPoint PPT Presentation

Reusable Test Collec.ons Through Experimental Design Ben Cartere:e, University of Delaware Evagelos Kanoulas, University of Sheffield Virgil Pavlu, Northeastern

Collec&ve Impact: Measuring Collec&ve Outcomes Agenda

Symmetry energy constrained by Nuclear collec:ve excita:ons ( February 16-18, 2017, Iizaka

More on collec)ons and sor)ng CSCI 136: Fundamentals of

Es#ma#ons of Collec#ve Instabili#es for JLEIC Rui Li JLEIC Collabora#on Mee#ng 4-3-2016

Func+on applica+ons (calls, invoca+ons) lambda denotes a anonymous func+on To use a func+on, you

Model-Based Testing (ISTQB Chapter 4) Arie van Deursen 1 4.1 ISTQB Test Design Test Scripts

Component Programming in The D Programming Language by Walter Bright Reusable Software an

Basic Experimental Design Basic Concepts in Experimental Design Prof. Dr. Luc Duchateau Ghent

200511316 200511316 Test plan Test design specification g p

Re-presen(ng Art Collec(ons Joon Son Chung, Relja Arandjelovi,

Evalua&ng Path Queries over Route Collec&ons Panagio&s

Adap%ve(Methods(for(User1Centered( Organiza%on(of(Music(Collec%ons(

Building Reusable Test Collections Ellen M. Voorhees 1 Test Collections Evaluate search

FLSA DUTIES TEST Exemption/Duties Test Types of Duties/Exemption Test Executive Exemption

Engineering Best Practices Test, test, test, and test some more; test as you go Start from a

Test automation Building automatically repeatable test suites Test automation n Test automation

Understanding Key Principles (& Math) that Link Team Effectiveness & Staffing Plans Lynn

The Value of On-site Comparisons During WCC Audits for Methane, Carbon Dioxide and Carbon

Principles of Library Design: The Eiffel Experience Bertrand Meyer ADFOCS Summer School, 2003

NoSQL: Redis and MongoDB A.A. 2016/17 Matteo Nardelli Laurea Magistrale in Ingegneria

Improving CG-CAHPS Scores in a Federally Qualified Health Center Presen ented B ed By Debra R

Whats the Difference? Comparison of the Army MEC Risk Management Method and the MEC HA Method

Pr Preliminary comparison re results University # graduates of # graduates # filled # valid

Comparison of Surgical and Nonsurgical Options for Management of Nonspecific Chronic Low Back

Sambuz

Useful Links

Newsletter

Mail Us

Reusable Test Collec.ons Through Experimental Design Ben - PowerPoint PPT Presentation

Reusable Test Collec.ons Through Experimental Design Ben Cartere:e, University of Delaware Evagelos Kanoulas, University of Sheffield Virgil Pavlu, Northeastern

Collec&amp;ve Impact: Measuring Collec&amp;ve Outcomes Agenda

Symmetry energy constrained by Nuclear collec:ve excita:ons ( February 16-18, 2017, Iizaka

More on collec)ons and sor)ng CSCI 136: Fundamentals of

Es#ma#ons of Collec#ve Instabili#es for JLEIC Rui Li JLEIC Collabora#on Mee#ng 4-3-2016

Func+on applica+ons (calls, invoca+ons) lambda denotes a anonymous func+on To use a func+on, you

Model-Based Testing (ISTQB Chapter 4) Arie van Deursen 1 4.1 ISTQB Test Design Test Scripts

Component Programming in The D Programming Language by Walter Bright Reusable Software an

Basic Experimental Design Basic Concepts in Experimental Design Prof. Dr. Luc Duchateau Ghent

200511316 200511316 Test plan Test design specification g p

Re-presen(ng Art Collec(ons Joon Son Chung, Relja Arandjelovi,

Evalua&amp;ng Path Queries over Route Collec&amp;ons Panagio&amp;s

Adap%ve(Methods(for(User1Centered( Organiza%on(of(Music(Collec%ons(

Building Reusable Test Collections Ellen M. Voorhees 1 Test Collections Evaluate search

FLSA DUTIES TEST Exemption/Duties Test Types of Duties/Exemption Test Executive Exemption

Engineering Best Practices Test, test, test, and test some more; test as you go Start from a

Test automation Building automatically repeatable test suites Test automation n Test automation

Understanding Key Principles (&amp; Math) that Link Team Effectiveness &amp; Staffing Plans Lynn

The Value of On-site Comparisons During WCC Audits for Methane, Carbon Dioxide and Carbon

Principles of Library Design: The Eiffel Experience Bertrand Meyer ADFOCS Summer School, 2003

NoSQL: Redis and MongoDB A.A. 2016/17 Matteo Nardelli Laurea Magistrale in Ingegneria

Improving CG-CAHPS Scores in a Federally Qualified Health Center Presen ented B ed By Debra R

Whats the Difference? Comparison of the Army MEC Risk Management Method and the MEC HA Method

Pr Preliminary comparison re results University # graduates of # graduates # filled # valid

Comparison of Surgical and Nonsurgical Options for Management of Nonspecific Chronic Low Back

Sambuz

Useful Links

Newsletter

Mail Us

Collec&ve Impact: Measuring Collec&ve Outcomes Agenda

Evalua&ng Path Queries over Route Collec&ons Panagio&s

Understanding Key Principles (& Math) that Link Team Effectiveness & Staffing Plans Lynn