Overview of the 2019 Open-Source IR Replicability Challenge (OSIRRC 2019)
Ryan Clancy, Nicola Ferro, Claudia Hauff, Jimmy Lin, Tetsuya Sakai, Ze Zhong Wu
Overview of the 2019 Open-Source IR Replicability Challenge (OSIRRC - - PowerPoint PPT Presentation
Overview of the 2019 Open-Source IR Replicability Challenge (OSIRRC 2019) Ryan Clancy, Nicola Ferro, Claudia Hauff, Jimmy Lin, Tetsuya Sakai, Ze Zhong Wu Vision Source: saveur.com Vision The ultimate candy store for information retrieval
Overview of the 2019 Open-Source IR Replicability Challenge (OSIRRC 2019)
Ryan Clancy, Nicola Ferro, Claudia Hauff, Jimmy Lin, Tetsuya Sakai, Ze Zhong Wu
Source: saveur.com
The ultimate candy store for information retrieval researchers!
Source: Wikipedia (Candy)
The ultimate candy store for information retrieval researchers! See a result you like? Click a button to recreate those results!
Really, any result?
(not quite… let’s start with batch ad hoc retrieval experiments on standard test collections)
Repeatability: you can recreate your own results again Reproducibility: others can recreate your results (with code they rewrite) Replicability: others can recreate your results (with your code)
ACM Artifact Review and Badging Guidelines
We get this “for free” Stepping stone… Our focus
Armstrong et al. (CIKM 2009): Little empirical progress made from 1998 to 2009 Why? researchers compare against weak baselines Yang et al. (SIGIR 2019): Researchers still compare against weak baselines
Open-Source Code!
A g
s t a r t , b u t f a r f r
e n
g h …
TREC 2015 “Open Runs”
Voorhees et al. Promoting Repeatability Through Open Runs. EVIA 2016.
79 submitted runs…
Voorhees et al. Promoting Repeatability Through Open Runs. EVIA 2016.
Number of runs successfully replicated
Open-Source Code!
A g
s t a r t , b u t f a r f r
e n
g h …
Ask developers to show us how!
Open-Source IR Reproducibility Challenge (OSIRRC), SIGIR 2015 Workshop on Reproducibility, Inexplicability, and Generalizability of Results (RIGOR)
Participants contributed end-to-end scripts for replicating ad hoc retrieval experiments
Lin et al. Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge. ECIR 2016.
7 participating systems, GOV2 collection
0.00 0.25 0.50 0.75 T e r r i e r : B M 2 5 G a l a g
Q L J A S S : 2 . 5 M P I n d r i : Q L M G 4 J : B J A S S : 1 B P A T I R E : Q u a n t . B M 2 5 A T I R E : B M 2 5 M G 4 J : B + G a l a g
S D M I n d r i : S D M T e r r i e r : D P H + P r
S D M G 4 J : B M 2 5 T e r r i e r : D P H L u c e n e : B M 2 5 ( P
. ) L u c e n e : B M 2 5 ( C
n t ) T e r r i e r : D P H + B
Q E
System / Model MAP
System Effectiveness
7 participating systems, GOV2 collection
1 10 100 1,000 10,000 100,000 J A S S : 2 . 5 M P M G 4 J : B J A S S : 1 B P M G 4 J : B + A T I R E : Q u a n t . B M 2 5 L u c e n e : B M 2 5 ( C
n t ) L u c e n e : B M 2 5 ( P
. ) A T I R E : B M 2 5 M G 4 J : B M 2 5 T e r r i e r : B M 2 5 T e r r i e r : D P H G a l a g
Q L T e r r i e r : D P H + P r
S D I n d r i : Q L T e r r i e r : D P H + B
Q E G a l a g
S D M I n d r i : S D M
System / Model Search Time (ms)
System Efficiency
7 participating systems, GOV2 collection
ATIRE: BM25 ATIRE: Quant. BM25 Galago: QL Galago: SDM Indri: QL Indri: SDM JASS: 1B P JASS: 2.5M P Lucene: BM25 (Count) Lucene: BM25 (Pos.) MG4J: B MG4J: B+ MG4J: BM25 Terrier: BM25 Terrier: DPH Terrier: DPH+Bo1 QE Terrier: DPH+Prox SD
100 1000 10000 . 2 8 . 3 . 3 2 . 3 4
MAP Time (ms)
Effectiveness/Efficiency Tradeoff
Open-Source Code!
A g
s t a r t , b u t f a r f r
e n
g h …
Ask developers to show us how!
I t w
k e d , b u t …
We actually pulled it off!
Technical infrastructure was brittle Replication scripts too under-constrained
Source: Wikipedia (Burj Khalifa)
VM OS App VM OS App
hypervisor
Physical Machine
Physical Machine Container Container
Container Engine
OS App App
Source: Wikipedia (Burj Khalifa)
ad hoc retrieval experiments – the “jig”.
(encourage adoption, broaden to other tasks, etc.)
jig Docker image
<snapshot> <image>:<tag> Creates snapshot run files Triggers hook with snapshot <image>:<tag>
prepare phase
User specifies
search phase
init hook index hook search hook Triggers hook Starts image Triggers hook trec_eval
Source: Flickr (https://www.flickr.com/photos/m00k/15789986125/)
Focus on newswire collections: Robust04, Core17, Core18 Official runs on Microsoft Azure T h a n k s M i c r
t f
f r e e c r e d i t s !
Anserini (University of Waterloo) Anserini-bm25prf (Waseda University) ATIRE (University of Otago) Birch (University of Waterloo) Elastirini (University of Waterloo) EntityRetrieval (Ryerson University) Galago (University of Massachusetts) ielab (University of Queensland) Indri (TU Delft) IRC-CENTRE2019 (T echnische Hochschule Köln) JASS (University of Otago) JASSv2 (University of Otago) NVSM (University of Padua) OldDog (Radboud University) PISA (New York University and RMIT University) Solrini (University of Waterloo) T errier (TU Delft and University of Glasgow)
Images captured diverse models: query expansion and relevance feedback conjunctive and efficiency-oriented query processing neural ranking models
Source: Time Magazine
Source: Washington Post
TREC best – 0.333 TREC median (title) – 0.258
ad hoc retrieval experiments – the “jig”.
(encourage adoption, broaden to other tasks, etc.)
Source: flickr (https://www.flickr.com/photos/39414578@N03/16042029002)