Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft
SIGIR 2019 · July 23rd · Paris
Picture by dalbera
Julin Urbano, Harlley Lima, Alan Hanjalic @TU Delft SIGIR 2019 July - - PowerPoint PPT Presentation
Julin Urbano, Harlley Lima, Alan Hanjalic @TU Delft SIGIR 2019 July 23 rd Paris Picture by dalbera Current Statistical Testing Practice According to surveys by Sakai & Carterette 60-75% of IR papers use significance testing
Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft
SIGIR 2019 · July 23rd · Paris
Picture by dalbera
–60-75% of IR papers use significance testing –In the paired case (2 systems, same topics):
2
3
4
van Rijsbergen Hull @SIGIR Savoy @IP&M Wilbur @JIS Zobel @SIGIR Voorhees & Buckley @SIGIR Voorhees @SIGIR Sakai @SIGIR Smucker et al. @SIGIR Sakai @SIGIR Parapar et al. @JASIST Cormack & Lynam @SIGIR Smucker et al. @CIKM Urbano & Nagler @SIGIR 1980 1990 2000 2010 2020 Sanderson & Zobel @SIGIR Urbano et al. @SIGIR Carterette @TOIS Carterette @ICTIR Sakai @SIGIR Forum Urbano @JIR
4
van Rijsbergen Hull @SIGIR Savoy @IP&M Wilbur @JIS Zobel @SIGIR Voorhees & Buckley @SIGIR Voorhees @SIGIR Sakai @SIGIR Smucker et al. @SIGIR Sakai @SIGIR Parapar et al. @JASIST Cormack & Lynam @SIGIR Smucker et al. @CIKM Urbano & Nagler @SIGIR 1980 1990 2000 2010 2020 Sanderson & Zobel @SIGIR Urbano et al. @SIGIR Carterette @TOIS Carterette @ICTIR Sakai @SIGIR Forum
1st Period
Statistical testing unpopular Theoretical arguments around test assumptions
Urbano @JIR
4
van Rijsbergen Hull @SIGIR Savoy @IP&M Wilbur @JIS Zobel @SIGIR Voorhees & Buckley @SIGIR Voorhees @SIGIR Sakai @SIGIR Smucker et al. @SIGIR Sakai @SIGIR Parapar et al. @JASIST Cormack & Lynam @SIGIR Smucker et al. @CIKM Urbano & Nagler @SIGIR 1980 1990 2000 2010 2020 Sanderson & Zobel @SIGIR Urbano et al. @SIGIR Carterette @TOIS Carterette @ICTIR Sakai @SIGIR Forum
2nd Period
Empirical studies appear Resampling-based tests and t-test
Urbano @JIR
4
van Rijsbergen Hull @SIGIR Savoy @IP&M Wilbur @JIS Zobel @SIGIR Voorhees & Buckley @SIGIR Voorhees @SIGIR Sakai @SIGIR Smucker et al. @SIGIR Sakai @SIGIR Parapar et al. @JASIST Cormack & Lynam @SIGIR Smucker et al. @CIKM Urbano & Nagler @SIGIR 1980 1990 2000 2010 2020 Sanderson & Zobel @SIGIR Urbano et al. @SIGIR Carterette @TOIS Carterette @ICTIR Sakai @SIGIR Forum
3rd Period
Wide adoption of statistical testing Long-pending discussion about statistical practice
Urbano @JIR
5
6
system scores
scores on new, random topics (no content, only scores)
9
test AP p-values Model
Urbano & Nagler, SIGIR 2018
system scores
scores on new, random topics (no content, only scores)
can be fit to existing data to make it realistic
10
TREC IR systems test AP AP p-values Model
Urbano & Nagler, SIGIR 2018
11
Experimental Baseline
models, which separate:
distributions, of individual systems
and control over H0
structure, among systems
μE μB
11
Experimental Baseline
models, which separate:
distributions, of individual systems
and control over H0
structure, among systems
μE μB
12
and 2010-13 Web
13
14
TREC Systems Topics
16
Experimental Baseline
16
Experimental Baseline
16
Experimental Baseline
16
Experimental Baseline μE = μB
16
Experimental Baseline Tests p-values μE = μB
–1,667,000 times –≈8.3 million 2-tailed p-values –≈8.3 million 1-tailed p-values
18
Not so interested in specific points but in trends
20
20
20
the sampling distribution better
20
the sampling distribution better
behavior across measures, even with small sample size
21
23
TREC Systems Topics Experimental Baseline
23
Experimental Baseline
23
Experimental Baseline
23
Experimental Baseline μE = μB+δ
23
Experimental Baseline Tests p-values μE = μB+δ
–167,000 times –≈8.3 million 2-tailed p-values –≈8.3 million 1-tailed p-values
25
ideally
26
26
26
26
(it’s indeed more efficient with some asymmetric distributions)
27
27
– But more Type I errors!
28
– We observe a positive result, 𝐹 > 𝐶 – We run a 2-tailed test, 𝐼0: 𝜈𝐹 = 𝜈𝐶 – Find 𝑞 < 𝛽, so we reject and conclude 𝜈𝐹 > 𝜈𝐶 – But 𝑰𝟏 is non-directional – What if we just got lucky, and really 𝝂𝑭 < 𝝂𝑪?
30
31
31
31
–Improvement of +0.01 over the baseline –2-tailed t-test comes up significant –7.3% probability that it is a Type III error and your system is actually worse –Is that too high?
32
– Paired test: Student’s t, Wilcoxon, Sign, Bootstrap-shift, Permutation – Measure: AP, nDCG@20, ERR@20, P@10, RR – Topic set size: 25, 50, 100 – Effect size: 0.01, 0.02, …, 0.1 – Significance level: 0.001, …, 0.1 – Tails: 1 and 2
https://github.com/julian-urbano/sigir2019-statistical
34
t-test is simple, the most robust, behaves as expected w.r.t. Type I errors, and is nearly as powerful as the Bootstrap. Keep using it
35