Julin Urbano, Harlley Lima, Alan Hanjalic @TU Delft SIGIR 2019 July - - PowerPoint PPT Presentation

juli n urbano harlley lima alan hanjalic tu delft
SMART_READER_LITE
LIVE PREVIEW

Julin Urbano, Harlley Lima, Alan Hanjalic @TU Delft SIGIR 2019 July - - PowerPoint PPT Presentation

Julin Urbano, Harlley Lima, Alan Hanjalic @TU Delft SIGIR 2019 July 23 rd Paris Picture by dalbera Current Statistical Testing Practice According to surveys by Sakai & Carterette 60-75% of IR papers use significance testing


slide-1
SLIDE 1

Julián Urbano, Harlley Lima, Alan Hanjalic @TU Delft

SIGIR 2019 · July 23rd · Paris

Picture by dalbera

slide-2
SLIDE 2

Current Statistical Testing Practice

  • According to surveys by Sakai & Carterette

–60-75% of IR papers use significance testing –In the paired case (2 systems, same topics):

  • 65% use the paired t-test
  • 25% use the Wilcoxon test
  • 10% others, like Sign, Bootstrap & Permutation

2

slide-3
SLIDE 3

t-test and Wilcoxon are the de facto choice Is this a good choice?

3

slide-4
SLIDE 4

Our Journey

4

van Rijsbergen Hull @SIGIR Savoy @IP&M Wilbur @JIS Zobel @SIGIR Voorhees & Buckley @SIGIR Voorhees @SIGIR Sakai @SIGIR Smucker et al. @SIGIR Sakai @SIGIR Parapar et al. @JASIST Cormack & Lynam @SIGIR Smucker et al. @CIKM Urbano & Nagler @SIGIR 1980 1990 2000 2010 2020 Sanderson & Zobel @SIGIR Urbano et al. @SIGIR Carterette @TOIS Carterette @ICTIR Sakai @SIGIR Forum Urbano @JIR

slide-5
SLIDE 5

Our Journey

4

van Rijsbergen Hull @SIGIR Savoy @IP&M Wilbur @JIS Zobel @SIGIR Voorhees & Buckley @SIGIR Voorhees @SIGIR Sakai @SIGIR Smucker et al. @SIGIR Sakai @SIGIR Parapar et al. @JASIST Cormack & Lynam @SIGIR Smucker et al. @CIKM Urbano & Nagler @SIGIR 1980 1990 2000 2010 2020 Sanderson & Zobel @SIGIR Urbano et al. @SIGIR Carterette @TOIS Carterette @ICTIR Sakai @SIGIR Forum

1st Period

Statistical testing unpopular Theoretical arguments around test assumptions

Urbano @JIR

slide-6
SLIDE 6

Our Journey

4

van Rijsbergen Hull @SIGIR Savoy @IP&M Wilbur @JIS Zobel @SIGIR Voorhees & Buckley @SIGIR Voorhees @SIGIR Sakai @SIGIR Smucker et al. @SIGIR Sakai @SIGIR Parapar et al. @JASIST Cormack & Lynam @SIGIR Smucker et al. @CIKM Urbano & Nagler @SIGIR 1980 1990 2000 2010 2020 Sanderson & Zobel @SIGIR Urbano et al. @SIGIR Carterette @TOIS Carterette @ICTIR Sakai @SIGIR Forum

2nd Period

Empirical studies appear Resampling-based tests and t-test

Urbano @JIR

slide-7
SLIDE 7

Our Journey

4

van Rijsbergen Hull @SIGIR Savoy @IP&M Wilbur @JIS Zobel @SIGIR Voorhees & Buckley @SIGIR Voorhees @SIGIR Sakai @SIGIR Smucker et al. @SIGIR Sakai @SIGIR Parapar et al. @JASIST Cormack & Lynam @SIGIR Smucker et al. @CIKM Urbano & Nagler @SIGIR 1980 1990 2000 2010 2020 Sanderson & Zobel @SIGIR Urbano et al. @SIGIR Carterette @TOIS Carterette @ICTIR Sakai @SIGIR Forum

3rd Period

Wide adoption of statistical testing Long-pending discussion about statistical practice

Urbano @JIR

slide-8
SLIDE 8

Our Journey

  • Theoretical and empirical arguments

for and against specific tests

  • 2-tailed tests at α=.05 with AP and P@10,

almost exclusively

  • Limited data, resampling from the same topics
  • No control over the null hypothesis
  • Discordances or conflicts among tests,

but no actual error rates

5

slide-9
SLIDE 9

Main reason? No control of the data generating process

6

slide-10
SLIDE 10

PROPOSAL FROM SIGIR 2018

slide-11
SLIDE 11
  • Build a generative model
  • f the joint distribution of

system scores

  • So that we can simulate

scores on new, random topics (no content, only scores)

  • Unlimited data
  • Full control over H0

Stochastic Simulation

9

test AP p-values Model

Urbano & Nagler, SIGIR 2018

slide-12
SLIDE 12
  • Build a generative model
  • f the joint distribution of

system scores

  • So that we can simulate

scores on new, random topics (no content, only scores)

  • Unlimited data
  • Full control over H0
  • The model is flexible, and

can be fit to existing data to make it realistic

Stochastic Simulation

10

TREC IR systems test AP AP p-values Model

Urbano & Nagler, SIGIR 2018

slide-13
SLIDE 13

Stochastic Simulation

11

Experimental Baseline

  • We use copula

models, which separate:

  • 1. Marginal

distributions, of individual systems

  • Give us full knowledge

and control over H0

  • 2. Dependence

structure, among systems

μE μB

slide-14
SLIDE 14

Stochastic Simulation

11

Experimental Baseline

  • We use copula

models, which separate:

  • 1. Marginal

distributions, of individual systems

  • Give us full knowledge

and control over H0

  • 2. Dependence

structure, among systems

μE μB

slide-15
SLIDE 15

Research Question

  • Which is the test that…
  • 1. Maintaining Type I errors at the α level,
  • 2. Has the highest statistical power,
  • 3. Across measures and sample sizes,
  • 4. With IR-like data?

12

slide-16
SLIDE 16

Factors Under Study

  • Paired test: Student’s t, Wilcoxon, Sign,

Bootstrap-shift, Permutation

  • Measure: AP, nDCG@20, ERR@20, P@10, RR
  • Topic set size n: 25, 50, 100
  • Effect size δ: 0.01, 0.02, …, 0.1
  • Significance level α: 0.001, …, 0.1
  • Tails: 1 and 2
  • Data to fit stochastic models: TREC 5-8 Ad Hoc

and 2010-13 Web

13

slide-17
SLIDE 17

We report results on >500 million p-values 1.5 years of CPU time ¯\_(ツ)_/¯

14

slide-18
SLIDE 18

TYPE I ERRORS

slide-19
SLIDE 19

TREC Systems Topics

Simulation such that μE = μB

16

Experimental Baseline

slide-20
SLIDE 20

Simulation such that μE = μB

16

Experimental Baseline

slide-21
SLIDE 21

Simulation such that μE = μB

16

Experimental Baseline

slide-22
SLIDE 22

Simulation such that μE = μB

16

Experimental Baseline μE = μB

slide-23
SLIDE 23

Simulation such that μE = μB

16

Experimental Baseline Tests p-values μE = μB

slide-24
SLIDE 24

Simulation such that μE = μB

  • Repeat for each measure and topic set size n

–1,667,000 times –≈8.3 million 2-tailed p-values –≈8.3 million 1-tailed p-values

  • Grand total of >250 million p-values
  • Any p<α corresponds to a Type I error
slide-25
SLIDE 25

Type I Errors by α | n 2-tailed

18

Not so interested in specific points but in trends

slide-26
SLIDE 26

Type I Errors by α | n 2-tailed

20

slide-27
SLIDE 27

Type I Errors by α | n 2-tailed

20

  • Wilcoxon and Sign have higher error rates than expected
  • Wilcoxon better in P@10 and RR because of symmetricity
  • Even worse as sample size increases (with RR too)
slide-28
SLIDE 28

Type I Errors by α | n 2-tailed

20

  • Bootstrap has high error rates too
  • Tends to correct with sample size because it estimates

the sampling distribution better

slide-29
SLIDE 29

Type I Errors by α | n 2-tailed

20

  • Bootstrap has high error rates too
  • Tends to correct with sample size because it estimates

the sampling distribution better

  • Permutation and t-test have nearly ideal behavior
  • Permutation very slightly sensitive to sample size
  • t-test remarkably robust to it
slide-30
SLIDE 30

Type I Errors - Summary

  • Wilcoxon, Sign and Bootstrap test tend to make

more errors than expected

  • Increasing sample size helps Bootstrap, but

hurts Wilcoxon and Sign even more

  • Permutation and t-test have nearly ideal

behavior across measures, even with small sample size

  • t-test is remarkably robust
  • Same conclusions with 1-tailed tests

21

slide-31
SLIDE 31

TYPE II ERRORS

slide-32
SLIDE 32

Simulation such that μE = μB + δ

23

TREC Systems Topics Experimental Baseline

slide-33
SLIDE 33

Simulation such that μE = μB + δ

23

Experimental Baseline

slide-34
SLIDE 34

Simulation such that μE = μB + δ

23

Experimental Baseline

slide-35
SLIDE 35

Simulation such that μE = μB + δ

23

Experimental Baseline μE = μB+δ

slide-36
SLIDE 36

Simulation such that μE = μB + δ

23

Experimental Baseline Tests p-values μE = μB+δ

slide-37
SLIDE 37

Simulation such that μE = μB + δ

  • Repeat for each measure, topic set size n

and effect size δ

–167,000 times –≈8.3 million 2-tailed p-values –≈8.3 million 1-tailed p-values

  • Grand total of >250 million p-values
  • Any p>α corresponds to a Type II error
slide-38
SLIDE 38

Power by δ | n α=.05, 2-tailed

25

ideally

slide-39
SLIDE 39

Power by δ | n α=.05, 2-tailed

26

  • Clear effect of effect size δ
  • Clear effect of sample size n
  • Clear effect of measure (via σ)
slide-40
SLIDE 40

Power by δ | n α=.05, 2-tailed

26

  • Clear effect of effect size δ
  • Clear effect of sample size n
  • Clear effect of measure (via σ)
  • Sign test consistently the least powerful (disregards magnitudes)
  • Bootstrap test consistently the most powerful, specially for small n
slide-41
SLIDE 41

Power by δ | n α=.05, 2-tailed

26

  • Clear effect of effect size δ
  • Clear effect of sample size n
  • Clear effect of measure (via σ)
  • Sign test consistently the least powerful (disregards magnitudes)
  • Bootstrap test consistently the most powerful, specially for small n
  • Permutation and t-test are almost identical again
  • Very close to Bootstrap as sample size increases
slide-42
SLIDE 42

Power by δ | n α=.05, 2-tailed

26

  • Clear effect of effect size δ
  • Clear effect of sample size n
  • Clear effect of measure (via σ)
  • Sign test consistently the least powerful (disregards magnitudes)
  • Bootstrap test consistently the most powerful, specially for small n
  • Wilcoxon is very similar to Permutation and t-test
  • Even slightly better with small n or δ, specially for AP, nDCG and ERR

(it’s indeed more efficient with some asymmetric distributions)

slide-43
SLIDE 43

Power by α | δ n=50, 2-tailed

27

slide-44
SLIDE 44

Power by α | δ n=50, 2-tailed

27

  • With small δ, Wilcoxon and Bootstrap consistently the most powerful
  • With large δ, Permutation and t-test catch up with Wilcoxon
slide-45
SLIDE 45

Type II Errors - Summary

  • All tests, except Sign, behave very similarly
  • Bootstrap and Wilcoxon are consistently

a bit more powerful across significance levels

– But more Type I errors!

  • With larger effect sizes and sample sizes,

Permutation and t-test catch up with Wilcoxon, but not with Bootstrap

  • Same conclusions with 1-tailed tests

28

slide-46
SLIDE 46

TYPE III ERRORS

slide-47
SLIDE 47

Type III what?

  • A wrong directional decision

based on the correct rejection

  • f a non-directional hypothesis
  • Example:

– We observe a positive result, 𝐹 > 𝐶 – We run a 2-tailed test, 𝐼0: 𝜈𝐹 = 𝜈𝐶 – Find 𝑞 < 𝛽, so we reject and conclude 𝜈𝐹 > 𝜈𝐶 – But 𝑰𝟏 is non-directional – What if we just got lucky, and really 𝝂𝑭 < 𝝂𝑪?

30

slide-48
SLIDE 48

Type III Errors by δ | n α=.05

31

  • Clear effect of δ and n
  • P@10 and RR substantially more problematic because of higher σ
slide-49
SLIDE 49

Type III Errors by δ | n α=.05

31

  • Bootstrap tends to correct with sample size
  • Wilcoxon stays the same, and Sign test gets even worse
slide-50
SLIDE 50

Type III Errors by δ | n α=.05

31

  • Bootstrap tends to correct with sample size
  • Wilcoxon stays the same, and Sign test gets even worse
slide-51
SLIDE 51

Type III Errors in Practice

  • How much of a problem could this be?
  • Example: AP and n=50 topics

–Improvement of +0.01 over the baseline –2-tailed t-test comes up significant –7.3% probability that it is a Type III error and your system is actually worse –Is that too high?

32

slide-52
SLIDE 52

CONCLUSIONS

slide-53
SLIDE 53

What We Did

  • First empirical study of actual error rates with IR-like data
  • Comprehensive

– Paired test: Student’s t, Wilcoxon, Sign, Bootstrap-shift, Permutation – Measure: AP, nDCG@20, ERR@20, P@10, RR – Topic set size: 25, 50, 100 – Effect size: 0.01, 0.02, …, 0.1 – Significance level: 0.001, …, 0.1 – Tails: 1 and 2

  • More than 500 million p-values
  • All data and many more plots are available online

https://github.com/julian-urbano/sigir2019-statistical

34

slide-54
SLIDE 54

Recommendations

  • Don’t use Wilcoxon or Sign tests anymore
  • For statistics other than the mean, use

permutation, and bootstrap only if you have many topics

  • For typical tests about mean scores, the

t-test is simple, the most robust, behaves as expected w.r.t. Type I errors, and is nearly as powerful as the Bootstrap. Keep using it

35