[PPT] - Sta$s$cal Significance Tes$ng In Theory and In Prac$ce PowerPoint Presentation

SLIDE 1

Sta$s$cal ¡Significance ¡Tes$ng ¡ In ¡Theory ¡and ¡In ¡Prac$ce ¡

Ben ¡Cartere8e ¡ University ¡of ¡Delaware ¡ ¡ h8p://ir.cis.udel.edu/ICTIR15tutorial ¡ ¡

SLIDE 2

Hypotheses ¡and ¡Experiments ¡

Hypothesis: ¡

– Using ¡an ¡SVM ¡for ¡classifica$on ¡will ¡give ¡be8er ¡accuracy ¡ than ¡using ¡Naïve ¡Bayes ¡ – A ¡“Symbol-‑Refined ¡Tree ¡Subs$tu$on ¡Grammar” ¡will ¡give ¡ be8er ¡parsing ¡results ¡than ¡a ¡simple ¡TSG ¡ – Expanding ¡a ¡short ¡keyword ¡query ¡with ¡synonyms ¡will ¡ improve ¡search ¡engine ¡effec$veness ¡

Experiment: ¡

– Build ¡a ¡baseline ¡system ¡ – Modify ¡it ¡based ¡on ¡your ¡hypothesis ¡ – Test ¡both ¡systems ¡on ¡one ¡or ¡more ¡datasets ¡

SLIDE 3

Experimental ¡Results ¡

from ¡Shindo ¡et ¡al., ¡Bayesian ¡Symbol-‑Refined ¡Tree ¡Subs5tu5on ¡Grammars ¡for ¡Syntac5c ¡Parsing, ¡ACL ¡2012 ¡

SLIDE 4

So ¡What? ¡

“Do ¡these ¡results ¡support ¡my ¡hypothesis? ¡
“Are ¡these ¡results ¡meaningful?” ¡
“Is ¡it ¡possible ¡that ¡my ¡results ¡are ¡just ¡

random?” ¡ à ¡sta$s$cal ¡significance ¡tes$ng! ¡

SLIDE 5

Overview ¡of ¡This ¡Tutorial ¡

Part ¡1: ¡ ¡Tes$ng ¡Sta$s$cal ¡Significance ¡

– May ¡be ¡a ¡review ¡for ¡some ¡of ¡you ¡

Part ¡2: ¡ ¡Fundamentals ¡of ¡Significance ¡Tes$ng ¡ Part ¡3: ¡ ¡Applica$ons, ¡or, ¡Why ¡Bother ¡With ¡ ¡ ¡Fundamentals? ¡ Part ¡4: ¡ ¡Myths ¡and ¡Misconcep$ons ¡ Part ¡5: ¡ ¡Significance ¡Tes$ng ¡in ¡IR ¡Research ¡

SLIDE 6

Using ¡R ¡

R ¡is ¡a ¡soaware ¡environment ¡for ¡sta$s$cal ¡

compu$ng ¡

¡

Includes ¡built-‑in ¡implementa$ons ¡of ¡many ¡

common ¡tests ¡

– Also ¡has ¡its ¡own ¡programming ¡language ¡for ¡ implemen$ng ¡your ¡own ¡

Download ¡from ¡h8p://r-‑project.org ¡

– Download ¡TREC-‑7 ¡evalua$on ¡data ¡from ¡ h8p://ir.cis.udel.edu/ICTIR15tutorial/trec7.RData ¡

SLIDE 7

Background: ¡ ¡Experimenta$on ¡in ¡IR ¡

The ¡standard ¡experimental ¡secng ¡in ¡IR ¡is ¡called ¡

the ¡Cranfield ¡paradigm ¡

Two ¡components: ¡ ¡test ¡collec$ons ¡and ¡

effec$veness ¡measures ¡

– A ¡test ¡collec$on ¡comprises: ¡

A ¡corpus ¡of ¡documents ¡
A ¡set ¡of ¡informa$on ¡needs/tasks/topics/queries ¡
Relevance ¡judgments ¡

– Effec$veness ¡measures ¡such ¡as: ¡

Precision@10, ¡average ¡precision, ¡nDCG@10, ¡alpha-‑

nDCG@10, ¡etc ¡

SLIDE 8

Background: ¡ ¡Cranfield ¡

query 1 query 2 query 3 query 4 query 5 A ¡ B ¡ C ¡ D ¡

0.3 0.4 0.1 0.5 0.3 0.2 0.3 0.1 0.2 0.3 0.4 0.4 0.3 0.1 0.2 0.1 0.5 0.4 0.3 0.1

SLIDE 9

Background: ¡ ¡Cranfield ¡

query 1 query 2 query 3 query 4 query 5 A ¡ B ¡ C ¡ D ¡ 0.3 0.2 0.4 0.1 0.250 0.4 0.3 0.4 0.5 0.400 0.1 0.1 0.3 0.4 0.225 0.5 0.2 0.1 0.3 0.275 0.3 0.3 0.2 0.1 0.225 0.32 0.22 0.28 0.28

SLIDE 10

TESTING ¡STATISTICAL ¡SIGNIFICANCE ¡

Part ¡1 ¡

SLIDE 11

Commonly-‑Used ¡Tests ¡

Non-‑parametric: ¡

– Sign ¡test/binomial ¡test ¡ – Wilcoxon ¡signed ¡rank ¡test ¡

Parametric: ¡

– Student’s ¡t-‑test ¡ – ANOVA ¡

Distribu$on-‑free: ¡

– Randomiza$on ¡test ¡ – Bootstrap ¡test ¡

SLIDE 12

Sign ¡Test ¡

Query ¡ A ¡ B ¡ B-‑A ¡ sign(B-‑A) ¡ 1 ¡ .25 ¡ .35 ¡ +.10 ¡ +1 ¡ 2 ¡ .43 ¡ .84 ¡ +.41 ¡ +1 ¡ 3 ¡ .39 ¡ .15 ¡

‑.24 ¡
‑1 ¡

4 ¡ .75 ¡ .75 ¡ 0 ¡ 0 ¡ 5 ¡ .43 ¡ .68 ¡ +.25 ¡ +1 ¡ 6 ¡ .15 ¡ .85 ¡ +.70 ¡ +1 ¡ 7 ¡ .20 ¡ .80 ¡ +.60 ¡ +1 ¡ 8 ¡ .52 ¡ .50 ¡

‑.02 ¡
‑1 ¡

9 ¡ .49 ¡ .58 ¡ +.09 ¡ +1 ¡ 10 ¡ .50 ¡ .75 ¡ +.25 ¡ +1 ¡

7 ¡“successes” ¡in ¡9 ¡complete ¡trials ¡ What ¡if ¡each ¡+1/-‑1 ¡was ¡just ¡the ¡ ¡ ¡result ¡of ¡flipping ¡a ¡fair ¡coin? ¡ What ¡is ¡the ¡probability ¡we ¡would ¡see ¡7 ¡or ¡more ¡heads ¡if ¡the ¡coin ¡is ¡fair? ¡

SLIDE 13

Binomial ¡Distribu$on ¡

What ¡is ¡the ¡probability ¡we ¡would ¡see ¡7 ¡or ¡more ¡heads ¡if ¡the ¡coin ¡is ¡fair? ¡ P(7 ¡heads ¡| ¡9 ¡trials, ¡½ ¡probability) ¡ + ¡P(8 ¡heads ¡| ¡9 ¡trials, ¡½ ¡probability) ¡ + ¡P(9 ¡heads ¡| ¡9 ¡trials, ¡½ ¡probability) ¡ = ¡0.09 ¡ p-‑value ¡= ¡0.09 ¡

SLIDE 14

Wilcoxon ¡Signed-‑Rank ¡Test ¡

Query ¡ A ¡ B ¡ B-‑A ¡ 1 ¡ .25 ¡ .35 ¡ +.10 ¡ 2 ¡ .43 ¡ .84 ¡ +.41 ¡ 3 ¡ .39 ¡ .15 ¡

‑.24 ¡

4 ¡ .75 ¡ .75 ¡ 0 ¡ 5 ¡ .43 ¡ .68 ¡ +.25 ¡ 6 ¡ .15 ¡ .85 ¡ +.70 ¡ 7 ¡ .20 ¡ .80 ¡ +.60 ¡ 8 ¡ .52 ¡ .50 ¡

‑.02 ¡

9 ¡ .49 ¡ .58 ¡ +.09 ¡ 10 ¡ .50 ¡ .75 ¡ +.25 ¡ Rank ¡ B-‑A ¡ 1 ¡

‑.02 ¡

2 ¡ +.09 ¡ 3 ¡ +.10 ¡ 4 ¡

‑.24 ¡

5.5 ¡ +.25 ¡ 5.5 ¡ +.25 ¡ 7 ¡ +.41 ¡ 8 ¡ +.60 ¡ 9 ¡ +.70 ¡

W = 2 +3+ 5.5+ 5.5+ 7+8+ 9 W = 40

SLIDE 15

Wilcoxon ¡Signed-‑Rank ¡Test ¡

W Density

60
40
20

20 40 60 0.000 0.005 0.010 0.015

W = 40

p− value = 0.02

SLIDE 16

Student’s ¡t-‑test ¡

Query ¡ A ¡ B ¡ B-‑A ¡ 1 ¡ .25 ¡ .35 ¡ +.10 ¡ 2 ¡ .43 ¡ .84 ¡ +.41 ¡ 3 ¡ .39 ¡ .15 ¡

‑.24 ¡

4 ¡ .75 ¡ .75 ¡ 0 ¡ 5 ¡ .43 ¡ .68 ¡ +.25 ¡ 6 ¡ .15 ¡ .85 ¡ +.70 ¡ 7 ¡ .20 ¡ .80 ¡ +.60 ¡ 8 ¡ .52 ¡ .50 ¡

‑.02 ¡

9 ¡ .49 ¡ .58 ¡ +.09 ¡ 10 ¡ .50 ¡ .75 ¡ +.25 ¡

16 ¡

ˆ µ = B − A = 0.214 ˆ σ

B −A = 0.291

t = ˆ µ ˆ σ

B −A

n = 2.33

SLIDE 17

Student’s ¡t-‑test ¡

p − value = 0.02 σB −A = 0.291

17 ¡

ˆ µ = B − A = 0.214 ˆ σ

B −A = 0.291

t = ˆ µ ˆ σ

B −A

n = 2.33

SLIDE 18

Query ¡ A ¡ B ¡ B-‑A ¡ 1 ¡ .25 ¡ .35 ¡ +.10 ¡ 2 ¡ .43 ¡ .84 ¡ +.41 ¡ 3 ¡ .39 ¡ .15 ¡

‑.24 ¡

4 ¡ .75 ¡ .75 ¡ 0 ¡ 5 ¡ .43 ¡ .68 ¡ +.25 ¡ 6 ¡ .15 ¡ .85 ¡ +.70 ¡ 7 ¡ .20 ¡ .80 ¡ +.60 ¡ 8 ¡ .52 ¡ .50 ¡

‑.02 ¡

9 ¡ .49 ¡ .58 ¡ +.09 ¡ 10 ¡ .50 ¡ .75 ¡ +.25 ¡ Query ¡ A ¡ B ¡ B-‑A ¡ 1 ¡ .35 ¡ .25 ¡

‑.10 ¡

2 ¡ .43 ¡ .84 ¡ +.41 ¡ 3 ¡ .39 ¡ .15 ¡

‑.24 ¡

4 ¡ .75 ¡ .75 ¡ 0 ¡ 5 ¡ .68 ¡ .43 ¡

‑.25 ¡

6 ¡ .85 ¡ .15 ¡

‑.70 ¡

7 ¡ .20 ¡ .80 ¡ +.60 ¡ 8 ¡ .50 ¡ .52 ¡ +.02 ¡ 9 ¡ .49 ¡ .58 ¡ +.09 ¡ 10 ¡ .50 ¡ .75 ¡ +.25 ¡ Query ¡ A ¡ B ¡ B-‑A ¡ 1 ¡ .25 ¡ .35 ¡ +.10 ¡ 2 ¡ .84 ¡ .43 ¡

‑.41 ¡

3 ¡ .39 ¡ .15 ¡

‑.24 ¡

4 ¡ .75 ¡ .75 ¡ 0 ¡ 5 ¡ .68 ¡ .43 ¡

‑.25 ¡

6 ¡ .15 ¡ .85 ¡ +.70 ¡ 7 ¡ .80 ¡ .20 ¡

‑.60 ¡

8 ¡ .50 ¡ .52 ¡ +.02 ¡ 9 ¡ .58 ¡ .49 ¡ 0.09 ¡ 10 ¡ .75 ¡ .50 ¡

‑.25 ¡

Randomiza$on ¡Test ¡

ˆ µ

0 = B − A = 0.214

ˆ µ

1 = −0.008

ˆ µ

2 = −0.093

SLIDE 19

Randomiza$on ¡Test ¡

mean

0.3
0.2
0.1

0.0 0.1 0.2 0.3

p − value = 0.02 ˆ µ

0 = B − A = 0.214

SLIDE 20

Bootstrap ¡Test ¡

Query ¡ A ¡ B ¡ B-‑A ¡ 1 ¡ .25 ¡ .35 ¡ +.10 ¡ 2 ¡ .43 ¡ .84 ¡ +.41 ¡ 3 ¡ .39 ¡ .15 ¡

‑.24 ¡

4 ¡ .75 ¡ .75 ¡ 0 ¡ 5 ¡ .43 ¡ .68 ¡ +.25 ¡ 6 ¡ .15 ¡ .85 ¡ +.70 ¡ 7 ¡ .20 ¡ .80 ¡ +.60 ¡ 8 ¡ .52 ¡ .50 ¡

‑.02 ¡

9 ¡ .49 ¡ .58 ¡ +.09 ¡ 10 ¡ .50 ¡ .75 ¡ +.25 ¡ s1 ¡ s2 ¡ s3 ¡

‑.24 ¡

+.25 ¡

‑.24 ¡

+.41 ¡ +.10 ¡ +.60 ¡

‑.02 ¡

+.25 ¡

‑.70 ¡

0 ¡ +.60 ¡ +.25 ¡ +.25 ¡ +.70 ¡ +.70 ¡ +.10 ¡

‑.02 ¡

+.41 ¡ +.25 ¡ +.10 ¡

‑.02 ¡

+.10 ¡ +.25 ¡

‑.24 ¡

+.25 ¡ 0 ¡ +.70 ¡ +.10 ¡

‑.02 ¡

+.25 ¡

SLIDE 21

Bootstrap ¡Distribu$on ¡

mean

0.1

0.0 0.1 0.2 0.3 0.4 0.5

p − value = 0.005

SLIDE 22

Comparing ¡TREC-‑7 ¡Submissions ¡

Let’s ¡compare ¡the ¡three ¡submissions ¡from ¡

UMass ¡Amherst ¡

– All ¡three ¡used ¡the ¡InQuery ¡retrieval ¡engine ¡ – Named ¡INQ501, ¡INQ502, ¡INQ503 ¡ – We’ll ¡use ¡all ¡5 ¡tests ¡discussed ¡so ¡far ¡

SLIDE 23

Empirical ¡Comparisons ¡

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 sign test p−value Wilcoxon signed−rank test p−value

SLIDE 24

Empirical ¡Comparisons ¡

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 sign test p−value t−test p−value

SLIDE 25

Empirical ¡Comparisons ¡

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Wilcoxon signed−rank test p−value t−test p−value

SLIDE 26

Empirical ¡Comparisons ¡

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 t−test p−value randomization test p−value

SLIDE 27

Empirical ¡Comparisons ¡

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 t−test p−value bootstrap test p−value

SLIDE 28

ANOVA ¡

Compare ¡variance ¡due ¡to ¡system ¡to ¡variance ¡

due ¡to ¡topic ¡

Query ¡ A ¡ B ¡ B-‑A ¡ 1 ¡ .25 ¡ .35 ¡ +.10 ¡ 2 ¡ .43 ¡ .84 ¡ +.41 ¡ 3 ¡ .39 ¡ .15 ¡

‑.24 ¡

4 ¡ .75 ¡ .75 ¡ 0 ¡ 5 ¡ .43 ¡ .68 ¡ +.25 ¡ 6 ¡ .15 ¡ .85 ¡ +.70 ¡ 7 ¡ .20 ¡ .80 ¡ +.60 ¡ 8 ¡ .52 ¡ .50 ¡

‑.02 ¡

9 ¡ .49 ¡ .58 ¡ +.09 ¡ 10 ¡ .50 ¡ .75 ¡ +.25 ¡

ˆ σ

2 = MSE = 0.042

ˆ σ

S 2 = MST = 0.229

F = MST MSE = 5.41

SLIDE 29

ANOVA ¡

ANOVA ¡is ¡a ¡generaliza$on ¡of ¡the ¡t-‑test ¡
Allows ¡comparison ¡of ¡more ¡than ¡just ¡2 ¡

systems ¡

– And ¡across ¡more ¡factors ¡than ¡just ¡system ¡and ¡ topic ¡

Let’s ¡use ¡ANOVA ¡to ¡compare ¡all ¡three ¡INQ ¡

systems ¡

SLIDE 30

Summary ¡

These ¡are ¡6 ¡of ¡the ¡most ¡common ¡tests ¡seen ¡in ¡IR ¡

experimenta$on ¡

– Many ¡others ¡in ¡the ¡literature: ¡

Chi-‑squared ¡
Propor$on ¡test ¡
ANCOVA/MANOVA/MANCOVA ¡
All ¡have ¡in ¡common: ¡

– The ¡use ¡of ¡some ¡probability ¡distribu$on, ¡computa$on ¡of ¡a ¡p-‑ value ¡from ¡that ¡distribu$on ¡

All ¡produce ¡p-‑values ¡that ¡are ¡highly ¡correlated ¡

– Though ¡they ¡do ¡not ¡always ¡agree ¡about ¡which ¡pairs ¡are ¡ significant ¡

SLIDE 31

FUNDAMENTALS ¡OF ¡ ¡ SIGNIFICANCE ¡TESTING ¡

Part ¡2 ¡

SLIDE 32

Tes$ng ¡Paradigms ¡

Ronald ¡Fisher ¡ Jerzy ¡Neyman ¡ Egon ¡Pearson ¡ Harold ¡Jeffreys ¡

SLIDE 33

What ¡Are ¡Tests ¡Really ¡Telling ¡Us? ¡

Formal ¡set-‑up: ¡

– H0: ¡ ¡μ ¡= ¡0 ¡ ¡ ¡or ¡ ¡ ¡H0: ¡ ¡μ ¡= ¡0 ¡ – H1: ¡ ¡μ ¡≠ ¡0 ¡ ¡ ¡ ¡ ¡ ¡H1: ¡ ¡μ ¡> ¡0 ¡

The ¡null ¡hypothesis ¡is ¡a ¡model ¡

– We ¡are ¡looking ¡to ¡prove ¡the ¡model ¡wrong ¡

The ¡p-‑value ¡is ¡the ¡probability ¡that ¡you ¡would ¡

have ¡seen ¡the ¡same ¡result ¡if ¡H0 ¡were ¡true ¡

– If ¡that ¡probability ¡is ¡low, ¡conclude ¡H0 ¡is ¡false ¡

SLIDE 34

What ¡Are ¡Tests ¡Really ¡Telling ¡Us? ¡

Fisher: ¡ ¡p-‑value ¡is ¡the ¡likelihood ¡of ¡the ¡data ¡under ¡H0 ¡

– The ¡p-‑value ¡is ¡a ¡conclusion ¡about ¡this ¡par$cular ¡experiment ¡

nly ¡

– Nothing ¡more, ¡nothing ¡less ¡

Neyman-‑Pearson: ¡ ¡p ¡< ¡0.05 ¡means ¡we ¡can ¡reject ¡H0 ¡as ¡

being ¡unlikely ¡to ¡be ¡true ¡

– p-‑values ¡lead ¡to ¡inference ¡about ¡the ¡popula$on ¡ – The ¡p-‑value ¡itself ¡is ¡not ¡interes$ng; ¡the ¡inference ¡is ¡ – Note ¡that ¡we ¡do ¡not ¡accept ¡that ¡H1 ¡is ¡true! ¡

Jeffreys: ¡ ¡posterior ¡probability ¡of ¡H0 ¡being ¡true ¡can ¡be ¡

compared ¡to ¡posterior ¡probability ¡of ¡other ¡models ¡

SLIDE 35

What ¡Are ¡Tests ¡NOT ¡Telling ¡Us? ¡

NOT ¡the ¡“probability ¡that ¡the ¡results ¡are ¡due ¡

to ¡chance” ¡

NOT ¡whether ¡the ¡experiment ¡is ¡reliable ¡
NOT ¡the ¡probability ¡that ¡H0 ¡is ¡true ¡
NOT ¡that ¡H0 ¡is ¡false ¡if ¡the ¡p-‑value ¡is ¡low ¡

SLIDE 36

Low ¡p-‑value ¡≠ ¡False ¡H0 ¡

Suppose ¡we ¡observe ¡a ¡difference ¡of ¡μ ¡= ¡0.02 ¡

with ¡σ ¡= ¡0.16, ¡and ¡our ¡t-‑test ¡gives ¡a ¡p-‑value ¡< ¡ 0.05 ¡

– This ¡has ¡a ¡~14% ¡chance ¡of ¡occurring ¡

What ¡is ¡P(H0 ¡true ¡| ¡p ¡< ¡0.05)? ¡

– Let’s ¡set ¡up ¡an ¡alterna$ve ¡hypothesis ¡H1, ¡that ¡μ ¡is ¡ sampled ¡from ¡a ¡normal ¡distribu$on ¡centered ¡at ¡ 0.02 ¡

SLIDE 37

Low ¡p-‑value ¡≠ ¡False ¡H0 ¡

P(H0|p < 0.05) = P(p < 0.05|H0) P(p < 0.05|H1)P(H1) + P(p < 0.05|H0)P(H0)

P(p < 0.05|H0) = 0.05 P(p < 0.05|H1) = 0.14 P(H0) = 0.5 P(H1) = 0.5 P(H0|p < 0.05) = 0.27

P(H0)

SLIDE 38

Terms ¡and ¡Defini$ons ¡

query 1 query 2 query 3 query 4 query 5 A ¡ B ¡ C ¡ D ¡

0.3 0.4 0.1 0.5 0.3 0.2 0.3 0.1 0.2 0.3 0.4 0.4 0.3 0.1 0.2 0.1 0.5 0.4 0.3 0.1

“subjects” ¡ “treatments” ¡ “measurements” ¡

SLIDE 39

Terms ¡and ¡Defini$ons ¡

Single-‑sample ¡vs ¡two-‑sample ¡tests ¡

– A ¡single-‑sample ¡test ¡is ¡generally ¡based ¡on ¡applying ¡one ¡or ¡ more ¡“treatments” ¡to ¡a ¡single ¡sample ¡of ¡“subjects” ¡ – In ¡a ¡two-‑sample ¡test, ¡each ¡treatment ¡is ¡applied ¡to ¡a ¡ different ¡sample ¡

Paired ¡vs ¡unpaired ¡

– Paired ¡tests ¡are ¡a ¡special ¡case ¡of ¡single-‑sample ¡tests: ¡ ¡ subtract ¡evalua$on ¡results ¡for ¡each ¡example ¡to ¡obtain ¡the ¡ measurements ¡to ¡summarize ¡ – Unpaired ¡tests ¡can ¡be ¡single-‑sample ¡too ¡

SLIDE 40

Terms ¡and ¡Defini$ons ¡

One-‑tailed ¡vs ¡two-‑tailed ¡

– All ¡the ¡examples ¡done ¡to ¡this ¡point ¡were ¡one-‑ tailed ¡tests ¡

Compu$ng ¡the ¡p-‑value ¡from ¡the ¡right ¡(upper) ¡tail ¡of ¡the ¡

test ¡sta$s$c ¡distribu$on ¡

– Two-‑tailed ¡tests ¡compute ¡the ¡p-‑value ¡from ¡both ¡ tails ¡ – Result ¡is ¡generally ¡a ¡higher ¡p-‑value ¡

SLIDE 41

Test ¡Sta$s$cs ¡and ¡Distribu$ons ¡

Test ¡sta$s$c ¡

– A ¡summary ¡of ¡the ¡data, ¡usually ¡designed ¡to ¡have ¡ specific ¡distribu$on ¡guarantees ¡(asympto$cally) ¡

Parametric ¡vs ¡non-‑parametric ¡

– If ¡the ¡test ¡sta$s$c ¡distribu$on ¡has ¡any ¡free ¡ parameters, ¡the ¡test ¡is ¡said ¡to ¡be ¡“parametric” ¡

Confidence ¡interval ¡

SLIDE 42

Sizes ¡and ¡Values ¡

Sample ¡size ¡

– The ¡number ¡of ¡subjects/examples ¡in ¡the ¡experiment ¡ – Assumed ¡to ¡be ¡sampled ¡i.i.d. ¡from ¡a ¡much ¡larger ¡popula$on ¡

Effect ¡size ¡

– A ¡measure ¡of ¡the ¡difference ¡between ¡two ¡“treatments” ¡or ¡algorithms ¡in ¡the ¡ popula$on ¡ – Independent ¡of ¡sample ¡size ¡ – H0: ¡ ¡no ¡effect ¡

p-‑value ¡

– The ¡likelihood ¡of ¡observing ¡the ¡effect ¡in ¡the ¡sample ¡assuming ¡H0 ¡is ¡true ¡

Cri$cal ¡value ¡

– The ¡minimum ¡test ¡sta$s$c ¡value ¡necessary ¡to ¡obtain ¡p ¡< ¡α ¡with ¡a ¡given ¡sample ¡ size ¡ – α ¡usually ¡= ¡0.05 ¡

SLIDE 43

Variance ¡

Total ¡variance ¡

– The ¡sum ¡of ¡the ¡square ¡differences ¡between ¡ measurements ¡and ¡the ¡overall ¡mean ¡

Within-‑group ¡variance ¡

– Variance ¡due ¡to ¡subjects/topics ¡ – Paired ¡tests ¡subtract ¡this ¡variance ¡out ¡

Between-‑group ¡variance ¡

– Variance ¡due ¡to ¡the ¡treatments/systems ¡

SLIDE 44

Accuracy ¡and ¡Power ¡

Accuracy ¡

– The ¡probability ¡of ¡gecng ¡p ¡≥ ¡α ¡when ¡H0 ¡ is ¡actually ¡true ¡ – Probability ¡of ¡correctly ¡not ¡rejec$ng ¡H0 ¡ – Propor$onal ¡to ¡false ¡posi$ve ¡rate ¡

Power ¡

– The ¡probability ¡of ¡gecng ¡p ¡< ¡α ¡when ¡the ¡ null ¡hypothesis ¡is ¡actually ¡false ¡ – The ¡probability ¡of ¡correctly ¡rejec$ng ¡H0 ¡ – True ¡posi$ve ¡rate ¡

Most ¡tests ¡are ¡defined ¡to ¡have ¡a ¡false ¡

posi$ve ¡rate ¡of ¡α ¡when ¡H0 ¡is ¡true ¡

– Achieving ¡a ¡certain ¡power ¡level ¡involves ¡ es$ma$ng ¡effect ¡size ¡and ¡sample ¡size ¡ H0 ¡à à ¡ test ¡result ¡↓ ¡ true ¡ false ¡ not ¡rejected ¡ accuracy ¡ α ¡ Type ¡II ¡ ¡ error ¡ 1-‑β ¡ rejected ¡ Type ¡I ¡ error ¡ 1-‑α ¡ power ¡ β ¡

SLIDE 45

The ¡Linear ¡Model ¡

Sta$s$cal ¡tests ¡are ¡classifiers ¡

– Like ¡classifiers, ¡they ¡are ¡based ¡on ¡an ¡underlying ¡ model ¡ – Unlike ¡classifiers, ¡we ¡cannot ¡evaluate ¡them ¡ directly ¡

ANOVA ¡is ¡based ¡on ¡the ¡linear ¡regression ¡

model ¡

– And ¡therefore ¡the ¡t-‑test ¡is ¡too ¡

yi = β0 + β

1x +ε i

SLIDE 46

All ¡models ¡are ¡ wrong, ¡but ¡some ¡are ¡

useful. ¡

George ¡E. ¡P. ¡Box ¡

SLIDE 47

APPLICATIONS, ¡OR, ¡WHY ¡BOTHER ¡ WITH ¡FUNDAMENTALS? ¡

Part ¡3 ¡

SLIDE 48

What ¡is ¡a ¡Sta$s$cal ¡Significance ¡Test? ¡

A ¡sta$s$cal ¡test ¡consists ¡of ¡four ¡things: ¡

– A ¡null ¡hypothesis ¡ – A ¡test ¡sta$s$c ¡ – A ¡null ¡distribu$on ¡for ¡the ¡test ¡sta$s$c ¡ – A ¡cri$cal ¡value ¡in ¡the ¡null ¡distribu$on ¡

You ¡can ¡invent ¡any ¡test ¡you ¡like! ¡

– … ¡as ¡long ¡as ¡you ¡can ¡compute ¡a ¡test ¡sta$s$c ¡and ¡ its ¡null ¡distribu$on ¡

SLIDE 49

Tests ¡Specific ¡to ¡IR ¡

Sources ¡of ¡variance ¡specific ¡to ¡IR: ¡

– Assessor ¡error ¡and ¡disagreement ¡ – Missing ¡relevance ¡judgments ¡ – Total ¡number ¡of ¡relevant ¡documents ¡ – Topic/task ¡type ¡ – Proper$es ¡of ¡document ¡corpus ¡ – Proper$es ¡of ¡effec$veness ¡measures ¡ – Low-‑level ¡system ¡features ¡(stemmer/stopwords/ tokeniza$on/etc) ¡ – … ¡

None ¡of ¡these ¡included ¡in ¡standard ¡test ¡models ¡

SLIDE 50

Tests ¡Specific ¡to ¡IR ¡

Effec$veness ¡measures ¡are ¡not ¡measurements ¡

in ¡the ¡standard ¡sense ¡

– They ¡are ¡sta$s$cs ¡computed ¡from ¡relevance ¡ judgments ¡and ¡ranked ¡lists ¡

An ¡IR-‑specific ¡test ¡should ¡look ¡at ¡individual ¡

relevance ¡judgments ¡

– Null ¡hypothesis: ¡ ¡two ¡systems ¡are ¡equally ¡good ¡at ¡ presen$ng ¡relevant ¡documents ¡to ¡users ¡

SLIDE 51

Likelihood ¡Ra$o ¡Test ¡

Really ¡a ¡framework ¡for ¡tes$ng ¡
Needed: ¡ ¡a ¡hypothesized ¡null ¡distribu$on ¡and ¡

a ¡hypothesized ¡“alterna$ve” ¡distribu$on ¡

Compute ¡the ¡likelihood ¡ra$o ¡between ¡the ¡two ¡
If ¡the ¡ra$o ¡is ¡above ¡some ¡threshold, ¡reject ¡H0 ¡

SLIDE 52

ANOVA ¡as ¡a ¡Likelihood ¡Ra$o ¡

ANOVA ¡is ¡based ¡on ¡the ¡linear ¡model: ¡
In ¡words, ¡the ¡observed ¡effec$veness ¡of ¡

system ¡j ¡on ¡topic ¡i ¡is ¡sampled ¡randomly ¡

– Sampled ¡from ¡a ¡normal ¡distribu$on ¡with ¡mean ¡ influenced ¡by ¡system ¡and ¡topic ¡ ¡

yij ∼ N(µ + αi + βj, σ2)

L0 = Y

i,j

P(yij|µ = 0, σ = b σ) L1 = Y

i,j

P(yij|µ = b µ, σ = b σ)

SLIDE 53

A ¡Test ¡for ¡IR ¡

Instead ¡of ¡the ¡likelihood ¡of ¡effec$veness ¡measure ¡

values, ¡compute ¡the ¡likelihood ¡of ¡the ¡actual ¡relevance ¡ judgments ¡

Suppose ¡the ¡following: ¡

– Relevance ¡is ¡generated ¡by ¡flipping ¡a ¡biased ¡coin ¡ – The ¡coin’s ¡probability ¡of ¡coming ¡up ¡heads ¡is ¡biased ¡by ¡the ¡ system ¡and ¡the ¡topic ¡

If ¡one ¡system ¡biases ¡the ¡coin ¡more ¡than ¡another, ¡even ¡

in ¡the ¡presence ¡of ¡topic ¡bias, ¡that ¡system ¡is ¡more ¡ effec$ve ¡at ¡finding ¡relevant ¡documents ¡

SLIDE 54

Test ¡Model ¡

Likelihood ¡is ¡based ¡on ¡Bernoulli ¡probabili$es ¡
This ¡model ¡is ¡s$ll ¡linear ¡in ¡system ¡and ¡topic ¡

effects, ¡but ¡fixes ¡some ¡problems ¡with ¡the ¡t-‑test ¡ ¡

xijk ∼ Bernoulli(pij) logit pij = µ + ↵i + j + ✏ij ✏ij ∼ N(0, 2)

SLIDE 55

p-‑value ¡Comparison ¡

SLIDE 56

Power ¡Analysis ¡

High ¡sta$s$cal ¡power ¡is ¡desirable ¡

– β ¡≈ ¡0.8 ¡is ¡generally ¡considered ¡good ¡power ¡

80% ¡chance ¡of ¡rejec$ng ¡H0 ¡when ¡it ¡is ¡false ¡
But ¡you ¡don’t ¡know ¡sta$s$cal ¡power ¡a ¡priori ¡

– Unlike ¡accuracy, ¡power ¡cannot ¡be ¡guaranteed ¡

Power ¡analysis ¡is ¡a ¡technique ¡to ¡es$mate ¡the ¡

necessary ¡sample ¡size ¡to ¡achieve ¡high ¡power ¡

SLIDE 57

Null ¡distribu$on ¡ Cri$cal ¡value ¡ “actual” ¡distribu$on ¡ α ¡ 1 ¡-‑ ¡β ¡

SLIDE 58

Power ¡Analysis ¡Example ¡

“I ¡want ¡to ¡be ¡able ¡to ¡detect ¡an ¡effect ¡of ¡size ¡0.1 ¡or ¡

higher ¡with ¡80% ¡probability” ¡

– In ¡other ¡words, ¡I ¡want ¡to ¡be ¡able ¡to ¡reject ¡H0 ¡even ¡when ¡ the ¡effect ¡is ¡rela$vely ¡small ¡

Steps ¡(using ¡t-‑test ¡as ¡example): ¡

– Pick ¡a ¡value ¡of ¡n ¡ – Let ¡cα ¡be ¡the ¡cri$cal ¡value ¡for ¡sample ¡size ¡n ¡ – Let ¡t ¡= ¡0.1 ¡× ¡√n ¡ ¡ – Compute ¡β ¡= ¡P(T ¡≥ ¡cα ¡| ¡n, ¡t) ¡

Search ¡for ¡the ¡smallest ¡value ¡of ¡n ¡that ¡results ¡in ¡β ¡≥ ¡0.8 ¡

SLIDE 59

A ¡Bad ¡Idea ¡

“I ¡have ¡a ¡posi$ve ¡result ¡but ¡it’s ¡too ¡small ¡to ¡be ¡
significant. ¡ ¡What ¡should ¡I ¡do?” ¡
We ¡already ¡saw ¡that ¡increasing ¡the ¡sample ¡size ¡is ¡likely ¡

to ¡decrease ¡your ¡p-‑value ¡

– So ¡get ¡50 ¡more ¡queries ¡and ¡test ¡your ¡systems ¡over ¡all ¡100 ¡ – S$ll ¡not ¡significant? ¡ ¡Get ¡50 ¡more ¡and ¡test ¡your ¡systems ¡

ver ¡150 ¡

– … ¡

Note: ¡ ¡I ¡am ¡NOT ¡recommending ¡you ¡do ¡this! ¡

SLIDE 60

MYTHS ¡AND ¡MISCONCEPTIONS ¡

Part ¡4 ¡

SLIDE 61

Myths ¡and ¡Misconcep$ons ¡

Significance ¡tests ¡lend ¡rigor ¡to ¡our ¡experimenta$on ¡

– Without ¡them, ¡the ¡usual ¡differences ¡of ¡< ¡5% ¡would ¡be ¡ difficult ¡to ¡interpret ¡

But ¡they ¡are ¡widely ¡misunderstood ¡

– p-‑values ¡can ¡be ¡incorrectly ¡interpreted ¡ – p-‑values ¡can ¡be ¡easily ¡manipulated ¡(even ¡uninten$onally) ¡

They ¡are ¡fundamentally ¡no ¡more ¡rigorous ¡than ¡any ¡AI ¡

approach ¡to ¡classifica$on ¡

– Though ¡they ¡may ¡have ¡a ¡much ¡deeper ¡theore$cal ¡basis ¡

SLIDE 62

Myth: ¡ ¡H0 ¡is ¡a ¡Realis$c ¡Model ¡

The ¡first ¡and ¡biggest ¡misconcep$on: ¡ ¡the ¡null ¡

hypothesis ¡is ¡some$mes ¡true ¡

– That ¡is, ¡there ¡is ¡a ¡chance ¡that ¡there ¡really ¡is ¡no ¡effect ¡

In ¡AI, ¡the ¡null ¡hypothesis ¡is ¡almost ¡never ¡true ¡

– Really ¡only ¡when ¡the ¡experimenter ¡made ¡a ¡mistake ¡

The ¡only ¡ques$on ¡is ¡how ¡big ¡of ¡a ¡sample ¡size ¡will ¡it ¡

take ¡to ¡reject ¡it ¡

– There ¡is ¡always ¡some ¡sample ¡big ¡enough ¡to ¡reject ¡it ¡

SLIDE 63

Myth: ¡ ¡Rejec$ng ¡H0 ¡Means ¡it ¡is ¡False ¡

First, ¡H0 ¡is ¡always ¡false ¡
But ¡even ¡if ¡it ¡were ¡true, ¡we ¡could ¡s$ll ¡reject ¡it ¡for ¡many ¡

reasons: ¡

– something ¡about ¡our ¡sample ¡ – viola$ons ¡of ¡test ¡model ¡assump$ons ¡ – failure ¡to ¡model ¡important ¡sources ¡of ¡variance ¡ – uninten$onal ¡overficng ¡

Rejec$ng ¡H0 ¡should ¡not ¡be ¡taken ¡to ¡mean ¡our ¡system ¡is ¡

definitely ¡be8er ¡

SLIDE 64

Myth: ¡ ¡Test ¡Assump$ons ¡ ¡ Are ¡Important ¡

Consider ¡the ¡t-‑test ¡based ¡on ¡the ¡linear ¡model ¡
Assump$ons: ¡

– y ¡is ¡unbounded ¡ – linearity ¡and ¡addi$vity ¡ – homoscedas$city ¡ – normality ¡of ¡errors ¡ – (note: ¡ ¡normality ¡of ¡data ¡is ¡not ¡an ¡assump$on) ¡

All ¡of ¡these ¡are ¡false! ¡

– But ¡we ¡can ¡evaluate ¡how ¡much ¡their ¡falseness ¡affects ¡ accuracy ¡and ¡power ¡

SLIDE 65

Myth: ¡Test ¡Assump$ons ¡ ¡ Are ¡Important ¡

OK, ¡so ¡t-‑test ¡assump$ons ¡are ¡false. ¡ ¡Why ¡not ¡use ¡a ¡

different ¡test? ¡

Every ¡test ¡is ¡based ¡on ¡some ¡model, ¡and ¡every ¡model ¡is ¡

false ¡

– Even ¡so-‑called ¡“assump$on-‑free” ¡tests ¡like ¡Fisher’s ¡exact ¡test ¡or ¡ the ¡bootstrap ¡actually ¡do ¡involve ¡assump$ons ¡

The ¡tradeoff ¡is ¡generally ¡between ¡simplicity ¡and ¡power ¡

– Fewer ¡assump$ons ¡à ¡less ¡power ¡à ¡fewer ¡significant ¡results ¡

t-‑test ¡is ¡popular ¡because ¡it ¡is ¡powerful, ¡robust ¡to ¡viola$ons ¡
f ¡its ¡assump$ons, ¡and ¡computa$onally ¡easy ¡

¡

SLIDE 66

Myth: ¡ ¡p-‑Values ¡Have ¡ ¡ Intrinsic ¡Meaning ¡

p ¡< ¡0.05 ¡is ¡oaen ¡taken ¡as ¡a ¡“gold ¡standard” ¡of ¡

proof ¡

Two ¡things ¡to ¡keep ¡in ¡mind: ¡

– The ¡p-‑value ¡comes ¡out ¡of ¡a ¡model; ¡“all ¡models ¡are ¡ wrong” ¡ – 0.05 ¡is ¡an ¡arbitrary ¡value ¡that ¡was ¡probably ¡first ¡used ¡ as ¡an ¡example ¡

Any ¡meaning ¡given ¡to ¡a ¡p-‑value ¡is ¡extrinsic ¡

– Usually ¡granted ¡by ¡a ¡community ¡of ¡scien$sts ¡

SLIDE 67

Myth: ¡ ¡p-‑Values ¡Have ¡ ¡ Intrinsic ¡Meaning ¡

The ¡real ¡gold ¡standard ¡is ¡whether ¡it ¡helps ¡

users ¡

Any ¡IR ¡evalua$on ¡based ¡on ¡the ¡Cranfield ¡

paradigm ¡cannot ¡directly ¡answer ¡that ¡

But ¡using ¡a ¡priori ¡power ¡analysis ¡to ¡determine ¡

appropriate ¡sample ¡size ¡comes ¡closer ¡than ¡ looking ¡at ¡p-‑values ¡

SLIDE 68

Myth: ¡ ¡Lower ¡p-‑Values ¡are ¡Be8er ¡

If ¡a ¡p-‑value ¡of ¡0.04 ¡is ¡be8er ¡than ¡a ¡p-‑value ¡of ¡

0.06, ¡then ¡a ¡p-‑value ¡of ¡0.02 ¡is ¡even ¡be8er, ¡right? ¡

A ¡p-‑value ¡can ¡be ¡lower ¡for ¡three ¡reasons: ¡

– The ¡effect ¡size ¡is ¡bigger ¡(good) ¡ – The ¡sample ¡size ¡is ¡bigger ¡(bad) ¡ – “Randomness” ¡

There’s ¡no ¡way ¡to ¡know ¡which ¡of ¡these ¡is ¡the ¡

reason ¡

SLIDE 69

Myth: ¡ ¡Lower ¡p-‑Values ¡are ¡Be8er ¡

p-‑value ¡= ¡P(data ¡| ¡H0, ¡test ¡model, ¡inputs) ¡
Any ¡change ¡to ¡the ¡underlying ¡model ¡results ¡in ¡

a ¡different ¡probability ¡distribu$on ¡

– That ¡includes ¡changes ¡to ¡the ¡systems ¡being ¡tested ¡

p-‑values ¡should ¡not ¡be ¡compared ¡directly ¡

– Fisher ¡and ¡Neyman/Pearson ¡would ¡have ¡agreed ¡

n ¡this! ¡

SLIDE 70

Myth: ¡ ¡Running ¡Many ¡Tests ¡is ¡OK ¡

AI ¡experimenta$on ¡oaen ¡happens ¡like ¡this: ¡
1. Modify ¡a ¡system, ¡compare ¡to ¡baseline, ¡run ¡test ¡
2. Significant? ¡
No: ¡ ¡go ¡back ¡to ¡step ¡1 ¡
Yes: ¡ ¡start ¡wri$ng ¡a ¡paper ¡

¡

How ¡many ¡tests ¡does ¡it ¡take ¡to ¡get ¡to ¡the ¡

endpoint? ¡

SLIDE 71

Sequen$al ¡Tes$ng ¡

Suppose ¡(hypothe$cally) ¡that ¡the ¡null ¡hypothesis ¡

is ¡actually ¡true ¡

The ¡probability ¡of ¡concluding ¡it ¡is ¡false ¡aaer ¡one ¡

test ¡is ¡α ¡(normally ¡0.05) ¡

– The ¡probability ¡of ¡concluding ¡it ¡is ¡false ¡aaer ¡two ¡tests ¡ is ¡.05 ¡+ ¡.95.05 ¡= ¡.0975 ¡ – Aaer ¡three ¡tests, ¡.05 ¡+ ¡.95.05 ¡+ ¡.95.95.05 ¡= ¡.143 ¡ – Aaer ¡14 ¡tests, ¡~0.5 ¡ – Aaer ¡27 ¡tests, ¡~0.75 ¡ – Aaer ¡90 ¡tests, ¡~0.99 ¡

SLIDE 72

Mul$ple ¡Tes$ng ¡

Suppose ¡three ¡different ¡people ¡have ¡the ¡same ¡

null ¡hypothesis ¡

– If ¡each ¡of ¡them ¡does ¡one ¡experiment, ¡probability ¡that ¡ there ¡will ¡be ¡at ¡least ¡one ¡false ¡posi$ve ¡is ¡0.143 ¡ – If ¡each ¡of ¡them ¡does ¡three ¡experiments, ¡probability ¡ goes ¡to ¡~0.4 ¡

Result: ¡ ¡very ¡high ¡probability ¡that ¡any ¡given ¡

published ¡result ¡is ¡false! ¡ ¡ ¡

– “Why ¡Most ¡Published ¡Research ¡Findings ¡Are ¡False”, ¡ Ioannidis, ¡PLoS ¡Medicine, ¡2005 ¡

SLIDE 73

Mul$ple ¡Tes$ng ¡

SLIDE 74

Correc$ng ¡for ¡Mul$ple ¡Comparisons ¡

We ¡should ¡adjust ¡our ¡p-‑values ¡up ¡for ¡the ¡fact ¡

that ¡we ¡have ¡made ¡mul$ple ¡comparisons ¡

Many ¡different ¡approaches: ¡

– Bonferroni ¡correc$on ¡ – Tukey’s ¡Honest ¡Significant ¡Differences ¡ – Mul$variate ¡t ¡test ¡

SLIDE 75

Tukey’s ¡HSD ¡

Omnibus ¡hypothesis: ¡

– H0: ¡ ¡S1 ¡= ¡S2 ¡= ¡S3 ¡= ¡… ¡= ¡Sn ¡ – ANOVA ¡fits ¡a ¡linear ¡model ¡to ¡all ¡data; ¡rejects ¡null ¡if ¡ there ¡is ¡any ¡difference ¡between ¡any ¡pair ¡of ¡systems ¡

The ¡maximum ¡difference ¡is ¡the ¡one ¡most ¡likely ¡to ¡

be ¡the ¡cause ¡of ¡rejec$on ¡

Tukey: ¡ ¡compute ¡a ¡distribu$on ¡of ¡maximum ¡

difference, ¡base ¡all ¡p-‑values ¡on ¡that ¡

SLIDE 76

Tukey’s ¡HSD ¡

SLIDE 77

Effect ¡on ¡TREC-‑8 ¡Evalua$on ¡

SLIDE 78

Families ¡of ¡Experiments ¡

p-‑values ¡should ¡be ¡adjusted ¡based ¡on ¡“families” ¡of ¡

experiments ¡

– All ¡experiments ¡tes$ng ¡the ¡same ¡hypothesis ¡

How ¡do ¡we ¡define ¡a ¡family ¡of ¡experiments? ¡

– Suppose ¡we ¡are ¡tes$ng ¡hypotheses ¡about ¡clustering ¡for ¡IR ¡

H: ¡ ¡augmen$ng ¡LM ¡retrieval ¡with ¡clusters ¡improves ¡ad ¡hoc ¡

retrieval ¡

H: ¡ ¡augmen$ng ¡BM25 ¡retrieval ¡with ¡clusters ¡improves ¡ad ¡hoc ¡

retrieval ¡

H: ¡ ¡augmen$ng ¡any ¡ranking ¡func$on ¡with ¡clusters ¡improves ¡ad ¡hoc ¡

retrieval ¡

H: ¡ ¡clusters ¡are ¡good ¡for ¡retrieval ¡

SLIDE 79

What ¡is ¡a ¡“Family”? ¡

In ¡TREC ¡data, ¡families ¡could ¡be: ¡

– All ¡pairs ¡of ¡submi8ed ¡systems ¡ – Pairs ¡of ¡systems ¡submi8ed ¡by ¡each ¡par$cipa$ng ¡group ¡in ¡ the ¡context ¡of ¡the ¡full ¡set ¡of ¡systems ¡ – Pairs ¡of ¡systems ¡submi8ed ¡by ¡each ¡par$cipa$ng ¡group ¡in ¡ the ¡context ¡of ¡just ¡that ¡group’s ¡systems ¡

p-‑values ¡can ¡be ¡corrected ¡based ¡on ¡each ¡family ¡type ¡

– Which ¡results ¡in ¡different ¡adjustments ¡for ¡each ¡

The ¡third ¡is ¡the ¡least ¡“honest”, ¡yet ¡is ¡really ¡the ¡only ¡

thing ¡you ¡can ¡do ¡on ¡your ¡own ¡

SLIDE 80

Summary ¡

Significance ¡tests ¡are ¡just ¡models ¡

– When ¡we ¡use ¡them ¡“out ¡of ¡the ¡box”, ¡we ¡fail ¡to ¡model ¡many ¡sources ¡of ¡ variance ¡in ¡IR ¡

Variance ¡in ¡relevance, ¡in ¡user ¡behavior, ¡in ¡interac$ons ¡between ¡system ¡

components, ¡… ¡

– The ¡things ¡we ¡do ¡model ¡are ¡probably ¡being ¡modeled ¡wrong ¡

In ¡par$cular ¡addi$vity ¡of ¡system ¡and ¡topic ¡effects ¡

– More ¡correct ¡models ¡could ¡change ¡our ¡conclusions ¡about ¡systems ¡

We ¡know ¡that ¡modeling ¡mul$ple ¡tes$ng ¡changes ¡our ¡conclusions ¡

drama$cally ¡

– Most ¡other ¡concerns ¡are ¡extremely ¡minor ¡in ¡comparison ¡ – But ¡we ¡don’t ¡really ¡truly ¡know ¡how ¡to ¡adjust ¡for ¡mul$ple ¡comparisons ¡

It ¡depends ¡very ¡much ¡on ¡what ¡other ¡researchers ¡are ¡thinking ¡and ¡doing ¡
The ¡one ¡thing ¡I ¡want ¡you ¡to ¡take ¡away ¡from ¡this ¡talk: ¡

– Never ¡trust ¡a ¡p-‑value, ¡even ¡one ¡you ¡computed ¡yourself ¡

SLIDE 81

SIGNIFICANCE ¡TESTING ¡IN ¡ IR ¡RESEARCH ¡

Part ¡5 ¡

SLIDE 82

What ¡Does ¡it ¡Mean? ¡

You ¡can ¡always ¡find ¡significance ¡

– With ¡the ¡right ¡sample, ¡the ¡right ¡sample ¡size, ¡the ¡right ¡test, ¡ enough ¡itera$ons ¡of ¡tes$ng ¡ – “Fishing ¡expedi$ons” ¡

Significance ¡is ¡only ¡a ¡rough ¡proxy ¡for ¡“interes$ngness” ¡

– A ¡heuris$c ¡

Looking ¡for ¡a ¡recommenda$on ¡of ¡what ¡test ¡to ¡use? ¡

– I’ll ¡always ¡say ¡the ¡t-‑test, ¡others ¡will ¡say ¡Wilcoxon ¡or ¡ randomiza$on ¡or ¡bootstrap ¡ – The ¡truth ¡is, ¡it ¡doesn’t ¡ma8er ¡much ¡

SLIDE 83

Searching ¡for ¡Interes$ng ¡Results ¡

How ¡do ¡we ¡use ¡significance ¡tests ¡in ¡research? ¡

– Conference ¡program ¡commi8ees/journal ¡editors ¡use ¡them ¡ as ¡a ¡guide ¡for ¡determining ¡what ¡to ¡publish ¡

Publica$on ¡determines ¡research ¡direc$ons ¡that ¡people ¡follow ¡

– Published ¡systems ¡implemented ¡as ¡baselines ¡

– Essen$ally ¡as ¡a ¡heuris$c ¡in ¡a ¡search ¡for ¡the ¡best ¡algorithms ¡

They ¡can ¡easily ¡be ¡used ¡as ¡a ¡subs$tute ¡for ¡human ¡

judgment ¡

– Like ¡most ¡AI, ¡they ¡should ¡be ¡used ¡as ¡an ¡aide ¡to ¡human ¡ judgment ¡ – There ¡isn’t ¡one ¡right ¡way ¡to ¡do ¡it ¡

No ¡Free ¡Lunch ¡Theorem ¡applies ¡to ¡significance ¡tes$ng ¡

SLIDE 84

Searching ¡for ¡Interes$ng ¡Results ¡

What ¡if ¡significance ¡was ¡granted ¡more ¡

conserva$vely? ¡ ¡e.g. ¡by: ¡

– Correc$ng ¡for ¡mul$ple ¡comparisons ¡ – Using ¡tests ¡that ¡make ¡fewer ¡assump$ons ¡ – Using ¡a ¡lower ¡value ¡of ¡alpha ¡(0.01 ¡for ¡instance) ¡

Is ¡a ¡more ¡conserva$ve ¡heuris$c ¡always ¡

be8er? ¡

SLIDE 85

All ¡hypotheses ¡ Sta$s$cally ¡significant ¡ results ¡ Published ¡ results ¡

The ¡State ¡of ¡ ¡ Research ¡Today ¡

InteresZng ¡ ¡ results ¡

SLIDE 86

All ¡hypotheses ¡ Sta$s$cally ¡ significant ¡results ¡ Published ¡ results ¡

The ¡State ¡of ¡ ¡ Research ¡When ¡ “Significance” ¡is ¡ Granted ¡More ¡ Conserva$vely ¡

Fewer ¡publica$ons ¡overall ¡ … ¡which ¡means ¡fewer ¡ uninteres$ng ¡publica$ons ¡ … ¡but ¡also ¡that ¡fewer ¡truly ¡ interes$ng ¡results ¡can ¡be ¡ published ¡

InteresZng ¡ ¡ results ¡

SLIDE 87

Thought ¡Experiment ¡

Suppose ¡sta$s$cal ¡significance ¡is ¡a ¡necessary ¡and ¡

sufficient ¡condi$on ¡for ¡publica$on ¡

– Consequences: ¡

Many ¡published ¡papers ¡are ¡not ¡interes$ng ¡
Some ¡interes$ng ¡results ¡are ¡not ¡published ¡
Most ¡uninteres$ng ¡results ¡are ¡not ¡published ¡

– Published ¡uninteres$ng ¡papers ¡-‑> ¡

$me ¡wasted ¡reading, ¡re-‑implemen$ng ¡

– Unpublished ¡interes$ng ¡results ¡-‑> ¡

$me ¡wasted ¡each ¡$me ¡results ¡are ¡re-‑discovered ¡

– Unpublished ¡uninteres$ng ¡papers ¡-‑> ¡

$me ¡wasted ¡each ¡$me ¡experiment ¡is ¡tried ¡and ¡fails ¡

SLIDE 88

Example: ¡ ¡NLP ¡for ¡IR ¡

NLP ¡generally ¡doesn’t ¡work ¡for ¡IR ¡

– Maybe ¡in ¡some ¡domains ¡(like ¡QA), ¡for ¡some ¡tasks, ¡but ¡in ¡ general ¡not ¡

But ¡almost ¡every ¡IR ¡grad ¡student ¡has ¡had ¡some ¡idea ¡for ¡

using ¡NLP ¡to ¡improve ¡IR ¡

– Result: ¡ ¡a ¡handful ¡of ¡published ¡papers ¡from ¡a ¡very ¡large ¡ number ¡of ¡experiments, ¡mostly ¡due ¡to ¡randomness ¡

e.g. ¡mul$ple ¡tes$ng ¡

– … ¡which ¡gives ¡hope ¡to ¡the ¡next ¡genera$on ¡of ¡students ¡ (who ¡don’t ¡know ¡about ¡the ¡very ¡low ¡success ¡rate) ¡ – … ¡which ¡results ¡in ¡a ¡lot ¡of ¡wasted ¡$me ¡as ¡they ¡re-‑do ¡ experiments ¡already ¡done ¡by ¡every ¡previous ¡genera$on ¡

SLIDE 89

Example: ¡ ¡NLP ¡for ¡IR ¡

Would ¡we ¡have ¡been ¡be8er ¡off ¡had ¡that ¡handful ¡
f ¡papers ¡never ¡been ¡published? ¡
Would ¡we ¡have ¡been ¡be8er ¡off ¡if ¡all ¡those ¡

nega$ve ¡results ¡had ¡been ¡published? ¡

Or ¡are ¡we ¡be8er ¡off ¡with ¡grad ¡students ¡having ¡

done ¡the ¡work ¡to ¡gain ¡some ¡intui$on ¡about ¡why ¡ it ¡doesn’t ¡work? ¡ ¡

SLIDE 90

Reproducibility ¡

A ¡major ¡topic ¡of ¡discussion ¡recently ¡

– Panel ¡discussions ¡at ¡IR ¡conferences, ¡a ¡track ¡at ¡ ECIR, ¡RIGOR ¡workshop ¡at ¡SIGIR ¡

Ques$ons: ¡

– What ¡does ¡“reproducibility” ¡mean? ¡ – Why ¡do ¡so ¡many ¡results ¡in ¡IR ¡seem ¡hard ¡to ¡ reproduce? ¡ – What ¡can ¡we ¡do ¡to ¡make ¡it ¡easier ¡to ¡reproduce ¡ them? ¡

SLIDE 91

Reproducibility ¡

What ¡does ¡“reproducibility” ¡mean? ¡

– SIGIR ¡RIGOR ¡workshop ¡$ers: ¡

Repeatability: ¡repeat ¡a ¡previous ¡result ¡under ¡iden$cal ¡condi$ons ¡
Reproducibility: ¡reproduce ¡a ¡previous ¡result ¡under ¡similar ¡

condi$ons ¡

Generalizability: ¡apply ¡a ¡technique ¡under ¡different ¡condi$ons ¡
“Repeatability” ¡means ¡duplica$ng ¡the ¡result ¡exactly ¡
“Reproducibility” ¡means ¡adding ¡new ¡sources ¡of ¡

variability ¡and ¡randomness ¡

– Expect ¡results ¡to ¡be ¡sta$s$cally ¡similar, ¡that ¡is, ¡within ¡ confidence ¡intervals ¡

SLIDE 92

Reproducibility ¡

As ¡a ¡binary ¡indicator, ¡sta$s$cal ¡significance ¡might ¡be ¡likely ¡to ¡

generalize ¡

– In ¡par$cular, ¡if ¡sta$s$cal ¡significance ¡means ¡“results ¡are ¡not ¡due ¡to ¡ chance”, ¡does ¡it ¡follow ¡that ¡a ¡sta$s$cally ¡significant ¡result ¡will ¡ reproduce ¡in ¡different ¡experimental ¡secngs? ¡

NO ¡

– Significance ¡does ¡NOT ¡mean ¡“results ¡are ¡not ¡due ¡to ¡chance” ¡ – Significance ¡can ¡fail ¡to ¡reproduce ¡for ¡many ¡reasons, ¡all ¡of ¡which ¡we ¡ would ¡a8ribute ¡to ¡“randomness” ¡in ¡our ¡current ¡tes$ng ¡models ¡

In ¡fact, ¡we ¡should ¡expect ¡many ¡significant ¡results ¡to ¡fail ¡to ¡

reproduce ¡solely ¡because ¡of ¡this ¡“randomness” ¡

– Says ¡nothing ¡about ¡the ¡honesty/integrity ¡of ¡the ¡researchers ¡whose ¡ results ¡didn’t ¡reproduce ¡ – It ¡is ¡only ¡when ¡a ¡result ¡reproduces ¡again ¡and ¡again ¡that ¡we ¡should ¡ accept ¡it ¡

SLIDE 93

Experimental ¡Validity ¡

significance ¡ internal ¡validity ¡ external ¡validity ¡ construct ¡validity ¡

SLIDE 94

Takeaways ¡

Always ¡do ¡significance ¡tests ¡

– But ¡don’t ¡worry ¡too ¡much ¡about ¡which ¡tests ¡to ¡use ¡ – The ¡t-‑test ¡is ¡always ¡a ¡good ¡op$on ¡ – Correc$ng ¡for ¡mul$ple ¡tes$ng ¡is ¡probably ¡not ¡necessary ¡

Don’t ¡just ¡report ¡p-‑values ¡or ¡* ¡to ¡indicate ¡significance ¡

– Always ¡report ¡es$mated ¡effect ¡sizes ¡and ¡confidence ¡intervals ¡

Always ¡take ¡results ¡of ¡tests ¡with ¡a ¡grain ¡of ¡salt ¡

– Especially ¡when ¡the ¡effect ¡size ¡is ¡low ¡ – Don’t ¡expect ¡them ¡to ¡generalize ¡a ¡priori ¡ – Build ¡your ¡intui$on ¡and ¡use ¡it ¡

Significance ¡must ¡be ¡interpreted ¡against ¡the ¡internal ¡and ¡external ¡validity ¡
f ¡the ¡experiment ¡

– Cranfield: ¡ ¡very ¡strong ¡internal ¡validity; ¡research ¡on ¡its ¡external ¡validity ¡is ¡ en$rely ¡inconclusive ¡

Sta$s$cal ¡Significance ¡Tes$ng ¡ In ¡Theory ¡and ¡In ¡Prac$ce ¡

Ben ¡Cartere8e ¡ University ¡of ¡Delaware ¡ ¡ h8p://ir.cis.udel.edu/ICTIR15tutorial ¡ ¡

Hypotheses ¡and ¡Experiments ¡

– Build ¡a ¡baseline ¡system ¡ – Modify ¡it ¡based ¡on ¡your ¡hypothesis ¡ – Test ¡both ¡systems ¡on ¡one ¡or ¡more ¡datasets ¡

Experimental ¡Results ¡

So ¡What? ¡

random?” ¡ à ¡sta$s$cal ¡significance ¡tes$ng! ¡

Overview ¡of ¡This ¡Tutorial ¡

Part ¡1: ¡ ¡Tes$ng ¡Sta$s$cal ¡Significance ¡

– May ¡be ¡a ¡review ¡for ¡some ¡of ¡you ¡

Part ¡2: ¡ ¡Fundamentals ¡of ¡Significance ¡Tes$ng ¡ Part ¡3: ¡ ¡Applica$ons, ¡or, ¡Why ¡Bother ¡With ¡ ¡ ¡Fundamentals? ¡ Part ¡4: ¡ ¡Myths ¡and ¡Misconcep$ons ¡ Part ¡5: ¡ ¡Significance ¡Tes$ng ¡in ¡IR ¡Research ¡

Using ¡R ¡

compu$ng ¡

¡

common ¡tests ¡

– Also ¡has ¡its ¡own ¡programming ¡language ¡for ¡ implemen$ng ¡your ¡own ¡

– Download ¡TREC-­‑7 ¡evalua$on ¡data ¡from ¡ h8p://ir.cis.udel.edu/ICTIR15tutorial/trec7.RData ¡

Background: ¡ ¡Experimenta$on ¡in ¡IR ¡

the ¡Cranfield ¡paradigm ¡

effec$veness ¡measures ¡

– A ¡test ¡collec$on ¡comprises: ¡

– Effec$veness ¡measures ¡such ¡as: ¡

nDCG@10, ¡etc ¡

Background: ¡ ¡Cranfield ¡

Background: ¡ ¡Cranfield ¡

TESTING ¡STATISTICAL ¡SIGNIFICANCE ¡

Commonly-­‑Used ¡Tests ¡

– Sign ¡test/binomial ¡test ¡ – Wilcoxon ¡signed ¡rank ¡test ¡

– Student’s ¡t-­‑test ¡ – ANOVA ¡

– Randomiza$on ¡test ¡ – Bootstrap ¡test ¡

Sign ¡Test ¡

Binomial ¡Distribu$on ¡

Wilcoxon ¡Signed-­‑Rank ¡Test ¡

Wilcoxon ¡Signed-­‑Rank ¡Test ¡

W = 40

p− value = 0.02

Student’s ¡t-­‑test ¡

ˆ µ = B − A = 0.214 ˆ σ

t = ˆ µ ˆ σ

n = 2.33

Student’s ¡t-­‑test ¡

p − value = 0.02 σB −A = 0.291

ˆ µ = B − A = 0.214 ˆ σ

t = ˆ µ ˆ σ

n = 2.33

Randomiza$on ¡Test ¡

ˆ µ

ˆ µ

ˆ µ

Randomiza$on ¡Test ¡

p − value = 0.02 ˆ µ

Bootstrap ¡Test ¡

Bootstrap ¡Distribu$on ¡

p − value = 0.005

Comparing ¡TREC-­‑7 ¡Submissions ¡

UMass ¡Amherst ¡

– All ¡three ¡used ¡the ¡InQuery ¡retrieval ¡engine ¡ – Named ¡INQ501, ¡INQ502, ¡INQ503 ¡ – We’ll ¡use ¡all ¡5 ¡tests ¡discussed ¡so ¡far ¡

Empirical ¡Comparisons ¡

Empirical ¡Comparisons ¡

Empirical ¡Comparisons ¡

Empirical ¡Comparisons ¡

Empirical ¡Comparisons ¡

ANOVA ¡

due ¡to ¡topic ¡

ˆ σ

ˆ σ

F = MST MSE = 5.41

ANOVA ¡

systems ¡

– And ¡across ¡more ¡factors ¡than ¡just ¡system ¡and ¡ topic ¡

systems ¡

Summary ¡

experimenta$on ¡

– Many ¡others ¡in ¡the ¡literature: ¡

– The ¡use ¡of ¡some ¡probability ¡distribu$on, ¡computa$on ¡of ¡a ¡p-­‑ value ¡from ¡that ¡distribu$on ¡

– Though ¡they ¡do ¡not ¡always ¡agree ¡about ¡which ¡pairs ¡are ¡ significant ¡

FUNDAMENTALS ¡OF ¡ ¡ SIGNIFICANCE ¡TESTING ¡

Tes$ng ¡Paradigms ¡

What ¡Are ¡Tests ¡Really ¡Telling ¡Us? ¡

– H0: ¡ ¡μ ¡= ¡0 ¡ ¡ ¡or ¡ ¡ ¡H0: ¡ ¡μ ¡= ¡0 ¡ – H1: ¡ ¡μ ¡≠ ¡0 ¡ ¡ ¡ ¡ ¡ ¡H1: ¡ ¡μ ¡> ¡0 ¡

– Download ¡TREC-‑7 ¡evalua$on ¡data ¡from ¡ h8p://ir.cis.udel.edu/ICTIR15tutorial/trec7.RData ¡

Commonly-‑Used ¡Tests ¡

– Student’s ¡t-‑test ¡ – ANOVA ¡

Wilcoxon ¡Signed-‑Rank ¡Test ¡

Wilcoxon ¡Signed-‑Rank ¡Test ¡

Student’s ¡t-‑test ¡

Student’s ¡t-‑test ¡

Comparing ¡TREC-‑7 ¡Submissions ¡

– The ¡use ¡of ¡some ¡probability ¡distribu$on, ¡computa$on ¡of ¡a ¡p-‑ value ¡from ¡that ¡distribu$on ¡

– The ¡p-‑value ¡is ¡a ¡conclusion ¡about ¡this ¡par$cular ¡experiment ¡

– p-‑values ¡lead ¡to ¡inference ¡about ¡the ¡popula$on ¡ – The ¡p-‑value ¡itself ¡is ¡not ¡interes$ng; ¡the ¡inference ¡is ¡ – Note ¡that ¡we ¡do ¡not ¡accept ¡that ¡H1 ¡is ¡true! ¡

Low ¡p-‑value ¡≠ ¡False ¡H0 ¡

with ¡σ ¡= ¡0.16, ¡and ¡our ¡t-‑test ¡gives ¡a ¡p-‑value ¡< ¡ 0.05 ¡

Low ¡p-‑value ¡≠ ¡False ¡H0 ¡

– A ¡single-‑sample ¡test ¡is ¡generally ¡based ¡on ¡applying ¡one ¡or ¡ more ¡“treatments” ¡to ¡a ¡single ¡sample ¡of ¡“subjects” ¡ – In ¡a ¡two-‑sample ¡test, ¡each ¡treatment ¡is ¡applied ¡to ¡a ¡ different ¡sample ¡

– Paired ¡tests ¡are ¡a ¡special ¡case ¡of ¡single-‑sample ¡tests: ¡ ¡ subtract ¡evalua$on ¡results ¡for ¡each ¡example ¡to ¡obtain ¡the ¡ measurements ¡to ¡summarize ¡ – Unpaired ¡tests ¡can ¡be ¡single-‑sample ¡too ¡

– All ¡the ¡examples ¡done ¡to ¡this ¡point ¡were ¡one-‑ tailed ¡tests ¡

– Two-‑tailed ¡tests ¡compute ¡the ¡p-‑value ¡from ¡both ¡ tails ¡ – Result ¡is ¡generally ¡a ¡higher ¡p-‑value ¡

– And ¡therefore ¡the ¡t-‑test ¡is ¡too ¡