1
Lecture 7.1: Multiple Comparisons
(A ‘non-quiz’ topic)
- Examples of the need for multiple comparisons
- The problem with multiple comparisons post hoc;
an outline of a solution
- Specific solutions: Fisher LSD, Tukey HSD, Holm
Multiple Comparisons Occasionally, e.g., at the start of a research - - PowerPoint PPT Presentation
Lecture 7.1: Multiple Comparisons (A non-quiz topic) Examples of the need for multiple comparisons The problem with multiple comparisons post hoc; an outline of a solution Specific solutions: Fisher LSD, Tukey HSD, Holm (same as
1
(A ‘non-quiz’ topic)
2
some.no pre.post no.post no.org -2 0
pre.org 1 -1 post.org 1 1 1
3
4
SM = MM’ > MM ? SS > MS
5
Morphine à à Saline Morphine à à Morphine Saline à à Saline Salineà à Morphine Morphine à à Morphine (New Envt) Contrast 1 (new v same) Contrast 2 Tr: M v S Contrast 3 Test: M v S Contrast 4 Tr * Test Contrast 5 NA!
(After Siegel, 1975 – See Howell, 6th ed., p. 346)
7
8
Group lcon ltr lte lT*T lKara 1=MM 1 1 1 1 1 2=MS 1 1
3=SM 1
1
4=SS 1
1 5=MM’ -4
9
10
11
12
The classical approach to multiple comparisons relies on the concepts of Type I and Type II errors. False Discovery Rate (FDR) is a new approach to the multiple comparisons problem. Instead of controlling the chance of any false positives, i.e., Prob(at least 1 false positive) [as Bonferroni or other methods do], FDR controls the expected proportion of false positives among voxels that are judged to be suprathreshold. This turns out to be a relatively lenient metric for false positives, and it leads to an increase in power. The FDR approach is well-suited to the case of “many, many tests.” Later we will show how FDR thresholds are determined from the observed p-value distribution.
13
14
15
Outcome of other participant Dependent Variable Unknown Better Worse Equal Outcome satisfaction Voice 5.1a,b 2.6c 4.1b 5.4a No voice 3.1d 2.8c 4.2b 5.3a Outcome fairness Voice 5.1b 2.3c 2.0c 6.1a No voice 3.0d 2.4c,d 2.1c 6.1a
Note: For each dependent variable, means with no subscripts in common differ significantly, as indicated by a least significant difference test for multiple comparisons between means (p < .05).
16
# Organiser study: Tukey HSD approach contrasts(d00$group, 2)=contr.treatment(3,base=2, contrasts=TRUE) rs3 = aov(score ~ group, data=d00) rs30 = TukeyHSD(rs3) print(rs30) [You may need to define a ‘group’ variable]
95% family-wise confidence level Fit: aov(formula = score ~ group, data = d00) $group diff lwr upr p adj pre.org-no.org 0.1 -1.57075607 1.770756 0.9879376 post.org-no.org 1.7 0.02924393 3.370756 0.0455236 post.org-pre.org 1.6 -0.07075607 3.270756 0.0624878
17
18
# Organiser study: Holm’s approach rs31 = pairwise.t.test(d00$score, d00$group) print(rs31)
Pairwise comparisons using t tests with pooled SD data: d00$score and d00$group no.org pre.org pre.org 0.883 - post.org 0.054 0.054 P value adjustment method: holm
(Holm’s procedure for controlling the familywise Type I error rate will be introduced in a later slide.)
19
20
False Alarm aka False Discovery or False Rejection. Miss aka False Non-Discovery
= P(Retain H0|H0 False); 1 - = Power
21
22
Suppose all m H0’s are true, and ‘rejection’ = ‘rej. α is P(rej) on each test; and tests are indep. We wish the prob
= 1 - P(no rej’s) = 1 - (1 - )m; implying that = 1 - (1 - )1/m. Below we show αF as a function of α and m.
m = 2 5 10 20 50 = .050 0.0975 0.2262 0.4013 0.6415 0.9231 = .010 0.0199 0.0490 0.0956 0.1821 0.3950 = .001 0.0020 0.0050 0.0100 0.0198 0.0488
23
24
m ≈ mα, for small α
25
26
27
using a range, or Q, statistic due to Ryan, Einot, Gabriel & Welsch)
28
At one extreme, if each test uses a critical value of FWE/m,
that can be very large. The Ryan and Holm procedures are intermediate, with the latter being more conservative. We now reprise the ‘morphine tolerance’, or ‘paw-licking’, study.
29
30
t p0 pdb pun phm prq ddb dun dhm drq MM-SS 0.354 0.725 0.005 0.05 0.050 0.050 0 0 0 0 SM-McM 1.768 0.086 0.005 0.05 0.025 0.045 0 0 0 0 MS-MM 2.121 0.041 0.005 0.05 0.017 0.040 0 1 0 0 MS-SS 2.475 0.018 0.005 0.05 0.012 0.035 0 1 0 1 SS-SM 4.596 0.000 0.005 0.05 0.010 0.030 1 1 1 1 MM-SM 4.950 0.000 0.005 0.05 0.008 0.025 1 1 1 1 SS-McM 6.364 0.000 0.005 0.05 0.007 0.020 1 1 1 1 MM-McM 6.717 0.000 0.005 0.05 0.006 0.015 1 1 1 1 MS-SM 7.071 0.000 0.005 0.05 0.006 0.010 1 1 1 1 MS-McM 8.839 0.000 0.005 0.05 0.005 0.005 1 1 1 1
FWE = 0.05 ‘db’ = Dunn/Bonferroni, ‘un’ = Uncontrolled FWE; ‘hm’ = Holm; ‘rq’ = Ryan/FDR. Note that ‘rq’ & ‘hm’ lie between ‘db’ and ‘un’. See Howell, p. 397, Table 12.7. Critical p-values vs. p-rank for 4 procs
31
Howell, pp. 396-7; http://en.wikipedia.org/wiki/ False_discovery_rate
32
33
34
35
36
– BMC Bioinformatics 2008, 9:303doi: 10.1186/1471-2105-9-303 – This article is quite technical
37
– log(Y), or log(Y + ε) to remove 0’s (ε = .5?) – sqrt(Y) – 1/Y, for Y = Reaction Time, is ‘speed’ – arcsin(Y), if Y is a proportion
lphapp = log(Pasthapp + .5); sphapp = sqrt(Pasthapp) rs3a = bartlett.test(Pasthapp ~ memtype, na.action=na.omit, data=dat0) #compare variances on score for all 3 groups print(rs3a) rs3b = bartlett.test(lphapp ~ memtype, na.action=na.omit, data=dat0) print(rs3b) rs3c = bartlett.test(sphapp ~ memtype, na.action=na.omit, data=dat0) print(rs3c)
38
39
40
41
42
43
44
45
In the “organizer’ study, we might be interested in the difference between the pre-organizer and post-organizer population means; the weights, a1 = 0, a2 = 1, a3 = -1, specifies the appropriate contrast: l1 = 0*µ1 + 1*µ2 + (-1)*µ3 = µ2 - µ3
46
weights would be, a1 = 2, a2 = -1, a3 = -1 (or 1, -.5,
scale or size (and perhaps the sign) of the contrast, not its primary meaning.
1, and Level 3 = treatment 2, then the weights (-2, 1, 1) and (0, 1, -1) define, respectively, a contrast
‘treat-1’ vs ‘treat-2’. Also, as we shall see, these contrasts are orthogonal.
47
i
i
i
Group X1 X2 1 a1 b1 2 a2 b2 … … … k ak bk
48
Suppose that X1 and X2 are the weights defining 2 contrasts among the k groups. What is the ‘correlation’ between X1 and X2 ? We need the “Sum of Products”, SP, which we wd put in the Numerator of a familiar formula, r = SP/√(SS1SS2). r = 0 iff SP = 0. SP = aibi
i=1 k
− ai
i=1 k
bi
i=1 k
k . Now, if X1 and X2 are true contrasts, ai
i=1 k
= 0 = bi
i=1 k
aibi
i=1 k
. Thus the condition for orthogonality, aibi
i=1 k
= 0, is also the condition for SP = 0, i.e., r = 0.
49
pre.org post.org no.org 0 0 pre.org 1 0 post.org 0 1
50
51
score ~ train + d.lin + d.quad + train : d.lin + train : d.quad. train : d.lin + train : d.quad.
52
Predictor Coefficient Intercept b0 train2 b1 d.lin b2 d.quad b3 train2:d.lin b4 train2:d.quad b5
I(difficulty - k)
scale(difficulty) in lm().
53
Predictor Coefficient Intercept b0 train2 b1 d.lin b2 d.quad b3 train2:d.lin b4 train2:d.quad b5
54
Predictor Coefficient Intercept b0 train2 b1 d.lin b2 d.quad b3 train2:d.lin b4 train2:d.quad b5
55
2 = F(1, df ), so F can also be used to test H0.
56
57
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.500 0.297 15.151 5.08e-15 *** grpone1 0.300 0.210 1.428 0.164 Residual standard error standard error: 1.627 on 28 degrees of
28 degrees of freedom freedom
Multiple R-squared: 0.06792, Adjusted R-squared: 0.035 F-statistic: 2.04 on 1 and 28 DF 1 and 28 DF, p-value: 0.1642
Note: t2 = 1.4282 = 2.04 = F! 2*(1 - pt(1.428, 28)) = 0.1644, as given above.
58
59
score group
1 5 no.org 2 4 no.org 3 6 no.org 4 2 no.org 5 2 no.org 6 2 no.org 7 6 no.org 8 4 no.org 9 3 no.org 10 5 no.org 11 4 pre.org 12 5 pre.org 13 3 pre.org … … … 27 6 post.org 28 4 post.org 29 4 post.org 30 7 post.org
# Custom-made function mean.sd = function(x) { c(mean = mean(x, na.rm=T), sd = sd(x, na.rm=T)) } # Data summary by group rs0 = by(d$score, d$group, mean.sd)
60
d$group: no.org mean sd 3.900000 1.595131
mean sd 4.000000 1.333333
mean sd 5.600000 1.577621 Each of the sd’s, 1.595, 1.33 and 1.578, is an independent estimate of the within-group sd, σ. MSw is the best, pooled est
To test if mean scores are the same for ‘post.org’ and ‘pre.org’, we would do an indep samples t-test. But what estimate of within-group variance should we use?
‘post.org’ and ‘pre.org’ groups. The df of the latter is n1 + n2 – 2 = 18; the df of MSw is N – k = 30 – 3 = 27 > 18! (More precise)
61
Analysis of Variance Table Analysis of Variance Table Response: score Df Sum Sq Mean Sq F value Pr(>F) group 2 18.20 9.10 4.0082 0.02991 * Residuals 27 61.30 2.27
2*(1 - pt(2.37, 27)) = 0.025; so p = .025, 2-tailed.
We haven’t yet considered if this t-test was planned or post-hoc!
62
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.9 3.9000 0.4765 8.185 8.64e-09 *** grouppre.org 0.1 0.1000 0.6739 0.148 0.8831 grouppost.org 1.7 1.7000 0.6739 2.523 0.0178 *
#Default coding is Dummy Coding Dummy Coding print(contrasts(d00$group)) pre.org post.org no.org 0 0 pre.org 1 0 post.org 0 1
63
pre.org post.org no.org 0 0 pre.org 1 0 post.org 0 1
64
pre.org post.org no.org 0 0 pre.org 1 0 post.org 0 1
– If ‘group’ has k levels, there are k-1 predictors, X1, …, Xk-1 – Designate one level, say, level 1, as the baseline. Each obs in level 1, the control condition, is assigned a value of 0 on all predictors - X1 = 0, …, Xk-1 = 0 – Each obs in level 2 is assigned a value of 1 on X1, and 0 on all
– Each obs in level 3 is assigned a value of 1 on X2, and 0 on all
65
Analysis of Variance Table Response: score Df Sum Sq Mean Sq F value Pr(>F) group 2 18.20 9.10 4.0082 0.02991 * Residuals 27 61.30 2.27
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.9000 0.4765 8.185 8.64e-09 *** grouppre.org 0.1000 0.6739 0.148 0.8831 grouppost.org 1.7000 0.6739 2.523 0.0178 *
66
67
68
69
contrasts(d00$group,2)=contr.treatment(3,base=2, contrasts=TRUE) rs1 = lm(score ~ group, data=d00) print(summary(rs1))
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.0000 0.4765 8.395 5.25e-09 *** group1 -0.1000 0.6739 -0.148 0.8831 group3 1.6000 0.6739 2.374 0.0249 *