Multiple Comparisons Occasionally, e.g., at the start of a research - - PowerPoint PPT Presentation

multiple comparisons
SMART_READER_LITE
LIVE PREVIEW

Multiple Comparisons Occasionally, e.g., at the start of a research - - PowerPoint PPT Presentation

Lecture 7.1: Multiple Comparisons (A non-quiz topic) Examples of the need for multiple comparisons The problem with multiple comparisons post hoc; an outline of a solution Specific solutions: Fisher LSD, Tukey HSD, Holm (same as


slide-1
SLIDE 1

1

Lecture 7.1: Multiple Comparisons

(A ‘non-quiz’ topic)

  • Examples of the need for multiple comparisons
  • The problem with multiple comparisons post hoc;

an outline of a solution

  • Specific solutions: Fisher LSD, Tukey HSD, Holm

(same as False Discovery Rate, FDR), Dunn/ Bonferroni, Ryan (REGWQ)

slide-2
SLIDE 2

2

Multiple Comparisons

  • Occasionally, e.g., at the start of a research project,

we do not have a priori theories and contrasts and, therefore, cannot use the ‘surgical’ approach of planned comparisons. We simply want to see whether the different ‘treatments’ are all the same.

  • If the omnibus F ratio is significant, we may want

to know after the fact (or post hoc) which treatments seem to ‘work’. This leads to multiple comparisons, e.g., (a) between every ‘treatment’ and the ‘control’ group (k-1 comps), or (b) between every pair of ‘treatments’ (k(k-1)/2 comps).

slide-3
SLIDE 3
  • Ex: Ss are randomly assigned to one of 3

conditions: No organiser (‘no.org’), Organiser before lecture (‘pre.org’), and Organiser after lecture (‘post.org’).

  • We might plan to examine 2 orthog contrasts,

but we might also wish to compare ‘post’ with ‘no’, even though we have only 2 df between groups.

some.no pre.post no.post no.org -2 0

  • 1

pre.org 1 -1 post.org 1 1 1

3

slide-4
SLIDE 4

A Paw-Licking Example

  • Morphine (M) reduces a rat’s sensitivity to pain –

under M for 1st time, it takes them longer to lick their paws (signalling pain) when they are put on an uncomfortably warm surface. So ‘time to lick’ is also an index of M-tolerance (= 0 on 1st trial).

  • Group MM receives M for 3 trials, then M on the

critical 4th trial in same lab setting. M-tolerance has developed, so RT is ‘normal’.

  • Group MS receives Saline on 4th trial – they

expected M but got S, so they are hypersensitive to pain and RT is very short.

4

slide-5
SLIDE 5

A Paw-Licking Example

  • Group MM’ receives Morphine on 4th trial, but in a

different setting. The usual cues are absent on the 4th trial, so rat shd not show M tolerance, and RT shd be long.

  • Group SM receives Saline for 3 trials and Morphine
  • n 4th trial, but in same setting. Rat shd not show

M tolerance, and RT shd be long.

  • The 5th group was SS. Predictions for RT are:

SM = MM’ > MM ? SS > MS

  • Tr = M vs S on 1st 3 trials; Test = M vs S on 4th trial

5

slide-6
SLIDE 6

A Paw-Licking Example

Morphine à à Saline Morphine à à Morphine Saline à à Saline Salineà à Morphine Morphine à à Morphine (New Envt) Contrast 1 (new v same) Contrast 2 Tr: M v S Contrast 3 Test: M v S Contrast 4 Tr * Test Contrast 5 NA!

(After Siegel, 1975 – See Howell, 6th ed., p. 346)

slide-7
SLIDE 7

Orthogonal contrasts for a (2x2 + 1) = 5-group design

  • The (train, test) groups in the ‘paw-lick’ study are

MM, MS, SM, SS and MM’ (where M’ = M in a new context). The 1st 4 groups conform to a tidy 2X2

  • design. Interpret each contrast below!

7

Group lcon ltr lte lT*T 1=MM 1 1 1 1 2=MS 1 1

  • 1
  • 1

3=SM 1

  • 1

1

  • 1

4=SS 1

  • 1
  • 1

1 5=MM’

  • 4
slide-8
SLIDE 8
  • Ex: ‘Paw-lick’ study of Morphine tolerance, with

(train, test) groups, MM, MS, SM, SS, MM’ (where M’ = M in a new context; S = saline). The 1st 4 groups conform to a tidy 2X2 design, and yields 3

  • rthog contrasts. But we might be interested also

in comparing MM with MM’.

8

Group lcon ltr lte lT*T lKara 1=MM 1 1 1 1 1 2=MS 1 1

  • 1
  • 1

3=SM 1

  • 1

1

  • 1

4=SS 1

  • 1
  • 1

1 5=MM’ -4

  • 1
slide-9
SLIDE 9

The problem of Type I errors

  • Measure 10 variables on n = 100 Ss, and

examine the correl matrix for sig correls. Assume true r = 0. How many observed r ’s do we expect to be sig (where |r|crit = 0.20, p = .05)? (Ans. E = Np = 45*0.05 = 2.25. Why?)

  • What is the P(at least 1 sig correl)? Ans. P(at

least 1) = 1 – P(none) = 1 – (.95)45 = .9. We’re almost certain to find at least 1 sig r! This is the problem with multiple comarisons!

  • Suppose we used α = .001, instead of .05. |r|crit =

0.32, and P(at least 1 sig r) = 1 – (.999)45 = .044, which is much more acceptable.

9

slide-10
SLIDE 10

The problem of Type I errors

  • Decreasing the Type I error rate from .05 to α = .

001, raises the critical value from 0.2 to |r|crit = 0.32.

  • But then we would retain H0 in cases of

‘seemingly large’ r, e.g., r = 0.27! That is, we would fail to detect violations of H0 more often; i.e., our power would decrease.

  • How to decrease α without sacrificing too much

power (assuming that sample size, n, is fixed)?

  • Recall that power depends on (i) α, (ii) the

difference in parameter value (e.g., µ, ρ) between H0 and H1, and (iii) measurement error.

10

slide-11
SLIDE 11

11

slide-12
SLIDE 12

12

The classical approach to multiple comparisons relies on the concepts of Type I and Type II errors. False Discovery Rate (FDR) is a new approach to the multiple comparisons problem. Instead of controlling the chance of any false positives, i.e., Prob(at least 1 false positive) [as Bonferroni or other methods do], FDR controls the expected proportion of false positives among voxels that are judged to be suprathreshold. This turns out to be a relatively lenient metric for false positives, and it leads to an increase in power. The FDR approach is well-suited to the case of “many, many tests.” Later we will show how FDR thresholds are determined from the observed p-value distribution.

slide-13
SLIDE 13

Outline of a Solution

  • To ensure that P(at least 1 sig r) is acceptably

low (e.g., .05 or .10), each individual test has to be done with a very stringent level of α (e.g., .01

  • r .001).
  • To proceed formally, let us label P(at least 1 sig

r) as the family-wise Type I error rate, αF; α is, as before, the Type I error rate for each individual test. If we wish αF to be ‘small’ (e.g., .1), what should α be? If we set α at, e.g., .01, what is the resulting αF?

  • In sum, what is the relationship between αF and

α? Which approaches ‘optimise’ this reln?

13

slide-14
SLIDE 14

14

R packages, with examples

  • Most post-hoc comparisons fall into 1 of 2

categories.

  • Compare every ‘treatment’ to a ‘control’
  • group. Download with

install.packages(‘multcomp’), and use Dunnett’s test.

  • Compare each treatment with every other
  • treatment. Use TukeyHSD(model) and

pairwise.t.test(score, group).

  • Other approaches include Fisher’s Least

Significant Difference (LSD) approach, and the use of the False Discovery Rate (FDR).

slide-15
SLIDE 15

Table 1: Mean outcome judgments as a function

  • f Procedure (Voice vs No voice) and Outcome of

Other Participant (Expt. 1)

15

Outcome of other participant Dependent Variable Unknown Better Worse Equal Outcome satisfaction Voice 5.1a,b 2.6c 4.1b 5.4a No voice 3.1d 2.8c 4.2b 5.3a Outcome fairness Voice 5.1b 2.3c 2.0c 6.1a No voice 3.0d 2.4c,d 2.1c 6.1a

Note: For each dependent variable, means with no subscripts in common differ significantly, as indicated by a least significant difference test for multiple comparisons between means (p < .05).

slide-16
SLIDE 16

16

# Organiser study: Tukey HSD approach contrasts(d00$group, 2)=contr.treatment(3,base=2, contrasts=TRUE) rs3 = aov(score ~ group, data=d00) rs30 = TukeyHSD(rs3) print(rs30) [You may need to define a ‘group’ variable]

Tukey multiple comparisons of means

95% family-wise confidence level Fit: aov(formula = score ~ group, data = d00) $group diff lwr upr p adj pre.org-no.org 0.1 -1.57075607 1.770756 0.9879376 post.org-no.org 1.7 0.02924393 3.370756 0.0455236 post.org-pre.org 1.6 -0.07075607 3.270756 0.0624878

slide-17
SLIDE 17

17

  • plot(rs30)
slide-18
SLIDE 18

18

# Organiser study: Holm’s approach rs31 = pairwise.t.test(d00$score, d00$group) print(rs31)

Pairwise comparisons using t tests with pooled SD data: d00$score and d00$group no.org pre.org pre.org 0.883 - post.org 0.054 0.054 P value adjustment method: holm

(Holm’s procedure for controlling the familywise Type I error rate will be introduced in a later slide.)

slide-19
SLIDE 19

19

Error Rates in Multiple Hypothesis Testing

  • For a single test of a null hypothesis, H0,
  • = P(Reject H0 | H0 true), the Type I error

rate, and

  • = P(Retain H0 | H0 false), the Type II error

rate

  • Power = 1 -
  • How to define “error rate” when we test m

hypotheses simultaneously?

α

β β

slide-20
SLIDE 20

20

Decision Retain Reject H0 True Correct Retention False Alarm Type I error False Miss Type II error Correct Rejection

False Alarm aka False Discovery or False Rejection. Miss aka False Non-Discovery

α = False Alarm rate = P(Reject H0|H0 True) β

= P(Retain H0|H0 False); 1 - = Power

β

slide-21
SLIDE 21

21

Testing m null hypotheses

  • If we test the (45) correlations among 10

variables for significance, with = .05, we wd expect about 5% of them, i.e., about 2 or 3 r’s to be significant, even if H0 is true everywhere; and the prob of at least 1 False Alarm wd be much greater than 0.05.

  • The prob of at least 1 Type I error when

testing m null hypotheses is called the familywise Type I error rate, . What is the relation between and ?

α

α F

α

α F

slide-22
SLIDE 22

22

m independent tests

Suppose all m H0’s are true, and ‘rejection’ = ‘rej. α is P(rej) on each test; and tests are indep. We wish the prob

  • f at least 1 rej in m tosses, αF, which equals 1 - P(no rej’s).

= 1 - P(no rej’s) = 1 - (1 - )m; implying that = 1 - (1 - )1/m. Below we show αF as a function of α and m.

α α α F α F

m = 2 5 10 20 50 = .050 0.0975 0.2262 0.4013 0.6415 0.9231 = .010 0.0199 0.0490 0.0956 0.1821 0.3950 = .001 0.0020 0.0050 0.0100 0.0198 0.0488

α

slide-23
SLIDE 23

23

  • To a good approx for small ,
  • = m .
  • To protect against familywise Type I

errors, we need to set very conservative single test levels (e.g., .01 or .001, rather than .05).

  • However, using very conservative

levels leads to failure to reject some false null hypotheses - i.e., it reduces

  • ur power! Are there ways to protect

the familywise rate while keeping power at reasonably high levels?

α α α F α α

slide-24
SLIDE 24

24

  • when the tests are

independent.

  • When the tests are not necessarily independent,

Bonferroni’s Inequality states that the prob of at least 1 of m events occurring is less than the sum

  • f the probs of occurrence of the individual

events; i.e.,

  • Dunn’s test, aka Bonferroni’s Test, uses sig level

= for each individual test, so as to control the familywise rate at .

αF = 1− (1− α)

m ≈ mα, for small α

α F ≤ mα

αF / m

α F

slide-25
SLIDE 25

25

  • Dunn’s Test is conservative and,

therefore, has relatively low power. Holm’s Test is a multistage version of Dunn’s (or Bonferroni’s) Test that has more power. Holm’s procedure is:

– Rank the p-values of the m tests from smallest (i.e., most sig) to largest. – Compare p(1), the smallest, with – If p(1) > , retain H0 for all tests – If p(1) < , reject H0 for 1st test and compare p(2) with . And so on.

αF / m αF / m αF / m

αF / (m −1)

slide-26
SLIDE 26

26

  • The rationale for Holm’s adjustment of

p-values is that, after we have rejected the 1st H0, we have only m-1 possibly true H0’s, so we shd replace m by m-1. Thus the critical p-value used in the 2nd test is greater (i.e., less conservative) than that used in the 1st test.

  • The critical p-value at the i’th test (i = 1,

2, …, m) = ,

– which increases from to

αF / (m − i +1)

αF / m αF

slide-27
SLIDE 27

27

Ryan Test (REGWQ;

using a range, or Q, statistic due to Ryan, Einot, Gabriel & Welsch)

  • In the Ryan procedure, the critical p-value

at the i’th test (i = 1, 2, …, m) = ,

– which also increases from to , – but it does so at a constant rate whereas, under the Holm procedure, the rate of increase is

  • increasing. The desirable properties have been

established by simulations. – A slightly better rule (REGWQ) is to replace (i*αF)/m by 1 – (1 – αF)(i/m).

(i *αF) / m

αF / m αF

slide-28
SLIDE 28

28

At one extreme, if each test uses a critical value of FWE/m,

  • r , the FWE is at most . This is a v conservative
  • procedure. At the other extreme, using a critical value of FWE,
  • r , for each test yields a FWE of at most m , a value

that can be very large. The Ryan and Holm procedures are intermediate, with the latter being more conservative. We now reprise the ‘morphine tolerance’, or ‘paw-licking’, study.

αF / m

αF αF αF

slide-29
SLIDE 29

29

Morphine Tolerance (Howell, Ch. 12)

5 groups of rats (MM, SS, SM, MS, MM’) have different levels of induced tolerance of pain, and we wish to test which induction procedures were

  • effective. There are 10 paired comparisons of

group means, each yielding a t statistic and a corresponding p-value. We rank the p-values, then test which pairs of groups are significantly different - using the 4 procedures from the previous slide. In this example, the results are similar across

  • procedures. This will not be true when m is very

large.

slide-30
SLIDE 30

30

t p0 pdb pun phm prq ddb dun dhm drq MM-SS 0.354 0.725 0.005 0.05 0.050 0.050 0 0 0 0 SM-McM 1.768 0.086 0.005 0.05 0.025 0.045 0 0 0 0 MS-MM 2.121 0.041 0.005 0.05 0.017 0.040 0 1 0 0 MS-SS 2.475 0.018 0.005 0.05 0.012 0.035 0 1 0 1 SS-SM 4.596 0.000 0.005 0.05 0.010 0.030 1 1 1 1 MM-SM 4.950 0.000 0.005 0.05 0.008 0.025 1 1 1 1 SS-McM 6.364 0.000 0.005 0.05 0.007 0.020 1 1 1 1 MM-McM 6.717 0.000 0.005 0.05 0.006 0.015 1 1 1 1 MS-SM 7.071 0.000 0.005 0.05 0.006 0.010 1 1 1 1 MS-McM 8.839 0.000 0.005 0.05 0.005 0.005 1 1 1 1

FWE = 0.05 ‘db’ = Dunn/Bonferroni, ‘un’ = Uncontrolled FWE; ‘hm’ = Holm; ‘rq’ = Ryan/FDR. Note that ‘rq’ & ‘hm’ lie between ‘db’ and ‘un’. See Howell, p. 397, Table 12.7. Critical p-values vs. p-rank for 4 procs

slide-31
SLIDE 31

31

False Discovery Rate:

Howell, pp. 396-7; http://en.wikipedia.org/wiki/ False_discovery_rate

slide-32
SLIDE 32

32

  • m0 is the number of true

null hypotheses

  • m − m0 is the number of false

null hypotheses

  • U is the number of true negatives
  • V is the number of false positives
  • T is the number of false negatives
  • S is the number of true positives
  • H1...Hm the null hypotheses being tested
slide-33
SLIDE 33

33

  • In m hypothesis tests of which m0 are true null

hypotheses, R is an observable random variable, and S, T, U, and V are unobservable random variables.

  • The false discovery rate is given by E(V/R); and
  • ne wants to keep this value below a threshold α*.

(V/R is defined to be 0 when R = 0.) Type I error rate, = E(V/m0).

α

slide-34
SLIDE 34

34

FDR versus

  • FDR = E(V/R), whereas
  • = E(V/m0).
  • The traditional focus on as the error rate

to be controlled (e.g., at .05 or .01) is attributed mainly to Fisher and Tukey. But we can agree that it wd be sensible to focus on controlling FDR instead, especially if this ‘works’ better!

  • This seems to be the case, especially when

m is ‘large’

α

α α

slide-35
SLIDE 35

35

A simple FDR procedure

  • Suppose we follow the Ryan procedure

in which the critical p-value at the i’th test (i = 1, 2, …, m) = .

  • Benjamini & Hochberg have shown that

the FDR is at most .

  • The behavior of this Ryan/FDR

procedure was shown in an earlier slide. In imaging and microarray analyses, FDR has proven to be useful.

(i *αF) / m

α F

slide-36
SLIDE 36

36

Uber-Reference for FDR

  • http://www.biomedcentral.com/

1471-2105/9/303

  • “A unified approach to false discovery

rate estimation”, by Korbinian Strimmer, University of Leipzig

– BMC Bioinformatics 2008, 9:303doi: 10.1186/1471-2105-9-303 – This article is quite technical

  • See also Howell, Ch. 12, p. 396, for a

sufficient treatment

slide-37
SLIDE 37

Homogeneity of variance, a key assumption in lm() lm()

  • The t-test and tests done with lm() assume

that all groups have the same within-group

  • variance. What remedial steps to take if this

assn is violated? (Handout 1, pp. 8-13)

37

slide-38
SLIDE 38
  • One solution is to transform the DV and hope

that stabilizes the variance.

– log(Y), or log(Y + ε) to remove 0’s (ε = .5?) – sqrt(Y) – 1/Y, for Y = Reaction Time, is ‘speed’ – arcsin(Y), if Y is a proportion

lphapp = log(Pasthapp + .5); sphapp = sqrt(Pasthapp) rs3a = bartlett.test(Pasthapp ~ memtype, na.action=na.omit, data=dat0) #compare variances on score for all 3 groups print(rs3a) rs3b = bartlett.test(lphapp ~ memtype, na.action=na.omit, data=dat0) print(rs3b) rs3c = bartlett.test(sphapp ~ memtype, na.action=na.omit, data=dat0) print(rs3c)

38

slide-39
SLIDE 39

39

slide-40
SLIDE 40

Transforming variables in Regression

  • In a regression context (i.e., X quantitative, instead
  • f categorical as in ANOVA), how to linearise a

non-linear relation? Should we transform X or Y, and how?

  • The “Rule of the Bulge” is a rule-of-thumb (pardon

the mixed metaphor) that helps us pick the transformation that is likely to ‘linearise’ a non- linear plot:

40

slide-41
SLIDE 41
  • Consider the circle below as 4 curves, 1 in each of

quadrants I-IV. Find the curve that has the same shape as your scatterplot. Then transform X or Y in the direction indicated by the ‘sign’ of X or Y in the quadrant containing the matching curve.

41

slide-42
SLIDE 42
  • Transform X or Y in the direction indicated by the

‘sign’ of X or Y in the quadrant containing the matching curve.

  • For example, if the curve in quadrant II resembles

your plot, transform X ‘up’ (because, in II, x > 0), or transform Y ‘down’ (because, in II, y < 0).

  • An example of an ‘up’ (i.e., convex) transform is X’

= X2. Examples of ‘down’ (i.e., concave) transforms are X’ = log(X), X’ = sqrt(X), and X’ = a

  • 1/X.

42

slide-43
SLIDE 43

43

Contrasts

  • In a new area of research, we may simply want to

know if a given manipulation (e.g., the grouping variable in a 1-way design) affects the dependent

  • variable. The omnibus F-test answers this question.
  • However, in many situations, we expect a significant

F and are more interested in specific comparisons among the groups. In this case, we ought to define: – a linear contrast to test each specific comparison, – the contrasts so that they are, if possible, mutually orthogonal.

slide-44
SLIDE 44

44

  • A contrast is a weighted sum (or average) of

certain group means, where the weights sum to 0. It is a specific explanatory variable (with 1 df) for accounting for differences among the groups.

  • If the Group factor has k levels, then Group

has k – 1 df, and there are at most k – 1 non- independent contrasts that can be defined.

  • Each contrast can be tested separately for

statistical significance. Defining contrasts so that they are orthogonal (to be defined) assures that the separate tests of the contrasts are approximately independent.

slide-45
SLIDE 45

45

In the “organizer’ study, we might be interested in the difference between the pre-organizer and post-organizer population means; the weights, a1 = 0, a2 = 1, a3 = -1, specifies the appropriate contrast: l1 = 0*µ1 + 1*µ2 + (-1)*µ3 = µ2 - µ3

slide-46
SLIDE 46

46

  • If we are interested in the difference between no
  • rganizer and some organizer, the appropriate

weights would be, a1 = 2, a2 = -1, a3 = -1 (or 1, -.5,

  • .5):
  • l2 = 2*µ1 + (-1)*µ2 + (-1)*µ3 = 2*µ1 - (µ2 + µ3)
  • For both contrasts, the {ai} sum to 0.
  • Multiplying all {ai} by a constant affects only the

scale or size (and perhaps the sign) of the contrast, not its primary meaning.

  • In general, if Level 1 = control, Level 2 = treatment

1, and Level 3 = treatment 2, then the weights (-2, 1, 1) and (0, 1, -1) define, respectively, a contrast

  • f ‘treatments’ vs ‘control,’ and a contrast of

‘treat-1’ vs ‘treat-2’. Also, as we shall see, these contrasts are orthogonal.

slide-47
SLIDE 47

47

Orthogonal Contrasts

Two contrasts, l1 = ai

i

xi, and l2 = bi

i

∑ xi, are

  • rthogonal if the group sample sizes are equal, i.e.,

if ni = n, and if aibi

i

= 0.

(The definition of orthogonality for the case of unequal ni is complicated.) Orthogonal contrasts answer “independent” questions. Knowing if one contrast is significant provides no information about the value of an orthogonal contrast.

slide-48
SLIDE 48

“Orthogonal” = “Uncorrelated”

Group X1 X2 1 a1 b1 2 a2 b2 … … … k ak bk

48

Suppose that X1 and X2 are the weights defining 2 contrasts among the k groups. What is the ‘correlation’ between X1 and X2 ? We need the “Sum of Products”, SP, which we wd put in the Numerator of a familiar formula, r = SP/√(SS1SS2). r = 0 iff SP = 0. SP = aibi

i=1 k

− ai

i=1 k

bi

i=1 k

k . Now, if X1 and X2 are true contrasts, ai

i=1 k

= 0 = bi

i=1 k

∑ , and SP =

aibi

i=1 k

. Thus the condition for orthogonality, aibi

i=1 k

= 0, is also the condition for SP = 0, i.e., r = 0.

slide-49
SLIDE 49

49

pre.org post.org no.org 0 0 pre.org 1 0 post.org 0 1

  • Even though ∑aibi = 0 in dummy coding, ‘pre.org’

and ‘post.org’, are not orthogonal because they are not contrasts; indeed ∑ai = ∑bi = 1 ≠ 0. Intuitively, the 2 effects are correlated, i.e. not

  • rthogonal, because they have a common term,

namely, the mean of the baseline group.

  • SP = ∑aibi – (∑ai *∑bi )/k = -1/3 ≠ 0; thus the

correlation is not 0.

slide-50
SLIDE 50

50

Different sets of orthogonal contrasts when k = 4 groups:

(a) A is quantitative (b) 2 levels of A, 2 levels of B (e.g., 1g. of A, 2 of A, 1 of B, 2 of B). (c) 2x2 factorial (A,B) design (a) (b) (c)

slide-51
SLIDE 51

Much more on the Algebra of regression equations in the ANOVA context; Simple Effects; Contrasts, t-tests, …

51

slide-52
SLIDE 52
  • score ~ train + d.lin + d.quad +

score ~ train + d.lin + d.quad + train : d.lin + train : d.quad. train : d.lin + train : d.quad.

52

Predictor Coefficient Intercept b0 train2 b1 d.lin b2 d.quad b3 train2:d.lin b4 train2:d.quad b5

slide-53
SLIDE 53
  • Suppose, when ‘difficulty’ = k, d.lin = 0 = d.quad.

Then score = b0 + b1*train2 describes the simple effect of ‘train’, when ‘difficulty’ = k. To

  • btain this simple effect, use I(difficulty - k)

I(difficulty - k)

in lm(). If k = mean(difficulty), can also use

scale(difficulty) in lm().

53

Predictor Coefficient Intercept b0 train2 b1 d.lin b2 d.quad b3 train2:d.lin b4 train2:d.quad b5

slide-54
SLIDE 54
  • Suppose, when ‘train’ = target, train2 = 0. Then

score = b0 + b2*d.lin + b3*d.quad describes the simple effect of ‘difficulty’ for the target group. To

  • btain this simple effect, code ‘train’ so that target

group has train2 = 0.

54

Predictor Coefficient Intercept b0 train2 b1 d.lin b2 d.quad b3 train2:d.lin b4 train2:d.quad b5

slide-55
SLIDE 55

Review of the t-test

In a 1-sample t-test, s2 is the estimate of the popn variance, σ2, and s2/n is the estimate of the variance of the sample mean. The s.e. of a statistic is the square root of its variance. Hence,

55

tdf = X − µ est.se(X), where df = df of the estimate of s.e.(X). est.se(X) = s2 / n = s / n, with df = n −1, and tn−1 = X − µ s / n . tdf

2 = F(1, df ), so F can also be used to test H0.

slide-56
SLIDE 56

t-tests in Regression

In regression, the statistic of interest is the estimated regression coefficient, bi, and we wish to test if E(bi) = 0. Computer output gives s.e.(bi), as well as t. The df of t is the df of SSresid, and this too is given in the ANOVA table from the regression. In general, the df of SSresid = N – p – 1, where N is the sample size and p is the number of predictors. An example of output with p = 1 follows.

56

tdf = bi est.se(bi), given in output.

slide-57
SLIDE 57

57

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.500 0.297 15.151 5.08e-15 *** grpone1 0.300 0.210 1.428 0.164 Residual standard error standard error: 1.627 on 28 degrees of

28 degrees of freedom freedom

Multiple R-squared: 0.06792, Adjusted R-squared: 0.035 F-statistic: 2.04 on 1 and 28 DF 1 and 28 DF, p-value: 0.1642

Note: t2 = 1.4282 = 2.04 = F! 2*(1 - pt(1.428, 28)) = 0.1644, as given above.

slide-58
SLIDE 58

t-tests in ANOVA

  • Does giving students explicit training on how to
  • rganise material (an “organiser”) in a class

improve scores on tests?

  • Ss are randomly assigned to one of 3 conditions:

No organiser (‘no.org’), Organiser before lecture (‘pre.org’), and Organiser after lecture (‘post.org’).

  • This is a between-S design, and a 1-way ANOVA

is appropriate.

  • Analysis of SS: SSt = SSb + SSw,
  • F = MSb/MSw, etc

58

slide-59
SLIDE 59

59

Data, d, in long form

score group

score group

1 5 no.org 2 4 no.org 3 6 no.org 4 2 no.org 5 2 no.org 6 2 no.org 7 6 no.org 8 4 no.org 9 3 no.org 10 5 no.org 11 4 pre.org 12 5 pre.org 13 3 pre.org … … … 27 6 post.org 28 4 post.org 29 4 post.org 30 7 post.org

# Custom-made function mean.sd = function(x) { c(mean = mean(x, na.rm=T), sd = sd(x, na.rm=T)) } # Data summary by group rs0 = by(d$score, d$group, mean.sd)

slide-60
SLIDE 60

60

d$group: no.org mean sd 3.900000 1.595131

  • d$group: pre.org

mean sd 4.000000 1.333333

  • d$group: post.org

mean sd 5.600000 1.577621 Each of the sd’s, 1.595, 1.33 and 1.578, is an independent estimate of the within-group sd, σ. MSw is the best, pooled est

  • f σ2.

To test if mean scores are the same for ‘post.org’ and ‘pre.org’, we would do an indep samples t-test. But what estimate of within-group variance should we use?

  • Ans. Better to use MSw than the pooled est from only the

‘post.org’ and ‘pre.org’ groups. The df of the latter is n1 + n2 – 2 = 18; the df of MSw is N – k = 30 – 3 = 27 > 18! (More precise)

slide-61
SLIDE 61

61

Analysis of Variance Table Analysis of Variance Table Response: score Df Sum Sq Mean Sq F value Pr(>F) group 2 18.20 9.10 4.0082 0.02991 * Residuals 27 61.30 2.27

Est of var(X1 − X2) = MSw 1 n1 + 1 n2 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ = MSw 5 = 2.27 5 = .454. Est of se(X1 − X2) = .454 = .674. For 'post.org' vs 'pre.org', t27 = 5.6 − 4.0 .674 = 2.37.

2*(1 - pt(2.37, 27)) = 0.025; so p = .025, 2-tailed.

We haven’t yet considered if this t-test was planned or post-hoc!

slide-62
SLIDE 62

62

The 3 group means are: no-org, 3.9; pre-org, 4.0; and post-org, 5.6. rs1 = lm(score ~ group, data=d00)

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.9 3.9000 0.4765 8.185 8.64e-09 *** grouppre.org 0.1 0.1000 0.6739 0.148 0.8831 grouppost.org 1.7 1.7000 0.6739 2.523 0.0178 *

#Default coding is Dummy Coding Dummy Coding print(contrasts(d00$group)) pre.org post.org no.org 0 0 pre.org 1 0 post.org 0 1

slide-63
SLIDE 63

63

pre.org post.org no.org 0 0 pre.org 1 0 post.org 0 1

  • The implied regression in lm() is:

Y = b0 + b1*pre.org + b2*post.org + e

  • Use this eqn. to get the mean score, mj,

for the j’th group, j = 1, 2, 3:

no.org: m1 = b0 + b1*0 + b2*0; b0 = m1 pre.org: m2 = b0 + b1*1 + b2*0; b1 = m2 - m1 post.org: m3 = b0 + b1*0 + b2*1; b2 = m3 - m1

slide-64
SLIDE 64

64

  • Y = b0 + b1*pre.org + b2*post.org + e

pre.org post.org no.org 0 0 pre.org 1 0 post.org 0 1

  • Define Dummy Coding:

– If ‘group’ has k levels, there are k-1 predictors, X1, …, Xk-1 – Designate one level, say, level 1, as the baseline. Each obs in level 1, the control condition, is assigned a value of 0 on all predictors - X1 = 0, …, Xk-1 = 0 – Each obs in level 2 is assigned a value of 1 on X1, and 0 on all

  • ther predictors - X1 = 1, X2 = 0, …, Xk-1 = 0

– Each obs in level 3 is assigned a value of 1 on X2, and 0 on all

  • ther predictors - X1 = 0, X2 = 1, …, Xk-1 = 0. And so on.
slide-65
SLIDE 65

65

Compare output from anova() and summary() rs1 = lm(score ~ group, data=d00) print(anova(rs1))

Analysis of Variance Table Response: score Df Sum Sq Mean Sq F value Pr(>F) group 2 18.20 9.10 4.0082 0.02991 * Residuals 27 61.30 2.27

print(summary(rs1))

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.9000 0.4765 8.185 8.64e-09 *** grouppre.org 0.1000 0.6739 0.148 0.8831 grouppost.org 1.7000 0.6739 2.523 0.0178 *

slide-66
SLIDE 66

66

  • anova(rs1) gives the familiar omnibus F.

There are no regression coefficients, nor any of the interpretive issues associated with them!

  • summary(lm()) yields also a set of

regression coefficients and associated p-

  • values. What are the ‘predictor’ variables in

this regression, and what algebraic interpretation do the coefficients have?

  • The interpretation of the coeffs depends

critically on how the predictors are defined.

  • The coefficients in the regression will often

correspond to the contrasts among group means that we have already discussed.

slide-67
SLIDE 67
  • The predictors will, in general, not look like

contrasts among means. However, when the contrasts are orthogonal, the predictors are the same as the contrasts (Benoît).

  • Note that R calls predictors ‘contrasts’
  • What predictors or ‘contrasts’ are used in lm()

as the default? Ans: ‘grouppre.org’ and ‘grouppost.org’.

  • How to (a) specify our own ‘contrasts’, and (b)

derive algebraically the interpretation of the coeffs in terms of the group means.

67

slide-68
SLIDE 68

68

  • The default set of predictors or ‘contrasts’

used in lm() is obtained with contrast.treatment, a particular kind of dummy coding in which level 1 of ‘treatment’ is treated as the baseline or control group (base = 1) . If another level of ‘treatment’, e.g., 2, is the baseline, we would specify base = 2; the predictors or ‘contrasts’ would then be ‘group1’ and ‘group3’.

  • With contrast.treatment, the resulting

effects are not really contrasts (the {ai} do not always sum to 0), and they are not

  • rthogonal, as will be shown.
slide-69
SLIDE 69

69

contrasts(d00$group,2)=contr.treatment(3,base=2, contrasts=TRUE) rs1 = lm(score ~ group, data=d00) print(summary(rs1))

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.0000 0.4765 8.395 5.25e-09 *** group1 -0.1000 0.6739 -0.148 0.8831 group3 1.6000 0.6739 2.374 0.0249 *

  • When the contrasts are defined by

‘contr.treatment’, the Intercept Intercept is equal to the mean of the is equal to the mean of the baseline level baseline level - level 1 (3.9) in the 1st analysis, and level 2 (4.0) above.