Lecture 5: ANOVA and Correlation Ani Manichaikul amanicha@jhsph.edu - - PowerPoint PPT Presentation

lecture 5 anova and correlation
SMART_READER_LITE
LIVE PREVIEW

Lecture 5: ANOVA and Correlation Ani Manichaikul amanicha@jhsph.edu - - PowerPoint PPT Presentation

Lecture 5: ANOVA and Correlation Ani Manichaikul amanicha@jhsph.edu 23 April 2007 1 / 62 Comparing Multiple Groups Continous data: comparing means Analysis of variance Binary data: comparing proportions Pearsons Chi-square tests for r


slide-1
SLIDE 1

Lecture 5: ANOVA and Correlation

Ani Manichaikul amanicha@jhsph.edu 23 April 2007

1 / 62

slide-2
SLIDE 2

Comparing Multiple Groups

Continous data: comparing means Analysis of variance Binary data: comparing proportions Pearson’s Chi-square tests for r × 2 tables

Independence Goodness of Fit Homogeneity

Categorical data: r × c tables Pearson chi-square tests Odds ratio and relative risk

2 / 62

slide-3
SLIDE 3

ANOVA: Definition

Statistical technique for comparing means for multiple populations Partitioning the total variation in a data set into components defined by specific sources ANOVA = ANalysis Of VAriance

3 / 62

slide-4
SLIDE 4

ANOVA: Concepts

Estimate group means Assess magnitude of variation attributable to specific sources Extension of 2-sample t-test to multiple groups Population model Sample model: estimates, standard errors Partition of variability

4 / 62

slide-5
SLIDE 5

Types of ANOVA

One-way ANOVA One factor — e.g. smoking status Two-way ANOVA Two factors — e.g. gender and smoking status Three-way ANOVA Three factors — e.g. gender, smoking and beer

5 / 62

slide-6
SLIDE 6

Emphasis

One-way ANOVA is an extension of the t-test to 3 or more samples focus analysis on group differences Two-way ANOVA (and higher) focuses on the interaction of factors Does the effect due to one factor change as the level of another factor changes?

6 / 62

slide-7
SLIDE 7

ANOVA Rationale I

Variation Variation Variation between each between each in all =

  • bservation

+ group mean

  • bservations

and its group and the overall mean mean In other words, Total = Within group + Between groups sum of squares sum of squares sum of squares

7 / 62

slide-8
SLIDE 8

ANOVA Rationale II

In shorthand: SST = SSW + SSB If the group means are not very different, the variation between them and the overall mean (SSB) will not be much more than the variation between the observations within a group (SSW)

8 / 62

slide-9
SLIDE 9

ANOVA: One-Way

9 / 62

slide-10
SLIDE 10

MSW

We can pool the estimates of σ2 across groups and use an

  • verall estimate for the population variance:

Variation within a group = ˆ σ2

W

= SSW N − k = MSW MSW is called the “within groups mean square”

10 / 62

slide-11
SLIDE 11

MSB

We can also look at systematic variation among groups Variation between groups = ˆ σ2

B

= SSB k − 1 = MSB

11 / 62

slide-12
SLIDE 12

An ANOVA table

Suppose there are k groups (e.g. if smoking status has categories current, former or never, then k=3) We calculate our test statistic using the sum of square values as follows:

12 / 62

slide-13
SLIDE 13

Hypothesis testing with ANOVA

In performing ANOVA, we may want to ask: is there truly a difference in means across groups? Formally, we can specify the hypotheses: H0 : µ1 = µ2 = · · · = µk Ha : at least one of the µi’s is different The null hypothesis specifies a global relationship If the result of the test is significant, then perform individual comparisons

13 / 62

slide-14
SLIDE 14

Goal of the comparisons

Compare the two variability estimates, MSW and MSB If Fobs = MSB MSW = ˆ

σ2

B

ˆ σ2

W is small,

then variability between groups is negligible compared to variation within groups ⇒ The grouping does not explain much variation in the data

14 / 62

slide-15
SLIDE 15

The F-statistic

For our observations, we assume X ∼ N(µgp, σ2), where µgp = E(X|gp) = β0 + β1 · I(group=2) + β1 · I(group=3) + · · · ) and I(group=i) is an indicator to denote whether or not each individual is in the ith group Note: we have assumed the same variance σ2 for all groups — important to check this assumption Under these assumptions, we know the null distribution of the statistic F= MSB

MSW

The distribution is called an F-distribution

15 / 62

slide-16
SLIDE 16

The F-distribution

Remember that a χ2 distribution is always specified by its degrees of freedom An F-distribution is any distribution obtained by taking the quotient of two χ2 distributions divided by their respective degrees of freedom When we specify an F-distribution, we must state two parameters, which correspond to the degrees of freedom for the two χ2 distributions If X1 ∼ χ2

df1 and X2 ∼ χ2 df2 we write:

X1/df1 X2/df2 ∼ Fdf1,df2

16 / 62

slide-17
SLIDE 17

Back to the hypothesis test . . .

Knowing the null distribution of MSB MSW, we can define a decision rule to test the hypothesis for ANOVA: Reject H0 if F ≥ Fα;k−1,N−k Fail to reject H0 if F < Fα;k−1,N−k

17 / 62

slide-18
SLIDE 18

ANOVA: F-tests I

18 / 62

slide-19
SLIDE 19

ANOVA: F-tests II

19 / 62

slide-20
SLIDE 20

Example: ANOVA for HDL

Study design: Randomize control trial 132 men randomized to one of

Diet + exericse Diet Control

Follow-up one year later: 119 men remaining in study Outcome: mean change in plasma levels of HDL cholesterol from baseline to one-year follow-up in the three groups

20 / 62

slide-21
SLIDE 21

Model for HDL outcomes

We model the means for each group as follows: µc = E(HDL|gp = c) = mean change in control group µd = E(HDL|gp = d) = mean change in diet group µde = E(HDL|gp = de) = mean change in diet and exercise group We could also write the model as E(HDL|gp) = β0 + β1I(gp = d) + β2I(gp = de) Recall that I(gp=D), I(gp=DE) are 0/1 group indicators

21 / 62

slide-22
SLIDE 22

HDL ANOVA Table

We obtain the following results from the HDL experiment:

22 / 62

slide-23
SLIDE 23

HDL ANOVA results

F-test H0 : µc = µd = µde (or H0 : β1 = β2 = 0) Ha : at least one mean is different from the others Test statistic Fobs = 13 df1 = k − 1 = 3 − 1 = 2 df2 = N − k = 116

23 / 62

slide-24
SLIDE 24

HDL ANOVA Conclusions

Rejection region: F > F0.05;2,116 = 3.07 Since Fobs = 13.0 > 3.07, we reject H0 We conclude that at least one of the group means is different from the others

24 / 62

slide-25
SLIDE 25

Which groups are different?

We might proceed to make individual comparisons Conduct two-sample t-tests for each pair of groups: t = ˆ θ − θ0 SE(ˆ θ) = ¯ Xi − ¯ Xj − 0

  • s2

p

ni + s2

p

nj

25 / 62

slide-26
SLIDE 26

Multiple Comparisons

Performing individual comparisons require multiple hypothesis tests If α = 0.05 for each comparison, there is a 5% chance that each comparison will falsely be called significant Overall, the probability of Type I error is elevated above 5% Question How can we address this multiple comparisons issue?

26 / 62

slide-27
SLIDE 27

Bonferroni adjustment

A possible correction for multiple comparisons Test each hypothesis at level α∗ = (α/3) = 0.0167 Adjustment ensures overall Type I error rate does not exceed α = 0.05 However, this adjustment may be too conservative

27 / 62

slide-28
SLIDE 28

Multiple comparisons α

Hypothesis α∗ = α/3 H0 : µc = µd (or β1 = 0) 0.0167 H0 : µc = µde (or β2 = 0) 0.0167 H0 : µd = µde (or β1 − β2 = 0) 0.0167 Overall α = 0.05

28 / 62

slide-29
SLIDE 29

HDL: Pairwise comparisons I

Control and Diet groups H0 : µc = µd (or β1 = 0) t =

−0.05−0.02 q

0.028 40 + 0.028 40

= −1.87 p-value = 0.06

29 / 62

slide-30
SLIDE 30

HDL: Pairwise comparisons II

Control and Diet + exercise groups H0 : µc = µde (or β2 = 0) t =

−0.05−0.14 q

0.028 40 + 0.028 39

= 5.05 p-value = 4.4 × 10−7

30 / 62

slide-31
SLIDE 31

HDL: Pairwise comparisons III

Diet and Diet + exercise groups H0 : µd = µde (or β1 − β2 = 0) t =

−0.02−0.14 q

0.028 40 + 0.028 39

= −3.19 p-value = 0.0014

31 / 62

slide-32
SLIDE 32

Bonferroni corrected p-values

Hypothesis p-value adjusted p-value H0 : µc = µd 0.06 0.18 H0 : µc = µde 4.4 × 10−7 1.3 × 10−6 H0 : µd = µde 0.0014 0.0042 Overall α = 0.05 Conclusion: Significant difference in HDL change for DE group compared to other groups

32 / 62

slide-33
SLIDE 33

Two-way ANOVA

Uses the same idea as one-way ANOVA by partitioning variability Allows us to look at interaction of factors Does the effect due to one factor change as the level of another factor changes?

33 / 62

slide-34
SLIDE 34

Example: Public health students’ medical expenditures

Study design: In an observation study, total medical expenditures and various demographic characteristics were recorded for 200 public health students Goal: determine how gender and smoking status affect total medical expenditures in this population

34 / 62

slide-35
SLIDE 35

Example: Set-up

Y = Total medical expenditures F = Indicator of Female = 1 if Gender=Female, 0 otherwise S = Indicator of Smoking = 1 if smoked 100 cigarettes or more, 0 otherwise

35 / 62

slide-36
SLIDE 36

Interaction model

We assume the model Y ∼ N(µ, σ2) where µ = E(Y ) = β0 + β1F + β2S + β3F · S What are the interpretations of β0, β1, β2, and β3

36 / 62

slide-37
SLIDE 37

Two-way ANOVA: Interactions

Mean Model µ = E(Y ) = β0 + β1F + β2S + β3F · S Smoker No Yes Gender Male β0 β0 + β2 Female β0 + β1 β0 + β1 + β2 + β3

37 / 62

slide-38
SLIDE 38

Mean Model

E(Expenditure|Male, non-smoker) = β0 + β1 · 0 + β2 · 0 + β3 · 0 = β0 E(Expenditure|Female, non-smoker) = β0 + β1 · 1 + β2 · 0 + β3 · 0 = β0 + β1 E(Expenditure|Male, Smoker) = β0 + β1 · 0 + β2 · 1 + β3 · 0 = β0 + β2 E(Expenditure|Female, Smoker) = β0 + β1 · 1 + β2 · 1 + β3 · 1 = β0 + β1 + β2 + β3

38 / 62

slide-39
SLIDE 39

Medical Expenditures: ANOVA table

Source of Sum of Mean Variation Square df Square F p-value Model (between groups) 1.7 × 109 3 5.6 × 108 28.11 < 0.001 Error (within groups) 3.9 × 109 196 2.0 × 107 Total 5.6 × 109 199

39 / 62

slide-40
SLIDE 40

Medical Expenditures: Results

Overall model F-test: H0 : β1 = β2 = β3 = 0 Ha : At least one group is different Test statistic:

Fobs = 28.11 df1 = k − 1 = 3 df2 = N − k = 196 p-value < 0.001

40 / 62

slide-41
SLIDE 41

Medical Expenditures: Overall Conclusions

The medical expenditures are different in at least one of the groups Now we can figure out which ones. . .

41 / 62

slide-42
SLIDE 42

Medical Expenditures: Two-way ANOVA I

Table of coefficient estimates Coefficient Estimate Standard Error β0 (baseline) 5049 597 β1 (female effect) 1784 765 β2 (smoker effect) 907 1062 β3 (female*smoke) 6239 1422

42 / 62

slide-43
SLIDE 43

Medical Expenditures: Two-way ANOVA II

Test statistics and confidence intervals Coefficient t P> |t| 95% Confidence interval β0 8.45 0.000 (3870, 6228) β1 2.33 0.21 (276, 3292) β2 0.85 0.394 (-1187, 3001) β3 4.39 0.000 (3434, 9043)

43 / 62

slide-44
SLIDE 44

Medical Expenditures: Group-wise Conclusions

In this population, an average male non-smoker spends about $5000 on medical costs per year Males who smoked were estimated as having spent about $900 more than non-smokers, but this difference was not found to be statistically significant Female non-smokers spent about $1700 more than there non-smoking male counterparts Female smokers spent about $8900 (= β1 + β2 + β3) more than non-smoking males

44 / 62

slide-45
SLIDE 45

Association and Correlation I

Association Express the relationship between two variables Can be measured in different ways, depending on the nature

  • f the variables

For now, we’ll focus on continuous variables (e.g. height, weight) Important note: association does not imply causation

45 / 62

slide-46
SLIDE 46

Association and Correlation II

Describing the relationship between two continuous variables Correlation analysis

Measures strength of relationship between two variables Specifies direction of relationship

Regression analysis

Concerns prediction or estimation of outcome variable, based

  • n value of another variable (or variables)

46 / 62

slide-47
SLIDE 47

Correlation analysis

Plot the data (or have a computer to do so) Visually inspect the relationship between two continous variables Is there a linear relationship (correlation)? Are there outliers? Are the distributions skewed?

47 / 62

slide-48
SLIDE 48

Correlation Coefficient I

Measures the strength and direction of the linear relationship between to variables X and Y Population correlation coefficient, ρ = cov(X, Y )

  • var(X) · var(Y )

= E[(X − µX)(Y − µY )]

  • E[(X − µX)2] · E[(Y − µY )2]

48 / 62

slide-49
SLIDE 49

Correlation Coefficient II

The correlation coefficient, ρ, takes values between -1 and +1

  • 1: Perfect negative linear relationship

0: No linear relationship +1: Perfect positive relationship

49 / 62

slide-50
SLIDE 50

Correlation Coefficient III

Sample correlation coefficient: Obtained by plugging sample estimates into the population correlation coefficient r = sample cov(X, Y )

  • s2

x · s2 Y

= n

i=1 (Xi− ¯ X)(Yi− ¯ Y ) n−1

n

i=1 (Xi− ¯ X)2 n−1

· n

i=1 (Yi− ¯ Y )2 n−1

50 / 62

slide-51
SLIDE 51

Correlation Coefficient IV

Plot standardized Y versus standardized X Observe an ellipse (elongated circle) Correlation is the slope of the major axis

51 / 62

slide-52
SLIDE 52

Correlation Notes

Other names for r

Pearson correlation coefficient Product moment of correlation

Characteristics of r

Measures *linear* association The value of r is independent of units used to measure the variables The value of r is sensitive to outliers r 2 tells us what proportion of variation in Y is explained by linear relationship with X

52 / 62

slide-53
SLIDE 53

Several levels of correlation

53 / 62

slide-54
SLIDE 54

Examples of the Correlation Coefficient I

Perfect positive correlation, r ≈ 1

  • ● ●
  • 54 / 62
slide-55
SLIDE 55

Examples of the Correlation Coefficient II

Perfect negative correlation, r ≈ -1

  • ● ●

55 / 62

slide-56
SLIDE 56

Examples of the Correlation Coefficient III

Imperfect positive correlation, 0< r <1

  • ● ●
  • ● ●

56 / 62

slide-57
SLIDE 57

Examples of the Correlation Coefficient IV

Imperfect negative correlation, -1<r <0

  • 57 / 62
slide-58
SLIDE 58

Examples of the Correlation Coefficient V

No relation, r ≈ 0

58 / 62

slide-59
SLIDE 59

Examples of the Correlation Coefficient VI

Some relation but little *linear* relationship, r ≈ 0

  • ● ●
  • 59 / 62
slide-60
SLIDE 60

Association and Causality

In general, association between two variables means there some form of relationship between them

The relationship is not necessarily causal Association does not imply causation, no matter how much we would like it to

Example: Hot days, ice cream, drowning

60 / 62

slide-61
SLIDE 61

Sir Bradford Hill’s Criteria for Causality I

Strength: magnitude of association Consistency of association: repeated observation of the association in different situations Specificity: uniqueness of the association Temporality: cause precedes effect

61 / 62

slide-62
SLIDE 62

Sir Bradford Hill’s Criteria for Causality II

Biologic gradient: dose-response relationship Biologic plausibility: known mechanisms Coherence: makes sense based on other known facts Experimental evidence: from designed (randomized) experiments Analogy: with other known associations

62 / 62