lecture 5 anova and correlation
play

Lecture 5: ANOVA and Correlation Ani Manichaikul amanicha@jhsph.edu - PowerPoint PPT Presentation

Lecture 5: ANOVA and Correlation Ani Manichaikul amanicha@jhsph.edu 23 April 2007 1 / 62 Comparing Multiple Groups Continous data: comparing means Analysis of variance Binary data: comparing proportions Pearsons Chi-square tests for r


  1. Lecture 5: ANOVA and Correlation Ani Manichaikul amanicha@jhsph.edu 23 April 2007 1 / 62

  2. Comparing Multiple Groups Continous data: comparing means Analysis of variance Binary data: comparing proportions Pearson’s Chi-square tests for r × 2 tables Independence Goodness of Fit Homogeneity Categorical data: r × c tables Pearson chi-square tests Odds ratio and relative risk 2 / 62

  3. ANOVA: Definition Statistical technique for comparing means for multiple populations Partitioning the total variation in a data set into components defined by specific sources ANOVA = AN alysis O f VA riance 3 / 62

  4. ANOVA: Concepts Estimate group means Assess magnitude of variation attributable to specific sources Extension of 2-sample t-test to multiple groups Population model Sample model: estimates, standard errors Partition of variability 4 / 62

  5. Types of ANOVA One-way ANOVA One factor — e.g. smoking status Two-way ANOVA Two factors — e.g. gender and smoking status Three-way ANOVA Three factors — e.g. gender, smoking and beer 5 / 62

  6. Emphasis One-way ANOVA is an extension of the t-test to 3 or more samples focus analysis on group differences Two-way ANOVA (and higher) focuses on the interaction of factors Does the effect due to one factor change as the level of another factor changes? 6 / 62

  7. ANOVA Rationale I Variation Variation Variation between each between each in all = observation + group mean observations and its group and the overall mean mean In other words, Total = Within group + Between groups sum of squares sum of squares sum of squares 7 / 62

  8. ANOVA Rationale II In shorthand: SST = SSW + SSB If the group means are not very different, the variation between them and the overall mean (SSB) will not be much more than the variation between the observations within a group (SSW) 8 / 62

  9. ANOVA: One-Way 9 / 62

  10. MSW We can pool the estimates of σ 2 across groups and use an overall estimate for the population variance: σ 2 Variation within a group = ˆ W SSW = N − k = MSW MSW is called the “within groups mean square” 10 / 62

  11. MSB We can also look at systematic variation among groups σ 2 Variation between groups = ˆ B SSB = k − 1 = MSB 11 / 62

  12. An ANOVA table Suppose there are k groups (e.g. if smoking status has categories current, former or never, then k=3) We calculate our test statistic using the sum of square values as follows: 12 / 62

  13. Hypothesis testing with ANOVA In performing ANOVA, we may want to ask: is there truly a difference in means across groups? Formally, we can specify the hypotheses: : µ 1 = µ 2 = · · · = µ k H 0 H a : at least one of the µ i ’s is different The null hypothesis specifies a global relationship If the result of the test is significant, then perform individual comparisons 13 / 62

  14. Goal of the comparisons Compare the two variability estimates, MSW and MSB σ 2 If F obs = MSB MSW = ˆ W is small, B σ 2 ˆ then variability between groups is negligible compared to variation within groups ⇒ The grouping does not explain much variation in the data 14 / 62

  15. The F-statistic For our observations, we assume X ∼ N ( µ gp , σ 2 ), where µ gp = E ( X | gp) = β 0 + β 1 · I (group=2) + β 1 · I (group=3) + · · · ) and I (group=i) is an indicator to denote whether or not each individual is in the i th group Note: we have assumed the same variance σ 2 for all groups — important to check this assumption Under these assumptions, we know the null distribution of the statistic F= MSB MSW The distribution is called an F-distribution 15 / 62

  16. The F-distribution Remember that a χ 2 distribution is always specified by its degrees of freedom An F-distribution is any distribution obtained by taking the quotient of two χ 2 distributions divided by their respective degrees of freedom When we specify an F-distribution, we must state two parameters, which correspond to the degrees of freedom for the two χ 2 distributions If X 1 ∼ χ 2 df 1 and X 2 ∼ χ 2 df 2 we write: X 1 / df 1 ∼ F df 1 , df 2 X 2 / df 2 16 / 62

  17. Back to the hypothesis test . . . Knowing the null distribution of MSB MSW, we can define a decision rule to test the hypothesis for ANOVA: Reject H 0 if F ≥ F α ; k − 1 , N − k Fail to reject H 0 if F < F α ; k − 1 , N − k 17 / 62

  18. ANOVA: F-tests I 18 / 62

  19. ANOVA: F-tests II 19 / 62

  20. Example: ANOVA for HDL Study design: Randomize control trial 132 men randomized to one of Diet + exericse Diet Control Follow-up one year later: 119 men remaining in study Outcome: mean change in plasma levels of HDL cholesterol from baseline to one-year follow-up in the three groups 20 / 62

  21. Model for HDL outcomes We model the means for each group as follows: µ c = E ( HDL | gp = c ) = mean change in control group µ d = E ( HDL | gp = d ) = mean change in diet group µ de = E ( HDL | gp = de ) = mean change in diet and exercise group We could also write the model as E ( HDL | gp ) = β 0 + β 1 I ( gp = d ) + β 2 I ( gp = de ) Recall that I(gp=D), I(gp=DE) are 0/1 group indicators 21 / 62

  22. HDL ANOVA Table We obtain the following results from the HDL experiment: 22 / 62

  23. HDL ANOVA results F-test H 0 : µ c = µ d = µ de (or H 0 : β 1 = β 2 = 0) H a : at least one mean is different from the others Test statistic F obs = 13 df 1 = k − 1 = 3 − 1 = 2 df 2 = N − k = 116 23 / 62

  24. HDL ANOVA Conclusions Rejection region: F > F 0 . 05;2 , 116 = 3 . 07 Since F obs = 13 . 0 > 3 . 07, we reject H 0 We conclude that at least one of the group means is different from the others 24 / 62

  25. Which groups are different? We might proceed to make individual comparisons Conduct two-sample t-tests for each pair of groups: X i − ¯ ¯ ˆ X j − 0 θ − θ 0 t = = SE (ˆ � θ ) s 2 s 2 p p n i + n j 25 / 62

  26. Multiple Comparisons Performing individual comparisons require multiple hypothesis tests If α = 0 . 05 for each comparison, there is a 5% chance that each comparison will falsely be called significant Overall, the probability of Type I error is elevated above 5% Question How can we address this multiple comparisons issue? 26 / 62

  27. Bonferroni adjustment A possible correction for multiple comparisons Test each hypothesis at level α ∗ = ( α/ 3) = 0 . 0167 Adjustment ensures overall Type I error rate does not exceed α = 0 . 05 However, this adjustment may be too conservative 27 / 62

  28. Multiple comparisons α α ∗ = α/ 3 Hypothesis H 0 : µ c = µ d (or β 1 = 0) 0.0167 H 0 : µ c = µ de (or β 2 = 0) 0.0167 H 0 : µ d = µ de (or β 1 − β 2 = 0) 0.0167 Overall α = 0 . 05 28 / 62

  29. HDL: Pairwise comparisons I Control and Diet groups H 0 : µ c = µ d (or β 1 = 0) − 0 . 05 − 0 . 02 t = = − 1 . 87 q 0 . 028 40 + 0 . 028 40 p-value = 0.06 29 / 62

  30. HDL: Pairwise comparisons II Control and Diet + exercise groups H 0 : µ c = µ de (or β 2 = 0) − 0 . 05 − 0 . 14 t = = 5 . 05 q 0 . 028 40 + 0 . 028 39 p-value = 4 . 4 × 10 − 7 30 / 62

  31. HDL: Pairwise comparisons III Diet and Diet + exercise groups H 0 : µ d = µ de (or β 1 − β 2 = 0) − 0 . 02 − 0 . 14 t = = − 3 . 19 q 0 . 028 40 + 0 . 028 39 p-value = 0.0014 31 / 62

  32. Bonferroni corrected p-values Hypothesis p-value adjusted p-value H 0 : µ c = µ d 0.06 0.18 4 . 4 × 10 − 7 1 . 3 × 10 − 6 H 0 : µ c = µ de H 0 : µ d = µ de 0.0014 0.0042 Overall α = 0 . 05 Conclusion: Significant difference in HDL change for DE group compared to other groups 32 / 62

  33. Two-way ANOVA Uses the same idea as one-way ANOVA by partitioning variability Allows us to look at interaction of factors Does the effect due to one factor change as the level of another factor changes? 33 / 62

  34. Example: Public health students’ medical expenditures Study design: In an observation study, total medical expenditures and various demographic characteristics were recorded for 200 public health students Goal: determine how gender and smoking status affect total medical expenditures in this population 34 / 62

  35. Example: Set-up Y = Total medical expenditures F = Indicator of Female = 1 if Gender=Female, 0 otherwise S = Indicator of Smoking = 1 if smoked 100 cigarettes or more, 0 otherwise 35 / 62

  36. Interaction model We assume the model Y ∼ N ( µ, σ 2 ) where µ = E ( Y ) = β 0 + β 1 F + β 2 S + β 3 F · S What are the interpretations of β 0 , β 1 , β 2 , and β 3 36 / 62

  37. Two-way ANOVA: Interactions Mean Model µ = E ( Y ) = β 0 + β 1 F + β 2 S + β 3 F · S Smoker No Yes Male β 0 β 0 + β 2 Gender Female β 0 + β 1 β 0 + β 1 + β 2 + β 3 37 / 62

  38. Mean Model E (Expenditure | Male, non-smoker) = β 0 + β 1 · 0 + β 2 · 0 + β 3 · 0 = β 0 E (Expenditure | Female, non-smoker) = β 0 + β 1 · 1 + β 2 · 0 + β 3 · 0 = β 0 + β 1 E (Expenditure | Male, Smoker) = β 0 + β 1 · 0 + β 2 · 1 + β 3 · 0 = β 0 + β 2 E (Expenditure | Female, Smoker) = β 0 + β 1 · 1 + β 2 · 1 + β 3 · 1 = β 0 + β 1 + β 2 + β 3 38 / 62

  39. Medical Expenditures: ANOVA table Source of Sum of Mean Variation Square df Square F p-value Model 1 . 7 × 10 9 5 . 6 × 10 8 (between groups) 3 28.11 < 0 . 001 Error 3 . 9 × 10 9 2 . 0 × 10 7 (within groups) 196 5 . 6 × 10 9 Total 199 39 / 62

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend