stat 213 indicator variables in mlr
play

STAT 213 Indicator Variables in MLR Colin Reimer Dawson Oberlin - PowerPoint PPT Presentation

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests STAT 213 Indicator Variables in MLR Colin Reimer Dawson Oberlin College February 28, 2018 1 / 36 Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F


  1. Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests STAT 213 Indicator Variables in MLR Colin Reimer Dawson Oberlin College February 28, 2018 1 / 36

  2. Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests 2 / 36

  3. Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests The Four-Step Process: Multiple Regression 1. CHOOSE a form of the model • Select predictors • Choose any transformations of predictors 2. FIT: Estimate • coefficients: ˆ β 1 , ˆ β 1 , . . . , ˆ β k • residual variance ˆ σ 2 ε 3. ASSESS the fit • Examine residuals (may need to return to step 1) • Test individual predictors ( t -tests) • Test/measure overall fit (ANOVA, R 2 ) • Model comparison/selection 4. USE the model • Make predictions 3 / 36 • Construct CIs and PIs

  4. Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests 4 / 36

  5. Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests CHOOSE: Active Pulse Rate library(Stat2Data); data(Pulse) head(Pulse, n = 3) Active Rest Smoke Sex Exercise Hgt Wgt 1 97 78 0 1 1 63 119 2 82 68 1 0 3 70 225 3 88 62 0 0 3 72 175 Active i = β 0 + β 1 · Rest i + β 2 · Hgt i + β 3 · Wgt i + ε i 5 / 36

  6. Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests 6 / 36

  7. Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests FIT: Estimate Coefficients The Multiple Regression Population Model Y i = β 0 + β 1 X i 1 + · · · + β K X iK + ε i The Multiple Regression Fitted Model Y i = ˆ β 0 + ˆ β 1 X i 1 + · · · + ˆ β K X 1 K + ˆ ε i How to choose ˆ β k s? Minimize SSE! (Requires linear algebra / vector calculus) 7 / 36

  8. Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests FIT: Estimate Coefficients pulseModel <- lm(Active ~ Rest + Hgt + Wgt, data = Pulse) coef(pulseModel) %>% round(digits = 2) (Intercept) Rest Hgt Wgt 57.26 1.13 -0.88 0.11 Active i = 57 . 26 + 1 . 13 · Rest i − 0 . 88 · Hgt i + 0 . 11 · Wgt i + ε i 8 / 36

  9. Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests FIT: Estimate Residual Variance Recall Variance Decomposition for Regression: Y ) 2 = Y ) 2 + � ( Y i − ¯ � (ˆ Y i − ¯ � ( Y i − ˆ Y i ) 2 i i i SS Total = SS Model + SS Error Recall ANOVA Table: MS Model = SS Model /d f Model MS Error = SS Error /d f Error σ 2 where MS Error represents ˆ ε . So... what are d f Model and d f Error ? 9 / 36

  10. Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests Regression Degrees of Freedom d f Model = K where K is the number of predictors This is the number of extra “free parameters” (compared to the null model) f Error = N − K − 1 where N is the sample size d This is the number of “pieces of information” we have about the sizes of the residuals. (Can fit any K + 1 points exactly with K + 1 coefficients including the intercept.) 10 / 36

  11. Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests FIT: Estimate Residual Variance � N i =1 ( Y i − ˆ Y i ) 2 ε = MS Error = SS Error σ 2 ˆ = d f Error N − K − 1 11 / 36

  12. Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests FIT: Estimate Residual Variance ## Coefficients w/ standard errors and t-tests summary(pulseModel) %>% coef() %>% round(digits = 2) Estimate Std. Error t value Pr(>|t|) (Intercept) 57.26 25.01 2.29 0.02 Rest 1.13 0.10 11.09 0.00 Hgt -0.88 0.41 -2.17 0.03 Wgt 0.11 0.05 2.31 0.02 ## The estimated standard deviation of the residuals sigma(pulseModel) %>% round(digits = 2) [1] 14.91 12 / 36

  13. Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests FIT: The Final Model Active i = 57 . 26 + 1 . 13 · Rest i − 0 . 88 · Hgt i + 0 . 11 · Wgt + ε i where ε i ∼ N (0 , 14 . 91) 13 / 36

  14. Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests Next • Binary Predictors and Indicator Variables • ASSESSing MLR models 14 / 36

  15. Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests 15 / 36

  16. Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests Pulse Rates Revisited library(Stat2Data); data(Pulse) PulseWithBMI <- mutate( Pulse, BMI = Wgt / Hgt^2 * 703, InvActive = 1 / Active, InvRest = 1 / Rest, Male = 1 - Sex) 16 / 36

  17. Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests Active Pulse Rate by Sex ### Male = 1 for males, 0 for others ### factor() tells R this represents categories pulseBySex <- lm(Active ~ factor(Male), data = PulseWithBMI) coef(pulseBySex) %>% round(digits = 2) (Intercept) factor(Male)1 94.82 -6.70 What is the model here? What does the coefficient for Male mean? 17 / 36

  18. Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests summary(pulseBySex) %>% coef() %>% round(digits = 2) Estimate Std. Error t value Pr(>|t|) (Intercept) 94.82 1.77 53.58 0.00 factor(Male)1 -6.70 2.44 -2.74 0.01 What does the t -test tell us? 18 / 36

  19. Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests Pair Discussion (3 min.) An environmental expert is interested in modeling the concentration of various chemicals in well water. Write down a regression model in which the amount of lead ( Lead ) depends on whether the well has been cleaned ( Iclean , a 0/1 variable). (5 min.) Can you write down a single regression model that you could use to predict the amount of lead ( Lead ) in a well based on Year and on whether the well has been cleaned? How do you interpret each coefficient? 19 / 36

  20. Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests Combining Quantitative and Indicator Variables pulseBySexAndRest <- lm(Active ~ Rest + factor(Male), data = PulseWithBMI) pulseBySexAndRest %>% coef() %>% round(2) (Intercept) Rest factor(Male)1 16.47 1.12 -2.99 � Active = 16 . 47 + 1 . 12 · Rest − 2 . 99 · Male Now what does the Male coefficient tell us? 20 / 36

  21. Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests ## CAUTION: don't try to use this with multiple quantitative ## predictors; it won't make sense plotModel(pulseBySexAndRest) + scale_color_discrete( name = "Sex", labels = c("0" = "Others", "1" = "Male")) ● 150 ● ● ● ● ● ● ● ● ● ● 125 ● ● ● ● ● ● ● ● ● ● ● ● ● ● Sex ● ● ● ● ● ● ● Active ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Others ● ● ● ● ● ● ● ● ● ● 100 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Male ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 75 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50 40 60 80 100 Rest 21 / 36

  22. Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests One Model, Two Prediction Equations � Active = 16 . 47 + 1 . 12 · Rest − 2 . 99 · Male � Active = 16 . 47 + 1 . 12 · Rest Females: � Active = (16 . 47 − 2 . 99) + 1 . 12 · Rest Males: t -test for Male coefficient tests whether intercepts are different 22 / 36

  23. Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests summary(pulseBySexAndRest) %>% coef() %>% round(digits = 2) Estimate Std. Error t value Pr(>|t|) (Intercept) 16.47 7.19 2.29 0.02 Rest 1.12 0.10 11.12 0.00 factor(Male)1 -2.99 2.00 -1.50 0.14 23 / 36

  24. Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests Non-Parallel Lines twoLinesModel <- lm(Active ~ Rest + factor(Male) + Rest:factor(Male), data = PulseWithBMI) coef(twoLinesModel) %>% round(digits = 2) (Intercept) Rest factor(Male)1 11.98 1.18 6.82 Rest:factor(Male)1 -0.14 Active = 11 . 98 + 1 . 18 · Rest + 6 . 82 · Male − 0 . 14 · Rest · Male Now what does the Male coefficient tell us? The last coefficient? 24 / 36

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend