r04 regression with categorical explanatory variables
play

R04 - Regression with Categorical Explanatory Variables STAT 587 - PowerPoint PPT Presentation

R04 - Regression with Categorical Explanatory Variables STAT 587 (Engineering) Iowa State University October 26, 2020 Categorical explanatory variables Binary explanatory variable Binary explanatory variable Recall the simple linear


  1. R04 - Regression with Categorical Explanatory Variables STAT 587 (Engineering) Iowa State University October 26, 2020

  2. Categorical explanatory variables Binary explanatory variable Binary explanatory variable Recall the simple linear regression model ind ∼ N ( β 0 + β 1 X i , σ 2 ) . Y i If we have a binary explanatory variable, i.e. the explanatory variable only has two levels say level A and level B, we can code it as X i = I( observation i is level A ) where I( statement ) is an indicator function that is 1 when statement is true and 0 otherwise. Then β 0 is the expected response for level B, β 0 + β 1 is the expected response for level A, and β 1 is the expected difference in response (level A minus level B).

  3. Categorical explanatory variables Binary explanatory variable Mice lifetimes Sleuth3::case0501 50 Lifetime (months) 40 30 20 N/R50 R/R50 Diet

  4. Categorical explanatory variables Binary explanatory variable Regression model for mice lifetimes Let ind ∼ N ( β 0 + β 1 X i , σ 2 ) Y i where Y i is the lifetime of the i th mouse and X i = I( Diet i = N/R50 ) then E [ Lifetime | R/R50 ] = E [ Y i | X i = 0] = β 0 E [ Lifetime | N/R50 ] = E [ Y i | X i = 1] = β 0 + β 1 and E [ Lifetime difference ] = E [ Lifetime | N/R50 ] − E [ Lifetime | R/R50 ] = ( β 0 + β 1 ) − β 0 = β 1 .

  5. Categorical explanatory variables Binary explanatory variable R code case0501$X <- ifelse(case0501$Diet == "N/R50", 1, 0) (m <- lm(Lifetime ~ X, data = case0501, subset = Diet %in% c("R/R50","N/R50"))) Call: lm(formula = Lifetime ~ X, data = case0501, subset = Diet %in% c("R/R50", "N/R50")) Coefficients: (Intercept) X 42.8857 -0.5885 confint(m) 2.5 % 97.5 % (Intercept) 40.952257 44.819172 X -3.174405 1.997342 predict(m, data.frame(X=1), interval = "confidence") # Expected lifetime on N/R50 fit lwr upr 1 42.29718 40.58007 44.0143

  6. Categorical explanatory variables Binary explanatory variable Mice lifetimes 50 Lifetime (months) 40 30 20 N/R50 R/R50 Diet

  7. Categorical explanatory variables Binary explanatory variable Equivalence to a two-sample t-test Recall that our two-sample t-test had the model ind ∼ N ( µ j , σ 2 ) Y ij for groups j = 0 , 1 . This is equivalent to our current regression model where µ 0 = β 0 µ 1 = β 0 + β 1 assuming µ 0 represents the mean for the R/R50 group and µ 1 represents the mean for N/R50 group. When the models are effectively the same, but have different parameters we the model is reparameterized.

  8. Categorical explanatory variables Binary explanatory variable Equivalence summary(m)$coefficients[2,4] # p-value [1] 0.6531748 confint(m) 2.5 % 97.5 % (Intercept) 40.952257 44.819172 X -3.174405 1.997342 t.test(Lifetime ~ Diet, data = case0501, subset = Diet %in% c("R/R50","N/R50"), var.equal=TRUE) Two Sample t-test data: Lifetime by Diet t = -0.45044, df = 125, p-value = 0.6532 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.174405 1.997342 sample estimates: mean in group N/R50 mean in group R/R50 42.29718 42.88571

  9. Categorical explanatory variables Many levels Using a categorical variable as an explanatory variable. 50 40 Lifetime (months) 30 20 10 N/N85 N/R40 N/R50 NP R/R50 lopro Diet

  10. Categorical explanatory variables Many levels Regression with a categorical variable 1. Choose one of the levels as the reference level, e.g. N/N85 2. Construct dummy variables using indicator functions, i.e. � 1 A is TRUE I( A ) = 0 A is FALSE for the other levels, e.g. X i, 1 = I( diet for observation i is N/R40 ) X i, 2 = I( diet for observation i is N/R50 ) X i, 3 = I( diet for observation i is NP ) X i, 4 = I( diet for observation i is R/R50 ) X i, 5 = I( diet for observation i is lopro ) 3. Estimate the parameters of a multiple regression model using these dummy variables.

  11. Categorical explanatory variables Many levels Regression model Our regression model becomes ind ∼ N ( β 0 + β 1 X i, 1 + β 2 X i, 2 + β 3 X i, 3 + β 4 X i, 4 + β 5 X i, 5 , σ 2 ) Y i where β 0 is the expected lifetime for the N/N85 group β 0 + β 1 is the expected lifetime for the N/R40 group β 0 + β 2 is the expected lifetime for the N/R50 group β 0 + β 3 is the expected lifetime for the NP group β 0 + β 4 is the expected lifetime for the R/R50 group β 0 + β 5 is the expected lifetime for the lopro group and thus β p for p > 0 is the difference in expected lifetimes between one group and a reference group.

  12. Categorical explanatory variables Many levels R code case0501 <- case0501 %>% mutate(X1 = Diet == "N/R40", X2 = Diet == "N/R50", X3 = Diet == "NP", X4 = Diet == "R/R50", X5 = Diet == "lopro") m <- lm(Lifetime ~ X1 + X2 + X3 + X4 + X5, data = case0501) m Call: lm(formula = Lifetime ~ X1 + X2 + X3 + X4 + X5, data = case0501) Coefficients: (Intercept) X1TRUE X2TRUE X3TRUE X4TRUE X5TRUE 32.691 12.425 9.606 -5.289 10.194 6.994 confint(m) 2.5 % 97.5 % (Intercept) 30.951394 34.431062 X1TRUE 9.995893 14.854984 X2TRUE 7.269897 11.942013 X3TRUE -7.848142 -2.730232 X4TRUE 7.723030 12.665943 X5TRUE 4.523030 9.465943

  13. Categorical explanatory variables Many levels R code (cont.) summary(m) Call: lm(formula = Lifetime ~ X1 + X2 + X3 + X4 + X5, data = case0501) Residuals: Min 1Q Median 3Q Max -25.5167 -3.3857 0.8143 5.1833 10.0143 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 32.6912 0.8846 36.958 < 2e-16 *** X1TRUE 12.4254 1.2352 10.059 < 2e-16 *** X2TRUE 9.6060 1.1877 8.088 1.06e-14 *** X3TRUE -5.2892 1.3010 -4.065 5.95e-05 *** X4TRUE 10.1945 1.2565 8.113 8.88e-15 *** X5TRUE 6.9945 1.2565 5.567 5.25e-08 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 6.678 on 343 degrees of freedom Multiple R-squared: 0.4543,Adjusted R-squared: 0.4463 F-statistic: 57.1 on 5 and 343 DF, p-value: < 2.2e-16

  14. Categorical explanatory variables Many levels Interpretation β 0 = E [ Y i | reference level ] , i.e. expected response for the reference level Note: the only way X i, 1 = · · · = X i,p = 0 is if all indicators are zero, i.e. at the reference level. β p , p > 0 : expected change in the response moving from the reference level to the level associated with the p th dummy variable Note: the only way for X i,p to increase by one is if initially X i, 1 = · · · = X i,p = 0 and now X i,p = 1 For example, The expected lifetime for mice on the N/N85 diet is 32.7 (31.0,34.4) months. The expected increase in lifetime for mice on the N/R40 diet compared to the N/N85 diet is 12.4 (10.0,14.9) months. The model explains 45% of the variability in mice lifetimes.

  15. Categorical explanatory variables Many levels Using a categorical variable as an explanatory variable. 50 40 β 1 β 4 β 2 Lifetime (months) β 5 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 0 β 3 30 20 10 N/N85 N/R40 N/R50 NP R/R50 lopro Diet

  16. Categorical explanatory variables Many levels Equivalence to multiple group model Recall that we had a multiple group model ind ∼ N ( µ j , σ 2 ) Y ij for groups j = 0 , 1 , 2 , . . . , 5 . Our regression model is a reparameterization of the multiple group model: N/N 85 : µ 0 = β 0 N/R 40 : µ 1 = β 0 + β 1 N/R 50 : µ 2 = β 0 + β 2 NP : µ 3 = β 0 + β 3 R/R 50 : µ 4 = β 0 + β 4 lopro : µ 5 = β 0 + β 5 assuming the groups are labeled appropriately.

  17. Categorical explanatory variables Summary Summary 1. Choose one of the levels as the reference level. 2. Construct dummy variables using indicator functions for all other levels, e.g. X i = I( observation i is < some non-reference level > ) . 3. Estimate the parameters of a multiple regression model using these dummy variables.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend