R04 - Regression with Categorical Explanatory Variables STAT 587 - - PowerPoint PPT Presentation

r04 regression with categorical explanatory variables
SMART_READER_LITE
LIVE PREVIEW

R04 - Regression with Categorical Explanatory Variables STAT 587 - - PowerPoint PPT Presentation

R04 - Regression with Categorical Explanatory Variables STAT 587 (Engineering) Iowa State University October 26, 2020 Categorical explanatory variables Binary explanatory variable Binary explanatory variable Recall the simple linear


slide-1
SLIDE 1

R04 - Regression with Categorical Explanatory Variables

STAT 587 (Engineering) Iowa State University

October 26, 2020

slide-2
SLIDE 2

Categorical explanatory variables Binary explanatory variable

Binary explanatory variable

Recall the simple linear regression model Yi

ind

∼ N(β0 + β1Xi, σ2). If we have a binary explanatory variable, i.e. the explanatory variable only has two levels say level A and level B, we can code it as Xi = I(observation i is level A) where I(statement) is an indicator function that is 1 when statement is true and 0 otherwise. Then β0 is the expected response for level B, β0 + β1 is the expected response for level A, and β1 is the expected difference in response (level A minus level B).

slide-3
SLIDE 3

Categorical explanatory variables Binary explanatory variable

Mice lifetimes

Sleuth3::case0501

20 30 40 50 N/R50 R/R50

Diet Lifetime (months)

slide-4
SLIDE 4

Categorical explanatory variables Binary explanatory variable

Regression model for mice lifetimes

Let Yi

ind

∼ N(β0 + β1Xi, σ2) where Yi is the lifetime of the ith mouse and Xi = I(Dieti = N/R50) then E[Lifetime|R/R50] = E[Yi|Xi = 0] = β0 E[Lifetime|N/R50] = E[Yi|Xi = 1] = β0 + β1 and E[Lifetime difference] = E[Lifetime|N/R50] − E[Lifetime|R/R50] = (β0 + β1) − β0 = β1.

slide-5
SLIDE 5

Categorical explanatory variables Binary explanatory variable

R code

case0501$X <- ifelse(case0501$Diet == "N/R50", 1, 0) (m <- lm(Lifetime ~ X, data = case0501, subset = Diet %in% c("R/R50","N/R50"))) Call: lm(formula = Lifetime ~ X, data = case0501, subset = Diet %in% c("R/R50", "N/R50")) Coefficients: (Intercept) X 42.8857

  • 0.5885

confint(m) 2.5 % 97.5 % (Intercept) 40.952257 44.819172 X

  • 3.174405

1.997342 predict(m, data.frame(X=1), interval = "confidence") # Expected lifetime on N/R50 fit lwr upr 1 42.29718 40.58007 44.0143

slide-6
SLIDE 6

Categorical explanatory variables Binary explanatory variable

Mice lifetimes

20 30 40 50 N/R50 R/R50

Diet Lifetime (months)

slide-7
SLIDE 7

Categorical explanatory variables Binary explanatory variable

Equivalence to a two-sample t-test

Recall that our two-sample t-test had the model Yij

ind

∼ N(µj, σ2) for groups j = 0, 1. This is equivalent to our current regression model where µ0 = β0 µ1 = β0 + β1 assuming µ0 represents the mean for the R/R50 group and µ1 represents the mean for N/R50 group. When the models are effectively the same, but have different parameters we the model is reparameterized.

slide-8
SLIDE 8

Categorical explanatory variables Binary explanatory variable

Equivalence

summary(m)$coefficients[2,4] # p-value [1] 0.6531748 confint(m) 2.5 % 97.5 % (Intercept) 40.952257 44.819172 X

  • 3.174405

1.997342 t.test(Lifetime ~ Diet, data = case0501, subset = Diet %in% c("R/R50","N/R50"), var.equal=TRUE) Two Sample t-test data: Lifetime by Diet t = -0.45044, df = 125, p-value = 0.6532 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:

  • 3.174405

1.997342 sample estimates: mean in group N/R50 mean in group R/R50 42.29718 42.88571

slide-9
SLIDE 9

Categorical explanatory variables Many levels

Using a categorical variable as an explanatory variable.

10 20 30 40 50 N/N85 N/R40 N/R50 NP R/R50 lopro

Diet Lifetime (months)

slide-10
SLIDE 10

Categorical explanatory variables Many levels

Regression with a categorical variable

  • 1. Choose one of the levels as the reference level, e.g.

N/N85

  • 2. Construct dummy variables using indicator functions,

i.e. I(A) =

  • 1

A is TRUE A is FALSE for the other levels, e.g. Xi,1 = I(diet for observation i is N/R40) Xi,2 = I(diet for observation i is N/R50) Xi,3 = I(diet for observation i is NP) Xi,4 = I(diet for observation i is R/R50) Xi,5 = I(diet for observation i is lopro)

  • 3. Estimate the parameters of a multiple regression model

using these dummy variables.

slide-11
SLIDE 11

Categorical explanatory variables Many levels

Regression model

Our regression model becomes Yi

ind

∼ N(β0 + β1Xi,1 + β2Xi,2 + β3Xi,3 + β4Xi,4 + β5Xi,5, σ2) where β0 is the expected lifetime for the N/N85 group β0 + β1 is the expected lifetime for the N/R40 group β0 + β2 is the expected lifetime for the N/R50 group β0 + β3 is the expected lifetime for the NP group β0 + β4 is the expected lifetime for the R/R50 group β0 + β5 is the expected lifetime for the lopro group and thus βp for p > 0 is the difference in expected lifetimes between one group and a reference group.

slide-12
SLIDE 12

Categorical explanatory variables Many levels

R code

case0501 <- case0501 %>% mutate(X1 = Diet == "N/R40", X2 = Diet == "N/R50", X3 = Diet == "NP", X4 = Diet == "R/R50", X5 = Diet == "lopro") m <- lm(Lifetime ~ X1 + X2 + X3 + X4 + X5, data = case0501) m Call: lm(formula = Lifetime ~ X1 + X2 + X3 + X4 + X5, data = case0501) Coefficients: (Intercept) X1TRUE X2TRUE X3TRUE X4TRUE X5TRUE 32.691 12.425 9.606

  • 5.289

10.194 6.994 confint(m) 2.5 % 97.5 % (Intercept) 30.951394 34.431062 X1TRUE 9.995893 14.854984 X2TRUE 7.269897 11.942013 X3TRUE

  • 7.848142 -2.730232

X4TRUE 7.723030 12.665943 X5TRUE 4.523030 9.465943

slide-13
SLIDE 13

Categorical explanatory variables Many levels

R code (cont.)

summary(m) Call: lm(formula = Lifetime ~ X1 + X2 + X3 + X4 + X5, data = case0501) Residuals: Min 1Q Median 3Q Max

  • 25.5167
  • 3.3857

0.8143 5.1833 10.0143 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 32.6912 0.8846 36.958 < 2e-16 *** X1TRUE 12.4254 1.2352 10.059 < 2e-16 *** X2TRUE 9.6060 1.1877 8.088 1.06e-14 *** X3TRUE

  • 5.2892

1.3010

  • 4.065 5.95e-05 ***

X4TRUE 10.1945 1.2565 8.113 8.88e-15 *** X5TRUE 6.9945 1.2565 5.567 5.25e-08 ***

  • Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 6.678 on 343 degrees of freedom Multiple R-squared: 0.4543,Adjusted R-squared: 0.4463 F-statistic: 57.1 on 5 and 343 DF, p-value: < 2.2e-16

slide-14
SLIDE 14

Categorical explanatory variables Many levels

Interpretation

β0 = E[Yi|reference level], i.e. expected response for the reference level Note: the only way Xi,1 = · · · = Xi,p = 0 is if all indicators are zero, i.e. at the reference level. βp, p > 0: expected change in the response moving from the reference level to the level associated with the pth dummy variable Note: the only way for Xi,p to increase by one is if initially Xi,1 = · · · = Xi,p = 0 and now Xi,p = 1

For example,

The expected lifetime for mice on the N/N85 diet is 32.7 (31.0,34.4) months. The expected increase in lifetime for mice on the N/R40 diet compared to the N/N85 diet is 12.4 (10.0,14.9) months. The model explains 45% of the variability in mice lifetimes.

slide-15
SLIDE 15

Categorical explanatory variables Many levels

Using a categorical variable as an explanatory variable.

β1 β2 β3 β4 β5 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0 β0

10 20 30 40 50 N/N85 N/R40 N/R50 NP R/R50 lopro

Diet Lifetime (months)

slide-16
SLIDE 16

Categorical explanatory variables Many levels

Equivalence to multiple group model

Recall that we had a multiple group model Yij

ind

∼ N(µj, σ2) for groups j = 0, 1, 2, . . . , 5. Our regression model is a reparameterization of the multiple group model: N/N85 : µ0 = β0 N/R40 : µ1 = β0 + β1 N/R50 : µ2 = β0 + β2 NP : µ3 = β0 + β3 R/R50 : µ4 = β0 + β4 lopro : µ5 = β0 + β5 assuming the groups are labeled appropriately.

slide-17
SLIDE 17

Categorical explanatory variables Summary

Summary

  • 1. Choose one of the levels as the reference level.
  • 2. Construct dummy variables using indicator functions for all other levels, e.g.

Xi = I(observation i is <some non-reference level>).

  • 3. Estimate the parameters of a multiple regression model using these dummy variables.