STAT 213 Indicator Variables in MLR Colin Reimer Dawson Oberlin - - PowerPoint PPT Presentation

stat 213 indicator variables in mlr
SMART_READER_LITE
LIVE PREVIEW

STAT 213 Indicator Variables in MLR Colin Reimer Dawson Oberlin - - PowerPoint PPT Presentation

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests STAT 213 Indicator Variables in MLR Colin Reimer Dawson Oberlin College February 28, 2018 1 / 36 Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F


slide-1
SLIDE 1

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

STAT 213 Indicator Variables in MLR

Colin Reimer Dawson

Oberlin College

February 28, 2018 1 / 36

slide-2
SLIDE 2

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

Outline

CHOOSE step FIT step Indicator Variables ASSESS: Nested F-tests 2 / 36

slide-3
SLIDE 3

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

The Four-Step Process: Multiple Regression

  • 1. CHOOSE a form of the model
  • Select predictors
  • Choose any transformations of predictors
  • 2. FIT: Estimate
  • coefficients: ˆ

β1, ˆ β1, . . . , ˆ βk

  • residual variance ˆ

σ2

ε

  • 3. ASSESS the fit
  • Examine residuals (may need to return to step 1)
  • Test individual predictors (t-tests)
  • Test/measure overall fit (ANOVA, R2)
  • Model comparison/selection
  • 4. USE the model
  • Make predictions
  • Construct CIs and PIs

3 / 36

slide-4
SLIDE 4

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

Outline

CHOOSE step FIT step Indicator Variables ASSESS: Nested F-tests 4 / 36

slide-5
SLIDE 5

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

CHOOSE: Active Pulse Rate

library(Stat2Data); data(Pulse) head(Pulse, n = 3) Active Rest Smoke Sex Exercise Hgt Wgt 1 97 78 1 1 63 119 2 82 68 1 3 70 225 3 88 62 3 72 175

Activei = β0 + β1 · Resti + β2 · Hgti + β3 · Wgti + εi 5 / 36

slide-6
SLIDE 6

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

Outline

CHOOSE step FIT step Indicator Variables ASSESS: Nested F-tests 6 / 36

slide-7
SLIDE 7

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

FIT: Estimate Coefficients

The Multiple Regression Population Model

Yi = β0 + β1Xi1 + · · · + βKXiK + εi

The Multiple Regression Fitted Model

Yi = ˆ β0 + ˆ β1Xi1 + · · · + ˆ βKX1K + ˆ εi

How to choose ˆ βks? Minimize SSE! (Requires linear algebra / vector calculus) 7 / 36

slide-8
SLIDE 8

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

FIT: Estimate Coefficients

pulseModel <- lm(Active ~ Rest + Hgt + Wgt, data = Pulse) coef(pulseModel) %>% round(digits = 2) (Intercept) Rest Hgt Wgt 57.26 1.13

  • 0.88

0.11

Activei = 57.26 + 1.13 · Resti − 0.88 · Hgti + 0.11 · Wgti + εi 8 / 36

slide-9
SLIDE 9

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

FIT: Estimate Residual Variance

Recall Variance Decomposition for Regression:

  • i

(Yi − ¯ Y )2 =

  • i

(ˆ Yi − ¯ Y )2 +

  • i

(Yi − ˆ Yi)2 SSTotal = SSModel + SSError Recall ANOVA Table: MSModel = SSModel/d fModel MSError = SSError/d fError where MSError represents ˆ σ2

ε. So... what are d

fModel and d fError? 9 / 36

slide-10
SLIDE 10

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

Regression Degrees of Freedom

d fModel = K where K is the number of predictors This is the number of extra “free parameters” (compared to the null model) d fError = N − K − 1 where N is the sample size This is the number of “pieces of information” we have about the sizes of the residuals. (Can fit any K + 1 points exactly with K + 1 coefficients including the intercept.)

10 / 36

slide-11
SLIDE 11

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

FIT: Estimate Residual Variance

ˆ σ2

ε = MSError = SSError

d fError = N

i=1(Yi − ˆ

Yi)2 N − K − 1

11 / 36

slide-12
SLIDE 12

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

FIT: Estimate Residual Variance

## Coefficients w/ standard errors and t-tests summary(pulseModel) %>% coef() %>% round(digits = 2) Estimate Std. Error t value Pr(>|t|) (Intercept) 57.26 25.01 2.29 0.02 Rest 1.13 0.10 11.09 0.00 Hgt

  • 0.88

0.41

  • 2.17

0.03 Wgt 0.11 0.05 2.31 0.02 ## The estimated standard deviation of the residuals sigma(pulseModel) %>% round(digits = 2) [1] 14.91

12 / 36

slide-13
SLIDE 13

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

FIT: The Final Model

Activei = 57.26 + 1.13 · Resti − 0.88 · Hgti + 0.11 · Wgt + εi where εi ∼ N(0, 14.91) 13 / 36

slide-14
SLIDE 14

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

Next

  • Binary Predictors and Indicator Variables
  • ASSESSing MLR models

14 / 36

slide-15
SLIDE 15

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

Outline

CHOOSE step FIT step Indicator Variables ASSESS: Nested F-tests 15 / 36

slide-16
SLIDE 16

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

Pulse Rates Revisited

library(Stat2Data); data(Pulse) PulseWithBMI <- mutate( Pulse, BMI = Wgt / Hgt^2 * 703, InvActive = 1 / Active, InvRest = 1 / Rest, Male = 1 - Sex)

16 / 36

slide-17
SLIDE 17

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

Active Pulse Rate by Sex

### Male = 1 for males, 0 for others ### factor() tells R this represents categories pulseBySex <- lm(Active ~ factor(Male), data = PulseWithBMI) coef(pulseBySex) %>% round(digits = 2) (Intercept) factor(Male)1 94.82

  • 6.70

What is the model here? What does the coefficient for Male mean? 17 / 36

slide-18
SLIDE 18

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

summary(pulseBySex) %>% coef() %>% round(digits = 2) Estimate Std. Error t value Pr(>|t|) (Intercept) 94.82 1.77 53.58 0.00 factor(Male)1

  • 6.70

2.44

  • 2.74

0.01

What does the t-test tell us? 18 / 36

slide-19
SLIDE 19

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

Pair Discussion

(3 min.) An environmental expert is interested in modeling the concentration of various chemicals in well water. Write down a regression model in which the amount of lead (Lead) depends

  • n whether the well has been cleaned (Iclean, a 0/1 variable).

(5 min.) Can you write down a single regression model that you could use to predict the amount of lead (Lead) in a well based on Year and on whether the well has been cleaned? How do you interpret each coefficient? 19 / 36

slide-20
SLIDE 20

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

Combining Quantitative and Indicator Variables

pulseBySexAndRest <- lm(Active ~ Rest + factor(Male), data = PulseWithBMI) pulseBySexAndRest %>% coef() %>% round(2) (Intercept) Rest factor(Male)1 16.47 1.12

  • 2.99
  • Active = 16.47 + 1.12 · Rest − 2.99 · Male

Now what does the Male coefficient tell us? 20 / 36

slide-21
SLIDE 21

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

## CAUTION: don't try to use this with multiple quantitative ## predictors; it won't make sense plotModel(pulseBySexAndRest) + scale_color_discrete( name = "Sex", labels = c("0" = "Others", "1" = "Male"))

  • 50

75 100 125 150 40 60 80 100

Rest Active Sex

  • Others

Male

21 / 36

slide-22
SLIDE 22

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

One Model, Two Prediction Equations

  • Active = 16.47 + 1.12 · Rest − 2.99 · Male

Females:

  • Active = 16.47 + 1.12 · Rest

Males:

  • Active = (16.47 − 2.99) + 1.12 · Rest

t-test for Male coefficient tests whether intercepts are different 22 / 36

slide-23
SLIDE 23

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

summary(pulseBySexAndRest) %>% coef() %>% round(digits = 2) Estimate Std. Error t value Pr(>|t|) (Intercept) 16.47 7.19 2.29 0.02 Rest 1.12 0.10 11.12 0.00 factor(Male)1

  • 2.99

2.00

  • 1.50

0.14

23 / 36

slide-24
SLIDE 24

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

Non-Parallel Lines

twoLinesModel <- lm(Active ~ Rest + factor(Male) + Rest:factor(Male), data = PulseWithBMI) coef(twoLinesModel) %>% round(digits = 2) (Intercept) Rest factor(Male)1 11.98 1.18 6.82 Rest:factor(Male)1

  • 0.14

Active = 11.98 + 1.18 · Rest + 6.82 · Male − 0.14 · Rest · Male Now what does the Male coefficient tell us? The last coefficient? 24 / 36

slide-25
SLIDE 25

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

## CAUTION: don't try to use this with multiple quantitative ## predictors; it won't make sense plotModel(twoLinesModel) + scale_color_discrete( name = "Sex", labels = c("0" = "Others", "1" = "Male"))

  • 50

75 100 125 150 40 60 80 100

Rest Active Sex

  • Others

Male

25 / 36

slide-26
SLIDE 26

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

Non-Parallel Lines

  • Male coefficient is the difference in intercepts
  • the interaction term is the difference in slopes
  • Active = 11.98 + 1.18 · Rest + 6.82 · Male − 0.14 · Rest · Male

Females:

  • Active = 11.98 + 1.18 · Rest

Males:

  • Active = (11.98 + 6.82) + (1.18 − 0.14) · Rest

t-test for Male · Rest coefficient tests whether slopes are different 26 / 36

slide-27
SLIDE 27

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

summary(twoLinesModel) %>% coef() %>% round(digits = 2) Estimate Std. Error t value Pr(>|t|) (Intercept) 11.98 9.58 1.25 0.21 Rest 1.18 0.14 8.74 0.00 factor(Male)1 6.82 13.96 0.49 0.63 Rest:factor(Male)1

  • 0.14

0.20

  • 0.71

0.48

27 / 36

slide-28
SLIDE 28

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

Caution

  • Test for different intercepts only tells us about the

difference when x = 0 if the slopes can be different

  • Non-parallel lines might meet when x = 0 but be far away

at other x values! 28 / 36

slide-29
SLIDE 29

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

Centering a Predictor

PulseWithBMI <- mutate(PulseWithBMI, RestCentered = Rest - mean(Rest)) twoLinesModel <- lm(Active ~ RestCentered + factor(Male) + RestCentered:factor(Male), data = PulseWithBMI) coef(twoLinesModel) %>% round(digits = 2) (Intercept) RestCentered 92.76 1.18 factor(Male)1 RestCentered:factor(Male)1

  • 3.01
  • 0.14

Active = 92.76 + 1.18 · RestCentered − 3.01 · Male − 0.14 · RestCentered · Male Now what does the Male coefficient tell us? 29 / 36

slide-30
SLIDE 30

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

plotModel(twoLinesModel) + scale_color_discrete( name = "Sex", labels = c("0" = "Others", "1" = "Male"))

  • 50

75 100 125 150 −20 20 40

RestCentered Active Sex

  • Others

Male

30 / 36

slide-31
SLIDE 31

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

Pair Discussion Revisited

Can you write down a single regression model that you could use to predict the amount of lead (Lead) in a well based on Year, but where the trend line is different depending on whether or not the well has been cleaned (Iclean)? What coefficients do you need and what is their interpretation? 31 / 36

slide-32
SLIDE 32

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

Outline

CHOOSE step FIT step Indicator Variables ASSESS: Nested F-tests 32 / 36

slide-33
SLIDE 33

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

Testing multiple (but not all) predictors

We can test:

  • One term at a time (t-test)

H0 : βk = 0 H1 : βk = 0

  • All terms at once (F-test)

H0 :β1 = β2 = · · · = βK = 0 H1 : Some βk = 0

  • What if we want to test a subset of the βs together?

33 / 36

slide-34
SLIDE 34

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

Nested Models

If Model B has all the terms in Model A and then some, we say that Model A is nested in Model B

Model A: Active = β0 + β1Rest Model B: Active = β0 + β1Rest + β2Male + β3Male · Rest Model A is nested in Model B 34 / 36

slide-35
SLIDE 35

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

Comparing Nested Models

  • Is there evidence that the additional predictors in Model

B are helpful, controlling for the predictors in Model A?

  • Some of SSError for the simpler model moves to SSModel

for the complex model.

  • Nested F-test: is this difference more than we would

expect by chance?

  • H0 : βKA+1 = · · · = βKB = 0

FComparison = MSComparison MSEFull = Increase in SSModel/Increase in d fModel MSEFull 35 / 36

slide-36
SLIDE 36

Outline CHOOSE step FIT step Indicator Variables ASSESS: Nested F -tests

Nested F-test

modelA <- lm(Active ~ Rest, data = PulseWithBMI) modelB <- lm(Active ~ Rest + factor(Male) + factor(Male):Rest, data = PulseWithBMI) anova(modelA,modelB) Analysis of Variance Table Model 1: Active ~ Rest Model 2: Active ~ Rest + factor(Male) + factor(Male):Rest Res.Df RSS Df Sum of Sq F Pr(>F) 1 230 51953 2 228 51335 2 617.27 1.3708 0.256

Conclusion: Little evidence that males and non-males need a different model 36 / 36