[PPT] - STAT 213 Multicollinearity and Model Selection Colin Reimer Dawson PowerPoint Presentation

SLIDE 1

Outline Multicollinearity Model Selection

STAT 213 Multicollinearity and Model Selection

Colin Reimer Dawson

Oberlin College

7 April 2016

SLIDE 2

Outline Multicollinearity Model Selection

Outline

Multicollinearity Model Selection

SLIDE 3

Outline Multicollinearity Model Selection

Reflection Questions

Do ANOVA and MLR give the same equation if the same set of data is used? Is the MSE in a nested F-test equal to SSE/(n − k − 1)? When you see nonlinear data, how do you decide between transforming the data and adding terms (e.g., quadratic)?

SLIDE 4

Outline Multicollinearity Model Selection

Reading Quiz

Suppose we have six candidate predictor variables that we might use to build a multiple regression model. How many models will we need to consider in total to find the best two-predictor model according to forward selection?

SLIDE 5

Outline Multicollinearity Model Selection

For Tuesday

Read: Ch. 6.1
Write: Ex. 4.4, 4.6
Answer: Ex. 6.2, 6.8(a,b,d)
Soon: Project 2

SLIDE 6

Outline Multicollinearity Model Selection

Correlated Predictors

Worksheet

SLIDE 7

Outline Multicollinearity Model Selection

Correlated Variables

plot(Scores)

Midterm

60 70 80 90

60

80

●●
●
60

80

Final
60

70 80 90

●
●
16

18 20 22 24 16 20 24

Quiz

SLIDE 8

Outline Multicollinearity Model Selection

Correlated Variables

cor(Scores) Midterm Final Quiz Midterm 1.0000000 0.7334905 0.9745957 Final 0.7334905 1.0000000 0.7397381 Quiz 0.9745957 0.7397381 1.0000000

SLIDE 9

Outline Multicollinearity Model Selection

SLR Model: Midterm Only

summary(m.midterm <- lm(Final ~ Midterm, data = Scores)) Call: lm(formula = Final ~ Midterm, data = Scores) Residuals: Min 1Q Median 3Q Max

15.0320
2.7025
0.1945

3.3716 15.0110 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 21.68490 5.57328 3.891 0.000182 *** Midterm 0.72769 0.06812 10.683 < 2e-16 ***

Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 5.474 on 98 degrees of freedom Multiple R-squared: 0.538,Adjusted R-squared: 0.5333 F-statistic: 114.1 on 1 and 98 DF, p-value: < 2.2e-16

SLIDE 10

Outline Multicollinearity Model Selection

SLR Model: Quiz Only

summary(m.quiz <- lm(Final ~ Quiz, data = Scores)) Call: lm(formula = Final ~ Quiz, data = Scores) Residuals: Min 1Q Median 3Q Max

14.0811
2.8279

0.0806 3.3445 13.9445 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 21.8043 5.4604 3.993 0.000126 *** Quiz 2.9149 0.2678 10.883 < 2e-16 ***

Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 5.419 on 98 degrees of freedom Multiple R-squared: 0.5472,Adjusted R-squared: 0.5426 F-statistic: 118.4 on 1 and 98 DF, p-value: < 2.2e-16

SLIDE 11

Outline Multicollinearity Model Selection

MLR Model: Midterm and Quiz

summary(m.both <- lm(Final ~ Midterm + Quiz, data = Scores)) Call: lm(formula = Final ~ Midterm + Quiz, data = Scores) Residuals: Min 1Q Median 3Q Max

14.4826
2.9728

0.0513 3.1453 14.1414 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 21.0855 5.5388 3.807 0.000247 *** Midterm 0.2481 0.3016 0.823 0.412717 Quiz 1.9545 1.1979 1.632 0.105993

Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 5.428 on 97 degrees of freedom Multiple R-squared: 0.5503,Adjusted R-squared: 0.5411 F-statistic: 59.36 on 2 and 97 DF, p-value: < 2.2e-16

SLIDE 12

Outline Multicollinearity Model Selection

Confidence Intervals

confint(m.midterm) 2.5 % 97.5 % (Intercept) 10.6249111 32.7448870 Midterm 0.5925106 0.8628613 confint(m.quiz) 2.5 % 97.5 % (Intercept) 10.968290 32.640322 Quiz 2.383376 3.446427 confint(m.both) 2.5 % 97.5 % (Intercept) 10.0924950 32.0784591 Midterm

0.3504585

0.8466639 Quiz

0.4229139

4.3319161

SLIDE 13

Outline Multicollinearity Model Selection

Confidence Ellipse

confidenceEllipse(m.both) −0.5 0.0 0.5 1.0 −1 1 2 3 4 5 Midterm coefficient Quiz coefficient

SLIDE 14

Outline Multicollinearity Model Selection

Elliptical Axes

dplyr::select(Scores, Midterm, Quiz) %>% cov() %>% eigen() $values [1] 69.161619 0.195581 $vectors [,1] [,2] [1,] -0.9710244 0.2389805 [2,] -0.2389805 -0.9710244 Scores.augmented <- mutate(Scores, V1 = 0.9710244 * Midterm + 0.2389805 * Quiz, V2 = 0.2389805 * Midterm - 0.9710244 * Quiz)

SLIDE 15

Outline Multicollinearity Model Selection

Elliptical Axes

plot(Scores.augmented)

Midterm

60 80

●
●
●
●
●
●●
●
●
●
●
●
60

80 100

●
●
●
●
●
●
60

90

●●
●
●
60

90

●
●
●
●
●
Final
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
Quiz
●
●
●
●
●
16

24

●●
●
●
60

90

●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
V1
●●
●
●
60

80

●
●
●
●
●
●
● ●
●
16

20 24

●
●
●
●
●
●
●
●
●
−1.5

0.0 1.0 −1.5 1.0

V2

SLIDE 16

Outline Multicollinearity Model Selection

Elliptical Axes

cor(Scores.augmented) Midterm Final Quiz V1 V2 Midterm 1.00000000 0.7334905 0.9745957 9.999144e-01 1.308627e-02 Final 0.73349045 1.0000000 0.7397381 7.348815e-01 -1.014838e-01 Quiz 0.97459573 0.7397381 1.0000000 9.774433e-01 -2.111984e-01 V1 0.99991437 0.7348815 0.9774433 1.000000e+00 -3.036446e-07 V2 0.01308627 -0.1014838 -0.2111984 -3.036446e-07 1.000000e+00

SLIDE 17

Outline Multicollinearity Model Selection

Orthogonal Predictors

summary(m.rotated <- lm(Final ~ V1 + V2, data = Scores.augmented)) Call: lm(formula = Final ~ V1 + V2, data = Scores.augmented) Residuals: Min 1Q Median 3Q Max

14.4826
2.9728

0.0513 3.1453 14.1414 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 21.08548 5.53880 3.807 0.000247 *** V1 0.70800 0.06559 10.794 < 2e-16 *** V2

1.83858

1.23350

1.491 0.139327
Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 5.428 on 97 degrees of freedom Multiple R-squared: 0.5503,Adjusted R-squared: 0.5411 F-statistic: 59.36 on 2 and 97 DF, p-value: < 2.2e-16

SLIDE 18

Outline Multicollinearity Model Selection

Orthogonal Predictors

confidenceEllipse(m.rotated) 0.55 0.60 0.65 0.70 0.75 0.80 0.85 −5 −4 −3 −2 −1 1 V1 coefficient V2 coefficient

SLIDE 19

Outline Multicollinearity Model Selection

Multicollinearity

When one predictor is highly predictable from the other predictors, the model suffers from multicollinearity One measure: R2 from a model predicting Xj using X1, . . . , Xj−1, Xj+1, . . . , Xk. Rough rule: If this R2 is > 0.80, test/intervals for coefficients may not be meaningful. Equivalently: VIF =

1 1−R2 > 5

SLIDE 20

Outline Multicollinearity Model Selection

Variance Inflation Factor

m.midterm <- lm(Midterm ~ Quiz, data = Scores) summary(m.midterm)$r.squared [1] 0.9498368 m.quiz <- lm(Quiz ~ Midterm, data = Scores) summary(m.quiz)$r.squared [1] 0.9498368 vif(m.both) Midterm Quiz 19.93495 19.93495 vif(m.rotated) V1 V2 1 1

SLIDE 21

Outline Multicollinearity Model Selection

Remedies for Multicollinearity

1. Remove redundant predictors
2. Combine predictors into a scale
3. Use the multicollinear model anyway, just don’t use

tests/intervals for individual coefficients.

SLIDE 22

Outline Multicollinearity Model Selection

Model Selection

Six predictor-selection methods:

1. Domain knowledge (+ a few F-tests)
2. Best subset
3. Forward selection
4. Backward selection
5. Stepwise selection
6. Cross-validation

SLIDE 23

Outline Multicollinearity Model Selection

Criteria to "score" models

1. high R2/low SSE/low ˆ

σ2

ε: always prefers more complex

models

2. Adj. R2: balances fit and complexity
3. Mallow’s Cp / Akaike Information Criterion (AIC):

estimates mean squared prediction error based on ˆ σ2

ε from

a “full” model

SLIDE 24

Outline Multicollinearity Model Selection

Mallow’s Cp / AIC

Two measures that reduce to the same thing in the case of MLR with independent, equal variance, Normal residuals. For a model with p coefficients (including the intercept), selected from a pool of predictors, fit using n observations: Cp = SSEreduced MSEfull + 2p − n (1) = p + SSEdiff MSEfull (2) Should we prefer larger or smaller values?

SLIDE 25

Outline Multicollinearity Model Selection

Methods to Explore the Space of Combinations

1. Best subset: consider all possible combinations (2k)
2. Forward selection: start with null model, and consider

adding one predictor at a time

3. Backward elimination: start with full model and consider

removing one predictor at a time

4. Stepwise regression: alternate forward selection and

backward elimination Note: Choose best step based on adj-R2 or Cp/AIC, not based

n P-values

SLIDE 26

Outline Multicollinearity Model Selection

Example: Baseball Win %

library(Stat2Data); data("MLB2007Standings") library(leaps) ## may need to install.packages() subsets <- regsubsets(WinPct ~ HR + BattingAvg + OBP + SLG + ERA + Walks + StrikeOuts, data = MLB2007Standings) plot(subsets, scale = "adjr2")

adjr2 (Intercept) HR BattingAvg OBP SLG ERA Walks StrikeOuts 0.41 0.73 0.78 0.79 0.8 0.8 0.8

SLIDE 27

Outline Multicollinearity Model Selection

Example: Baseball Win %

plot(subsets, scale = "Cp")

Cp (Intercept) HR BattingAvg OBP SLG ERA Walks StrikeOuts 51 9.5 8 6.1 4.3 3 2

SLIDE 28

Outline Multicollinearity Model Selection

Example Baseball Win %

library(HH) ## may need to install summaryHH(subsets) model p rsq rss adjr2 cp bic stderr 1 E 2 0.426 0.0545 0.406 51.30

9.86 0.0441

2 O-E 3 0.751 0.0236 0.733 9.54 -31.50 0.0296 3 H-B-E 4 0.822 0.0169 0.802 1.96 -38.20 0.0255 4 H-B-E-W 5 0.829 0.0162 0.802 3.03 -35.98 0.0255 5 H-B-E-W-SO 6 0.834 0.0157 0.800 4.32 -33.52 0.0256 6 H-B-SL-E-W-SO 7 0.836 0.0156 0.793 6.14 -30.36 0.0260 7 H-B-O-SL-E-W-SO 8 0.837 0.0155 0.785 8.00 -27.15 0.0265 Model variables with abbreviations model E ERA O-E OBP-ERA H-B-E HR-BattingAvg-ERA H-B-E-W HR-BattingAvg-ERA-Walks H-B-E-W-SO HR-BattingAvg-ERA-Walks-StrikeOuts H-B-SL-E-W-SO HR-BattingAvg-SLG-ERA-Walks-StrikeOuts H-B-O-SL-E-W-SO HR-BattingAvg-OBP-SLG-ERA-Walks-StrikeOuts model with largest adjr2

SLIDE 29

Outline Multicollinearity Model Selection

Backward elimination

full <- lm(WinPct ~ HR + BattingAvg + OBP + SLG + ERA + Walks + StrikeOuts, data = MLB2007Standings) step(full, direction = "backward", scale = summary(full)$sigma^2)

Stepwise regression

none <- lm(WinPct ~ 1, data = MLB2007Standings) ## null model step(none, scope = list(upper = full), scale = summary(full)$sigma^2)