Outline Multicollinearity Model Selection
STAT 213 Multicollinearity and Model Selection
Colin Reimer Dawson
Oberlin College
STAT 213 Multicollinearity and Model Selection Colin Reimer Dawson - - PowerPoint PPT Presentation
Outline Multicollinearity Model Selection STAT 213 Multicollinearity and Model Selection Colin Reimer Dawson Oberlin College 7 April 2016 Outline Multicollinearity Model Selection Outline Multicollinearity Model Selection Outline
Outline Multicollinearity Model Selection
Oberlin College
Outline Multicollinearity Model Selection
Outline Multicollinearity Model Selection
Outline Multicollinearity Model Selection
Outline Multicollinearity Model Selection
Outline Multicollinearity Model Selection
Outline Multicollinearity Model Selection
plot(Scores)
60 70 80 90
80
80
70 80 90
18 20 22 24 16 20 24
Outline Multicollinearity Model Selection
cor(Scores) Midterm Final Quiz Midterm 1.0000000 0.7334905 0.9745957 Final 0.7334905 1.0000000 0.7397381 Quiz 0.9745957 0.7397381 1.0000000
Outline Multicollinearity Model Selection
summary(m.midterm <- lm(Final ~ Midterm, data = Scores)) Call: lm(formula = Final ~ Midterm, data = Scores) Residuals: Min 1Q Median 3Q Max
3.3716 15.0110 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 21.68490 5.57328 3.891 0.000182 *** Midterm 0.72769 0.06812 10.683 < 2e-16 ***
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 5.474 on 98 degrees of freedom Multiple R-squared: 0.538,Adjusted R-squared: 0.5333 F-statistic: 114.1 on 1 and 98 DF, p-value: < 2.2e-16
Outline Multicollinearity Model Selection
summary(m.quiz <- lm(Final ~ Quiz, data = Scores)) Call: lm(formula = Final ~ Quiz, data = Scores) Residuals: Min 1Q Median 3Q Max
0.0806 3.3445 13.9445 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 21.8043 5.4604 3.993 0.000126 *** Quiz 2.9149 0.2678 10.883 < 2e-16 ***
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 5.419 on 98 degrees of freedom Multiple R-squared: 0.5472,Adjusted R-squared: 0.5426 F-statistic: 118.4 on 1 and 98 DF, p-value: < 2.2e-16
Outline Multicollinearity Model Selection
summary(m.both <- lm(Final ~ Midterm + Quiz, data = Scores)) Call: lm(formula = Final ~ Midterm + Quiz, data = Scores) Residuals: Min 1Q Median 3Q Max
0.0513 3.1453 14.1414 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 21.0855 5.5388 3.807 0.000247 *** Midterm 0.2481 0.3016 0.823 0.412717 Quiz 1.9545 1.1979 1.632 0.105993
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 5.428 on 97 degrees of freedom Multiple R-squared: 0.5503,Adjusted R-squared: 0.5411 F-statistic: 59.36 on 2 and 97 DF, p-value: < 2.2e-16
Outline Multicollinearity Model Selection
confint(m.midterm) 2.5 % 97.5 % (Intercept) 10.6249111 32.7448870 Midterm 0.5925106 0.8628613 confint(m.quiz) 2.5 % 97.5 % (Intercept) 10.968290 32.640322 Quiz 2.383376 3.446427 confint(m.both) 2.5 % 97.5 % (Intercept) 10.0924950 32.0784591 Midterm
0.8466639 Quiz
4.3319161
Outline Multicollinearity Model Selection
confidenceEllipse(m.both) −0.5 0.0 0.5 1.0 −1 1 2 3 4 5 Midterm coefficient Quiz coefficient
Outline Multicollinearity Model Selection
dplyr::select(Scores, Midterm, Quiz) %>% cov() %>% eigen() $values [1] 69.161619 0.195581 $vectors [,1] [,2] [1,] -0.9710244 0.2389805 [2,] -0.2389805 -0.9710244 Scores.augmented <- mutate(Scores, V1 = 0.9710244 * Midterm + 0.2389805 * Quiz, V2 = 0.2389805 * Midterm - 0.9710244 * Quiz)
Outline Multicollinearity Model Selection
plot(Scores.augmented)
60 80
80 100
90
90
24
90
80
20 24
0.0 1.0 −1.5 1.0
Outline Multicollinearity Model Selection
cor(Scores.augmented) Midterm Final Quiz V1 V2 Midterm 1.00000000 0.7334905 0.9745957 9.999144e-01 1.308627e-02 Final 0.73349045 1.0000000 0.7397381 7.348815e-01 -1.014838e-01 Quiz 0.97459573 0.7397381 1.0000000 9.774433e-01 -2.111984e-01 V1 0.99991437 0.7348815 0.9774433 1.000000e+00 -3.036446e-07 V2 0.01308627 -0.1014838 -0.2111984 -3.036446e-07 1.000000e+00
Outline Multicollinearity Model Selection
summary(m.rotated <- lm(Final ~ V1 + V2, data = Scores.augmented)) Call: lm(formula = Final ~ V1 + V2, data = Scores.augmented) Residuals: Min 1Q Median 3Q Max
0.0513 3.1453 14.1414 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 21.08548 5.53880 3.807 0.000247 *** V1 0.70800 0.06559 10.794 < 2e-16 *** V2
1.23350
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 5.428 on 97 degrees of freedom Multiple R-squared: 0.5503,Adjusted R-squared: 0.5411 F-statistic: 59.36 on 2 and 97 DF, p-value: < 2.2e-16
Outline Multicollinearity Model Selection
confidenceEllipse(m.rotated) 0.55 0.60 0.65 0.70 0.75 0.80 0.85 −5 −4 −3 −2 −1 1 V1 coefficient V2 coefficient
Outline Multicollinearity Model Selection
1 1−R2 > 5
Outline Multicollinearity Model Selection
m.midterm <- lm(Midterm ~ Quiz, data = Scores) summary(m.midterm)$r.squared [1] 0.9498368 m.quiz <- lm(Quiz ~ Midterm, data = Scores) summary(m.quiz)$r.squared [1] 0.9498368 vif(m.both) Midterm Quiz 19.93495 19.93495 vif(m.rotated) V1 V2 1 1
Outline Multicollinearity Model Selection
Outline Multicollinearity Model Selection
Outline Multicollinearity Model Selection
ε: always prefers more complex
ε from
Outline Multicollinearity Model Selection
Outline Multicollinearity Model Selection
Outline Multicollinearity Model Selection
library(Stat2Data); data("MLB2007Standings") library(leaps) ## may need to install.packages() subsets <- regsubsets(WinPct ~ HR + BattingAvg + OBP + SLG + ERA + Walks + StrikeOuts, data = MLB2007Standings) plot(subsets, scale = "adjr2")
adjr2 (Intercept) HR BattingAvg OBP SLG ERA Walks StrikeOuts 0.41 0.73 0.78 0.79 0.8 0.8 0.8
Outline Multicollinearity Model Selection
plot(subsets, scale = "Cp")
Cp (Intercept) HR BattingAvg OBP SLG ERA Walks StrikeOuts 51 9.5 8 6.1 4.3 3 2
Outline Multicollinearity Model Selection
library(HH) ## may need to install summaryHH(subsets) model p rsq rss adjr2 cp bic stderr 1 E 2 0.426 0.0545 0.406 51.30
2 O-E 3 0.751 0.0236 0.733 9.54 -31.50 0.0296 3 H-B-E 4 0.822 0.0169 0.802 1.96 -38.20 0.0255 4 H-B-E-W 5 0.829 0.0162 0.802 3.03 -35.98 0.0255 5 H-B-E-W-SO 6 0.834 0.0157 0.800 4.32 -33.52 0.0256 6 H-B-SL-E-W-SO 7 0.836 0.0156 0.793 6.14 -30.36 0.0260 7 H-B-O-SL-E-W-SO 8 0.837 0.0155 0.785 8.00 -27.15 0.0265 Model variables with abbreviations model E ERA O-E OBP-ERA H-B-E HR-BattingAvg-ERA H-B-E-W HR-BattingAvg-ERA-Walks H-B-E-W-SO HR-BattingAvg-ERA-Walks-StrikeOuts H-B-SL-E-W-SO HR-BattingAvg-SLG-ERA-Walks-StrikeOuts H-B-O-SL-E-W-SO HR-BattingAvg-OBP-SLG-ERA-Walks-StrikeOuts model with largest adjr2
Outline Multicollinearity Model Selection
full <- lm(WinPct ~ HR + BattingAvg + OBP + SLG + ERA + Walks + StrikeOuts, data = MLB2007Standings) step(full, direction = "backward", scale = summary(full)$sigma^2)
none <- lm(WinPct ~ 1, data = MLB2007Standings) ## null model step(none, scope = list(upper = full), scale = summary(full)$sigma^2)