Gov 2000: 12. Troubleshooting the Linear Model
Matthew Blackwell
Fall 2016
1 / 67
Gov 2000: 12. Troubleshooting the Linear Model Matthew Blackwell - - PowerPoint PPT Presentation
Gov 2000: 12. Troubleshooting the Linear Model Matthew Blackwell Fall 2016 1 / 67 1. Outliers, leverage points, and infmuential observations 2. Heteroskedasticity 3. Nonlinearity of the regression function 2 / 67 Where are we? Where are we
Matthew Blackwell
Fall 2016
1 / 67
2 / 67
under Gauss-Markov assumptions (and sometimes conditional Normality)
tell? Can we fjx it?
3 / 67
๐๐ธ + ๐ฃ๐
๐) are a iid sample from the population.
๐ฃ
๐ฃ)
4 / 67
Three issues today:
โถ โ SEs are biased (usually downward)
โถ โ biased/inconsistent estimates 5 / 67
6 / 67
7 / 67
100000 200000 300000 400000 500000 600000 500 1000 1500 2000 2500 3000 3500 Total Votes
Buchanan Votes
8 / 67
100000 200000 300000 400000 500000 600000 500 1000 1500 2000 2500 3000 3500 Total Votes
Buchanan Votes Duval Lee Broward Martin Collier Dixie Pinellas Osceola Miami-Dade Alachua Glades Leon Volusia Highlands Hendry
Polk Madison Orange Okeechobee Lafayette Desoto Indian River Wakulla Brevard Manatee Sarasota Jefferson Suwannee Gulf Columbia Walton Sumter Pasco Putnam
Seminole Franklin Monroe Charlotte Gilchrist Washington Marion Jackson Flagler Citrus Holmes Clay Taylor Union Lake Hillsborough Hernando Bradford Calhoun Nassau Bay Baker Gadsden Okaloosa Escambia Santa Rosa Levy Palm Beach Hamilton Hardee Liberty
9 / 67
mod <- lm(edaybuchanan ~ edaytotal, data = flvote) summary(mod) ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 54.22945 49.14146 1.10 0.27 ## edaytotal 0.00232 0.00031 7.48 2.4e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 333 on 65 degrees of freedom ## Multiple R-squared: 0.463, Adjusted R-squared: 0.455 ## F-statistic: 56 on 1 and 65 DF, p-value: 2.42e-10
10 / 67
distribution), can cause ineffjciency and possibly bias
11 / 67
2 4 6 8
1 2 y
Without leverage point Full sample Leverage Point
12 / 67
๐ = ๐ (๐โฒ๐)โ1 ๐โฒ ฬ ๐ฏ = ๐ณ โ ๐ ฬ ๐ธ = ๐ณ โ ๐ (๐โฒ๐)โ1 ๐โฒ๐ณ โก ๐ณ โ ๐๐ณ = (๐ โ ๐)๐ณ
ฬ ๐ณ = ๐๐ณ
โถ ๐ is an ๐ ร ๐ symmetric matrix โถ ๐ is idempotent: ๐๐ = ๐ 13 / 67
ฬ ๐ณ = ๐ ฬ ๐ธ = ๐(๐โฒ๐)โ1๐โฒ๐ณ = ๐๐ณ
ฬ ๐ง๐ =
๐
โ
๐=1
โ๐๐๐ง๐
ฬ ๐ง๐
matrix
โ๐ = 1 ๐ + (๐ฆ๐ โ ๐ฆ)2 โ๐
๐=1(๐ฆ๐ โ ๐ฆ)2
โถ โ how far ๐ is from the center of the ๐ distribution
14 / 67
head(hatvalues(mod), 5) ## 1 2 3 4 5 ## 0.04179 0.02285 0.22066 0.01556 0.01493
15 / 67
Duval Lee Broward Martin Collier Dixie Pinellas Osceola Miami-Dade Alachua Glades Leon Volusia Highlands Hendry
Polk Madison Orange Okeechobee Lafayette Desoto Indian River Wakulla Brevard Manatee Sarasota Jefferson Suwannee Gulf Columbia Walton Sumter Pasco Putnam
Seminole Franklin Monroe Charlotte Gilchrist Washington Marion Jackson Flagler Citrus Holmes Clay Taylor Union Lake Hillsborough Hernando Bradford Calhoun Nassau Bay Baker Gadsden Okaloosa Escambia Santa Rosa Levy Palm Beach Hamilton Hardee Liberty 0.05 0.10 0.15 0.20 0.25 Hat Values
16 / 67
2 4
2 4 6
Without outlier Full sample Outlier
๐2)
17 / 67
โถ Problem:
ฬ ๐ฃ๐ are not identically distributed.
โถ Variance of the ๐th residual:
๐[ ฬ ๐ฃ๐|๐] = ๐2
๐ฃ(1 โ โ๐๐)
ฬ ๐ฃโฒ
๐ =
ฬ ๐ฃ๐ ฬ ๐โ1 โ โ๐๐
โถ | ฬ
๐ฃโฒ
๐| > 2 will be relatively rare.
โถ | ฬ
๐ฃโฒ
๐| > 4 โ 5 should defjnitely be checked.
18 / 67
std.resids <- rstandard(mod)
10 20 30 40 50 60
2 4 6 Index Standardized Residuals
Palm Beach
19 / 67
them.
ฬ ๐ธ(โ๐) = (๐โฒ
(โ๐)๐(โ๐)) โ1 ๐โฒ (โ๐)๐ณ(โ๐)
ฬ ๐ง๐ = ๐ฒโฒ
๐ ฬ
๐ธ(โ๐)
ฬ ๐ฃ๐ = ๐ง๐ โ ฬ ๐ง๐
ฬ ๐ฃ๐ = ฬ ๐ฃ๐ 1 โ โ๐
20 / 67
2 4 6 8
2 4 6 y
Without influence point Full sample Influence Point
leverage point.
21 / 67
between the fjtted value and the predicted leave-one-out value: ฬ ๐ง๐ โ ฬ ๐ง๐
โถ This is equivalent to
ฬ ๐ฃ๐โ๐, which is just the โoutlier-ness ร leverageโ
ฬ ๐ฃ2
๐
(๐+1)ฬ ๐2 ร โ๐
โถ Basically: โnormalized outlier-ness ร leverageโ โถ ๐ธ๐ > 4/(๐ โ ๐ โ 1) considered โlargeโ, but cutofgs are arbitrary
โถ x-axis: hat values, โ๐ โถ y-axis: standardized residuals,
ฬ ๐ฃโฒ
๐
22 / 67
plot(mod, which = 5, labels.id = flvote$county)
0.00 0.05 0.10 0.15 0.20 0.25
2 4 6 8 Leverage Standardized residuals
lm(edaybuchanan ~ edaytotal) Cook's distance
Palm Beach Miami-Dade Broward
23 / 67
2 4 6 8
0.0 0.5 1.0 1.5 y
24 / 67
โถ Fix the observation (obvious data entry errors) โถ Remove the observation โถ Be transparent either way
โถ Transform the dependent variable (log(๐ง)) โถ Use a method that is robust to outliers (robust regression,
least absolute deviations)
25 / 67
26 / 67
ฬ ๐ธ = (๐โฒ๐)โ1 ๐โฒ๐ณ
๐[ ฬ ๐ธ|๐] = (๐โฒ๐)โ1 ๐โฒฮฃ๐ (๐โฒ๐)โ1
๐[ ฬ ๐ธ|๐] = ๐2 (๐โฒ๐)โ1
๐2 will give us our estimate of the covariance matrix
27 / 67
๐[๐ฏ|๐] = ๐2๐ = โก โข โข โข โฃ ๐2 โฆ ๐2 โฆ โฎ โฆ ๐2 โค โฅ โฅ โฅ โฆ
๐[๐ฏ|๐] = โก โข โข โข โฃ ๐2
1
โฆ ๐2
2
โฆ โฎ โฆ ๐2
๐
โค โฅ โฅ โฅ โฆ
๐
28 / 67
0.0 0.5 1.0 1.5 2.0 2.5 3.0
5 10 15
Heteroskedastic
X Y 0.0 0.5 1.0 1.5 2.0 2.5 3.0
5 10 15
Homoskedastic
X Y 29 / 67
๐ธ still unbiased and consistent for ๐ธ
30 / 67
โถ In R, plot(mod, which = 1) โถ Residuals should have the same variance across ๐ฆ-axis
โถ y-axis: Square-root of the absolute value of the residuals โถ x-axis: Fitted values โถ Usually has loess trend curve, should be fmat โถ In R, plot(mod, which = 3) 31 / 67
plot(mod, which = 1, lwd = 3) plot(mod, which = 3, lwd = 3)
500 1000 1500
1000 2000 Fitted values Residuals
64 9 3
500 1000 1500 0.0 0.5 1.0 1.5 2.0 2.5 Fitted values Standardized residuals
64 9 3
32 / 67
(WLS)
๐ธ|๐] that is robust to heteroskedasticity
33 / 67
mod2 <- lm(log(edaybuchanan) ~ log(edaytotal), data = flvote) summary(mod2) ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept)
0.400
3.5e-09 *** ## log(edaytotal) 0.729 0.038 19.15 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.469 on 65 degrees of freedom ## Multiple R-squared: 0.849, Adjusted R-squared: 0.847 ## F-statistic: 367 on 1 and 65 DF, p-value: <2e-16
34 / 67
plot(mod2, which = 3)
3 4 5 6 7 0.0 0.5 1.0 1.5 Fitted values Standardized residuals
lm(log(edaybuchanan) ~ log(edaytotal))
64 39 55
35 / 67
multiplicative constant: ๐[๐ฃ๐|๐] = ๐๐๐2 where ๐๐ = ๐๐(๐ฒโฒ
๐) is a positive and known function of ๐ฒโฒ ๐
๐ง๐ โ๐๐ = ๐พ0 1 โ๐๐ + ๐พ1 ๐ฆ๐1 โ๐๐ + โฏ + ๐พ๐ ๐ฆ๐๐ โ๐๐ + ๐ฃ๐ โ๐๐
36 / 67
๐ [ 1 โ๐๐ ๐ฃ๐|๐] = 1 ๐๐ ๐ [๐ฃ๐|๐] = 1 ๐๐ ๐๐๐2 = ๐2
model homoeskedastic and, thus, BLUE again
37 / 67
๐ = โก โข โข โข โฃ 1/โ๐1 1/โ๐2 โฎ โฎ โฑ โฎ 1/โ๐๐ โค โฅ โฅ โฅ โฆ
๐๐ณ = ๐๐๐ธ + ๐๐ฏ ๐ณโ = ๐โ๐ธ + ๐ฏโ
Gauss-Markov assumptions are satisfjed
๐ธ: ฬ ๐ธ๐ = (๐โฒ๐โฒ๐๐)โ1๐โฒ๐โฒ๐๐ณ
38 / 67
squared: 1/๐๐
to the total number of ballots cast:
mod.wls <- lm(edaybuchanan ~ edaytotal, weights = 1/edaytotal, data = flvote) summary(mod.wls) ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 27.06785 8.50723 3.18 0.0022 ** ## edaytotal 0.00263 0.00025 10.50 1.2e-15 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.565 on 65 degrees of freedom ## Multiple R-squared: 0.629, Adjusted R-squared: 0.624 ## F-statistic: 110 on 1 and 65 DF, p-value: 1.22e-15
39 / 67
plot(mod, which = 3, lwd = 2, sub = "") plot(mod.wls, which = 3, lwd = 2, sub = "")
500 1000 1500 0.0 0.5 1.0 1.5 2.0 2.5 Fitted values Standardized residuals
64 9 3
OLS
500 1000 1500 0.0 0.5 1.0 1.5 2.0 2.5 Fitted values Standardized residuals
64 9 3
WLS
40 / 67
๐[๐ฏ|๐] = ฮฃ = โก โข โข โข โฃ ๐2
1
โฆ ๐2
2
โฆ โฎ โฆ ๐2
๐
โค โฅ โฅ โฅ โฆ
๐[ ฬ ๐ธ|๐] = (๐โฒ๐)โ1 ๐โฒฮฃ๐ (๐โฒ๐)โ1
we have an estimate of ฮฃ: ฬ ๐[ ฬ ๐ธ|๐] = (๐โฒ๐)โ1 ๐โฒฬ ฮฃ๐ (๐โฒ๐)โ1
ฮฃ๐
41 / 67
ฬ ๐ฏ
ฮฃ with squared residuals in diagonal:
ฬ ฮฃ = โก โข โข โข โฃ ฬ ๐ฃ2
1
โฆ ฬ ๐ฃ2
2
โฆ โฎ โฎ โฎ โฑ โฎ โฆ ฬ ๐ฃ2
๐
โค โฅ โฅ โฅ โฆ
ฮฃ into sandwich formula to obtain HC/robust estimator
ฬ ๐[ ฬ ๐ธ|๐] = (๐โฒ๐)โ1 ๐โฒฬ ฮฃ๐ (๐โฒ๐)โ1
ฬ ๐[ ฬ ๐ธ|๐] = ๐ ๐ โ ๐ โ 1 โ (๐โฒ๐)โ1 ๐โฒฬ ฮฃ๐ (๐โฒ๐)โ1
42 / 67
coeftest(mod) ## ## t test of coefficients: ## ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 54.22945 49.14146 1.10 0.27 ## edaytotal 0.00232 0.00031 7.48 2.4e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 coeftest(mod, vcovHC(mod, type = "HC0")) ## ## t test of coefficients: ## ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 54.22945 40.61283 1.34 0.1864 ## edaytotal 0.00232 0.00087 2.67 0.0096 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
43 / 67
lmtest::coeftest(mod, sandwich::vcovHC(mod, type = "HC0")) ## ## t test of coefficients: ## ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 54.22945 40.61283 1.34 0.1864 ## edaytotal 0.00232 0.00087 2.67 0.0096 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 lmtest::coeftest(mod, sandwich::vcovHC(mod, type = "HC1")) ## ## t test of coefficients: ## ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 54.229453 41.232904 1.32 0.193 ## edaytotal 0.002323 0.000884 2.63 0.011 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
44 / 67
โถ With known weights, WLS is effjcient โถ and ฬ
๐๐น[ ฬ ๐ธ๐๐๐] is consistent
โถ but weights usually arenโt known
โถ Doesnโt change estimate ฬ
๐ธ
โถ Consistent for ๐[ ฬ
๐ธ] under any form of heteroskedasticity
โถ Because it relies on consistency, it is a large sample result, best
with large ๐
โถ For small ๐, performance might be poor 45 / 67
46 / 67
mod3 <- lm(edaybuchanan ~ edaytotal + absnbuchanan, data = flvote) summary(mod3)
## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept)
55.19635
0.5969 ## edaytotal 0.00110 0.00048 2.29 0.0253 * ## absnbuchanan 6.89546 2.12942 3.24 0.0019 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 317 on 61 degrees of freedom ## (3 observations deleted due to missingness) ## Multiple R-squared: 0.536, Adjusted R-squared: 0.521 ## F-statistic: 35.2 on 2 and 61 DF, p-value: 6.71e-11
47 / 67
and ๐๐
๐พ๐ and 0 intercept
48 / 67
par(mfrow = c(1, 2))
lines(loess.smooth(x = out$edaytotal[, 1], y = out$edaytotal[, 2]), col = "dodgerblue", lwd = 2)
lines(loess.smooth(x = out2$absnbuchanan[, 1], y = out2$absnbuchanan[, 2]), col = "dodgerblue", lwd = 2)
100000 300000 500000
500 1000 1500 2000 edaytotal | others edaybuchanan | others
20 40 60
500 1000 1500 2000 absnbuchanan | others edaybuchanan | others
49 / 67
โถ Generalized additive models and splines allow the data to tell
us what the functional form is.
โถ Complicated math, but important ideas. 50 / 67
model:
โถ Examples weโve seen: โ๐(๐ฆ๐) = ๐ฆ๐, โ๐(๐ฆ๐) = ๐ฆ2
๐ ,
โ๐(๐ฆ๐) = log(๐ฆ๐)
non-linearity
constant: โ1 = 1, โ2 = ๐(๐1 < ๐ฆ๐ < ๐2), โ3 = ๐(๐ฆ๐ > ๐2)
51 / 67
2 4
1 2 3 x y
52 / 67
bin by adding interactions: โ1(๐ฆ๐) = 1, โ2(๐ฆ๐) = ๐ฆ๐, โ3(๐ฆ๐) = ๐(๐1 < ๐ฆ๐ < ๐2), โ4(๐ฆ๐) = ๐ฆ๐๐(๐1 < ๐ฆ๐ < ๐2), โ5(๐ฆ๐) = ๐(๐ฆ๐ โฅ ๐2), โ6(๐ฆ๐) = ๐ฆ๐๐(๐ฆ๐ โฅ ๐2)
53 / 67
2 4
1 2 3 x y
54 / 67
linear function of ๐๐: โ1(๐ฆ๐) = 1, โ2(๐ฆ๐) = ๐ฆ๐, โ3(๐ฆ๐) = (๐ฆ๐ โ ๐1)+, โ4(๐ฆ๐) = (๐ฆ๐ โ ๐2)+
55 / 67
๐ง๐ = ๐พ0 + ๐พ1๐ฆ๐ + ๐พ2(๐ฆ๐ โ ๐1)+ + ๐พ3(๐ฆ๐ โ ๐2)+ + ๐ฃ๐
๐พ0 + ๐พ1๐1
๐พ0 + ๐พ1๐1 + ๐พ2(๐1 โ ๐1)+ = ๐พ0 + ๐พ1๐1
change:
โถ ๐พ1 = slope when ๐๐ < ๐1 โถ ๐พ1 + ๐พ2 = slope when ๐1 < ๐๐ < ๐2 โถ ๐พ1 + ๐พ2 + ๐พ3 = slope when ๐๐ > ๐2 โถ Function is continuous at cutpoints 56 / 67
h2 <- x h3 <- 1 * (x > -1.5) * (x - -1.5) h4 <- 1 * (x > 1.5) * (x - 1.5) reg <- lm(y ~ h2 + h3 + h4)
2 4
1 2 3 x y
57 / 67
probably want โsmoothโ functions.
โถ What does smooth mean? Continuous derivatives! โถ โ use higher-order polynomials in the basis functions
with continuous fjrst and second derivatives โ1(๐ฆ๐) = 1, โ2(๐ฆ๐) = ๐ฆ๐, โ3(๐ฆ๐) = ๐ฆ2
๐
โ4(๐ฆ๐) = ๐ฆ3
๐ ,
โ5(๐ฆ๐) = (๐ฆ๐ โ ๐1)3
+,
โ6(๐ฆ๐) = (๐ฆ๐ โ ๐2)3
+
have to connect and be smooth at the knots.
โถ Ensure this by allowing only the coeffjcient on the cubic term
to change at the knot point.
58 / 67
h2 <- x h3 <- x^2 h4 <- x^3 h5 <- 1 * (x > -1.5) * (x - -1.5)^3 h6 <- 1 * (x > 1.5) * (x - 1.5)^3 reg <- lm(y ~ h2 + h3 + h4 + h5 + h6)
2 4
1 2 3 x y
59 / 67
h2 <- x h3 <- x^2 h4 <- x^3 rr <- lm(y ~ h2 + h3 + h4)
2 4
1 2 3 x y
Global Cubic Local Cubic Spline
60 / 67
โถ More knot points โ โrougherโ function, less in-sample bias,
more variance.
โถ Fewer knot points โ โsmootherโ function, more in-sample
bias, less variance.
terrible.
representing this trade-ofg other than knots.
61 / 67
spline with ๐ knots.
ฬ ๐งโ๐
๐๐ and caculate squared prediction
error: (๐ง๐ โ ฬ ๐งโ๐
๐๐)2.
the MSE with ๐ knots.
that has the lowest MSE.
62 / 67
smth <- smooth.spline(x, y) plot(x, y, ylim = c(-3, 3), pch = 19, col = "grey50", bty = "n") lines(smth, col = "indianred", lwd = 2)
2 4
1 2 3 x y
63 / 67
the spline of any particular variable in the regression.
โถ Each spline is additive: ๐ง๐ = ๐1(๐ฆ๐1) + ๐๐ฆ(๐ฆ๐2) + ๐ฃ๐
nonlinearity of the functional form.
64 / 67
## library(mgcv) ## GAM package
subset = county != "Palm Beach") ## ## Family: gaussian ## Link function: identity ## ## Formula: ## edaybuchanan ~ s(edaytotal) + s(absnbuchanan) ## ## Parametric coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 221.84 6.41 34.6 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Approximate significance of smooth terms: ## edf Ref.df F p-value ## s(edaytotal) 6.85 7.82 10.6 1.6e-09 *** ## s(absnbuchanan) 2.95 3.64 22.6 1.6e-11 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## R-sq.(adj) = 0.95 Deviance explained = 95.8% ## GCV = 3129 Scale est. = 2592.3 n = 63 65 / 67
plot(out, shade = TRUE, residual = TRUE, pch = 1)
200000 400000 600000
200 400 600 edaytotal s(edaytotal,6.85) 20 40 60 80 100
200 400 600 absnbuchanan s(absnbuchanan,2.95)
66 / 67
โถ Check your data! summary(), plot(), etc โถ Use transformations to make assumptions more plausible โถ Weaken linearity when you need to. 67 / 67