week 7 regression issues
play

Week 7: Regression Issues Standardized and Studentized residuals, - PowerPoint PPT Presentation

BUS41100 Applied Regression Analysis Week 7: Regression Issues Standardized and Studentized residuals, outliers and leverage, nonconstant variance, non-normality, nonlinearity, transformations, multicollinearity Max H. Farrell The University


  1. BUS41100 Applied Regression Analysis Week 7: Regression Issues Standardized and Studentized residuals, outliers and leverage, nonconstant variance, non-normality, nonlinearity, transformations, multicollinearity Max H. Farrell The University of Chicago Booth School of Business

  2. Model assumptions Y | X ∼ N ( β 0 + β 1 X, σ 2 ) Key assumptions of our linear regression model: (i) The conditional mean of Y is linear in X . (ii) The additive errors (deviations from line) ◮ are Normally distributed ◮ independent from each other ◮ identically distributed (i.e., they have constant variance) 1

  3. Inference and prediction relies on this model being true! If the model assumptions do not hold, then all bets are off: ◮ prediction can be systematically biased ◮ standard errors and confidence intervals are wrong (but how wrong?) We will focus on using graphical methods (plots!) to detect violations of the model assumptions. You’ll see that ◮ It is more of an art than a science, ◮ but it is grounded in mathematics. 2

  4. Example model violations Anscombe’s quartet comprises four datasets that have similar statistical properties . . . > attach(anscombe <- read.csv("anscombe.csv")) > c(x.m1=mean(x1), x.m2=mean(x2), x.m3=mean(x3), x.m4=mean(x4)) x.m1 x.m2 x.m3 x.m4 9 9 9 9 > c(y.m1=mean(y1), y.m2=mean(y2), y.m3=mean(y3), y.m4=mean(y4)) y.m1 y.m2 y.m3 y.m4 7.500909 7.500909 7.500000 7.500909 > c(x.sd1=sd(x1), x.sd2=sd(x2), x.sd3=sd(x3), x.sd3=sd(x4)) x.sd1 x.sd2 x.sd3 x.sd4 3.316625 3.316625 3.316625 3.316625 > c(y.sd1=sd(y1), y.sd2=sd(y2), y.sd4=sd(y3), y.sd4=sd(y4)) y.sd1 y.sd2 y.sd3 y.sd4 2.031568 2.031657 2.030424 2.030579 > c(cor1=cor(x1,y1), cor2=cor(x2,y2), cor3=cor(x3,y3), cor4=cor(x4,y4)) cor1 cor2 cor3 cor4 0.8164205 0.8162365 0.8162867 0.8165214 3

  5. . . . but vary considerably when graphed. ● 10 10 ● ● ● ● ● ● ● ● y1 y2 ● ● ● 8 8 ● ● ● ● ● 6 6 ● ● ● ● 4 4 ● 4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20 x1 x2 ● ● 10 10 ● ● ● y3 y4 ● 8 8 ● ● ● ● ● ● ● ● ● ● ● 6 6 ● ● ● ● ● 4 4 4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20 x3 x4 4

  6. Similarly, let’s consider linear regression for each dataset. ● 10 10 ● ● ● ● ● ● ● ● y1 y2 ● ● ● 8 8 ● ● ● ● ● 6 6 ● ● ● ● 4 4 ● 4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20 x1 x2 ● ● 10 10 ● ● ● y3 y4 ● 8 8 ● ● ● ● ● ● ● ● ● ● ● 6 6 ● ● ● ● ● 4 4 4 6 8 10 12 14 16 18 20 4 6 8 10 12 14 16 18 20 x3 x4 5

  7. The regression lines and even R 2 values are the same... > ansreg <- list(reg1=lm(y1~x1), reg2=lm(y2~x2), + reg3=lm(y3~x3), reg4=lm(y4~x4)) > attach(ansreg) > cbind(reg1$coef, reg2$coef, reg3$coef, reg4$coef) [,1] [,2] [,3] [,4] (Intercept) 3.0000909 3.000909 3.0024545 3.0017273 x1 0.5000909 0.500000 0.4997273 0.4999091 > smry <- lapply(ansreg, summary) > c(smry$reg1$r.sq, smry$reg2$r.sq, + smry$reg3$r.sq, smry$reg4$r.sq) [1] 0.6665425 0.6662420 0.6663240 0.6667073 6

  8. ...but the residuals (plotted against ˆ Y ) look totally different. ● ● ● ● 1.0 ● ● ● ● reg1$residuals reg2$residuals 1 0.0 ● ● ● ● 0 ● ● ● −1.0 ● ● ● −1 ● −2.0 −2 ● ● ● 5 6 7 8 9 10 5 6 7 8 9 10 reg1$fitted reg2$fitted ● ● 3 ● reg3$residuals reg4$residuals 1.0 ● 2 ● 0.0 1 ● ● ● ● ● ● ● 0 ● ● ● ● −1.5 ● ● ● −1 ● ● ● 5 6 7 8 9 10 7 8 9 10 11 12 reg3$fitted reg4$fitted 7

  9. Plotting e vs ˆ Y is your #1 tool for finding fit problems. Why? ◮ Because it gives a quick visual indicator of whether or not the model assumptions are true. What should we expect to see if they are true? 1. Each ε i has the same variance ( σ 2 ). 2. Each ε i has the same mean (0). 3. The ε i collectively have the same Normal distribution. Remember: ˆ Y is made from X in SLR and MLR, so one plot summarizes across the X . 8

  10. How do we check these? Well, the true ε i residuals are unknown, so must look instead at the least squares estimated residuals. ◮ We estimate Y i = b 0 + b 1 X i + e i , such that the sample least squares regression residuals are e i = Y i − ˆ Y i What should the e i look like if the SLR model is true? 9

  11. If the SLR model is true, it turns out that: ( X i − ¯ X ) 2 h i = 1 e i ∼ N (0 , σ 2 [1 − h i ]) , n + X ) 2 . j =1 ( X j − ¯ � n The h i term is referred to as the i th observation’s leverage: ◮ It is that point’s share of the data ( 1 /n ) plus its proportional contribution to variability in X . Notice that as n → ∞ , h i → 0 and residuals e i “obtain” the same distribution as the unknown errors ε i , i.e., e i ∼ N (0 , σ 2 ) . ————————————— See handout on course page for derivations. 10

  12. Understanding Leverage The h i leverage term measures sensitivity of the estimated least squares regression line to changes in Y i . The term “leverage” provides a mechanical intuition: ◮ The farther you are from a pivot joint, the more torque you have pulling on a lever. Online illustration of leverage: https://rstudio-class.chicagobooth.edu Outliers do more damage if they have high leverage! 11

  13. Standardized residuals Since e i ∼ N (0 , σ 2 [1 − h i ]) , we know that e i σ √ 1 − h i ∼ N (0 , 1) . These transformed e i ’s are called the standardized residuals. ◮ They all have the same distribution if the SLR model assumptions are true. iid ◮ They are almost (close enough) independent ( ∼ N (0 , 1) ). ◮ Estimate σ 2 using ˆ σ 2 or s 2 12

  14. About estimating s under sketchy SLR assumptions ... We want to see whether any particular e i is “too big”, but we don’t want a single outlier to make s artificially large. > plot(x3,y3, col=3, pch=20, cex=1.5) > abline(reg3, col=3) ● 12 ◮ One big outlier 10 y3 ● can make s ● 8 ● ● overestimate σ . ● ● ● ● 6 ● ● 4 6 8 10 12 14 x3 13

  15. Studentized residuals We thus define a Studentized residual as e i √ 1 − h i r i = s − i where s 2 1 j � = i e 2 � − i = j is ˆ σ calculated without e i . n − p − 1 These are easy to get in R with the rstudent() function. > rstudent(reg3) [1] -0.4390554 -0.1855022 1203.5394638 -0.3138441 [5] -0.5742948 -1.1559818 0.0664074 0.3618514 [9] -0.7356770 -0.0657680 0.2002633 14

  16. Outliers and Studentized residuals Since the studentized residuals should be ≈ N (0 , 1) , we should be concerned about any r i outside of about [ − 3 , 3] . ● ● 3 1000 reg3$residuals rstudent(reg3) 2 600 1 ● ● ● 0 ● ● 200 ● ● ● ● −1 ● 0 ● ● ● ● ● ● ● ● ● ● 5 6 7 8 9 10 5 6 7 8 9 10 reg3$fitted reg3$fitted These aren’t hard and fast cutoffs. As n gets bigger, we will expect to see some very rare events (big ε i ) and not get worried unless | r i | > 4 . 15

  17. How to deal with outliers When should you delete outliers? ◮ Only when you have a really good reason! There is nothing wrong with running a regression with and without potential outliers to see whether results are significantly impacted. Any time outliers are dropped, the reasons for doing so should be clearly noted. ◮ I maintain that both a statistical and a non-statistical reason are required. (What?) 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend