assumptions
play

Assumptions EDUC 7610 Chapter 16 and 17 Diagnostics (Detecting - PDF document

Assumptions EDUC 7610 Chapter 16 and 17 Diagnostics (Detecting Irregularities) and miscellaneous stuff Tyson S. Barrett, PhD This is one of the most important chapters The regression results validity depend on whether the assumptions


  1. Assumptions…

  2. EDUC 7610 Chapter 16 and 17 Diagnostics (Detecting Irregularities) and miscellaneous stuff Tyson S. Barrett, PhD

  3. This is one of the most important chapters The regression results’ validity depend on whether the assumptions of the model hold or not Assumptions of the model: HOLDS? 1. Linear relationship 2. Homoscedasticity of residuals 3. Normally-distributed residuals with mean 0 4. No omitted variables VIOLATED? 5. Independence of residuals 6. Variance of X > 0

  4. This is one of the most important chapters The regression results validity depend on whether the assumptions of the model hold or not Violations usually occur because of Extreme Cases High Leverage High Influence High Distance

  5. Leverage The atypicalness of a case’s pattern of values on the regressors in the model A point with high leverage could be: • A 55-year-old pregnant female In a general population, is being 55 strange by itself? What about pregnant? • A high-income individual receiving welfare assistance In a general population, is having a high-income strange by itself? What about being on welfare assistance? We must consider the combination of the variables to know if it has high leverage Measured with ℎ ! “case i ’s hat value ”

  6. Distance ! value deviates from " How far case i ’s 𝑍 𝑍 ! 2 Often measured with ! − % residual: 𝑓 ! = 𝑍 𝑍 ! 0 y But an outlier pulls the # 𝑍 ! so we can adjust it using ℎ ! 𝑓 ! (1 − ℎ ! ) − 2 Turns out with mathemagic, we can see that ℎ ! is equal to proportion by which case I lowers its own residual by pulling the regression surface − 2 − 1 0 1 2 x

  7. Influence The extent to which its inclusion changes the regression solution or some aspect of it Which extreme point (A, B, or C) changes the solution the most?

  8. Influence The extent to which its inclusion changes the regression solution or some aspect of it Which extreme point (A, B, or C) total influence changes the solution the most? vs. partial influence Measured with Cook’s Distance the change in the value of case j ’s residual when case i is deleted from the model % & ∑ "#$ 𝑒 !" 𝐷𝑝𝑝𝑙 ! = 𝑙 × 𝑁𝑇 '()!*+,- number of regressors The error variance from the model

  9. Approaching Diagnostics Diagnostic statistics are estimates of different features of the observations Anything look weird in this data? Any extreme values? Eye-balling a data set (especially larger ones) is not very productive

  10. Approaching Diagnostics Diagnostic statistics are estimates of different features of the observations regular residual Instead, let’s use the diagnostics Which has a high distance? residual with this case removed from 𝑍 !

  11. Approaching Diagnostics Diagnostic statistics are estimates of different features of the observations Measures leverage Which has a high leverage? Measures leverage and ranges from 1/N to 1

  12. Approaching Diagnostics Diagnostic statistics are estimates of different features of the observations Measures influence Which has a high influence?

  13. Assumptions The book provides four basic assumptions of regression, we make explicit two implicit ones below Assumptions of the model: 1. Linear relationship 2. Homoscedasticity of residuals 3. Normally-distributed residuals with mean 0 4. No omitted variables 5. Independence of residuals 6. Variance of X > 0 Note: there are a lot of “tests” for these assumptions but they usually have their own assumptions so we won’t discuss them here

  14. Assumptions The book provides four basic assumptions of regression, we make explicit two implicit ones below Assumptions of the model: 1. Linear relationship 2. Homoscedasticity of residuals Omitted variables is largely theoretically 3. Normally-distributed residuals with mean 0 based Can show up as weird residuals (maybe 4. No omitted variables we are missing an effect) We’ll see this in the next slide 5. Independence of residuals 6. Variance of X > 0 This is easy to test: are there at least two values in X?

  15. Assumptions The book provides four basic assumptions of regression, we make explicit two implicit ones below Assumptions of the model: 1. Linear relationship Residuals were not good until we added a quadratic effect 2. Homoscedasticity of residuals (omitted variable bias) 3. Normally-distributed residuals with mean 0 4. No omitted variables 5. Independence of residuals 6. Variance of X > 0 Use a scatterplot of t-residuals against X

  16. Assumptions The book provides four basic assumptions of regression, we make explicit two implicit ones below Assumptions of the model: 1. Linear relationship 2. Homoscedasticity of residuals 3. Normally-distributed residuals with mean 0 4. No omitted variables 5. Independence of residuals 6. Variance of X > 0 Scatterplots of residuals on X or Y on X

  17. Assumptions The book provides four basic assumptions of regression, we make explicit two implicit ones below Assumptions of the model: This one is somewhat tricky but important: 1. Linear relationship The residuals at each point of x are • normally distributed is the 2. Homoscedasticity of residuals assumption • Are more points closer to the line 3. Normally-distributed residuals with mean 0 than far away? 4. No omitted variables 5. Independence of residuals 2 6. Variance of X > 0 0 y − 2 − 3 − 2 − 1 0 1 2 x

  18. Assumptions The book provides four basic assumptions of Normal Q − Q regression, we make explicit two implicit ones below 60 2 Assumptions of the model: This one is somewhat tricky but important: Standardized residuals 1. Linear relationship The residuals at each point of x are • 1 normally distributed is the 2. Homoscedasticity of residuals assumption 0 • Are more points closer to the line 3. Normally-distributed residuals with mean 0 than far away? 4. No omitted variables − 1 5. Independence of residuals 2 − 2 6. Variance of X > 0 81 34 0 y − 2 − 1 0 1 2 Usually tested with a Q-Q plot − 2 Theoretical Quantiles lm(y ~ x) − 3 − 2 − 1 0 1 2 x

  19. Assumptions The book provides four basic assumptions of regression, we make explicit two implicit ones below Assumptions of the model: 1. Linear relationship 2. Homoscedasticity of residuals Generally theoretical in nature 3. Normally-distributed residuals with mean 0 • Are the observations (participants) independent? 4. No omitted variables • Are the observations connected in 5. Independence of residuals some way? Time-series almost always violate • 6. Variance of X > 0 this assumption because previous time points are correlated with current ones • In many cases, we can use Multilevel Modeling here

  20. Dealing with Irregularities The book provides four basic ways to deal with irregularities, we add a fifth Correction Robustification Use an alternative approach that is less Correct the error that lead to the extreme value sensitive to the extreme value Transformation Generalized Linear Models Transform the outcome or predictors using a monotonic transformation (log, square root, etc.) A family of approaches that can assess categorical, ordinal, and otherwise strange Elimination outcomes Remove the extreme value • Chapter 18 is about these and we’ll discuss • I recommend if you do this to report the results them much more then both with and without the extreme value in any publication

  21. Robustification Use an alternative approach that is less sensitive to the extreme value Two main ways of robustifying 1 Use alternative way of estimating coefficients 2 Use alternative way of estimating the standard error (or, more generally speaking, the uncertainty) We’ll talk about this one

  22. Robustification Use an alternative approach that is less sensitive to the extreme value Heteroscedasticity- Bootstrapping Permutation Consistent Standard Errors Adjusts the SE’s to be less Resamples from the sample with Randomly shuffles the data and sensitive to extreme values using replacement to come up with an re-runs the model many times sandwich estimators empirical distribution of the estimate The proportion of these shuffled models that are as big or bigger They are consistent (gets closer Can obtain SE’s, CI’s, and p-values than the original model gives us and closer to the right value as info on p-values sample size increases) Works with any statistic Works with any statistic Many versions, HC3 and HC4 are best

  23. Robustification Use an alternative approach that is less sensitive to the extreme value Heteroscedasticity- Bootstrapping Permutation Consistent Standard Errors Adjusts the SE’s to be less Resamples from the sample with Randomly shuffles the data and sensitive to extreme values using replacement to come up with an re-runs the model Can use whether or not sandwich estimators empirical distribution of the estimate The proportion of these shuffled models that are as big or bigger They are consistent (gets closer homoscedasticity exists Can obtain SE’s, CI’s, and p-values than the original model gives us and closer to the right value as info on p-values sample size increases) Works with any statistic Works with any statistic Many versions, HC3 and HC4 are best

  24. Remember: No model is “correct” but some models are useful

  25. Some Miscellaneous Stuff Measurement Specification Power Error Error Non-interval Missing Data Outcomes

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend