201ab Quantitative methods Linear model diagnostics. Model - PowerPoint PPT Presentation

201ab Quantitative methods Linear model diagnostics.

Model assumptions, in order of importance (1) Validity (2) Additivity and linearity (3) Independent errors (4) Normal errors (5) Homoscedastic errors (6) Error in y, not in x

Validity & Generalization • What conclusions are drawn from a data analysis, and how do they relate to the data and the analysis? – How do the measured / Linking assumptions? Their justifiability? manipulated variables correspond to the concepts in the conclusions? – Which aspects of the desired Subjects? Stimuli? Manipulations? generalization are represented in the measured variability? – Are the premises and logic of your “Availability” of k* words vs **k* words? analysis sound?

Additivity and Linearity • The linear model assumes linearity + additivity: y = B0 + B 1 x 1 + B 2 x 2 … • Important violations to beware of: – Lots of measures are not fundamentally not linear (need for linearizing transforms, etc.) – Lots of effects are fundamentally not linear (e.g., dose-response curve cannot be linear)

Independent errors. • Standard linear model assumes i.i.d. errors: y = … + e e ~ N(0, s e ) • Critical violations: – Measuring the same person many times (repeated measures) – Measuring a fixed set of stimuli (item random effects) – Measuring over time/space (smoothness/autocorrelation) – Error correlates with explanatory variable (endogeneity) In these cases you need to use models that can handle it. • Less critical violations: – Weak correlations orthogonal to explanatory variables

Normal, homoscedastic errors • Small deviations from normality / homoscedasticity are often not a big deal. • Large deviations from normality, in particular extreme outliers, may yield large errors in estimated coefficients that are not captured by our measures of uncertainty. This undermines generalization.

Error in y not x • Error in x will cause us to underestimate coefficients. • Not really a big deal. • Errors-in-variables models deal with this if need be.

Model assumptions, in order of importance (1) Validity (2) Additivity and linearity (3) Independent errors (4) Normal errors (5) Homoscedastic errors (6) Error in y, not in x Caveat: “Importance” here determined by my estimate of the expected magnitude of the problems caused by violations of these assumptions in the kinds of analyses people in this class will typically undertake in their research.

Diagnostics you should undertake • Look at marginal histograms – Check for outliers – Check for skew and heavy tails

Diagnostics you should undertake • Look at marginal histograms – Check for outliers – Check for skew and heavy tails • Look at scatterplots. – Check for major 2d non-linearities – Check for outliers. – Check for generalized weirdness.

Diagnostics you should undertake • Look at marginal histograms – Check for outliers – Check for skew and heavy tails • Look at scatterplots. – Check for major 2d non-linearities – Check for outliers. • Check various plots of residuals – Residuals ~ y_hat to check for non-linearities.

Checking for non-linearity Residual ~ x Residual ~ y.hat Residual plots highlight the non-linearity For high dimensional data, only Residual ~ y.hat is really possible to look at.

Diagnostics you could undertake • Look at marginal histograms – Check for outliers – Check for skew and heavy tails • Look at scatterplots. – Check for major 2d non-linearities – Check for outliers. • Check various plots of residuals – Residuals ~ y_hat to check for non-linearities. – Absolute residuals ~ y_hat to check for homoscedasticity.

Checking for homoscedasticity Homoscedasticity: variance of residuals is constant |residual| ~ y.hat spreadLevelPlot(lm(y~x)) plot(lm, 3) Test for non-constant variance (heteroscedasticity) based on regression of error^2 as a function of fitted y values (for regression): “Breusch-Pagan test” (different, and somewhat more powerful procedure for categorical predictors) ncvTest(lm(y~x)) Non-constant Variance Score Test Variance formula: ~ fitted.values Chisquare = 10.68375 Df = 1 p = 0.00108081

Diagnostics you could undertake • Look at marginal histograms – Check for outliers – Check for skew and heavy tails • Look at scatterplots. – Check for major 2d non-linearities – Check for outliers. • Check various plots of residuals – Residuals ~ y_hat to check for non-linearities. – Absolute residuals ~ y_hat to check for homoscedasticity. – Standardized residual QQ plots check for Normality

Studentized / Standardized residuals Residuals (estimated error) Standardized residuals Deviation of real y value from line Residual divided by sd of residuals ( S ) = ˆ ˆ ε i = y i − ˆ ( ) y i ˆ ε i / s r ε i These should be t distributed, so we can compare to t distribution to look for abnormalities / outliers. qqPlot(lm(y~x)) Large deviations from theoretical t distribution can be tested for (via t-test!) and extreme outliers will be evident this way.

Checking for normal residuals Look at qq plot, test with Kolmogorov-Smirnov test qqPlot(lm(y~x)) ks.test(rstudent(lm(y~x)), "pt", length(y)-2) One-sample Kolmogorov-Smirnov test data: rstudent(lm(y ~ x)) D = 0.1398, p-value = 0.04002 alternative hypothesis: two-sided Generally though, it’s fine to ignore slight but significant deviations

Diagnostics you could undertake • Look at marginal histograms – Check for outliers – Check for skew and heavy tails • Look at scatterplots. – Check for major 2d non-linearities – Check for outliers. • Check various plots of residuals – Residuals ~ y_hat to check for non-linearities. – Absolute residuals ~ y_hat to check for homoscedasticity. – Standardized residual QQ plots check for Normality – Standardized residual ~ leverage for outlier effects, Cook’s dist.

Testing for outliers outlierTest(lm(y~x)) student uncorrectedBonferonni # error p-value p-value 6 4.31 0.0004 0.0088 16 -4.31 0.0004 0.0088 These tests for outliers tend to be less sensitive than the eye: if there is a significant outlier, we will be able to see it, but if we can see it, it may still not be significant (usually due to low df tails).

Leverage Leverage in statistics is like leverage in physics: with a long enough lever (a predictor far enough away from the mean) you can make a regression line do whatever you want. Leverage is potential influence. With many predictors what matters is ~Mahalanobis distance: distance from the center of mass scaled by the covariance matrix. This is hard to visualize, so it’s useful to just look at the leverage numbers, and particularly, whether there are large residuals at large leverage – that is bad.

Cook’s distance plot(lm(y~x), which=5) A data point with a lot of leverage and large residuals is exerting undue influence on the regression. Cook’s distance measures this. Several, equally correct, ways to think about Cook’s distance: (1) How much will my regression coefficients change without this data point? (2) How much will the predicted Y values change without this data point? (3) A combination of leverage and residual to ascertain point’s influence.

Outliers and extreme influence Data points with large residuals, and/or high leverage How do we measure this apparent extreme influence? Outlier detection qqPlot outlierTest Look at residuals as a function of leverage plot(lm(y~x), which=5) Compute Cook’s distance plot(lm(y~x), which=4)

Outliers and extreme influence Data points with large residuals, and/or high leverage

Cook’s distance Several, equally correct, ways to think about Cook’s distance: (1) How much will my regression coefficients change without this data point? (2) How much will the predicted Y values change without this data point? (3) A combination of leverage and residual to ascertain point’s influence. plot(lm(y~x), which=4) We can just look at the Cook’s distance for different data How much influence is points, to see if some are too much? extremely influential. (a) D > 1 ? (b) D > (4/n) ? (c) D > (4/(n-k-1)) ? Different folks have different standards…

Diagnostics you could undertake • Look at marginal histograms – Check for outliers – Check for skew and heavy tails • Look at scatterplots. – Check for major 2d non-linearities – Check for outliers. • Check various plots of residuals – Residuals ~ y_hat to check for non-linearities. – Absolute residuals ~ y_hat to check for homoscedasticity. – Standardized residual QQ plots check for Normality – Standardized residual ~ leverage for outlier effects, Cook’s dist. – Residual as a function of observation to look for autocorrelation

Autocorrelated errors. • Something fishy… plot(x,y) • Residuals as a function of observation number. plot(residuals(lm(y~x)) • Autocorrelation function. acf(residuals(lm(y~x))

201ab Quantitative methods Linear model diagnostics. Model - PowerPoint PPT Presentation

201ab Quantitative methods Linear model diagnostics. Model assumptions, in order of importance (1) Validity (2) Additivity and linearity (3) Independent errors (4) Normal errors (5) Homoscedastic errors (6) Error in y, not in x Validity

201ab Quantitative methods L.12 Linear model: Categorical predictors E D V UL | UCSD Psychology

201ab Quantitative methods L.13: ANOVA (b) ANalysis Of VAriance E D V UL | UCSD Psychology

201ab Quantitative methods non-linear Transformations E D V UL | UCSD Psychology 1 Linearly

201ab Quantitative methods L.09: Correlation, regression (2) Alt-text: Correlation doesn't imply

201ab Quantitative methods Multiple regression (b) With great illustrations from Julian Parris. E

201ab Quantitative methods Visualization E D V UL | UCSD Psychology Visualization failure

201ab Quantitative methods ANCOVA E D V UL | UCSD Psychology What does ANCOVA do? In an ANOVA ,

Quantitative Quantitative Quantitative Quantitative Modal Modal Transition Transition

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Overview of this module Course 02429 Analysis of correlated data: Mixed Linear Models Linear

Why not Quantitative Methods? Why not Quantitative Methods? division into variables:

Application of Local Influence Diagnostics to the Buckley-James Model Nazrina Aziz 1 and Dong Q

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

Quantitative Evaluation Research Questions Quantitative Data Controlled Studies Experimental

Quantitative Evaluation Research Questions Quantitative Data Controlled Studies Experimental

Diagnostics & Kernel Methods Visualized Ondej Bojar April 3, 2019 NPFL104 Machine

Virtual Network Mapping based on Subgraph Isomorphism Detection Jens Lischka, Holger Karl Paderborn

2.2 Semantics In classical logic (dating back to Aristoteles) there are only two truth

Reducing CTL-Live Model Checking to First-Order Logic Validity Checking Amirhossein Vakili and

T H R E AT S TO VA L I D I T Y PMAP 8521: Program Evaluation for Public Service October 7, 2019

A Self-correcting Graph Connected Component Algorithm Piyush Sao, Oded Green, Chirag Jain,

Social Media For Self Storage Executives Original Talk Given at TSSA Executive Retreat on

NLO QCD corrections to the production of a weak boson pair with a jet Gr egory Sanguinetti -

The Algol Family and ML Lisp Algol 60 Algol 68 John Mitchell Pascal ML Modula Many other

201ab Quantitative methods Linear model diagnostics. Model - PowerPoint PPT Presentation

201ab Quantitative methods Linear model diagnostics. Model assumptions, in order of importance (1) Validity (2) Additivity and linearity (3) Independent errors (4) Normal errors (5) Homoscedastic errors (6) Error in y, not in x Validity

201ab Quantitative methods L.12 Linear model: Categorical predictors E D V UL | UCSD Psychology

201ab Quantitative methods L.13: ANOVA (b) ANalysis Of VAriance E D V UL | UCSD Psychology

201ab Quantitative methods non-linear Transformations E D V UL | UCSD Psychology 1 Linearly

201ab Quantitative methods L.09: Correlation, regression (2) Alt-text: Correlation doesn't imply

201ab Quantitative methods Multiple regression (b) With great illustrations from Julian Parris. E

201ab Quantitative methods Visualization E D V UL | UCSD Psychology Visualization failure

201ab Quantitative methods ANCOVA E D V UL | UCSD Psychology What does ANCOVA do? In an ANOVA ,

Quantitative Quantitative Quantitative Quantitative Modal Modal Transition Transition

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Overview of this module Course 02429 Analysis of correlated data: Mixed Linear Models Linear

Why not Quantitative Methods? Why not Quantitative Methods? division into variables:

Application of Local Influence Diagnostics to the Buckley-James Model Nazrina Aziz 1 and Dong Q

Graphics 2014 Linear Algebra II Linear Maps &amp; Matrices Linear Maps &amp; Matrices CORE

Quantitative Evaluation Research Questions Quantitative Data Controlled Studies Experimental

Quantitative Evaluation Research Questions Quantitative Data Controlled Studies Experimental

Diagnostics &amp; Kernel Methods Visualized Ondej Bojar April 3, 2019 NPFL104 Machine

Virtual Network Mapping based on Subgraph Isomorphism Detection Jens Lischka, Holger Karl Paderborn

2.2 Semantics In classical logic (dating back to Aristoteles) there are only two truth

Reducing CTL-Live Model Checking to First-Order Logic Validity Checking Amirhossein Vakili and

T H R E AT S TO VA L I D I T Y PMAP 8521: Program Evaluation for Public Service October 7, 2019

A Self-correcting Graph Connected Component Algorithm Piyush Sao, Oded Green, Chirag Jain,

Social Media For Self Storage Executives Original Talk Given at TSSA Executive Retreat on

NLO QCD corrections to the production of a weak boson pair with a jet Gr egory Sanguinetti -

The Algol Family and ML Lisp Algol 60 Algol 68 John Mitchell Pascal ML Modula Many other

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

Diagnostics & Kernel Methods Visualized Ondej Bojar April 3, 2019 NPFL104 Machine