linear regression
play

Linear Regression Fernando Brito e Abreu (fba@di.fct.unl.pt) - PDF document

Experimental Software Engineering Linear Regression Fernando Brito e Abreu (fba@di.fct.unl.pt) Universidade Nova de Lisboa (http://www.unl.pt) QUASAR Research Group (http://ctp.di.fct.unl.pt/QUASAR) Summary Purpose of regression?


  1. Experimental Software Engineering – Linear Regression – Fernando Brito e Abreu (fba@di.fct.unl.pt) Universidade Nova de Lisboa (http://www.unl.pt) QUASAR Research Group (http://ctp.di.fct.unl.pt/QUASAR) Summary � Purpose of regression? � Residuals independence: Durbin-Watson test � Linear Regression - Purpose � Multiple regression model � First order linear model � Linear regression validation � Probabilistic linear relationship � Inference testing (Regression � Residuals ANOVA) � Least squares method � Assessing the influence of � Linear model assumptions each X � Normal Probability Plot and � Variable selection methods Residuals Plot � Goodness of fit 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu 1

  2. Purpose of regression? � To determine whether values of one or more variables are related to the response variable � To predict the value of one variable based on the value of one or more variables Definitions: � Dependent variable (aka response or endogenous) � The variable that is being predicted or explained � Independent variable (aka explanatory, regressor or exogenous) � The variable that is doing the predicting or explaining 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu Correlation or Regression? � Use correlation if you are interested only in whether a relationship exists � Use regression if you are interested in: � building a mathematical model that can predict the response (dependent) variable � the relative effectiveness of several variables in predicting the response variable 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu 2

  3. Linear regression - purpose � Is there an association between the two variables? � Is defect density (DD) change related to code complexity (CC) change? � Estimation of impact � How much DD change occurs per CC unit change? � Prediction � If a program looses 20 CC units (e.g. by refactoring it), how much of a drop in DD can be expected? SPSS : Analyse / Regression / Linear 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu First order linear model (aka simple regression model) � A deterministic mathematical model between y and x: y = β 0 + β 1 * x � Y Regression line � β 0 is the intercept with y axis, observation the point at which x = 0 Rise � β 1 is the angle of the line, the Run ratio of rise divided by the run � It measures the change in y for one unit of change in x intercept X 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu 3

  4. Probabilistic linear relationship � Relationship between x and y is not always exact � Observations do not always fall on a straight line � To accommodate this, we introduce a random error term referred to as epsilon: y = β 0 + β 1 * x + ε � ε reflects how individuals deviate from others with the same value of x 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu Probabilistic linear relationship � The task of regression analysis then is to estimate the parameters b 0 and b 1 in the equation: ^ = b 0 + b 1 * x y so that the difference between y and is minimized ^ y This is called the estimated simple linear regression equation � Notes: Notes: � is the estimate for β 0 and is the estimate for β 1 � b 0 is the estimate for and b 1 is the estimate for ^ is the estimated (predicted) value of y y for a given for a given x x value. value. is the estimated (predicted) value of � � y 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu 4

  5. Residuals � The distance between each observation (red dot) and the solid Y Regression line line (estimated regression line) is called residual � Residuals are deviations due to the random error term residual � The regression line is determined by minimizing the sum of the observation squared residuals X 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu Least squares method � Criterion: Choose b 0 and b 1 to minimize the squared sum of squared residuals S = Σ ( y i – b 0 − b 1 x i ) 2 The regression line is the one that minimizes the sum of the squared distances of each observation to that line Slope: ∑ Intercept: − − ( x x )( y y ) = − = − b b y y b x b x = i i b 0 1 ∑ 0 1 − 1 2 ( x x ) i 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu 5

  6. Linear model assumptions (1) � Y variable: � is continuous (measured at least on an interval scale) � X variables: � can be continuous or indicator variables (nominal / ordinal) � do not need to be normally distributed � Note: � this assumption is valid both for simple (one X) or multiple (several Xs) regression models 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu Linear model assumptions (2) ε ~ N(0, σ ) � The probability distribution of the error is Normal with mean 0 � The variance is constant, finite and not depending on X values � The latter is called homoscedasticity propriety � Homoscedastic = Having equal statistical variances � [homo (same) + Greek skedastos ("able to be scattered“)] � Rephrasing: For each value of X there is a population of Y’s that are normally distributed with the same variability � The population means form a straight line � Each population has the same variance σ 2 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu 6

  7. 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu Normal Probability Plot � Normal Probability Plot compares the percent of errors falling in particular bins (e.g. deciles) to the percentage expected from Normal distribution � If regression assumption is met then the plot should look like a straight line 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu 7

  8. Residual Plot RES IDUA L P LOT 10 5 esidual 0 4 6 8 10 R -5 -10 C oding productivity � Allows to observe if residuals have: � a mean of zero � Observations will be evenly scattered on both sides of the line � a constant standard deviation (not dependent on X value) � Observations will evenly scattered horizontally (across the X axis) 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu Linear model assumptions (3) � A final regression assumption is that residuals are independent � This is equivalent to say that for all possible pairs of X i , X j (i ≠ j; i,j = 1, …, n), the errors ε i , ε j are not autocorrelated, that is, their covariance is null � This assumption can be evaluated with the Durbin-Watson statistic that allows testing the following hypothesis: � H 0 : Cov( ε i , ε j ) = 0 (i ≠ j; I,j = 1, …, n) � residuals are not autocorrelated (they are independent) � H 1 : Cov( ε i , ε j ) <> 0 � residuals are autocorrelated (their covariance is not negligible) Note : The Durbin-Watson statistic ranges in value from 0 to 4 SPSS : Analyze /Regression /Linear /Statistics /Residuals /Durbin- Watson 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu 8

  9. Residuals independence: Durbin-Watson test � Hypothesis acceptance decision is performed by comparing the statistic observed value (d) with tabled critical values: upper (dU) and lower bounds (dL) � SPSS help includes tables by Savin and White for several values of α � Table entries are the sample size (n) and the number of independent variables (k) � Decision: � d < dL , we reject H 0 and accept H 1 � residuals are not independent (they are autocorrelated) � d > dU , we accept H 0 and reject H 1 � residuals are independent 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu Durbin and Watson statistic � Example: � d = 1,723 (observed value of Durbin-Watson statistic) � sample size (n) = 71 � number of independent variables (k) = 2 � From Savin and White table for α = 0,01 (in SPSS): � dL (2, 70) = 1.400 � dU (2, 70) = 1.514 Conclusion: since d > dU, we accept H 0 and reject H 1 � Residuals are independent! The assumption is met ☺ 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu 9

  10. Multiple regression model � These are models of the type y = β 0 + β 1 * x 1 + β 2 * x 2 + … + β p * x p + ε � These models require that the independent variables are near to orthogonal (statistically uncorrelated) � Aka: Absence of multicolinearity SPSS : Analyse / Regression / Linear � Note: to compare the magnitude of the influences of each independent variable (x i ) we must consider the standardized coefficients (aka Beta coefficients ) 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu How to detect collinearity problems? Method 1: Bivariate correlation analysis � � Collinearity problem: Correlations > 75% (typical criterion) � � This method is only indicative; VIF or Tolerance should also be used SPSS (M1) : Analyze / Descriptive Statistics / Crosstabs Method 2: Variance Inflation Factor (VIF) � � Collinearity problem: VIF > 5 (other authors consider 10) � Method 3: Tolerance (T=1/VIF) � � Collinearity problem: T < 0.2 (other authors consider 0.1) � SPSS (M2,M3) : Analyze / Regression / Linear / Statistics / Collinearity diagnostics 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu 10

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend