Linear Regression Fernando Brito e Abreu (fba@di.fct.unl.pt) - - PDF document

linear regression
SMART_READER_LITE
LIVE PREVIEW

Linear Regression Fernando Brito e Abreu (fba@di.fct.unl.pt) - - PDF document

Experimental Software Engineering Linear Regression Fernando Brito e Abreu (fba@di.fct.unl.pt) Universidade Nova de Lisboa (http://www.unl.pt) QUASAR Research Group (http://ctp.di.fct.unl.pt/QUASAR) Summary Purpose of regression?


slide-1
SLIDE 1

1

Experimental Software Engineering

– Linear Regression –

Fernando Brito e Abreu (fba@di.fct.unl.pt) Universidade Nova de Lisboa (http://www.unl.pt)

QUASAR Research Group (http://ctp.di.fct.unl.pt/QUASAR)

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Summary

Purpose of regression? Linear Regression - Purpose First order linear model Probabilistic linear relationship Residuals Least squares method Linear model assumptions Normal Probability Plot and

Residuals Plot

Residuals independence:

Durbin-Watson test

Multiple regression model Linear regression validation Inference testing (Regression

ANOVA)

Assessing the influence of

each X

Variable selection methods Goodness of fit

slide-2
SLIDE 2

2

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Purpose of regression?

To determine whether values of one or more variables

are related to the response variable

To predict the value of one variable based on the value

  • f one or more variables

Definitions:

Dependent variable (aka response or endogenous)

The variable that is being predicted or explained

Independent variable (aka explanatory, regressor or exogenous)

The variable that is doing the predicting or explaining

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Correlation or Regression?

Use correlation if you are interested only in whether a

relationship exists

Use regression if you are interested in:

building a mathematical model that can predict the response

(dependent) variable

the relative effectiveness of several variables in predicting the

response variable

slide-3
SLIDE 3

3

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Linear regression - purpose

Is there an association between the two variables?

Is defect density (DD) change related to code complexity (CC)

change?

Estimation of impact

How much DD change occurs per CC unit change?

Prediction

If a program looses 20 CC units (e.g. by refactoring it), how

much of a drop in DD can be expected?

SPSS: Analyse / Regression / Linear

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

First order linear model (aka simple regression model)

A deterministic mathematical

model between y and x:

  • y = β0 + β1 * x

β0 is the intercept with y axis,

the point at which x = 0

β1 is the angle of the line, the

ratio of rise divided by the run

It measures the change in y for

  • ne unit of change in x

Y X

  • bservation

Regression line intercept Run Rise

slide-4
SLIDE 4

4

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Probabilistic linear relationship

Relationship between x and y is not always exact

Observations do not always fall on a straight line

To accommodate this, we introduce a random error term

referred to as epsilon:

y = β0 + β1 * x + ε

ε reflects how individuals deviate from others with the same

value of x

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Probabilistic linear relationship

The task of regression analysis then is to estimate the parameters

b0 and b1 in the equation: = b0 + b1 * x so that the difference between y and is minimized This is called the estimated simple linear regression equation

  • Notes:

Notes:

b0 is the estimate for

is the estimate for β0 and and b1 is the estimate for is the estimate for β1

  • is the estimated (predicted) value of

is the estimated (predicted) value of y y for a given for a given x x value. value.

^ y ^

y

^ y

slide-5
SLIDE 5

5

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Residuals

The distance between each

  • bservation (red dot) and the solid

line (estimated regression line) is called residual

Residuals are deviations due to the

random error term

The regression line is determined

by minimizing the sum of the squared residuals

Y X

  • bservation

Regression line residual

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Least squares method

Criterion: Choose b0 and b1 to minimize the squared

sum of squared residuals S = Σ (yi – b0− b1xi)2 The regression line is the one that minimizes the sum of the squared distances of each observation to that line Slope: Intercept:

∑ ∑

− − − =

2 1

) ( ) )( ( x x y y x x b

i i i 1

b y b x = −

1

b y b x = −

slide-6
SLIDE 6

6

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Linear model assumptions (1)

Y variable:

is continuous (measured at least on an interval scale)

X variables:

can be continuous or indicator variables (nominal / ordinal) do not need to be normally distributed

Note:

this assumption is valid both for simple (one X) or multiple

(several Xs) regression models

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Linear model assumptions (2)

ε ~ N(0,σ)

The probability distribution of the error is Normal with mean 0 The variance is constant, finite and not depending on X values

The latter is called homoscedasticity propriety Homoscedastic = Having equal statistical variances

[homo (same) + Greek skedastos ("able to be scattered“)]

Rephrasing: For each value of X there is a population of

Y’s that are normally distributed with the same variability

The population means form a straight line Each population has the same variance σ2

slide-7
SLIDE 7

7

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08 Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Normal Probability Plot

Normal Probability Plot

compares the percent of errors falling in particular bins (e.g. deciles) to the percentage expected from Normal distribution

If regression assumption is

met then the plot should look like a straight line

slide-8
SLIDE 8

8

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Residual Plot

Allows to observe if residuals have:

a mean of zero

Observations will be evenly scattered on both sides of the line

a constant standard deviation (not dependent on X value)

Observations will evenly scattered horizontally (across the X axis)

RES IDUA L P LOT

  • 10
  • 5

5 10 4 6 8 10 C

  • ding productivity

R esidual

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Linear model assumptions (3)

A final regression assumption is that residuals are independent

This is equivalent to say that for all possible pairs of Xi, Xj (i≠j; i,j = 1, …, n),

the errors εi , εj are not autocorrelated, that is, their covariance is null

This assumption can be evaluated with the Durbin-Watson

statistic that allows testing the following hypothesis:

H0: Cov(εi , εj) = 0

(i≠j; I,j = 1, …, n)

residuals are not autocorrelated (they are independent)

H1: Cov(εi , εj) <> 0

residuals are autocorrelated (their covariance is not negligible)

Note: The Durbin-Watson statistic ranges in value from 0 to 4 SPSS: Analyze /Regression /Linear /Statistics /Residuals /Durbin- Watson

slide-9
SLIDE 9

9

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Residuals independence: Durbin-Watson test

Hypothesis acceptance decision is performed by

comparing the statistic observed value (d) with tabled critical values: upper (dU) and lower bounds (dL)

SPSS help includes tables by Savin and White for several

values of α

Table entries are the sample size (n) and the number of

independent variables (k)

Decision:

d < dL, we reject H0 and accept H1

residuals are not independent (they are autocorrelated)

d > dU, we accept H0 and reject H1

residuals are independent Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Durbin and Watson statistic

Example:

d = 1,723 (observed value of Durbin-Watson statistic) sample size (n) = 71 number of independent variables (k) = 2

From Savin and White table for α = 0,01 (in SPSS):

dL (2, 70) = 1.400 dU (2, 70) = 1.514

Conclusion: since d > dU, we accept H0 and reject H1

Residuals are independent! The assumption is met ☺

slide-10
SLIDE 10

10

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Multiple regression model

These are models of the type

y = β0 + β1 * x1 + β2 * x2 + … + βp * xp + ε

These models require that the independent variables are

near to orthogonal (statistically uncorrelated)

Aka: Absence of multicolinearity

SPSS: Analyse / Regression / Linear

Note: to compare the magnitude of the influences of

each independent variable (xi) we must consider the standardized coefficients (aka Beta coefficients)

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

How to detect collinearity problems?

  • Method 1: Bivariate correlation analysis

Collinearity problem:

  • Correlations > 75% (typical criterion)

This method is only indicative; VIF or Tolerance should also be used

SPSS (M1): Analyze / Descriptive Statistics / Crosstabs

  • Method 2: Variance Inflation Factor (VIF)

Collinearity problem:

  • VIF > 5 (other authors consider 10)
  • Method 3: Tolerance (T=1/VIF)

Collinearity problem:

  • T < 0.2 (other authors consider 0.1)

SPSS (M2,M3): Analyze / Regression / Linear / Statistics / Collinearity diagnostics

slide-11
SLIDE 11

11

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Linear regression validation

Consider the total variability of Y: SST = SSR + SSE

SST = total sum of squares of deviations relative to the mean of Y (Observations and mean of observations) SSR = sum of squares of deviations due to regression (Predicted values and mean of observations) SSE = sum of squares of deviations due to error (Predicted values and observations)

Y X mean of Y observations

  • bservation

deviation due to error Regression line deviation due to regression deviation from mean of Y

( ) ( ) ( ) y y y y y y

i i i i

− ∑ = − ∑ + − ∑

2 2 2

( ) ( ) ( ) y y y y y y

i i i i

− ∑ = − ∑ + − ∑

2 2 2

^ ^ ^ ^

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Estimating residuals parameters

Mean square error (MSE) is an estimate of residuals

variance σ2

where:

Root mean square error is an estimate of residuals

standard deviation σ

∑ ∑

− − = − =

2 1 2

) ( ) ˆ ( SSE

i i i i

x b b y y y

∑ ∑

− − = − =

2 1 2

) ( ) ˆ ( SSE

i i i i

x b b y y y

2 SSE MSE − = = n s 2 SSE MSE − = = n s 2 SSE MSE

2

− = = n s 2 SSE MSE

2

− = = n s

slide-12
SLIDE 12

12

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Inference testing (Regression ANOVA)

The hypothesis that the regression equation explains the variation in Y can be tested using the F test (F statistic)

H0: β1 = β2 = … = βp = 0 H1: ∃ i: βi ≠ 0 (i = 1, …, p)

F = MSR / MSE

Notes:

Small MSR -> the deviations due to regression is negligible Large MSE -> observations are far from the regression line (too scattered) If p ≤ α (significant F statistic):

Reject H0 and accept H1 (the regression explains the variation in Y) If p > α (not significant F statistic) Accept H0 and therefore reject H1 (invalid regression)

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Assessing the influence of each X

If the F statistic is significant, we can only deduct that at

least one of the independent variables affects Y

But it doesn’t say which X (one or more) does affect

To assess the individual relevance of each X variable to

explain the behavior of Y we can use the t statistic!

slide-13
SLIDE 13

13

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

t-student Statistic

The hypothesis that a coefficient βi (intercept or Xi coefficients) is zero, can be

tested using the t-student statistic

t = bi /σ (bi)

This hypothesis must be tested for all i ∈ [1, …, p] p is the regression order (the number of X variables)

Note: The standard error σ (b1) depends on (i) sample size, (ii) how well the estimated line fits the points and (iii) how spread out is the range of x values

Hypothesis: H0: βi = 0 (there is no relation between x and y) H1: βi ≠ 0 (a relation between x and y exists)

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

It is possible to demonstrate that the t Statistic has a t- student distribution with n-p degrees of freedom

n is the number of observations p is the number of independent variables

Significant t-student statistic:

  • Condition:

Condition: | | t | t | > > t tα/

α/2 2

In this case we reject H0 and accept H1 We can say that the coefficient affects Y

t tα/

α/2 2 (cut

(cut-

  • off value)
  • ff value) is based on a

is based on a t t-

  • student

student distribution with distribution with n n-

  • p

p degrees of freedom ( degrees of freedom (df df) which is usually tabled ) which is usually tabled

t-student Statistic: Significance testing

slide-14
SLIDE 14

14

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Tail probabilities of t-Student distribution

One tail 0.10 0.05 0.025 0.01 0.005 0.001 0.0005 df ↓ Two tails 0.20 0.10 0.05 0.02 0.01 0.002 0.001 26 1.314972 1.705618 2.05553 2.47863 2.77871 2.779 3.7066 27 1.313703 1.703288 2.05183 2.47266 2.77068 2.787 3.6896 28 1.312527 1.701131 2.04841 2.46714 2.76326 3.408 3.6739 29 1.311434 1.699127 2.04523 2.46202 2.75639 3.396 3.6594 30 1.310415 1.697261 2.04227 2.45726 2.75000 3.385 3.6460 32 1.309 1.694 2.037 2.449 2.738 3.365 3.622 34 1.307 1.691 2.032 2.441 2.728 3.348 3.601 36 1.306 1.688 2.028 2.434 2.719 3.333 3.582 38 1.304 1.686 2.024 2.429 2.712 3.319 3.566 40 1.303 1.684 2.021 2.423 2.704 3.307 3.551 42 1.302 1.682 2.018 2.418 2.698 3.296 3.538 44 1.301 1.680 2.015 2.414 2.692 3.286 3.526 46 1.300 1.679 2.013 2.410 2.687 3.277 3.515 48 1.299 1.677 2.011 2.407 2.682 3.269 3.505 50 1.299 1.676 2.009 2.403 2.678 3.261 3.496 55 1.297 1.673 2.004 2.396 2.668 3.245 3.476 60 1.296 1.671 2.000 2.390 2.660 3.232 3.460 65 1.295 1.669 1.997 2.385 2.654 3.220 3.447 70 1.294 1.667 1.994 2.381 2.648 3.211 3.435 80 1.292 1.664 1.990 2.374 2.639 3.195 3.416 100 1.290 1.660 1.984 2.364 2.626 3.174 3.390 120 1.289 1.658 1.980 2.358 2.617 3.160 3.373 150 1.287 1.655 1.976 2.351 2.609 3.145 3.357 200 1.286 1.653 1.972 2.345 2.601 3.131 3.340 ∞ 1.281552 1.644854 1.95996 2.32635 2.57583 3.090 3.2905 One tail 0.10 0.05 0.025 0.01 0.005 0.001 0.0005 df ↓ Two tails 0.20 0.10 0.05 0.02 0.01 0.002 0.001 1 3.077684 6.313752 12.70620 31.82052 63.65674 318.300 636.6192 2 1.885618 2.919986 4.30265 6.96456 9.92484 22.330 31.5991 3 1.637744 2.353363 3.18245 4.54070 5.84091 10.210 12.9240 4 1.533206 2.131847 2.77645 3.74695 4.60409 7.173 8.6103 5 1.475884 2.015048 2.57058 3.36493 4.03214 5.893 6.8688 6 1.439756 1.943180 2.44691 3.14267 3.70743 5.208 5.9588 7 1.414924 1.894579 2.36462 2.99795 3.49948 4.785 5.4079 8 1.396815 1.859548 2.30600 2.89646 3.35539 4.501 5.0413 9 1.383029 1.833113 2.26216 2.82144 3.24984 4.297 4.7809 10 1.372184 1.812461 2.22814 2.76377 3.16927 4.144 4.5869 11 1.363430 1.795885 2.20099 2.71808 3.10581 4.025 4.4370 12 1.356217 1.782288 2.17881 2.68100 3.05454 3.930 4.3178 13 1.350171 1.770933 2.16037 2.65031 3.01228 3.852 4.2208 14 1.345030 1.761310 2.14479 2.62449 2.97684 3.787 4.1405 15 1.340606 1.753050 2.13145 2.60248 2.94671 3.733 4.0728 16 1.336757 1.745884 2.11991 2.58349 2.92078 3.686 4.0150 17 1.333379 1.739607 2.10982 2.56693 2.89823 3.646 3.9651 18 1.330391 1.734064 2.10092 2.55238 2.87844 3.610 3.9216 19 1.327728 1.729133 2.09302 2.53948 2.86093 3.579 3.8834 20 1.325341 1.724718 2.08596 2.52798 2.84534 3.552 3.8495 21 1.323188 1.720743 2.07961 2.51765 2.83136 3.527 3.8193 22 1.321237 1.717144 2.07387 2.50832 2.81876 3.505 3.7921 23 1.319460 1.713872 2.06866 2.49987 2.80734 3.485 3.7676 24 1.317836 1.710882 2.06390 2.49216 2.79694 3.467 3.7454 25 1.316345 1.708141 2.05954 2.48511 2.78744 3.450 3.7251

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Example1:

3 independent variables, 30 cases, 95% confidence interval

  • α

α = 0.05 => = 0.05 => α/2 α/2 = 0,025 = 0,025 (one-tail probability)

df = n - p = 30 – 3 = 27

  • => t(

=> t(0.975, 0.975, df df=27 =27) = ) = 2,052 2,052 Example 2:

3 independent variables, 110 cases, 99% confidence interval df= n-p=110-3=107 (we take the closest tabled lower n)

  • α

α=0.01=> =0.01=> α/2 α/2 = 0,005 = 0,005 (one-tail probability)

  • => t(

=> t(0.995, 0.995, df df=100 =100) = ) = 2,626 2,626

t-student Statistic: Significance testing

slide-15
SLIDE 15

15

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Variable selection methods (1/2)

Method selection allows to specify how independent variables are

entered into the analysis

Using different methods, it is possible to construct a variety of regression

models from the same set of variables

Enter (Regression): All variables in a block are entered in a

single step

  • Stepwise. At each step, the independent variable not in the

equation which has the smallest probability of F is entered, if that probability is sufficiently small

Variables already in the regression equation are removed if their probability

  • f F becomes sufficiently large. The method terminates when no more

variables are eligible for inclusion or removal.

Remove: all variables in a block are removed in a single step

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Variable selection methods (2/2)

Backward Elimination: all variables are entered into the equation

and then sequentially removed

The variable with the smallest partial correlation with the dependent variable

is considered first for removal. If it meets the criterion for elimination, it is

  • removed. After the first variable is removed, the variable remaining in the

equation with the smallest partial correlation is considered next. The procedure stops when there are no variables in the equation that satisfy the removal criteria

Forward Selection: variables are sequentially entered into the

model

The first variable considered for entry into the equation is the one with the

largest positive or negative correlation with the dependent variable. This variable is entered into the equation only if it satisfies the criterion for entry. If the first variable is entered, the independent variable not in the equation that has the largest partial correlation is considered next. The procedure stops when there are no variables that meet the entry criterion.

slide-16
SLIDE 16

16

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Goodness of fit

The most well-known goodness of fit (adjustment) measure of a

regression model is the coefficient of determination R2 R2 = SSR/SST

Corresponds to the % of variation of Y explained by X 1- R2 is the variation left unexplained A small value of R2 means Y is not related to X or related in a

non-linear fashion

What is considered a good adjustment?

Exact Sciences: R2 > 90% Social Sciences: R2 > 50% (acceptable, not necessarily good)

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Goodness of fit (continued)

The addition of a new variable tends to increase R2

This may lead to over-specified models

Adjusted R2 shows the value of R2 after adjustment for

a given degree of freedom (sample size)

The addition of a new variable only increases Adjusted R2 if

the model produces a best adjustment (lower error variance)

It protects against having an artificially high R2 by increasing

the number of variables in the model

slide-17
SLIDE 17

17

Experimental Software Engineering / Fernando Brito e Abreu 12-May-08

Ensaie o modelo através da técnica de Jack- Knifing

Permite comparar os valores estimados com os reais,

com obtenção do erro relativo.

(i) tirar um dos casos da amostra (ii) recalcular os parâmetros dos modelos de estimação (iii) utilizando os modelos obtidos no passo anterior,

calcular as estimativas para o caso que ficou de fora no primeiro passo

(iv) comparar as estimativas produzidas no passo

anterior com os valores "reais" observados.