Residual Analysis Inferences about a regression model are valid only - - PowerPoint PPT Presentation

residual analysis
SMART_READER_LITE
LIVE PREVIEW

Residual Analysis Inferences about a regression model are valid only - - PowerPoint PPT Presentation

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II Residual Analysis Inferences about a regression model are valid only under assumptions about the random errors in the observations. Objectives:


slide-1
SLIDE 1

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Residual Analysis

Inferences about a regression model are valid only under assumptions about the random errors in the observations. Objectives: Show how residuals reveal departures from assumptions; Suggest procedures for coping with such departures.

1 / 17 Residual Analysis Introduction

slide-2
SLIDE 2

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Regression Residuals

The random errors ǫ satisfy Y = E(Y ) + ǫ, or ǫ = Y − E(Y ). We observe Y , but we do not know E(Y ), so we cannot calculate ǫ. We estimate E(Y ) by ˆ y, the predicted (or fitted) value. We approximate the random errors by regression residuals: ˆ ǫi = yi − ˆ yi, i = 1, 2, . . . , n.

2 / 17 Residual Analysis Regression Residuals

slide-3
SLIDE 3

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Properties of residuals

If the model contains an intercept, the sum of the residuals, and also their mean, is zero:

n

  • i=1

ˆ ǫi = 0, and so ¯ ˆ ǫ = 0. The covariance of the residuals and any term in the regression model is zero:

n

  • i=1

ˆ ǫixi,j = 0, j = 1, 2, . . . , k.

3 / 17 Residual Analysis Properties of residuals

slide-4
SLIDE 4

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Detecting Lack of Fit

A misspecified model is one that leaves out a relevant predictor. The residuals from a misspecified model do not have mean zero. Example: serum cholesterol (y) and dietary fat (x) in Olympic athletes.

ath <- read.table("Text/Exercises&Examples/OLYMPIC.txt", header = TRUE) pairs(ath)

4 / 17 Residual Analysis Detecting Lack of Fit

slide-5
SLIDE 5

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Suppose we ignore the graph, and fit a first-order model:

l1 <- lm(CHOLESTEROL ~ FAT, ath) summary(l1) plot(ath$FAT, residuals(l1))

The summary of the fitted model looks reasonable. But the graph of the residuals against x show that the assumption E(ǫ) = 0 is violated. Because this is a straight-line model, this graph is effectively the same as the “residuals versus fitted value” graph from plot(l1).

5 / 17 Residual Analysis Detecting Lack of Fit

slide-6
SLIDE 6

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Because of the curvature, we could fit the second-order (quadratic) model:

l2 <- lm(CHOLESTEROL ~ FAT + I(FAT^2), ath) summary(l2) plot(ath$FAT, residuals(l2))

The residual plot suggests that the model is satisfactory. The quadratic term is highly significant.

6 / 17 Residual Analysis Detecting Lack of Fit

slide-7
SLIDE 7

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Partial residuals

Sometimes the effect of an independent variable is better described by a transformed version: log(x), 1/x, etc. The partial residual plot can help identify the transformation: The partial residuals for independent variable xj are ˆ ǫ∗ = ˆ ǫ + ˆ βjxj Plot ˆ ǫ∗ against xj. Also known as a “Component + Residual” plot.

7 / 17 Residual Analysis Partial residuals

slide-8
SLIDE 8

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Example Effect of price (p) and advertising (x2) on demand (y) for coffee.

coffee <- read.table("Text/Exercises&Examples/COFFEE2.txt", header = TRUE) pairs(coffee)

Try a first-order model:

l1 <- lm(DEMAND ~ PRICE + AD, coffee) summary(l1) plot(coffee$PRICE, residuals(l1))

The residual plot shows misspecification.

8 / 17 Residual Analysis Partial residuals

slide-9
SLIDE 9

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

The Component + Residual plot:

library(car) crPlot(l1, variable = "PRICE")

Curve suggests either adding PRICE^2, or transforming to log(PRICE)

  • r 1/PRICE.

R2 and R2

a are highest for 1/PRICE.

Note: the partial regression plot is different.

9 / 17 Residual Analysis Partial residuals

slide-10
SLIDE 10

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Detecting Unequal Variances

Homoscedasticity versus heteroscedasticity. That is, constant variance versus varying variance. When the variance is not constant, it is most often related to the mean. For Poisson-distributed data (counts), var(Y ) = E(Y ). When errors are multiplicative, Y = E(Y ) × (1 + ǫ), and var(Y ) ∝ E(Y )2.

10 / 17 Residual Analysis Detecting Unequal Variances

slide-11
SLIDE 11

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Sometimes the variance can be made constant by transforming Y . For example, with multiplicative errors, log(Y ) = log[E(Y ) × (1 + ǫ)] = log[E(Y )] + log[1 + ǫ] ≈ log[E(Y )] + ǫ. So var[log Y ] is (approximately) constant. Sometimes variance can be made constant by transformation, but a different method may be better than using a transformation. For example, with Poisson-distributed counts, √ Y has approximately constant variance, but a generalized linear model may be more satisfactory.

11 / 17 Residual Analysis Detecting Unequal Variances

slide-12
SLIDE 12

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Example Salary and experience for social workers.

workers <- read.table("Text/Exercises&Examples/SOCWORK.txt", header = TRUE) pairs(workers)

Try a second-order model:

l2 <- lm(SALARY ~ EXP + I(EXP^2), workers) summary(l2) plot(l2)

12 / 17 Residual Analysis Detecting Unequal Variances

slide-13
SLIDE 13

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

The “Residuals vs Fitted” plot shows a fan-shaped scatter, and the “Scale-Location” plot shows an upward trend. It suggests std dev(Y ) ∝ E(Y ), hence var(Y ) ∝ E(Y )2, so try logarithms.

13 / 17 Residual Analysis Detecting Unequal Variances

slide-14
SLIDE 14

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Second-order model for log(SALARY):

lLog2 <- lm(log(SALARY) ~ EXP + I(EXP^2), workers) summary(lLog2)

The quadratic term is not significant, so try a first-order model:

lLog1 <- lm(log(SALARY) ~ EXP, workers) summary(lLog1) plot(lLog1)

The residual plots are more satisfactory.

14 / 17 Residual Analysis Detecting Unequal Variances

slide-15
SLIDE 15

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Simple Test for Heteroscedasticity

Divide the data set in two, for instance low fitted values versus high fitted values. Fit the model separately to each part, and compare the MSEs (Mean Square for Errors). Under H0: variance is constant, F ∗ = MSE1 MSE2 has the F-distribution with ν1 = n1 − (k + 1) and ν2 = n2 − (k + 1) degrees of freedom.

15 / 17 Residual Analysis Simple Test for Heteroscedasticity

slide-16
SLIDE 16

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

This is usually a two-sided test; Ha: variance is not constant. Reject H0 at level α if F ∗ differs too far from 1 in either direction; that is, if F ∗ < F1−α/2(ν1, ν2), the lower α/2-point of the distribution, or F ∗ > Fα/2(ν1, ν2), the upper α/2-point of the distribution.

16 / 17 Residual Analysis Simple Test for Heteroscedasticity

slide-17
SLIDE 17

ST 430/514 Introduction to Regression Analysis/Statistics for Management and the Social Sciences II

Note: F1−α/2(ν1, ν2) = 1/Fα/2(ν2, ν1), so an equivalent method is based on F = Larger MSE Smaller MSE = max

  • F ∗, 1

F ∗

  • .

Then we reject H0 if F > Fα/2 (νLarger, νSmaller)

17 / 17 Residual Analysis Simple Test for Heteroscedasticity