Regression Diagnostics Introduction to Regression 1 Why do we need - - PowerPoint PPT Presentation

regression diagnostics
SMART_READER_LITE
LIVE PREVIEW

Regression Diagnostics Introduction to Regression 1 Why do we need - - PowerPoint PPT Presentation

Regression Diagnostics Introduction to Regression 1 Why do we need to do all this? Theory based on assumptions Focuses on the residuals and fitted values Validate the model Gives us clues how to change the model Is it


slide-1
SLIDE 1

Regression Diagnostics

Introduction to Regression

1

slide-2
SLIDE 2

Why do we need to do all this?

  • Theory based on assumptions
  • Focuses on the residuals and fitted values
  • Validate the model
  • Gives us clues how to change the model
  • Is it appropriate?
  • Lots of statistical tests based on certain

assumptions

2

slide-3
SLIDE 3

What shall we look at?

  • Calculate residuals for each case
  • Observed value – Predicted value
  • Standardise them by dividing by their SD

(approx.)

  • Different types of standardised residuals
  • We need to do a series of plots
  • Remember the model
  • Constant variance

3

slide-4
SLIDE 4

4

slide-5
SLIDE 5

Standardised Residuals

  • Should be

– Size??? – Independent – Normally distributed – Constant – Unrelated to the fitted values – Unrelated to the independent variables

5

slide-6
SLIDE 6

Plots to do compute

  • Normal probability plot of residuals
  • Look for large standardised residuals
  • Check values
  • Plot residuals vs fitted/ predicted values
  • Plot residuals vs each independent variable
  • Plot residuals against time (if that is

appropriate)

6

slide-7
SLIDE 7

7

slide-8
SLIDE 8

8

slide-9
SLIDE 9

9

slide-10
SLIDE 10

10

slide-11
SLIDE 11

The regression equation is sqrtrooms = 0.200 + 1.90 sqrtcrews

11

slide-12
SLIDE 12

Leverage

  • Measure the distance from the x-values to the

mean of the x-values

  • May influence results
  • p-predictors
  • High values for leverage > 2 ∗ (𝑞+1)

𝑜

  • Be careful here

12

slide-13
SLIDE 13

Outliers and Bad leverage points

  • Examine them and see if they are different
  • Flag a problem with model
  • Consider fitting another model

13

slide-14
SLIDE 14

Cooks distance

  • Measures the influence of an observation on

the set of regression coefficients . Influential

  • bservations can be leverage points, outliers,
  • r both.
  • Look for gaps
  • Function of leverage and standardised

residuals

  • Suggested cutoffs are 4/(n-2).
  • What happens when you omit points

14

slide-15
SLIDE 15

Makey up data

15

SRESID Leverage Cooks 1.54 0.26 0.42

  • 4.35

0.26 3.35

slide-16
SLIDE 16

16

slide-17
SLIDE 17

Results

17

Including all points

slide-18
SLIDE 18

18

slide-19
SLIDE 19

And more

19

Without x=20, and y=10 point

slide-20
SLIDE 20

20

slide-21
SLIDE 21

21

Without point x=20, y=95

slide-22
SLIDE 22

DFITS

  • Measures the influence of each observation
  • n the fitted values
  • Roughly the number of standard deviations

that the fitted value changes when each

  • bservation is removed from the data set and

the model is refit.

22

slide-23
SLIDE 23

23

slide-24
SLIDE 24

What model to fit?

  • Suppose we start with
  • Salaries = α+β1Experience+ε – linear model
  • And look at residuals vs fitted values

24

slide-25
SLIDE 25

25

slide-26
SLIDE 26

So what model should we fit?

  • Salaries = α+β1Experience+ε – linear model
  • Should create a new variable
  • Experience*Experience and added it to model
  • Salaries = α+β1Experience+β2Exper*Exper+ ε
  • Polynomial model

26

slide-27
SLIDE 27

27

slide-28
SLIDE 28

28

slide-29
SLIDE 29

Oregon Housing

  • Description
  • 76 single-family homes in Eugene, Oregon

during 2005

  • Estate agents have their methods of

determining price

  • Seller wanted a method of determining asking

price

slide-30
SLIDE 30

Variables

  • Price (thousands of $)
  • Floor size (thousands of sq ft)
  • Age of house
  • Number of bedrooms
  • Number of bathrooms
  • Garage size
  • School area
  • Lot size (1:11)- interesting variable too.
slide-31
SLIDE 31

Coding of Lot size

Category Lot Size 1 0-3k 2 3-5k 3 5-7k 4 7-10k 5 10-15k 6 15-20k 7 20K-1acre 8 1-3ac 9 3-5ac 10 5-10ac 11 10-20ac

31

0-3k = 0-3,000 sq ft 1 acre = 43,560 sq ft

slide-32
SLIDE 32

Model

  • Going to focus on three variables Price, Size

and Age

  • Age is coded as (Year – 70)/10
  • Going to fit two models
  • Price = α+β1*Size+ β2*Age+ε
  • Price = α+β1*Size+ β2*Age+ β3*Age*Age+ ε
  • First we draw some graphs

32

slide-33
SLIDE 33

33

slide-34
SLIDE 34

First model

34

slide-35
SLIDE 35

Residuals vs Age

35

slide-36
SLIDE 36

Second model

36

slide-37
SLIDE 37

Article

37

Modeling Home Prices Using Realtor Data Iain Pardoe Lundquist College of Business, University of Oregon Journal of Statistics Education Volume 16, Number 2 (2008), www.amstat.org/publications/jse/v16n2/datasets.pardoe.html

slide-38
SLIDE 38

Conclusion

  • Be sure to run diagnostics
  • Examine the plots
  • Check funny points
  • Try out some changes

38

slide-39
SLIDE 39

Added variable plots

  • Added-variable plots enable us to visually

assess the effect of each predictor, having adjusted for the effects of the other predictors.

  • Y and two predictor variables X and Z
  • Regress Y on X – calculate residuals – Set 1
  • Regress Z on X – calculate residuals – Set 2
  • Plot Set 1 residuals vs Set 2 residuals

39

slide-40
SLIDE 40

And more…

  • Residuals from Y and X = part of Y not

predicted by X

  • Residuals from Z and X = part of Z not

predicted by X

  • Added-variable plot for predictor variable Z

shows that part of Y that is not predicted by X against that part of Z that is not predicted by X

40

slide-41
SLIDE 41

Another dataset

  • Price = the price (in $US) of dinner (including one

drink and a tip)

  • Food = customer rating of the food (out of 30)
  • Décor = customer rating of the decor (out of 30)
  • Service = customer rating of the service (out of

30)

  • East = 1 (0) if the restaurant is east (west) of Fifth

Avenue

41

slide-42
SLIDE 42

Added Variable plots

  • For Food variable
  • Price vs Décor, service and East – calculate

residuals

  • Food vs Décor, service and East- calculate

residuals

  • Plot residuals against each other

42

slide-43
SLIDE 43

43

slide-44
SLIDE 44

44