Lecture 8: Model assessment, nested models, and hypothesis testing - - PowerPoint PPT Presentation

lecture 8 model assessment nested models and hypothesis
SMART_READER_LITE
LIVE PREVIEW

Lecture 8: Model assessment, nested models, and hypothesis testing - - PowerPoint PPT Presentation

Lecture 8: Model assessment, nested models, and hypothesis testing Ani Manichaikul amanicha@jhsph.edu 27 April 2007 1 Another Example: Mortality n British Smoke, Pollution & Morality Data Airborne Smoke Particles 1.34 SO2


slide-1
SLIDE 1

1

Lecture 8: Model assessment, nested models, and hypothesis testing

Ani Manichaikul amanicha@jhsph.edu 27 April 2007

slide-2
SLIDE 2

2

Another Example: Mortality

n British Smoke, Pollution & Morality Data

Airborne Smoke Particles

.09 1.34 .29 4.46

SO2 Concentration

112 518

London Mortality

slide-3
SLIDE 3

3

Mortality Example: Model

Let:

n

Y = the daily mortality for London (deaths)

n

X1 = airborne smoke particles (mg/m3) (smoke)

n

X2 = SO2 (ppm) (so2)

Model:

n

1) Yi = β0 + β1(X1-2) + β2(X2-.5) + εi

n

2) εi ~ N(0, σ2)

n

Mortality is a linear function of the concentration of airborne smoke particles AND the SO2 level

slide-4
SLIDE 4

4

Mortality Example: Interpretations

Model:

n E( Y | X ) = β0 + β1(X1-2) + β2(X2-.5) n β0:

E( Y | X1= 2, X2= .5) = β0 + β1(0)+ β2(0) = β0

n Therefore: β0 = The mean number of deaths

per day when smoke particle concentrations are 2 mg/m3 and SO2 concentrations are 0.5 ppm levels

slide-5
SLIDE 5

5

Mortality Example: Interpretations

n β1:

E( Y | X1= (X1+ 1), X2)= β0 + β1(X1-1)+ β2(X2-.5) E( Y | X1= (X1), X2) = β0 + β1(X1-2)+ β2(X2-.5)

∆ E( Y | X ) = β1

n Therefore: β1 = Expected change in mortality

  • n days when particles are 0.1 mg/m3 higher

if SO2 is unchanged

slide-6
SLIDE 6

6

Mortality Example: Interpretations

n β2:

E( Y | X1= ?, X2= ?) = E( Y | X1= ?, X2= ?) =

∆ E( Y | X ) = β2

n Therefore: β2 =

slide-7
SLIDE 7

7

Mortality Example: Results

Source | SS df MS Number of obs = 15

  • ------------+------------------------------

F( 2, 12) = 36.57 Model | 205097.531 2 102548.765 Prob > F = 0.0000 Residual | 33654.2025 12 2804.51687 R-squared = 0.8590

  • ------------+------------------------------

Adj R-squared = 0.8355 Total | 238751.733 14 17053.6952 Root MSE = 52.958

  • deaths | Coef. Std. Err. t P>|t| [95% Conf. Interval]
  • ------------+----------------------------------------------------------------

smokecenter | -220.3244 58.14314 -3.79 0.003 -347.0074 -93.64135 so2center | 1051.816 212.5959 4.95 0.000 588.6096 1515.023 _cons | 174.7703 29.16174 5.99 0.000 111.2323 238.3083

slide-8
SLIDE 8

8

Mortality Example: Inference

n Overall F-Test:

n Are ANY of the covariates significant?

n H0: β1 = β2 = 0; n Fobs: (2,13) = 36.57; n p-val = 0.0000 n Decision: At least one of the β’s are

nonzero

slide-9
SLIDE 9

9

Parameter Estimates (95% C.I.) & individual t-tests

β0

n b0 = 174.8 (111.2, 238.3) n H0: β0 = 0; n tobs: (12) = 5.99; n p-val = 0.000

slide-10
SLIDE 10

10

Parameter Estimates (95% C.I.) & individual t-tests

β1

n b1= -220.3 (-347.0, -93.6) n H0: β1 = 0; n tobs: (12) = -3.79; n p-val = 0.003

slide-11
SLIDE 11

11

Parameter Estimates (95% C.I.) & individual t-tests

β2

n b2= 1051.8 (588.6, 1515.0) n H0: β2 = 0; n tobs: (12) = 4.95; n p-val = 0.000 means p-val < 0.001 n Note: s2 = MSE = 2805; n s = √MSE = ‘Root MSE’ = 53

slide-12
SLIDE 12

12

Parameter Interpretations: with Estimates

n b0: when smoke particles and SO2 are around

their average levels, (2 mg/m3,and 0.5 ppm respectively), the estimated mean number of deaths is 174.8 / day

n b1: the estimated mean mortality is 22

deaths/day lower on days when particles are 0.1 mg/m3 higher if SO2 is unchanged

n b2 : (You do!)

slide-13
SLIDE 13

13

Estimating

n

Suppose we were interested in the estimated mean number of deaths when smoke particle concentrations were 3 mg/m3 and SO2 levels were 0.65 ppm E( Y | X ) = β0 + β1(X1-2) + β2(X2-.5) so:

n

E(Deaths) = b0 + b1(smoke-2) + b2(so2-.5) = 174.8 - 220 (3 - 2) + 1052 (.65 -0.5)

≈ 60 deaths

n

How about if smoke particle concentrations were 3 mg/m3 and SO2 levels were 0.45 ppm?

slide-14
SLIDE 14

14

Association

n The estimate for airborne smoke

particles is b1= –220, implying that smoke particles and mortality have a negative relationship

n i.e. an increase in smoke particles is

associated with a decrease in mortality, after adjusting for SO2 levels.

slide-15
SLIDE 15

15

Negative Association??

n BUT WAIT! n Look at the plot of deaths vs smoke

presented previously. Shouldn’t the relationship be positive instead?!

n Let’s run Simple Linear Regressions

(SLRs) of mortality on smoke & SO2 and see what we get.

slide-16
SLIDE 16

16

Simple Linear Regression

n Same Notation: n Y = the daily mortality for London

(deaths)

n X1 = airborne smoke particles (mg/m3)

(smoke)

n X2 = SO2 (ppm)

(so2)

slide-17
SLIDE 17

17

SLR Models

n Smoke:

n 1) Yi = β0 + β1(X1-2) + εi n 2) εi ~ N(0, σ2)

n SO2:

n 1) Yi = β0* + β1* (X2-.5) + εi* n 2) εi* ~ N(0, σ2* )

slide-18
SLIDE 18

18

SLR: Deaths ~ Smoke

SLR: DEATHS ~ SMOKE London Mortality Airborne Smoke Particles 2 4 100 200 300 400 500

slide-19
SLIDE 19

19

Death ~ Smoke: Results

Source | SS df MS Number of obs = 15

  • ------------+------------------------------

F( 1, 13) = 17.34 Model | 136449.517 1 136449.517 Prob > F = 0.0011 Residual | 102302.216 13 7869.40127 R-squared = 0.5715

  • ------------+------------------------------

Adj R-squared = 0.5386 Total | 238751.733 14 17053.6952 Root MSE = 88.71

  • deaths | Coef. Std. Err. t P>|t| [95% Conf. Interval]
  • ------------+----------------------------------------------------------------

smokecenter | 63.76092 15.31226 4.16 0.001 30.68078 96.84105 _cons | 299.3407 24.64457 12.15 0.000 246.0993 352.582

  • Parameter Estimates:

b0 = 299.3 b1 = 63.8 ( is positive?!!) Amount of variation described: R2 = SSM / SST = 57% Residual Variability left over, (undescribed by this SLR): SSE = 1023002.216

slide-20
SLIDE 20

20

SLR: Death ~ SO2

SLR: DEATHS ~ SO2 London Mortality SO2 Concentration .5 1 1.5 100 200 300 400 500

slide-21
SLIDE 21

21

Death ~ SO2: Results

Source | SS df MS Number of obs = 15

  • ------------+------------------------------

F( 1, 13) = 28.99 Model | 164827.112 1 164827.112 Prob > F = 0.0001 Residual | 73924.6211 13 5686.50932 R-squared = 0.6904

  • ------------+------------------------------

Adj R-squared = 0.6666 Total | 238751.733 14 17053.6952 Root MSE = 75.409

  • deaths | Coef. Std. Err. t P>|t| [95% Conf. Interval]
  • ------------+----------------------------------------------------------------

so2center | 256.2356 47.59353 5.38 0.000 153.416 359.0551 _cons | 272.2286 19.57285 13.91 0.000 229.944 314.5131

  • Parameter Estimates:

b0 = 256.2 b1 = 272.2 Amount of variation described: R2 = SSM / SST = 69% Residual Variability left over, (undescribed by this SLR): SSE = 73924.6211

slide-22
SLIDE 22

22

Confounding in this Example

Recall our parameter interpretations:

n β1 = Expected change in mortality on

days when particles are 0.1 mg/m3 higher if SO2 is unchanged

n Suppose we examine the relationship

between smoke particle concentrations and SO2 levels, (SLR):

slide-23
SLIDE 23

23

SLR: Smoke ~ SO2

SLR: SMOKE ~ SO2 Airborne Smoke Particles SO2 Concentration .5 1 1.5 2 4 6

slide-24
SLIDE 24

24

Confounding

n Smoke particle concentrations and SO2 levels

are highly related! How can we talk about changing smoke particle concentrations while leaving SO2 levels unchanged??

n This phenomenon is called ‘confounding’

n both covariates are related to the outcome and to

each other.

n Confounding is the reason we found

differences between the SLR models and the MLR model.

slide-25
SLIDE 25

25

Residuals: part “left over”

slide-26
SLIDE 26

26

Residuals

n Residuals are deviations

n what’s ‘left over’

n in the response, Y, from what was

expected given the predictor, X

n The residuals are the part of Y that

can’t be predicted by X!

slide-27
SLIDE 27

27

Adjusted Variable Plots

Idea:

n Explain all that we can in London daily

mortality using SO2 levels

n Explain all that we can in smoke particle

concentrations using SO2 levels

n Explain everything that’s ‘left over’ in

mortality with everything that’s ‘left over’ in smoke particle concentrations. The slope of this line will be the MLR coefficient!

slide-28
SLIDE 28

28

Adjusted Variable Plot

AVP: Deaths vs. Smoke Resids: DEATHS ~ SO2 Resids: SMOKE ~ SO2

  • .5

.5

  • 200
  • 100

100 200

slide-29
SLIDE 29

29

Recipe for AVP

Recipe for obtaining the MLR slope for X1 from an AVP (adjusted for X2):

1.

Regress Y on X2, save residuals as: RY|X2

2.

Regress X1 on X2, save residuals as: RX1|X2

3.

Plot RY|X2 vs RX1|X2 (Adjusted Variable Plot ) Regress RY|X2 on RX1|X2: RY|X2 = β0* + β1* RX1|X2 + ε

slide-30
SLIDE 30

30

Notes on AVPs

n β1* is identical to the coefficient of X1

from an MLR of Y on X1 and X2

n β0* is zero -- zero intercept n The AVP display may be misleading if Y

and/or X1 are not linearly related to the

  • ther predictors
slide-31
SLIDE 31

31

AVP for Mortality Example

n

Regress deaths on (centered) SO2, save residuals

n

Removes the effects of SO2 on mortality

Deaths = 272 + 256 SO2c + RY| X2

n

Regress Smoke on SO2 (both centered), save residuals

n

Removes the effects of SO2 on smoke particles

Smokec = -.44 + 3.6 SO2c + RX1| X2

n

Regress RY| X2 on RX1| X2

n

regress deaths adjusted for SO2 on smoke particles adjusted for SO2

n

RY| X2 = 0.0 – 220 RX1|X2

slide-32
SLIDE 32

32

AVP Interpretation

n

The parameter from this last regression: β1* = -220 is the same as the related parameter from the MLR of deaths on smoke particles and SO2. E(Deaths) = β0 + β1(smoke-2) + β2(so2-.5) = 174.8 - 220 (smoke - 2) + 1052 (SO2 -0.5)

n

This aids in our interpretations of β1: the effect of airborne smoke particles on daily mortality after having removed, (or adjusted out), all the effects of SO2 levels.

n

This is what is usually meant by the term ‘adjustment’

slide-33
SLIDE 33

33

MLR and Scientific Inference

n The single most important idea

today may be the realization that MLR can shift interpretations markedly!

n From SLR of the air pollution data:

E(Deaths)= 299 + 64(smoke-2)

n Expected deaths increase by an estimated

64 per mg/m3 increase in British smoke

slide-34
SLIDE 34

34

MLR and Scientific Inference

n From MLR of the air pollution data:

E(Deaths )= 174.8 - 220(smoke-2)+ 1052(SO2-.5)

n Controlling for SO2, expected deaths

decrease 220 per mg/m3 of British smoke

n Interpretation and value of a regression

coefficient depends critically on what

  • ther variables are in the model !!
slide-35
SLIDE 35

35

Simple Linear Regression

SLR: DEATHS ~ SMOKE London Mortality Airborne Smoke Particles 2 4 100 200 300 400 500

slide-36
SLIDE 36

36

Multiple Linear Regression

AVP: Deaths vs. Smoke Resids: DEATHS ~ SO2 Resids: SMOKE ~ SO2

  • .5

.5

  • 200
  • 100

100 200

slide-37
SLIDE 37

37

MLR Lesson:

n Interpretation and value of a regression

coefficient depends critically on what

  • ther variables are in the model
slide-38
SLIDE 38

38

Types of predictors

n primary predictor

n always in model

n other predictor(s)

n can we improve prediction after adjusting for

primary predictor?

n interaction may be a component here

n potential confounder(s) (ie demographics)

n only important if they change the effect of the

primary predictor

slide-39
SLIDE 39

39

Nested models

n One model is nested within another if

the parent model contains one set of variables and the nested model contains all of the original variables plus

  • ne or more additional variables.
slide-40
SLIDE 40

40

Difference in assessing variables: “nested models”

n other predictor(s)

n assess with t test if single variable defines

predictor

n assess with F test (today) if two or more

variables are needed to define the predictor

n potential confounder(s)

n compare CI of primary predictor to see

whether new parameter is significantly different (Lecture 23)

slide-41
SLIDE 41

41

Dataset

n Class health dataset

n Outcome: number of credits n Primary predictor: housing (on or off

campus)

n Other predictors

n health status (good/excellent or fair/poor) n year in school

slide-42
SLIDE 42

42

Parent model (Model 1)

. reg credits housing Source | SS df MS Number of obs = 26

  • ------------+------------------------------

F( 1, 24) = 0.06 Model | .176282088 1 .176282088 Prob > F = 0.8074 Residual | 69.6333335 24 2.9013889 R-squared = 0.0025

  • ------------+------------------------------

Adj R-squared = -0.0390 Total | 69.8096156 25 2.79238462 Root MSE = 1.7033

  • credits | Coef. Std. Err. t P>|t| [95% Conf. Interval]
  • ------------+----------------------------------------------------------------

housing | .1666667 .6761572 0.25 0.807 -1.228853 1.562187 _cons | 16.2 .5135783 31.54 0.000 15.14003 17.25997

  • )

g sin

  • u

H (

  • ˆ
  • ˆ

Y ˆ

i 1 i

+ =

1 if on-campus 0 if off-campus

slide-43
SLIDE 43

43

Extended model (Model 2)

. reg credits housing healthgood Source | SS df MS Number of obs = 26

  • ------------+------------------------------

F( 2, 23) = 0.20 Model | 1.1834815 2 .591740751 Prob > F = 0.8215 Residual | 68.6261341 23 2.98374496 R-squared = 0.0170

  • ------------+------------------------------

Adj R-squared = -0.0685 Total | 69.8096156 25 2.79238462 Root MSE = 1.7274

  • credits | Coef. Std. Err. t P>|t| [95% Conf. Interval]
  • ------------+----------------------------------------------------------------

housing | .1541237 .6860262 0.22 0.824 -1.26503 1.573277 healthgood | .4139175 .7124214 0.58 0.567 -1.059838 1.887673 _cons | 15.9366 .6904955 23.08 0.000 14.5082 17.365

  • )

Healthgood (

  • ˆ

) g sin

  • u

H (

  • ˆ
  • ˆ

Y ˆ

i 2 i 1 i

+ + =

1 if excellent/good 0 if fair/poor

slide-44
SLIDE 44

44

Comparing models 1 and 2

n If we remove healthgood from model 2,

we are left with model 1

n Model 1 is nested in model 2 n To decide whether model 2 is better than

model 1, use the t test for the new variable, healthgood

n p= 0.567 > = 0.05 tests H0: 2= 0 n Fail to reject H0 n Conclude model 2 is no better than model 1

slide-45
SLIDE 45

45

What if we add more than one variable?

n The t test on each row only tests that

variable in the presence of everything else in the model

n When more than one variable is added

at a time, the t test is not sufficient

n The t test only tests one variable at a time n Use the F test instead to compare nested

models that differ by more than one variable

slide-46
SLIDE 46

46

When would more than one variable need to be added??

n Many modeling scenarios require adding

more than one variable at once to go from the parent model to the nested model

n One that arises frequently is when a

categorical variable needs to be added

slide-47
SLIDE 47

47

Coding a categorical predictor

n A categorical predictor (such as year in

program) cannot be added as a single variable

n If we add year (1, 2, 3, or 4) to the model

in its original form, then software thinks it is a continuous predictor

n As a continuous predictor, the difference in

mean number of credits taken would be assumed to change by a constant amount for each additional year

slide-48
SLIDE 48

48

Coding a categorical predictor

n A categorical predictor should always be

recoded as a set of dummy variables

n Choose one category (year= 1) as the

reference group

n For each other category (such as year= 2),

create a dummy variable for membership in that category

n gen year2=1 if year==2

replace year2=0 if year~=2

slide-49
SLIDE 49

49

Example

n Year 1 = reference group

(no dummy variable for this group)

n Year2 = 1 for those in year 2, 0 else n Year34 = 1 for those in yr 3/4, 0 else

n very few observations, so categories were

combined

n In in year 3: Year2= 0, Year34= 1 n For a Firstyear: Year2= 0, Year34= 0

slide-50
SLIDE 50

50

Model 3

. reg credits housing year2 year34 Source | SS df MS Number of obs = 26

  • ------------+------------------------------

F( 3, 22) = 2.94 Model | 19.9853465 3 6.66178216 Prob > F = 0.0555 Residual | 49.8242691 22 2.26473951 R-squared = 0.2863

  • ------------+------------------------------

Adj R-squared = 0.1890 Total | 69.8096156 25 2.79238462 Root MSE = 1.5049

  • credits | Coef. Std. Err. t P>|t| [95% Conf. Interval]
  • ------------+----------------------------------------------------------------

housing | -1.402299 .8537457 -1.64 0.115 -3.172859 .3682613 year2 | .7068966 .7215468 0.98 0.338 -.7894999 2.203293 year34 | -2.10197 1.087462 -1.93 0.066 -4.357228 .1532874 _cons | 17.34483 .9268436 18.71 0.000 15.42267 19.26698

  • )

34 Year (

  • ˆ

) 2 Year (

  • ˆ

) g sin

  • u

H (

  • ˆ
  • ˆ

Y ˆ

i 3 i 2 i 1 i

+ + + =

slide-51
SLIDE 51

51

n What is the mean number of credits taken by

Second year students who live on campus?

n What is the mean number of credits taken by First

year students who live on campus?

n

is the difference in mean number of credits

2 1 i

  • ˆ
  • ˆ
  • ˆ

) ( 2.1 ) 1 ( 0.7 ) 1 ( 4 . 1 3 . 7 1 ) 34 ( 2.1 ) 2 ( 0.7 ) sin H ( 4 . 1 3 . 7 1 ˆ + + = − + − = − + − =

i i i

Year Year g

  • u

Y

1 i i i i

  • ˆ
  • ˆ

) ( 2.1 ) ( 0.7 ) 1 ( 4 . 1 3 . 7 1 ) 34 Year ( 2.1 ) 2 Year ( 0.7 ) g sin

  • u

H ( 4 . 1 3 . 7 1 Y ˆ + = − + − = − + − =

2

  • ˆ
slide-52
SLIDE 52

52

Interpretation

n 0: First year students who live off campus take an

average of 17.3 credits

n 2: After adjusting for housing, Second year

students take an average of 0.71 more credits than Freshmen

n 3: After adjusting for housing, 3rd and 4th year

students take an average of 2.1 fewer credits than First year students

slide-53
SLIDE 53

53

Notice

n Coding:

Year2= 0 for anyone not in year 2

n Interpretation:

Coefficient for Year2 compares Second year students to First year students (the reference category)

n (not to anyone not in year 2)

slide-54
SLIDE 54

54

Evaluation

n We cannot evaluate Year using the t test for

each row, because two variables are needed to define Year and the t tests are separate

n We must use an F test to evaluate Year by

comparing the residual sums of squares (RSS) in the parent model and in the nested model.

slide-55
SLIDE 55

55

Parent model (Model 1)

. reg credits housing Source | SS df MS Number of obs = 26

  • ------------+------------------------------

F( 1, 24) = 0.06 Model | .176282088 1 .176282088 Prob > F = 0.8074 Residual | 69.6333335 24 2.9013889 R-squared = 0.0025

  • ------------+------------------------------

Adj R-squared = -0.0390 Total | 69.8096156 25 2.79238462 Root MSE = 1.7033

  • credits | Coef. Std. Err. t P>|t| [95% Conf. Interval]
  • ------------+----------------------------------------------------------------

housing | .1666667 .6761572 0.25 0.807 -1.228853 1.562187 _cons | 16.2 .5135783 31.54 0.000 15.14003 17.25997

  • RSSparent

Residual dfparent

slide-56
SLIDE 56

56

Extended model (Model 3)

. reg credits housing year2 year34 Source | SS df MS Number of obs = 26

  • ------------+------------------------------

F( 3, 22) = 2.94 Model | 19.9853465 3 6.66178216 Prob > F = 0.0555 Residual | 49.8242691 22 2.26473951 R-squared = 0.2863

  • ------------+------------------------------

Adj R-squared = 0.1890 Total | 69.8096156 25 2.79238462 Root MSE = 1.5049

  • credits | Coef. Std. Err. t P>|t| [95% Conf. Interval]
  • ------------+----------------------------------------------------------------

housing | -1.402299 .8537457 -1.64 0.115 -3.172859 .3682613 year2 | .7068966 .7215468 0.98 0.338 -.7894999 2.203293 year34 | -2.10197 1.087462 -1.93 0.066 -4.357228 .1532874 _cons | 17.34483 .9268436 18.71 0.000 15.42267 19.26698

  • RSSextended

Residual dfextended

slide-57
SLIDE 57

57

The F test

Numerator of F-statistic: (RSSparent – RSS extended)/(num. vars. added) Denominator of F-statistic: RSSextended/(residual dfextended)

( )

4 . 4 22 8 . 49 2 8 . 49 6 . 69 F

  • bs

= − =

H0: all new ’s=0 in population HA: at least one new is not 0 in population

slide-58
SLIDE 58

58

The F table

n Recall: the F distribution is very similar to the X2

distribution

n F distribution is automatically 2-sided (like X2) n df change the shape of the F distribution (like X2),

but now there are two sets of df: the numerator df and the denominator df

slide-59
SLIDE 59

59

The F table

n numerator df: # of variables added = 2 n denominator df: residual dfextended = 22 n Using = 0.05, find Fcr

n 1- = 0.95 n Find quantile in R, using appropriate

degrees of freedom

n Fcr= 3.49 is shown in 2nd row of the table

slide-60
SLIDE 60

60

Conclusion

n Fcr= 3.49 < Fobs= 4.4 so p < n Reject H0: conclude that adding year

improves prediction after adjusting for housing

n Notice: both individual t tests were not

statistically significant, but F test was still significant

n Must always use F test to evaluate multiple

X’s at once

slide-61
SLIDE 61

61

The F test: notes

n The F test can be used to compare any two

nested models

n If only one variable is added, it’s easier to

compare the models using the t test for that variable

n t2= F if one variable is added

n For any regression, the estimated variance of

the residuals is RSS/(residual df)

slide-62
SLIDE 62

62

Nested Models

n Comparing nested models

n 1 new variable: use t test for that variable n 2+ new variables: use F test

n Categorical predictor

n set one group as reference n create dummy variable for other groups n include/exclude all dummy variables n evaluate categorical predictor with F test