1
Lecture 8: Model assessment, nested models, and hypothesis testing - - PowerPoint PPT Presentation
Lecture 8: Model assessment, nested models, and hypothesis testing - - PowerPoint PPT Presentation
Lecture 8: Model assessment, nested models, and hypothesis testing Ani Manichaikul amanicha@jhsph.edu 27 April 2007 1 Another Example: Mortality n British Smoke, Pollution & Morality Data Airborne Smoke Particles 1.34 SO2
2
Another Example: Mortality
n British Smoke, Pollution & Morality Data
Airborne Smoke Particles
.09 1.34 .29 4.46
SO2 Concentration
112 518
London Mortality
3
Mortality Example: Model
Let:
n
Y = the daily mortality for London (deaths)
n
X1 = airborne smoke particles (mg/m3) (smoke)
n
X2 = SO2 (ppm) (so2)
Model:
n
1) Yi = β0 + β1(X1-2) + β2(X2-.5) + εi
n
2) εi ~ N(0, σ2)
n
Mortality is a linear function of the concentration of airborne smoke particles AND the SO2 level
4
Mortality Example: Interpretations
Model:
n E( Y | X ) = β0 + β1(X1-2) + β2(X2-.5) n β0:
E( Y | X1= 2, X2= .5) = β0 + β1(0)+ β2(0) = β0
n Therefore: β0 = The mean number of deaths
per day when smoke particle concentrations are 2 mg/m3 and SO2 concentrations are 0.5 ppm levels
5
Mortality Example: Interpretations
n β1:
E( Y | X1= (X1+ 1), X2)= β0 + β1(X1-1)+ β2(X2-.5) E( Y | X1= (X1), X2) = β0 + β1(X1-2)+ β2(X2-.5)
∆ E( Y | X ) = β1
n Therefore: β1 = Expected change in mortality
- n days when particles are 0.1 mg/m3 higher
if SO2 is unchanged
6
Mortality Example: Interpretations
n β2:
E( Y | X1= ?, X2= ?) = E( Y | X1= ?, X2= ?) =
∆ E( Y | X ) = β2
n Therefore: β2 =
7
Mortality Example: Results
Source | SS df MS Number of obs = 15
- ------------+------------------------------
F( 2, 12) = 36.57 Model | 205097.531 2 102548.765 Prob > F = 0.0000 Residual | 33654.2025 12 2804.51687 R-squared = 0.8590
- ------------+------------------------------
Adj R-squared = 0.8355 Total | 238751.733 14 17053.6952 Root MSE = 52.958
- deaths | Coef. Std. Err. t P>|t| [95% Conf. Interval]
- ------------+----------------------------------------------------------------
smokecenter | -220.3244 58.14314 -3.79 0.003 -347.0074 -93.64135 so2center | 1051.816 212.5959 4.95 0.000 588.6096 1515.023 _cons | 174.7703 29.16174 5.99 0.000 111.2323 238.3083
8
Mortality Example: Inference
n Overall F-Test:
n Are ANY of the covariates significant?
n H0: β1 = β2 = 0; n Fobs: (2,13) = 36.57; n p-val = 0.0000 n Decision: At least one of the β’s are
nonzero
9
Parameter Estimates (95% C.I.) & individual t-tests
β0
n b0 = 174.8 (111.2, 238.3) n H0: β0 = 0; n tobs: (12) = 5.99; n p-val = 0.000
10
Parameter Estimates (95% C.I.) & individual t-tests
β1
n b1= -220.3 (-347.0, -93.6) n H0: β1 = 0; n tobs: (12) = -3.79; n p-val = 0.003
11
Parameter Estimates (95% C.I.) & individual t-tests
β2
n b2= 1051.8 (588.6, 1515.0) n H0: β2 = 0; n tobs: (12) = 4.95; n p-val = 0.000 means p-val < 0.001 n Note: s2 = MSE = 2805; n s = √MSE = ‘Root MSE’ = 53
12
Parameter Interpretations: with Estimates
n b0: when smoke particles and SO2 are around
their average levels, (2 mg/m3,and 0.5 ppm respectively), the estimated mean number of deaths is 174.8 / day
n b1: the estimated mean mortality is 22
deaths/day lower on days when particles are 0.1 mg/m3 higher if SO2 is unchanged
n b2 : (You do!)
13
Estimating
n
Suppose we were interested in the estimated mean number of deaths when smoke particle concentrations were 3 mg/m3 and SO2 levels were 0.65 ppm E( Y | X ) = β0 + β1(X1-2) + β2(X2-.5) so:
n
E(Deaths) = b0 + b1(smoke-2) + b2(so2-.5) = 174.8 - 220 (3 - 2) + 1052 (.65 -0.5)
≈ 60 deaths
n
How about if smoke particle concentrations were 3 mg/m3 and SO2 levels were 0.45 ppm?
14
Association
n The estimate for airborne smoke
particles is b1= –220, implying that smoke particles and mortality have a negative relationship
n i.e. an increase in smoke particles is
associated with a decrease in mortality, after adjusting for SO2 levels.
15
Negative Association??
n BUT WAIT! n Look at the plot of deaths vs smoke
presented previously. Shouldn’t the relationship be positive instead?!
n Let’s run Simple Linear Regressions
(SLRs) of mortality on smoke & SO2 and see what we get.
16
Simple Linear Regression
n Same Notation: n Y = the daily mortality for London
(deaths)
n X1 = airborne smoke particles (mg/m3)
(smoke)
n X2 = SO2 (ppm)
(so2)
17
SLR Models
n Smoke:
n 1) Yi = β0 + β1(X1-2) + εi n 2) εi ~ N(0, σ2)
n SO2:
n 1) Yi = β0* + β1* (X2-.5) + εi* n 2) εi* ~ N(0, σ2* )
18
SLR: Deaths ~ Smoke
SLR: DEATHS ~ SMOKE London Mortality Airborne Smoke Particles 2 4 100 200 300 400 500
19
Death ~ Smoke: Results
Source | SS df MS Number of obs = 15
- ------------+------------------------------
F( 1, 13) = 17.34 Model | 136449.517 1 136449.517 Prob > F = 0.0011 Residual | 102302.216 13 7869.40127 R-squared = 0.5715
- ------------+------------------------------
Adj R-squared = 0.5386 Total | 238751.733 14 17053.6952 Root MSE = 88.71
- deaths | Coef. Std. Err. t P>|t| [95% Conf. Interval]
- ------------+----------------------------------------------------------------
smokecenter | 63.76092 15.31226 4.16 0.001 30.68078 96.84105 _cons | 299.3407 24.64457 12.15 0.000 246.0993 352.582
- Parameter Estimates:
b0 = 299.3 b1 = 63.8 ( is positive?!!) Amount of variation described: R2 = SSM / SST = 57% Residual Variability left over, (undescribed by this SLR): SSE = 1023002.216
20
SLR: Death ~ SO2
SLR: DEATHS ~ SO2 London Mortality SO2 Concentration .5 1 1.5 100 200 300 400 500
21
Death ~ SO2: Results
Source | SS df MS Number of obs = 15
- ------------+------------------------------
F( 1, 13) = 28.99 Model | 164827.112 1 164827.112 Prob > F = 0.0001 Residual | 73924.6211 13 5686.50932 R-squared = 0.6904
- ------------+------------------------------
Adj R-squared = 0.6666 Total | 238751.733 14 17053.6952 Root MSE = 75.409
- deaths | Coef. Std. Err. t P>|t| [95% Conf. Interval]
- ------------+----------------------------------------------------------------
so2center | 256.2356 47.59353 5.38 0.000 153.416 359.0551 _cons | 272.2286 19.57285 13.91 0.000 229.944 314.5131
- Parameter Estimates:
b0 = 256.2 b1 = 272.2 Amount of variation described: R2 = SSM / SST = 69% Residual Variability left over, (undescribed by this SLR): SSE = 73924.6211
22
Confounding in this Example
Recall our parameter interpretations:
n β1 = Expected change in mortality on
days when particles are 0.1 mg/m3 higher if SO2 is unchanged
n Suppose we examine the relationship
between smoke particle concentrations and SO2 levels, (SLR):
23
SLR: Smoke ~ SO2
SLR: SMOKE ~ SO2 Airborne Smoke Particles SO2 Concentration .5 1 1.5 2 4 6
24
Confounding
n Smoke particle concentrations and SO2 levels
are highly related! How can we talk about changing smoke particle concentrations while leaving SO2 levels unchanged??
n This phenomenon is called ‘confounding’
n both covariates are related to the outcome and to
each other.
n Confounding is the reason we found
differences between the SLR models and the MLR model.
25
Residuals: part “left over”
26
Residuals
n Residuals are deviations
n what’s ‘left over’
n in the response, Y, from what was
expected given the predictor, X
n The residuals are the part of Y that
can’t be predicted by X!
27
Adjusted Variable Plots
Idea:
n Explain all that we can in London daily
mortality using SO2 levels
n Explain all that we can in smoke particle
concentrations using SO2 levels
n Explain everything that’s ‘left over’ in
mortality with everything that’s ‘left over’ in smoke particle concentrations. The slope of this line will be the MLR coefficient!
28
Adjusted Variable Plot
AVP: Deaths vs. Smoke Resids: DEATHS ~ SO2 Resids: SMOKE ~ SO2
- .5
.5
- 200
- 100
100 200
29
Recipe for AVP
Recipe for obtaining the MLR slope for X1 from an AVP (adjusted for X2):
1.
Regress Y on X2, save residuals as: RY|X2
2.
Regress X1 on X2, save residuals as: RX1|X2
3.
Plot RY|X2 vs RX1|X2 (Adjusted Variable Plot ) Regress RY|X2 on RX1|X2: RY|X2 = β0* + β1* RX1|X2 + ε
30
Notes on AVPs
n β1* is identical to the coefficient of X1
from an MLR of Y on X1 and X2
n β0* is zero -- zero intercept n The AVP display may be misleading if Y
and/or X1 are not linearly related to the
- ther predictors
31
AVP for Mortality Example
n
Regress deaths on (centered) SO2, save residuals
n
Removes the effects of SO2 on mortality
Deaths = 272 + 256 SO2c + RY| X2
n
Regress Smoke on SO2 (both centered), save residuals
n
Removes the effects of SO2 on smoke particles
Smokec = -.44 + 3.6 SO2c + RX1| X2
n
Regress RY| X2 on RX1| X2
n
regress deaths adjusted for SO2 on smoke particles adjusted for SO2
n
RY| X2 = 0.0 – 220 RX1|X2
32
AVP Interpretation
n
The parameter from this last regression: β1* = -220 is the same as the related parameter from the MLR of deaths on smoke particles and SO2. E(Deaths) = β0 + β1(smoke-2) + β2(so2-.5) = 174.8 - 220 (smoke - 2) + 1052 (SO2 -0.5)
n
This aids in our interpretations of β1: the effect of airborne smoke particles on daily mortality after having removed, (or adjusted out), all the effects of SO2 levels.
n
This is what is usually meant by the term ‘adjustment’
33
MLR and Scientific Inference
n The single most important idea
today may be the realization that MLR can shift interpretations markedly!
n From SLR of the air pollution data:
E(Deaths)= 299 + 64(smoke-2)
n Expected deaths increase by an estimated
64 per mg/m3 increase in British smoke
34
MLR and Scientific Inference
n From MLR of the air pollution data:
E(Deaths )= 174.8 - 220(smoke-2)+ 1052(SO2-.5)
n Controlling for SO2, expected deaths
decrease 220 per mg/m3 of British smoke
n Interpretation and value of a regression
coefficient depends critically on what
- ther variables are in the model !!
35
Simple Linear Regression
SLR: DEATHS ~ SMOKE London Mortality Airborne Smoke Particles 2 4 100 200 300 400 500
36
Multiple Linear Regression
AVP: Deaths vs. Smoke Resids: DEATHS ~ SO2 Resids: SMOKE ~ SO2
- .5
.5
- 200
- 100
100 200
37
MLR Lesson:
n Interpretation and value of a regression
coefficient depends critically on what
- ther variables are in the model
38
Types of predictors
n primary predictor
n always in model
n other predictor(s)
n can we improve prediction after adjusting for
primary predictor?
n interaction may be a component here
n potential confounder(s) (ie demographics)
n only important if they change the effect of the
primary predictor
39
Nested models
n One model is nested within another if
the parent model contains one set of variables and the nested model contains all of the original variables plus
- ne or more additional variables.
40
Difference in assessing variables: “nested models”
n other predictor(s)
n assess with t test if single variable defines
predictor
n assess with F test (today) if two or more
variables are needed to define the predictor
n potential confounder(s)
n compare CI of primary predictor to see
whether new parameter is significantly different (Lecture 23)
41
Dataset
n Class health dataset
n Outcome: number of credits n Primary predictor: housing (on or off
campus)
n Other predictors
n health status (good/excellent or fair/poor) n year in school
42
Parent model (Model 1)
. reg credits housing Source | SS df MS Number of obs = 26
- ------------+------------------------------
F( 1, 24) = 0.06 Model | .176282088 1 .176282088 Prob > F = 0.8074 Residual | 69.6333335 24 2.9013889 R-squared = 0.0025
- ------------+------------------------------
Adj R-squared = -0.0390 Total | 69.8096156 25 2.79238462 Root MSE = 1.7033
- credits | Coef. Std. Err. t P>|t| [95% Conf. Interval]
- ------------+----------------------------------------------------------------
housing | .1666667 .6761572 0.25 0.807 -1.228853 1.562187 _cons | 16.2 .5135783 31.54 0.000 15.14003 17.25997
- )
g sin
- u
H (
- ˆ
- ˆ
Y ˆ
i 1 i
+ =
1 if on-campus 0 if off-campus
43
Extended model (Model 2)
. reg credits housing healthgood Source | SS df MS Number of obs = 26
- ------------+------------------------------
F( 2, 23) = 0.20 Model | 1.1834815 2 .591740751 Prob > F = 0.8215 Residual | 68.6261341 23 2.98374496 R-squared = 0.0170
- ------------+------------------------------
Adj R-squared = -0.0685 Total | 69.8096156 25 2.79238462 Root MSE = 1.7274
- credits | Coef. Std. Err. t P>|t| [95% Conf. Interval]
- ------------+----------------------------------------------------------------
housing | .1541237 .6860262 0.22 0.824 -1.26503 1.573277 healthgood | .4139175 .7124214 0.58 0.567 -1.059838 1.887673 _cons | 15.9366 .6904955 23.08 0.000 14.5082 17.365
- )
Healthgood (
- ˆ
) g sin
- u
H (
- ˆ
- ˆ
Y ˆ
i 2 i 1 i
+ + =
1 if excellent/good 0 if fair/poor
44
Comparing models 1 and 2
n If we remove healthgood from model 2,
we are left with model 1
n Model 1 is nested in model 2 n To decide whether model 2 is better than
model 1, use the t test for the new variable, healthgood
n p= 0.567 > = 0.05 tests H0: 2= 0 n Fail to reject H0 n Conclude model 2 is no better than model 1
45
What if we add more than one variable?
n The t test on each row only tests that
variable in the presence of everything else in the model
n When more than one variable is added
at a time, the t test is not sufficient
n The t test only tests one variable at a time n Use the F test instead to compare nested
models that differ by more than one variable
46
When would more than one variable need to be added??
n Many modeling scenarios require adding
more than one variable at once to go from the parent model to the nested model
n One that arises frequently is when a
categorical variable needs to be added
47
Coding a categorical predictor
n A categorical predictor (such as year in
program) cannot be added as a single variable
n If we add year (1, 2, 3, or 4) to the model
in its original form, then software thinks it is a continuous predictor
n As a continuous predictor, the difference in
mean number of credits taken would be assumed to change by a constant amount for each additional year
48
Coding a categorical predictor
n A categorical predictor should always be
recoded as a set of dummy variables
n Choose one category (year= 1) as the
reference group
n For each other category (such as year= 2),
create a dummy variable for membership in that category
n gen year2=1 if year==2
replace year2=0 if year~=2
49
Example
n Year 1 = reference group
(no dummy variable for this group)
n Year2 = 1 for those in year 2, 0 else n Year34 = 1 for those in yr 3/4, 0 else
n very few observations, so categories were
combined
n In in year 3: Year2= 0, Year34= 1 n For a Firstyear: Year2= 0, Year34= 0
50
Model 3
. reg credits housing year2 year34 Source | SS df MS Number of obs = 26
- ------------+------------------------------
F( 3, 22) = 2.94 Model | 19.9853465 3 6.66178216 Prob > F = 0.0555 Residual | 49.8242691 22 2.26473951 R-squared = 0.2863
- ------------+------------------------------
Adj R-squared = 0.1890 Total | 69.8096156 25 2.79238462 Root MSE = 1.5049
- credits | Coef. Std. Err. t P>|t| [95% Conf. Interval]
- ------------+----------------------------------------------------------------
housing | -1.402299 .8537457 -1.64 0.115 -3.172859 .3682613 year2 | .7068966 .7215468 0.98 0.338 -.7894999 2.203293 year34 | -2.10197 1.087462 -1.93 0.066 -4.357228 .1532874 _cons | 17.34483 .9268436 18.71 0.000 15.42267 19.26698
- )
34 Year (
- ˆ
) 2 Year (
- ˆ
) g sin
- u
H (
- ˆ
- ˆ
Y ˆ
i 3 i 2 i 1 i
+ + + =
51
n What is the mean number of credits taken by
Second year students who live on campus?
n What is the mean number of credits taken by First
year students who live on campus?
n
is the difference in mean number of credits
2 1 i
- ˆ
- ˆ
- ˆ
) ( 2.1 ) 1 ( 0.7 ) 1 ( 4 . 1 3 . 7 1 ) 34 ( 2.1 ) 2 ( 0.7 ) sin H ( 4 . 1 3 . 7 1 ˆ + + = − + − = − + − =
i i i
Year Year g
- u
Y
1 i i i i
- ˆ
- ˆ
) ( 2.1 ) ( 0.7 ) 1 ( 4 . 1 3 . 7 1 ) 34 Year ( 2.1 ) 2 Year ( 0.7 ) g sin
- u
H ( 4 . 1 3 . 7 1 Y ˆ + = − + − = − + − =
2
- ˆ
52
Interpretation
n 0: First year students who live off campus take an
average of 17.3 credits
n 2: After adjusting for housing, Second year
students take an average of 0.71 more credits than Freshmen
n 3: After adjusting for housing, 3rd and 4th year
students take an average of 2.1 fewer credits than First year students
53
Notice
n Coding:
Year2= 0 for anyone not in year 2
n Interpretation:
Coefficient for Year2 compares Second year students to First year students (the reference category)
n (not to anyone not in year 2)
54
Evaluation
n We cannot evaluate Year using the t test for
each row, because two variables are needed to define Year and the t tests are separate
n We must use an F test to evaluate Year by
comparing the residual sums of squares (RSS) in the parent model and in the nested model.
55
Parent model (Model 1)
. reg credits housing Source | SS df MS Number of obs = 26
- ------------+------------------------------
F( 1, 24) = 0.06 Model | .176282088 1 .176282088 Prob > F = 0.8074 Residual | 69.6333335 24 2.9013889 R-squared = 0.0025
- ------------+------------------------------
Adj R-squared = -0.0390 Total | 69.8096156 25 2.79238462 Root MSE = 1.7033
- credits | Coef. Std. Err. t P>|t| [95% Conf. Interval]
- ------------+----------------------------------------------------------------
housing | .1666667 .6761572 0.25 0.807 -1.228853 1.562187 _cons | 16.2 .5135783 31.54 0.000 15.14003 17.25997
- RSSparent
Residual dfparent
56
Extended model (Model 3)
. reg credits housing year2 year34 Source | SS df MS Number of obs = 26
- ------------+------------------------------
F( 3, 22) = 2.94 Model | 19.9853465 3 6.66178216 Prob > F = 0.0555 Residual | 49.8242691 22 2.26473951 R-squared = 0.2863
- ------------+------------------------------
Adj R-squared = 0.1890 Total | 69.8096156 25 2.79238462 Root MSE = 1.5049
- credits | Coef. Std. Err. t P>|t| [95% Conf. Interval]
- ------------+----------------------------------------------------------------
housing | -1.402299 .8537457 -1.64 0.115 -3.172859 .3682613 year2 | .7068966 .7215468 0.98 0.338 -.7894999 2.203293 year34 | -2.10197 1.087462 -1.93 0.066 -4.357228 .1532874 _cons | 17.34483 .9268436 18.71 0.000 15.42267 19.26698
- RSSextended
Residual dfextended
57
The F test
Numerator of F-statistic: (RSSparent – RSS extended)/(num. vars. added) Denominator of F-statistic: RSSextended/(residual dfextended)
( )
4 . 4 22 8 . 49 2 8 . 49 6 . 69 F
- bs
= − =
H0: all new ’s=0 in population HA: at least one new is not 0 in population
58
The F table
n Recall: the F distribution is very similar to the X2
distribution
n F distribution is automatically 2-sided (like X2) n df change the shape of the F distribution (like X2),
but now there are two sets of df: the numerator df and the denominator df
59
The F table
n numerator df: # of variables added = 2 n denominator df: residual dfextended = 22 n Using = 0.05, find Fcr
n 1- = 0.95 n Find quantile in R, using appropriate
degrees of freedom
n Fcr= 3.49 is shown in 2nd row of the table
60
Conclusion
n Fcr= 3.49 < Fobs= 4.4 so p < n Reject H0: conclude that adding year
improves prediction after adjusting for housing
n Notice: both individual t tests were not
statistically significant, but F test was still significant
n Must always use F test to evaluate multiple
X’s at once
61
The F test: notes
n The F test can be used to compare any two
nested models
n If only one variable is added, it’s easier to
compare the models using the t test for that variable
n t2= F if one variable is added
n For any regression, the estimated variance of
the residuals is RSS/(residual df)
62
Nested Models
n Comparing nested models
n 1 new variable: use t test for that variable n 2+ new variables: use F test
n Categorical predictor
n set one group as reference n create dummy variable for other groups n include/exclude all dummy variables n evaluate categorical predictor with F test