Lecture 9: Interactions, Quadratic terms and Splines Ani - - PowerPoint PPT Presentation
Lecture 9: Interactions, Quadratic terms and Splines Ani - - PowerPoint PPT Presentation
Lecture 9: Interactions, Quadratic terms and Splines Ani Manichaikul amanicha@jhsph.edu 30 April 2007 Effect Modification n The phenomenon in which the relationship between the primary predictor and outcome varies across levels of another
Effect Modification
n The phenomenon in which the relationship
between the primary predictor and outcome varies across levels of another predictor
n We say the other predictor modifies the effect
between the primary predictor and outcome
n In linear regression, coded by inclusion of
interaction term between primary predictor and another predictor
Reminder: Nested models
n Parent model
n contains one set of variables
n Extended model
n adds one or more new variables to the parent
model
n one variable added: compare models with t test n two or more variables added: compare models with F
test
n Return to the example of wage versus
experience
Model 1
n This model allows the average wage to differ
for men and women, but the difference in average wage between men and women is
always the same regardless of experience
level.
) ender G (
- ˆ
) Experience (
- ˆ
- ˆ
] Wage [ E
i 2 i 1 i
+ + =
Model 1
Source | SS df MS Number of obs = 534
- ------------+------------------------------
F( 2, 531) = 61.62 Model | 2651.49936 2 1325.74968 Prob > F = 0.0000 Residual | 11425.1992 531 21.5163827 R-squared = 0.1884
- ------------+------------------------------
Adj R-squared = 0.1853 Total | 14076.6985 533 26.4103162 Root MSE = 4.6386
- wagehr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
- ------------+----------------------------------------------------------------
educyrs | .7512834 .0768225 9.78 0.000 .6003701 .9021966 gender | -2.124057 .4028322 -5.27 0.000 -2.915397 -1.332716 _cons | .2178312 1.036322 0.21 0.834 -1.817962 2.253624
Model 2
n What is the interaction variable??
) Experience ender G (
- ˆ
) ender G (
- ˆ
) Experience (
- ˆ
- ˆ
] Wage [ E
i i 3 i 2 i 1 i
× + + + =
Model 2: Creating the interaction variable
n gender:
n 0 for men n 1 for women
n gender* experience
= 0* experience = 0 for men = 1* experience = experience for women
Model 2: output
. generate gender_educ = gender*educ . reg wagehr educyrs gender gender_educ Source | SS df MS Number of obs = 534
- ------------+------------------------------
F( 3, 530) = 41.50 Model | 2677.43224 3 892.477414 Prob > F = 0.0000 Residual | 11399.2663 530 21.5080496 R-squared = 0.1902
- ------------+------------------------------
Adj R-squared = 0.1856 Total | 14076.6985 533 26.4103162 Root MSE = 4.6377
- wagehr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
- ------------+----------------------------------------------------------------
educyrs | .6831451 .0987423 6.92 0.000 .4891708 .8771194 gender | -4.37045 2.085057 -2.10 0.037 -8.466441 -.2744591 gender_educ | .1725303 .1571232 1.10 0.273 -.1361305 .481191 _cons | 1.104571 1.313655 0.84 0.401 -1.476038 3.685181
Model 2: Interpretation
n Equation for men: n Equation for women: n 2: change in mean wage for women vs. men
with no experience
n 3: change in slope (of experience) for women
- vs. men
) Experience ( 68 . 10 . 1 ] Wage [ E ) Experience (
- ˆ
- ˆ
] Wage [ E
i i i 1 i
+ = + =
( ) ( )
( ) ( )
) Experience ( 0.17 68 . 37 . 4 10 . 1 ] Wage [ E ) Experience (
- ˆ
- ˆ
- ˆ
- ˆ
] Wage [ E
i i i 3 1 2 i
+ + − = + + + =
n Men with no experience n Women with no experience n
is the difference in mean wage between women and men of no experience
i
- ˆ
1.10 0) ( 17 . ) ( 4.37 0) ( 68 . 10 . 1 ] Wage [ E = = × + − + =
2 i
- ˆ
- ˆ
4.37
- 1.10
0) 1 ( 17 . ) 1 ( 4.37 0) ( 68 . 10 . 1 ] Wage [ E + = = × + − + =
2
- ˆ
Model 2: Predictions by gender, no experience
n Men with 1 year of experience n Women with 1 year of experience n
is the difference in mean wage between women and men with one year of
experience
1 i
- ˆ
- ˆ
68 . 1.10 1) ( 17 . ) ( 4.37 1) ( 68 . 10 . 1 ] Wage [ E + = + = × + − + =
3 2 1 i
- ˆ
- ˆ
- ˆ
- ˆ
17 . 4.37
- 68
. 1.10 1) 1 ( 17 . ) 1 ( 4.37 1) ( 68 . 10 . 1 ] Wage [ E + + + = + + = × + − + =
3 2
- ˆ
- ˆ +
Model 2: Predictions by gender, 1 year of experience
n Men with 2 years of experience n Women with 2 years of experience n
is the difference in mean wage between women and men with two years
- f experience
1 i
- ˆ
2
- ˆ
2) ( 68 . 1.10 2) ( 17 . ) ( 4.37 2) ( 68 . 10 . 1 ] Wage [ E + = + = × + − + =
3 2 1 i
- ˆ
2
- ˆ
- ˆ
2
- ˆ
2) ( 17 . 4.37
- 2)
( 68 . 1.10 2) 1 ( 17 . ) 1 ( 4.37 2) ( 68 . 10 . 1 ] Wage [ E + + + = + + = × + − + =
3 2
- ˆ
2
- ˆ +
Model 2: Predictions by gender, 2 years of experience
Model 2: Interpretation
n
0: The average wage for men with no experience
n
1: The difference in average wage for a one year
increase in experience among men
n
2: The difference in average wage between women
and men with no experience
n
3: The difference of the difference in average
wage for a one year increase in experience between
women and men
n the change in slope between women and men n the slope for women is 1+ 3
Compare to model 1
n In the parent model
n 1 was slope for both men and women n 2 was difference between women & men at every
experience level
n In the extended model (with interaction)
n 1 is slope for men n 2 is difference between women & men for
experience= 0
n 3 is change in slope per year of experience
between men & women
Is the change in slope statistically significant?
n Test model 1 vs. model 2
n only 1 variable added n use t test for that variable to compare
models
n H0: 3= 0 in the population n From the t-statistic, p = 0.27 n Fail to reject H0 n Conclude that model 1 is better
Model 3: Interaction of two binary predictors
n Model 2:
n continuous X, binary X, their interaction
n slope changes by group
n Model 3:
n binary X, binary X, their interaction
n difference in mean changes by group
Model 3: output
Source | SS df MS Number of obs = 534
- ------------+------------------------------
F( 3, 530) = 13.94 Model | 1029.58518 3 343.195059 Prob > F = 0.0000 Residual | 13047.1134 530 24.617195 R-squared = 0.0731
- ------------+------------------------------
Adj R-squared = 0.0679 Total | 14076.6985 533 26.4103162 Root MSE = 4.9616
- wagehr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
- ------------+----------------------------------------------------------------
gender | -.0951139 .7350696 -0.13 0.897 -1.539121 1.348894 married | 2.521311 .6121088 4.12 0.000 1.318854 3.723768 gender_mar~d | -3.097184 .907319 -3.41 0.001 -4.879567 -1.314802 _cons | 8.354752 .4936948 16.92 0.000 7.384914 9.324591
Model 3: Creating the interaction variable
n
gender:
n 0 for men n 1 for women
n
married:
n 0 if unmarried n 1 if married
n
gender* married
= 0* 0 = 0 for unmarried men = 1* 0 = 0 for unmarried women = 0* 1 = 0 for married men = 1* 1 = 1 for married women
Graph for Model 3
2 4 6 8 10 12
unmarried men unmarried women married men married women
Mean hourly wage
Difference = 1 Difference = 2 Difference = 13 Difference = 23 3 = Difference of differences
Model 3: Interpretation
n
0: The average wage for unmarried men
n
1: The difference in average wage between unmarried women and unmarried men
n
1+ 3 : The difference in average wage between married women and married men
n
3: The difference of the difference in average
wage between married women and married men and between unmarried women and unmarried men
Model 3: Interpretation
n
0: The average wage for unmarried men
n
2: The difference in average wage between married men and unmarried men
n
2 + 3 : The difference in average wage between
married women and unmarried women
n
3: The difference of the difference in average
wage between married women and unmarried
women and between married men and unmarried men
Model 3: conclusion
n The interaction variable is statistically
significantly different from 0
n (p= 0.001, CI: -4.9 to -1.3 )
n The difference in mean hourly wage between
women and men is greater for married people than for unmarried people.
- or-
The difference in mean hourly wage between married people and unmarried people is greater for men than for women.
23
Summary
n Interaction
n interaction= var1* var2 n interaction variable changes interpretation of
entire model
n with interaction, the effect of one variable
changes according to the level of the second variable
n Test for interaction by testing new variable
n if significant (p< , 0 not in CI), keep n if not significant, go back to parent model without
interaction variable
Flexibility in linear models
n In linear regression, we assume the
- utcome, Y, has a linear relationship
with the predictors, X
n However, we have flexibility in defining
the predictors
n transform X, such as X2 or X3 n use linear splines to fit ”broken arrow”
models
Example: Hospital Expenditures ($$)
n The data are similar to an example from the
book by Pagano and Gauvreau: Principles of Biostatistics Data:
n Y - Average Hospital expenditure ($s) per
admission
n X1 - Average length of stay (days) n X2 - Average employee salary ($s) n n = 51; 50 U.S. states + DC
Scientific Question
n How is per capita expenditure (Y)
related to:
n Length of stay (X1) n Employee salary (X2)
Model
n We might formulate a MLR:
1) Y = β0 + β1X1 + β2X2 + ε 2) ε ~ N(0, σ2) where:
n Y =
Expenditures per admission in $s
n X1 =
Length of stay (LOS) in days
n X2 =
Salary in $s
Model: E( Y | X ) = β0 + β1X1 + β2X2
Parameter Interpretations:
n β0: expected expenditure when LOS = 0 and
salary = 0; (Need to center the model!)
n β1: difference in expected expenditure ($s)
for two states with same average salary but LOS that differs by one day
n β2: difference in expected per capita
expenditure ($s) for two states with same average LOS but salary that differs by one dollar
Basic Model
Source | SS df MS Number of obs = 51
- ------------+------------------------------
F( 2, 48) = 46.08 Model | 25555145.4 2 12777572.7 Prob > F = 0.0000 Residual | 13311254.7 48 277317.807 R-squared = 0.6575
- ------------+------------------------------
Adj R-squared = 0.6432 Total | 38866400.2 50 777328.003 Root MSE = 526.61
- expend | Coef. Std. Err. t P>|t| [95% Conf. Interval]
- ------------+----------------------------------------------------------------
los | 313.5297 73.44155 4.27 0.000 165.8656 461.1938 salary | .333249 .0379306 8.79 0.000 .2569844 .4095137 _cons | -4662.343 808.7017 -5.77 0.000 -6288.346 -3036.339
Check for curvature & other patterns of interest:
e(expend | X) e( los | X )
- 2.10593
2.25279
- 1131.39
1516.55 e(expend | X) e( salary | X )
- 2968.18
8599.59
- 989.144
4815.65 Standardized Residuals length of stay (days) 4 6 8 10
- 2
2 4 Standardized Residuals salary ($) 10000 15000 2000025000
- 2
2 4
AVPlots Residuals
Diagnosis
n The Alaskan outlier appears here as well as
some curvature in the salary relationship
n There appears to be a non-linear relationship
between expenditures (Y) and salary (X2).
n How could we incorporate this in our model?
n Define a new variable: salary2 and include it in
the model:
New Model
E( Y | X ) = β0 + β1X1 + β2X2 + β3X2
2
Linear relationship with X1 Quadratic relationship with X2
Quadratic Term
n Expenditures are linearly related to
length of stay, but have a quadratic relationship with salary.
n Define a new variable:
salary2 = salary^ 2 and include it in the regression.
Model Output
Source | SS df MS Number of obs = 50
- ------------+------------------------------
F( 3, 46) = 142.76 Model | 17552265.1 3 5850755.03 Prob > F = 0.0000 Residual | 1885257.79 46 40983.8651 R-squared = 0.9030
- ------------+------------------------------
Adj R-squared = 0.8967 Total | 19437522.9 49 396684.14 Root MSE = 202.44
- expend | Coef. Std. Err. t P>|t| [95% Conf. Interval]
- ------------+----------------------------------------------------------------
los | 441.9992 29.34269 15.06 0.000 382.9354 501.063 salary | -2.883287 .2929512 -9.84 0.000 -3.472967 -2.293607 salary2 | .0001002 9.58e-06 10.46 0.000 .0000809 .0001195 _cons | 19724.65 2206.543 8.94 0.000 15283.11 24166.19
Interpretations
n β0: ??? n β1: We estimate that expected expenditures
per admission will be $442 higher (95% CI: $372-512) in a state whose average LOS is
- ne day longer than another state with the
same average employee salary
n β2: ??? n β3: ???
Inferences
n Is salary related to expenditures? n Could test:
n H0: β2 = 0? n H0: β3 = 0?
n But really want
n H0: β2 = β3 = 0 n overall test for salary
Hospital Example
n Recall Model:
E( Y | X ) = β0 + β1X1 + β2X2 + β3X2
2
Ho: β2 = β3 = 0
(Test by hand: need SSEE, SSEF)
Full Model Results
Source | SS df MS Number of obs = 50
- ------------+------------------------------
F( 3, 46) = 142.76 Model | 17552265.1 3 5850755.03 Prob > F = 0.0000 Residual | 1885257.79 46 40983.8651 R-squared = 0.9030
- ------------+------------------------------
Adj R-squared = 0.8967 Total | 19437522.9 49 396684.14 Root MSE = 202.44
- expend | Coef. Std. Err. t P>|t| [95% Conf. Interval]
- ------------+----------------------------------------------------------------
los | 441.9992 29.34269 15.06 0.000 382.9354 501.063 salary | -2.883287 .2929512 -9.84 0.000 -3.472967 -2.293607 salary2 | .0001002 9.58e-06 10.46 0.000 .0000809 .0001195 _cons | 19724.65 2206.543 8.94 0.000 15283.11 24166.19
- SSEF = 1885257.79, n-p-s-1 = 50-1-2-1 = 46
Null Model Results
Source | SS df MS Number of obs = 50
- ------------+------------------------------
F( 1, 48) = 47.04 Model | 9621038.76 1 9621038.76 Prob > F = 0.0000 Residual | 9816484.12 48 204510.086 R-squared = 0.4950
- ------------+------------------------------
Adj R-squared = 0.4845 Total | 19437522.9 49 396684.14 Root MSE = 452.23
- expend | Coef. Std. Err. t P>|t| [95% Conf. Interval]
- ------------+----------------------------------------------------------------
los | 443.3567 64.63975 6.86 0.000 313.3898 573.3236 _cons | -786.6091 490.4083 -1.60 0.115 -1772.641 199.4228
- “Null” model: E( Y | X ) = β0 + β1X1
SSEE = 9816484.12, s= 2
F-test Results
F-test: F2,46 =
96.76 (p< 0.001; F.05,2,46= 3.2)
Reject the null: conclude that the salary effects were statistically significant in regression model
) 1 2 1 50 /( 79 . 1885257 2 / ) 3 . 7931226 ( − − −
Linear Splines: set-up
n The broken arrow model n Example:
n A researcher tells you most Health Management
Organizations (HMOs) will usually pay for the first week of a hospital stay only
n She expects expenditures to increase dramitically
if LOS was longer than one week
n How should we set up the model?
The researcher thought the LOS regression line should look like:
Broken Arrow Model Expenditures length of stay (days) 3 5 7 9 2000 2500 3000 3500
Defining a New Variable
n
Similar to what we did in ANCOVA, we could just define a new variable that checks to see if the slope is indeed different if LOS is greater than 7.
n
Idea, include a term:
n
(LOS-7)+ = (LOS – 7) if LOS> 7 = 0 if LOS< = 7
The spline allows you to change the magnitude of the slope!
When to use a spline?
n When a continuous predictor is used, a
typical regression equation assumes there is a straight-line relationship between X and Y in the population.
n If the relationship between X and Y is
n a bent line n a curve
adding a spline may more accurately model the relationship between X and Y
Visualizing the Model
Broken Arrow Model Expenditures length of stay (days) 3 5 7 9 2000 2500 3000 3500
Slope = β1 Slope = β1 + β2
The Model
n Model:
E(expenditures) = β0 + β1LOS + β2(LOS-7)+
Where: (LOS – 7) if LOS> 7 (LOS-7)+ = 0 if LOS< = 7
Then:
E(expenditures | LOS < = 7) = β0 + β1LOS E(exp | LOS > 7) = β0 + β1LOS + β2(LOS - 7) = (β0 - β2⋅7)+ (β1+ β2)LOS = β0* + β1* LOS
New Model
E(Y | X) = β0 + β1X1 + β2(X1 - 7)+ + β3X2 + β4X2
2
Broken Arrow relationship with X1 Quadratic relationship with X2
Adding Spline to Quadratic
n Expenditures have a different linear
relationship before and after a 7 day length of stay, and have a quadratic relationship with salary.
n We’ll just define a new variable:
los7 = (los-7)* (los> 7) and include it in the regression.
Results
Source | SS df MS Number of obs = 50
- ------------+------------------------------
F( 4, 45) = 126.01 Model | 17844348.0 4 4461087.00 Prob > F = 0.0000 Residual | 1593174.87 45 35403.8861 R-squared = 0.9180
- ------------+------------------------------
Adj R-squared = 0.9108 Total | 19437522.9 49 396684.14 Root MSE = 188.16
- expend | Coef. Std. Err. t P>|t| [95% Conf. Interval]
- ------------+----------------------------------------------------------------
los | 212.5361 84.41545 2.52 0.015 42.51468 382.5576 los7 | 347.7778 121.0805 2.87 0.006 103.9091 591.6465 salary | -3.143061 .2869069 -10.95 0.000 -3.720921 -2.565201 salary2 | .0001082 9.32e-06 11.60 0.000 .0000894 .0001269 _cons | 23276.97 2394.892 9.72 0.000 18453.41 28100.53
Centering LOS in the expenditures model
n Y: Average Hospital expenditure ($s)
per admission
n X1: Average length of stay (days) n X2: Average employee salary($1000s)
Centered Model: E(Y|X) = β0 + β1(X1-7) + β2(X1-7)+ +
β3(X2 -15) + β4(X2-15)2
Final Model for Expenditures
Source | SS df MS Number of obs = 50
- ------------+------------------------------
F( 4, 45) = 126.01 Model | 17844345.3 4 4461086.31 Prob > F = 0.0000 Residual | 1593177.63 45 35403.9473 R-squared = 0.9180
- ------------+------------------------------
Adj R-squared = 0.9108 Total | 19437522.9 49 396684.14 Root MSE = 188.16
- expend | Coef. Std. Err. t P>|t| [95% Conf. Interval]
- ------------+----------------------------------------------------------------
losc | 212.5366 84.41552 2.52 0.015 42.515 382.5582 losc7 | 347.7772 121.0806 2.87 0.006 103.9083 591.646 salc | 101.6865 19.69614 5.16 0.000 62.01645 141.3566 salc2 | 108.1581 9.324742 11.60 0.000 89.37714 126.9391 _cons | 1954.413 68.69979 28.45 0.000 1816.045 2092.782
- E( Y | X ) = 1954 + 213(X1-7) + 348(X1-7) +
+ 102(X2 -15) + 108(X2-15) 2
Back to modelling wages
- 2
2 4 Standardized residuals 20 30 40 50 60 age
We removed an outlier, but do we still need a spline?
How should we add the spline?
n Goal: let the regression line bend n Model:
E(Wagei) = 0+ 1(age-35)+ 2(age-35)+
n What is (age-35) + ?
n 0 if age< 35 n (age-35) if age> = 35
Fitted model with spline at 35
Source | SS df MS Number of obs = 533
- ------------+------------------------------
F( 2, 530) = 28.18 Model | 1231.65577 2 615.827885 Prob > F = 0.0000 Residual | 11584.1395 530 21.8568669 R-squared = 0.0961
- ------------+------------------------------
Adj R-squared = 0.0927 Total | 12815.7952 532 24.0898407 Root MSE = 4.6751
- wagehr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
- ------------+----------------------------------------------------------------
age_cent | .3328909 .0470853 7.07 0.000 .2403943 .4253876 age_spline | -.374546 .0663082 -5.65 0.000 -.504805 -.2442869 _cons | 10.45389 .3577241 29.22 0.000 9.751156 11.15662
Fitted Graph (with spline)
10 20 30 40 50 20 30 40 50 60 age Wage ($/hour)
n E(Wagei)
= 10.45+ 0.33(age-35)-0.37(age-35)+
n For a person under 35:
n E(Wagei)
= 10.45+ 0.33(age-35)-0.37(age-35)+
n For a person 35 or older:
n E(Wagei)
= 10.45+ 0.33(age-35)-0.37(age-35)+ = 10.45-0.04(age-35)
(age-35) 12 = new slope for those over 35
Interpretation
n 0 is the average wage for people who are 35
years old
n 1 is the change in average wage per
additional year of age for those under 35
n 2 is the difference in the change in average
wage per additional year of age for those
- ver age 35 as compared to those under
age 35
n 2 is the change in the slope for over 35 vs.
under 35
Better Interpretation
n The average wage for people who are 35
years old is $10.45/hour (95% CI: $9.75, 11.16)
n For each additional year of age, those under
age 35 earn an average of $0.33 more per
hour (95% CI: $0.24, $0.43)
n For each additional year of age, those over
age 35 earn an average of $0.04 less per
hour (95% CI: -$0.10, $0.01)
Is the change in slope statistically significant?
n
One variable was added to create the change in slope
n
compare nested models with t test
. regress wagehr age_cent age_spline if sres_age<6 Source | SS df MS Number of obs = 533
- ------------+------------------------------
F( 2, 530) = 28.18 Model | 1231.65577 2 615.827885 Prob > F = 0.0000 Residual | 11584.1395 530 21.8568669 R-squared = 0.0961
- ------------+------------------------------
Adj R-squared = 0.0927 Total | 12815.7952 532 24.0898407 Root MSE = 4.6751
- wagehr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
- ------------+----------------------------------------------------------------
age_cent | .3328909 .0470853 7.07 0.000 .2403943 .4253876 age_spline | -.374546 .0663082 -5.65 0.000 -.504805 -.2442869 _cons | 10.45389 .3577241 29.22 0.000 9.751156 11.15662
- n
H0: spline is not needed (no change in slope in the population)
n
p< 0.001 or CI does not include 0: reject H0
n
Conclude slope differs for those over vs. under 35 in population
“L” – Linear relationship
n With the spline, there is no longer any
pattern in the residuals
n After removing the one outlier, no
- thers appear to stand out
“I” - Independence
n We cannot check this by looking at the
data
“N” – Normality of the residuals
n The residuals are slightly skewed to
positive values
n the estimated regression coefficients are
still correct
n their confidence intervals may be
misleading
“E” – Equal variance of the residuals across X
n The vertical spread of the residuals may
be smaller for those under 25 years of age
n the estimated regression coefficients are
still correct
n their confidence intervals may be
misleading
Conclusion
n
The increase in hourly wage with increasing age is statistically significant for those who recently entered the workforce (ages 18-35): for each additional year, these workers earn an average of 33 cents more per hour.
n
However, this increase in wage with increasing age levels
- ff for those over age 35, so that no appreciable increase
in average wage is observed for those over age 35.
n
One 21-year-old had much higher earnings ($44.50 per hour) than other young workers. This person’s results were so unlike the rest of the sample that the observation was dropped from the analysis. It is possible that the data was incorrectly entered for this person, but we are unable to assess the data entry since the original completed surveys are unavailable.
66
Splines
n Splines are used to allow the regression
line to bend
n the breakpoint is arbitrary and decided
graphically
n the actual slope above and below the