Lecture 12: Effect modification, and confounding in logistic - - PowerPoint PPT Presentation
Lecture 12: Effect modification, and confounding in logistic - - PowerPoint PPT Presentation
Lecture 12: Effect modification, and confounding in logistic regression Ani Manichaikul amanicha@jhsph.edu 4 May 2007 Today n Categorical predictor n create dummy variables n just like for linear regression n Comparing nested models that
Today
n Categorical predictor
n create dummy variables n just like for linear regression
n Comparing nested models that differ by two
- r more variables for logistic regression
n X2 Test of Deviance n analogous to the F test in linear regression
n Effect Modification and Confounding
Example
n Mean SAT scores were compared for
the 50 US states. The goal of the study was to compare overall SAT scores using state-wide predictors such as per- pupil expenditures and average teachers’ salary. The investigators also considered the proportion of student eligible to take the SAT who actually took the examination.
Variables
n Outcome
n Total SAT score [sat_low]
n 1= low, 0= high
n Primary predictor
n Average expenditures per pupil [expen] in
thousands
n Continuous, range: 3.65-9.77, mean: 5.9
Variables
n Secondary predictors
n Percent of pupils taking the SAT, in quartiles
n percent1 – lowest quartile n percent2 – 2nd quartile n percent3 – 3rd quartile n percent4 – highest quartile
n Mean teacher salary in thousands, in quartiles
n salary1 – lowest quartile n salary2 – 2nd quartile n salary3 – 3rd quartile n salary4 – highest quartile
Modifications to variables
n Expenditures: continuous, doesn’t include 0:
center at $5,000 per pupil
n Percent: four dummy variables for four
categories; must exclude one category to create a reference group
n Salary: four dummy variables for four
categories; must exclude one category to create a reference group
Plan
n Assess primary relationship n Add each secondary predictor
separately
n Determine which secondary predictor is
more statistically significant
n Add other secondary predictor to model
with “better” secondary predictor
The X2 Test of Deviance
n We would like to consider adding salary
quartiles to our model
n We want to compare parent model to an
extended model, which differs by the three dummy variables for the four salary quartiles.
n The X2 test of deviance compares nested
models
n We use it for nested models that differ by two or
more variables because the Wald test cannot be used in that situation
- 1. Get the Log Likelihood
from both models
n The log likelihood is shown in the upper
right corner of the logit or logistic
- utput
n Null model: LL = -28.94 n Extended model B: LL = -28.25
- 2. Find the deviance for each
model
n
Deviance = -2x(log likelihood)
n
Deviance is analogous to residual sums of squares (RSS) in linear regression; it measures the deviation still available in the model
n A saturated model is one in which every Y is perfectly
predicted
n
Null model:
n Deviance = -2(-28.94) = 57.88
n
Extended model B:
n Deviance = -2(-28.25) = 56.50
- 3. Find the change in deviance
between the nested models
n Null model: Deviance = 57.88 n Extended model B: Deviance = 56.50 n Change in deviance
= deviancenull – devianceextended = 57.88 - 56.50 = 1.38
- 4. Evaluate the change in
deviance
n The change in deviance from the parent
model to the nested model is an
- bserved Chi-square statistic
n df = # of variables added n H0: all new ’s are 0 in the population
n or H0: the parent model is better
- 4. Evaluate the change in
deviance
n H0: After adjusting for per-pupil
expenditures, teachers’ salary is not an important predictor of SAT score.
n X2
- bs = 1.38
n df = 3
n with 3 df and = 0.05, X2
cr is 7.81
n Fail to reject H0
Notes about deviance test
n The deviance test gives us a framework
in which to add several predictors to a model simultaneously
n Can only handle nested models n Analogous to F-test for linear regression n Also known as a "likelihood ratio test"
Conclusions
n per-pupil expenditure is associated with
SAT score
n After adjusting for per-pupil expenditure
n Percent of students taking the SAT is
statistically significant
n Teachers’ salary is not statistically
significant
n Is salary significant after adjusting for
both expenditure and percent?
Possible ways to improve this model:
n Add an interaction variable
n Does the effect of expenditures on odds of low
mean SAT score vary between states with low and high percentages of students taking the SAT?
n Add a spline
n Does the effect of expenditures on odds of low
mean SAT score vary over the level of expenditures?
Effect Modification in Logistic Regression
Heart Disease Smoking and Coffee
Effect modification
n Just like with linear regression, we may want
to allow different relationships between the primary predictor and outcome across levels
- f another covariate
n Can model such relationships by fitting
interaction terms
n Modelling effect modification will require
dealing with two or more covariates
Logistic models with two covariates
n logit(p)
=
β0 + β1X1 + β2X2
Then: logit(p | X1= X1+ 1,X2= X2) = β0+ β1(X1+ 1)+ β2X2 logit(p | X1= X1 ,X2= X2) = β0+ β1(X1 )+ β2X2
∆ in log-odds
=
β1
n β1 is the change in log-odds for a 1 unit
change in X1 provided X2 is held constant.
Interpretation in General
n Also: log
= β1
n And: OR
= exp(β1) !!
n exp(β1) is the Multiplicative change in
- dds for a 1 unit increase in X1 provided
X2 is held constant.
n The result is similar for X2
= + = ) 2 X , 1 X | 1
- dds(Y
) 2 X 1, 1 X | 1
- dds(Y
Risk of CHD from Smoking and Coffee
n = 151
Study Information
n Study Facts:
n Case-Control study n 40-50 year-old males previously in good health
n Study questions:
n Is smoking and/or coffee related to an increased
- dds of CHD?
n Is the association of coffee with CHD higher
among smokers? That is, is smoking an effect
modifier of the coffee-CHD associations?
Fraction with CHD by smoking and coffee
Pooled data, ignoring smoking
Odds ratio = (40 * 50) / (26 * 35) = 2.2 95% CI = (1.14, 4.24)
Among Non-Smokers
Odds ratio = (15 * 42) / (15 * 21) = 2.0 95% CI = (0.82, 4.9)
Among Smokers
Odds ratio = (25 * 8) / (11 * 14) = 1.3 95% CI = (.42, 4.0)
Plot Odds Ratios and 95% CIs
Define Variables
n Yi = 1 if CHD case, 0 if control n COFi = 1 if Coffee Drinker, 0 if not n SMKi = 1 if Smoker, 0 if not n pi = Pr (Yi = 1) n ni = Number observed at patterni of Xs
Logistic Regression Model
n Yi are from a Binomial (ni, pi)
distribution
n Yi are independent n log odds (Yi= 1) (or, logit( Yi= 1) ) is a
function of
n Coffee n Smoking n and coffee x smoking interaction
Logistic Regression Model
n Which implies that Pr(Yi= 1) is the
logistic function
2 1 3 2 2 1 1
- 2
1 3 2 2 1 1
e 1 e
- i
X i X i X i X i i i i
X X X X i
p
β β
β β
+ + +
+ =
+ + +
i i i i i i
SMK COF SMK COF p p
3 2 1
1 log β β β β + + + = −
Probabilities of CHD as a function
- f coffee and smoking history
Yes No Yes No Coffee Smoke
- e
1 e
- +
1
- 1
e 1 e
- +
+
+
3 2 1
- 3
2 1
e 1 e
- β
β
β β
+ + +
+
+ + +
2
- 2
e 1 e
β
β
+
+
+
Among Non-Smokers:
( ) ( )
1 1 1
1 1 1 1 1 1 Coffee No | Case Odds Coffee | Case Odds
β β β β β β β β β
e e e e e e + + + + =
+ + +
Ratio Odds
1 1
= = =
+ β β β β
e e e
Interpretations
n exp{ 1} : odds ratio of being a CHD case
for coffee drinkers -vs- non-drinkers among non-smokers
n exp{ 13} : odds ratio of being a CHD
case for coffee drinkers -vs- non- drinkers among smokers
Interpretations
n exp{ 2} : odds ratio of being a CHD case
for smokers -vs- non-smokers among non-coffee drinkers
n exp{ 23} : odds ratio of being case
for smokers -vs- non-smokers among coffee drinkers
Interpretations
n
fraction of cases among non- smoking non-coffee drinking individuals in the sample (determined by sampling plan)
n exp{ 3} : ratio of odds ratios
1
β β
e e +
exp{ 3} Interpretations
n exp{ 3} : factor by which odds ratio of being
a CHD case for coffee drinkers -vs- nondrinkers is multiplied for smokers as compared to non-smokers
- r
n exp{ 3} : factor by which odds ratio of being a
CHD case for smokers -vs- non-smokers is multiplied for coffee drinkers as compared to non-coffee drinkers
Some Special Cases
n Given n If 1 = 2 = 3 = 0 n Neither smoking no coffee drinking is
associated with increased risk of CHD
SMK COF SMK COF Y Y * ) Pr( ) 1 Pr( log
3 2 1
β β β β + + + = = =
Some Special Cases
n Given n If 1 = 3 = 0 n Smoking, but not coffee drinking, is
associated with increased risk of CHD
SMK COF SMK COF Y Y * ) Pr( ) 1 Pr( log
3 2 1
β β β β + + + = = =
Some Special Cases
n If 3 = 0 n Smoking and coffee drinking are both
associated with risk of CHD but the odds ratio
- f CHD-smoking is the same at levels of
coffee
n Smoking and coffee drinking are both
associated with risk of CHD but the odds ratio
- f CHD-coffee is the same at levels of
smoking.
CHD ~ Coffee: Coefficients
Logit estimates Number of obs = 151 LR chi2(1) = 5.65 Prob > chi2 = 0.0175 Log likelihood = -100.64332 Pseudo R2 = 0.0273
- chd | Coef. Std. Err. z P>|z| [95% Conf. Interval]
- ------------+----------------------------------------------------------------
cof | .7874579 .3347123 2.35 0.019 .1314338 1.443482 _cons | -.6539265 .2417869 -2.70 0.007 -1.12782 -.1800329
Adding Smoke: Coefficients
. logit chd cof smk Logit estimates Number of obs = 151 LR chi2(2) = 15.19 Prob > chi2 = 0.0005 Log likelihood = -95.869718 Pseudo R2 = 0.0734
- chd | Coef. Std. Err. z P>|z| [95% Conf. Interval]
- ------------+----------------------------------------------------------------
cof | .5269764 .3541932 1.49 0.137 -.1672295 1.221182 smk | 1.101978 .3609954 3.05 0.002 .3944404 1.809516 _cons | -.9572328 .2703086 -3.54 0.000 -1.487028 -.4274377
Adding Interaction: Coefficient
. logit chd cof smk cof_smk Logit estimates Number of obs = 151 LR chi2(3) = 15.55 Prob > chi2 = 0.0014 Log likelihood = -95.694169 Pseudo R2 = 0.0751
- chd | Coef. Std. Err. z P>|z| [95% Conf. Interval]
- ------------+----------------------------------------------------------------
cof | .6931472 .4525062 1.53 0.126 -.1937487 1.580043 smk | 1.348073 .5535208 2.44 0.015 .2631923 2.432954 cof_smk | -.4317824 .7294515 -0.59 0.554 -1.861481 .9979163 _cons | -1.029619 .3007926 -3.42 0.001 -1.619162 -.4400768
Comparing Models
Model1 Model 2
- 3.5
.27
- .96
Intercept 1.5 .35 .53 Coffee 3.1 .36 1.10 Smoking 2.4 .33 .79 Coffee
- 2.7
.24
- .65
Intercept z se Est Variable
Question:
Is smoking a confounder of the coffee-CHD association?
Confounding
n In epidemiological terms, Z is a “confounder”
- f the relationship of Y with X if Z is related
to both X and Y and Z is not in the causal pathway between X and Y
n In statistical terms, Z is a “confounder” of the
relationship of Y with X if the X coefficient changes when Z is added to a regression of Y
- n X
Confounding
n For example, consider the two models
Y = 0 + 1X + 1 Y = 0 + 1X + 2Z + 2
n then Z is a confounder of the X, Y
relationship if 1 1
Comparing Models
Model1 Model 2
- 3.5
.27
- .96
Intercept
1.5 .35 .53 Coffee
3.1 .36 1.10 Smoking
2.4 .33 .79 Coffee
- 2.7
.24
- .65
Intercept z se Est Variable
Look at Confidence Intervals
n Without Smoking
OR = e0.79 = 2.2
n 95% CI for log(OR): 0.79 ± 1.96(0.33)
= (0.13, 1.44)
n 95% CI for OR: (e0.13, e1.44)
= (1.14, 4.24)
Look at Confidence Intervals
n With Smoking (adjusting for smoking)
OR = e0.53 = 1.7
n 95% CI for log(OR): 0.53 ± 1.96(0.35)
= (-0.17, 1.22)
n 95% CI for OR: (e-0.17, e1.22)
= (0.85, 3.39)
Conclusion
n So, ignoring smoking, the CHD and
coffee OR is 2.2 (95% CI: 1.14 - 4.26)
n Adjusting for smoking, gives more
modest evidence for a coffee effect
n In this case-control study, smoking is a
weak-to-moderate confounder of the coffee-CHD association
Question:
Is smoking an effect modifier of CHD-coffee association?
Interaction Model
Model 3 2.4 .55 1.3 Smoking
- .59
.73
- .43
Coffee* Smoking 1.5 .45 .69 Coffee
- 3.4
.30
- 1.0
Intercept z se Est Variable
Testing Interaction Term
n Among non-smokers:
OR = e0.69 = 1.99
n 95% CI for log(OR): 0.69 ± 1.96(0.45)
= (-0.19, 1.58)
n 95% CI for OR: (e-0.19, e1.58)
= (0.82, 4.86)
Testing Interaction Term
n Among smokers
OR = e0.69-0.43 = e0.26 = 1.30
n 95% CI for log(OR): 0.26 ± 1.96(.57)
= (-0.86, 1.38)
n 95% CI for OR:(e-0.86, e1.38)
= (0.42, 3.99)
Testing Interaction Term
n Z= -0.59, p-value = 0.554 n 95% Confidence interval for 13
n (0.42, 3.99)
n Both of the above suggest that there is
little evidence that smoking is an effect modifier!
Note
n Calculating the SE for 3 1
ˆ ˆ β β +
.57 = sqrt(.329)
Question:
What model should we choose to describe the relationship of coffee and smoking with CHD?
Fitted Values
n We can use the logistic models to
calculate fitted values for comparison with observed frequencies using each of the three models
n Model 1:
.79Coffee .65
- e
1 e ˆ
.79Coffee
- .65
+
+ =
+
p
Fitted Values
n Model 2: n Model 3:
1.1Smoking .53Coffee .96
- e
1 e ˆ
1.1Smoking .53Coffee
- .96
+ +
+ =
+ +
p
Smoking) * .43(Coffee
- 1.3Smoking
.69Coffee .1.03
- e
1 e ˆ
Smoking) e*
- .43(Coffe
1.3Smoking .69Coffee
- .1.03
+ +
+ =
+ +
p
Observed vs Fitted Values
Saturated Model
n Note that fitted values from Model 3 exactly
match the observed values indicating a “saturated” model that gives perfect predictions
n Although the saturated model will always
result in a perfect fit, it is usually not the best model (e.g., when there are continuous covariates or many covariates)
Likelihood Ratio Test
n The Likelihood Ratio Test will help decide
whether or not additional term(s) “significantly” improve the model fit
n Likelihood Ratio Test (LRT) statistic for
comparing nested models is
n -2 times the difference between the log likelihoods
(LLs) for the Null -vs- Extended models
n the obtained is identical to from an
analysis of variance test for linear regression models
Likelihood Ratio Test
Deviance is a term used for the difference in
- 2* log likelihood relative to the best possible value from
a perfectly predicting model. Change in deviance is the same as change in -2LL.
LRT Example
Model comparisons using likelihood ratio test
Summary
n
A case-control study was conducted with 151 subjects, 66 (44% ) of whom had CHD, to assess the relative importance of smoking and coffee drinking as risk factors. The observed fractions of CHD cases by smoking, coffee strata are
Summary: Unadjusted ORs
n The odds of CHD was estimated to be
3.4 times higher among smokers compared to non-smokers
n 95% CI: (1.7, 7.9)
n The odds of CHD was estimated to be
2.2 times higher among coffee drinkers compared to non-coffee drinkers
n 95% CI: (1.1, 4.3)
Summary: Adjusted ORs
n Controlling for the potential
confounding of smoking, the coffee
- dds ratio was estimated to be 1.7 with
95% CI: (.85, 3.4).
n Hence, the evidence in these data are
insufficient to conclude coffee has an independent effect on CHD beyond that
- f smoking.
Summary
n Finally, we estimated the coffee odds ratio
separately for smokers and non-smokers to assess whether smoking is an effect modifier
- f the coffee-CHD relationship. For the
smokers and non-smokers, the coffee odds ratio was estimated to be 1.3 (95% CI: .42, 4.0) and 2.0 (95% CI: .82, 4.9) respectively. There is little evidence of effect modification in these data.
Note: Retrospective Studies
n Ratio of odds of CHD for coffee vs. non-
coffee drinkers is equivalent to ratio of coffee drinking for cases of CHD vs. controls
n Thus, can estimate odds ratio of CHD
(prospective question) using retrospective data -- key property of odds ratios
n This is one reason why logistic regression is