Lecture 11: Interpreting logistic regression models Ani Manichaikul - - PowerPoint PPT Presentation
Lecture 11: Interpreting logistic regression models Ani Manichaikul - - PowerPoint PPT Presentation
Lecture 11: Interpreting logistic regression models Ani Manichaikul amanicha@jhsph.edu 3 May 2007 Logistic regression n Framework and ideas of linear modelling similar to linear regression n Still have a systematic and probabilistic part to
Logistic regression
n Framework and ideas of linear
modelling similar to linear regression
n Still have a systematic and probabilistic
part to any model
n Coefficients have a new interpretation,
based on log(odds) and log(odds ratios)
The logit function
n In logistic regression, we are always
modelling the outcome log(p/(1-p))
n We define the function:
logit(p)= log(p/(1-p))
n We often use the name logit for
convenience
Example: Public health graduate students
n 323 graduate students in introductory
biostatistics took a health survey. Current smoking status was gathered, which we will predict with gender.
n Associating demographics with smoking is vital to
planning public health programs.
n Information was also collected on age, exercise,
and history of smoking; potential confounders of the association between gender and current smoking.
n Today, we will focus only on the association
between gender and current smoking status.
Coding
n Outcome:
n smoking =
1 for current smokers 0 for current nonsmokers
n Primary predictor:
n gender = 1 for men
0 for women
Recall
n In linear regression, if we had only one
binary X like gender, we would be predicting two means:
n 0 – the mean outcome when X= 0 n 0 + 1 – the mean outcome when X= 1 n 1 – the difference in mean outcome
when X= 1 vs. when X= 0
Output
Logit estimates Number of obs = 323 LR chi2(1) = 4.46 Prob > chi2 = 0.0348 Log likelihood = -75.469757 Pseudo R2 = 0.0287
- smoke | Coef. Std. Err. z P>|z| [95% Conf. Interval]
- ------------+----------------------------------------------------------------
gender | .967966 .4547931 2.13 0.033 .0765879 1.859344 _cons | -3.058707 .3235656 -9.45 0.000 -3.692884 -2.42453
- (
) ( )
Gender . 1 1 . 3
- p
1 p ln Gender
- p
1 p ln
1
+ = − ⇒ + = −
n For women, gender= 0: n For men, gender= 1: n 1 is the difference:
1 is the change in log odds
( )
1 . 3 . 1 1 . 3 p 1 p ln − = + − = −
( )
1 . 2 1 . 1 1 . 3 p 1 p ln − = + − = −
Predictions by gender
Interpretation 1: log(odds)
n 0: the log odds of smoking for women n 01: the log odds of smoking for
men
n 1: the difference in the log odds of
smoking for men as compared to women
But, we really wanted to predict P(Y= 1), not the log odds…
n We can start to “untransform” the equation
n n For women, X= 0: ln(odds)= 01(0) = 0 n For men, X= 1: ln(odds)= 01(1)
( )
a b if = =
b
e then , a ln
12 . e e e men for smoking
- f
dds
- 2.1
1.0
- 3.1
- 1
= = = =
+ +
05 . e e for women smoking
- f
dds
- 3.1
= = =
Interpretation 2: odds
n
the odds of smoking for women (when X= 0)
n
the odds of smoking for men (when X= 1)
n In the past, we’ve compared two sets of odds
by dividing to find the odds ratio (OR)
: e
- :
e
1
+
Comparing odds
n If we subtract the log odds, mathematically
that’s equivalent to dividing inside the log:
n ln(a) – ln(b) = ln(a/b)
n So, if
n
is the odds when X= 1, and
n
is the odds when X= 0, then
n we want to divide them in order to compare
05 . e e
- 3.1
= = 12 . e e e
- 2.1
1.0
- 3.1
- 1
= = =
+ +
4 . 2 05 . 12 . e e for women
- dds
men for
- dds
Ratio Odds
1
- =
= = =
+
Interpreting the odds ratio
n The odds of smoking is about 2 ½
times greater for men than for women.
n Based on this study, smoking cessation
programs should be targeted toward men, while perhaps smoking prevention programs should be targeted toward women.
Useful math
n We can usually simplify an equation like this
( ) ( )
b a b a
- e
e e e e e e Ratio Odds
1 1 1
− + +
= = = =
because
- dds and odds ratio
n
the odds when X= 0
n
the odds when X= 1
n
the odds ratio comparing the odds when X= 1 vs. X= 0
: e
- :
e
1
+
1 1
- e
e e =
+
Note on the computer output
n R does not give in the output n This is because logistic regression is so often
used for case-control studies
n the odds aren’t appropriate for a case-control
study, because the investigators determine the ratio of cases to controls
n the odds ratio is appropriate regardless of
whether exposure or outcome was gathered first (by invariance of the odds ratio)
- e
Types of interpretation
n 01 = ln(odds) (for X= 1)
n 1 = difference in log odds
n
= odds (for X= 1)
n
= odds ratio
n But we started with P(Y= 1) n Can we find that?
1
- e
+
1
- e
More useful math
n n n
( )
1 1
- e
1 e 1 X for robability p so
+ +
+ = =
- dds
1
- dds
robability p + =
robability p 1 robability p
- dds
− =
Finding the probability
Find the log odds:
For X= 0: ln(odds) = 0 For X= 1: ln(odds) = 0 + 1
Find odds:
For X= 0: odds = For X= 1: odds =
1
- e
+
- e
Finding the probability
Transform odds into probability:
1 1
- e
1 e robability p : 1 X For
+ +
+ = =
- dds
1
- dds
p + =
- e
1 e robability p : X For + = =
We could even go one step further
n n n
n no way to simplify
( )
1 1
- e
1 e male | smoke P : 1 X For
+ +
+ = =
( )
- e
1 e female | smoke P : X For + = =
+ + =
+ +
1 1
- 2
1
e 1 e e 1 e p p : Women Men vs. for Risk elative R
2 1
p p (RR) Risk lative Re =
Remember to consider study design
n We always can calculate the relative
risk
n The relative risk is not appropriate for
case-control studies
n Again, because the investigators decide the
number of cases and controls to study
n The odds ratio is appropriate for all
study designs
Types of interpretation
n 01 = ln(odds) (for X= 1)
n 1 = difference in log odds
n
= odds (for X= 1)
n
= odds ratio
n
n
1
- e
+
1
- e
( )
1 X for e 1 e robability p
1 1
- =
+ =
+ +
+ + =
+ +
1 1
- e
1 e e 1 e Risk elative R
Interpretation Tips
n
If the equation includes 0, then it is usually for a
particular set of people
n log odds n odds n probability n exception: the equation for RR will include 0, because
that equation cannot be simplified
n
If the equation does not include 0, then it must
compare two groups
n difference of log odds log odds ratio n odds ratio
25
In General
n Logistic regression for a binary outcome n Left side of equation is log odds
n Can transform the equation to find
n odds n probability
n Can compare two groups
n difference of log odds log odds ratio n odds ratio n relative risk
n Everything we learned before applies
Useful math for logistic regression
n
n
X= 1: ln(odds)= 01(1)
n
ln(a) – ln(b) = ln(a/b)
n
so ln(odds|X= 1) – ln(odds|X= 0) = ln(OR for X= 1 vs. X= 0)
n n
n
( )
a e then , b a ln If
b =
=
1 1
- e
e e so =
+ b a b a
e e e
−
=
( )
1
- e
1 X for dds
- so
+
= =
( )
1 1
- e
1 e 1 X for robability p so
+ +
+ = =
- dds
1
- dds
robability p + =
( )
2
- b
a b a
1 1 1 1
e e e e so e e e : Also = × = × =
+
Another Example
n Regular physical examination is an
important preventative public health measure
n We’ll study this outcome using the public
health graduate student dataset.
n Outcome: No physical exam in the past two years n Primary predictor: age n Secondary predictor and potential confounder:
regularly taking a multivitamin
Problem
n The original “phys” variable was meant to be
continuous, but it was collected categorically.
n time since last physician visit
n Since it is now categorical and we wish to use
it as the outcome for a regression model, we have to make it binary and use logistic regression.
Creating a new variable
1 if over 2 years
n Phys_no =
0 if 2 years or less
. tab phys Length of time since last | check-up | Freq. Percent Cum.
- -------------------------+-----------------------------------
Within the past year | 182 54.17 54.17 Within the past 1-2 years | 72 21.43 75.60 Within the past 2-5 years | 53 15.77 91.37 5 or more years | 29 8.63 100.00
- -------------------------+-----------------------------------
Total | 336 100.00
Goals
n Predict Phys (no physician visit within the
past two years= 1) with Age (continuous)
n After adjusting for age, is taking a
multivitamin (1= yes) a statistically significant predictor for not regularly visiting a physician?
n Is taking a multivitamin a confounder for the
age-physician visit relationship?
Null Model: Coefficients
Logit estimates Number of obs = 336 LR chi2(1) = 0.00 Prob > chi2 = 0.9567 Log likelihood = -186.71399 Pseudo R2 = 0.0000
- phys_no | Coef. Std. Err. z P>|z| [95% Conf. Interval]
- ------------+----------------------------------------------------------------
agec | -.0009585 .0176509 -0.05 0.957 -.0355536 .0336365 _cons | -1.130428 .1270539 -8.90 0.000 -1.379449 -.8814066
- agec = age-30 (centered age)
( ) ( )
30 Age 001 . .13 1
- p
1 p ln 30 Age
- p
1 p ln
1
− − = − ⇒ − + = −
n For a 30-year-old:
n
n For a 31-year-old:
n
n Difference is 1
( )
13 . 1 30 30 001 . .13 1
- p
1 p ln − = − − = −
( )
129 . 1 001 . 13 . 1 30 31 001 . .13 1
- p
1 p ln − = − − = − − = −
Predictions by age
Interpretation: log odds
n 0: the log odds of not visiting a
physician for a 30-year-old
n 1: the difference in the log odds of not
visiting a physician for a one year increase in age
Recall:
n ln(a) – ln(b) = ln(a/b)
n so ln(odds|X= 31) – ln(odds|X= 30) = ln(OR
for X= 31 vs. X= 30)
n difference of log odds = log odds ratio
n Alternate interpretation for 1:
n The log odds ratio of not visiting a
physician corresponding to a one year increase in age
n For a 31-year-old:
n
n For a 30-year-old:
n
n Ratio =
( )
3227 . e e e p 1 p
131 . 1 001 . .13 1
- 30
31 001 . .13 1
- =
= = = −
− − − −
3230 . e p 1 p
.13 1
- =
= −
1 1
- e
e e 999 . 3230 . 3227 . = = =
+
( )
30 Age 001 . .13 1
- e
p 1 p physician a ng not visiti
- f
dds
- −
−
= − =
Interpretation: log(odds ratio) for one year age difference
Interpretation: odds ratio for
- ne year age difference
n
is the odds of not visiting a physician for 30-year-olds
n
is the odds of not visiting a physician for 31-year-olds
n
is the odds ratio of not visiting a physician corresponding to a one year increase in age
- e
1
- e
+
1
- e
n For a 32-year-old:
n
n For a 30-year-old:
n
n Ratio =
( )
3224 . e e e p 1 p
132 . 1 2 001 . .13 1
- 30
32 001 . .13 1
- =
= = = −
− × − − −
3230 . e p 1 p
.13 1
- =
= −
( )
2
- 2
- 1
1 1
e e e e 998 . 3230 . 3224 . = = = =
+
( )
30 Age 001 . .13 1
- e
p 1 p physician a ng not visiti
- f
dds
- −
−
= − =
Interpretation: odds ratio for two year age difference
n For a 40-year-old:
n
n For a 30-year-old:
n
n Ratio =
( )
3198 . e e e p 1 p
14 . 1 01 . .13 1
- 30
40 001 . .13 1
- =
= = = −
− − − −
3230 . e p 1 p
.13 1
- =
= −
( )
10
- 10
- 10
- 1
1 1
e e e e 990 . 3230 . 3198 . = = = =
+
( )
30 Age 001 . .13 1
- e
p 1 p physician a ng not visiti
- f
dds
- −
−
= − =
Interpretation: odds ratio for 10 year age difference
n
is the proportional increase of the
- dds of not visiting a physician
corresponding to a one year increase in age
n
is the proportional increase of the odds of not visiting a physician corresponding to a ten year increase in age
1
- e
( )
1 1
10 10
- e
e =
( ) ( ) ( ) ( )
- ld
- yr
- 31
for
- dds
- ld
- yr
- 30
for
- dds
- ld
- yr
- 31
for
- dds
- ld
- yr
- 30
for
- dds
= ×
( )
2
- b
a b a
1 1 1 1
e e e e so e e e : fact Math = × = × =
+
( ) ( )
30 Age 001 . .13 1
- 30
Age 001 . .13 1
- e
1 e p physician a ng not visiti
- f
robability p
− − − −
+ = =
n For a 40-year-old: n For a 30-year-old: n The ratio (RR) cannot be simplified
( ) ( )
2442 . e 1 e e 1 e p
.13 1
- .13
1
- 001
. .13 1
- 001
. .13 1
- =
+ = + =
− − ( ) ( )
2423 . e 1 e e 1 e e 1 e p
4 .1 1
- 4
.1 1
- 01
. .13 1
- 01
. .13 1
- 30
40 001 . .13 1
- 30
40 001 . .13 1
- =
+ + = + =
− − − − − −
992 . 2442 . 2423 . p p
2 1
= =
Interpretation: probability
n
is the probability of not visiting a physician for 30-year-olds
n
is the probability of not visiting a physician for 40-year-olds
n
is the relative risk of not visiting a physician for 40-year-olds vs. 30- year-olds
- e
1 e +
10
- 10
- 1
1
e 1 e
× + × +
+
1 1
- 10
- 10
- e
1 e e 1 e + +
× + × +
Goals
n Predict Phys (no physician visit within the
past two years= 1) with Age (continuous)
n After adjusting for age, is taking a
multivitamin (1= yes) a statistically significant predictor for not regularly visiting a physician?
n Is taking a multivitamin a confounder for the
age-physician visit relationship?
Nested models
n Adding a single new variable to the model
n null model: n full model:
( )
30 Age
- p
1 p ln
1
− + = −
( ) ( )
min Multivita
- 30
Age
- p
1 p ln
2 1
+ − + = −
Comparing nested models that differ by one variable
n Compare models with p-value or CI
n What method is this?
n The Wald test, a test that applies the CLT, like
n Z test comparing proportions in 2x2 table n X2 test for independence in 2x2 table n analogous to the t test for linear regression
n H0: the new variable is not needed n or H0: new= 0 in the population
Full Model: Coefficients
Logit estimates Number of obs = 317 LR chi2(2) = 7.87 Prob > chi2 = 0.0195 Log likelihood = -171.80997 Pseudo R2 = 0.0224
- phys_no | Coef. Std. Err. z P>|z| [95% Conf. Interval]
- ------------+----------------------------------------------------------------
agec | .0012855 .0192619 0.07 0.947 -.0364671 .0390381 multivit | -.7808889 .2871247 -2.72 0.007 -1.343643 -.2181349 _cons | -.8571962 .159519 -5.37 0.000 -1.169848 -.5445446
Conclusion from the Wald test
n The p-value for multivitamin is 0.007 (< 0.05)
and the CI for coefficient multivitamin does not include 0 (CI for OR doesn’t include 1)
n Reject H0 n Conclude that the larger model is better:
after adjusting for age, multivitamin use is still an important predictor of physician visits in the population
n A 30-year-old non-vitamin user:
n log odds = -0.86
n A 31-year-old non-vitamin user:
n log odds = -0.86 + 0.001
n A 30-year-old vitamin user:
n log odds = -0.86 – 0.78
n A 31-year-old vitamin user:
n log odds = -0.86 + 0.001 – 0.78
( ) ( ) ( )
min) Multivita ( 78 . 30 Age 001 . .86
- p
1 p ln min Multivita
- 30
Age
- p
1 p ln
2 1
− − + = − ⇒ + − + = −
Interpretation - log odds
n 0: the log odds of not visiting a physician
for a 30-year-old person who reports not regularly taking multivitamins
n 1: the log odds ratio of not visiting a
physician for a one year increase in age controlling for multivitamin use
n 2: the log odds ratio of not visiting a
physician for those who take multivitamins compared with those who do not, adjusting for age
( )
min) Multivita ( 78 . 30 Age 001 . .86
- e
p 1 p physician a ng not visiti
- f
dds
- −
− +
= − =
n A 30-year-old non-vitamin user:
n odds = exp{ -0.86} = 0.4232
n A 31-year-old non-vitamin user:
n odds = exp{ -0.86 + 0.001} = 0.4236
n A 30-year-old vitamin user:
n odds = exp{ -0.86 – 0.78} = 0.1940
n A 31-year-old vitamin user:
n odds = exp{ -0.86 + 0.001 – 0.78} = 0.1942
Interpretation – odds and
- dds ratio
n exp{ 0} : the odds of not visiting a
physician for a 30-year-old person who reports not regularly taking multivitamins
Interpretation – odds and
- dds ratio
n exp{ 1} : after adjusting for
multivitamin use, the odds ratio of not visiting a physician changes by a factor
- f exp{ 1} = 1.001 for each additional
year of age
n additional age is associated with lower
frequency of physician visits in these students, but the association is not statistically significant (p> 0.05)
Interpretation – odds and
- dds ratio
n exp{ 2} : the odds ratio of not visiting a
physician for those who take multivitamins compared with those who do not is exp{ 2} = 0.46, adjusting for age
n taking multivitamins is associated with regular
physician visits (p= 0.007)
( ) ( )
min) Multivita ( 78 . 30 Age 001 . .86
- min)
Multivita ( 78 . 30 Age 001 . .86
- e
1 e p physician a ng not visiti
- f
robability p
− − + − − +
+ = =
n For a 30-year-old non vitamin user n For a 40-year-old vitamin user
( ) ( )
30 . e 1 e p
) ( 78 . 001 . .86
- )
( 78 . 001 . .86
- =
+ =
− + − +
( ) ( )
16 . e 1 e p
) 1 ( 78 . 30 40 001 . .86
- )
1 ( 78 . 30 40 001 . .86
- =
+ =
− − + − − +
Goals
n Predict Phys (no physician visit within the
past two years= 1) with Age (continuous)
n After adjusting for age, is taking a
multivitamin (1= yes) a statistically significant predictor for not regularly visiting a physician?
n I s taking a multivitamin a confounder
for the age-physician visit relationship?
Was multivitamin use a confounder?
n CI for 1 in parent model: (-0.036, 0.034)
n Estimate for 1 in nested model: 0.001
n CI for exp{ 1} in parent model:
(exp{ -0.036} , exp{ 0.034} ) (0.97, 1.03)
n Estimate for exp{ 1} in nested model: exp{ 0.001}
= 1.001
n Estimate is in original CI: multivitamin use is
not a statistically significant confounder
Interpretation
n The factor by which the odds of
irregular physician visits changes for each additional year of age does not change appreciably when we adjust for multivitamin use.
n The “slope” is roughly the same before
and after adjusting for multivitamin use.
Goals: conclusions
n Predict Phys (no physician visit
within the past two years= 1) with Age (continuous)
n There is no statistically significant effect of
age on physician visits in the population
Goals: conclusions
n After adjusting for age, is taking a
multivitamin (1= yes) a statistically significant predictor for not regularly visiting a physician?
n After adjusting for age, those who
regularly take a multivitamin are also more likely to have visited a physician during the past two years (p= 0.007)
Goals: conclusions
n Is taking a multivitamin a
confounder for the age-physician visit relationship?
n The effect of age on physician visit is still
nonsignificant after adjusting for multivitamin use: multivitamin use is not a confounder
Overall evaluation of the model
n Pseudo R2 = 0.02
n Only about 2% of the total variability in
physician visit has been explained by age and multivitamin use.
n Other important predictors probably exist.
61
Summary
n Nested models n Test statistical significance of new variable
after adjusting for null model
n t test or CI for
n Test whether new X is a confounder for an
- riginal
n is nested 1 in CI for parent 1?
62
Summary
n For a continuous X, exp{ } is the factor by
which the odds or odds ratio changes for each unit change of X
n Pseudo R2 provides overall evaluation of the