Lecture 11: Interpreting logistic regression models Ani Manichaikul - - PowerPoint PPT Presentation

lecture 11 interpreting logistic regression models
SMART_READER_LITE
LIVE PREVIEW

Lecture 11: Interpreting logistic regression models Ani Manichaikul - - PowerPoint PPT Presentation

Lecture 11: Interpreting logistic regression models Ani Manichaikul amanicha@jhsph.edu 3 May 2007 Logistic regression n Framework and ideas of linear modelling similar to linear regression n Still have a systematic and probabilistic part to


slide-1
SLIDE 1

Lecture 11: Interpreting logistic regression models

Ani Manichaikul amanicha@jhsph.edu 3 May 2007

slide-2
SLIDE 2

Logistic regression

n Framework and ideas of linear

modelling similar to linear regression

n Still have a systematic and probabilistic

part to any model

n Coefficients have a new interpretation,

based on log(odds) and log(odds ratios)

slide-3
SLIDE 3

The logit function

n In logistic regression, we are always

modelling the outcome log(p/(1-p))

n We define the function:

logit(p)= log(p/(1-p))

n We often use the name logit for

convenience

slide-4
SLIDE 4

Example: Public health graduate students

n 323 graduate students in introductory

biostatistics took a health survey. Current smoking status was gathered, which we will predict with gender.

n Associating demographics with smoking is vital to

planning public health programs.

n Information was also collected on age, exercise,

and history of smoking; potential confounders of the association between gender and current smoking.

n Today, we will focus only on the association

between gender and current smoking status.

slide-5
SLIDE 5

Coding

n Outcome:

n smoking =

1 for current smokers 0 for current nonsmokers

n Primary predictor:

n gender = 1 for men

0 for women

slide-6
SLIDE 6

Recall

n In linear regression, if we had only one

binary X like gender, we would be predicting two means:

n 0 – the mean outcome when X= 0 n 0 + 1 – the mean outcome when X= 1 n 1 – the difference in mean outcome

when X= 1 vs. when X= 0

slide-7
SLIDE 7

Output

Logit estimates Number of obs = 323 LR chi2(1) = 4.46 Prob > chi2 = 0.0348 Log likelihood = -75.469757 Pseudo R2 = 0.0287

  • smoke | Coef. Std. Err. z P>|z| [95% Conf. Interval]
  • ------------+----------------------------------------------------------------

gender | .967966 .4547931 2.13 0.033 .0765879 1.859344 _cons | -3.058707 .3235656 -9.45 0.000 -3.692884 -2.42453

  • (

) ( )

Gender . 1 1 . 3

  • p

1 p ln Gender

  • p

1 p ln

1

+ =         − ⇒ + =         −

slide-8
SLIDE 8

n For women, gender= 0: n For men, gender= 1: n 1 is the difference:

1 is the change in log odds

( )

1 . 3 . 1 1 . 3 p 1 p ln − = + − =         −

( )

1 . 2 1 . 1 1 . 3 p 1 p ln − = + − =         −

Predictions by gender

slide-9
SLIDE 9

Interpretation 1: log(odds)

n 0: the log odds of smoking for women n 01: the log odds of smoking for

men

n 1: the difference in the log odds of

smoking for men as compared to women

slide-10
SLIDE 10

But, we really wanted to predict P(Y= 1), not the log odds…

n We can start to “untransform” the equation

n n For women, X= 0: ln(odds)= 01(0) = 0 n For men, X= 1: ln(odds)= 01(1)

( )

a b if = =

b

e then , a ln

12 . e e e men for smoking

  • f

dds

  • 2.1

1.0

  • 3.1
  • 1

= = = =

+ +

05 . e e for women smoking

  • f

dds

  • 3.1

= = =

slide-11
SLIDE 11

Interpretation 2: odds

n

the odds of smoking for women (when X= 0)

n

the odds of smoking for men (when X= 1)

n In the past, we’ve compared two sets of odds

by dividing to find the odds ratio (OR)

: e

  • :

e

1

+

slide-12
SLIDE 12

Comparing odds

n If we subtract the log odds, mathematically

that’s equivalent to dividing inside the log:

n ln(a) – ln(b) = ln(a/b)

n So, if

n

is the odds when X= 1, and

n

is the odds when X= 0, then

n we want to divide them in order to compare

05 . e e

  • 3.1

= = 12 . e e e

  • 2.1

1.0

  • 3.1
  • 1

= = =

+ +

4 . 2 05 . 12 . e e for women

  • dds

men for

  • dds

Ratio Odds

1

  • =

= = =

+

slide-13
SLIDE 13

Interpreting the odds ratio

n The odds of smoking is about 2 ½

times greater for men than for women.

n Based on this study, smoking cessation

programs should be targeted toward men, while perhaps smoking prevention programs should be targeted toward women.

slide-14
SLIDE 14

Useful math

n We can usually simplify an equation like this

( ) ( )

b a b a

  • e

e e e e e e Ratio Odds

1 1 1

− + +

= = = =

because

slide-15
SLIDE 15
  • dds and odds ratio

n

the odds when X= 0

n

the odds when X= 1

n

the odds ratio comparing the odds when X= 1 vs. X= 0

: e

  • :

e

1

+

1 1

  • e

e e =

+

slide-16
SLIDE 16

Note on the computer output

n R does not give in the output n This is because logistic regression is so often

used for case-control studies

n the odds aren’t appropriate for a case-control

study, because the investigators determine the ratio of cases to controls

n the odds ratio is appropriate regardless of

whether exposure or outcome was gathered first (by invariance of the odds ratio)

  • e
slide-17
SLIDE 17

Types of interpretation

n 01 = ln(odds) (for X= 1)

n 1 = difference in log odds

n

= odds (for X= 1)

n

= odds ratio

n But we started with P(Y= 1) n Can we find that?

1

  • e

+

1

  • e
slide-18
SLIDE 18

More useful math

n n n

( )

1 1

  • e

1 e 1 X for robability p so

+ +

+ = =

  • dds

1

  • dds

robability p + =

robability p 1 robability p

  • dds

− =

slide-19
SLIDE 19

Finding the probability

Find the log odds:

For X= 0: ln(odds) = 0 For X= 1: ln(odds) = 0 + 1

Find odds:

For X= 0: odds = For X= 1: odds =

1

  • e

+

  • e
slide-20
SLIDE 20

Finding the probability

Transform odds into probability:

1 1

  • e

1 e robability p : 1 X For

+ +

+ = =

  • dds

1

  • dds

p + =

  • e

1 e robability p : X For + = =

slide-21
SLIDE 21

We could even go one step further

n n n

n no way to simplify

( )

1 1

  • e

1 e male | smoke P : 1 X For

+ +

+ = =

( )

  • e

1 e female | smoke P : X For + = =

        +         + =

+ +

1 1

  • 2

1

e 1 e e 1 e p p : Women Men vs. for Risk elative R

2 1

p p (RR) Risk lative Re =

slide-22
SLIDE 22

Remember to consider study design

n We always can calculate the relative

risk

n The relative risk is not appropriate for

case-control studies

n Again, because the investigators decide the

number of cases and controls to study

n The odds ratio is appropriate for all

study designs

slide-23
SLIDE 23

Types of interpretation

n 01 = ln(odds) (for X= 1)

n 1 = difference in log odds

n

= odds (for X= 1)

n

= odds ratio

n

n

1

  • e

+

1

  • e

( )

1 X for e 1 e robability p

1 1

  • =

+ =

+ +

        +         + =

+ +

1 1

  • e

1 e e 1 e Risk elative R

slide-24
SLIDE 24

Interpretation Tips

n

If the equation includes 0, then it is usually for a

particular set of people

n log odds n odds n probability n exception: the equation for RR will include 0, because

that equation cannot be simplified

n

If the equation does not include 0, then it must

compare two groups

n difference of log odds log odds ratio n odds ratio

slide-25
SLIDE 25

25

In General

n Logistic regression for a binary outcome n Left side of equation is log odds

n Can transform the equation to find

n odds n probability

n Can compare two groups

n difference of log odds log odds ratio n odds ratio n relative risk

n Everything we learned before applies

slide-26
SLIDE 26

Useful math for logistic regression

n

n

X= 1: ln(odds)= 01(1)

n

ln(a) – ln(b) = ln(a/b)

n

so ln(odds|X= 1) – ln(odds|X= 0) = ln(OR for X= 1 vs. X= 0)

n n

n

( )

a e then , b a ln If

b =

=

1 1

  • e

e e so =

+ b a b a

e e e

=

( )

1

  • e

1 X for dds

  • so

+

= =

( )

1 1

  • e

1 e 1 X for robability p so

+ +

+ = =

  • dds

1

  • dds

robability p + =

( )

2

  • b

a b a

1 1 1 1

e e e e so e e e : Also = × = × =

+

slide-27
SLIDE 27

Another Example

n Regular physical examination is an

important preventative public health measure

n We’ll study this outcome using the public

health graduate student dataset.

n Outcome: No physical exam in the past two years n Primary predictor: age n Secondary predictor and potential confounder:

regularly taking a multivitamin

slide-28
SLIDE 28

Problem

n The original “phys” variable was meant to be

continuous, but it was collected categorically.

n time since last physician visit

n Since it is now categorical and we wish to use

it as the outcome for a regression model, we have to make it binary and use logistic regression.

slide-29
SLIDE 29

Creating a new variable

1 if over 2 years

n Phys_no =

0 if 2 years or less

. tab phys Length of time since last | check-up | Freq. Percent Cum.

  • -------------------------+-----------------------------------

Within the past year | 182 54.17 54.17 Within the past 1-2 years | 72 21.43 75.60 Within the past 2-5 years | 53 15.77 91.37 5 or more years | 29 8.63 100.00

  • -------------------------+-----------------------------------

Total | 336 100.00

slide-30
SLIDE 30

Goals

n Predict Phys (no physician visit within the

past two years= 1) with Age (continuous)

n After adjusting for age, is taking a

multivitamin (1= yes) a statistically significant predictor for not regularly visiting a physician?

n Is taking a multivitamin a confounder for the

age-physician visit relationship?

slide-31
SLIDE 31

Null Model: Coefficients

Logit estimates Number of obs = 336 LR chi2(1) = 0.00 Prob > chi2 = 0.9567 Log likelihood = -186.71399 Pseudo R2 = 0.0000

  • phys_no | Coef. Std. Err. z P>|z| [95% Conf. Interval]
  • ------------+----------------------------------------------------------------

agec | -.0009585 .0176509 -0.05 0.957 -.0355536 .0336365 _cons | -1.130428 .1270539 -8.90 0.000 -1.379449 -.8814066

  • agec = age-30 (centered age)
slide-32
SLIDE 32

( ) ( )

30 Age 001 . .13 1

  • p

1 p ln 30 Age

  • p

1 p ln

1

− − =         − ⇒ − + =         −

n For a 30-year-old:

n

n For a 31-year-old:

n

n Difference is 1

( )

13 . 1 30 30 001 . .13 1

  • p

1 p ln − = − − =         −

( )

129 . 1 001 . 13 . 1 30 31 001 . .13 1

  • p

1 p ln − = − − = − − =         −

Predictions by age

slide-33
SLIDE 33

Interpretation: log odds

n 0: the log odds of not visiting a

physician for a 30-year-old

n 1: the difference in the log odds of not

visiting a physician for a one year increase in age

slide-34
SLIDE 34

Recall:

n ln(a) – ln(b) = ln(a/b)

n so ln(odds|X= 31) – ln(odds|X= 30) = ln(OR

for X= 31 vs. X= 30)

n difference of log odds = log odds ratio

n Alternate interpretation for 1:

n The log odds ratio of not visiting a

physician corresponding to a one year increase in age

slide-35
SLIDE 35

n For a 31-year-old:

n

n For a 30-year-old:

n

n Ratio =

( )

3227 . e e e p 1 p

131 . 1 001 . .13 1

  • 30

31 001 . .13 1

  • =

= = = −

− − − −

3230 . e p 1 p

.13 1

  • =

= −

1 1

  • e

e e 999 . 3230 . 3227 . = = =

+

( )

30 Age 001 . .13 1

  • e

p 1 p physician a ng not visiti

  • f

dds

= − =

Interpretation: log(odds ratio) for one year age difference

slide-36
SLIDE 36

Interpretation: odds ratio for

  • ne year age difference

n

is the odds of not visiting a physician for 30-year-olds

n

is the odds of not visiting a physician for 31-year-olds

n

is the odds ratio of not visiting a physician corresponding to a one year increase in age

  • e

1

  • e

+

1

  • e
slide-37
SLIDE 37

n For a 32-year-old:

n

n For a 30-year-old:

n

n Ratio =

( )

3224 . e e e p 1 p

132 . 1 2 001 . .13 1

  • 30

32 001 . .13 1

  • =

= = = −

− × − − −

3230 . e p 1 p

.13 1

  • =

= −

( )

2

  • 2
  • 1

1 1

e e e e 998 . 3230 . 3224 . = = = =

+

( )

30 Age 001 . .13 1

  • e

p 1 p physician a ng not visiti

  • f

dds

= − =

Interpretation: odds ratio for two year age difference

slide-38
SLIDE 38

n For a 40-year-old:

n

n For a 30-year-old:

n

n Ratio =

( )

3198 . e e e p 1 p

14 . 1 01 . .13 1

  • 30

40 001 . .13 1

  • =

= = = −

− − − −

3230 . e p 1 p

.13 1

  • =

= −

( )

10

  • 10
  • 10
  • 1

1 1

e e e e 990 . 3230 . 3198 . = = = =

+

( )

30 Age 001 . .13 1

  • e

p 1 p physician a ng not visiti

  • f

dds

= − =

Interpretation: odds ratio for 10 year age difference

slide-39
SLIDE 39

n

is the proportional increase of the

  • dds of not visiting a physician

corresponding to a one year increase in age

n

is the proportional increase of the odds of not visiting a physician corresponding to a ten year increase in age

1

  • e

( )

1 1

10 10

  • e

e =

( ) ( ) ( ) ( )

  • ld
  • yr
  • 31

for

  • dds
  • ld
  • yr
  • 30

for

  • dds
  • ld
  • yr
  • 31

for

  • dds
  • ld
  • yr
  • 30

for

  • dds

= ×

( )

2

  • b

a b a

1 1 1 1

e e e e so e e e : fact Math = × = × =

+

slide-40
SLIDE 40

( ) ( )

30 Age 001 . .13 1

  • 30

Age 001 . .13 1

  • e

1 e p physician a ng not visiti

  • f

robability p

− − − −

+ = =

n For a 40-year-old: n For a 30-year-old: n The ratio (RR) cannot be simplified

( ) ( )

2442 . e 1 e e 1 e p

.13 1

  • .13

1

  • 001

. .13 1

  • 001

. .13 1

  • =

+ = + =

− − ( ) ( )

2423 . e 1 e e 1 e e 1 e p

4 .1 1

  • 4

.1 1

  • 01

. .13 1

  • 01

. .13 1

  • 30

40 001 . .13 1

  • 30

40 001 . .13 1

  • =

+ + = + =

− − − − − −

992 . 2442 . 2423 . p p

2 1

= =

slide-41
SLIDE 41

Interpretation: probability

n

is the probability of not visiting a physician for 30-year-olds

n

is the probability of not visiting a physician for 40-year-olds

n

is the relative risk of not visiting a physician for 40-year-olds vs. 30- year-olds

  • e

1 e +

10

  • 10
  • 1

1

e 1 e

× + × +

+

1 1

  • 10
  • 10
  • e

1 e e 1 e + +

× + × +

slide-42
SLIDE 42

Goals

n Predict Phys (no physician visit within the

past two years= 1) with Age (continuous)

n After adjusting for age, is taking a

multivitamin (1= yes) a statistically significant predictor for not regularly visiting a physician?

n Is taking a multivitamin a confounder for the

age-physician visit relationship?

slide-43
SLIDE 43

Nested models

n Adding a single new variable to the model

n null model: n full model:

( )

30 Age

  • p

1 p ln

1

− + =         −

( ) ( )

min Multivita

  • 30

Age

  • p

1 p ln

2 1

+ − + =         −

slide-44
SLIDE 44

Comparing nested models that differ by one variable

n Compare models with p-value or CI

n What method is this?

n The Wald test, a test that applies the CLT, like

n Z test comparing proportions in 2x2 table n X2 test for independence in 2x2 table n analogous to the t test for linear regression

n H0: the new variable is not needed n or H0: new= 0 in the population

slide-45
SLIDE 45

Full Model: Coefficients

Logit estimates Number of obs = 317 LR chi2(2) = 7.87 Prob > chi2 = 0.0195 Log likelihood = -171.80997 Pseudo R2 = 0.0224

  • phys_no | Coef. Std. Err. z P>|z| [95% Conf. Interval]
  • ------------+----------------------------------------------------------------

agec | .0012855 .0192619 0.07 0.947 -.0364671 .0390381 multivit | -.7808889 .2871247 -2.72 0.007 -1.343643 -.2181349 _cons | -.8571962 .159519 -5.37 0.000 -1.169848 -.5445446

slide-46
SLIDE 46

Conclusion from the Wald test

n The p-value for multivitamin is 0.007 (< 0.05)

and the CI for coefficient multivitamin does not include 0 (CI for OR doesn’t include 1)

n Reject H0 n Conclude that the larger model is better:

after adjusting for age, multivitamin use is still an important predictor of physician visits in the population

slide-47
SLIDE 47

n A 30-year-old non-vitamin user:

n log odds = -0.86

n A 31-year-old non-vitamin user:

n log odds = -0.86 + 0.001

n A 30-year-old vitamin user:

n log odds = -0.86 – 0.78

n A 31-year-old vitamin user:

n log odds = -0.86 + 0.001 – 0.78

( ) ( ) ( )

min) Multivita ( 78 . 30 Age 001 . .86

  • p

1 p ln min Multivita

  • 30

Age

  • p

1 p ln

2 1

− − + =         − ⇒ + − + =         −

slide-48
SLIDE 48

Interpretation - log odds

n 0: the log odds of not visiting a physician

for a 30-year-old person who reports not regularly taking multivitamins

n 1: the log odds ratio of not visiting a

physician for a one year increase in age controlling for multivitamin use

n 2: the log odds ratio of not visiting a

physician for those who take multivitamins compared with those who do not, adjusting for age

slide-49
SLIDE 49

( )

min) Multivita ( 78 . 30 Age 001 . .86

  • e

p 1 p physician a ng not visiti

  • f

dds

− +

= − =

n A 30-year-old non-vitamin user:

n odds = exp{ -0.86} = 0.4232

n A 31-year-old non-vitamin user:

n odds = exp{ -0.86 + 0.001} = 0.4236

n A 30-year-old vitamin user:

n odds = exp{ -0.86 – 0.78} = 0.1940

n A 31-year-old vitamin user:

n odds = exp{ -0.86 + 0.001 – 0.78} = 0.1942

slide-50
SLIDE 50

Interpretation – odds and

  • dds ratio

n exp{ 0} : the odds of not visiting a

physician for a 30-year-old person who reports not regularly taking multivitamins

slide-51
SLIDE 51

Interpretation – odds and

  • dds ratio

n exp{ 1} : after adjusting for

multivitamin use, the odds ratio of not visiting a physician changes by a factor

  • f exp{ 1} = 1.001 for each additional

year of age

n additional age is associated with lower

frequency of physician visits in these students, but the association is not statistically significant (p> 0.05)

slide-52
SLIDE 52

Interpretation – odds and

  • dds ratio

n exp{ 2} : the odds ratio of not visiting a

physician for those who take multivitamins compared with those who do not is exp{ 2} = 0.46, adjusting for age

n taking multivitamins is associated with regular

physician visits (p= 0.007)

slide-53
SLIDE 53

( ) ( )

min) Multivita ( 78 . 30 Age 001 . .86

  • min)

Multivita ( 78 . 30 Age 001 . .86

  • e

1 e p physician a ng not visiti

  • f

robability p

− − + − − +

+ = =

n For a 30-year-old non vitamin user n For a 40-year-old vitamin user

( ) ( )

30 . e 1 e p

) ( 78 . 001 . .86

  • )

( 78 . 001 . .86

  • =

+ =

− + − +

( ) ( )

16 . e 1 e p

) 1 ( 78 . 30 40 001 . .86

  • )

1 ( 78 . 30 40 001 . .86

  • =

+ =

− − + − − +

slide-54
SLIDE 54

Goals

n Predict Phys (no physician visit within the

past two years= 1) with Age (continuous)

n After adjusting for age, is taking a

multivitamin (1= yes) a statistically significant predictor for not regularly visiting a physician?

n I s taking a multivitamin a confounder

for the age-physician visit relationship?

slide-55
SLIDE 55

Was multivitamin use a confounder?

n CI for 1 in parent model: (-0.036, 0.034)

n Estimate for 1 in nested model: 0.001

n CI for exp{ 1} in parent model:

(exp{ -0.036} , exp{ 0.034} ) (0.97, 1.03)

n Estimate for exp{ 1} in nested model: exp{ 0.001}

= 1.001

n Estimate is in original CI: multivitamin use is

not a statistically significant confounder

slide-56
SLIDE 56

Interpretation

n The factor by which the odds of

irregular physician visits changes for each additional year of age does not change appreciably when we adjust for multivitamin use.

n The “slope” is roughly the same before

and after adjusting for multivitamin use.

slide-57
SLIDE 57

Goals: conclusions

n Predict Phys (no physician visit

within the past two years= 1) with Age (continuous)

n There is no statistically significant effect of

age on physician visits in the population

slide-58
SLIDE 58

Goals: conclusions

n After adjusting for age, is taking a

multivitamin (1= yes) a statistically significant predictor for not regularly visiting a physician?

n After adjusting for age, those who

regularly take a multivitamin are also more likely to have visited a physician during the past two years (p= 0.007)

slide-59
SLIDE 59

Goals: conclusions

n Is taking a multivitamin a

confounder for the age-physician visit relationship?

n The effect of age on physician visit is still

nonsignificant after adjusting for multivitamin use: multivitamin use is not a confounder

slide-60
SLIDE 60

Overall evaluation of the model

n Pseudo R2 = 0.02

n Only about 2% of the total variability in

physician visit has been explained by age and multivitamin use.

n Other important predictors probably exist.

slide-61
SLIDE 61

61

Summary

n Nested models n Test statistical significance of new variable

after adjusting for null model

n t test or CI for

n Test whether new X is a confounder for an

  • riginal

n is nested 1 in CI for parent 1?

slide-62
SLIDE 62

62

Summary

n For a continuous X, exp{ } is the factor by

which the odds or odds ratio changes for each unit change of X

n Pseudo R2 provides overall evaluation of the

model