Logistic regression Susanne Rosthj Section of Biostatistics - - PowerPoint PPT Presentation

logistic regression
SMART_READER_LITE
LIVE PREVIEW

Logistic regression Susanne Rosthj Section of Biostatistics - - PowerPoint PPT Presentation

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s Logistic regression Susanne Rosthj Section of Biostatistics Institute of Public Health University of Copenhagen sr@biostat.ku.dk u n i v e r s i


slide-1
SLIDE 1

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Logistic regression

Susanne Rosthøj

Section of Biostatistics Institute of Public Health University of Copenhagen sr@biostat.ku.dk

slide-2
SLIDE 2

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Outline

  • Risk, odds and odds-ratio
  • Simple logistic regression:
  • One binary explantory variable
  • One categorical
  • One quantitative.
  • Multiple logistic regression:
  • Two binary + interaction

2 / 20

slide-3
SLIDE 3

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Example 1: gender and CHD

Is the risk of CHD different for males and females? CHD 1 Females 616 (85.6%) 104 (14.4%) 720 Males 479 (74.5%) 164 (25.5%) 643 1095 (80.3%) 268 (19.7%) 1363 The hypothesis of no difference in risk for the genders is rejected (p<0.0001, Chi-square test).

3 / 20

slide-4
SLIDE 4

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Quantifying the difference

Risk of CHD for males: p1 ≈ 164/643 = 0.26 Risk of CHD for females: p2 ≈ 104/720 = 0.14 Odds of CHD for males: p1/(1 − p1) ≈ 164/479 = 0.34(≈ 1 : 3) Odds of CHD for females: p2/(1 − p2) ≈ 104/616 = 0.17(≈ 1 : 6) Quantification of the difference in risk : Absolute Risk Reduction (ARR): |p1 − p2| ≈ 0.12 Risk Ratio (RR) : p1/p2 ≈ 1.77 Odds-ratio (OR): p1/(1 − p1)/(p2/(1 − p2)) ≈ 2.03. When p1 and p2 are small (<0.1) : RR≈OR. We have seen that there is a difference for males and females : p1 = p2 i.e. ARR > 0, RR =1, OR =1

4 / 20

slide-5
SLIDE 5

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

The purpose of a logistic regression analysis

Relate a binary outcome variable, e.g. Yi =

  • 1

if i developed CHD if i did not develop CHD to explanatory variables for individual i. In logistic regression we formulate models for log-odds : log

  • pi

1 − pi

  • 5 / 20
slide-6
SLIDE 6

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Odds and log-odds

0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 p Odds p/(1−p) 0.0 0.2 0.4 0.6 0.8 1.0 −6 −4 −2 2 4 p log−odds

6 / 20

slide-7
SLIDE 7

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

The logistic regression model

Explanatory variable : malei =

  • i is female

1 i is male. Model: log

  • pi

1 − pi

  • = a + b · malei

=

  • a

i is female a + b i is male Determine a and b by hand. The difference in log-odds between males and females is b= (?)

7 / 20

slide-8
SLIDE 8

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Calculating OR using logistic regression

log

  • pi

1 − pi

  • = a + b · malei

=

  • a

i is female a + b i is male. b = (a + b) − a = log (odds for males) - log (odds for females) = log (OR for males vs. females) ie. exp(b) = OR for males vs. females = Now determine the OR of CHD for females vs. males.

8 / 20

slide-9
SLIDE 9

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Logistic regression in R

glm = Generalized Linear Model.

> d <- read.dbf(’framingham.dbf’) > glm1 <- glm( chd01 ~ factor(sex), data=d, family=binomial ) > summary( glm1 ) Call: glm(formula = chd01 ~ factor(sex), family = binomial, data = d) Deviance Residuals: Min 1Q Median 3Q Max

  • 0.7674
  • 0.7674
  • 0.5586
  • 0.5586

1.9672 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept)

  • 1.7789

0.1060 -16.780 < 2e-16 *** factor(sex)1 0.7070 0.1394 5.073 3.92e-07 ***

  • Signif. codes:

0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1351.2

  • n 1362

degrees of freedom Residual deviance: 1324.9

  • n 1361

degrees of freedom (43 observations deleted due to missingness) AIC: 1328.9 Number of Fisher Scoring iterations: 4 >

9 / 20

slide-10
SLIDE 10

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Finding OR and CI

> > # Estimates in terms of log-odds > coef( glm1 ) (Intercept) factor(sex)1

  • 1.7788561

0.7070219 > > > # OR’s : > exp( coef( glm1 ) ) (Intercept) factor(sex)1 0.1688312 2.0279428 > > > # Confidence intervals : > exp( confint.default( glm1 ) ) 2.5 % 97.5 % (Intercept) 0.1371558 0.2078218 factor(sex)1 1.5432055 2.6649413 >

10 / 20

slide-11
SLIDE 11

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Logistic regression with a quantitative variable

The model for log-odds is linear: log

  • pi

1−pi

  • = a + b · agei

Compare two individuals aged 51 and 50 OR = odds 51 years

  • dds 50 years.

log(OR) = log(odds 51 years) − log(odds 50 years) = (a + 51 · b) − (a + 50 · b) = b i.e. OR = exp(b) = exp(0.066) = 1.068. Interpretation?

11 / 20

slide-12
SLIDE 12

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Exercise : Odds ratios

Determine the OR comparing two individuals with a difference in age

  • f two years.

Three years? Ten years? Discuss how to assess whether the linear model is plausible.

12 / 20

slide-13
SLIDE 13

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Risk of CHD according to the model

Predictions : pi =

exp(a+b·agei) 1+exp(a+b·agei).

50 100 150 0.0 0.2 0.4 0.6 0.8 1.0 Alder p

13 / 20

slide-14
SLIDE 14

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

The additive model

Consider the additive model:

log

  • pi

1 − pi

  • =

a + b × malei + c × hypertensioni

  • r put in tabular form:

log-odds no hypertension hypertension females a a + c males a + b a + b + c

OR of CHD, hypertention vs no hypertension: Males: exp(c) Females: exp(c)

14 / 20

slide-15
SLIDE 15

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Interaction

Is there an interaction between gender and hypertension? The interaction model log-odds no hypertension hypertension females a a + c males a + b a + b + c + d OR of CHD, hypertension vs no hypertension: Males: ? Females: ?

15 / 20

slide-16
SLIDE 16

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Exercise

On the following slides you find three model outputs. Study the outputs and fill in the blanks on this and the next slide.

We use model to test whether there is an interaction between sex and hypertension. No interaction, i.e. d = 0 Estimated interaction term (d) and SE Test statistic: Wald: W =

d SE =

, P = Conclude :

16 / 20

slide-17
SLIDE 17

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Exercise cont.

We use model to compute ORs of CHD, hypertension vs no hypertension, for each gender. Males OR: Females OR:

17 / 20

slide-18
SLIDE 18

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Model 1

> glm1 <- glm( chd01 ~ factor(sex)*factor(hyper), data=d, family=binomial ) > summary( glm1 ) Call: glm(formula = chd01 ~ factor(sex) * factor(hyper), family = binomial, data = d) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept)

  • 2.5244

0.1867

  • 13.524

< 2e-16 *** factor(sex)1 1.2147 0.2196 5.532 3.16e-08 *** factor(hyper)1 1.3812 0.2300 6.005 1.92e-09 *** factor(sex)1:factor(hyper)1

  • 0.6815

0.2977

  • 2.289

0.0221 *

  • Signif. codes:

0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1351.2

  • n 1362

degrees of freedom Residual deviance: 1271.7

  • n 1359

degrees of freedom (43 observations deleted due to missingness) AIC: 1279.7 Number of Fisher Scoring iterations: 5 > > exp( coef( glm1 ) ) (Intercept) factor(sex)1 0.08010336 3.36922654 factor(hyper)1 factor(sex)1:factor(hyper)1 3.97957459 0.50585702 >

18 / 20

slide-19
SLIDE 19

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Model 2

> glm2 <- glm( chd01 ~ factor(sex) + factor(sex):factor(hyper), data=d, family=binomial) > summary( glm2 ) Call: glm(formula = chd01 ~ factor(sex) + factor(sex):factor(hyper), family = binomial, data = d) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept)

  • 2.5244

0.1867 -13.524 < 2e-16 *** factor(sex)1 1.2147 0.2196 5.532 3.16e-08 *** factor(sex)0:factor(hyper)1 1.3812 0.2300 6.005 1.92e-09 *** factor(sex)1:factor(hyper)1 0.6997 0.1890 3.701 0.000214 ***

  • Signif. codes:

0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1351.2

  • n 1362

degrees of freedom Residual deviance: 1271.7

  • n 1359

degrees of freedom (43 observations deleted due to missingness) AIC: 1279.7 Number of Fisher Scoring iterations: 5 > > exp( coef( glm2 ) ) (Intercept) factor(sex)1 0.08010336 3.36922654 factor(sex)0:factor(hyper)1 factor(sex)1:factor(hyper)1 3.97957459 2.01309573 >

19 / 20

slide-20
SLIDE 20

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Model 3

> glm3 <- glm( chd01 ~ factor(hyper) + factor(sex):factor(hyper), data=d, family=binomial ) > summary( glm3 ) Call: glm(formula = chd01 ~ factor(hyper) + factor(sex):factor(hyper), family = binomial, data = d) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept)

  • 2.5244

0.1867 -13.524 < 2e-16 *** factor(hyper)1 1.3812 0.2300 6.005 1.92e-09 *** factor(hyper)0:factor(sex)1 1.2147 0.2196 5.532 3.16e-08 *** factor(hyper)1:factor(sex)1 0.5332 0.2011 2.652 0.00801 **

  • Signif. codes:

0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1351.2

  • n 1362

degrees of freedom Residual deviance: 1271.7

  • n 1359

degrees of freedom (43 observations deleted due to missingness) AIC: 1279.7 Number of Fisher Scoring iterations: 5 > > exp( coef( glm3 ) ) (Intercept) factor(hyper)1 0.08010336 3.97957459 factor(hyper)0:factor(sex)1 factor(hyper)1:factor(sex)1 3.36922654 1.70434689 >

20 / 20