STAT 215 Multiple Logistic Regression Colin Reimer Dawson Oberlin - - PowerPoint PPT Presentation

stat 215 multiple logistic regression
SMART_READER_LITE
LIVE PREVIEW

STAT 215 Multiple Logistic Regression Colin Reimer Dawson Oberlin - - PowerPoint PPT Presentation

Outline Multiple Predictors Nested Model Tests Model Selection STAT 215 Multiple Logistic Regression Colin Reimer Dawson Oberlin College November 16, 2017 1 / 24 Outline Multiple Predictors Nested Model Tests Model Selection Outline


slide-1
SLIDE 1

Outline Multiple Predictors Nested Model Tests Model Selection

STAT 215 Multiple Logistic Regression

Colin Reimer Dawson

Oberlin College

November 16, 2017 1 / 24

slide-2
SLIDE 2

Outline Multiple Predictors Nested Model Tests Model Selection

Outline

Multiple Predictors Nested Model Tests Model Selection 2 / 24

slide-3
SLIDE 3

Outline Multiple Predictors Nested Model Tests Model Selection

Logistic Regression With Multiple Predictors

We are combining logistic regression (Ch. 9) with multiple regression (Chs 3-4). Nothing really fundamentally new. All of the “usual” options for predictors:

  • Quantitative variables
  • Powers of variables (e.g., second-order models)
  • Other transformations of variables (e.g., log)
  • Interactions (products) of variables
  • Indicator variables for binary predictors
  • Collections of k − 1 indicators for categorical predictors

w/ k levels 4 / 24

slide-4
SLIDE 4

Outline Multiple Predictors Nested Model Tests Model Selection

Two Equivalent Forms of (Multiple) Logistic Regression

Probability Form π = eβ0+β1X+···+βkXk 1 + eβ0+β1X1+···+βkXk Logit Form log

  • π

1 − π

  • = β0 + β1X1 + . . . βkXk

5 / 24

slide-5
SLIDE 5

Outline Multiple Predictors Nested Model Tests Model Selection

Example: Survival in ICU

  • Response: Survive =
  • Died

1 Lived

  • Predictors:
  • Age
  • SysBP (Systolic Blood Pressure)
  • Pulse

6 / 24

slide-6
SLIDE 6

Outline Multiple Predictors Nested Model Tests Model Selection

Simple Logistic Models

library("Stat2Data"); data("ICU") m1 <- glm(Survive ~ Age, family = "binomial", data = ICU) plotModel(m1) Age Survive

0.0 0.2 0.4 0.6 0.8 1.0 20 40 60 80

  • 7 / 24
slide-7
SLIDE 7

Outline Multiple Predictors Nested Model Tests Model Selection

Simple Logistic Models

m2 <- glm(Survive ~ SysBP, family = "binomial", data = ICU) plotModel(m2) SysBP Survive

0.0 0.2 0.4 0.6 0.8 1.0 50 100 150 200 250

  • 8 / 24
slide-8
SLIDE 8

Outline Multiple Predictors Nested Model Tests Model Selection

Simple Logistic Models

m3 <- glm(Survive ~ Pulse, family = "binomial", data = ICU) plotModel(m3) Pulse Survive

0.0 0.2 0.4 0.6 0.8 1.0 50 100 150

  • 9 / 24
slide-9
SLIDE 9

Outline Multiple Predictors Nested Model Tests Model Selection

Simple Logistic Models

m3 <- glm(Survive ~ Pulse + I(Pulse^2), family = "binomial", data = ICU) plotModel(m3) Pulse Survive

0.0 0.2 0.4 0.6 0.8 1.0 50 100 150

  • 10 / 24
slide-10
SLIDE 10

Outline Multiple Predictors Nested Model Tests Model Selection

Multiple Predictor Model

full.model <- glm(Survive ~ Age + SysBP, family = "binomial", data = ICU) summary(full.model)$coefficients %>% round(digits = 3) Estimate Std. Error z value Pr(>|z|) (Intercept) 0.962 1.000 0.962 0.336 Age

  • 0.028

0.011

  • 2.637

0.008 SysBP 0.017 0.006 2.873 0.004

How to interpret tests of individual coefficients? Just as in linear regression: is the predictor adding something over the

  • thers?

11 / 24

slide-11
SLIDE 11

Outline Multiple Predictors Nested Model Tests Model Selection

Checking For Multicollinearity

Same issues with multicollinearity can arise!

dplyr::select(ICU, Age, SysBP, Pulse) %>% cor() %>% round(digits = 2) Age SysBP Pulse Age 1.00 0.04 0.04 SysBP 0.04 1.00 -0.06 Pulse 0.04 -0.06 1.00 vif(full.model) Age SysBP 1.001818 1.001818

But no worries in this case 12 / 24

slide-12
SLIDE 12

Outline Multiple Predictors Nested Model Tests Model Selection

Overall and Nested LR Tests

pulse.quad.model <- glm(Survive ~ Age + SysBP + Pulse + I(Pulse^2), family = "binomial", data = ICU) no.pulse.model <- glm(Survive ~ Age + SysBP, family = "binomial", data = ICU) anova(no.pulse.model, pulse.quad.model, test = "LRT") Analysis of Deviance Table Model 1: Survive ~ Age + SysBP Model 2: Survive ~ Age + SysBP + Pulse + I(Pulse^2)

  • Resid. Df Resid. Dev Df Deviance Pr(>Chi)

1 197 183.25 2 195 182.57 2 0.68431 0.7102

Test statistic: G = −2(log P(Data | Full) − log P(Data | Reduced)) 14 / 24

slide-13
SLIDE 13

Outline Multiple Predictors Nested Model Tests Model Selection

Overall and Nested LR Tests

xpchisq(0.68431, df = 2, lower.tail = FALSE)

density

0.1 0.2 0.3 0.4 0.5 2 4 6 8 10 12

. 7 1 . 2 9

[1] 0.7102381

15 / 24

slide-14
SLIDE 14

Outline Multiple Predictors Nested Model Tests Model Selection

One vs. Two Curves

Is Sex an important predictor, controlling for BP?

full.model <- glm(Survive ~ SysBP + factor(Sex) + SysBP:factor(Sex), family = 'binomial', data = ICU) summary(full.model)$coefficients Estimate

  • Std. Error

z value Pr(>|z|) (Intercept)

  • 1.43930431 1.021041657 -1.409643 0.158645099

SysBP 0.02299392 0.008325432 2.761889 0.005746799 factor(Sex)1 1.45516591 1.525558283 0.953858 0.340155546 SysBP:factor(Sex)1 -0.01301957 0.011964883 -1.088148 0.276529569 reduced.model <- glm(Survive ~ SysBP, family = 'binomial', data = ICU) anova(reduced.model, full.model, test = "LRT") Analysis of Deviance Table Model 1: Survive ~ SysBP Model 2: Survive ~ SysBP + factor(Sex) + SysBP:factor(Sex)

  • Resid. Df Resid. Dev Df Deviance Pr(>Chi)

1 198 191.34 2 196 189.99 2 1.3421 0.5112

16 / 24

slide-15
SLIDE 15

Outline Multiple Predictors Nested Model Tests Model Selection

One vs. Two Curves

plotModel(full.model) SysBP Survive

0.0 0.2 0.4 0.6 0.8 1.0 50 100 150 200 250

  • 1
  • Curves are not significantly different

17 / 24

slide-16
SLIDE 16

Outline Multiple Predictors Nested Model Tests Model Selection

One vs. Two Curves

f.hat.full <- makeFun(full.model) f.hat.reduced <- makeFun(reduced.model) xyplot(Survive ~ SysBP, data = ICU, groups = factor(Sex)) plotFun(f.hat.full(SysBP, Sex) ~ SysBP, Sex = 0, xlim = c(0,300), col = 1, add = TRUE) plotFun(f.hat.full(SysBP, Sex) ~ SysBP, Sex = 1, add = TRUE, col = 2) plotFun(f.hat.reduced(SysBP) ~ SysBP, add = TRUE, lty = 2)

SysBP Survive

0.0 0.2 0.4 0.6 0.8 1.0 50 100 150 200 250

  • Curves are not signifcantly different

18 / 24

slide-17
SLIDE 17

Outline Multiple Predictors Nested Model Tests Model Selection

Parallel vs. Non-Parallel logit lines

full.model <- glm(Survive ~ SysBP + factor(Infection) + SysBP:factor(Infection), family = 'binomial', data = ICU) summary(full.model)$coefficients %>% round(digits = 3) Estimate Std. Error z value Pr(>|z|) (Intercept) 1.123 1.195 0.940 0.347 SysBP 0.005 0.009 0.601 0.548 factor(Infection)1

  • 2.934

1.589

  • 1.846

0.065 SysBP:factor(Infection)1 0.018 0.012 1.436 0.151 reduced.model <- glm(Survive ~ SysBP + factor(Infection), family = 'binomial', data

19 / 24

slide-18
SLIDE 18

Outline Multiple Predictors Nested Model Tests Model Selection

One vs. Two Curves

plotModel(full.model) SysBP Survive

0.0 0.2 0.4 0.6 0.8 1.0 50 100 150 200 250

  • 1
  • Curves do not have significantly different “slopes”

20 / 24

slide-19
SLIDE 19

Outline Multiple Predictors Nested Model Tests Model Selection

Parallel vs. Non-parallel logit lines

SysBP Survive

0.0 0.2 0.4 0.6 0.8 1.0 50 100 150 200 250

  • Lines are not signifcantly non-parallel

21 / 24

slide-20
SLIDE 20

Outline Multiple Predictors Nested Model Tests Model Selection

Model Selection Criteria

The usual metrics no longer apply:

  • adj. R2
  • Mallow’s Cp

Instead:

  • Akaike Information Criterion (AIC): Deviance +2p

(lower is better)

  • (Hard or Soft) Prediction Error (only evaluate
  • ut-of-sample)
  • Hard: How many cases did the model yield ˆ

π on the wrong side of 1/2?

  • Soft: Sum absolute difference between ˆ

πi to Yi

23 / 24

slide-21
SLIDE 21

Outline Multiple Predictors Nested Model Tests Model Selection

Stepwise Regression and Cross-Validation Demo

24 / 24