STAT 213 Logistic Regression: Assessment and Testing Colin Reimer - - PowerPoint PPT Presentation

stat 213 logistic regression assessment and testing
SMART_READER_LITE
LIVE PREVIEW

STAT 213 Logistic Regression: Assessment and Testing Colin Reimer - - PowerPoint PPT Presentation

Outline Assessing Conditions Tests and Intervals STAT 213 Logistic Regression: Assessment and Testing Colin Reimer Dawson Oberlin College April 13, 2020 1 / 30 Outline Assessing Conditions Tests and Intervals Outline Assessing


slide-1
SLIDE 1

Outline Assessing Conditions Tests and Intervals

STAT 213 Logistic Regression: Assessment and Testing

Colin Reimer Dawson

Oberlin College

April 13, 2020 1 / 30

slide-2
SLIDE 2

Outline Assessing Conditions Tests and Intervals

Outline

Assessing Conditions Checking Linearity: Binned Data Alternative Residuals Checking Linearity: Unbinned Data Tests and Intervals Test of Coefficients Intervals for Coefficients Intervals for Specific Predictors 2 / 30

slide-3
SLIDE 3

Outline Assessing Conditions Tests and Intervals

Outline

Assessing Conditions Checking Linearity: Binned Data Alternative Residuals Checking Linearity: Unbinned Data Tests and Intervals Test of Coefficients Intervals for Coefficients Intervals for Specific Predictors 3 / 30

slide-4
SLIDE 4

Outline Assessing Conditions Tests and Intervals

Conditions for Logistic Regression

  • 1. Logit-Linearity (log odds depends linearly on X)
  • 2. Independence (no clustering or time/space dependence)
  • 3. Random (data comes from a random sample, or random

assignment)

  • 4. Normality no longer applies! (Response is binary, so it can’t)
  • 5. Constant Variance no longer required! (In fact, more variance

when ˆ π near 0.5)

4 / 30

slide-5
SLIDE 5

Outline Assessing Conditions Tests and Intervals

Checking Linearity

  • Can’t just transform response via logit to check linearity...
  • logit(0) = −∞
  • logit(1) = ∞
  • ...unless data is binned... then can take logit of

proportion per bin 6 / 30

slide-6
SLIDE 6

Outline Assessing Conditions Tests and Intervals

Example: Golf Putts

Distance (ft) 3 4 5 6 7 # Made 84 88 61 61 44 # Missed 17 31 47 64 90 Odds 4.94 2.84 1.30 0.95 0.49 Log Odds 1.60 1.04 0.26

  • 0.05
  • 0.71

library("mosaic") Putts <- data.frame( Distance = 3:7, Made = c(84,88,61,61,44), Missed = c(17,31,47,64,90)) %>% mutate( Total = Made + Missed, PropMade = Made / Total)

7 / 30

slide-7
SLIDE 7

Outline Assessing Conditions Tests and Intervals

Binned Data

xyplot(logit(PropMade) ~ Distance, data = Putts, type = c("p","r")) Distance logit(PropMade)

−0.5 0.0 0.5 1.0 1.5 3 4 5 6 7

  • Logits are fairly linear

8 / 30

slide-8
SLIDE 8

Outline Assessing Conditions Tests and Intervals

Equivalent Model Code for Binned Data

m2 <- glm(cbind(Made,Missed) ~ Distance, data = Putts, family = "binomial") m2 Call: glm(formula = cbind(Made, Missed) ~ Distance, family = "binomial", data = Putts) Coefficients: (Intercept) Distance 3.2568

  • 0.5661

Degrees of Freedom: 4 Total (i.e. Null); 3 Residual Null Deviance: 81.39 Residual Deviance: 1.069 AIC: 30.18

9 / 30

slide-9
SLIDE 9

Outline Assessing Conditions Tests and Intervals

Deviance Residuals

  • Total log likelihood:

ℓ := log P(Data | Model)

  • Deviance measures “total discrepancy” between data and

model: Deviance := −2ℓ = −2 log P(Data | Model)

  • In linear regression, we had

SSE =

N

  • i=1

ε2

i = −2 log p(Data | Model)

  • deviance residuals di “reverse engineered” so that

Deviance =

N

  • i=1

d2

i

11 / 30

slide-10
SLIDE 10

Outline Assessing Conditions Tests and Intervals

Checking for Outliers

### Model of med school acceptance probability by MCAT score library(Stat2Data); data(MedGPA) mcatModel <- glm(Acceptance ~ MCAT, data = MedGPA, family = "binomial") ## Check for outliers by plotting residual distribution ## (Note: will almost always be bimodal; *not* expecting normality) residuals(mcatModel, type = "deviance") %>% histogram() . Density

0.0 0.1 0.2 0.3 0.4 −2 −1 1 2

12 / 30

slide-11
SLIDE 11

Outline Assessing Conditions Tests and Intervals

Pearson Residuals

Another way to conceive of residuals is by “standardized distance” from the predicted value Pearson’s residuali = Yi − ˆ πi

  • ˆ

πi(1 − ˆ πi)

residuals(mcatModel, type = "pearson") %>% histogram() . Density

0.0 0.1 0.2 0.3 0.4 −2 −1 1 2

13 / 30

slide-12
SLIDE 12

Outline Assessing Conditions Tests and Intervals

Pearson Residuals vs. Fitted Values Plot

Can check logit-linearity for unbinned data by binning residuals and constructing fitted values vs. (average) residuals plot

library("arm") ## for binnedplot() binnedplot(fitted(mcatModel), residuals(mcatModel, type = "pearson"), nclass = 10 # number of bins to use ) 0.2 0.3 0.4 0.5 0.6 0.7 0.8 −1.5 0.0 1.0

Binned residual plot

Expected Values Average residual

  • 15 / 30
slide-13
SLIDE 13

Outline Assessing Conditions Tests and Intervals

Linear vs. Logistic Regression

Goal Linear Logistic Estimate coefs Minimize SSE Maximize Likelihood Check conditions Linearity/Const. var.: Logit linearity: Residual vs. Fitted Binned residuals vs. Normality: QQ Plots fitted

16 / 30

slide-14
SLIDE 14

Outline Assessing Conditions Tests and Intervals

Outline

Assessing Conditions Checking Linearity: Binned Data Alternative Residuals Checking Linearity: Unbinned Data Tests and Intervals Test of Coefficients Intervals for Coefficients Intervals for Specific Predictors 17 / 30

slide-15
SLIDE 15

Outline Assessing Conditions Tests and Intervals

Hypothesis Test for β1

In linear regression, we computed the test statistic: tobs = ˆ β1 − 0 ˆ se(ˆ β1) (number of standard errors ˆ β1 is from 0). P-value: prob. of getting a test stat this big by chance if H0 true (i.e., β1 = 0) 19 / 30

slide-16
SLIDE 16

Outline Assessing Conditions Tests and Intervals

Hypothesis Test for β1

In logistic regression we can do the same thing, but with Normal instead of t distribution. zobs = ˆ β1 − 0 ˆ se(ˆ β1) and get P-value: prob of a test stat this big if H0 true 20 / 30

slide-17
SLIDE 17

Outline Assessing Conditions Tests and Intervals

In R

summary(mcatModel) %>% coef() %>% round(3) Estimate Std. Error z value Pr(>|z|) (Intercept)

  • 8.712

3.236

  • 2.692

0.007 MCAT 0.246 0.089 2.752 0.006

Only 0.6% chance we’d get

  • ˆ

β1

  • ≥ 0.246 if the association is

due solely to chance sampling 21 / 30

slide-18
SLIDE 18

Outline Assessing Conditions Tests and Intervals

Linear vs. Logistic Regression

Goal Linear Logistic Estimate coefs Minimize SSE Maximize Likelihood Check conditions Linearity/Const. var.: Logit linearity: Residual vs. Fitted Binned residuals vs. Normality: QQ Plots fitted Test coefs Measure SEs from 0, Measure SEs from 0 P-value using t P-value using Normal

22 / 30

slide-19
SLIDE 19

Outline Assessing Conditions Tests and Intervals

Confidence Interval for β1

Same principle applies for confidence interval... CI(∆logit) : ˆ β1 ± z∗ · ˆ se( ˆ β1)

confint(mcatModel) %>% round(2) 2.5 % 97.5 % (Intercept) -15.77

  • 3.04

MCAT 0.09 0.44

But... β1 is the rate of change of the log odds, which is hard to understand. More common to report a CI for odds ratio (eβ1). CI(OR) : (eβ(lwr)

1

, eβ(upr)

1

) 24 / 30

slide-20
SLIDE 20

Outline Assessing Conditions Tests and Intervals

In R...

confint(medschool.model) %>% round(2) 2.5 % 97.5 % (Intercept) -15.77

  • 3.04

MCAT 0.09 0.44 confint(medschool.model) %>% exp() %>% round(2) 2.5 % 97.5 % (Intercept) 0.00 0.05 MCAT 1.09 1.55

“We are 95% confident that the odds (not probability) of admittance increases by a factor of (is multiplied by) between 1.09 and 1.55 for each additional point of MCAT score” 25 / 30

slide-21
SLIDE 21

Outline Assessing Conditions Tests and Intervals

Linear vs. Logistic Regression

Goal Linear Logistic Estimate coefs Minimize SSE Maximize Likelihood Check conditions Linearity/Const. var.: Logit linearity: Residual vs. Fitted Binned residuals vs. Normality: QQ Plots fitted Test coefs Measure SEs from 0, Measure SEs from 0 P-value using t P-value using Normal Intervals for Params Slope: β1 Odds Ratio: eβ1

26 / 30

slide-22
SLIDE 22

Outline Assessing Conditions Tests and Intervals

CIs at specific values

Arguably easier to interpret, CIs for π at a few specific X values

source("http://colindawson.net/stat213/code/helper_functions.R") ## functions made with regular makeFun() give point values but not ## intervals with logistic models, so I wrote a custom function f.hat <- makeFun.logistic(mcatModel) quartiles <- quantile(~MCAT, data = MedGPA) f.hat(MCAT = quartiles, interval = "confidence", level = 0.95) %>% round(2) MCAT pi.hat lwr upr 0% 18 0.01 0.00 0.26 25% 34 0.41 0.26 0.58 50% 36 0.54 0.39 0.67 75% 39 0.71 0.52 0.84 100% 48 0.96 0.72 0.99

Interpretation: “We are 95% confident that the probability

  • f acceptance for students with an MCAT score of 39 is

between 52% and 84%” 28 / 30

slide-23
SLIDE 23

Outline Assessing Conditions Tests and Intervals

Confidence Bands

## Also requires sourcing helper_functions.R ## Can supply level=, xlim=, xlab= and ylab= to customize graph plot.logistic.bands(mcatModel) 20 25 30 35 40 45 0.0 0.2 0.4 0.6 0.8 MCAT P( Acceptance = 1)

29 / 30

slide-24
SLIDE 24

Outline Assessing Conditions Tests and Intervals

Linear vs. Logistic Regression

Goal Linear Logistic Estimate coefs Minimize SSE Maximize Likelihood Check conditions Linearity/Const. var.: Logit linearity: Residual vs. Fitted Binned residuals vs. Normality: QQ Plots fitted Test coefs Measure SEs from 0, Measure SEs from 0 P-value using t P-value using Normal Intervals for Params Slope: β1 Odds Ratio: eβ1 Intervals for Fitted Confidence and Confidence intervals Vals. prediction intervals

  • nly

30 / 30