stat 213 logistic regression assessment and testing
play

STAT 213 Logistic Regression: Assessment and Testing Colin Reimer - PowerPoint PPT Presentation

Outline Assessing Conditions Tests and Intervals STAT 213 Logistic Regression: Assessment and Testing Colin Reimer Dawson Oberlin College April 13, 2020 1 / 30 Outline Assessing Conditions Tests and Intervals Outline Assessing


  1. Outline Assessing Conditions Tests and Intervals STAT 213 Logistic Regression: Assessment and Testing Colin Reimer Dawson Oberlin College April 13, 2020 1 / 30

  2. Outline Assessing Conditions Tests and Intervals Outline Assessing Conditions Checking Linearity: Binned Data Alternative Residuals Checking Linearity: Unbinned Data Tests and Intervals Test of Coefficients Intervals for Coefficients Intervals for Specific Predictors 2 / 30

  3. Outline Assessing Conditions Tests and Intervals Outline Assessing Conditions Checking Linearity: Binned Data Alternative Residuals Checking Linearity: Unbinned Data Tests and Intervals Test of Coefficients Intervals for Coefficients Intervals for Specific Predictors 3 / 30

  4. Outline Assessing Conditions Tests and Intervals Conditions for Logistic Regression 1. Logit-Linearity ( log odds depends linearly on X ) 2. Independence (no clustering or time/space dependence) 3. Random (data comes from a random sample, or random assignment) 4. Normality no longer applies! (Response is binary, so it can’t) 5. Constant Variance no longer required! (In fact, more variance when ˆ π near 0.5) 4 / 30

  5. Outline Assessing Conditions Tests and Intervals Checking Linearity • Can’t just transform response via logit to check linearity... • logit(0) = −∞ • logit(1) = ∞ • ...unless data is binned... then can take logit of proportion per bin 6 / 30

  6. Outline Assessing Conditions Tests and Intervals Example: Golf Putts Distance (ft) 3 4 5 6 7 # Made 84 88 61 61 44 # Missed 17 31 47 64 90 Odds 4.94 2.84 1.30 0.95 0.49 Log Odds 1.60 1.04 0.26 -0.05 -0.71 library("mosaic") Putts <- data.frame( Distance = 3:7, Made = c(84,88,61,61,44), Missed = c(17,31,47,64,90)) %>% mutate( Total = Made + Missed, PropMade = Made / Total) 7 / 30

  7. Outline Assessing Conditions Tests and Intervals Binned Data xyplot(logit(PropMade) ~ Distance, data = Putts, type = c("p","r")) ● 1.5 logit(PropMade) ● 1.0 0.5 ● 0.0 ● −0.5 ● 3 4 5 6 7 Distance 8 / 30 Logits are fairly linear

  8. Outline Assessing Conditions Tests and Intervals Equivalent Model Code for Binned Data m2 <- glm(cbind(Made,Missed) ~ Distance, data = Putts, family = "binomial") m2 Call: glm(formula = cbind(Made, Missed) ~ Distance, family = "binomial", data = Putts) Coefficients: (Intercept) Distance 3.2568 -0.5661 Degrees of Freedom: 4 Total (i.e. Null); 3 Residual Null Deviance: 81.39 Residual Deviance: 1.069 AIC: 30.18 9 / 30

  9. Outline Assessing Conditions Tests and Intervals Deviance Residuals • Total log likelihood : ℓ := log P ( Data | Model ) • Deviance measures “total discrepancy” between data and model: Deviance := − 2 ℓ = − 2 log P ( Data | Model ) • In linear regression, we had N � ε 2 SSE = i = − 2 log p ( Data | Model ) i =1 • deviance residuals d i “reverse engineered” so that N � d 2 Deviance = 11 / 30 i i =1

  10. Outline Assessing Conditions Tests and Intervals Checking for Outliers ### Model of med school acceptance probability by MCAT score library(Stat2Data); data(MedGPA) mcatModel <- glm(Acceptance ~ MCAT, data = MedGPA, family = "binomial") ## Check for outliers by plotting residual distribution ## (Note: will almost always be bimodal; *not* expecting normality) residuals(mcatModel, type = "deviance") %>% histogram() 0.4 Density 0.3 0.2 0.1 0.0 −2 −1 0 1 2 12 / 30 .

  11. Outline Assessing Conditions Tests and Intervals Pearson Residuals Another way to conceive of residuals is by “standardized distance” from the predicted value Y i − ˆ π i Pearson’s residual i = � ˆ π i (1 − ˆ π i ) residuals(mcatModel, type = "pearson") %>% histogram() 0.4 Density 0.3 0.2 0.1 0.0 −2 −1 0 1 2 13 / 30 .

  12. Outline Assessing Conditions Tests and Intervals Pearson Residuals vs. Fitted Values Plot Can check logit-linearity for unbinned data by binning residuals and constructing fitted values vs. (average) residuals plot library("arm") ## for binnedplot() binnedplot(fitted(mcatModel), residuals(mcatModel, type = "pearson"), nclass = 10 # number of bins to use ) Binned residual plot Average residual ● 1.0 ● 0.0 ● ● ● ● ● ● ● ● −1.5 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Expected Values 15 / 30

  13. Outline Assessing Conditions Tests and Intervals Linear vs. Logistic Regression Goal Linear Logistic Estimate coefs Minimize SSE Maximize Likelihood Check conditions Linearity/Const. var. : Logit linearity : Residual vs. Fitted Binned residuals vs. Normality : QQ Plots fitted 16 / 30

  14. Outline Assessing Conditions Tests and Intervals Outline Assessing Conditions Checking Linearity: Binned Data Alternative Residuals Checking Linearity: Unbinned Data Tests and Intervals Test of Coefficients Intervals for Coefficients Intervals for Specific Predictors 17 / 30

  15. Outline Assessing Conditions Tests and Intervals Hypothesis Test for β 1 In linear regression, we computed the test statistic : ˆ β 1 − 0 t obs = se (ˆ ˆ β 1 ) (number of standard errors ˆ β 1 is from 0). P -value: prob. of getting a test stat this big by chance if H 0 true (i.e., β 1 = 0 ) 19 / 30

  16. Outline Assessing Conditions Tests and Intervals Hypothesis Test for β 1 In logistic regression we can do the same thing, but with Normal instead of t distribution. ˆ β 1 − 0 z obs = se (ˆ ˆ β 1 ) and get P -value: prob of a test stat this big if H 0 true 20 / 30

  17. Outline Assessing Conditions Tests and Intervals In R summary(mcatModel) %>% coef() %>% round(3) Estimate Std. Error z value Pr(>|z|) (Intercept) -8.712 3.236 -2.692 0.007 MCAT 0.246 0.089 2.752 0.006 � � � ˆ Only 0.6% chance we’d get β 1 � ≥ 0 . 246 if the association is � � due solely to chance sampling 21 / 30

  18. Outline Assessing Conditions Tests and Intervals Linear vs. Logistic Regression Goal Linear Logistic Estimate coefs Minimize SSE Maximize Likelihood Check conditions Linearity/Const. var. : Logit linearity : Residual vs. Fitted Binned residuals vs. Normality : QQ Plots fitted Test coefs Measure SEs from 0, Measure SEs from 0 P -value using t P -value using Normal 22 / 30

  19. Outline Assessing Conditions Tests and Intervals Confidence Interval for β 1 Same principle applies for confidence interval... β 1 ± z ∗ · ˆ CI (∆ logit ) : ˆ se ( ˆ β 1 ) confint(mcatModel) %>% round(2) 2.5 % 97.5 % (Intercept) -15.77 -3.04 MCAT 0.09 0.44 But... β 1 is the rate of change of the log odds, which is hard to understand. More common to report a CI for odds ratio ( e β 1 ). CI ( OR ) : ( e β ( lwr ) , e β ( upr ) ) 1 1 24 / 30

  20. Outline Assessing Conditions Tests and Intervals In R... confint(medschool.model) %>% round(2) 2.5 % 97.5 % (Intercept) -15.77 -3.04 MCAT 0.09 0.44 confint(medschool.model) %>% exp() %>% round(2) 2.5 % 97.5 % (Intercept) 0.00 0.05 MCAT 1.09 1.55 “We are 95% confident that the odds ( not probability ) of admittance increases by a factor of (is multiplied by) between 1.09 and 1.55 for each additional point of MCAT score” 25 / 30

  21. Outline Assessing Conditions Tests and Intervals Linear vs. Logistic Regression Goal Linear Logistic Estimate coefs Minimize SSE Maximize Likelihood Check conditions Linearity/Const. var. : Logit linearity : Residual vs. Fitted Binned residuals vs. Normality : QQ Plots fitted Test coefs Measure SEs from 0, Measure SEs from 0 P -value using t P -value using Normal Odds Ratio: e β 1 Intervals for Params Slope: β 1 26 / 30

  22. Outline Assessing Conditions Tests and Intervals CIs at specific values Arguably easier to interpret, CIs for π at a few specific X values source("http://colindawson.net/stat213/code/helper_functions.R") ## functions made with regular makeFun() give point values but not ## intervals with logistic models, so I wrote a custom function f.hat <- makeFun.logistic(mcatModel) quartiles <- quantile(~MCAT, data = MedGPA) f.hat(MCAT = quartiles, interval = "confidence", level = 0.95) %>% round(2) MCAT pi.hat lwr upr 0% 18 0.01 0.00 0.26 25% 34 0.41 0.26 0.58 50% 36 0.54 0.39 0.67 75% 39 0.71 0.52 0.84 100% 48 0.96 0.72 0.99 Interpretation: “We are 95% confident that the probability of acceptance for students with an MCAT score of 39 is 28 / 30 between 52% and 84%”

  23. Outline Assessing Conditions Tests and Intervals Confidence Bands ## Also requires sourcing helper_functions.R ## Can supply level=, xlim=, xlab= and ylab= to customize graph plot.logistic.bands(mcatModel) 0.8 P( Acceptance = 1) 0.6 0.4 0.2 0.0 20 25 30 35 40 45 MCAT 29 / 30

  24. Outline Assessing Conditions Tests and Intervals Linear vs. Logistic Regression Goal Linear Logistic Estimate coefs Minimize SSE Maximize Likelihood Check conditions Linearity/Const. var. : Logit linearity : Residual vs. Fitted Binned residuals vs. Normality : QQ Plots fitted Test coefs Measure SEs from 0, Measure SEs from 0 P -value using t P -value using Normal Odds Ratio: e β 1 Intervals for Params Slope: β 1 Intervals for Fitted Confidence and Confidence intervals Vals. prediction intervals only 30 / 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend