Linear and logisitic regression models
Søren Højsgaard Department of Mathematical Sciences Aalborg University, Denmark August 22, 2012
Printed: August 22, 2012 File: models-slides.tex
Linear and logisitic regression models Sren Hjsgaard Department of - - PowerPoint PPT Presentation
Linear and logisitic regression models Sren Hjsgaard Department of Mathematical Sciences Aalborg University, Denmark August 22, 2012 Printed: August 22, 2012 File: models-slides.tex 2: August 22, 2012 Contents 1 Linear normal models 3
Printed: August 22, 2012 File: models-slides.tex
2: August 22, 2012
1 Linear normal models 3 2 Linear regression 4 2.1 Fitting linear regression model with lm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Printing the object: print() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Model objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Extractor methods and methods for further computing . . . . . . . . . . . . . . . . . . . . . 13 2.5 Plotting the object: plot() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.6 Summary information: summary() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.7 Confidence interval for model parameters: confint() . . . . . . . . . . . . . . . . . . . . . . 19 2.8 Predicting new cases: predict() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3 Regression with a factor 23 3.1 Transforming data using transform() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4 Model comparison 30 4.1 Comparing two models with anova() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2 Three commonly used tables for model comparisons . . . . . . . . . . . . . . . . . . . . . . . 33 4.3 Sequential ANOVA table: anova() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.4 On interpreting the anova() output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.5 Dropping each term in turn using drop1() . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.6 On interpreting the drop1() output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.7 Investigating parameter estimates using coef() . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.8 Which table to use?* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5 Residuals and model checking 42 5.1 Interpreting diagnostic plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6 Logistic regression 46
3: August 22, 2012
R> lm(y ~ x1 + x2 + x3, dataset)
4: August 22, 2012
5: August 22, 2012 R> data(hellung,package="ISwR") R> head(hellung,6) glucose conc diameter 1 1 631000 21.2 2 1 592000 21.5 3 1 563000 21.3 4 1 475000 21.0 5 1 461000 21.5 6 1 416000 21.3 R> sapply(hellung, class) glucose conc diameter "integer" "integer" "numeric"
6: August 22, 2012 R> par(mfrow=c(1,2)) R> plot(diameter ~ conc, data=hellung, + col=glucose, pch=as.character(glucose)) R> plot(log(diameter) ~ log(conc), data=hellung, + col=glucose, pch=as.character(glucose))
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0e+00 3e+05 6e+05 19 21 23 25 conc diameter 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 10 11 12 13 2.95 3.05 3.15 3.25 log(conc) log(diameter)
7: August 22, 2012
8: August 22, 2012
R> hm <- lm(log(diameter) ~ log(conc), data=hellung)
9: August 22, 2012
R> print(hm) Call: lm(formula = log(diameter) ~ log(conc), data = hellung) Coefficients: (Intercept) log(conc) 3.74703
R> hm Call: lm(formula = log(diameter) ~ log(conc), data = hellung) Coefficients: (Intercept) log(conc) 3.74703
10: August 22, 2012
R> class(hm) [1] "lm" R> names(hm) [1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "xlevels" "call" "terms" "model"
11: August 22, 2012
R> hm$coefficients (Intercept) log(conc) 3.74702641 -0.05451464 R> hm$residuals 1 2 3 4 5 0.0350211011 0.0455948627 0.0335108932 0.0100606873 0.0319602834 6 7 8 9 10 0.0170150708 -0.0352928345 0.0665403222 0.0089021709 0.0182022593 11 12 13 14 15 0.0349529425 0.0214114524 0.0462117137 0.0518996737 0.0275888916 16 17 18 19 20 0.0191915976 0.0218841508 0.0061947069 -0.0004889143 0.0306645283 21 22 23 24 25 0.0059028969 0.0565348153 0.0212271549 0.0298302213 0.0260690999 26 27 28 29 30
0.0471016029 0.0193724964 0.0218186112 0.0256750284 31 32 33 34 35 0.0096540212 0.0298367056 -0.0641562641 -0.0611421290 -0.0683058568 36 37 38 39 40
41 42 43 44 45
0.0014197167 46 47 48 49 50
51
12: August 22, 2012
13: August 22, 2012
14: August 22, 2012 R> coef(hm) (Intercept) log(conc) 3.74702641 -0.05451464 R> residuals(hm) 1 2 3 4 5 0.0350211011 0.0455948627 0.0335108932 0.0100606873 0.0319602834 6 7 8 9 10 0.0170150708 -0.0352928345 0.0665403222 0.0089021709 0.0182022593 11 12 13 14 15 0.0349529425 0.0214114524 0.0462117137 0.0518996737 0.0275888916 16 17 18 19 20 0.0191915976 0.0218841508 0.0061947069 -0.0004889143 0.0306645283 21 22 23 24 25 0.0059028969 0.0565348153 0.0212271549 0.0298302213 0.0260690999 26 27 28 29 30
0.0471016029 0.0193724964 0.0218186112 0.0256750284 31 32 33 34 35 0.0096540212 0.0298367056 -0.0641562641 -0.0611421290 -0.0683058568 36 37 38 39 40
41 42 43 44 45
0.0014197167 46 47 48 49 50
51
15: August 22, 2012
16: August 22, 2012
R> par(mfrow=c(2,2),mar=c(2,4.5,2,2)) R> plot(hm)
3.05 3.10 3.15 3.20 −0.10 0.00 Residuals
40 35 8
−1 1 2 −2 1 2 Standardized residuals
Normal Q−Q
40 35 8
3.05 3.10 3.15 3.20 0.0 0.5 1.0 1.5 Standardized residuals
40 35 8
0.00 0.02 0.04 0.06 −3 −1 1 2 Standardized residuals
Residuals vs Leverage
33 34 35
17: August 22, 2012
R> sumhm <- summary(hm) Call: lm(formula = log(diameter) ~ log(conc), data = hellung) Residuals: Min 1Q Median 3Q Max
0.009654 0.028710 0.066540 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.747026 0.047415 79.03 <2e-16 *** log(conc)
0.004178
<2e-16 ***
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.03822 on 49 degrees of freedom Multiple R-squared: 0.7765, Adjusted R-squared: 0.7719 F-statistic: 170.2 on 1 and 49 DF, p-value: < 2.2e-16
18: August 22, 2012
19: August 22, 2012
R> confint(hm) 2.5 % 97.5 % (Intercept) 3.65174215 3.84231066 log(conc)
20: August 22, 2012
R> class(sumhm) [1] "summary.lm" R> names(sumhm) [1] "call" "terms" "residuals" "coefficients" [5] "aliased" "sigma" "df" "r.squared" [9] "adj.r.squared" "fstatistic" "cov.unscaled"
R> coef(sumhm) Estimate
t value Pr(>|t|) (Intercept) 3.74702641 0.047415122 79.02598 2.467047e-53 log(conc)
21: August 22, 2012
22: August 22, 2012 R> pred.data<-data.frame(conc=seq(1000,2000,3000)) R> pred1<-predict(hm, newdata = pred.data, interval= 'confidence') R> head(pred1) fit lwr upr 1 3.370453 3.332237 3.408668 R> pred2<-predict(hm, newdata = pred.data, interval= 'prediction') R> head(pred2) fit lwr upr 1 3.370453 3.284669 3.456237
23: August 22, 2012
R> head(hellung,6) glucose conc diameter 1 1 631000 21.2 2 1 592000 21.5 3 1 563000 21.3 4 1 475000 21.0 5 1 461000 21.5 6 1 416000 21.3
24: August 22, 2012 R> gluc <- factor(hellung$glucose, labels=c('Yes','No')) R> hellung$gluc <- gluc R> gluc[c(1:4,41:44)] [1] Yes Yes Yes Yes No No No No Levels: Yes No R> head(hellung) glucose conc diameter gluc 1 1 631000 21.2 Yes 2 1 592000 21.5 Yes 3 1 563000 21.3 Yes 4 1 475000 21.0 Yes 5 1 461000 21.5 Yes 6 1 416000 21.3 Yes
25: August 22, 2012
R> hel <- lm(log(diameter) ~ gluc + log(conc) + gluc:log(conc), + data=hellung)
26: August 22, 2012 R> summary(hel) Call: lm(formula = log(diameter) ~ gluc + log(conc) + gluc:log(conc), data = hellung) Residuals: Min 1Q Median 3Q Max
0.000129 0.008675 0.040543 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.756307 0.031957 117.543 <2e-16 *** glucNo 0.007869 0.054559 0.144 0.886 log(conc)
0.002807 -18.954 <2e-16 *** glucNo:log(conc) -0.006480 0.004821
0.185
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.02086 on 47 degrees of freedom Multiple R-squared: 0.9361, Adjusted R-squared: 0.9321 F-statistic: 229.6 on 3 and 47 DF, p-value: < 2.2e-16
27: August 22, 2012
R> model.matrix(hel)[c(1:3, 35:37),] (Intercept) glucNo log(conc) glucNo:log(conc) 1 1 13.35506 0.00000 2 1 13.29126 0.00000 3 1 13.24103 0.00000 35 1 1 12.71289 12.71289 36 1 1 12.56024 12.56024 37 1 1 12.21106 12.21106
28: August 22, 2012
29: August 22, 2012
R> data(hellung,package="ISwR") R> hellung<-transform(hellung, + glucose=factor(glucose, labels=c("Yes","No")), + ldiam=log(diameter),lconc=log(conc))
30: August 22, 2012
31: August 22, 2012
32: August 22, 2012
R> hel.int<-lm(ldiam~glucose+lconc+glucose:lconc,data=hellung) R> hel.add<-lm(ldiam~ glucose + lconc,data=hellung) R> anova(hel.add,hel.int) Analysis of Variance Table Model 1: ldiam ~ glucose + lconc Model 2: ldiam ~ glucose + lconc + glucose:lconc Res.Df RSS Df Sum of Sq F Pr(>F) 1 48 0.021234 2 47 0.020448 1 0.00078615 1.807 0.1853
33: August 22, 2012
34: August 22, 2012
R> anova(hel.int) Analysis of Variance Table Response: ldiam Df Sum Sq Mean Sq F value Pr(>F) glucose 1 0.042591 0.042591 97.896 4.508e-13 *** lconc 1 0.256353 0.256353 589.237 < 2.2e-16 *** glucose:lconc 1 0.000786 0.000786 1.807 0.1853 Residuals 47 0.020448 0.000435
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
35: August 22, 2012
Effect H0 model H1 model 1 glucose 1 1 + glucose 2 lconc 1 + glucose 1 + glucose + lconc 3 glucose:lconc 1 + glucose + lconc 1 + glucose + lconc + glucose:lconc
36: August 22, 2012
R> drop1(hel.add,test='F') Single term deletions Model: ldiam ~ glucose + lconc Df Sum of Sq RSS AIC F value Pr(>F) <none> 0.021234 -390.98 glucose 1 0.050335 0.071569 -331.01 113.78 2.932e-14 *** lconc 1 0.256353 0.277587 -261.89 579.49 < 2.2e-16 ***
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
37: August 22, 2012
38: August 22, 2012
R> drop1(hel.int,test='F') Single term deletions Model: ldiam ~ glucose + lconc + glucose:lconc Df Sum of Sq RSS AIC F value Pr(>F) <none> 0.020448 -390.91 glucose:lconc 1 0.00078615 0.021234 -390.98 1.807 0.1853
39: August 22, 2012
R> coef(summary(hel.int)) Estimate
t value Pr(>|t|) (Intercept) 3.756307427 0.031956957 117.5427121 1.058597e-59 glucoseNo 0.007869203 0.054559221 0.1442323 8.859337e-01 lconc
glucoseNo:lconc -0.006480457 0.004820897
40: August 22, 2012
R> coef(summary(hel.add)) Estimate
t value Pr(>|t|) (Intercept) 3.78114962 0.026289701 143.82627 6.440775e-65 glucoseNo
lconc
41: August 22, 2012
42: August 22, 2012
i = (yi − ˆ
i ∼A N(0, 1)
R> residuals(hel.add) R> rstandard(hel.add)
43: August 22, 2012
R> par(mfrow=c(2,2)) R> plot(hel.add, pch = c(17,2)[hellung$glucose], + col = c(17,2)[hellung$glucose])
44: August 22, 2012
3.00 3.05 3.10 3.15 3.20 3.25 −0.06 −0.02 0.02 Fitted values Residuals
Residuals vs Fitted
7 40 8
−2 −1 1 2 −3 −2 −1 1 2 Theoretical Quantiles Standardized residuals
Normal Q−Q
7 40 8
3.00 3.05 3.10 3.15 3.20 3.25 0.0 0.5 1.0 1.5 Fitted values Standardized residuals
Scale−Location
7 40 8
0.00 0.02 0.04 0.06 0.08 0.10 −3 −2 −1 1 2 Leverage Standardized residuals Cook's distance
0.5
Residuals vs Leverage
7 40 45
45: August 22, 2012
46: August 22, 2012
R> budworm sex dose ndead ntotal 1 male 1 1 20 2 male 2 4 20 3 male 4 9 20 4 male 8 13 20 5 male 16 18 20 6 male 32 20 20 7 female 1 20 8 female 2 2 20 9 female 4 6 20 10 female 8 10 20 11 female 16 12 20 12 female 32 16 20
47: August 22, 2012
R> p <- 6/20
R> O <- (6/20)/(14/20)
48: August 22, 2012
49: August 22, 2012 R> library(lattice) R> print(xyplot(log((ndead+.5)/(20-ndead+.5))~log(dose), groups=sex, data=budworm, + type="b", auto.key=T))
log(dose) log((ndead + 0.5)/(20 − ndead + 0.5))
−4 −2 2 4 1 2 3
male
R> budworm <- transform(budworm, logdose=log(dose)) R> mm1 <- glm(cbind(ndead,ntotal-ndead)~sex+logdose, + data=budworm, family=binomial(link=logit))
50: August 22, 2012 R> summary(mm1) Call: glm(formula = cbind(ndead, ntotal - ndead) ~ sex + logdose, family = binomial(link = logit), data = budworm) Deviance Residuals: Min 1Q Median 3Q Max
0.48471 1.42944 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept)
0.4685
sexmale 1.1007 0.3558 3.093 0.00198 ** logdose 1.5353 0.1891 8.119 4.70e-16 ***
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 124.8756
degrees of freedom Residual deviance: 6.7571
9 degrees of freedom AIC: 42.867 Number of Fisher Scoring iterations: 4
51: August 22, 2012
R> mm0 <- update(mm1, .~. - sex) R> anova(mm1, mm0, test="Chisq") Analysis of Deviance Table Model 1: cbind(ndead, ntotal - ndead) ~ sex + logdose Model 2: cbind(ndead, ntotal - ndead) ~ logdose
1 9 6.7571 2 10 16.9840 -1
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1