Linear and logisitic regression models Sren Hjsgaard Department of - - PowerPoint PPT Presentation

linear and logisitic regression models
SMART_READER_LITE
LIVE PREVIEW

Linear and logisitic regression models Sren Hjsgaard Department of - - PowerPoint PPT Presentation

Linear and logisitic regression models Sren Hjsgaard Department of Mathematical Sciences Aalborg University, Denmark August 22, 2012 Printed: August 22, 2012 File: models-slides.tex 2: August 22, 2012 Contents 1 Linear normal models 3


slide-1
SLIDE 1

Linear and logisitic regression models

Søren Højsgaard Department of Mathematical Sciences Aalborg University, Denmark August 22, 2012

Printed: August 22, 2012 File: models-slides.tex

slide-2
SLIDE 2

2: August 22, 2012

Contents

1 Linear normal models 3 2 Linear regression 4 2.1 Fitting linear regression model with lm() . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Printing the object: print() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Model objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4 Extractor methods and methods for further computing . . . . . . . . . . . . . . . . . . . . . 13 2.5 Plotting the object: plot() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.6 Summary information: summary() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.7 Confidence interval for model parameters: confint() . . . . . . . . . . . . . . . . . . . . . . 19 2.8 Predicting new cases: predict() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3 Regression with a factor 23 3.1 Transforming data using transform() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4 Model comparison 30 4.1 Comparing two models with anova() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2 Three commonly used tables for model comparisons . . . . . . . . . . . . . . . . . . . . . . . 33 4.3 Sequential ANOVA table: anova() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.4 On interpreting the anova() output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.5 Dropping each term in turn using drop1() . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.6 On interpreting the drop1() output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.7 Investigating parameter estimates using coef() . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.8 Which table to use?* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5 Residuals and model checking 42 5.1 Interpreting diagnostic plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6 Logistic regression 46

slide-3
SLIDE 3

3: August 22, 2012

1 Linear normal models

  • Linear normal models (regression models, analysis of variance models, analysis
  • f covariance models etc.) are fitted using the lm() function.
  • The lm() function is typically called as:

R> lm(y ~ x1 + x2 + x3, dataset)

  • The result from calling the lm() function is an object (also called a model
  • bject) with a specific class.
  • Further analysis of the model is typically via additional R functions applied to

the model object.

slide-4
SLIDE 4

4: August 22, 2012

2 Linear regression

The hellung dataset has 51 rows and 3 columns. Diameter and concentration of Tetrahymena cells with (coded as 1) and without (coded as 2) glucose added to growth medium. Tetrahymena cells are often used as model organisms in experimental biology.

slide-5
SLIDE 5

5: August 22, 2012 R> data(hellung,package="ISwR") R> head(hellung,6) glucose conc diameter 1 1 631000 21.2 2 1 592000 21.5 3 1 563000 21.3 4 1 475000 21.0 5 1 461000 21.5 6 1 416000 21.3 R> sapply(hellung, class) glucose conc diameter "integer" "integer" "numeric"

slide-6
SLIDE 6

6: August 22, 2012 R> par(mfrow=c(1,2)) R> plot(diameter ~ conc, data=hellung, + col=glucose, pch=as.character(glucose)) R> plot(log(diameter) ~ log(conc), data=hellung, + col=glucose, pch=as.character(glucose))

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0e+00 3e+05 6e+05 19 21 23 25 conc diameter 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 10 11 12 13 2.95 3.05 3.15 3.25 log(conc) log(diameter)

  • On a log scale the curves look linear
  • but are they parallel?
slide-7
SLIDE 7

7: August 22, 2012

For now we ignore the glucose treatment. In the following let y = log(diameter) and x = log(conc). The plots suggest an approximately linear relationship on the log scale which we can capture by a linear regression model yi = α + βxi + ei, ei ∼ N(0, σ2), i = 1, . . . , 51

slide-8
SLIDE 8

8: August 22, 2012

2.1 Fitting linear regression model with lm()

R> hm <- lm(log(diameter) ~ log(conc), data=hellung)

Now hm is a linear model object.

slide-9
SLIDE 9

9: August 22, 2012

2.2 Printing the object: print()

The print() method gives some information about the object:

R> print(hm) Call: lm(formula = log(diameter) ~ log(conc), data = hellung) Coefficients: (Intercept) log(conc) 3.74703

  • 0.05451

Instead of calling print() we may simply type the object’s name:

R> hm Call: lm(formula = log(diameter) ~ log(conc), data = hellung) Coefficients: (Intercept) log(conc) 3.74703

  • 0.05451
slide-10
SLIDE 10

10: August 22, 2012

2.3 Model objects

Techically, a model object is a list with all sorts of information; for example the model specification, the data, the parameter estimates and so on.

R> class(hm) [1] "lm" R> names(hm) [1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "xlevels" "call" "terms" "model"

The names of the list are called attributes or slots.

slide-11
SLIDE 11

11: August 22, 2012

We may extract values from the model object using the $–operator as follows

R> hm$coefficients (Intercept) log(conc) 3.74702641 -0.05451464 R> hm$residuals 1 2 3 4 5 0.0350211011 0.0455948627 0.0335108932 0.0100606873 0.0319602834 6 7 8 9 10 0.0170150708 -0.0352928345 0.0665403222 0.0089021709 0.0182022593 11 12 13 14 15 0.0349529425 0.0214114524 0.0462117137 0.0518996737 0.0275888916 16 17 18 19 20 0.0191915976 0.0218841508 0.0061947069 -0.0004889143 0.0306645283 21 22 23 24 25 0.0059028969 0.0565348153 0.0212271549 0.0298302213 0.0260690999 26 27 28 29 30

  • 0.0098984554

0.0471016029 0.0193724964 0.0218186112 0.0256750284 31 32 33 34 35 0.0096540212 0.0298367056 -0.0641562641 -0.0611421290 -0.0683058568 36 37 38 39 40

  • 0.0177867867 -0.0368224409 -0.0443737505 -0.0468146393 -0.0932894058

41 42 43 44 45

  • 0.0149983152 -0.0164825214 -0.0395395538 -0.0230984646

0.0014197167 46 47 48 49 50

  • 0.0295345605 -0.0402017510 -0.0534922053 -0.0320022390 -0.0401489903

51

  • 0.0533796004
slide-12
SLIDE 12

12: August 22, 2012

slide-13
SLIDE 13

13: August 22, 2012

2.4 Extractor methods and methods for further computing

For some of the attributes there exist extractor functions, for example:

slide-14
SLIDE 14

14: August 22, 2012 R> coef(hm) (Intercept) log(conc) 3.74702641 -0.05451464 R> residuals(hm) 1 2 3 4 5 0.0350211011 0.0455948627 0.0335108932 0.0100606873 0.0319602834 6 7 8 9 10 0.0170150708 -0.0352928345 0.0665403222 0.0089021709 0.0182022593 11 12 13 14 15 0.0349529425 0.0214114524 0.0462117137 0.0518996737 0.0275888916 16 17 18 19 20 0.0191915976 0.0218841508 0.0061947069 -0.0004889143 0.0306645283 21 22 23 24 25 0.0059028969 0.0565348153 0.0212271549 0.0298302213 0.0260690999 26 27 28 29 30

  • 0.0098984554

0.0471016029 0.0193724964 0.0218186112 0.0256750284 31 32 33 34 35 0.0096540212 0.0298367056 -0.0641562641 -0.0611421290 -0.0683058568 36 37 38 39 40

  • 0.0177867867 -0.0368224409 -0.0443737505 -0.0468146393 -0.0932894058

41 42 43 44 45

  • 0.0149983152 -0.0164825214 -0.0395395538 -0.0230984646

0.0014197167 46 47 48 49 50

  • 0.0295345605 -0.0402017510 -0.0534922053 -0.0320022390 -0.0401489903

51

  • 0.0533796004
slide-15
SLIDE 15

15: August 22, 2012

Moreover, there are various methods available for model objects and each of these methods perform a specific task. Some of these methods are print(), summary(), plot(), coef(), fitted(), predict()...

slide-16
SLIDE 16

16: August 22, 2012

2.5 Plotting the object: plot()

The plot() method for lm–objects produces illustrative diagnostic plots:

R> par(mfrow=c(2,2),mar=c(2,4.5,2,2)) R> plot(hm)

3.05 3.10 3.15 3.20 −0.10 0.00 Residuals

  • ●●
  • ● ●
  • Residuals vs Fitted

40 35 8

  • −2

−1 1 2 −2 1 2 Standardized residuals

Normal Q−Q

40 35 8

3.05 3.10 3.15 3.20 0.0 0.5 1.0 1.5 Standardized residuals

  • Scale−Location

40 35 8

0.00 0.02 0.04 0.06 −3 −1 1 2 Standardized residuals

  • ●●
  • Cook's distance

Residuals vs Leverage

33 34 35

slide-17
SLIDE 17

17: August 22, 2012

2.6 Summary information: summary()

A general summary of the fit is obtained by

R> sumhm <- summary(hm) Call: lm(formula = log(diameter) ~ log(conc), data = hellung) Residuals: Min 1Q Median 3Q Max

  • 0.093289 -0.033648

0.009654 0.028710 0.066540 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.747026 0.047415 79.03 <2e-16 *** log(conc)

  • 0.054515

0.004178

  • 13.05

<2e-16 ***

  • Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.03822 on 49 degrees of freedom Multiple R-squared: 0.7765, Adjusted R-squared: 0.7719 F-statistic: 170.2 on 1 and 49 DF, p-value: < 2.2e-16

slide-18
SLIDE 18

18: August 22, 2012

  • Residual standard error: Estimate of σ
  • Multiple R-squared: The proportion of variation in data explained by the

model.

  • F-statistic, p-value: Test for whether all parameters except the intercept

are zero Estimate

  • Std. Error

t value Pr(>|t|) (Intercept) 3.75 0.05 79.03 0.00 log(conc)

  • 0.05

0.00

  • 13.05

0.00

  • The estimates for the intercept and slope,
  • Standard error of the estimates
  • Test for whether these parameters are equal to zero.
slide-19
SLIDE 19

19: August 22, 2012

2.7 Confidence interval for model parameters: confint()

R> confint(hm) 2.5 % 97.5 % (Intercept) 3.65174215 3.84231066 log(conc)

  • 0.06291165 -0.04611763
slide-20
SLIDE 20

20: August 22, 2012

A call to summary() returns an object which is a list

R> class(sumhm) [1] "summary.lm" R> names(sumhm) [1] "call" "terms" "residuals" "coefficients" [5] "aliased" "sigma" "df" "r.squared" [9] "adj.r.squared" "fstatistic" "cov.unscaled"

These elements of the list can be accessed directly using $ and in some cases by accessor methods:

R> coef(sumhm) Estimate

  • Std. Error

t value Pr(>|t|) (Intercept) 3.74702641 0.047415122 79.02598 2.467047e-53 log(conc)

  • 0.05451464 0.004178499 -13.04647 1.465068e-17

Notice that coef() acts differently when applied to a summary.lm and a lm object.

slide-21
SLIDE 21

21: August 22, 2012

2.8 Predicting new cases: predict()

Having fitted the model one can use the model formula and the parameter estimates to obtain predicted values for any values of the explanatory variables: ˆ y = ˆ α + ˆ βx Create dataframe with values for the explanatory variables and invoke the predict() method:

slide-22
SLIDE 22

22: August 22, 2012 R> pred.data<-data.frame(conc=seq(1000,2000,3000)) R> pred1<-predict(hm, newdata = pred.data, interval= 'confidence') R> head(pred1) fit lwr upr 1 3.370453 3.332237 3.408668 R> pred2<-predict(hm, newdata = pred.data, interval= 'prediction') R> head(pred2) fit lwr upr 1 3.370453 3.284669 3.456237

slide-23
SLIDE 23

23: August 22, 2012

3 Regression with a factor

R> head(hellung,6) glucose conc diameter 1 1 631000 21.2 2 1 592000 21.5 3 1 563000 21.3 4 1 475000 21.0 5 1 461000 21.5 6 1 416000 21.3

Hence glucose defines a partitioning of data into two groups (and the values 1 and 2 are completely arbitrary). It is more informative to recode 1 as“Yes”and 2 as“No” . We need the concept of a factor because factors are treated differently from covariates in a modelling context. First we create a factor and append it to the data frame:

slide-24
SLIDE 24

24: August 22, 2012 R> gluc <- factor(hellung$glucose, labels=c('Yes','No')) R> hellung$gluc <- gluc R> gluc[c(1:4,41:44)] [1] Yes Yes Yes Yes No No No No Levels: Yes No R> head(hellung) glucose conc diameter gluc 1 1 631000 21.2 Yes 2 1 592000 21.5 Yes 3 1 563000 21.3 Yes 4 1 475000 21.0 Yes 5 1 461000 21.5 Yes 6 1 416000 21.3 Yes

slide-25
SLIDE 25

25: August 22, 2012

Perhaps the following model is adequate: y = µ + αg(i) + βxi + βg(i)xi + ei which reads:

  • When glucose=Yes : yi = µ + αyes + βxi + βyesxi + ei
  • When glucose=No : yi = µ + αno + βxi + βnoxi + ei

The model is fitted by

R> hel <- lm(log(diameter) ~ gluc + log(conc) + gluc:log(conc), + data=hellung)

slide-26
SLIDE 26

26: August 22, 2012 R> summary(hel) Call: lm(formula = log(diameter) ~ gluc + log(conc) + gluc:log(conc), data = hellung) Residuals: Min 1Q Median 3Q Max

  • 0.061530 -0.011254

0.000129 0.008675 0.040543 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.756307 0.031957 117.543 <2e-16 *** glucNo 0.007869 0.054559 0.144 0.886 log(conc)

  • 0.053196

0.002807 -18.954 <2e-16 *** glucNo:log(conc) -0.006480 0.004821

  • 1.344

0.185

  • Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.02086 on 47 degrees of freedom Multiple R-squared: 0.9361, Adjusted R-squared: 0.9321 F-statistic: 229.6 on 3 and 47 DF, p-value: < 2.2e-16

slide-27
SLIDE 27

27: August 22, 2012

The meaning of a factor is to create new (dummy) covariates:

R> model.matrix(hel)[c(1:3, 35:37),] (Intercept) glucNo log(conc) glucNo:log(conc) 1 1 13.35506 0.00000 2 1 13.29126 0.00000 3 1 13.24103 0.00000 35 1 1 12.71289 12.71289 36 1 1 12.56024 12.56024 37 1 1 12.21106 12.21106

Notice: To make the parameters unique, one chooses a reference-level and sets that parameter equal to 0. R chooses the first level of a factor as reference. Thereby glucNo is the difference in intercept between the two regression lines. Likewise, glucNo:log(conc) is the difference in slope of the two regression lines.

slide-28
SLIDE 28

28: August 22, 2012

The parameter estimate table Estimate

  • Std. Error

t value Pr(>|t|) (Intercept) 3.76 0.03 117.54 0.00 glucNo 0.01 0.05 0.14 0.89 log(conc)

  • 0.05

0.00

  • 18.95

0.00 glucNo:log(conc)

  • 0.01

0.00

  • 1.34

0.19

  • (Intercept): Intercept when gluc=yes
  • glucNo: Difference in intercept between gluc=no and gluc=yes groups
  • log(conc): Slope when gluc=yes
  • glucNo:log(conc): Difference in slope between gluc=no and gluc=yes

groups

slide-29
SLIDE 29

29: August 22, 2012

3.1 Transforming data using transform()

We supplement the data with logarithmic transformed values and declare glucose as a factor in one step using transform():

R> data(hellung,package="ISwR") R> hellung<-transform(hellung, + glucose=factor(glucose, labels=c("Yes","No")), + ldiam=log(diameter),lconc=log(conc))

slide-30
SLIDE 30

30: August 22, 2012

4 Model comparison

  • We consider comparing two models for the hellung data:
  • The interaction model

hel.int : ldiam ∼ glucose + lconc + glucose : lconc describes a model with non-parallel lines.

  • The simpler additive model

hel.add : ldiam ∼ glucose + lconc defines a model with parallel lines.

  • The additive model hel.add is a submodel of the interaction model hel.int
slide-31
SLIDE 31

31: August 22, 2012

  • We can consider the model comparison as a hypothesis test:

H0 : Model hel.add is appropriate. H1 : Model hel.int is appropriate.

  • This may alternatively be written by characterizing the parameters to be

dropped: H0 : βNo = 0, H1 : βNo = 0.

slide-32
SLIDE 32

32: August 22, 2012

4.1 Comparing two models with anova()

R> hel.int<-lm(ldiam~glucose+lconc+glucose:lconc,data=hellung) R> hel.add<-lm(ldiam~ glucose + lconc,data=hellung) R> anova(hel.add,hel.int) Analysis of Variance Table Model 1: ldiam ~ glucose + lconc Model 2: ldiam ~ glucose + lconc + glucose:lconc Res.Df RSS Df Sum of Sq F Pr(>F) 1 48 0.021234 2 47 0.020448 1 0.00078615 1.807 0.1853

  • DF: difference in the number of model parameters (=48-47).
  • Sum of Sq: Difference in the RSS (residual sum of squares).
  • F: value of the F-statistic.
  • Pr(>F): p-value: probability that the F-statistic is larger than F.
slide-33
SLIDE 33

33: August 22, 2012

4.2 Three commonly used tables for model comparisons

Most statistical programs produce standard output tables, which provide tests of several hypotheses. We will consider three of these tables, which are available in R(Tab. 1). Table 1: Three commonly used tables. R-function Description anova(model) Sequential ANOVA table (SAS: TYPE 1), drop1(model) Dropping effects (SAS: TYPE 2), coef(summary(model)) Parameter estimate table.

slide-34
SLIDE 34

34: August 22, 2012

4.3 Sequential ANOVA table: anova()

The effects are tested sequentially as they appear in the model equation. It is tested, whether an effect is necessary, assuming that all the preceding effects are in the model.

R> anova(hel.int) Analysis of Variance Table Response: ldiam Df Sum Sq Mean Sq F value Pr(>F) glucose 1 0.042591 0.042591 97.896 4.508e-13 *** lconc 1 0.256353 0.256353 589.237 < 2.2e-16 *** glucose:lconc 1 0.000786 0.000786 1.807 0.1853 Residuals 47 0.020448 0.000435

  • Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

slide-35
SLIDE 35

35: August 22, 2012

4.4 On interpreting the anova() output

Table 2: ANOVA (type I) table for model hel.int.

Effect H0 model H1 model 1 glucose 1 1 + glucose 2 lconc 1 + glucose 1 + glucose + lconc 3 glucose:lconc 1 + glucose + lconc 1 + glucose + lconc + glucose:lconc

Notice: The table only shows that the log-concentration effect is significant, after glucose has been included. The other test, whether glucose is significant after log-concentration has been fitted cannot be concluded from this anova-table. One could obtain this by fitting the model with the position of glucose and log-concentration turned around. The test results depend on the sequence of the effects in the model equation if the design is not balanced.

slide-36
SLIDE 36

36: August 22, 2012

4.5 Dropping each term in turn using drop1()

The drop1() function provides a table, where effects are removed in turn – assuming that all the other effects are contained in the model.

R> drop1(hel.add,test='F') Single term deletions Model: ldiam ~ glucose + lconc Df Sum of Sq RSS AIC F value Pr(>F) <none> 0.021234 -390.98 glucose 1 0.050335 0.071569 -331.01 113.78 2.932e-14 *** lconc 1 0.256353 0.277587 -261.89 579.49 < 2.2e-16 ***

  • Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

slide-37
SLIDE 37

37: August 22, 2012

4.6 On interpreting the drop1() output

For model hel.add such an analysis table is given in Table 3 Table 3: drop1 (type II) table for model hel.add. Effect H0 model H1 model glucose 1 + lconc 1 + glucose + lconc lconc 1 + glucose 1 + glucose + lconc

slide-38
SLIDE 38

38: August 22, 2012

Notice: Applying the drop1() function to the interaction model hel.int only the test for the interaction is returned.

R> drop1(hel.int,test='F') Single term deletions Model: ldiam ~ glucose + lconc + glucose:lconc Df Sum of Sq RSS AIC F value Pr(>F) <none> 0.020448 -390.91 glucose:lconc 1 0.00078615 0.021234 -390.98 1.807 0.1853

This is because R obeys the principle of marginality for factors: Do not test for a main effect if this is contained in higher order interactions.

slide-39
SLIDE 39

39: August 22, 2012

4.7 Investigating parameter estimates using coef()

Some insight can be gained by looking at the parameter estimates, their standard errors and derived quantities:

R> coef(summary(hel.int)) Estimate

  • Std. Error

t value Pr(>|t|) (Intercept) 3.756307427 0.031956957 117.5427121 1.058597e-59 glucoseNo 0.007869203 0.054559221 0.1442323 8.859337e-01 lconc

  • 0.053196257 0.002806628 -18.9537980 1.193636e-23

glucoseNo:lconc -0.006480457 0.004820897

  • 1.3442429 1.853226e-01

However, care must be taken here: Both glucoseNo and glucoseNo:lconc appear insignificant. From this, one can not conclude that both parameters can be excluded from the model (i.e. that the two regression lines coincide). This is because the summary table is based on all other effects being in the model.

slide-40
SLIDE 40

40: August 22, 2012

To drive home this point, consider

R> coef(summary(hel.add)) Estimate

  • Std. Error

t value Pr(>|t|) (Intercept) 3.78114962 0.026289701 143.82627 6.440775e-65 glucoseNo

  • 0.06502011 0.006095455 -10.66698 2.932214e-14

lconc

  • 0.05539270 0.002301060 -24.07269 1.917681e-28

Hence, when the interaction glucose:lconc (different slopes) has been removed then glucose (different intercepts) is highly significant.

slide-41
SLIDE 41

41: August 22, 2012

4.8 Which table to use?*

All three tables provide a collection of tests. The most useful table is the TYPE II table (produced by drop1() in R) because it tests for each effect in turn whether it is necessary. This feature is especially useful, if one starts with a large model containing several effects. The sequential anova table is less useful because of the dependence of the tests on the sequence of the model terms. The least useful table (with respect to testing) is the parameter estimate table. It provides (Wald) tests that each parameter is equal to zero. This can only be used as a test that the complete effect can be dropped, if either

  • the the effect is a factor with only two levels or
  • a continuous covariate.
slide-42
SLIDE 42

42: August 22, 2012

5 Residuals and model checking

Plots of residuals are an important tool for model checking. There are different types of residuals available: Pearson residuals: ri = yi − ˆ yi Standardized residuals: rS

i = (yi − ˆ

yi)/ √ ˆ σ2 If the assumptions behind the linear normal model are satisfied then

  • Pearson residuals are asymptotically ri ∼A N(0, σ2)
  • Standardized residuals are asymptotically rS

i ∼A N(0, 1)

  • The residuals are asymptotically independent.

R> residuals(hel.add) R> rstandard(hel.add)

slide-43
SLIDE 43

43: August 22, 2012

The plot() method for lm–objects produces several diagnostic plots based on the residuals.

R> par(mfrow=c(2,2)) R> plot(hel.add, pch = c(17,2)[hellung$glucose], + col = c(17,2)[hellung$glucose])

slide-44
SLIDE 44

44: August 22, 2012

3.00 3.05 3.10 3.15 3.20 3.25 −0.06 −0.02 0.02 Fitted values Residuals

Residuals vs Fitted

7 40 8

−2 −1 1 2 −3 −2 −1 1 2 Theoretical Quantiles Standardized residuals

Normal Q−Q

7 40 8

3.00 3.05 3.10 3.15 3.20 3.25 0.0 0.5 1.0 1.5 Fitted values Standardized residuals

Scale−Location

7 40 8

0.00 0.02 0.04 0.06 0.08 0.10 −3 −2 −1 1 2 Leverage Standardized residuals Cook's distance

0.5

Residuals vs Leverage

7 40 45

Figure 1: Residuals from model hel.add

slide-45
SLIDE 45

45: August 22, 2012

5.1 Interpreting diagnostic plots

  • Residuals vs Fitted: The Pearson residuals a plotted against the fitted

values and a smooth line is added. For central predicted values around 1.35 the model seems to underfit the data because the residuals are positive.

  • Normal Q-Q: The distribution of the residuals look rather ’normal’ (=wiggle

randomly around the straight line). Deviations from the lines at the extremes are observed even for ideally normally distributed data.

  • Scale-Location: If the variance were independent of the mean one would

expect the points to wiggle unsystematically around a horizontal line at about 0.82.

  • Residuals vs. Leverage: Points with high leverage may change the

parameter estimates dramatically if omitted from the data. Especially points with high leverage and high residuals should be scrutinized. For the present data no such point exists.

slide-46
SLIDE 46

46: August 22, 2012

6 Logistic regression

The number of killed budworms when exposed to an insecticide:

R> budworm sex dose ndead ntotal 1 male 1 1 20 2 male 2 4 20 3 male 4 9 20 4 male 8 13 20 5 male 16 18 20 6 male 32 20 20 7 female 1 20 8 female 2 2 20 9 female 4 6 20 10 female 8 10 20 11 female 16 12 20 12 female 32 16 20

  • The number of dead budworms (out of a total of 20) increases with increasing

dose.

  • There can not be less than 0 and more than 20 dead budworms.
slide-47
SLIDE 47

47: August 22, 2012

  • For females at dose 4 there are 6 dead budworms. The probability of dying can

be estimated as

R> p <- 6/20

  • Common to work on odds–scale: With 6 out of 20 budworms dead there are

14 out of 20 alive, so the odds for dying is

R> O <- (6/20)/(14/20)

  • The interpretation of O = 0.4 = 4/10 is that when 4 dies then 10 survives.
  • More generally,

O = p 1 − p So odds can be anything from 0 to ∞.

slide-48
SLIDE 48

48: August 22, 2012

  • In practice it is more convenient to work with log–odds (for historical reasons
  • ften denoted by η):

η = log( p 1 − p)

  • It is not a big deal: If we know η we can get to the probabilities

p = eη 1 + eη

  • In logistic regression we relate log–odds to the explanatory variables.
slide-49
SLIDE 49

49: August 22, 2012 R> library(lattice) R> print(xyplot(log((ndead+.5)/(20-ndead+.5))~log(dose), groups=sex, data=budworm, + type="b", auto.key=T))

log(dose) log((ndead + 0.5)/(20 − ndead + 0.5))

−4 −2 2 4 1 2 3

  • female

male

  • Very vanilla logistic regression.

R> budworm <- transform(budworm, logdose=log(dose)) R> mm1 <- glm(cbind(ndead,ntotal-ndead)~sex+logdose, + data=budworm, family=binomial(link=logit))

slide-50
SLIDE 50

50: August 22, 2012 R> summary(mm1) Call: glm(formula = cbind(ndead, ntotal - ndead) ~ sex + logdose, family = binomial(link = logit), data = budworm) Deviance Residuals: Min 1Q Median 3Q Max

  • 1.10540
  • 0.65343
  • 0.02225

0.48471 1.42944 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept)

  • 3.4732

0.4685

  • 7.413 1.23e-13 ***

sexmale 1.1007 0.3558 3.093 0.00198 ** logdose 1.5353 0.1891 8.119 4.70e-16 ***

  • Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 124.8756

  • n 11

degrees of freedom Residual deviance: 6.7571

  • n

9 degrees of freedom AIC: 42.867 Number of Fisher Scoring iterations: 4

slide-51
SLIDE 51

51: August 22, 2012

  • There apears to be a significant effect of sex
  • The regression coefficient for logdose is 1.53. So when log dose increases by
  • ne unit (that is when dose doubles) log–odds increase by 1.53. So if

1.53 = log O2 − log O1 = log(02/O1) then exp(1.53) = 4.62 = O2/O1 So the odds for dying increases by a factor of 4.62.

R> mm0 <- update(mm1, .~. - sex) R> anova(mm1, mm0, test="Chisq") Analysis of Deviance Table Model 1: cbind(ndead, ntotal - ndead) ~ sex + logdose Model 2: cbind(ndead, ntotal - ndead) ~ logdose

  • Resid. Df Resid. Dev Df Deviance Pr(>Chi)

1 9 6.7571 2 10 16.9840 -1

  • 10.227 0.001384 **
  • Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1