Lecture 11. Modelling Process and Model Diagnostics (cont.) Nan Ye - - PowerPoint PPT Presentation

lecture 11 modelling process and model diagnostics cont
SMART_READER_LITE
LIVE PREVIEW

Lecture 11. Modelling Process and Model Diagnostics (cont.) Nan Ye - - PowerPoint PPT Presentation

Lecture 11. Modelling Process and Model Diagnostics (cont.) Nan Ye School of Mathematics and Physics University of Queensland 1 / 24 Recall: Some key modelling activities model class use fit validate model model model data Check


slide-1
SLIDE 1

Lecture 11. Modelling Process and Model Diagnostics (cont.) Nan Ye

School of Mathematics and Physics University of Queensland

1 / 24

slide-2
SLIDE 2

Recall: Some key modelling activities model class data fit model validate model use model

  • Check model assumption
  • Check goodness of fit, residual plot et al on training set.
  • A good fit on the training set may mean overfitting.
  • Check predictive performance
  • Check cross-validation score, validation set performance.
  • Reconsider model class or data if checks are not satisfactory.

2 / 24

slide-3
SLIDE 3

This Lecture

  • Checking model assumption
  • Checking predictive performance

3 / 24

slide-4
SLIDE 4

Residual Plots

  • Plot Pearson residuals/deviance residuals against link (i.e. linear

predictor).

  • If the model is correct, the points should be roughly uniformly

scattered around 0.

  • Plotting against the fitted mean (i.e. response) can be helpful but

less popular.

4 / 24

slide-5
SLIDE 5

Example Consider plots of Pearson residuals againt the link (linear predictor) for models on the blood clotting time example. Recall the following models

> fit.ig.inv = glm(time ~ lot * log(conc), data=clot, family=inverse.gaussian(link='inverse')) > fit.ig.invquad = glm(time ~ lot * log(conc), data=clot, family=inverse.gaussian) > fit.ig.log = glm(time ~ lot * log(conc), data=clot, family=inverse.gaussian(link='log')) > fit.gam.inv = glm(time ~ lot * log(conc), data=clot, family=Gamma) ...

5 / 24

slide-6
SLIDE 6

Comparison of link functions

  • 0.02

0.04 0.06 0.08 −0.010 0.000 0.010 link pearson

(a) fit.ig.inv

  • 0.000

0.001 0.002 0.003 0.004 0.005 −0.04 0.00 0.04 link pearson

(b) fit.ig.invquad

6 / 24

slide-7
SLIDE 7
  • Residual decreases as link increases for inverse quadratic link
  • No such obvious pattern for inverse link.
  • Inverse link model is thus likely to be better.
  • This is consistent with conclusions obtained using likelihood or

residual deviance (see previous lectures).

7 / 24

slide-8
SLIDE 8

Comparison of variance functions

  • 0.02

0.04 0.06 0.08 −0.010 0.000 0.010 link pearson

(a) fit.ig.inv

  • 0.02

0.04 0.06 0.08 −0.06 0.00 0.04 0.08 link pearson

(b) fit.gam.inv

8 / 24

slide-9
SLIDE 9
  • Residuals on the RHS are close to 0 for Gamma.
  • No such obvious pattern for inverse Gaussian.
  • Inverse Gaussian thus likely has a better variance structure.
  • This is consistent with conclusions obtained using likelihood.

> logLik(fit.gam.inv) 'log Lik.' -26.59759 (df=5) > logLik(fit.ig.inv) 'log Lik.' -25.33805 (df=5)

9 / 24

slide-10
SLIDE 10

Link scale vs. mean scale

  • 0.02

0.04 0.06 0.08 −0.010 0.000 0.010 link pearson

(a) link scale

  • 20

40 60 80 100 120 −0.010 0.000 0.010 response pearson

(b) mean scale

  • Both are Pearson residual plots for fit.ig.inv.
  • The mean scale spreads out the rightmost two points too much.
  • These two points appear to be outliers on the mean scale, but not
  • n the link scale.

10 / 24

slide-11
SLIDE 11

Deviance residual plots

  • 0.02

0.04 0.06 0.08 −0.010 0.000 0.010 link deviance

(a) fit.ig.inv

  • 0.000

0.001 0.002 0.003 0.004 0.005 −0.04 0.00 0.04 link deviance

(b) fit.ig.invquad

11 / 24

slide-12
SLIDE 12
  • 0.02

0.04 0.06 0.08 −0.010 0.000 0.010 link deviance

(a) fit.ig.inv

  • 0.02

0.04 0.06 0.08 −0.06 0.00 0.04 0.08 link deviance

(b) fit.gam.inv

12 / 24

slide-13
SLIDE 13
  • We get roughly the same plots, and thus roughly the same

conclusions as using the Pearson residual plots.

  • In fact, the Pearson residuals and the deviance residuals are almost

the same for the models considered here.

13 / 24

slide-14
SLIDE 14

Analysis of Deviance

  • We successively fit a sequence of models by adding one term to the

model.

  • The deviance of a term is the difference between the deviance of

the first model that contains it and the deviance of the previous model.

  • Thus the deviance of a term depends on when it is added.

14 / 24

slide-15
SLIDE 15

Example

> anova(fit.ig.inv) Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev F Pr(>F) NULL 17 0.247884 lot 1 0.034159 16 0.213725 492.04 2.630e-12 *** log(conc) 1 0.203628 15 0.010097 2933.14 < 2.2e-16 *** lot:log(conc) 1 0.009122 14 0.000975 131.40 1.679e-08 ***

  • Signif. codes:

0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

  • The deviance of a term is F-distributed under the null hypothesis

that the term is not significant.

  • All terms are significant in this example.
  • log(conc) has the largest contribution in the model.

15 / 24

slide-16
SLIDE 16

> fit.ig.inv1 = glm(time ~ log(conc)*lot, data=clot, family=inverse.gaussian(link='inverse')) > anova(fit.ig.inv1) Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev F Pr(>F) NULL 17 0.247884 log(conc) 1 0.206543 16 0.041341 2975.13 < 2.2e-16 *** lot 1 0.031244 15 0.010097 450.06 4.829e-12 *** log(conc):lot 1 0.009122 14 0.000975 131.40 1.679e-08 ***

  • Signif. codes:

0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

  • The order of lot and log(conc) are swapped.
  • The deviances are slightly different.
  • However, we have the same qualitative conclusion about the

signifiance of the terms.

16 / 24

slide-17
SLIDE 17
  • Often, we need to decide whether a factor should be included.
  • This can be done by comparing the deviances of before and after

including it.

  • Again, the conclusion depends on the model on which the factor is

added.

17 / 24

slide-18
SLIDE 18

> fit1 = glm(time ~ log(conc), data=clot, family=inverse.gaussian(link='inverse')) > fit2 = glm(time ~ lot*log(conc), data=clot, family=inverse.gaussian(link='inverse')) > anova(fit1, fit2, test='F') Analysis of Deviance Table Model 1: time ~ log(conc) Model 2: time ~ lot * log(conc)

  • Resid. Df Resid. Dev Df Deviance

F Pr(>F) 1 16 0.041341 2 14 0.000975 2 0.040367 290.73 3.971e-12 ***

  • Signif. codes:

0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

The lot factor is significant.

18 / 24

slide-19
SLIDE 19

Checking Predictive Performance

Overfitting

  • A model satisfying the model assumption does not necessarily

make good predictions on test data.

  • In particular, when there are many covariates, a model which

better fits the training data may have poorer performance than one which fits less well.

  • Overfitting: as model complex increases, the model fits the training

set better and better, but the test set performance first improves and then drops.

19 / 24

slide-20
SLIDE 20

Measuring predictive performance

  • The validation set approach
  • If we have enough data, we can split the dataset into a training set

a validation set.

  • Train models using the training set, and pick the one with best

predictive performance on the validation set.

  • Cross-validation (CV)
  • We split the dataset into K folds (parts).
  • For each model class, train K models by leaving one fold out each

time, and make predictions on the left-out fold.

  • The performance of predictions obtained using CV is the predictive

performance of the model class.

20 / 24

slide-21
SLIDE 21

> library(caret) > train(time ~ lot*log(conc), method="glm", data=clot, family=inverse.gaussian(link='inverse'), trControl=trainControl(method="LOOCV")) Resampling results: RMSE Rsquared MAE 15.98637 0.9575552 5.65666

  • In leave-one-out CV, each fold has only one example.
  • The caret library provides a simple way to do CV for many models,

including GLMs.

21 / 24

slide-22
SLIDE 22

> train(time ~ lot*log(conc), method="glm", data=clot, family=inverse.gaussian(link='log'), trControl=trainControl(method="LOOCV")) Resampling results: RMSE Rsquared MAE 13.34795 0.8315472 6.159968

  • Using the log link improves RMS, but decreases R2 and MAE.
  • This is what usually happens: no single model performs the best

for all performance measures.

22 / 24

slide-23
SLIDE 23

> train(time ~ lot*log(conc), method="glm", data=clot, family=inverse.gaussian(link='1/mu^2'), trControl=trainControl(method="LOOCV")) Resampling results: RMSE Rsquared MAE 5.791858 0.9130303 3.973965 Warning messages: 1: In sqrt(eta) : NaNs produced 2: In sqrt(eta) : NaNs produced

  • Inverse quadratic link is only legitimate when one can ensure on a

new x, β⊤x > 0.

  • In this example, it happens that this positivity constraint is violated

twice (eta refers to the linear predictor).

23 / 24

slide-24
SLIDE 24

What You Need to Know

  • Checking model assumption: residual plots, analysis of deviance.
  • Checking predictive performance: validation set, cross-validation.

24 / 24