Workshop 3 Building from Linear Models to Generalised Linear Models - - PowerPoint PPT Presentation

workshop 3
SMART_READER_LITE
LIVE PREVIEW

Workshop 3 Building from Linear Models to Generalised Linear Models - - PowerPoint PPT Presentation

1 Workshop 3 Building from Linear Models to Generalised Linear Models Part 2: GLMs 2 2 What are linear models? Stage 1: response continuous - General Linear Model Procedure Response Predictors Single linear regression y ~ x Continuous


slide-1
SLIDE 1

1

Workshop 3

Building from Linear Models to Generalised Linear Models Part 2: GLMs

slide-2
SLIDE 2

2

What are linear models?

Procedure Response Predictors Single linear regression y ~ x Continuous 1 Continuous/discrete Two-sample t-test y ~ x Continuous 1 categorical (2 levels) One-way ANOVA y ~ x Continuous 1 categorical (2 or more levels) Two-way ANOVA y ~ x1*x2 Continuous 2 categorical (2 or more levels each)

Stage 1: response continuous - General Linear Model Stage 2: incl other types of response - Generalised Linear Model 2

slide-3
SLIDE 3

3

Key points

T-tests, ANOVA and regression are fundamentally the same Collectively ‘general linear model’ Can be extended to ‘generalised linear model’ for different types of response

3

slide-4
SLIDE 4

4

What are Generalised linear models?

  • Response can be continuous, discrete or

categorical

  • Predictors can be continuous or categorical
  • Type of response determines type of GLM
slide-5
SLIDE 5

5

Two types of GLM

Counts (“Poisson”)

number of cases, individuals

Binary (“Binomial”)

true-false, yes-no, proportions

These are the two types of GLM we will consider

slide-6
SLIDE 6

6

GLM

  • We have to specify which one

lm(y ~ x)

Same as: glm(y ~ x, family = gaussian)

glm(y ~ x, family = poisson) glm(y ~ x, family = binomial)

slide-7
SLIDE 7

7

LM – GLM summary

  • Linear models (lm)

– Response assumed normal – Residuals assumed normal and homoscedastic – Residual SS measures fit

  • GLM

– Response can be other than normal – Variance need not be constant – Deviance measures fit

Levels of competence

Minimum required Above the minimum Advanced

slide-8
SLIDE 8

8

GLM - poisson

  • Response is a count
  • Explanatory is continuous, categorical or

combination

  • If explanatory is category then you have a

chi-squared

slide-9
SLIDE 9

9

Example: Poisson GLM

  • Does proximity to a nuclear power plant alter the risk of

cancer?

  • Number of cases of cancer reported by a clinic by its distance

from nuclear plant (note not good epidemiology)

> str(cases) 'data.frame': 43 obs. of 2 variables: $ cancers : int 0 0 4 0 0 0 0 1 0 0 ... $ distance: num 154.37 93.14 3.83 60.83 142.61 ...

Each row is a clinic

slide-10
SLIDE 10

10

Example: Poisson GLM

ggplot(cases, aes(x = distance, y = cancers)) + geom_point() > str(cases) 'data.frame': 43 obs. of 2 variables: $ cancers : int 0 0 4 0 0 0 0 1 0 0 ... $ distance: num 154.37 93.14 3.83 60.83 142.61 ...

Each row is a clinic

slide-11
SLIDE 11

11

Example: Poisson GLM

mod <- glm(cancers ~ distance, data = cases, family = poisson) summary(mod) Call: glm(formula = cancers ~ distance, family = poisson, data = cases) Deviance Residuals: Min 1Q Median 3Q Max

  • 1.8423 -0.7437 -0.4834 0.4206 1.8933

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.019169 0.308709 3.301 0.000962 *** distance -0.021495 0.005034 -4.270 1.96e-05 ***

  • Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1) Null deviance: 54.522 on 42 degrees of freedom Residual deviance: 31.790 on 41 degrees of freedom AIC: 78.163 Number of Fisher Scoring iterations: 5

Estimate is –’ve Cancers decrease with increasing distance... and significantly

Estimate is logged!

“Estimate is on the scale of the link function” Selecting and applying

slide-12
SLIDE 12

12

Example: Poisson GLM

mod <- glm(cancers ~ distance, data = cases, family = poisson) anova(mod,test=”Chisq”) Analysis of Deviance Table Model: poisson, link: log Response: cancers Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL 42 54.522 distance 1 22.732 41 31.790 1.862e-06 ***

  • Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

glm object anova(mod, test = ”Chisq”) summary(mod) lm object anova(mod, test = ”F”) summary(mod)

anova() on the object more useful for models with more than

  • ne explanatory variable
slide-13
SLIDE 13

13

Example: Poisson GLM

> exp(coef(mod)) (Intercept) distance 2.770892 0.978734 ggplot(cases, aes(x = distance, y = cancers)) + geom_point()+ stat_smooth(method="glm", method.args = list(family = "poisson"), col="black",se=FALSE) + xlab("Distance (km)") + ylab("Number of Cancer cases per year")+ theme_bw()

2.77 cancer cases are expected at 0km For every km from the nuclear plant, a clinic reports 0.97 of the cases reported at previous km

Need to antilog estimates to understand

slide-14
SLIDE 14

14

GLM - Binomial

  • Response is a binary variable1

– (0 and 1, Y and N, T and F, red and blue)

  • Explanatory is continuous or combination

1. actually, response can be in several equivalent formats but here we use 0 and 1

slide-15
SLIDE 15

15

Example: Binomial GLM

  • Do fibrinogen levels affect whether a person’s erythrocyte

sedimentation rate (ESR) is classed is healthy or unhealthy

  • ESR classed as 0 (healthy) or 1 (ill) by fibrinogen levels
  • aka logistic regression

> str(plasma) 'data.frame': 32 obs. of 2 variables: $ fibrinogen: num 2.52 2.56 2.19 2.18 3.41 2.46 3.22 2.21 3.15 2.6 ... $ ESR : num 0 0 0 0 0 0 0 0 0 0 ...

Each row is an individual

slide-16
SLIDE 16

16

Example: Binomial GLM

ggplot(plasma, aes(x = fibrinogen, y = ESR)) + geom_point()

slide-17
SLIDE 17

17

Example: Binomial GLM

mod2 <- glm(ESR ~ fibrinogen,data = plasma, family = binomial) summary(mod2) Call: glm(formula = ESR ~ fibrinogen, family = binomial, data = plasma) Deviance Residuals: Min 1Q Median 3Q Max

  • 0.9298 -0.5399 -0.4382 -0.3356 2.4794

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.8451 2.7703 -2.471 0.0135 * fibrinogen 1.8271 0.9009 2.028 0.0425 *

  • Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1) Null deviance: 30.885 on 31 degrees of freedom Residual deviance: 24.840 on 30 degrees of freedom AIC: 28.84 Number of Fisher Scoring iterations: 5

Estimate is +’ve Probability of ‘ill’ (1) is higher as fibrinogen increases

Estimate is on logit scale!

“Estimate is on the scale of the link function”

slide-18
SLIDE 18

18

Example: Binomial GLM

ggplot(plasma, aes(x = fibrinogen, y = ESR)) + geom_point()+ stat_smooth(method = "glm", method.args = list(family = "binomial"), col = "black",se = FALSE) + xlab("Fibrinogen") + ylab("ESR class")+ theme_bw() newdat fibrinogen pr 1 1 0.006574281 2 2 0.039509110 3 3 0.203618144 4 4 0.613784514 5 5 0.908072941 6 6 0.983974363 fibv <- seq(1, 6, 1) newdat <- data.frame(fibrinogen = fibv) newdat$pr <- predict(mod2, type = "response", newdata = newdat)

Estimates are odds ratios and hard to understand, instead: Use predict() to understand

slide-19
SLIDE 19

19

LM – GLM summary

  • Linear models (lm)

– response assumed continuous; Residuals assumed normal and homoscedastic

  • GLM

– Response can be other than normal; variance need not be constant – Specify the family

  • poisson for counts
  • binomial for binary outcome

mod <- glm(response ~ explanatory1 * explanatory2, data = mydata, family = family) summary(mod) anova(mod,test = “Chisq”) mod <- glm(response ~ explanatory1, data = mydata, family = family) summary(mod)