1
Workshop 3 Building from Linear Models to Generalised Linear Models - - PowerPoint PPT Presentation
Workshop 3 Building from Linear Models to Generalised Linear Models - - PowerPoint PPT Presentation
1 Workshop 3 Building from Linear Models to Generalised Linear Models Part 2: GLMs 2 2 What are linear models? Stage 1: response continuous - General Linear Model Procedure Response Predictors Single linear regression y ~ x Continuous
2
What are linear models?
Procedure Response Predictors Single linear regression y ~ x Continuous 1 Continuous/discrete Two-sample t-test y ~ x Continuous 1 categorical (2 levels) One-way ANOVA y ~ x Continuous 1 categorical (2 or more levels) Two-way ANOVA y ~ x1*x2 Continuous 2 categorical (2 or more levels each)
Stage 1: response continuous - General Linear Model Stage 2: incl other types of response - Generalised Linear Model 2
3
Key points
T-tests, ANOVA and regression are fundamentally the same Collectively ‘general linear model’ Can be extended to ‘generalised linear model’ for different types of response
3
4
What are Generalised linear models?
- Response can be continuous, discrete or
categorical
- Predictors can be continuous or categorical
- Type of response determines type of GLM
5
Two types of GLM
Counts (“Poisson”)
number of cases, individuals
Binary (“Binomial”)
true-false, yes-no, proportions
These are the two types of GLM we will consider
6
GLM
- We have to specify which one
lm(y ~ x)
Same as: glm(y ~ x, family = gaussian)
glm(y ~ x, family = poisson) glm(y ~ x, family = binomial)
7
LM – GLM summary
- Linear models (lm)
– Response assumed normal – Residuals assumed normal and homoscedastic – Residual SS measures fit
- GLM
– Response can be other than normal – Variance need not be constant – Deviance measures fit
Levels of competence
Minimum required Above the minimum Advanced
8
GLM - poisson
- Response is a count
- Explanatory is continuous, categorical or
combination
- If explanatory is category then you have a
chi-squared
9
Example: Poisson GLM
- Does proximity to a nuclear power plant alter the risk of
cancer?
- Number of cases of cancer reported by a clinic by its distance
from nuclear plant (note not good epidemiology)
> str(cases) 'data.frame': 43 obs. of 2 variables: $ cancers : int 0 0 4 0 0 0 0 1 0 0 ... $ distance: num 154.37 93.14 3.83 60.83 142.61 ...
Each row is a clinic
10
Example: Poisson GLM
ggplot(cases, aes(x = distance, y = cancers)) + geom_point() > str(cases) 'data.frame': 43 obs. of 2 variables: $ cancers : int 0 0 4 0 0 0 0 1 0 0 ... $ distance: num 154.37 93.14 3.83 60.83 142.61 ...
Each row is a clinic
11
Example: Poisson GLM
mod <- glm(cancers ~ distance, data = cases, family = poisson) summary(mod) Call: glm(formula = cancers ~ distance, family = poisson, data = cases) Deviance Residuals: Min 1Q Median 3Q Max
- 1.8423 -0.7437 -0.4834 0.4206 1.8933
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.019169 0.308709 3.301 0.000962 *** distance -0.021495 0.005034 -4.270 1.96e-05 ***
- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1) Null deviance: 54.522 on 42 degrees of freedom Residual deviance: 31.790 on 41 degrees of freedom AIC: 78.163 Number of Fisher Scoring iterations: 5
Estimate is –’ve Cancers decrease with increasing distance... and significantly
Estimate is logged!
“Estimate is on the scale of the link function” Selecting and applying
12
Example: Poisson GLM
mod <- glm(cancers ~ distance, data = cases, family = poisson) anova(mod,test=”Chisq”) Analysis of Deviance Table Model: poisson, link: log Response: cancers Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL 42 54.522 distance 1 22.732 41 31.790 1.862e-06 ***
- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
glm object anova(mod, test = ”Chisq”) summary(mod) lm object anova(mod, test = ”F”) summary(mod)
anova() on the object more useful for models with more than
- ne explanatory variable
13
Example: Poisson GLM
> exp(coef(mod)) (Intercept) distance 2.770892 0.978734 ggplot(cases, aes(x = distance, y = cancers)) + geom_point()+ stat_smooth(method="glm", method.args = list(family = "poisson"), col="black",se=FALSE) + xlab("Distance (km)") + ylab("Number of Cancer cases per year")+ theme_bw()
2.77 cancer cases are expected at 0km For every km from the nuclear plant, a clinic reports 0.97 of the cases reported at previous km
Need to antilog estimates to understand
14
GLM - Binomial
- Response is a binary variable1
– (0 and 1, Y and N, T and F, red and blue)
- Explanatory is continuous or combination
1. actually, response can be in several equivalent formats but here we use 0 and 1
15
Example: Binomial GLM
- Do fibrinogen levels affect whether a person’s erythrocyte
sedimentation rate (ESR) is classed is healthy or unhealthy
- ESR classed as 0 (healthy) or 1 (ill) by fibrinogen levels
- aka logistic regression
> str(plasma) 'data.frame': 32 obs. of 2 variables: $ fibrinogen: num 2.52 2.56 2.19 2.18 3.41 2.46 3.22 2.21 3.15 2.6 ... $ ESR : num 0 0 0 0 0 0 0 0 0 0 ...
Each row is an individual
16
Example: Binomial GLM
ggplot(plasma, aes(x = fibrinogen, y = ESR)) + geom_point()
17
Example: Binomial GLM
mod2 <- glm(ESR ~ fibrinogen,data = plasma, family = binomial) summary(mod2) Call: glm(formula = ESR ~ fibrinogen, family = binomial, data = plasma) Deviance Residuals: Min 1Q Median 3Q Max
- 0.9298 -0.5399 -0.4382 -0.3356 2.4794
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.8451 2.7703 -2.471 0.0135 * fibrinogen 1.8271 0.9009 2.028 0.0425 *
- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1) Null deviance: 30.885 on 31 degrees of freedom Residual deviance: 24.840 on 30 degrees of freedom AIC: 28.84 Number of Fisher Scoring iterations: 5
Estimate is +’ve Probability of ‘ill’ (1) is higher as fibrinogen increases
Estimate is on logit scale!
“Estimate is on the scale of the link function”
18
Example: Binomial GLM
ggplot(plasma, aes(x = fibrinogen, y = ESR)) + geom_point()+ stat_smooth(method = "glm", method.args = list(family = "binomial"), col = "black",se = FALSE) + xlab("Fibrinogen") + ylab("ESR class")+ theme_bw() newdat fibrinogen pr 1 1 0.006574281 2 2 0.039509110 3 3 0.203618144 4 4 0.613784514 5 5 0.908072941 6 6 0.983974363 fibv <- seq(1, 6, 1) newdat <- data.frame(fibrinogen = fibv) newdat$pr <- predict(mod2, type = "response", newdata = newdat)
Estimates are odds ratios and hard to understand, instead: Use predict() to understand
19
LM – GLM summary
- Linear models (lm)
– response assumed continuous; Residuals assumed normal and homoscedastic
- GLM
– Response can be other than normal; variance need not be constant – Specify the family
- poisson for counts
- binomial for binary outcome
mod <- glm(response ~ explanatory1 * explanatory2, data = mydata, family = family) summary(mod) anova(mod,test = “Chisq”) mod <- glm(response ~ explanatory1, data = mydata, family = family) summary(mod)