Lecture 3 Residual Analysis + Generalized Linear Models Colin - - PowerPoint PPT Presentation

lecture 3
SMART_READER_LITE
LIVE PREVIEW

Lecture 3 Residual Analysis + Generalized Linear Models Colin - - PowerPoint PPT Presentation

Lecture 3 Residual Analysis + Generalized Linear Models Colin Rundel 1/23/2017 1 Residual Analysis 2 3 Atmospheric CO 2 (ppm) from Mauna Loa 360 co2 350 1988 1992 1996 date Where to start? Well, it looks like stuff is going up on


slide-1
SLIDE 1

Lecture 3

Residual Analysis + Generalized Linear Models

Colin Rundel 1/23/2017

1

slide-2
SLIDE 2

Residual Analysis

2

slide-3
SLIDE 3

Atmospheric CO2 (ppm) from Mauna Loa

350 360 1988 1992 1996

date co2

3

slide-4
SLIDE 4

Where to start?

Well, it looks like stuff is going up on average …

4

slide-5
SLIDE 5

Where to start?

Well, it looks like stuff is going up on average …

350 360 1988 1992 1996

date co2

−2.5 0.0 2.5 1988 1992 1996

date resid

4

slide-6
SLIDE 6

and then?

Well there is some periodicity lets add the month …

5

slide-7
SLIDE 7

and then?

Well there is some periodicity lets add the month …

−2.5 0.0 2.5 1988 1992 1996

date resid

−1.0 −0.5 0.0 0.5 1.0 1988 1992 1996

date resid2

5

slide-8
SLIDE 8

and then and then?

Maybe there is some different effect by year …

6

slide-9
SLIDE 9

and then and then?

Maybe there is some different effect by year …

−1.0 −0.5 0.0 0.5 1.0 1988 1992 1996

date resid2

−0.8 −0.4 0.0 0.4 0.8 1988 1992 1996

date resid3

6

slide-10
SLIDE 10

Too much

(lm = lm(co2~date + month + as.factor(year), data=co2_df)) ## ## Call: ## lm(formula = co2 ~ date + month + as.factor(year), data = co2_df) ## ## Coefficients: ## (Intercept) date monthAug ##

  • 2.645e+03

1.508e+00

  • 4.177e+00

## monthDec monthFeb monthJan ##

  • 3.612e+00
  • 2.008e+00
  • 2.705e+00

## monthJul monthJun monthMar ##

  • 2.035e+00
  • 3.251e-01
  • 1.227e+00

## monthMay monthNov monthOct ## 4.821e-01

  • 4.838e+00
  • 6.135e+00

## monthSep as.factor(year)1986 as.factor(year)1987 ##

  • 6.064e+00
  • 2.585e-01

9.722e-03 ## as.factor(year)1988 as.factor(year)1989 as.factor(year)1990 ## 1.065e+00 9.978e-01 7.726e-01 ## as.factor(year)1991 as.factor(year)1992 as.factor(year)1993 ## 7.067e-01 1.236e-02

  • 7.911e-01

## as.factor(year)1994 as.factor(year)1995 as.factor(year)1996 ##

  • 4.146e-01

1.119e-01 3.768e-01 ## as.factor(year)1997 ## NA

7

slide-11
SLIDE 11

350 360 1988 1992 1996

date co2

8

slide-12
SLIDE 12

Generalized Linear Models

9

slide-13
SLIDE 13

Background

A generalized linear model has three key components:

  • 1. a probability distribution (from the exponential family) that describes

your response variable

  • 2. a linear predictor η = Xβ,
  • 3. and a link function g such that g(E(Y|X)) = µ = η.

10

slide-14
SLIDE 14

Poisson Regression

11

slide-15
SLIDE 15

Model Specification

A generalized linear model for count data where we assume the outcome variable follows a poisson distribution (mean = variance). Yi ∼ Poisson(λi) log E(Y|X) = log λ = Xβ

12

slide-16
SLIDE 16

Example - AIDS in Belgium

50 100 150 200 250 1985 1990

year cases

AIDS cases in Belgium

13

slide-17
SLIDE 17

Frequentist glm fit

g = glm(cases~year, data=aids, family=poisson) pred = data_frame(year=seq(1981,1993,by=0.1)) pred$cases = predict(g, newdata=pred, type = ”response”)

100 200 300 1985 1990

year cases

AIDS cases in Belgium

14

slide-18
SLIDE 18

Residuals

Standard residuals: ri = Yi − ˆ Yi = Yi − ˆ

λi

Pearson residuals: ri = Yi − E(Yi|X)

Var(Yi|X) = Yi − ˆ

λi

ˆ λi

Deviance residuals: di = sign(yi − λi)

2(yi log(yi/ˆ

λi) − (yi − ˆ λi))

15

slide-19
SLIDE 19

Deviance and deviance residuals

Deviance can be interpreted as the difference between your model’s fit and the fit of an ideal model (where E(ˆ Yi) = Yi). D = 2(L(Y|θbest) − L(Y|ˆ

θ)) =

n

i=1

di

2

Deviance is a measure of goodness of fit in a similar way to the residual sum of squares (which is just the sum of squared standard residuals).

16

slide-20
SLIDE 20

Deviance residuals derivation

17

slide-21
SLIDE 21

Residual plots

standard pearson deviance 1985 1990 1985 1990 1985 1990 −5.0 −2.5 0.0 2.5 −2.5 0.0 2.5 −50

year residual

18

slide-22
SLIDE 22

print(aids_fit)

100 200 300 1985 1990

year cases

AIDS cases in Belgium

19

slide-23
SLIDE 23

Quadratic fit

g2 = glm(cases~year+I(year^2), data=aids, family=poisson) pred2 = data_frame(year=seq(1981,1993,by=0.1)) pred2$cases = predict(g2, newdata=pred, type = ”response”)

50 100 150 200 250 1985 1990

year cases

AIDS cases in Belgium

20

slide-24
SLIDE 24

Quadratic fit - residuals

standard pearson deviance 1985 1990 1985 1990 1985 1990 −1 1 −1 1 −10 10 20

year residual

21

slide-25
SLIDE 25

Bayesian Poisson Regression Model

## model{ ## # Likelihood ## for(i in 1:length(Y)){ ## Y[i] ~ dpois(lambda[i]) ## log(lambda[i]) <- beta[1] + beta[2]*X[i] ## ## # In-sample prediction ## Y_hat[i] ~ dpois(lambda[i]) ## } ## ## # Prior for beta ## for(j in 1:2){ ## beta[j] ~ dnorm(0,1/100) ## } ## }

22

slide-26
SLIDE 26

Bayesian Model fit

m = jags.model( textConnection(poisson_model1), quiet = TRUE, data = list(Y=aids$cases, X=aids$year) ) update(m, n.iter=1000, progress.bar=”none”) samp = coda.samples( m, variable.names=c(”beta”,”lambda”,”Y_hat”), n.iter=5000, progress.bar=”none” )

23

slide-27
SLIDE 27

MCMC Diagnostics

2000 3000 4000 5000 6000 7000 −5 −3 −1 Iterations

Trace of beta[1]

−6 −4 −2 0.0 0.2

Density of beta[1]

N = 5000 Bandwidth = 0.3209 2000 3000 4000 5000 6000 7000 0.0020 0.0040 Iterations

Trace of beta[2]

0.002 0.003 0.004 0.005 400

Density of beta[2]

N = 5000 Bandwidth = 0.0001615

24

slide-28
SLIDE 28

Model fit?

100 200 300 1985 1990

year cases

AIDS cases in Belgium

25

slide-29
SLIDE 29

What went wrong?

summary(g) ## ## Call: ## glm(formula = cases ~ year, family = poisson, data = aids) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -4.6784

  • 1.5013
  • 0.2636

2.1760 2.7306 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.971e+02 1.546e+01

  • 25.68

<2e-16 *** ## year 2.021e-01 7.771e-03 26.01 <2e-16 *** ## --- ## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 872.206

  • n 12

degrees of freedom ## Residual deviance: 80.686

  • n 11

degrees of freedom ## AIC: 166.37 ## ## Number of Fisher Scoring iterations: 4

26

slide-30
SLIDE 30

What went wrong?

summary(g) ## ## Call: ## glm(formula = cases ~ year, family = poisson, data = aids) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -4.6784

  • 1.5013
  • 0.2636

2.1760 2.7306 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.971e+02 1.546e+01

  • 25.68

<2e-16 *** ## year 2.021e-01 7.771e-03 26.01 <2e-16 *** ## --- ## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 872.206

  • n 12

degrees of freedom ## Residual deviance: 80.686

  • n 11

degrees of freedom ## AIC: 166.37 ## ## Number of Fisher Scoring iterations: 4

26

slide-31
SLIDE 31

summary(glm(cases~I(year-1981), data=aids, family=poisson)) ## ## Call: ## glm(formula = cases ~ I(year - 1981), family = poisson, data = aids) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -4.6784

  • 1.5013
  • 0.2636

2.1760 2.7306 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 3.342711 0.070920 47.13 <2e-16 *** ## I(year - 1981) 0.202121 0.007771 26.01 <2e-16 *** ## --- ## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 872.206

  • n 12

degrees of freedom ## Residual deviance: 80.686

  • n 11

degrees of freedom ## AIC: 166.37 ## ## Number of Fisher Scoring iterations: 4

27

slide-32
SLIDE 32

Revising the model

## model{ ## # Likelihood ## for(i in 1:length(Y)){ ## Y[i] ~ dpois(lambda[i]) ## log(lambda[i]) <- beta[1] + beta[2]*(X[i] - 1981) ## ## # In-sample prediction ## Y_hat[i] ~ dpois(lambda[i]) ## } ## ## # Prior for beta ## for(j in 1:2){ ## beta[j] ~ dnorm(0,1/100) ## } ## }

28

slide-33
SLIDE 33

MCMC Diagnostics

2000 3000 4000 5000 6000 7000 3.1 3.3 3.5 Iterations

Trace of beta[1]

3.1 3.2 3.3 3.4 3.5 3.6 2 4

Density of beta[1]

N = 5000 Bandwidth = 0.01336 2000 3000 4000 5000 6000 7000 0.18 0.21 Iterations

Trace of beta[2]

0.18 0.19 0.20 0.21 0.22 0.23 20 40

Density of beta[2]

N = 5000 Bandwidth = 0.001456

29

slide-34
SLIDE 34

Model fit

100 200 300 1985 1990

year cases

(Linear Poisson Model)

AIDS cases in Belgium

30

slide-35
SLIDE 35

Bayesian Residual Plots

standard pearson deviance 1985 1990 1985 1990 1985 1990 −6 −4 −2 2 −6 −4 −2 2 4 −100 −50 50

year post_mean

31

slide-36
SLIDE 36

Model fit

100 200 300 1985 1990

year cases

(Quadratic Poisson Model)

AIDS cases in Belgium

32

slide-37
SLIDE 37

Bayesian Residual Plots

standard pearson deviance 1985 1990 1985 1990 1985 1990 −2 −1 1 2 −2 −1 1 2 −20 20 40

year post_mean

33

slide-38
SLIDE 38

Negative Binomial Regression

34

slide-39
SLIDE 39

Overdispersion

One of the properties of the Poisson distribution is that if X ∼ Pois(λ) then E(X) = Var(X) = λ. If we are constructing a model where we claim that our response variable Y follows a Poisson distribution then we are making a very strong assumption which has implactions for both inference and prediction.

mean(aids$cases) ## [1] 124.7692 var(aids$cases) ## [1] 8124.526

35

slide-40
SLIDE 40

Overdispersion

One of the properties of the Poisson distribution is that if X ∼ Pois(λ) then E(X) = Var(X) = λ. If we are constructing a model where we claim that our response variable Y follows a Poisson distribution then we are making a very strong assumption which has implactions for both inference and prediction.

mean(aids$cases) ## [1] 124.7692 var(aids$cases) ## [1] 8124.526

35

slide-41
SLIDE 41

Negative binomial regession

If we define Yi|Zi ∼ Pois(λi Zi) Zi ∼ Gamma(θi, θi) then the marginal distribution of Yi will be negative binomial with, E(Yi) = λi Var(Yi) = λi + λ2

i /θi 36

slide-42
SLIDE 42

Model

## model{ ## for(i in 1:length(Y)) ## { ## Z[i] ~ dgamma(theta, theta) ## log(lambda[i]) <- beta[1] + beta[2]*(X[i] - 1981) + beta[3]*(X[i] - 1981)^2 ## ## lambda_Z[i] <- Z[i]*lambda[i] ## ## Y[i] ~ dpois(lambda_Z[i]) ## Y_hat[i] ~ dpois(lambda_Z[i]) ## } ## ## for(j in 1:3){ ## beta[j] ~ dnorm(0, 1/100) ## } ## ## log_theta ~ dnorm(0, 1/100) ## theta <- exp(log_theta) ## }

37

slide-43
SLIDE 43

Negative Binomial Model fit

100 200 300 1985 1990

year cases

(Quadratic Poisson Model)

AIDS cases in Belgium

100 200 300 1985 1990

year cases

(Quadratic Negative Binomial Model)

AIDS cases in Belgium

38

slide-44
SLIDE 44

Bayesian Residual Plots

standard pearson 1985 1990 1985 1990 −2 −1 1 2 −20 20 40

year post_mean

39