Lecture 3 Residual Analysis + Generalized Linear Models Colin - PowerPoint PPT Presentation

Lecture 3 Residual Analysis + Generalized Linear Models Colin Rundel 1/23/2018 1

Residual Analysis 2

3 Atmospheric CO 2 (ppm) from Mauna Loa 360 co2 350 1988 1992 1996 date

Where to start? Well, it looks like stuff is going up on average … 4

Where to start? Well, it looks like stuff is going up on average … 4 360 co2 350 1988 1992 1996 date 2.5 resid 0.0 −2.5 1988 1992 1996 date

and then? Well there is some periodicity lets add the month … 5

and then? Well there is some periodicity lets add the month … 5 2.5 resid 0.0 −2.5 1988 1992 1996 date 1.0 0.5 resid2 0.0 −0.5 −1.0 1988 1992 1996 date

and then and then? There is still some long term trend in the data, maybe a fancy polynomial can help … 6

and then and then? There is still some long term trend in the data, maybe a fancy polynomial 6 can help … 1.0 0.5 resid2 0.0 −0.5 −1.0 1988 1992 1996 date 1.0 0.5 resid3 0.0 −0.5 1988 1992 1996 date

Putting it all together … -6.102e+00 NA NA ## poly(date, 5)1 < 2e-16 *** -44.832 1.346e-01 -6.036e+00 ## monthSep < 2e-16 *** -45.282 1.348e-01 ## monthOct NA < 2e-16 *** -35.577 1.349e-01 -4.799e+00 ## monthNov 3.631 0.000396 *** 1.344e-01 4.881e-01 ## monthMay -9.175 5.54e-16 *** 1.344e-01 -1.233e+00 NA ## poly(date, 5)2 -1.920e+00 -2.332 0.021117 * < 2e-16 *** p-value: < 2.2e-16 2872 on 16 and 139 DF, ## F-statistic: 0.9966 Adjusted R-squared: 0.997, ## Multiple R-squared: ## Residual standard error: 0.3427 on 139 degrees of freedom ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## Signif. codes: ## --- -12.535 3.427e-01 3.462e-01 ## poly(date, 5)5 -4.340e+00 2.609 0.010062 * 3.428e-01 8.946e-01 ## poly(date, 5)4 < 2e-16 *** 11.358 3.451e-01 3.920e+00 ## poly(date, 5)3 -5.602 1.09e-07 *** ## monthMar 1.345e-01 l_final = lm (co2~date + month + poly (date,5), data=co2_df) 0.17565 ## date < 2e-16 *** 1.460e+01 -177.174 -2.587e+03 ## (Intercept) t value Pr(>|t|) Estimate Std. Error ## ## Coefficients: (1 not defined because of singularities) ## 1.06026 ## -0.72022 -0.19169 -0.00638 7.334e-03 Max 3Q Median 1Q Min ## ## Residuals: ## ## lm(formula = co2 ~ date + month + poly(date, 5), data = co2_df) ## Call: ## summary (l_final) 1.479e+00 201.649 -3.136e-01 -15.041 ## monthJun < 2e-16 *** -15.003 1.345e-01 -2.018e+00 ## monthJul < 2e-16 *** -20.286 1.345e-01 -2.729e+00 ## monthJan < 2e-16 *** 1.345e-01 < 2e-16 *** -2.022e+00 ## monthFeb < 2e-16 *** -26.404 1.350e-01 -3.566e+00 ## monthDec < 2e-16 *** -30.880 1.346e-01 -4.155e+00 ## monthAug 7

Final fit + Residualss 8 360 co2 350 1988 1992 1996 date 1.0 resid_final 0.5 0.0 −0.5 1988 1992 1996 date

Generalized Linear Models 9

Background A generalized linear model has three key components: 1. a probability distribution (from the exponential family) that describes your response variable 2. a linear predictor 𝜽 = 𝐘𝜸 , 3. and a link function 𝑕 such that 𝑕(𝐹(𝐙|𝐘)) = 𝜽 . 10

Poisson Regression This is a special case of a generalized linear model for count data where we assume the outcome variable follows a poisson distribution (mean = variance). 11 𝑍 𝑗 ∼ Poisson (𝜇 𝑗 ) log 𝐹(𝑍 𝑗 |𝐘 𝑗⋅ ) = log 𝜇 𝑗 = 𝐘 𝑗⋅ 𝜸

Example - AIDS in Belgium These data represent the total number of new AIDS cases reported in Belgium during the early stages of the epidemic. 12 AIDS cases in Belgium 250 200 150 cases 100 50 0 1985 1990 year

Frequentist glm fit g = glm (cases~year, data=aids, family=poisson) mutate (cases = predict (g, newdata=., type = ”response”)) 13 pred = data_frame (year= seq (1981,1993,by=0.1)) %>% AIDS cases in Belgium 300 200 cases 100 0 1985 1990 year

aids = aids %>% Residuals? The naive approach is to use standard residuals, 𝜇 𝑗 mutate (pred = predict (g, newdata=., type = ”response”)) %>% mutate (standard = cases - pred) ggplot (aids, aes (x=year, y=standard)) + geom_point () + geom_segment ( aes (xend=year, yend=0)) 14 𝑠 𝑗 = 𝑍 𝑗 − 𝐹(𝑍 𝑗 |𝑌) = 𝑍 𝑗 − ̂

Residuals? The naive approach is to use standard residuals, 𝜇 𝑗 mutate (pred = predict (g, newdata=., type = ”response”)) %>% mutate (standard = cases - pred) ggplot (aids, aes (x=year, y=standard)) + geom_point () + geom_segment ( aes (xend=year, yend=0)) 14 𝑠 𝑗 = 𝑍 𝑗 − 𝐹(𝑍 𝑗 |𝑌) = 𝑍 𝑗 − ̂ aids = aids %>% standard 0 −50 1985 1990 year

aids = aids %>% Accounting for variability Pearson residuals: √𝑊 𝑏𝑠(𝑍 𝑗 |𝑌) 𝜇 𝑗 𝜇 𝑗 mutate (pearson = (cases - pred)/ sqrt (pred)) ggplot (aids, aes (x=year, y=pearson)) + geom_point () + geom_segment ( aes (xend=year, yend=0)) 15 = 𝑍 𝑗 − ̂ 𝑠 𝑗 = 𝑍 𝑗 − 𝐹(𝑍 𝑗 |𝑌) √ ̂

Accounting for variability 𝜇 𝑗 geom_point () + geom_segment ( aes (xend=year, yend=0)) Pearson residuals: mutate (pearson = (cases - pred)/ sqrt (pred)) ggplot (aids, aes (x=year, y=pearson)) + 𝜇 𝑗 √𝑊 𝑏𝑠(𝑍 𝑗 |𝑌) 15 = 𝑍 𝑗 − ̂ 𝑠 𝑗 = 𝑍 𝑗 − 𝐹(𝑍 𝑗 |𝑌) √ ̂ aids = aids %>% 2.5 pearson 0.0 −2.5 1985 1990 year

Deviance Deviance is a way of measuring the difference between your glm’s fit and 𝑍 𝑗 |𝑌) = 𝑍 𝑗 ). It is defined as twice the log of the ratio between the likelihood of a perfect model and the likelihood of the given model, 𝜄|𝑍 )) 𝜄|𝑍 )) 16 the fit of a perfect model (where 𝐹( ̂ 𝐸 = 2 log (ℒ(𝜄 𝑐𝑓𝑡𝑢 |𝑍 )/ℒ( ̂ = 2(𝑚(𝜄 𝑐𝑓𝑡𝑢 |𝑍 ) − 𝑚( ̂

Derivation - Normal 17

Derivation - Poisson 18

glm output ## 2.021e-01 7.771e-03 26.01 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## <2e-16 *** Null deviance: 872.206 on 12 degrees of freedom ## Residual deviance: 80.686 on 11 degrees of freedom ## AIC: 166.37 ## ## Number of Fisher Scoring iterations: 4 ## year -25.68 summary (g) 3Q ## ## Call: ## glm(formula = cases ~ year, family = poisson, data = aids) ## ## Deviance Residuals: ## Min 1Q Median Max 1.546e+01 ## -4.6784 -1.5013 -0.2636 2.1760 2.7306 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.971e+02 19

dev_resid = function (obs,pred) aids = aids %>% Deviance residuals We can therefore think of deviance as 𝐸 = ∑ 𝑜 generalized residual. So in the Poisson case we can define, 𝜇 𝑗 )) sign (obs-pred) * sqrt (2*(obs* log (obs/pred)-(obs-pred))) mutate (deviance = dev_resid (cases, pred)) ggplot (aids, aes (x=year, y=deviance)) + geom_point () + geom_segment ( aes (xend=year, yend=0)) 20 𝑗=1 𝑒 2 𝑗 where 𝑒 𝑗 is a 𝑒 𝑗 = sign (𝑧 𝑗 − 𝜇 𝑗 )√2(𝑧 𝑗 log (𝑧 𝑗 / ̂ 𝜇 𝑗 ) − (𝑧 𝑗 − ̂

Deviance residuals 𝜇 𝑗 )) geom_point () + geom_segment ( aes (xend=year, yend=0)) ggplot (aids, aes (x=year, y=deviance)) + We can therefore think of deviance as 𝐸 = ∑ sign (obs-pred) * sqrt (2*(obs* log (obs/pred)-(obs-pred))) mutate (deviance = dev_resid (cases, pred)) 20 generalized residual. So in the Poisson case we can define, 𝑜 𝑗=1 𝑒 2 𝑗 where 𝑒 𝑗 is a 𝑒 𝑗 = sign (𝑧 𝑗 − 𝜇 𝑗 )√2(𝑧 𝑗 log (𝑧 𝑗 / ̂ 𝜇 𝑗 ) − (𝑧 𝑗 − ̂ dev_resid = function (obs,pred) aids = aids %>% 2.5 deviance 0.0 −2.5 −5.0 1985 1990 year

Comparing Residuals 21 standard pearson deviance 2.5 2.5 0 0.0 residual 0.0 −2.5 −2.5 −50 −5.0 1985 1990 1985 1990 1985 1990 year

Updating the model 22

Quadratic fit g2 = glm (cases~year+ I (year^2), data=aids, family=poisson) mutate (cases = predict (g2, newdata=., type = ”response”)) 23 pred2 = data_frame (year= seq (1981,1993,by=0.1)) %>% AIDS cases in Belgium 250 200 150 cases 100 50 0 1985 1990 year

Quadratic fit - residuals 24 standard pearson deviance 2.5 2.5 0 0.0 0.0 −2.5 −2.5 −50 residual −5.0 standard2 pearson2 deviance2 20 1 1 10 0 0 0 −1 −1 −10

Bayesian Model 25

Bayesian Poisson Regression Model poisson_model = ”model{ # Likelihood for (i in 1:length(Y)) { Y[i] ~ dpois(lambda[i]) log(lambda[i]) <- beta[1] + beta[2]*X[i] # In-sample prediction Y_hat[i] ~ dpois(lambda[i]) } # Prior for beta for(j in 1:2){ beta[j] ~ dnorm(0,1/100) } }” 26

Fit Model n_burn=1000; n_iter=5000 m = rjags:: jags.model ( textConnection (poisson_model), quiet = TRUE, data = list (Y=aids$cases, X=aids$year) ) update (m, n.iter=1000, progress.bar=”none”) samp = rjags:: coda.samples ( n.iter=5000, progress.bar=”none” ) 27 m, variable.names= c (”beta”,”lambda”,”Y_hat”,”Y”,”X”),

Model Fit? geom_point () tidybayes:: spread_samples (samp, Y_hat[i], X[i],Y[i]) %>% 28 tidybayes:: stat_lineribbon ( aes (y=Y_hat), alpha=0.5) + ggplot ( aes (x=X,y=Y)) + ungroup () %>% 250 200 0.95 150 Y 0.8 0.5 100 50 0 1985 1990 X

Lecture 3 Residual Analysis + Generalized Linear Models Colin - PowerPoint PPT Presentation

Lecture 3 Residual Analysis + Generalized Linear Models Colin Rundel 1/23/2018 1 Residual Analysis 2 3 Atmospheric CO 2 (ppm) from Mauna Loa 360 co2 350 1988 1992 1996 date Where to start? Well, it looks like stuff is going up on

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

Statistics and Data Analysis Logistic Regression & Frequent Pattern Mining Ling-Chieh Kung

Agenda 1. Visualizing predictions in R (lab from Tuesday) 2. Parsimony and Occams razor

STAT 215 Multiple Logistic Regression Colin Reimer Dawson Oberlin College November 16, 2017 1

Stat 5421 Lecture Notes: To Accompany Agresti Ch 4 Charles J. Geyer October 16, 2020 Section

Lecture 7: GLMs: Score equations, Residuals Author: Nick Reich / Transcribed by Bing Miu and

STAT 215 Logistic Regression II Colin Reimer Dawson Oberlin College November 14, 2017 1 / 33

Contents 1 Introduction 1 2 The Problem of Overdispersion 1 2.1 Relevant Distributional

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

Lecture 3 Residual Analysis + Generalized Linear Models Colin - PowerPoint PPT Presentation

Lecture 3 Residual Analysis + Generalized Linear Models Colin Rundel 1/23/2018 1 Residual Analysis 2 3 Atmospheric CO 2 (ppm) from Mauna Loa 360 co2 350 1988 1992 1996 date Where to start? Well, it looks like stuff is going up on

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

Statistics and Data Analysis Logistic Regression &amp; Frequent Pattern Mining Ling-Chieh Kung

Agenda 1. Visualizing predictions in R (lab from Tuesday) 2. Parsimony and Occams razor

STAT 215 Multiple Logistic Regression Colin Reimer Dawson Oberlin College November 16, 2017 1

Stat 5421 Lecture Notes: To Accompany Agresti Ch 4 Charles J. Geyer October 16, 2020 Section

Lecture 7: GLMs: Score equations, Residuals Author: Nick Reich / Transcribed by Bing Miu and

STAT 215 Logistic Regression II Colin Reimer Dawson Oberlin College November 14, 2017 1 / 33

Contents 1 Introduction 1 2 The Problem of Overdispersion 1 2.1 Relevant Distributional

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

Statistics and Data Analysis Logistic Regression & Frequent Pattern Mining Ling-Chieh Kung