Assessing Model Fit C OR R E L ATION AN D R E G R E SSION IN R - - PowerPoint PPT Presentation

assessing model fit
SMART_READER_LITE
LIVE PREVIEW

Assessing Model Fit C OR R E L ATION AN D R E G R E SSION IN R - - PowerPoint PPT Presentation

Assessing Model Fit C OR R E L ATION AN D R E G R E SSION IN R Ben Ba u mer Assistant Professor at Smith College Ho w w ell does o u r te x tbook model fit ? ggplot(data = textbooks, aes(x = amazNew, y = uclaNew)) + geom_point() +


slide-1
SLIDE 1

Assessing Model Fit

C OR R E L ATION AN D R E G R E SSION IN R

Ben Baumer

Assistant Professor at Smith College

slide-2
SLIDE 2

CORRELATION AND REGRESSION IN R

How well does our textbook model fit?

ggplot(data = textbooks, aes(x = amazNew, y = uclaNew)) + geom_point() + geom_smooth(method = "lm", se = FALSE)

slide-3
SLIDE 3

CORRELATION AND REGRESSION IN R

How well does our possum model fit?

ggplot(data = possum, aes(y = totalL, x = tailL)) + geom_point() + geom_smooth(method = "lm", se = FALSE)

slide-4
SLIDE 4

CORRELATION AND REGRESSION IN R

Sums of squared deviations

slide-5
SLIDE 5

CORRELATION AND REGRESSION IN R

SSE

library(broom) mod_possum <- lm(totalL ~ tailL, data = possum) mod_possum %>% augment() %>% summarize(SSE = sum(.resid^2), SSE_also = (n() - 1) * var(.resid)) SSE SSE_also 1 1301 1301

slide-6
SLIDE 6

CORRELATION AND REGRESSION IN R

RMSE

slide-7
SLIDE 7

CORRELATION AND REGRESSION IN R

Residual standard error (possums)

summary(mod_possum) Call: lm(formula = totalL ~ tailL, data = possum) Residuals: Min 1Q Median 3Q Max

  • 9.210 -2.326 0.179 2.777 6.790

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 41.04 6.66 6.16 1.4e-08 tailL 1.24 0.18 6.93 3.9e-10 Residual standard error: 3.57 on 102 degrees of freedom Multiple R-squared: 0.32, Adjusted R-squared: 0.313 F-statistic: 48 on 1 and 102 DF, p-value: 3.94e-10

slide-8
SLIDE 8

CORRELATION AND REGRESSION IN R

Residual standard error (textbooks)

lm(uclaNew ~ amazNew, data = textbooks) %>% summary() Call: lm(formula = uclaNew ~ amazNew, data = textbooks) Residuals: Min 1Q Median 3Q Max

  • 34.78 -4.57 0.58 4.01 39.00

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.9290 1.9354 0.48 0.63 amazNew 1.1990 0.0252 47.60 <2e-16 Residual standard error: 10.5 on 71 degrees of freedom Multiple R-squared: 0.97, Adjusted R-squared: 0.969 F-statistic: 2.27e+03 on 1 and 71 DF, p-value: <2e-16

slide-9
SLIDE 9

Let's practice!

C OR R E L ATION AN D R E G R E SSION IN R

slide-10
SLIDE 10

Comparing model fits

C OR R E L ATION AN D R E G R E SSION IN R

Ben Baumer

Assistant Professor at Smith College

slide-11
SLIDE 11

CORRELATION AND REGRESSION IN R

How well does our textbook model fit?

ggplot(data = textbooks, aes(x = amazNew, y = uclaNew)) + geom_point() + geom_smooth(method = "lm", se = FALSE)

slide-12
SLIDE 12

CORRELATION AND REGRESSION IN R

How well does our possum model fit?

ggplot(data = possum, aes(y = totalL, x = tailL)) + geom_point() + geom_smooth(method = "lm", se = FALSE)

slide-13
SLIDE 13

CORRELATION AND REGRESSION IN R

Null (average) model

For all observations…

slide-14
SLIDE 14

CORRELATION AND REGRESSION IN R

Visualization of null model

slide-15
SLIDE 15

CORRELATION AND REGRESSION IN R

SSE, null model

mod_null <- lm(totalL ~ 1, data = possum) mod_null %>% augment(possum) %>% summarize(SSE = sum(.resid^2)) SSE 1 1914

slide-16
SLIDE 16

CORRELATION AND REGRESSION IN R

SSE, our model

mod_possum <- lm(totalL ~ tailL, data = possum) mod_possum %>% augment() %>% summarize(SSE = sum(.resid^2)) SSE 1 1301

slide-17
SLIDE 17

CORRELATION AND REGRESSION IN R

Coefficient of determination

SST is the SSE for the null model

slide-18
SLIDE 18

CORRELATION AND REGRESSION IN R

Connection to correlation

For simple linear regression...

slide-19
SLIDE 19

CORRELATION AND REGRESSION IN R

Summary

summary(mod_possum) Call: lm(formula = totalL ~ tailL, data = possum) Residuals: Min 1Q Median 3Q Max

  • 9.210 -2.326 0.179 2.777 6.790

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 41.04 6.66 6.16 1.4e-08 tailL 1.24 0.18 6.93 3.9e-10 Residual standard error: 3.57 on 102 degrees of freedom Multiple R-squared: 0.32, Adjusted R-squared: 0.313 F-statistic: 48 on 1 and 102 DF, p-value: 3.94e-10

slide-20
SLIDE 20

CORRELATION AND REGRESSION IN R

Over-reliance on R-squared

slide-21
SLIDE 21

Let's practice!

C OR R E L ATION AN D R E G R E SSION IN R

slide-22
SLIDE 22

Unusual Points

C OR R E L ATION AN D R E G R E SSION IN R

Ben Baumer

Assistant Professor at Smith College

slide-23
SLIDE 23

CORRELATION AND REGRESSION IN R

Unusual points

regulars <- mlbBat10 %>% filter(AB > 400) ggplot(data = regulars, aes(x = SB, y = HR)) + geom_point() + geom_smooth(method = "lm", se = 0)

slide-24
SLIDE 24

CORRELATION AND REGRESSION IN R

Unusual points

regulars <- mlbBat10 %>% filter(AB > 400) ggplot(data = regulars, aes(x = SB, y = HR)) + geom_point() + geom_smooth(method = "lm", se = 0)

slide-25
SLIDE 25

CORRELATION AND REGRESSION IN R

Unusual points

regulars <- mlbBat10 %>% filter(AB > 400) ggplot(data = regulars, aes(x = SB, y = HR)) + geom_point() + geom_smooth(method = "lm", se = 0)

slide-26
SLIDE 26

CORRELATION AND REGRESSION IN R

Unusual points

regulars <- mlbBat10 %>% filter(AB > 400) ggplot(data = regulars, aes(x = SB, y = HR)) + geom_point() + geom_smooth(method = "lm", se = 0)

slide-27
SLIDE 27

CORRELATION AND REGRESSION IN R

Leverage

slide-28
SLIDE 28

CORRELATION AND REGRESSION IN R

Leverage computations

library(broom) mod <- lm(HR ~ SB, data = regulars) mod %>% augment() %>% arrange(desc(.hat)) %>% select(HR, SB, .fitted, .resid, .hat) %>% head() HR SB .fitted .resid .hat 1 1 68 2.383 -1.383 0.13082 2 2 52 6.461 -4.461 0.07034 3 5 50 6.971 -1.971 0.06417 4 19 47 7.736 11.264 0.05550 5 5 47 7.736 -2.736 0.05550 6 1 42 9.010 -8.010 0.04261

slide-29
SLIDE 29

CORRELATION AND REGRESSION IN R

Leverage computations

library(broom) mod <- lm(HR ~ SB, data = regulars) mod %>% augment() %>% arrange(desc(.hat)) %>% select(HR, SB, .fitted, .resid, .hat) %>% head() HR SB .fitted .resid .hat 1 1 68 2.383 -1.383 0.13082 # Juan Pierre 2 2 52 6.461 -4.461 0.07034 3 5 50 6.971 -1.971 0.06417 4 19 47 7.736 11.264 0.05550 5 5 47 7.736 -2.736 0.05550 6 1 42 9.010 -8.010 0.04261

slide-30
SLIDE 30

CORRELATION AND REGRESSION IN R

Consider Rickey Henderson…

slide-31
SLIDE 31

CORRELATION AND REGRESSION IN R

Consider Rickey Henderson…

slide-32
SLIDE 32

CORRELATION AND REGRESSION IN R

Consider Rickey Henderson…

slide-33
SLIDE 33

CORRELATION AND REGRESSION IN R

Influence via Cook's distance

mod <- lm(HR ~ SB, data = regulars_plus) mod %>% augment() %>% arrange(desc(.cooksd)) %>% select(HR, SB, .fitted, .resid, .hat, .cooksd) %>% head() HR SB .fitted .resid .hat .cooksd 1 28 65 5.770 22.230 0.105519 0.33430 2 54 9 17.451 36.549 0.006070 0.04210 3 34 26 13.905 20.095 0.013150 0.02797 4 19 47 9.525 9.475 0.049711 0.02535 5 39 0 19.328 19.672 0.010479 0.02124 6 42 14 16.408 25.592 0.006061 0.02061

slide-34
SLIDE 34

CORRELATION AND REGRESSION IN R

Influence via Cook's distance

mod <- lm(HR ~ SB, data = regulars_plus) mod %>% augment() %>% arrange(desc(.cooksd)) %>% select(HR, SB, .fitted, .resid, .hat, .cooksd) %>% head() HR SB .fitted .resid .hat .cooksd 1 28 65 5.770 22.230 0.105519 0.33430 # Henderson 2 54 9 17.451 36.549 0.006070 0.04210 3 34 26 13.905 20.095 0.013150 0.02797 4 19 47 9.525 9.475 0.049711 0.02535 5 39 0 19.328 19.672 0.010479 0.02124 6 42 14 16.408 25.592 0.006061 0.02061

slide-35
SLIDE 35

Let's practice!

C OR R E L ATION AN D R E G R E SSION IN R

slide-36
SLIDE 36

Dealing with Outliers

C OR R E L ATION AN D R E G R E SSION IN R

Ben Baumer

Assistant Professor at Smith College

slide-37
SLIDE 37

CORRELATION AND REGRESSION IN R

Dealing with outliers

ggplot(data = regulars_plus, aes(x = SB, y = HR)) + geom_point() + geom_smooth(method = "lm", se = 0)

slide-38
SLIDE 38

CORRELATION AND REGRESSION IN R

Dealing with outliers

ggplot(data = regulars_plus, aes(x = SB, y = HR)) + geom_point() + geom_smooth(method = "lm", se = 0)

slide-39
SLIDE 39

CORRELATION AND REGRESSION IN R

Dealing with outliers

ggplot(data = regulars_plus, aes(x = SB, y = HR)) + geom_point() + geom_smooth(method = "lm", se = 0)

slide-40
SLIDE 40

CORRELATION AND REGRESSION IN R

The full model

coef(lm(HR ~ SB, data = regulars_plus)) (Intercept) SB 19.3282 -0.2086

slide-41
SLIDE 41

CORRELATION AND REGRESSION IN R

Removing outliers that don't fit

regulars <- regulars_plus %>% filter(!(SB > 60 & HR > 20)) # remove Henderson coef(lm(HR ~ SB, data = regulars)) (Intercept) SB 19.7169 -0.2549

What is the justication? How does the scope of inference change?

slide-42
SLIDE 42

CORRELATION AND REGRESSION IN R

Removing outliers that do fit

regulars_new <- regulars %>% filter(SB < 60) # remove Pierre coef(lm(HR ~ SB, data = regulars_new)) (Intercept) SB 19.6870 -0.2514

What is the justication? How does the scope of inference change?

slide-43
SLIDE 43

Let's practice!

C OR R E L ATION AN D R E G R E SSION IN R

slide-44
SLIDE 44

Conclusion

C OR R E L ATION AN D R E G R E SSION IN R

Ben Baumer

Assistant Professor at Smith College

slide-45
SLIDE 45

CORRELATION AND REGRESSION IN R

Graphical: scatterplots

slide-46
SLIDE 46

CORRELATION AND REGRESSION IN R

Numerical: correlation

slide-47
SLIDE 47

CORRELATION AND REGRESSION IN R

Numerical: correlation

slide-48
SLIDE 48

CORRELATION AND REGRESSION IN R

Modular: linear regression

slide-49
SLIDE 49

CORRELATION AND REGRESSION IN R

Focus on interpretation

slide-50
SLIDE 50

CORRELATION AND REGRESSION IN R

Objects and formulas

summary(mod) Call: lm(formula = uclaNew ~ amazNew, data = textbooks) Residuals: Min 1Q Median 3Q Max

  • 34.78 -4.57 0.58 4.01 39.00

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.9290 1.9354 0.48 0.63 amazNew 1.1990 0.0252 47.60 <2e-16 Residual standard error: 10.5 on 71 degrees of freedom Multiple R-squared: 0.97, Adjusted R-squared: 0.969 F-statistic: 2.27e+03 on 1 and 71 DF, p-value: <2e-16

slide-51
SLIDE 51

CORRELATION AND REGRESSION IN R

Model fit

slide-52
SLIDE 52

Let's practice!

C OR R E L ATION AN D R E G R E SSION IN R