[PPT] - STAT 213 Regression Inference II Colin Reimer Dawson Oberlin PowerPoint Presentation

SLIDE 1

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

STAT 213 Regression Inference II

Colin Reimer Dawson

Oberlin College

18 February 2016

SLIDE 2

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Outline

Key Ideas: Last Time Influence and Outliers Regression Inference Simulation Approaches Partitioning Variability

SLIDE 3

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Reading Quiz

A regression equation was fit to a set of data for which the correlation, r, between X and Y was 0.6. Which of the following must be true? (a) The slope of the regression line is 0.6. (b) The regression model explains 60% of the variability in Y . (c) The regression model explains 36% of the variability in Y . (d) At least half of the residuals are smaller than 0.6 in absolute value.

SLIDE 4

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

For Tuesday...

Write and turn in: Ex. 1.10, 1.12, 1.26, 2.14a, 2.34
Read: Ch. 2.4, 4.6
Answer:
1. Exercise 2.5
2. Exercise 2.6
3. In a randomization distribution to test whether a

regression slope is significantly different from zero, the P-value is the proportion of

btained by

that exceed ?

SLIDE 5

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Transformations and Outliers

Data Transformations

Can be used to

Address non-linearity
Stabilize (homogenize) variance
“Unskew” residual distribution
Reduce influence of outliers

SLIDE 6

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Brain and Body Weight of Terrestrial Mammals

library(mosaic) BrainBodyWeight <- read.file("http://colinreimerdawson.com/data/BrainBodyWeight.csv") xyplot( brain.weight.grams ~ body.weight.kilograms, data = BrainBodyWeight, type = c("p", "r")) body.weight.kilograms brain.weight.grams

1000 2000 3000 4000 5000 2000 4000 6000

●
●

SLIDE 7

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Brain and Body Weight of Terrestrial Mammals

brain.model <- lm(brain.weight.grams ~ body.weight.kilograms, data = BrainBodyWeight) par(mfrow = c(1,2)) # to create a 1-by-2 plotting grid plot(brain.model, which = 1) #residuals by predicted plot(brain.model, which = 2) #quantile-quantile 2000 4000 6000 −1000 1000 Fitted values Residuals

●
●
●
Residuals vs Fitted

5 34 1

●
● ●●
−2

−1 1 2 −6 −2 2 4 6 8 Theoretical Quantiles Standardized residuals Normal Q−Q

5 1 34

SLIDE 8

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Log Brain and Log Body Weight

xyplot( log(brain.weight.grams) ~ log(body.weight.kilograms), data = BrainBodyWeight, type = c("p", "r")) log(body.weight.kilograms) log(brain.weight.grams)

−2 2 4 6 8 −5 5

SLIDE 9

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Log Brain and Log Body Weight

log.brain.model <- lm(log(brain.weight.grams) ~ log(body.weight.kilograms), data = BrainBodyWeight) par(mfrow = c(1,2)) plot(log.brain.model, which = 1) #residuals by predicted plot(log.brain.model, which = 2) #quantile-quantile −2 2 4 6 8 −2 −1 1 2 Fitted values Residuals

Residuals vs Fitted

34 61 50

−2

−1 1 2 −2 1 2 3 Theoretical Quantiles Standardized residuals Normal Q−Q

34 61 50

SLIDE 10

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Percent Brain Weight by Body Weight

library(mosaic) transform( BrainBodyWeight, percent.brain = brain.weight.grams / (body.weight.kilograms * 1000) ) %>% xyplot( log(percent.brain) ~ log(body.weight.kilograms), data = ., type = c("p", "r")) log(body.weight.kilograms) log(percent.brain)

−7 −6 −5 −4 −5 5

SLIDE 11

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Percent Brain Weight By Body Weight

7.0 8.0 9.0 10.0 −2 −1 1 2 Fitted values Residuals

Residuals vs Fitted

34 61 50

−2

−1 1 2 −2 1 2 3 Theoretical Quantiles Standardized residuals Normal Q−Q

34 61 50

SLIDE 12

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Unusual Cases

Detecting Unusual Cases

Residual plots
Standardized/Studentized residuals
Leverage measurement

SLIDE 13

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Men’s Long Jump

library(Stat2Data) data(LongJumpOlympics) xyplot( Gold ~ Year, data = LongJumpOlympics, type = c("p", "r"), groups = (Year == 1968) ## highlight the outlier ) Year Gold

7.5 8.0 8.5 1900 1920 1940 1960 1980 2000

SLIDE 14

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Men’s Long Jump: Residuals

long.jump.model <- lm(Gold ~ Year, data = LongJumpOlympics) par(mfrow = c(1,2)) plot(long.jump.model, which = 1) plot(long.jump.model, which = 2) 7.5 8.0 8.5 −0.4 0.0 0.4 0.8 Fitted values Residuals

Residuals vs Fitted

16 12 26

−2

−1 1 2 −1 1 2 3 Theoretical Quantiles Standardized residuals Normal Q−Q

16 26 12

SLIDE 15

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Men’s Long Jump: Residuals

long.jump.model <- lm(Gold ~ Year, data = LongJumpOlympics) par(mfrow = c(1,2)) plot(long.jump.model, which = 1) plot(long.jump.model, which = 2) 7.5 8.0 8.5 −0.4 0.0 0.4 0.8 Fitted values Residuals

Residuals vs Fitted

16 12 26

−2

−1 1 2 −1 1 2 3 Theoretical Quantiles Standardized residuals Normal Q−Q

16 26 12

SLIDE 16

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Influence

Two characteristics contribute to influence of a data point on regression line:

1. Distance in Y from trend (think: residual for line fit w/o

that point)

2. Distance of X from ¯

X (think: distance from center on a see-saw)

SLIDE 17

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Standardized and Studentized Residuals

Standardized Residuals yi − ˆ yi ˆ σε (1) “Studentized” Residuals yi − ˆ yi ˆ σ(i)

ε

(2) where ˆ σ(i)

ε

is standard deviation of all residuals other than i.

SLIDE 18

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Leverage

hi = 1 n + (xi − ¯ x)2 n

i′=1(xi′ − ¯

x)2 (3) measures influence of predictor value on regression line

SLIDE 19

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Sample vs Population “Best-Fit” Line

For a sample: choose intercept and slope to minimize

sum of squared errors.

But this does not yield the “correct” (or even “best”)

model for the population, due to sampling error.

20

30 40 50 60 70 20 30 40 50 60 70 Wife's Age Husband's Age

Population

Sample 1 Sample 2 Sample 3 Sample 4

SLIDE 20

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Reminder: Sampling Distributions

Sampling Distribution

The sampling distribution of a sample statistic (e.g., ˆ β1) for β1, or ¯ Y for µY ) is the distribution that statistic has across all possible samples from the population.

SLIDE 21

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Marriage Ages of Male-Female Couples

Slope predicting male partner's age from female partner's age Density

0.0 0.5 1.0 1.5 1 2 3

SLIDE 22

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Two Methods for Estimating Sampling Distribution

1. t-distribution: assumes Normal residuals (along with
ther regression conditions)
2. Bootstrap distribution: no Normal assumption needed

SLIDE 23

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Normal Residuals

If ε ∼ N(0, σε), then ˆ βi − βi SEˆ

βi

∼ tn−2

(β1

^ − β1) SEβ1

^

Density

0.0 0.2 0.4 0.6 0.8 −2 2 4

SLIDE 24

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

t-based Confidence Interval

CI1−α : ˆ βi ± t∗(1−α/2)

n−2

· SEβi (4) where t∗(1−α/2)

n−2

represents the 1 − α/2 quantile of the tn−2 distribution.

SLIDE 25

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability sample.model <- lm(Husband ~ Wife, data = sample1) summary(sample.model) Call: lm(formula = Husband ~ Wife, data = sample1) Residuals: Min 1Q Median 3Q Max

7.3577 -4.9705

0.7395 3.3295 7.5107 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -13.7455 7.1773

1.915

0.0918 . Wife 1.5486 0.1989 7.786 5.3e-05 ***

Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 5.774 on 8 degrees of freedom Multiple R-squared: 0.8834,Adjusted R-squared: 0.8689 F-statistic: 60.63 on 1 and 8 DF, p-value: 5.304e-05 MoE.95 <- qt(0.975, df = 8) * 0.1989 (CI.95 <- c(1.5486 - MoE.95, 1.5486 + MoE.95)) [1] 1.089936 2.007264

SLIDE 26

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

confint(sample.model, level = 0.95) 2.5 % 97.5 % (Intercept) -30.296476 2.805433 Wife 1.089947 2.007218

SLIDE 27

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Correlation Test and Interval

Can also estimate dist. for correlation r using tn−2, where SEr =

1 − r2

n − 2 (5) CI1−α : r ± t∗(1−α/2)

n−2

· SEr (6) tobs = r − 0 SEr (7)

SLIDE 28

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Bootstrap Distribution

SLIDE 29

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Bootstrap Distribution

Figure: Our actual sample Figure: Our simulated population

SLIDE 30

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Illustrated Simulation

http://lock5stat.com/statkey

SLIDE 31

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Permutation Test: Slope

To test H0 : β1 = 0, we want probability that a random ˆ β1 is as large or larger than observed ˆ β1, assuming H0 true: β1 = 0.

Permutation Test: Slope

1. Simulate H0 by randomly pairing X and Y values, and

computing ˆ β1 for each pseudodataset.

2. Repeat many times
3. Calculate proportion of random ˆ

β1 that exceed actual ˆ β1. This is the P-value of the test.

4. If P < α for predetermined α, reject H0.

SLIDE 32

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Permutation Test: Correlation

To test H0 : ρ = 0, we want probability that a random r is as large or larger than observed r, assuming H0 true: ρ = 0.

Permutation Test

1. Simulate H0 by randomly pairing X and Y values, and

computing r for each pseudodataset.

2. Repeat many times
3. Calculate proportion of random r that exceed actual r.

This is the P-value of the test.

4. If P < α for predetermined α, reject H0.

SLIDE 33

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Illustrated Simulation

http://lock5stat.com/statkey

SLIDE 34

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

ANOVA for Regression Y = f(X) + ε DATA = PATTERN + IDIOSYNCRACIES

Total Variation = Explained Variation + Unexplained Variation Y − ¯ Y = ˆ Y − ¯ Y + Y − ˆ Y

i

(Yi − ¯ Y )2 =

i

(ˆ Yi − ¯ Y )2 + 0 +

(Yi − ˆ

Yi)2 SSTotal = SSModel + SSError

SLIDE 35

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

“Omnibus” F-test for a Regression Model

F = SSModel/d fModel SSError/d fError = MSModel MSError This statistic has an F distribution with corresponding d f if the null model is correct (i.e., Y = β0 + ε)

brain.model <- lm(log(brain.weight.grams) ~ log(body.weight.kilograms), data = BrainBodyWeight) anova(brain.model) Analysis of Variance Table Response: log(brain.weight.grams) Df Sum Sq Mean Sq F value Pr(>F) log(body.weight.kilograms) 1 336.19 336.19 697.42 < 2.2e-16 *** Residuals 60 28.92 0.48

Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

SLIDE 36

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Proportion of Variability Explained

The Coefficient of Determination (R2)

The coefficient of determination, or R2 value, associated with a linear model, is the percent reduction in prediction uncertainty achieved by the regression model compared to the null model I.e., what proportion of the variation (variance) in y is “explained”: R2 = SSModel SSTotal (8) Turns out to just be the square of the correlation! (Show this algebraically)

SLIDE 37

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Regression Summary

summary(brain.model) Call: lm(formula = log(brain.weight.grams) ~ log(body.weight.kilograms), data = BrainBodyWeight) Residuals: Min 1Q Median 3Q Max

1.71550 -0.49228 -0.06162

0.43597 1.94829 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.13479 0.09604 22.23 <2e-16 *** log(body.weight.kilograms) 0.75169 0.02846 26.41 <2e-16 ***

Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.6943 on 60 degrees of freedom Multiple R-squared: 0.9208,Adjusted R-squared: 0.9195 F-statistic: 697.4 on 1 and 60 DF, p-value: < 2.2e-16

SLIDE 38

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability