[PPT] - STAT 215 Regression Inference Colin Reimer Dawson Oberlin College PowerPoint Presentation

SLIDE 1

Outline Regression Inference Simulation Approaches Partitioning Variability

STAT 215 Regression Inference

Colin Reimer Dawson

Oberlin College

October 12, 2017 1 / 26

SLIDE 2

Outline Regression Inference Simulation Approaches Partitioning Variability

Outline

Regression Inference Simulation Approaches Partitioning Variability 2 / 26

SLIDE 3

Outline Regression Inference Simulation Approaches Partitioning Variability

Sample vs Population “Best-Fit” Line

For a sample: choose intercept and slope to minimize

sum of squared errors.

But this does not yield the “correct” (or even “best”)

model for the population, due to sampling error.

1000 2000 3000 4000 5000 1.5 2.0 2.5 Area (sq. ft.) log10(Price ($K))

●
●
●
Population

Sample 1 Sample 2 Sample 3 Sample 4

4 / 26

SLIDE 4

Outline Regression Inference Simulation Approaches Partitioning Variability

Reminder: Sampling Distributions

Sampling Distribution

The sampling distribution of a sample statistic (e.g., ˆ β1) for β1, or ¯ Y for µY ) is the distribution that statistic has across all possible samples from the population. 5 / 26

SLIDE 5

Outline Regression Inference Simulation Approaches Partitioning Variability

Predicting Home Prices in Ames, Iowa

250 500 750 −0.0004 0.0000 0.0004 0.0008

Sample Slope count

6 / 26

SLIDE 6

Outline Regression Inference Simulation Approaches Partitioning Variability

Two Methods for Estimating Sampling Distribution

1. t-distribution: assumes Normal residuals (along with
ther regression conditions)
2. Bootstrap distribution: no Normal assumption needed

7 / 26

SLIDE 7

Outline Regression Inference Simulation Approaches Partitioning Variability

Normal Residuals

If ε ∼ N(0, σε), then ˆ βi − βi SEˆ

βi

∼ tn−2

(β1

^ − β1) SEβ1

^

Density

0.0 0.2 0.4 0.6 0.8 1.0 −2 −1 1 2 3

8 / 26

SLIDE 8

Outline Regression Inference Simulation Approaches Partitioning Variability

t-based Confidence Interval

CI1−α : ˆ βi ± t∗(1−α/2)

n−2

· SEβi where t∗(1−α/2)

n−2

represents the 1 − α/2 quantile of the tn−2 distribution.

9 / 26

SLIDE 9

Outline Regression Inference Simulation Approaches Partitioning Variability

sample10 <- sample(Ames, 10) ## This would just be our dataset sample.model <- lm(Price ~ Area, data = sample10) summary(sample.model)$coefficients %>% round(digits = 3) Estimate Std. Error t value Pr(>|t|) (Intercept) 154146.420 34633.509 4.451 0.002 Area 16.231 15.503 1.047 0.326 MoE.95 <- qt(0.975, df = 10 - 2) * 15.503 CI.95 <- c(16.231- MoE.95, 16.231 + MoE.95) CI.95 [1] -19.51898 51.98098

10 / 26

SLIDE 10

Outline Regression Inference Simulation Approaches Partitioning Variability

confint(sample.model, level = 0.95) 2.5 % 97.5 % (Intercept) 74281.40618 234011.43423 Area

19.51942

51.98123

11 / 26

SLIDE 11

Outline Regression Inference Simulation Approaches Partitioning Variability

Correlation Test and Interval

Can also estimate dist. for correlation r using tn−2, where SEr =

1 − r2

n − 2 (1) CI1−α : r ± t∗(1−α/2)

n−2

· SEr (2) tobs = r − 0 SEr (3) 12 / 26

SLIDE 12

Outline Regression Inference Simulation Approaches Partitioning Variability

Bootstrap Distribution

14 / 26

SLIDE 13

Outline Regression Inference Simulation Approaches Partitioning Variability

Bootstrap Distribution

Figure: Our actual sample Figure: Our simulated population

15 / 26

SLIDE 14

Outline Regression Inference Simulation Approaches Partitioning Variability

Illustrated Simulation

http://lock5stat.com/statkey 16 / 26

SLIDE 15

Outline Regression Inference Simulation Approaches Partitioning Variability

Permutation Test: Slope

To test H0 : β1 = 0, we want probability that a random ˆ β1 is as large or larger than observed ˆ β1, assuming H0 true: β1 = 0.

Permutation Test: Slope

1. Simulate H0 by randomly pairing X and Y values, and

computing ˆ β1 for each pseudodataset.

2. Repeat many times
3. Calculate proportion of random ˆ

β1 that exceed actual ˆ β1. This is the P-value of the test.

4. If P < α for predetermined α, reject H0.

17 / 26

SLIDE 16

Outline Regression Inference Simulation Approaches Partitioning Variability

Permutation Test: Correlation

To test H0 : ρ = 0, we want probability that a random r is as large or larger than observed r, assuming H0 true: ρ = 0.

Permutation Test

1. Simulate H0 by randomly pairing X and Y values, and

computing r for each pseudodataset.

2. Repeat many times
3. Calculate proportion of random r that exceed actual r.

This is the P-value of the test.

4. If P < α for predetermined α, reject H0.

18 / 26

SLIDE 17

Outline Regression Inference Simulation Approaches Partitioning Variability

Illustrated Simulation

http://lock5stat.com/statkey 19 / 26

SLIDE 18

Outline Regression Inference Simulation Approaches Partitioning Variability

ANOVA for Regression Y = f(X) + ε DATA = PATTERN + IDIOSYNCRACIES

Total Variation = Explained Variation + Unexplained Variation Y − ¯ Y = ˆ Y − ¯ Y + Y − ˆ Y

i

(Yi − ¯ Y )2 =

i

(ˆ Yi − ¯ Y )2 + 0 +

(Yi − ˆ

Yi)2 SSTotal = SSModel + SSError 21 / 26

SLIDE 19

Outline Regression Inference Simulation Approaches Partitioning Variability

“Omnibus” F-test for a Regression Model

F = SSModel/d fModel SSError/d fError = MSModel MSError This statistic has an F distribution with corresponding d f if the null model is correct (i.e., Y = β0 + ε)

BrainBodyWeight <- read.file("http://colindawson.net/data/BrainBodyWeight.csv") brain.model <- lm(log(brain.weight.grams) ~ log(body.weight.kilograms), data = BrainBodyWeight) anova(brain.model) Analysis of Variance Table Response: log(brain.weight.grams) Df Sum Sq Mean Sq F value Pr(>F) log(body.weight.kilograms) 1 336.19 336.19 697.42 < 2.2e-16 *** Residuals 60 28.92 0.48

Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

22 / 26

SLIDE 20

Outline Regression Inference Simulation Approaches Partitioning Variability

Proportion of Variability Explained

The Coefficient of Determination (R2)

The coefficient of determination, or R2 value, associated with a linear model, is the percent reduction in prediction uncertainty achieved by the regression model compared to the null model I.e., what proportion of the variation (variance) in y is “explained”: R2 = SSModel SSTotal (4) Turns out to just be the square of the correlation! (Show this algebraically) 23 / 26

SLIDE 21

Outline Regression Inference Simulation Approaches Partitioning Variability

Example: Restaurant Tips

library("Lock5Data"); library("mosaic") data("RestaurantTips") null.tip.model <- lm(Tip ~ 1, data = RestaurantTips) tip.model.using.bill <- lm(Tip ~ Bill, data = RestaurantTips)

●
10

20 30 40 50 60 70 5 10 15 Total Bill ($) Tip ($) null.tip.model tip.model.using.bill

24 / 26

SLIDE 22

Outline Regression Inference Simulation Approaches Partitioning Variability

Example: Restaurant Tips

Residual Tip (Null Model) Tip ($) −10 −5 5 10 15 σε ^ 2 = 5.861 Residual Tip (Bill Model) Frequency −10 −5 5 10 30 σε ^ 2 = 0.953

25 / 26

SLIDE 23

Outline Regression Inference Simulation Approaches Partitioning Variability

Regression Summary

summary(brain.model) Call: lm(formula = log(brain.weight.grams) ~ log(body.weight.kilograms), data = BrainBodyWeight) Residuals: Min 1Q Median 3Q Max

1.71550 -0.49228 -0.06162

0.43597 1.94829 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.13479 0.09604 22.23 <2e-16 *** log(body.weight.kilograms) 0.75169 0.02846 26.41 <2e-16 ***

Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.6943 on 60 degrees of freedom Multiple R-squared: 0.9208,Adjusted R-squared: 0.9195 F-statistic: 697.4 on 1 and 60 DF, p-value: < 2.2e-16

STAT 215 Regression Inference

Colin Reimer Dawson

October 12, 2017 1 / 26

Outline

Regression Inference Simulation Approaches Partitioning Variability 2 / 26

Sample vs Population “Best-Fit” Line

sum of squared errors.

model for the population, due to sampling error.

4 / 26

Reminder: Sampling Distributions

Sampling Distribution

The sampling distribution of a sample statistic (e.g., ˆ β1) for β1, or ¯ Y for µY ) is the distribution that statistic has across all possible samples from the population. 5 / 26

Predicting Home Prices in Ames, Iowa

6 / 26

Two Methods for Estimating Sampling Distribution

7 / 26

Normal Residuals

If ε ∼ N(0, σε), then ˆ βi − βi SEˆ

∼ tn−2

(β1

8 / 26

t-based Confidence Interval

CI1−α : ˆ βi ± t∗(1−α/2)

n−2

· SEβi where t∗(1−α/2)

n−2

represents the 1 − α/2 quantile of the tn−2 distribution.

9 / 26

10 / 26

confint(sample.model, level = 0.95) 2.5 % 97.5 % (Intercept) 74281.40618 234011.43423 Area

51.98123

11 / 26

Correlation Test and Interval

Can also estimate dist. for correlation r using tn−2, where SEr =

n − 2 (1) CI1−α : r ± t∗(1−α/2)

· SEr (2) tobs = r − 0 SEr (3) 12 / 26

Bootstrap Distribution

14 / 26

Bootstrap Distribution

Figure: Our actual sample Figure: Our simulated population

15 / 26

Illustrated Simulation

http://lock5stat.com/statkey 16 / 26

Permutation Test: Slope

To test H0 : β1 = 0, we want probability that a random ˆ β1 is as large or larger than observed ˆ β1, assuming H0 true: β1 = 0.

Permutation Test: Slope

computing ˆ β1 for each pseudodataset.

β1 that exceed actual ˆ β1. This is the P-value of the test.

17 / 26

Permutation Test: Correlation

To test H0 : ρ = 0, we want probability that a random r is as large or larger than observed r, assuming H0 true: ρ = 0.

Permutation Test

computing r for each pseudodataset.

This is the P-value of the test.

18 / 26

Illustrated Simulation

http://lock5stat.com/statkey 19 / 26

ANOVA for Regression Y = f(X) + ε DATA = PATTERN + IDIOSYNCRACIES

Total Variation = Explained Variation + Unexplained Variation Y − ¯ Y = ˆ Y − ¯ Y + Y − ˆ Y

(Yi − ¯ Y )2 =

(ˆ Yi − ¯ Y )2 + 0 +

Yi)2 SSTotal = SSModel + SSError 21 / 26

“Omnibus” F-test for a Regression Model

F = SSModel/d fModel SSError/d fError = MSModel MSError This statistic has an F distribution with corresponding d f if the null model is correct (i.e., Y = β0 + ε)

22 / 26

Proportion of Variability Explained

The Coefficient of Determination (R2)

Example: Restaurant Tips

library("Lock5Data"); library("mosaic") data("RestaurantTips") null.tip.model <- lm(Tip ~ 1, data = RestaurantTips) tip.model.using.bill <- lm(Tip ~ Bill, data = RestaurantTips)

24 / 26

Example: Restaurant Tips

25 / 26

Regression Summary

26 / 26