STAT 215 Regression Inference Colin Reimer Dawson Oberlin College - - PowerPoint PPT Presentation

stat 215 regression inference
SMART_READER_LITE
LIVE PREVIEW

STAT 215 Regression Inference Colin Reimer Dawson Oberlin College - - PowerPoint PPT Presentation

Outline Regression Inference Simulation Approaches Partitioning Variability STAT 215 Regression Inference Colin Reimer Dawson Oberlin College October 12, 2017 1 / 26 Outline Regression Inference Simulation Approaches Partitioning


slide-1
SLIDE 1

Outline Regression Inference Simulation Approaches Partitioning Variability

STAT 215 Regression Inference

Colin Reimer Dawson

Oberlin College

October 12, 2017 1 / 26

slide-2
SLIDE 2

Outline Regression Inference Simulation Approaches Partitioning Variability

Outline

Regression Inference Simulation Approaches Partitioning Variability 2 / 26

slide-3
SLIDE 3

Outline Regression Inference Simulation Approaches Partitioning Variability

Sample vs Population “Best-Fit” Line

  • For a sample: choose intercept and slope to minimize

sum of squared errors.

  • But this does not yield the “correct” (or even “best”)

model for the population, due to sampling error.

1000 2000 3000 4000 5000 1.5 2.0 2.5 Area (sq. ft.) log10(Price ($K))

  • Population

Sample 1 Sample 2 Sample 3 Sample 4

4 / 26

slide-4
SLIDE 4

Outline Regression Inference Simulation Approaches Partitioning Variability

Reminder: Sampling Distributions

Sampling Distribution

The sampling distribution of a sample statistic (e.g., ˆ β1) for β1, or ¯ Y for µY ) is the distribution that statistic has across all possible samples from the population. 5 / 26

slide-5
SLIDE 5

Outline Regression Inference Simulation Approaches Partitioning Variability

Predicting Home Prices in Ames, Iowa

250 500 750 −0.0004 0.0000 0.0004 0.0008

Sample Slope count

6 / 26

slide-6
SLIDE 6

Outline Regression Inference Simulation Approaches Partitioning Variability

Two Methods for Estimating Sampling Distribution

  • 1. t-distribution: assumes Normal residuals (along with
  • ther regression conditions)
  • 2. Bootstrap distribution: no Normal assumption needed

7 / 26

slide-7
SLIDE 7

Outline Regression Inference Simulation Approaches Partitioning Variability

Normal Residuals

If ε ∼ N(0, σε), then ˆ βi − βi SEˆ

βi

∼ tn−2

(β1

^ − β1) SEβ1

^

Density

0.0 0.2 0.4 0.6 0.8 1.0 −2 −1 1 2 3

8 / 26

slide-8
SLIDE 8

Outline Regression Inference Simulation Approaches Partitioning Variability

t-based Confidence Interval

CI1−α : ˆ βi ± t∗(1−α/2)

n−2

· SEβi where t∗(1−α/2)

n−2

represents the 1 − α/2 quantile of the tn−2 distribution.

9 / 26

slide-9
SLIDE 9

Outline Regression Inference Simulation Approaches Partitioning Variability

sample10 <- sample(Ames, 10) ## This would just be our dataset sample.model <- lm(Price ~ Area, data = sample10) summary(sample.model)$coefficients %>% round(digits = 3) Estimate Std. Error t value Pr(>|t|) (Intercept) 154146.420 34633.509 4.451 0.002 Area 16.231 15.503 1.047 0.326 MoE.95 <- qt(0.975, df = 10 - 2) * 15.503 CI.95 <- c(16.231- MoE.95, 16.231 + MoE.95) CI.95 [1] -19.51898 51.98098

10 / 26

slide-10
SLIDE 10

Outline Regression Inference Simulation Approaches Partitioning Variability

confint(sample.model, level = 0.95) 2.5 % 97.5 % (Intercept) 74281.40618 234011.43423 Area

  • 19.51942

51.98123

11 / 26

slide-11
SLIDE 11

Outline Regression Inference Simulation Approaches Partitioning Variability

Correlation Test and Interval

Can also estimate dist. for correlation r using tn−2, where SEr =

  • 1 − r2

n − 2 (1) CI1−α : r ± t∗(1−α/2)

n−2

· SEr (2) tobs = r − 0 SEr (3) 12 / 26

slide-12
SLIDE 12

Outline Regression Inference Simulation Approaches Partitioning Variability

Bootstrap Distribution

14 / 26

slide-13
SLIDE 13

Outline Regression Inference Simulation Approaches Partitioning Variability

Bootstrap Distribution

Figure: Our actual sample Figure: Our simulated population

15 / 26

slide-14
SLIDE 14

Outline Regression Inference Simulation Approaches Partitioning Variability

Illustrated Simulation

http://lock5stat.com/statkey 16 / 26

slide-15
SLIDE 15

Outline Regression Inference Simulation Approaches Partitioning Variability

Permutation Test: Slope

To test H0 : β1 = 0, we want probability that a random ˆ β1 is as large or larger than observed ˆ β1, assuming H0 true: β1 = 0.

Permutation Test: Slope

  • 1. Simulate H0 by randomly pairing X and Y values, and

computing ˆ β1 for each pseudodataset.

  • 2. Repeat many times
  • 3. Calculate proportion of random ˆ

β1 that exceed actual ˆ β1. This is the P-value of the test.

  • 4. If P < α for predetermined α, reject H0.

17 / 26

slide-16
SLIDE 16

Outline Regression Inference Simulation Approaches Partitioning Variability

Permutation Test: Correlation

To test H0 : ρ = 0, we want probability that a random r is as large or larger than observed r, assuming H0 true: ρ = 0.

Permutation Test

  • 1. Simulate H0 by randomly pairing X and Y values, and

computing r for each pseudodataset.

  • 2. Repeat many times
  • 3. Calculate proportion of random r that exceed actual r.

This is the P-value of the test.

  • 4. If P < α for predetermined α, reject H0.

18 / 26

slide-17
SLIDE 17

Outline Regression Inference Simulation Approaches Partitioning Variability

Illustrated Simulation

http://lock5stat.com/statkey 19 / 26

slide-18
SLIDE 18

Outline Regression Inference Simulation Approaches Partitioning Variability

ANOVA for Regression Y = f(X) + ε DATA = PATTERN + IDIOSYNCRACIES

Total Variation = Explained Variation + Unexplained Variation Y − ¯ Y = ˆ Y − ¯ Y + Y − ˆ Y

  • i

(Yi − ¯ Y )2 =

  • i

(ˆ Yi − ¯ Y )2 + 0 +

  • (Yi − ˆ

Yi)2 SSTotal = SSModel + SSError 21 / 26

slide-19
SLIDE 19

Outline Regression Inference Simulation Approaches Partitioning Variability

“Omnibus” F-test for a Regression Model

F = SSModel/d fModel SSError/d fError = MSModel MSError This statistic has an F distribution with corresponding d f if the null model is correct (i.e., Y = β0 + ε)

BrainBodyWeight <- read.file("http://colindawson.net/data/BrainBodyWeight.csv") brain.model <- lm(log(brain.weight.grams) ~ log(body.weight.kilograms), data = BrainBodyWeight) anova(brain.model) Analysis of Variance Table Response: log(brain.weight.grams) Df Sum Sq Mean Sq F value Pr(>F) log(body.weight.kilograms) 1 336.19 336.19 697.42 < 2.2e-16 *** Residuals 60 28.92 0.48

  • Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

22 / 26

slide-20
SLIDE 20

Outline Regression Inference Simulation Approaches Partitioning Variability

Proportion of Variability Explained

The Coefficient of Determination (R2)

The coefficient of determination, or R2 value, associated with a linear model, is the percent reduction in prediction uncertainty achieved by the regression model compared to the null model I.e., what proportion of the variation (variance) in y is “explained”: R2 = SSModel SSTotal (4) Turns out to just be the square of the correlation! (Show this algebraically) 23 / 26

slide-21
SLIDE 21

Outline Regression Inference Simulation Approaches Partitioning Variability

Example: Restaurant Tips

library("Lock5Data"); library("mosaic") data("RestaurantTips") null.tip.model <- lm(Tip ~ 1, data = RestaurantTips) tip.model.using.bill <- lm(Tip ~ Bill, data = RestaurantTips)

  • 10

20 30 40 50 60 70 5 10 15 Total Bill ($) Tip ($) null.tip.model tip.model.using.bill

24 / 26

slide-22
SLIDE 22

Outline Regression Inference Simulation Approaches Partitioning Variability

Example: Restaurant Tips

Residual Tip (Null Model) Tip ($) −10 −5 5 10 15 σε ^ 2 = 5.861 Residual Tip (Bill Model) Frequency −10 −5 5 10 30 σε ^ 2 = 0.953

25 / 26

slide-23
SLIDE 23

Outline Regression Inference Simulation Approaches Partitioning Variability

Regression Summary

summary(brain.model) Call: lm(formula = log(brain.weight.grams) ~ log(body.weight.kilograms), data = BrainBodyWeight) Residuals: Min 1Q Median 3Q Max

  • 1.71550 -0.49228 -0.06162

0.43597 1.94829 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.13479 0.09604 22.23 <2e-16 *** log(body.weight.kilograms) 0.75169 0.02846 26.41 <2e-16 ***

  • Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.6943 on 60 degrees of freedom Multiple R-squared: 0.9208,Adjusted R-squared: 0.9195 F-statistic: 697.4 on 1 and 60 DF, p-value: < 2.2e-16

26 / 26