STAT 213 Regression Inference II Colin Reimer Dawson Oberlin - - PowerPoint PPT Presentation

stat 213 regression inference ii
SMART_READER_LITE
LIVE PREVIEW

STAT 213 Regression Inference II Colin Reimer Dawson Oberlin - - PowerPoint PPT Presentation

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability STAT 213 Regression Inference II Colin Reimer Dawson Oberlin College 18 February 2016 Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability


slide-1
SLIDE 1

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

STAT 213 Regression Inference II

Colin Reimer Dawson

Oberlin College

18 February 2016

slide-2
SLIDE 2

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Outline

Key Ideas: Last Time Influence and Outliers Regression Inference Simulation Approaches Partitioning Variability

slide-3
SLIDE 3

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Reading Quiz

A regression equation was fit to a set of data for which the correlation, r, between X and Y was 0.6. Which of the following must be true? (a) The slope of the regression line is 0.6. (b) The regression model explains 60% of the variability in Y . (c) The regression model explains 36% of the variability in Y . (d) At least half of the residuals are smaller than 0.6 in absolute value.

slide-4
SLIDE 4

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

For Tuesday...

  • Write and turn in: Ex. 1.10, 1.12, 1.26, 2.14a, 2.34
  • Read: Ch. 2.4, 4.6
  • Answer:
  • 1. Exercise 2.5
  • 2. Exercise 2.6
  • 3. In a randomization distribution to test whether a

regression slope is significantly different from zero, the P-value is the proportion of

  • btained by

that exceed ?

slide-5
SLIDE 5

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Transformations and Outliers

Data Transformations

Can be used to

  • Address non-linearity
  • Stabilize (homogenize) variance
  • “Unskew” residual distribution
  • Reduce influence of outliers
slide-6
SLIDE 6

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Brain and Body Weight of Terrestrial Mammals

library(mosaic) BrainBodyWeight <- read.file("http://colinreimerdawson.com/data/BrainBodyWeight.csv") xyplot( brain.weight.grams ~ body.weight.kilograms, data = BrainBodyWeight, type = c("p", "r")) body.weight.kilograms brain.weight.grams

1000 2000 3000 4000 5000 2000 4000 6000

slide-7
SLIDE 7

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Brain and Body Weight of Terrestrial Mammals

brain.model <- lm(brain.weight.grams ~ body.weight.kilograms, data = BrainBodyWeight) par(mfrow = c(1,2)) # to create a 1-by-2 plotting grid plot(brain.model, which = 1) #residuals by predicted plot(brain.model, which = 2) #quantile-quantile 2000 4000 6000 −1000 1000 Fitted values Residuals

  • Residuals vs Fitted

5 34 1

  • ● ●●
  • −2

−1 1 2 −6 −2 2 4 6 8 Theoretical Quantiles Standardized residuals Normal Q−Q

5 1 34

slide-8
SLIDE 8

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Log Brain and Log Body Weight

xyplot( log(brain.weight.grams) ~ log(body.weight.kilograms), data = BrainBodyWeight, type = c("p", "r")) log(body.weight.kilograms) log(brain.weight.grams)

−2 2 4 6 8 −5 5

slide-9
SLIDE 9

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Log Brain and Log Body Weight

log.brain.model <- lm(log(brain.weight.grams) ~ log(body.weight.kilograms), data = BrainBodyWeight) par(mfrow = c(1,2)) plot(log.brain.model, which = 1) #residuals by predicted plot(log.brain.model, which = 2) #quantile-quantile −2 2 4 6 8 −2 −1 1 2 Fitted values Residuals

  • Residuals vs Fitted

34 61 50

  • −2

−1 1 2 −2 1 2 3 Theoretical Quantiles Standardized residuals Normal Q−Q

34 61 50

slide-10
SLIDE 10

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Percent Brain Weight by Body Weight

library(mosaic) transform( BrainBodyWeight, percent.brain = brain.weight.grams / (body.weight.kilograms * 1000) ) %>% xyplot( log(percent.brain) ~ log(body.weight.kilograms), data = ., type = c("p", "r")) log(body.weight.kilograms) log(percent.brain)

−7 −6 −5 −4 −5 5

slide-11
SLIDE 11

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Percent Brain Weight By Body Weight

7.0 8.0 9.0 10.0 −2 −1 1 2 Fitted values Residuals

  • Residuals vs Fitted

34 61 50

  • −2

−1 1 2 −2 1 2 3 Theoretical Quantiles Standardized residuals Normal Q−Q

34 61 50

slide-12
SLIDE 12

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Unusual Cases

Detecting Unusual Cases

  • Residual plots
  • Standardized/Studentized residuals
  • Leverage measurement
slide-13
SLIDE 13

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Men’s Long Jump

library(Stat2Data) data(LongJumpOlympics) xyplot( Gold ~ Year, data = LongJumpOlympics, type = c("p", "r"), groups = (Year == 1968) ## highlight the outlier ) Year Gold

7.5 8.0 8.5 1900 1920 1940 1960 1980 2000

slide-14
SLIDE 14

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Men’s Long Jump: Residuals

long.jump.model <- lm(Gold ~ Year, data = LongJumpOlympics) par(mfrow = c(1,2)) plot(long.jump.model, which = 1) plot(long.jump.model, which = 2) 7.5 8.0 8.5 −0.4 0.0 0.4 0.8 Fitted values Residuals

  • Residuals vs Fitted

16 12 26

  • −2

−1 1 2 −1 1 2 3 Theoretical Quantiles Standardized residuals Normal Q−Q

16 26 12

slide-15
SLIDE 15

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Men’s Long Jump: Residuals

long.jump.model <- lm(Gold ~ Year, data = LongJumpOlympics) par(mfrow = c(1,2)) plot(long.jump.model, which = 1) plot(long.jump.model, which = 2) 7.5 8.0 8.5 −0.4 0.0 0.4 0.8 Fitted values Residuals

  • Residuals vs Fitted

16 12 26

  • −2

−1 1 2 −1 1 2 3 Theoretical Quantiles Standardized residuals Normal Q−Q

16 26 12

slide-16
SLIDE 16

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Influence

Two characteristics contribute to influence of a data point on regression line:

  • 1. Distance in Y from trend (think: residual for line fit w/o

that point)

  • 2. Distance of X from ¯

X (think: distance from center on a see-saw)

slide-17
SLIDE 17

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Standardized and Studentized Residuals

Standardized Residuals yi − ˆ yi ˆ σε (1) “Studentized” Residuals yi − ˆ yi ˆ σ(i)

ε

(2) where ˆ σ(i)

ε

is standard deviation of all residuals other than i.

slide-18
SLIDE 18

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Leverage

hi = 1 n + (xi − ¯ x)2 n

i′=1(xi′ − ¯

x)2 (3) measures influence of predictor value on regression line

slide-19
SLIDE 19

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Sample vs Population “Best-Fit” Line

  • For a sample: choose intercept and slope to minimize

sum of squared errors.

  • But this does not yield the “correct” (or even “best”)

model for the population, due to sampling error.

  • 20

30 40 50 60 70 20 30 40 50 60 70 Wife's Age Husband's Age

  • Population

Sample 1 Sample 2 Sample 3 Sample 4

slide-20
SLIDE 20

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Reminder: Sampling Distributions

Sampling Distribution

The sampling distribution of a sample statistic (e.g., ˆ β1) for β1, or ¯ Y for µY ) is the distribution that statistic has across all possible samples from the population.

slide-21
SLIDE 21

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Marriage Ages of Male-Female Couples

Slope predicting male partner's age from female partner's age Density

0.0 0.5 1.0 1.5 1 2 3

slide-22
SLIDE 22

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Two Methods for Estimating Sampling Distribution

  • 1. t-distribution: assumes Normal residuals (along with
  • ther regression conditions)
  • 2. Bootstrap distribution: no Normal assumption needed
slide-23
SLIDE 23

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Normal Residuals

If ε ∼ N(0, σε), then ˆ βi − βi SEˆ

βi

∼ tn−2

(β1

^ − β1) SEβ1

^

Density

0.0 0.2 0.4 0.6 0.8 −2 2 4

slide-24
SLIDE 24

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

t-based Confidence Interval

CI1−α : ˆ βi ± t∗(1−α/2)

n−2

· SEβi (4) where t∗(1−α/2)

n−2

represents the 1 − α/2 quantile of the tn−2 distribution.

slide-25
SLIDE 25

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability sample.model <- lm(Husband ~ Wife, data = sample1) summary(sample.model) Call: lm(formula = Husband ~ Wife, data = sample1) Residuals: Min 1Q Median 3Q Max

  • 7.3577 -4.9705

0.7395 3.3295 7.5107 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -13.7455 7.1773

  • 1.915

0.0918 . Wife 1.5486 0.1989 7.786 5.3e-05 ***

  • Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 5.774 on 8 degrees of freedom Multiple R-squared: 0.8834,Adjusted R-squared: 0.8689 F-statistic: 60.63 on 1 and 8 DF, p-value: 5.304e-05 MoE.95 <- qt(0.975, df = 8) * 0.1989 (CI.95 <- c(1.5486 - MoE.95, 1.5486 + MoE.95)) [1] 1.089936 2.007264

slide-26
SLIDE 26

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

confint(sample.model, level = 0.95) 2.5 % 97.5 % (Intercept) -30.296476 2.805433 Wife 1.089947 2.007218

slide-27
SLIDE 27

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Correlation Test and Interval

Can also estimate dist. for correlation r using tn−2, where SEr =

  • 1 − r2

n − 2 (5) CI1−α : r ± t∗(1−α/2)

n−2

· SEr (6) tobs = r − 0 SEr (7)

slide-28
SLIDE 28

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Bootstrap Distribution

slide-29
SLIDE 29

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Bootstrap Distribution

Figure: Our actual sample Figure: Our simulated population

slide-30
SLIDE 30

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Illustrated Simulation

http://lock5stat.com/statkey

slide-31
SLIDE 31

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Permutation Test: Slope

To test H0 : β1 = 0, we want probability that a random ˆ β1 is as large or larger than observed ˆ β1, assuming H0 true: β1 = 0.

Permutation Test: Slope

  • 1. Simulate H0 by randomly pairing X and Y values, and

computing ˆ β1 for each pseudodataset.

  • 2. Repeat many times
  • 3. Calculate proportion of random ˆ

β1 that exceed actual ˆ β1. This is the P-value of the test.

  • 4. If P < α for predetermined α, reject H0.
slide-32
SLIDE 32

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Permutation Test: Correlation

To test H0 : ρ = 0, we want probability that a random r is as large or larger than observed r, assuming H0 true: ρ = 0.

Permutation Test

  • 1. Simulate H0 by randomly pairing X and Y values, and

computing r for each pseudodataset.

  • 2. Repeat many times
  • 3. Calculate proportion of random r that exceed actual r.

This is the P-value of the test.

  • 4. If P < α for predetermined α, reject H0.
slide-33
SLIDE 33

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Illustrated Simulation

http://lock5stat.com/statkey

slide-34
SLIDE 34

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

ANOVA for Regression Y = f(X) + ε DATA = PATTERN + IDIOSYNCRACIES

Total Variation = Explained Variation + Unexplained Variation Y − ¯ Y = ˆ Y − ¯ Y + Y − ˆ Y

  • i

(Yi − ¯ Y )2 =

  • i

(ˆ Yi − ¯ Y )2 + 0 +

  • (Yi − ˆ

Yi)2 SSTotal = SSModel + SSError

slide-35
SLIDE 35

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

“Omnibus” F-test for a Regression Model

F = SSModel/d fModel SSError/d fError = MSModel MSError This statistic has an F distribution with corresponding d f if the null model is correct (i.e., Y = β0 + ε)

brain.model <- lm(log(brain.weight.grams) ~ log(body.weight.kilograms), data = BrainBodyWeight) anova(brain.model) Analysis of Variance Table Response: log(brain.weight.grams) Df Sum Sq Mean Sq F value Pr(>F) log(body.weight.kilograms) 1 336.19 336.19 697.42 < 2.2e-16 *** Residuals 60 28.92 0.48

  • Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

slide-36
SLIDE 36

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Proportion of Variability Explained

The Coefficient of Determination (R2)

The coefficient of determination, or R2 value, associated with a linear model, is the percent reduction in prediction uncertainty achieved by the regression model compared to the null model I.e., what proportion of the variation (variance) in y is “explained”: R2 = SSModel SSTotal (8) Turns out to just be the square of the correlation! (Show this algebraically)

slide-37
SLIDE 37

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Regression Summary

summary(brain.model) Call: lm(formula = log(brain.weight.grams) ~ log(body.weight.kilograms), data = BrainBodyWeight) Residuals: Min 1Q Median 3Q Max

  • 1.71550 -0.49228 -0.06162

0.43597 1.94829 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.13479 0.09604 22.23 <2e-16 *** log(body.weight.kilograms) 0.75169 0.02846 26.41 <2e-16 ***

  • Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.6943 on 60 degrees of freedom Multiple R-squared: 0.9208,Adjusted R-squared: 0.9195 F-statistic: 697.4 on 1 and 60 DF, p-value: < 2.2e-16

slide-38
SLIDE 38

Outline Key Ideas: Last Time Simulation Approaches Partitioning Variability

Worksheet