Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
STAT 113 Analytic Inference for Regression Colin Reimer Dawson - - PowerPoint PPT Presentation
STAT 113 Analytic Inference for Regression Colin Reimer Dawson - - PowerPoint PPT Presentation
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression STAT 113 Analytic Inference for Regression Colin Reimer Dawson Oberlin College 21-24 April 2017 1 / 33 Outline Linear Models
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
Outline
Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression 2 / 33
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
Prediction
- Correlations give us a description of the relationship between
two numeric variables.
- However, when two variables are related, we can go further and
use knowledge of one to make predictions about the other.
- Examples:
- Use SAT scores to predict college GPA
- Use economic indicators to predict stock prices
- Use biomarkers to predict disease progression
- Use bill total to predict percent tip
4 / 33
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
What’s a Good Prediction?
- ● ●
- 4
6 8 10 12 14 2 4 6 8 10 12 X Y
- Pretty much the simplest
model we can have is a straight line.
- Two things determine
what line we have:
- The intercept
- The slope
5 / 33
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
Review: Intercept Slope Form
- The intercept and slope are the parameters of our regression
model.
- The general equation for a line is:
f(x) = a + bx
- In statistics notation, we write ˆ
y (“y hat”) to represent a predicted (or fitted) value.
- Given a value xi, we predict using:
ˆ y = a + bxi 6 / 33
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
The Simple Linear Model
Prediction Function
ˆ y = ˆ β0 + ˆ β1x
The Population Model
Y = β0 + β1X + ε where ε is a residual, specific to each case. 7 / 33
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
Sample vs Population “Best-Fit” Line
- For a sample: choose intercept and slope to minimize sum of
squared errors.
- But this does not yield the “correct” (or even “best”) model for
the population, due to sampling error.
- 20
30 40 50 60 70 20 30 40 50 60 70 X Y
- Population
Sample 1 Sample 2 Sample 3 Sample 4
8 / 33
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
The Sample Linear Model
Prediction Function
ˆ y = ˆ β0 + ˆ β1x
The Sample Model
Y = ˆ β0 + ˆ β1X + ˆ ε where ˆ ε is the estimated residual for the case, used in fitting. 9 / 33
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
Tests and Intervals for the Slope
- We can bootstrap sample the (x, y) points to get a CI for
slope.
- We can randomly re-pair xs and ys, computing the regression
line after each fit, to get a randomization distribution.
- StatKey
11 / 33
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
Analytic Tests and Intervals for the Slope
The bootstrap and randomization distributions are well-modeled using a t-distribution under certain conditions:
- 1. The linear model is appropriate
- 2. The residuals have constant variance across all x
- 3. The residuals are Normally distributed, or the sample size is
large enough
Refining the Linear Model
Y = β0 + β1X + ε where the ε are distributed as N(0, σε) for a σε that does not depend on x. 12 / 33
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
Ways The Conditions Can Be Violated
Figure: From the textbook (Fig. 9.5)
13 / 33
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
Analytic Tests and Intervals for the Slope
- The SE expression is not given in the book, but when the
conditions are met, it is: SE =
- s2
ε/s2 x
n − 2 where s2
ε is the variance of the residuals, and sx is the variance
- f the X variable.
- We will not need to use this by hand, but there is insight to be
gained by examining it
- Then, use a t-distribution with n − 2 df (why?) for either the
CI or the test. 14 / 33
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
Test of H0 : β1 = 0: Restaurant Tips
Demo 15 / 33
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
Testing Correlation vs. Slope
- The correlation and the slope are different; but the tests yield
the exact same results.
- This is not a coincidence. Slope is zero if and only if the
correlation is zero, so the same null hypotheses are equivalent.
- In fact
Slope = r · sy sx 16 / 33
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
A 95% CI for the Slope
Demo 17 / 33
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
Measuring the Predictive Power of the Model
- How can we measure how useful it is to have x when trying to
predict y, based on a slope of 0.049 (or whatever it is)?
- Correlation measures the strength of association, which is
close to what we want.
- We can use the coefficient of determination: “How much
better does our prediction for y get when we know x?” 18 / 33
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
The Coefficient of Determination
The Coefficient of Determination (R2)
The coefficient of determination, or R2 value, associated with a linear model, is the percent reduction in prediction uncertainty achieved by knowing the predictor vs. just predicting the mean of y. I.e., what fraction of the variation (variance) in y is predictable via x? Turns out to just be the square of the correlation! 19 / 33
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
Example: Restaurant Tips
- ●
- 10
20 30 40 50 60 70 5 10 15 Total Bill ($) Tip ($) null.tip.model tip.model.using.bill
20 / 33
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
Example: Restaurant Tips
data(RestaurantTips) tip.model <- lm(Tip ~ Bill, data = RestaurantTips) summary(tip.model) Call: lm(formula = Tip ~ Bill, data = RestaurantTips) Residuals: Min 1Q Median 3Q Max
- 2.3911 -0.4891 -0.1108
0.2839 5.9738 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.292267 0.166160
- 1.759
0.0806 . Bill 0.182215 0.006451 28.247 <2e-16 ***
- Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.9795 on 155 degrees of freedom Multiple R-squared: 0.8373,Adjusted R-squared: 0.8363 F-statistic: 797.9 on 1 and 155 DF, p-value: < 2.2e-16
21 / 33
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
Example: Restaurant Tips
Residual Tip (Null Model) Tip ($) −10 −5 5 10 15 σε ^ 2 = 5.861 Residual Tip (Bill Model) Frequency −10 −5 5 10 30 σε ^ 2 = 0.953
22 / 33
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
Regression Summary
R.squared <- 1 - 0.953 / 5.861; R.squared [1] 0.8373998
summary(tip.model.using.bill) Call: lm(formula = Tip ~ Bill, data = RestaurantTips) Residuals: Min 1Q Median 3Q Max
- 2.3911 -0.4891 -0.1108
0.2839 5.9738 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.292267 0.166160
- 1.759
0.0806 . Bill 0.182215 0.006451 28.247 <2e-16 ***
- Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.9795 on 155 degrees of freedom Multiple R-squared: 0.8373,Adjusted R-squared: 0.8363 F-statistic: 797.9 on 1 and 155 DF, p-value: < 2.2e-16
23 / 33
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
Example: Restaurant Tips
r <- cor(Tip ~ Bill, data = RestaurantTips) r^2 ## It really is the same number [1] 0.8373334
24 / 33
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
Intervals at a particular X
- A confidence interval for the slope is useful, but if our goal is a
predictive model, we want to be able to make statements about Y values at particular X values.
- I should be able to estimate
- 1. What the mean Y value is at that X in the population
- 2. Where the particular Y is likely to be for this one new
- bservation
- Note: These are different things, in the same way that a 95%
confidence interval does not tell us where 95% of the individual cases are. 26 / 33
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
Confidence and Prediction Intervals for a Linear Model
(Population) linear model: Y = β0 + β1X + ε = f(X) + ε
- 1. A confidence interval (for a particular X) is an estimate
(with a margin of error) of f(X).
- 2. A prediction interval (for a particular X) is an estimate
about Y 27 / 33
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
Confidence vs. Prediction Intervals
Which is wider? The prediction interval is wider, b/c it has uncertainty about ε plus the uncertainty about f(X) 28 / 33
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
A Subtlety Re: Prediction Intervals
Interpreting Prediction Intervals
A coverage level of 95% for a prediction interval does not mean that, having fit a model from a particular sample, we will make successful predictions 95% of the time going forward. The worse
- ur line, the lower the %.
What we can say is that the average success rate across all possible samples is 95% 29 / 33
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
Confidence and Prediction Bands
Intervals for all x in the range are called “confidence / prediction bands”.
Bill PctTip
10 20 30 40 20 40 60
- Why the hourglass shape? Changes in the sample have more
influence on the extremes of the line 30 / 33
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
Calculating Confidence and Prediction Intervals
Both types of intervals are of the form Point Estimate ± t∗(1−α/2)
n−2
· SE
Confidence Interval: ˆ f(X∗) ± t∗
n−2 ·
- ˆ
σ2
ˆ f(X∗)
where ˆ σ2
ˆ f(X∗) = ˆ
σ2
ε
- 1
n + (x∗−¯ x)2 (xi−¯ x)2
- Prediction Interval:
ˆ Y ∗ ± t∗(1−α/2)
n−2
·
- ˆ
σ2
ˆ f(X∗) + ˆ
σ2
ε
31 / 33
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression
R code for a confidence/prediction bands plot:
library("mosaic"); library("Lock5Data") data("RestaurantTips") xyplot(PctTip ~ Bill, data = RestaurantTips, panel = panel.lmbands, # Note, no quotes level = 0.90, # The confidence level ## OPTIONAL: band.lty= what kind of lines to use ## format: c(conf.linetype, pred.linetype), where ## 1 = solid, 2 = dashed, 3 = dotted band.lty = c(1,2), ## OPTIONAL: band.col: what color lines to use ## format: c(conf.color, pred.color) band.col = c("royalblue", "blueviolet") )
Bill PctTip
10 20 30 40 20 40 60
- ●
- ●
- 32 / 33
Outline Linear Models Inference for Regression Slope Confidence and Prediction Intervals for Regression