Week 2: Inference for SLR Inference: sampling distributions, testing - - PowerPoint PPT Presentation

week 2 inference for slr
SMART_READER_LITE
LIVE PREVIEW

Week 2: Inference for SLR Inference: sampling distributions, testing - - PowerPoint PPT Presentation

BUS41100 Applied Regression Analysis Week 2: Inference for SLR Inference: sampling distributions, testing confidence intervals, and prediction intervals Max H. Farrell The University of Chicago Booth School of Business Back to House Prices


slide-1
SLIDE 1

BUS41100 Applied Regression Analysis

Week 2: Inference for SLR

Inference: sampling distributions, testing confidence intervals, and prediction intervals Max H. Farrell The University of Chicago Booth School of Business

slide-2
SLIDE 2

Back to House Prices

Understand the relationship between price and size. How? Last week we fit a line through a bunch of points: price = 39 + 35 × size.

  • 1.0

1.5 2.0 2.5 3.0 3.5 60 80 100 120 140 160

size price

1

slide-3
SLIDE 3

CAPM

Another example of conditional distributions: Individual returns given market return. The Capital Asset Pricing Model (CAPM) for asset A relates return RAt = VAt − VAt−1 VAt−1 to the “market” return, RMt. In particular, the relationship is given by the regression model RAt = α + βRMt + ε with observations at times t = 1 . . . T (more on (α, β) vs (b0, b1) vs (β0, β1) in a minute). When asset A is a mutual fund, this CAPM regression can be used as a performance benchmark for fund managers.

2

slide-4
SLIDE 4

> mfund <- read.csv("mfunds.csv", stringsAsFactors=TRUE) > mu <- apply(mfund, 2, mean) > mu drefus fidel keystne Putnminc scudinc 0.006767000 0.004696739 0.006542550 0.005517072 0.004432333 windsor valmrkt tbill 0.010021906 0.006812983 0.005978333 > stdev <- apply(mfund, 2, sd) > stdev drefus fidel keystne Putnminc scudinc 0.047237111 0.056587091 0.084236450 0.030079074 0.035969261 windsor valmrkt tbill 0.048639473 0.048000146 0.002522863

3

slide-5
SLIDE 5

> plot(mu, stdev, col=0) > text(x=mu, y=stdev, labels=names(mfund), col=4)

0.005 0.006 0.007 0.008 0.009 0.010 0.00 0.02 0.04 0.06 0.08 mu stdev drefus fidel keystne Putnminc scudinc windsor valmrkt tbill

4

slide-6
SLIDE 6

Lets look at just windsor (which dominates the market).

> windsor.reg <- lm(mfund$windsor ~ mfund$valmrkt) > plot(mfund$valmrkt, mfund$windsor, pch=20) > abline(windsor.reg, col="green")

  • −0.10

−0.05 0.00 0.05 0.10 0.15 −0.15 −0.05 0.05 0.15 mfund$valmrkt mfund$windsor b_0 = 0.0036 b_1 = 0.9357

5

slide-7
SLIDE 7

What is a good line?

Statistics version!

In a happy coincidence, the least squares line makes good statistical sense too. To see why, we need a model and we need to remember the conditional distribution. We will also use the model to talk about uncertainty. Okay, so lm(Y ∼ X) makes a great line, but how “likely” is it that our answer is useful? ◮ The concept of a sampling distribution is the fundamental idea in all of statistics, and understanding it is our main job today.

6

slide-8
SLIDE 8

Normal Distribution – Quick Review

Why do we like the Normal distribution? ◮ Symmetric ◮ Concentration around the mean!

֒ → 95% of the data within 2 s.d.

−3 sd −2 sd −1 sd mean +1 sd +2 sd +3 sd

Z0.025 Z0.975

95%

2.5% 2.5%

7

slide-9
SLIDE 9

Simple linear regression (SLR) model

Y = β0 + β1X + ε, ε ∼ N(0, σ2) What’s important? ◮ It is a model, so we are assuming this relationship holds for some fixed but unknown values of β0, β1. ◮ It is linear. ◮ The error ε is independent & mean zero

  • 1. E[ε] = 0 ⇔ E[Y |X] = β0 + β1X
  • 2. Fixed but unknown variance σ2; constant over X
  • 3. Most things are approx. Normal (Central Limit Theorem)
  • 4. ε represents anything left, not captured in linear fcn of X

◮ It just works! This is a very robust model for the world.

8

slide-10
SLIDE 10

Remember the two types of regression questions:

  • 1. Prediction
  • 2. Model

ˆ Y = b0 + b1X Y = β0 + β1X + ε Y = b0 + b1X + e

  • 1. Predicting Y

◮ Best guess for Y given (or “conditional on”) X.

  • 2. Properties of βk

◮ Sign: Does Y go up when X goes up? ◮ Magnitude: By how much?

9

slide-11
SLIDE 11

Conditional distributions

Regression models are really all about modeling the conditional distribution of Y given X. Why are conditional distributions important? We want to develop models for forecasting. What we are doing is exploiting the information in the conditional distribution of Y given X. The conditional distribution is obtained by “slicing” the point cloud in the scatterplot to obtain the distribution of Y conditional on various ranges of X values.

10

slide-12
SLIDE 12

Conditional v. marginal distribution

Consider a regression of house price on size:

  • 0.5

1.0 1.5 2.0 2.5 3.0 3.5 100 200 300 400 size price

  • marg

1−1.5 1.5−2 2−2.5 2.5−3 3−3.5 100 200 300 400 price

{

“slice” of data conditional distribution

  • f price given

3 < size < 3.5 regression line marginal distribution

  • f price

11

slide-13
SLIDE 13

Key observations from these plots: ◮ Conditional distributions answer the forecasting problem: if I know that a house is between 1 and 1.5 1000 sq.ft., then the conditional distribution (second boxplot) gives me a point forecast (the mean) and prediction interval. ◮ The conditional means (medians) seem to line up along the regression line. ◮ The conditional distributions have much smaller dispersion than the marginal distribution.

12

slide-14
SLIDE 14

This suggests two general points: ◮ If X has no forecasting power, then the marginal and conditionals will be the same. ◮ If X has some forecasting information, then conditional means will be different than the marginal or overall mean and the conditional standard deviation of Y given X will be less than the marginal standard deviation of Y .

13

slide-15
SLIDE 15

Intuition from an example where X has no predictive power.

marg 1 2 3 4 100 200 300 400 price

  • 1

2 3 4 100 200 300 400 # stops price

House price v. number of stop signs (Y) within a two-block radius

  • f a house (X)

See that in this case the marginal and conditionals are not all that different

14

slide-16
SLIDE 16

Before looking at any data, the model specifies ◮ how Y varies with X on average: E[Y |X] = β0 + β1X;

i.e. what’s the trend?

◮ and the influence of factors other than X, ε ∼ N(0, σ2) independently of X.

E[Y |X] = β0 + β1X ε X Y

15

slide-17
SLIDE 17

The variance σ2 controls the dispersion of Y around β0 + β1X ◮ think signal-to-noise

  • 1.0

1.5 2.0 2.5 3.0 3.5 50 100 200

small dispersion

X Y

  • 1.0

1.5 2.0 2.5 3.0 3.5 50 100 200

large dispersion

X Y

16

slide-18
SLIDE 18

IMPORTANT! β0 is not b0, β1 is not b1, and εi is not ei

E[Y |X] = β0 + β1X X Y ei εi ˆ Y = b0 + b1X

(We use Greek letters remind to us.)

17

slide-19
SLIDE 19

What is a good line?

Statistics version!

We assume the model E[Y |X] = β0 + β1X. Now let’s estimate the parameters β0 and β1. cov(X, Y ) = cov(X, β0 + β1X + ε) = cov(X, β1X) = β1var(X) Re-write this to get: β1 = cov(X, Y ) var(X) = corr(X, Y )σy σx . ◮ Interpretation: β1 is correlation in units Y per units X. But still unknown! Replace with analogues from the data: ˆ β1 = sample cov(X, Y ) sample var(X) = rxy sy sx ≡ b1!

18

slide-20
SLIDE 20

To summarize: R’s lm(Y ∼ X) function gives a good line because:

  • 1. It’s sensible to minimize the SSE = n

i=1 e2 i

  • 2. Equivalent to: corr(e, X)=0

&

1 n

n

i=1 ei =0

  • 3. Equivalent to: estimating the model E[Y |X] = β0 + β1X

The least squares formulas are b1 = rxy sy sx and b0 = ¯ Y − b1 ¯ X.

19

slide-21
SLIDE 21

Context from the house data example

E[Y |X] is the average price of houses with size X, and σ2 is the spread around that average. When we specify the SLR model we say that ◮ the average house price is linear in its size, but we don’t know the coefficients. ◮ Some houses could have a higher than expected value, some lower, but the amount by which they differ from average is unknown and

◮ is independent of the size, ◮ and is Normal.

Question: At an open house: is this house priced fairly?

20

slide-22
SLIDE 22

Context from the CAPM example

E[Y |X] is the average return of the asset when the market return is X, and σ2 is the spread around that average. When we specify the SLR model we say that ◮ the average asset return is linear in the market return, but we don’t know the coefficients. ◮ Some days could have a higher than expected value, some lower, but the amount by which they differ from average is unknown and

◮ is independent of the market return, ◮ and is Normal.

Question: Does this asset follow the market? (Is β = 1?)

21

slide-23
SLIDE 23

Detour / example: Oracle v. SAP Uncertainty Matters!

22

slide-24
SLIDE 24

> sap <- read.csv("sap.csv") > m.sap <- mean(sap$ROE) > m.I <- mean(sap$IndustryROE) > m.sap / m.I [1] 0.8049701 That’s the mean, what about the spread? > summary(sap[,4:5]) ROE IndustryROE Min. :-91.80 Min. : 2.6 1st Qu.: 6.20 1st Qu.:10.2 Median : 13.40 Median :14.0 Mean : 12.64 Mean :15.7 3rd Qu.: 22.80 3rd Qu.:19.5 Max. :116.40 Max. :48.8

23

slide-25
SLIDE 25

What’s going on here? ◮ SAP ROE is more variable than average Industry ROE. ֒ → Makes sense, averages are less variable than atoms ◮ What about large values (positive and negative)?

ROE Frequency

−100 −50 50 100 10 20 30 40 SAP Industry average

  • SAP

Industry −100 −50 50 100

ROE

24

slide-26
SLIDE 26

Uncertainty matters! Do we even think that SAP use is correlated with lower ROE? ◮ Probably not, given the above results But even beyond statistical uncertainty: ◮ Does SAP use cause ROE to fall? ◮ Were the SAP ROEs selected at random in the industry? Statistical uncertainty is the only kind we can quantify. In any analysis there is a lot we aren’t sure about: ◮ Do we have the right data? ◮ Do we have the “right” (useful?) model? ◮ What assumptions are we making?

25

slide-27
SLIDE 27

Sampling distribution of LS estimates

We think of the data as being one possible realization of data that could have been generated from the model Y |X ∼ N(β0 + β1X, σ2). ◮ How much do our estimates depend on the particular random sample that we happen to observe?

◮ Different data ⇒ different b0 and b1 ◮ Always the same β0 and β1.

If the estimates don’t vary much from sample to sample, then it doesn’t matter which sample you happen to observe. If the estimates do vary a lot, then it matters which sample you happen to observe.

26

slide-28
SLIDE 28

How do we know what would happen with other realizations? We pretend!

  • 1. Randomly draw new data
  • 2. Compute the estimates b0 and b1
  • 3. Repeat

Or we use statistics to tell us: ◮ What the sampling distribution is . . . ◮ . . . and how to use it to measure uncertainty.

◮ Testing, confidence intervals, etc.

But first let’s see it!

27

slide-29
SLIDE 29
  • −3

−2 −1 1 2 3 −4 −2 2 4 x y

  • −3

−2 −1 1 2 3 −4 −2 2 4 x y

  • −3

−2 −1 1 2 3 −4 −2 2 4 y

  • −3

−2 −1 1 2 3 −4 −2 2 4 y

n=5 var=2 true model LS line

28

slide-30
SLIDE 30
  • −3

−2 −1 1 2 3 −4 −2 2 4 x y

  • −3

−2 −1 1 2 3 −4 −2 2 4 x y

  • −3

−2 −1 1 2 3 −4 −2 2 4 y

  • −3

−2 −1 1 2 3 −4 −2 2 4 y

n=50 var=2 true model LS line

29

slide-31
SLIDE 31

Sampling distribution of LS estimates

What did we just do? ◮ We “imagined” through simulation the sampling distribution of a LS line. What did we learn? ◮ Looked pretty Normal! ◮ When n = 5, some lines are close, others aren’t: we need to get lucky. ◮ The lines are much closer to the truth when n = 50. ◮ The variance σ2 matters a lot!

30

slide-32
SLIDE 32

What happens in real life? ◮ We get just one data set, and we don’t know the true generating model. ◮ But we can still imagine . . . . . . and use statistics! ◮ Quantify how n and σ2 matter ◮ Quantify uncertainty

  • nly within our model.

31

slide-33
SLIDE 33

Sampling distribution of b1

It turns out that b1 is Normally distributed: b1 ∼ N(β1, σ2

b1).

◮ b1 is unbiased: E[b1] = β1. ◮ The sampling sd σb1 determines precision of b1: σ2

b1 = var(b1) =

σ2 (Xi − ¯ X)2 = σ2 (n − 1)s2

x

. It depends on three factors:

  • 1. sample size (n)
  • 2. error variance (σ2 = σ2

ε), and

  • 3. X-spread (s2

x).

(We don’t have time to do detailed proofs, but there is an extensive

handout on my website; see also the Sheather book.)

32

slide-34
SLIDE 34

Sampling distribution of b0

The intercept is also normal and unbiased: b0 ∼ N(β0, σ2

b0),

where σ2

b0 = var(b0) = σ2

1 n + ¯ X2 (n − 1)s2

x

  • .

What is the intuition here? var(¯ Y − ¯ Xb1) = var(¯ Y ) + ¯ X2var(b1) − 2 ¯ Xcov(¯ Y , b1) ◮ ¯ Y and b1 are uncorrelated because the slope (b1) is invariant if you shift the data up or down (¯ Y ).

33

slide-35
SLIDE 35

Joint distribution of b0 and b1

We know that b0 and b1 can be dependent, i.e., E[(b0 − β0)(b1 − β1)] = 0. This means that estimation error in the slope is correlated with the estimation error in the intercept. cov(b0, b1) = −σ2

  • ¯

X (n − 1)s2

x

  • ◮ Usually, if the slope estimate is too high, the intercept

estimate is too low (negative correlation). ◮ The correlation decreases with more X spread (s2

x). 34

slide-36
SLIDE 36

Estimation of error variance

The formulas aren’t practicable since they involve an unknown quantity: σ = σε. Replace with: ˆ σ2 = 1 n

n

  • i=1

e2

i

  • r

s2 = 1 n − p

n

  • i=1

e2

i = SSE

n − p (p is the number of regression coefficients; i.e. 2 for β0 + β1). It is often convenient to report ˆ σ or s, which are in the same units as Y . Plug in for σ in any formula, e.g. σ2

b1 =

σ2 (n − 1)s2

x

⇒ s2

b1 =

s2 (n − 1)s2

x

◮ Small s2

bj values mean high info/precision/accuracy. 35

slide-37
SLIDE 37

Example: revisit the house price/size data

> summary(house.reg) Call: lm(formula = price ~ size) Residuals: Min 1Q Median 3Q Max

  • 30.425
  • 8.618

0.575 10.766 18.498 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 38.885 9.094 4.276 0.000903 *** size 35.386 4.494 7.874 2.66e-06 ***

  • Signif. codes:

0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 14.14 on 13 degrees of freedom Multiple R-squared: 0.8267, Adjusted R-squared: 0.8133 F-statistic: 62 on 1 and 13 DF, p-value: 2.66e-06

36

slide-38
SLIDE 38

Glossary and Equations

◮ LS Estimators: b1 = rxy sy sx = sxy s2

x

and b0 = ¯ Y − b1 ¯ X. ◮ ˆ Yi = b0 + b1Xi is the ith fitted value. ◮ ei = Yi − ˆ Yi is the ith residual. ◮ ˆ σ, s: standard error of regression residuals (≈ σ = σε). ◮ sbj: standard error of regression coefficients. sb1 =

  • s2

(n − 1)s2

x

sb0 = s

  • 1

n + ¯ X2 (n − 1)s2

x 37

slide-39
SLIDE 39

◮ α is the significance level (prob of type 1 error). ◮ zα/2 is the value such that for Z ∼ N(0, 1), P[Z > −zα/2] = P[Z < zα/2] = α/2. ◮ zbj is the standardized coefficient: zbj = bj − β0

j

sbj

H0

∼ N(0, 1). ◮ The (1 − α) ∗ 100% confidence interval for βj is bj ± zα/2sbj

38

slide-40
SLIDE 40

◮ ˆ Yf = b0 + Xfb1 is a forecast prediction. se(ˆ Yf) = sfit = s

  • 1

n + (Xf − ¯ X)2 (n − 1)s2

x

◮ Forecast residual is ef = Yf − ˆ Yf and var(ef) = s2 + s2

fit.

That is, the predictive standard error is spred = s

  • 1 + 1

n + (Xf − ¯ X)2 (n − 1)s2

x

. and ˆ Yf ± zα/2spred is the (1 − α)100% prediction interval at Xf.

39