Week 3: Finish SLR Inference Then Multiple Linear Regression I. - - PowerPoint PPT Presentation

week 3 finish slr inference then multiple linear
SMART_READER_LITE
LIVE PREVIEW

Week 3: Finish SLR Inference Then Multiple Linear Regression I. - - PowerPoint PPT Presentation

BUS41100 Applied Regression Analysis Week 3: Finish SLR Inference Then Multiple Linear Regression I. Confidence and Prediction Intervals II. Polynomials, log transformation, categorical variables, interactions & main effects Max H.


slide-1
SLIDE 1

BUS41100 Applied Regression Analysis

Week 3: Finish SLR Inference Then Multiple Linear Regression

  • I. Confidence and Prediction Intervals
  • II. Polynomials, log transformation,

categorical variables, interactions & main effects Max H. Farrell The University of Chicago Booth School of Business

slide-2
SLIDE 2

Quick Recap

  • I. We drew a line through a cloud of points

ˆ Y = b0 + b1X & Y − ˆ Y = e ◮ It was a good line because:

  • 1. It minimized the SSE
  • 2. It extracted all linear information
  • 3. It implemented the model
  • II. The regression model helped us understand uncertainty

Y = β0 + β1X + ε, ε ∼ N(0, σ2) ◮ Sampling distribution: estimates change as data changes b1 ∼ N(β1, σ2

b1)

σ2

b1 =

σ2 (n − 1)s2

x 1

slide-3
SLIDE 3

Our work today

  • I. Finish SLR

◮ Put sampling distribution to work ◮ Communicable summaries of uncertainty

  • II. Multiple Linear Regression

Y = β0 + β1X1+β2X2 + ε, ε ∼ N(0, σ2) ◮ Everything carries over from SLR ◮ Interpretation requires one extra piece

2

slide-4
SLIDE 4

Summarizing the sampling distribution

Remember the two types of regression questions:

  • 1. Model
  • 2. Prediction

Y = β0 + β1X + ε ˆ Y = b0 + b1X Y = b0 + b1X + e

  • 1. Properties of βk

◮ Sign: Does Y go up when X goes up? ◮ Magnitude: By how much?

⇒ A confidence interval captures uncertainty about β

  • 2. Predicting Y

◮ Best guess for Y given (or “conditional on”) X.

⇒ A prediction interval captures uncertainty about Y

3

slide-5
SLIDE 5

Confidence Intervals and Testing

Suppose we think that the true βj is equal to some value β0

j

(often 0). Does the data support that guess? We can rephrase this in terms of competing hypotheses. (Null) H0 : βj = β0

j

(Alternative) H1 : βj = β0

j

Our hypothesis test will either reject or fail to reject the null hypothesis ◮ If the hypothesis test rejects the null hypothesis, we have statistical support for our claim ◮ Gives only a “yes” or “no” answer! ◮ You choose the “probability” of false rejection: α

4

slide-6
SLIDE 6

We use bj for our test about βj. ◮ Reject H0 if bj is “far” from β0

j ; assume H0 when close

◮ What we really care about is: how many standard errors bj is away from β0

j

◮ standard error = sb1, cf σb1 The t-statistic is this test is zbj = bj − β0

j

sbj

H0

∼ N(0, 1). “Big” |zβj| makes our guess β0

j look silly ⇒ reject

◮ If H0 is true, then P[|zbj|>2] < 0.05 = α But: |zβj| =

  • bj − β0

j

sbj

  • > 2

⇔ β0

j ∈ (bj ± 2sbj) 5

slide-7
SLIDE 7

Confidence intervals

Since bj ∼ N(βj, σ2

bj),

1 − α = P

  • zα/2 < bj − βj

sbj < z1−α/2

  • = P
  • βj ∈ (bj ± zα/2sbj)
  • Why should we care about confidence intervals?

◮ The confidence interval completely captures the information in the data about the parameter.

◮ Center is your estimate ◮ Length is how sure you are about your estimate ◮ Any value outside would be rejected by a test!

6

slide-8
SLIDE 8

Real life or pretend? P

  • β1 ∈ (b1 ± 2σb1)
  • = 95%
  • r

P

  • β1 ∈ (b1 ± 2σb1)
  • = 0 or 1

?

True β1

7

slide-9
SLIDE 9

Level, Size, and p-values

The p-value is P[|Z| > |zβj|]. ◮ Test with size/level = p-value almost rejects ◮ CI of level 1 − (p-value) just excludes |zβj|

Zα 2 Z1 − α 2 −|zβj| |zβj|

1 − α

p/2 p/2

Level α p−value

8

slide-10
SLIDE 10

Example: revisit the CAPM regression for the Windsor fund. Does Windsor have a non-zero intercept? (i.e., does it make/lose money independent of the market?). H0 : β0 = 0 H1 : β0 = 0 ◮ Recall: the intercept estimate b0 is the stock’s “alpha”

> summary(windsor.reg) ## output abbreviated Estimate Std. Error t value Pr(>|t|) (Intercept) 0.003647 0.001409 2.588 0.0105 * mfund$valmrkt 0.935717 0.029150 32.100 <2e-16 *** > 2*pnorm(-abs(0.003647/.001409)) [1] 0.009643399

We reject the null at α = .05, Windsor does have an “alpha”

  • ver the market.

◮ Why set α = .05? What about at α = 0.01?

9

slide-11
SLIDE 11

Now let’s ask whether or not Windsor moves in a different way than the market (e.g., is it more conservative?). ◮ Recall that the estimate of the slope b1 is the “beta” of the stock. This is a rare case where the null hypothesis is not zero: H0 : β1 = 1, Windsor is just the market (+ alpha). H1 : β1 = 1, Windsor softens or exaggerates market moves. This time, R’s output t/p values are not what we want (why?).

> summary(windsor.reg) ## output abbreviated Estimate Std. Error t value Pr(>|t|) (Intercept) 0.003647 0.001409 2.588 0.0105 * mfund$valmrkt 0.935717 0.029150 32.100 <2e-16 ***

10

slide-12
SLIDE 12

But we can get the appropriate values easily: ◮ Test and p-value:

> b1 <- 0.935717; sb1 <- 0.029150 > zb1 <- (b1 - 1)/sb1 [1] -2.205249 > 2*pnorm(-abs(zb1)) [1] 0.02743665

◮ Confidence Interval

> confint(windsor.reg, level=0.95) 2.5 % 97.5 % (Intercept) 0.000865657 0.006428105 mfund$valmrkt 0.878193149 0.993240873

Reject at α=.05, so Windsor softens than the market. ◮ What about other values of α?

confint(windsor.reg, level=0.99) confint(windsor.reg, level=(1-2*pt(-abs(zb1), df=178)))

11

slide-13
SLIDE 13

Forecasting & Prediction Intervals

The conditional forecasting problem: ◮ Given covariate Xf and sample data {Xi, Yi}n

i=1, predict

the “future” observation Yf. The solution is to use our LS fitted value: ˆ Yf = b0 + b1Xf. ◮ That’s the easy bit. The hard (and very important!) part of forecasting is assessing uncertainty about our predictions. One method is to specify a prediction interval ◮ a range of Y values that are likely, given an X value.

12

slide-14
SLIDE 14

The least squares line is a prediction rule: Read ˆ Y off the line for a new X. ◮ It’s not a perfect prediction: ˆ Y is what we expect.

  • 1.0

1.5 2.0 2.5 3.0 3.5 60 80 100 120 140 160 size price

X ˆ Y

13

slide-15
SLIDE 15

If we use ˆ Yf, our prediction error has two pieces ef = Yf − ˆ Yf = Yf − b0 − b1Xf

E[Yf|Xf] = β0 + β1Xf ˆ Yf Yf

} {

ε ef Xf

fit error

b0 + b1Xf

14

slide-16
SLIDE 16

We can decompose ef into two sources of error: ◮ Inherent idiosyncratic randomness (due to ε). ◮ Estimation error in the intercept and slope (i.e., discrepancy between our line and “the truth”). ef = Yf − ˆ Yf = (Yf − E[Yf|Xf]) + E[Yf|Xf] − ˆ Yf = εf + (E[Yf|Xf] − ˆ Yf) = εf + (β0 − b0) + (β1 − b1)Xf. The variance of our prediction error is thus var(ef) = var(εf) + var(E[Yf|Xf] − ˆ Yf) = σ2 + var(ˆ Yf)

15

slide-17
SLIDE 17

From the sampling distributions derived earlier, var(ˆ Yf) is var(b0 + b1Xf) = var(b0) + X2

fvar(b1) + 2Xfcov(b0, b1)

= σ2 1 n + (Xf − ¯ X)2 (n − 1)s2

x

  • .

Replacing σ2 with s2 gives the standard error for ˆ Yf. And hence the variance of our predictive error is var(ef) = σ2

  • 1 + 1

n + (Xf − ¯ X)2 (n − 1)s2

x

  • .

16

slide-18
SLIDE 18

Putting it all together, we have that ˆ Yf ∼ N

  • Yf, σ2
  • 1 + 1

n + (Xf − ¯ X)2 (n − 1)s2

x

  • A (1 − α)100% confidence/prediction interval for Yf is thus

b0 + b1Xf ± zα/2 ×

  • s
  • 1 + 1

n + (Xf − ¯ X)2 (n − 1)s2

x

  • .

17

slide-19
SLIDE 19

Looking closer at what we’ll call

spred = s

  • 1 + 1

n + (Xf − ¯ X)2 (n − 1)s2

x

=

  • s2 + s2

fit.

A large predictive error variance (high uncertainty) comes from ◮ Large s (i.e., large ε’s). ◮ Small n (not enough data). ◮ Small sx (not enough observed spread in covariates). ◮ Large (Xf − ¯ X). The first three are familiar... what about the last one?

18

slide-20
SLIDE 20

For Xf far from our ¯ X, the space between lines is magnified ...

Y X

Estimated Line True Line ( ¯ X, ¯ Y ) point of means (Xf − ¯ X) small (Xf − ¯ X) large

19

slide-21
SLIDE 21

⇒ The prediction (conf.) interval needs to widen away from ¯ X

20

slide-22
SLIDE 22

Returning to our housing data for an example ...

> Xf <- data.frame(size=c(mean(size), 2.5, max(size))) > cbind(Xf,predict(reg, newdata=Xf, interval="prediction")) size fit lwr upr 1 1.853333 104.4667 72.92080 136.0125 2 2.500000 127.3496 95.18501 159.5142 3 3.500000 162.7356 127.36982 198.1013

◮ interval="prediction" gives lwr and upr, otherwise we just get fit ◮ spred is not shown in this output

21

slide-23
SLIDE 23

We can get spred from the predict output.

> p <- predict(reg, newdata=Xf, se.fit=TRUE) > s <- p$residual.scale > sfit <- p$se.fit > spred <- sqrt(s^2+sfit^2) > b <- reg$coef > b[1] + b[2]*Xf[1,]+ c(0,-1, 1)*qnorm(.975)*spred[1] [,1] [,2] [,3] [1,] 104.4667 75.84713 133.0862 > b[1] + b[2]*Xf[1,]+ c(0,-1, 1)*qt(.975, df=n-2)*spred[1] [1,] 104.4667 72.92080 136.0125

◮ Or, we can calculate it by hand [see R code]. —————

Notice that spred =

  • s2 + s2

fit; you need to square before

summing.

22

slide-24
SLIDE 24

Summary

Uncertainty matters! Captured by the Sampling Distribution. ◮ Quantifies uncertainty from the data ◮ . . . only within the model, assumed before we see data. ◮ Which factors matter for signal-to-noise? Reporting ◮ Confidence Interval: completely captures the information in the data about the parameter. ◮ Testing/p-value: only a yes/no answer. (Don’t abuse p-values)

23

slide-25
SLIDE 25

Multiple Linear Regression

24

slide-26
SLIDE 26

Multiple vs simple linear regression

Fundamental model is the same. Basic concepts and techniques translate directly from SLR. ◮ Individual parameter inference and estimation is the same, conditional on the rest of the model. ◮ We still use lm, summary, predict, etc. The hardest part would be moving to matrix algebra to translate all of our equations. Luckily, R does all that for you.

25

slide-27
SLIDE 27

Polynomial regression

A nice bridge between SLR and MLR is polynomial regression. Still only one X variable, but we add powers of X: E[Y |X] = β0 + β1X + β2X2 + · · · + βmXm You can fit any mean function if m is big enough. ◮ Usually, m = 2 does the trick. This is our first “multiple linear regression”!

26

slide-28
SLIDE 28

Example: telemarketing/call-center data. ◮ How does length of employment (months) relate to productivity (number of calls placed per day)?

> attach(telemkt <- read.csv("telemarketing.csv")) > tele1 <- lm(calls~months) > xgrid <- data.frame(months = 10:30) > par(mfrow=c(1,2)) > plot(months, calls, pch=20, col=4) > lines(xgrid$months, predict(tele1, newdata=xgrid)) > plot(months, tele1$residuals, pch=20, col=4) > abline(h=0, lty=2)

27

slide-29
SLIDE 29
  • 10

15 20 25 30 20 25 30 months calls

  • 10

15 20 25 30 −3 −2 −1 1 2 3 months residuals

It looks like there is a polynomial shape to the residuals. ◮ We are leaving some predictability on the table . . . just not linear predictability.

28

slide-30
SLIDE 30

Testing for nonlinearity

To see if you need more nonlinearity, try the regression which includes the next polynomial term, and see if it is significant. For example, to see if you need a quadratic term, ◮ fit the model then run the regression E[Y |X] = β0 + β1X + β2X2. ◮ If your test implies β2 = 0, you need X2 in your model.

Note: p-values are calculated “given the other β’s are nonzero”; i.e., conditional on X being in the model.

29

slide-31
SLIDE 31

Test for a quadratic term:

> months2 <- months^2 > tele2 <- lm(calls~ months + months2) > summary(tele2) ## abbreviated output Coefficients: Estimate

  • Std. Err

t value Pr(>|t|) (Intercept) -0.140471 2.32263

  • 0.060

0.952 months 2.310202 0.25012 9.236 4.90e-08 *** months2

  • 0.040118

0.00633

  • 6.335 7.47e-06 ***

The quadratic months2 term has a very significant t-value, so a better model is calls = β0 + β1months + β2months2 + ε.

30

slide-32
SLIDE 32

Everything looks much better with the quadratic mean model.

> xgrid <- data.frame(months=10:30, months2=(10:30)^2) > par(mfrow=c(1,2)) > plot(months, calls, pch=20, col=4) > lines(xgrid$months, predict(tele2, newdata=xgrid)) > plot(months, tele2$residuals, pch=20, col=4) > abline(h=0, lty=2)

  • 10

15 20 25 30 20 25 30 months calls

  • 10

15 20 25 30 −1.5 −0.5 0.5 1.5 months residuals 31

slide-33
SLIDE 33

A few words of caution

We can always add higher powers (cubic, etc.) if necessary. ◮ If you add a higher order term, the lower order term is kept regardless of its individual t-stat.

(see handout on website)

Be very careful about predicting outside the data range; ◮ the curve may do unintended things beyond the data. Watch out for over-fitting. ◮ You can get a “perfect” fit with enough polynomial terms, ◮ but that doesn’t mean it will be any good for prediction

  • r understanding.

32

slide-34
SLIDE 34

Beyond SLR

Many problems involve more than one independent variable or factor which affects the dependent or response variable. ◮ Multi-factor asset pricing models (beyond CAPM). ◮ Demand for a product given prices of competing brands, advertising, household attributes, etc. ◮ More than size to predict house price! In SLR, the conditional mean of Y depends on X. The multiple linear regression (MLR) model extends this idea to include more than one independent variable.

33

slide-35
SLIDE 35

The MLR Model

The MLR model is same as always, but with more covariates. Y |X1, . . . , Xd ∼ N

  • β0 + β1X1 + · · · + βdXd, σ2

Recall the key assumptions of our linear regression model: (i) The conditional mean of Y is linear in the Xj variables. (ii) The additive errors (deviations from line)

◮ are Normally distributed ◮ independent from each other and all the Xj ◮ identically distributed (i.e., they have constant variance)

34

slide-36
SLIDE 36

Our interpretation of regression coefficients can be extended from the simple single covariate regression case:

βj = ∂E[Y |X1, . . . , Xd] ∂Xj

◮ Holding all other variables constant, βj is the average change in Y per unit change in Xj. —————

∂ is from calculus and means “change in”

35

slide-37
SLIDE 37

If d = 2, we can plot the regression surface in 3D. Consider sales of a product as predicted by price of this product (P1) and the price of a competing product (P2).

◮ Everything on log scale ⇒ -1.0 & 1.1 are elasticities

36

slide-38
SLIDE 38

Obtaining these estimates in R is very easy: > salesdata <- read.csv("sales.csv") > attach(salesdata) > salesMLR <- lm(Sales ~ P1 + P2) > salesMLR Call: lm(formula = Sales ~ P1 + P2) Coefficients: (Intercept) P1 P2 1.003

  • 1.006

1.098

37

slide-39
SLIDE 39

Same Least Squares Principles

How do we estimate the MLR model parameters? ◮ fitted values ˆ Yi = b0 + b1X1i + b2X2i + · · · + bdXdi ◮ residuals ei = Yi − ˆ Yi ◮ residual variance s2 =

1 n−p

n

i=1 e2 i , p = d + 1

Then find the best fitting plane, i.e., coefs b0, b1, b2, . . . , bd, by minimizing the sum of squared errors, s2.

38

slide-40
SLIDE 40

Residuals in MLR

As in the SLR model, the residuals in multiple regression are purged of any linear relationship to the independent variables. We decompose Y into the part predicted by X and the part due to idiosyncratic error.

Y = ˆ Y + e corr(Xj, e) = 0 corr(ˆ Y , e) = 0

These hold by construction, just like SLR.

39

slide-41
SLIDE 41

Inference for coefficients

As before in SLR, the LS linear coefficients are random (different for each sample) and correlated with each other. The sampling distribution for the whole vector b = [b0, b1, · · · , bd] is a multivariate Normal:

b ∼ Nd+1(β, Σb)

◮ With mean β = [β0, β1, · · · , βd]′ (unbiased, as before) ◮ Variance-covariance matrix Σb ◮ Same as last week:

  • b0

b1

  • ∼ N2

β0

β1

  • ,
  • σ2

b0

cov(b0, b1) cov(b0, b1) σ2

b1

  • 40
slide-42
SLIDE 42

Coefficient covariance matrix

Σb = var(b): the p × p covariance matrix for random vector b Σb =       

var(b0) cov(b0, b1) cov(b1, b0) var(b1) ... var(bd−1) cov(bd−1, bd) cov(bd, bd−1) var(bd)

       ◮ Variance decreases with n and var(X); increases with σ2. ⇒ Standard errors are the square root of the diagonal of Σb.

41

slide-43
SLIDE 43

Inference for individual coefficients

Intervals and t-statistics are exactly the same as in SLR. ◮ A (1 − α)100% C.I. for βj is bj ± zα/2sbj. ◮ zbj = (bj − β0

j )/sbj ∼ N(0, 1) is number of standard

errors between the LS estimate and the null value. Intervals/testing via bj & sbj are one-at-a-time procedures: ◮ You are evaluating the jth coefficient conditional on the

  • ther X’s being in the model, but regardless of the values

you’ve estimated for the other b’s.

42

slide-44
SLIDE 44

Conveniently, R’s summary gives you all the standard errors.

(or do it manually, see week3-Rcode.R) > summary(salesMLR) ## abbreviated output Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.002688 0.007445 134.7 <2e-16 *** P1

  • 1.005900

0.009385

  • 107.2

<2e-16 *** P2 1.097872 0.006425 170.9 <2e-16 *** Residual standard error: 0.01453 on 97 degrees of freedom Multiple R-squared: 0.998, Adjusted R-squared: 0.9979 F-statistic: 2.392e+04 on 2 and 97 DF, p-value: < 2.2e-16

43

slide-45
SLIDE 45

Forecasting in MLR

Prediction follows exactly the same methodology as in SLR. For new data xf = [X1,f · · · Xd,f]′, ◮ ˆ Yf = b0 + b1X1f + · · · + bdXd

f

◮ var[Yf|xf] = var(ˆ Yf) + var(εf) = s2

fit + s2 = s2 pred.

◮ (1 − α) level prediction interval is still ˆ Yf ± zα/2spred.

44

slide-46
SLIDE 46

The syntax in R is also exactly the same as before:

> predict(salesMLR, data.frame(P1=1, P2=1), + interval="prediction", level=0.95) fit lwr upr 1 1.094661 1.064015 1.125306 > predict(salesMLR, data.frame(P1=1, P2=1), + se.fit=TRUE)$se.fit [1] 0.005227347

45

slide-47
SLIDE 47

Glossary and Equations

◮ LS Estimators: b1 = rxy sy sx = sxy s2

x

and b0 = ¯ Y − b1 ¯ X. ◮ ˆ Yi = b0 + b1Xi is the ith fitted value. ◮ ei = Yi − ˆ Yi is the ith residual. ◮ ˆ σ, s: standard error of regression residuals (≈ σ = σε). ◮ sbj: standard error of regression coefficients. sb1 =

  • s2

(n − 1)s2

x

sb0 = s

  • 1

n + ¯ X2 (n − 1)s2

x 46

slide-48
SLIDE 48

◮ α is the significance level (prob of type 1 error). ◮ zα/2 is the value such that for Z ∼ N(0, 1), P[Z > −zα/2] = P[Z < zα/2] = α/2. ◮ zbj is the standardized coefficient: zbj = bj − β0

j

sbj

H0

∼ N(0, 1). ◮ The (1 − α) ∗ 100% confidence interval for βj is bj ± zα/2sbj

47

slide-49
SLIDE 49

◮ ˆ Yf = b0 + Xfb1 is a forecast prediction. se(ˆ Yf) = sfit = s

  • 1

n + (Xf − ¯ X)2 (n − 1)s2

x

◮ Forecast residual is ef = Yf − ˆ Yf and var(ef) = s2 + s2

fit.

That is, the predictive standard error is spred = s

  • 1 + 1

n + (Xf − ¯ X)2 (n − 1)s2

x

. and ˆ Yf ± zα/2spred is the (1 − α)100% prediction interval at Xf.

48

slide-50
SLIDE 50

Glossary and equations

MLR: ◮ Model: Y |X1, . . . , Xd

ind

∼ N(β0 + β1X1 + · · · + βdXd, σ2) ◮ Prediction: ˆ Yi = b0 + b1X1i + b2X2i + · · · + bdXdi ◮ b ∼ Np(β, Sb) ◮ Interaction: ◮ Yi = β0 + β1X1i + β2X2i + β3(X1iX2i) + . . . + ε ◮

∂E[Y |X1,X2] ∂X1

= β1 + β3X2

49

slide-51
SLIDE 51

Glossary and equations

R2 and related terms ◮ SST = n

i=1(Yi − ¯

Y )2 ◮ SSR = n

i=1(ˆ

Yi − ¯ Y )2 ◮ SSE = n

i=1(Yi − ˆ

Yi)2 ◮ (Watch out: sometimes SSE is called SSR or RSS!) ◮ R2 = SSR/SST = cor2(ˆ Y , Y ) = r2

ˆ yy

Elasticity is the slope in a log-log model: β1 ≈ d%Y d%X .

(See handout on course website.)

50