Week 4: Multiple Linear Regression Causation, Categorical Variables, - - PowerPoint PPT Presentation

week 4 multiple linear regression
SMART_READER_LITE
LIVE PREVIEW

Week 4: Multiple Linear Regression Causation, Categorical Variables, - - PowerPoint PPT Presentation

BUS41100 Applied Regression Analysis Week 4: Multiple Linear Regression Causation, Categorical Variables, Interactions, Log Transformation Max H. Farrell The University of Chicago Booth School of Business Causality When does correlation


slide-1
SLIDE 1

BUS41100 Applied Regression Analysis

Week 4: Multiple Linear Regression

Causation, Categorical Variables, Interactions, Log Transformation Max H. Farrell The University of Chicago Booth School of Business

slide-2
SLIDE 2

Causality

When does correlation ⇒ causation? ◮ We have been careful to never say that X causes Y . . . ◮ . . . but we’ve really wanted to. ◮ We want to find a “real” underlying mechanism:

What’s the change in Y as T moves independent of all other influences?

But how can we do this in regression? ◮ First we’ll look at the Gold Standard: experiments ◮ Watch out for multiple testing ◮ Then see how this works in regression

1

slide-3
SLIDE 3

Randomized Experiments

We want to know the effect of treatment T on outcome Y What’s the problem with “regular” data? Selection. ◮ People choose their treatments

◮ Eg: (i) Firm investment & tax laws; (ii) people & training/education; (iii) . . . .

Experiments are the best way to find a true causal effect. Why? The key is randomization: ◮ No systematic relationship between units and treatments

◮ T moves independently by design.

◮ T is discrete, usually binary.

◮ Classic: drug vs. placebo ◮ Newer: Website experience (A/B testing)

◮ Experiments are important (& common) in their own right

2

slide-4
SLIDE 4

The fundamental question: Is Y better on average with T? E[Y | T = 1] > E[Y | T = 0] ? We need a model for E[Y | T] ◮ T is just a special X variable: E[Y | T] = β0 + βTT ◮ βT is the Average Treatment Effect (ATE) ◮ This is not a prediction problem, . . . ◮ . . . it’s an inference problem, about a single coefficient. Estimation: bT = ˆ βT = ¯ YT=1 − ¯ YT=0 Can’t usually do better than this. (Be wary of any claims.)

3

slide-5
SLIDE 5

Why do we care about the average Y ? First, we might care about Y directly, for an individual unit: ◮ Does Y = earnings increase after T = training?

◮ e.g. does getting an MBA increase earnings?

◮ Do firms benefit from consulting? ◮ Do people live longer with a medication/procedure? ◮ Do people stay longer on my website with the new design? Or, we might care about aggregate measures: ◮ Y = purchase yes/no, then profit is P = price × Y

◮ Average profit per customer: E[P × Y ] ◮ Total profit: (No. customers)×E[P × Y ]

◮ Higher price means fewer customers, but perhaps more profit overall? (Ignore Giffen goods)

4

slide-6
SLIDE 6

Profit Maximization

Data from an online recruiting service

◮ Customers are firms looking to hire ◮ Fixed price is charged for access

◮ Post job openings, find candidates, etc

Question is: what price to charge? Profit at price P = Quantity(P) × (P - Cost) Arriving customers are shown a random price P

◮ P is our treatment variable T ◮ How to randomize matters:

◮ Why not do: P1 in June, P2 in July, . . . ? What’s wrong?

Data set includes

◮ P = price – price they were shown, $99 or $249 ◮ Y = buy – did this firm sign up for service: yes/no

5

slide-7
SLIDE 7

Let’s see the data

> price.data <- read.csv("priceExperiment.csv") > summary(price.data) > head(price.data)

Note that Y = buy is binary. That’s okay! E[Y ] = P[Y = 1] Computing the ATE and Profit:

> purchases <- by(price.data$buy, price.data$price, mean) > purchases[2] - purchases[1]

  • 0.1291639

> 249*purchases[2] - 99*purchases[1] 4.311221

  • 0.13 what? 4.31 what? For whom? How many?

6

slide-8
SLIDE 8

Regression version: computing ATE

> summary(lm(price.data$buy ~ price.data$price)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.3284017 0.0195456 16.802 <2e-16 *** price.data$price -0.0008611 0.0001039

  • 8.287

<2e-16 ***

careful with how you code the variables!

> summary(lm(price.data$buy ~ (price.data$price==249))) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.24315 0.01091 22.285 <2e-16 *** price.data$price == 249TRUE -0.12916 0.01559

  • 8.287

<2e-16 ***

What’s so special about T = 0/1?

7

slide-9
SLIDE 9

Regression version: computing profit

> profit <- price.data$buy*price.data$price > summary(lm(profit ~ (price.data$price==249))) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 24.072 1.820 13.226 <2e-16 *** price.data$price == 249TRUE 4.311 2.600 1.658 0.0974 .

  • Signif. codes:

0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 63.18 on 2361 degrees of freedom Multiple R-squared: 0.001163, Adjusted R-squared: 0.0007402 F-statistic: 2.75 on 1 and 2361 DF, p-value: 0.09741

◮ Same profit estimate, thanks to transformed Y variable ◮ Tiny R2! Why? ◮ What’s 24.072?

8

slide-10
SLIDE 10

What about variables other than Y and T?

We usually have information (some X’s) other than Y and T ◮ Key: when was the information recorded? ◮ Useful other X variables are “pre-treatment”: not affected by treatment or even treatment assignment ◮ Useful for targeting, heterogeneity

(see homework)

Important idea: Randomized means randomized for every value of X

> table(price.data$customerSize) 1 2 1897 216 250

⇒ Nothing wrong with

> summary(lm(buy~(price==249),data=price.data[price.data$customerSize==2,])) 9

slide-11
SLIDE 11

Causality Without Randomization

We want to find: The change in Y caused by T moving indepen- dently of all other influences. Our MLR interpretation of E[Y | T, X]: The change in Y associated with T, holding fixed all X variables. ⇒ We need T to be randomly assigned given X ◮ X must include enough variables so T is random.

◮ Requires a lot of knowledge!

◮ No systematic relationship between units and treatments, conditional on X.

◮ It’s OK if X is predictive of Y .

10

slide-12
SLIDE 12

The model is the same as always: E[Y | T, X] = β0 + βTT + β1X1 + · · · βdXd. But the assumptions change: ◮ This is a structural model: it says something true about the real world. ◮ Need X to control for all sources of non-randomness.

◮ Even possible?

Then the interpretation changes: βT is the average treatment effect ◮ Continuous “treatments” are easy. ◮ Not a “conditional average treatment effect”

◮ What happens to βT as the variables change? To bT ?

◮ No T × X interactions, why? What would these mean?

11

slide-13
SLIDE 13

Example: Bike Sharing & Weather: does a change in humidity cause a change in bike rentals? From Capital Bikeshare (D.C.’s Divvy) we have daily bike rentals & weather info.

◮ Y1 = registered – # rentals by registered users ◮ Y2 = casual – # rentals by non-registered users ◮ T = humidity – relative humidity (continuous!)

Possible controls/confounders:

◮ season ◮ holiday – Is the day a holiday? ◮ workingday – Is it a work day (not holiday, not weekend)? ◮ weather – coded 1=nice, 2=OK, 3=bad ◮ temp – degrees Celsius ◮ feels.like – “feels like” in Celsius ◮ windspeed

12

slide-14
SLIDE 14

Is humidity randomly assigned to days?

  • 20

40 60 80 100 Humidity Casual Rentals 2000 4000 6000

  • 20

40 60 80 100 Humidity Registered Rentals 2000 4000 6000 All Data Humidity > 0

humidity ↑ ⇒ rentals ↓! Or is this because of something else?

13

slide-15
SLIDE 15

The “randomized experiment” coefficient

> summary(casual.reg <- lm(casual ~ humidity, data=bike)) Estimate Std. Error t value Pr(>|t|) (Intercept) 1092.719 114.116 9.576 < 2e-16 *** humidity

  • 5.652

1.912

  • 2.957

0.00327 **

. . . is pretty similar to the effect with controls. So what?

> summary(casual.reg.main <- lm(casual ~ humidity + season + holiday + workingday + weather + temp + windspeed, data=bike)) Estimate Std. Error t value Pr(>|t|) (Intercept) 716.964 203.273 3.527 0.000464 *** humidity

  • 6.845

1.496

  • 4.574

6.2e-06 *** seasonspring

  • 94.041

82.189

  • 1.144 0.253152

seasonsummer 182.964 53.249 3.436 0.000646 *** seasonwinter 57.194 68.849 0.831 0.406578 holiday

  • 285.327

103.757

  • 2.750 0.006203 **

workingday

  • 796.933

37.381 -21.319 < 2e-16 *** weathernice 308.495 100.633 3.066 0.002305 ** weatherok 264.843 92.695 2.857 0.004475 ** temp 39.430 4.045 9.747 < 2e-16 *** windspeed

  • 10.912

3.071

  • 3.554 0.000420 ***

14

slide-16
SLIDE 16

The bottom line:

You only get causal effects with strong assumptions. ◮ Real-world concerns take precedence over statistics. ◮ Is there an economic/business/etc justification for your choice of X? ◮ The data/computer cannot help. Only assumptions. Causal inference from observational data may be the hardest problem in statistics. ◮ We are just scratching the surface in terms of ideas, methods, applications, . . . . ◮ Still an active area of research in econometrics, statistics, & machine learning.

15

slide-17
SLIDE 17

Moving on. . .

That’s all we can really say about causation, and all anyone can really say conceptually. ◮ From now on, no regression is causal unless you want to assume it is. We will just remember our MLR interpretation (“controlling for”) and we’ll build ◮ fancier regressions, ◮ different data types, ◮ more tools.

16

slide-18
SLIDE 18

Categorical effects/dummy variables

To represent qualitative factors in multiple regression, we use dummy, binary, or indicator variables. ◮ temporal effects (1 if Holiday season, 0 if not) ◮ spatial (1 if in Midwest, 0 if not) If a factor X takes R possible levels, we use R − 1 dummies ◮ Allow the intercept to shift by taking on the value 0 or 1 ◮ 1[X=r] = 1 if X = r, 0 if X = r. E[Y |X] = β0 + β11[X=2] + β21[X=3] + · · · + βR−11[X=R] What is E[Y |X = 1]?

17

slide-19
SLIDE 19

Example: back to the pickup truck data. Does price vary by make?

> attach(pickup <- read.csv("pickup.csv")) > c(mean(price[make=="Dodge"]), mean(price[make=="Ford"]), mean(price[make=="GMC"])) [1] 6554.200 8867.917 7996.208

◮ GMC seems lower on average, but lots of

  • verlap.

◮ Not much of a pattern.

Dodge Ford GMC 5000 15000 make price

18

slide-20
SLIDE 20

Now fit with linear regression: E[price|make] = β0 + β11[make=Ford] + β21[make=GMC] Easy in R (if make is a factor variable)

> summary(trucklm1 <- lm(price ~ make, data=pickup)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6554 1787 3.667 0.000671 *** makeFord 2314 2420 0.956 0.344386 makeGMC 1442 2127 0.678 0.501502

The coefficient values correspond to our dummy variables. What are the p-values?

19

slide-21
SLIDE 21

What if you also want to include mileage? ◮ No problem.

> pickup$miles <- pickup$miles/10000 > trucklm2 <- lm(price ~ make + miles, data=pickup) > summary(trucklm2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 12761.8 1746.6 7.307 5.31e-09 *** makeFord 2185.7 1842.9 1.186 0.242 makeGMC 2298.8 1627.0 1.413 0.165 miles

  • 654.1

115.3

  • 5.671 1.18e-06 ***

All three brands expect to lose $654 per 10k miles.

20

slide-22
SLIDE 22

Different intercepts, same slope!

> plot(miles, price, pch=20, col=make, xlab="miles (10k)", ylab="price ($)") > abline(a=coef(trucklm2)[1],b=coef(trucklm2)[4],col=1) > abline(a=(coef(trucklm2)[1]+coef(trucklm2)[2]), b=coef(trucklm2)[4],col=2)

5 10 15 20 5000 10000 15000 20000 miles (10k) price ($)

  • Dodge

Ford GMC

◮ Dodge trucks affect all slopes!

21

slide-23
SLIDE 23

Variable interaction

So far we have considered the impact of each independent variable in a additive way. We can extend this notion and include interaction effects through multiplicative terms.

Yi = β0 + β1X1i + β2X2i + β3(X1iX2i) + · · · + εi ∂E[Y |X1, X2] ∂X1 = β1 + β3X2

22

slide-24
SLIDE 24

Interactions with dummy variables

Dummy variables separate out categories ◮ Different intercept for each category Interactions with dummies separate out trends ◮ Different slope for each category Yi = β0 + β1✶{X1i=1} + β2X2i + β3(✶{X1i=1}X2i) + · · · + εi ∂E[Y |X1 = 0, X2] ∂X2 = β2 ∂E[Y |X1 = 1, X2] ∂X2 = β2 + β3

23

slide-25
SLIDE 25

Same slope, different intercept ◮ Price difference does not depend on mileage!

> trucklm2 <- lm(price ~ make + mile, data=pickup)

5 10 15 20 5000 10000 15000 20000 miles (10k) price ($)

  • Dodge

Ford GMC

◮ Dodge trucks affect all slopes!

24

slide-26
SLIDE 26

Now add individual slopes! ◮ Price difference varies with miles!

> trucklm3 <- lm(price ~ make*miles, data=pickup)

5 10 15 20 5000 10000 15000 20000 miles (10k) price ($)

  • Dodge

Ford GMC

◮ Dodge doesn’t effect Ford, GMC b0, b1

25

slide-27
SLIDE 27

What do the numbers show?

> summary(trucklm3) Estimate Std. Error t value Pr(>|t|) (Intercept) 8862 2508 3.5 0.001 ** makeFord 5216 3707 1.4 0.167 makeGMC 8360 3080 2.7 0.010 ** miles

  • 243

225

  • 1.1

0.287 makeFord:miles

  • 317

347

  • 0.9

0.366 makeGMC:miles

  • 611

268

  • 2.3

0.028 * > c(coef(trucklm3)[1], coef(trucklm3)[4]) ##(b_0,b_1) Dodge (Intercept) miles 8862.1987

  • 243.1789

> c((coef(trucklm3)[1]+coef(trucklm3)[2]), ## b_0 Ford + (coef(trucklm3)[4]+coef(trucklm3)[5])) ## b_1 Ford (Intercept) miles 14078.6715

  • 560.5871

26

slide-28
SLIDE 28

What do the numbers show?

> summary(trucklm3) Estimate Std. Error t value Pr(>|t|) (Intercept) 8862 2508 3.5 0.001 ** makeFord 5216 3707 1.4 0.167 makeGMC 8360 3080 2.7 0.010 ** miles

  • 243

225

  • 1.1

0.287 makeFord:miles

  • 317

347

  • 0.9

0.366 makeGMC:miles

  • 611

268

  • 2.3

0.028 * > price.Ford <- price[make=="Ford"] > miles.Ford <- miles[make=="Ford"] > summary(lm(price.Ford ~miles.Ford)) Estimate Std. Error t value Pr(>|t|) (Intercept) 14078.7 3094.6 4.549 0.00106 ** miles.Ford

  • 560.6

299.3

  • 1.873

0.09054 .

27

slide-29
SLIDE 29

Individual slopes, and X2! ◮ Price difference varies with miles!

> trucklm5 <- lm(price ~ make*miles + I(miles^2), data=pickup) > see week3-Rcode.R for graphing

5 10 15 20 5000 10000 15000 20000 miles (10k) price ($)

  • Dodge

Ford GMC

◮ Common quadratic. Interpretation? ◮ Extrapolation danger!

28

slide-30
SLIDE 30

Individual slopes and X2! ◮ Price difference varies with miles!

> trucklm5 <- lm(price ~ make*miles + I(miles^2), data=pickup) > see week3-Rcode.R for graphing

5 10 15 20 5000 10000 15000 20000 miles (10k) price ($)

  • Dodge

Ford GMC

◮ Common quadratic. Interpretation? ◮ Extrapolation danger!

29

slide-31
SLIDE 31

Individual slopes and X2! ◮ Price difference varies with miles!

> trucklm6 <- lm(price ~ make*(miles+I(miles^2)), data=pickup) > see week3-Rcode.R for graphing

5 10 15 20 5000 10000 15000 20000 miles (10k) price ($)

  • Dodge

Ford GMC

◮ Different quadratic. Interpretation? ◮ Extrapolation danger!

30

slide-32
SLIDE 32

Interactions with continuous variables

Example: connection between college & MBA grades. A model to predict Booth GPA from college GPA could be GPAMBA = β0 + β1GPABach + ε.

> grades <- read.csv("grades.csv") > summary(grades) #output not shown > attach(grades) > summary(lm(MBAGPA ~ BachGPA)) ## severly abbrev. Estimate Std. Error t value Pr(>|t|) (Intercept) 2.58985 0.31206 8.299 1.2e-11 *** BachGPA 0.26269 0.09244 2.842 0.00607 **

◮ For every 1 point increase in college GPA, your expected GPA at Booth increases by about 0.26 points.

31

slide-33
SLIDE 33

However, this model assumes that the marginal effect of College GPA is the same for any age. ◮ But I’d guess that how you did in college has less effect

  • n your MBA GPA as you get older (farther from college).

We can account for this intuition with an interaction term: GPAMBA = β0 + β1GPABach + β2(Age × GPABach) + ε Now, the college effect is ∂E[GPAMBA | GPABach, Age] ∂GPABach = β1 + β2Age. ⇒ Depends on Age!

32

slide-34
SLIDE 34

Fitting interactions in R is easy:

lm(Y ~ X1*X2) fits E[Y ] = β0 + β1X1 + β2X2 + β3X1X2.

Here, we want the interaction but do not want to include the main effect of age (should age matter individually?).

> summary(lm(MBAGPA ~ BachGPA*Age - Age)) Coefficients: ## output abbreviated Estimate Std. Error t value Pr(>|t|) (Intercept) 2.820494 0.296928 9.499 1.23e-13 *** BachGPA 0.455750 0.103026 4.424 4.07e-05 *** BachGPA:Age -0.009377 0.002786

  • 3.366

0.00132 **

33

slide-35
SLIDE 35

Without the interaction term ◮ Marginal effect of College GPA is b1 = 0.26. With the interaction term: ◮ Marginal effect is b1 + b2Age = 0.46 − 0.0094Age. Age Marginal Effect 25 0.22 30 0.17 35 0.13 40 0.08

34

slide-36
SLIDE 36

The log-log model

A common covariate transform is log(X). ◮ When X-values are bunched up, log(X) helps spread them out and reduces the leverage of extreme values. ◮ Recall that both reduce sb1. In practice, this is often used in conjunction with a log(Y ) response transformation. ◮ The log-log model is log(Y ) = β0 + β1 log(X) + ε. ◮ It is super useful, and has some special properties ...

35

slide-37
SLIDE 37

Recall that ◮ log is always natural log, with base e = 2.718 . . ., and ◮ log(ab) = log(a) + log(b) ◮ log(ab) = b log(a). Consider the multiplicative model E[Y |X] = AXB. Take logs of both sides to get log(E[Y |X]) = log(A) + log(XB) = log(A) + B log(X) ≡ β0 + β1 log(X). The log-log model is appropriate whenever things are linearly related on a multiplicative, or percentage, scale.

(See handout on course website.)

36

slide-38
SLIDE 38

Consider a country’s GDP as a function of IMPORTS: ◮ Since trade multiplies, we might expect to see %GDP increase with %IMPORTS.

200 400 600 800 1000 1200 2000 4000 6000 8000 10000 IMPORTS GDP Argentina Australia Bolivia Brazil Canada Cuba Denmark Egypt Finland France Greece Haiti India Israel Jamaica Japan Liberia Malaysia Mauritius Netherlands Nigeria Panama Samoa United Kingdom United States −2 2 4 6 8 2 4 6 8 log(IMPORTS) log(GDP) Argentina Australia Bolivia Brazil Canada Cuba Denmark Egypt Finland France Greece Haiti India Israel Jamaica Japan Liberia Malaysia Mauritius Netherlands Nigeria Panama Samoa United Kingdom United States

37

slide-39
SLIDE 39

Elasticity and the log-log model

In a log-log model, the slope β1 is sometimes called elasticity. An elasticity is (roughly) % change in Y per 1% change in X.

β1 ≈ d%Y d%X

For example, economists often assume that GDP has import elasticity of 1. Indeed:

> coef(lm(log(GDP) ~ log(IMPORTS))) (Intercept) log(IMPORTS) 1.8915 0.9693 (Can we test for 1%?)

38

slide-40
SLIDE 40

Price elasticity

In marketing, the slope coefficient β1 in the regression log(sales) = β0 + β1 log(price) + ε is called price elasticity: ◮ the % change in sales per 1% change in price. The model implies that E[sales|price] = A ∗ priceβ1 such that β1 is the constant rate of change. ————

Economists have “demand elasticity” curves, which are just more general and harder to measure.

39

slide-41
SLIDE 41

Example: we have Nielson SCANTRACK data on supermarket sales of a canned food brand produced by Consolidated Foods.

> attach(confood <- read.csv("confood.csv")) > par(mfrow=c(1,2)) > plot(Price,Sales, pch=20) > plot(log(Price),log(Sales), pch=20)

  • 0.60

0.65 0.70 0.75 0.80 0.85 2000 4000 6000 Price Sales

  • −0.5

−0.4 −0.3 −0.2 5 6 7 8 log(Price) log(Sales)

40

slide-42
SLIDE 42

Run the regression to determine price elasticity:

> confood.reg <- lm(log(Sales) ~ log(Price)) > coef(confood.reg) (Intercept) log(Price) 4.802877

  • 5.147687

> plot(log(Price),log(Sales), pch=20) > abline(confood.reg, col=4)

  • −0.5

−0.4 −0.3 −0.2 5 6 7 8 log(Price) log(Sales)

◮ Sales decrease by

about 5% for every 1% price increase.

41

slide-43
SLIDE 43

Summary

Multiple linear regression is just like SLR . . . with one important tweak. Interpretation is crucial: ◮ Holding other variables fixed ◮ Causation? ◮ Polynomials, interactions, log transformations, . . .

42