Week 1: Introduction to Remote Learning Format, Policies, Guiding - - PowerPoint PPT Presentation

week 1 introduction to remote learning
SMART_READER_LITE
LIVE PREVIEW

Week 1: Introduction to Remote Learning Format, Policies, Guiding - - PowerPoint PPT Presentation

BUS 41100 Applied Regression Analysis Week 1: Introduction to Remote Learning Format, Policies, Guiding Principles Max H. Farrell The University of Chicago Booth School of Business Remote Instruction Guiding Principles Be patient Be


slide-1
SLIDE 1

BUS 41100 Applied Regression Analysis

Week 1: Introduction to Remote Learning

Format, Policies, Guiding Principles Max H. Farrell The University of Chicago Booth School of Business

slide-2
SLIDE 2

Remote Instruction

Guiding Principles ◮ Be patient ◮ Be flexible ◮ Learn something ◮ Student interaction When in doubt, ask! I haven’t thought of everything, and everyone’s needs are different.

1

slide-3
SLIDE 3

What is class going to look like?

Synchronous (but recorded) ◮ Lectures: live during class, format will evolve over time ◮ Office hours: twice a week, times TBD Your work ◮ Group work: homework & project. Randomly assigned groups to facilitate interaction. ◮ Midterm exam: on your own. Resources ◮ Course website: slides, data, etc ◮ Piazza: Q & A ◮ Textbook: Sheather. Recommended, not required, see syllabus

2

slide-4
SLIDE 4

Your work

Turned-in work: clear, concise, and on message ◮ Fewer plots usually better ◮ Results and analysis, not output/code Homework ◮ Not exam practice! Not similar at all ◮ Reinforce & extend ideas, challenge you ◮ Open-ended analysis Exams ◮ Narrower scope ◮ Test core concepts/abilities ◮ Look at sample exams to get a sense of style Project: Your glimpse at real life!

3

slide-5
SLIDE 5

Course Overview

Rough outline ◮ Weeks 1 – 4: Simple and Multiple Linear Regression ◮ Weeks 5 – 6: Panel and Times Series Data ◮ Week 7: Logistic Regression ◮ Week 8 – 9: Model Building ◮ Week 10: Presentations But . . . we will be flexible and patient ◮ Cover the material we can learn well ◮ Fix an exam in somewhere

4

slide-6
SLIDE 6

BUS 41100 Applied Regression Analysis

Week 1: Introduction, Simple Linear Regression

Data visualization, conditional distributions, correlation, and least squares regression Max H. Farrell The University of Chicago Booth School of Business

slide-7
SLIDE 7

The basic problem

Available data on two or more variables Formulate a model to predict or estimate a value of interest Use estimate to make a (business) decision

1

slide-8
SLIDE 8

Regression: What is it?

◮ Simply: The most widely used statistical tool for understanding relationships among variables ◮ A conceptually simple method for investigating relationships between one or more factors and an

  • utcome of interest

◮ The relationship is expressed in the form of an equation

  • r a model connecting the outcome to the factors

2

slide-9
SLIDE 9

Regression in business

◮ Optimal portfolio choice:

  • Predict the future joint distribution of asset returns
  • Construct an optimal portfolio (choose weights)

◮ Determining price and marketing strategy:

  • Estimate the effect of price and advertisement on sales
  • Decide what is optimal price and ad campaign

◮ Credit scoring model:

  • Predict the future probability of default using known

characteristics of borrower

  • Decide whether or not to lend (and if so, how much)

3

slide-10
SLIDE 10

Regression in everything

Straight prediction questions: ◮ What price should I charge for my car? ◮ What will the interest rates be next month? ◮ Will this person like that movie? Explanation and understanding: ◮ Does your income increase if you get an MBA? ◮ Will tax incentives change purchasing behavior? ◮ Is my advertising campaign working?

4

slide-11
SLIDE 11

Data Visualization

Example: pickup truck prices on Craigslist We have 4 dimensions to consider.

> data <- read.csv("pickup.csv") > names(data) [1] "year" "miles" "price" "make"

A simple summary is

> summary(data) year miles price make Min. :1978 Min. : 1500 Min. : 1200 Dodge:10 1st Qu.:1996 1st Qu.: 70958 1st Qu.: 4099 Ford :12 Median :2000 Median : 96800 Median : 5625 GMC :24 Mean :1999 Mean :101233 Mean : 7910 3rd Qu.:2003 3rd Qu.:130375 3rd Qu.: 9725 Max. :2008 Max. :215000 Max. :23950

5

slide-12
SLIDE 12

First, the simple histogram (for each continuous variable).

> par(mfrow=c(1,3)) > hist(data$year) > hist(data$miles) > hist(data$price)

Histogram of data$year

data$year Frequency 1975 1980 1985 1990 1995 2000 2005 2010 5 10 15

Histogram of data$miles

data$miles Frequency 50000 100000 150000 200000 250000 5 10 15

Histogram of data$price

data$price Frequency 5000 10000 15000 20000 25000 5 10 15 20

Data is “binned” and plotted bar height is the count in each bin.

6

slide-13
SLIDE 13

We can use scatterplots to compare two dimensions. > par(mfrow=c(1,2)) > plot(data$year, data$price, pch=20) > plot(data$miles, data$price, pch=20)

  • 1980

1990 2000 5000 15000 data$year data$price

  • 50000

150000 5000 15000 data$miles data$price

7

slide-14
SLIDE 14

Add color to see another dimension.

> par(mfrow=c(1,2)) > plot(data$year, data$price, pch=20, col=data$make) > legend("topleft", levels(data$make), fill=1:3) > plot(data$miles, data$price, pch=20, col=data$make)

  • 1980

1990 2000 5000 15000 data$year data$price Dodge Ford GMC

  • 50000

150000 5000 15000 data$miles data$price

8

slide-15
SLIDE 15

Boxplots are also super useful.

> year_boxplot <- factor(1*(year<1995) + 2*(1995<=year & year<2000) + 3*(2000<=year & year<2005) + 4*(2005<=year & year<2009), labels=c("<1995", "’95-’99", "2000-’04", "’05-’09")) > boxplot(price ~ make, ylab="Price ($)", main="Make") > boxplot(price ~ year_boxplot, ylab="Price ($)", main="Year")

Dodge Ford GMC 5000 15000 Make Price ($)

  • <1995

'95−'99 2000−'04 '05−'09 5000 15000 Year Price ($)

The box is the Interquartile Range (IQR; i.e., 25th to 75th %), with the median in bold. The whiskers extend to the most extreme point which is no more than 1.5 times the IQR width from the box.

9

slide-16
SLIDE 16

Regression is what we’re really here for.

> plot(data$year, data$price, pch=20, col=data$make) > abline(lm(price ~ year),lwd=1.5)

  • 1980

1990 2000 5000 15000 data$year data$price Dodge Ford GMC

  • 50000

150000 5000 15000 data$miles data$price

◮ Fit a line through the points, but how? ◮ lm stands for linear model ◮ Rest of the course: formalize and explore this idea

10

slide-17
SLIDE 17

Predicting house prices

Problem: ◮ Predict market price based on observed characteristics Solution: ◮ Look at property sales data where we know the price and some observed characteristics. ◮ Build a decision rule that predicts price as a function of the observed characteristics. = ⇒ We have to define the variables of interest and develop a specific quantitative measure of these variables

11

slide-18
SLIDE 18

What characteristics do we use? ◮ Many factors or variables affect the price of a house

◮ size of house ◮ number of baths ◮ garage, air conditioning, etc. ◮ size of land ◮ location

◮ Easy to quantify price and size but what about other variables such as location, aesthetics, workmanship, etc?

12

slide-19
SLIDE 19

To keep things super simple, let’s focus only on size of the house. The value that we seek to predict is called the dependent (or output) variable, and we denote this as ◮ Y = price of house (e.g. thousands of dollars) The variable that we use to guide prediction is the explanatory (or input) variable, and this is labelled ◮ X = size of house (e.g. thousands of square feet)

13

slide-20
SLIDE 20

What do the data look like?

> size <- c(.8,.9,1,1.1,1.4,1.4,1.5,1.6, + 1.8,2,2.4,2.5,2.7,3.2,3.5) > price <- c(70,83,74,93,89,58,85,114, + 95,100,138,111,124,161,172) > plot(size, price, pch=20)

  • 1.0

1.5 2.0 2.5 3.0 3.5 60 80 100 120 140 160 size price

14

slide-21
SLIDE 21

Appears to be a linear relationship between price and size: ◮ as size goes up, price goes up. Fitting a line by the “eyeball” method:

> abline(35, 40, col="red")

  • 1.0

1.5 2.0 2.5 3.0 3.5 60 80 100 120 140 160 size price

15

slide-22
SLIDE 22

Recall that the equation of a line is: ˆ Y = b0 + b1X where b0 is the intercept and b1 is the slope. In the house price example ◮ our “eyeball” line has b0 = 35, b1 = 40. ◮ predict the price of a house when we know only size

◮ just read the value off the line that we’ve drawn.

◮ The intercept value is in units of Y ($1,000). ◮ The slope is in units of Y per units of X ($1,000/1,000 sq ft).

16

slide-23
SLIDE 23

Recall how the slope (b1) and intercept (b0) work together graphically. Y X b0 2 1 b1 Y = b0 + b1X

17

slide-24
SLIDE 24

What is a good line?

Can we do better than the eyeball method? We desire a strategy for estimating the slope and intercept parameters in the model ˆ Y = b0 + b1X. That involves ◮ choosing a criteria, i.e., quantifying how good a line is ◮ and matching that with a solution, i.e., finding the best line subject to that criteria.

18

slide-25
SLIDE 25

Although there are lots of ways to choose a criteria ◮ only a small handful lead to solutions that are “easy” to compute, ◮ and which have nice statistical properties (more later). Most reasonable criteria involve measuring the amount by which the fitted value obtained from the line differs from the

  • bserved value of the response value(s) in the data.

This amount is called the residual. ◮ Good lines produce small residuals. ◮ Good lines produce accurate predictions.

19

slide-26
SLIDE 26
  • X

Y

Xi Yi Y ^

i

ei = Yi − Y ^

i

The line is our predictions or fitted values: ˆ Yi = b0 + b1Xi. The residual ei is the discrepancy between the fitted ˆ Yi and

  • bserved Yi values.

◮ Note that we can write Yi = ˆ Yi + (Yi − ˆ Yi) = ˆ Yi + ei.

20

slide-27
SLIDE 27

Least squares

A reasonable goal is to minimize the size of all residuals: ◮ If they were all zero we would have a perfect line. ◮ Trade-off between moving closer to some points and at the same time moving away from other points. Since some residuals are positive and some are negative, we need one more ingredient. ◮ |ei| treats positives and negatives equally. ◮ So does e2

i , which is easier to work with mathematically.

Least squares chooses b0 and b1 to minimize n

i=1 e2 i . 21

slide-28
SLIDE 28

Least squares visualization

  • X

Y

ei

22

slide-29
SLIDE 29

Least squares visualization

  • X

Y

ei

23

slide-30
SLIDE 30

Least squares visualization

  • X

Y

ei

2

24

slide-31
SLIDE 31

Least squares visualization

  • X

Y

25

slide-32
SLIDE 32

Least squares visualization

  • X

Y

intercept = 0.5 slope = 0.75 SSE = 6.73

26

slide-33
SLIDE 33

Least squares visualization

  • X

Y

intercept = 0.5 slope = 0.5 SSE = 6.09

27

slide-34
SLIDE 34

Least squares visualization

  • X

Y

intercept = 0.87 slope = 0.53 SSE = 5.25

28

slide-35
SLIDE 35

Least squares chooses b0 and b1 to minimize

n

  • i=1

e2

i = n

  • i=1

(Yi − ˆ Yi)2 =

n

  • i=1

(Yi − [b0 + b1Xi])2.

R’s lm command provides a least squares fit. > reg <- lm(price ~ size) > reg Call: lm(formula = price ~ size) Coefficients: (Intercept) size 38.88 35.39 ◮ lm stands for “linear model”; it’ll be our workhorse

29

slide-36
SLIDE 36

> abline(reg, col="green") > legend("bottomright", c("eyeball", "LS"), + col=c("red", "green"), lty=1)

  • 1.0

1.5 2.0 2.5 3.0 3.5 60 80 100 120 140 160 size price eyeball LS

◮ The least squares line is different than our eyeballed line ◮ . . . but why do we like it better?

30

slide-37
SLIDE 37

Properties of the least squares fit

Developing techniques for model validation and criticism requires a deeper understanding of the least squares line. The fitted values (ˆ Yi) and “residuals” (ei) obtained from the least squares line have some special properties. ◮ From now on “obtained from the least squares line” will be implied (and therefore not repeated) whenever we talk about ˆ Yi and ei. Let’s look at the housing data analysis to figure out what some of these properties are . . . . . . but first, review covariance and correlation.

31

slide-38
SLIDE 38

Covariance and correlation

Cov(X, Y ) = E [(X − E[X])(Y − E[Y ])]

0.0 0.2 0.4 0.6 0.8 1.0

  • 1

1 2 3 4 x y

E[Y] E[X]

X and Y vary with each other around their means.

32

slide-39
SLIDE 39

corr(X, Y ) = Cov(X, Y )/σXσY

  • 3
  • 2
  • 1

1 2 3

  • 3
  • 2
  • 1

1 2 3

corr = 1

  • 3
  • 2
  • 1

1 2 3

  • 3
  • 2
  • 1

1 2 3

corr = .5

  • 3
  • 2
  • 1

1 2 3

  • 3
  • 2
  • 1

1 2 3

corr = .8

  • 3
  • 2
  • 1

1 2 3

  • 3
  • 2
  • 1

1 2 3

corr = -.8 33

slide-40
SLIDE 40

Warning 1: Correlation only measures linear relationships: ◮ corr(X, Y ) = 0 does not mean the variables are unrelated!

  • 3
  • 2
  • 1

1 2

  • 8
  • 6
  • 4
  • 2

corr = 0.01

5 10 15 20 5 10 15 20

corr = 0.72

Also be careful with influential observations.

34

slide-41
SLIDE 41

Warning 2: Correlation is not causation: ◮ We really want to interpret a regression as a change in X causing a change in Y

  • 500

1000 1500 2000 1000 2000 3000 Advertising Sales

But what do we really learn? Can we ever say X causes Y ?

35

slide-42
SLIDE 42

Least squares properties

  • 1. The fitted values are perfectly correlated with the inputs.

> plot(size, reg$fitted, pch=20, xlab="X", + ylab="Fitted Values") > text(x=3, y=80, col=2, cex=1.5, + paste("corr(y.hat, x) =", cor(size, reg$fitted)))

  • 1.0

1.5 2.0 2.5 3.0 3.5 80 100 120 140 160 X Fitted Values

corr(y.hat, x) = 1

36

slide-43
SLIDE 43
  • 2. The residuals have zero correlation with inputs, i.e.

“stripped of all linearity”.

> plot(size, reg$fitted-price, pch=20, xlab="X", ylab="Residuals") > text(x=3.1, y=26, col=2, cex=1.5, + paste("corr(e, x) =", round(cor(size, reg$fitted-price),2))) > text(x=3.1, y=19, col=4, cex=1.5, + paste("mean(e) =", round(mean(reg$fitted-price),0))) > abline(h=0, col=8, lty=2)

  • 1.0

1.5 2.0 2.5 3.0 3.5 −20 −10 10 20 30 X Residuals

corr(e, x) = 0 mean(e) = 0

37

slide-44
SLIDE 44

Intuition for the relationship between ˆ Y , e, and X? ◮ Lets consider some “crazy” alternative line:

  • 1.0

1.5 2.0 2.5 3.0 3.5 60 80 100 120 140 160 X Y LS line: 38.9 + 35.4 X Crazy line: 10 + 50 X

38

slide-45
SLIDE 45

This is a bad fit! We are underestimating the value of small houses and overestimating the value of big houses.

  • 1.0

1.5 2.0 2.5 3.0 3.5 −20 −10 10 20 30 X Crazy Residuals

corr(e, x) = −0.7 mean(e) = 1.8

◮ Clearly, we have left some predictive ability on the table!

39

slide-46
SLIDE 46

As long as the correlation between e and X is non-zero, we could always adjust our prediction rule to do better: min

n

  • i=1

e2

i

equivalent to corr(e, X)=0 & 1 n

n

  • i=1

ei =0 We need to exploit all of the (linear!) predictive power in the X values and put this into ˆ Y , ◮ leaving no “Xness” in the residuals. In Summary: Y = ˆ Y + e where: ◮ ˆ Y is “made from X”; corr(X, ˆ Y ) = 1; ◮ e is unrelated to X; corr(X, e) = 0.

40

slide-47
SLIDE 47

To summarize: R’s lm(Y ∼ X) function ◮ finds the coefficients b0 and b1 characterizing the “least squares” line ˆ Y = b0 + b1X. ◮ That is it minimizes n

i=1(Yi − ˆ

Yi)2 = n

i=1 e2 i .

◮ Equivalent to: corr(e, X)=0 &

1 n

n

i=1 ei =0

The least squares formulas are b1 = rxy sy sx and b0 = ¯ Y − b1 ¯ X.

41

slide-48
SLIDE 48

Steps in a regression analysis

  • 1. State the problem
  • 2. Data collection
  • 3. Model fitting & estimation (this class)

3.1 Model specification (linear? logistic?) 3.2 Select potentially relevant variables 3.3 Model fitting (least squares) 3.4 Model validation and criticism 3.5 Back to 3.1? Back to 2?

  • 4. Answering the posed questions

But that oversimplifies a bit; ◮ it is more iterative, and can be more art than science.

42

slide-49
SLIDE 49

Glossary of symbols

◮ X = input, explanatory variable, or covariate. ◮ Y = output, dependent variable, or response. ◮ sxy is covariance and rxy is the correlation, sx and sy are standard deviation of X and Y respectively ◮ rxy =

sxy sxsy .

◮ b0 = least squares estimate of the intercept ◮ b1 = least squares estimate of the slope ◮ ˆ Y is the fitted value b0 + b1X ◮ ei is the residual Yi − ˆ Yi.

43