Gov 2000: 8. Simple Linear Regression Matthew Blackwell Fall 2016 - - PowerPoint PPT Presentation

gov 2000 8 simple linear regression
SMART_READER_LITE
LIVE PREVIEW

Gov 2000: 8. Simple Linear Regression Matthew Blackwell Fall 2016 - - PowerPoint PPT Presentation

Gov 2000: 8. Simple Linear Regression Matthew Blackwell Fall 2016 1 / 84 1. Assumptions of the Linear Regression Model 2. Sampling Distribution of the OLS Estimator 3. Sampling Variance of the OLS Estimator 4. Large Sample Properties of OLS


slide-1
SLIDE 1

Gov 2000: 8. Simple Linear Regression

Matthew Blackwell

Fall 2016

1 / 84

slide-2
SLIDE 2
  • 1. Assumptions of the Linear Regression Model
  • 2. Sampling Distribution of the OLS Estimator
  • 3. Sampling Variance of the OLS Estimator
  • 4. Large Sample Properties of OLS
  • 5. Exact Inference for OLS
  • 6. Hypothesis Tests and Confjdence Intervals
  • 7. Goodness of Fit

2 / 84

slide-3
SLIDE 3

Where are we? Where are we going?

  • Last week:

▶ Using the CEF to explore relationships ▶ Practical estimation concerns led us to OLS/lines of best fjt.

  • This week:

▶ Inference for OLS: sampling distribution. ▶ Is there really a relationship? Hypothesis tests ▶ Can we get a range of plausible slope values? Confjdence

intervals

▶ ⇝ how to read regression output. 3 / 84

slide-4
SLIDE 4

More narrow goal

## ## Call: ## lm(formula = logpgp95 ~ logem4, data = ajr) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.7130 -0.5333 0.0195 0.4719 1.4467 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 10.6602 0.3053 34.92 < 2e-16 *** ## logem4

  • 0.5641

0.0639

  • 8.83

2.1e-13 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.756 on 79 degrees of freedom ## (82 observations deleted due to missingness) ## Multiple R-squared: 0.497, Adjusted R-squared: 0.49 ## F-statistic: 78 on 1 and 79 DF, p-value: 2.09e-13

4 / 84

slide-5
SLIDE 5

1/ Assumptions of the Linear Regression Model

5 / 84

slide-6
SLIDE 6

Simple linear regression model

  • We are going to assume a linear model:

𝑍𝑗 = 𝛾0 + 𝛾1𝑌𝑗 + 𝑣𝑗

  • Data:

▶ Dependent variable: 𝑍𝑗 ▶ Independent variable: 𝑌𝑗

  • Population parameters:

▶ Population intercept: 𝛾0 ▶ Population slope: 𝛾1

  • Error/disturbance: 𝑣𝑗

▶ Represents all unobserved error factors infmuencing 𝑍𝑗 other

than 𝑌𝑗.

6 / 84

slide-7
SLIDE 7

Causality and regression

𝑍𝑗 = 𝛾0 + 𝛾1𝑌𝑗 + 𝑣𝑗

  • Last week we showed there is always a population linear

regression we called the linear projection.

▶ No notion of causality and may not even be the CEF.

  • Traditional regression approach: assume slope parameters are

causal or structural.

▶ 𝛾1 is the efgect of a one-unit change in 𝑦 holding all other

factors (𝑣𝑗) constant.

  • Regression will always consistently estimate a linear

association between 𝑍𝑗 and 𝑌𝑗.

  • Today: When will regression say something causal?

▶ GOV 2001/2002 has more on a formal language of causality. 7 / 84

slide-8
SLIDE 8

Linear regression model

  • In order to investigate the statistical properties of OLS, we

need to make some statistical assumptions:

Linear Regression Model

The observations, (𝑍𝑗, 𝑌𝑗) come from a random (i.i.d.) sample and satisfy the linear regression equation, 𝑍𝑗 = 𝛾0 + 𝛾1𝑌𝑗 + 𝑣𝑗 𝔽[𝑣𝑗|𝑌𝑗] = 0. The independent variable is assumed to have non-zero variance, 𝕎[𝑌𝑗] > 0.

8 / 84

slide-9
SLIDE 9

Linearity

Assumption 1: Linearity

The population regression function is linear in the parameters: 𝑍𝑗 = 𝛾0 + 𝛾1𝑌𝑗 + 𝑣𝑗

  • Violation of the linearity assumption:

𝑍𝑗 = 1 𝛾0 + 𝛾1𝑌𝑗 + 𝑣𝑗

  • Not a violation of the linearity assumption:

𝑍𝑗 = 𝛾0 + 𝛾1𝑌2

𝑗 + 𝑣𝑗

  • In future weeks, we’ll talk about how to allow for

non-linearities in 𝑌𝑗.

9 / 84

slide-10
SLIDE 10

Random sample

Assumption 2: Random Sample

We have a iid random sample of size 𝑜, {(𝑍𝑗, 𝑌𝑗) ∶ 𝑗 = 1, 2, … , 𝑜} from the population regression model above.

  • Violations: time-series, selected samples.
  • Think about the weight example from last week, where 𝑍𝑗 was

my weight on a given day and 𝑌𝑗 was my number of active minutes the day before: weight𝑗 = 𝛾0 + 𝛾1activity𝑗 + 𝑣𝑗

  • What if I only weighed myself on the weekdays?

10 / 84

slide-11
SLIDE 11

A non-iid sample

20 40 60 80 100

  • 3
  • 2
  • 1

X Y

11 / 84

slide-12
SLIDE 12

Variation in X

Assumption 3: Variation in 𝑌

There is in-sample variation in 𝑌𝑗, so that,

𝑜

𝑗=1

(𝑌𝑗 − 𝑌)2 > 0.

  • OLS not well-defjned if no in-sample variation in 𝑌𝑗
  • Remember the formula for the OLS slope estimator:

̂ 𝛾1 = ∑𝑜

𝑗=1(𝑌𝑗 − 𝑌)(𝑍𝑗 − 𝑍)

∑𝑜

𝑗=1(𝑌𝑗 − 𝑌)2

  • What happens here when 𝑌𝑗 doesn’t vary?

12 / 84

slide-13
SLIDE 13

Stuck in a moment

  • Why does this matter? How would you draw the line of best

fjt through this scatterplot, which is a violation of this assumption?

  • 3
  • 2
  • 1

1 2 3

  • 2
  • 1

1 2 X Y

13 / 84

slide-14
SLIDE 14

Stuck in a moment

  • Why does this matter? How would you draw the line of best

fjt through this scatterplot, which is a violation of this assumption?

  • 3
  • 2
  • 1

1 2 3

  • 2
  • 1

1 2 X Y

14 / 84

slide-15
SLIDE 15

Zero conditional mean

Assumption 4: Zero conditional mean of the errors

The error, 𝑣𝑗, has expected value of 0 given any value of the independent variable: 𝔽[𝑣𝑗|𝑌𝑗 = 𝑦] = 0 ∀𝑦.

  • ⇝ weaker condition that 𝑣𝑗 and 𝑌𝑗 uncorrelated:

Cov[𝑣𝑗, 𝑌𝑗] = 𝔽[𝑣𝑗𝑌𝑗] = 0

  • ⇝ 𝔽[𝑍𝑗|𝑌𝑗] = 𝛾0 + 𝛾1𝑌𝑗 is the CEF

15 / 84

slide-16
SLIDE 16

Violating the zero conditional mean assumption

  • How does this assumption get violated? Let’s generate data

from the following model: 𝑍𝑗 = 1 + 0.5𝑌𝑗 + 𝑣𝑗

  • But let’s compare two situations:
  • 1. Where the mean of 𝑣𝑗 depends on 𝑌𝑗 (they are correlated)
  • 2. No relationship between them (satisfjes the assumption)
  • 3
  • 2
  • 1

1 2 3

  • 2
  • 1

1 2 3 4 5

Assumption 4 violated

X Y

  • 3
  • 2
  • 1

1 2 3

  • 2
  • 1

1 2 3 4 5

Assumption 4 not violated

X Y 16 / 84

slide-17
SLIDE 17

More examples of zero conditional mean in the error

  • Think about the weight example from last week, where 𝑍𝑗 was

my weight on a given day and 𝑌𝑗 was my number of active minutes the day before: weight𝑗 = 𝛾0 + 𝛾1activity𝑗 + 𝑣𝑗

  • What might in 𝑣𝑗 here? Amount of food eaten, workload, etc

etc.

  • We have to assume that all of these factors have the same

mean, no matter what my level of activity was. Plausible?

  • When is this assumption most plausible? When 𝑌𝑗 is randomly

assigned.

17 / 84

slide-18
SLIDE 18

2/ Sampling Distribution of the OLS Estimator

18 / 84

slide-19
SLIDE 19

What is OLS?

  • Ordinary least squares (OLS) is an estimator for the slope and

the intercept of the regression line.

  • Where does it come from? Minimizing the sum of the squared

residuals: ( ̂ 𝛾0, ̂ 𝛾1) = arg min

𝑐0,𝑐1 𝑜

𝑗=1

(𝑍𝑗 − 𝑐0 − 𝑐1𝑌𝑗)2

  • Leads to:

̂ 𝛾0 = 𝑍 − ̂ 𝛾1𝑌 ̂ 𝛾1 = ∑𝑜

𝑗=1(𝑌𝑗 − 𝑌)(𝑍𝑗 − 𝑍)

∑𝑜

𝑗=1(𝑌𝑗 − 𝑌)2

19 / 84

slide-20
SLIDE 20

Intuition of the OLS estimator

  • Regression line goes through the sample means (𝑍, 𝑌):

𝑍 = ̂ 𝛾0 + ̂ 𝛾1𝑌

  • Slope is the ratio of the covariance to the variance of 𝑌𝑗:

̂ 𝛾1 = ∑𝑜

𝑗=1(𝑌𝑗 − 𝑌)(𝑍𝑗 − 𝑍)

∑𝑜

𝑗=1(𝑌𝑗 − 𝑌)2

= ̂ Cov(𝑌𝑗, 𝑍𝑗) ̂ 𝕎[𝑌𝑗] = Sample Covariance between 𝑌 and 𝑍 Sample Variance of 𝑌

20 / 84

slide-21
SLIDE 21

The sample linear regression function

  • The estimated or sample regression function is:

̂ 𝑍𝑗 = ̂ 𝛾0 + ̂ 𝛾1𝑌𝑗

  • Estimated intercept: ̂

𝛾0

  • Estimated slope: ̂

𝛾1

  • Predicted/fjtted values: ̂

𝑍𝑗

  • Residuals:

̂ 𝑣𝑗 = 𝑍𝑗 − ̂ 𝑍𝑗

  • You can think of the residuals as the prediction errors of our

estimates.

21 / 84

slide-22
SLIDE 22

OLS slope as a weighted sum of the outcomes

  • One useful derivation that we’ll do moving forward is to write

the OLS estimator for the slope as a weighted sum of the

  • utcomes.

̂ 𝛾1 =

𝑜

𝑗=1

𝑋𝑗𝑍𝑗

  • Where here we have the weights, 𝑋𝑗 as:

𝑋𝑗 = (𝑌𝑗 − 𝑌) ∑𝑜

𝑗=1(𝑌𝑗 − 𝑌)2

  • Estimation error:

proof

̂ 𝛾1 − 𝛾1 =

𝑜

𝑗=1

𝑋𝑗𝑣𝑗

  • ⇝ ̂

𝛾1 is a sum of random variables.

22 / 84

slide-23
SLIDE 23

Sampling distribution of the OLS estimator

  • Remember: OLS is an estimator—it’s a machine that we plug

data into and we get out estimates. OLS

Sample 1: {(𝑍1, 𝑌1), … , (𝑍𝑜, 𝑌𝑜)} (̂ 𝛾0, ̂ 𝛾1)1 Sample 2: {(𝑍1, 𝑌1), … , (𝑍𝑜, 𝑌𝑜)} (̂ 𝛾0, ̂ 𝛾1)2

⋮ ⋮

Sample 𝑙 − 1: {(𝑍1, 𝑌1), … , (𝑍𝑜, 𝑌𝑜)} (̂ 𝛾0, ̂ 𝛾1)𝑙−1 Sample 𝑙: {(𝑍1, 𝑌1), … , (𝑍𝑜, 𝑌𝑜)} (̂ 𝛾0, ̂ 𝛾1)𝑙

  • Just like the sample mean, sample difgerence in means, or the

sample variance

  • It has a sampling distribution, with a sampling

variance/standard error, etc.

23 / 84

slide-24
SLIDE 24

Simulation procedure

  • Let’s take a simulation approach to demonstrate:

▶ Pretend that the AJR data represents the population of

interest

▶ See how the line varies from sample to sample

  • 1. Draw a random sample of size 𝑜 = 30 with replacement using

sample()

  • 2. Use lm() to calculate the OLS estimates of the slope and

intercept

  • 3. Plot the estimated regression line

24 / 84

slide-25
SLIDE 25

Population Regression

1 2 3 4 5 6 7 8 6 7 8 9 10 11 12 Log Settler Mortality Log GDP per capita

25 / 84

slide-26
SLIDE 26

Randomly sample from AJR

1 2 3 4 5 6 7 8 6 7 8 9 10 11 12 Log Settler Mortality Log GDP per capita

26 / 84

slide-27
SLIDE 27

Randomly sample from AJR

1 2 3 4 5 6 7 8 6 7 8 9 10 11 12 Log Settler Mortality Log GDP per capita

27 / 84

slide-28
SLIDE 28

Randomly sample from AJR

1 2 3 4 5 6 7 8 6 7 8 9 10 11 12 Log Settler Mortality Log GDP per capita

28 / 84

slide-29
SLIDE 29

Randomly sample from AJR

1 2 3 4 5 6 7 8 6 7 8 9 10 11 12 Log Settler Mortality Log GDP per capita

29 / 84

slide-30
SLIDE 30

Randomly sample from AJR

1 2 3 4 5 6 7 8 6 7 8 9 10 11 12 Log Settler Mortality Log GDP per capita

30 / 84

slide-31
SLIDE 31

Randomly sample from AJR

1 2 3 4 5 6 7 8 6 7 8 9 10 11 12 Log Settler Mortality Log GDP per capita

31 / 84

slide-32
SLIDE 32

Randomly sample from AJR

1 2 3 4 5 6 7 8 6 7 8 9 10 11 12 Log Settler Mortality Log GDP per capita

32 / 84

slide-33
SLIDE 33

Sampling distribution of OLS

  • You can see that the estimated slopes and intercepts vary

from sample to sample, but that the “average” of the lines looks about right.

Sampling distribution of intercepts

β ^ Frequency 6 8 10 12 14 100 200 300 400

Sampling distribution of slopes

β ^

1

Frequency

  • 1.5
  • 1.0
  • 0.5

0.0 0.5 100 200 300 400 33 / 84

slide-34
SLIDE 34

Sample mean properties review

  • Last couple of weeks we derived the properties of 𝑌𝑜 under
  • ne assumption: i.i.d. random samples.
  • In large samples, we derived the sampling distribution:

𝑌𝑜 ∼ 𝑂 (𝜈, 𝜏2 𝑜 )

  • Unbiasedness: 𝔽[𝑌𝑜] = 𝜈
  • Sampling variance: 𝜏2/𝑜
  • Standard error: 𝜏/√𝑜
  • ⇝ allows us to do hypothesis tests, calculate confjdence

intervals.

34 / 84

slide-35
SLIDE 35

Our goal

  • What is the sampling distribution of the OLS slope?

̂ 𝛾1 ∼ ?(?, ?)

  • Mean of the sampling distribution: ??
  • Sampling variance: ??
  • Standard error: ??
  • Distribution: ??

35 / 84

slide-36
SLIDE 36

Mean of the OLS sampling distribution

  • Remember the 4 assumptions:
  • 1. Linearity: 𝑍𝑗 = 𝛾0 + 𝛾1𝑌𝑗 + 𝑣𝑗
  • 2. Random (iid) sample
  • 3. Variation in 𝑌𝑗
  • 4. Zero conditional mean of the errors: 𝔽[𝑣𝑗|𝑌𝑗 = 𝑦] = 0
  • Letting 𝑌 = (𝑌1, … , 𝑌𝑜)

Unbiasedness of OLS

Under assumptions 1-4, the OLS estimator is conditionally and unconditionally unbiased, 𝔽[ ̂ 𝛾1|𝑌] = 𝔽[ ̂ 𝛾1] = 𝛾1

36 / 84

slide-37
SLIDE 37

Unbiasedness proof

  • Remember the estimation error:

̂ 𝛾1 − 𝛾1 =

𝑜

𝑗=1

𝑋𝑗𝑣𝑗

  • 𝑋𝑗 = (𝑌𝑗 − 𝑌)/(∑𝑗=1(𝑌𝑗 − 𝑌)2).
  • Use this to prove conditional unbiasedness:

𝔽[ ̂ 𝛾1 − 𝛾1|𝑌] = 𝔽 ⎡ ⎢ ⎣

𝑜

𝑗=1

𝑋𝑗𝑣𝑗∣𝑌⎤ ⎥ ⎦ =

𝑜

𝑗=1

𝔽[𝑋𝑗𝑣𝑗|𝑌] =

𝑜

𝑗=1

𝑋𝑗𝔽[𝑣𝑗|𝑌] =

𝑜

𝑗=1

𝑋𝑗 × 0 = 0

  • True for any realization of the independent variables.
  • Use iterated expectations to get unconditionally unbiased:

𝔽[ ̂ 𝛾1] = 𝔽[𝔽[ ̂ 𝛾1|𝑌]] = 𝔽[𝛾1] = 𝛾1

37 / 84

slide-38
SLIDE 38

3/ Sampling Variance of the OLS Estimator

38 / 84

slide-39
SLIDE 39

Where are we?

  • Now we know that, under Assumptions 1-4, we know that

̂ 𝛾1 ∼ ?(𝛾1, ?)

  • That is we know that the sampling distribution is centered on

the true population slope, but we don’t know the population sampling variance. 𝕎[ ̂ 𝛾1] = ??

39 / 84

slide-40
SLIDE 40

Sampling variance of estimated slope

  • It is easiest to derive the sampling variance under one

additional assumption:

  • 1. Linearity
  • 2. Random (iid) sample
  • 3. Variation in 𝑌𝑗
  • 4. Zero conditional mean of the errors
  • 5. Homoskedasticity

40 / 84

slide-41
SLIDE 41

Homoskedasticity

Assumption 5

The conditional variance of 𝑍𝑗 given 𝑌𝑗 is constant: 𝕎(𝑍𝑗|𝑌𝑗 = 𝑦) = 𝕎(𝑣𝑗|𝑌𝑗 = 𝑦) = 𝜏2

𝑣.

  • 𝕎[𝑍𝑗|𝑌𝑗 = 𝑦] sometimes called the skedastic function, thus the

name homoskedasticity.

  • Under homoskedasticity

proof :

𝕎[ ̂ 𝛾1|𝑌] = 𝜏2

𝑣

∑𝑜

𝑗=1(𝑌𝑗 − 𝑌)2

  • Standard error:

se[ ̂ 𝛾1|𝑌] = √𝕎[ ̂ 𝛾1|𝑌] = 𝜏𝑣 √∑𝑜

𝑗=1(𝑌𝑗 − 𝑌)2

41 / 84

slide-42
SLIDE 42

Violations of homoskedasticity

  • Violations: magnitude of 𝑣𝑗 difger at difgerent levels of 𝑌𝑗.

0.0 0.5 1.0 1.5 2.0 2.5 3.0

  • 10
  • 5

5 10 15

Heteroskedastic

X Y 0.0 0.5 1.0 1.5 2.0 2.5 3.0

  • 10
  • 5

5 10 15

Homoskedastic

X Y 42 / 84

slide-43
SLIDE 43

Derive the sampling variance

𝕎[ ̂ 𝛾1|𝑌] = 𝜏2

𝑣

∑𝑜

𝑗=1(𝑌𝑗 − 𝑌)2 =

𝜏2

𝑣

(𝑜 − 1)𝑇2

𝑌

  • What drives the sampling variability of the OLS estimator?

▶ The higher the variance of 𝑍𝑗, the higher the sampling variance ▶ The lower the variance of 𝑌𝑗, the higher the sampling variance ▶ As we increase 𝑜, the denominator gets large, while the

numerator is fjxed and so the sampling variance shrinks to 0.

43 / 84

slide-44
SLIDE 44

Variance in X -> SEs

  • 1

1 2 3 4

  • 1

1 2 3 4 5

High V[X]

X Y

  • 1

1 2 3 4

  • 1

1 2 3 4 5

Low V[X]

X Y

44 / 84

slide-45
SLIDE 45

Variation in X -> SEs

  • 1

1 2 3 4

  • 1

1 2 3 4 5

High V[X]

X Y

  • 1

1 2 3 4

  • 1

1 2 3 4 5

Low V[X]

X Y

45 / 84

slide-46
SLIDE 46

Estimating the sampling variance/standard error

  • But we don’t observe 𝜏2

𝑣—it is the variance of the errors.

  • Estimate with the residuals:

̂ 𝜏2

𝑣 =

1 𝑜 − 2

𝑜

𝑗=1

̂ 𝑣2

𝑗

  • Why 𝑜 − 2 instead of 𝑜 or 𝑜 − 1? To correct for OLS slightly

underestimating the variance.

▶ We already used the data twice to estimate ̂

𝛾0 and ̂ 𝛾1

  • Estimated standard error of the OLS slope:

̂ se[ ̂ 𝛾1|𝑌] = √̂ 𝜏2

𝑣

√∑𝑜

𝑗=1(𝑌𝑗 − 𝑌)2

= ̂ 𝜏𝑣 √∑𝑜

𝑗=1(𝑌𝑗 − 𝑌)2

46 / 84

slide-47
SLIDE 47

Where are we?

  • Under Assumptions 1-5, we know that

̂ 𝛾1 ∼ ? ⎛ ⎜ ⎝ 𝛾1, 𝜏2

𝑣

∑𝑜

𝑗=1(𝑌𝑗 − 𝑌)2

⎞ ⎟ ⎠

  • Now we know the mean and sampling variance of the

sampling distribution.

  • How does this compare to other estimators for the population

slope?

47 / 84

slide-48
SLIDE 48

OLS is BLUE :(

Gauss-Markov Theorem

Under assumptions 1-5, the OLS estimator is BLUE, or the Best Linear Unbiased Estimator, in the sense that if ̃ 𝛾1 is another unbiased estimator of the population slope, it has variance at least as big as OLS: 𝕎[ ̂ 𝛾1|𝑌] ≤ 𝕎[ ̃ 𝛾1|𝑌].

  • Assumptions 1-5: the “Gauss Markov Assumptions”
  • Fails to hold when the assumptions are violated!

48 / 84

slide-49
SLIDE 49

4/ Large Sample Properties of OLS

49 / 84

slide-50
SLIDE 50

Where are we?

  • Under Assumptions 1-5, we know that

̂ 𝛾1 ∼ ? ⎛ ⎜ ⎝ 𝛾1, 𝜏2

𝑣

∑𝑜

𝑗=1(𝑌𝑗 − 𝑌)2

⎞ ⎟ ⎠

  • And we know that

𝜏2

𝑣

∑𝑜

𝑗=1(𝑌𝑗−𝑌)2 is the lowest variance of any

linear estimator of 𝛾1

  • What about the last question mark? What’s the form of the

distribution? Uniform? 𝑢? Normal? Exponential? Hypergeometric?

50 / 84

slide-51
SLIDE 51

Consistency

  • To see consistency of OLS, fjrst remember:

̂ 𝛾1 = 𝛾1 +

𝑜

𝑗=1

𝑋𝑗𝑣𝑗

  • Under i.i.d., we have:

𝑜

𝑗=1

𝑋𝑗𝑣𝑗 = ∑𝑜

𝑗=1(𝑌𝑗 − 𝑌)𝑣𝑗

∑𝑜

𝑗=1(𝑌𝑗 − 𝑌)2 𝑞

→ Cov(𝑌𝑗, 𝑣𝑗) 𝕎[𝑌𝑗]

  • Under zero conditional mean error, Cov[𝑌𝑗, 𝑣𝑗] = 0 so as long

as 𝕎[𝑌𝑗] > 0, then we’ll have ̂ 𝛾1

𝑞

→ 𝛾1

51 / 84

slide-52
SLIDE 52

Large-sample distribution of OLS estimators

  • OLS estimator is the sum of independent r.v.’s:

̂ 𝛾1 =

𝑜

𝑗=1

𝑋𝑗𝑍𝑗

  • Weighted sum of r.v.s ⇝ central limit theorem (notice we

replace sample variance of 𝑌𝑗 with population variance): ̂ 𝛾1

𝑒

→ 𝑂 (𝛾1, 𝜏2

𝑣

(𝑜 − 1)𝕎[𝑌𝑗])

  • True here as well, so we know that in large samples:

̂ 𝛾1 − 𝛾1 se[ ̂ 𝛾1] ∼ 𝑂(0, 1)

  • Can also replace se with an estimate:

̂ 𝛾1 − 𝛾1 ̂ se[ ̂ 𝛾1] ∼ 𝑂(0, 1)

52 / 84

slide-53
SLIDE 53

Where are we?

Under Assumptions 1-5 and in large samples, we know that ̂ 𝛾1 ∼ 𝑂 ⎛ ⎜ ⎝ 𝛾1, ̂ 𝜏2

𝑣

∑𝑜

𝑗=1(𝑌𝑗 − 𝑌)2

⎞ ⎟ ⎠

53 / 84

slide-54
SLIDE 54

5/ Exact Inference for OLS

54 / 84

slide-55
SLIDE 55

Sampling distribution in small samples

  • What if we have a small sample? What can we do then? Back

here: ̂ 𝛾1 ∼ ? ⎛ ⎜ ⎝ 𝛾1, 𝜏2

𝑣

∑𝑜

𝑗=1(𝑌𝑗 − 𝑌)2

⎞ ⎟ ⎠

  • Can’t get something for nothing, but we can make progress if

we make another assumption:

  • 1. Linearity
  • 2. Random (iid) sample
  • 3. Variation in 𝑌𝑗
  • 4. Zero conditional mean of the errors
  • 5. Homoskedasticity
  • 6. Errors are conditionally normal

55 / 84

slide-56
SLIDE 56

Normal errors

Assumption 6: Conditionally Normal Errors

The conditional distribution of 𝑣𝑗 given 𝑌𝑗 is normal with mean 0 and variance 𝜏2

𝑣.

  • This implies that the distribution of 𝑍𝑗 given 𝑌𝑗 is:

𝑂(𝛾0 + 𝛾1𝑌𝑗, 𝜏2

𝑣).

56 / 84

slide-57
SLIDE 57

Conditional normal errors

X Y 1 2 3 2 4 6 8

57 / 84

slide-58
SLIDE 58

Conditional normal errors

X Y 1 2 3 2 4 6 8 1 2 3 4 5 6 7 8 Y μ(0.25)

58 / 84

slide-59
SLIDE 59

Conditional normal errors

X Y 1 2 3 2 4 6 8 1 2 3 4 5 6 7 8 Y μ(0.25) μ(1)

59 / 84

slide-60
SLIDE 60

Conditional normal errors

X Y 1 2 3 2 4 6 8 1 2 3 4 5 6 7 8 Y μ(0.25) μ(1) μ(1.75)

60 / 84

slide-61
SLIDE 61

Conditional normal errors

X Y 1 2 3 2 4 6 8 1 2 3 4 5 6 7 8 Y μ(0.25) μ(1) μ(1.75) μ(2.5)

61 / 84

slide-62
SLIDE 62

Conditional not marginal!

X Y 1 2 3 2 4 6 8 1 2 3 4 5 6 7 8 Y Marginal Distribution

  • f Y

62 / 84

slide-63
SLIDE 63

Non-normal errors

X Y 1 2 3 2 4 6 8

63 / 84

slide-64
SLIDE 64

Non-normal errors

X Y 1 2 3 2 4 6 8 2 4 6 8 Y μ(0.25)

64 / 84

slide-65
SLIDE 65

Non-normal errors

X Y 1 2 3 2 4 6 8 2 4 6 8 Y μ(0.25) μ(1)

65 / 84

slide-66
SLIDE 66

Non-normal errors

X Y 1 2 3 2 4 6 8 2 4 6 8 Y μ(0.25) μ(1) μ(1.75)

66 / 84

slide-67
SLIDE 67

Non-normal errors

X Y 1 2 3 2 4 6 8 2 4 6 8 Y μ(0.25) μ(1) μ(1.75) μ(2.5)

67 / 84

slide-68
SLIDE 68

Marginals are deceiving!

1 2 3 4 5 6 7 8

Y Marginal Distribution

  • f Y

2 4 6 8

Y Marginal Distribution

  • f Y

68 / 84

slide-69
SLIDE 69

Sampling distribution of OLS slope

  • If we have 𝑍𝑗 given 𝑌𝑗 is distributed 𝑂(𝛾0 + 𝛾1𝑌𝑗, 𝜏2

𝑣), then

we have the following at any sample size: ̂ 𝛾1 − 𝛾1 se[ ̂ 𝛾1] ∼ 𝑂(0, 1)

  • Furthermore, if we replace the true standard error with the

estimated standard error, then we get the following: ̂ 𝛾1 − 𝛾1 ̂ se[ ̂ 𝛾1] ∼ 𝑢𝑜−2

  • The standardized coeffjcient follows a 𝑢 distribution 𝑜 − 2

degrees of freedom. We take ofg an extra degree of freedom because we had to one more parameter than just the sample mean.

  • All of this depends on normal errors! We can check to see if

the residuals do look normal.

69 / 84

slide-70
SLIDE 70

Where are we?

  • Under Assumptions 1-5 and in large samples, we know that

̂ 𝛾1 − 𝛾1 ̂ se[ ̂ 𝛾1] ∼ 𝑂(0, 1)

  • Under Assumptions 1-6 and in any sample, we know that

̂ 𝛾1 − 𝛾1 ̂ se[ ̂ 𝛾1] ∼ 𝑢𝑜−2

70 / 84

slide-71
SLIDE 71

6/ Hypothesis Tests and Confidence Intervals

71 / 84

slide-72
SLIDE 72

Null and alternative hypotheses review

  • Null: 𝐼0 ∶ 𝛾1 = 0

▶ The null is the straw man we want to knock down. ▶ With regression, almost always null of no relationship

  • Alternative: 𝐼𝑏 ∶ 𝛾1 ≠ 0

▶ Claim we want to test ▶ Almost always “some efgect” ▶ Could do one-sided test, but you shouldn’t, for reasons we’ve

already discussed

  • Notice these are statements about the population parameters,

not the OLS estimates.

72 / 84

slide-73
SLIDE 73

Test statistic

  • Under the null of 𝐼0 ∶ 𝛾1 = 𝑐, we can use the following

familiar test statistic: 𝑈 = ̂ 𝛾1 − 𝑐 ̂ se[ ̂ 𝛾1]

  • Under then null hypothesis:

▶ Large samples: 𝑈 ∼ 𝑂(0, 1). ▶ Any sample size, plus conditionally normal errors: 𝑈 ∼ 𝑢𝑜−2 ▶ Conservative to use 𝑢𝑜−2 in either case since 𝑢𝑜−2 ⇝ 𝑂(0, 1)

  • Thus, under the null, we know the distribution of 𝑈 and can

use that to formulate a critical value and calculate p-values as usual.

73 / 84

slide-74
SLIDE 74

R output

  • By default, R shows you the 𝑈𝑝𝑐𝑡 for the test statistic with the

null of 𝛾1 = 0, which is just the estimate divided by the standard error: 𝑈𝑝𝑐𝑡 = ̂ 𝛾1 − 0 ̂ se[ ̂ 𝛾1] = ̂ 𝛾1 ̂ se[ ̂ 𝛾1]

  • R also calculates the p-values for you.
  • In the AJR data:

## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 10.6602 0.30528 34.92 8.759e-50 ## logem4

  • 0.5641

0.06389

  • 8.83 2.094e-13

74 / 84

slide-75
SLIDE 75

Confidence intervals

  • Large-sample CIs relying on asymptotic normality:

̂ 𝛾1 ± 𝑨𝛽/2 ⋅ ̂ se[ ̂ 𝛾1]

  • Exact CIs relying on normality of the errors:

̂ 𝛾1 ± 𝑢𝛽/2,𝑜−2̂ se[ ̂ 𝛾1]

  • “In 95% of repeated samples, the confjdence interval for 𝛾1

will cover the true value.”

75 / 84

slide-76
SLIDE 76

7/ Goodness of Fit

76 / 84

slide-77
SLIDE 77

Prediction error

  • How do we judge how well a line fjts the data? Is there some

way to judge?

  • One way is to fjnd out how much better we do at predicting

𝑍𝑗 once we include 𝑌𝑗 into the regression model.

  • Prediction errors without 𝑌𝑗: best prediction is the mean, so
  • ur squared errors, or the total sum of squares (𝑇𝑇𝑢𝑝𝑢) would

be: 𝑇𝑇𝑢𝑝𝑢 =

𝑜

𝑗=1

(𝑍𝑗 − 𝑍)2

  • Prediction errors with 𝑌𝑗: the sum of the squared residuals or

𝑇𝑇𝑠𝑓𝑡: 𝑇𝑇𝑠𝑓𝑡 =

𝑜

𝑗=1

(𝑍𝑗 − ̂ 𝑍𝑗)2

77 / 84

slide-78
SLIDE 78

Total SS vs SSR

1 2 3 4 5 6 7 8 6 7 8 9 10 11 12

Total Prediction Errors

Log Settler Mortality Log GDP per capita 78 / 84

slide-79
SLIDE 79

Total SS vs SSR

1 2 3 4 5 6 7 8 6 7 8 9 10 11 12

Residuals

Log Settler Mortality Log GDP per capita 79 / 84

slide-80
SLIDE 80

R-square

  • By defjnition, the residuals have to be smaller than the

deviations from the mean, so we might ask the following: how much lower is the 𝑇𝑇𝑠𝑓𝑡 compared to the 𝑇𝑇𝑢𝑝𝑢?

  • We quantify this question with the coeffjcient of

determination or 𝑆2. This is the following: 𝑆2 = 𝑇𝑇𝑢𝑝𝑢 − 𝑇𝑇𝑠𝑓𝑡 𝑇𝑇𝑢𝑝𝑢 = 1 − 𝑇𝑇𝑠𝑓𝑡 𝑇𝑇𝑢𝑝𝑢

  • This is the fraction of the total prediction error eliminated by

providing information on 𝑌𝑗.

  • Common interpretation: 𝑆2 is the fraction of the variation in

𝑍𝑗 is “explained by” 𝑌𝑗.

▶ 𝑆2 = 0 means no relationship ▶ 𝑆2 = 1 implies perfect linear fjt 80 / 84

slide-81
SLIDE 81

Is R-squared useful?

  • Can be very misleading. Each of these samples have the same

𝑆2 even though they are vastly difgerent:

5 10 15 4 6 8 10 12 X Y 5 10 15 4 6 8 10 12 X Y 5 10 15 4 6 8 10 12 X Y 5 10 15 4 6 8 10 12 X Y

81 / 84

slide-82
SLIDE 82

Review of Assumptions

  • What assumptions do we need to make what claims with

OLS?

  • 1. Data description: variation in 𝑌𝑗
  • 2. Unbiasedness/Consistency: linearity, iid, variation in 𝑌𝑗, zero

conditional mean error.

  • 3. Large-sample inference: linearity, iid, variation in 𝑌𝑗, zero

conditional mean error, homoskedasticity.

  • 4. Small-sample inference: linearity, iid, variation in 𝑌𝑗, zero

conditional mean error, homoskedasticity, Normal errors.

  • Can we weaken these? In some cases, yes.
  • Next week: adding another variable to regression.

82 / 84

slide-83
SLIDE 83

Estimation error proof

Return

  • Key facts:

▶ ∑𝑜

𝑗=1 𝑋𝑗 = 0 because ∑𝑜 𝑗=1(𝑌𝑗 − 𝑌) = 0

▶ ∑𝑜

𝑗=1 𝑋𝑗𝑌𝑗 = 1 because ∑𝑜 𝑗=1 𝑌𝑗(𝑌𝑗 − 𝑌) = ∑𝑜 𝑗=1(𝑌𝑗 − 𝑌)2

  • Proof:

̂ 𝛾1 =

𝑜

𝑗=1

𝑋𝑗𝑍𝑗 =

𝑜

𝑗=1

𝑋𝑗(𝛾0 + 𝛾1𝑌𝑗 + 𝑣𝑗) = 𝛾0 ⎛ ⎜ ⎝

𝑜

𝑗=1

𝑋𝑗⎞ ⎟ ⎠ + 𝛾1 ⎛ ⎜ ⎝

𝑜

𝑗=1

𝑋𝑗𝑌𝑗⎞ ⎟ ⎠ +

𝑜

𝑗=1

𝑋𝑗𝑣𝑗 = 𝛾1 +

𝑜

𝑗=1

𝑋𝑗𝑣𝑗

83 / 84

slide-84
SLIDE 84

Variance proof

Return

  • Proof:

𝕎[ ̂ 𝛾1|𝑌] = 𝕎 ⎡ ⎢ ⎣

𝑜

𝑗=1

𝑋𝑗𝑣𝑗∣𝑌⎤ ⎥ ⎦ =

𝑜

𝑗=1

𝕎[𝑋𝑗𝑣𝑗|𝑌] =

𝑜

𝑗=1

𝑋2

𝑗 𝕎[𝑣𝑗|𝑌]

=

𝑜

𝑗=1

𝑋2

𝑗 𝜏2 𝑣

= 𝜏2

𝑣 𝑜

𝑗=1

𝑋2

𝑗

= 𝜏2

𝑣

∑𝑜

𝑗=1(𝑌𝑗 − 𝑌)2

(∑𝑜

𝑗=1(𝑌𝑗 − 𝑌)2)2 =

𝜏2

𝑣

∑𝑜

𝑗=1(𝑌𝑗 − 𝑌)2

84 / 84