Gov 2000: 8. Simple Linear Regression
Matthew Blackwell
Fall 2016
1 / 84
Gov 2000: 8. Simple Linear Regression Matthew Blackwell Fall 2016 - - PowerPoint PPT Presentation
Gov 2000: 8. Simple Linear Regression Matthew Blackwell Fall 2016 1 / 84 1. Assumptions of the Linear Regression Model 2. Sampling Distribution of the OLS Estimator 3. Sampling Variance of the OLS Estimator 4. Large Sample Properties of OLS
Matthew Blackwell
Fall 2016
1 / 84
2 / 84
▶ Using the CEF to explore relationships ▶ Practical estimation concerns led us to OLS/lines of best fjt.
▶ Inference for OLS: sampling distribution. ▶ Is there really a relationship? Hypothesis tests ▶ Can we get a range of plausible slope values? Confjdence
intervals
▶ ⇝ how to read regression output. 3 / 84
## ## Call: ## lm(formula = logpgp95 ~ logem4, data = ajr) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.7130 -0.5333 0.0195 0.4719 1.4467 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 10.6602 0.3053 34.92 < 2e-16 *** ## logem4
0.0639
2.1e-13 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.756 on 79 degrees of freedom ## (82 observations deleted due to missingness) ## Multiple R-squared: 0.497, Adjusted R-squared: 0.49 ## F-statistic: 78 on 1 and 79 DF, p-value: 2.09e-13
4 / 84
5 / 84
𝑍𝑗 = 𝛾0 + 𝛾1𝑌𝑗 + 𝑣𝑗
▶ Dependent variable: 𝑍𝑗 ▶ Independent variable: 𝑌𝑗
▶ Population intercept: 𝛾0 ▶ Population slope: 𝛾1
▶ Represents all unobserved error factors infmuencing 𝑍𝑗 other
than 𝑌𝑗.
6 / 84
𝑍𝑗 = 𝛾0 + 𝛾1𝑌𝑗 + 𝑣𝑗
regression we called the linear projection.
▶ No notion of causality and may not even be the CEF.
causal or structural.
▶ 𝛾1 is the efgect of a one-unit change in 𝑦 holding all other
factors (𝑣𝑗) constant.
association between 𝑍𝑗 and 𝑌𝑗.
▶ GOV 2001/2002 has more on a formal language of causality. 7 / 84
need to make some statistical assumptions:
Linear Regression Model
The observations, (𝑍𝑗, 𝑌𝑗) come from a random (i.i.d.) sample and satisfy the linear regression equation, 𝑍𝑗 = 𝛾0 + 𝛾1𝑌𝑗 + 𝑣𝑗 𝔽[𝑣𝑗|𝑌𝑗] = 0. The independent variable is assumed to have non-zero variance, 𝕎[𝑌𝑗] > 0.
8 / 84
Assumption 1: Linearity
The population regression function is linear in the parameters: 𝑍𝑗 = 𝛾0 + 𝛾1𝑌𝑗 + 𝑣𝑗
𝑍𝑗 = 1 𝛾0 + 𝛾1𝑌𝑗 + 𝑣𝑗
𝑍𝑗 = 𝛾0 + 𝛾1𝑌2
𝑗 + 𝑣𝑗
non-linearities in 𝑌𝑗.
9 / 84
Assumption 2: Random Sample
We have a iid random sample of size 𝑜, {(𝑍𝑗, 𝑌𝑗) ∶ 𝑗 = 1, 2, … , 𝑜} from the population regression model above.
my weight on a given day and 𝑌𝑗 was my number of active minutes the day before: weight𝑗 = 𝛾0 + 𝛾1activity𝑗 + 𝑣𝑗
10 / 84
20 40 60 80 100
X Y
11 / 84
Assumption 3: Variation in 𝑌
There is in-sample variation in 𝑌𝑗, so that,
𝑜
∑
𝑗=1
(𝑌𝑗 − 𝑌)2 > 0.
̂ 𝛾1 = ∑𝑜
𝑗=1(𝑌𝑗 − 𝑌)(𝑍𝑗 − 𝑍)
∑𝑜
𝑗=1(𝑌𝑗 − 𝑌)2
12 / 84
fjt through this scatterplot, which is a violation of this assumption?
1 2 3
1 2 X Y
13 / 84
fjt through this scatterplot, which is a violation of this assumption?
1 2 3
1 2 X Y
14 / 84
Assumption 4: Zero conditional mean of the errors
The error, 𝑣𝑗, has expected value of 0 given any value of the independent variable: 𝔽[𝑣𝑗|𝑌𝑗 = 𝑦] = 0 ∀𝑦.
Cov[𝑣𝑗, 𝑌𝑗] = 𝔽[𝑣𝑗𝑌𝑗] = 0
15 / 84
from the following model: 𝑍𝑗 = 1 + 0.5𝑌𝑗 + 𝑣𝑗
1 2 3
1 2 3 4 5
Assumption 4 violated
X Y
1 2 3
1 2 3 4 5
Assumption 4 not violated
X Y 16 / 84
my weight on a given day and 𝑌𝑗 was my number of active minutes the day before: weight𝑗 = 𝛾0 + 𝛾1activity𝑗 + 𝑣𝑗
etc.
mean, no matter what my level of activity was. Plausible?
assigned.
17 / 84
18 / 84
the intercept of the regression line.
residuals: ( ̂ 𝛾0, ̂ 𝛾1) = arg min
𝑐0,𝑐1 𝑜
∑
𝑗=1
(𝑍𝑗 − 𝑐0 − 𝑐1𝑌𝑗)2
̂ 𝛾0 = 𝑍 − ̂ 𝛾1𝑌 ̂ 𝛾1 = ∑𝑜
𝑗=1(𝑌𝑗 − 𝑌)(𝑍𝑗 − 𝑍)
∑𝑜
𝑗=1(𝑌𝑗 − 𝑌)2
19 / 84
𝑍 = ̂ 𝛾0 + ̂ 𝛾1𝑌
̂ 𝛾1 = ∑𝑜
𝑗=1(𝑌𝑗 − 𝑌)(𝑍𝑗 − 𝑍)
∑𝑜
𝑗=1(𝑌𝑗 − 𝑌)2
= ̂ Cov(𝑌𝑗, 𝑍𝑗) ̂ 𝕎[𝑌𝑗] = Sample Covariance between 𝑌 and 𝑍 Sample Variance of 𝑌
20 / 84
̂ 𝑍𝑗 = ̂ 𝛾0 + ̂ 𝛾1𝑌𝑗
𝛾0
𝛾1
𝑍𝑗
̂ 𝑣𝑗 = 𝑍𝑗 − ̂ 𝑍𝑗
estimates.
21 / 84
the OLS estimator for the slope as a weighted sum of the
̂ 𝛾1 =
𝑜
∑
𝑗=1
𝑋𝑗𝑍𝑗
𝑋𝑗 = (𝑌𝑗 − 𝑌) ∑𝑜
𝑗=1(𝑌𝑗 − 𝑌)2
proof
̂ 𝛾1 − 𝛾1 =
𝑜
∑
𝑗=1
𝑋𝑗𝑣𝑗
𝛾1 is a sum of random variables.
22 / 84
data into and we get out estimates. OLS
Sample 1: {(𝑍1, 𝑌1), … , (𝑍𝑜, 𝑌𝑜)} (̂ 𝛾0, ̂ 𝛾1)1 Sample 2: {(𝑍1, 𝑌1), … , (𝑍𝑜, 𝑌𝑜)} (̂ 𝛾0, ̂ 𝛾1)2
⋮ ⋮
Sample 𝑙 − 1: {(𝑍1, 𝑌1), … , (𝑍𝑜, 𝑌𝑜)} (̂ 𝛾0, ̂ 𝛾1)𝑙−1 Sample 𝑙: {(𝑍1, 𝑌1), … , (𝑍𝑜, 𝑌𝑜)} (̂ 𝛾0, ̂ 𝛾1)𝑙
sample variance
variance/standard error, etc.
23 / 84
▶ Pretend that the AJR data represents the population of
interest
▶ See how the line varies from sample to sample
sample()
intercept
24 / 84
1 2 3 4 5 6 7 8 6 7 8 9 10 11 12 Log Settler Mortality Log GDP per capita
25 / 84
1 2 3 4 5 6 7 8 6 7 8 9 10 11 12 Log Settler Mortality Log GDP per capita
26 / 84
1 2 3 4 5 6 7 8 6 7 8 9 10 11 12 Log Settler Mortality Log GDP per capita
27 / 84
1 2 3 4 5 6 7 8 6 7 8 9 10 11 12 Log Settler Mortality Log GDP per capita
28 / 84
1 2 3 4 5 6 7 8 6 7 8 9 10 11 12 Log Settler Mortality Log GDP per capita
29 / 84
1 2 3 4 5 6 7 8 6 7 8 9 10 11 12 Log Settler Mortality Log GDP per capita
30 / 84
1 2 3 4 5 6 7 8 6 7 8 9 10 11 12 Log Settler Mortality Log GDP per capita
31 / 84
1 2 3 4 5 6 7 8 6 7 8 9 10 11 12 Log Settler Mortality Log GDP per capita
32 / 84
from sample to sample, but that the “average” of the lines looks about right.
Sampling distribution of intercepts
β ^ Frequency 6 8 10 12 14 100 200 300 400
Sampling distribution of slopes
β ^
1
Frequency
0.0 0.5 100 200 300 400 33 / 84
𝑌𝑜 ∼ 𝑂 (𝜈, 𝜏2 𝑜 )
intervals.
34 / 84
̂ 𝛾1 ∼ ?(?, ?)
35 / 84
Unbiasedness of OLS
Under assumptions 1-4, the OLS estimator is conditionally and unconditionally unbiased, 𝔽[ ̂ 𝛾1|𝑌] = 𝔽[ ̂ 𝛾1] = 𝛾1
36 / 84
̂ 𝛾1 − 𝛾1 =
𝑜
∑
𝑗=1
𝑋𝑗𝑣𝑗
𝔽[ ̂ 𝛾1 − 𝛾1|𝑌] = 𝔽 ⎡ ⎢ ⎣
𝑜
∑
𝑗=1
𝑋𝑗𝑣𝑗∣𝑌⎤ ⎥ ⎦ =
𝑜
∑
𝑗=1
𝔽[𝑋𝑗𝑣𝑗|𝑌] =
𝑜
∑
𝑗=1
𝑋𝑗𝔽[𝑣𝑗|𝑌] =
𝑜
∑
𝑗=1
𝑋𝑗 × 0 = 0
𝔽[ ̂ 𝛾1] = 𝔽[𝔽[ ̂ 𝛾1|𝑌]] = 𝔽[𝛾1] = 𝛾1
37 / 84
38 / 84
̂ 𝛾1 ∼ ?(𝛾1, ?)
the true population slope, but we don’t know the population sampling variance. 𝕎[ ̂ 𝛾1] = ??
39 / 84
additional assumption:
40 / 84
Assumption 5
The conditional variance of 𝑍𝑗 given 𝑌𝑗 is constant: 𝕎(𝑍𝑗|𝑌𝑗 = 𝑦) = 𝕎(𝑣𝑗|𝑌𝑗 = 𝑦) = 𝜏2
𝑣.
name homoskedasticity.
proof :
𝕎[ ̂ 𝛾1|𝑌] = 𝜏2
𝑣
∑𝑜
𝑗=1(𝑌𝑗 − 𝑌)2
se[ ̂ 𝛾1|𝑌] = √𝕎[ ̂ 𝛾1|𝑌] = 𝜏𝑣 √∑𝑜
𝑗=1(𝑌𝑗 − 𝑌)2
41 / 84
0.0 0.5 1.0 1.5 2.0 2.5 3.0
5 10 15
Heteroskedastic
X Y 0.0 0.5 1.0 1.5 2.0 2.5 3.0
5 10 15
Homoskedastic
X Y 42 / 84
𝕎[ ̂ 𝛾1|𝑌] = 𝜏2
𝑣
∑𝑜
𝑗=1(𝑌𝑗 − 𝑌)2 =
𝜏2
𝑣
(𝑜 − 1)𝑇2
𝑌
▶ The higher the variance of 𝑍𝑗, the higher the sampling variance ▶ The lower the variance of 𝑌𝑗, the higher the sampling variance ▶ As we increase 𝑜, the denominator gets large, while the
numerator is fjxed and so the sampling variance shrinks to 0.
43 / 84
1 2 3 4
1 2 3 4 5
High V[X]
X Y
1 2 3 4
1 2 3 4 5
Low V[X]
X Y
44 / 84
1 2 3 4
1 2 3 4 5
High V[X]
X Y
1 2 3 4
1 2 3 4 5
Low V[X]
X Y
45 / 84
𝑣—it is the variance of the errors.
̂ 𝜏2
𝑣 =
1 𝑜 − 2
𝑜
∑
𝑗=1
̂ 𝑣2
𝑗
underestimating the variance.
▶ We already used the data twice to estimate ̂
𝛾0 and ̂ 𝛾1
̂ se[ ̂ 𝛾1|𝑌] = √̂ 𝜏2
𝑣
√∑𝑜
𝑗=1(𝑌𝑗 − 𝑌)2
= ̂ 𝜏𝑣 √∑𝑜
𝑗=1(𝑌𝑗 − 𝑌)2
46 / 84
̂ 𝛾1 ∼ ? ⎛ ⎜ ⎝ 𝛾1, 𝜏2
𝑣
∑𝑜
𝑗=1(𝑌𝑗 − 𝑌)2
⎞ ⎟ ⎠
sampling distribution.
slope?
47 / 84
Gauss-Markov Theorem
Under assumptions 1-5, the OLS estimator is BLUE, or the Best Linear Unbiased Estimator, in the sense that if ̃ 𝛾1 is another unbiased estimator of the population slope, it has variance at least as big as OLS: 𝕎[ ̂ 𝛾1|𝑌] ≤ 𝕎[ ̃ 𝛾1|𝑌].
48 / 84
49 / 84
̂ 𝛾1 ∼ ? ⎛ ⎜ ⎝ 𝛾1, 𝜏2
𝑣
∑𝑜
𝑗=1(𝑌𝑗 − 𝑌)2
⎞ ⎟ ⎠
𝜏2
𝑣
∑𝑜
𝑗=1(𝑌𝑗−𝑌)2 is the lowest variance of any
linear estimator of 𝛾1
distribution? Uniform? 𝑢? Normal? Exponential? Hypergeometric?
50 / 84
̂ 𝛾1 = 𝛾1 +
𝑜
∑
𝑗=1
𝑋𝑗𝑣𝑗
𝑜
∑
𝑗=1
𝑋𝑗𝑣𝑗 = ∑𝑜
𝑗=1(𝑌𝑗 − 𝑌)𝑣𝑗
∑𝑜
𝑗=1(𝑌𝑗 − 𝑌)2 𝑞
→ Cov(𝑌𝑗, 𝑣𝑗) 𝕎[𝑌𝑗]
as 𝕎[𝑌𝑗] > 0, then we’ll have ̂ 𝛾1
𝑞
→ 𝛾1
51 / 84
̂ 𝛾1 =
𝑜
∑
𝑗=1
𝑋𝑗𝑍𝑗
replace sample variance of 𝑌𝑗 with population variance): ̂ 𝛾1
𝑒
→ 𝑂 (𝛾1, 𝜏2
𝑣
(𝑜 − 1)𝕎[𝑌𝑗])
̂ 𝛾1 − 𝛾1 se[ ̂ 𝛾1] ∼ 𝑂(0, 1)
̂ 𝛾1 − 𝛾1 ̂ se[ ̂ 𝛾1] ∼ 𝑂(0, 1)
52 / 84
Under Assumptions 1-5 and in large samples, we know that ̂ 𝛾1 ∼ 𝑂 ⎛ ⎜ ⎝ 𝛾1, ̂ 𝜏2
𝑣
∑𝑜
𝑗=1(𝑌𝑗 − 𝑌)2
⎞ ⎟ ⎠
53 / 84
54 / 84
here: ̂ 𝛾1 ∼ ? ⎛ ⎜ ⎝ 𝛾1, 𝜏2
𝑣
∑𝑜
𝑗=1(𝑌𝑗 − 𝑌)2
⎞ ⎟ ⎠
we make another assumption:
55 / 84
Assumption 6: Conditionally Normal Errors
The conditional distribution of 𝑣𝑗 given 𝑌𝑗 is normal with mean 0 and variance 𝜏2
𝑣.
𝑂(𝛾0 + 𝛾1𝑌𝑗, 𝜏2
𝑣).
56 / 84
X Y 1 2 3 2 4 6 8
57 / 84
X Y 1 2 3 2 4 6 8 1 2 3 4 5 6 7 8 Y μ(0.25)
58 / 84
X Y 1 2 3 2 4 6 8 1 2 3 4 5 6 7 8 Y μ(0.25) μ(1)
59 / 84
X Y 1 2 3 2 4 6 8 1 2 3 4 5 6 7 8 Y μ(0.25) μ(1) μ(1.75)
60 / 84
X Y 1 2 3 2 4 6 8 1 2 3 4 5 6 7 8 Y μ(0.25) μ(1) μ(1.75) μ(2.5)
61 / 84
X Y 1 2 3 2 4 6 8 1 2 3 4 5 6 7 8 Y Marginal Distribution
62 / 84
X Y 1 2 3 2 4 6 8
63 / 84
X Y 1 2 3 2 4 6 8 2 4 6 8 Y μ(0.25)
64 / 84
X Y 1 2 3 2 4 6 8 2 4 6 8 Y μ(0.25) μ(1)
65 / 84
X Y 1 2 3 2 4 6 8 2 4 6 8 Y μ(0.25) μ(1) μ(1.75)
66 / 84
X Y 1 2 3 2 4 6 8 2 4 6 8 Y μ(0.25) μ(1) μ(1.75) μ(2.5)
67 / 84
1 2 3 4 5 6 7 8
Y Marginal Distribution
2 4 6 8
Y Marginal Distribution
68 / 84
𝑣), then
we have the following at any sample size: ̂ 𝛾1 − 𝛾1 se[ ̂ 𝛾1] ∼ 𝑂(0, 1)
estimated standard error, then we get the following: ̂ 𝛾1 − 𝛾1 ̂ se[ ̂ 𝛾1] ∼ 𝑢𝑜−2
degrees of freedom. We take ofg an extra degree of freedom because we had to one more parameter than just the sample mean.
the residuals do look normal.
69 / 84
̂ 𝛾1 − 𝛾1 ̂ se[ ̂ 𝛾1] ∼ 𝑂(0, 1)
̂ 𝛾1 − 𝛾1 ̂ se[ ̂ 𝛾1] ∼ 𝑢𝑜−2
70 / 84
71 / 84
▶ The null is the straw man we want to knock down. ▶ With regression, almost always null of no relationship
▶ Claim we want to test ▶ Almost always “some efgect” ▶ Could do one-sided test, but you shouldn’t, for reasons we’ve
already discussed
not the OLS estimates.
72 / 84
familiar test statistic: 𝑈 = ̂ 𝛾1 − 𝑐 ̂ se[ ̂ 𝛾1]
▶ Large samples: 𝑈 ∼ 𝑂(0, 1). ▶ Any sample size, plus conditionally normal errors: 𝑈 ∼ 𝑢𝑜−2 ▶ Conservative to use 𝑢𝑜−2 in either case since 𝑢𝑜−2 ⇝ 𝑂(0, 1)
use that to formulate a critical value and calculate p-values as usual.
73 / 84
null of 𝛾1 = 0, which is just the estimate divided by the standard error: 𝑈𝑝𝑐𝑡 = ̂ 𝛾1 − 0 ̂ se[ ̂ 𝛾1] = ̂ 𝛾1 ̂ se[ ̂ 𝛾1]
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 10.6602 0.30528 34.92 8.759e-50 ## logem4
0.06389
74 / 84
̂ 𝛾1 ± 𝑨𝛽/2 ⋅ ̂ se[ ̂ 𝛾1]
̂ 𝛾1 ± 𝑢𝛽/2,𝑜−2̂ se[ ̂ 𝛾1]
will cover the true value.”
75 / 84
76 / 84
way to judge?
𝑍𝑗 once we include 𝑌𝑗 into the regression model.
be: 𝑇𝑇𝑢𝑝𝑢 =
𝑜
∑
𝑗=1
(𝑍𝑗 − 𝑍)2
𝑇𝑇𝑠𝑓𝑡: 𝑇𝑇𝑠𝑓𝑡 =
𝑜
∑
𝑗=1
(𝑍𝑗 − ̂ 𝑍𝑗)2
77 / 84
1 2 3 4 5 6 7 8 6 7 8 9 10 11 12
Total Prediction Errors
Log Settler Mortality Log GDP per capita 78 / 84
1 2 3 4 5 6 7 8 6 7 8 9 10 11 12
Residuals
Log Settler Mortality Log GDP per capita 79 / 84
deviations from the mean, so we might ask the following: how much lower is the 𝑇𝑇𝑠𝑓𝑡 compared to the 𝑇𝑇𝑢𝑝𝑢?
determination or 𝑆2. This is the following: 𝑆2 = 𝑇𝑇𝑢𝑝𝑢 − 𝑇𝑇𝑠𝑓𝑡 𝑇𝑇𝑢𝑝𝑢 = 1 − 𝑇𝑇𝑠𝑓𝑡 𝑇𝑇𝑢𝑝𝑢
providing information on 𝑌𝑗.
𝑍𝑗 is “explained by” 𝑌𝑗.
▶ 𝑆2 = 0 means no relationship ▶ 𝑆2 = 1 implies perfect linear fjt 80 / 84
𝑆2 even though they are vastly difgerent:
5 10 15 4 6 8 10 12 X Y 5 10 15 4 6 8 10 12 X Y 5 10 15 4 6 8 10 12 X Y 5 10 15 4 6 8 10 12 X Y
81 / 84
OLS?
conditional mean error.
conditional mean error, homoskedasticity.
conditional mean error, homoskedasticity, Normal errors.
82 / 84
Return
▶ ∑𝑜
𝑗=1 𝑋𝑗 = 0 because ∑𝑜 𝑗=1(𝑌𝑗 − 𝑌) = 0
▶ ∑𝑜
𝑗=1 𝑋𝑗𝑌𝑗 = 1 because ∑𝑜 𝑗=1 𝑌𝑗(𝑌𝑗 − 𝑌) = ∑𝑜 𝑗=1(𝑌𝑗 − 𝑌)2
̂ 𝛾1 =
𝑜
∑
𝑗=1
𝑋𝑗𝑍𝑗 =
𝑜
∑
𝑗=1
𝑋𝑗(𝛾0 + 𝛾1𝑌𝑗 + 𝑣𝑗) = 𝛾0 ⎛ ⎜ ⎝
𝑜
∑
𝑗=1
𝑋𝑗⎞ ⎟ ⎠ + 𝛾1 ⎛ ⎜ ⎝
𝑜
∑
𝑗=1
𝑋𝑗𝑌𝑗⎞ ⎟ ⎠ +
𝑜
∑
𝑗=1
𝑋𝑗𝑣𝑗 = 𝛾1 +
𝑜
∑
𝑗=1
𝑋𝑗𝑣𝑗
83 / 84
Return
𝕎[ ̂ 𝛾1|𝑌] = 𝕎 ⎡ ⎢ ⎣
𝑜
∑
𝑗=1
𝑋𝑗𝑣𝑗∣𝑌⎤ ⎥ ⎦ =
𝑜
∑
𝑗=1
𝕎[𝑋𝑗𝑣𝑗|𝑌] =
𝑜
∑
𝑗=1
𝑋2
𝑗 𝕎[𝑣𝑗|𝑌]
=
𝑜
∑
𝑗=1
𝑋2
𝑗 𝜏2 𝑣
= 𝜏2
𝑣 𝑜
∑
𝑗=1
𝑋2
𝑗
= 𝜏2
𝑣
∑𝑜
𝑗=1(𝑌𝑗 − 𝑌)2
(∑𝑜
𝑗=1(𝑌𝑗 − 𝑌)2)2 =
𝜏2
𝑣
∑𝑜
𝑗=1(𝑌𝑗 − 𝑌)2
84 / 84