PS 405 Week 5 Section: OLS Regression and Its Assumptions D.J. - - PowerPoint PPT Presentation
PS 405 Week 5 Section: OLS Regression and Its Assumptions D.J. - - PowerPoint PPT Presentation
PS 405 Week 5 Section: OLS Regression and Its Assumptions D.J. Flynn February 11, 2014 Todays plan Basic OLS set-up Estimation/interpretion of OLS models in R Gauss-Markov Assumptions Basic set-up Scalar: Y i = 0 + 1 X 1 i +
Today’s plan
Basic OLS set-up Estimation/interpretion of OLS models in R Gauss-Markov Assumptions
Basic set-up
◮ Scalar:
Yi = β0 + β1X1i + β2X2i + ...βKXKi + ǫi
◮ Matrix:
Yi = Xiβ + ǫi
◮ Y is a (quasi-)continuous outcome, Xs are independent
variables, ǫ is a residual
◮ We’ll use matrix form and assume X could include k = 1, 2, ...
variables
◮ Our goal: specify a model (pick Xs) and estimate parameters
(β0, β1, ...βK) such that error is minimized
Re-cap from last week
Yi = Xiβ + ǫi,
where i = 1, 2, ...N and k = 1, 2, ...K. For each term,
◮ vector or matrix? ◮ size? ◮ why are some (not all) terms indexed by i?
Estimating OLS
◮ Collect data on Y and X. ◮ Estimate the model and obtain parameters: β0, β1, ...βK. ◮ Make predictions for each observation’s outcome (ˆ
Y) via linear combination: Suppose our model is: Turnouti = β0 + β1Competitivenessi + β2AdSpendingi + ǫi, where Turnout is measured 0-100, Competitiveness is a dummy, and AdSpending is measured 1-5. We estimate the model in R and get these coefficients: β0 = 11,βC = 25, βAS = 6.25.
Now we can predict turnout in any election given competitiveness and ad spending data. For a competitive election with lots of spending (5/5), the predicted level of turnout is: ˆ Yi = 11 + 25(1) + 6.25(5) = 67.25%. Suppose true turnout in that election was 71%. Then ui = Yi − ˆ Yi = 71 − 67.25 = 3.75%. Recall, OLS estimates parameters such that these errors are minimized over the whole dataset: min
N
- i=1
(Yi − ˆ Yi)2
Estimating/Interpreting OLS in R
◮ Practice estimating a model using the USArrests dataset:
library(datasets) summary(USArrests) murder.model<-lm(Murder∼Assault+Rape+UrbanPop, data=USArrests) summary(murder.model)
◮ Thanks to linearity assumption, we can interpret coefficients
as the effect of a one-unit increase in X on Y.
◮ Thus, we MUST know units for X and Y to interpret. ◮ Check out description of variable codings here.
R output
Call: lm(formula = Murder ~ Assault + Rape + UrbanPop, data = USArrests) Residuals: Min 1Q Median 3Q Max
- 4.3990 -1.9127 -0.3444
1.2557 7.4279 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.276639 1.737997 1.885 0.0657 . Assault 0.039777 0.005912 6.729 2.33e-08 *** Rape 0.061399 0.055740 1.102 0.2764 UrbanPop
- 0.054694
0.027880
- 1.962
0.0559 .
- Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
Let’s look at fitted values: fitted.values(murder.model) predict.lm(murder.model,interval="confidence") plot(fitted.values(murder.model),USArrests$Murder) Residuals: resid(murder.model) plot(murder.model) Other helpful commands: coef(murder.model) murder.model$coef[1] confint(murder.model) Later: lots of diagnostics for checking assumptions
Key point on interpretation
ANOVA = does a factor (regardless of which category you’re in) predict outcome? OLS = does some variable, X, affect outcome relative to baseline (omitted category)? Helpful example: estimating treatment effects in experiments.
DV: Policy Support (1-7) EE Treatment −0.311 (0.263) J Treatment −0.609∗∗ (0.248) HA Treatment −0.621∗∗ (0.254) constant 5.508∗∗∗ (0.186) Observations 272
∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01
Gauss-Markov
Under certain assumptions, OLS is the Best Linear Unbiased Estimator of β . Assumptions1
- 1. Linearity
- 2. Homoskedasticity
- 3. Error terms are i.i.d
- 4. Strict Exogeneity
- 5. Errors are Normally Distributed
- 6. No (Perfect) Multicollinearity
1Note: every regression text you read will express/refer to these differently.
Assumption 1: Linearity
◮ Y is a linear function of the data:
ˆ Yi = Xiβ
◮ Typically OK if DV is continuous. ◮ Categorical/limited DVs break linearity and require more
advanced (non-linear) models, which you’ll learn in 407.
◮ Common DVs that break linearity: models for binary response
(e.g., logit). Notice, function is non-linear: ˆ Yi = 1 1 + e−Xiβ
Assumption 2: Homoskedasticity
◮ homoskedasticity: constant error variance = errors
approximately the same size for subgroups of data: var(ǫ|X) = σ2, where σ2 is some constant.
◮ heteroskedasticity: non-constant error variance = errors
differ across subgroups of data
◮ easily testable/fixable (later this quarter)
Assumption 3: Error terms are i.i.d
◮ no correlation between error terms on different observations:
E(ǫi ∗ ǫj) = 0, i = j
◮ common violation: autocorrelation ◮ easy fix: use time series models (not simple OLS)
Assumption 4: Strict Exogeneity
◮ Many ways to express. Usually:
E(ǫi|Xi) = 0
◮ Jay will write it this way (same idea):
X ⊥ ǫ
◮ Xs are determined outside the model, uncorrelated with error
term
◮ challenging assumption for political scientists (e.g.,
democracy/GDP, media choice/political knowledge, etc....)
◮ possible solution: instrumental variables regression (next
quarter)
Assumption 5: Errors are Normally Distributed
◮ given your data and model, errors are Normal:
ǫ ∼ N(0, σ2), where σ2 is some constant.
◮ depends on distribution of your variables and model ◮ easy problem to detect: normal probability plots:
plot(murder.model,which=2)
◮ if violated, coefficients are OK, hypothesis testing invalid
Assumption 6: No (Perfect) Multicollinearity
◮ multicollinearity: correlation among independent variables
in a model (e.g., ideology, PID)
◮ perfect multicollinearity: two variables perfectly predict one
another (e.g., dummies for male, female) = we can’t estimate effect of one relative to other
◮ challenging assumption for political scientists (espec.
behavioralists)
◮ What does R do with perfectly multicollinear regressors?......
dep.var<-rnorm(100,10,2) female<-rbinom(100,1,.51) male<-ifelse(female==1,0,1) perf.collin.model<-lm(dep.var∼female+male) summary(perf.collin.model) Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 10.0323 0.2757 36.388 <2e-16 *** female
- 0.2203
0.3861
- 0.571
0.57 male NA NA NA NA
- Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ Residual standard error: 1.93 on 98 degrees of freedom Multiple R-squared: 0.003313, Adjusted R-squared:
- 0.006858