Section 1 : Regression Review Yotam Shem-Tov Fall 2014 1/33 Yotam - - PowerPoint PPT Presentation

section 1 regression review
SMART_READER_LITE
LIVE PREVIEW

Section 1 : Regression Review Yotam Shem-Tov Fall 2014 1/33 Yotam - - PowerPoint PPT Presentation

Section 1 : Regression Review Yotam Shem-Tov Fall 2014 1/33 Yotam Shem-Tov STAT 239/ PS 236A Contact information Yotam Shem-Tov, PhD student in economics E-mail: shemtov@berkeley.edu Office hours: Wednesday 2-4 2/33 Yotam Shem-Tov STAT


slide-1
SLIDE 1

1/33

Section 1 : Regression Review

Yotam Shem-Tov Fall 2014

Yotam Shem-Tov STAT 239/ PS 236A

slide-2
SLIDE 2

2/33

Contact information

Yotam Shem-Tov, PhD student in economics E-mail: shemtov@berkeley.edu Office hours: Wednesday 2-4

Yotam Shem-Tov STAT 239/ PS 236A

slide-3
SLIDE 3

3/33

There are two general approaches to regression

1 Regression as a model: a data generating process (DGP) 2 Regression as an algorithm, i.e as a predictive model

This two approaches are different, and make different assumptions

Yotam Shem-Tov STAT 239/ PS 236A

slide-4
SLIDE 4

4/33

Regression as a prediction

We have an input vector X T = (X1, X2, . . . , Xp) with dimensions of n × p and an output vector Y with dimensions n × 1. The linear regression model has the form: f (X) = β0 +

p

  • j=1

Xjβj We can pick the coefficients β = (β0, β1, . . . , βp)T in a variety

  • f ways but OLS is by far the most common, which minimizes

the residual sum of squares (RSS): RSS(β) =

N

  • i=1

(yi − f (xi))2 =

N

  • i=1

(yi − β0 −

P

  • j=1

xijβj)2

Yotam Shem-Tov STAT 239/ PS 236A

slide-5
SLIDE 5

5/33

Regression as a prediction

Yotam Shem-Tov STAT 239/ PS 236A

slide-6
SLIDE 6

6/33

Regression as a prediction: Deriving the Algorithm

Denote X the N × (p + 1) matrix with each row an input vector (with a 1 in the first position) and y is the output vector. Write the RSS as: RSS(β) = (y − Xβ)T(y − xβ) Differentiate with respect to β: ∂RSS ∂β = −2XT(y − Xβ) (1) Assume that X is full rank (no perfect collinearity among any

  • f the independent variables) and set first derivative to 0:

XT(y − Xβ) = 0 Solve for β: ˆ β = (XTX)−1XTy

Yotam Shem-Tov STAT 239/ PS 236A

slide-7
SLIDE 7

7/33

Regression as a prediction: Deriving the Algorithm

What happens if X is not full rank? There is an infinite number of ways to invert the matrix X TX, and the algorithm does not have a unique solution. There are many values of β that satisfy the F.O.C The matrix X is also referred as the design matrix

Yotam Shem-Tov STAT 239/ PS 236A

slide-8
SLIDE 8

8/33

Regression as a prediction: Making a Prediction

The hat matrix, or projection matrix H = X(XTX)−1XT with ˜ H = I − H We use the hat matrix to find the fitted values: ˆ Y = Xˆ β = X(XTX)−1XTY = HY We can now write e = (I − H)Y If HY yields part of Y that projects into X, this means that ˜ HY is the part of Y that does not project into X, which is the residual part of Y. Therefore, ˜ HY makes the residuals e is the part of Y which is not a linear combination of X

Yotam Shem-Tov STAT 239/ PS 236A

slide-9
SLIDE 9

9/33

Regression as a prediction: Deriving the Algorithm

Do we make any assumption on the distribution of Y? No! Can the dependent variable (the response), Y, be a binary variable, i.e Y ∈ {0, 1}? Yes! Do we assume that homoskedasticity, i.e that Var(Yi) = σ2, ∀i? No! Is the residuals, e, correlated with Y? Do we need to make any additional assumption in order for corr(e, X) = 0? No! The OLS algorithm will always yield residuals which are not correlated with the covariates The procedure we discussed so far is an algorithm, which solves an optimization problem (minimizing a square loss function). The algorithm requires an assumption of full rank in order to yield a unique solution, however it does not require any assumption on the distribution or the type of the response variable, Y

Yotam Shem-Tov STAT 239/ PS 236A

slide-10
SLIDE 10

10/33

Regression as a model: From algorithm to model

Now we make stronger assumptions, most importantly we assume a data generating process (hence DGP), i.e we assume a functional form for the relationship between Y and X Is Y a linear function of the covariates? No, it is a linear function of β What are the classic assumptions of the regression model?

Yotam Shem-Tov STAT 239/ PS 236A

slide-11
SLIDE 11

11/33

Regression as a model: The classic assumptions of the regression model

1 The dependent variable is linearly related to the coefficients of

the model and the model is correctly specified, Y = Xβ + ǫ

2 The independent variables, X, are fixed, i.e are not random

variables (this can be relaxed to Cov(X, ǫ) = 0)

3 The conditional mean of the error term is zero, E(ǫ|X) = 0 4 Homoscedasticity. The error term has a constant variance, i.e

V(ǫi) = σ2

5 The error terms are uncorrelated with each other,

Cov(ǫi, ǫj) = 0

6 The design matrix, X, has full rank 7 The error term is normally distributed, i.e ǫ ∼ N(0, σ2) (the

mean and variance follows from (3) and (4))

Yotam Shem-Tov STAT 239/ PS 236A

slide-12
SLIDE 12

12/33

Discussion of the classic assumptions of the regression model

The assumption that E(ǫ|X) = 0 will always be satisfied when there is an intercept term in the model, i.e when the design matrix contains a constant term When X ⊥ ǫ it follows that Cov(X, ǫ) = 0 The normality assumption of ǫi is required for hypothesis testing on β The assumption can be relaxed for sufficiently large sample sizes, as by the CLT, ˆ βOLS converges to a normal distribution when N → ∞. What is a sufficiently large sample size?

Yotam Shem-Tov STAT 239/ PS 236A

slide-13
SLIDE 13

13/33

Properties of the OLS estimators: Unbiased estimator

The OLS estimator of β is, ˆ β = (X TX)−1X TY = (X TX)−1X T(Xβ + ǫ) = (X TX)−1X TXβ + (X TX)−1X Tǫ = β + (X TX)−1X Tǫ We know that ˆ β is unbiased if E(ˆ β) = β E(ˆ β) = E(β + (X TX)−1X Tǫ|X) = E(β|X) + E((X TX)−1X Tǫ|X) = β + (X TX)−1E(ǫ|X) where E(ǫ|X) = E(ǫ) = 0 E(ˆ β) = β

Yotam Shem-Tov STAT 239/ PS 236A

slide-14
SLIDE 14

14/33

Properties of the OLS estimators: Unbiased estimator

What assumptions are used for the proof that ˆ βOLS is an unbiased estimator? Assumption (1), the model is correct. Assumption (2), the covariates are independent of the error term

Yotam Shem-Tov STAT 239/ PS 236A

slide-15
SLIDE 15

15/33

Properties of the OLS estimators: The variance of ˆ βOLS

Recall: ˆ β = (X TX)−1X TY = (X TX)−1X T(Xβ + ǫ) ⇒ ˆ β − β = (X TX)−1X Tǫ Plugging this into the covariance equation: cov(ˆ β|X) = E[(ˆ β − β)(ˆ β − β)′|X] = E

  • (X TX)−1X Tǫ
  • (X TX)−1X Tǫ)′|X
  • = E[(X TX)−1X TǫǫTX(X TX)−1|X]

= (X TX)−1X TE(ǫǫT|X)X(X TX)−1 where E(ǫǫT|X) = σ2Ip×p = (X TX)−1X Tσ2Ip×pX(X TX)−1 = σ2(X TX)−1X TX(X TX)−1 = σ2(X TX)−1

Yotam Shem-Tov STAT 239/ PS 236A

slide-16
SLIDE 16

16/33

Estimating σ2

We estimate σ2 by dividing the residuals squared by the degrees of freedom because the ei are generally smaller than the ǫi due to the fact that ˆ β was chosen to make the sum of square residuals as small as possible. ˆ σ2

OLS =

1 n − p

n

  • i=1

e2

i

Compare the above estimator to the classic variance estimator: ˆ σ2

classic =

1 n − 1

n

  • i=1
  • Yi − ¯

Y 2 Is one estimator always preferable over the other? If not when each estimator is preferable?

Yotam Shem-Tov STAT 239/ PS 236A

slide-17
SLIDE 17

17/33

measurment error

Consider the following DGP (data generating process): n=200 x1 = rnorm(n,mean=10,1) epsilon = rnorm(n,0,2) y = 10+5*x1+epsilon ### mesurment error: noise = rnorm(n,0,2) x1_noise = x1+noise The true model has x1, however we observe only xnoise

1

. We will investigate the effect of the noise and the distribution of the noise

  • n the OLS estimation of β1. The true value of the parameter of

interest is, β1 = 5

Yotam Shem-Tov STAT 239/ PS 236A

slide-18
SLIDE 18

18/33

Measurement error: noise ∼ N(µ = 0, σ = 2)

50 60 70 6 9 12 15

x1 y

measurment error with mean 0

Yotam Shem-Tov STAT 239/ PS 236A

slide-19
SLIDE 19

19/33

Measurement error: noise ∼ N(µ = 5, σ = 2)

50 60 70 8 12 16 20

x1 y

measurment error with mean 5

Yotam Shem-Tov STAT 239/ PS 236A

slide-20
SLIDE 20

20/33

Measurement error: noise ∼ N(µ =?, σ = 2)

  • 1

2 3 4 5 25 50 75 100

Expectation of noise beta

Yotam Shem-Tov STAT 239/ PS 236A

slide-21
SLIDE 21

21/33

Measurement error: noise ∼ N(µ = 5, σ =?)

  • 1

2 3 4 5 5 10 15

Variance of noise beta

Yotam Shem-Tov STAT 239/ PS 236A

slide-22
SLIDE 22

22/33

Measurement error: noise ∼ exp(λ =?)

  • 1

2 3 4 5 0.0 2.5 5.0 7.5 10.0

Rate parameter (variance = 1/Rate) beta

Yotam Shem-Tov STAT 239/ PS 236A

slide-23
SLIDE 23

23/33

Measurement error

Could we reach the same conclusions as the simulations from analytical derivations?Yes As we saw before, E(ˆ βOLS) = Cov(y, xnoise

1

) V(xnoise

1

) = Cov(y, x1 + noise) V(x1 + noise) = Cov(y, x1) V(x1) + V(noise) Therefore as V(noise) → ∞, the expectation of the OLS estimator of β will converge to zero, V(noise) → ∞ ⇒ E(ˆ βOLS) = Cov(y, x1) V(x1) + V(noise) → 0

Yotam Shem-Tov STAT 239/ PS 236A

slide-24
SLIDE 24

24/33

Measurement error in the dependent variable

Consider the situation in which yi is not observed, but ynoise

i

is observed. There are no measurement error in x1. The model (DGP) is, yi = 10 + 5 ∗ x1i + ǫi ynoise

i

= yi + noisei Will the OLS estimator of β1 be unbiased? Yes E(ˆ βOLS) = Cov(ynoise, x1) V(x1) = Cov(y + noise, x1) V(x1) = Cov(y, x1) V(x1) = β1 This model is equivalent to the model, yi = 10 + 5 ∗ x1i + (ǫi + noisei), where yi is observed.

Yotam Shem-Tov STAT 239/ PS 236A

slide-25
SLIDE 25

25/33

Measurment error in the dependent variable

Will the OLS estimator be unbiased if the measurement error was multiplicative instead of additive? Formally, if the DGP was: yi = 10 + 5 ∗ x1i + ǫi ynoise

i

= yi · noisei Analytic derivations: E(ˆ βOLS) = Cov(ynoise, x1) V(x1) = Cov(y · noise, x1) V(x1) Cov(y · noise, x1) = E(y · noise · x1) − E(y · noise) · E(x1) = E(noise) · Cov(y, x1) V(x1) = E(noise) · β1

Yotam Shem-Tov STAT 239/ PS 236A

slide-26
SLIDE 26

26/33

Measurment error in the dependent variable

When there is multiplicative noise the bias of ˆ β is influenced by E(noise), not from V(noise)

Yotam Shem-Tov STAT 239/ PS 236A

slide-27
SLIDE 27

27/33

Gauss-Markov theorem: BLUE

The regression estimator is a linear estimator, ˆ β = Cy, where C = (X TX)−1X T. A linear estimator is any ˆ βj such that ˆ βj = c1y1 + c2y2 + · · · + cpyp The Gauss-Markov theorem: If assumptions: (2),(3),(4),(5)

  • hold. The regression estimator is the best linear unbiased

estimator (BLUE), in terms of MSE (Mean Squared Error)

Yotam Shem-Tov STAT 239/ PS 236A

slide-28
SLIDE 28

28/33

Frisch-Waugh-Lovell: Regression Anatomy

In the simple bivariate case: β1 = Cov(Yi, Xi) Var(Xi) In the multivariate case, βj is: βj = Cov(Yi, ˜ Xij) Var( ˜ Xij) where ˜ Xij is the residual from the regression of Xij on all other covariates. The multiple regression coefficient ˆ βj represents the additional contribution of xj on y, after xj has been adjusted for 1, x1, . . . , xj−1, xj+1, . . . , xp What happens when xj is highly correlated with some of the

  • ther xk’s?

Yotam Shem-Tov STAT 239/ PS 236A

slide-29
SLIDE 29

29/33

Frisch-Waugh-Lovell: Regression Anatomy

Claim: βj = Cov( ˜

Yi, ˜ Xij) Var( ˜ Xij) , i.e Cov(Yi, ˜

Xij) = Cov( ˜ Yi, ˜ Xij) Proof: Let ˜ Yi be the residuals of a regression of all the covariates except Xji on Yi, i.e Xji = β0 + β1X1i + β2X2 + · · · + βPXPi + fi Yi = α0 + α1X1i + α2X2 + · · · + αPXPi + ei Then, ˆ ei = ˜ Yi, and ˆ fi = ˜ Xji It follows from the OLS algorithm that Cov(xki, ˜ Xji) = 0, ∀k=j. As the residuals of a regression are not correlated with any of the covariates Cov( ˜ Yi, ˜ Xij) = Cov(Yi − ˆ α0 − ˆ α1X1i − ˆ α2X2 −· · ·+ ˆ αPXPi, ˜ Xij) = Cov(Yi, ˜ Xij)

Yotam Shem-Tov STAT 239/ PS 236A

slide-30
SLIDE 30

30/33

Asymptotics of OLS

Is the OLS estimator of β consistent? Yes Proof: Denote the observed characteristics of observation i by xi. What is the dimensions of xi? 1 × p xi = (xi1, xi2, . . . , xip) and xT

i

=      xi1 xi2 . . . xip      xT

i xi =

     x2

i1

xi1xi2 . . . xi1xip xi2xi1 x2

i2

. . . xi2xip . . . . . . . . . xipxi1 xipxi2 . . . x2

ip

    

Yotam Shem-Tov STAT 239/ PS 236A

slide-31
SLIDE 31

31/33

Asymptotics of OLS

Verify at home that, X TX =      n

i=1 x2 i1

n

i=1 xi1xi2

. . . n

i=1 xi1xip

n

i=1 xi2xi1

n

i=1 x2 i2

. . . n

i=1 xi2xip

. . . . . . . . . n

i=1 xipxi1

n

i=1 xipxi2

. . . n

i=1 x2 ip

    

(p×p)

Hence, X TX = n

i=1 xT i xi

Note (and verify at home), X Ty =      n

i=1 xi1yi

n

i=1 xi2yi

. . . n

i=1 xipyi

     =

n

  • i=1

xT

i yi

Yotam Shem-Tov STAT 239/ PS 236A

slide-32
SLIDE 32

32/33

Asymptotics of OLS

The OLS estimator is, β =

  • X TX

−1 X Ty Recall (X · k)−1 = k−1 · (X)−1 Multiplying and dividing by 1

n yields,

β = 1 nX TX −1 1 nX Ty

  • =
  • 1

n

n

  • i=1

xT

i xi

−1 1 n

n

  • i=1

xT

i yi

  • → E
  • xT

i xi

−1 · E

  • xT

i yi

  • = E
  • xT

i xi

−1 · E

  • xT

i (xiβ + ǫi)

  • The converges follows from the central limit theorem (CLT).

= E

  • xT

i xi

−1 · E

  • xT

i xi

  • β + E
  • xT

i xi

−1 · E

  • xT

i ǫi

  • = β

Yotam Shem-Tov STAT 239/ PS 236A

slide-33
SLIDE 33

33/33

Regression in Causal Analysis

Imagine we are analyzing a randomized experiment with a regression using the following model: Yi = α + β1 · Ti + XT

i · β2 + ǫi

where Ti is an indicator variable for treatment status and Xi is a vector of pre-treatment characteristics Under this model, what is random? How do we interpret the coefficient β1?

Yotam Shem-Tov STAT 239/ PS 236A