section 1 regression review
play

Section 1 : Regression Review Yotam Shem-Tov Fall 2014 1/33 Yotam - PowerPoint PPT Presentation

Section 1 : Regression Review Yotam Shem-Tov Fall 2014 1/33 Yotam Shem-Tov STAT 239/ PS 236A Contact information Yotam Shem-Tov, PhD student in economics E-mail: shemtov@berkeley.edu Office hours: Wednesday 2-4 2/33 Yotam Shem-Tov STAT


  1. Section 1 : Regression Review Yotam Shem-Tov Fall 2014 1/33 Yotam Shem-Tov STAT 239/ PS 236A

  2. Contact information Yotam Shem-Tov, PhD student in economics E-mail: shemtov@berkeley.edu Office hours: Wednesday 2-4 2/33 Yotam Shem-Tov STAT 239/ PS 236A

  3. There are two general approaches to regression 1 Regression as a model: a data generating process (DGP) 2 Regression as an algorithm, i.e as a predictive model This two approaches are different, and make different assumptions 3/33 Yotam Shem-Tov STAT 239/ PS 236A

  4. Regression as a prediction We have an input vector X T = ( X 1 , X 2 , . . . , X p ) with dimensions of n × p and an output vector Y with dimensions n × 1. The linear regression model has the form: p � f ( X ) = β 0 + X j β j j =1 We can pick the coefficients β = ( β 0 , β 1 , . . . , β p ) T in a variety of ways but OLS is by far the most common, which minimizes the residual sum of squares (RSS): N � ( y i − f ( x i )) 2 RSS ( β ) = i =1 N P � � x ij β j ) 2 = ( y i − β 0 − i =1 j =1 4/33 Yotam Shem-Tov STAT 239/ PS 236A

  5. Regression as a prediction 5/33 Yotam Shem-Tov STAT 239/ PS 236A

  6. Regression as a prediction: Deriving the Algorithm Denote X the N × ( p + 1) matrix with each row an input vector (with a 1 in the first position) and y is the output vector. Write the RSS as: RSS ( β ) = ( y − X β ) T ( y − x β ) Differentiate with respect to β : ∂ RSS = − 2 X T ( y − X β ) (1) ∂β Assume that X is full rank (no perfect collinearity among any of the independent variables) and set first derivative to 0: X T ( y − X β ) = 0 Solve for β : ˆ β = ( X T X ) − 1 X T y 6/33 Yotam Shem-Tov STAT 239/ PS 236A

  7. Regression as a prediction: Deriving the Algorithm What happens if X is not full rank? There is an infinite number of ways to invert the matrix X T X , and the algorithm does not have a unique solution. There are many values of β that satisfy the F.O.C The matrix X is also referred as the design matrix 7/33 Yotam Shem-Tov STAT 239/ PS 236A

  8. Regression as a prediction: Making a Prediction The hat matrix , or projection matrix H = X ( X T X ) − 1 X T with ˜ H = I − H We use the hat matrix to find the fitted values: Y = Xˆ ˆ β = X ( X T X ) − 1 X T Y = HY We can now write e = ( I − H ) Y If HY yields part of Y that projects into X , this means that ˜ HY is the part of Y that does not project into X , which is the residual part of Y . Therefore, ˜ HY makes the residuals e is the part of Y which is not a linear combination of X 8/33 Yotam Shem-Tov STAT 239/ PS 236A

  9. Regression as a prediction: Deriving the Algorithm Do we make any assumption on the distribution of Y ? No! Can the dependent variable (the response), Y , be a binary variable, i.e Y ∈ { 0 , 1 } ? Yes! Do we assume that homoskedasticity, i.e that Var ( Y i ) = σ 2 , ∀ i ? No! Is the residuals, e , correlated with Y ? Do we need to make any additional assumption in order for corr ( e , X ) = 0? No! The OLS algorithm will always yield residuals which are not correlated with the covariates The procedure we discussed so far is an algorithm, which solves an optimization problem (minimizing a square loss function). The algorithm requires an assumption of full rank in order to yield a unique solution, however it does not require any assumption on the distribution or the type of the response variable, Y 9/33 Yotam Shem-Tov STAT 239/ PS 236A

  10. Regression as a model: From algorithm to model Now we make stronger assumptions, most importantly we assume a data generating process (hence DGP), i.e we assume a functional form for the relationship between Y and X Is Y a linear function of the covariates? No , it is a linear function of β What are the classic assumptions of the regression model? 10/33 Yotam Shem-Tov STAT 239/ PS 236A

  11. Regression as a model: The classic assumptions of the regression model 1 The dependent variable is linearly related to the coefficients of the model and the model is correctly specified, Y = X β + ǫ 2 The independent variables, X , are fixed, i.e are not random variables (this can be relaxed to Cov ( X , ǫ ) = 0) 3 The conditional mean of the error term is zero, E ( ǫ | X ) = 0 4 Homoscedasticity. The error term has a constant variance, i.e V ( ǫ i ) = σ 2 5 The error terms are uncorrelated with each other, Cov ( ǫ i , ǫ j ) = 0 6 The design matrix, X , has full rank 7 The error term is normally distributed, i.e ǫ ∼ N (0 , σ 2 ) (the mean and variance follows from (3) and (4)) 11/33 Yotam Shem-Tov STAT 239/ PS 236A

  12. Discussion of the classic assumptions of the regression model The assumption that E ( ǫ | X ) = 0 will always be satisfied when there is an intercept term in the model, i.e when the design matrix contains a constant term When X ⊥ ǫ it follows that Cov ( X , ǫ ) = 0 The normality assumption of ǫ i is required for hypothesis testing on β The assumption can be relaxed for sufficiently large sample sizes, as by the CLT, ˆ β OLS converges to a normal distribution when N → ∞ . What is a sufficiently large sample size? 12/33 Yotam Shem-Tov STAT 239/ PS 236A

  13. Properties of the OLS estimators: Unbiased estimator The OLS estimator of β is, ˆ β = ( X T X ) − 1 X T Y = ( X T X ) − 1 X T ( X β + ǫ ) = ( X T X ) − 1 X T X β + ( X T X ) − 1 X T ǫ = β + ( X T X ) − 1 X T ǫ We know that ˆ β is unbiased if E (ˆ β ) = β E (ˆ = E ( β + ( X T X ) − 1 X T ǫ | X ) β ) = E ( β | X ) + E (( X T X ) − 1 X T ǫ | X ) = β + ( X T X ) − 1 E ( ǫ | X ) where E ( ǫ | X ) = E ( ǫ ) = 0 E (ˆ β ) = β 13/33 Yotam Shem-Tov STAT 239/ PS 236A

  14. Properties of the OLS estimators: Unbiased estimator What assumptions are used for the proof that ˆ β OLS is an unbiased estimator? Assumption (1), the model is correct. Assumption (2), the covariates are independent of the error term 14/33 Yotam Shem-Tov STAT 239/ PS 236A

  15. Properties of the OLS estimators: The variance of ˆ β OLS Recall: ˆ = ( X T X ) − 1 X T Y β = ( X T X ) − 1 X T ( X β + ǫ ) ˆ = ( X T X ) − 1 X T ǫ ⇒ β − β Plugging this into the covariance equation: cov (ˆ = E [(ˆ β − β )(ˆ β − β ) ′ | X ] β | X ) ( X T X ) − 1 X T ǫ ( X T X ) − 1 X T ǫ ) ′ | X �� �� � = E = E [( X T X ) − 1 X T ǫǫ T X ( X T X ) − 1 | X ] = ( X T X ) − 1 X T E ( ǫǫ T | X ) X ( X T X ) − 1 where E ( ǫǫ T | X ) = σ 2 I p × p = ( X T X ) − 1 X T σ 2 I p × p X ( X T X ) − 1 = σ 2 ( X T X ) − 1 X T X ( X T X ) − 1 = σ 2 ( X T X ) − 1 15/33 Yotam Shem-Tov STAT 239/ PS 236A

  16. Estimating σ 2 We estimate σ 2 by dividing the residuals squared by the degrees of freedom because the e i are generally smaller than the ǫ i due to the fact that ˆ β was chosen to make the sum of square residuals as small as possible. n 1 σ 2 � e 2 ˆ OLS = i n − p i =1 Compare the above estimator to the classic variance estimator: n 1 � 2 � Y i − ¯ σ 2 � ˆ classic = Y n − 1 i =1 Is one estimator always preferable over the other? If not when each estimator is preferable? 16/33 Yotam Shem-Tov STAT 239/ PS 236A

  17. measurment error Consider the following DGP (data generating process): n=200 x1 = rnorm(n,mean=10,1) epsilon = rnorm(n,0,2) y = 10+5*x1+epsilon ### mesurment error: noise = rnorm(n,0,2) x1_noise = x1+noise The true model has x 1 , however we observe only x noise . We will 1 investigate the effect of the noise and the distribution of the noise on the OLS estimation of β 1 . The true value of the parameter of interest is, β 1 = 5 17/33 Yotam Shem-Tov STAT 239/ PS 236A

  18. Measurement error: noise ∼ N ( µ = 0 , σ = 2) measurment error with mean 0 70 y 60 50 6 9 12 15 x1 18/33 Yotam Shem-Tov STAT 239/ PS 236A

  19. Measurement error: noise ∼ N ( µ = 5 , σ = 2) measurment error with mean 5 70 y 60 50 8 12 16 20 x1 19/33 Yotam Shem-Tov STAT 239/ PS 236A

  20. Measurement error: noise ∼ N ( µ =? , σ = 2) 5 4 3 ● ● beta ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ● ● 1 0 0 25 50 75 100 Expectation of noise 20/33 Yotam Shem-Tov STAT 239/ PS 236A

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend