simple linear regression
play

Simple Linear Regression Suppose we observe bivariate data ( X, Y ), - PowerPoint PPT Presentation

Simple Linear Regression Suppose we observe bivariate data ( X, Y ), but we do not know the regression function E ( Y | X = x ). In many cases it is reason- able to assume that the function is linear: E ( Y | X = x ) = + x. In addition,


  1. Simple Linear Regression • Suppose we observe bivariate data ( X, Y ), but we do not know the regression function E ( Y | X = x ). In many cases it is reason- able to assume that the function is linear: E ( Y | X = x ) = α + βx. In addition, we assume that the distribution is homoscedastic, so that σ ( Y | X = x ) = σ. We have reduced the problem to three unknowns (parameters): α , β , and σ . Now we need a way to estimate these unknowns from the data.

  2. • For fixed values of α and β (not necessarily the true values), let r i = Y i − α − βX i ( r i is called the residual at X i ). Note that r i is the vertical distance from Y i to the line α + βx . This is illustrated in the following figure: 7 6 5 4 3 2 1 0 -1 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 A bivariate data set with E ( Y | X = x ) = 3 + 2 X , where the line Y = 2 . 5 + 1 . 5 X is shown in blue. The residuals are the green vertical line segments.

  3. • One approach to estimating the unknowns α and β is to consider the sum of squared residuals function, or SSR. i r 2 i ( Y i − α − βX i ) 2 . When α and The SSR is the function � i = � β are chosen so the fit to the data is good, SSR will be small. If α and β are chosen so the fit to the data is poor, SSR will be large. 8 8 6 6 4 4 2 2 0 0 -2 -2 -4 -4 -6 -6 -8 -8 -10 -10 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 Left: a poor choice of α and β that give high SSR. Right: α and β that give nearly the smallest possible SSR.

  4. • It is a fact that among all possible α and β , the following values minimize the SSR: ˆ β = cov( X, Y ) / var( X ) Y − ˆ ¯ β ¯ ˆ = α X, These are called the least squares estimates of α and β . The estimated regression function is α + ˆ ˆ E ( Y | X = x ) = ˆ βx and the fitted values are ˆ α + ˆ Y i = ˆ βx i .

  5. • Some properties of the least square estimates: 1. ˆ σ X , so ˆ β = cor( X, Y )ˆ σ Y / ˆ β and cor( X, Y ) always have the same sign – if the data are positively correlated, the es- timated slope is positive, and if the data are negatively correlated, the estimated slope is negative. α + ˆ 2. The fitted line ˆ βx always passes through the overall mean ( ¯ X, ¯ Y ). 3. Since cov( cX, Y ) = c · cov( X, Y ) and var( cX ) = c 2 · var( X ), if we scale the X values by c then the slope is scaled by 1 /c . If we scale the Y values by c then the slope is scaled by c .

  6. α and ˆ • Once we have ˆ β , we can compute the residuals r i based on these estimates, i.e. α − ˆ r i = Y i − ˆ βX i . The following is used to estimate σ : � i r 2 � � � i σ = ˆ n − 2 . �

  7. • It is also possible to formulate this problem in terms of a model, which is a complete description of the distribution that generated the data. The model for linear regression is written: Y i = α + βX i + ǫ i , where α and β are the population regression coefficients, and the ǫ i are iid random variables with mean 0 and standard deviation σ . The ǫ i are called errors.

  8. • Model assumptions: 1. The means all fall on the line α + βX . 2. The ǫ i are iid (no heteroscedasticity). 3. The ǫ i have a normal distribution. Assumption 3 is not always necessary. Least squares estimates α and ˆ ˆ β are still valid when the ǫ i are not normal (as long as 1 and 2 are met). However hypothesis tests, CI’s, and PI’s (derived below) depend on normality of the ǫ i .

  9. α and ˆ • Since ˆ β are functions of the data, which is random, they are random variables, and hence they have a distribution. This distribution reflects the sampling variation that causes ˆ α and ˆ β to differ somewhat from the population values α and β . The sampling variation is less if the sample size n is large, and if the error standard deviation σ is small. The sampling variation of ˆ β is less if the X i values are more variable. We will derive formulas later. For now, we can look at his- tograms.

  10. 300 300 250 250 200 200 150 150 100 100 50 50 0 0 0 0.5 1 1.5 2 -3 -2.5 -2 -1.5 -1 α (left) and ˆ Sampling variation of ˆ β (right) for 1000 replicates of the simple linear model Y = 1 − 2 X + ǫ , where SD( ǫ ) = 2 , the sample size is n = 200 , and σ X ≈ 1 . 2 .

  11. 250 300 250 200 200 150 150 100 100 50 50 0 0 0 0.5 1 1.5 2 -3 -2.5 -2 -1.5 -1 α (left) and ˆ Sampling variation of ˆ β (right) for 1000 replicates of the simple linear model Y = 1 − 2 X + ǫ , where SD( ǫ ) = 1 / 2 , the sample size is n = 200 , and σ X ≈ 1 . 2 .

  12. 300 300 250 250 200 200 150 150 100 100 50 50 0 0 0 0.5 1 1.5 2 -3 -2.5 -2 -1.5 -1 α (left) and ˆ Sampling variation of ˆ β (right) for 1000 replicates of the simple linear model Y = 1 − 2 X + ǫ , where SD( ǫ ) = 2 , the sample size is n = 50 , and σ X ≈ 1 . 2 .

  13. 250 250 200 200 150 150 100 100 50 50 0 0 0 0.5 1 1.5 2 -3 -2.5 -2 -1.5 -1 α (left) and ˆ Sampling variation of ˆ β (right) for 1000 replicates of the simple linear model Y = 1 − 2 X + ǫ , where SD( ǫ ) = 2 , the sample size is n = 50 , and σ X ≈ 2 . 2 .

  14. 250 250 200 200 150 150 100 100 50 50 0 0 1 1.5 2 2.5 3 1 1.5 2 2.5 3 Sampling variation of ˆ σ for 1000 replicates of the simple linear model Y = 1 − 2 X + ǫ , where SD( ǫ ) = 2 , the sample size is n = 50 (left) and n = 200 (right), and σ X ≈ 1 . 2 .

  15. Sampling properties of the least squares estimates • The following is an identity for the sample covariance: 1 � ( Y i − ¯ Y )( X i − ¯ cov( X, Y ) = X ) n − 1 i 1 n � Y ¯ ¯ = Y i X i − X. n − 1 n − 1 i The average of the products minus the product of the averages (almost).

  16. A similar identity for the sample variance is 1 Y ) 2 ( Y i − ¯ � var( Y ) = n − 1 i 1 n Y 2 Y 2 . � ¯ = i − n − 1 n − 1 i The average of the squares minus the square of the averages (almost).

  17. • An identify for the regression model Y i = α + βX i + ǫ i : 1 1 � � Y i = α + βX i + ǫ i n n i ¯ α + β ¯ Y = X + ¯ ǫ.

  18. • Let’s get the mean and variance of ˆ β : An equivalent way to write the least squares slope estimate is i Y i X i − n ¯ Y ¯ � X ˆ β = X 2 . i X 2 i − n ¯ � Now if we substitute Y i = α + βX i + ǫ i into the above we get i ( α + βX i + ǫ i ) X i − n ( α + β ¯ ǫ ) ¯ X + ¯ X � ˆ β = . i X 2 i − n ¯ X 2 �

  19. Since X 2 � � � � ( α + βX i + ǫ i ) X i = α X i + β i + ǫ i X i i i i X 2 nα ¯ � � = X + β i + ǫ i X i i i we can simplify the expression for ˆ β to get X 2 + � i X 2 i − nβ ¯ ǫ ¯ i ǫ i X i − n ¯ β = β � X ˆ , i X 2 i − n ¯ X 2 � and further to ǫ ¯ i ǫ i X i − n ¯ X � ˆ β = β + i X 2 i − n ¯ X 2 �

  20. To apply this result, by the assumption of the linear model Eǫ i = ǫ = 0, so E cov( X, ǫ ) = 0, and we can conclude that E ˆ E ¯ β = β . This means that ˆ β is an unbiased estimate of β – it is correct on average. If we observe an independent SRS every day for 1000 days from the same linear model, and we calculate ˆ β i each day for i = 1 , . . . , 1000, the daily ˆ β i may differ from the population β due to i ˆ sampling variation, but the average � β i / 1000 will be extremely close to β .

  21. • Now that we know E ˆ β = β , the corresponding analysis for ˆ α is straightforward. Since α = ¯ Y − ˆ β ¯ ˆ X, then α = E ¯ Y − β ¯ E ˆ X, and since ¯ Y = α + β ¯ ǫ , so E ¯ Y = α + β ¯ X + ¯ X , thus α = α + β ¯ X − β ¯ E ˆ X = α, so α is also unbiased.

  22. • Next we would like to calculate the standard deviation of ˆ β , which will allow us to produce a CI for β . Beginning with ǫ ¯ i ǫ i X i − n ¯ � X ˆ β = β + i X 2 i − n ¯ X 2 � and applying the identity var( U − V ) = var( U )+var( V ) − 2cov( U, V ): ǫ ¯ ǫ ¯ β ) = var( � i ǫ i X i ) + var( n ¯ X ) − 2cov( � i ǫ i X i , n ¯ X ) var(ˆ . i X 2 i − n ¯ X 2 ) 2 ( � Simplifying i var( ǫ i ) + n 2 ¯ i X 2 X 2 var(¯ ǫ ) − 2 n ¯ i X i cov( ǫ i , ¯ ǫ ) � X � var(ˆ β ) = . i X 2 i − n ¯ X 2 ) 2 ( �

  23. Next, using var( ǫ i ) = σ 2 , var(¯ ǫ ) = σ 2 /n : i + nσ 2 ¯ X 2 − 2 n ¯ β ) = σ 2 � i X 2 X � i X i cov( ǫ i , ¯ ǫ ) var(ˆ . i X 2 i − n ¯ X 2 ) 2 ( � � cov( ǫ i , ¯ ǫ ) = cov( ǫ i , ǫ j ) /n j σ 2 /n. = So we get i + nσ 2 ¯ X 2 − 2 n ¯ σ 2 � i X 2 i X i σ 2 /n X � var(ˆ β ) = i X 2 i − n ¯ X 2 ) 2 ( � i + nσ 2 ¯ X 2 − 2 n ¯ σ 2 � i X 2 X 2 σ 2 = . i X 2 i − n ¯ X 2 ) 2 ( �

  24. Alomst done: σ 2 � i X 2 X 2 σ 2 i − n ¯ var(ˆ β ) = i X 2 i − n ¯ X 2 ) 2 ( � σ 2 = i X 2 i − n ¯ X 2 � σ 2 = ( n − 1)var( X ) , and σ sd(ˆ β ) = √ n − 1ˆ . σ X

  25. • The slope SD formula is consistent with the three factors that influenced the precision of ˆ β in the histograms: 1. greater sample size reduces the SD 2. greater σ 2 increases the SD 3. greater X variability (ˆ σ X ) reduces the SD.

  26. • A similar analysis for ˆ α yields � X 2 i /n α ) = σ 2 var(ˆ ( n − 1)var( X ) . β ) � X 2 α ) = var(ˆ Thus var(ˆ i /n . Due to the � X 2 i /n term the estimate will be more precise when the X i values are close to zero. Since ˆ α is the intercept, it’s easier to estimate when the data is close to the origin.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend