8 4 3 linear regression
play

8.4.3 Linear Regression Prof. Tesler Math 283 Fall 2019 Prof. - PowerPoint PPT Presentation

8.4.3 Linear Regression Prof. Tesler Math 283 Fall 2019 Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 1 / 28 Regression Given n points ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , we want to determine a function y = f ( x ) that is


  1. 8.4.3 Linear Regression Prof. Tesler Math 283 Fall 2019 Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 1 / 28

  2. Regression Given n points ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , we want to determine a function y = f ( x ) that is close to them. Scatter plot of data (x,y) 50 ● 40 ● ● 30 ● ● y ● ● 20 ● 10 ● ● −20 −10 0 10 20 30 x Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 2 / 28

  3. Regression Based on knowledge of the underlying problem or on plotting the data, you have an idea of the general form of the function, such as: Polynomial y = β 0 + β 1 x + β 2 x 2 + β 3 x 3 Line y = β 0 + β 1 x 500 ! ! 100 ! ! ! ! ! ! ! 0 ! ! ! ! ! ! ! ! ! ! ! ! ! 80 ! ! − 1000 ! ! ! ! 60 10 12 14 16 18 20 5 10 15 20 25 Exponential Decay y = Ae − Bx Logistic Curve y = A / ( 1 + B / C x ) 10 ! ! ! ! ! ! ! ! ! ! ! ! 5 ! 8 ! 4 ! ! 6 3 ! ! ! 4 2 ! ! ! ! ! 1 2 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 0 ! ! ! 0 ! 0 5 10 15 20 25 30 0 5 10 15 20 Goal: Compute the parameters ( β 0 , β 1 , . . . or A , B , C , . . .) that give a “best fit” to the data in some sense (least squares or MLEs). Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 3 / 28

  4. Regression The methods we consider require the parameters to occur linearly. It is fine if ( x , y ) do not occur linearly. y = β 0 + β 1 x + β 2 x 2 + β 3 x 3 E.g., plugging ( x , y ) = ( 2 , 3 ) into gives 3 = β 0 + 2 β 1 + 4 β 2 + 8 β 3 . For exponential decay, y = Ae − Bx , parameter B does not occur linearly. Transform the equation to: ln y = ln ( A ) − Bx = A ′ − Bx When we plug in ( x , y ) values, the parameters A ′ , B occur linearly. Transform the logistic curve y = A / ( 1 + B / C x ) to: � A � = ln ( B ) − x ln ( C ) = B ′ + C ′ x ln y − 1 x → ∞ y ( x ) . Now B ′ , C ′ occur linearly. where A is determined from A = lim Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 4 / 28

  5. Least squares fit to a line 50 ! 40 ! ! 30 ! ! y ! ! 20 ! 10 ! ! − 20 − 10 0 10 20 30 x Given n points ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , we will fit them to a line ˆ y = β 0 + β 1 x : Independent variable: x . We assume the x ’s are known exactly or have negligible measurement errors. Dependent variable: y . We assume the y ’s depend on the x ’s but fluctuate due to a random process. We do not have y = f ( x ) , but instead, y = f ( x ) + error. Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 5 / 28

  6. Least squares fit to a line 50 ! 40 ! ! 30 ! ! y ! ! 20 ! 10 ! ! − 20 − 10 0 10 20 30 x Given n points ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , we will fit them to a line ˆ y = β 0 + β 1 x : Predicted y value (on the line): y i = β 0 + β 1 x i ˆ Actual data ( • ): y i = β 0 + β 1 x i + ǫ i Residual (actual y minus prediction): ǫ i = y i − ˆ y i = y i − ( β 0 + β 1 x i ) Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 6 / 28

  7. Least squares fit to a line 50 ! 40 ! ! 30 ! ! y ! ! 20 ! 10 ! ! − 20 − 10 0 10 20 30 x We will use the least squares method : pick parameters β 0 , β 1 that minimize the sum of squares of the residuals. n � ( y i − ( β 0 + β 1 x i )) 2 L = i = 1 Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 7 / 28

  8. Least squares fit to a line n � ( y i − ( β 0 + β 1 x i )) 2 L = i = 1 � � ∂β 0 , ∂ L ∂ L To find β 0 , β 1 that minimize this, solve ∇ L = = ( 0 , 0 ) : ∂β 1 � � n n n � � � ∂ L = − 2 ( y i − ( β 0 + β 1 x i )) = 0 n β 0 + β 1 = ⇒ x i y i ∂β 0 i = 1 i = 1 i = 1 � � � � n n n n ∂ L � � � � 2 = − 2 ( y i − ( β 0 + β 1 x i )) x i = 0 β 0 + β 1 = ⇒ x i x i x i y i ∂β 1 i = 1 i = 1 i = 1 i = 1 which has solution (all sums are i = 1 to n ) β 1 = n ( � i x i y i ) − ( � i x i ) ( � � i y i ) i ( x i − ¯ x )( y i − ¯ y ) = β 0 = ¯ y − β 1 ¯ x � n ( � i x i 2 ) − ( � i x i ) 2 x ) 2 i ( x i − ¯ Not shown: use 2nd derivatives to confirm it’s a minimum rather than a maximum or saddle point. Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 8 / 28

  9. Best fitting line x = α 0 + α 1 y + ε y = β 0 + β 1 x + ε 50 50 y = 24.9494 + 0.6180x x = −28.2067 + 1.1501y slope = 0.6180 slope = 0.8695 ● ● 40 40 ● ● ● ● 30 30 ● ● ● ● y y ● ● ● ● 20 20 ● ● 10 10 ● ● ● ● −20 −10 0 10 20 30 −20 −10 0 10 20 30 x x The best fit for y = β 0 + β 1 x + error or x = α 0 + α 1 y + error give different lines! y = β 0 + β 1 x + error assumes the x ’s are known exactly with no errors, while the y ’s have errors. x = α 0 + α 1 y + error is the other way around. Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 9 / 28

  10. Total Least Squares / Principal Components Analysis x = α 0 + α 1 y + ε y = β 0 + β 1 x + ε 50 50 y = 24.9494 + 0.6180x slope = 0.6180 x = −28.2067 + 1.1501y slope = 0.8695 ● ● 40 40 ● ● ● ● 30 30 ● ● ● ● y y ● ● ● ● 20 20 ● ● 10 10 ● ● ● ● −20 −10 0 10 20 30 −20 −10 0 10 20 30 x x First principal component All three of centered data 50 50 slope = 0.6934274 x = 1.685727 y = 25.99114 ● ● 40 40 ● ● ● ● 30 30 ● ● ● ● y y ● ● ● ● 20 20 ● ● 10 10 ● ● ● ● −20 −10 0 10 20 30 −20 −10 0 10 20 30 x x Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 10 / 28

  11. Least squares vs. PCA Errors in data: Least squares: y = β 0 + β 1 x + error assumes x ’s have no errors while y ’s have errors. PCA: assumes all coordinates have errors. For ( x i , y i ) data, we minimize the sum of . . . Least squares: squared vertical distances from points to the line. PCA: squared orthogonal distances from points to the line. Due to centering data, the lines all go through ( ¯ x , ¯ y ) . For multivariate data, lines are replaced by planes, etc. Different units/scaling on inputs ( x ) and outputs ( y ): Least squares gives equivalent solutions if you change units or scaling, while PCA is sensitive to changes in these. Example: (a) x in seconds, y in cm vs. (b) x in seconds, y in mm give equivalent results for least squares, inequivalent for PCA. For PCA, a workaround is to convert coordinates to Z -scores. Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 11 / 28

  12. Distribution of values at each x (a) Homoscedastic (b) Heteroscedastic 80 80 ! ! ! 60 60 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 40 ! ! 40 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 20 20 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 0 0 0 2 4 6 8 10 0 2 4 6 8 10 On repeated trials, at each x we get a distribution of values of y rather than a single value. In (a), the error term is a normal distribution with the same variance for every x . This is the case we will study. Assume the errors are independent of x and have a normal distribution with mean 0 , SD σ . In (b), the variance changes for different values of x . Use a generalization called Weighted Least Squares . Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 12 / 28

  13. Maximum Likelihood Estimate for best fitting line The method of least squares uses a geometrical perspective. Now we’ll assume the data has certain statistical properties. Simple linear model: Y = β 0 + β 1 x + E Assume the x ’s are known (so lowercase) and E is Gaussian with mean 0 and standard deviation σ , making E , Y random variables. At each x , there is a distribution of possible y ’s, giving a conditional distribution : f Y | X = x ( y ) . Assume conditional distributions for different x ’s are independent. The means of these conditional distributions form a line y = E ( Y | X = x ) = β 0 + β 1 x . σ 2 to distinguish them from the Denote the MLE values by ˆ β 0 , ˆ β 1 , ˆ true (hidden) values. Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 13 / 28

  14. Maximum Likelihood Estimate for best fitting line Given data ( x 1 , y 1 ) , . . . , ( x n , y n ) , we have y i = β 0 + β 1 x i + ǫ i where ǫ i = y i − ( β 0 + β 1 x i ) has a normal distribution with mean 0 and standard deviation σ . The likelihood of the data is the product of the pdf of the normal distribution at ǫ i over all i : � � n ( y i − ( β 0 + β 1 x i )) 2 � 1 L = 2 πσ ) n exp − √ 2 σ 2 ( i = 1 Finding β 0 , β 1 that maximize L (or log L ) is equivalent to minimizing n � ( y i − ( β 0 + β 1 x i )) 2 i = 1 so we get the same answer as using least squares! Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 14 / 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend