8.4.3 Linear Regression Prof. Tesler Math 283 Fall 2019 Prof. - PowerPoint PPT Presentation

8.4.3 Linear Regression Prof. Tesler Math 283 Fall 2019 Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 1 / 28

Regression Given n points ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , we want to determine a function y = f ( x ) that is close to them. Scatter plot of data (x,y) 50 ● 40 ● ● 30 ● ● y ● ● 20 ● 10 ● ● −20 −10 0 10 20 30 x Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 2 / 28

Regression Based on knowledge of the underlying problem or on plotting the data, you have an idea of the general form of the function, such as: Polynomial y = β 0 + β 1 x + β 2 x 2 + β 3 x 3 Line y = β 0 + β 1 x 500 ! ! 100 ! ! ! ! ! ! ! 0 ! ! ! ! ! ! ! ! ! ! ! ! ! 80 ! ! − 1000 ! ! ! ! 60 10 12 14 16 18 20 5 10 15 20 25 Exponential Decay y = Ae − Bx Logistic Curve y = A / ( 1 + B / C x ) 10 ! ! ! ! ! ! ! ! ! ! ! ! 5 ! 8 ! 4 ! ! 6 3 ! ! ! 4 2 ! ! ! ! ! 1 2 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 0 ! ! ! 0 ! 0 5 10 15 20 25 30 0 5 10 15 20 Goal: Compute the parameters ( β 0 , β 1 , . . . or A , B , C , . . .) that give a “best fit” to the data in some sense (least squares or MLEs). Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 3 / 28

Regression The methods we consider require the parameters to occur linearly. It is fine if ( x , y ) do not occur linearly. y = β 0 + β 1 x + β 2 x 2 + β 3 x 3 E.g., plugging ( x , y ) = ( 2 , 3 ) into gives 3 = β 0 + 2 β 1 + 4 β 2 + 8 β 3 . For exponential decay, y = Ae − Bx , parameter B does not occur linearly. Transform the equation to: ln y = ln ( A ) − Bx = A ′ − Bx When we plug in ( x , y ) values, the parameters A ′ , B occur linearly. Transform the logistic curve y = A / ( 1 + B / C x ) to: � A � = ln ( B ) − x ln ( C ) = B ′ + C ′ x ln y − 1 x → ∞ y ( x ) . Now B ′ , C ′ occur linearly. where A is determined from A = lim Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 4 / 28

Least squares fit to a line 50 ! 40 ! ! 30 ! ! y ! ! 20 ! 10 ! ! − 20 − 10 0 10 20 30 x Given n points ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , we will fit them to a line ˆ y = β 0 + β 1 x : Independent variable: x . We assume the x ’s are known exactly or have negligible measurement errors. Dependent variable: y . We assume the y ’s depend on the x ’s but fluctuate due to a random process. We do not have y = f ( x ) , but instead, y = f ( x ) + error. Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 5 / 28

Least squares fit to a line 50 ! 40 ! ! 30 ! ! y ! ! 20 ! 10 ! ! − 20 − 10 0 10 20 30 x Given n points ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , we will fit them to a line ˆ y = β 0 + β 1 x : Predicted y value (on the line): y i = β 0 + β 1 x i ˆ Actual data ( • ): y i = β 0 + β 1 x i + ǫ i Residual (actual y minus prediction): ǫ i = y i − ˆ y i = y i − ( β 0 + β 1 x i ) Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 6 / 28

Least squares fit to a line 50 ! 40 ! ! 30 ! ! y ! ! 20 ! 10 ! ! − 20 − 10 0 10 20 30 x We will use the least squares method : pick parameters β 0 , β 1 that minimize the sum of squares of the residuals. n � ( y i − ( β 0 + β 1 x i )) 2 L = i = 1 Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 7 / 28

Least squares fit to a line n � ( y i − ( β 0 + β 1 x i )) 2 L = i = 1 � � ∂β 0 , ∂ L ∂ L To find β 0 , β 1 that minimize this, solve ∇ L = = ( 0 , 0 ) : ∂β 1 � � n n n � � � ∂ L = − 2 ( y i − ( β 0 + β 1 x i )) = 0 n β 0 + β 1 = ⇒ x i y i ∂β 0 i = 1 i = 1 i = 1 � � � � n n n n ∂ L � � � � 2 = − 2 ( y i − ( β 0 + β 1 x i )) x i = 0 β 0 + β 1 = ⇒ x i x i x i y i ∂β 1 i = 1 i = 1 i = 1 i = 1 which has solution (all sums are i = 1 to n ) β 1 = n ( � i x i y i ) − ( � i x i ) ( � � i y i ) i ( x i − ¯ x )( y i − ¯ y ) = β 0 = ¯ y − β 1 ¯ x � n ( � i x i 2 ) − ( � i x i ) 2 x ) 2 i ( x i − ¯ Not shown: use 2nd derivatives to confirm it’s a minimum rather than a maximum or saddle point. Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 8 / 28

Best fitting line x = α 0 + α 1 y + ε y = β 0 + β 1 x + ε 50 50 y = 24.9494 + 0.6180x x = −28.2067 + 1.1501y slope = 0.6180 slope = 0.8695 ● ● 40 40 ● ● ● ● 30 30 ● ● ● ● y y ● ● ● ● 20 20 ● ● 10 10 ● ● ● ● −20 −10 0 10 20 30 −20 −10 0 10 20 30 x x The best fit for y = β 0 + β 1 x + error or x = α 0 + α 1 y + error give different lines! y = β 0 + β 1 x + error assumes the x ’s are known exactly with no errors, while the y ’s have errors. x = α 0 + α 1 y + error is the other way around. Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 9 / 28

Total Least Squares / Principal Components Analysis x = α 0 + α 1 y + ε y = β 0 + β 1 x + ε 50 50 y = 24.9494 + 0.6180x slope = 0.6180 x = −28.2067 + 1.1501y slope = 0.8695 ● ● 40 40 ● ● ● ● 30 30 ● ● ● ● y y ● ● ● ● 20 20 ● ● 10 10 ● ● ● ● −20 −10 0 10 20 30 −20 −10 0 10 20 30 x x First principal component All three of centered data 50 50 slope = 0.6934274 x = 1.685727 y = 25.99114 ● ● 40 40 ● ● ● ● 30 30 ● ● ● ● y y ● ● ● ● 20 20 ● ● 10 10 ● ● ● ● −20 −10 0 10 20 30 −20 −10 0 10 20 30 x x Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 10 / 28

Least squares vs. PCA Errors in data: Least squares: y = β 0 + β 1 x + error assumes x ’s have no errors while y ’s have errors. PCA: assumes all coordinates have errors. For ( x i , y i ) data, we minimize the sum of . . . Least squares: squared vertical distances from points to the line. PCA: squared orthogonal distances from points to the line. Due to centering data, the lines all go through ( ¯ x , ¯ y ) . For multivariate data, lines are replaced by planes, etc. Different units/scaling on inputs ( x ) and outputs ( y ): Least squares gives equivalent solutions if you change units or scaling, while PCA is sensitive to changes in these. Example: (a) x in seconds, y in cm vs. (b) x in seconds, y in mm give equivalent results for least squares, inequivalent for PCA. For PCA, a workaround is to convert coordinates to Z -scores. Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 11 / 28

Distribution of values at each x (a) Homoscedastic (b) Heteroscedastic 80 80 ! ! ! 60 60 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 40 ! ! 40 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 20 20 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 0 0 0 2 4 6 8 10 0 2 4 6 8 10 On repeated trials, at each x we get a distribution of values of y rather than a single value. In (a), the error term is a normal distribution with the same variance for every x . This is the case we will study. Assume the errors are independent of x and have a normal distribution with mean 0 , SD σ . In (b), the variance changes for different values of x . Use a generalization called Weighted Least Squares . Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 12 / 28

Maximum Likelihood Estimate for best fitting line The method of least squares uses a geometrical perspective. Now we’ll assume the data has certain statistical properties. Simple linear model: Y = β 0 + β 1 x + E Assume the x ’s are known (so lowercase) and E is Gaussian with mean 0 and standard deviation σ , making E , Y random variables. At each x , there is a distribution of possible y ’s, giving a conditional distribution : f Y | X = x ( y ) . Assume conditional distributions for different x ’s are independent. The means of these conditional distributions form a line y = E ( Y | X = x ) = β 0 + β 1 x . σ 2 to distinguish them from the Denote the MLE values by ˆ β 0 , ˆ β 1 , ˆ true (hidden) values. Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 13 / 28

Maximum Likelihood Estimate for best fitting line Given data ( x 1 , y 1 ) , . . . , ( x n , y n ) , we have y i = β 0 + β 1 x i + ǫ i where ǫ i = y i − ( β 0 + β 1 x i ) has a normal distribution with mean 0 and standard deviation σ . The likelihood of the data is the product of the pdf of the normal distribution at ǫ i over all i : � � n ( y i − ( β 0 + β 1 x i )) 2 � 1 L = 2 πσ ) n exp − √ 2 σ 2 ( i = 1 Finding β 0 , β 1 that maximize L (or log L ) is equivalent to minimizing n � ( y i − ( β 0 + β 1 x i )) 2 i = 1 so we get the same answer as using least squares! Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 14 / 28

8.4.3 Linear Regression Prof. Tesler Math 283 Fall 2019 Prof. - PowerPoint PPT Presentation

8.4.3 Linear Regression Prof. Tesler Math 283 Fall 2019 Prof. Tesler 8.4.3: Linear Regression Math 283 / Fall 2019 1 / 28 Regression Given n points ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , we want to determine a function y = f ( x ) that is

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Technical conditions for linear regression Jo Hardin Professor, Pomona College DataCamp

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Simple Linear Regression Recall: A regression model describes how a dependent variable (or

Non-Stationary Time Series, Cointegration and Spurious Regression Heino Bohn Nielsen 1 of 32

Notation ^ y = b 0 + b 1 x 1 + b 2 x 2 + b 3 x 3 +. . .+ b k x k 0 = the y -intercept, or the

Chapter 3. Linear Models for Regression Wei Pan Division of Biostatistics, School of Public

Poli 5D Social Science Data Analytics Regression in Stata Shane Xinyang Xuan ShaneXuan.com

Statistical Machine Learning Lecture 13: Kernel Regression and Gaussian Processes Kristian

Regression Testing Gavan Fantom gavan@NetBSD.org pkgsrcCon 2005 Introduction Have you ever

Econometric Analysis Using Stata Introduction Time Series Panel Data Stata : Data Analysis and