LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT - PowerPoint PPT Presentation

LINEAR REGRESSION

LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR REGRESSION 20 Sales 15 ▸ Starting point 10 ▸ Simplest parametric function 5 ▸ Easy to interpret the parameters: 0 50 100 150 200 250 300 intercept, coefficients: unit change in x TV makes coefficient times unit change in y Y ≈ β 0 + β 1 X. Y = β 0 + β 1 X + � . ▸ Can be very accurate in certain problems sales ≈ β 0 + β 1 × TV . ▸ Least squares N � 1 � 2 � � � � 1 � y i � y.x i / � Y P. data j model / / y exp ▸ Insight: minimising (log) probability 2 � i D 0 (actually the likelihood) of observations given Gaussian y distribution � n i =1 ( x i − ¯ x )( y i − ¯ y ) ˆ β 1 = , � n i =1 ( x i − ¯ x ) 2 ˆ y − ˆ β 0 = ¯ β 1 ¯ x, �

LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 10 10 ACCURACY OF COEFFICIENTS 5 5 Y Y 0 0 − 5 − 5 ▸ Data is from a true relationship + errors − 10 − 10 ▸ We get the line which fits the − 2 − 1 0 1 2 − 2 − 1 0 1 2 measurements most accurately using X X OLS Y = β 0 + β 1 X + � . RSS = e 2 1 + e 2 2 + · · · + e 2 n , ▸ The true and the measured coefficients will be different! � 1 � x 2 σ 2 2 = σ 2 ¯ 2 = SE(ˆ SE(ˆ � n � n β 0 ) n + , β 1 ) x ) 2 , i =1 ( x i − ¯ x ) 2 i =1 ( x i − ¯ ▸ We can estimate the standard errors of estimated parameters, ( assuming of σ is known as the re ˆ � β 1 − 0 uncorrelated errors which have a t = , RSE = RSS / ( n − 2). SE(ˆ β 1 ) � common variance (sigma) ) ▸ We can estimate the errors from the Coe ffi cient Std. error t-statistic p-value 7.0325 0.4578 15.36 < 0 . 0001 Intercept data itself: residual standard error, RSE 0.0475 0.0027 17.67 < 0 . 0001 TV

LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 ACCURACY OF ESTIMATION 20 Sales 15 ▸ How accurate it the fit? 10 ▸ RSE, Residual standard errors 5 ▸ Closely related to chi-square 0 50 100 150 200 250 300 commonly used by physicist Y = β 0 + β 1 X + � . RSS = e 2 1 + e 2 2 + · · · + e 2 TV n , ▸ R squared, proportion of variance � n � � 1 1 � explained � RSE = n − 2RSS = ( y i − ˆ y i ) 2 . � n − 2 i =1 ▸ For simple linear regression R R 2 = TSS − RSS = 1 − RSS y ) 2 where TSS = � ( y i − ¯ squared is the same as Cor(x, y) TSS TSS .16). TSS measures th ▸ R is more general: multiple regression or nonlinear � n i =1 ( x i − x )( y i − y ) regression Cor( X, Y ) = i =1 ( y i − y ) 2 , �� n i =1 ( x i − x ) 2 �� n

LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW Sales MULTIPLE LINEAR REGRESSION ▸ Multiple x variables ▸ OLS TV ▸ Without other variables newspaper ads seem to be Radio related to sales, with other it does not Y = β 0 + β 1 X 1 + β 2 X 2 + · · · + β p X p + � , ▸ Ad spendings are Coe ffi cient Std. error t-statistic p-value correlated 9.312 0.563 16.54 < 0 . 0001 Intercept TV radio newspaper sales 0.203 0.020 9.92 < 0 . 0001 radio 1.0000 0.0548 0.0567 0.7822 TV ▸ Multiple regression 1.0000 0.3541 0.5762 radio Coe ffi cient Std. error t-statistic p-value 1.0000 0.2283 coefficients describe the newspaper 12.351 0.621 19.88 < 0 . 0001 Intercept 1.0000 sales effect of an input on the 0.055 0.017 3.30 0 . 00115 newspaper outcome given fixed other inputs Coe ffi cient Std. error t-statistic p-value ▸ Including all possible 2.939 0.3119 9.42 < 0 . 0001 Intercept factors can reveal the real 0.046 0.0014 32.81 < 0 . 0001 TV effect of variables 0.189 0.0086 21.89 < 0 . 0001 radio − 0.001 0.0059 − 0.18 0 . 8599 newspaper (adjusting for …)

LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW QUALITATIVE INPUTS TO LINEAR REGRESSION ▸ X can be a category ▸ Gender, ethnicity, marital status, phone type, country, .. � � 1 if i th person is female 1 if i th person is female ▸ Binary inputs x i = x i = − 1 if i th person is male 0 if i th person is male , � � 1 if i th person is Caucasian 1 if i th person is Asian x i 2 = x i 1 = ▸ Multiple categories 0 if i th person is not Asian , 0 if i th person is not Caucasian . ▸ It is called one-hot  β 0 + β 1 + � i if i th person is Asian   y i = β 0 + β 1 x i 1 + β 2 x i 2 + � i = β 0 + β 2 + � i if i th person is Caucasian encoding  β 0 + � i if i th person is African American . 

LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW Sales EXTENDING LINEAR REGRESSION: INTERACTIONS ▸ Linear regression is additive ▸ Best strategy? TV ▸ Spend all our money on radio ads Radio ▸ Some companies do that, but others have Coe ffi cient Std. error t-statistic p-value more balanced strategy 2.939 0.3119 9.42 < 0 . 0001 Intercept 0.046 0.0014 32.81 < 0 . 0001 TV 0.189 0.0086 21.89 < 0 . 0001 ▸ Interaction (synergy) between TV and radio radio − 0.001 0.0059 − 0.18 0 . 8599 newspaper ▸ TV x radio is just treated as a new variable, Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 1 X 2 + � . OLS fitting as before Coe ffi cient Std. error t-statistic p-value 6.7502 0.248 27.23 < 0 . 0001 Intercept ▸ Y is not a linear function of X, but linear in 0.0191 0.002 12.70 < 0 . 0001 TV B-s, and the same formalism can be used 0.0289 0.009 3.24 0.0014 radio 0.0011 0.000 20.73 < 0 . 0001 TV × radio ▸ B_3 can be interpreted as the increase of the = β 0 + ( β 1 + β 3 X 2 ) X 1 + β 2 X 2 + � Y effectiveness of TV ads for one unit increase in radio ads = β 0 + β 1 × TV + β 2 × radio + β 3 × ( radio × TV ) + � × × × × sales β 0 + ( β 1 + β 3 × radio ) × TV + β 2 × radio + � .

LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW EXTENDING LINEAR REGRESSION: POLYNOMIAL REGRESSION 50 Linear Degree 2 Degree 5 ▸ Effects may be non linear, e.g.: very 40 often saturating Miles per gallon ▸ We can add polynomials of x as 30 different variables, OLS fitting as before 20 ▸ Again, y is not a linear function of x, 10 but linear in B-s, and the same 50 100 150 200 formalism can be used Horsepower mpg = β 0 + β 1 × horsepower + β 2 × horsepower 2 + � ▸ Actually we can use any functions of x, log(x), cos(x), sin(x), etc. Until the Coe ffi cient Std. error t-statistic p-value outcome is linear in the coefficients. 56.9001 1.8004 31.6 < 0 . 0001 Intercept − 0.4662 0.0311 − 15.0 < 0 . 0001 horsepower E.g.: we can not use cos(a*x+b) in horsepower 2 0.0012 0.0001 10.1 < 0 . 0001 linear regression.

LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW DETECTING NON-LINEARITY, OUTLIERS, HIGH LEVERAGE Residual Plot for Linear Fit Residual Plot for Quadratic Fit ▸ Clear trends in residuals indicate 20 323 334 15 323 non-linearity 15 330 334 10 10 ▸ Residuals plots are also useful to 5 Residuals Residuals 5 identify outliers 0 0 − 5 − 5 ▸ Could be just measurement − 10 − 10 error or indicate problems − 15 155 − 15 with the model itself 5 10 15 20 25 30 15 20 25 30 35 Fitted values Fitted values ▸ High leverage points have strong 20 20 20 6 6 effect on coefficients 4 Studentized Residuals 3 4 4 Residuals 2 2 Y 2 41 1 0 0 0 − 2 10 − 1 − 4 20 − 2 − 1 0 1 2 − 2 0 2 4 6 − 2 0 2 4 6 Y 5 X Fitted Values Fitted Values 0 x ) 2 h i = 1 ( x i − ¯ n + x ) 2 . � n i ′ =1 ( x i ′ − ¯ − 2 − 1 0 1 2 3 4 X

LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 80 800 COLLINEARITY 70 600 60 Rating Age ▸ Some predictive variables can be 50 400 highly correlated 40 200 30 ▸ Their individual effect can not be 2000 4000 6000 8000 12000 2000 4000 6000 8000 12000 inferred Limit Limit 5 21.8 0 21.8 ▸ For 3 or more variables it is harder to 2 1 . 5 4 − 1 2 1 . 2 5 detect: multicollinearity 21.5 3 β Rating β Age − 2 2 − 3 ▸ Variance inflation factor, VIF 1 − 4 0 ▸ Possible solutions: drop one, or − 5 combine them? 0.16 0.17 0.18 0.19 − 0.1 0.0 0.1 0.2 β Limit β Limit Coe ffi cient Std. error t-statistic p-value − 173.411 43.828 − 3.957 < 0 . 0001 Intercept Model 1 − 2.292 0.672 − 3.407 0 . 0007 age 0.173 0.005 34.496 < 0 . 0001 limit − 377.537 45.254 − 8.343 < 0 . 0001 Intercept Model 2 2.202 0.952 2.312 0.0213 rating 0.025 0.064 0.384 0.7012 limit

LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW SOLVING MULTIPLE LINEAR ⎡ ⎤ N M 1 REGRESSION � � ⎦ X k ( x i ) 0 = a j X j ( x i ) k = 1 , . . . , M ⎣ y i − σ 2 i i =1 j =1 ▸ Linear regression can N y i X k ( x i ) � [ β ] = A T · b β k = usually be solved by σ 2 i i =1 matrix inversion N X j ( x i ) X k ( x i ) � α kj = [ α ] = A T · A σ 2 ▸ But sometimes normal i i =1 an matrix, and equations can be close M � A T · A · a = A T · b α kj a j = β k � � to singular, and it fails j =1 � N M M � y i X k ( x i ) � � � [ α ] − 1 a j = jk β k = C jk σ 2 i k =1 k =1 i =1 the variance associated with the estimate can be found

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT - PowerPoint PPT Presentation

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR REGRESSION 20 Sales 15 Starting point 10 Simplest parametric function 5 Easy to interpret the parameters: 0 50 100 150 200 250

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Technical conditions for linear regression Jo Hardin Professor, Pomona College DataCamp

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model

STAT 213 Regression Inference II Colin Reimer Dawson Oberlin College 18 February 2016 Outline

Introduction to Data Science Winter Semester 2019/20 Oliver Ernst TU Chemnitz, Fakultt fr

Assessing Model Fit Our model has assumptions: mean 0 errors, functional form of

for scientometrics network analysis Lovro ubelj University of Ljubljana, Faculty of Computer

Compressed sensing, sparsity and p-values Sara van de Geer April 16, 2015 (Leiden) Dantzig

Implementing Bootstrap Methods in R GETTING STARTED WITH BOOTSTRAPPING IN R Janani Ravi

Probability: Reasoning Under Uncertainty CS171, Winter Quarter, 2019 Introduction to Artificial

Probabilistic Reasoning Philipp Koehn 4 April 2017 Philipp Koehn Artificial Intelligence: