LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT - - PowerPoint PPT Presentation
LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT - - PowerPoint PPT Presentation
LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR REGRESSION 20 Sales 15 Starting point 10 Simplest parametric function 5 Easy to interpret the parameters: 0 50 100 150 200 250
LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW
SIMPLE LINEAR REGRESSION
▸ Starting point ▸ Simplest parametric function ▸ Easy to interpret the parameters:
intercept, coefficients: unit change in x makes coefficient times unit change in y
▸ Can be very accurate in certain problems ▸ Least squares ▸ Insight: minimising (log) probability
(actually the likelihood) of observations given Gaussian y distribution
50 100 150 200 250 300 5 10 15 20 25 TV SalesY ≈ β0 + β1X.
sales ≈ β0 + β1 × TV.
ˆ β1 = n
i=1(xi − ¯
x)(yi − ¯ y) n
i=1(xi − ¯
x)2 , ˆ β0 = ¯ y − ˆ β1¯ x,
- Y = β0 + β1X + .
P.data j model/ /
N1Y
iD0- exp
- 1
2 yi y.xi/
- 2
y
LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW
ACCURACY OF COEFFICIENTS
▸ Data is from a true relationship + errors ▸ We get the line which fits the
measurements most accurately using OLS
▸ The true and the measured coefficients
will be different!
▸ We can estimate the standard errors of
estimated parameters, ( assuming uncorrelated errors which have a common variance (sigma) )
▸ We can estimate the errors from the
data itself: residual standard error, RSE
Y = β0 + β1X + .
X Y −2 −1 1 2 X −2 −1 1 2 −10 −5 5 10 Y −10 −5 5 10SE(ˆ β0)
2 = σ21 n + ¯ x2 n
i=1(xi − ¯x)2
- ,
SE(ˆ β1)
2 =σ2 n
i=1(xi − ¯x)2 ,
t = ˆ β1 − 0 SE(ˆ β1) ,
Coefficient
- Std. error
t-statistic p-value
Intercept7.0325 0.4578 15.36 < 0.0001
TV0.0475 0.0027 17.67 < 0.0001
- f σ is known as the re
RSE =
- RSS/(n − 2).
- RSS = e2
LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW
ACCURACY OF ESTIMATION
▸ How accurate it the fit? ▸ RSE, Residual standard errors ▸ Closely related to chi-square
commonly used by physicist
▸ R squared, proportion of variance
explained
▸ For simple linear regression R
squared is the same as Cor(x, y)
▸ R is more general: multiple
regression or nonlinear regression
Y = β0 + β1X + .
RSS = e2
1 + e2 2 + · · · + e2 n,RSE =
- 1
n − 2RSS =
- 1
n − 2
n
- i=1
(yi − ˆ yi)2.
R2 = TSS − RSS TSS = 1 − RSS TSS
where TSS = (yi − ¯ y)2 .16). TSS measures th
50 100 150 200 250 300 5 10 15 20 25 TV SalesCor(X, Y ) = n
i=1(xi − x)(yi − y)n
i=1(xi − x)2n i=1(yi − y)2 ,LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW
MULTIPLE LINEAR REGRESSION
▸ Multiple x variables ▸ OLS ▸ Without other variables
newspaper ads seem to be related to sales, with other it does not
▸ Ad spendings are
correlated
▸ Multiple regression
coefficients describe the effect of an input on the
- utcome given fixed other
inputs
▸ Including all possible
factors can reveal the real effect of variables (adjusting for …)
Y = β0 + β1X1 + β2X2 + · · · + βpXp + ,
Coefficient- Std. error
- Std. error
Coefficient
- Std. error
t-statistic p-value
Intercept
2.939 0.3119 9.42 < 0.0001
TV
0.046 0.0014 32.81 < 0.0001
radio
0.189 0.0086 21.89 < 0.0001
newspaper
−0.001 0.0059 −0.18 0.8599
TV radio newspaper sales TV 1.0000 0.0548 0.0567 0.7822 radio 1.0000 0.3541 0.5762 newspaper 1.0000 0.2283 sales 1.0000 Sales Radio TVLINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW
QUALITATIVE INPUTS TO LINEAR REGRESSION
▸ X can be a category ▸ Gender, ethnicity,
marital status, phone type, country, ..
▸ Binary inputs ▸ Multiple categories ▸ It is called one-hot
encoding
xi =
- 1
if ith person is female if ith person is male,
xi =
- 1
if ith person is female −1 if ith person is male
xi1 =
- 1
if ith person is Asian if ith person is not Asian, xi2 =
- 1
if ith person is Caucasian if ith person is not Caucasian. yi = β0+β1xi1+β2xi2+i = β0+β1+i if ith person is Asian β0+β2+i if ith person is Caucasian β0+i if ith person is African American.
LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW
EXTENDING LINEAR REGRESSION: INTERACTIONS
▸ Linear regression is additive ▸ Best strategy? ▸ Spend all our money on radio ads ▸ Some companies do that, but others have
more balanced strategy
▸ Interaction (synergy) between TV and radio ▸ TV x radio is just treated as a new variable,
OLS fitting as before
▸ Y is not a linear function of X, but linear in
B-s, and the same formalism can be used
▸ B_3 can be interpreted as the increase of the
effectiveness of TV ads for one unit increase in radio ads
Sales Radio TVCoefficient
- Std. error
t-statistic p-value
Intercept
2.939 0.3119 9.42 < 0.0001
TV
0.046 0.0014 32.81 < 0.0001
radio
0.189 0.0086 21.89 < 0.0001
newspaper
−0.001 0.0059 −0.18 0.8599
Y = β0 + β1X1 + β2X2 + β3X1X2 + .
Coefficient
- Std. error
t-statistic p-value
Intercept6.7502 0.248 27.23 < 0.0001
TV0.0191 0.002 12.70 < 0.0001
radio0.0289 0.009 3.24 0.0014
TV×radio0.0011 0.000 20.73 < 0.0001
Y = β0 + (β1 + β3X2)X1 + β2X2 +
sales
= β0 + β1 × TV + β2 × radio + β3 × (radio × TV) +
× × × × β0 + (β1 + β3 × radio) × TV + β2 × radio + .
LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW
EXTENDING LINEAR REGRESSION: POLYNOMIAL REGRESSION
▸ Effects may be non linear, e.g.: very
- ften saturating
▸ We can add polynomials of x as
different variables, OLS fitting as before
▸ Again, y is not a linear function of x,
but linear in B-s, and the same formalism can be used
▸ Actually we can use any functions of x,
log(x), cos(x), sin(x), etc. Until the
- utcome is linear in the coefficients.
E.g.: we can not use cos(a*x+b) in linear regression.
50 100 150 200 10 20 30 40 50 Horsepower Miles per gallon Linear Degree 2 Degree 5mpg = β0 + β1 × horsepower + β2 × horsepower2 +
Coefficient
- Std. error
t-statistic p-value
Intercept56.9001 1.8004 31.6 < 0.0001
horsepower−0.4662 0.0311 −15.0 < 0.0001
horsepower20.0012 0.0001 10.1 < 0.0001
LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW
DETECTING NON-LINEARITY, OUTLIERS, HIGH LEVERAGE
▸ Clear trends in residuals indicate
non-linearity
▸ Residuals plots are also useful to
identify outliers
▸ Could be just measurement
error or indicate problems with the model itself
▸ High leverage points have strong
effect on coefficients
Fitted values Residuals Residual Plot for Linear Fit 323 330 334 5 10 15 20 25 30 −15 −10 −5 5 10 15 20 15 20 25 30 35 −15 −10 −5 5 10 15 Fitted values Residuals Residual Plot for Quadratic Fit 334 323 155 −2 −1 1 2 −4 −2 2 4 6 20 −2 2 4 6 −1 1 2 3 4 Fitted Values Residuals 20 −2 2 4 6 2 4 6 Fitted Values Studentized Residuals 20 X Y −2 −1 1 2 3 4 5 10 20 41 X Yhi = 1 n + (xi − ¯ x)2 n
i′=1(xi′ − ¯
x)2 .
LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW
COLLINEARITY
▸ Some predictive variables can be
highly correlated
▸ Their individual effect can not be
inferred
▸ For 3 or more variables it is harder to
detect: multicollinearity
▸ Variance inflation factor, VIF ▸ Possible solutions: drop one, or
combine them?
Limit Age 2000 4000 6000 8000 12000 30 40 50 60 70 80 2000 4000 6000 8000 12000 200 400 600 800 Limit Rating 2 1 . 2 5 2 1 . 5 21.8 0.16 0.17 0.18 0.19 −5 −4 −3 −2 −1 21.5 21.8 −0.1 0.0 0.1 0.2 1 2 3 4 5 βLimit βLimit βAge βRatingCoefficient
- Std. error
t-statistic p-value
Intercept
−173.411 43.828 −3.957 < 0.0001 Model 1
age
−2.292 0.672 −3.407 0.0007
limit
0.173 0.005 34.496 < 0.0001
Intercept
−377.537 45.254 −8.343 < 0.0001 Model 2
rating
2.202 0.952 2.312 0.0213
limit
0.025 0.064 0.384 0.7012
LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW
SOLVING MULTIPLE LINEAR REGRESSION
▸ Linear regression can
usually be solved by matrix inversion
▸ But sometimes normal
equations can be close to singular, and it fails
0 =
N- i=1
1 σ2
i⎡ ⎣yi −
M- j=1
ajXj(xi) ⎤ ⎦ Xk(xi) k = 1, . . . , M
M
- j=1
αkjaj = βk
αkj =
N- i=1
Xj(xi)Xk(xi) σ2
ian matrix, and
βk =
N
- i=1
yiXk(xi) σ2
i
- AT · A
- · a = AT · b
[α] = AT · A
[β] = AT · b
aj =
M
- k=1
[α]−1
jk βk = M
- k=1
Cjk N
- i=1
yiXk(xi) σ2
i
- the variance associated with the estimate
can be found
LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW
SOLVING MULTIPLE LINEAR REGRESSION WITH SINGULAR VALUE DECOMPOSITION
▸ It can also be solved with SVD ▸ For over-determined systems
( more data points than coefficients ) SVD produces a solution with minimal least squares error ( hooray! )
▸ For under-determined systems
(more coefficients than data points) the coefficients will be the one with minimal least squares! ( hooray! Small values instead of cancelling infinities!)
χ2 = |A · a − b|2
a =
M
- i=1
U(i) · b wi
- V(i)
LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW
REFERENCES
▸ ILSR: chapter ▸ Statistics like treatise ▸ Numerical recipes in C, chapter 15. ▸ Physicist like treatise ▸ SVD: Numerical recipes in C, chapter 2.6.