LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT - - PowerPoint PPT Presentation

linear regression
SMART_READER_LITE
LIVE PREVIEW

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT - - PowerPoint PPT Presentation

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR REGRESSION 20 Sales 15 Starting point 10 Simplest parametric function 5 Easy to interpret the parameters: 0 50 100 150 200 250


slide-1
SLIDE 1

LINEAR REGRESSION

slide-2
SLIDE 2

LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW

SIMPLE LINEAR REGRESSION

▸ Starting point ▸ Simplest parametric function ▸ Easy to interpret the parameters:

intercept, coefficients: unit change in x makes coefficient times unit change in y

▸ Can be very accurate in certain problems ▸ Least squares ▸ Insight: minimising (log) probability

(actually the likelihood) of observations given Gaussian y distribution

50 100 150 200 250 300 5 10 15 20 25 TV Sales

Y ≈ β0 + β1X.

sales ≈ β0 + β1 × TV.

ˆ β1 = n

i=1(xi − ¯

x)(yi − ¯ y) n

i=1(xi − ¯

x)2 , ˆ β0 = ¯ y − ˆ β1¯ x,

  • Y = β0 + β1X + .

P.data j model/ /

N1

Y

iD0
  • exp
  • 1

2 yi y.xi/

  • 2

y

slide-3
SLIDE 3

LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW

ACCURACY OF COEFFICIENTS

▸ Data is from a true relationship + errors ▸ We get the line which fits the

measurements most accurately using OLS

▸ The true and the measured coefficients

will be different!

▸ We can estimate the standard errors of

estimated parameters, ( assuming uncorrelated errors which have a common variance (sigma) )

▸ We can estimate the errors from the

data itself: residual standard error, RSE

Y = β0 + β1X + .

X Y −2 −1 1 2 X −2 −1 1 2 −10 −5 5 10 Y −10 −5 5 10

SE(ˆ β0)

2 = σ2

1 n + ¯ x2 n

i=1(xi − ¯

x)2

  • ,

SE(ˆ β1)

2 =

σ2 n

i=1(xi − ¯

x)2 ,

t = ˆ β1 − 0 SE(ˆ β1) ,

Coefficient

  • Std. error

t-statistic p-value

Intercept

7.0325 0.4578 15.36 < 0.0001

TV

0.0475 0.0027 17.67 < 0.0001

  • f σ is known as the re

RSE =

  • RSS/(n − 2).
  • RSS = e2
1 + e2 2 + · · · + e2 n,
slide-4
SLIDE 4

LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW

ACCURACY OF ESTIMATION

▸ How accurate it the fit? ▸ RSE, Residual standard errors ▸ Closely related to chi-square

commonly used by physicist

▸ R squared, proportion of variance

explained

▸ For simple linear regression R

squared is the same as Cor(x, y)

▸ R is more general: multiple

regression or nonlinear regression

Y = β0 + β1X + .

RSS = e2

1 + e2 2 + · · · + e2 n,

RSE =

  • 1

n − 2RSS =

  • 1

n − 2

n

  • i=1

(yi − ˆ yi)2.

R2 = TSS − RSS TSS = 1 − RSS TSS

where TSS = (yi − ¯ y)2 .16). TSS measures th

50 100 150 200 250 300 5 10 15 20 25 TV Sales

Cor(X, Y ) = n

i=1(xi − x)(yi − y)

n

i=1(xi − x)2n i=1(yi − y)2 ,
slide-5
SLIDE 5

LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW

MULTIPLE LINEAR REGRESSION

▸ Multiple x variables ▸ OLS ▸ Without other variables

newspaper ads seem to be related to sales, with other it does not

▸ Ad spendings are

correlated

▸ Multiple regression

coefficients describe the effect of an input on the

  • utcome given fixed other

inputs

▸ Including all possible

factors can reveal the real effect of variables (adjusting for …)

Y = β0 + β1X1 + β2X2 + · · · + βpXp + ,

Coefficient
  • Std. error
t-statistic p-value Intercept 9.312 0.563 16.54 < 0.0001 radio 0.203 0.020 9.92 < 0.0001 Coefficient
  • Std. error
t-statistic p-value Intercept 12.351 0.621 19.88 < 0.0001 newspaper 0.055 0.017 3.30 0.00115

Coefficient

  • Std. error

t-statistic p-value

Intercept

2.939 0.3119 9.42 < 0.0001

TV

0.046 0.0014 32.81 < 0.0001

radio

0.189 0.0086 21.89 < 0.0001

newspaper

−0.001 0.0059 −0.18 0.8599

TV radio newspaper sales TV 1.0000 0.0548 0.0567 0.7822 radio 1.0000 0.3541 0.5762 newspaper 1.0000 0.2283 sales 1.0000 Sales Radio TV
slide-6
SLIDE 6

LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW

QUALITATIVE INPUTS TO LINEAR REGRESSION

▸ X can be a category ▸ Gender, ethnicity,

marital status, phone type, country, ..

▸ Binary inputs ▸ Multiple categories ▸ It is called one-hot

encoding

xi =

  • 1

if ith person is female if ith person is male,

xi =

  • 1

if ith person is female −1 if ith person is male

xi1 =

  • 1

if ith person is Asian if ith person is not Asian, xi2 =

  • 1

if ith person is Caucasian if ith person is not Caucasian. yi = β0+β1xi1+β2xi2+i =      β0+β1+i if ith person is Asian β0+β2+i if ith person is Caucasian β0+i if ith person is African American.

slide-7
SLIDE 7

LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW

EXTENDING LINEAR REGRESSION: INTERACTIONS

▸ Linear regression is additive ▸ Best strategy? ▸ Spend all our money on radio ads ▸ Some companies do that, but others have

more balanced strategy

▸ Interaction (synergy) between TV and radio ▸ TV x radio is just treated as a new variable,

OLS fitting as before

▸ Y is not a linear function of X, but linear in

B-s, and the same formalism can be used

▸ B_3 can be interpreted as the increase of the

effectiveness of TV ads for one unit increase in radio ads

Sales Radio TV

Coefficient

  • Std. error

t-statistic p-value

Intercept

2.939 0.3119 9.42 < 0.0001

TV

0.046 0.0014 32.81 < 0.0001

radio

0.189 0.0086 21.89 < 0.0001

newspaper

−0.001 0.0059 −0.18 0.8599

Y = β0 + β1X1 + β2X2 + β3X1X2 + .

Coefficient

  • Std. error

t-statistic p-value

Intercept

6.7502 0.248 27.23 < 0.0001

TV

0.0191 0.002 12.70 < 0.0001

radio

0.0289 0.009 3.24 0.0014

TV×radio

0.0011 0.000 20.73 < 0.0001

Y = β0 + (β1 + β3X2)X1 + β2X2 +

sales

= β0 + β1 × TV + β2 × radio + β3 × (radio × TV) +

× × × × β0 + (β1 + β3 × radio) × TV + β2 × radio + .

slide-8
SLIDE 8

LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW

EXTENDING LINEAR REGRESSION: POLYNOMIAL REGRESSION

▸ Effects may be non linear, e.g.: very

  • ften saturating

▸ We can add polynomials of x as

different variables, OLS fitting as before

▸ Again, y is not a linear function of x,

but linear in B-s, and the same formalism can be used

▸ Actually we can use any functions of x,

log(x), cos(x), sin(x), etc. Until the

  • utcome is linear in the coefficients.

E.g.: we can not use cos(a*x+b) in linear regression.

50 100 150 200 10 20 30 40 50 Horsepower Miles per gallon Linear Degree 2 Degree 5

mpg = β0 + β1 × horsepower + β2 × horsepower2 +

Coefficient

  • Std. error

t-statistic p-value

Intercept

56.9001 1.8004 31.6 < 0.0001

horsepower

−0.4662 0.0311 −15.0 < 0.0001

horsepower2

0.0012 0.0001 10.1 < 0.0001

slide-9
SLIDE 9

LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW

DETECTING NON-LINEARITY, OUTLIERS, HIGH LEVERAGE

▸ Clear trends in residuals indicate

non-linearity

▸ Residuals plots are also useful to

identify outliers

▸ Could be just measurement

error or indicate problems with the model itself

▸ High leverage points have strong

effect on coefficients

Fitted values Residuals Residual Plot for Linear Fit 323 330 334 5 10 15 20 25 30 −15 −10 −5 5 10 15 20 15 20 25 30 35 −15 −10 −5 5 10 15 Fitted values Residuals Residual Plot for Quadratic Fit 334 323 155 −2 −1 1 2 −4 −2 2 4 6 20 −2 2 4 6 −1 1 2 3 4 Fitted Values Residuals 20 −2 2 4 6 2 4 6 Fitted Values Studentized Residuals 20 X Y −2 −1 1 2 3 4 5 10 20 41 X Y

hi = 1 n + (xi − ¯ x)2 n

i′=1(xi′ − ¯

x)2 .

slide-10
SLIDE 10

LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW

COLLINEARITY

▸ Some predictive variables can be

highly correlated

▸ Their individual effect can not be

inferred

▸ For 3 or more variables it is harder to

detect: multicollinearity

▸ Variance inflation factor, VIF ▸ Possible solutions: drop one, or

combine them?

Limit Age 2000 4000 6000 8000 12000 30 40 50 60 70 80 2000 4000 6000 8000 12000 200 400 600 800 Limit Rating 2 1 . 2 5 2 1 . 5 21.8 0.16 0.17 0.18 0.19 −5 −4 −3 −2 −1 21.5 21.8 −0.1 0.0 0.1 0.2 1 2 3 4 5 βLimit βLimit βAge βRating

Coefficient

  • Std. error

t-statistic p-value

Intercept

−173.411 43.828 −3.957 < 0.0001 Model 1

age

−2.292 0.672 −3.407 0.0007

limit

0.173 0.005 34.496 < 0.0001

Intercept

−377.537 45.254 −8.343 < 0.0001 Model 2

rating

2.202 0.952 2.312 0.0213

limit

0.025 0.064 0.384 0.7012

slide-11
SLIDE 11

LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW

SOLVING MULTIPLE LINEAR REGRESSION

▸ Linear regression can

usually be solved by matrix inversion

▸ But sometimes normal

equations can be close to singular, and it fails

0 =

N
  • i=1

1 σ2

i

⎡ ⎣yi −

M
  • j=1

ajXj(xi) ⎤ ⎦ Xk(xi) k = 1, . . . , M

M

  • j=1

αkjaj = βk

αkj =

N
  • i=1

Xj(xi)Xk(xi) σ2

i

an matrix, and

βk =

N

  • i=1

yiXk(xi) σ2

i

  • AT · A
  • · a = AT · b

[α] = AT · A

[β] = AT · b

aj =

M

  • k=1

[α]−1

jk βk = M

  • k=1

Cjk N

  • i=1

yiXk(xi) σ2

i

  • the variance associated with the estimate

can be found

slide-12
SLIDE 12

LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW

SOLVING MULTIPLE LINEAR REGRESSION WITH SINGULAR VALUE DECOMPOSITION

▸ It can also be solved with SVD ▸ For over-determined systems

( more data points than coefficients ) SVD produces a solution with minimal least squares error ( hooray! )

▸ For under-determined systems

(more coefficients than data points) the coefficients will be the one with minimal least squares! ( hooray! Small values instead of cancelling infinities!)

χ2 = |A · a − b|2

a =

M

  • i=1

U(i) · b wi

  • V(i)
⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ A ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ U ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ · ⎛ ⎜ ⎝ w1 w2 · · · · · · wN ⎞ ⎟ ⎠ · ⎛ ⎜ ⎜ ⎝ VT ⎞ ⎟ ⎟ ⎠ (2.6.1)       UT       ·                 U                 =       VT       ·       V       =       1       (2.6.4)
slide-13
SLIDE 13

LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW

REFERENCES

▸ ILSR: chapter ▸ Statistics like treatise ▸ Numerical recipes in C, chapter 15. ▸ Physicist like treatise ▸ SVD: Numerical recipes in C, chapter 2.6.