Day 6: Model Selection II Lucas Leemann Essex Summer School - - PowerPoint PPT Presentation

day 6 model selection ii
SMART_READER_LITE
LIVE PREVIEW

Day 6: Model Selection II Lucas Leemann Essex Summer School - - PowerPoint PPT Presentation

Day 6: Model Selection II Lucas Leemann Essex Summer School Introduction to Statistical Learning L. Leemann (Essex Summer School) Day 6 Introduction to SL 1 / 26 1 Repetition Week 1 2 Regularization Approaches Ridge


slide-1
SLIDE 1
  • Day 6: Model Selection II

Lucas Leemann

Essex Summer School

Introduction to Statistical Learning

  • L. Leemann (Essex Summer School)

Day 6 Introduction to SL 1 / 26

slide-2
SLIDE 2
  • 1 Repetition Week 1

2 Regularization Approaches

Ridge Regression Lasso Lasso vs Ridge

  • L. Leemann (Essex Summer School)

Day 6 Introduction to SL 2 / 26

slide-3
SLIDE 3
  • Repetition: Fundamental Problem

Red: Test error. Blue: Training error. (Hastie et al, 2008: 220)

  • L. Leemann (Essex Summer School)

Day 6 Introduction to SL 3 / 26

slide-4
SLIDE 4
  • Tuesday: Linear Models
  • 2
  • 1

1 2

  • 1

1 2 3 X Y

α

ΔX ΔY β=(ΔY)/ΔX) Yi=2.45 Y ^i=1.85 u ^i=0.6

  • L. Leemann (Essex Summer School)

Day 6 Introduction to SL 4 / 26

slide-5
SLIDE 5
  • Wednesday: Classification

(James et al, 2013: 140)

  • L. Leemann (Essex Summer School)

Day 6 Introduction to SL 5 / 26

slide-6
SLIDE 6
  • Thursday: Resampling

(James et al, 2013: 181)

  • L. Leemann (Essex Summer School)

Day 6 Introduction to SL 6 / 26

slide-7
SLIDE 7
  • Friday: Model Selection I

Subset Selection:

1 Generate an empty model and call it M0 2 For k = 1....p :

i) Generate all

!p

k

" possible models with k explanatory variables

ii) determine the model with the best criteria value (e.g. R2) and call it Mk

3 Determine best model within the set of these models: M0, ...., Mp

  • rely on a criteria like AIC, BIC, R2, Cp or use CV and estimate

test error

  • L. Leemann (Essex Summer School)

Day 6 Introduction to SL 7 / 26

slide-8
SLIDE 8
  • Regularization Approaches
  • L. Leemann (Essex Summer School)

Day 6 Introduction to SL 8 / 26

slide-9
SLIDE 9
  • Shrinkage Methods

Ridge regression and Lasso

  • The subset selection methods use least squares to fit a linear model

that contains a subset of the predictors.

  • As an alternative, we can fit a model containing all p predictors

using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero.

  • It may not be immediately obvious why such a constraint should

improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance.

  • L. Leemann (Essex Summer School)

Day 6 Introduction to SL 9 / 26

slide-10
SLIDE 10
  • Regularization
  • Recall that the least squares fitting procedure estimates

—0, —1, . . . , —p using the values that minimize

n

ÿ

i=1

1

yi ≠ —0 ≠

J

ÿ

j=1

—jxij

22

= RSS

  • In contrast, the regularization approach minimizes:

n

ÿ

i=1

1

yi ≠ —0 ≠

J

ÿ

j=1

—jxij

22

+ ⁄f (—j) = RSS + ⁄f (—j) where ⁄ Ø 0 is a tuning parameter, to be determined separately.

  • L. Leemann (Essex Summer School)

Day 6 Introduction to SL 10 / 26

slide-11
SLIDE 11
  • Ridge Regression
  • Ridge Regression minimizes this expression:

n

ÿ

i=1

1

yi ≠ —0 ≠

J

ÿ

j=1

—jxij

22 ¸ ˚˙ ˝

standard OLS estimate

+ ⁄

J

ÿ

j=1

—2

j

¸ ˚˙ ˝

penalty

  • ⁄ is a tuning parameter, i.e. different values of ⁄ lead to different

models and predictions.

  • When ⁄ is very big the estimates get pushed to 0.
  • When ⁄ is 0 the ridge regression and OLS are identical.
  • We can find an optimal value for ⁄ by relying on cross-validation.
  • L. Leemann (Essex Summer School)

Day 6 Introduction to SL 11 / 26

slide-12
SLIDE 12
  • Example: Credit data

1e−02 1e+00 1e+02 1e+04 −300 −100 100 200 300 400

Standardized Coefficients Income Limit Rating Student

0.0 0.2 0.4 0.6 0.8 1.0 −300 −100 100 200 300 400

Standardized Coefficients

λ kˆ βR

λ k2/kˆ

βk2

|| ˆ β||2 =

Òqp

j=1 β2 j

(James et al, 2013: 216)

  • L. Leemann (Essex Summer School)

Day 6 Introduction to SL 12 / 26

slide-13
SLIDE 13
  • Ridge Regression: Details
  • Shrinkage is not applied to the model constant —0, model estimate

for conditional mean should be un-shrunk.

  • Ridge regression is an example of ¸2 regularization:
  • ¸1 : f (—j) = qJ

j=1 |—j|

  • ¸2 : f (—j) = qJ

j=1 —2 j

˜ xij = xij

Ò

1 n

qn

i=1(xij ≠ ¯

xj)2

  • L. Leemann (Essex Summer School)

Day 6 Introduction to SL 13 / 26

slide-14
SLIDE 14
  • Ridge regression: scaling of predictors
  • The standard least squares coefficient estimates are scale

equivariant: multiplying Xj by a constant c simply leads to a scaling

  • f the least squares coefficient estimates by a factor of 1/c. In
  • ther words, regardless of how the jth predictor is scaled, Xj ˆ

—j will remain the same.

  • In contrast, the ridge regression coefficient estimates can change

substantially when multiplying a given predictor by a constant, due to the sum of squared coefficients term in the penalty part of the ridge regression objective function.

  • Therefore, it is best to apply ridge regression after standardizing the

predictors, using the formula ˜ xij = xij

Ò

1 n

qn

i=1(xij ≠ ¯

xj)2

  • L. Leemann (Essex Summer School)

Day 6 Introduction to SL 14 / 26

slide-15
SLIDE 15
  • Why Does Ridge Regression Improve Over Least Squares?

1e−01 1e+01 1e+03 10 20 30 40 50 60

Mean Squared Error

0.0 0.2 0.4 0.6 0.8 1.0 10 20 30 40 50 60

Mean Squared Error

λ kˆ βR

λ k2/kˆ

βk2 (James et al, 2013: 218)

  • Simulated data with n = 50 observations, p = 45 predictors, all having

nonzero coefficients.

  • Squared bias (black), variance (green), and test mean squared error

(purple).

  • The purple crosses indicate the ridge regression models for which the MSE

is smallest.

  • OLS with p variables is low bias but high variance - shrinkage lowers

variance at the price of bias.

  • L. Leemann (Essex Summer School)

Day 6 Introduction to SL 15 / 26

slide-16
SLIDE 16
  • The Lasso
  • Ridge regression does have one obvious disadvantage: unlike subset

selection, which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final model.

  • The Lasso is a relatively recent alternative to ridge regression that
  • vercomes this disadvantage. The lasso coefficients, ˆ

—L

λ, minimize

this quantity

n

ÿ

i=1

Q ayi ≠ —0 ≠

p

ÿ

j=1

—jxij

R b

2

+ ⁄

p

ÿ

j=1

|—j| = RSS + ⁄

p

ÿ

j=1

|—j|

  • In statistical parlance, the lasso uses an ¸1 (pronounced “ell 1”)

penalty instead of an ¸2 penalty. The ¸1 norm of a coefficient vector — is given by ΗÎ1 = q |—j|.

  • L. Leemann (Essex Summer School)

Day 6 Introduction to SL 16 / 26

slide-17
SLIDE 17
  • The Lasso: continued
  • As with ridge regression, the lasso shrinks the coefficient estimates

towards zero.

  • However, in the case of the lasso, the ¸1 penalty has the effect of

forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter ⁄ is sufficiently large.

  • Hence, much like best subset selection, the lasso performs variable

selection.

  • We say that the lasso yields sparse models – that is, models that

involve only a subset of the variables.

  • As in ridge regression, selecting a good value of ⁄ for the lasso is

critical; cross-validation is again the method of choice.

  • L. Leemann (Essex Summer School)

Day 6 Introduction to SL 17 / 26

slide-18
SLIDE 18
  • Example: Credit data

20 50 100 200 500 2000 5000 −200 100 200 300 400

Standardized Coefficients

0.0 0.2 0.4 0.6 0.8 1.0 −300 −100 100 200 300 400

Standardized Coefficients Income Limit Rating Student

λ kˆ βL

λ k1/kˆ

βk1

(James et al, 2013: 220)

  • L. Leemann (Essex Summer School)

Day 6 Introduction to SL 18 / 26

slide-19
SLIDE 19
  • Example: Baseball Data
  • 5

5 10 15 20

  • 150
  • 50

50 Log Lambda Coefficients 19 18 2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

  • 5

5 10 15 20 100000 200000 log(Lambda) Mean-Squared Error 19 19 17 17 18 17 9 7 5 4 1

  • L. Leemann (Essex Summer School)

Day 6 Introduction to SL 19 / 26

slide-20
SLIDE 20
  • Lasso Example 4

> lasso.pred <- predict(lasso.mod, s = log(cv.out$lambda.1se), newx = x[test, ]) > plot(lasso.pred, y[test], ylim=c(0,2500), xlim=c(0,2500), ylab="True Value in Test Data", xlab="Predicted Value in Test Data") > abline(coef = c(0,1),lty=2)

500 1000 1500 2000 2500 500 1000 1500 2000 2500 Predicted Value in Test Data True Value in Test Data

  • L. Leemann (Essex Summer School)

Day 6 Introduction to SL 20 / 26

slide-21
SLIDE 21
  • Comparing the Lasso and Ridge Regression

0.02 0.10 0.50 2.00 10.00 50.00 10 20 30 40 50 60

Mean Squared Error

0.0 0.2 0.4 0.6 0.8 1.0 10 20 30 40 50 60

R2 on Training Data Mean Squared Error

λ

  • Left: Plots of squared bias (black), variance (green), and test MSE

(purple) for the lasso on simulated data set.

  • Right: Comparison of squared bias, variance and test MSE between lasso

(solid) and ridge (dashed).

  • Both are plotted against their R2 on the training data, as a common form
  • f indexing.
  • The crosses in both plots indicate the lasso model for which the MSE is

smallest.

  • L. Leemann (Essex Summer School)

Day 6 Introduction to SL 21 / 26

slide-22
SLIDE 22
  • Comparing the Lasso and Ridge Regression: continued

0.02 0.10 0.50 2.00 10.00 50.00 20 40 60 80 100

Mean Squared Error

0.4 0.5 0.6 0.7 0.8 0.9 1.0 20 40 60 80 100

R2 on Training Data Mean Squared Error

λ

  • Left: Plots of squared bias (black), variance (green), and test MSE

(purple) for the lasso. The simulated data is similar to that previous slide, except that now only two predictors are related to the response.

  • Right: Comparison of squared bias, variance and test MSE between lasso

(solid) and ridge (dashed).

  • Both are plotted against their R2 on the training data, as a common form
  • f indexing.
  • L. Leemann (Essex Summer School)

Day 6 Introduction to SL 22 / 26

slide-23
SLIDE 23
  • Why does Lasso shrink to exactly 0?
  • L. Leemann (Essex Summer School)

Day 6 Introduction to SL 23 / 26

slide-24
SLIDE 24
  • Ridge vs Lasso
  • Ridge is preferred when some features are (strongly) correlated –

Lasso may only pick one.

  • Elastic net: Combining Lasso and Ridge:

˜ — = argmin

1

RSS ≠ ⁄

J

ÿ

j=1

(–—2

j + (1 ≠ –)|—j|

2

we now have two tuning parameters: – and ⁄

  • Details: Hastie et al. 2008. The Elements of Statistical Learning.

Springer.

  • L. Leemann (Essex Summer School)

Day 6 Introduction to SL 24 / 26

slide-25
SLIDE 25
  • Take away message
  • The two examples illustrate that neither ridge regression nor the

lasso will universally dominate the other.

  • In general, one might expect the lasso to perform better when the

response is a function of only a relatively small number of predictors.

  • However, the number of predictors that is related to the response is

never known a priori for real data sets.

  • A technique such as cross-validation can be used in order to

determine which approach is better on a particular data set.

  • Ridge can be expected to work better than Lasso if some features

are highly correlated.

  • L. Leemann (Essex Summer School)

Day 6 Introduction to SL 25 / 26

slide-26
SLIDE 26
  • Selecting the Tuning Parameter for Ridge Regression and

Lasso

  • For ridge regression and lasso we require a method to determine

which of the models under consideration is best.

  • That is, we require a method selecting a value for the tuning

parameter ⁄.

  • Cross-validation provides a simple way to tackle this problem. We

choose a grid of ⁄ values, and compute the cross-validation error rate for each value of ⁄.

  • We then select the tuning parameter value for which the

cross-validation error is smallest.

  • Finally, the model is re-fit using all of the available observations and

the selected value of the tuning parameter.

  • L. Leemann (Essex Summer School)

Day 6 Introduction to SL 26 / 26