day 6 model selection ii
play

Day 6: Model Selection II Lucas Leemann Essex Summer School - PowerPoint PPT Presentation

Day 6: Model Selection II Lucas Leemann Essex Summer School Introduction to Statistical Learning L. Leemann (Essex Summer School) Day 6 Introduction to SL 1 / 26 1 Repetition Week 1 2 Regularization Approaches Ridge


  1. �� Day 6: Model Selection II Lucas Leemann Essex Summer School Introduction to Statistical Learning L. Leemann (Essex Summer School) Day 6 Introduction to SL 1 / 26

  2. �� 1 Repetition Week 1 2 Regularization Approaches Ridge Regression Lasso Lasso vs Ridge L. Leemann (Essex Summer School) Day 6 Introduction to SL 2 / 26

  3. �� Repetition: Fundamental Problem Red: Test error. (Hastie et al, 2008: 220) Blue: Training error. L. Leemann (Essex Summer School) Day 6 Introduction to SL 3 / 26

  4. �� Tuesday: Linear Models Y i =2.45 ^ i =0.6 u 3 ^ i =1.85 Y Δ Y 2 Δ X β =( Δ Y)/ Δ X) Y 1 α 0 -1 -2 -1 0 1 2 X L. Leemann (Essex Summer School) Day 6 Introduction to SL 4 / 26

  5. �� Wednesday: Classification (James et al, 2013: 140) L. Leemann (Essex Summer School) Day 6 Introduction to SL 5 / 26

  6. �� Thursday: Resampling (James et al, 2013: 181) L. Leemann (Essex Summer School) Day 6 Introduction to SL 6 / 26

  7. �� Friday: Model Selection I Subset Selection: 1 Generate an empty model and call it M 0 2 For k = 1.... p : " possible models with k explanatory variables ! p i) Generate all k ii) determine the model with the best criteria value (e.g. R 2 ) and call it M k 3 Determine best model within the set of these models: M 0 , ...., M p - rely on a criteria like AIC, BIC, R 2 , C p or use CV and estimate test error L. Leemann (Essex Summer School) Day 6 Introduction to SL 7 / 26

  8. �� Regularization Approaches L. Leemann (Essex Summer School) Day 6 Introduction to SL 8 / 26

  9. �� Shrinkage Methods Ridge regression and Lasso • The subset selection methods use least squares to fit a linear model that contains a subset of the predictors. • As an alternative, we can fit a model containing all p predictors using a technique that constrains or regularizes the coe ffi cient estimates, or equivalently, that shrinks the coe ffi cient estimates towards zero. • It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coe ffi cient estimates can significantly reduce their variance. L. Leemann (Essex Summer School) Day 6 Introduction to SL 9 / 26

  10. �� Regularization • Recall that the least squares fitting procedure estimates — 0 , — 1 , . . . , — p using the values that minimize n J 1 2 2 ÿ ÿ y i ≠ — 0 ≠ — j x ij = RSS i =1 j =1 • In contrast, the regularization approach minimizes: n J 2 2 1 ÿ ÿ y i ≠ — 0 ≠ + ⁄ f ( — j ) = RSS + ⁄ f ( — j ) — j x ij i =1 j =1 where ⁄ Ø 0 is a tuning parameter, to be determined separately. L. Leemann (Essex Summer School) Day 6 Introduction to SL 10 / 26

  11. �� Ridge Regression • Ridge Regression minimizes this expression: n J J 2 2 1 ÿ ÿ ÿ — 2 y i ≠ — 0 ≠ + ⁄ — j x ij j i =1 j =1 j =1 ¸ ˚˙ ˝ ¸ ˚˙ ˝ standard OLS estimate penalty • ⁄ is a tuning parameter, i.e. di ff erent values of ⁄ lead to di ff erent models and predictions. • When ⁄ is very big the estimates get pushed to 0. • When ⁄ is 0 the ridge regression and OLS are identical. • We can find an optimal value for ⁄ by relying on cross-validation. L. Leemann (Essex Summer School) Day 6 Introduction to SL 11 / 26

  12. �� Example: Credit data 400 Income 400 Limit Standardized Coefficients Standardized Coefficients 300 Rating 300 Student 200 200 100 100 0 0 − 100 − 100 − 300 − 300 1e − 02 1e+00 1e+02 1e+04 0.0 0.2 0.4 0.6 0.8 1.0 k ˆ λ k 2 / k ˆ β R λ β k 2 Òq p || ˆ j =1 β 2 β || 2 = (James et al, 2013: 216) j L. Leemann (Essex Summer School) Day 6 Introduction to SL 12 / 26

  13. �� Ridge Regression: Details • Shrinkage is not applied to the model constant — 0 , model estimate for conditional mean should be un-shrunk . • Ridge regression is an example of ¸ 2 regularization: • ¸ 1 : f ( — j ) = q J j =1 | — j | • ¸ 2 : f ( — j ) = q J j =1 — 2 j x ij ˜ x ij = Ò q n 1 x j ) 2 i =1 ( x ij ≠ ¯ n L. Leemann (Essex Summer School) Day 6 Introduction to SL 13 / 26

  14. �� Ridge regression: scaling of predictors • The standard least squares coe ffi cient estimates are scale equivariant: multiplying X j by a constant c simply leads to a scaling of the least squares coe ffi cient estimates by a factor of 1 / c . In other words, regardless of how the j th predictor is scaled, X j ˆ — j will remain the same. • In contrast, the ridge regression coe ffi cient estimates can change substantially when multiplying a given predictor by a constant, due to the sum of squared coe ffi cients term in the penalty part of the ridge regression objective function. • Therefore, it is best to apply ridge regression after standardizing the predictors, using the formula x ij ˜ x ij = Ò q n 1 x j ) 2 i =1 ( x ij ≠ ¯ n L. Leemann (Essex Summer School) Day 6 Introduction to SL 14 / 26

  15. �� Why Does Ridge Regression Improve Over Least Squares? 60 60 Mean Squared Error 50 Mean Squared Error 50 40 40 30 30 20 20 10 10 0 0 1e − 01 1e+01 1e+03 0.0 0.2 0.4 0.6 0.8 1.0 (James et al, 2013: 218) k ˆ β R λ k 2 / k ˆ λ β k 2 • Simulated data with n = 50 observations, p = 45 predictors, all having nonzero coe ffi cients. • Squared bias (black), variance (green), and test mean squared error (purple). • The purple crosses indicate the ridge regression models for which the MSE is smallest. • OLS with p variables is low bias but high variance - shrinkage lowers variance at the price of bias. L. Leemann (Essex Summer School) Day 6 Introduction to SL 15 / 26

  16. �� The Lasso • Ridge regression does have one obvious disadvantage: unlike subset selection, which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final model. • The Lasso is a relatively recent alternative to ridge regression that overcomes this disadvantage. The lasso coe ffi cients, ˆ — L λ , minimize this quantity Q R 2 p p p n ÿ ÿ ÿ ÿ a y i ≠ — 0 ≠ — j x ij + ⁄ | — j | = RSS + ⁄ | — j | b i =1 j =1 j =1 j =1 • In statistical parlance, the lasso uses an ¸ 1 (pronounced “ell 1”) penalty instead of an ¸ 2 penalty. The ¸ 1 norm of a coe ffi cient vector — is given by Î — Î 1 = q | — j | . L. Leemann (Essex Summer School) Day 6 Introduction to SL 16 / 26

  17. �� The Lasso: continued • As with ridge regression, the lasso shrinks the coe ffi cient estimates towards zero. • However, in the case of the lasso, the ¸ 1 penalty has the e ff ect of forcing some of the coe ffi cient estimates to be exactly equal to zero when the tuning parameter ⁄ is su ffi ciently large. • Hence, much like best subset selection, the lasso performs variable selection. • We say that the lasso yields sparse models – that is, models that involve only a subset of the variables. • As in ridge regression, selecting a good value of ⁄ for the lasso is critical; cross-validation is again the method of choice. L. Leemann (Essex Summer School) Day 6 Introduction to SL 17 / 26

  18. �� Example: Credit data 400 400 Standardized Coefficients Standardized Coefficients 300 300 200 200 100 100 0 0 − 100 Income Limit − 200 Rating Student − 300 20 50 100 200 500 2000 5000 0.0 0.2 0.4 0.6 0.8 1.0 k ˆ β L λ k 1 / k ˆ λ β k 1 (James et al, 2013: 220) L. Leemann (Essex Summer School) Day 6 Introduction to SL 18 / 26

  19. �� Example: Baseball Data 19 18 2 0 0 0 14 50 Coefficients 6 3 2 0 12 11 16 17 8 5 9 13 18 10 19 1 4 7 -50 15 -150 -5 0 5 10 15 20 Log Lambda 19 19 17 17 18 17 9 7 5 4 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Mean-Squared Error 200000 100000 -5 0 5 10 15 20 log(Lambda) L. Leemann (Essex Summer School) Day 6 Introduction to SL 19 / 26

  20. �� Lasso Example 4 > lasso.pred <- predict(lasso.mod, s = log(cv.out$lambda.1se), newx = x[test, ]) > plot(lasso.pred, y[test], ylim=c(0,2500), xlim=c(0,2500), ylab="True Value in Test Data", xlab="Predicted Value in Test Data") > abline(coef = c(0,1),lty=2) 2500 2000 True Value in Test Data 1500 1000 500 0 0 500 1000 1500 2000 2500 Predicted Value in Test Data L. Leemann (Essex Summer School) Day 6 Introduction to SL 20 / 26

  21. �� Comparing the Lasso and Ridge Regression 60 60 Mean Squared Error 50 Mean Squared Error 50 40 40 30 30 20 20 10 10 0 0 0.02 0.10 0.50 2.00 10.00 50.00 0.0 0.2 0.4 0.6 0.8 1.0 R 2 on Training Data λ • Left: Plots of squared bias (black), variance (green), and test MSE (purple) for the lasso on simulated data set. • Right: Comparison of squared bias, variance and test MSE between lasso (solid) and ridge (dashed). • Both are plotted against their R 2 on the training data, as a common form of indexing. • The crosses in both plots indicate the lasso model for which the MSE is smallest. L. Leemann (Essex Summer School) Day 6 Introduction to SL 21 / 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend