Day 6: Model Selection II Lucas Leemann Essex Summer School - PowerPoint PPT Presentation

�� Day 6: Model Selection II Lucas Leemann Essex Summer School Introduction to Statistical Learning L. Leemann (Essex Summer School) Day 6 Introduction to SL 1 / 26

�� 1 Repetition Week 1 2 Regularization Approaches Ridge Regression Lasso Lasso vs Ridge L. Leemann (Essex Summer School) Day 6 Introduction to SL 2 / 26

�� Repetition: Fundamental Problem Red: Test error. (Hastie et al, 2008: 220) Blue: Training error. L. Leemann (Essex Summer School) Day 6 Introduction to SL 3 / 26

�� Tuesday: Linear Models Y i =2.45 ^ i =0.6 u 3 ^ i =1.85 Y Δ Y 2 Δ X β =( Δ Y)/ Δ X) Y 1 α 0 -1 -2 -1 0 1 2 X L. Leemann (Essex Summer School) Day 6 Introduction to SL 4 / 26

�� Wednesday: Classification (James et al, 2013: 140) L. Leemann (Essex Summer School) Day 6 Introduction to SL 5 / 26

�� Thursday: Resampling (James et al, 2013: 181) L. Leemann (Essex Summer School) Day 6 Introduction to SL 6 / 26

�� Friday: Model Selection I Subset Selection: 1 Generate an empty model and call it M 0 2 For k = 1.... p : " possible models with k explanatory variables ! p i) Generate all k ii) determine the model with the best criteria value (e.g. R 2 ) and call it M k 3 Determine best model within the set of these models: M 0 , ...., M p - rely on a criteria like AIC, BIC, R 2 , C p or use CV and estimate test error L. Leemann (Essex Summer School) Day 6 Introduction to SL 7 / 26

�� Regularization Approaches L. Leemann (Essex Summer School) Day 6 Introduction to SL 8 / 26

�� Shrinkage Methods Ridge regression and Lasso • The subset selection methods use least squares to fit a linear model that contains a subset of the predictors. • As an alternative, we can fit a model containing all p predictors using a technique that constrains or regularizes the coe ffi cient estimates, or equivalently, that shrinks the coe ffi cient estimates towards zero. • It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coe ffi cient estimates can significantly reduce their variance. L. Leemann (Essex Summer School) Day 6 Introduction to SL 9 / 26

�� Regularization • Recall that the least squares fitting procedure estimates — 0 , — 1 , . . . , — p using the values that minimize n J 1 2 2 ÿ ÿ y i ≠ — 0 ≠ — j x ij = RSS i =1 j =1 • In contrast, the regularization approach minimizes: n J 2 2 1 ÿ ÿ y i ≠ — 0 ≠ + ⁄ f ( — j ) = RSS + ⁄ f ( — j ) — j x ij i =1 j =1 where ⁄ Ø 0 is a tuning parameter, to be determined separately. L. Leemann (Essex Summer School) Day 6 Introduction to SL 10 / 26

�� Ridge Regression • Ridge Regression minimizes this expression: n J J 2 2 1 ÿ ÿ ÿ — 2 y i ≠ — 0 ≠ + ⁄ — j x ij j i =1 j =1 j =1 ¸ ˚˙ ˝ ¸ ˚˙ ˝ standard OLS estimate penalty • ⁄ is a tuning parameter, i.e. di ff erent values of ⁄ lead to di ff erent models and predictions. • When ⁄ is very big the estimates get pushed to 0. • When ⁄ is 0 the ridge regression and OLS are identical. • We can find an optimal value for ⁄ by relying on cross-validation. L. Leemann (Essex Summer School) Day 6 Introduction to SL 11 / 26

�� Example: Credit data 400 Income 400 Limit Standardized Coefficients Standardized Coefficients 300 Rating 300 Student 200 200 100 100 0 0 − 100 − 100 − 300 − 300 1e − 02 1e+00 1e+02 1e+04 0.0 0.2 0.4 0.6 0.8 1.0 k ˆ λ k 2 / k ˆ β R λ β k 2 Òq p || ˆ j =1 β 2 β || 2 = (James et al, 2013: 216) j L. Leemann (Essex Summer School) Day 6 Introduction to SL 12 / 26

�� Ridge Regression: Details • Shrinkage is not applied to the model constant — 0 , model estimate for conditional mean should be un-shrunk . • Ridge regression is an example of ¸ 2 regularization: • ¸ 1 : f ( — j ) = q J j =1 | — j | • ¸ 2 : f ( — j ) = q J j =1 — 2 j x ij ˜ x ij = Ò q n 1 x j ) 2 i =1 ( x ij ≠ ¯ n L. Leemann (Essex Summer School) Day 6 Introduction to SL 13 / 26

�� Ridge regression: scaling of predictors • The standard least squares coe ffi cient estimates are scale equivariant: multiplying X j by a constant c simply leads to a scaling of the least squares coe ffi cient estimates by a factor of 1 / c . In other words, regardless of how the j th predictor is scaled, X j ˆ — j will remain the same. • In contrast, the ridge regression coe ffi cient estimates can change substantially when multiplying a given predictor by a constant, due to the sum of squared coe ffi cients term in the penalty part of the ridge regression objective function. • Therefore, it is best to apply ridge regression after standardizing the predictors, using the formula x ij ˜ x ij = Ò q n 1 x j ) 2 i =1 ( x ij ≠ ¯ n L. Leemann (Essex Summer School) Day 6 Introduction to SL 14 / 26

�� Why Does Ridge Regression Improve Over Least Squares? 60 60 Mean Squared Error 50 Mean Squared Error 50 40 40 30 30 20 20 10 10 0 0 1e − 01 1e+01 1e+03 0.0 0.2 0.4 0.6 0.8 1.0 (James et al, 2013: 218) k ˆ β R λ k 2 / k ˆ λ β k 2 • Simulated data with n = 50 observations, p = 45 predictors, all having nonzero coe ffi cients. • Squared bias (black), variance (green), and test mean squared error (purple). • The purple crosses indicate the ridge regression models for which the MSE is smallest. • OLS with p variables is low bias but high variance - shrinkage lowers variance at the price of bias. L. Leemann (Essex Summer School) Day 6 Introduction to SL 15 / 26

�� The Lasso • Ridge regression does have one obvious disadvantage: unlike subset selection, which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final model. • The Lasso is a relatively recent alternative to ridge regression that overcomes this disadvantage. The lasso coe ffi cients, ˆ — L λ , minimize this quantity Q R 2 p p p n ÿ ÿ ÿ ÿ a y i ≠ — 0 ≠ — j x ij + ⁄ | — j | = RSS + ⁄ | — j | b i =1 j =1 j =1 j =1 • In statistical parlance, the lasso uses an ¸ 1 (pronounced “ell 1”) penalty instead of an ¸ 2 penalty. The ¸ 1 norm of a coe ffi cient vector — is given by Î — Î 1 = q | — j | . L. Leemann (Essex Summer School) Day 6 Introduction to SL 16 / 26

�� The Lasso: continued • As with ridge regression, the lasso shrinks the coe ffi cient estimates towards zero. • However, in the case of the lasso, the ¸ 1 penalty has the e ff ect of forcing some of the coe ffi cient estimates to be exactly equal to zero when the tuning parameter ⁄ is su ffi ciently large. • Hence, much like best subset selection, the lasso performs variable selection. • We say that the lasso yields sparse models – that is, models that involve only a subset of the variables. • As in ridge regression, selecting a good value of ⁄ for the lasso is critical; cross-validation is again the method of choice. L. Leemann (Essex Summer School) Day 6 Introduction to SL 17 / 26

�� Example: Credit data 400 400 Standardized Coefficients Standardized Coefficients 300 300 200 200 100 100 0 0 − 100 Income Limit − 200 Rating Student − 300 20 50 100 200 500 2000 5000 0.0 0.2 0.4 0.6 0.8 1.0 k ˆ β L λ k 1 / k ˆ λ β k 1 (James et al, 2013: 220) L. Leemann (Essex Summer School) Day 6 Introduction to SL 18 / 26

�� Example: Baseball Data 19 18 2 0 0 0 14 50 Coefficients 6 3 2 0 12 11 16 17 8 5 9 13 18 10 19 1 4 7 -50 15 -150 -5 0 5 10 15 20 Log Lambda 19 19 17 17 18 17 9 7 5 4 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Mean-Squared Error 200000 100000 -5 0 5 10 15 20 log(Lambda) L. Leemann (Essex Summer School) Day 6 Introduction to SL 19 / 26

�� Lasso Example 4 > lasso.pred <- predict(lasso.mod, s = log(cv.out$lambda.1se), newx = x[test, ]) > plot(lasso.pred, y[test], ylim=c(0,2500), xlim=c(0,2500), ylab="True Value in Test Data", xlab="Predicted Value in Test Data") > abline(coef = c(0,1),lty=2) 2500 2000 True Value in Test Data 1500 1000 500 0 0 500 1000 1500 2000 2500 Predicted Value in Test Data L. Leemann (Essex Summer School) Day 6 Introduction to SL 20 / 26

�� Comparing the Lasso and Ridge Regression 60 60 Mean Squared Error 50 Mean Squared Error 50 40 40 30 30 20 20 10 10 0 0 0.02 0.10 0.50 2.00 10.00 50.00 0.0 0.2 0.4 0.6 0.8 1.0 R 2 on Training Data λ • Left: Plots of squared bias (black), variance (green), and test MSE (purple) for the lasso on simulated data set. • Right: Comparison of squared bias, variance and test MSE between lasso (solid) and ridge (dashed). • Both are plotted against their R 2 on the training data, as a common form of indexing. • The crosses in both plots indicate the lasso model for which the MSE is smallest. L. Leemann (Essex Summer School) Day 6 Introduction to SL 21 / 26

Day 6: Model Selection II Lucas Leemann Essex Summer School - PowerPoint PPT Presentation

Day 6: Model Selection II Lucas Leemann Essex Summer School Introduction to Statistical Learning L. Leemann (Essex Summer School) Day 6 Introduction to SL 1 / 26 1 Repetition Week 1 2 Regularization Approaches Ridge

At Creation Common Holy Day 1 Day 2 Day 8 Day 9 Day 3 Day 4 Day 5 Day 6 Day 7 7 Days The

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

STAT 213 Model Selection II Colin Reimer Dawson Oberlin College March 30, 2018 1 / 13 Outline

Science with a Little Altitude | QS18 Fah Sathirapongsasuti, PhD EBC Everest Day 1 Day 2 Day

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

MODEL SELECTION AND REGULARISATION MODEL SELECTION ESTIMATING THE ACCURACY OF THE MODEL We

Model Selection and Assumptions November 15, 2019 November 15, 2019 1 / 32 Forward Selection

STAT 213 Multicollinearity and Model Selection Colin Reimer Dawson Oberlin College 7 April 2016

Demo (Step 1, Selection) Demo (Step 1, Optimization) Demo (Step 2, Selection) Demo (Step 2,

Conference Site Selection Stephanie Sabal Program Coordinator: Site Selection sabal@acm.org

Selection Sort Section 10.2 Code for Selection Sort (cont.) Code for an Array Sort Code for an

Selection Rules: Selection Rules Each of the spectroscopies have associated selection

ENGLAND | APRIL 12 20, 2020 8 DAY TOU R SUGGE STE D ITI N E R ARY* DAY 0 DAY 1 DAY 2

CSE 158 Lecture 2 Web Mining and Recommender Systems Supervised learning Regression

Introduction to Machine Learning Evaluation: Measures for Regression Learning goals Know the

Lecture 2 Diagnostics and Model Evaluation Colin Rundel 1/23/2017 1 From last time 2 Linear

10703 Deep Reinforcement Learning Policy Gradient Methods Part 3 Tom Mitchell October 8, 2018

( Y n a bX n ) 2 . n = 1 Thus, Note that E [ X ] = 0 and E [ Y ] = 0 in these

The Paradox of Overfitting Volker Nannen February 1, 2003 Artificial Intelligence

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

CSC 411: Lecture 10: Neural Networks I Class based on Raquel Urtasun & Rich Zemels lectures

Day 6: Model Selection II Lucas Leemann Essex Summer School - PowerPoint PPT Presentation

Day 6: Model Selection II Lucas Leemann Essex Summer School Introduction to Statistical Learning L. Leemann (Essex Summer School) Day 6 Introduction to SL 1 / 26 1 Repetition Week 1 2 Regularization Approaches Ridge

At Creation Common Holy Day 1 Day 2 Day 8 Day 9 Day 3 Day 4 Day 5 Day 6 Day 7 7 Days The

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

STAT 213 Model Selection II Colin Reimer Dawson Oberlin College March 30, 2018 1 / 13 Outline

Science with a Little Altitude | QS18 Fah Sathirapongsasuti, PhD EBC Everest Day 1 Day 2 Day

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

MODEL SELECTION AND REGULARISATION MODEL SELECTION ESTIMATING THE ACCURACY OF THE MODEL We

Model Selection and Assumptions November 15, 2019 November 15, 2019 1 / 32 Forward Selection

STAT 213 Multicollinearity and Model Selection Colin Reimer Dawson Oberlin College 7 April 2016

Demo (Step 1, Selection) Demo (Step 1, Optimization) Demo (Step 2, Selection) Demo (Step 2,

Conference Site Selection Stephanie Sabal Program Coordinator: Site Selection sabal@acm.org

Selection Sort Section 10.2 Code for Selection Sort (cont.) Code for an Array Sort Code for an

Selection Rules: Selection Rules Each of the spectroscopies have associated selection

ENGLAND | APRIL 12 20, 2020 8 DAY TOU R SUGGE STE D ITI N E R ARY* DAY 0 DAY 1 DAY 2

CSE 158 Lecture 2 Web Mining and Recommender Systems Supervised learning Regression

Introduction to Machine Learning Evaluation: Measures for Regression Learning goals Know the

Lecture 2 Diagnostics and Model Evaluation Colin Rundel 1/23/2017 1 From last time 2 Linear

10703 Deep Reinforcement Learning Policy Gradient Methods Part 3 Tom Mitchell October 8, 2018

( Y n a bX n ) 2 . n = 1 Thus, Note that E [ X ] = 0 and E [ Y ] = 0 in these

The Paradox of Overfitting Volker Nannen February 1, 2003 Artificial Intelligence

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

CSC 411: Lecture 10: Neural Networks I Class based on Raquel Urtasun &amp; Rich Zemels lectures

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

CSC 411: Lecture 10: Neural Networks I Class based on Raquel Urtasun & Rich Zemels lectures