Variable Selection Using Elastic Net A Gentle Introduction to - PowerPoint PPT Presentation

Variable Selection Using Elastic Net A Gentle Introduction to Penalized Regression Mohamad Hindawi, PhD, FCAS towerswatson.com

Antitrust Notice Antitrust Notice The Casualty Actuarial Society is committed to adhering strictly The Casualty Actuarial Society is committed to adhering strictly • • to the letter and spirit of the antitrust laws. Seminars conducted to the letter and spirit of the antitrust laws. Seminars conduc ted under the auspices of the CAS are designed solely to provide a under the auspices of the CAS are designed solely to provide a forum for the expression of various points of view on topics forum for the expression of various points of view on topics described in the programs or agendas for such meetings. described in the programs or agendas for such meetings. Under no circumstances shall CAS seminars be used as a means Under no circumstances shall CAS seminars be used as a means • • for competing companies or firms to reach any understanding – – for competing companies or firms to reach any understanding expressed or implied – – that restricts competition or in any way expressed or implied that restricts competition or in any way impairs the ability of members to exercise independent business impairs the ability of members to exercise independent business judgment regarding matters affecting competition. judgment regarding matters affecting competition. It is the responsibility of all seminar participants to be aware of It is the responsibility of all seminar participants to be aware of • • antitrust regulations, to prevent any written or verbal discussions antitrust regulations, to prevent any written or verbal discussi ons that appear to violate these laws, and to adhere in every respect t that appear to violate these laws, and to adhere in every respec to the CAS antitrust compliance policy. to the CAS antitrust compliance policy. 2

Have you ever… • …needed to build a realistic model with not enough data? • …wanted to keep in your model highly correlated variables that capture different characteristics? • …had highly correlated variables that made your model unstable? (Was it easy to find the source of the problem? ) • …had hundreds or thousands of highly redundant predictors to consider? • …felt you had too little time to build a model? You came to the right place! 2 towerswatson.com

Agenda • The variable selection problem • Classic variable selection tools • Challenges • Introduction to penalized regression • Ridge regression • LASSO • Elastic Net • Extension to GLM • Appendix • Close relatives to LASSO and Elastic Net • Bayesian interpretation of penalized regression 3 towerswatson.com

Goals of predictive modeling • The goal is to build a model that ensures accurate prediction on future data • How: • Choose the correct model structure • Choose variables that are predictive • Obtain the coefficients • Many techniques: • Linear regression • GLM • Survival analysis – Cox’s partial likelihood • …and many more! • Variable selection: • Recover the true non-zero variables • Estimate coefficients close to their true value 4 towerswatson.com

Classic variable selection tools: Exhaustive methods • Brute-force search • For each 𝑙 ∈ 1,2,… , 𝑞 , find the subset of “best” variables of size k • For example: the smallest residual sum of squares (RSS) • Choosing 𝑙 can be done using: • AIC • Cross-validation • Do not need to examine all possible subsets • “Leaps and bounds” techniques by Furnival and Wilson (1974) • Never practical for even small number of variables or small datasets 5 towerswatson.com

Classic variable selection tools : Greedy algorithms • More constrained than exhaustive methods • Forward stepwise selection • Starts with the intercept and then sequentially adds into the model the predictor that most improves the fit • Backward stepwise selection • Starts with the full model and sequentially deletes the predictor that has the least impact on the fit • Hybrid stepwise selection • Considers both forward and backward moves 6 towerswatson.com

Challenges • Discrete process — variables are either retained or discarded but nothing in between • Issues: • Unstable  small changes in the data produce changes in the chosen variables • Models built this way usually exhibit low prediction accuracy on future data • Computationally prohibitive when the number of predictors is large 7 towerswatson.com

Challenges • Severely limits the number of variables to include in a model, especially for models built on small datasets • Certain lines of business Boat, motorcycle, GL • • Certain type of models Fraud models, retention models • • Problems • Over-fitting • Under-fitting • …and don’t forget multicollinearity • Many regularization techniques provide a “more democratic” and smoother version of variable selection 8 towerswatson.com

Quick review of linear models • Target variable ( 𝑧 ) • Profitability (pure premium, loss ratio) • Retention • Fraudulent claims • Predictive variables { 𝑦 1 , 𝑦 2 , … , 𝑦 𝑞 } • “Covariates” – used to make predictions • Policy age, credit, vehicle type, etc. • Model structure 𝑧 = 𝛽 + 𝛾 1 ∙ 𝑦 1 + ⋯ + 𝛾 𝑞 ∙ 𝑦 𝑞 • Solution is given by 2 𝑂 𝑞 � 𝑷𝑷𝑷 = arg min 𝜸 � 𝑧 𝑗 − 𝛽 − � 𝛾 𝑘 𝑦 𝑗𝑘 𝑗=1 𝑘=1 9 towerswatson.com

Penalization methods • Generally, a penalized problem can be described as: 2 𝑂 𝑞 � 𝐐𝐐𝐐𝐐𝐐𝐐𝐐𝐐𝐐 = arg min 𝜸 𝑧 𝑗 − 𝛽 − � 𝛾 𝑘 𝑦 𝑗𝑘 + 𝜇 ⋅ 𝐾 𝛾 1 , … , 𝛾 𝑞 � 𝑗=1 𝑘=1 𝐾 ⋯ is a positive penalty for 𝛾 1 , … , 𝛾 𝑞 not equal to zero • Unlike subset selection methods, penalization methods are: • More continuous • Somewhat shielded from high variability • All methods shrink coefficients toward zero • Some methods also do variable selection 10 towerswatson.com

The classic bias-variance trade-off • Penalized regression produces estimates of coefficients that are biased • The common dilemma: reduction in variance at the price of increased bias ̂ ) + Bias ( 𝛾 ̂ )² MSE = Var ( 𝛾 • If bias is a concern, use penalized regression to choose variables and then fit unpenalized model • Use cross validation to see which method works better 11 towerswatson.com

Penalization methods 2 𝑂 𝑞 � 𝐐𝐐𝐐𝐐𝐐𝐐𝐐𝐐𝐐 = arg min � 𝜸 𝑧 𝑗 − 𝛽 − �𝛾 𝑘 𝑦 𝑗𝑘 + 𝜇 ⋅ 𝐾 𝛾 1 ,… , 𝛾 𝑞 𝑗=1 𝑘=1 • Different methods use different penalty functions: • Ridge Regression : 𝑀 2 • LASSO : 𝑀 1 • Elastic Net : combination of 𝑀 1 and 𝑀 2 • To use penalized regression, data needs to be normalized: • Center 𝑧 around zero • Center each 𝑦 𝑗 around zero and standardized to have SD = 1 12 towerswatson.com

Ridge regression • Ridge regression uses 𝑀 2 penalty function, i.e. “sum of squares” 2 𝑂 𝑞 𝑞 � 𝑺𝑺𝑺𝑺𝑺 = arg min 𝜇 ⋅ � 𝛾 𝑘 2 𝜸 𝑧 𝑗 − 𝛽 − � 𝛾 𝑘 𝑦 𝑗𝑘 + � 𝑗=1 𝑘=1 𝑘=1 • Used to penalize large parameters • 𝜇 is a tuning parameter; for every 𝜇 there is a solution 13 towerswatson.com

Ridge regression Unconstrained OLS solution • Equivalent way to write the ridge problem: 2 𝑂 𝑞 � 𝑺𝑺𝑺𝑺𝑺 = arg min 𝜸 � 𝑧 𝑗 − 𝛽 − � 𝛾 𝑘 𝑦 𝑗𝑘 𝑗=1 𝑘=1 subject to 𝑞 � 𝛾 𝑘 2 ≤ 𝑢 Ridge solution 𝑘=1 • Ridge regression shrinks parameters, but never forces any to be zero Sphere of radius 𝑢  constraining domain for the ridge solution 14 towerswatson.com

Ridge regression example using R • Simulated data with 10 Ridge regression variables and 500 4 observations • True model: 3 𝑧 = 4 ∙ 𝑦 1 + 3 ∙ x 2 + 2 ∙ x 3 + 𝑦 4 t(x$coef) 2 • Fit using package (MASS) in R • lm.ridge 1 0 0 200 400 600 800 1000 x$lambda 15 towerswatson.com

How to choose the tuning parameter λ ? • Use cross validation • How it works: • Randomly divide data into 𝑂 equal pieces Training Testing Training Training Training • For each piece, estimate model from the other N-1 pieces • Test the model fit (e.g., sum of squared errors) on the remaining piece • Add up the N sum of square errors • Plot the sum vs. λ • Recommendation: If possible, use separate years of data as the folds 16 towerswatson.com

How to choose the tuning parameter λ ? 56 Mean-Squared Error 54 52 50 48 17 -2 0 2 4 6 towerswatson.com log(Lambda)

Simple example: Ridge regression − multicollinearity • Ridge regression controls well for multicollinearity • Deals well with high correlations among predictors • Simple example: • True model 𝑧 = 2 + 𝑦 1 • Assume 𝑦 2 is another variable such that 𝑦 2 = 𝑦 1 • Notice that 𝑧 = 2 + 𝛾 1 ∙ 𝑦 1 + (1 − 𝛾 1 ) ∙ 𝑦 2 should be an equivalent linear model • Ridge regression tries to fit the data so that it will minimize 𝛾 12 + 𝛾 22 • Ridge solution tries to split the coefficients as equally as possible between the two variables 𝑧 = 2 + ½ 𝑦 1 + ½ 𝑦 2 18 towerswatson.com

Variable Selection Using Elastic Net A Gentle Introduction to - PowerPoint PPT Presentation

Variable Selection Using Elastic Net A Gentle Introduction to Penalized Regression Mohamad Hindawi, PhD, FCAS towerswatson.com Antitrust Notice Antitrust Notice The Casualty Actuarial Society is committed to adhering strictly The Casualty

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

MLCC 2019 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Using Kieker with Elastic APM: An Experience Report Valentin Seifermann Duan Okanovi SSP

Monitor your containers with the Elastic Stack Monica Sarbu Monica Sarbu Team lead, Beats team

July 9, Week 6 Today: Chapter 10, Elastic Potential Energy Homework #6 due Friday Office hours

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Luigi Spezia Biomathematics & Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION

Variable selection STAT 401 - Statistical Methods for Research Workers Jarad Niemi Iowa State

Curvature-Exploiting Acceleration of Elastic Net Computation Vien V. Mai and Mikael Johansson KTH

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

A Theory of A Theory of Elastic Presentation Space Elastic Presentation Space Sheelagh

A Theory of A Theory of Elastic Presentation Space Elastic Presentation Space Sheelagh

Elastic Search - Aditi Choksi (EW18455) Elastic Search Search engine Distributed

Financing Agile Delivery with Forecasts Presented by:

Wage convergence in south-eastern European countries KODA AUTO University September 2019

Considerations behind the Reflection Paper on Confirmatory Trials with an Adaptive Design Armin

Inflation expectations and monetary policy design: monetary policy design: Evidence from the

Detecting distributed attacks using distributed processing frameworks RP2 #59 Sudesh Jethoe

AGENDA Need for Proactive Adaptation Online Failure Prediction and Accuracy

Mathematics Courses, Pathways, and Placement Process Parent Information For 5th Grade Parents

rst t

Variable Selection Using Elastic Net A Gentle Introduction to - PowerPoint PPT Presentation

Variable Selection Using Elastic Net A Gentle Introduction to Penalized Regression Mohamad Hindawi, PhD, FCAS towerswatson.com Antitrust Notice Antitrust Notice The Casualty Actuarial Society is committed to adhering strictly The Casualty

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

MLCC 2019 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Using Kieker with Elastic APM: An Experience Report Valentin Seifermann Duan Okanovi SSP

Monitor your containers with the Elastic Stack Monica Sarbu Monica Sarbu Team lead, Beats team

July 9, Week 6 Today: Chapter 10, Elastic Potential Energy Homework #6 due Friday Office hours

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

Luigi Spezia Biomathematics &amp; Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION

Variable selection STAT 401 - Statistical Methods for Research Workers Jarad Niemi Iowa State

Curvature-Exploiting Acceleration of Elastic Net Computation Vien V. Mai and Mikael Johansson KTH

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

A Theory of A Theory of Elastic Presentation Space Elastic Presentation Space Sheelagh

A Theory of A Theory of Elastic Presentation Space Elastic Presentation Space Sheelagh

Elastic Search - Aditi Choksi (EW18455) Elastic Search Search engine Distributed

Financing Agile Delivery with Forecasts Presented by:

Wage convergence in south-eastern European countries KODA AUTO University September 2019

Considerations behind the Reflection Paper on Confirmatory Trials with an Adaptive Design Armin

Inflation expectations and monetary policy design: monetary policy design: Evidence from the

Detecting distributed attacks using distributed processing frameworks RP2 #59 Sudesh Jethoe

AGENDA Need for Proactive Adaptation Online Failure Prediction and Accuracy

Mathematics Courses, Pathways, and Placement Process Parent Information For 5th Grade Parents

rst t

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Luigi Spezia Biomathematics & Statistics Scotland Aberdeen BAYESIAN VARIABLE SELECTION