towerswatson.com
Variable Selection Using Elastic Net
A Gentle Introduction to Penalized Regression
Mohamad Hindawi, PhD, FCAS
Variable Selection Using Elastic Net A Gentle Introduction to - - PowerPoint PPT Presentation
Variable Selection Using Elastic Net A Gentle Introduction to Penalized Regression Mohamad Hindawi, PhD, FCAS towerswatson.com Antitrust Notice Antitrust Notice The Casualty Actuarial Society is committed to adhering strictly The Casualty
towerswatson.com
Mohamad Hindawi, PhD, FCAS
2
The Casualty Actuarial Society is committed to adhering strictly to the letter and spirit of the antitrust laws. Seminars conduc to the letter and spirit of the antitrust laws. Seminars conducted ted under the auspices of the CAS are designed solely to provide a under the auspices of the CAS are designed solely to provide a forum for the expression of various points of view on topics forum for the expression of various points of view on topics described in the programs or agendas for such meetings. described in the programs or agendas for such meetings.
Under no circumstances shall CAS seminars be used as a means for competing companies or firms to reach any understanding for competing companies or firms to reach any understanding – – expressed or implied expressed or implied – – that restricts competition or in any way that restricts competition or in any way impairs the ability of members to exercise independent business impairs the ability of members to exercise independent business judgment regarding matters affecting competition. judgment regarding matters affecting competition.
It is the responsibility of all seminar participants to be aware
antitrust regulations, to prevent any written or verbal discussi antitrust regulations, to prevent any written or verbal discussions
that appear to violate these laws, and to adhere in every respec that appear to violate these laws, and to adhere in every respect t to the CAS antitrust compliance policy. to the CAS antitrust compliance policy.
towerswatson.com
different characteristics?
it easy to find the source of the problem? )
consider?
2 You came to the right place!
towerswatson.com
3
towerswatson.com
data
4
towerswatson.com
5
towerswatson.com
predictor that most improves the fit
least impact on the fit
6
towerswatson.com
nothing in between
variables
7
towerswatson.com
for models built on small datasets
smoother version of variable selection 8
towerswatson.com
𝑧 = 𝛽 + 𝛾1 ∙ 𝑦1 + ⋯+ 𝛾𝑞 ∙ 𝑦𝑞
𝜸 𝑷𝑷𝑷 = arg min
𝑞 𝑘=1 2 𝑂 𝑗=1
9
towerswatson.com
𝜸 𝐐𝐐𝐐𝐐𝐐𝐐𝐐𝐐𝐐 = arg min
𝑞 𝑘=1 2
+
𝑂 𝑗=1
𝜇 ⋅ 𝐾 𝛾1, … , 𝛾𝑞 𝐾 ⋯ is a positive penalty for 𝛾1, … , 𝛾𝑞 not equal to zero
10
towerswatson.com
bias MSE = Var(𝛾 ̂) + Bias(𝛾 ̂)² 11
regression to choose variables and then fit unpenalized model
method works better
towerswatson.com
𝜸 𝐐𝐐𝐐𝐐𝐐𝐐𝐐𝐐𝐐 = arg min 𝑧𝑗 − 𝛽 − 𝛾𝑘𝑦𝑗𝑘
𝑞 𝑘=1 2
+
𝑂 𝑗=1
𝜇 ⋅ 𝐾 𝛾1,… , 𝛾𝑞
12
towerswatson.com
𝜸 𝑺𝑺𝑺𝑺𝑺 = arg min
𝑞 𝑘=1 2
+
𝑂 𝑗=1
𝜇 ⋅ 𝛾𝑘2
𝑞 𝑘=1
13
towerswatson.com
𝜸 𝑺𝑺𝑺𝑺𝑺 = arg min
𝑞 𝑘=1 2 𝑂 𝑗=1
subject to 𝛾𝑘2 ≤ 𝑢
𝑞 𝑘=1
never forces any to be zero 14
Unconstrained OLS solution Ridge solution Sphere of radius 𝑢 constraining domain for the ridge solution
towerswatson.com
variables and 500
𝑧 = 4 ∙ 𝑦1 + 3 ∙ x2 + 2 ∙ x3 + 𝑦4
in R
15
200 400 600 800 1000 1 2 3 4 x$lambda t(x$coef)
Ridge regression
towerswatson.com
16
Training Testing Training Training Training
towerswatson.com
17
2 4 6 48 50 52 54 56 log(Lambda) Mean-Squared Error
towerswatson.com
𝑧 = 2 + 𝑦1
model
two variables
𝑧 = 2 + ½ 𝑦1 + ½ 𝑦2 18
towerswatson.com
19
towerswatson.com
𝜸 𝑷𝑴𝑷𝑷𝐏 = arg min
𝑞 𝑘=1 2
+
𝑂 𝑗=1
𝜇 ⋅ 𝛾𝑘
𝑞 𝑘=1
20
towerswatson.com
𝜸 𝑷𝑴𝑷𝑷𝑷 = arg min
𝑞 𝑘=1 2 𝑂 𝑗=1
subject to 𝛾𝑘 ≤ 𝑢
𝑞 𝑘=1
21
Unconstrained OLS solution LASSO solution Cube of size 𝑢 constraining domain for the LASSO solution
towerswatson.com
22
towerswatson.com
variables and 500
𝑧 = 4 ∙ 𝑦1 + 3 ∙ 𝑦2 + 2 ∙ 𝑦3 + 𝑦4
“elasticnet” in R 23
towerswatson.com
is extremely fast
data and 100 variables 24
LASSO sequence Computing X'X ..... LARS Step 1 : Variable 37 added LARS Step 2 : Variable 12 added LARS Step 3 : Variable 49 added LARS Step 4 : Variable 82 added LARS Step 5 : Variable 42 added LARS Step 6 : Variable 19 added LARS Step 7 : Variable 1 added LARS Step 8 : Variable 7 added LARS Step 9 : Variable 89 added LARS Step 10 : Variable 22 added LARS Step 11 : Variable 4 added LARS Step 12 : Variable 50 added LARS Step 13 : Variable 23 added LARS Step 14 : Variable 65 added LARS Step 15 : Variable 72 added LARS Step 16 : Variable 60 added LARS Step 17 : Variable 44 added LARS Step 18 : Variable 94 added LARS Step 19 : Variable 61 added LARS Step 20 : Variable 55 added LARS Step 21 : Variable 48 added LARS Step 22 : Variable 79 added LARS Step 23 : Variable 70 added LARS Step 24 : Variable 81 added LARS Step 25 : Variable 97 added LARS Step 26 : Variable 17 added .......
towerswatson.com
when 𝑌𝑈𝑌 = 𝐽, i.e. the design matrix is orthonormal
absolute values and set the rest to zero
𝛾𝑘 𝑇𝑇 = 𝛾𝑘
𝑃𝑃𝑇 𝑗𝑗𝑗 |𝛾𝑘
𝑃𝑃𝑇| > 𝜇
𝛾𝑘 𝑆𝑗𝑆𝑆𝑆 = 1 1 + 𝜊 𝛾𝑘 𝑃𝑃𝑇
𝛾𝑘 𝑃𝑀𝑇𝑇𝑃 = sign(𝛾𝑘 𝑃𝑃𝑇)∙ 𝛾𝑘 𝑃𝑃𝑇 − 𝜃
+
25
towerswatson.com
LASSO is dominated by Ridge regression (Tibshirani, 1996)
are very high, then the LASSO tends to select only one variable from the group and does not care which one is selected. 26 Is there a compromise between Ridge regression and LASSO?
towerswatson.com
𝜸
𝑞 𝑘=1 2 𝑂 𝑗=1
subject to
𝛾𝑘 𝑟 ≤ 𝑢
𝑞 𝑘=1
27
towerswatson.com
function
𝜸 𝑶𝑶𝑺𝑶𝑺 𝑭𝑶𝑺𝑭 = arg min
𝑞 𝑘=1 2
+
𝑂 𝑗=1
𝜇1 ⋅ 𝛾𝑘
𝑞 𝑘=1
+ 𝜇2 ∙ 𝛾𝑘2
𝑞 𝑘=1
28
towerswatson.com
problem:
𝜸 𝑶𝑶𝑺𝑶𝑺 𝐅𝐅𝐐𝐅 = arg min
𝑞 𝑘=1 2 𝑂 𝑗=1
subject to (1 − 𝛽) ∙ 𝛾𝑘 + 𝛽 ∙ 𝛾𝑘2
𝑞 𝑘=1
≤ 𝑢
𝑞 𝑘=1
effect even in the extreme situation of identical predictors 29
Singularities at the vertexes results in a sparse ENet solution
towerswatson.com
does not perform satisfactorily unless it is close to Ridge or LASSO
reduce the variance and introduces extra bias 30
towerswatson.com
𝜸 𝑭𝑶𝑺𝑭 = 1 + 𝜇2 ∙ 𝜸 𝑶𝑶𝑺𝑶𝑺 𝑭𝑶𝑺𝑭
both variables
worrying about multicollinearity or near-aliasing 31 Similar to a fishing net, Elastic Net retains only all the “big fish”
towerswatson.com
𝑨1 ~ 𝑉 0,20 𝑏𝑏𝑏 𝑨2 ~ 𝑉(0,20)
𝑦1 = 𝑨1 + 𝜗1, 𝑦2 = −𝑨1 + 𝜗2, 𝑦3 = 𝑨1 + 𝜗3 𝑦4 = 𝑨2 + 𝜗4, 𝑦5 = −𝑨2 + 𝜗5, 𝑦6 = 𝑨2 + 𝜗6
16)
important variables, but none of the 𝑨2 group variables 32
towerswatson.com
33
towerswatson.com
34
towerswatson.com
smaller magnitudes 35
0.0 0.5 1.0 1.5 Log Lambda Coefficients 12 10 10 7 1
0.0 0.5 1.0 1.5 Log Lambda Coefficients 47 46 43 33 20
towerswatson.com
the exponential family of distributions
maximum likelihood criterion: 𝜸 𝐻𝑀𝐻 = argmax 𝑀 𝑧; 𝜸 equivalently 𝜸 𝐻𝑀𝐻 = arg min −log 𝑀(𝑧; 𝜸) 36
towerswatson.com
following equation: 𝜸 𝑄𝑄𝑏𝑏𝑄𝑗𝑨𝑄𝑏 = arg min −log 𝑀 𝑧; 𝜸 + 𝜇 ∙ 𝐾(𝜸)
linear
added and then used a piece-wise linear approximation
37
towerswatson.com
38
towerswatson.com
39
towerswatson.com
40
towerswatson.com
41
towerswatson.com
exclude the entire group
𝜸 𝑯𝑯𝑯 𝑷𝑴𝑷𝑷𝑷 = arg min 𝑧𝑗 − 𝛽 − 𝛾𝑘𝑦𝑗𝑘
𝑞 𝑘=1 2
+ 𝜇 ∙ 𝑞𝑄
𝑃 𝑚=1
∙ 𝜸𝒎
2
𝑂 𝑗=1
42
towerswatson.com
𝜸 𝑷 𝑯 𝑷𝑴𝑷𝑷𝑷 = arg min 𝑧𝑗 − 𝛽 − 𝛾𝑘𝑦𝑗𝑘
𝑞 𝑘=1 2
+ 𝜇1 ∙ 𝑞𝑄
𝑃 𝑚=1
∙ 𝜸𝒎
2
𝑂 𝑗=1
+ 𝜇2 ∙ 𝜸
1
43
Group LASSO LASSO Sparse group LASSO 𝑦𝑦 and 𝑦𝑦 belong to the same group 𝑦1 and 𝑦2 belong to the same group
towerswatson.com
𝛾 ̂𝐵𝑏𝑏 𝑀𝐵𝑀𝑀𝑀 = arg min 𝑧𝑗 − 𝛽 − 𝛾𝑘𝑦𝑗𝑘
𝑞 𝑘=1 2
+ 𝜇 ∙ 𝑥 𝑘
𝑞 𝑘=1
𝛾𝑘
𝑂 𝑗=1
44
towerswatson.com
𝛾 ̂𝐵𝑏𝑏 𝐹𝑂𝑄𝑢 = 1 + λ2 ∙ arg min 𝑧𝑗 − 𝛽 − 𝛾𝑘𝑦𝑗𝑘
𝑞 𝑘=1 2
+ 𝜇̇ 1 ∙ 𝑥 𝑘
𝑞 𝑘=1
𝛾𝑘
𝑂 𝑗=1
+ 𝜇2 ∙ 𝛾𝑘2
𝑞 𝑘=1
𝑘 = 𝛾 𝑘 𝐹𝑂𝑄𝑢
−𝛿
45
towerswatson.com
46
towerswatson.com
𝑄 𝐵 𝐶 = 𝑄 𝐶 𝐵 𝑄 𝐵 𝑄 𝐶
𝑄 𝛾 𝑧 ∝ 𝑄 𝑧 𝛾 𝑄 𝛾
47
towerswatson.com
small
𝑄 𝛾 ∝ 𝑄− 1
2𝜏2 𝛾 2
2
𝑄 𝛾 𝑧 ∝ 𝑄−1
2 𝑧−𝛾𝛾 2
2+ 1
𝜏2 𝛾 2
2
𝑧 − 𝛾𝑌 2
2 + 1
𝜏2 𝛾 2
2
which is the Ridge solution where 𝜇 = 1
𝜏2
48
towerswatson.com
𝑄 𝛾 ∝ 𝑄−𝜇
2 𝛾 1
given by: 𝑄 𝛾 ∝ 𝑄−1
2 𝜇1 𝛾 1+𝜇2 𝛾 2
2
49
towerswatson.com
If you would like additional information or references for this presentation, please contact: Mohamad Hindawi, PhD, FCAS Towers Watson 175 Powder Forest Dr. Weatogue, CT 06089 860.843.7134 Mohamad.Hindawi@towerswatson.com 50