Lecture 9: Regularized/penalized regression Felix Held, Mathematical - - PowerPoint PPT Presentation
Lecture 9: Regularized/penalized regression Felix Held, Mathematical - - PowerPoint PPT Presentation
Lecture 9: Regularized/penalized regression Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 15th April 2019 Revisited: Expectation-Maximization (I) New target function: Maximize ] () and ] ()
Revisited: Expectation-Maximization (I)
New target function: Maximize log(π(π|πΎ)) = π½π(π) [log π(π, π|πΎ) π(π) ] β π½π(π) [log π(π|π, πΎ) π(π) ] with respect to π(π) and πΎ Note:
βΆ The left hand side is independent of π(π) βΆ The difference on the right hand side has always the
same value, irrespective of the chosen π(π). Choosing π(π) is therefore a trade-off between π½π(π) [log π(π, π|πΎ) π(π) ] and π½π(π) [log π(π|π, πΎ) π(π) ]
1/24
Revisited: Expectation-Maximization (II)
- 1. Expectation step: For given parameters πΎ(π) the density
π(π) = π(π|π, πΎ(π)) minimizes the second term and thereby maximizes the first one. Set π (πΎ, πΎ(π)) = π½π(π|π,πΎ(π)) [log π(π, π|πΎ) π(π|π, πΎ(π))]
- 2. Maximization step: Maximize the first term with
πΎ(π+1) = arg max
πΎ
π (πΎ, πΎ(π)) Note: Since π½π(π|π,πΎ(π)) [log π(π|π, πΎ(π)) π(π|π, πΎ(π))] = 0 it follows that log(π(π|πΎ(π))) = π½π(π|π,πΎ(π)) [log π(π, π|πΎ(π)) π(π|π, πΎ(π))]
2/24
Regularized/penalized regression
Remember ordinary least-squares (OLS)
Consider the model π³ = ππΈ + π» where
βΆ π³ β βπ is the outcome, π β βπΓ(π+1) is the design matrix,
πΈ β βπ+1 are the regression coefficients, and π» β βπ is the additive error
βΆ Five basic assumptions have to be checked
Underlying relationship is linear (1) Zero mean (2), uncorrelated (3) errors with constant variance (4) which are (roughly) normally distributed (5)
βΆ Centring (
1 π β π π=1 π¦ππ = 0) and standardisation
(
1 π β π π=1 π¦2 ππ = 1) of predictors simplifies interpretation
βΆ Centring the outcome (
1 π β π π=1 π§π = 0) and features
removes the need to estimate the intercept
3/24
Feature selection as motivation
Analytical solution exists when πππ is invertible Μ πΈOLS = (πππ)β1πππ³ This can be unstable or fail in case of
βΆ high correlation between predictors, or βΆ if π > π.
Solutions: Regularisation or feature selection
4/24
Filtering for feature selection
βΆ Choose features through pre-processing
βΆ Features with maximum variance βΆ Use only the first π PCA components
βΆ Examples of other useful measures
βΆ Use a univariate criterion, e.g. F-score: Features that
correlate most with the response
βΆ Mutual Information: Reduction in uncertainty about π²
after observing π§
βΆ Variable importance: Determine variable importance with
random forests
βΆ Summary
βΆ Pro: Fast and easy βΆ Con: Filtering mostly operates on single features and is
not geared towards a certain method
βΆ Care with cross-validation and multiple testing necessary
βΆ Filtering is often more of a pre-processing step and less
- f a proper feature selection step
5/24
Wrapping for feature selection
βΆ Idea: Determine the best set of features by fitting models
- f different complexity and comparing their performance
βΆ Best subset selection: Try all possible (exponentially
many) subsets of features and compare model performance with e.g. cross-validation
βΆ Forward selection: Start with just an intercept and add in
each step the variable that improves fit the most (greedy algorithm)
βΆ Backward selection: Start with all variables included and
then remove sequentially the one with the least impact (greedy algorithm)
βΆ As discreet procedures, all of these methods exhibit high
variance (small changes could lead to different predictors being selected, resulting in a potentially very different model)
6/24
Embedding for feature selection
βΆ Embed/include the feature selection into the model
estimation procedure
βΆ Ideally, penalization on the number of included features
Μ πΈ = arg min
πΈ
βπ³ β ππΈβ2
2 + π π
β
π=1
1(πΎ
π β 0)
However, discrete optimization problems are hard to solve
βΆ Softer regularisation methods can help
Μ πΈ = arg min
πΈ
βπ³ β ππΈβ2
2 + πβπΈβπ π
where π is a tuning parameter and π β₯ 1 or π = β.
7/24
Constrained regression
The optimization problem arg min
πΈ
βπ³ β ππΈβ2
2
subject to βπΈβπ
π β€ π’
for π > 0 is equivalent to Μ πΈ = arg min
πΈ
βπ³ β ππΈβ2
2 + πβπΈβπ π
when π β₯ 1. This is the Lagrangian of the constrained problem.
βΆ Clear when π > 1: Convex constraint + target function and
both are differentiable
βΆ Harder to prove for π = 1, but possible (e.g. with
subgradients)
8/24
Ridge regression
For π = 2 the constrained problem is ridge regression Μ πΈridge(π) = arg min
πΈ
βπ³ β ππΈβ2
2 + πβπΈβ2 2
where βπΈβ2
2 = β π π=1 πΎ2 π .
An analytical solution exists if πππ + πππ is invertible Μ πΈridge(π) = (πππ + πππ)β1πππ³ If πππ = ππ, then Μ πΈridge(π) = Μ πΈOLS 1 + π, i.e. Μ πΈridge(π) is biased but has lower variance.
9/24
SVD and ridge regression
Recall: The SVD of a matrix π β βπΓπ was π = ππππ The analytical solution for ridge regression becomes (π β₯ π) Μ πΈridge(π) = (πππ + πππ)β1πππ³ = (ππ2ππ + πππ)β1πππππ³ = π(π2 + πππ)β1ππππ³ =
π
β
π=1
π
π
π2
π + ππ° ππ―π π π³
Ridge regression acts most on principal components with lower eigenvalues, e.g. in presence of correlation between features.
10/24
Effective degrees of freedom
Recall the hat matrix π = π(πππ)β1ππ in OLS. The trace of π tr(πΌ) = tr(π(πππ)β1ππ) = tr(πππ(πππ)β1) = tr(ππ) = π is equal to the trace of Λ π» and the degrees of freedom for the regression coefficients. In analogy define for ridge regression π(π) βΆ= π(πππ + πππ)β1ππ and df(π) βΆ= tr(π(π)) =
π
β
π=1
π2
π
π2
π + π,
the effective degrees of freedom.
11/24
Lasso regression
For π = 1 the constrained problem is known as the lasso Μ πΈridge(π) = arg min
πΈ
βπ³ β ππΈβ2
2 + πβπΈβ1
βΆ Smallest π in penalty such that constraint is still convex βΆ Performs feature selection
12/24
Intuition for the penalties (I)
Assume the OLS solution πΈOLS exists and set π¬ = π³ β ππΈOLS it follows for the residual sum of squares (RSS) that βπ³ β ππΈβ2
2 = β(ππΈOLS + π¬) β ππΈβ2 2
= β(π(πΈ β πΈOLS) β π¬β2
2
= (πΈ β πΈOLS)ππππ(πΈ β πΈOLS) β 2π¬ππ(πΈ β πΈOLS) + π¬ππ¬ which is an ellipse (at least in 2D) centred on πΈOLS.
13/24
Intuition for the penalties (II)
The least squares RSS is minimized for πΈOLS. If a constraint is added (βπΈβπ
π β€ π’) then the RSS is minimized by the closest πΈ
possible that fulfills the constraint.
Ξ²1 Ξ²2 Ξ²OLS
- Ξ²lasso
Lasso
Ξ²1 Ξ²2 Ξ²OLS
- Ξ²ridge
Ridge
The blue lines are the contour lines for the RSS.
14/24
Intuition for the penalties (III)
Depending on π the different constraints lead to different
- solutions. If πΈOLS is
in one of the coloured areas or
- n a line, the
constrained solution will be at the corresponding dot.
Ξ²1 Ξ²2
- Ξ²1
Ξ²2
- Ξ²1
Ξ²2
- Ξ²1
Ξ²2
- q: 2
q: Inf q: 0.7 q: 1
15/24
Computational aspects of the Lasso (I)
What estimates does the lasso produce? Target function arg min
πΈ
1 2βπ³ β ππΈβ2
2 + πβπΈβ1
Special case: πππ = ππ. Then 1 2βπ³ β ππΈβ2
2 + πβπΈβ1 = 1
2π³ππ³ β π³ππ β
=πΈπ
OLS
πΈ + 1 2πΈππΈ + πβπΈβ1 = π(πΈ) How do we find the solution Μ πΈ in presence of the non-differentiable penalisation βπΈβ1?
16/24
Computational aspects of the Lasso (II)
For πππ = ππ the target function can be written as arg min
πΈ π
β
π=1
βπΎOLS,ππΎ
π + 1
2πΎ2
π + π|πΎ π|
This results in π uncoupled optimization problems.
βΆ If πΎOLS,π > 0, then πΎ
π > 0 to minimize the target
βΆ If πΎOLS,π β€ 0, then πΎ
π β€ 0
Each case results in Λ πΎ
π = sign(πΎOLS,π)(|πΎOLS,π| β π)+ = ST(πΎOLS,π, π),
where π¦+ = {π¦ π¦ > 0
- therwise
and ST is the soft-thresholding operator
17/24
Relation to OLS estimates
Both ridge regression and the lasso estimates can be written as functions of πΈOLS if πππ = ππ. πΎridge,π = πΎOLS,π 1 + π and Λ πΎ
π = sign(πΎOLS,π)(|πΎOLS,π| β π)+
Ξ»
Ridge Lasso
Visualisation of the transformations applied to the OLS estimates.
18/24
Shrinkage
When π is fixed, the shrinkage of the lasso estimate πΈlasso(π) compared to the OLS estimate πΈOLS is defined as π‘(π) = βπΈlasso(π)β1 βπΈOLSβ1 Note: π‘(π) β [0, 1] with π‘(π) β 0 for increasing π and π‘(π) = 1 if π = 0
19/24
A regularisation path
Prostate cancer dataset (π = 67, π = 8)
Red dashed lines indicate the π selected by cross-validation
β0.25 0.00 0.25 0.50 0.75 2 4 6 8 Effective degrees of freedom Coefficient
Ridge
β0.25 0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75 1.00 Shrinkage Coefficient
Lasso
β0.25 0.00 0.25 0.50 0.75 β5 5 10 log(Ξ») Coefficient β0.25 0.00 0.25 0.50 0.75 β5 5 10 log(Ξ») Coefficient
20/24
Notes on the lasso
βΆ In the general case, i.e. πππ β ππ, there is no explicit
solution.
βΆ Numerical solution possible, e.g. with coordinate descent βΆ As for ridge regression, estimates are biased βΆ But
βΆ Asymptotic consistency: If π = o(π) then πΈlasso β πΈtrue for
π β β
βΆ Model selection consistency: If π β π1/2, then there is a
non-zero probability of identifying the true model
βΆ Degrees of freedom: The degrees of freedom are equal to
the number of non-zero coefficients
21/24
Potential caveats of the lasso (I)
βΆ Sparsity of the true model:
βΆ The lasso only works if the data is generated from a
sparse process.
βΆ However, a dense process with many variables and not
enough data or high correlation between predictors can be unidentifiable either way
βΆ Correlations: Many non-relevant variables correlated
with relevant variables can lead to the selection of the wrong model, even for large π
βΆ Irrepresentable condition: Split π such that π1 contains
all relevant variables and π2 contains all irrelevant
- variables. If
|(ππ
2 π1)β1(ππ 1 π1)| < 1 β π½
for some π½ > 0 then the lasso is (almost) guaranteed to pick the true model
22/24
Potential caveats of the lasso (II)
In practice, both the sparsity of the true model and the irrepresentable condition cannot be checked.
βΆ Assumptions and domain knowledge have to be used
23/24
Take-home message
βΆ Filtering and wrapping methods useful for feature
selection in practice but can be unprincipled or have high variance
βΆ Penalisation gives stability to regression βΆ The lasso performs variable selection and variance