Lecture 9: Regularized/penalized regression Felix Held, Mathematical - - PowerPoint PPT Presentation

β–Ά
lecture 9 regularized penalized regression
SMART_READER_LITE
LIVE PREVIEW

Lecture 9: Regularized/penalized regression Felix Held, Mathematical - - PowerPoint PPT Presentation

Lecture 9: Regularized/penalized regression Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 15th April 2019 Revisited: Expectation-Maximization (I) New target function: Maximize ] () and ] ()


slide-1
SLIDE 1

Lecture 9: Regularized/penalized regression

Felix Held, Mathematical Sciences

MSA220/MVE440 Statistical Learning for Big Data 15th April 2019

slide-2
SLIDE 2

Revisited: Expectation-Maximization (I)

New target function: Maximize log(π‘ž(𝐘|𝜾)) = π”½π‘Ÿ(𝐚) [log π‘ž(𝐘, 𝐚|𝜾) π‘Ÿ(𝐚) ] βˆ’ π”½π‘Ÿ(𝐚) [log π‘ž(𝐚|𝐘, 𝜾) π‘Ÿ(𝐚) ] with respect to π‘Ÿ(𝐚) and 𝜾 Note:

β–Ά The left hand side is independent of π‘Ÿ(𝐚) β–Ά The difference on the right hand side has always the

same value, irrespective of the chosen π‘Ÿ(𝐚). Choosing π‘Ÿ(𝐚) is therefore a trade-off between π”½π‘Ÿ(𝐚) [log π‘ž(𝐘, 𝐚|𝜾) π‘Ÿ(𝐚) ] and π”½π‘Ÿ(𝐚) [log π‘ž(𝐚|𝐘, 𝜾) π‘Ÿ(𝐚) ]

1/24

slide-3
SLIDE 3

Revisited: Expectation-Maximization (II)

  • 1. Expectation step: For given parameters 𝜾(𝑛) the density

π‘Ÿ(𝐚) = π‘ž(𝐚|𝐘, 𝜾(𝑛)) minimizes the second term and thereby maximizes the first one. Set 𝑅(𝜾, 𝜾(𝑛)) = π”½π‘ž(𝐚|𝐘,𝜾(𝑛)) [log π‘ž(𝐘, 𝐚|𝜾) π‘ž(𝐚|𝐘, 𝜾(𝑛))]

  • 2. Maximization step: Maximize the first term with

𝜾(𝑛+1) = arg max

𝜾

𝑅(𝜾, 𝜾(𝑛)) Note: Since π”½π‘ž(𝐚|𝐘,𝜾(𝑛)) [log π‘ž(𝐚|𝐘, 𝜾(𝑛)) π‘ž(𝐚|𝐘, 𝜾(𝑛))] = 0 it follows that log(π‘ž(𝐘|𝜾(𝑛))) = π”½π‘ž(𝐚|𝐘,𝜾(𝑛)) [log π‘ž(𝐘, 𝐚|𝜾(𝑛)) π‘ž(𝐚|𝐘, 𝜾(𝑛))]

2/24

slide-4
SLIDE 4

Regularized/penalized regression

slide-5
SLIDE 5

Remember ordinary least-squares (OLS)

Consider the model 𝐳 = 𝐘𝜸 + 𝜻 where

β–Ά 𝐳 ∈ β„π‘œ is the outcome, 𝐘 ∈ β„π‘œΓ—(π‘ž+1) is the design matrix,

𝜸 ∈ β„π‘ž+1 are the regression coefficients, and 𝜻 ∈ β„π‘œ is the additive error

β–Ά Five basic assumptions have to be checked

Underlying relationship is linear (1) Zero mean (2), uncorrelated (3) errors with constant variance (4) which are (roughly) normally distributed (5)

β–Ά Centring (

1 π‘œ βˆ‘ π‘œ π‘š=1 π‘¦π‘šπ‘˜ = 0) and standardisation

(

1 π‘œ βˆ‘ π‘œ π‘š=1 𝑦2 π‘šπ‘˜ = 1) of predictors simplifies interpretation

β–Ά Centring the outcome (

1 π‘œ βˆ‘ π‘œ π‘š=1 π‘§π‘š = 0) and features

removes the need to estimate the intercept

3/24

slide-6
SLIDE 6

Feature selection as motivation

Analytical solution exists when π˜π‘ˆπ˜ is invertible Μ‚ 𝜸OLS = (π˜π‘ˆπ˜)βˆ’1π˜π‘ˆπ³ This can be unstable or fail in case of

β–Ά high correlation between predictors, or β–Ά if π‘ž > π‘œ.

Solutions: Regularisation or feature selection

4/24

slide-7
SLIDE 7

Filtering for feature selection

β–Ά Choose features through pre-processing

β–Ά Features with maximum variance β–Ά Use only the first 𝑙 PCA components

β–Ά Examples of other useful measures

β–Ά Use a univariate criterion, e.g. F-score: Features that

correlate most with the response

β–Ά Mutual Information: Reduction in uncertainty about 𝐲

after observing 𝑧

β–Ά Variable importance: Determine variable importance with

random forests

β–Ά Summary

β–Ά Pro: Fast and easy β–Ά Con: Filtering mostly operates on single features and is

not geared towards a certain method

β–Ά Care with cross-validation and multiple testing necessary

β–Ά Filtering is often more of a pre-processing step and less

  • f a proper feature selection step

5/24

slide-8
SLIDE 8

Wrapping for feature selection

β–Ά Idea: Determine the best set of features by fitting models

  • f different complexity and comparing their performance

β–Ά Best subset selection: Try all possible (exponentially

many) subsets of features and compare model performance with e.g. cross-validation

β–Ά Forward selection: Start with just an intercept and add in

each step the variable that improves fit the most (greedy algorithm)

β–Ά Backward selection: Start with all variables included and

then remove sequentially the one with the least impact (greedy algorithm)

β–Ά As discreet procedures, all of these methods exhibit high

variance (small changes could lead to different predictors being selected, resulting in a potentially very different model)

6/24

slide-9
SLIDE 9

Embedding for feature selection

β–Ά Embed/include the feature selection into the model

estimation procedure

β–Ά Ideally, penalization on the number of included features

Μ‚ 𝜸 = arg min

𝜸

‖𝐳 βˆ’ π˜πœΈβ€–2

2 + πœ‡ π‘ž

βˆ‘

π‘˜=1

1(𝛾

π‘˜ β‰  0)

However, discrete optimization problems are hard to solve

β–Ά Softer regularisation methods can help

Μ‚ 𝜸 = arg min

𝜸

‖𝐳 βˆ’ π˜πœΈβ€–2

2 + πœ‡β€–πœΈβ€–π‘Ÿ π‘Ÿ

where πœ‡ is a tuning parameter and π‘Ÿ β‰₯ 1 or π‘Ÿ = ∞.

7/24

slide-10
SLIDE 10

Constrained regression

The optimization problem arg min

𝜸

‖𝐳 βˆ’ π˜πœΈβ€–2

2

subject to β€–πœΈβ€–π‘Ÿ

π‘Ÿ ≀ 𝑒

for π‘Ÿ > 0 is equivalent to Μ‚ 𝜸 = arg min

𝜸

‖𝐳 βˆ’ π˜πœΈβ€–2

2 + πœ‡β€–πœΈβ€–π‘Ÿ π‘Ÿ

when π‘Ÿ β‰₯ 1. This is the Lagrangian of the constrained problem.

β–Ά Clear when π‘Ÿ > 1: Convex constraint + target function and

both are differentiable

β–Ά Harder to prove for π‘Ÿ = 1, but possible (e.g. with

subgradients)

8/24

slide-11
SLIDE 11

Ridge regression

For π‘Ÿ = 2 the constrained problem is ridge regression Μ‚ 𝜸ridge(πœ‡) = arg min

𝜸

‖𝐳 βˆ’ π˜πœΈβ€–2

2 + πœ‡β€–πœΈβ€–2 2

where β€–πœΈβ€–2

2 = βˆ‘ π‘ž π‘˜=1 𝛾2 π‘˜ .

An analytical solution exists if π˜π‘ˆπ˜ + πœ‡π‰π‘ž is invertible Μ‚ 𝜸ridge(πœ‡) = (π˜π‘ˆπ˜ + πœ‡π‰π‘ž)βˆ’1π˜π‘ˆπ³ If π˜π‘ˆπ˜ = π‰π‘ž, then Μ‚ 𝜸ridge(πœ‡) = Μ‚ 𝜸OLS 1 + πœ‡, i.e. Μ‚ 𝜸ridge(πœ‡) is biased but has lower variance.

9/24

slide-12
SLIDE 12

SVD and ridge regression

Recall: The SVD of a matrix 𝐘 ∈ β„π‘œΓ—π‘ž was 𝐘 = π•π„π–π‘ˆ The analytical solution for ridge regression becomes (π‘œ β‰₯ π‘ž) Μ‚ 𝜸ridge(πœ‡) = (π˜π‘ˆπ˜ + πœ‡π‰π‘ž)βˆ’1π˜π‘ˆπ³ = (𝐖𝐄2π–π‘ˆ + πœ‡π‰π‘ž)βˆ’1π–π„π•π‘ˆπ³ = 𝐖(𝐄2 + πœ‡π‰π‘ž)βˆ’1π„π•π‘ˆπ³ =

π‘ž

βˆ‘

π‘˜=1

𝑒

π‘˜

𝑒2

π‘˜ + πœ‡π° π‘˜π―π‘ˆ π‘˜ 𝐳

Ridge regression acts most on principal components with lower eigenvalues, e.g. in presence of correlation between features.

10/24

slide-13
SLIDE 13

Effective degrees of freedom

Recall the hat matrix 𝐈 = 𝐘(π˜π‘ˆπ˜)βˆ’1π˜π‘ˆ in OLS. The trace of 𝐈 tr(𝐼) = tr(𝐘(π˜π‘ˆπ˜)βˆ’1π˜π‘ˆ) = tr(π˜π‘ˆπ˜(π˜π‘ˆπ˜)βˆ’1) = tr(π‰π‘ž) = π‘ž is equal to the trace of Λ† 𝚻 and the degrees of freedom for the regression coefficients. In analogy define for ridge regression 𝐈(πœ‡) ∢= 𝐘(π˜π‘ˆπ˜ + πœ‡π‰π‘ž)βˆ’1π˜π‘ˆ and df(πœ‡) ∢= tr(𝐈(πœ‡)) =

π‘ž

βˆ‘

π‘˜=1

𝑒2

π‘˜

𝑒2

π‘˜ + πœ‡,

the effective degrees of freedom.

11/24

slide-14
SLIDE 14

Lasso regression

For π‘Ÿ = 1 the constrained problem is known as the lasso Μ‚ 𝜸ridge(πœ‡) = arg min

𝜸

‖𝐳 βˆ’ π˜πœΈβ€–2

2 + πœ‡β€–πœΈβ€–1

β–Ά Smallest π‘Ÿ in penalty such that constraint is still convex β–Ά Performs feature selection

12/24

slide-15
SLIDE 15

Intuition for the penalties (I)

Assume the OLS solution 𝜸OLS exists and set 𝐬 = 𝐳 βˆ’ 𝐘𝜸OLS it follows for the residual sum of squares (RSS) that ‖𝐳 βˆ’ π˜πœΈβ€–2

2 = β€–(𝐘𝜸OLS + 𝐬) βˆ’ π˜πœΈβ€–2 2

= β€–(𝐘(𝜸 βˆ’ 𝜸OLS) βˆ’ 𝐬‖2

2

= (𝜸 βˆ’ 𝜸OLS)π‘ˆπ˜π‘ˆπ˜(𝜸 βˆ’ 𝜸OLS) βˆ’ 2π¬π‘ˆπ˜(𝜸 βˆ’ 𝜸OLS) + π¬π‘ˆπ¬ which is an ellipse (at least in 2D) centred on 𝜸OLS.

13/24

slide-16
SLIDE 16

Intuition for the penalties (II)

The least squares RSS is minimized for 𝜸OLS. If a constraint is added (β€–πœΈβ€–π‘Ÿ

π‘Ÿ ≀ 𝑒) then the RSS is minimized by the closest 𝜸

possible that fulfills the constraint.

Ξ²1 Ξ²2 Ξ²OLS

  • Ξ²lasso

Lasso

Ξ²1 Ξ²2 Ξ²OLS

  • Ξ²ridge

Ridge

The blue lines are the contour lines for the RSS.

14/24

slide-17
SLIDE 17

Intuition for the penalties (III)

Depending on π‘Ÿ the different constraints lead to different

  • solutions. If 𝜸OLS is

in one of the coloured areas or

  • n a line, the

constrained solution will be at the corresponding dot.

Ξ²1 Ξ²2

  • Ξ²1

Ξ²2

  • Ξ²1

Ξ²2

  • Ξ²1

Ξ²2

  • q: 2

q: Inf q: 0.7 q: 1

15/24

slide-18
SLIDE 18

Computational aspects of the Lasso (I)

What estimates does the lasso produce? Target function arg min

𝜸

1 2‖𝐳 βˆ’ π˜πœΈβ€–2

2 + πœ‡β€–πœΈβ€–1

Special case: π˜π‘ˆπ˜ = π‰π‘ž. Then 1 2‖𝐳 βˆ’ π˜πœΈβ€–2

2 + πœ‡β€–πœΈβ€–1 = 1

2π³π‘ˆπ³ βˆ’ π³π‘ˆπ˜ ⏟

=πœΈπ‘ˆ

OLS

𝜸 + 1 2πœΈπ‘ˆπœΈ + πœ‡β€–πœΈβ€–1 = 𝑕(𝜸) How do we find the solution Μ‚ 𝜸 in presence of the non-differentiable penalisation β€–πœΈβ€–1?

16/24

slide-19
SLIDE 19

Computational aspects of the Lasso (II)

For π˜π‘ˆπ˜ = π‰π‘ž the target function can be written as arg min

𝜸 π‘ž

βˆ‘

π‘˜=1

βˆ’π›ΎOLS,π‘˜π›Ύ

π‘˜ + 1

2𝛾2

π‘˜ + πœ‡|𝛾 π‘˜|

This results in π‘ž uncoupled optimization problems.

β–Ά If 𝛾OLS,π‘˜ > 0, then 𝛾

π‘˜ > 0 to minimize the target

β–Ά If 𝛾OLS,π‘˜ ≀ 0, then 𝛾

π‘˜ ≀ 0

Each case results in Λ† 𝛾

π‘˜ = sign(𝛾OLS,π‘˜)(|𝛾OLS,π‘˜| βˆ’ πœ‡)+ = ST(𝛾OLS,π‘˜, πœ‡),

where 𝑦+ = {𝑦 𝑦 > 0

  • therwise

and ST is the soft-thresholding operator

17/24

slide-20
SLIDE 20

Relation to OLS estimates

Both ridge regression and the lasso estimates can be written as functions of 𝜸OLS if π˜π‘ˆπ˜ = π‰π‘ž. 𝛾ridge,π‘˜ = 𝛾OLS,π‘˜ 1 + πœ‡ and Λ† 𝛾

π‘˜ = sign(𝛾OLS,π‘˜)(|𝛾OLS,π‘˜| βˆ’ πœ‡)+

Ξ»

Ridge Lasso

Visualisation of the transformations applied to the OLS estimates.

18/24

slide-21
SLIDE 21

Shrinkage

When πœ‡ is fixed, the shrinkage of the lasso estimate 𝜸lasso(πœ‡) compared to the OLS estimate 𝜸OLS is defined as 𝑑(πœ‡) = β€–πœΈlasso(πœ‡)β€–1 β€–πœΈOLSβ€–1 Note: 𝑑(πœ‡) ∈ [0, 1] with 𝑑(πœ‡) β†’ 0 for increasing πœ‡ and 𝑑(πœ‡) = 1 if πœ‡ = 0

19/24

slide-22
SLIDE 22

A regularisation path

Prostate cancer dataset (π‘œ = 67, π‘ž = 8)

Red dashed lines indicate the πœ‡ selected by cross-validation

βˆ’0.25 0.00 0.25 0.50 0.75 2 4 6 8 Effective degrees of freedom Coefficient

Ridge

βˆ’0.25 0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75 1.00 Shrinkage Coefficient

Lasso

βˆ’0.25 0.00 0.25 0.50 0.75 βˆ’5 5 10 log(Ξ») Coefficient βˆ’0.25 0.00 0.25 0.50 0.75 βˆ’5 5 10 log(Ξ») Coefficient

20/24

slide-23
SLIDE 23

Notes on the lasso

β–Ά In the general case, i.e. π˜π‘ˆπ˜ β‰  π‰π‘ž, there is no explicit

solution.

β–Ά Numerical solution possible, e.g. with coordinate descent β–Ά As for ridge regression, estimates are biased β–Ά But

β–Ά Asymptotic consistency: If πœ‡ = o(π‘œ) then 𝜸lasso β†’ 𝜸true for

π‘œ β†’ ∞

β–Ά Model selection consistency: If πœ‡ ∝ π‘œ1/2, then there is a

non-zero probability of identifying the true model

β–Ά Degrees of freedom: The degrees of freedom are equal to

the number of non-zero coefficients

21/24

slide-24
SLIDE 24

Potential caveats of the lasso (I)

β–Ά Sparsity of the true model:

β–Ά The lasso only works if the data is generated from a

sparse process.

β–Ά However, a dense process with many variables and not

enough data or high correlation between predictors can be unidentifiable either way

β–Ά Correlations: Many non-relevant variables correlated

with relevant variables can lead to the selection of the wrong model, even for large π‘œ

β–Ά Irrepresentable condition: Split 𝐘 such that 𝐘1 contains

all relevant variables and 𝐘2 contains all irrelevant

  • variables. If

|(π˜π‘ˆ

2 𝐘1)βˆ’1(π˜π‘ˆ 1 𝐘1)| < 1 βˆ’ 𝜽

for some 𝜽 > 0 then the lasso is (almost) guaranteed to pick the true model

22/24

slide-25
SLIDE 25

Potential caveats of the lasso (II)

In practice, both the sparsity of the true model and the irrepresentable condition cannot be checked.

β–Ά Assumptions and domain knowledge have to be used

23/24

slide-26
SLIDE 26

Take-home message

β–Ά Filtering and wrapping methods useful for feature

selection in practice but can be unprincipled or have high variance

β–Ά Penalisation gives stability to regression β–Ά The lasso performs variable selection and variance

stabilisation at the same time

24/24