lecture 9 regularized penalized regression
play

Lecture 9: Regularized/penalized regression Felix Held, Mathematical - PowerPoint PPT Presentation

Lecture 9: Regularized/penalized regression Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 15th April 2019 Revisited: Expectation-Maximization (I) New target function: Maximize ] () and ] ()


  1. Lecture 9: Regularized/penalized regression Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 15th April 2019

  2. Revisited: Expectation-Maximization (I) New target function: Maximize ] π‘Ÿ(𝐚) and ] π‘Ÿ(𝐚) Choosing π‘Ÿ(𝐚) is therefore a trade-off between same value, irrespective of the chosen π‘Ÿ(𝐚) . 1/24 Note: with respect to π‘Ÿ(𝐚) and 𝜾 ] π‘Ÿ(𝐚) π‘Ÿ(𝐚) log (π‘ž(𝐘|𝜾)) = 𝔽 π‘Ÿ(𝐚) [ log π‘ž(𝐘, 𝐚|𝜾) ] βˆ’ 𝔽 π‘Ÿ(𝐚) [ log π‘ž(𝐚|𝐘, 𝜾) β–Ά The left hand side is independent of π‘Ÿ(𝐚) β–Ά The difference on the right hand side has always the 𝔽 π‘Ÿ(𝐚) [ log π‘ž(𝐘, 𝐚|𝜾) 𝔽 π‘Ÿ(𝐚) [ log π‘ž(𝐚|𝐘, 𝜾)

  3. Revisited: Expectation-Maximization (II) 𝑅(𝜾, 𝜾 (𝑛) ) π‘ž(𝐚|𝐘, 𝜾 (𝑛) )] it follows that Note: Since 𝜾 π‘ž(𝐚|𝐘, 𝜾 (𝑛) )] π‘ž(𝐘, 𝐚|𝜾) π‘Ÿ(𝐚) = π‘ž(𝐚|𝐘, 𝜾 (𝑛) ) minimizes the second term and 2/24 1. Expectation step: For given parameters 𝜾 (𝑛) the density thereby maximizes the first one . Set 𝑅(𝜾, 𝜾 (𝑛) ) = 𝔽 π‘ž(𝐚|𝐘,𝜾 (𝑛) ) [ log 2. Maximization step: Maximize the first term with 𝜾 (𝑛+1) = arg max 𝔽 π‘ž(𝐚|𝐘,𝜾 (𝑛) ) [ log π‘ž(𝐚|𝐘, 𝜾 (𝑛) ) π‘ž(𝐚|𝐘, 𝜾 (𝑛) )] = 0 log (π‘ž(𝐘|𝜾 (𝑛) )) = 𝔽 π‘ž(𝐚|𝐘,𝜾 (𝑛) ) [ log π‘ž(𝐘, 𝐚|𝜾 (𝑛) )

  4. Regularized/penalized regression

  5. Remember ordinary least-squares (OLS) 1 removes the need to estimate the intercept π‘œ 1 π‘œ 1 ( Consider the model π‘œ variance (4) which are (roughly) normally distributed (5) where 𝐳 = 𝐘𝜸 + 𝜻 3/24 Underlying relationship is linear (1) Zero mean (2), uncorrelated (3) errors with constant β–Ά 𝐳 ∈ ℝ π‘œ is the outcome , 𝐘 ∈ ℝ π‘œΓ—(π‘ž+1) is the design matrix , 𝜸 ∈ ℝ π‘ž+1 are the regression coefficients , and 𝜻 ∈ ℝ π‘œ is the additive error β–Ά Five basic assumptions have to be checked β–Ά Centring ( π‘œ βˆ‘ π‘š=1 𝑦 π‘šπ‘˜ = 0 ) and standardisation π‘š=1 𝑦 2 π‘šπ‘˜ = 1 ) of predictors simplifies interpretation π‘œ βˆ‘ β–Ά Centring the outcome ( π‘œ βˆ‘ π‘š=1 𝑧 π‘š = 0 ) and features

  6. Feature selection as motivation Analytical solution exists when 𝐘 π‘ˆ 𝐘 is invertible Μ‚ This can be unstable or fail in case of Solutions: Regularisation or feature selection 4/24 𝜸 OLS = (𝐘 π‘ˆ 𝐘) βˆ’1 𝐘 π‘ˆ 𝐳 β–Ά high correlation between predictors, or β–Ά if π‘ž > π‘œ .

  7. Filtering for feature selection after observing 𝑧 of a proper feature selection step not geared towards a certain method random forests 5/24 correlate most with the response β–Ά Choose features through pre-processing β–Ά Features with maximum variance β–Ά Use only the first 𝑙 PCA components β–Ά Examples of other useful measures β–Ά Use a univariate criterion, e.g. F-score: Features that β–Ά Mutual Information: Reduction in uncertainty about 𝐲 β–Ά Variable importance: Determine variable importance with β–Ά Summary β–Ά Pro: Fast and easy β–Ά Con: Filtering mostly operates on single features and is β–Ά Care with cross-validation and multiple testing necessary β–Ά Filtering is often more of a pre-processing step and less

  8. Wrapping for feature selection then remove sequentially the one with the least impact model) being selected, resulting in a potentially very different variance (small changes could lead to different predictors ( greedy algorithm ) algorithm ) each step the variable that improves fit the most ( greedy performance with e.g. cross-validation many ) subsets of features and compare model of different complexity and comparing their performance 6/24 β–Ά Idea: Determine the best set of features by fitting models β–Ά Best subset selection: Try all possible ( exponentially β–Ά Forward selection: Start with just an intercept and add in β–Ά Backward selection: Start with all variables included and β–Ά As discreet procedures, all of these methods exhibit high

  9. Embedding for feature selection π‘˜=1 where πœ‡ is a tuning parameter and π‘Ÿ β‰₯ 1 or π‘Ÿ = ∞ . π‘Ÿ ‖𝐳 βˆ’ π˜πœΈβ€– 2 𝜸 𝜸 = arg min Μ‚ solve However, discrete optimization problems are hard to 7/24 βˆ‘ π‘ž ‖𝐳 βˆ’ π˜πœΈβ€– 2 𝜸 𝜸 = arg min Μ‚ estimation procedure β–Ά Embed/include the feature selection into the model β–Ά Ideally, penalization on the number of included features 1 (𝛾 2 + πœ‡ π‘˜ β‰  0) β–Ά Softer regularisation methods can help 2 + πœ‡β€–πœΈβ€– π‘Ÿ

  10. Constrained regression Μ‚ subgradients) both are differentiable problem. π‘Ÿ ‖𝐳 βˆ’ π˜πœΈβ€– 2 𝜸 The optimization problem 𝜸 = arg min for π‘Ÿ > 0 is equivalent to β€–πœΈβ€– π‘Ÿ subject to 2 ‖𝐳 βˆ’ π˜πœΈβ€– 2 𝜸 arg min 8/24 π‘Ÿ ≀ 𝑒 2 + πœ‡β€–πœΈβ€– π‘Ÿ when π‘Ÿ β‰₯ 1 . This is the Lagrangian of the constrained β–Ά Clear when π‘Ÿ > 1 : Convex constraint + target function and β–Ά Harder to prove for π‘Ÿ = 1 , but possible (e.g. with

  11. Ridge regression For π‘Ÿ = 2 the constrained problem is ridge regression Μ‚ i.e. 1 + πœ‡, 𝜸 OLS Μ‚ 𝜸 ridge (πœ‡) = Μ‚ If 𝐘 π‘ˆ 𝐘 = 𝐉 π‘ž , then 𝜸 ridge (πœ‡) = (𝐘 π‘ˆ 𝐘 + πœ‡π‰ π‘ž ) βˆ’1 𝐘 π‘ˆ 𝐳 Μ‚ 9/24 π‘ž where β€–πœΈβ€– 2 2 ‖𝐳 βˆ’ π˜πœΈβ€– 2 𝜸 𝜸 ridge (πœ‡) = arg min Μ‚ 2 + πœ‡β€–πœΈβ€– 2 2 = βˆ‘ π‘˜=1 𝛾 2 π‘˜ . An analytical solution exists if 𝐘 π‘ˆ 𝐘 + πœ‡π‰ π‘ž is invertible 𝜸 ridge (πœ‡) is biased but has lower variance .

  12. SVD and ridge regression βˆ‘ features. lower eigenvalues , e.g. in presence of correlation between π‘˜ 𝐯 π‘ˆ 𝑒 2 π‘˜ 𝑒 π‘˜=1 π‘ž = 𝜸 ridge (πœ‡) = (𝐘 π‘ˆ 𝐘 + πœ‡π‰ π‘ž ) βˆ’1 𝐘 π‘ˆ 𝐳 Μ‚ The analytical solution for ridge regression becomes ( π‘œ β‰₯ π‘ž ) 𝐘 = 𝐕𝐄𝐖 π‘ˆ 10/24 Recall: The SVD of a matrix 𝐘 ∈ ℝ π‘œΓ—π‘ž was = (𝐖𝐄 2 𝐖 π‘ˆ + πœ‡π‰ π‘ž ) βˆ’1 𝐖𝐄𝐕 π‘ˆ 𝐳 = 𝐖(𝐄 2 + πœ‡π‰ π‘ž ) βˆ’1 𝐄𝐕 π‘ˆ 𝐳 π‘˜ 𝐳 π‘˜ + πœ‡π° Ridge regression acts most on principal components with

  13. Effective degrees of freedom π‘ž 𝑒 2 π‘˜ 𝑒 2 π‘˜=1 βˆ‘ df (πœ‡) ∢= tr (𝐈(πœ‡)) = and 𝐈(πœ‡) ∢= 𝐘(𝐘 π‘ˆ 𝐘 + πœ‡π‰ π‘ž ) βˆ’1 𝐘 π‘ˆ In analogy define for ridge regression regression coefficients. 𝚻 and the degrees of freedom for the tr (𝐼) = tr (𝐘(𝐘 π‘ˆ 𝐘) βˆ’1 𝐘 π‘ˆ ) = tr (𝐘 π‘ˆ 𝐘(𝐘 π‘ˆ 𝐘) βˆ’1 ) = tr (𝐉 π‘ž ) = π‘ž 11/24 Recall the hat matrix 𝐈 = 𝐘(𝐘 π‘ˆ 𝐘) βˆ’1 𝐘 π‘ˆ in OLS. The trace of 𝐈 is equal to the trace of Λ† π‘˜ + πœ‡, the effective degrees of freedom .

  14. Lasso regression For π‘Ÿ = 1 the constrained problem is known as the lasso Μ‚ 𝜸 ridge (πœ‡) = arg min 𝜸 ‖𝐳 βˆ’ π˜πœΈβ€– 2 12/24 2 + πœ‡β€–πœΈβ€– 1 β–Ά Smallest π‘Ÿ in penalty such that constraint is still convex β–Ά Performs feature selection

  15. Intuition for the penalties (I) 𝐬 = 𝐳 βˆ’ 𝐘𝜸 OLS ‖𝐳 βˆ’ π˜πœΈβ€– 2 2 = β€–(𝐘(𝜸 βˆ’ 𝜸 OLS ) βˆ’ 𝐬‖ 2 2 = (𝜸 βˆ’ 𝜸 OLS ) π‘ˆ 𝐘 π‘ˆ 𝐘(𝜸 βˆ’ 𝜸 OLS ) βˆ’ 2𝐬 π‘ˆ 𝐘(𝜸 βˆ’ 𝜸 OLS ) + 𝐬 π‘ˆ 𝐬 13/24 Assume the OLS solution 𝜸 OLS exists and set it follows for the residual sum of squares (RSS) that 2 = β€–(𝐘𝜸 OLS + 𝐬) βˆ’ π˜πœΈβ€– 2 which is an ellipse (at least in 2D) centred on 𝜸 OLS .

  16. Intuition for the penalties (II) The least squares RSS is minimized for 𝜸 OLS . If a constraint is The blue lines are the contour lines for the RSS. 14/24 possible that fulfills the constraint. added ( β€–πœΈβ€– π‘Ÿ π‘Ÿ ≀ 𝑒 ) then the RSS is minimized by the closest 𝜸 Lasso Ridge Ξ² 1 Ξ² 1 ● ● Ξ² OLS Ξ² OLS ● ● Ξ² lasso Ξ² ridge Ξ² 2 Ξ² 2

  17. Intuition for the penalties (III) constrained Depending on π‘Ÿ the dot. the corresponding solution will be at 15/24 on a line, the different constraints in one of the lead to different coloured areas or q: 0.7 q: 1 Ξ² 1 Ξ² 1 ● ● ● ● ● ● Ξ² 2 Ξ² 2 solutions. If 𝜸 OLS is q: 2 q: Inf Ξ² 1 Ξ² 1 ● ● ● ● ● ● Ξ² 2 Ξ² 2

  18. Computational aspects of the Lasso (I) What estimates does the lasso produce? non-differentiable penalisation β€–πœΈβ€– 1 ? 𝜸 in presence of the Μ‚ How do we find the solution OLS =𝜸 π‘ˆ ⏟ 2𝐳 π‘ˆ 𝐳 βˆ’ 𝐳 π‘ˆ 𝐘 16/24 2‖𝐳 βˆ’ π˜πœΈβ€– 2 1 Special case: 𝐘 π‘ˆ 𝐘 = 𝐉 π‘ž . Then 2‖𝐳 βˆ’ π˜πœΈβ€– 2 1 𝜸 arg min Target function 2 + πœ‡β€–πœΈβ€– 1 2 + πœ‡β€–πœΈβ€– 1 = 1 𝜸 + 1 2𝜸 π‘ˆ 𝜸 + πœ‡β€–πœΈβ€– 1 = 𝑕(𝜸)

  19. Computational aspects of the Lasso (II) π‘˜ | otherwise 0 𝑦 > 0 where 𝛾 Λ† Each case results in 17/24 π‘˜=1 2𝛾 2 arg min 𝜸 π‘ž βˆ‘ βˆ’π›Ύ OLS ,π‘˜ 𝛾 For 𝐘 π‘ˆ 𝐘 = 𝐉 π‘ž the target function can be written as π‘˜ + 1 π‘˜ + πœ‡|𝛾 This results in π‘ž uncoupled optimization problems. β–Ά If 𝛾 OLS ,π‘˜ > 0 , then 𝛾 π‘˜ > 0 to minimize the target β–Ά If 𝛾 OLS ,π‘˜ ≀ 0 , then 𝛾 π‘˜ ≀ 0 π‘˜ = sign (𝛾 OLS ,π‘˜ )(|𝛾 OLS ,π‘˜ | βˆ’ πœ‡) + = ST (𝛾 OLS ,π‘˜ , πœ‡), 𝑦 + = {𝑦 and ST is the soft-thresholding operator

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend