lecture 10 regularized penalized regression cont d
play

Lecture 10: Regularized/penalized regression (contd) Felix Held, - PowerPoint PPT Presentation

Lecture 10: Regularized/penalized regression (contd) Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 2nd May 2019 A short recap Goals of modelling 1. Predictive strength: How well can we reconstruct the


  1. Lecture 10: Regularized/penalized regression (cont’d) Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 2nd May 2019

  2. A short recap

  3. Goals of modelling 1. Predictive strength: How well can we reconstruct the observed data? Has been most important so far. true model ? This is about uncovering structure to allow for mechanistic understanding. 1/31 2. Model/variable selection: Which variables are part of the

  4. Feature selection variable into consideration (e.g. PCA) estimation through penalisation of the model coefficients additional hyper-parameter Feature selection can be addressed in multiple ways 2/31 time (e.g. F-Score, MIC) or does not take the outcome the data is built ▶ Filtering: Remove variables before the actual model for ▶ Often crude but fast ▶ Typically only pays attention to one or two features at a ▶ Wrapping: Consider the selected features as an ▶ computationally very heavy ▶ most approximations are greedy algorithms ▶ Embedding: Include feature selection into parameter ▶ Naive form is equally computationally heavy as wrapping ▶ Soft-constraints create biased but useful approximations

  5. Penalised regression ̂ in 𝜸 = 𝟏 for 𝑟 = 1 the lasso when 𝑟 ≥ 1 . 𝑟 2‖𝐳 − 𝐘𝜸‖ 2 1 𝜸 The optimization problem 𝜸 = arg min for 𝑟 > 0 is equivalent to ‖𝜸‖ 𝑟 subject to 2 2‖𝐳 − 𝐘𝜸‖ 2 1 𝜸 arg min 3/31 𝑟 ≤ 𝑢 2 + 𝜇‖𝜸‖ 𝑟 ▶ For 𝑟 = 2 known as ridge regression for 𝑟 = 1 known as ▶ Constraints are convex for all 𝑟 ≥ 1 but not differentiable

  6. Intuition for the penalties (I) 𝐬 = 𝐳 − 𝐘𝜸 OLS ‖𝐳 − 𝐘𝜸‖ 2 2 = ‖(𝐘(𝜸 − 𝜸 OLS ) − 𝐬‖ 2 2 = (𝜸 − 𝜸 OLS ) 𝑈 𝐘 𝑈 𝐘(𝜸 − 𝜸 OLS ) − 2𝐬 𝑈 𝐘(𝜸 − 𝜸 OLS ) + 𝐬 𝑈 𝐬 4/31 Assume the OLS solution 𝜸 OLS exists and set it follows for the residual sum of squares (RSS) that 2 = ‖(𝐘𝜸 OLS + 𝐬) − 𝐘𝜸‖ 2 which is an ellipse (at least in 2D) centred on 𝜸 OLS .

  7. Intuition for the penalties (II) The least squares RSS is minimized for 𝜸 OLS . If a constraint is The blue lines are the contour lines for the RSS. 5/31 possible that fulfills the constraint. added ( ‖𝜸‖ 𝑟 𝑟 ≤ 𝑢 ) then the RSS is minimized by the closest 𝜸 Lasso Ridge β 1 β 1 ● ● β OLS β OLS ● ● β lasso β ridge β 2 β 2

  8. Intuition for the penalties (III) will be at the Depending on 𝑟 the Convexity only for 𝑟 ≥ 1 Sparsity only for 𝑟 ≤ 1 corresponding dot. 6/31 constrained solution different constraints one of the coloured lead to different areas or on a line, the q: 0.7 q: 1 β 1 β 1 ● ● ● ● ● ● β 2 β 2 solutions. If 𝜸 OLS is in q: 2 q: Inf β 1 β 1 ● ● ● ● ● ● β 2 β 2

  9. Shrinkage and effective degrees of freedom When 𝜇 is fixed, the shrinkage of the lasso estimate 𝜸 lasso (𝜇) the effective degrees of freedom . 𝑒 2 𝑘 𝑒 2 𝑘=1 ∑ 𝑞 df (𝜇) ∶= tr (𝐈(𝜇)) = and 𝐈(𝜇) ∶= 𝐘(𝐘 𝑈 𝐘 + 𝜇𝐉 𝑞 ) −1 𝐘 𝑈 For ridge regression define 𝜇 = 0 Note: 𝑡(𝜇) ∈ [0, 1] with 𝑡(𝜇) → 0 for increasing 𝜇 and 𝑡(𝜇) = 1 if ‖𝜸 OLS ‖ 1 7/31 compared to the OLS estimate 𝜸 OLS is defined as 𝑡(𝜇) = ‖𝜸 lasso (𝜇)‖ 1 𝑘 + 𝜇,

  10. A regularisation path Prostate cancer dataset ( 𝑜 = 67 , 𝑞 = 8 ) 8/31 Red dashed lines indicate the 𝜇 selected by cross-validation Ridge Lasso 0.75 0.75 0.50 0.50 Coefficient Coefficient 0.25 0.25 0.00 0.00 −0.25 −0.25 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 Effective degrees of freedom Shrinkage 0.75 0.75 0.50 0.50 Coefficient Coefficient 0.25 0.25 0.00 0.00 −0.25 −0.25 −5 0 5 10 −5 0 5 10 log( λ ) log( λ )

  11. Connection to classification

  12. Recall: Regularised Discriminant Analysis (RDA) 𝚻 𝑗 . If diagonal matrix 𝚬 + 𝜇ˆ 𝑗 𝚻 QDA Penalisation can help here: 9/31 𝑞(𝑗) = 𝜌 𝑗 𝑞(𝐲|𝑗) = 𝑂(𝐲|𝝂 𝑗 , 𝚻 𝑗 ) and Estimates ˆ Given training samples (𝑗 𝑚 , 𝐲 𝑚 ) , quadratic DA models 𝝂 𝑗 , ˆ 𝚻 𝑗 and ˆ 𝜌 𝑗 are straight-forward to find,… …but evaluating the normal density requires inversion of ˆ it is (near-)singular, this can lead to numerical instability . 𝚻 LDA for 𝜇 > 0 ▶ Use ˆ 𝚻 𝑗 = ˆ 𝚻 LDA + 𝜇𝚬 for 𝜇 > 0 and a ▶ Use LDA (i.e. 𝚻 𝑗 = 𝚻 ) and ˆ 𝚻 = ˆ

  13. Recall: Naive Bayes LDA 𝑗 𝑚 =𝑗 as the predicted class. 𝜀 𝑗 (𝐲) 𝑗 𝑑(𝐲) = arg max and by choosing 𝜌 𝑗 ) 𝝂 𝑗 ) + log (ˆ 𝚬 −1 (𝐲 − ˆ 2(𝐲 − ˆ 𝜀 𝑗 (𝐲) = −1 functions 𝝂 𝑗,𝑘 ) 2 10/31 ∑ 𝑜 − 𝐿 𝚬 for a 𝚬 . The diagonal elements are estimated as ˆ Δ 2 1 𝐿 ∑ 𝑗=1 Naive Bayes LDA means that we assume that ˆ 𝚻 = ˆ diagonal matrix ˆ 𝑘𝑘 = (𝑦 𝑚𝑘 − ˆ which is the pooled within-class variance . Classification is performed by evaluating the discriminant 𝝂 𝑗 ) 𝑈 ˆ

  14. Shrunken centroids (I) In high-dimensional problems, centroids will 𝝂 𝑈 ‖ 1 ‖𝐰−ˆ 𝑜 (𝑜 − 𝑜 𝑗 )𝑜 𝑗 2 +𝜇√ 𝚬+𝑡 0 𝐉 𝑞 ) −1/2 (𝐲 𝑚 −𝐰)‖ 2 ‖(ˆ 𝑗 𝑚 =𝑗 1 𝐰 𝝂 𝑡 ˆ stabilises centroid estimates by solving Nearest shrunken centroids performs variable selection and 2 and reduce noise . Note: The class centroids solve ˆ 𝐰 1 𝑗 𝑚 =𝑗 11/31 ▶ contain noise ▶ be hard to interpret when all variables are active As in regression, we would like to perform variable selection 𝝂 𝑗 = arg min 2 ∑ ‖𝐲 𝑚 − 𝐰‖ 2 𝑗 = arg min 2 ∑

  15. Shrunken centroids (II) 2 +𝜇√ problem that are less variable across samples ( interpretability ) covariance matrix. Leads to greater weights for variables centroid 𝝂 𝑈 𝝂 𝑈 ‖ 1 ‖𝐰−ˆ Nearest shrunken centroids (𝑜 − 𝑜 𝑗 )𝑜 𝑗 𝑜 𝚬 + 𝑡 0 𝐉 𝑞 ) −1/2 (𝐲 𝑚 −𝐰)‖ 2 ‖(ˆ 𝑗 𝑚 =𝑗 1 𝐰 𝝂 𝑡 ˆ 12/31 𝑗 = arg min 2 ∑ ▶ Penalises distance of class centroid to the overall ▶ ˆ 𝚬 + 𝑡 0 𝐉 𝑞 is the diagonal regularised within-class ▶ √(𝑜 − 𝑜 𝑗 )𝑜 𝑗 /𝑜 is only there for technical reasons ▶ If the predictors are centred ( ˆ 𝝂 𝑈 = 0 ) this is a scaled lasso

  16. Shrunken centroids (III) 𝝂 𝑈,𝑘 respective component of the overall centroid. increasing 𝜇 and declines for too high values through e.g. cross-validation. Note: 𝜇 is a tuning parameter and has to be determined 𝑜 . 1 1 The solution for component 𝑘 can be derived using 13/31 ˆ where 𝝂 𝑡 ˆ subdifferentials as 𝝂 𝑗,𝑘 − ˆ 𝑗,𝑘 = ˆ 𝝂 𝑈,𝑘 +𝑛 𝑗 (Δ 𝑘𝑘 +𝑡 0 ) ST (𝑢 𝑗,𝑘 , 𝜇) 𝑢 𝑗,𝑘 = 𝑛 𝑗 (Δ 𝑘𝑘 + 𝑡 0 ) and 𝑛 𝑗 = √ 𝑜 𝑗 − ▶ Typically, misclassification rate improves first with ▶ The larger 𝜇 the more components will be equal to the

  17. Application of nearest shrunken centroids (I) A gene expression data set with 𝑜 = 63 and 𝑞 = 2308 . There misclassification rate 5-fold cross-validation curve and largest 𝜇 that leads to minimal 14/31 are four classes (cancer subtypes) with 𝑜 BL = 8 , 𝑜 EWS = 23 , 𝑜 NB = 12 , and 𝑜 RMS = 20 . Misclassification rate ● 0.6 ● ● 0.4 ● ● ● 0.2 ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● 0 2 4 6 λ

  18. Application of nearest shrunken centroids (II) Grey lines show the original centroids and red lines show the shrunken centroids 15/31 BL 0 EWS Average Expression 0 NB 0 RMS 0 0 500 1000 1500 2000 Gene

  19. General calculation of the lasso estimates

  20. Calculation of the lasso estimate 2 |𝛾 𝑗 | 𝑗=1 ∑ 𝑞 𝑗=1 ∑ 𝑞 𝑚=1 ∑ 𝑜 𝑘 𝛾 𝑗 𝛾 𝐲 𝑈 𝑗,𝑘=1 ∑ 𝑞 1 2‖𝐳 − 𝐘𝜸‖ 2 then ˆ What about the general case? arg min 𝜸 𝛾 1 ,…,𝛾 𝑞 1 can be written in coordinates (omitting terms not dependent on any 𝛾 𝑗 ) arg min 16/31 Last lecture: When 𝐘 𝑈 𝐘 = 𝐉 𝑞 and 𝜸 OLS are the OLS estimates 𝛾 lasso ,𝑘 (𝜇) = sign (𝛾 OLS ,𝑘 )(|𝛾 OLS ,𝑘 | − 𝜇) + = ST (𝛾 OLS ,𝑘 , 𝜇) where 𝑦 + = max (𝑦, 0) and the soft-thresholding operator ST . Coordinate Descent: The lasso problem 2 + 𝜇‖𝜸‖ 1 𝑗 𝐲 𝑘 − 𝑧 𝑚 𝑦 𝑚𝑗 𝛾 𝑗 + 𝜇

  21. Subderivative and subdifferential 𝑦 − 𝑦 0 {+1} [−1, 1] {−1} ⎩ ⎪ ⎨ ⎪ ⎧ 𝜀𝑔(𝑦 0 ) = Example: Let 𝑔(𝑦) = |𝑦| , then subdifferential of 𝑔 at 𝑦 0 . Let 𝑔 ∶ 𝐽 → ℝ be a convex function in an open interval 𝐽 and all 𝑑 ∈ [𝑏, 𝑐] are subderivatives. Call 𝜀𝑔(𝑦 0 ) ∶= [𝑏, 𝑐] the 𝑔(𝑦) − 𝑔(𝑦 0 ) 0 𝑔(𝑦) − 𝑔(𝑦 0 ) ≥ 𝑑(𝑦 − 𝑦 0 ) It can be shown that for 𝑏 = lim 0 𝑦→𝑦 − 𝑔(𝑦) − 𝑔(𝑦 0 ) 𝑦 − 𝑦 0 and 𝑐 = lim 𝑦→𝑦 + 17/31 𝑦 0 ∈ 𝐽 . A 𝑑 ∈ ℝ is called a subderivative of 𝑔 at 𝑦 0 if 𝑦 0 < 0 𝑦 0 = 0 𝑦 0 > 0

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend