stk in4300 statistical learning methods in data science
play

STK-IN4300 Statistical Learning Methods in Data Science Riccardo De - PowerPoint PPT Presentation

STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no STK-IN4300: lecture 4 1/ 39 STK-IN4300 - Statistical Learning Methods in Data Science Outline of the lecture Shrinkage Methods Lasso Comparison of


  1. STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no STK-IN4300: lecture 4 1/ 39

  2. STK-IN4300 - Statistical Learning Methods in Data Science Outline of the lecture Shrinkage Methods Lasso Comparison of Shrinkage Methods More on Lasso and Related Path Algorithms STK-IN4300: lecture 4 2/ 39

  3. STK-IN4300 - Statistical Learning Methods in Data Science Shrinkage Methods: ridge regression and PCR STK-IN4300: lecture 4 3/ 39

  4. STK-IN4300 - Statistical Learning Methods in Data Science Shrinkage Methods: bias and variance STK-IN4300: lecture 4 4/ 39

  5. STK-IN4300 - Statistical Learning Methods in Data Science Lasso: Least Absolute Shrinkage and Selection Operator Lasso is similar to ridge regression, with an L 1 penalty instead of the L 2 one, N p ÿ ÿ β j x ij q 2 , p y i ´ β 0 ´ i “ 1 j “ 1 subject to ř p j “ 1 | β j | ď t . Or, in the equivalent Lagrangian form, # N p p + β j x ij q 2 ` λ ˆ ÿ ÿ ÿ β lasso p λ q “ argmin β p y i ´ β 0 ´ | β j | . i “ 1 j “ 1 j “ 1 ‚ X must be standardized; ‚ β 0 is again not considered in the penalty term. STK-IN4300: lecture 4 5/ 39

  6. STK-IN4300 - Statistical Learning Methods in Data Science Lasso: constrained estimation STK-IN4300: lecture 4 6/ 39

  7. STK-IN4300 - Statistical Learning Methods in Data Science Lasso: remarks Due to the structure of the L 1 norm; ‚ some estimates are forced to be 0 (variable selection); ‚ no close form for the estimator. From a Bayesian prospective: ‚ ˆ β lasso p λ q as the posterior mode estimate. ‚ β „ Laplace p 0 , τ 2 q ; ‚ for more details, see Park & Casella (2008). Extreme situations: ‚ λ Ñ 0 , ˆ β lasso p λ q Ñ ˆ β OLS ; ‚ λ Ñ 8 , ˆ β lasso p λ q Ñ 0 . STK-IN4300: lecture 4 7/ 39

  8. STK-IN4300 - Statistical Learning Methods in Data Science Lasso: shrinkage STK-IN4300: lecture 4 8/ 39

  9. STK-IN4300 - Statistical Learning Methods in Data Science Lasso: generalized linear models Lasso (and ridge r.) can be used with any linear regression model; ‚ e.g., logistic regression. In logistic regression, the lasso solution is the maximizer of # N p + y i p β 0 ` β T x i q ´ log p 1 ` e β 0 ` β T x i q ” ı ÿ ÿ max β 0 ,β ´ λ | β j | . i “ 1 j “ 1 Note: ‚ penalized logistic regression can be applied to problems with high-dimensional data (see Section 18.4). STK-IN4300: lecture 4 9/ 39

  10. STK-IN4300 - Statistical Learning Methods in Data Science Comparison of Shrinkage Methods: coefficient profiles STK-IN4300: lecture 4 10/ 39

  11. STK-IN4300 - Statistical Learning Methods in Data Science Comparison of Shrinkage Methods: coefficient profiles STK-IN4300: lecture 4 11/ 39

  12. STK-IN4300 - Statistical Learning Methods in Data Science Comparison of Shrinkage Methods: coefficient profiles STK-IN4300: lecture 4 12/ 39

  13. STK-IN4300 - Statistical Learning Methods in Data Science More on Lasso and Related Path Algorithms: generalization Generalization including lasso and ridge r. Ñ bridge regression : # N p p + β j x ij q 2 ` λ ˜ ÿ ÿ ÿ | β j | q β p λ q “ argmin β p y i ´ β 0 ´ , q ě 0 . i “ 1 j “ 1 j “ 1 Where: ‚ q “ 0 Ñ best subset selection; ‚ q “ 1 Ñ lasso; ‚ q “ 2 Ñ ridge regression. STK-IN4300: lecture 4 13/ 39

  14. STK-IN4300 - Statistical Learning Methods in Data Science More on Lasso and Related Path Algorithms: generalization Note that: ‚ 0 ă q ď 1 Ñ non differentiable; ‚ 1 ă q ă 2 Ñ compromise between lasso and ridge (but makespaceoii differentiable ñ no variable selection property). ‚ q defines the shape of the constrain area: ‚ q could be estimated from the data (tuning parameter); ‚ in practice does not work well (variance). STK-IN4300: lecture 4 14/ 39

  15. STK-IN4300 - Statistical Learning Methods in Data Science More on Lasso and Related Path Algorithms: elastic net Different compromise lasso / ridge regression: elastic net # N + p p β j x ij q 2 ` λ ˜ ÿ ÿ ÿ ` α | β j | ` p 1 ´ α q β 2 ˘ β p λ q “ argmin β p y i ´ β 0 ´ . j i “ 1 j “ 1 j “ 1 Idea: ‚ L 1 penalty takes care of variable selection; ‚ L 2 penalty helps in correctly handling correlation; ‚ α defines how much L 1 and L 2 penalty should be used: § it is a tuning parameter, must be found in addition to λ ; § a grid search is discouraged; § in real experiments, often very close to 0 or 1. STK-IN4300: lecture 4 15/ 39

  16. STK-IN4300 - Statistical Learning Methods in Data Science More on Lasso and Related Path Algorithms: elastic net Comparing the bridge regression and the elastic net, ‚ they look very similar; ‚ huge difference due to differentiability (variable selection). STK-IN4300: lecture 4 16/ 39

  17. STK-IN4300 - Statistical Learning Methods in Data Science More on Lasso and Related Path Algorithms: Least Angle Regression The Least Angle Regression ( LAR ): ‚ can be viewed as a “democratic” version of the forward selection; ‚ add sequentially new predictors into the model § only “as much as it deserves”; ‚ eventually reaches the least square estimation; ‚ strongly connected with lasso; § lasso can be seen as a special case of LAR; § LAR is often used to fit lasso models. STK-IN4300: lecture 4 17/ 39

  18. STK-IN4300 - Statistical Learning Methods in Data Science More on Lasso and Related Path Algorithms: LAR Least Angle Regression: 1. Standardize the predictors (mean zero, unit norm). Initialize: § residuals r “ y ´ ¯ y § regression coefficient estimates β 1 “ ¨ ¨ ¨ “ β p “ 0 ; 2. find the predictor x j most correlated with r ; 3. move ˆ β j towards its least-squares coefficient x x j , r y , § until for k ‰ j , corr p x k , r q “ corr p x j , r q . 4. add x k in the active list and update both ˆ β j and ˆ β k : § towards their joint least squares coefficient; § until x l has as much correlation with the current residual; 5. continue until all p predictors have been entered. STK-IN4300: lecture 4 18/ 39

  19. STK-IN4300 - Statistical Learning Methods in Data Science More on Lasso and Related Path Algorithms: comparison STK-IN4300: lecture 4 19/ 39

  20. STK-IN4300 - Statistical Learning Methods in Data Science More on Lasso and Related Path Algorithms: overfit STK-IN4300: lecture 4 20/ 39

  21. STK-IN4300 - Statistical Learning Methods in Data Science More on Lasso and Related Path Algorithms: other shrinkage methods Group Lasso Sometimes predictors belong to the same group: ‚ genes that belong to the same molecular pathway; ‚ dummy variables from the same categorical variable . . . Suppose the p predictors are grouped in L groups, group lasso minimizes # L L + ? p ℓ || β j || 2 ÿ ÿ ||p y ´ β 0 � X ℓ β ℓ || 2 min β 1 ´ 2 ` λ , ℓ “ 1 ℓ “ 1 where: ‚ ? p ℓ accounts for the group sizes; ‚ || ¨ || denotes the (not squared) Euclidean norm § it is 0 ð all its component are 0; ‚ sparsity is encouraged at group level. STK-IN4300: lecture 4 21/ 39

  22. STK-IN4300 - Statistical Learning Methods in Data Science More on Lasso and Related Path Algorithms: other shrinkage methods Non-negative garrote The idea of lasso originates from the non-negative garrote, p N ˆ ÿ ÿ c j β j x ij q 2 , β garrote “ argmin β p y i ´ β 0 ´ i “ 1 j “ 1 subject to ÿ c j ě 0 and c j ď t. j Non-negative garrote starts with OLS estimates and shrinks them: ‚ by non-negative factors; ‚ the sum of the non-negative factor is constrained; ‚ for more information, see Breiman (1995). STK-IN4300: lecture 4 22/ 39

  23. STK-IN4300 - Statistical Learning Methods in Data Science More on Lasso and Related Path Algorithms: other shrinkage methods In the case of orthogonal design ( X T X “ I N ), ˜ ¸ λ c j p λ q “ 1 ´ , ˆ β OLS j where λ is a tuning parameter (related to t ). Note that the solution depends on ˆ β OLS : ‚ cannot be applied in p ąą N problems; ‚ may be a problem when ˆ β OLS behaves poorly; ‚ has the oracle properties (Yuan & Lin, 2006) Ð see soon. STK-IN4300: lecture 4 23/ 39

  24. STK-IN4300 - Statistical Learning Methods in Data Science More on Lasso and Related Path Algorithms: other shrinkage methods Comparison between lasso (left) and non-negative garrote (right). (picture from Tibshirani, 1996) STK-IN4300: lecture 4 24/ 39

  25. STK-IN4300 - Statistical Learning Methods in Data Science More on Lasso and Related Path Algorithms: the oracle property Let: ‚ A : “ t j : β j ‰ 0 u be the set of the true relevant coefficients; ‚ δ be a fitting procedure (lasso, non-negative garrote, . . . ); ‚ ˆ β p δ q the coefficient estimator of the procedure δ . We would like that δ : (a) identifies the right subset model, t j : ˆ β p δ q ‰ 0 u “ A ; (b) has the optimal estimation rate, ? n p ˆ β p δ q A ´ β A q d Ý Ñ N p 0 , Σ q , where Σ is the covariance matrix for the true subset model. If δ asymptotically satisfies (a) and (b), it is defined an oracle procedure. STK-IN4300: lecture 4 25/ 39

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend