ridge lasso regression model selection
play

Ridge/Lasso Regression, Model selection Xuezhi Wang Computer - PowerPoint PPT Presentation

Ridge/Lasso Regression Model Selection Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon University 10701-recitation, Apr 22 Lasso Ridge/Lasso Regression Model Selection Outline Ridge/Lasso


  1. Ridge/Lasso Regression Model Selection Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon University 10701-recitation, Apr 22 Lasso

  2. Ridge/Lasso Regression Model Selection Outline Ridge/Lasso Regression 1 Linear Regression Regularization Probabilistic Intepretation Model Selection 2 Variable Selection Model selection Lasso

  3. Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Outline Ridge/Lasso Regression 1 Linear Regression Regularization Probabilistic Intepretation Model Selection 2 Variable Selection Model selection Lasso

  4. Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Linear Regression Data X : N × P matrix, Target y : N × 1 vector N samples, each sample has P features Want to find θ so that y and X θ are as close as possible Pick θ that minimizes the cost function L = 1 ( y i − X i θ ) 2 = 1 � 2 || y − X θ || 2 2 i use gradient descent j − step ∗ ∂ L � θ t + 1 = θ t = θ t j − step ∗ ( y i − X i θ )( − X ij ) j ∂θ j i Lasso

  5. Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Linear Regression Matrix form: L = 1 ( y i − X i θ ) 2 = 1 � 2 || y − X θ || 2 2 i = 1 2 ( y − X θ ) ⊤ ( y − X θ ) = 1 2 ( y ⊤ y − y ⊤ X θ − θ ⊤ X ⊤ y + θ ⊤ X ⊤ X θ ) Take derivative w.r.t. θ ∂ L ∂θ = 1 2 ( − 2 X ⊤ y + 2 X ⊤ X θ ) = 0 Hence we get θ = ( X ⊤ X ) − 1 X ⊤ y Lasso

  6. Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Linear Regression Comparison of iterative methods and matrix methods: matrix methods achieve solution in a single step, but can be infeasible for real-time data, or large amount of data. iterative methods can be used in large practical problems, but need to decide learning rate Any problems? Data X is an N × P matrix Usually N > P , i.e., number of data points larger than feature dimensions. And usually X is of full column rank. Under this case X ⊤ X have rank P , i.e., invertible What if X has less than full column rank? Lasso

  7. Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Outline Ridge/Lasso Regression 1 Linear Regression Regularization Probabilistic Intepretation Model Selection 2 Variable Selection Model selection Lasso

  8. Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Regularization: ℓ 2 norm Ridge Regression: 1 ( y i − X i θ ) 2 + λ || θ || 2 � min 2 2 θ i Solution is given by: θ = ( X ⊤ X + λ I ) − 1 X ⊤ y Results in a solution with small θ Solves the problem that X ⊤ X is not invertible Lasso

  9. Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Regularization: ℓ 1 norm Lasso Regression: 1 ( y i − X i θ ) 2 + λ || θ || 1 � min 2 θ i Solution is given by taking subgradient: � ( y i − X i θ )( − X ij ) + λ t j i where t j is the subgradient of ℓ 1 norm, t j = sign ( θ j ) if θ j � = 0 , t j ∈ [ − 1 , 1 ] otherwise Sparse solution, i.e., θ will be a vector with more zero coordinates. Good for high-dimensional problems Lasso

  10. Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Solving Lasso regression Efron et al. proposed LARS (least angle regression) which computes the LASSO path efficiently Forward stagewise algorithm Assume X is standardized and y is centered choose small ǫ Start with initial residual r = y , and θ 1 = ... = θ P = 0 Find the predictor Z j ( j th column of X ) most correlated with r Update θ j ← θ j + δ j , where δ j = ǫ · sign ( Z ⊤ j r ) Set r ← r − δ j Z j , repeat steps 2 and 3 Lasso

  11. Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Comparison of Ridge and Lasso regression: Two-dimensional case: Lasso

  12. Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Comparison of Ridge and Lasso regression: Higher dimensional case: Lasso

  13. Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Choosing λ Standard practice now is to use cross-validation Lasso

  14. Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Outline Ridge/Lasso Regression 1 Linear Regression Regularization Probabilistic Intepretation Model Selection 2 Variable Selection Model selection Lasso

  15. Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Probabilistic Intepretation of Linear regression Assume y i = X i θ + ǫ i , where ǫ is the random noise. Assume ǫ ∼ N ( 0 , σ 2 ) exp {− ( y i − X i θ ) 2 1 p ( y i | X i ; θ ) = √ } 2 σ 2 2 πσ Since data points are i.i.d, we have the data likelihood N � N i = 1 ( y i − X i θ ) 2 � L ( θ ) = p ( y i | X i ; θ ) ∝ exp {− } 2 σ 2 i = 1 The log likelihood is: � N i = 1 ( y i − X i θ ) 2 ℓ ( θ ) = − + const 2 σ 2 Maximizing the log-likelihood is equivalent to minimize � N i = 1 ( y i − X i θ ) 2 , i.e., the loss function in LR! Lasso

  16. Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Probabilistic Intepretation of Ridge regression Assume a Gaussian prior on θ ∼ N ( 0 , τ 2 I ) , i.e., p ( θ ) ∝ exp {− θ ⊤ θ/ 2 τ 2 } Now get the MAP estimate of θ � N i = 1 ( y i − X i θ ) 2 } exp {− θ ⊤ θ/ 2 τ 2 } p ( θ | X , y ) ∝ p ( y | X ; θ ) p ( θ ) = exp {− 2 σ 2 The log likelihood is: � N i = 1 ( y i − X i θ ) 2 − θ ⊤ θ/ 2 τ 2 + const ℓ ( θ | X , y ) = − 2 σ 2 i ( y i − X i θ ) 2 + λ || θ || 2 which matches min θ 1 � 2 , where λ is a 2 constant associated with σ 2 , τ 2 . Lasso

  17. Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Probabilistic Intepretation of Lasso regression iid Assume a Laplace prior on θ i ∼ Laplace ( 0 , t ) , i.e., p ( θ i ) ∝ exp {−| θ i | / t } Now get the MAP estimate of θ � N i = 1 ( y i − X i θ ) 2 � p ( θ | X , y ) ∝ p ( y | X ; θ ) p ( θ ) = exp {− } exp {− | θ i | / t } 2 σ 2 i The log likelihood is: � N i = 1 ( y i − X i θ ) 2 � ℓ ( θ | X , y ) = − − | θ i | / t + const 2 σ 2 i i ( y i − X i θ ) 2 + λ || θ || 1 , where λ is a which matches min θ 1 � 2 constant associated with σ 2 , t . Lasso

  18. Ridge/Lasso Regression Variable Selection Model Selection Model selection Outline Ridge/Lasso Regression 1 Linear Regression Regularization Probabilistic Intepretation Model Selection 2 Variable Selection Model selection Lasso

  19. Ridge/Lasso Regression Variable Selection Model Selection Model selection Variable Selection Consider "best" subsets, order O ( 2 P ) (combinatorial explosion) Stepwise selection A new variable may be added into the model even with a small improvement in LMS When applying stepwise to a perturbation of the data, probably have different set of variables enter into the model at each stage LASSO produces sparse solutions, which takes care of model selection we can even see when variables jump into the model by looking at the LASSO path Lasso

  20. Ridge/Lasso Regression Variable Selection Model Selection Model selection Outline Ridge/Lasso Regression 1 Linear Regression Regularization Probabilistic Intepretation Model Selection 2 Variable Selection Model selection Lasso

  21. Ridge/Lasso Regression Variable Selection Model Selection Model selection Example Suppose you have data Y 1 , ..., Y n and you want to model the distribution of Y . Some popular models are: the Exponential distribution: f ( y ; θ ) = θ e − θ y the Gaussian distribution: f ( y ; u , σ 2 ) ∼ N ( u , σ 2 ) ... How do you know which model is better? Lasso

  22. Ridge/Lasso Regression Variable Selection Model Selection Model selection AIC Suppose we have models M 1 , ..., M k where each model is a set of densities: M j = { p ( y ; θ j ) : θ j ∈ Θ j } We have data Y 1 , ..., Y n drawn from some density f (not necessarily drawn from these models). Define AIC ( j ) = ℓ j (ˆ θ j ) − 2 d j where ℓ j ( θ j ) is the log-likelihood, and ˆ θ j is the parameter that maximizes the log-likelihood. d j is the dimension of Θ j . Lasso

  23. Ridge/Lasso Regression Variable Selection Model Selection Model selection BIC Bayesian Information Criterion We choose j to maximize θ j ) − d j BIC j = ℓ j (ˆ 2 log n which is similar to AIC but the penalty is harsher, hence BIC tends to choose simpler models. Lasso

  24. Ridge/Lasso Regression Variable Selection Model Selection Model selection Simple example Let Y 1 , ..., Y n ∼ N ( µ, 1 ) we want to compare two model: M 0 : N ( 0 , 1 ) and M 1 : N ( u , 1 ) Lasso

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend