Ridge/Lasso Regression, Model selection Xuezhi Wang Computer - PowerPoint PPT Presentation

Ridge/Lasso Regression Model Selection Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon University 10701-recitation, Apr 22 Lasso

Ridge/Lasso Regression Model Selection Outline Ridge/Lasso Regression 1 Linear Regression Regularization Probabilistic Intepretation Model Selection 2 Variable Selection Model selection Lasso

Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Outline Ridge/Lasso Regression 1 Linear Regression Regularization Probabilistic Intepretation Model Selection 2 Variable Selection Model selection Lasso

Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Linear Regression Data X : N × P matrix, Target y : N × 1 vector N samples, each sample has P features Want to find θ so that y and X θ are as close as possible Pick θ that minimizes the cost function L = 1 ( y i − X i θ ) 2 = 1 � 2 || y − X θ || 2 2 i use gradient descent j − step ∗ ∂ L � θ t + 1 = θ t = θ t j − step ∗ ( y i − X i θ )( − X ij ) j ∂θ j i Lasso

Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Linear Regression Matrix form: L = 1 ( y i − X i θ ) 2 = 1 � 2 || y − X θ || 2 2 i = 1 2 ( y − X θ ) ⊤ ( y − X θ ) = 1 2 ( y ⊤ y − y ⊤ X θ − θ ⊤ X ⊤ y + θ ⊤ X ⊤ X θ ) Take derivative w.r.t. θ ∂ L ∂θ = 1 2 ( − 2 X ⊤ y + 2 X ⊤ X θ ) = 0 Hence we get θ = ( X ⊤ X ) − 1 X ⊤ y Lasso

Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Linear Regression Comparison of iterative methods and matrix methods: matrix methods achieve solution in a single step, but can be infeasible for real-time data, or large amount of data. iterative methods can be used in large practical problems, but need to decide learning rate Any problems? Data X is an N × P matrix Usually N > P , i.e., number of data points larger than feature dimensions. And usually X is of full column rank. Under this case X ⊤ X have rank P , i.e., invertible What if X has less than full column rank? Lasso

Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Regularization: ℓ 2 norm Ridge Regression: 1 ( y i − X i θ ) 2 + λ || θ || 2 � min 2 2 θ i Solution is given by: θ = ( X ⊤ X + λ I ) − 1 X ⊤ y Results in a solution with small θ Solves the problem that X ⊤ X is not invertible Lasso

Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Regularization: ℓ 1 norm Lasso Regression: 1 ( y i − X i θ ) 2 + λ || θ || 1 � min 2 θ i Solution is given by taking subgradient: � ( y i − X i θ )( − X ij ) + λ t j i where t j is the subgradient of ℓ 1 norm, t j = sign ( θ j ) if θ j � = 0 , t j ∈ [ − 1 , 1 ] otherwise Sparse solution, i.e., θ will be a vector with more zero coordinates. Good for high-dimensional problems Lasso

Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Solving Lasso regression Efron et al. proposed LARS (least angle regression) which computes the LASSO path efficiently Forward stagewise algorithm Assume X is standardized and y is centered choose small ǫ Start with initial residual r = y , and θ 1 = ... = θ P = 0 Find the predictor Z j ( j th column of X ) most correlated with r Update θ j ← θ j + δ j , where δ j = ǫ · sign ( Z ⊤ j r ) Set r ← r − δ j Z j , repeat steps 2 and 3 Lasso

Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Comparison of Ridge and Lasso regression: Two-dimensional case: Lasso

Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Comparison of Ridge and Lasso regression: Higher dimensional case: Lasso

Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Choosing λ Standard practice now is to use cross-validation Lasso

Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Probabilistic Intepretation of Linear regression Assume y i = X i θ + ǫ i , where ǫ is the random noise. Assume ǫ ∼ N ( 0 , σ 2 ) exp {− ( y i − X i θ ) 2 1 p ( y i | X i ; θ ) = √ } 2 σ 2 2 πσ Since data points are i.i.d, we have the data likelihood N � N i = 1 ( y i − X i θ ) 2 � L ( θ ) = p ( y i | X i ; θ ) ∝ exp {− } 2 σ 2 i = 1 The log likelihood is: � N i = 1 ( y i − X i θ ) 2 ℓ ( θ ) = − + const 2 σ 2 Maximizing the log-likelihood is equivalent to minimize � N i = 1 ( y i − X i θ ) 2 , i.e., the loss function in LR! Lasso

Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Probabilistic Intepretation of Ridge regression Assume a Gaussian prior on θ ∼ N ( 0 , τ 2 I ) , i.e., p ( θ ) ∝ exp {− θ ⊤ θ/ 2 τ 2 } Now get the MAP estimate of θ � N i = 1 ( y i − X i θ ) 2 } exp {− θ ⊤ θ/ 2 τ 2 } p ( θ | X , y ) ∝ p ( y | X ; θ ) p ( θ ) = exp {− 2 σ 2 The log likelihood is: � N i = 1 ( y i − X i θ ) 2 − θ ⊤ θ/ 2 τ 2 + const ℓ ( θ | X , y ) = − 2 σ 2 i ( y i − X i θ ) 2 + λ || θ || 2 which matches min θ 1 � 2 , where λ is a 2 constant associated with σ 2 , τ 2 . Lasso

Linear Regression Ridge/Lasso Regression Regularization Model Selection Probabilistic Intepretation Probabilistic Intepretation of Lasso regression iid Assume a Laplace prior on θ i ∼ Laplace ( 0 , t ) , i.e., p ( θ i ) ∝ exp {−| θ i | / t } Now get the MAP estimate of θ � N i = 1 ( y i − X i θ ) 2 � p ( θ | X , y ) ∝ p ( y | X ; θ ) p ( θ ) = exp {− } exp {− | θ i | / t } 2 σ 2 i The log likelihood is: � N i = 1 ( y i − X i θ ) 2 � ℓ ( θ | X , y ) = − − | θ i | / t + const 2 σ 2 i i ( y i − X i θ ) 2 + λ || θ || 1 , where λ is a which matches min θ 1 � 2 constant associated with σ 2 , t . Lasso

Ridge/Lasso Regression Variable Selection Model Selection Model selection Outline Ridge/Lasso Regression 1 Linear Regression Regularization Probabilistic Intepretation Model Selection 2 Variable Selection Model selection Lasso

Ridge/Lasso Regression Variable Selection Model Selection Model selection Variable Selection Consider "best" subsets, order O ( 2 P ) (combinatorial explosion) Stepwise selection A new variable may be added into the model even with a small improvement in LMS When applying stepwise to a perturbation of the data, probably have different set of variables enter into the model at each stage LASSO produces sparse solutions, which takes care of model selection we can even see when variables jump into the model by looking at the LASSO path Lasso

Ridge/Lasso Regression Variable Selection Model Selection Model selection Outline Ridge/Lasso Regression 1 Linear Regression Regularization Probabilistic Intepretation Model Selection 2 Variable Selection Model selection Lasso

Ridge/Lasso Regression Variable Selection Model Selection Model selection Example Suppose you have data Y 1 , ..., Y n and you want to model the distribution of Y . Some popular models are: the Exponential distribution: f ( y ; θ ) = θ e − θ y the Gaussian distribution: f ( y ; u , σ 2 ) ∼ N ( u , σ 2 ) ... How do you know which model is better? Lasso

Ridge/Lasso Regression Variable Selection Model Selection Model selection AIC Suppose we have models M 1 , ..., M k where each model is a set of densities: M j = { p ( y ; θ j ) : θ j ∈ Θ j } We have data Y 1 , ..., Y n drawn from some density f (not necessarily drawn from these models). Define AIC ( j ) = ℓ j (ˆ θ j ) − 2 d j where ℓ j ( θ j ) is the log-likelihood, and ˆ θ j is the parameter that maximizes the log-likelihood. d j is the dimension of Θ j . Lasso

Ridge/Lasso Regression Variable Selection Model Selection Model selection BIC Bayesian Information Criterion We choose j to maximize θ j ) − d j BIC j = ℓ j (ˆ 2 log n which is similar to AIC but the penalty is harsher, hence BIC tends to choose simpler models. Lasso

Ridge/Lasso Regression Variable Selection Model Selection Model selection Simple example Let Y 1 , ..., Y n ∼ N ( µ, 1 ) we want to compare two model: M 0 : N ( 0 , 1 ) and M 1 : N ( u , 1 ) Lasso

Ridge/Lasso Regression, Model selection Xuezhi Wang Computer - PowerPoint PPT Presentation

Ridge/Lasso Regression Model Selection Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon University 10701-recitation, Apr 22 Lasso Ridge/Lasso Regression Model Selection Outline Ridge/Lasso

Regularization: Ridge Regression and the LASSO Statistics 305: Autumn Quarter 2006/2007

On Model Selection Consistency Of Lasso Yewon Kim 12/08/2015 Introduction Model selection is a

RIDGE and LASSO regularization for regression Feature selection - Some algorithms perform

Why LASSO, Ridge Need for Strictly . . . Regression, and EN: General Analysis of the . . . Why

Why Geometric Progression LASSO Method in Selecting the LASSO How Is Selected: . . . Natural

Sparse Exponential Weighting as an alternative to LASSO and Dantzig selector Alexandre Tsybakov

STARTS: STARTS: STARTS: STARTS: STAtic STAtic Regression Test Selection Regression Test

Big Data - Lecture 2 High dimensional regression with the Lasso S. Gadat Toulouse, Octobre 2014

Sparse CCA using Lasso Anastasia Lykou & Joe Whittaker Department of Mathematics and

A practical tour of optimization algorithms for the Lasso Alexandre Gramfort

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp August, 2019

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Mount Sutro Mount Sutro South Ridge & Edgewood Avenue South Ridge & Edgewood Avenue

Blue Ridge Blue Ridge $858,700,000 in new investment since 2010 Blue Ridge Anecdotal Market

Lasso Regression: Some Recent Developments David Madigan Suhrid Balakrishnan Rutgers University

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Modern MDL meets Data Mining Insight, Theory, and Practice Jilles Kenji Vreeken Yamanishi

Second-Order Bias-Corrected AIC for Selecting Structural Equation Models Kentaro H AYASHI

CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu

COMS 4721: Machine Learning for Data Science Lecture 24, 4/25/2017 Prof. John Paisley Department

Lecture 7: Cross-Validation Instructor: Prof. Shuai Huang Industrial and Systems Engineering

Motivation Partial Wave Analysis Up to know: worked on + with

Lesson 3: Likelihood-based inference for POMP models Aaron A. King, Edward L. Ionides, Kidus

A comparisons of some criteria for states selection of the latent Markov model for longitudinal

Ridge/Lasso Regression, Model selection Xuezhi Wang Computer - PowerPoint PPT Presentation

Ridge/Lasso Regression Model Selection Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon University 10701-recitation, Apr 22 Lasso Ridge/Lasso Regression Model Selection Outline Ridge/Lasso

Regularization: Ridge Regression and the LASSO Statistics 305: Autumn Quarter 2006/2007

On Model Selection Consistency Of Lasso Yewon Kim 12/08/2015 Introduction Model selection is a

RIDGE and LASSO regularization for regression Feature selection - Some algorithms perform

Why LASSO, Ridge Need for Strictly . . . Regression, and EN: General Analysis of the . . . Why

Why Geometric Progression LASSO Method in Selecting the LASSO How Is Selected: . . . Natural

Sparse Exponential Weighting as an alternative to LASSO and Dantzig selector Alexandre Tsybakov

STARTS: STARTS: STARTS: STARTS: STAtic STAtic Regression Test Selection Regression Test

Big Data - Lecture 2 High dimensional regression with the Lasso S. Gadat Toulouse, Octobre 2014

Sparse CCA using Lasso Anastasia Lykou &amp; Joe Whittaker Department of Mathematics and

A practical tour of optimization algorithms for the Lasso Alexandre Gramfort

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp August, 2019

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Mount Sutro Mount Sutro South Ridge &amp; Edgewood Avenue South Ridge &amp; Edgewood Avenue

Blue Ridge Blue Ridge $858,700,000 in new investment since 2010 Blue Ridge Anecdotal Market

Lasso Regression: Some Recent Developments David Madigan Suhrid Balakrishnan Rutgers University

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Modern MDL meets Data Mining Insight, Theory, and Practice Jilles Kenji Vreeken Yamanishi

Second-Order Bias-Corrected AIC for Selecting Structural Equation Models Kentaro H AYASHI

CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu

COMS 4721: Machine Learning for Data Science Lecture 24, 4/25/2017 Prof. John Paisley Department

Lecture 7: Cross-Validation Instructor: Prof. Shuai Huang Industrial and Systems Engineering

Motivation Partial Wave Analysis Up to know: worked on + with

Lesson 3: Likelihood-based inference for POMP models Aaron A. King, Edward L. Ionides, Kidus

A comparisons of some criteria for states selection of the latent Markov model for longitudinal

Sparse CCA using Lasso Anastasia Lykou & Joe Whittaker Department of Mathematics and

Mount Sutro Mount Sutro South Ridge & Edgewood Avenue South Ridge & Edgewood Avenue