Regularization: Ridge Regression and the LASSO Statistics 305: - PowerPoint PPT Presentation

Agenda Regularization: Ridge Regression and the LASSO Statistics 305: Autumn Quarter 2006/2007 Wednesday, November 29, 2006 Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the ℓ 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression 3 Cross Validation K -Fold Cross Validation Generalized CV 4 The LASSO 5 Model Selection, Oracles, and the Dantzig Selector 6 References Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

Part I: The Bias-Variance Tradeoff Part I The Bias-Variance Tradeoff Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

Part I: The Bias-Variance Tradeoff Estimating β As usual, we assume the model: ε ∼ (0 , σ 2 ) y = f ( z ) + ε, In regression analysis, our major goal is to come up with some f ( z ) = z ⊤ ˆ good regression function ˆ β ls , or the least squares So far, we’ve been dealing with ˆ β solution: ls has well known properties (e.g., Gauss-Markov, ML) ˆ β But can we do better? Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

Part I: The Bias-Variance Tradeoff Choosing a good regression function Suppose we have an estimator ˆ f ( z ) = z ⊤ ˆ β To see if ˆ f ( z ) = z ⊤ ˆ β is a good candidate, we can ask ourselves two questions: 1.) Is ˆ β close to the true β ? 2.) Will ˆ f ( z ) fit future observations well? Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

Part I: The Bias-Variance Tradeoff 1.) Is ˆ β close to the true β ? To answer this question, we might consider the mean squared error of our estimate ˆ β : i.e., consider squared distance of ˆ β to the true β : MSE (ˆ β ) = E [ || ˆ β − β || 2 ] = E [(ˆ β − β ) ⊤ (ˆ β − β )] Example: In least squares (LS), we now that: ls − β ) ⊤ (ˆ ls − β )] = σ 2 tr[( Z ⊤ Z ) − 1 ] E [(ˆ β β Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

Part I: The Bias-Variance Tradeoff 2.) Will ˆ f ( z ) fit future observations well? Just because ˆ f ( z ) fits our data well, this doesn’t mean that it will be a good fit to new data In fact, suppose that we take new measurements y ′ i at the same z i ’s: ( z 1 , y ′ 1 ) , ( z 2 , y ′ 2 ) , . . . , ( z n , y ′ n ) So if ˆ f ( · ) is a good model, then ˆ f ( z i ) should also be close to the new target y ′ i This is the notion of prediction error (PE) Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

Part I: The Bias-Variance Tradeoff Prediction error and the bias-variance tradeoff So good estimators should, on average have, small prediction errors Let’s consider the PE at a particular target point z 0 (see the board for a derivation): E Y | Z = z 0 { ( Y − ˆ f ( Z )) 2 | Z = z 0 } PE( z 0 ) = ε + Bias 2 (ˆ f ( z 0 )) + Var(ˆ σ 2 = f ( z 0 )) Such a decomposition is known as the bias-variance tradeoff As model becomes more complex (more terms included), local structure/curvature can be picked up But coefficient estimates suffer from high variance as more terms are included in the model So introducing a little bias in our estimate for β might lead to a substantial decrease in variance, and hence to a substantial decrease in PE Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

Part I: The Bias-Variance Tradeoff Depicting the bias-variance tradeoff Bias−Variance Tradeoff Prediction Error Bias^2 Variance Squared Error Model Complexity Figure: A graph depicting the bias-variance tradeoff. Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

Part II: Ridge Regression Part II Ridge Regression Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

1. Solution to the ℓ 2 Problem and Some Properties 2. Data Augmentation Approach Part II: Ridge Regression 3. Bayesian Interpretation 4. The SVD and Ridge Regression Ridge regression as regularization If the β j ’s are unconstrained... They can explode And hence are susceptible to very high variance To control variance, we might regularize the coefficients i.e., Might control how large the coefficients grow Might impose the ridge constraint: p n ( y i − β ⊤ z i ) 2 s.t. � � β 2 j ≤ t minimize i =1 j =1 p ⇔ minimize ( y − Z β ) ⊤ ( y − Z β ) s.t. � β 2 j ≤ t j =1 By convention (very important!): Z is assumed to be standardized (mean 0, unit variance) y is assumed to be centered Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

1. Solution to the ℓ 2 Problem and Some Properties 2. Data Augmentation Approach Part II: Ridge Regression 3. Bayesian Interpretation 4. The SVD and Ridge Regression Ridge regression: ℓ 2 -penalty Can write the ridge constraint as the following penalized residual sum of squares (PRSS): p n i β ) 2 + λ � � ( y i − z ⊤ β 2 PRSS ( β ) ℓ 2 = j i =1 j =1 ( y − Z β ) ⊤ ( y − Z β ) + λ || β || 2 = 2 ls Its solution may have smaller average PE than ˆ β PRSS ( β ) ℓ 2 is convex, and hence has a unique solution Taking derivatives, we obtain: ∂ PRSS ( β ) ℓ 2 = − 2 Z ⊤ ( y − Z β ) + 2 λ β ∂ β Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

1. Solution to the ℓ 2 Problem and Some Properties 2. Data Augmentation Approach Part II: Ridge Regression 3. Bayesian Interpretation 4. The SVD and Ridge Regression The ridge solutions The solution to PRSS (ˆ β ) ℓ 2 is now seen to be: β ridge ˆ = ( Z ⊤ Z + λ I p ) − 1 Z ⊤ y λ Remember that Z is standardized y is centered Solution is indexed by the tuning parameter λ (more on this later) Inclusion of λ makes problem non-singular even if Z ⊤ Z is not invertible This was the original motivation for ridge regression (Hoerl and Kennard, 1970) Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

1. Solution to the ℓ 2 Problem and Some Properties 2. Data Augmentation Approach Part II: Ridge Regression 3. Bayesian Interpretation 4. The SVD and Ridge Regression Tuning parameter λ Notice that the solution is indexed by the parameter λ So for each λ , we have a solution Hence, the λ ’s trace out a path of solutions (see next page) λ is the shrinkage parameter λ controls the size of the coefficients λ controls amount of regularization As λ ↓ 0, we obtain the least squares solutions ridge As λ ↑ ∞ , we have ˆ λ = ∞ = 0 (intercept-only model) β Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

1. Solution to the ℓ 2 Problem and Some Properties 2. Data Augmentation Approach Part II: Ridge Regression 3. Bayesian Interpretation 4. The SVD and Ridge Regression Ridge coefficient paths The λ ’s trace out a set of ridge solutions, as illustrated below Ridge Regression Coefficient Paths ltg bmi ldl map tch Coefficient hdl glu age sex tc 0 2 4 6 8 10 DF Figure: Ridge coefficient path for the diabetes data set found in the lars library in R. Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

1. Solution to the ℓ 2 Problem and Some Properties 2. Data Augmentation Approach Part II: Ridge Regression 3. Bayesian Interpretation 4. The SVD and Ridge Regression Choosing λ Need disciplined way of selecting λ : That is, we need to “tune” the value of λ In their original paper, Hoerl and Kennard introduced ridge traces : ridge Plot the components of ˆ β against λ λ Choose λ for which the coefficients are not rapidly changing and have “sensible” signs No objective basis; heavily criticized by many Standard practice now is to use cross-validation (defer discussion until Part 3) Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

1. Solution to the ℓ 2 Problem and Some Properties 2. Data Augmentation Approach Part II: Ridge Regression 3. Bayesian Interpretation 4. The SVD and Ridge Regression ridge Proving that ˆ β is biased λ Let R = Z ⊤ Z Then: ridge ˆ ( Z ⊤ Z + λ I p ) − 1 Z ⊤ y = β λ ( R + λ I p ) − 1 R ( R − 1 Z ⊤ y ) = [ R ( I p + λ R − 1 )] − 1 R [( Z ⊤ Z ) − 1 Z ⊤ y ] = ls ( I p + λ R − 1 ) − 1 R − 1 R ˆ = β ls ( I p + λ R − 1 )ˆ = β So: ridge ls } E (ˆ E { ( I p + λ R − 1 )ˆ β ) = β λ ( I p + λ R − 1 ) β = (if λ � =0) � = β . Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

1. Solution to the ℓ 2 Problem and Some Properties 2. Data Augmentation Approach Part II: Ridge Regression 3. Bayesian Interpretation 4. The SVD and Ridge Regression Data augmentation approach The ℓ 2 PRSS can be written as: p n i β ) 2 + λ � ( y i − z ⊤ � β 2 PRSS ( β ) ℓ 2 = j i =1 j =1 p n √ i β ) 2 + � ( y i − z ⊤ � λβ j ) 2 (0 − = i =1 j =1 Hence, the ℓ 2 criterion can be recast as another least squares problem for another data set Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO

Regularization: Ridge Regression and the LASSO Statistics 305: - PowerPoint PPT Presentation

Agenda Regularization: Ridge Regression and the LASSO Statistics 305: Autumn Quarter 2006/2007 Wednesday, November 29, 2006 Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO Agenda Agenda 1 The

Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon

Why LASSO, Ridge Need for Strictly . . . Regression, and EN: General Analysis of the . . . Why

RIDGE and LASSO regularization for regression Feature selection - Some algorithms perform

Why LASSO, EN, and General Regularization CLOT: Invariance-Based Scale-Invariance: . . .

Complexity Analysis of the Lasso Regularization Path Julien Mairal and Bin Yu Inria, UC Berkeley

Regularization Regularization is a general approach to add a complexity parameter to a

Why Geometric Progression LASSO Method in Selecting the LASSO How Is Selected: . . . Natural

Sparse Exponential Weighting as an alternative to LASSO and Dantzig selector Alexandre Tsybakov

Big Data - Lecture 2 High dimensional regression with the Lasso S. Gadat Toulouse, Octobre 2014

Sparse CCA using Lasso Anastasia Lykou & Joe Whittaker Department of Mathematics and

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp August, 2019

A practical tour of optimization algorithms for the Lasso Alexandre Gramfort

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Mount Sutro Mount Sutro South Ridge & Edgewood Avenue South Ridge & Edgewood Avenue

Blue Ridge Blue Ridge $858,700,000 in new investment since 2010 Blue Ridge Anecdotal Market

Secure Shell (secsh) Working Group IETF62, 10 March 2005 Chair: Bill Sommerfeld

Pending A new approach to transportation 40,000 people + 20,000 MSU students 6 NJTransit

Trademark Law as to source) (are these forms of irrelevant confusion?): Prof. Madison 1.

Trademark and Unfair Competition Law Slides 17: Trademark Infringement: The Actionable Use

Data Science for scaling water research Jordan S Read, USGS Office of Water Information U.S.

Good prospects for exhibitors at IE expo 2015 The Chinese government is introducing tighter

Breaking It Down: Sports Betting Partnerships & Market Access Please stand by. The webinar

The main findings of an ECL Survey A need for special considerations at the work A need for

Regularization: Ridge Regression and the LASSO Statistics 305: - PowerPoint PPT Presentation

Agenda Regularization: Ridge Regression and the LASSO Statistics 305: Autumn Quarter 2006/2007 Wednesday, November 29, 2006 Statistics 305: Autumn Quarter 2006/2007 Regularization: Ridge Regression and the LASSO Agenda Agenda 1 The

Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon

Why LASSO, Ridge Need for Strictly . . . Regression, and EN: General Analysis of the . . . Why

RIDGE and LASSO regularization for regression Feature selection - Some algorithms perform

Why LASSO, EN, and General Regularization CLOT: Invariance-Based Scale-Invariance: . . .

Complexity Analysis of the Lasso Regularization Path Julien Mairal and Bin Yu Inria, UC Berkeley

Regularization Regularization is a general approach to add a complexity parameter to a

Why Geometric Progression LASSO Method in Selecting the LASSO How Is Selected: . . . Natural

Sparse Exponential Weighting as an alternative to LASSO and Dantzig selector Alexandre Tsybakov

Big Data - Lecture 2 High dimensional regression with the Lasso S. Gadat Toulouse, Octobre 2014

Sparse CCA using Lasso Anastasia Lykou &amp; Joe Whittaker Department of Mathematics and

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp August, 2019

A practical tour of optimization algorithms for the Lasso Alexandre Gramfort

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

Mount Sutro Mount Sutro South Ridge &amp; Edgewood Avenue South Ridge &amp; Edgewood Avenue

Blue Ridge Blue Ridge $858,700,000 in new investment since 2010 Blue Ridge Anecdotal Market

Secure Shell (secsh) Working Group IETF62, 10 March 2005 Chair: Bill Sommerfeld

Pending A new approach to transportation 40,000 people + 20,000 MSU students 6 NJTransit

Trademark Law as to source) (are these forms of irrelevant confusion?): Prof. Madison 1.

Trademark and Unfair Competition Law Slides 17: Trademark Infringement: The Actionable Use

Data Science for scaling water research Jordan S Read, USGS Office of Water Information U.S.

Good prospects for exhibitors at IE expo 2015 The Chinese government is introducing tighter

Breaking It Down: Sports Betting Partnerships &amp; Market Access Please stand by. The webinar

The main findings of an ECL Survey A need for special considerations at the work A need for

Sparse CCA using Lasso Anastasia Lykou & Joe Whittaker Department of Mathematics and

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Mount Sutro Mount Sutro South Ridge & Edgewood Avenue South Ridge & Edgewood Avenue

Breaking It Down: Sports Betting Partnerships & Market Access Please stand by. The webinar