Advanced Section #3: Methods of Regularization and their - PowerPoint PPT Presentation

Advanced Section #3: Methods of Regularization and their justifications Robbert Struyven and Pavlos Protopapas (viz. Camilo Fosco) CS109A Introduction to Data Science Pavlos Protopapas , Kevin Rader and Chris Tanner 1

Outline Motivation for regularization • • Generalization Instability • • Ridge estimator • Lasso estimator Elastic Net estimator • • Visualizations • Bayesian approach CS109A, P ROTOPAPAS , R ADER 2

Regularization: introduce additional information to solve ill- posed problems or avoid overfitting. CS109A, P ROTOPAPAS , R ADER 3

MOTIVATION Why do we regularize? 4

Generalization - Avoid overfitting. Reduce features that have weak predictive power. - Discourage the use of a model that is too complex. - Do not fit the noise! CS109A, P ROTOPAPAS , R ADER 5

Instability issues - Linear regression becomes unstable when p (degrees of freedom) is close to n (observations). - Think about each obs. as a piece of info about the model. What happens when n is close to the degrees of freedom? - Collinearity generates instability issues. If we want to understand the effect of 𝑌 " and 𝑌 # on Y, is it - easier when they vary together or when they vary separately? - Regularization helps combat instability by constraining the space of possible parameters. - Mathematically, instability can be seen through the = 𝜏 # 𝑌 + 𝑌 ," estimator’s variance: ( 𝑤𝑏𝑠 𝛾 CS109A, P ROTOPAPAS , R ADER 6

Instability issues = 𝜏 # 𝑌 + 𝑌 ," ( 𝑤𝑏𝑠 𝛾 var(Y) Inverse of Gram matrix The variance of the estimator is affected by the irreducible noise But the variance also depends on the of the model. We have no predictors themselves! This is the control over this. important part. if the eigenvalues of 𝑌 + 𝑌 are close to zero, our matrix is almost singular. One or - more eigenvalues of 𝑌 + 𝑌 ," can be extremely large. - In that case, on top of having large variance, we have numerical instability. In general, we want the condition number of 𝑌 + 𝑌 to be small (well-conditioning). - / 012 Remember that for 𝑌 + 𝑌: 𝜆 𝑌 + 𝑌 = / 034 CS109A, P ROTOPAPAS , R ADER 7

� Instability and the condition number More formally, instability can be analyzed through perturbation theory. Consider the following least-squares problem: min (𝑌 + 𝜀𝑌 𝛾 − (Y + 𝜀𝑍)‖ Perturbations 9 B is the solution of the original least squares problem, we If 𝛾 can prove that: B 𝛾 − 𝛾 𝜀𝑌 𝜆 𝑌 + 𝑌 ≤ Condition number of 𝑌 + 𝑌 𝛾 𝑌 Small 𝜆 𝑌 + 𝑌 tightens the bound on how much my coefficients can vary. CS109A, P ROTOPAPAS , R ADER 8

Instability visualized - Instability can be visualized by regressing on nearly colinear data, and observing the changes on the same data, slightly perturbed: Image from “Instability of Least Squares, Least Absolute Deviation and Least Median of Squares Linear Regression” , Ellis et al. (1998) CS109A, P ROTOPAPAS , R ADER 9

Motivation in short - We want less complex models to avoid overfitting and increase interpretability. - We want to be able to solve problems where p = n or p > n, and still generalize reasonably well. - We want to reduce instability (increase min eigenvalue/reduce condition number) in our estimators. We need to be better at estimating betas with colinear predictors. - In a nutshell, we want to avoid ill-posed problems (no solutions / solutions not unique / unstable solutions) CS109A, P ROTOPAPAS , R ADER 10

RIDGE REGRESSION Instability destroyer 11

What is the Ridge estimator? - Regularized estimator proposed by Hoerl and Kennard (1970). - Imposes L2 penalty on the magnitude of the coefficients. Regularization factor # + 𝜇| 𝛾 | # ( EFGHI = 𝑏𝑠𝑕𝑛𝑗𝑜 9 𝑌𝛾 − 𝑍 # # 𝛾 ( EFGHI = 𝑌 + 𝑌 + 𝜇𝐽 ," 𝑌 + 𝑍 𝛾 - In practice, the ridge estimator reduces the complexity of the model by shrinking the coefficients, but it doesn’t nullify them. - The lambda factor controls the amount of regularization. CS109A, P ROTOPAPAS , R ADER 12

Deriving the Ridge estimator 𝑌 + 𝑌 ," is considered unstable (or super-collinear) if eigenvalues are close to zero. 𝑌 + 𝑌 ," = 𝑅Λ ," 𝑅 ," Eigendecompostion ," 𝑙 " 0 0 Λ ," = 0 ⋱ 0 ," . 0 0 𝑙 V If the eigenvalues 𝑙 F are close to zero, Λ ," will have extremely large diagonal values. 𝑌 + 𝑌 ," will be very hard to find numerically. What can we do? CS109A, P ROTOPAPAS , R ADER 13

Deriving the Ridge estimator Just add a constant to the eigenvalues. 𝑅 Λ ," 𝑅 ," + 𝜇𝐽 𝑅 ," = 𝑅Λ ," 𝑅 ," + 𝜇𝑅𝑅 ," = 𝑌 + 𝑌 + 𝜇𝐽 Added constant 𝜇 We can find a new estimator: ( EFGHI = 𝑌 + 𝑌 + 𝜇𝐽 ," 𝑌 + 𝑍 𝛾 CS109A, P ROTOPAPAS , R ADER 14

Properties: shrinks the coefficients The Ridge estimator can be seen as a modification of the OLS estimator: ( EFGHI = 𝐽 + 𝜇 𝑌 + 𝑌 ," ," 𝛾 ( WXY 𝛾 Let’s look at an example to see its effect on the OLS betas: univariate case ( 𝑌 = (𝑦 " , … , 𝑦 ] ) ) with normalized predictor # = 𝑌 + 𝑌 = 1 ). ( 𝑌 # In this case, the ridge estimator is: ( WXY ( EFGHI = 𝛾 𝛾 1 + 𝜇 As we can see, Ridge regression shrinks the OLS predictors, but does not nullify them. No variable selection occurs at this stage. CS109A, P ROTOPAPAS , R ADER 15

Properties: closer to the real beta Interesting theorem: there always exists 𝜇 > 0 such that: • # < 𝐹 # ( E − 𝛾 # ( WXY − 𝛾 # 𝐹 𝛾 𝛾 Regardless of X and Y, there is a value of lambda for which • Ridge performs better than OLS in terms of MSE. Careful: we’re talking about MSE in estimating the true • coefficient (inference), not performance in terms of prediction. • OLS is unbiased, Ridge is not, however estimation is better: Ridge’s lower variance more than makes up for increase in bias. Good bias-variance tradeoff. CS109A, P ROTOPAPAS , R ADER 16

Good bias-variance tradeoff. OLS Ridge • Higher Variance (instable • Lower Variance Betas) • Adding some Bias • No Bias CS109A, P ROTOPAPAS , R ADER 17

Different perspectives on Ridge So far, we understand Ridge as a penalty on the • optimization objective: # + 𝜇| 𝛾 | # ( EFGHI = 𝑏𝑠𝑕𝑛𝑗𝑜 9 𝑌𝛾 − 𝑍 # 𝛾 # However, there are multiple ways to look at it: • Transformation (shrinkage) of OLS estimator. • Constraint for curvature on the loss function • Estimator obtained from increased eigenvalues • Regression with dummy data of 𝑌 + 𝑌 (better conditioning) • Special case of Tikhonov Regularization • Normal prior on coefficients (Bayesian • Constrained minimization interpretation) CS109A, P ROTOPAPAS , R ADER 18

Optimization perspective The ridge regression problem is equivalent to the following constrained optimization problem: # min b cd 𝑍 − 𝑌𝛾 # 9 b - From this perspective, we are doing regular least squares with a constraint on the magnitude of 𝛾 . - We can get from one expression to the other through Lagrange multipliers. # ( EFGHI ∗ Inverse relationship between 𝜆 and 𝜇 . Namely, 𝜆 = 𝛾 𝜇 - # CS109A, P ROTOPAPAS , R ADER 19

� Ridge, formal perspective Monsieur Ridge Ridge is a special case of Tikhonov Regularization: Tikhonov Matrix 𝟑 + 𝚫𝒚 𝟑 𝟑 𝑩𝒚 − 𝒄 𝟑 ,𝟐 𝐁 𝐔 𝐜 𝒚 = 𝑩 𝑼 𝑩 + 𝚫 𝐔 𝚫 If Γ = 𝜇 𝐽 , we have classic Ridge regression. Tikhonov regularization is interesting, as we can use Γ to generate other constraints, such as smoothness in the estimator values. CS109A, P ROTOPAPAS , R ADER 20

Ridge visualized Ridge estimator The values of the coefficients decrease as The ridge estimator is where the constraint lambda increases, but they are not nullified. and the loss intersect. CS109A, P ROTOPAPAS , R ADER 21

Ridge visualized Ridge curves the loss function in colinear problems, avoiding instability. CS109A, P ROTOPAPAS , R ADER 22

LASSO REGRESSION Yes, LASSO is an acronym 23

� What is LASSO? - Least Absolute Shrinkage and Selection Operator - Originally introduced in geophysics paper from 1986 but popularized by Robert Tibshirani (1996) - Idea: L1 penalization on the coefficients. # + 𝜇 𝛾 " 𝛾 XqYYW = argmin 𝑌𝛾 − 𝑍 # 9 Remember that 𝛾 " = ∑ |𝛾 F | - F - This looks deceptively similar to Ridge, but behaves very differently. Tends to zero-out coefficients. CS109A, P ROTOPAPAS , R ADER 24

Deriving the LASSO estimator The original LASSO definition comes from the constrained optimization problem: # 9 v cd 𝑌𝛾 − 𝑍 # min This is similar to Ridge. We should be able to easily find a closed form solution like Ridge, right? CS109A, P ROTOPAPAS , R ADER 25

No. CS109A, P ROTOPAPAS , R ADER 26

Advanced Section #3: Methods of Regularization and their - PowerPoint PPT Presentation

Advanced Section #3: Methods of Regularization and their justifications Robbert Struyven and Pavlos Protopapas (viz. Camilo Fosco) CS109A Introduction to Data Science Pavlos Protopapas , Kevin Rader and Chris Tanner 1 Outline Motivation for

Regularization Methods for System Identification Input Design Biqiang MU Academy of Mathematics

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Iterative regularization for general inverse problems Guillaume Garrigos with L. Rosasco and S.

Regularization of optimal control problems Daniel Wachsmuth (RICAM Linz) joint work with Gerd

Manifold Regularization Lorenzo Rosasco MIT, 9.520 L. Rosasco Manifold Regularization About

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

10. Regularization More on tradeoffs Regularization Effect of using different norms

Regularization for Multi-Output Learning Lorenzo Rosasco 9.520 L. Rosasco Regularization for

METHODS OF REGULARIZATION AND THEIR JUSTIFICATIONS WON (RYAN) LEE We turn to the question of

Half Year Results Presentation 2019 6 months ended 30 June 2019 Section 1 Section 2 Section 3

Student Employment THINK ACADEMY SUMMER Did you know? 91 % of employers prefer that their

PennyMac Mortgage PennyMac Mortgage PennyMac Mortgage PennyMac Mortgage Investment Trust

Chapter 6 : Informatics practices Conditional Class XI ( As per & CBSE Board) Looping

Inside PHP Tom Lee @tglee OSCON 2012 19th July, 2012 Overview About me! New Relics

Assessment of the Single Perturbation Load Approach on composite conical shells 25 March 2015,

DiamondTile Algorithm for High-Performance Wave Modeling Vadim Levchenko Anastasia Perepelkina

Finding the Right Exemplars for Reconstructing Single Image Super-Resolution Jiahuan Zhou , Ying

Kindergarten Curriculum -Environment rich in reading and writing opportunities English Language

Advanced Section #3: Methods of Regularization and their - PowerPoint PPT Presentation

Advanced Section #3: Methods of Regularization and their justifications Robbert Struyven and Pavlos Protopapas (viz. Camilo Fosco) CS109A Introduction to Data Science Pavlos Protopapas , Kevin Rader and Chris Tanner 1 Outline Motivation for

Regularization Methods for System Identification Input Design Biqiang MU Academy of Mathematics

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Iterative regularization for general inverse problems Guillaume Garrigos with L. Rosasco and S.

Regularization of optimal control problems Daniel Wachsmuth (RICAM Linz) joint work with Gerd

Manifold Regularization Lorenzo Rosasco MIT, 9.520 L. Rosasco Manifold Regularization About

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

10. Regularization More on tradeoffs Regularization Effect of using different norms

Regularization for Multi-Output Learning Lorenzo Rosasco 9.520 L. Rosasco Regularization for

METHODS OF REGULARIZATION AND THEIR JUSTIFICATIONS WON (RYAN) LEE We turn to the question of

Half Year Results Presentation 2019 6 months ended 30 June 2019 Section 1 Section 2 Section 3

Student Employment THINK ACADEMY SUMMER Did you know? 91 % of employers prefer that their

PennyMac Mortgage PennyMac Mortgage PennyMac Mortgage PennyMac Mortgage Investment Trust

Chapter 6 : Informatics practices Conditional Class XI ( As per &amp; CBSE Board) Looping

Inside PHP Tom Lee @tglee OSCON 2012 19th July, 2012 Overview About me! New Relics

Assessment of the Single Perturbation Load Approach on composite conical shells 25 March 2015,

DiamondTile Algorithm for High-Performance Wave Modeling Vadim Levchenko Anastasia Perepelkina

Finding the Right Exemplars for Reconstructing Single Image Super-Resolution Jiahuan Zhou , Ying

Kindergarten Curriculum -Environment rich in reading and writing opportunities English Language

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Chapter 6 : Informatics practices Conditional Class XI ( As per & CBSE Board) Looping