Lecture #7: Regularization Data Science 1 CS 109A, STAT 121A, AC - PowerPoint PPT Presentation

Lecture #7: Regularization Data Science 1 CS 109A, STAT 121A, AC 209A, E-109A Pavlos Protopapas Kevin Rader Margo Levine Rahul Dave

Lecture Outline Review Applications of Model Selection Behind Ordinary Lease Squares, AIC, BIC Regularization: LASSO and Ridge Bias vs Variance Regularization Methods: A Comparison 2

Review 3

Model Selection Model selection is the application of a principled method to determine the complexity of the model, e.g. choosing a subset of predictors, choosing the degree of the polynomial model etc. A strong motivation for performing model selection is to avoid overfitting, which we saw can happen when ▶ there are too many predictors: – the feature space has high dimensionality – the polynomial degree is too high – too many cross terms are considered ▶ the coefficients values are too extreme 4

Stepwise Variable Selection and Cross Validation Last time, we addressed the issue of selecting optimal subsets of predictors (including choosing the degree of polynomial models) through: ▶ stepwise variable selection - iteratively building an optimal subset of predictors by optimizing a fixed model evaluation metric each time, ▶ cross validation - selecting an optimal model by evaluating each model on multiple validation sets. Today, we will address the issue of discouraging extreme values in model parameters. 5

Stepwise Variable Selection Computational Complexity How many models did we evaluate? ▶ 1st step, J Models 6

Stepwise Variable Selection Computational Complexity How many models did we evaluate? ▶ 1st step, J Models ▶ 2nd step, J − 1 Models (add 1 predictor out of J − 1 possible) 6

Stepwise Variable Selection Computational Complexity How many models did we evaluate? ▶ 1st step, J Models ▶ 2nd step, J − 1 Models (add 1 predictor out of J − 1 possible) ▶ 3rd step, J − 2 Models (add 1 predictor out of J − 2 possible) 6

Stepwise Variable Selection Computational Complexity How many models did we evaluate? ▶ 1st step, J Models ▶ 2nd step, J − 1 Models (add 1 predictor out of J − 1 possible) ▶ 3rd step, J − 2 Models (add 1 predictor out of J − 2 possible) ... 6

Stepwise Variable Selection Computational Complexity How many models did we evaluate? ▶ 1st step, J Models ▶ 2nd step, J − 1 Models (add 1 predictor out of J − 1 possible) ▶ 3rd step, J − 2 Models (add 1 predictor out of J − 2 possible) ... O ( J 2 ) ≪ 2 J for large J 7

Applications of Model Selection 8

Cross Validation. Why? 9

Cross Validation. Why? linear = 0 . 78 on validation set R 2 9

Cross Validation. Why? quadratic = 0 . 64 on validation set R 2 linear = 0 . 78 , R 2 9

Cross Validation 10

Predictor Selection: Cross Validation Rather than choosing a subset of significant predictors using stepwise selection, we can use K -fold cross validation: ▶ create a collection of different subsets of the predictors ▶ for each subset of predictors, compute the cross validation score for the model created using only that subset ▶ select the subset (and the corresponding model) with the best cross validation score ▶ evaluate the model one last time on the test set 11

Degree Selection: Stepwise We can frame the problem of degree selection for polynomial models as a predictor selection problem: which of the predictors { x, x 2 , . . . , x M } should we select for modeling? We can apply stepwise selection to determine the optimal subset of predictors. 12

Degree Selection: Cross Validation We can also select the degree of a polynomial model using K -fold cross validation. ▶ consider a number of different degrees ▶ for each degree, compute the cross validation score for a polynomial model of that degree ▶ select the degree, and the corresponding model, with the best cross validation score ▶ evaluate the model one last time on the test set 13

kNN Revisited Recall our first simple, intuitive, non-parametric model for regression - the kNN model. We saw that it is vitally important to select an appropriate k for the data. If the k is too small then the model is very sensitive to noise (since a new prediction is based on very few observed neighbors), and if the k is too large, the model tends towards making constant predictions. A principled way to choose k is through K -fold cross validation. 14

A Simple Example 15

Behind Ordinary Lease Squares, AIC, BIC 16

Likelihood Functions We’ve been using AIC/BIC to evaluate the explanatory powers of models, and we’ve been using the following formulae to calculate these criteria AIC ≈ n · ln ( RSS / n ) + 2 J BIC ≈ n · ln ( RSS / n ) + J · ln ( n ) where J is the number of predictors in model. Intuitively, AIC/BIC is a loss function that depends both on the predictive error, RSS, and the complexity of the model. We see that we prefer a model with few parameters and low RSS. But why do the formulae look this way - what is the justification? 17

Likelihood Functions Recall that our statistical model for linear regression in vector notation is J ∑ β ⊤ x y = β 0 + β i x i + ϵ = β β x x + ϵ. j =1 It is standard to suppose that ϵ ∼ N (0 , σ 2 ) . In fact, in many analyses we have been making this assumption. Then, β ⊤ x x, σ 2 ) . y | β x, ϵ ∼ N ( β β,x β x β x Can you see why? Note that N ( y ; β β ⊤ x x, σ 2 ) is naturally a function of the model β x parameters β β , since the data is fixed. We call β β ⊤ x x, σ 2 ) L ( β β β ) = N ( y ; β β x the likelihood function , as it gives the likelihood of the observed data for a chosen model β β . β 17

Maximum Likelihood Estimators Once we have a likelihood function, L ( β β ) , we have strong β incentive to seek values of β β to maximize L . β Can you see why? The model parameters that maximizes L are called maximum likelihood estimators (MLE) and are denoted: β MLE = argmax L ( β β β β β ) β β β The model constructed with MLE parameters assigns the highest likelihood to the observed data. 19

Maximum Likelihood Estimators But how does one maximize a likelihood function? Fix a set of n observations of J predictors, X , and a set of corresponding response values, Y ; consider a linear model Y = X β β + ϵ . β If we assume that ϵ ∼ N (0 , σ 2 ) , then the likelihood for each observation is β ⊤ x x i , σ 2 ) L i ( β β β ) = N ( y i ; β β x and the likelihood for the entire set of data is n ∏ β ⊤ x x i , σ 2 ) L ( β β β ) = N ( y i ; β β x i =1 Through some algebra, we can show that maximizing L ( β β ) is β equivalent to minimizing MSE: n 1 x i | 2 = argmin β MLE = argmax β ) = argmin ∑ β ⊤ x L ( β | y i − β β β β β x RSS n β β β β β β β β β i =1 Minimizing MSE or RSS is called ordinary least squares . 19

Information Criteria Revisited Using the likelihood function, we can reformulate the information criteria metrics for model fitness in very intuitive terms. For both AIC and BIC, we consider the likelihood of the data under the MLE model against the number of explanatory variables used in the model g ( J ) − L ( β β β MLE ) where g is a function of the number of predictors J . Individually, AIC = J − ln ( L ( β β β MLE )) BIC = 1 2 J ln ( n ) − ln ( L ( β β β MLE )) In the formulae we’d been using for AIC/BIC, we approximate β MLE ) using the RSS. L ( β β 20

Bias vs Variance 21

Variance 22

Bias vs Variance 23

The Bias/Variance Trade-off 24

Regularization: LASSO and Ridge 25

Regularization: An Overview The idea of regularization revolves around modifying the loss function L ; in particular, we add a regularization term that penalizes some specified properties of the model parameters L reg ( β ) = L ( β ) + λR ( β ) , where λ is a scalar that gives the weight (or importance) of the regularization term. Fitting the model using the modified loss function L reg would result in model parameters with desirable properties (specified by R ). 26

LASSO Regression Since we wish to discourage extreme values in model parameter, we need to choose a regularization term that penalizes parameter magnitudes. For our loss function, we will again use MSE. Together our regularized loss function is n J L LASSO ( β ) = 1 x i | 2 + λ ∑ ∑ β ⊤ x | y i − β β x | β j | . n i =1 j =1 Note that ∑ J j =1 | β j | is the ℓ 1 norm of the vector β β β J ∑ | β j | = ∥ β β β ∥ 1 j =1 Hence, we often say that L LASSO is the loss function for ℓ ℓ ℓ 1 regularization . Finding model parameters β β LASSO that minimize the ℓ 1 β regularized loss function is called LASSO regression . 27

Lecture #7: Regularization Data Science 1 CS 109A, STAT 121A, AC - PowerPoint PPT Presentation

Lecture #7: Regularization Data Science 1 CS 109A, STAT 121A, AC 209A, E-109A Pavlos Protopapas Kevin Rader Margo Levine Rahul Dave Lecture Outline Review Applications of Model Selection Behind Ordinary Lease Squares, AIC, BIC

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Regularization Regularization is a general approach to add a complexity parameter to a

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

Regularization of optimal control problems Daniel Wachsmuth (RICAM Linz) joint work with Gerd

Iterative regularization for general inverse problems Guillaume Garrigos with L. Rosasco and S.

10. Regularization More on tradeoffs Regularization Effect of using different norms

Regularization Methods for System Identification Input Design Biqiang MU Academy of Mathematics

Manifold Regularization Lorenzo Rosasco MIT, 9.520 L. Rosasco Manifold Regularization About

Regularization for Multi-Output Learning Lorenzo Rosasco 9.520 L. Rosasco Regularization for

Learning From Data Lecture 12 Regularization Constraining the Model Weight Decay Augmented

Lecture 3: Regularization I Princeton University COS 495 Instructor: Yingyu Liang What is

Regularization for Deep Learning Lecture slides for Chapter 7 of Deep Learning

Likelihood inference in complex settings Nancy Reid with Uyen Hoang, Wei Lin, Ximing Xu 1 / 30

Social Pr Social Protection otection: : Conc Concepts and Lif epts and Lifec ecycle le

T he So c ia l E nte rprise L ife Cyc le Da na Bra kma n Re ise r a nd Ste ve n De a n Pro

life-cycle of the product Spiros Vamvakas Head of Scientific Advice Product Development

Improving EnKF with machine learning algorithms John Harlim Department of Mathematics and

Likelihood Ratio Test Lecture 19 Biostatistics 602 - Statistical Inference . . . . Unbiased

Controlling Health Care Costs Through Limited Network Insurance Plans: Evidence from

Overview 1. Linearization 2. Examples of linearization 3. Example with Mathematica 4.

Lecture #7: Regularization Data Science 1 CS 109A, STAT 121A, AC - PowerPoint PPT Presentation

Lecture #7: Regularization Data Science 1 CS 109A, STAT 121A, AC 209A, E-109A Pavlos Protopapas Kevin Rader Margo Levine Rahul Dave Lecture Outline Review Applications of Model Selection Behind Ordinary Lease Squares, AIC, BIC

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

Regularization Regularization is a general approach to add a complexity parameter to a

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

Regularization of optimal control problems Daniel Wachsmuth (RICAM Linz) joint work with Gerd

Iterative regularization for general inverse problems Guillaume Garrigos with L. Rosasco and S.

10. Regularization More on tradeoffs Regularization Effect of using different norms

Regularization Methods for System Identification Input Design Biqiang MU Academy of Mathematics

Manifold Regularization Lorenzo Rosasco MIT, 9.520 L. Rosasco Manifold Regularization About

Regularization for Multi-Output Learning Lorenzo Rosasco 9.520 L. Rosasco Regularization for

Learning From Data Lecture 12 Regularization Constraining the Model Weight Decay Augmented

Lecture 3: Regularization I Princeton University COS 495 Instructor: Yingyu Liang What is

Regularization for Deep Learning Lecture slides for Chapter 7 of Deep Learning

Likelihood inference in complex settings Nancy Reid with Uyen Hoang, Wei Lin, Ximing Xu 1 / 30

Social Pr Social Protection otection: : Conc Concepts and Lif epts and Lifec ecycle le

T he So c ia l E nte rprise L ife Cyc le Da na Bra kma n Re ise r a nd Ste ve n De a n Pro

life-cycle of the product Spiros Vamvakas Head of Scientific Advice Product Development

Improving EnKF with machine learning algorithms John Harlim Department of Mathematics and

Likelihood Ratio Test Lecture 19 Biostatistics 602 - Statistical Inference . . . . Unbiased

Controlling Health Care Costs Through Limited Network Insurance Plans: Evidence from

Overview 1. Linearization 2. Examples of linearization 3. Example with Mathematica 4.

Regularization Overview Regularization Overview Problems & Multicollinearity We will