Lecture #6: Model Selection & Cross Validation Data Science 1 - PowerPoint PPT Presentation

Lecture #6: Model Selection & Cross Validation Data Science 1 CS 109A, STAT 121A, AC 209A, E-109A Pavlos Protopapas Kevin Rader Margo Levine Rahul Dave

Lecture Outline Review Multiple Regression with Interaction Terms Model Selection: Overview Stepwise Variable Selection Cross Validation Applications of Model Selection 2

Review 3

Multiple Linear and Polynomial Regression Last time, we saw that we can build a linear model for multiple predictors, { X 1 , . . . , X J } , y = β 0 + β 1 x 1 + . . . + β J x J + ϵ. Using vector notation,       1 x 1 , 1 . . . x 1 ,J β 0 y 1     1 x 2 , 1 . . . x 2 ,J β 1 .       . Y = X = . . . . β = β .   ,  ...   , β    , . . . .  . . .  . y y 1 x n, 1 . . . x n,J β J We can express the regression coefficients as ( ) − 1 X ⊤ Y . � β = argmin MSE ( β X ⊤ X β ) = β β β β β β 4

Multiple Linear and Polynomial Regression We also saw that there are ways to generalize multiple linear regression: ▶ Polynomial regression y = β 0 + β 1 x + . . . + β M x M + ϵ. ▶ Polynomial regression with multiple predictors In each case, we treat each polynomial term x m j as an unique predictor and perform multiple linear regression. 4

Selecting Significant Predictors When modeling with multiple predictors, we are interested in which predictor or sets of predictors have a significant effect on the response. Significance of predictors can be measured in multiple ways: ▶ Hypothesis testing: – Subsets of predictors with higher F -stats higher than 1 may be significant. – Individual predictors with p -values smaller than established threshold (e.g. 0.05) may be significant. ▶ Evaluating model fitness: – Subsets of predictors with higher model R 2 should be more significant. – Subsets of predictors with lower model AIC or BIC should be more significant. 5

Example 6

Multiple Regression with Interaction Terms 7

Interacting Predictors In our multiple linear regression model for the NYC taxi data, we considered two predictors, rush hour indicator x 1 (in 0 or 1) and trip length x 2 (in minutes), y = β 0 + β 1 x 1 + β 2 x 2 . This model assumes that each predictor has an independent effect on the response, e.g. regardless of the time of day, the fare depends on the length of the trip in the same way. In reality, we know that a 30 minute trip covers a shorter distance during rush hour than in normal traffic. 8

Interacting Predictors A better model considers how the interactions between the two predictors impact the response, y = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 1 x 2 . The term β 3 x 1 x 2 is called the interaction term . It determines the effect on the response when we consider the predictors jointly. For example, the effect of trip length on cab fare in the absence of rush hour is β 2 x 2 . When combined with rush hour traffic ( x 1 = 1 ), the effect of trip length is ( β 2 + β 3 ) x 2 . 8

Multiple Linear Regression with Interaction Terms Multiple linear regression with interaction terms can be treated like a special form of multiple linear regression - we simply treat the cross terms (e.g. x 1 x 2 ) as additional predictors. Given a set of observations { ( x 1 , 1 , x 1 , 2 , y 1 ) , . . . ( x n, 1 , x n, 2 , y n ) } , the data and the model can be expressed in vector notation,       1 x 1 , 1 x 1 , 2 x 1 , 1 x 1 , 2 β 0 y 1   1 x 2 , 1 x 2 , 2 x 2 , 1 x 2 , 2   .     β 1 .   Y = X = . . . . β =  .  ,   ,  β β  , . . . .   . . . . β 2 y n β 3 1 x n, 1 x n, 2 x n, 1 x n, 2 Again, minimizing the MSE using vector calculus yields, ( ) − 1 � β = argmin MSE ( β X ⊤ X X ⊤ Y . β β β β ) = β β β 9

Generalized Polynomial Regression We can generalize polynomial models: 1. considering polynomial models with multiple predictors { X 1 , . . . , X J } : y = β 0 + β 1 x 1 + . . . + β M x M 1 + . . . + β 1+ MJ x J + . . . + β M + MJ x M J 2. consider polynomial models with multiple predictors { X 1 , X 2 } and cross terms: y = β 0 + β 1 x 1 + . . . + β M x M 1 + β 1+ M x 2 + . . . + β 2 M x M 2 + β 1+2 M ( x 1 x 2 ) + . . . + β 3 M ( x 1 x 2 ) M In each case, we consider each term x m j and each cross term x 1 x 2 an unique predictor and apply linear regression. 10

Model Selection: Overview 11

Overfitting: Another Motivation for Model Selection Finding subsets of significant predictors is an important for model interpretation. But there is another strong reason to model using the smaller set of significant predictors: to avoid overfitting. Definition Overfitting is the phenomenon where the model is unnecessarily complex, in the sense that portions of the model captures the random noise in the observation, rather than the relationship between predictor(s) and response. Overfitting causes the model to lose predictive power on new data. 12

An Example 13

Causes of Overfitting As we saw, overfitting can happen when ▶ there are too many predictors: – the feature space has high dimensionality – the polynomial degree is too high – too many cross terms are considered ▶ the coefficients values are too extreme A sign of overfitting may be a high training R 2 or low MSE and unexpectedly poor testing performance. Note: There is no 100% accurate test for overfitting and there is not a 100% effective way to prevent it. Rather, we may use multiple techniques in combination to prevent overfitting and various methods to detect it. 14

Model Selection Model selection is the application of a principled method to determine the complexity of the model, e.g. choosing a subset of predictors, choosing the degree of the polynomial model etc. Model selection typically consists of the following steps: 1. split the training set into two subsets: training and validation 2. multiple models (e.g. polynomial models with different degrees) are fitted on the training set; each model is evaluated on the validation set 3. the model with the best validation performance is selected 4. the selected model is evaluated one last time on the testing set 15

Stepwise Variable Selection 16

Exhaustive Selection To find the optimal subset of predictors for modeling a response variable, we can ▶ compute all possible subsets of { X 1 , . . . , X J } , ▶ evaluate all the models constructed from the subsets of { X 1 , . . . , X J } , ▶ find the model that optimizes some metric. While straightforward, exhaustive selection is computationally infeasible, since { X 1 , . . . , X J } has 2 J number of possible subsets. Instead, we will consider methods that iteratively build the optimal set of predictors. 17

Variable Selection: Forward In forward selection , we find an ‘optimal’ set of predictors by iterative building up our set. 1. Start with the empty set P 0 , construct the null model M 0 . 2. For k = 1 , . . . , J : 2.1 Let M k − 1 be the model constructed from the best set of k − 1 predictors, P k − 1 . 2.2 Select the predictor X n k , not in P k − 1 , so that the model constructed from P k = X n k ∪ P k − 1 optimizes a fixed metric (this can be p -value, F -stat; validation MSE, R 2 ; or AIC/BIC on training set). 2.3 Let M k denote the model constructed from the optimal P k . 3. Select the model M amongst { M 0 , M 1 , . . . , M J } that optimizes a fixed metric (this can be validation MSE, R 2 ; or AIC/BIC on training set). 4. Evaluate the final model M on the testing set. 18

Variable Selection: Backward In backward selection , we find an ‘optimal’ set of predictors by iterative eliminating predictors. 1. Start with all the predictors P J , construct the full model M J . 2. For k = 1 , . . . , J : 2.1 Let M k be the model constructed from the best set of k − 1 predictors, P k . 2.2 Select the predictor X n k in P k so that the model constructed from P k − 1 = P k − 1 − { X n k } optimizes a fixed metric (this can be p -value, F -stat; validation MSE, R 2 ; or AIC/BIC on training set). 2.3 Let M k − 1 denote the model constructed from the optimal P k − 1 . 3. Select the model M amongst { M 0 , M 1 , . . . , M J } that optimizes a fixed metric (this can be validation MSE, R 2 ; or AIC/BIC on training set). 4. Evaluate the final model M on the testing set. 19

An Example 20

Cross Validation 21

Cross Validation: Motivation Using a single validation set to select amongst multiple models can be problematic - there is the possibility of overfitting to the validation set. One solution to the problems raised by using a single validation set is to evaluate each model multiple validation sets and average the validation performance. One can randomly split the training set into training and validation multiple times, but randomly creating these sets can create the scenario where important features of the data never appear in our random draws. 22

Lecture #6: Model Selection & Cross Validation Data Science 1 - PowerPoint PPT Presentation

Lecture #6: Model Selection & Cross Validation Data Science 1 CS 109A, STAT 121A, AC 209A, E-109A Pavlos Protopapas Kevin Rader Margo Levine Rahul Dave Lecture Outline Review Multiple Regression with Interaction Terms Model Selection:

Learning From Data Lecture 13 Validation and Model Selection The Validation Set Model Selection

Cross-validation and the Bootstrap In the section we discuss two resampling methods:

STAT 213 Cross-Validation (and Multifactor ANOVA?) Colin Reimer Dawson Oberlin College 12

Progress to Date in A3: Method Transfer, Partial Validation and Cross validation A3: Method

Data Mining II Model Validation Heiko Paulheim Why Model Validation? We have seen so far

Introduction to Data Science: Classifier n 1 n 1 k k Suppose you want to compare two

02 | 27 SOUTHERN CROSS 23.04 03 | 27 SOUTHERN CROSS 23.04 04 | 27 SOUTHERN CROSS 23.04 06

Validation and Testing COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning

The Shadow of the Cross The Cross of Jesus part 1B The Shadow of the Cross Hebrews 10:1-14 The

Validation of National Burn Severity Validation of National Burn Severity Validation of National

Form Validation 1 CS380 What is form validation? 2 validation: ensuring that form's values

Stratified Cross-Validation in Multi-Label Classification Using Genetic Algorithms 7-8/02/2013

Holdout and Cross- -Validation Validation Holdout and Cross Methods Overfitting Avoidance

Criticality experiments and benchmarks for for validation of cross validation of cross sections:

Importance-Weighted Cross- Importance-Weighted Cross- Validation for Covariate Shift Validation

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Woodlot efm Project June 25th , 2008 Thank you for this opportunity to provide an update on the

!"#$%&$'()%+(,$(+-$.%-&#-$ '()%+(,$.#)/%-$*+$0#1("$

Speed and Accuracy Tests of the Variable-Step St ormer-Cowell Integrator Matt Berry Liam

A Feasibility Study for Weather-Controlled Variable Speed Limits (VSL) in Norway 2016

2007 Data Linear Regression Analysis Nathan Platt Dennis DeRiggi June 4, 2010 7/6/2010-1

Generalized Collision-free Velocity Model for Pedestrian Dynamics FEMTC2018. October 1-3,2018.

mapping under ETC/ACM Jan Horlek (CHMI) Peter de Smet, Frank de Leeuw (RIVM), Pavel Kurfrst,

INVESTOR PRESENTATION IMPORTANT LEGAL NOTES ICL- AT A GLANCE $5.3 B 44 ~11,000 TOP 3 (1) $5.5

Lecture #6: Model Selection & Cross Validation Data Science 1 - PowerPoint PPT Presentation

Lecture #6: Model Selection & Cross Validation Data Science 1 CS 109A, STAT 121A, AC 209A, E-109A Pavlos Protopapas Kevin Rader Margo Levine Rahul Dave Lecture Outline Review Multiple Regression with Interaction Terms Model Selection:

Learning From Data Lecture 13 Validation and Model Selection The Validation Set Model Selection

Cross-validation and the Bootstrap In the section we discuss two resampling methods:

STAT 213 Cross-Validation (and Multifactor ANOVA?) Colin Reimer Dawson Oberlin College 12

Progress to Date in A3: Method Transfer, Partial Validation and Cross validation A3: Method

Data Mining II Model Validation Heiko Paulheim Why Model Validation? We have seen so far

Introduction to Data Science: Classifier n 1 n 1 k k Suppose you want to compare two

02 | 27 SOUTHERN CROSS 23.04 03 | 27 SOUTHERN CROSS 23.04 04 | 27 SOUTHERN CROSS 23.04 06

Validation and Testing COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning

The Shadow of the Cross The Cross of Jesus part 1B The Shadow of the Cross Hebrews 10:1-14 The

Validation of National Burn Severity Validation of National Burn Severity Validation of National

Form Validation 1 CS380 What is form validation? 2 validation: ensuring that form's values

Stratified Cross-Validation in Multi-Label Classification Using Genetic Algorithms 7-8/02/2013

Holdout and Cross- -Validation Validation Holdout and Cross Methods Overfitting Avoidance

Criticality experiments and benchmarks for for validation of cross validation of cross sections:

Importance-Weighted Cross- Importance-Weighted Cross- Validation for Covariate Shift Validation

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

Woodlot efm Project June 25th , 2008 Thank you for this opportunity to provide an update on the

!&quot;#$%&amp;$'()*%+(,$(+-$.%-*&amp;*#-$ '()*%+(,$.#)/%-$*+$0#1(&quot;$

Speed and Accuracy Tests of the Variable-Step St ormer-Cowell Integrator Matt Berry Liam

A Feasibility Study for Weather-Controlled Variable Speed Limits (VSL) in Norway 2016

2007 Data Linear Regression Analysis Nathan Platt Dennis DeRiggi June 4, 2010 7/6/2010-1

Generalized Collision-free Velocity Model for Pedestrian Dynamics FEMTC2018. October 1-3,2018.

mapping under ETC/ACM Jan Horlek (CHMI) Peter de Smet, Frank de Leeuw (RIVM), Pavel Kurfrst,

INVESTOR PRESENTATION IMPORTANT LEGAL NOTES ICL- AT A GLANCE $5.3 B 44 ~11,000 TOP 3 (1) $5.5

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

!"#$%&$'()%+(,$(+-$.%-&#-$ '()%+(,$.#)/%-$*+$0#1("$