The problems with holdout sets MODEL VALIDATION IN P YTH ON - PowerPoint PPT Presentation

The problems with holdout sets MODEL VALIDATION IN P YTH ON Kasey Jones Data Scientist

Transition validation X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2) rf = RandomForestRegressor() rf.fit(X_train, y_train) out_of_sample = rf.predict(X_test) print(mae(y_test, out_of_sample)) 10.24 MODEL VALIDATION IN PYTHON

Traditional training splits cd = pd.read_csv("candy-data.csv") s1 = cd.sample(60, random_state=1111) s2 = cd.sample(60, random_state=1112) Overlapping candies: print(len([i for i in s1.index if i in s2.index])) 39 MODEL VALIDATION IN PYTHON

Traditional training splits Chocolate Candies: print(s1.chocolate.value_counts()[0]) print(s2.chocolate.value_counts()[0]) 34 30 MODEL VALIDATION IN PYTHON

The split matters Sample 1 T esting Error print('Testing error: {0:.2f}'.format(mae(s1_y_test, rfr.predict(s1_X_test)))) 10.32 Sample 2 T esting Error print('Testing error: {0:.2f}'.format(mae(s2_y_test, rfr.predict(s2_X_test)))) 11.56 MODEL VALIDATION IN PYTHON

Train, validation, test X_temp, X_val, y_temp, y_val = train_test_split(..., random_state=1111) X_train, X_test, y_train, y_test = train_test_split(..., random_state=1111) rfr = RandomForestRegressor(n_estimators=25, random_state=1111, max_features=4) rfr.fit(X_train, y_train) print('Validation error: {0:.2f}'.format(mae(y_test, rfr.predict(X_test)))) 9.18 print('Testing error: {0:.2f}'.format(mae(y_val, rfr.predict(X_val)))) 8.98 MODEL VALIDATION IN PYTHON

Round 2 X_temp, X_val, y_temp, y_val = train_test_split(..., random_state=1171) X_train, X_test, y_train, y_test = train_test_split(..., random_state=1171) rfr = RandomForestRegressor(n_estimators=25, random_state=1111, max_features=4) rfr.fit(X_train, y_train) print('Validation error: {0:.2f}'.format(mae(y_test, rfr.predict(X_test)))) 8.73 print('Testing error: {0:.2f}'.format(mae(y_val, rfr.predict(X_val)))) 10.91 MODEL VALIDATION IN PYTHON

Holdout set exercises MODEL VALIDATION IN P YTH ON

Cross-validation MODEL VALIDATION IN P YTH ON Kasey Jones Data Scientist

Cross-validation MODEL VALIDATION IN PYTHON

n_splits : number of cross-validation splits shuffle : boolean indicating to shuf�e data before splitting random_state : random seed from sklearn.model_selection import KFold X = np.array(range(40)) y = np.array([0] * 20 + [1] * 20) kf = KFold(n_splits=5) splits = kf.split(X) MODEL VALIDATION IN PYTHON

kf = KFold(n_splits=5) splits = kf.split(X) for train_index, test_index in splits: print(len(train_index), len(test_index)) 32 8 32 8 32 8 32 8 32 8 # Print one of the index sets: print(train_index, test_index) [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...] [32 33 34 35 36 37 38 39] MODEL VALIDATION IN PYTHON

rfr = RandomForestRegressor(n_estimators=25, random_state=1111) errors = [] for train_index, val_index in splits: X_train, y_train = X[train_index], y[train_index] X_val, y_val = X[val_index], y[val_index] rfr.fit(X_train, y_train) predictions = rfc.predict(X_test) errors.append(<some_accuracy_metric>) print(np.mean(errors)) 4.25 MODEL VALIDATION IN PYTHON

Practice time MODEL VALIDATION IN P YTH ON

sklearn's cross_val_score() MODEL VALIDATION IN P YTH ON Kasey Jones Data Scientist

cross_val_score() from sklearn.model_selection import cross_val_score from sklearn.ensemble import RandomForestClassifier rfc = RandomForestClassifier() estimator : the model to use X : the predictor dataset y : the response array cv : the number of cross-validation splits cross_val_score(estimator=rfc, X=X, y=y, cv=5) MODEL VALIDATION IN PYTHON

Using scoring and make_scorer The cross_val_score scoring parameter: # Load the Methods from sklearn.metrics import mean_absolute_error, make_scorer # Create a scorer mae_scorer = make_scorer(mean_absolute_error) # Use the scorer cross_val_score(<estimator>, <X>, <y>, cv=5, scoring=mae_scorer) MODEL VALIDATION IN PYTHON

Load all of the sklearn methods from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import cross_val_score from sklearn.metrics import mean_squared_error, make_scorer Create a model and a scorer rfc = RandomForestRegressor(n_estimators=20, max_depth=5, random_state=1111) mse = make_scorer(mean_squared_error) Run cross_val_score() cv_results = cross_val_score(rfc, X, y, cv=5, scoring=mse) MODEL VALIDATION IN PYTHON

Accessing the results print(cv_results) [196.765, 108.563, 85.963, 222.594, 140.942] Report the mean and standard deviation: print('The mean: {}'.format(cv_results.mean())) print('The std: {}'.format(cv_results.std())) The mean: 150.965 The std: 51.676 MODEL VALIDATION IN PYTHON

Let's practice! MODEL VALIDATION IN P YTH ON

Leave-one-out-cross- validation (LOOCV) MODEL VALIDATION IN P YTH ON Kasey Jones Data Scientist

LOOCV MODEL VALIDATION IN PYTHON

When to use LOOCV? Use when: Be cautious when: The amount of training data is limited Computational resources are limited You want the absolute best error estimate You have a lot of data for new data You have a lot of parameters to test MODEL VALIDATION IN PYTHON

LOOCV Example n = X.shape[0] mse = make_scorer(mean_sqaured_error) cv_results = cross_val_score(estimator, X, y, scoring=mse, cv=n) print(cv_results) [5.45, 10.52, 6.23, 1.98, 11.27, 9.21, 4.65, ... ] print(cv_results.mean()) 6.32 MODEL VALIDATION IN PYTHON

Let's practice MODEL VALIDATION IN P YTH ON

The problems with holdout sets MODEL VALIDATION IN P YTH ON - PowerPoint PPT Presentation

The problems with holdout sets MODEL VALIDATION IN P YTH ON Kasey Jones Data Scientist Transition validation X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2) rf = RandomForestRegressor() rf.fit(X_train, y_train)

Holdout and Cross- -Validation Validation Holdout and Cross Methods Overfitting Avoidance

CSE 446: Week 3: Decision Trees (Apr 4) Instructor: Sergey Levine I. Overfitting idea 1: holdout

MATH 105: Finite Mathematics 6-1: Sets Prof. Jonathan Duncan Walla Walla College Winter

Sets Sets A Set is an abstract data type representing an unordered Sets are unordered and

Learning Algorithm Evaluation Outline Why ? Overfitting How? Holdout vs Cross-validation

Languages and Regular expressions Lecture 2 1 Strings, Sets of Strings, Sets of Sets of

Solving Percent Problems Word Problems Find a Pattern Estimation Problems Fraction Problems

Singer difference sets and difference system of sets Akihiro Munemasa Graduate School of

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Sets Instructor: Shaddin Dughmi

Objectives FOLLOW Sets Dr. Mattox Beckman Compute the FOLLOW sets for the nonterminal symbols

S 3 identified by a rep. identified by a rep. n n = # of = # of Make Make- -Set

Sets Reading: EC 3.1-3.3 Peter J. Haas INFO 150 Fall Semester 2019 Lecture 11 1/ 21 Sets

Some Remarks on Sets of Lexicographic Probabilities and Sets of Desirable Gambles Fabio G. Cozman

Connected Domina-ng Sets Network Design Fall 2015 Saba Ahmadi Sheng Yang Domina-ng Sets and

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Sets Instructor: Shaddin Dughmi

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

Unibet.com Architecture Open Source at Unibet.com: 10x scalability at half the cost

Announcements: HW1 due today 11:59p. PA1 due 02/03, 11:59p. Quizzes Warm-up: Weird Mystery

Learning in Graphical Models Marco Chiarandini Department of Mathematics & Computer Science

Example Suppose there are five kinds of bags of candies: 10% are h 1 : 100% cherry candies 20% are

Teaching Statistical Literacy: Ch 4 16 May 2019 Ch4: V1 Ch4: V1 2019 USCOTS Workshop 1 2019

Picking Sequences for Resource Allocation Sylvain Bouveret LIG Grenoble INP / Ensimag

Final Project M4 Monday, April 10, 2017 Agenda Reading Quiz Review M4 Requirements

Overview of the 2019 Open-Source IR Replicability Challenge (OSIRRC 2019) Ryan Clancy, Nicola

The problems with holdout sets MODEL VALIDATION IN P YTH ON - PowerPoint PPT Presentation

The problems with holdout sets MODEL VALIDATION IN P YTH ON Kasey Jones Data Scientist Transition validation X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2) rf = RandomForestRegressor() rf.fit(X_train, y_train)

Holdout and Cross- -Validation Validation Holdout and Cross Methods Overfitting Avoidance

CSE 446: Week 3: Decision Trees (Apr 4) Instructor: Sergey Levine I. Overfitting idea 1: holdout

MATH 105: Finite Mathematics 6-1: Sets Prof. Jonathan Duncan Walla Walla College Winter

Sets Sets A Set is an abstract data type representing an unordered Sets are unordered and

Learning Algorithm Evaluation Outline Why ? Overfitting How? Holdout vs Cross-validation

Languages and Regular expressions Lecture 2 1 Strings, Sets of Strings, Sets of Sets of

Solving Percent Problems Word Problems Find a Pattern Estimation Problems Fraction Problems

Singer difference sets and difference system of sets Akihiro Munemasa Graduate School of

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Sets Instructor: Shaddin Dughmi

Objectives FOLLOW Sets Dr. Mattox Beckman Compute the FOLLOW sets for the nonterminal symbols

S 3 identified by a rep. identified by a rep. n n = # of = # of Make Make- -Set

Sets Reading: EC 3.1-3.3 Peter J. Haas INFO 150 Fall Semester 2019 Lecture 11 1/ 21 Sets

Some Remarks on Sets of Lexicographic Probabilities and Sets of Desirable Gambles Fabio G. Cozman

Connected Domina-ng Sets Network Design Fall 2015 Saba Ahmadi Sheng Yang Domina-ng Sets and

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Sets Instructor: Shaddin Dughmi

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

Unibet.com Architecture Open Source at Unibet.com: 10x scalability at half the cost

Announcements: HW1 due today 11:59p. PA1 due 02/03, 11:59p. Quizzes Warm-up: Weird Mystery

Learning in Graphical Models Marco Chiarandini Department of Mathematics &amp; Computer Science

Example Suppose there are five kinds of bags of candies: 10% are h 1 : 100% cherry candies 20% are

Teaching Statistical Literacy: Ch 4 16 May 2019 Ch4: V1 Ch4: V1 2019 USCOTS Workshop 1 2019

Picking Sequences for Resource Allocation Sylvain Bouveret LIG Grenoble INP / Ensimag

Final Project M4 Monday, April 10, 2017 Agenda Reading Quiz Review M4 Requirements

Overview of the 2019 Open-Source IR Replicability Challenge (OSIRRC 2019) Ryan Clancy, Nicola

Learning in Graphical Models Marco Chiarandini Department of Mathematics & Computer Science