CSI5180. MachineLearningfor BioinformaticsApplications Fundamentals - - PowerPoint PPT Presentation

csi5180 machinelearningfor bioinformaticsapplications
SMART_READER_LITE
LIVE PREVIEW

CSI5180. MachineLearningfor BioinformaticsApplications Fundamentals - - PowerPoint PPT Presentation

CSI5180. MachineLearningfor BioinformaticsApplications Fundamentals of Machine Learning Training by Marcel Turcotte Version November 6, 2019 Preamble Preamble 2/47 Preamble Fundamentals of Machine Learning Training In this lecture,


slide-1
SLIDE 1
  • CSI5180. MachineLearningfor

BioinformaticsApplications

Fundamentals of Machine Learning — Training

by

Marcel Turcotte

Version November 6, 2019

slide-2
SLIDE 2

Preamble 2/47

Preamble

slide-3
SLIDE 3

Preamble

Preamble 3/47

Fundamentals of Machine Learning — Training In this lecture, we introduce we focus on training learning algorithms. This will include the need for 2, 3 or k sets, tuning the hyperparameters values, as well as concepts such as under- and over-fitting the data. General objective :

Describe the fundamental concepts of machine learning

slide-4
SLIDE 4

Learning objectives

Preamble 4/47

Describe the role of the training, validation, and test sets Clarify the concepts of under- and over- fitting the data Explain the process of tuning hyperparameters values

Reading:

Chicco, D. Ten quick tips for machine learning in computational biology. BioData Mining 10:35 (2017). Boulesteix, A.-L. Ten simple rules for reducing overoptimistic reporting in methodological computational research. PLoS Comput Biol 11:e1004191 (2015). Domingos, P. A few useful things to know about machine learning. Commun Acm 55:7887 (2012).

slide-5
SLIDE 5

Plan

Preamble 5/47

  • 1. Preamble
  • 2. Problem
  • 3. Testing
  • 4. Under- and over- fitting
  • 5. 7-Steps workflow
  • 6. Prologue
slide-6
SLIDE 6

The 7 Steps of Machine Learning

Preamble 6/47

https://youtu.be/nKW8Ndu7Mjw

slide-7
SLIDE 7

Problem 7/47

Problem

slide-8
SLIDE 8

Supervised learning - regression

Problem 8/47

The data set is a collection of labelled examples.

{(xi, yi)}N

i=1

Each xi is a feature vector with D dimensions. x(j)

i

is the value of the feature j of the example i, for j ∈ 1 . . . D and i ∈ 1 . . . N.

The label yi is a real number.

Problem: given the data set as input, create a “model” that can be used to predict the value of y for an unseen x.

slide-9
SLIDE 9

QSAR

Problem 9/47

QSAR stands for Quantitative Structure-Activity Relationship As a machine learning problem,

Each xi is a chemical compound yi is the biological activity of the compound xi

Examples of biological activity include toxicology and biodegradability

0.615

  • 0.125

1.140 . . . . . . 0.941

slide-10
SLIDE 10

HIV-1 reverse transcriptase inhibitors

Problem 10/47

Viira, B., García-Sosa, A. T. & Maran, U. Chemical structure and correlation analysis of HIV-1 NNRT and NRT inhibitors and database-curated, published inhibition constants with chemical structure in diverse datasets. J Mol Graph Model 76:205223 (2017).

slide-11
SLIDE 11

HIV-1 reverse transcriptase inhibitors

Problem 10/47

Viira, B., García-Sosa, A. T. & Maran, U. Chemical structure and correlation analysis of HIV-1 NNRT and NRT inhibitors and database-curated, published inhibition constants with chemical structure in diverse datasets. J Mol Graph Model 76:205223 (2017). “Human immunodeficiency virus type 1 reverse transcriptase (HIV-1 RT) has been one of the most important targets for anti-HIV drug development due to it being an essential step in retroviral replication.”

slide-12
SLIDE 12

HIV-1 reverse transcriptase inhibitors

Problem 10/47

Viira, B., García-Sosa, A. T. & Maran, U. Chemical structure and correlation analysis of HIV-1 NNRT and NRT inhibitors and database-curated, published inhibition constants with chemical structure in diverse datasets. J Mol Graph Model 76:205223 (2017). “Human immunodeficiency virus type 1 reverse transcriptase (HIV-1 RT) has been one of the most important targets for anti-HIV drug development due to it being an essential step in retroviral replication.” “Many small molecule compounds (. . . ) have been studied over the years.”

slide-13
SLIDE 13

HIV-1 reverse transcriptase inhibitors

Problem 10/47

Viira, B., García-Sosa, A. T. & Maran, U. Chemical structure and correlation analysis of HIV-1 NNRT and NRT inhibitors and database-curated, published inhibition constants with chemical structure in diverse datasets. J Mol Graph Model 76:205223 (2017). “Human immunodeficiency virus type 1 reverse transcriptase (HIV-1 RT) has been one of the most important targets for anti-HIV drug development due to it being an essential step in retroviral replication.” “Many small molecule compounds (. . . ) have been studied over the years.” “Due to mutations and other influencing factors, the search for new inhibitor molecules for HIV-1 is ongoing.”

slide-14
SLIDE 14

HIV-1 reverse transcriptase inhibitors

Problem 10/47

Viira, B., García-Sosa, A. T. & Maran, U. Chemical structure and correlation analysis of HIV-1 NNRT and NRT inhibitors and database-curated, published inhibition constants with chemical structure in diverse datasets. J Mol Graph Model 76:205223 (2017). “Human immunodeficiency virus type 1 reverse transcriptase (HIV-1 RT) has been one of the most important targets for anti-HIV drug development due to it being an essential step in retroviral replication.” “Many small molecule compounds (. . . ) have been studied over the years.” “Due to mutations and other influencing factors, the search for new inhibitor molecules for HIV-1 is ongoing.” “Our recent design, modelling, and synthesis effort in the search for new compounds has resulted in two new, small, low toxicity (. . . ) inhibitors.”

slide-15
SLIDE 15

HIV Life Cycle

Problem 11/47

https://aidsinfo.nih.gov/understanding-hiv-aids

slide-16
SLIDE 16

HIV Life Cycle

Problem 11/47

slide-17
SLIDE 17

HIV-1 reverse transcriptase inhibitors

Problem 12/47

Each compound (example) in ChemDB has features such as the number

  • f atoms, area, solvation, coulombic, molecular weight, XLogP, etc.
slide-18
SLIDE 18

HIV-1 reverse transcriptase inhibitors

Problem 12/47

Each compound (example) in ChemDB has features such as the number

  • f atoms, area, solvation, coulombic, molecular weight, XLogP, etc.

A possible solution, a model, would look something like this: ˆ y = 44.418 − 35.133 × x (1) − 13.518 × x (2) + 0.766 × x (3)

slide-19
SLIDE 19

Testing 13/47

Testing

slide-20
SLIDE 20

Two sets!

Testing 14/47

Training set versus test set

slide-21
SLIDE 21

Two sets!

Testing 14/47

Training set versus test set Rule of thumb: keep 80 % of your data for training and use the remaining 20 % of your data for testing

slide-22
SLIDE 22

Two sets!

Testing 14/47

Training set versus test set Rule of thumb: keep 80 % of your data for training and use the remaining 20 % of your data for testing Training set: the data set used for training your model

slide-23
SLIDE 23

Two sets!

Testing 14/47

Training set versus test set Rule of thumb: keep 80 % of your data for training and use the remaining 20 % of your data for testing Training set: the data set used for training your model Test set: an independent set that used at the very end to evaluate the performance of your model

slide-24
SLIDE 24

Two sets!

Testing 14/47

Training set versus test set Rule of thumb: keep 80 % of your data for training and use the remaining 20 % of your data for testing Training set: the data set used for training your model Test set: an independent set that used at the very end to evaluate the performance of your model

Generalization error: error rate on new cases

slide-25
SLIDE 25

Two sets!

Testing 14/47

Training set versus test set Rule of thumb: keep 80 % of your data for training and use the remaining 20 % of your data for testing Training set: the data set used for training your model Test set: an independent set that used at the very end to evaluate the performance of your model

Generalization error: error rate on new cases

In most cases, the training error will be low, this because most learning algorithms are designed to find a set of values for their (weights) parameters such that the training error is low. However, the generalization error can still be high, we say that the model is overfitting the training data.

slide-26
SLIDE 26

Two sets!

Testing 14/47

Training set versus test set Rule of thumb: keep 80 % of your data for training and use the remaining 20 % of your data for testing Training set: the data set used for training your model Test set: an independent set that used at the very end to evaluate the performance of your model

Generalization error: error rate on new cases

In most cases, the training error will be low, this because most learning algorithms are designed to find a set of values for their (weights) parameters such that the training error is low. However, the generalization error can still be high, we say that the model is overfitting the training data. If the training error is high, we say that the model is underfitting the training data.

slide-27
SLIDE 27

Under- and over- fitting 15/47

Under-andover-fitting

slide-28
SLIDE 28

Underfitting and overfitting

Under- and over- fitting 16/47

Underfitting and overfitting are two important concepts for machine learning projects We will use a regression task to illustrate those two concepts

slide-29
SLIDE 29

Linear Regression

Under- and over- fitting 17/47

A linear model assumes that the value of the label, ˆ yi, can be expressed as a linear combination of the feature values, x (j)

i :

ˆ yi = h(xi) = θ0 + θ1x (1)

i

+ θ2x (2)

i

+ . . . + θDx (D)

i

slide-30
SLIDE 30

Linear Regression

Under- and over- fitting 17/47

A linear model assumes that the value of the label, ˆ yi, can be expressed as a linear combination of the feature values, x (j)

i :

ˆ yi = h(xi) = θ0 + θ1x (1)

i

+ θ2x (2)

i

+ . . . + θDx (D)

i

Here, θj is the jth parameter of the (linear) model, with θ0 being the bias term/parameter, θ1 . . . θD being the feature weights.

slide-31
SLIDE 31

Linear Regression

Under- and over- fitting 17/47

A linear model assumes that the value of the label, ˆ yi, can be expressed as a linear combination of the feature values, x (j)

i :

ˆ yi = h(xi) = θ0 + θ1x (1)

i

+ θ2x (2)

i

+ . . . + θDx (D)

i

Here, θj is the jth parameter of the (linear) model, with θ0 being the bias term/parameter, θ1 . . . θD being the feature weights. Problem: find values for all the model parameters so that the model “best fit” the training data.

slide-32
SLIDE 32

Linear Regression

Under- and over- fitting 17/47

A linear model assumes that the value of the label, ˆ yi, can be expressed as a linear combination of the feature values, x (j)

i :

ˆ yi = h(xi) = θ0 + θ1x (1)

i

+ θ2x (2)

i

+ . . . + θDx (D)

i

Here, θj is the jth parameter of the (linear) model, with θ0 being the bias term/parameter, θ1 . . . θD being the feature weights. Problem: find values for all the model parameters so that the model “best fit” the training data.

The Root Mean Square Error is a common performance measure for regression problems.

  • 1

N

N

  • 1

[h(xi) − yi]2

slide-33
SLIDE 33

sklearn.linear_model.LinearRegression

Under- and over- fitting 18/47

4 3 2 1 1 2

x

5 10 15 20 25 30 35

y

from s k l e a r n . linear_model import L i n e a r R e g r e s s i o n l i n _ r e g = L i n e a r R e g r e s s i o n () l i n _ r e g . f i t (X, y )

slide-34
SLIDE 34

Source code

Under- and over- fitting 19/47

import numpy as np X = 6 ∗ np . random . rand (100 , 1) − 4 y = X ∗∗ 2 − 4 ∗ X + 5 + np . random . randn (100 , 1) from s k l e a r n . linear_model import L i n e a r R e g r e s s i o n l i n _ r e g = L i n e a r R e g r e s s i o n () l i n _ r e g . f i t (X, y ) X_new = np . a r r a y ([[ −4] , [ 2 ] ] ) y_pred = l i n _ r e g . p r e d i c t (X_new)

slide-35
SLIDE 35

Source code (continued)

Under- and over- fitting 20/47

p l t . p l o t (X, y , "b . " ) p l t . p l o t (X_new , y_pred , " r−" ) p l t . x l a b e l ( " $x$ " , f o n t s i z e =18) p l t . y l a b e l ( " $y$ " , r o t a t i o n =0, f o n t s i z e =18) p l t . a x i s ([ −4 , 2 , −1, 35]) s a v e _ f i g ( " r e g r e s s i o n _ l i n e a r −01" ) p l t . show ()

slide-36
SLIDE 36

Source code (continued)

Under- and over- fitting 21/47

import

  • s

import m a t p l o t l i b as mpl import m a t p l o t l i b . pyplot as p l t def s a v e _ f i g ( fig_id , t i g h t _ l a y o u t=True , f i g _ e x t e n s i o n=" pdf " , r e s o l u t i o n =300): path = os . path . j o i n ( f i g _ i d + " . " + f i g _ e x t e n s i o n ) p r i n t ( " Saving f i g u r e " , f i g _ i d ) i f t i g h t _ l a y o u t : p l t . t i g h t _ l a y o u t () p l t . s a v e f i g ( path , format=f i g _ e x t e n s i o n , dpi=r e s o l u t i o n )

slide-37
SLIDE 37

The need for a validation set

Under- and over- fitting 22/47

Wait a minute! How do we know that a linear model is suitable for this application?

slide-38
SLIDE 38

The need for a validation set

Under- and over- fitting 22/47

Wait a minute! How do we know that a linear model is suitable for this application? We don’t!

slide-39
SLIDE 39

The need for a validation set

Under- and over- fitting 22/47

Wait a minute! How do we know that a linear model is suitable for this application? We don’t! Solution: we might want to “test” alternative hypotheses.

slide-40
SLIDE 40

Three sets!

Under- and over- fitting 23/47

Hyperparameter: a parameter whose value is not learnt by the algorithm, but set by the user.

slide-41
SLIDE 41

Three sets!

Under- and over- fitting 23/47

Hyperparameter: a parameter whose value is not learnt by the algorithm, but set by the user.

Examples include learning rate, soft-margin and hard-margin for SVM, regularization weight for regression, number of layers and optimization algorithm for ANN, but also the degree of a polynomial in the case of a regression, and many more.

slide-42
SLIDE 42

Three sets!

Under- and over- fitting 23/47

Hyperparameter: a parameter whose value is not learnt by the algorithm, but set by the user.

Examples include learning rate, soft-margin and hard-margin for SVM, regularization weight for regression, number of layers and optimization algorithm for ANN, but also the degree of a polynomial in the case of a regression, and many more.

Validation set: a third data set is used to determine the “optimal” values

  • f the parameters.
slide-43
SLIDE 43

Three sets!

Under- and over- fitting 23/47

Hyperparameter: a parameter whose value is not learnt by the algorithm, but set by the user.

Examples include learning rate, soft-margin and hard-margin for SVM, regularization weight for regression, number of layers and optimization algorithm for ANN, but also the degree of a polynomial in the case of a regression, and many more.

Validation set: a third data set is used to determine the “optimal” values

  • f the parameters.

Rule of thumb: keep 70 % of your data for training, 15 % of your data is used for validation, and 15 % of your data is used for testing.

slide-44
SLIDE 44

Three sets!

Under- and over- fitting 23/47

Hyperparameter: a parameter whose value is not learnt by the algorithm, but set by the user.

Examples include learning rate, soft-margin and hard-margin for SVM, regularization weight for regression, number of layers and optimization algorithm for ANN, but also the degree of a polynomial in the case of a regression, and many more.

Validation set: a third data set is used to determine the “optimal” values

  • f the parameters.

Rule of thumb: keep 70 % of your data for training, 15 % of your data is used for validation, and 15 % of your data is used for testing. For data sets comprising millions of examples, 1 or 2 % test set might be enough.

slide-45
SLIDE 45

Learning curves

Under- and over- fitting 24/47

On way to assess our models is to visualize the learning curves:

slide-46
SLIDE 46

Learning curves

Under- and over- fitting 24/47

On way to assess our models is to visualize the learning curves:

A learning curve shows the performance of our model, here using RMSE, on both the training set and the validation set.

slide-47
SLIDE 47

Learning curves

Under- and over- fitting 24/47

On way to assess our models is to visualize the learning curves:

A learning curve shows the performance of our model, here using RMSE, on both the training set and the validation set. Multiple measurements are obtained by repeatedly training the model on larger and larger subsets of the data.

slide-48
SLIDE 48

Linear model - underfitting

Under- and over- fitting 25/47

Source: Géron 2019

slide-49
SLIDE 49

sklearn.linear_model.LinearRegression

Under- and over- fitting 26/47

4 3 2 1 1 2

x

5 10 15 20 25 30 35

y

slide-50
SLIDE 50

Learning curves - underfitting

Under- and over- fitting 27/47

With only one (1) or two (2) examples, the model perfectly “fits” the training set.

slide-51
SLIDE 51

Learning curves - underfitting

Under- and over- fitting 27/47

With only one (1) or two (2) examples, the model perfectly “fits” the training set. As the size of the data set grows, it becomes impossible to fit the training set since the randomly generated examples used to produce this example were generated using a quadratic function. Accordingly the RMSE reaches a plateau and stays there.

slide-52
SLIDE 52

Learning curves - underfitting

Under- and over- fitting 27/47

With only one (1) or two (2) examples, the model perfectly “fits” the training set. As the size of the data set grows, it becomes impossible to fit the training set since the randomly generated examples used to produce this example were generated using a quadratic function. Accordingly the RMSE reaches a plateau and stays there. With few examples, the model performs badly on the validation set. With few examples, it model generalizes poorly.

slide-53
SLIDE 53

Learning curves - underfitting

Under- and over- fitting 27/47

With only one (1) or two (2) examples, the model perfectly “fits” the training set. As the size of the data set grows, it becomes impossible to fit the training set since the randomly generated examples used to produce this example were generated using a quadratic function. Accordingly the RMSE reaches a plateau and stays there. With few examples, the model performs badly on the validation set. With few examples, it model generalizes poorly. As the size of the data set grows, the performance on the validation set

  • improves. Eventually, the performance does not improve.
slide-54
SLIDE 54

Learning curves - underfitting

Under- and over- fitting 27/47

With only one (1) or two (2) examples, the model perfectly “fits” the training set. As the size of the data set grows, it becomes impossible to fit the training set since the randomly generated examples used to produce this example were generated using a quadratic function. Accordingly the RMSE reaches a plateau and stays there. With few examples, the model performs badly on the validation set. With few examples, it model generalizes poorly. As the size of the data set grows, the performance on the validation set

  • improves. Eventually, the performance does not improve.

“These learning curves are typical of a model thats underfitting. Both curves have reached a plateau; they are close and fairly high.” [2]

slide-55
SLIDE 55

Polynomial of degree 10 - overfitting

Under- and over- fitting 28/47

Source: Géron 2019

slide-56
SLIDE 56

Overfitting and underfitting

Under- and over- fitting 29/47

Source: Géron 2019

slide-57
SLIDE 57

Learning curves - overfitting

Under- and over- fitting 30/47

Here, the error on the training data is much lower.

slide-58
SLIDE 58

Learning curves - overfitting

Under- and over- fitting 30/47

Here, the error on the training data is much lower. There is a gap between the two curses, the model performs significantly better on the training data compared to the validation data.

slide-59
SLIDE 59

Overfitting - deep nets - loss

Under- and over- fitting 31/47

Source: Chollet 2018

slide-60
SLIDE 60

Overfitting - deep nets - accuracy

Under- and over- fitting 32/47

Source: Chollet 2018

slide-61
SLIDE 61

Summary

Under- and over- fitting 33/47

Underfitting:

Your model is too simple (linear model). Uninformative features. Poor performance on both training and validation data.

Overfitting:

Your model is too complex (tall decision tree, deep and wide neural

  • networks. . . ).

Too many features given the number of examples available Excellent performance on the training set, but poor performance on the validation set.

slide-62
SLIDE 62

k sets! Cross-validation

Under- and over- fitting 34/47

What if I have a small number of examples?

The data is divided into k sets.

slide-63
SLIDE 63

k sets! Cross-validation

Under- and over- fitting 34/47

What if I have a small number of examples?

The data is divided into k sets.

Each set is a called a fold;

slide-64
SLIDE 64

k sets! Cross-validation

Under- and over- fitting 34/47

What if I have a small number of examples?

The data is divided into k sets.

Each set is a called a fold; We talk about 3-fold, 5-fold, 10-fold cross-validation;

slide-65
SLIDE 65

k sets! Cross-validation

Under- and over- fitting 34/47

What if I have a small number of examples?

The data is divided into k sets.

Each set is a called a fold; We talk about 3-fold, 5-fold, 10-fold cross-validation; The special case where k = N is called leave-one-out.

slide-66
SLIDE 66

k sets! Cross-validation

Under- and over- fitting 34/47

What if I have a small number of examples?

The data is divided into k sets.

Each set is a called a fold; We talk about 3-fold, 5-fold, 10-fold cross-validation; The special case where k = N is called leave-one-out.

The training/validation procedure is ran k times.

slide-67
SLIDE 67

k sets! Cross-validation

Under- and over- fitting 34/47

What if I have a small number of examples?

The data is divided into k sets.

Each set is a called a fold; We talk about 3-fold, 5-fold, 10-fold cross-validation; The special case where k = N is called leave-one-out.

The training/validation procedure is ran k times.

Each time, one of the k sets is used for validation, whereas the rest of the data is used for training.

slide-68
SLIDE 68

k sets! Cross-validation

Under- and over- fitting 34/47

What if I have a small number of examples?

The data is divided into k sets.

Each set is a called a fold; We talk about 3-fold, 5-fold, 10-fold cross-validation; The special case where k = N is called leave-one-out.

The training/validation procedure is ran k times.

Each time, one of the k sets is used for validation, whereas the rest of the data is used for training.

At the end, one calculates the mean and standard deviation for the metrics of interest: cost/loss function, precision/recall, etc.

slide-69
SLIDE 69

k sets! Cross-validation

Under- and over- fitting 34/47

What if I have a small number of examples?

The data is divided into k sets.

Each set is a called a fold; We talk about 3-fold, 5-fold, 10-fold cross-validation; The special case where k = N is called leave-one-out.

The training/validation procedure is ran k times.

Each time, one of the k sets is used for validation, whereas the rest of the data is used for training.

At the end, one calculates the mean and standard deviation for the metrics of interest: cost/loss function, precision/recall, etc. Open the door to hypothesis testing.

slide-70
SLIDE 70

TO DO 2020

Under- and over- fitting 35/47

Insert a discussion about hypothesis testing

slide-71
SLIDE 71

sklearn.model_selection.cross_val_score

Under- and over- fitting 36/47

from s k l e a r n . model_selection import cross_val_score l i n _ s c o r e s = cross_val_score ( lin_reg , X, y , cv=10) p r i n t ( " Scores : " , l i n _ s c o r e s ) p r i n t ( "Mean : " , l i n _ s c o r e s . mean ( ) ) p r i n t ( " Standard d e v i a t i o n : " , l i n _ s c o r e s . std ( ) ) Scores: [66782.73843989 66960.118071 70347.95244419 74739.57052552 68031.13388938 71193.84183426 64969.63056405 68281.61137997 71552.91566558 67665.10082067] Mean: 69052.46136345083 Standard deviation: 2731.674001798348

slide-72
SLIDE 72

sklearn.model_selection.cross_val_score

Under- and over- fitting 37/47

from s k l e a r n . model_selection import cross_val_score tree_rmse_scores = cross_val_score ( tree_reg , X, y , cv=10) p r i n t ( " Scores : " , tree_rmse_scores ) p r i n t ( "Mean : " , tree_rmse_scores . mean ( ) ) p r i n t ( " Standard d e v i a t i o n : " , tree_rmse_scores . std ( ) ) Scores: [70194.33680785 66855.16363941 72432.58244769 70758.73896782 71115.88230639 75585.14172901 70262.86139133 70273.6325285 75366.87952553 71231.65726027] Mean: 71407.68766037929 Standard deviation: 2439.4345041191004

slide-73
SLIDE 73

Grid search

Under- and over- fitting 38/47

Most learning algorithms have many hyperparameters that need tuning.

slide-74
SLIDE 74

Grid search

Under- and over- fitting 38/47

Most learning algorithms have many hyperparameters that need tuning.

In fact, this is one of the major disadvantages of machine learning algorithms.

slide-75
SLIDE 75

Grid search

Under- and over- fitting 38/47

Most learning algorithms have many hyperparameters that need tuning.

In fact, this is one of the major disadvantages of machine learning algorithms.

Often, people manually explore various combinations.

slide-76
SLIDE 76

Grid search

Under- and over- fitting 38/47

Most learning algorithms have many hyperparameters that need tuning.

In fact, this is one of the major disadvantages of machine learning algorithms.

Often, people manually explore various combinations.

A grid search is a better approach.

slide-77
SLIDE 77

Grid search

Under- and over- fitting 38/47

Most learning algorithms have many hyperparameters that need tuning.

In fact, this is one of the major disadvantages of machine learning algorithms.

Often, people manually explore various combinations.

A grid search is a better approach.

Systematically enumerate all the possible combinations of hyperparameters.

slide-78
SLIDE 78

Grid search

Under- and over- fitting 38/47

Most learning algorithms have many hyperparameters that need tuning.

In fact, this is one of the major disadvantages of machine learning algorithms.

Often, people manually explore various combinations.

A grid search is a better approach.

Systematically enumerate all the possible combinations of hyperparameters. For each combination, train a model on the training set and test its performance on the validation set.

slide-79
SLIDE 79

Grid search

Under- and over- fitting 38/47

Most learning algorithms have many hyperparameters that need tuning.

In fact, this is one of the major disadvantages of machine learning algorithms.

Often, people manually explore various combinations.

A grid search is a better approach.

Systematically enumerate all the possible combinations of hyperparameters. For each combination, train a model on the training set and test its performance on the validation set.

Initially, you try powers of two (2) or ten (10).

slide-80
SLIDE 80

Grid search

Under- and over- fitting 38/47

Most learning algorithms have many hyperparameters that need tuning.

In fact, this is one of the major disadvantages of machine learning algorithms.

Often, people manually explore various combinations.

A grid search is a better approach.

Systematically enumerate all the possible combinations of hyperparameters. For each combination, train a model on the training set and test its performance on the validation set.

Initially, you try powers of two (2) or ten (10). If time allows, conduct another grid search with values close to the optimal values found in the previous iteration.

slide-81
SLIDE 81

sklearn.model_selection.GridSearchCV

Under- and over- fitting 39/47

See: Géron 2019 §2

from s k l e a r n . model_selection import GridSearchCV param_grid = [ { ’ n_estimators ’ : [ 3 , 10 , 30] , ’ max_features ’ : [ 2 , 4 , 6 , 8]} ] f o r e s t _ r e g = RandomForestRegressor () grid_search = GridSearchCV ( forest_reg , param_grid , cv=5) grid_search . f i t ( X_train , y_train ) grid_search . best_params_

{'max_features': 8, 'n_estimators': 30}

slide-82
SLIDE 82

Randomized search

Under- and over- fitting 40/47

What if the number of combinations is large (many hyperparameters, many values for each) Scikit-Learn provides RandomizedSearchCV

The user can either supply a list of values for each hyperparameters or a probability distribution (a method for sampling values) The user also specifies the number of iterations, that is the number of combinations to try. Makes the execution time more predictable.

slide-83
SLIDE 83

7-Steps workflow 41/47

7-Stepsworkflow

slide-84
SLIDE 84

Workflow [Deep Learning with Python]

7-Steps workflow 42/47

Defining the problem and assembling the data Choosing a measure of success Choosing an evaluation protocol Preparing the data Developing an initial model Developing a model that overfits Regularization and hyper parameter tuning

Source: [4]

slide-85
SLIDE 85

Prologue 43/47

Prologue

slide-86
SLIDE 86

Summary

Prologue 44/47

We discussed of the roles of the training, validation, and test sets

slide-87
SLIDE 87

Summary

Prologue 44/47

We discussed of the roles of the training, validation, and test sets We also talked about cross-validation: k-folds and leave-one-out

slide-88
SLIDE 88

Summary

Prologue 44/47

We discussed of the roles of the training, validation, and test sets We also talked about cross-validation: k-folds and leave-one-out Underfitting and overfitting are important concepts

slide-89
SLIDE 89

Summary

Prologue 44/47

We discussed of the roles of the training, validation, and test sets We also talked about cross-validation: k-folds and leave-one-out Underfitting and overfitting are important concepts We looked at grid search and randomized search

slide-90
SLIDE 90

Next module

Prologue 45/47

Training - gradient descent

slide-91
SLIDE 91

References

Prologue 46/47

Nathalie Japkowicz and Mohak Shah. Evaluating Learning Algorithms: a classification perspective. Cambridge University Press, Cambridge, 2011. Aurélien Géron. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media, 2nd edition, 2019. Andriy Burkov. The Hundred-Page Machine Learning Book. Andriy Burkov, 2019. François Chollet. Deep learning with Python. Manning Publications, 2017.

slide-92
SLIDE 92

Prologue 47/47

Marcel Turcotte

Marcel.Turcotte@uOttawa.ca School of Electrical Engineering and Computer Science (EECS) University of Ottawa