CSI5180. MachineLearningfor BioinformaticsApplications Regularized - - PowerPoint PPT Presentation

csi5180 machinelearningfor bioinformaticsapplications
SMART_READER_LITE
LIVE PREVIEW

CSI5180. MachineLearningfor BioinformaticsApplications Regularized - - PowerPoint PPT Presentation

CSI5180. MachineLearningfor BioinformaticsApplications Regularized Linear Models by Marcel Turcotte Version November 6, 2019 Preamble Preamble 2/42 Preamble Regularized Linear Models In this lecture, we introduce the concept of


slide-1
SLIDE 1
  • CSI5180. MachineLearningfor

BioinformaticsApplications

Regularized Linear Models

by

Marcel Turcotte

Version November 6, 2019

slide-2
SLIDE 2

Preamble 2/42

Preamble

slide-3
SLIDE 3

Preamble

Preamble 3/42

Regularized Linear Models In this lecture, we introduce the concept of regularization. We consider the specific context of linear models: Ridge Regression, Lasso Regression, and Elastic Net. Finally, we discuss a simple technique called early stopping. General objective :

Explain the concept of regularization in the context of linear regression and logistic

slide-4
SLIDE 4

Learning objectives

Preamble 4/42

Explain the concept of regularization in the context of linear regression and logistic

Reading:

Simon Dirmeier, Christiane Fuchs, Nikola S Mueller, and Fabian J Theis, netReg: network-regularized linear models for biological association studies, Bioinformatics 34 (2018), no. 5, 896898.

slide-5
SLIDE 5

Plan

Preamble 5/42

  • 1. Preamble
  • 2. Introduction
  • 3. Polynomial Regression
  • 4. Regularization
  • 5. Logistic Regression
  • 6. Prologue
slide-6
SLIDE 6

Introduction 6/42

Introduction

slide-7
SLIDE 7

Supervised learning

Introduction 7/42

The data set is a collection of labelled examples.

{(xi, yi)}N

i=1

Each xi is a feature vector with D dimensions. x(j)

k

is the value of the feature j of the example k, for j ∈ 1 . . . D and k ∈ 1 . . . N.

The label yi is either a class, taken from a finite list of classes, {1, 2, . . . , C},

  • r a real number, or a more complex object (vector, matrix, tree, graph,

etc).

Problem: given the data set as input, create a “model” that can be used to predict the value of y for an unseen x.

Classification: yi ∈ {Positive, Negative}, a binary classification problem. Regression: yi is a real number.

slide-8
SLIDE 8

Linear Regression

Introduction 8/42

A linear model assumes that the value of the label, ˆ yi, can be expressed as a linear combination of the feature values, x (j)

i :

ˆ yi = h(xi) = θ0 + θ1x (1)

i

+ θ2x (2)

i

+ . . . + θDx (D)

i

slide-9
SLIDE 9

Linear Regression

Introduction 8/42

A linear model assumes that the value of the label, ˆ yi, can be expressed as a linear combination of the feature values, x (j)

i :

ˆ yi = h(xi) = θ0 + θ1x (1)

i

+ θ2x (2)

i

+ . . . + θDx (D)

i

Here, θj is the jthe parameter of the (linear) model, with θ0 being the bias term/parameter, θ1 . . . θD being the feature weights.

slide-10
SLIDE 10

Linear Regression

Introduction 8/42

A linear model assumes that the value of the label, ˆ yi, can be expressed as a linear combination of the feature values, x (j)

i :

ˆ yi = h(xi) = θ0 + θ1x (1)

i

+ θ2x (2)

i

+ . . . + θDx (D)

i

Here, θj is the jthe parameter of the (linear) model, with θ0 being the bias term/parameter, θ1 . . . θD being the feature weights. Problem: find values for all the model parameters so that the model “best fit” the training data.

slide-11
SLIDE 11

Linear Regression

Introduction 8/42

A linear model assumes that the value of the label, ˆ yi, can be expressed as a linear combination of the feature values, x (j)

i :

ˆ yi = h(xi) = θ0 + θ1x (1)

i

+ θ2x (2)

i

+ . . . + θDx (D)

i

Here, θj is the jthe parameter of the (linear) model, with θ0 being the bias term/parameter, θ1 . . . θD being the feature weights. Problem: find values for all the model parameters so that the model “best fit” the training data.

The Root Mean Square Error is a common performance measure for regression problems.

  • 1

N

N

  • 1

[h(xi) − yi]2

slide-12
SLIDE 12

Polynomial Regression 9/42

PolynomialRegression

slide-13
SLIDE 13

Polynomial Regression

Polynomial Regression 10/42

What if the data is more complex?

slide-14
SLIDE 14

Polynomial Regression

Polynomial Regression 10/42

What if the data is more complex? In our discussion on underfitting and overfitting the training data, we did look at polynomial models, but did not discuss how to learn them.

slide-15
SLIDE 15

Polynomial Regression

Polynomial Regression 10/42

What if the data is more complex? In our discussion on underfitting and overfitting the training data, we did look at polynomial models, but did not discuss how to learn them. Can we use our linear model to “fit” non linear data, and specifically data would have been generated by a polynomial “process”?

slide-16
SLIDE 16

Polynomial Regression

Polynomial Regression 10/42

What if the data is more complex? In our discussion on underfitting and overfitting the training data, we did look at polynomial models, but did not discuss how to learn them. Can we use our linear model to “fit” non linear data, and specifically data would have been generated by a polynomial “process”?

How?

slide-17
SLIDE 17

sklearn.preprocessing.PolynomialFeatures

Polynomial Regression 11/42

A surprisingly simple solution consists of generating new features that are powers of existing ones!

slide-18
SLIDE 18

sklearn.preprocessing.PolynomialFeatures

Polynomial Regression 11/42

A surprisingly simple solution consists of generating new features that are powers of existing ones!

slide-19
SLIDE 19

sklearn.preprocessing.PolynomialFeatures

Polynomial Regression 11/42

A surprisingly simple solution consists of generating new features that are powers of existing ones!

from s k l e a r n . p r e p r o c e s s i n g import PolynomialFeatures p o l y _ f e a t u r e s = PolynomialFeatures ( degree =2, i n c l u d e _ b i a s=Fa l s e ) X_poly = p o l y _ f e a t u r e s . f i t _ t r a n s f o r m (X)

slide-20
SLIDE 20

sklearn.preprocessing.PolynomialFeatures

Polynomial Regression 11/42

A surprisingly simple solution consists of generating new features that are powers of existing ones!

from s k l e a r n . p r e p r o c e s s i n g import PolynomialFeatures p o l y _ f e a t u r e s = PolynomialFeatures ( degree =2, i n c l u d e _ b i a s=Fa l s e ) X_poly = p o l y _ f e a t u r e s . f i t _ t r a n s f o r m (X) from s k l e a r n . linear_model import L i n e a r R e g r e s s i o n l i n _ r e g = L i n e a r R e g r e s s i o n () l i n _ r e g . f i t ( X_poly , y ) p r i n t ( l i n _ r e g . intercept_ , l i n _ r e g . coef_ )

slide-21
SLIDE 21

Example fitting a linear model

Polynomial Regression 12/42

import numpy as np X = 2 ∗ np . random . rand (100 , 1) y = 4 + 3 ∗ X + np . random . randn (100 , 1) from s k l e a r n . linear_model import L i n e a r R e g r e s s i o n l i n _ r e g = L i n e a r R e g r e s s i o n () l i n _ r e g . f i t (X, y ) l i n _ r e g . intercept_ , l i n _ r e g . coef_ # [4.07916603] [ [ 2 . 9 0 1 7 3 9 4 9 ] ]

y = 4 + 3x + noise ˆ y = 4.07916603 + 2.90173949x

slide-22
SLIDE 22

Example fitting a polynomial model

Polynomial Regression 13/42

import numpy as np X = 6 ∗ np . random . rand (100 , 1) − 3 y = 2 + 0.5 ∗ X∗∗2 + X + np . random . randn (100 , 1) from s k l e a r n . p r e p r o c e s s i n g import PolynomialFeatures p o l y _ f e a t u r e s = PolynomialFeatures ( degree =2, i n c l u d e _ b i a s=Fa l s e ) X_poly = p o l y _ f e a t u r e s . f i t _ t r a n s f o r m (X) l i n _ r e g = L i n e a r R e g r e s s i o n () l i n _ r e g . f i t ( X_poly , y ) l i n _ r e g . intercept_ , l i n _ r e g . coef_ # [ 1 . 7 0 1 1 4 4 ] [[1.02118676 0.55725864]]

y = 2.0 + 0.5x 2 + 1.0x + noise ˆ y = 1.701144 + 0.55725864x 2 + 1.02118676x

slide-23
SLIDE 23

Remarks

Polynomial Regression 14/42

For higher degrees, PolynomialFeatures adds all the combination of features.

slide-24
SLIDE 24

Remarks

Polynomial Regression 14/42

For higher degrees, PolynomialFeatures adds all the combination of features.

Given two features a and b, PolynomialFeatures generates, a2, a3, b2, b3, but also ab, a2b, ab2.

slide-25
SLIDE 25

Remarks

Polynomial Regression 14/42

For higher degrees, PolynomialFeatures adds all the combination of features.

Given two features a and b, PolynomialFeatures generates, a2, a3, b2, b3, but also ab, a2b, ab2.

Given n features and degree d, PolynomialFeatures produces (n+d)!

d!n!

combinations!

slide-26
SLIDE 26

Regularization 15/42

Regularization

slide-27
SLIDE 27

Bias/Variance trade-off

Regularization 16/42

From [2] §4:

“(. . . ) a models generalization error can be expressed as the sum of three very different errors:”

slide-28
SLIDE 28

Bias/Variance trade-off

Regularization 16/42

From [2] §4:

“(. . . ) a models generalization error can be expressed as the sum of three very different errors:”

Bias: “is due to wrong assumptions”, “A high-bias model is most likely to underfit the training data”

slide-29
SLIDE 29

Bias/Variance trade-off

Regularization 16/42

From [2] §4:

“(. . . ) a models generalization error can be expressed as the sum of three very different errors:”

Bias: “is due to wrong assumptions”, “A high-bias model is most likely to underfit the training data” Variance: “the model’s excessive sensitivity to small variations in the training data”. A model with many parameters “is likely to have high variance and thus overfit the training data.”

slide-30
SLIDE 30

Bias/Variance trade-off

Regularization 16/42

From [2] §4:

“(. . . ) a models generalization error can be expressed as the sum of three very different errors:”

Bias: “is due to wrong assumptions”, “A high-bias model is most likely to underfit the training data” Variance: “the model’s excessive sensitivity to small variations in the training data”. A model with many parameters “is likely to have high variance and thus overfit the training data.” Irreducible error: “noisiness of the data itself”

slide-31
SLIDE 31

Bias/Variance trade-off

Regularization 16/42

From [2] §4:

“(. . . ) a models generalization error can be expressed as the sum of three very different errors:”

Bias: “is due to wrong assumptions”, “A high-bias model is most likely to underfit the training data” Variance: “the model’s excessive sensitivity to small variations in the training data”. A model with many parameters “is likely to have high variance and thus overfit the training data.” Irreducible error: “noisiness of the data itself”

“Increasing a models complexity will typically increase its variance and reduce its bias. Conversely, reducing a models complexity increases its bias and reduces its variance.”

slide-32
SLIDE 32

Overfitting and underfitting

Regularization 17/42

Source: Géron 2019

slide-33
SLIDE 33

Linear model - underfitting

Regularization 18/42

Source: Géron 2019

slide-34
SLIDE 34

Polynomial of degree 10 - overfitting

Regularization 19/42

Source: Géron 2019

slide-35
SLIDE 35

Regularization

Regularization 20/42

“Constraining a model to make it simpler and reduce the risk of overfitting is called regularization.” [2]

slide-36
SLIDE 36

Regularization

Regularization 20/42

“Constraining a model to make it simpler and reduce the risk of overfitting is called regularization.” [2] One way to regularized a polynomial model is to restrict its degree.

slide-37
SLIDE 37

Regularization

Regularization 20/42

“Constraining a model to make it simpler and reduce the risk of overfitting is called regularization.” [2] One way to regularized a polynomial model is to restrict its degree.

How would you do that?

slide-38
SLIDE 38

Regularization

Regularization 20/42

“Constraining a model to make it simpler and reduce the risk of overfitting is called regularization.” [2] One way to regularized a polynomial model is to restrict its degree.

How would you do that?

Make the degree a hyperpamater, use a holding set or cross-validation.

slide-39
SLIDE 39

Regularization

Regularization 20/42

“Constraining a model to make it simpler and reduce the risk of overfitting is called regularization.” [2] One way to regularized a polynomial model is to restrict its degree.

How would you do that?

Make the degree a hyperpamater, use a holding set or cross-validation.

Alternatively, we can constraint the weights of the model.

slide-40
SLIDE 40

Norm

Regularization 21/42

A norm is a function that assigns a number (length, size) to a vector. ℓp-norm ℓp-norm = ||θ||p = (

D

  • j=1

|θ(j)|p)

1 p

ℓ1-norm ℓl-norm = ||θ||1 =

D

  • j=1

|θ(j)| ℓ2-norm ℓ2-norm = ||θ||2 =

  • D
  • j=1

|θ(j)|2

slide-41
SLIDE 41

Ridge Regression

Regularization 22/42

You will remember the objective function, Mean Squared Error (MSE), used by our gradient descent. 1 N

N

  • 1

[h(xi) − yi]2

slide-42
SLIDE 42

Ridge Regression

Regularization 22/42

You will remember the objective function, Mean Squared Error (MSE), used by our gradient descent. 1 N

N

  • 1

[h(xi) − yi]2 In the case Ridge Regression, the objective function becomes: 1 N

N

  • 1

[h(xi) − yi]2 + 1 2α

D

  • 1

θ(j)2

slide-43
SLIDE 43

Ridge Regression

Regularization 22/42

You will remember the objective function, Mean Squared Error (MSE), used by our gradient descent. 1 N

N

  • 1

[h(xi) − yi]2 In the case Ridge Regression, the objective function becomes: 1 N

N

  • 1

[h(xi) − yi]2 + 1 2α

D

  • 1

θ(j)2 The regularization is applying at learning time only.

slide-44
SLIDE 44

Ridge Regression

Regularization 22/42

You will remember the objective function, Mean Squared Error (MSE), used by our gradient descent. 1 N

N

  • 1

[h(xi) − yi]2 In the case Ridge Regression, the objective function becomes: 1 N

N

  • 1

[h(xi) − yi]2 + 1 2α

D

  • 1

θ(j)2 The regularization is applying at learning time only. α is a hyperparameter, with α = 0, Ridge Regression is equivalent to a Linear Regression.

slide-45
SLIDE 45

Ridge Regression

Regularization 22/42

You will remember the objective function, Mean Squared Error (MSE), used by our gradient descent. 1 N

N

  • 1

[h(xi) − yi]2 In the case Ridge Regression, the objective function becomes: 1 N

N

  • 1

[h(xi) − yi]2 + 1 2α

D

  • 1

θ(j)2 The regularization is applying at learning time only. α is a hyperparameter, with α = 0, Ridge Regression is equivalent to a Linear Regression.

1 2α D 1 θ(j)2 is the ℓ2-norm of the weight vector.

slide-46
SLIDE 46

sklearn.linear_model.Ridge

Regularization 23/42

from s k l e a r n . linear_model import Ridge ridge_reg = Ridge ( alpha =1, s o l v e r=" c h o l e s k y " ) ridge_reg . f i t (X, y )

slide-47
SLIDE 47

Ridge Regression

Regularization 24/42

Source: [2] Figure 4.17

slide-48
SLIDE 48

Lasso Regression

Regularization 25/42

Another popular regularization is the Least Absolute Shrinkage and Selection Operator Regression, Lasso Regression.

slide-49
SLIDE 49

Lasso Regression

Regularization 25/42

Another popular regularization is the Least Absolute Shrinkage and Selection Operator Regression, Lasso Regression. Its objective function is: 1 N

N

  • 1

[h(xi) − yi]2 + α

D

  • 1

θ(j)

slide-50
SLIDE 50

Lasso Regression

Regularization 25/42

Another popular regularization is the Least Absolute Shrinkage and Selection Operator Regression, Lasso Regression. Its objective function is: 1 N

N

  • 1

[h(xi) − yi]2 + α

D

  • 1

θ(j) The regularization is applying at learning time only.

slide-51
SLIDE 51

Lasso Regression

Regularization 25/42

Another popular regularization is the Least Absolute Shrinkage and Selection Operator Regression, Lasso Regression. Its objective function is: 1 N

N

  • 1

[h(xi) − yi]2 + α

D

  • 1

θ(j) The regularization is applying at learning time only. α is a hyperparameter, with α = 0, Lasso Regression is equivalent to a Linear Regression.

slide-52
SLIDE 52

Lasso Regression

Regularization 25/42

Another popular regularization is the Least Absolute Shrinkage and Selection Operator Regression, Lasso Regression. Its objective function is: 1 N

N

  • 1

[h(xi) − yi]2 + α

D

  • 1

θ(j) The regularization is applying at learning time only. α is a hyperparameter, with α = 0, Lasso Regression is equivalent to a Linear Regression. α D

1 θ(j) is the ℓ1-norm of the weight vector.

slide-53
SLIDE 53

Lasso Regression

Regularization 25/42

Another popular regularization is the Least Absolute Shrinkage and Selection Operator Regression, Lasso Regression. Its objective function is: 1 N

N

  • 1

[h(xi) − yi]2 + α

D

  • 1

θ(j) The regularization is applying at learning time only. α is a hyperparameter, with α = 0, Lasso Regression is equivalent to a Linear Regression. α D

1 θ(j) is the ℓ1-norm of the weight vector.

Lasso regression favors sparse models (models with few terms with non-zero weights)

slide-54
SLIDE 54

Lasso Regression

Regularization 26/42

Source: [2] Figure 4.18

slide-55
SLIDE 55

Ridge and Lasso regression

Regularization 27/42

“Your role as the data analyst is to find such a value of the hyperparameter [α] that doesn’t increase the bias too much but reduces the variance to a level reasonable for the problem at hand.” [3] In practice, ℓ1-norm (Lasso) produces models that are sparse. Thus acting as a feature selection mechanism. However, ℓ2-norm (Ridge) usually gives better results in practice. These norms are frequently used with other models/objective functions.

slide-56
SLIDE 56

Elastic Net

Regularization 28/42

Elastic Net is a mixture of Ridge Regression and Lasso Regression.

slide-57
SLIDE 57

Elastic Net

Regularization 28/42

Elastic Net is a mixture of Ridge Regression and Lasso Regression. 1 N

N

  • 1

[h(xi) − yi]2 + rα

D

  • 1

θ(j) + 1 − r 2 α

D

  • 1

θ(j)2

slide-58
SLIDE 58

Elastic Net

Regularization 28/42

Elastic Net is a mixture of Ridge Regression and Lasso Regression. 1 N

N

  • 1

[h(xi) − yi]2 + rα

D

  • 1

θ(j) + 1 − r 2 α

D

  • 1

θ(j)2 It adds a second hyperparameter r, to control ratio of ℓ2 and ℓ1 regularization.

slide-59
SLIDE 59

Elastic Net

Regularization 28/42

Elastic Net is a mixture of Ridge Regression and Lasso Regression. 1 N

N

  • 1

[h(xi) − yi]2 + rα

D

  • 1

θ(j) + 1 − r 2 α

D

  • 1

θ(j)2 It adds a second hyperparameter r, to control ratio of ℓ2 and ℓ1 regularization. In all three cases, the summation starts at 1, i.e. the bias term (here, the intercept) is excluded from the regularization.

slide-60
SLIDE 60

sklearn.linear_model.ElasticNet

Regularization 29/42

from s k l e a r n . linear_model import E l a s t i c N e t e l a s t i c _ n e t = E l a s t i c N e t ( alpha =0.1 , l 1 _ r a t i o =0.5) e l a s t i c _ n e t . f i t (X, y )

Source: [2] §4

slide-61
SLIDE 61

Early stopping

Regularization 30/42

Geoffrey Hinton called this the “beautiful free lunch”

Source: [2] Figure 4.20

slide-62
SLIDE 62

Remarks

Regularization 31/42

The criteria used to drive the optimization (training) can be different than the criteria used for the hyper parameter selection procedure. Regularized models are known to be sensitive to the scale of features, thus the data should be “normalized”. “(. . . ) the fewer degrees of freedom it has, the harder it will be for it to overfit the data.”

slide-63
SLIDE 63

Logistic Regression 32/42

LogisticRegression

slide-64
SLIDE 64

Logistic (Logit) Regression

Logistic Regression 33/42

Despite its name, Logistic Regression is a classification algorithm.

slide-65
SLIDE 65

Logistic (Logit) Regression

Logistic Regression 33/42

Despite its name, Logistic Regression is a classification algorithm. The labels are binary values, yi ∈ {0, 1}.

slide-66
SLIDE 66

Logistic (Logit) Regression

Logistic Regression 33/42

Despite its name, Logistic Regression is a classification algorithm. The labels are binary values, yi ∈ {0, 1}. It is formulated to answer the question, “what is the probability that xi is a positive example, i.e. yi = 1?”

slide-67
SLIDE 67

Logistic (Logit) Regression

Logistic Regression 33/42

Despite its name, Logistic Regression is a classification algorithm. The labels are binary values, yi ∈ {0, 1}. It is formulated to answer the question, “what is the probability that xi is a positive example, i.e. yi = 1?” Just like the Linear Regression, the Logistic Regression computes a weighted sum of the input features: θ0 + θ1x (1)

i

+ θ2x (2)

i

+ . . . + θDx (D)

i

slide-68
SLIDE 68

Logistic (Logit) Regression

Logistic Regression 33/42

Despite its name, Logistic Regression is a classification algorithm. The labels are binary values, yi ∈ {0, 1}. It is formulated to answer the question, “what is the probability that xi is a positive example, i.e. yi = 1?” Just like the Linear Regression, the Logistic Regression computes a weighted sum of the input features: θ0 + θ1x (1)

i

+ θ2x (2)

i

+ . . . + θDx (D)

i

The image of this function is −∞ to ∞!

slide-69
SLIDE 69

Logistic Regression

Logistic Regression 34/42

In mathematics, a standard logistic function maps a real value (R) to the interval (0, 1):

0.5 1 −6 −4 −2 2 4 6

Source: Wikipedia

σ(t) = 1 1 + e−t

slide-70
SLIDE 70

Logistic Regression

Logistic Regression 35/42

The Logistic Regression model, in its vectorized form is: hθ(xi) = σ(θxi) = 1 1 + e−θxi

slide-71
SLIDE 71

Logistic Regression

Logistic Regression 35/42

The Logistic Regression model, in its vectorized form is: hθ(xi) = σ(θxi) = 1 1 + e−θxi Predictions are made as follows:

slide-72
SLIDE 72

Logistic Regression

Logistic Regression 35/42

The Logistic Regression model, in its vectorized form is: hθ(xi) = σ(θxi) = 1 1 + e−θxi Predictions are made as follows:

yi = 0, if hθ(xi) < 0.5

slide-73
SLIDE 73

Logistic Regression

Logistic Regression 35/42

The Logistic Regression model, in its vectorized form is: hθ(xi) = σ(θxi) = 1 1 + e−θxi Predictions are made as follows:

yi = 0, if hθ(xi) < 0.5 yi = 1, if hθ(xi) ≥ 0.5

slide-74
SLIDE 74

Logistic Regression

Logistic Regression 35/42

The Logistic Regression model, in its vectorized form is: hθ(xi) = σ(θxi) = 1 1 + e−θxi Predictions are made as follows:

yi = 0, if hθ(xi) < 0.5 yi = 1, if hθ(xi) ≥ 0.5

The values of θ are learnt using gradient descent.

slide-75
SLIDE 75

2020

Logistic Regression 36/42

Include the derivation of the loss (objective) function.

slide-76
SLIDE 76

sklearn.linear_model.LogisticRegression

Logistic Regression 37/42

from s k l e a r n . linear_model import L o g i s t i c R e g r e s s i o n log_reg = L o g i s t i c R e g r e s s i o n () log_reg . f i t (X, y ) # . . . y_proba = log_reg . predict_proba (X_new)

slide-77
SLIDE 77

Prologue 38/42

Prologue

slide-78
SLIDE 78

Summary

Prologue 39/42

Regularization is the idea to constrain a model making it simpler, thus less prone to overfitting.

slide-79
SLIDE 79

Summary

Prologue 39/42

Regularization is the idea to constrain a model making it simpler, thus less prone to overfitting. Limiting the complexity of the model is one way to add regularization.

slide-80
SLIDE 80

Summary

Prologue 39/42

Regularization is the idea to constrain a model making it simpler, thus less prone to overfitting. Limiting the complexity of the model is one way to add regularization.

Limiting the degree of the polynomial in case of a polynomial model.

slide-81
SLIDE 81

Summary

Prologue 39/42

Regularization is the idea to constrain a model making it simpler, thus less prone to overfitting. Limiting the complexity of the model is one way to add regularization.

Limiting the degree of the polynomial in case of a polynomial model.

Often, penalty terms are added to the objective (cost) function.

slide-82
SLIDE 82

Summary

Prologue 39/42

Regularization is the idea to constrain a model making it simpler, thus less prone to overfitting. Limiting the complexity of the model is one way to add regularization.

Limiting the degree of the polynomial in case of a polynomial model.

Often, penalty terms are added to the objective (cost) function.

Ridge: ℓ2-norm term is added to the objective function.

slide-83
SLIDE 83

Summary

Prologue 39/42

Regularization is the idea to constrain a model making it simpler, thus less prone to overfitting. Limiting the complexity of the model is one way to add regularization.

Limiting the degree of the polynomial in case of a polynomial model.

Often, penalty terms are added to the objective (cost) function.

Ridge: ℓ2-norm term is added to the objective function. Lasso: ℓ1-norm term is added to the objective function.

slide-84
SLIDE 84

Summary

Prologue 39/42

Regularization is the idea to constrain a model making it simpler, thus less prone to overfitting. Limiting the complexity of the model is one way to add regularization.

Limiting the degree of the polynomial in case of a polynomial model.

Often, penalty terms are added to the objective (cost) function.

Ridge: ℓ2-norm term is added to the objective function. Lasso: ℓ1-norm term is added to the objective function. Elastic Net: both, ℓ2 and ℓ1-norm terms are added to the objective function.

slide-85
SLIDE 85

Summary

Prologue 39/42

Regularization is the idea to constrain a model making it simpler, thus less prone to overfitting. Limiting the complexity of the model is one way to add regularization.

Limiting the degree of the polynomial in case of a polynomial model.

Often, penalty terms are added to the objective (cost) function.

Ridge: ℓ2-norm term is added to the objective function. Lasso: ℓ1-norm term is added to the objective function. Elastic Net: both, ℓ2 and ℓ1-norm terms are added to the objective function.

Early stopping criteria is an effective and fairly general regularization, it can be applied iterative learning algorithms, such as batch gradient.

slide-86
SLIDE 86

Summary

Prologue 39/42

Regularization is the idea to constrain a model making it simpler, thus less prone to overfitting. Limiting the complexity of the model is one way to add regularization.

Limiting the degree of the polynomial in case of a polynomial model.

Often, penalty terms are added to the objective (cost) function.

Ridge: ℓ2-norm term is added to the objective function. Lasso: ℓ1-norm term is added to the objective function. Elastic Net: both, ℓ2 and ℓ1-norm terms are added to the objective function.

Early stopping criteria is an effective and fairly general regularization, it can be applied iterative learning algorithms, such as batch gradient. Contrary to Principal Component Analysis, the above techniques are of their impact on the performance of the learning algorithms (o the validation set).

slide-87
SLIDE 87

Next module

Prologue 40/42

Models related to decision trees

slide-88
SLIDE 88

References

Prologue 41/42

Simon Dirmeier, Christiane Fuchs, Nikola S Mueller, and Fabian J Theis. netReg: network-regularized linear models for biological association studies. Bioinformatics, 34(5):896–898, 03 2018. Aurélien Géron. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media, 2nd edition, 2019. Andriy Burkov. The Hundred-Page Machine Learning Book. Andriy Burkov, 2019.

slide-89
SLIDE 89

Prologue 42/42

Marcel Turcotte

Marcel.Turcotte@uOttawa.ca School of Electrical Engineering and Computer Science (EECS) University of Ottawa