Bias-variance trade-off. Crossvalidation. Regularization. Petr Po - - PowerPoint PPT Presentation

bias variance trade off crossvalidation regularization
SMART_READER_LITE
LIVE PREVIEW

Bias-variance trade-off. Crossvalidation. Regularization. Petr Po - - PowerPoint PPT Presentation

CZECH TECHNICAL UNIVERSITY IN PRAGUE Faculty of Electrical Engineering Department of Cybernetics Bias-variance trade-off. Crossvalidation. Regularization. Petr Po s k P. Po s k c 2015 Artificial Intelligence 1 / 13


slide-1
SLIDE 1

CZECH TECHNICAL UNIVERSITY IN PRAGUE

Faculty of Electrical Engineering Department of Cybernetics

  • P. Poˇ

s´ ık c 2015 Artificial Intelligence – 1 / 13

Bias-variance trade-off.

  • Crossvalidation. Regularization.

Petr Poˇ s´ ık

slide-2
SLIDE 2

How to evaluate a predictive model?

  • P. Poˇ

s´ ık c 2015 Artificial Intelligence – 2 / 13

slide-3
SLIDE 3

Model evaluation

  • P. Poˇ

s´ ık c 2015 Artificial Intelligence – 3 / 13

Fundamental question: What is a good measure of “model quality” from the machine-learning standpoint?

slide-4
SLIDE 4

Model evaluation

  • P. Poˇ

s´ ık c 2015 Artificial Intelligence – 3 / 13

Fundamental question: What is a good measure of “model quality” from the machine-learning standpoint?

■ We have various measures of model error: ■ For regression tasks: MSE, MAE, . . . ■ For classification tasks: misclassification rate, measures based on confusion matrix, . . . ■ Some of them can be regarded as finite approximations of the Bayes risk. ■ Are these functions good approximations when measured on the data the models were trained on?

slide-5
SLIDE 5

Model evaluation

  • P. Poˇ

s´ ık c 2015 Artificial Intelligence – 3 / 13

Fundamental question: What is a good measure of “model quality” from the machine-learning standpoint?

■ We have various measures of model error: ■ For regression tasks: MSE, MAE, . . . ■ For classification tasks: misclassification rate, measures based on confusion matrix, . . . ■ Some of them can be regarded as finite approximations of the Bayes risk. ■ Are these functions good approximations when measured on the data the models were trained on?

−0.5 0.5 1 1.5 2 2.5 −1 −0.5 0.5 1 1.5 2 2.5 3 f(x) = x f(x) = x3−3x2+3x

slide-6
SLIDE 6

Model evaluation

  • P. Poˇ

s´ ık c 2015 Artificial Intelligence – 3 / 13

Fundamental question: What is a good measure of “model quality” from the machine-learning standpoint?

■ We have various measures of model error: ■ For regression tasks: MSE, MAE, . . . ■ For classification tasks: misclassification rate, measures based on confusion matrix, . . . ■ Some of them can be regarded as finite approximations of the Bayes risk. ■ Are these functions good approximations when measured on the data the models were trained on?

−0.5 0.5 1 1.5 2 2.5 −1 −0.5 0.5 1 1.5 2 2.5 3 f(x) = x f(x) = x3−3x2+3x

Using MSE only, both models are equivalent!!!

slide-7
SLIDE 7

Model evaluation

  • P. Poˇ

s´ ık c 2015 Artificial Intelligence – 3 / 13

Fundamental question: What is a good measure of “model quality” from the machine-learning standpoint?

■ We have various measures of model error: ■ For regression tasks: MSE, MAE, . . . ■ For classification tasks: misclassification rate, measures based on confusion matrix, . . . ■ Some of them can be regarded as finite approximations of the Bayes risk. ■ Are these functions good approximations when measured on the data the models were trained on?

−0.5 0.5 1 1.5 2 2.5 −1 −0.5 0.5 1 1.5 2 2.5 3 f(x) = x f(x) = x3−3x2+3x

Using MSE only, both models are equivalent!!!

−0.5 0.5 1 1.5 2 2.5 −0.5 0.5 1 1.5 2 2.5 f(x) = −0.09 + 0.99x f(x) = 0.00 + (−0.31x) + (1.67x2) + (−0.51x3)

slide-8
SLIDE 8

Model evaluation

  • P. Poˇ

s´ ık c 2015 Artificial Intelligence – 3 / 13

Fundamental question: What is a good measure of “model quality” from the machine-learning standpoint?

■ We have various measures of model error: ■ For regression tasks: MSE, MAE, . . . ■ For classification tasks: misclassification rate, measures based on confusion matrix, . . . ■ Some of them can be regarded as finite approximations of the Bayes risk. ■ Are these functions good approximations when measured on the data the models were trained on?

−0.5 0.5 1 1.5 2 2.5 −1 −0.5 0.5 1 1.5 2 2.5 3 f(x) = x f(x) = x3−3x2+3x

Using MSE only, both models are equivalent!!!

−0.5 0.5 1 1.5 2 2.5 −0.5 0.5 1 1.5 2 2.5 f(x) = −0.09 + 0.99x f(x) = 0.00 + (−0.31x) + (1.67x2) + (−0.51x3)

Using MSE only, the cubic model is better than linear!!!

slide-9
SLIDE 9

Model evaluation

  • P. Poˇ

s´ ık c 2015 Artificial Intelligence – 3 / 13

Fundamental question: What is a good measure of “model quality” from the machine-learning standpoint?

■ We have various measures of model error: ■ For regression tasks: MSE, MAE, . . . ■ For classification tasks: misclassification rate, measures based on confusion matrix, . . . ■ Some of them can be regarded as finite approximations of the Bayes risk. ■ Are these functions good approximations when measured on the data the models were trained on?

−0.5 0.5 1 1.5 2 2.5 −1 −0.5 0.5 1 1.5 2 2.5 3 f(x) = x f(x) = x3−3x2+3x

Using MSE only, both models are equivalent!!!

−0.5 0.5 1 1.5 2 2.5 −0.5 0.5 1 1.5 2 2.5 f(x) = −0.09 + 0.99x f(x) = 0.00 + (−0.31x) + (1.67x2) + (−0.51x3)

Using MSE only, the cubic model is better than linear!!! A basic method of evaluation is model validation on a different, independent data set from the same source, i.e.

  • n testing data.
slide-10
SLIDE 10

Validation on testing data

  • P. Poˇ

s´ ık c 2015 Artificial Intelligence – 4 / 13

Example: Polynomial regression with varrying degree: X ∼ U(−1, 3) Y ∼ X2 + N(0, 1)

− 2 − 1 1 2 3 4 x − 4 − 2 2 4 6 8 10 y

Polynom de g.: 0, tr. e rr.: 8.319, te st. e rr.: 6.901 Tra ining da ta T e sting da ta

− 2 − 1 1 2 3 4 x − 4 − 2 2 4 6 8 10 y

Polynom de g.: 1, tr. e rr.: 2.013, te st. e rr.: 2.841 Tra ining da ta T e sting da ta

− 2 − 1 1 2 3 4 x − 4 − 2 2 4 6 8 10 y

Polynom de g.: 2, tr. e rr.: 0.647, te st. e rr.: 0.925 Tra ining da ta T e sting da ta

− 2 − 1 1 2 3 4 x − 4 − 2 2 4 6 8 10 y

Polynom de g.: 3, tr. e rr.: 0.645, te st. e rr.: 0.919 Tra ining da ta T e sting da ta

− 2 − 1 1 2 3 4 x − 4 − 2 2 4 6 8 10 y

Polynom de g.: 5, tr. e rr.: 0.611, te st. e rr.: 0.979 Tra ining da ta T e sting da ta

− 2 − 1 1 2 3 4 x − 4 − 2 2 4 6 8 10 y

Polynom de g.: 9, tr. e rr.: 0.545, te st. e rr.: 1.067 Tra ining da ta T e sting da ta

slide-11
SLIDE 11

Training and testing error

How to evaluate a predictive model?

  • Model evaluation
  • Training and testing

error

  • Overfitting
  • Bias vs Variance
  • Crossvalidation
  • How to determine a

suitable model flexibility

  • How to prevent
  • verfitting?

Regularization

  • P. Poˇ

s´ ık c 2015 Artificial Intelligence – 5 / 13

2 4 6 8 10 Polynom de gre e 1 2 3 4 5 6 7 8 9 MSE

Tra ining e rror T e sting e rror

■ The training error decreases with increasing model flexibility. ■ The testing error is minimal for certain degree of model flexibility.

slide-12
SLIDE 12

Overfitting

  • P. Poˇ

s´ ık c 2015 Artificial Intelligence – 6 / 13

Definition of overfitting:

■ Let H be a hypothesis space. ■ Let h ∈ H and h′ ∈ H be 2 different hypotheses from

this space.

■ Let ErrTr(h) be an error of the hypothesis h

measured on the training dataset (training error).

■ Let ErrTst(h) be an error of the hypothesis h

measured on the testing dataset (testing error).

■ We say that h is overfitted if there is another h′ for

which ErrTr(h) < ErrTr(h′) ∧ ErrTst(h) > ErrTst(h′)

Model Error Model Flexibility Training data Testing data

■ “When overfitted, the model works well for the training data, but fails for new (testing) data.” ■ Overfitting is a general phenomenon affecting all kinds of inductive learning.

slide-13
SLIDE 13

Overfitting

  • P. Poˇ

s´ ık c 2015 Artificial Intelligence – 6 / 13

Definition of overfitting:

■ Let H be a hypothesis space. ■ Let h ∈ H and h′ ∈ H be 2 different hypotheses from

this space.

■ Let ErrTr(h) be an error of the hypothesis h

measured on the training dataset (training error).

■ Let ErrTst(h) be an error of the hypothesis h

measured on the testing dataset (testing error).

■ We say that h is overfitted if there is another h′ for

which ErrTr(h) < ErrTr(h′) ∧ ErrTst(h) > ErrTst(h′)

Model Error Model Flexibility Training data Testing data

■ “When overfitted, the model works well for the training data, but fails for new (testing) data.” ■ Overfitting is a general phenomenon affecting all kinds of inductive learning.

We want models and learning algorithms with a good generalization ability, i.e.

■ we want models that encode only the patterns valid in the whole domain, not those that learned the

specifics of the training data,

■ we want algorithms able to find only the patterns valid in the whole domain and ignore specifics of the

training data.

slide-14
SLIDE 14

Bias vs Variance

  • P. Poˇ

s´ ık c 2015 Artificial Intelligence – 7 / 13

− 2 − 1 1 2 3 4 x − 4 − 2 2 4 6 8 10 y

Polynom de g.: 1, tr. e rr.: 2.013, te st. e rr.: 2.841 Tra ining da ta T e sting da ta

− 2 − 1 1 2 3 4 x − 4 − 2 2 4 6 8 10 y

Polynom de g.: 2, tr. e rr.: 0.647, te st. e rr.: 0.925 Tra ining da ta T e sting da ta

− 2 − 1 1 2 3 4 x − 4 − 2 2 4 6 8 10 y

Polynom de g.: 9, tr. e rr.: 0.545, te st. e rr.: 1.067 Tra ining da ta T e sting da ta

High bias: model not flexible enough (Underfit) “Just right” (Good fit) High variance: model flexibility too high (Overfit)

slide-15
SLIDE 15

Bias vs Variance

  • P. Poˇ

s´ ık c 2015 Artificial Intelligence – 7 / 13

− 2 − 1 1 2 3 4 x − 4 − 2 2 4 6 8 10 y

Polynom de g.: 1, tr. e rr.: 2.013, te st. e rr.: 2.841 Tra ining da ta T e sting da ta

− 2 − 1 1 2 3 4 x − 4 − 2 2 4 6 8 10 y

Polynom de g.: 2, tr. e rr.: 0.647, te st. e rr.: 0.925 Tra ining da ta T e sting da ta

− 2 − 1 1 2 3 4 x − 4 − 2 2 4 6 8 10 y

Polynom de g.: 9, tr. e rr.: 0.545, te st. e rr.: 1.067 Tra ining da ta T e sting da ta

High bias: model not flexible enough (Underfit) “Just right” (Good fit) High variance: model flexibility too high (Overfit) High bias problem:

■ ErrTr(h) is high ■ ErrTst(h) ≈ ErrTr(h)

Model Error Model Flexibility Training data Testing data

High variance problem:

ErrTr(h) is low

ErrTst(h) >> ErrTr(h)

slide-16
SLIDE 16

Crossvalidation

How to evaluate a predictive model?

  • Model evaluation
  • Training and testing

error

  • Overfitting
  • Bias vs Variance
  • Crossvalidation
  • How to determine a

suitable model flexibility

  • How to prevent
  • verfitting?

Regularization

  • P. Poˇ

s´ ık c 2015 Artificial Intelligence – 8 / 13

Simple crossvalidation:

■ Split the data into training and testing subsets. ■ Train the model on training data. ■ Evaluate the model error on testing data.

slide-17
SLIDE 17

Crossvalidation

How to evaluate a predictive model?

  • Model evaluation
  • Training and testing

error

  • Overfitting
  • Bias vs Variance
  • Crossvalidation
  • How to determine a

suitable model flexibility

  • How to prevent
  • verfitting?

Regularization

  • P. Poˇ

s´ ık c 2015 Artificial Intelligence – 8 / 13

Simple crossvalidation:

■ Split the data into training and testing subsets. ■ Train the model on training data. ■ Evaluate the model error on testing data.

K-fold crossvalidation:

■ Split the data into k folds (k is usually 5 or 10). ■ In each iteration: ■ Use k − 1 folds to train the model. ■ Use 1 fold to test the model, i.e. measure error.

  • Iter. 1

Training Training Testing

  • Iter. 2

Training Testing Training

  • Iter. k

Testing Training Training

■ Aggregate (average) the k error measurements to get the final error estimate. ■ Train the model on the whole data set.

slide-18
SLIDE 18

Crossvalidation

How to evaluate a predictive model?

  • Model evaluation
  • Training and testing

error

  • Overfitting
  • Bias vs Variance
  • Crossvalidation
  • How to determine a

suitable model flexibility

  • How to prevent
  • verfitting?

Regularization

  • P. Poˇ

s´ ık c 2015 Artificial Intelligence – 8 / 13

Simple crossvalidation:

■ Split the data into training and testing subsets. ■ Train the model on training data. ■ Evaluate the model error on testing data.

K-fold crossvalidation:

■ Split the data into k folds (k is usually 5 or 10). ■ In each iteration: ■ Use k − 1 folds to train the model. ■ Use 1 fold to test the model, i.e. measure error.

  • Iter. 1

Training Training Testing

  • Iter. 2

Training Testing Training

  • Iter. k

Testing Training Training

■ Aggregate (average) the k error measurements to get the final error estimate. ■ Train the model on the whole data set.

Leave-one-out (LOO) crossvalidation:

k = |T|, i.e. the number of folds is equal to the training set size.

■ Time consuming for large |T|.

slide-19
SLIDE 19

How to determine a suitable model flexibility

How to evaluate a predictive model?

  • Model evaluation
  • Training and testing

error

  • Overfitting
  • Bias vs Variance
  • Crossvalidation
  • How to determine a

suitable model flexibility

  • How to prevent
  • verfitting?

Regularization

  • P. Poˇ

s´ ık c 2015 Artificial Intelligence – 9 / 13

Simply test models of varying complexities and choose the one with the best testing error, right?

■ The testing data are used here to tune a meta-parameter of the model. ■ The testing data are used to train (a part of) the model, thus essentially become part of

training data.

■ The error on testing data is no longer an unbiased estimate of model error; it

underestimates it.

■ A new, separate data set is needed to estimate the model error.

slide-20
SLIDE 20

How to determine a suitable model flexibility

How to evaluate a predictive model?

  • Model evaluation
  • Training and testing

error

  • Overfitting
  • Bias vs Variance
  • Crossvalidation
  • How to determine a

suitable model flexibility

  • How to prevent
  • verfitting?

Regularization

  • P. Poˇ

s´ ık c 2015 Artificial Intelligence – 9 / 13

Simply test models of varying complexities and choose the one with the best testing error, right?

■ The testing data are used here to tune a meta-parameter of the model. ■ The testing data are used to train (a part of) the model, thus essentially become part of

training data.

■ The error on testing data is no longer an unbiased estimate of model error; it

underestimates it.

■ A new, separate data set is needed to estimate the model error.

Using simple crossvalidation:

  • 1. Training data: use cca 50 % of data for model building.
  • 2. Validation data: use cca 25 % of data to search for the suitable model flexibility.
  • 3. Train the suitable model on training + validation data.
  • 4. Testing data: use cca 25 % of data for the final estimate of the model error.
slide-21
SLIDE 21

How to determine a suitable model flexibility

How to evaluate a predictive model?

  • Model evaluation
  • Training and testing

error

  • Overfitting
  • Bias vs Variance
  • Crossvalidation
  • How to determine a

suitable model flexibility

  • How to prevent
  • verfitting?

Regularization

  • P. Poˇ

s´ ık c 2015 Artificial Intelligence – 9 / 13

Simply test models of varying complexities and choose the one with the best testing error, right?

■ The testing data are used here to tune a meta-parameter of the model. ■ The testing data are used to train (a part of) the model, thus essentially become part of

training data.

■ The error on testing data is no longer an unbiased estimate of model error; it

underestimates it.

■ A new, separate data set is needed to estimate the model error.

Using simple crossvalidation:

  • 1. Training data: use cca 50 % of data for model building.
  • 2. Validation data: use cca 25 % of data to search for the suitable model flexibility.
  • 3. Train the suitable model on training + validation data.
  • 4. Testing data: use cca 25 % of data for the final estimate of the model error.

Using k-fold crossvalidation

  • 1. Training data: use cca 75 % of data to find and train a suitable model using

crossvalidation.

  • 2. Testing data: use cca 25 % of data for the final estimate of the model error.
slide-22
SLIDE 22

How to determine a suitable model flexibility

How to evaluate a predictive model?

  • Model evaluation
  • Training and testing

error

  • Overfitting
  • Bias vs Variance
  • Crossvalidation
  • How to determine a

suitable model flexibility

  • How to prevent
  • verfitting?

Regularization

  • P. Poˇ

s´ ık c 2015 Artificial Intelligence – 9 / 13

Simply test models of varying complexities and choose the one with the best testing error, right?

■ The testing data are used here to tune a meta-parameter of the model. ■ The testing data are used to train (a part of) the model, thus essentially become part of

training data.

■ The error on testing data is no longer an unbiased estimate of model error; it

underestimates it.

■ A new, separate data set is needed to estimate the model error.

Using simple crossvalidation:

  • 1. Training data: use cca 50 % of data for model building.
  • 2. Validation data: use cca 25 % of data to search for the suitable model flexibility.
  • 3. Train the suitable model on training + validation data.
  • 4. Testing data: use cca 25 % of data for the final estimate of the model error.

Using k-fold crossvalidation

  • 1. Training data: use cca 75 % of data to find and train a suitable model using

crossvalidation.

  • 2. Testing data: use cca 25 % of data for the final estimate of the model error.

The ratios are not set in stone, there are other possibilities, e.g. 60:20:20, etc.

slide-23
SLIDE 23

How to prevent overfitting?

How to evaluate a predictive model?

  • Model evaluation
  • Training and testing

error

  • Overfitting
  • Bias vs Variance
  • Crossvalidation
  • How to determine a

suitable model flexibility

  • How to prevent
  • verfitting?

Regularization

  • P. Poˇ

s´ ık c 2015 Artificial Intelligence – 10 / 13

  • 1. Reduce number of features.

■ Select manually, which features to keep. ■ Try to identify a suitable subset of features during learning phase.

  • 2. Regularization

■ Keep all features, but reduce the magnitude of parameters w. ■ Works well, if we have a lot of features each of which contributes a bit to

predicting y.

slide-24
SLIDE 24

Regularization

  • P. Poˇ

s´ ık c 2015 Artificial Intelligence – 11 / 13

slide-25
SLIDE 25

Ridge regularization (a.k.a. Tikhonov regularization)

  • P. Poˇ

s´ ık c 2015 Artificial Intelligence – 12 / 13

Ridge regularization penalizes the size of the model coefficients:

■ Modification of the optimization criterion:

J(w) = 1

|T|

|T|

i=1

  • y(i) − hw(x(i))

2

D

d=1

w2

d.

■ The solution is given by a modified Normal

equation w∗ = (XTX+αI)−1XTy

■ As α → 0, wridge → wOLS. ■ As α → ∞, wridge → 0.

Training and testing errors as functions of regularization parameter:

10-8 10-6 10-4 10-2 100 102 104 106 108 Re gula riza tion fa ctor 0.5 1.0 1.5 2.0 2.5 3.0 3.5 MSE

Tra ining e rror T e sting e rror

The values of coefficients as functions of regularization parameter:

10-8 10-6 10-4 10-2 100 102 104 106 108 Re gula riza tion fa ctor − 10 − 5 5 10 15 Coe fficie nt size

slide-26
SLIDE 26

Lasso regularization

  • P. Poˇ

s´ ık c 2015 Artificial Intelligence – 13 / 13

Lasso regularization penalizes the size of the model coefficients:

■ Modification of the optimization criterion:

J(w) = 1

|T|

|T|

i=1

  • y(i) − hw(x(i))

2

D

d=1

|wd|.

■ Solution is usually found by quadratic

programming.

■ As α → ∞, Lasso regularization decreases the

number of non-zero coefficients. Training and testing errors as functions of regularization parameter:

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 Re gula riza tion fa ctor 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 MSE

Tra ining e rror T e sting e rror

The values of coefficients as functions of regularization parameter:

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 Re gula riza tion fa ctor − 0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Coe fficie nt size