[PPT] - 4. Model evaluatjon & selectjon Chlo-Agathe Azencot Centre for PowerPoint Presentation

SLIDE 1

4. Model evaluatjon & selectjon

Foundatjons of Machine Learning CentraleSupélec — Fall 2017 Chloé-Agathe Azencot

Centre for Computatjonal Biology, Mines ParisTech

chloe-agathe.azencott@mines-paristech.fr

SLIDE 2

Practjcal maters

You should have received an email from me on

Tuesday

Partjal solutjon to Lab 1 at the end of the slides of

Chapter 3.

Pointers/refreshers re: (scientjfjc) python

– http://www.scipy-lectures.org/ – https://github.com/chagaz/ml-notebooks/

→ lsml2017

Yes, I only put the slides online afuer the lecture.

SLIDE 3

3

Generalizatjon

A good and useful approximatjon

It’s easy to build a model that performs well on the

training data

But how well will it perform on new data?
“Predictjons are hard, especially about the future” — Niels Bohr.

– Learn models that generalize well – Evaluate whether models generalize well.

SLIDE 4

4

Noise in the data

Imprecision in recording the features
Errors in labeling the data points (teacher noise)
Missing features (hidden or latent)
Making no errors on the training set might not be

possible.

SLIDE 5

5

Models of increasing complexity

SLIDE 6

6

Noise and model complexity

Use simple models!

– Easier to use

lower computatjonal complexity

– Easier to train

lower space complexity

– Easier to explain

more interpretable

– Generalize beter

Occam’s razor: simpler explanatjons are more plausible.

SLIDE 7

7

Overfjttjng

What are the

empirical errors of the black and purple classifjers?

Which model

seems more likely to be correct?

SLIDE 8

8

Overfjttjng & Underfjttjng (Regression)

Underfjttjng Overfjttjng

SLIDE 9

9

Generalizatjon error vs. model complexity

Overfjttjng Predictjon error Model complexity

On training data On new data

Underfjttjng

SLIDE 10

10

Bias-variance tradeof

Bias: difgerence between the expected value of the

estjmator and the true value being estjmated.

– A simpler model has a higher bias. – High bias can cause underfjttjng.

Variance: deviatjon from the expected value of the

estjmates.

– A more complex model has a higher variance. – High variance can cause overfjttjng.

SLIDE 11

11

Bias-variance decompositjon

Mean squared error:
Proof ?

SLIDE 12

12

Bias-variance decompositjon

Mean squared error:

and y are determinist.

SLIDE 13

13

Generalizatjon error vs. model complexity

Prediction error Model complexity On training data

On new data

High bias Low variance Low bias High variance

SLIDE 14

14

Model selectjon & generalizatjon

Well-posed problems:

– a solutjon exists; – it is unique; – the solutjon changes contjnuously with the initjal

conditjons

Learning is an ill-posed problem:

data helps carve out the hypothesis space but data is not suffjcient to fjnd a unique solutjon.

Need for inductjve bias

assumptjons about the hypothesis space model selectjon: choose the “right” inductjve bias?

Hadamard, on the mathematical modelisation of physical phenomena.

SLIDE 15

15

How do we decide a model is good?

SLIDE 16

16

Learning objectjves

Afuer this lecture you should be able to

design experiments to select and evaluate supervised machine learning models. Concepts:

training and testjng sets;
cross-validatjon;
bootstrap;
measures of performance for classifjers and regressors;
measures of model complexity.

SLIDE 17

17

Supervised learning settjng

Training set:
Classifjcatjon:
Regression:
Goal: Find such that
Empirical error of f on the training set, given a loss:

– E.g. (classifjcatjon) – E.g. (regression)

?

SLIDE 18

18

Supervised learning settjng

Training set:
Classifjcatjon:
Regression:
Goal: Find such that
Empirical error of f on the training set, given a loss:

– E.g. (classifjcatjon) – E.g. (regression)

?

SLIDE 19

19

Supervised learning settjng

Training set:
Classifjcatjon:
Regression:
Goal: Find such that
Empirical error of f on the training set, given a loss:

– E.g. (classifjcatjon) – E.g. (regression)

?

SLIDE 20

20

Supervised learning settjng

Training set:
Classifjcatjon:
Regression:
Goal: Find such that
Empirical error of f on the training set, given a loss:

– E.g. (classifjcatjon) – E.g. (regression)

?

SLIDE 21

21

Supervised learning settjng

Training set:
Classifjcatjon:
Regression:
Goal: Find such that
Empirical error of f on the training set, given a loss:

– E.g. (classifjcatjon) – E.g. (regression)

?

SLIDE 22

22

Supervised learning settjng

Training set:
Classifjcatjon:
Regression:
Goal: Find such that
Empirical error of f on the training set, given a loss:

– E.g. (classifjcatjon) – E.g. (regression) ?

SLIDE 23

23

Supervised learning settjng

Training set:
Classifjcatjon:
Regression:
Goal: Find such that
Empirical error of f on the training set, given a loss:

– E.g. (classifjcatjon) – E.g. (regression)

SLIDE 24

24

Generalizatjon error

The empirical error on the training set is a poor

estjmate of the generalizatjon error (expected error

n new data)

If the model is overfjttjng, the generalizatjon error can be arbitrarily large.

We would like to estjmate the generalizatjon error
n new data, which we do not have.

SLIDE 25

25

Validatjon sets

Choose the model that performs best on a

validatjon set separate from the training set.

Because we have not used the validatjon data at any

point during training, the validatjon set can be considered “new data” and the error on the validatjon set is an estjmatjon of the generalizatjon error. Training Validation

SLIDE 26

26

Model selectjon

What if we want to choose among k models?

– Train each model on the train set – Compute the predictjon error of each model on the

validatjon set

– Pick the model with the smallest predictjon error on the

validatjon set.

What is the generalizatjon error?

– We don’t know! – Validatjon data was used to select the model – We have “cheated” and looked at the validatjon data: it

is not a good proxy for new, unseen data any more.

SLIDE 27

27

Validatjon sets

Hence we need to set aside part of the data, the

test set, that remains untouched during the entjre procedure and on which we’ll estjmate the generalizatjon error.

Model selectjon: pick the best model.
Model assessment: estjmate its predictjon error on

new data. Training Validation Test

SLIDE 28

28

How much data should go in each of the training,

validatjon and test sets?

How do we know we have enough data to evaluate

the predictjon and generalizatjon errors?

Empirical evaluatjon with sample re-use

– cross-validatjon – bootstrap

Analytjcal tools

– Mallow's Cp, AIC, BIC – MDL.

SLIDE 29

29

Sample re-use

SLIDE 30

30

Cross-validatjon

Cut the training set in k separate folds.
For each fold, train on the (k-1) remaining folds.

Validation Validation Validation Validation Validation Training Training Training Training

SLIDE 31

31

Cross-validated performance

Cross-validatjon estjmate of the predictjon error
r:
Estjmates the expected predictjon error

Y, X: (independent) test sample

Computed with the k(i)-th part of the data removed. k(i) = fold in which i is.

Fold l

SLIDE 32

32

Issues with cross-validatjon

Training set size becomes (K-1)n/K

Why is this a problem? ?

SLIDE 33

33

Issues with cross-validatjon

Training set size becomes (K-1)n/K

– small training set

biased estjmator of the error ⇒

Leave-one-out cross-validatjon: K = n

– approximately unbiased estjmator of the expected

predictjon error

– potentjal high variance (the training sets are very similar

to each other)

– computatjon can become burdensome (n repeats)

In practjce: set K = 5 or K = 10.

SLIDE 34

34

Bootstrap

Randomly draw datasets with replacement from

the training data

Repeat B tjmes (typically, B=100)

B models ⇒

Leave-one-out bootstrap error:

– For each training point i, predict with the bi < B models

that did not have i in their training set

– Average predictjon errors

Each training set contains ?

SLIDE 35

35

Bootstrap

Randomly draw datasets with replacement from

the training data

Repeat B tjmes (typically, B=100)

B models ⇒

Leave-one-out bootstrap error:

– For each training point i, predict with the bi < B models

that did not have i in their training set

– Average predictjon errors

Each training set contains 0.632.n distjnct examples

⇒ same issue as with cross-validatjon

SLIDE 36

36

Evaluatjng model performance

SLIDE 37

37

Classifjcatjon model evaluatjon

Confusion matrix

True class

1

+1 Predicted class

1

True Negatjves False Negatjves +1 False Positjves True Positjves

False positjves (false alarms) are also called type I errors
False negatjves (misses) are also called type II errors

SLIDE 38

38

Sensitjvity = Recall = True positjve rate (TPR)
Specifjcity = True negatjve rate (TNR)
Precision = Positjve predictjve value (PPV)
False discovery rate (FDR)

# positjves # predicted positjves

SLIDE 39

39

Accuracy
F1-score = harmonic mean of precision and

sensitjvity.

SLIDE 40

40

Example: Pap smear

4,000 apparently healthy women of age 40+
Tested for cervical cancer through pap smear and

histology (gold standard)

What are the sensitjvity, specifjcity, and PPV of the

test?

Cancer No cancer Total Positive test 190 210 400 Negative test 10 3590 3600 Total 200 3800 4000

?

SLIDE 41

41

Sensitjvity = Recall = True positjve rate (TPR)
Specifjcity = True negatjve rate (TNR)
Precision = Positjve predictjve value (PPV)

Cancer No cancer Total Positive test 190 210 400 Negative test 10 3590 3600 Total 200 3800 4000

SLIDE 42

42

In this populatjon:

Sensitjvity = 95.0 % Specifjcity = 94.5 % PPV = 47.5 %

Prevalence of the disease = 200/4000 = 0.05
P(cancer|positjve test) = PPV = 47.5 %
P(no cancer|negatjve test) = 3590/3600 = 99.7 %
Poor diagnosis tool
Good screening tool

Cancer No cancer Total Positive test 190 210 400 Negative test 10 3590 3600 Total 200 3800 4000

SLIDE 43

43

ROC curves

ROC = Receiver-Operator Characteristjc.
Summarized by the area under the curve (AUROC).

True positjve rate False positjve rate

1 1

Plot TPR vs FPR for all

possible thresholds.

threshold =

?

SLIDE 44

44

ROC curves

ROC = Receiver-Operator Characteristjc.
Summarized by the area under the curve (AUROC).

True positjve rate False positjve rate

1 1

Plot TPR vs FPR for all

possible thresholds.

threshold = smallest predicted value. threshold = ?

SLIDE 45

45

ROC curves

ROC = Receiver-Operator Characteristjc.
Summarized by the area under the curve (AUROC).

True positjve rate False positjve rate

1 1

Plot TPR vs FPR for all

possible thresholds.

threshold = smallest predicted value. threshold = largest predicted value. What is the ROC curve of:

a random classifjer?
a perfect classifjer? ?

SLIDE 46

46

ROC curves

ROC = Receiver-Operator Characteristjc.
Summarized by the area under the curve (AUROC).

True positjve rate False positjve rate

1 1

random classifjer Perfect classifjer

Perfect classifjer:

AUROC = 1.0

Random classifjer:

AUROC = 0.5

Our classifjer:

0.5 < AUROC < 1.0

SLIDE 47

47

Predictjng breast cancer risk based on mammography images, SNPs, or both.

Liu J, Page D, Nassif H, et al. (2013). Genetjc Variants Improve Breast Cancer Risk Predictjon on Mammograms. AMIA Annual Symposium Proceedings. 876-885.

Which method
utperforms the others?
Is a low FPR or high TPR

preferable in a clinical settjng?

= 1 - FPR

SLIDE 48

48

Predictjng breast cancer risk based on mammography images, SNPs, or both.

Liu J, Page D, Nassif H, et al. (2013). Genetjc Variants Improve Breast Cancer Risk Predictjon on Mammograms. AMIA Annual Symposium Proceedings. 876-885.

High recall = fewer chances to miss a case High specifjcity / low FPR = fewer false alarms

= 1 - FPR

SLIDE 49

49

Precision-Recall curves

Precision Recall 1 1 Good corner Bad corner

Sensitjvity = Recall = True positjve rate (TPR) Precision = Positjve predictjve value (PPV)

SLIDE 50

50

Predictjng breast cancer risk based on mammography images, SNPs, or both.

Liu J, Page D, Nassif H, et al. (2013). Genetjc Variants Improve Breast Cancer Risk Predictjon on Mammograms. AMIA Annual Symposium Proceedings. 876-885.

Which method has the

highest area under the PR curve?

Is a high recall or high

precision preferable in a clinical settjng?

Sensitjvity = Recall = True positjve rate (TPR) Precision = Positjve predictjve value (PPV)

?

SLIDE 51

51

Predictjng breast cancer risk based on mammography images, SNPs, or both.

Liu J, Page D, Nassif H, et al. (2013). Genetjc Variants Improve Breast Cancer Risk Predictjon on Mammograms. AMIA Annual Symposium Proceedings. 876-885.

High recall = fewer chances to miss a case High precision = substantjally more true diagnoses than false alarms

Sensitjvity = Recall = True positjve rate (TPR) Precision = Positjve predictjve value (PPV)

SLIDE 52

52

Regression model evaluatjon

Countjng the number of errors is not reasonable ?

SLIDE 53

53

Regression model evaluatjon

Countjng the number of errors is not reasonable

– What does error even mean for numerical values? – Not all errors are created equal.

SLIDE 54

54

Regression model evaluatjon

Residual sum of squares
Root-mean squared error
Relatjve squared error
Coeffjcient of determinatjon

SLIDE 55

55

Correlatjon between true and predicted values

SLIDE 56

56

Analytjcal tools and model complexity

SLIDE 57

57

Optjmism terms

– Correct the empirical error with an optjmism term – Theoretjcal estjmate of the discrepancy between

training and test error

For linear models, optjmism terms proportjonal to:

– Mallow’s Cp: – Akaike Informatjon Criterion (AIC): – Bayesian Informatjon Criterion (BIC):

Augmented error = empirical error + optjmism term

Variance of the residuals

n the train set

Squared standard error of the mean of the residuals # parameters = # non-zero coeffjcients

SLIDE 58

58

Minimum descriptjon length (MDL)

Shortest code to transmit a random variable z:

[Shannon's source coding theorem]

Assume

– Parametric model – receiver knows inputs X, model family f.

To transmit outputs y, need
Choose the model with smallest Kolmogorov complexity

(=MDL)

average code length to transmit the diference between model predictjon and true outputs. average code length to transmit θ.

Consider discrete variable z

– Equiprobable case: use a fjxed-length code – Otherwise: use a variable-length prefjx code in which

frequent values get shorter codes

The prefjx separates codes

SLIDE 59

59

Minimum descriptjon length (MDL)

Shortest code to transmit a random variable z:

[Shannon's source coding theorem]

Assume

– Parametric model – receiver knows inputs X, model family f.

To transmit outputs y, need
Choose the model with smallest Kolmogorov complexity

(=MDL)

average code length to transmit the diference between model predictjon and true outputs. average code length to transmit θ.

SLIDE 60

60

Summary: model selectjon techniques

Empirical:

Estjmate quality of generalizatjon with

– cross-validatjon – bootstrap

Theoretjcal:

– Estjmate the difgerence between train error and

generalizatjon error with an optjmism term E.g. Mallow's Cp, Akaike's / Bayesian Informatjon Criteria

– Minimum descriptjon length (MDL)

Choose simplest model (according to Kolmogorov complexity)

SLIDE 61

61

References

A Course in Machine Learning.

http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf

– Noise: Chap 2.3 – Overfjttjng: Chap 2.4 – Bias-variance tradeof: Chap 5.9 – Train and test sets: Chap 2.5 – Cross-validatjon: Chap 5.6 – Performance measures: Chap 5.5

The Elements of Statjstjcal Learning.

http://web.stanford.edu/~hastie/ElemStatLearn/

– Overfjttjng: Chap 7.1 – Bias-variance tradeof: Chap 2.9, 7.2–7.3 – Cross-validatjon: Chap 7.10 – Bootstrap: Chap 7.11 – Mallow’s Cp, AIC, BIC: Chap 7.7 – MDL: Chap 7.8

Entropy encoding:

http://lesswrong.com/lw/o1/entropy_and_short_codes/

SLIDE 62

62

Linear algebra:

http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/video-lectures/

Statjstjcs & probabilitjes:

– Probability theory: A primer (Jeremy Kun)

http://jeremykun.com/2013/01/04/probability-theory-a-primer/

– Probability Primer (Jefrey Miller) https://www.youtube.com/playlist?list=PL17567A1A3F5DB5E4

References for prerequisites

SLIDE 63

Practjcal maters

Make sure you have turned in HW01
HW02 is online, due Oct. 9
HW03 is online, due Oct. 13
Lab

https://github.com/chagaz/ma2823_2017

SLIDE 64

Lab 2 – pointers

SLIDE 65