4. Model evaluatjon & selectjon Chlo-Agathe Azencot Centre for - - PowerPoint PPT Presentation

4 model evaluatjon selectjon
SMART_READER_LITE
LIVE PREVIEW

4. Model evaluatjon & selectjon Chlo-Agathe Azencot Centre for - - PowerPoint PPT Presentation

Foundatjons of Machine Learning CentraleSuplec Fall 2017 4. Model evaluatjon & selectjon Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Practjcal maters You should


slide-1
SLIDE 1
  • 4. Model evaluatjon & selectjon

Foundatjons of Machine Learning CentraleSupélec — Fall 2017 Chloé-Agathe Azencot

Centre for Computatjonal Biology, Mines ParisTech

chloe-agathe.azencott@mines-paristech.fr

slide-2
SLIDE 2

Practjcal maters

  • You should have received an email from me on

Tuesday

  • Partjal solutjon to Lab 1 at the end of the slides of

Chapter 3.

  • Pointers/refreshers re: (scientjfjc) python

– http://www.scipy-lectures.org/ – https://github.com/chagaz/ml-notebooks/

→ lsml2017

  • Yes, I only put the slides online afuer the lecture.
slide-3
SLIDE 3

3

Generalizatjon

A good and useful approximatjon

  • It’s easy to build a model that performs well on the

training data

  • But how well will it perform on new data?
  • “Predictjons are hard, especially about the future” — Niels Bohr.

– Learn models that generalize well – Evaluate whether models generalize well.

slide-4
SLIDE 4

4

Noise in the data

  • Imprecision in recording the features
  • Errors in labeling the data points (teacher noise)
  • Missing features (hidden or latent)
  • Making no errors on the training set might not be

possible.

slide-5
SLIDE 5

5

Models of increasing complexity

slide-6
SLIDE 6

6

Noise and model complexity

  • Use simple models!

– Easier to use

lower computatjonal complexity

– Easier to train

lower space complexity

– Easier to explain

more interpretable

– Generalize beter

Occam’s razor: simpler explanatjons are more plausible.

slide-7
SLIDE 7

7

Overfjttjng

  • What are the

empirical errors of the black and purple classifjers?

  • Which model

seems more likely to be correct?

slide-8
SLIDE 8

8

Overfjttjng & Underfjttjng (Regression)

Underfjttjng Overfjttjng

slide-9
SLIDE 9

9

Generalizatjon error vs. model complexity

Overfjttjng Predictjon error Model complexity

On training data On new data

Underfjttjng

slide-10
SLIDE 10

10

Bias-variance tradeof

  • Bias: difgerence between the expected value of the

estjmator and the true value being estjmated.

– A simpler model has a higher bias. – High bias can cause underfjttjng.

  • Variance: deviatjon from the expected value of the

estjmates.

– A more complex model has a higher variance. – High variance can cause overfjttjng.

slide-11
SLIDE 11

11

Bias-variance decompositjon

  • Mean squared error:
  • Proof ?
slide-12
SLIDE 12

12

Bias-variance decompositjon

  • Mean squared error:

and y are determinist.

slide-13
SLIDE 13

13

Generalizatjon error vs. model complexity

Prediction error Model complexity On training data

On new data

High bias Low variance Low bias High variance

slide-14
SLIDE 14

14

Model selectjon & generalizatjon

  • Well-posed problems:

– a solutjon exists; – it is unique; – the solutjon changes contjnuously with the initjal

conditjons

  • Learning is an ill-posed problem:

data helps carve out the hypothesis space but data is not suffjcient to fjnd a unique solutjon.

  • Need for inductjve bias

assumptjons about the hypothesis space model selectjon: choose the “right” inductjve bias?

Hadamard, on the mathematical modelisation of physical phenomena.

slide-15
SLIDE 15

15

How do we decide a model is good?

slide-16
SLIDE 16

16

Learning objectjves

Afuer this lecture you should be able to

design experiments to select and evaluate supervised machine learning models. Concepts:

  • training and testjng sets;
  • cross-validatjon;
  • bootstrap;
  • measures of performance for classifjers and regressors;
  • measures of model complexity.
slide-17
SLIDE 17

17

Supervised learning settjng

  • Training set:
  • Classifjcatjon:
  • Regression:
  • Goal: Find such that
  • Empirical error of f on the training set, given a loss:

– E.g. (classifjcatjon) – E.g. (regression)

?

slide-18
SLIDE 18

18

Supervised learning settjng

  • Training set:
  • Classifjcatjon:
  • Regression:
  • Goal: Find such that
  • Empirical error of f on the training set, given a loss:

– E.g. (classifjcatjon) – E.g. (regression)

?

slide-19
SLIDE 19

19

Supervised learning settjng

  • Training set:
  • Classifjcatjon:
  • Regression:
  • Goal: Find such that
  • Empirical error of f on the training set, given a loss:

– E.g. (classifjcatjon) – E.g. (regression)

?

slide-20
SLIDE 20

20

Supervised learning settjng

  • Training set:
  • Classifjcatjon:
  • Regression:
  • Goal: Find such that
  • Empirical error of f on the training set, given a loss:

– E.g. (classifjcatjon) – E.g. (regression)

?

slide-21
SLIDE 21

21

Supervised learning settjng

  • Training set:
  • Classifjcatjon:
  • Regression:
  • Goal: Find such that
  • Empirical error of f on the training set, given a loss:

– E.g. (classifjcatjon) – E.g. (regression)

?

slide-22
SLIDE 22

22

Supervised learning settjng

  • Training set:
  • Classifjcatjon:
  • Regression:
  • Goal: Find such that
  • Empirical error of f on the training set, given a loss:

– E.g. (classifjcatjon) – E.g. (regression) ?

slide-23
SLIDE 23

23

Supervised learning settjng

  • Training set:
  • Classifjcatjon:
  • Regression:
  • Goal: Find such that
  • Empirical error of f on the training set, given a loss:

– E.g. (classifjcatjon) – E.g. (regression)

slide-24
SLIDE 24

24

Generalizatjon error

  • The empirical error on the training set is a poor

estjmate of the generalizatjon error (expected error

  • n new data)

If the model is overfjttjng, the generalizatjon error can be arbitrarily large.

  • We would like to estjmate the generalizatjon error
  • n new data, which we do not have.
slide-25
SLIDE 25

25

Validatjon sets

  • Choose the model that performs best on a

validatjon set separate from the training set.

  • Because we have not used the validatjon data at any

point during training, the validatjon set can be considered “new data” and the error on the validatjon set is an estjmatjon of the generalizatjon error. Training Validation

slide-26
SLIDE 26

26

Model selectjon

  • What if we want to choose among k models?

– Train each model on the train set – Compute the predictjon error of each model on the

validatjon set

– Pick the model with the smallest predictjon error on the

validatjon set.

  • What is the generalizatjon error?

– We don’t know! – Validatjon data was used to select the model – We have “cheated” and looked at the validatjon data: it

is not a good proxy for new, unseen data any more.

slide-27
SLIDE 27

27

Validatjon sets

  • Hence we need to set aside part of the data, the

test set, that remains untouched during the entjre procedure and on which we’ll estjmate the generalizatjon error.

  • Model selectjon: pick the best model.
  • Model assessment: estjmate its predictjon error on

new data. Training Validation Test

slide-28
SLIDE 28

28

  • How much data should go in each of the training,

validatjon and test sets?

  • How do we know we have enough data to evaluate

the predictjon and generalizatjon errors?

  • Empirical evaluatjon with sample re-use

– cross-validatjon – bootstrap

  • Analytjcal tools

– Mallow's Cp, AIC, BIC – MDL.

slide-29
SLIDE 29

29

Sample re-use

slide-30
SLIDE 30

30

Cross-validatjon

  • Cut the training set in k separate folds.
  • For each fold, train on the (k-1) remaining folds.

Validation Validation Validation Validation Validation Training Training Training Training

slide-31
SLIDE 31

31

Cross-validated performance

  • Cross-validatjon estjmate of the predictjon error
  • r:
  • Estjmates the expected predictjon error

Y, X: (independent) test sample

Computed with the k(i)-th part of the data removed. k(i) = fold in which i is.

Fold l

slide-32
SLIDE 32

32

Issues with cross-validatjon

  • Training set size becomes (K-1)n/K

Why is this a problem? ?

slide-33
SLIDE 33

33

Issues with cross-validatjon

  • Training set size becomes (K-1)n/K

– small training set

biased estjmator of the error ⇒

  • Leave-one-out cross-validatjon: K = n

– approximately unbiased estjmator of the expected

predictjon error

– potentjal high variance (the training sets are very similar

to each other)

– computatjon can become burdensome (n repeats)

  • In practjce: set K = 5 or K = 10.
slide-34
SLIDE 34

34

Bootstrap

  • Randomly draw datasets with replacement from

the training data

  • Repeat B tjmes (typically, B=100)

B models ⇒

  • Leave-one-out bootstrap error:

– For each training point i, predict with the bi < B models

that did not have i in their training set

– Average predictjon errors

  • Each training set contains ?
slide-35
SLIDE 35

35

Bootstrap

  • Randomly draw datasets with replacement from

the training data

  • Repeat B tjmes (typically, B=100)

B models ⇒

  • Leave-one-out bootstrap error:

– For each training point i, predict with the bi < B models

that did not have i in their training set

– Average predictjon errors

  • Each training set contains 0.632.n distjnct examples

⇒ same issue as with cross-validatjon

slide-36
SLIDE 36

36

Evaluatjng model performance

slide-37
SLIDE 37

37

Classifjcatjon model evaluatjon

  • Confusion matrix

True class

  • 1

+1 Predicted class

  • 1

True Negatjves False Negatjves +1 False Positjves True Positjves

  • False positjves (false alarms) are also called type I errors
  • False negatjves (misses) are also called type II errors
slide-38
SLIDE 38

38

  • Sensitjvity = Recall = True positjve rate (TPR)
  • Specifjcity = True negatjve rate (TNR)
  • Precision = Positjve predictjve value (PPV)
  • False discovery rate (FDR)

# positjves # predicted positjves

slide-39
SLIDE 39

39

  • Accuracy
  • F1-score = harmonic mean of precision and

sensitjvity.

slide-40
SLIDE 40

40

Example: Pap smear

  • 4,000 apparently healthy women of age 40+
  • Tested for cervical cancer through pap smear and

histology (gold standard)

  • What are the sensitjvity, specifjcity, and PPV of the

test?

Cancer No cancer Total Positive test 190 210 400 Negative test 10 3590 3600 Total 200 3800 4000

?

slide-41
SLIDE 41

41

  • Sensitjvity = Recall = True positjve rate (TPR)
  • Specifjcity = True negatjve rate (TNR)
  • Precision = Positjve predictjve value (PPV)

Cancer No cancer Total Positive test 190 210 400 Negative test 10 3590 3600 Total 200 3800 4000

slide-42
SLIDE 42

42

  • In this populatjon:

Sensitjvity = 95.0 % Specifjcity = 94.5 % PPV = 47.5 %

  • Prevalence of the disease = 200/4000 = 0.05
  • P(cancer|positjve test) = PPV = 47.5 %
  • P(no cancer|negatjve test) = 3590/3600 = 99.7 %
  • Poor diagnosis tool
  • Good screening tool

Cancer No cancer Total Positive test 190 210 400 Negative test 10 3590 3600 Total 200 3800 4000

slide-43
SLIDE 43

43

ROC curves

  • ROC = Receiver-Operator Characteristjc.
  • Summarized by the area under the curve (AUROC).

True positjve rate False positjve rate

1 1

  • Plot TPR vs FPR for all

possible thresholds.

threshold =

?

slide-44
SLIDE 44

44

ROC curves

  • ROC = Receiver-Operator Characteristjc.
  • Summarized by the area under the curve (AUROC).

True positjve rate False positjve rate

1 1

  • Plot TPR vs FPR for all

possible thresholds.

threshold = smallest predicted value. threshold = ?

slide-45
SLIDE 45

45

ROC curves

  • ROC = Receiver-Operator Characteristjc.
  • Summarized by the area under the curve (AUROC).

True positjve rate False positjve rate

1 1

  • Plot TPR vs FPR for all

possible thresholds.

threshold = smallest predicted value. threshold = largest predicted value. What is the ROC curve of:

  • a random classifjer?
  • a perfect classifjer? ?
slide-46
SLIDE 46

46

ROC curves

  • ROC = Receiver-Operator Characteristjc.
  • Summarized by the area under the curve (AUROC).

True positjve rate False positjve rate

1 1

random classifjer Perfect classifjer

  • Perfect classifjer:

AUROC = 1.0

  • Random classifjer:

AUROC = 0.5

  • Our classifjer:

0.5 < AUROC < 1.0

slide-47
SLIDE 47

47

Predictjng breast cancer risk based on mammography images, SNPs, or both.

Liu J, Page D, Nassif H, et al. (2013). Genetjc Variants Improve Breast Cancer Risk Predictjon on Mammograms. AMIA Annual Symposium Proceedings. 876-885.

  • Which method
  • utperforms the others?
  • Is a low FPR or high TPR

preferable in a clinical settjng?

= 1 - FPR

slide-48
SLIDE 48

48

Predictjng breast cancer risk based on mammography images, SNPs, or both.

Liu J, Page D, Nassif H, et al. (2013). Genetjc Variants Improve Breast Cancer Risk Predictjon on Mammograms. AMIA Annual Symposium Proceedings. 876-885.

High recall = fewer chances to miss a case High specifjcity / low FPR = fewer false alarms

= 1 - FPR

slide-49
SLIDE 49

49

Precision-Recall curves

Precision Recall 1 1 Good corner Bad corner

Sensitjvity = Recall = True positjve rate (TPR) Precision = Positjve predictjve value (PPV)

slide-50
SLIDE 50

50

Predictjng breast cancer risk based on mammography images, SNPs, or both.

Liu J, Page D, Nassif H, et al. (2013). Genetjc Variants Improve Breast Cancer Risk Predictjon on Mammograms. AMIA Annual Symposium Proceedings. 876-885.

  • Which method has the

highest area under the PR curve?

  • Is a high recall or high

precision preferable in a clinical settjng?

Sensitjvity = Recall = True positjve rate (TPR) Precision = Positjve predictjve value (PPV)

?

slide-51
SLIDE 51

51

Predictjng breast cancer risk based on mammography images, SNPs, or both.

Liu J, Page D, Nassif H, et al. (2013). Genetjc Variants Improve Breast Cancer Risk Predictjon on Mammograms. AMIA Annual Symposium Proceedings. 876-885.

High recall = fewer chances to miss a case High precision = substantjally more true diagnoses than false alarms

Sensitjvity = Recall = True positjve rate (TPR) Precision = Positjve predictjve value (PPV)

slide-52
SLIDE 52

52

Regression model evaluatjon

  • Countjng the number of errors is not reasonable ?
slide-53
SLIDE 53

53

Regression model evaluatjon

  • Countjng the number of errors is not reasonable

– What does error even mean for numerical values? – Not all errors are created equal.

slide-54
SLIDE 54

54

Regression model evaluatjon

  • Residual sum of squares
  • Root-mean squared error
  • Relatjve squared error
  • Coeffjcient of determinatjon
slide-55
SLIDE 55

55

Correlatjon between true and predicted values

slide-56
SLIDE 56

56

Analytjcal tools and model complexity

slide-57
SLIDE 57

57

Optjmism terms

– Correct the empirical error with an optjmism term – Theoretjcal estjmate of the discrepancy between

training and test error

  • For linear models, optjmism terms proportjonal to:

– Mallow’s Cp: – Akaike Informatjon Criterion (AIC): – Bayesian Informatjon Criterion (BIC):

Augmented error = empirical error + optjmism term

Variance of the residuals

  • n the train set

Squared standard error of the mean of the residuals # parameters = # non-zero coeffjcients

slide-58
SLIDE 58

58

Minimum descriptjon length (MDL)

  • Shortest code to transmit a random variable z:

[Shannon's source coding theorem]

  • Assume

– Parametric model – receiver knows inputs X, model family f.

  • To transmit outputs y, need
  • Choose the model with smallest Kolmogorov complexity

(=MDL)

average code length to transmit the diference between model predictjon and true outputs. average code length to transmit θ.

Consider discrete variable z

– Equiprobable case: use a fjxed-length code – Otherwise: use a variable-length prefjx code in which

frequent values get shorter codes

The prefjx separates codes

slide-59
SLIDE 59

59

Minimum descriptjon length (MDL)

  • Shortest code to transmit a random variable z:

[Shannon's source coding theorem]

  • Assume

– Parametric model – receiver knows inputs X, model family f.

  • To transmit outputs y, need
  • Choose the model with smallest Kolmogorov complexity

(=MDL)

average code length to transmit the diference between model predictjon and true outputs. average code length to transmit θ.

slide-60
SLIDE 60

60

Summary: model selectjon techniques

  • Empirical:

Estjmate quality of generalizatjon with

– cross-validatjon – bootstrap

  • Theoretjcal:

– Estjmate the difgerence between train error and

generalizatjon error with an optjmism term E.g. Mallow's Cp, Akaike's / Bayesian Informatjon Criteria

– Minimum descriptjon length (MDL)

Choose simplest model (according to Kolmogorov complexity)

slide-61
SLIDE 61

61

References

  • A Course in Machine Learning.

http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf

– Noise: Chap 2.3 – Overfjttjng: Chap 2.4 – Bias-variance tradeof: Chap 5.9 – Train and test sets: Chap 2.5 – Cross-validatjon: Chap 5.6 – Performance measures: Chap 5.5

  • The Elements of Statjstjcal Learning.

http://web.stanford.edu/~hastie/ElemStatLearn/

– Overfjttjng: Chap 7.1 – Bias-variance tradeof: Chap 2.9, 7.2–7.3 – Cross-validatjon: Chap 7.10 – Bootstrap: Chap 7.11 – Mallow’s Cp, AIC, BIC: Chap 7.7 – MDL: Chap 7.8

  • Entropy encoding:

http://lesswrong.com/lw/o1/entropy_and_short_codes/

slide-62
SLIDE 62

62

  • Linear algebra:

http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/video-lectures/

  • Statjstjcs & probabilitjes:

– Probability theory: A primer (Jeremy Kun)

http://jeremykun.com/2013/01/04/probability-theory-a-primer/

– Probability Primer (Jefrey Miller) https://www.youtube.com/playlist?list=PL17567A1A3F5DB5E4

References for prerequisites

slide-63
SLIDE 63

Practjcal maters

  • Make sure you have turned in HW01
  • HW02 is online, due Oct. 9
  • HW03 is online, due Oct. 13
  • Lab

https://github.com/chagaz/ma2823_2017

slide-64
SLIDE 64

Lab 2 – pointers

slide-65
SLIDE 65

Minimizatjon with Newton’s method