- 4. Model evaluatjon & selectjon
4. Model evaluatjon & selectjon Chlo-Agathe Azencot Centre for - - PowerPoint PPT Presentation
4. Model evaluatjon & selectjon Chlo-Agathe Azencot Centre for - - PowerPoint PPT Presentation
Foundatjons of Machine Learning CentraleSuplec Fall 2017 4. Model evaluatjon & selectjon Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Practjcal maters You should
Practjcal maters
- You should have received an email from me on
Tuesday
- Partjal solutjon to Lab 1 at the end of the slides of
Chapter 3.
- Pointers/refreshers re: (scientjfjc) python
– http://www.scipy-lectures.org/ – https://github.com/chagaz/ml-notebooks/
→ lsml2017
- Yes, I only put the slides online afuer the lecture.
3
Generalizatjon
A good and useful approximatjon
- It’s easy to build a model that performs well on the
training data
- But how well will it perform on new data?
- “Predictjons are hard, especially about the future” — Niels Bohr.
– Learn models that generalize well – Evaluate whether models generalize well.
4
Noise in the data
- Imprecision in recording the features
- Errors in labeling the data points (teacher noise)
- Missing features (hidden or latent)
- Making no errors on the training set might not be
possible.
5
Models of increasing complexity
6
Noise and model complexity
- Use simple models!
– Easier to use
lower computatjonal complexity
– Easier to train
lower space complexity
– Easier to explain
more interpretable
– Generalize beter
Occam’s razor: simpler explanatjons are more plausible.
7
Overfjttjng
- What are the
empirical errors of the black and purple classifjers?
- Which model
seems more likely to be correct?
8
Overfjttjng & Underfjttjng (Regression)
Underfjttjng Overfjttjng
9
Generalizatjon error vs. model complexity
Overfjttjng Predictjon error Model complexity
On training data On new data
Underfjttjng
10
Bias-variance tradeof
- Bias: difgerence between the expected value of the
estjmator and the true value being estjmated.
– A simpler model has a higher bias. – High bias can cause underfjttjng.
- Variance: deviatjon from the expected value of the
estjmates.
– A more complex model has a higher variance. – High variance can cause overfjttjng.
11
Bias-variance decompositjon
- Mean squared error:
- Proof ?
12
Bias-variance decompositjon
- Mean squared error:
and y are determinist.
13
Generalizatjon error vs. model complexity
Prediction error Model complexity On training data
On new data
High bias Low variance Low bias High variance
14
Model selectjon & generalizatjon
- Well-posed problems:
– a solutjon exists; – it is unique; – the solutjon changes contjnuously with the initjal
conditjons
- Learning is an ill-posed problem:
data helps carve out the hypothesis space but data is not suffjcient to fjnd a unique solutjon.
- Need for inductjve bias
assumptjons about the hypothesis space model selectjon: choose the “right” inductjve bias?
Hadamard, on the mathematical modelisation of physical phenomena.
15
How do we decide a model is good?
16
Learning objectjves
Afuer this lecture you should be able to
design experiments to select and evaluate supervised machine learning models. Concepts:
- training and testjng sets;
- cross-validatjon;
- bootstrap;
- measures of performance for classifjers and regressors;
- measures of model complexity.
17
Supervised learning settjng
- Training set:
- Classifjcatjon:
- Regression:
- Goal: Find such that
- Empirical error of f on the training set, given a loss:
– E.g. (classifjcatjon) – E.g. (regression)
?
18
Supervised learning settjng
- Training set:
- Classifjcatjon:
- Regression:
- Goal: Find such that
- Empirical error of f on the training set, given a loss:
– E.g. (classifjcatjon) – E.g. (regression)
?
19
Supervised learning settjng
- Training set:
- Classifjcatjon:
- Regression:
- Goal: Find such that
- Empirical error of f on the training set, given a loss:
– E.g. (classifjcatjon) – E.g. (regression)
?
20
Supervised learning settjng
- Training set:
- Classifjcatjon:
- Regression:
- Goal: Find such that
- Empirical error of f on the training set, given a loss:
– E.g. (classifjcatjon) – E.g. (regression)
?
21
Supervised learning settjng
- Training set:
- Classifjcatjon:
- Regression:
- Goal: Find such that
- Empirical error of f on the training set, given a loss:
– E.g. (classifjcatjon) – E.g. (regression)
?
22
Supervised learning settjng
- Training set:
- Classifjcatjon:
- Regression:
- Goal: Find such that
- Empirical error of f on the training set, given a loss:
– E.g. (classifjcatjon) – E.g. (regression) ?
23
Supervised learning settjng
- Training set:
- Classifjcatjon:
- Regression:
- Goal: Find such that
- Empirical error of f on the training set, given a loss:
– E.g. (classifjcatjon) – E.g. (regression)
24
Generalizatjon error
- The empirical error on the training set is a poor
estjmate of the generalizatjon error (expected error
- n new data)
If the model is overfjttjng, the generalizatjon error can be arbitrarily large.
- We would like to estjmate the generalizatjon error
- n new data, which we do not have.
25
Validatjon sets
- Choose the model that performs best on a
validatjon set separate from the training set.
- Because we have not used the validatjon data at any
point during training, the validatjon set can be considered “new data” and the error on the validatjon set is an estjmatjon of the generalizatjon error. Training Validation
26
Model selectjon
- What if we want to choose among k models?
– Train each model on the train set – Compute the predictjon error of each model on the
validatjon set
– Pick the model with the smallest predictjon error on the
validatjon set.
- What is the generalizatjon error?
– We don’t know! – Validatjon data was used to select the model – We have “cheated” and looked at the validatjon data: it
is not a good proxy for new, unseen data any more.
27
Validatjon sets
- Hence we need to set aside part of the data, the
test set, that remains untouched during the entjre procedure and on which we’ll estjmate the generalizatjon error.
- Model selectjon: pick the best model.
- Model assessment: estjmate its predictjon error on
new data. Training Validation Test
28
- How much data should go in each of the training,
validatjon and test sets?
- How do we know we have enough data to evaluate
the predictjon and generalizatjon errors?
- Empirical evaluatjon with sample re-use
– cross-validatjon – bootstrap
- Analytjcal tools
– Mallow's Cp, AIC, BIC – MDL.
29
Sample re-use
30
Cross-validatjon
- Cut the training set in k separate folds.
- For each fold, train on the (k-1) remaining folds.
Validation Validation Validation Validation Validation Training Training Training Training
31
Cross-validated performance
- Cross-validatjon estjmate of the predictjon error
- r:
- Estjmates the expected predictjon error
Y, X: (independent) test sample
Computed with the k(i)-th part of the data removed. k(i) = fold in which i is.
Fold l
32
Issues with cross-validatjon
- Training set size becomes (K-1)n/K
Why is this a problem? ?
33
Issues with cross-validatjon
- Training set size becomes (K-1)n/K
– small training set
biased estjmator of the error ⇒
- Leave-one-out cross-validatjon: K = n
– approximately unbiased estjmator of the expected
predictjon error
– potentjal high variance (the training sets are very similar
to each other)
– computatjon can become burdensome (n repeats)
- In practjce: set K = 5 or K = 10.
34
Bootstrap
- Randomly draw datasets with replacement from
the training data
- Repeat B tjmes (typically, B=100)
B models ⇒
- Leave-one-out bootstrap error:
– For each training point i, predict with the bi < B models
that did not have i in their training set
– Average predictjon errors
- Each training set contains ?
35
Bootstrap
- Randomly draw datasets with replacement from
the training data
- Repeat B tjmes (typically, B=100)
B models ⇒
- Leave-one-out bootstrap error:
– For each training point i, predict with the bi < B models
that did not have i in their training set
– Average predictjon errors
- Each training set contains 0.632.n distjnct examples
⇒ same issue as with cross-validatjon
36
Evaluatjng model performance
37
Classifjcatjon model evaluatjon
- Confusion matrix
True class
- 1
+1 Predicted class
- 1
True Negatjves False Negatjves +1 False Positjves True Positjves
- False positjves (false alarms) are also called type I errors
- False negatjves (misses) are also called type II errors
38
- Sensitjvity = Recall = True positjve rate (TPR)
- Specifjcity = True negatjve rate (TNR)
- Precision = Positjve predictjve value (PPV)
- False discovery rate (FDR)
# positjves # predicted positjves
39
- Accuracy
- F1-score = harmonic mean of precision and
sensitjvity.
40
Example: Pap smear
- 4,000 apparently healthy women of age 40+
- Tested for cervical cancer through pap smear and
histology (gold standard)
- What are the sensitjvity, specifjcity, and PPV of the
test?
Cancer No cancer Total Positive test 190 210 400 Negative test 10 3590 3600 Total 200 3800 4000
?
41
- Sensitjvity = Recall = True positjve rate (TPR)
- Specifjcity = True negatjve rate (TNR)
- Precision = Positjve predictjve value (PPV)
Cancer No cancer Total Positive test 190 210 400 Negative test 10 3590 3600 Total 200 3800 4000
42
- In this populatjon:
Sensitjvity = 95.0 % Specifjcity = 94.5 % PPV = 47.5 %
- Prevalence of the disease = 200/4000 = 0.05
- P(cancer|positjve test) = PPV = 47.5 %
- P(no cancer|negatjve test) = 3590/3600 = 99.7 %
- Poor diagnosis tool
- Good screening tool
Cancer No cancer Total Positive test 190 210 400 Negative test 10 3590 3600 Total 200 3800 4000
43
ROC curves
- ROC = Receiver-Operator Characteristjc.
- Summarized by the area under the curve (AUROC).
True positjve rate False positjve rate
1 1
- Plot TPR vs FPR for all
possible thresholds.
threshold =
?
44
ROC curves
- ROC = Receiver-Operator Characteristjc.
- Summarized by the area under the curve (AUROC).
True positjve rate False positjve rate
1 1
- Plot TPR vs FPR for all
possible thresholds.
threshold = smallest predicted value. threshold = ?
45
ROC curves
- ROC = Receiver-Operator Characteristjc.
- Summarized by the area under the curve (AUROC).
True positjve rate False positjve rate
1 1
- Plot TPR vs FPR for all
possible thresholds.
threshold = smallest predicted value. threshold = largest predicted value. What is the ROC curve of:
- a random classifjer?
- a perfect classifjer? ?
46
ROC curves
- ROC = Receiver-Operator Characteristjc.
- Summarized by the area under the curve (AUROC).
True positjve rate False positjve rate
1 1
random classifjer Perfect classifjer
- Perfect classifjer:
AUROC = 1.0
- Random classifjer:
AUROC = 0.5
- Our classifjer:
0.5 < AUROC < 1.0
47
Predictjng breast cancer risk based on mammography images, SNPs, or both.
Liu J, Page D, Nassif H, et al. (2013). Genetjc Variants Improve Breast Cancer Risk Predictjon on Mammograms. AMIA Annual Symposium Proceedings. 876-885.
- Which method
- utperforms the others?
- Is a low FPR or high TPR
preferable in a clinical settjng?
= 1 - FPR
48
Predictjng breast cancer risk based on mammography images, SNPs, or both.
Liu J, Page D, Nassif H, et al. (2013). Genetjc Variants Improve Breast Cancer Risk Predictjon on Mammograms. AMIA Annual Symposium Proceedings. 876-885.
High recall = fewer chances to miss a case High specifjcity / low FPR = fewer false alarms
= 1 - FPR
49
Precision-Recall curves
Precision Recall 1 1 Good corner Bad corner
Sensitjvity = Recall = True positjve rate (TPR) Precision = Positjve predictjve value (PPV)
50
Predictjng breast cancer risk based on mammography images, SNPs, or both.
Liu J, Page D, Nassif H, et al. (2013). Genetjc Variants Improve Breast Cancer Risk Predictjon on Mammograms. AMIA Annual Symposium Proceedings. 876-885.
- Which method has the
highest area under the PR curve?
- Is a high recall or high
precision preferable in a clinical settjng?
Sensitjvity = Recall = True positjve rate (TPR) Precision = Positjve predictjve value (PPV)
?
51
Predictjng breast cancer risk based on mammography images, SNPs, or both.
Liu J, Page D, Nassif H, et al. (2013). Genetjc Variants Improve Breast Cancer Risk Predictjon on Mammograms. AMIA Annual Symposium Proceedings. 876-885.
High recall = fewer chances to miss a case High precision = substantjally more true diagnoses than false alarms
Sensitjvity = Recall = True positjve rate (TPR) Precision = Positjve predictjve value (PPV)
52
Regression model evaluatjon
- Countjng the number of errors is not reasonable ?
53
Regression model evaluatjon
- Countjng the number of errors is not reasonable
– What does error even mean for numerical values? – Not all errors are created equal.
54
Regression model evaluatjon
- Residual sum of squares
- Root-mean squared error
- Relatjve squared error
- Coeffjcient of determinatjon
55
Correlatjon between true and predicted values
56
Analytjcal tools and model complexity
57
Optjmism terms
– Correct the empirical error with an optjmism term – Theoretjcal estjmate of the discrepancy between
training and test error
- For linear models, optjmism terms proportjonal to:
– Mallow’s Cp: – Akaike Informatjon Criterion (AIC): – Bayesian Informatjon Criterion (BIC):
Augmented error = empirical error + optjmism term
Variance of the residuals
- n the train set
Squared standard error of the mean of the residuals # parameters = # non-zero coeffjcients
58
Minimum descriptjon length (MDL)
- Shortest code to transmit a random variable z:
[Shannon's source coding theorem]
- Assume
– Parametric model – receiver knows inputs X, model family f.
- To transmit outputs y, need
- Choose the model with smallest Kolmogorov complexity
(=MDL)
average code length to transmit the diference between model predictjon and true outputs. average code length to transmit θ.
Consider discrete variable z
– Equiprobable case: use a fjxed-length code – Otherwise: use a variable-length prefjx code in which
frequent values get shorter codes
The prefjx separates codes
59
Minimum descriptjon length (MDL)
- Shortest code to transmit a random variable z:
[Shannon's source coding theorem]
- Assume
– Parametric model – receiver knows inputs X, model family f.
- To transmit outputs y, need
- Choose the model with smallest Kolmogorov complexity
(=MDL)
average code length to transmit the diference between model predictjon and true outputs. average code length to transmit θ.
60
Summary: model selectjon techniques
- Empirical:
Estjmate quality of generalizatjon with
– cross-validatjon – bootstrap
- Theoretjcal:
– Estjmate the difgerence between train error and
generalizatjon error with an optjmism term E.g. Mallow's Cp, Akaike's / Bayesian Informatjon Criteria
– Minimum descriptjon length (MDL)
Choose simplest model (according to Kolmogorov complexity)
61
References
- A Course in Machine Learning.
http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf
– Noise: Chap 2.3 – Overfjttjng: Chap 2.4 – Bias-variance tradeof: Chap 5.9 – Train and test sets: Chap 2.5 – Cross-validatjon: Chap 5.6 – Performance measures: Chap 5.5
- The Elements of Statjstjcal Learning.
http://web.stanford.edu/~hastie/ElemStatLearn/
– Overfjttjng: Chap 7.1 – Bias-variance tradeof: Chap 2.9, 7.2–7.3 – Cross-validatjon: Chap 7.10 – Bootstrap: Chap 7.11 – Mallow’s Cp, AIC, BIC: Chap 7.7 – MDL: Chap 7.8
- Entropy encoding:
http://lesswrong.com/lw/o1/entropy_and_short_codes/
62
- Linear algebra:
http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/video-lectures/
- Statjstjcs & probabilitjes:
– Probability theory: A primer (Jeremy Kun)
http://jeremykun.com/2013/01/04/probability-theory-a-primer/
– Probability Primer (Jefrey Miller) https://www.youtube.com/playlist?list=PL17567A1A3F5DB5E4
References for prerequisites
Practjcal maters
- Make sure you have turned in HW01
- HW02 is online, due Oct. 9
- HW03 is online, due Oct. 13
- Lab