learning algorithm evaluation outline
play

Learning Algorithm Evaluation Outline Why ? Overfitting How? - PowerPoint PPT Presentation

Disease + - t l TP FP + TP+FP u s e r FN TN t FN+TN - s e t TP+FN FP+TN Learning Algorithm Evaluation Outline Why ? Overfitting How? Holdout vs Cross-validation What? Evaluation measures Who wins?


  1. Disease + - t l TP FP + TP+FP u s e r FN TN t FN+TN - s e t TP+FN FP+TN Learning Algorithm Evaluation

  2. Outline Why ? Overfitting • How? Holdout vs Cross-validation • What? Evaluation measures • Who wins? Statistical significance •

  3. Quiz Is this a good model?

  4. Overfitting While it fits the training data perfectly, it may perform badly on unseen data. A simpler model may be better.

  5. Outline Why ? Overfitting • How? Holdout vs Cross-validation • What? Evaluation measures • Who wins? Statistical significance •

  6. A first evaluation measure • Predictive accuracy • Success : instance’s class is predicted correctly • Error : instance’s class is predicted incorrectly • Error rate : #errors/#instances • Predictive Accuracy : #successes/#instances • Quiz • 50 examples, 10 classified incorrectly • Accuracy? Error rate?

  7. Rule #1

  8. Rule #1 Never evaluate on training data!

  9. Holdout (Train and Test)

  10. Holdout (Train and Test)

  11. Holdout (Train and Test) a.k.a. holdout set

  12. Holdout (Train and Test) a.k.a. holdout set

  13. Holdout (Train and Test) a.k.a. holdout set

  14. Quiz Can I retry with other parameter settings?

  15. Rule #2

  16. Rule #2 Never train/optimize on test data! (that includes parameter selection)

  17. Holdout (Train and Test) You need a separate optimization set to tune parameters OPTIMIZATION TESTING

  18. Test data leakage Never use test data to create the classifier • Can be tricky: e.g. social network • Proper procedure uses three sets • training set : train models • optimization/validation set : optimize algorithm • parameters test set : evaluate final model •

  19. Holdout (Train and Test) Build final model on ALL data (more data, better model)

  20. Making the most of data • Once evaluation is complete, and algorithm/ parameters are selected, all the data can be used to build the final classifier • Trade-off: performance <-> evaluation accuracy More training data, better model (but returns diminish) • More test data, more accurate error estimate •

  21. Issues • Small data sets Random test set can be quite different from training set • (different data distribution) • Unbalanced class distributions One class can be overrepresented in test set • Serious problem for some domains: • medical diagnosis: 90% healthy, 10% disease • eCommerce: 99% don’t buy, 1% buy • Security: >99.99% of Americans are not terrorists •

  22. Balancing unbalanced data Sample equal amounts from minority and majority class + ensure approximately equal proportions in train/test set

  23. Stratified Sampling Advanced class balancing: sample so that each class represented with approx. equal proportions in both subsets E.g. take a stratified sample of 50 instances:

  24. Repeated holdout method Evaluation still biased by random test sample • Solution: repeat and average results • Random, stratified sampling, N times • Final performance = average of all performances •

  25. Repeated holdout method Evaluation still biased by random test sample • Solution: repeat and average results • Random, stratified sampling, N times • Final performance = average of all performances •

  26. Repeated holdout method Evaluation still biased by random test sample • Solution: repeat and average results • Random, stratified sampling, N times • Final performance = average of all performances • TRAIN TEST TRAIN TEST TRAIN TEST

  27. Repeated holdout method Evaluation still biased by random test sample • Solution: repeat and average results • Random, stratified sampling, N times • Final performance = average of all performances • TRAIN 0.86 TEST TRAIN 0.74 TEST TRAIN 0.8 TEST

  28. Repeated holdout method Evaluation still biased by random test sample • Solution: repeat and average results • Random, stratified sampling, N times • Final performance = average of all performances • TRAIN 0.86 0.8 TEST TRAIN 0.74 TEST TRAIN 0.8 TEST

  29. k-fold Cross-validation Split data (stratified) in k-folds Use (k-1) for training, 1 for testing, repeat k times, average results

  30. Cross-validation Standard method: • stratified 10-fold cross-validation • Experimentally determined. Removes most of • sampling bias Even better: repeated stratified cross-validation • Popular: 10 x 10-fold CV, 2 x 3-fold CV •

  31. Leave-One-Out Cross-validation A particular form of cross-validation: • #folds = #instances • n instances, build classifier n times • Makes best use of the data, no sampling bias • Computationally very expensive •

  32. Outline Why ? Overfitting • How? Holdout vs Cross-validation • What? Evaluation measures • Who wins? Statistical significance •

  33. Some other Evaluation Measures • ROC: Receiver-Operator Characteristic • Precision and Recall • Cost-sensitive learning • Evaluation for numeric predictions • MDL principle and Occam’s razor

  34. ROC curves ROC curves • • Receiver Operating Characteristic • From signal processing: tradeoff between hit rate and false alarm rate over noisy channel • Method: • Plot True Positive rate against False Positive rate

  35. Confusion Matrix actual + - TP FP + d e t true positive false positive c i TN FN d - e r false negative true negative p FP+TN TP+FN TPrate (sensitivity): FPrate (fall-out):

  36. ROC curves ROC curves • • Receiver Operating Characteristic • From signal processing: tradeoff between hit rate and false alarm rate over noisy channel • Method: • Plot True Positive rate against False Positive rate • Collect many points by varying prediction threshold • For probabilistic algorithms (probabilistic predictions) • Non-probabilistic algorithms have single point • Or, make cost sensitive and vary costs (see below)

  37. ROC curves Predictions actually actually inst P(+) actual positive negative 1 0.8 + 2 0.5 - FP 3 0.45 + 4 0.3 - TP FP

  38. ROC curves Predictions Thresholds actually actually 0 0.3 0.45 0.5 0.8 inst P(+) actual positive negative 1 0.8 + + + + + - 2 0.5 - + + + - - FP 3 0.45 + + + - - - 4 0.3 - + - - - - TP 0 0.3 0.45 0.5 0.8 1 1 1/2 1/2 0 TPrate 1 1/2 1/2 0 0 FPrate FP

  39. ROC curves Predictions Thresholds actually actually 0 0.3 0.45 0.5 0.8 inst P(+) actual positive negative 1 0.8 + + + + + - 2 0.5 - + + + - - FP 3 0.45 + + + - - - 4 0.3 - + - - - - TP 0 0.3 0.45 0.5 0.8 1 1 1/2 1/2 0 TPrate 1 1/2 1/2 0 0 FPrate FP

  40. ROC curves Predictions Thresholds actually actually 0 0.3 0.45 0.5 0.8 inst P(+) actual positive negative 1 0.8 + + + + + - 2 0.5 - + + + - - FP 3 0.45 + + + - - - 4 0.3 - + - - - - TP 0 0.3 0.45 0.5 0.8 1 1 1/2 1/2 0 TPrate 1 1/2 1/2 0 0 FPrate FP

  41. ROC curves Predictions Thresholds actually actually 0 0.3 0.45 0.5 0.8 inst P(+) actual positive negative 1 0.8 + + + + + - 2 0.5 - + + + - - FP 3 0.45 + + + - - - 4 0.3 - + - - - - TP 0 0.3 0.45 0.5 0.8 1 1 1/2 1/2 0 TPrate 1 1/2 1/2 0 0 FPrate FP

  42. ROC curves Predictions Thresholds actually actually 0 0.3 0.45 0.5 0.8 inst P(+) actual positive negative 1 0.8 + + + + + - 2 0.5 - + + + - - FP 3 0.45 + + + - - - 4 0.3 - + - - - - TP 0 0.3 0.45 0.5 0.8 1 1 1/2 1/2 0 TPrate 1 1/2 1/2 0 0 FPrate FP

  43. ROC curves Predictions Thresholds actually actually 0 0.3 0.45 0.5 0.8 inst P(+) actual positive negative 1 0.8 + + + + + - 2 0.5 - + + + - - FP 3 0.45 + + + - - - 4 0.3 - + - - - - TP 0 0.3 0.45 0.5 0.8 1 1 1/2 1/2 0 TPrate 1 1/2 1/2 0 0 FPrate FP

  44. ROC curves Predictions Thresholds actually actually 0 0.3 0.45 0.5 0.8 inst P(+) actual positive negative 1 0.8 + + + + + - 2 0.5 - + + + - - FP 3 0.45 + + + - - - 4 0.3 - + - - - - TP 0 0.3 0.45 0.5 0.8 1 1 1/2 1/2 0 TPrate 1 1/2 1/2 0 0 FPrate FP

  45. ROC curves Predictions Thresholds actually actually 0 0.3 0.45 0.5 0.8 inst P(+) actual positive negative 1 0.8 + + + + + - 2 0.5 - + + + - - FP 3 0.45 + + + - - - 4 0.3 - + - - - - TP 0 0.3 0.45 0.5 0.8 1 1 1/2 1/2 0 TPrate 1 1/2 1/2 0 0 FPrate FP

  46. ROC curves Predictions Thresholds actually actually 0 0.3 0.45 0.5 0.8 inst P(+) actual positive negative 1 0.8 + + + + + - 2 0.5 - + + + - - FP 3 0.45 + + + - - - 4 0.3 - + - - - - TP 0 0.3 0.45 0.5 0.8 1 1 1/2 1/2 0 TPrate 1 1/2 1/2 0 0 FPrate FP

  47. ROC curves Predictions Thresholds actually actually 0 0.3 0.45 0.5 0.8 inst P(+) actual positive negative 1 0.8 + + + + + - 2 0.5 - + + + - - FP 3 0.45 + + + - - - 4 0.3 - + - - - - TP 0 0.3 0.45 0.5 0.8 1 1 1/2 1/2 0 TPrate 1 1/2 1/2 0 0 FPrate FP

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend