stat 339 evaluating a classifier
play

STAT 339 Evaluating a Classifier 3 February 2017 Colin Reimer - PowerPoint PPT Presentation

STAT 339 Evaluating a Classifier 3 February 2017 Colin Reimer Dawson Questions/Administrative Business? Everyone enrolled who intends to be? Any technical difficulties? Anything else? Outline Evaluating a Supervised Learning Method


  1. STAT 339 Evaluating a Classifier 3 February 2017 Colin Reimer Dawson

  2. Questions/Administrative Business? ◮ Everyone enrolled who intends to be? ◮ Any technical difficulties? ◮ Anything else?

  3. Outline Evaluating a Supervised Learning Method Classification Performance Validation and Test Sets

  4. Types of Learning ◮ Supervised Learning: Learning to make predictions when you have many examples of “correct answers” ◮ Classification: answer is a category / label ◮ Regression: answer is a number ◮ Unsupervised Learning: Finding structure in unlabeled data ◮ Reinforcement Learning: Finding actions that maximize long-run reward (not part of this course)

  5. Classification and Regression If t is a categorical output, then we are doing classification If t is a quantitative output, then we are doing regression NB: “Logistic regression” is really a classification method, in this taxonomy

  6. K -Nearest neighbors algorithm 1. Given a training set , D = { ( x n , t n ) } , n = 1 , . . . , N , a test point, x , and a distance function, d , compute distances: { d n : d ( x , x n ) } , n = 1 , . . . , N 2. Find the K “nearest neighbors” in D to x 3. Classify the test point based on a “plurality vote” of the K -nearest neighbors. 4. In the event of a tie, apply a chosen tie-breaking procedure (e.g., choose the most frequent class / increase K / etc.)

  7. K -nearest-neighbors for Iris data K = 1 K = 3 K = 5 7.5 7.5 7.5 Sepal.Length Sepal.Length Sepal.Length ● ● ● ● ● ● 6.5 ● ● 6.5 ● ● 6.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5.5 ●● ●● ● ● ● ● 5.5 ●● ●● ● ● ● ● 5.5 ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4.5 4.5 4.5 2.0 3.0 4.0 2.0 3.0 4.0 2.0 3.0 4.0 Sepal.Width Sepal.Width Sepal.Width K = 11 K = 21 K = N 7.5 7.5 7.5 Sepal.Length Sepal.Length Sepal.Length ● ● ● ● ● ● ● ● ● ● ● ● 6.5 6.5 6.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ●● ●● ● ● ● ● 5.5 5.5 5.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4.5 4.5 4.5 2.0 3.0 4.0 2.0 3.0 4.0 2.0 3.0 4.0 Sepal.Width Sepal.Width Sepal.Width

  8. Flexibility vs. Robustness ◮ Small K : highly flexible – can fit arbitrarily complex patterns in the data – but not robust (highly sensitive to noise/specific sample properties) ◮ Larger K : mitigates sensitivity to noise, etc., but at the expense of flexibility

  9. Variants of KNN ◮ “Soft” KNN: Retain the vote share for each class, instead of simply taking the max, to do “soft” classification. ◮ “Kernel”-KNN: Use a “kernel” function that decays with distance to weight the votes of the neighbors by their nearness. ◮ Beyond R d : KNN can be used for objects such as strings, trees, graphs by simply defining a suitable distance metric.

  10. Choices to Make Using KNN ◮ What distance measure? (Euclidean ( L 2 ), Manhattan ( L 1 ), Chebyshev ( L ∞ ), Edit distance ( L 0 ), ...) Always standardize your features (e.g., convert to z -scores) so the dimensions are on comparable scales when computing distance! ◮ What value of K ? ◮ What kernel (and what kernel parameters), if any? ◮ What tie-breaking procedure (if doing hard classification)?

  11. Evaluating a Supervised Learning Method Two Kinds of Evaluation 1. How do we select which free “parameters” like K , or kernel decay rate, are best? 2. How do we know how good a job our final method has done? Two Choices To Be Made 1. How do we quantify performance? 2. What data do we use to measure performance?

  12. Quantifying Classification Performance: Misclassification Rate ◮ One possible metric: misclassification rate : what proportion of cases does the classifier get incorrect? Misclassification Rate = 1 � I (ˆ t n � = t n ) N n where ˆ t n is the classifier’s output for training point n , and I ( A ) returns 1 if A is true, 0 otherwise.

  13. Other Classification Measures For binary class problems with asymmetry between classes (e.g., positive and negative instances), there are four possibilities: Classification + − True Positive False Negative + Truth False Positive True Negative − Table: Possible outcomes for a binary classifier We can measure four component success rates: TP TP Recall/Sensitivity = Precision/Pos. Pred. Value = TP + FN TP + FP TN TN Specificity = Neg. Pred. Value = TN + FP TN + FN

  14. F -measures � − 1 1 1 � Recall + Precision F 1 score = 2 = 2 · Recall · Precision Recall + Precision � − 1 � β 2 1 1 Recall + Precision F β score = 1 + β 2 = (1 + β 2 ) · Recall · Precision Recall + β 2 · Precision F β aggregates recall (sensitivity / true positive rate) and precision (positive predictive value), with a “cost parameter” β to emphasize or de-emphasize recall.

  15. Receiver Operating Characteristic (ROC) Curve Figure: Example of an ROC curve. As classifier is more willing to say “ + ”, both true positives and false positives go up. Ideally, false positives go up much more slowly (curve hugs upper left).

  16. Overfitting and Test Set ◮ Fitting and evaluating on the same data (for most evaluation metrics) results in overfitting . ◮ Overfitting occurs when a learning algorithm mistakes noise for signal, and incorporates idiosyncracies of the training set into its decision rule ◮ To combat overfitting, use different data for evaluation vs. fitting. This “held out data” is called a test set

  17. Train vs. Test Error (KNN on Iris data) 1.0 Train Error Test Error 0.8 train.error 0.6 0.4 0.2 0.0 0 10 20 30 40 50 K

  18. Validation vs. Test Set ◮ If we have decisions left to make, then we should not look at the final test set? (Why not?) ◮ If we are going to select the best version of our method by optimizing on the test set, then we have no measure of absolute performance: test set performance is overly optimistic b/c it is cherry-picked. ◮ Instead, take training set and (randomly) subdivide into training and validation set . Use training to do classification; validation to evaluate to guide “higher-order” decisions.

  19. Validation vs. Test Error 1.0 0.8 train.error 0.6 0.4 Train Error 0.2 Validation Error Test Error 0.0 0 10 20 30 40 50 K

  20. Drawbacks of Simple Validation Approach ◮ Sacrificing training data degrades performance ◮ If validation set is too small, decisions will be based on noisy information. ◮ Partial solution: Divide training set into K equal parts, or “folds”; give each fold the chance to serve as validation set, and average generalization performance. ◮ Yields “ K -fold cross-validation” (note: completely separate choice of K here)

  21. K -fold Cross Validation Algorithm A. For each method, M , under consideration 1. Divide training set into K “folds” with (approximately) equal cases per fold. (Keep test set “sealed”) 2. For k = 1 , . . . , K : (a) Designate fold k the “validation set”, and 1 , . . . , k − 1 , k + 1 , . . . , K the training set. (b) “Train” the algorithm on the training set to yield classification rule c k , and compute error rate, Err k on the validation set: e.g. 1 � Err k ( M ) = I ( c k ( x i ) = t i ) | Validation | i ∈ Validation 3. Return the mean error rate across folds K Err ( M ) = 1 � Err k ( M ) K k =1 B. Select M with lowest Err

  22. Cross Validation Error 1.0 0.8 train.error 0.6 0.4 Train Error 0.2 10−fold CV Error Test Error 0.0 0 10 20 30 40 K

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend