STAT 339 Evaluating a Classifier 3 February 2017 Colin Reimer - - PowerPoint PPT Presentation
STAT 339 Evaluating a Classifier 3 February 2017 Colin Reimer - - PowerPoint PPT Presentation
STAT 339 Evaluating a Classifier 3 February 2017 Colin Reimer Dawson Questions/Administrative Business? Everyone enrolled who intends to be? Any technical difficulties? Anything else? Outline Evaluating a Supervised Learning Method
Questions/Administrative Business?
◮ Everyone enrolled who intends to be? ◮ Any technical difficulties? ◮ Anything else?
Outline
Evaluating a Supervised Learning Method Classification Performance Validation and Test Sets
Types of Learning
◮ Supervised Learning: Learning to make predictions when
you have many examples of “correct answers”
◮ Classification: answer is a category / label ◮ Regression: answer is a number
◮ Unsupervised Learning: Finding structure in unlabeled
data
◮ Reinforcement Learning: Finding actions that maximize
long-run reward (not part of this course)
Classification and Regression
If t is a categorical output, then we are doing classification If t is a quantitative output, then we are doing regression NB: “Logistic regression” is really a classification method, in this taxonomy
K-Nearest neighbors algorithm
- 1. Given a training set,
D = {(xn, tn)}, n = 1, . . . , N, a test point, x, and a distance function, d, compute distances: {dn : d(x, xn)}, n = 1, . . . , N
- 2. Find the K “nearest neighbors” in D to x
- 3. Classify the test point based on a “plurality vote”
- f the K-nearest neighbors.
- 4. In the event of a tie, apply a chosen tie-breaking
procedure (e.g., choose the most frequent class / increase K / etc.)
K-nearest-neighbors for Iris data
- 2.0
3.0 4.0 4.5 5.5 6.5 7.5
K = 1
Sepal.Width Sepal.Length
- 2.0
3.0 4.0 4.5 5.5 6.5 7.5
K = 3
Sepal.Width Sepal.Length
- 2.0
3.0 4.0 4.5 5.5 6.5 7.5
K = 5
Sepal.Width Sepal.Length
- 2.0
3.0 4.0 4.5 5.5 6.5 7.5
K = 11
Sepal.Width Sepal.Length
- 2.0
3.0 4.0 4.5 5.5 6.5 7.5
K = 21
Sepal.Width Sepal.Length
- 2.0
3.0 4.0 4.5 5.5 6.5 7.5
K = N
Sepal.Width Sepal.Length
Flexibility vs. Robustness
◮ Small K: highly flexible – can fit arbitrarily complex
patterns in the data – but not robust (highly sensitive to noise/specific sample properties)
◮ Larger K: mitigates sensitivity to noise, etc., but at the
expense of flexibility
Variants of KNN
◮ “Soft” KNN: Retain the vote share for each class, instead
- f simply taking the max, to do “soft” classification.
◮ “Kernel”-KNN: Use a “kernel” function that decays with
distance to weight the votes of the neighbors by their nearness.
◮ Beyond Rd: KNN can be used for objects such as
strings, trees, graphs by simply defining a suitable distance metric.
Choices to Make Using KNN
◮ What distance measure? (Euclidean (L2), Manhattan
(L1), Chebyshev (L∞), Edit distance (L0), ...) Always standardize your features (e.g., convert to z-scores) so the dimensions are on comparable scales when computing distance!
◮ What value of K? ◮ What kernel (and what kernel parameters), if any? ◮ What tie-breaking procedure (if doing hard classification)?
Evaluating a Supervised Learning Method
Two Kinds of Evaluation
- 1. How do we select which free “parameters” like K, or
kernel decay rate, are best?
- 2. How do we know how good a job our final method has
done? Two Choices To Be Made
- 1. How do we quantify performance?
- 2. What data do we use to measure performance?
Quantifying Classification Performance: Misclassification Rate
◮ One possible metric: misclassification rate: what
proportion of cases does the classifier get incorrect? Misclassification Rate = 1 N
- n
I(ˆ tn = tn) where ˆ tn is the classifier’s output for training point n, and I(A) returns 1 if A is true, 0 otherwise.
Other Classification Measures
For binary class problems with asymmetry between classes (e.g., positive and negative instances), there are four possibilities: Classification + − Truth + True Positive False Negative − False Positive True Negative
Table: Possible outcomes for a binary classifier
We can measure four component success rates:
Recall/Sensitivity = TP TP + FN Precision/Pos. Pred. Value = TP TP + FP Specificity = TN TN + FP
- Neg. Pred. Value
= TN TN + FN
F-measures
F1 score =
- 1
Recall + 1 Precision
2 −1 = 2 · Recall · Precision Recall + Precision Fβ score = β2
1 Recall + 1 Precision
1 + β2 −1 = (1 + β2) · Recall · Precision Recall + β2 · Precision Fβ aggregates recall (sensitivity / true positive rate) and precision (positive predictive value), with a “cost parameter” β to emphasize or de-emphasize recall.
Receiver Operating Characteristic (ROC) Curve
Figure: Example of an ROC curve. As classifier is more willing to say “+”, both true positives and false positives go up. Ideally, false positives go up much more slowly (curve hugs upper left).
Overfitting and Test Set
◮ Fitting and evaluating on the same data (for most
evaluation metrics) results in overfitting.
◮ Overfitting occurs when a learning algorithm mistakes
noise for signal, and incorporates idiosyncracies of the training set into its decision rule
◮ To combat overfitting, use different data for evaluation
- vs. fitting. This “held out data” is called a test set
Train vs. Test Error (KNN on Iris data)
10 20 30 40 50 0.0 0.2 0.4 0.6 0.8 1.0 K train.error Train Error Test Error
Validation vs. Test Set
◮ If we have decisions left to make, then we should not look
at the final test set? (Why not?)
◮ If we are going to select the best version of our method
by optimizing on the test set, then we have no measure of absolute performance: test set performance is overly
- ptimistic b/c it is cherry-picked.
◮ Instead, take training set and (randomly) subdivide into
training and validation set. Use training to do classification; validation to evaluate to guide “higher-order” decisions.
Validation vs. Test Error
10 20 30 40 50 0.0 0.2 0.4 0.6 0.8 1.0 K train.error Train Error Validation Error Test Error
Drawbacks of Simple Validation Approach
◮ Sacrificing training data degrades performance ◮ If validation set is too small, decisions will be based on
noisy information.
◮ Partial solution: Divide training set into K equal parts, or
“folds”; give each fold the chance to serve as validation set, and average generalization performance.
◮ Yields “K-fold cross-validation” (note: completely
separate choice of K here)
K-fold Cross Validation Algorithm
- A. For each method, M, under consideration
- 1. Divide training set into K “folds” with (approximately) equal
cases per fold. (Keep test set “sealed”)
- 2. For k = 1, . . . , K:
(a) Designate fold k the “validation set”, and 1, . . . , k − 1, k + 1, . . . , K the training set. (b) “Train” the algorithm on the training set to yield classification rule ck, and compute error rate, Errk on the validation set: e.g. Errk(M) = 1 |Validation|
- i∈Validation
I(ck(xi) = ti)
- 3. Return the mean error rate across folds
Err(M) = 1 K
K
- k=1
Errk(M)
- B. Select M with lowest Err
Cross Validation Error
10 20 30 40 0.0 0.2 0.4 0.6 0.8 1.0 K train.error Train Error 10−fold CV Error Test Error