STAT 339 Evaluating a Classifier 3 February 2017 Colin Reimer - - PowerPoint PPT Presentation

stat 339 evaluating a classifier
SMART_READER_LITE
LIVE PREVIEW

STAT 339 Evaluating a Classifier 3 February 2017 Colin Reimer - - PowerPoint PPT Presentation

STAT 339 Evaluating a Classifier 3 February 2017 Colin Reimer Dawson Questions/Administrative Business? Everyone enrolled who intends to be? Any technical difficulties? Anything else? Outline Evaluating a Supervised Learning Method


slide-1
SLIDE 1

STAT 339 Evaluating a Classifier

3 February 2017 Colin Reimer Dawson

slide-2
SLIDE 2

Questions/Administrative Business?

◮ Everyone enrolled who intends to be? ◮ Any technical difficulties? ◮ Anything else?

slide-3
SLIDE 3

Outline

Evaluating a Supervised Learning Method Classification Performance Validation and Test Sets

slide-4
SLIDE 4

Types of Learning

◮ Supervised Learning: Learning to make predictions when

you have many examples of “correct answers”

◮ Classification: answer is a category / label ◮ Regression: answer is a number

◮ Unsupervised Learning: Finding structure in unlabeled

data

◮ Reinforcement Learning: Finding actions that maximize

long-run reward (not part of this course)

slide-5
SLIDE 5

Classification and Regression

If t is a categorical output, then we are doing classification If t is a quantitative output, then we are doing regression NB: “Logistic regression” is really a classification method, in this taxonomy

slide-6
SLIDE 6

K-Nearest neighbors algorithm

  • 1. Given a training set,

D = {(xn, tn)}, n = 1, . . . , N, a test point, x, and a distance function, d, compute distances: {dn : d(x, xn)}, n = 1, . . . , N

  • 2. Find the K “nearest neighbors” in D to x
  • 3. Classify the test point based on a “plurality vote”
  • f the K-nearest neighbors.
  • 4. In the event of a tie, apply a chosen tie-breaking

procedure (e.g., choose the most frequent class / increase K / etc.)

slide-7
SLIDE 7

K-nearest-neighbors for Iris data

  • 2.0

3.0 4.0 4.5 5.5 6.5 7.5

K = 1

Sepal.Width Sepal.Length

  • 2.0

3.0 4.0 4.5 5.5 6.5 7.5

K = 3

Sepal.Width Sepal.Length

  • 2.0

3.0 4.0 4.5 5.5 6.5 7.5

K = 5

Sepal.Width Sepal.Length

  • 2.0

3.0 4.0 4.5 5.5 6.5 7.5

K = 11

Sepal.Width Sepal.Length

  • 2.0

3.0 4.0 4.5 5.5 6.5 7.5

K = 21

Sepal.Width Sepal.Length

  • 2.0

3.0 4.0 4.5 5.5 6.5 7.5

K = N

Sepal.Width Sepal.Length

slide-8
SLIDE 8

Flexibility vs. Robustness

◮ Small K: highly flexible – can fit arbitrarily complex

patterns in the data – but not robust (highly sensitive to noise/specific sample properties)

◮ Larger K: mitigates sensitivity to noise, etc., but at the

expense of flexibility

slide-9
SLIDE 9

Variants of KNN

◮ “Soft” KNN: Retain the vote share for each class, instead

  • f simply taking the max, to do “soft” classification.

◮ “Kernel”-KNN: Use a “kernel” function that decays with

distance to weight the votes of the neighbors by their nearness.

◮ Beyond Rd: KNN can be used for objects such as

strings, trees, graphs by simply defining a suitable distance metric.

slide-10
SLIDE 10

Choices to Make Using KNN

◮ What distance measure? (Euclidean (L2), Manhattan

(L1), Chebyshev (L∞), Edit distance (L0), ...) Always standardize your features (e.g., convert to z-scores) so the dimensions are on comparable scales when computing distance!

◮ What value of K? ◮ What kernel (and what kernel parameters), if any? ◮ What tie-breaking procedure (if doing hard classification)?

slide-11
SLIDE 11

Evaluating a Supervised Learning Method

Two Kinds of Evaluation

  • 1. How do we select which free “parameters” like K, or

kernel decay rate, are best?

  • 2. How do we know how good a job our final method has

done? Two Choices To Be Made

  • 1. How do we quantify performance?
  • 2. What data do we use to measure performance?
slide-12
SLIDE 12

Quantifying Classification Performance: Misclassification Rate

◮ One possible metric: misclassification rate: what

proportion of cases does the classifier get incorrect? Misclassification Rate = 1 N

  • n

I(ˆ tn = tn) where ˆ tn is the classifier’s output for training point n, and I(A) returns 1 if A is true, 0 otherwise.

slide-13
SLIDE 13

Other Classification Measures

For binary class problems with asymmetry between classes (e.g., positive and negative instances), there are four possibilities: Classification + − Truth + True Positive False Negative − False Positive True Negative

Table: Possible outcomes for a binary classifier

We can measure four component success rates:

Recall/Sensitivity = TP TP + FN Precision/Pos. Pred. Value = TP TP + FP Specificity = TN TN + FP

  • Neg. Pred. Value

= TN TN + FN

slide-14
SLIDE 14

F-measures

F1 score =

  • 1

Recall + 1 Precision

2 −1 = 2 · Recall · Precision Recall + Precision Fβ score = β2

1 Recall + 1 Precision

1 + β2 −1 = (1 + β2) · Recall · Precision Recall + β2 · Precision Fβ aggregates recall (sensitivity / true positive rate) and precision (positive predictive value), with a “cost parameter” β to emphasize or de-emphasize recall.

slide-15
SLIDE 15

Receiver Operating Characteristic (ROC) Curve

Figure: Example of an ROC curve. As classifier is more willing to say “+”, both true positives and false positives go up. Ideally, false positives go up much more slowly (curve hugs upper left).

slide-16
SLIDE 16

Overfitting and Test Set

◮ Fitting and evaluating on the same data (for most

evaluation metrics) results in overfitting.

◮ Overfitting occurs when a learning algorithm mistakes

noise for signal, and incorporates idiosyncracies of the training set into its decision rule

◮ To combat overfitting, use different data for evaluation

  • vs. fitting. This “held out data” is called a test set
slide-17
SLIDE 17

Train vs. Test Error (KNN on Iris data)

10 20 30 40 50 0.0 0.2 0.4 0.6 0.8 1.0 K train.error Train Error Test Error

slide-18
SLIDE 18

Validation vs. Test Set

◮ If we have decisions left to make, then we should not look

at the final test set? (Why not?)

◮ If we are going to select the best version of our method

by optimizing on the test set, then we have no measure of absolute performance: test set performance is overly

  • ptimistic b/c it is cherry-picked.

◮ Instead, take training set and (randomly) subdivide into

training and validation set. Use training to do classification; validation to evaluate to guide “higher-order” decisions.

slide-19
SLIDE 19

Validation vs. Test Error

10 20 30 40 50 0.0 0.2 0.4 0.6 0.8 1.0 K train.error Train Error Validation Error Test Error

slide-20
SLIDE 20

Drawbacks of Simple Validation Approach

◮ Sacrificing training data degrades performance ◮ If validation set is too small, decisions will be based on

noisy information.

◮ Partial solution: Divide training set into K equal parts, or

“folds”; give each fold the chance to serve as validation set, and average generalization performance.

◮ Yields “K-fold cross-validation” (note: completely

separate choice of K here)

slide-21
SLIDE 21

K-fold Cross Validation Algorithm

  • A. For each method, M, under consideration
  • 1. Divide training set into K “folds” with (approximately) equal

cases per fold. (Keep test set “sealed”)

  • 2. For k = 1, . . . , K:

(a) Designate fold k the “validation set”, and 1, . . . , k − 1, k + 1, . . . , K the training set. (b) “Train” the algorithm on the training set to yield classification rule ck, and compute error rate, Errk on the validation set: e.g. Errk(M) = 1 |Validation|

  • i∈Validation

I(ck(xi) = ti)

  • 3. Return the mean error rate across folds

Err(M) = 1 K

K

  • k=1

Errk(M)

  • B. Select M with lowest Err
slide-22
SLIDE 22

Cross Validation Error

10 20 30 40 0.0 0.2 0.4 0.6 0.8 1.0 K train.error Train Error 10−fold CV Error Test Error