STAT 339 Evaluating a Classifier 3 February 2017 Colin Reimer - PowerPoint PPT Presentation

STAT 339 Evaluating a Classifier 3 February 2017 Colin Reimer Dawson

Questions/Administrative Business? ◮ Everyone enrolled who intends to be? ◮ Any technical difficulties? ◮ Anything else?

Outline Evaluating a Supervised Learning Method Classification Performance Validation and Test Sets

Types of Learning ◮ Supervised Learning: Learning to make predictions when you have many examples of “correct answers” ◮ Classification: answer is a category / label ◮ Regression: answer is a number ◮ Unsupervised Learning: Finding structure in unlabeled data ◮ Reinforcement Learning: Finding actions that maximize long-run reward (not part of this course)

Classification and Regression If t is a categorical output, then we are doing classification If t is a quantitative output, then we are doing regression NB: “Logistic regression” is really a classification method, in this taxonomy

K -Nearest neighbors algorithm 1. Given a training set , D = { ( x n , t n ) } , n = 1 , . . . , N , a test point, x , and a distance function, d , compute distances: { d n : d ( x , x n ) } , n = 1 , . . . , N 2. Find the K “nearest neighbors” in D to x 3. Classify the test point based on a “plurality vote” of the K -nearest neighbors. 4. In the event of a tie, apply a chosen tie-breaking procedure (e.g., choose the most frequent class / increase K / etc.)

K -nearest-neighbors for Iris data K = 1 K = 3 K = 5 7.5 7.5 7.5 Sepal.Length Sepal.Length Sepal.Length ● ● ● ● ● ● 6.5 ● ● 6.5 ● ● 6.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5.5 ●● ●● ● ● ● ● 5.5 ●● ●● ● ● ● ● 5.5 ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4.5 4.5 4.5 2.0 3.0 4.0 2.0 3.0 4.0 2.0 3.0 4.0 Sepal.Width Sepal.Width Sepal.Width K = 11 K = 21 K = N 7.5 7.5 7.5 Sepal.Length Sepal.Length Sepal.Length ● ● ● ● ● ● ● ● ● ● ● ● 6.5 6.5 6.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ●● ●● ● ● ● ● 5.5 5.5 5.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4.5 4.5 4.5 2.0 3.0 4.0 2.0 3.0 4.0 2.0 3.0 4.0 Sepal.Width Sepal.Width Sepal.Width

Flexibility vs. Robustness ◮ Small K : highly flexible – can fit arbitrarily complex patterns in the data – but not robust (highly sensitive to noise/specific sample properties) ◮ Larger K : mitigates sensitivity to noise, etc., but at the expense of flexibility

Variants of KNN ◮ “Soft” KNN: Retain the vote share for each class, instead of simply taking the max, to do “soft” classification. ◮ “Kernel”-KNN: Use a “kernel” function that decays with distance to weight the votes of the neighbors by their nearness. ◮ Beyond R d : KNN can be used for objects such as strings, trees, graphs by simply defining a suitable distance metric.

Choices to Make Using KNN ◮ What distance measure? (Euclidean ( L 2 ), Manhattan ( L 1 ), Chebyshev ( L ∞ ), Edit distance ( L 0 ), ...) Always standardize your features (e.g., convert to z -scores) so the dimensions are on comparable scales when computing distance! ◮ What value of K ? ◮ What kernel (and what kernel parameters), if any? ◮ What tie-breaking procedure (if doing hard classification)?

Evaluating a Supervised Learning Method Two Kinds of Evaluation 1. How do we select which free “parameters” like K , or kernel decay rate, are best? 2. How do we know how good a job our final method has done? Two Choices To Be Made 1. How do we quantify performance? 2. What data do we use to measure performance?

Quantifying Classification Performance: Misclassification Rate ◮ One possible metric: misclassification rate : what proportion of cases does the classifier get incorrect? Misclassification Rate = 1 � I (ˆ t n � = t n ) N n where ˆ t n is the classifier’s output for training point n , and I ( A ) returns 1 if A is true, 0 otherwise.

Other Classification Measures For binary class problems with asymmetry between classes (e.g., positive and negative instances), there are four possibilities: Classification + − True Positive False Negative + Truth False Positive True Negative − Table: Possible outcomes for a binary classifier We can measure four component success rates: TP TP Recall/Sensitivity = Precision/Pos. Pred. Value = TP + FN TP + FP TN TN Specificity = Neg. Pred. Value = TN + FP TN + FN

F -measures � − 1 1 1 � Recall + Precision F 1 score = 2 = 2 · Recall · Precision Recall + Precision � − 1 � β 2 1 1 Recall + Precision F β score = 1 + β 2 = (1 + β 2 ) · Recall · Precision Recall + β 2 · Precision F β aggregates recall (sensitivity / true positive rate) and precision (positive predictive value), with a “cost parameter” β to emphasize or de-emphasize recall.

Receiver Operating Characteristic (ROC) Curve Figure: Example of an ROC curve. As classifier is more willing to say “ + ”, both true positives and false positives go up. Ideally, false positives go up much more slowly (curve hugs upper left).

Overfitting and Test Set ◮ Fitting and evaluating on the same data (for most evaluation metrics) results in overfitting . ◮ Overfitting occurs when a learning algorithm mistakes noise for signal, and incorporates idiosyncracies of the training set into its decision rule ◮ To combat overfitting, use different data for evaluation vs. fitting. This “held out data” is called a test set

Train vs. Test Error (KNN on Iris data) 1.0 Train Error Test Error 0.8 train.error 0.6 0.4 0.2 0.0 0 10 20 30 40 50 K

Validation vs. Test Set ◮ If we have decisions left to make, then we should not look at the final test set? (Why not?) ◮ If we are going to select the best version of our method by optimizing on the test set, then we have no measure of absolute performance: test set performance is overly optimistic b/c it is cherry-picked. ◮ Instead, take training set and (randomly) subdivide into training and validation set . Use training to do classification; validation to evaluate to guide “higher-order” decisions.

Validation vs. Test Error 1.0 0.8 train.error 0.6 0.4 Train Error 0.2 Validation Error Test Error 0.0 0 10 20 30 40 50 K

Drawbacks of Simple Validation Approach ◮ Sacrificing training data degrades performance ◮ If validation set is too small, decisions will be based on noisy information. ◮ Partial solution: Divide training set into K equal parts, or “folds”; give each fold the chance to serve as validation set, and average generalization performance. ◮ Yields “ K -fold cross-validation” (note: completely separate choice of K here)

K -fold Cross Validation Algorithm A. For each method, M , under consideration 1. Divide training set into K “folds” with (approximately) equal cases per fold. (Keep test set “sealed”) 2. For k = 1 , . . . , K : (a) Designate fold k the “validation set”, and 1 , . . . , k − 1 , k + 1 , . . . , K the training set. (b) “Train” the algorithm on the training set to yield classification rule c k , and compute error rate, Err k on the validation set: e.g. 1 � Err k ( M ) = I ( c k ( x i ) = t i ) | Validation | i ∈ Validation 3. Return the mean error rate across folds K Err ( M ) = 1 � Err k ( M ) K k =1 B. Select M with lowest Err

Cross Validation Error 1.0 0.8 train.error 0.6 0.4 Train Error 0.2 10−fold CV Error Test Error 0.0 0 10 20 30 40 K

STAT 339 Evaluating a Classifier 3 February 2017 Colin Reimer - PowerPoint PPT Presentation

STAT 339 Evaluating a Classifier 3 February 2017 Colin Reimer Dawson Questions/Administrative Business? Everyone enrolled who intends to be? Any technical difficulties? Anything else? Outline Evaluating a Supervised Learning Method

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes

STAT 830 Blank Slides for Notes Richard Lockhart SFU STAT 830 Fall 2020 Richard Lockhart

Lazy Associative Classification Decision Tree Classifier (Eager) Associative Classifier By

Paws4medford.org www.paws4medford.org (339) 674-0085 Paws4medford.org www.paws4medford.org (339)

The Large-Scale Commercial Dog Breeder Act HB 4898 / SB 339 Getting to the Goal September 15,

STAT 339 Approximate Inference I 15 March 2017 Colin Reimer Dawson Outline Approximation

STAT 339 A Generative Linear Model and Max Likelihood Estimation 20-22 February 2017 Colin

STAT 339 Hidden Markov Models III 21 April 2017 Bayesian Estimation / Model Averaging Outline

STAT 339 Markov Chain Monte Carlo (MCMC) 7 April 2017 Some theory and intuition about MCMC

STAT 339 Probabilistic Modeling and Machine Learning 30 January 2017 Colin Reimer Dawson

Data Mining with Weka Class 2 Lesson 1 Be a classifier! Ian H. Witten Department of Computer

When and Why to use a Classifier? When and Why to use a Classifier? Alan Rector Alan Rector

When and Why to use a Classifier? When and Why to use a Classifier? Alan Rector Alan Rector

Lecture 2: Nearest Neighbour Classifier Aykut Erdem September 2017 Hacettepe University Your

Maximum Entropy Classifier Ensembling using Ge- netic Algorithm for NER in Bengali Asif Ekbal 1

Honors Combinatorics CMSC-27410 = Math-28410 CMSC-37200 Instructor: Laszlo Babai University

Estimating the Physical Distance between Two Locations with Wi-Fi Received Signal Strength

Mixing Time Def: Total variation distance of distributions and on the same countable set

srt rststt r

QUERYING AND MINING QUERYING AND MINING DATA STREAMS Elena Ikonomovska Joef Stefan Institute

Parallel Homotopy Algorithms to Solve Polynomial Systems Jan Verschelde Department of Math, Stat

Consensus measures generated by weighted Kemeny distances on linear orders e Luis GARC David

Connecting the dots with common sense and linear models L eon Bottou NEC Labs America COS

Sambuz

Useful Links

Newsletter

Mail Us