Evaluation of Classifiers Evaluation of Classifiers ROC Curves ROC - PowerPoint PPT Presentation

Evaluation of Classifiers Evaluation of Classifiers ROC Curves ROC Curves Reject Curves Reject Curves Precision- -Recall Curves Recall Curves Precision Statistical Tests Statistical Tests – Estimating the error rate of a classifier Estimating the error rate of a classifier – – Comparing two classifiers Comparing two classifiers – – Estimating the error rate of a learning Estimating the error rate of a learning – algorithm algorithm – Comparing two algorithms Comparing two algorithms –

Cost- -Sensitive Learning Sensitive Learning Cost In most applications, false positive and false In most applications, false positive and false negative errors are not equally important. We negative errors are not equally important. We therefore want to adjust the tradeoff between therefore want to adjust the tradeoff between them. Many learning algorithms provide a way them. Many learning algorithms provide a way to do this: to do this: – probabilistic classifiers: combine cost matrix with – probabilistic classifiers: combine cost matrix with decision theory to make classification decisions decision theory to make classification decisions – discriminant functions: adjust the threshold for discriminant functions: adjust the threshold for – classifying into the positive class classifying into the positive class – ensembles: adjust the number of votes required to ensembles: adjust the number of votes required to – classify as positive classify as positive

Example: 30 decision trees Example: 30 decision trees constructed by bagging constructed by bagging Classify as positive if K out of 30 trees Classify as positive if K out of 30 trees predict positive. Vary K. predict positive. Vary K.

Directly Visualizing the Tradeoff Directly Visualizing the Tradeoff We can plot the false positives versus false negatives directly. If We can plot the false positives versus false negatives directly. If L(0,1) = R · · L(1,0) (i.e., a FN is R times more expensive than a FP), L(1,0) (i.e., a FN is R times more expensive than a FP), L(0,1) = R then the best operating point will be tangent to a line with a slope of lope of then the best operating point will be tangent to a line with a s –R R – If R=1, we should set the threshold to 10. If R=10, the threshold should be 29

Receiver Operating Characteristic Receiver Operating Characteristic (ROC) Curve (ROC) Curve It is traditional to plot this same information in a It is traditional to plot this same information in a normalized form with 1 – – False Negative Rate False Negative Rate normalized form with 1 plotted against the False Positive Rate. plotted against the False Positive Rate. The optimal operating point is tangent to a line with a slope of R

Generating ROC Curves Generating ROC Curves Linear Threshold Units, Sigmoid Units, Neural Linear Threshold Units, Sigmoid Units, Neural Networks Networks – adjust the classification threshold between 0 and 1 adjust the classification threshold between 0 and 1 – K nearest neighbor K nearest neighbor – adjust number of votes (between 0 and k) required to adjust number of votes (between 0 and k) required to – classify positive classify positive Naï ïve Bayes, Logistic Regression, etc. ve Bayes, Logistic Regression, etc. Na – vary the probability threshold for classifying as vary the probability threshold for classifying as – positive positive Support vector machines Support vector machines – require different margins for positive and negative require different margins for positive and negative – examples examples

SVM: Asymmetric Margins SVM: Asymmetric Margins 2 + C ∑ i ξ i + C ∑ i ξ Minimize ||w|| 2 Minimize ||w|| i Subject to Subject to ξ i + ξ ≥ R (positive examples) R (positive examples) i + w · · x i ≥ x i w ξ i + ξ – w ≥ 1 (negative examples) 1 (negative examples) – i + w · · x i ≥ x i

ROC Convex Hull ROC Convex Hull If we have two classifiers h and h with (fp1,fn1) If we have two classifiers 1 and 2 with (fp1,fn1) h 1 h 2 and (fp2,fn2), then we can construct a stochastic and (fp2,fn2), then we can construct a stochastic classifier that interpolates between them. Given classifier that interpolates between them. Given a new data point x , we use classifier h with a new data point x , we use classifier 1 with h 1 probability p p and and h with probability (1- -p). The p). The probability 2 with probability (1 h 2 resulting classifier has an expected false positive resulting classifier has an expected false positive level of p fp1 + (1 – – p) fp2 and an expected false p) fp2 and an expected false level of p fp1 + (1 negative level of p fn1 + (1 – – p) fn2. p) fn2. negative level of p fn1 + (1 This means that we can create a classifier that This means that we can create a classifier that matches any point on the convex hull of the matches any point on the convex hull of the ROC curve ROC curve

ROC Convex Hull ROC Convex Hull ROC Convex Hull Original ROC Curve

Maximizing AUC Maximizing AUC At learning time, we may not know the cost ratio At learning time, we may not know the cost ratio R. In such cases, we can maximize the Area R. In such cases, we can maximize the Area Under the ROC Curve (AUC) Under the ROC Curve (AUC) Efficient computation of AUC Efficient computation of AUC – Assume Assume h ( x ) returns a real quantity (larger values => – h ( x ) returns a real quantity (larger values => class 1) class 1) – Sort Sort x according to h ( x ). Number the sorted points – i according to h ( i ). Number the sorted points x i x i from 1 to N such that r(i) = the rank of data point x from 1 to N such that r(i) = the rank of data point x i i – AUC = probability that a randomly chosen example AUC = probability that a randomly chosen example – from class 1 ranks above a randomly chosen example from class 1 ranks above a randomly chosen example from class 0 = the Wilcoxon- -Mann Mann- -Whitney statistic Whitney statistic from class 0 = the Wilcoxon

Computing AUC Computing AUC Let S 1 = sum of r(i) for y i = 1 (sum of the Let S 1 = sum of r(i) for y i = 1 (sum of the ranks of the positive examples) examples) ranks of the positive AUC = S 1 − N 1 ( N 1 + 1) / 2 d N 0 N 1 where N 0 is the number of negative where N 0 is the number of negative examples and N 1 is the number of positive examples and N 1 is the number of positive examples examples

Optimizing AUC Optimizing AUC A hot topic in machine learning right now A hot topic in machine learning right now is developing algorithms for optimizing is developing algorithms for optimizing AUC AUC RankBoost: A modification of AdaBoost. RankBoost: A modification of AdaBoost. The main idea is to define a “ “ranking loss ranking loss” ” The main idea is to define a function and then penalize a training function and then penalize a training example x x by the number of examples of by the number of examples of example the other class that are misranked (relative the other class that are misranked (relative to x ) to x )

Rejection Curves Rejection Curves In most learning algorithms, we can In most learning algorithms, we can specify a threshold for making a rejection specify a threshold for making a rejection decision decision – Probabilistic classifiers: adjust cost of Probabilistic classifiers: adjust cost of – rejecting versus cost of FP and FN rejecting versus cost of FP and FN – Decision Decision- -boundary method: if a test point boundary method: if a test point x x is is – θ of the decision boundary, then reject within θ of the decision boundary, then reject within Equivalent to requiring that the “ Equivalent to requiring that the “activation activation” ” of the of the best class is larger than the second second- -best best class by class by best class is larger than the θ at least θ at least

Rejection Curves (2) Rejection Curves (2) θ and plot fraction correct versus fraction Vary θ and plot fraction correct versus fraction Vary rejected rejected

Precision versus Recall Precision versus Recall Information Retrieval: Information Retrieval: – y = 1: document is relevant to query y = 1: document is relevant to query – – y = 0: document is irrelevant to query y = 0: document is irrelevant to query – – K: number of documents retrieved K: number of documents retrieved – Precision: Precision: – fraction of the K retrieved documents ( fraction of the K retrieved documents ( ŷ ŷ =1) that are =1) that are – actually relevant (y=1) actually relevant (y=1) – TP / (TP + FP) TP / (TP + FP) – Recall: Recall: – fraction of all relevant documents that are retrieved fraction of all relevant documents that are retrieved – – TP / (TP + FN) = true positive rate TP / (TP + FN) = true positive rate –

Precision Recall Graph Precision Recall Graph Plot recall on horizontal axis; precision on Plot recall on horizontal axis; precision on vertical axis; and vary the threshold for making vertical axis; and vary the threshold for making positive predictions (or vary K) positive predictions (or vary K)

Evaluation of Classifiers Evaluation of Classifiers ROC Curves ROC - PowerPoint PPT Presentation

Evaluation of Classifiers Evaluation of Classifiers ROC Curves ROC Curves Reject Curves Reject Curves Precision- -Recall Curves Recall Curves Precision Statistical Tests Statistical Tests Estimating the error rate of a classifier

Nonlinear Classifiers II 2 Nonlinear Classifiers: Introduction Classifiers Supervised

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice

Machine Learning Nave Bayes classifiers Types of classifiers We can divide the large

Occasion-level Classifiers or Event-level Classifiers? -Evidence from Child Language Acquisition

CS440/ECE448 Lecture 22: Including Slides by Svetlana Lazebnik, 10/2016 Linear Classifiers

Data Dependence in Data Dependence in Combining Classifiers Combining Classifiers Mohamed

Automatically Evading Classifiers A Case Study on PDF Malware Classifiers Weilin Xu

Linear Classifiers: Expressiveness Machine Learning 1 Lecture outline Linear models:

On Robust Trimming of Bayesian Network Classifiers YooJung Choi and Guy Van den Broeck UCLA

Visualization for Explainable Classifiers Yao MING THE HONG KONG UNIVERSITY OF SCIENCE AND

Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers

MAXIMUM MARGIN CLASSIFIERS MAXIMUM MARGIN CLASSIFIERS Matthieu R Bloch Tuesday, February 11,

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

Off- -The The- -Shelf Classifiers Shelf Classifiers Off A method that can be applied directly

PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, LOGISTIC REGRESSION

Evaluation Measures Sebastian Plsterl Computer Aided Medical Procedures | Technische

CSE 158 Lecture 4 Web Mining and Recommender Systems More Classifiers Last lecture How

Healthcare + Economic Development Jolynn Suko, Chief Innovation Officer GETTING BACK TO

Finiteness of associated primes of local cohomology modules over Stanley-Reisner rings joint w/

Machine Learning and Data Mining 2 : Bayes Classifiers Kalev Kask A basic classifier

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 1 Instructor: Yizhou Sun

Learning Methods: Part 2 CS 760@UW-Madison Goals for the last lecture you should understand the

Sambuz

Useful Links

Newsletter

Mail Us