evaluation of classifiers evaluation of classifiers
play

Evaluation of Classifiers Evaluation of Classifiers ROC Curves ROC - PowerPoint PPT Presentation

Evaluation of Classifiers Evaluation of Classifiers ROC Curves ROC Curves Reject Curves Reject Curves Precision- -Recall Curves Recall Curves Precision Statistical Tests Statistical Tests Estimating the error rate of a classifier


  1. Evaluation of Classifiers Evaluation of Classifiers ROC Curves ROC Curves Reject Curves Reject Curves Precision- -Recall Curves Recall Curves Precision Statistical Tests Statistical Tests – Estimating the error rate of a classifier Estimating the error rate of a classifier – – Comparing two classifiers Comparing two classifiers – – Estimating the error rate of a learning Estimating the error rate of a learning – algorithm algorithm – Comparing two algorithms Comparing two algorithms –

  2. Cost- -Sensitive Learning Sensitive Learning Cost In most applications, false positive and false In most applications, false positive and false negative errors are not equally important. We negative errors are not equally important. We therefore want to adjust the tradeoff between therefore want to adjust the tradeoff between them. Many learning algorithms provide a way them. Many learning algorithms provide a way to do this: to do this: – probabilistic classifiers: combine cost matrix with – probabilistic classifiers: combine cost matrix with decision theory to make classification decisions decision theory to make classification decisions – discriminant functions: adjust the threshold for discriminant functions: adjust the threshold for – classifying into the positive class classifying into the positive class – ensembles: adjust the number of votes required to ensembles: adjust the number of votes required to – classify as positive classify as positive

  3. Example: 30 decision trees Example: 30 decision trees constructed by bagging constructed by bagging Classify as positive if K out of 30 trees Classify as positive if K out of 30 trees predict positive. Vary K. predict positive. Vary K.

  4. Directly Visualizing the Tradeoff Directly Visualizing the Tradeoff We can plot the false positives versus false negatives directly. If We can plot the false positives versus false negatives directly. If L(0,1) = R · · L(1,0) (i.e., a FN is R times more expensive than a FP), L(1,0) (i.e., a FN is R times more expensive than a FP), L(0,1) = R then the best operating point will be tangent to a line with a slope of lope of then the best operating point will be tangent to a line with a s –R R – If R=1, we should set the threshold to 10. If R=10, the threshold should be 29

  5. Receiver Operating Characteristic Receiver Operating Characteristic (ROC) Curve (ROC) Curve It is traditional to plot this same information in a It is traditional to plot this same information in a normalized form with 1 – – False Negative Rate False Negative Rate normalized form with 1 plotted against the False Positive Rate. plotted against the False Positive Rate. The optimal operating point is tangent to a line with a slope of R

  6. Generating ROC Curves Generating ROC Curves Linear Threshold Units, Sigmoid Units, Neural Linear Threshold Units, Sigmoid Units, Neural Networks Networks – adjust the classification threshold between 0 and 1 adjust the classification threshold between 0 and 1 – K nearest neighbor K nearest neighbor – adjust number of votes (between 0 and k) required to adjust number of votes (between 0 and k) required to – classify positive classify positive Naï ïve Bayes, Logistic Regression, etc. ve Bayes, Logistic Regression, etc. Na – vary the probability threshold for classifying as vary the probability threshold for classifying as – positive positive Support vector machines Support vector machines – require different margins for positive and negative require different margins for positive and negative – examples examples

  7. SVM: Asymmetric Margins SVM: Asymmetric Margins 2 + C ∑ i ξ i + C ∑ i ξ Minimize ||w|| 2 Minimize ||w|| i Subject to Subject to ξ i + ξ ≥ R (positive examples) R (positive examples) i + w · · x i ≥ x i w ξ i + ξ – w ≥ 1 (negative examples) 1 (negative examples) – i + w · · x i ≥ x i

  8. ROC Convex Hull ROC Convex Hull If we have two classifiers h and h with (fp1,fn1) If we have two classifiers 1 and 2 with (fp1,fn1) h 1 h 2 and (fp2,fn2), then we can construct a stochastic and (fp2,fn2), then we can construct a stochastic classifier that interpolates between them. Given classifier that interpolates between them. Given a new data point x , we use classifier h with a new data point x , we use classifier 1 with h 1 probability p p and and h with probability (1- -p). The p). The probability 2 with probability (1 h 2 resulting classifier has an expected false positive resulting classifier has an expected false positive level of p fp1 + (1 – – p) fp2 and an expected false p) fp2 and an expected false level of p fp1 + (1 negative level of p fn1 + (1 – – p) fn2. p) fn2. negative level of p fn1 + (1 This means that we can create a classifier that This means that we can create a classifier that matches any point on the convex hull of the matches any point on the convex hull of the ROC curve ROC curve

  9. ROC Convex Hull ROC Convex Hull ROC Convex Hull Original ROC Curve

  10. Maximizing AUC Maximizing AUC At learning time, we may not know the cost ratio At learning time, we may not know the cost ratio R. In such cases, we can maximize the Area R. In such cases, we can maximize the Area Under the ROC Curve (AUC) Under the ROC Curve (AUC) Efficient computation of AUC Efficient computation of AUC – Assume Assume h ( x ) returns a real quantity (larger values => – h ( x ) returns a real quantity (larger values => class 1) class 1) – Sort Sort x according to h ( x ). Number the sorted points – i according to h ( i ). Number the sorted points x i x i from 1 to N such that r(i) = the rank of data point x from 1 to N such that r(i) = the rank of data point x i i – AUC = probability that a randomly chosen example AUC = probability that a randomly chosen example – from class 1 ranks above a randomly chosen example from class 1 ranks above a randomly chosen example from class 0 = the Wilcoxon- -Mann Mann- -Whitney statistic Whitney statistic from class 0 = the Wilcoxon

  11. Computing AUC Computing AUC Let S 1 = sum of r(i) for y i = 1 (sum of the Let S 1 = sum of r(i) for y i = 1 (sum of the ranks of the positive examples) examples) ranks of the positive AUC = S 1 − N 1 ( N 1 + 1) / 2 d N 0 N 1 where N 0 is the number of negative where N 0 is the number of negative examples and N 1 is the number of positive examples and N 1 is the number of positive examples examples

  12. Optimizing AUC Optimizing AUC A hot topic in machine learning right now A hot topic in machine learning right now is developing algorithms for optimizing is developing algorithms for optimizing AUC AUC RankBoost: A modification of AdaBoost. RankBoost: A modification of AdaBoost. The main idea is to define a “ “ranking loss ranking loss” ” The main idea is to define a function and then penalize a training function and then penalize a training example x x by the number of examples of by the number of examples of example the other class that are misranked (relative the other class that are misranked (relative to x ) to x )

  13. Rejection Curves Rejection Curves In most learning algorithms, we can In most learning algorithms, we can specify a threshold for making a rejection specify a threshold for making a rejection decision decision – Probabilistic classifiers: adjust cost of Probabilistic classifiers: adjust cost of – rejecting versus cost of FP and FN rejecting versus cost of FP and FN – Decision Decision- -boundary method: if a test point boundary method: if a test point x x is is – θ of the decision boundary, then reject within θ of the decision boundary, then reject within Equivalent to requiring that the “ Equivalent to requiring that the “activation activation” ” of the of the best class is larger than the second second- -best best class by class by best class is larger than the θ at least θ at least

  14. Rejection Curves (2) Rejection Curves (2) θ and plot fraction correct versus fraction Vary θ and plot fraction correct versus fraction Vary rejected rejected

  15. Precision versus Recall Precision versus Recall Information Retrieval: Information Retrieval: – y = 1: document is relevant to query y = 1: document is relevant to query – – y = 0: document is irrelevant to query y = 0: document is irrelevant to query – – K: number of documents retrieved K: number of documents retrieved – Precision: Precision: – fraction of the K retrieved documents ( fraction of the K retrieved documents ( ŷ ŷ =1) that are =1) that are – actually relevant (y=1) actually relevant (y=1) – TP / (TP + FP) TP / (TP + FP) – Recall: Recall: – fraction of all relevant documents that are retrieved fraction of all relevant documents that are retrieved – – TP / (TP + FN) = true positive rate TP / (TP + FN) = true positive rate –

  16. Precision Recall Graph Precision Recall Graph Plot recall on horizontal axis; precision on Plot recall on horizontal axis; precision on vertical axis; and vary the threshold for making vertical axis; and vary the threshold for making positive predictions (or vary K) positive predictions (or vary K)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend