Evaluating classifiers CS440 The 2-by-2 contingency table correct - PDF document

10/30/19 Evaluating classifiers CS440 The 2-by-2 contingency table correct not correct positive tp fp prediction negative fn tn prediction 1

10/30/19 Precision and recall • Precision : % of predicted items that are correct P = tp/ (tp + fp) • Recall : % of correct items that are predicted R = tp / (tp + fn) correct not correct positive tp fp prediction negative fn tn prediction Combining precision and recall with the F measure • A combined measure that assesses the P/R tradeoff is F measure (weighted harmonic mean): 2 1 ( 1 ) PR β + F = = 1 1 2 P R β + ( 1 ) α + − α P R • The harmonic mean is a very conservative average (tends towards the minimum) • People usually use the F1 measure – i.e., with β = 1 (that is, α = ½ ): – F 1 = 2 PR /( P + R ) 2

10/30/19 Sec.14.5 Prediction problems with more than two classes • Dealing with muti-class problems – A document can belong one of several classes. • Given test doc d, – Evaluate it for membership in each class – d belongs to the class which has the maximum score Sec.14.5 Prediction problems with more than two classes • Dealing with muti-label problems – A document can belong to 0, 1, or >1 classes. • Given test doc d, – Evaluate it for membership in each class – d belongs to any class for which the score is higher than some threshold 3

10/30/19 Sec. 15.2.4 Evaluation: the Reuters-21578 Data Set • Most (over)used data set, 21,578 docs (each 90 types, 200 toknens) • 9603 training, 3299 test articles (ModApte/Lewis split) • 118 categories – An article can be in more than one category – Learn 118 binary category distinctions • Each document has 1.24 classes on average • Only about 10 out of 118 categories are large • Trade (369,119) • Earn (2877, 1087) • Interest (347, 131) • Acquisitions (1650, 179) Common categories • Ship (197, 89) • Money-fx (538, 179) (#train, #test) • Grain (433, 149) • Wheat (212, 71) • Corn (182, 56) • Crude (389, 189) Sec. 15.2.4 Reuters-21578 example document <REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798"> <DATE> 2-MAR-1987 16:51:43.42</DATE> <TOPICS><D>livestock</D><D>hog</D></TOPICS> <TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE> <DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC. Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said. A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter </BODY></TEXT></REUTERS > 4

10/30/19 Confusion matrix • For each pair of classes <c 1 ,c 2 > how many documents from c 1 were assigned to c 2 ? Docs in test set Assigned Assigned Assigned Assigned Assigned Assigned UK poultry wheat coffee interest trade True UK 95 1 13 0 1 0 True poultry 0 1 0 0 0 0 True wheat 10 90 0 1 0 0 True coffee 0 0 0 34 3 7 True interest 0 1 2 13 26 5 True trade 0 0 2 14 5 10 Sec. 15.2.4 Per class evaluation measures c ii Recall : ∑ c ij Fraction of documentss in class i classified correctly j Precision : c ii Fraction of documents assigned class i that are actually ∑ c ji about class i j ∑ c ii Accuracy : (1 - error rate) i ∑ ∑ Fraction of docs classified correctly c ij j i 5

10/30/19 Sec. 15.2.4 Micro- vs. Macro-Averaging If we have two or more classes, how do we combine multiple performance measures into one quantity? • Macroaveraging : Compute performance for each class, then average. • Microaveraging : Collect decisions for all classes, compute contingency table, evaluate. Sec. 15.2.4 Micro- vs. Macro-Averaging: Example Class 1 Class 2 Micro Ave. Table Truth: Truth: Truth: Truth: Truth: Truth: yes no yes no yes no Classifier: yes 10 10 Classifier: yes 90 10 Classifier: yes 100 20 Classifier: no 10 970 Classifier: no 10 890 Classifier: no 20 1860 • Macroaveraged precision: (0.5 + 0.9)/2 = 0.7 • Microaveraged precision: 100/120 = .83 • Microaveraged score is dominated by score on common classes 12 6

10/30/19 Validation sets and cross-validation Training Set ValidaGon Set Test Set • Metric: P/R/F1 or Accuracy • Unseen test set Training Set Val Set – avoid overfitting (‘tuning to the test set’) – more realistic estimate of performance Training Set Val Set Cross-validation over multiple splits – Pool results over each split Training Set Val Set – Compute pooled validation set performance Test Set Sec. 15.3.1 The Real World • Gee, I’m building a text classifier for real, now! • What should I do? 7

10/30/19 Sec. 15.3.1 No training data? Manually written rules If (wheat or grain) and not (whole or bread) then Categorize as grain • Need careful crafting – Human tuning on development data – Time-consuming Sec. 15.3.1 Very little data? • Use Naïve Bayes • Get more labeled data – Find clever ways to get humans to label data for you • Try semi-supervised training methods 8

10/30/19 Sec. 15.3.1 A reasonable amount of data? • Perfect for all the clever classifiers – SVM – Regularized Logistic Regression – Random forests and boosting classifiers – Deep neural networks • Look into decision trees – Considered a more interpretable classifier Sec. 15.3.1 A huge amount of data? • Can achieve high accuracy! • At a cost: – SVMs/deep neural networks (train time) or kNN (test time) can be too slow – Regularized logistic regression can be somewhat better • So Naïve Bayes can come back into its own again! 9

10/30/19 Sec. 15.3.1 Accuracy as a function of dataset size • Classification accuracy improves with amount of data Brill and Banko on spelling correcGon Real-world systems generally combine: • Automatic classification • Manual review of uncertain/difficult cases 10

10/30/19 Sec. 15.3.2 How to tweak performance • Domain-specific features and weights: very important in real performance • Sometimes need to collapse terms: – Part numbers, chemical formulas, … – But stemming generally doesn’t help • Upweighting: Counting a word as if it occurred twice: – title words (Cohen & Singer 1996) – first sentence of each paragraph (Murata, 1999) – In sentences that contain title words (Ko et al, 2002) 11

Evaluating classifiers CS440 The 2-by-2 contingency table correct - PDF document

10/30/19 Evaluating classifiers CS440 The 2-by-2 contingency table correct not correct positive tp fp prediction negative fn tn prediction 1 10/30/19 Precision and recall Precision : % of predicted items that are correct P = tp/

Nonlinear Classifiers II 2 Nonlinear Classifiers: Introduction Classifiers Supervised

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice

Machine Learning Nave Bayes classifiers Types of classifiers We can divide the large

Occasion-level Classifiers or Event-level Classifiers? -Evidence from Child Language Acquisition

CS440/ECE448 Lecture 22: Including Slides by Svetlana Lazebnik, 10/2016 Linear Classifiers

Data Dependence in Data Dependence in Combining Classifiers Combining Classifiers Mohamed

Automatically Evading Classifiers A Case Study on PDF Malware Classifiers Weilin Xu

Evaluation of Classifiers Evaluation of Classifiers ROC Curves ROC Curves Reject Curves Reject

Linear Classifiers: Expressiveness Machine Learning 1 Lecture outline Linear models:

On Robust Trimming of Bayesian Network Classifiers YooJung Choi and Guy Van den Broeck UCLA

Visualization for Explainable Classifiers Yao MING THE HONG KONG UNIVERSITY OF SCIENCE AND

Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers

MAXIMUM MARGIN CLASSIFIERS MAXIMUM MARGIN CLASSIFIERS Matthieu R Bloch Tuesday, February 11,

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

Off- -The The- -Shelf Classifiers Shelf Classifiers Off A method that can be applied directly

Open Cavity Resonators The Orpheus Experiment Gray Rybka, University of Washington Workshop on

2018 CCIM President Carole Brill, CCIM 2018 Commercial Real Estate Forecasts Presented by

Large-Scale Click- stream and transaction log mining in practice Uwe Mayer, Nish Parikh, Gyanit

Nave Bayes CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Terminology: bag-of-words

Morphology and Corpora: Introduction Marco Baroni University of Bologna Granada Morphology

Information Extraction and Question-Answering Systems Foundations and methods Dr. Gnter

HMM Can Find Pretty Good POS Taggers (When Given a Good Start) Yoav Goldberg Meni Adler Michael

higher genus partition functions from three dimensional gravity Henry Maxfield 1601.00980 with