An Empirical Comparison of Text Categorization Methods Ana - PowerPoint PPT Presentation

An Empirical Comparison of Text Categorization Methods Ana Cardoso-Cachopo and Arlindo L. Oliveira acardoso@gia.ist.utl.pt and aml@inesc-id.pt Instituto Superior T´ ecnico / ALGOS-INESC-ID An Empirical Comparison of Text Categorization Methods – p. 1/16

Outline Data sets Information Retrieval methods Evaluation Experimental setup Results Conclusions An Empirical Comparison of Text Categorization Methods – p. 2/16

Data sets C10 (in Portuguese) 461 help desk messages, with answers 10 classes, 34 to 58 messages each mini20 (in English) 2000 subset of 20Newsgroups 100 messages for each newsgroup pre-processing Discard words shorter than 3 characters Discard words longer than 20 characters Remove numbers and non-letter characters Case and special character unification An Empirical Comparison of Text Categorization Methods – p. 3/16

IR methods Vector model Latent Semantic Analysis/Indexing (LSA) Support Vector Machines (SVM) k-NN Vector k-NN LSA An Empirical Comparison of Text Categorization Methods – p. 4/16

� IR methods — Vector Words terms Docs are vectors in an N-dimensional space Similarity between docs is the cosine of the angle formed by the vectors representing the docs A doc’s class is the class of the most similar doc An Empirical Comparison of Text Categorization Methods – p. 5/16

� IR methods — LSA Words terms Docs are vectors in an N-dimensional space Apply Singular Value Decomposition (M N)-dimensional space representing concepts Similarity between docs is the cosine of the angle formed by the vectors representing the docs in this lower-dimensional space A doc’s class is the class of the most similar doc An Empirical Comparison of Text Categorization Methods – p. 6/16

� IR methods — SVM Words terms Docs are vectors in an N-dimensional space Transform the space using a kernel function Find a decision surface for each class that separates it from the others One-against-one or one-against-all approach for multiclass problems A doc belongs to the class that had more “belongs” votes An Empirical Comparison of Text Categorization Methods – p. 7/16

� IR methods — k-NN Vector / k-NN LSA Words terms Docs are vectors in an N-dimensional space A doc’s class is the most wheighted class among its k neighbours The weight is the cosine similarity in the Vector/LSA space An Empirical Comparison of Text Categorization Methods – p. 8/16

IR methods — overview Vector y kNN Vector t i kNN (cos. sim.) r a l i m Cosine similarity LSA i M<<N dimens. concept space s e n i words ~ terms s o C N−dimens. term space kNN (cos. sim.) D V S kNN LSA RBF kernel Voting strategy SVM An Empirical Comparison of Text Categorization Methods – p. 9/16

Evaluation Text Categorization task Each document has ONE category (Recall is not important) The rank of the first correct answer is important (Precision is not enough) Preferably one single number Mean Reciprocal Rank (MRR) The MRR of each individual query is the reciprocal of the rank at which the first correct response was returned, or 0 if none of the first responses contained a correct answer. The score for a sequence of queries is the mean of the individual query’s reciprocal ranks. An Empirical Comparison of Text Categorization Methods – p. 10/16

Experimental setup Documents IREP Read documents IREP documents External Pakages IGLU (Vector) Filter documents FAQO (LSA/I) Filtered documents LIBSVM (SVMs) Test IR models ... IREP results Write results Results An Empirical Comparison of Text Categorization Methods – p. 11/16

Results – C10 1 0.95 0.9 Mean Reciprocal Rank 0.85 0.8 LSA 0.75 k-NN LSA Vector k-NN Vector SVM 0.7 0.65 0 500 1000 1500 2000 2500 All terms Number of terms An Empirical Comparison of Text Categorization Methods – p. 12/16

Results – mini20 0.85 0.8 0.75 0.7 Mean Reciprocal Rank 0.65 0.6 0.55 0.5 LSA k-NN LSA Vector k-NN Vector 0.45 SVM 0.4 0.35 0 500 1000 1500 2000 2500 All terms Number of terms An Empirical Comparison of Text Categorization Methods – p. 13/16

✂ � � ✄ ✁ ✂ ✄ ✁ Significance Tests – C10 Results of the t-test for dataset C10 SVM k-NN LSA LSA k-NN Vector Vector SVM – , 0.1581 , 0.0001 , 0.0004 , 0.0012 k-NN LSA – , 0.0001 , 0.0007 , 0.0020 LSA – , 0.0026 , 0.0038 k-NN Vector – , 0.7221 Vector – k-NN LSA SVM Vector k-NN Vector LSA An Empirical Comparison of Text Categorization Methods – p. 14/16

Significance Tests – mini20 Results of the t-test for dataset mini20 SVM k-NN LSA LSA k-NN Vector Vector SVM – , 0.0010 , 0.0001 , 0.0000 , 0.0001 k-NN LSA – , 0.0001 , 0.0000 , 0.0000 LSA – , 0.0004 , 0.0000 k-NN Vector – , 0.0079 Vector – SVM k-NN LSA LSA Vector k-NN Vector An Empirical Comparison of Text Categorization Methods – p. 15/16

� Conclusions 2500 is a good upper-bound for the number of terms k-NN LSA SVM Both are significantly better than the others MRR is useful for one-class Text Categorization tasks An Empirical Comparison of Text Categorization Methods – p. 16/16

An Empirical Comparison of Text Categorization Methods Ana - PowerPoint PPT Presentation

An Empirical Comparison of Text Categorization Methods Ana Cardoso-Cachopo and Arlindo L. Oliveira acardoso@gia.ist.utl.pt and aml@inesc-id.pt Instituto Superior T ecnico / ALGOS-INESC-ID An Empirical Comparison of Text Categorization

Text Categorization (I) Luo Si Department of Computer Science Purdue University Text

Categorization Categorization is the basis of structure and meaning in our world. We

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Inductive Learning Algorithms and Representations for Text Categorization David Heckerman Susan

CS473 CS-473 Text Categorization (II) Luo Si Department of Computer Science Purdue University

Effective Use of f Word Order for Text xt Categorization wit ith Convolutional Neural Network

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Automatic Categorization of Query Results SIGMOD 04 . Kaushik Chakrabarti 1 S. Surajit

Computer Vision Exercise Session 10 Image Categorization Object Categorization Task

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

High dimensional computing - the upside of the curse of dimensionality Peer Neubert Stefan

5-Axis Machining Some Best Practices Longxiang Yang FANUC America IMTS 2018 Conference

Vectors Slide 2 / 36 Scalar versus Vector A scalar has only a physical quantity such as mass,

Support Vector Manifold Learning for Solving Regression Problems via Clustering Marcin Orchel

Introduction to R Nishant Gopalakrishnan, Martin Morgan Fred Hutchinson Cancer Research Center

2017 Presentation Descriptions Thursday November 16, 2017 Risk, Resilience, and Other Related

The Principles of Regulation for Vector Control Products Vector Control Team Contents 1. The

Silver Management Chile May 2020 Photograph: The Challacollo Project, Chile

Sambuz

Useful Links

Newsletter

Mail Us