An Empirical Comparison of Text Categorization Methods Ana - - PowerPoint PPT Presentation

an empirical comparison of text categorization methods
SMART_READER_LITE
LIVE PREVIEW

An Empirical Comparison of Text Categorization Methods Ana - - PowerPoint PPT Presentation

An Empirical Comparison of Text Categorization Methods Ana Cardoso-Cachopo and Arlindo L. Oliveira acardoso@gia.ist.utl.pt and aml@inesc-id.pt Instituto Superior T ecnico / ALGOS-INESC-ID An Empirical Comparison of Text Categorization


slide-1
SLIDE 1

An Empirical Comparison

  • f Text Categorization Methods

Ana Cardoso-Cachopo and Arlindo L. Oliveira

acardoso@gia.ist.utl.pt and aml@inesc-id.pt

Instituto Superior T´ ecnico / ALGOS-INESC-ID

An Empirical Comparison of Text Categorization Methods – p. 1/16

slide-2
SLIDE 2

Outline

Data sets Information Retrieval methods Evaluation Experimental setup Results Conclusions

An Empirical Comparison of Text Categorization Methods – p. 2/16

slide-3
SLIDE 3

Data sets

C10 (in Portuguese) 461 help desk messages, with answers 10 classes, 34 to 58 messages each mini20 (in English) 2000 subset of 20Newsgroups 100 messages for each newsgroup pre-processing Discard words shorter than 3 characters Discard words longer than 20 characters Remove numbers and non-letter characters Case and special character unification

An Empirical Comparison of Text Categorization Methods – p. 3/16

slide-4
SLIDE 4

IR methods

Vector model Latent Semantic Analysis/Indexing (LSA) Support Vector Machines (SVM) k-NN Vector k-NN LSA

An Empirical Comparison of Text Categorization Methods – p. 4/16

slide-5
SLIDE 5

IR methods — Vector

Words

  • terms

Docs are vectors in an N-dimensional space Similarity between docs is the cosine of the angle formed by the vectors representing the docs A doc’s class is the class of the most similar doc

An Empirical Comparison of Text Categorization Methods – p. 5/16

slide-6
SLIDE 6

IR methods — LSA

Words

  • terms

Docs are vectors in an N-dimensional space Apply Singular Value Decomposition (M N)-dimensional space representing concepts Similarity between docs is the cosine of the angle formed by the vectors representing the docs in this lower-dimensional space A doc’s class is the class of the most similar doc

An Empirical Comparison of Text Categorization Methods – p. 6/16

slide-7
SLIDE 7

IR methods — SVM

Words

  • terms

Docs are vectors in an N-dimensional space Transform the space using a kernel function Find a decision surface for each class that separates it from the

  • thers

One-against-one or one-against-all approach for multiclass problems A doc belongs to the class that had more “belongs” votes

An Empirical Comparison of Text Categorization Methods – p. 7/16

slide-8
SLIDE 8

IR methods — k-NN Vector / k-NN LSA

Words

  • terms

Docs are vectors in an N-dimensional space A doc’s class is the most wheighted class among its k neighbours The weight is the cosine similarity in the Vector/LSA space

An Empirical Comparison of Text Categorization Methods – p. 8/16

slide-9
SLIDE 9

IR methods — overview

kNN Vector LSA Vector kNN LSA

Cosine similarity RBF kernel S V D

M<<N dimens. concept space N−dimens. term space words ~ terms

kNN (cos. sim.) kNN (cos. sim.) C

  • s

i n e s i m i l a r i t y Voting strategy

SVM

An Empirical Comparison of Text Categorization Methods – p. 9/16

slide-10
SLIDE 10

Evaluation

Text Categorization task Each document has ONE category (Recall is not important) The rank of the first correct answer is important (Precision is not enough) Preferably one single number Mean Reciprocal Rank (MRR) The MRR of each individual query is the reciprocal of the rank at which the first correct response was returned, or 0 if none

  • f the first

responses contained a correct answer. The score for a sequence of queries is the mean of the individual query’s reciprocal ranks.

An Empirical Comparison of Text Categorization Methods – p. 10/16

slide-11
SLIDE 11

Experimental setup

IREP results Read documents IREP documents Filtered documents Filter documents Test IR models Write results

IREP

Results Documents IGLU (Vector) FAQO (LSA/I) ... LIBSVM (SVMs) External Pakages

An Empirical Comparison of Text Categorization Methods – p. 11/16

slide-12
SLIDE 12

Results – C10

0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 500 1000 1500 2000 2500 All terms Mean Reciprocal Rank Number of terms LSA k-NN LSA Vector k-NN Vector SVM

An Empirical Comparison of Text Categorization Methods – p. 12/16

slide-13
SLIDE 13

Results – mini20

0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 500 1000 1500 2000 2500 All terms Mean Reciprocal Rank Number of terms LSA k-NN LSA Vector k-NN Vector SVM

An Empirical Comparison of Text Categorization Methods – p. 13/16

slide-14
SLIDE 14

Significance Tests – C10

Results of the t-test for dataset C10 SVM k-NN LSA LSA k-NN Vector Vector SVM –

  • , 0.1581

, 0.0001 , 0.0004 , 0.0012 k-NN LSA – , 0.0001 , 0.0007 , 0.0020 LSA – , 0.0026 , 0.0038 k-NN Vector –

  • , 0.7221

Vector –

k-NN LSA

SVM

✄ ✁

Vector

k-NN Vector

LSA

An Empirical Comparison of Text Categorization Methods – p. 14/16

slide-15
SLIDE 15

Significance Tests – mini20

Results of the t-test for dataset mini20 SVM k-NN LSA LSA k-NN Vector Vector SVM – , 0.0010 , 0.0001 , 0.0000 , 0.0001 k-NN LSA – , 0.0001 , 0.0000 , 0.0000 LSA – , 0.0004 , 0.0000 k-NN Vector – , 0.0079 Vector –

SVM k-NN LSA LSA Vector k-NN Vector

An Empirical Comparison of Text Categorization Methods – p. 15/16

slide-16
SLIDE 16

Conclusions

2500 is a good upper-bound for the number of terms k-NN LSA

  • SVM

Both are significantly better than the others MRR is useful for one-class Text Categorization tasks

An Empirical Comparison of Text Categorization Methods – p. 16/16