combining lsi with other classifiers to improve accuracy
play

Combining LSI with other Classifiers to Improve Accuracy of - PowerPoint PPT Presentation

Combining LSI with other Classifiers to Improve Accuracy of Single-label Text Categorization Ana Cardoso-Cachopo Arlindo Oliveira Instituto Superior T ecnico Technical University of Lisbon / INESC-ID EWLSATEL, March 2007


  1. Combining LSI with other Classifiers to Improve Accuracy of Single-label Text Categorization Ana Cardoso-Cachopo Arlindo Oliveira Instituto Superior T´ ecnico — Technical University of Lisbon / INESC-ID EWLSATEL, March 2007 (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 1 / 11

  2. Outline Introduction 1 Classification Methods 2 Combinations Between Methods 3 Experimental Setup 4 Experimental Results 5 Conclusions and Future Work 6 (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 2 / 11

  3. Outline Introduction 1 Classification Methods 2 Combinations Between Methods 3 Experimental Setup 4 Experimental Results 5 Conclusions and Future Work 6 (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 2 / 11

  4. Outline Introduction 1 Classification Methods 2 Combinations Between Methods 3 Experimental Setup 4 Experimental Results 5 Conclusions and Future Work 6 (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 2 / 11

  5. Outline Introduction 1 Classification Methods 2 Combinations Between Methods 3 Experimental Setup 4 Experimental Results 5 Conclusions and Future Work 6 (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 2 / 11

  6. Outline Introduction 1 Classification Methods 2 Combinations Between Methods 3 Experimental Setup 4 Experimental Results 5 Conclusions and Future Work 6 (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 2 / 11

  7. Outline Introduction 1 Classification Methods 2 Combinations Between Methods 3 Experimental Setup 4 Experimental Results 5 Conclusions and Future Work 6 (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 2 / 11

  8. Introduction Text Classification Single-label Classification Methods ◮ Vector ◮ k-NN ◮ SVM ◮ LSI Goal: improve Accuracy (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11

  9. Introduction Text Classification Single-label Classification Methods ◮ Vector ◮ k-NN ◮ SVM ◮ LSI Goal: improve Accuracy (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11

  10. Introduction Text Classification Single-label Classification Methods ◮ Vector ◮ k-NN ◮ SVM ◮ LSI Goal: improve Accuracy (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11

  11. Introduction Text Classification Single-label Classification Methods ◮ Vector ◮ k-NN ◮ SVM ◮ LSI Goal: improve Accuracy (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11

  12. Introduction Text Classification Single-label Classification Methods ◮ Vector ◮ k-NN ◮ SVM ◮ LSI Goal: improve Accuracy (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11

  13. Introduction Text Classification Single-label Classification Methods ◮ Vector ◮ k-NN ◮ SVM ◮ LSI Goal: improve Accuracy (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11

  14. Introduction Text Classification Single-label Classification Methods ◮ Vector ◮ k-NN ◮ SVM ◮ LSI Goal: improve Accuracy (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11

  15. Introduction Text Classification Single-label Classification Methods ◮ Vector ◮ k-NN ◮ SVM ◮ LSI Goal: improve Accuracy (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11

  16. Classification Methods Cosine similarity Vector k-NN + Cosine similarity k-NN Voting strategy Kernel SVM p dimensional term space Cosine similarity SVD LSI s << p dimensional concept space (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 4 / 11

  17. Combinations Between Methods k-NN + Cosine similarity k-NN-LSI SVD Kernel + Voting strategy SVM-LSI s << p p dimensional dimensional term space concept space (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 5 / 11

  18. Experimental Setup Methods (6 already mentioned + Dumb) Datasets ◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12 Evaluation Measure Accuracy = #Correctly classified documents #Total documents (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11

  19. Experimental Setup Methods (6 already mentioned + Dumb) Datasets ◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12 Evaluation Measure Accuracy = #Correctly classified documents #Total documents (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11

  20. Experimental Setup Methods (6 already mentioned + Dumb) Datasets ◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12 Evaluation Measure Accuracy = #Correctly classified documents #Total documents (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11

  21. Experimental Setup Methods (6 already mentioned + Dumb) Datasets ◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12 Evaluation Measure Accuracy = #Correctly classified documents #Total documents (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11

  22. Experimental Setup Methods (6 already mentioned + Dumb) Datasets ◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12 Evaluation Measure Accuracy = #Correctly classified documents #Total documents (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11

  23. Experimental Setup Methods (6 already mentioned + Dumb) Datasets ◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12 Evaluation Measure Accuracy = #Correctly classified documents #Total documents (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11

  24. Experimental Setup Methods (6 already mentioned + Dumb) Datasets ◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12 Evaluation Measure Accuracy = #Correctly classified documents #Total documents (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11

  25. Experimental Setup Methods (6 already mentioned + Dumb) Datasets ◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12 Evaluation Measure Accuracy = #Correctly classified documents #Total documents (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11

  26. Characteristics of the Datasets Train Test Total Smallest Largest Docs Docs Docs Class Class Bank37 928 463 1391 5 346 20Ng 11293 7528 18821 628 999 R8 5485 2189 7674 51 3923 R52 6532 2568 9100 3 3923 Web4 2803 1396 4199 504 1641 Cade12 27322 13661 40983 625 8473 Numbers of documents for the datasets: number of training documents, number of test documents, total number of documents, number of documents in the smallest class, and number of documents in the largest class. (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 7 / 11

  27. Characteristics of the Datasets Train Test Total Smallest Largest Docs Docs Docs Class Class Bank37 928 463 1391 5 346 20Ng 11293 7528 18821 628 999 R8 5485 2189 7674 51 3923 R52 6532 2568 9100 3 3923 Web4 2803 1396 4199 504 1641 Cade12 27322 13661 40983 625 8473 Numbers of documents for the datasets: number of training documents, number of test documents, total number of documents, number of documents in the smallest class, and number of documents in the largest class. (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 7 / 11

  28. Characteristics of the Datasets Train Test Total Smallest Largest Docs Docs Docs Class Class Bank37 928 463 1391 5 346 20Ng 11293 7528 18821 628 999 R8 5485 2189 7674 51 3923 R52 6532 2568 9100 3 3923 Web4 2803 1396 4199 504 1641 Cade12 27322 13661 40983 625 8473 Numbers of documents for the datasets: number of training documents, number of test documents, total number of documents, number of documents in the smallest class, and number of documents in the largest class. (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 7 / 11

  29. Characteristics of the Datasets Train Test Total Smallest Largest Docs Docs Docs Class Class Bank37 928 463 1391 5 346 20Ng 11293 7528 18821 628 999 R8 5485 2189 7674 51 3923 R52 6532 2568 9100 3 3923 Web4 2803 1396 4199 504 1641 Cade12 27322 13661 40983 625 8473 Numbers of documents for the datasets: number of training documents, number of test documents, total number of documents, number of documents in the smallest class, and number of documents in the largest class. (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 7 / 11

  30. Experimental Results 1.0 Dumb Vector 0.8 k-NN SVM 0.6 LSI k-NN-LSI 0.4 SVM-LSI 0.2 0.0 Bank37 20Ng R8 R52 Web4 Cade12 Accuracy values for the six datasets using each method. (IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 8 / 11

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend