Combining LSI with other Classifiers to Improve Accuracy of - - PowerPoint PPT Presentation

combining lsi with other classifiers to improve accuracy
SMART_READER_LITE
LIVE PREVIEW

Combining LSI with other Classifiers to Improve Accuracy of - - PowerPoint PPT Presentation

Combining LSI with other Classifiers to Improve Accuracy of Single-label Text Categorization Ana Cardoso-Cachopo Arlindo Oliveira Instituto Superior T ecnico Technical University of Lisbon / INESC-ID EWLSATEL, March 2007


slide-1
SLIDE 1

Combining LSI with other Classifiers to Improve Accuracy of Single-label Text Categorization

Ana Cardoso-Cachopo Arlindo Oliveira

Instituto Superior T´ ecnico — Technical University of Lisbon / INESC-ID

EWLSATEL, March 2007

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 1 / 11

slide-2
SLIDE 2

Outline

1

Introduction

2

Classification Methods

3

Combinations Between Methods

4

Experimental Setup

5

Experimental Results

6

Conclusions and Future Work

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 2 / 11

slide-3
SLIDE 3

Outline

1

Introduction

2

Classification Methods

3

Combinations Between Methods

4

Experimental Setup

5

Experimental Results

6

Conclusions and Future Work

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 2 / 11

slide-4
SLIDE 4

Outline

1

Introduction

2

Classification Methods

3

Combinations Between Methods

4

Experimental Setup

5

Experimental Results

6

Conclusions and Future Work

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 2 / 11

slide-5
SLIDE 5

Outline

1

Introduction

2

Classification Methods

3

Combinations Between Methods

4

Experimental Setup

5

Experimental Results

6

Conclusions and Future Work

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 2 / 11

slide-6
SLIDE 6

Outline

1

Introduction

2

Classification Methods

3

Combinations Between Methods

4

Experimental Setup

5

Experimental Results

6

Conclusions and Future Work

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 2 / 11

slide-7
SLIDE 7

Outline

1

Introduction

2

Classification Methods

3

Combinations Between Methods

4

Experimental Setup

5

Experimental Results

6

Conclusions and Future Work

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 2 / 11

slide-8
SLIDE 8

Introduction

Text Classification Single-label Classification Methods

◮ Vector ◮ k-NN ◮ SVM ◮ LSI

Goal: improve Accuracy

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11

slide-9
SLIDE 9

Introduction

Text Classification Single-label Classification Methods

◮ Vector ◮ k-NN ◮ SVM ◮ LSI

Goal: improve Accuracy

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11

slide-10
SLIDE 10

Introduction

Text Classification Single-label Classification Methods

◮ Vector ◮ k-NN ◮ SVM ◮ LSI

Goal: improve Accuracy

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11

slide-11
SLIDE 11

Introduction

Text Classification Single-label Classification Methods

◮ Vector ◮ k-NN ◮ SVM ◮ LSI

Goal: improve Accuracy

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11

slide-12
SLIDE 12

Introduction

Text Classification Single-label Classification Methods

◮ Vector ◮ k-NN ◮ SVM ◮ LSI

Goal: improve Accuracy

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11

slide-13
SLIDE 13

Introduction

Text Classification Single-label Classification Methods

◮ Vector ◮ k-NN ◮ SVM ◮ LSI

Goal: improve Accuracy

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11

slide-14
SLIDE 14

Introduction

Text Classification Single-label Classification Methods

◮ Vector ◮ k-NN ◮ SVM ◮ LSI

Goal: improve Accuracy

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11

slide-15
SLIDE 15

Introduction

Text Classification Single-label Classification Methods

◮ Vector ◮ k-NN ◮ SVM ◮ LSI

Goal: improve Accuracy

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 3 / 11

slide-16
SLIDE 16

Classification Methods

p dimensional term space s << p dimensional concept space SVM k-NN Vector LSI Cosine similarity k-NN + Cosine similarity Kernel Voting strategy SVD Cosine similarity

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 4 / 11

slide-17
SLIDE 17

Combinations Between Methods

p dimensional term space s << p dimensional concept space k-NN-LSI SVM-LSI k-NN + Cosine similarity Kernel + Voting strategy SVD

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 5 / 11

slide-18
SLIDE 18

Experimental Setup

Methods (6 already mentioned + Dumb) Datasets

◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12

Evaluation Measure Accuracy = #Correctly classified documents #Total documents

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11

slide-19
SLIDE 19

Experimental Setup

Methods (6 already mentioned + Dumb) Datasets

◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12

Evaluation Measure Accuracy = #Correctly classified documents #Total documents

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11

slide-20
SLIDE 20

Experimental Setup

Methods (6 already mentioned + Dumb) Datasets

◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12

Evaluation Measure Accuracy = #Correctly classified documents #Total documents

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11

slide-21
SLIDE 21

Experimental Setup

Methods (6 already mentioned + Dumb) Datasets

◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12

Evaluation Measure Accuracy = #Correctly classified documents #Total documents

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11

slide-22
SLIDE 22

Experimental Setup

Methods (6 already mentioned + Dumb) Datasets

◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12

Evaluation Measure Accuracy = #Correctly classified documents #Total documents

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11

slide-23
SLIDE 23

Experimental Setup

Methods (6 already mentioned + Dumb) Datasets

◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12

Evaluation Measure Accuracy = #Correctly classified documents #Total documents

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11

slide-24
SLIDE 24

Experimental Setup

Methods (6 already mentioned + Dumb) Datasets

◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12

Evaluation Measure Accuracy = #Correctly classified documents #Total documents

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11

slide-25
SLIDE 25

Experimental Setup

Methods (6 already mentioned + Dumb) Datasets

◮ Bank´s Data - Bank37 ◮ Reuters 21578 - R8, R52 ◮ 20 Newsgroups - 20Ng ◮ Web Knowledge Base - Web4 ◮ Cade - Cade12

Evaluation Measure Accuracy = #Correctly classified documents #Total documents

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 6 / 11

slide-26
SLIDE 26

Characteristics of the Datasets

Train Test Total Smallest Largest Docs Docs Docs Class Class Bank37 928 463 1391 5 346 20Ng 11293 7528 18821 628 999 R8 5485 2189 7674 51 3923 R52 6532 2568 9100 3 3923 Web4 2803 1396 4199 504 1641 Cade12 27322 13661 40983 625 8473 Numbers of documents for the datasets: number of training documents, number of test documents, total number of documents, number of documents in the smallest class, and number of documents in the largest class.

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 7 / 11

slide-27
SLIDE 27

Characteristics of the Datasets

Train Test Total Smallest Largest Docs Docs Docs Class Class Bank37 928 463 1391 5 346 20Ng 11293 7528 18821 628 999 R8 5485 2189 7674 51 3923 R52 6532 2568 9100 3 3923 Web4 2803 1396 4199 504 1641 Cade12 27322 13661 40983 625 8473 Numbers of documents for the datasets: number of training documents, number of test documents, total number of documents, number of documents in the smallest class, and number of documents in the largest class.

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 7 / 11

slide-28
SLIDE 28

Characteristics of the Datasets

Train Test Total Smallest Largest Docs Docs Docs Class Class Bank37 928 463 1391 5 346 20Ng 11293 7528 18821 628 999 R8 5485 2189 7674 51 3923 R52 6532 2568 9100 3 3923 Web4 2803 1396 4199 504 1641 Cade12 27322 13661 40983 625 8473 Numbers of documents for the datasets: number of training documents, number of test documents, total number of documents, number of documents in the smallest class, and number of documents in the largest class.

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 7 / 11

slide-29
SLIDE 29

Characteristics of the Datasets

Train Test Total Smallest Largest Docs Docs Docs Class Class Bank37 928 463 1391 5 346 20Ng 11293 7528 18821 628 999 R8 5485 2189 7674 51 3923 R52 6532 2568 9100 3 3923 Web4 2803 1396 4199 504 1641 Cade12 27322 13661 40983 625 8473 Numbers of documents for the datasets: number of training documents, number of test documents, total number of documents, number of documents in the smallest class, and number of documents in the largest class.

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 7 / 11

slide-30
SLIDE 30

Experimental Results

0.0 0.2 0.4 0.6 0.8 1.0

Dumb Vector k-NN SVM LSI k-NN-LSI SVM-LSI

Bank37 20Ng R8 R52 Web4 Cade12 Accuracy values for the six datasets using each method.

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 8 / 11

slide-31
SLIDE 31

Experimental Results

0.0 0.2 0.4 0.6 0.8 1.0

Dumb Vector k-NN SVM LSI k-NN-LSI SVM-LSI

Bank37 20Ng R8 R52 Web4 Cade12 Accuracy values for the six datasets using each method.

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 8 / 11

slide-32
SLIDE 32

Experimental Results

0.0 0.2 0.4 0.6 0.8 1.0

Dumb Vector k-NN SVM LSI k-NN-LSI SVM-LSI

Bank37 20Ng R8 R52 Web4 Cade12 Accuracy values for the six datasets using each method.

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 8 / 11

slide-33
SLIDE 33

Experimental Results

0.0 0.2 0.4 0.6 0.8 1.0

Dumb Vector k-NN SVM LSI k-NN-LSI SVM-LSI

Bank37 20Ng R8 R52 Web4 Cade12 Accuracy values for the six datasets using each method.

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 8 / 11

slide-34
SLIDE 34

Experimental Results

0.0 0.2 0.4 0.6 0.8 1.0

Dumb Vector k-NN SVM LSI k-NN-LSI SVM-LSI

Bank37 20Ng R8 R52 Web4 Cade12 Accuracy values for the six datasets using each method.

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 8 / 11

slide-35
SLIDE 35

Experimental Results

0.0 0.2 0.4 0.6 0.8 1.0

Dumb Vector k-NN SVM LSI k-NN-LSI SVM-LSI

Bank37 20Ng R8 R52 Web4 Cade12 Accuracy values for the six datasets using each method.

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 8 / 11

slide-36
SLIDE 36

Experimental Results

Dataset Dumb Vector k-NN SVM LSI k-NN LSI SVM LSI Bank37 0.2505 0.8359 0.8423 0.9071 0.8531 0.8488 0.9179 20Ng 0.0530 0.7240 0.7593 0.8284 0.7491 0.7557 0.7775 R8 0.4947 0.7889 0.8524 0.9698 0.9411 0.9488 0.9680 R52 0.4217 0.7687 0.8322 0.9377 0.9093 0.9100 0.9311 Web4 0.3897 0.6447 0.7256 0.8582 0.7357 0.7908 0.8897 Cade12 0.2083 0.4142 0.5120 0.5284 0.4329 0.4880 0.5465 Average 0.3030 0.6961 0.7540 0.8383 0.7702 0.7904 0.8385 Accuracy values for the six datasets using each method, and average Accuracy for each method over all the datasets.

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 9 / 11

slide-37
SLIDE 37

Experimental Results

Dataset Dumb Vector k-NN SVM LSI k-NN LSI SVM LSI Bank37 0.2505 0.8359 0.8423 0.9071 0.8531 0.8488 0.9179 20Ng 0.0530 0.7240 0.7593 0.8284 0.7491 0.7557 0.7775 R8 0.4947 0.7889 0.8524 0.9698 0.9411 0.9488 0.9680 R52 0.4217 0.7687 0.8322 0.9377 0.9093 0.9100 0.9311 Web4 0.3897 0.6447 0.7256 0.8582 0.7357 0.7908 0.8897 Cade12 0.2083 0.4142 0.5120 0.5284 0.4329 0.4880 0.5465 Average 0.3030 0.6961 0.7540 0.8383 0.7702 0.7904 0.8385 Accuracy values for the six datasets using each method, and average Accuracy for each method over all the datasets.

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 9 / 11

slide-38
SLIDE 38

Experimental Results

Dataset Dumb Vector k-NN SVM LSI k-NN LSI SVM LSI Bank37 0.2505 0.8359 0.8423 0.9071 0.8531 0.8488 0.9179 20Ng 0.0530 0.7240 0.7593 0.8284 0.7491 0.7557 0.7775 R8 0.4947 0.7889 0.8524 0.9698 0.9411 0.9488 0.9680 R52 0.4217 0.7687 0.8322 0.9377 0.9093 0.9100 0.9311 Web4 0.3897 0.6447 0.7256 0.8582 0.7357 0.7908 0.8897 Cade12 0.2083 0.4142 0.5120 0.5284 0.4329 0.4880 0.5465 Average 0.3030 0.6961 0.7540 0.8383 0.7702 0.7904 0.8385 Accuracy values for the six datasets using each method, and average Accuracy for each method over all the datasets.

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 9 / 11

slide-39
SLIDE 39

Experimental Results

Dataset Dumb Vector k-NN SVM LSI k-NN LSI SVM LSI Bank37 0.2505 0.8359 0.8423 0.9071 0.8531 0.8488 0.9179 20Ng 0.0530 0.7240 0.7593 0.8284 0.7491 0.7557 0.7775 R8 0.4947 0.7889 0.8524 0.9698 0.9411 0.9488 0.9680 R52 0.4217 0.7687 0.8322 0.9377 0.9093 0.9100 0.9311 Web4 0.3897 0.6447 0.7256 0.8582 0.7357 0.7908 0.8897 Cade12 0.2083 0.4142 0.5120 0.5284 0.4329 0.4880 0.5465 Average 0.3030 0.6961 0.7540 0.8383 0.7702 0.7904 0.8385 Accuracy values for the six datasets using each method, and average Accuracy for each method over all the datasets.

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 9 / 11

slide-40
SLIDE 40

Experimental Results

Dataset Dumb Vector k-NN SVM LSI k-NN LSI SVM LSI Bank37 0.2505 0.8359 0.8423 0.9071 0.8531 0.8488 0.9179 20Ng 0.0530 0.7240 0.7593 0.8284 0.7491 0.7557 0.7775 R8 0.4947 0.7889 0.8524 0.9698 0.9411 0.9488 0.9680 R52 0.4217 0.7687 0.8322 0.9377 0.9093 0.9100 0.9311 Web4 0.3897 0.6447 0.7256 0.8582 0.7357 0.7908 0.8897 Cade12 0.2083 0.4142 0.5120 0.5284 0.4329 0.4880 0.5465 Average 0.3030 0.6961 0.7540 0.8383 0.7702 0.7904 0.8385 Accuracy values for the six datasets using each method, and average Accuracy for each method over all the datasets.

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 9 / 11

slide-41
SLIDE 41

Conclusions and Future Work

Very good Accuracy for some datasets. It is worth pursuing this line of research by testing more combinations between the method´s parameters.

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 10 / 11

slide-42
SLIDE 42

Conclusions and Future Work

Very good Accuracy for some datasets. It is worth pursuing this line of research by testing more combinations between the method´s parameters.

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 10 / 11

slide-43
SLIDE 43

Thank You. Any Questions?

(IST-TULisbon/INESC-ID) Ana Cardoso-Cachopo EWLSATEL, March 2007 11 / 11