Support Vector Machines for Large Scale Text Mining in R Ingo - - PowerPoint PPT Presentation

support vector machines for large scale text mining in r
SMART_READER_LITE
LIVE PREVIEW

Support Vector Machines for Large Scale Text Mining in R Ingo - - PowerPoint PPT Presentation

Support Vector Machines for Large Scale Text Mining in R Ingo Feinerer 1 Alexandros Karatzoglou 2 1 Vienna University of Technology, Austria 2 Telefonica Research, Spain COMPSTAT2010 Motivation Machine learning and data mining require


slide-1
SLIDE 1

Support Vector Machines for Large Scale Text Mining in R

Ingo Feinerer1 Alexandros Karatzoglou2

1Vienna University of Technology, Austria 2Telefonica Research, Spain

COMPSTAT’2010

slide-2
SLIDE 2

Motivation

◮ Machine learning and data mining require classification ◮ Large amounts of data ◮ Use R for data intensive operations ◮ Text mining is especially resource hungry ◮ Highly sparse matrices ◮ Need of scalable implementations

slide-3
SLIDE 3

Large Scale Linear Support Vector Machines

Modified Finite Newton l2-Svm

Given

◮ m binary labeled examples {xi, yi} with yi ∈ {−1, +1}, and ◮ the Svm optimization problem

w∗ = argmin

w∈Rd

1 2

m

  • i=1

cil2(yiwTxi) + λ 2 w2 the modified finite Newton l2-Svm method gives an efficient primal solution.

slide-4
SLIDE 4

R Extension Package svmlin

Features

Implements l2-Svm algorithm.

◮ Extends original C++ version of svmlin

by Sindhwani and Keerthi (2007). Adds support for

◮ multi-class classification

(one-against-one and one-against-all voting schemes),

◮ cross-validation, and ◮ a broad range of sparse matrix formats

(SparseM, Matrix, slam).

slide-5
SLIDE 5

R Extension Package svmlin

Interface

model <- svmlin(matrix, labels, lambda = 0.1, cross = 3)

◮ Regularization parameter of λ = 0.1 ◮ 3-fold cross-validation ◮ model can be used with the predict() function

slide-6
SLIDE 6

R Extension Package tm

Text mining framework in R

◮ Functionality for managing text documents ◮ Abstracts the process of document manipulation ◮ Eases the usage of heterogeneous text formats (XML, . . . ) ◮ Meta data management ◮ Preprocessing via transformations and filters

Exports

◮ (Sparse) term-document matrices ◮ Interfaces to string kernels

Available via CRAN

slide-7
SLIDE 7

Data

Reuters-21578

◮ News articles by Reuters news agency from 1987 ◮ 21578 short to medium length documents in XML format ◮ Wide range of topics (M&A, finance, politics, . . . )

SpamAssassin

◮ Public mail corpus ◮ Authentic e-mail communication with classification into

normal and unsolicited mail of various difficulty levels

◮ 4150 ham and 1896 spam documents

20 Newsgroups

◮ 19997 e-mail messages taken from 20 different newsgroups ◮ Wide field of topics, e.g., atheism, computer graphics, or

motorcycles

slide-8
SLIDE 8

Preprocessing

Creation of term-document matrices

◮ 42 seconds for Reuters-21578 ◮ 31 seconds for SpamAssassin ◮ 75 seconds for 20 Newsgroups

Term-document matrix size

◮ Reuters-21578: 65973 terms, 21578 documents, 24 MB ◮ SpamAssassin: 151029 terms, 6046 documents, 24 MB ◮ 20 Newsgroups: 175685 terms, 19997 documents, 46 MB

slide-9
SLIDE 9

Protocol

Compare Svm implementations

◮ Runtime of svm (package e1071) vs. svmlin ◮ For svm we use a linear kernel and set the cost parameter to 1 λ ◮ Initially sample 1 10 from data set for training ◮ Increase training data in 1 10 steps ◮ Compare classification performance using 10-fold

cross-validation

slide-10
SLIDE 10

Results

SpamAssassin

  • 0.2

0.4 0.6 0.8 1.0 2 4 6 8 10 12

SpamAssassin

Portion of data used for training Training time in seconds e1071 svmlin

slide-11
SLIDE 11

Results

20 Newsgroups

  • 0.2

0.4 0.6 0.8 1.0 10 20 30 40 50 60

20 Newsgroups

Portion of data used for training Training time in seconds e1071 svmlin

slide-12
SLIDE 12

Results

Reuters-21578

  • 0.2

0.4 0.6 0.8 1.0 100 200 300 400

Reuters 21578

Portion of data used for training Training time in seconds e1071 svmlin

slide-13
SLIDE 13

Conclusion

◮ svmlin extension package ◮ Takes advantage of sparse data ◮ Computations are done in primal space (no kernel necessary) ◮ Comparison with state-of-the-art svm ◮ Linear scaling, faster training times