support vector machines for large scale text mining in r
play

Support Vector Machines for Large Scale Text Mining in R Ingo - PowerPoint PPT Presentation

Support Vector Machines for Large Scale Text Mining in R Ingo Feinerer 1 Alexandros Karatzoglou 2 1 Vienna University of Technology, Austria 2 Telefonica Research, Spain COMPSTAT2010 Motivation Machine learning and data mining require


  1. Support Vector Machines for Large Scale Text Mining in R Ingo Feinerer 1 Alexandros Karatzoglou 2 1 Vienna University of Technology, Austria 2 Telefonica Research, Spain COMPSTAT’2010

  2. Motivation ◮ Machine learning and data mining require classification ◮ Large amounts of data ◮ Use R for data intensive operations ◮ Text mining is especially resource hungry ◮ Highly sparse matrices ◮ Need of scalable implementations

  3. Large Scale Linear Support Vector Machines Modified Finite Newton l 2 - Svm Given ◮ m binary labeled examples { x i , y i } with y i ∈ {− 1 , +1 } , and ◮ the Svm optimization problem m 1 c i l 2 ( y i w T x i ) + λ w ∗ = argmin � 2 � w � 2 2 w ∈ R d i =1 the modified finite Newton l 2 - Svm method gives an efficient primal solution.

  4. R Extension Package svmlin Features Implements l 2 - Svm algorithm. ◮ Extends original C++ version of svmlin by Sindhwani and Keerthi (2007). Adds support for ◮ multi-class classification (one-against-one and one-against-all voting schemes), ◮ cross-validation, and ◮ a broad range of sparse matrix formats ( SparseM , Matrix , slam ).

  5. R Extension Package svmlin Interface model <- svmlin(matrix, labels, lambda = 0.1, cross = 3) ◮ Regularization parameter of λ = 0 . 1 ◮ 3-fold cross-validation ◮ model can be used with the predict() function

  6. R Extension Package tm Text mining framework in R ◮ Functionality for managing text documents ◮ Abstracts the process of document manipulation ◮ Eases the usage of heterogeneous text formats ( XML , . . . ) ◮ Meta data management ◮ Preprocessing via transformations and filters Exports ◮ (Sparse) term-document matrices ◮ Interfaces to string kernels Available via CRAN

  7. Data Reuters-21578 ◮ News articles by Reuters news agency from 1987 ◮ 21578 short to medium length documents in XML format ◮ Wide range of topics (M&A, finance, politics, . . . ) SpamAssassin ◮ Public mail corpus ◮ Authentic e-mail communication with classification into normal and unsolicited mail of various difficulty levels ◮ 4150 ham and 1896 spam documents 20 Newsgroups ◮ 19997 e-mail messages taken from 20 different newsgroups ◮ Wide field of topics, e.g., atheism, computer graphics, or motorcycles

  8. Preprocessing Creation of term-document matrices ◮ 42 seconds for Reuters-21578 ◮ 31 seconds for SpamAssassin ◮ 75 seconds for 20 Newsgroups Term-document matrix size ◮ Reuters-21578: 65973 terms, 21578 documents, 24 MB ◮ SpamAssassin: 151029 terms, 6046 documents, 24 MB ◮ 20 Newsgroups: 175685 terms, 19997 documents, 46 MB

  9. Protocol Compare Svm implementations ◮ Runtime of svm (package e1071 ) vs. svmlin ◮ For svm we use a linear kernel and set the cost parameter to 1 λ 1 ◮ Initially sample 10 from data set for training 1 ◮ Increase training data in 10 steps ◮ Compare classification performance using 10-fold cross-validation

  10. Results SpamAssassin SpamAssassin 12 ● 10 Training time in seconds 8 6 ● ● 4 ● ● 2 ● ● e1071 ● ● svmlin ● 0 0.2 0.4 0.6 0.8 1.0 Portion of data used for training

  11. Results 20 Newsgroups 20 Newsgroups 60 ● 50 ● Training time in seconds 40 ● 30 ● ● 20 ● 10 ● ● e1071 ● svmlin ● 0 0.2 0.4 0.6 0.8 1.0 Portion of data used for training

  12. Results Reuters-21578 Reuters 21578 ● 400 300 Training time in seconds ● ● 200 ● ● 100 ● ● ● e1071 ● svmlin 0 ● 0.2 0.4 0.6 0.8 1.0 Portion of data used for training

  13. Conclusion ◮ svmlin extension package ◮ Takes advantage of sparse data ◮ Computations are done in primal space (no kernel necessary) ◮ Comparison with state-of-the-art svm ◮ Linear scaling, faster training times

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend