Support Vector Machines for Large Scale Text Mining in R Ingo - PowerPoint PPT Presentation

Support Vector Machines for Large Scale Text Mining in R Ingo Feinerer 1 Alexandros Karatzoglou 2 1 Vienna University of Technology, Austria 2 Telefonica Research, Spain COMPSTAT’2010

Motivation ◮ Machine learning and data mining require classification ◮ Large amounts of data ◮ Use R for data intensive operations ◮ Text mining is especially resource hungry ◮ Highly sparse matrices ◮ Need of scalable implementations

Large Scale Linear Support Vector Machines Modified Finite Newton l 2 - Svm Given ◮ m binary labeled examples { x i , y i } with y i ∈ {− 1 , +1 } , and ◮ the Svm optimization problem m 1 c i l 2 ( y i w T x i ) + λ w ∗ = argmin � 2 � w � 2 2 w ∈ R d i =1 the modified finite Newton l 2 - Svm method gives an efficient primal solution.

R Extension Package svmlin Features Implements l 2 - Svm algorithm. ◮ Extends original C++ version of svmlin by Sindhwani and Keerthi (2007). Adds support for ◮ multi-class classification (one-against-one and one-against-all voting schemes), ◮ cross-validation, and ◮ a broad range of sparse matrix formats ( SparseM , Matrix , slam ).

R Extension Package svmlin Interface model <- svmlin(matrix, labels, lambda = 0.1, cross = 3) ◮ Regularization parameter of λ = 0 . 1 ◮ 3-fold cross-validation ◮ model can be used with the predict() function

R Extension Package tm Text mining framework in R ◮ Functionality for managing text documents ◮ Abstracts the process of document manipulation ◮ Eases the usage of heterogeneous text formats ( XML , . . . ) ◮ Meta data management ◮ Preprocessing via transformations and filters Exports ◮ (Sparse) term-document matrices ◮ Interfaces to string kernels Available via CRAN

Data Reuters-21578 ◮ News articles by Reuters news agency from 1987 ◮ 21578 short to medium length documents in XML format ◮ Wide range of topics (M&A, finance, politics, . . . ) SpamAssassin ◮ Public mail corpus ◮ Authentic e-mail communication with classification into normal and unsolicited mail of various difficulty levels ◮ 4150 ham and 1896 spam documents 20 Newsgroups ◮ 19997 e-mail messages taken from 20 different newsgroups ◮ Wide field of topics, e.g., atheism, computer graphics, or motorcycles

Preprocessing Creation of term-document matrices ◮ 42 seconds for Reuters-21578 ◮ 31 seconds for SpamAssassin ◮ 75 seconds for 20 Newsgroups Term-document matrix size ◮ Reuters-21578: 65973 terms, 21578 documents, 24 MB ◮ SpamAssassin: 151029 terms, 6046 documents, 24 MB ◮ 20 Newsgroups: 175685 terms, 19997 documents, 46 MB

Protocol Compare Svm implementations ◮ Runtime of svm (package e1071 ) vs. svmlin ◮ For svm we use a linear kernel and set the cost parameter to 1 λ 1 ◮ Initially sample 10 from data set for training 1 ◮ Increase training data in 10 steps ◮ Compare classification performance using 10-fold cross-validation

Results SpamAssassin SpamAssassin 12 ● 10 Training time in seconds 8 6 ● ● 4 ● ● 2 ● ● e1071 ● ● svmlin ● 0 0.2 0.4 0.6 0.8 1.0 Portion of data used for training

Results 20 Newsgroups 20 Newsgroups 60 ● 50 ● Training time in seconds 40 ● 30 ● ● 20 ● 10 ● ● e1071 ● svmlin ● 0 0.2 0.4 0.6 0.8 1.0 Portion of data used for training

Results Reuters-21578 Reuters 21578 ● 400 300 Training time in seconds ● ● 200 ● ● 100 ● ● ● e1071 ● svmlin 0 ● 0.2 0.4 0.6 0.8 1.0 Portion of data used for training

Conclusion ◮ svmlin extension package ◮ Takes advantage of sparse data ◮ Computations are done in primal space (no kernel necessary) ◮ Comparison with state-of-the-art svm ◮ Linear scaling, faster training times

Support Vector Machines for Large Scale Text Mining in R Ingo - PowerPoint PPT Presentation

Support Vector Machines for Large Scale Text Mining in R Ingo Feinerer 1 Alexandros Karatzoglou 2 1 Vienna University of Technology, Austria 2 Telefonica Research, Spain COMPSTAT2010 Motivation Machine learning and data mining require

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Data Mining Support Vector Machines Introduction to Data Mining, 2 nd Edition by Tan, Steinbach,

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Text Classification and Sentiment Analysis Alejandro Moreo AFIRM 16th January 2019 Alejandro

DESIGN LESSONS FROM THE FASTEST Q&A SITE IN THE WEST Lena Mamykina Columbia University s t

Welcome to CS440 / ECE 448 Introduction to Artificial Intelligence Prof.

Current Status NNTPEXT Working Group Stan Barber, Co-Chair 12/11/2000 Status Summary

Dependency Parsing Diyi Yang Presenting: Yuval Pinter (uvp@) Representing Sentence Structure

Are You moved by Your Social Network Application? Abderrahmen Mtibaa Thomson Paris Research Lab

Announcements Project 2 has been posted Due Feb 1st at 10:00pm Work ALONE! Help

Overview Nima Honarmand Fall 2014 :: CSE 506 :: Section 2 (PhD) Course Information

Support Vector Machines for Large Scale Text Mining in R Ingo - PowerPoint PPT Presentation

Support Vector Machines for Large Scale Text Mining in R Ingo Feinerer 1 Alexandros Karatzoglou 2 1 Vienna University of Technology, Austria 2 Telefonica Research, Spain COMPSTAT2010 Motivation Machine learning and data mining require

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Data Mining Support Vector Machines Introduction to Data Mining, 2 nd Edition by Tan, Steinbach,

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines &amp; Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Text Classification and Sentiment Analysis Alejandro Moreo AFIRM 16th January 2019 Alejandro

DESIGN LESSONS FROM THE FASTEST Q&amp;A SITE IN THE WEST Lena Mamykina Columbia University s t

Welcome to CS440 / ECE 448 Introduction to Artificial Intelligence Prof.

Current Status NNTPEXT Working Group Stan Barber, Co-Chair 12/11/2000 Status Summary

Dependency Parsing Diyi Yang Presenting: Yuval Pinter (uvp@) Representing Sentence Structure

Are You moved by Your Social Network Application? Abderrahmen Mtibaa Thomson Paris Research Lab

Announcements Project 2 has been posted Due Feb 1st at 10:00pm Work ALONE! Help

Overview Nima Honarmand Fall 2014 :: CSE 506 :: Section 2 (PhD) Course Information

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

DESIGN LESSONS FROM THE FASTEST Q&A SITE IN THE WEST Lena Mamykina Columbia University s t