SVM vs Regularized Least Squares Classification Peng Zhang and Jing - PDF document

SVM vs Regularized Least Squares Classification Peng Zhang and Jing Peng Electrical Engineering and Computer Science Department Tulane University, New Orleans, LA 70118, USA { zhangp,jp } @eecs.tulane.edu Abstract 2. SVMs and RLS Our learning problem is formulated as follows: Given a Support vector machines (SVMs) and regularized least set of training data: ( x i , y i ), where x i represents the i th fea- squares (RLS) are two recent promising techniques for clas- ture vector in ℜ n and y i ∈ ℜ the label of x i . In the binary sification. SVMs implement the structure risk minimization case y i ∈ {− 1 , 1 } . The goal of learning is to find a map- principle and use the kernel trick to extend it to the non- ping f : X → Y that is predictive (i.e., generalizes well). linear case. On the other hand, RLS minimizes a regu- The data ( x, y ) is drawn randomly according to an unknown larized functional directly in a reproducing kernel Hilbert probability measure ρ on the product space X × Y . There space defined by a kernel. While both have a sound math- is a true input-output function f ρ reflecting the environment ematical foundation, RLS is strikingly simple. On the other that produces the data. Then given any mapping function f , hand, SVMs in general have a sparse representation of so- X ( f − f ρ ) 2 dρx , where � the measure of the error of f is: lutions. In addition, the performance of SVMs has been ρx is the measure on X induced by the marginal measure ρ . well documented but little can be said of RLS. This pa- The objective of learning is to find f close to f ρ as much as per applies these two techniques to a collection of data sets possible. and presents results demonstrating virtual identical perfor- Given the training data z = { x i , y i } m mance by the two methods. i =1 , then l R SV M = 1 � | y i − f z ( x i ) | (1) m 1. Introduction i =1 represents the empirical error that f z made on the data z , Support vector machines (SVMs) have been successfully where the classifier f z is induced by SVMs from z . For used as a classification tool in a number of areas, rang- RLS, on the other hand, the empirical error is ing from object recognition to classification of cancer mor- phologies [4, 7, 8, 9, 10]. SVMs realize the Structure Risk l R RLS = 1 � ( y i − f z ( x i )) 2 . Minimization principle [10] by maximizing the margin be- (2) m tween the separating plane and the data, and use the ker- i =1 nel trick to extend them to the nonlinear case. The regular- Note that the main issue concerning learning is generaliza- ized least squares (RLS) method [6], on the other hand, con- tion. A good (predictive) classifier minimizes the error it structs classifiers by minimizing a regularized functional makes on new (unseen) data not on the training data. Also, directly in a reproducing kernel Hilbert space (RKHS) in- learning starts from a hypothesis space from which f is cho- duced by a kernel function [5, 6]. sen. While both methods have a sound mathematical foundation, the performance of SVMs has been relatively well doc- 2.1. SVMs umented. Yet little can be said of RLS. On the other hand, RLS is claimed to be fully comparable in performance to In the SVM framework, unlike typical classifica- SVMs [6] but empirical evidence has been lacking thus far. tion methods that simply minimize R SV M , SVMs mini- We present in this paper the results of applying these two mize the following upper bound of the expected general- techniques to a collection of data sets. Our results demon- ization error strate that the two methods are indeed similar in perfor- R ≤ R SV M + C ( h ) , mance. (3) Proceedings of the 17th International Conference on Pattern Recognition (ICPR’04) 1051-4651/04 $ 20.00 IEEE Authorized licensed use limited to: MIT Libraries. Downloaded on February 16,2010 at 20:28:22 EST from IEEE Xplore. Restrictions apply.

where C represents the “VC confidence,” and h the VC di- K where is the Gram (kernel) matrix, and ( y 1 , y 2 , · · · , y m ) t . The resulting classifier f is y = mension. This can be accomplished by maximizing the mar- (in the appendix, we show how to derive f ) gin between the separating plane and the data, which can be viewed as realizing the Structure Risk Minimization princi- � f ( x ) = c i K ( x, x i ) (9) ple [10]. The SVM solution produces a hyperplane having the For the binary classification { -1,1 } case, if f ( x ) ≤ 0 , the maximum margin, where the margin is defined as 2 / � w � . predicted class is − 1 . Otherwise it is 1 . Note that there is no It is shown [1, 4, 10] that this hyperplane is optimum with issue of separability or nonseparability for this algorithm. respect to the maximum margin. The hyperplane, determined by its normal vector � w � , can be explicitly written as w = � i ∈ SV α i y i x i , where α i ’s are Langrange coeffi- 2.3. Complexity cients that maximize The bulk of the computational cost associated with α i − 1 � � L D = α i α j y i y j x i · x j (4) SVMs is incurred by solving the quadratic program- 2 i i,j ming problem (4). This optimization problem can be bounded by O ( N 3 s + N 2 s m + N s nm ) [1], where N s is the and SV is the set of support vectors determined by the number of support vectors and n the dimensions of the in- SVM. For the nonlinear case, the dot product can be re- put data. In the worst case, N s ≈ m , we have O ( nm 2 ) . placed by kernel functions. On the other hand, solving the linear system of equa- tions (8) has been studied for a long time, and efficient al- 2.2. RLS gorithms exist in numerical analysis (the condition number is good if mγ is large). In the worst case, it can be bounded Starting from the training data z = ( x i , y i ) m and the i by O ( m 2 . 376 ) [3]. Overall, a RLS solution can be obtained unknown true function f ρ , instead of looking for the the much faster than that computed by SVMs. However, a SVM � l 1 empirical optimal classifier that minimizes i =1 ( y i − m solution has a sparse representation, which can be advanta- f z ( x i )) 2 , RLS focuses on the problem of estimating [5, 6] geous in prediction. � ( f z − f ρ ) 2 dρ X . (5) 3. Experiments X In order to search f z , it begins with a hypothesis space H . The RLS algorithm can be implemented in a straight for- Define the “true optimum” f H relative to H . That is, f H = ward way. For SVMs, we used the package from LIBSVM X ( f − f ρ ) 2 dρ X . � min [2]. For both algorithms we adopt the same kernel function: H Gaussian K ( x, x ′ ) = e −� x − x ′ � 2 / 2 σ 2 . The problem above can then be decomposed as [6]: The SVM algorithm has two procedural parameters: σ � � ( f z − f ρ ) 2 dρ X = S ( z, H ) + ( f H − f ρ ) 2 dρ X and C , the soft margin parameter. Similarly, the RLS algo- (6) X X rithm also has two parameters: σ and γ . σ is common to both. For model selection, ten-fold cross-validation is used. X ( f z − f ρ ) 2 dρ X − � X ( f H − f ρ ) 2 dρ X . where S ( z, H ) = � σ takes values in [ 10 − 15 , 10 15 ], C in [ 10 − 15 , 10 15 ], and γ On the right-hand side of (6), the first term is called sample in [ 10 − 15 , 10 5 ]. error or sometime estimation error, while the second term is called approximation error [6]. 3.1. Real Data Experiments The RLS algorithm chooses RKHS as the hypothesis space H K , and minimizes the following regularized func- 12 datasets from UCI Machine Learning Reposi- tional: 1 ( y i − f ( x i )) 2 + γ � f � 2 tory were used for comparison. They are: glass, cancer, � (7) K m cancer-w, credit card, heart cleveland, heart hungery, iono- sphere, iris, letter (only v and w are chosen), new thyoid, where � f � 2 K is the norm in H K defined by the kernel K , pima indian, and sonar. Some datasets has minor miss- and γ a fixed parameter. The minimizer exists and is unique ing data. In that case, the missing data are removed. All [6]. features are normalized to lie between 0 and 1. For ev- It turns out that the solution to the above optimization problem is quite simple: Compute c = ( c 1 , c 2 , · · · , c m ) t by ery dataset, we randomly choose 60% as training data and the rest 40% as testing data. The process is re- solving the equation peated 10 time and the average error rates obtained by the ( mγI + K ) c = y (8) two methods are reported. Proceedings of the 17th International Conference on Pattern Recognition (ICPR’04) 1051-4651/04 $ 20.00 IEEE Authorized licensed use limited to: MIT Libraries. Downloaded on February 16,2010 at 20:28:22 EST from IEEE Xplore. Restrictions apply.

SVM vs Regularized Least Squares Classification Peng Zhang and Jing - PDF document

SVM vs Regularized Least Squares Classification Peng Zhang and Jing Peng Electrical Engineering and Computer Science Department Tulane University, New Orleans, LA 70118, USA { zhangp,jp } @eecs.tulane.edu Abstract 2. SVMs and RLS Our learning

Statistical Properties of the Regularized Least Squares Functional and a hybrid LSQR Newton method

The Chi-squared Distribution of the Regularized Least Squares Functional for Regularization

Practical Least-Squares for Computer Graphics Siggraph Course 11 Siggraph Course 11 Practical

Regularized Least Squares Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin

Regularized Least Squares Charlie Frogner 1 MIT 2012 1 Slides mostly stolen from Ryan Rifkin

Model Selection and Fast Rates for Regularized Least-Squares Andrea Caponnetto 1 Plan

Least Mean Squares Regression Machine Learning 1 Least Squares Method for regression

ECE 516: Adaptive Digital Filters Lecture 13 (Recursive Least-Squares) Mojtaba Soltanalian 2

The Mathemagic of Magic Squares History of Magic Squares Mathematics and Magic Squares

Solving Regularized Total Least Squares Problems Based on Eigensolvers Heinrich Voss

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

SVM-flexible discriminant analysis Huimin Peng November 20, 2014 Outline SVM Nonlinear SVM =

Overview SVM theoretical framework ORACLE data mining technology SVM parameter

SVM on Intel Graphics Jesse Barnes Intel Open Source Technology Center 1 What is SVM?

Statistical Geometry Processing Winter Semester 2011/2012 Least-Squares Least-Squares Fitting

9. Equality constraints and tradeoffs More least squares Example: moving average model

arXiv:1710.07922v2 [hep-ex] 8 Mar 2018 Y. Hu 1 , G. S. Huang 50 , 40 , J. S. Huang 15 , X. T.

Event by Event Fluctuations General remarks about fluctuations First order, second order

Hadron Structure in Lattice QCD C. Alexandrou University of Cyprus and Cyprus Institute PSI,

Text analytics, NLP, and accounting research 2018 November 23 Dr. Richard M. Crowley

arXiv:1710.06921v1 [cs.CY] 18 Oct 2017 ABSTRACT and fairness-aware ML methods [6, 7, 8, 9, 10,

On the track of new didactic tools for Education in Sustainable Development How do we feel when we

The design process and design fiction Gabriela Avram The Scientific Method p a process for

Ambisonics in an Ogg Opus Container Agenda Ambisonics Foundations Adding Ambisonics to

Sambuz

Useful Links

Newsletter

Mail Us