SVM vs Regularized Least Squares Classification Peng Zhang and Jing - - PDF document

svm vs regularized least squares classification
SMART_READER_LITE
LIVE PREVIEW

SVM vs Regularized Least Squares Classification Peng Zhang and Jing - - PDF document

SVM vs Regularized Least Squares Classification Peng Zhang and Jing Peng Electrical Engineering and Computer Science Department Tulane University, New Orleans, LA 70118, USA { zhangp,jp } @eecs.tulane.edu Abstract 2. SVMs and RLS Our learning


slide-1
SLIDE 1

SVM vs Regularized Least Squares Classification

Peng Zhang and Jing Peng Electrical Engineering and Computer Science Department Tulane University, New Orleans, LA 70118, USA {zhangp,jp}@eecs.tulane.edu Abstract

Support vector machines (SVMs) and regularized least squares (RLS) are two recent promising techniques for clas-

  • sification. SVMs implement the structure risk minimization

principle and use the kernel trick to extend it to the non- linear case. On the other hand, RLS minimizes a regu- larized functional directly in a reproducing kernel Hilbert space defined by a kernel. While both have a sound math- ematical foundation, RLS is strikingly simple. On the other hand, SVMs in general have a sparse representation of so-

  • lutions. In addition, the performance of SVMs has been

well documented but little can be said of RLS. This pa- per applies these two techniques to a collection of data sets and presents results demonstrating virtual identical perfor- mance by the two methods.

  • 1. Introduction

Support vector machines (SVMs) have been successfully used as a classification tool in a number of areas, rang- ing from object recognition to classification of cancer mor- phologies [4, 7, 8, 9, 10]. SVMs realize the Structure Risk Minimization principle [10] by maximizing the margin be- tween the separating plane and the data, and use the ker- nel trick to extend them to the nonlinear case. The regular- ized least squares (RLS) method [6], on the other hand, con- structs classifiers by minimizing a regularized functional directly in a reproducing kernel Hilbert space (RKHS) in- duced by a kernel function [5, 6]. While both methods have a sound mathematical founda- tion, the performance of SVMs has been relatively well doc-

  • umented. Yet little can be said of RLS. On the other hand,

RLS is claimed to be fully comparable in performance to SVMs [6] but empirical evidence has been lacking thus far. We present in this paper the results of applying these two techniques to a collection of data sets. Our results demon- strate that the two methods are indeed similar in perfor- mance.

  • 2. SVMs and RLS

Our learning problem is formulated as follows: Given a set of training data: (xi, yi), where xi represents the ith fea- ture vector in ℜn and yi ∈ ℜ the label of xi. In the binary case yi ∈ {−1, 1}. The goal of learning is to find a map- ping f : X → Y that is predictive (i.e., generalizes well). The data (x, y) is drawn randomly according to an unknown probability measure ρ on the product space X × Y . There is a true input-output function fρ reflecting the environment that produces the data. Then given any mapping function f, the measure of the error of f is:

  • X(f − fρ)2dρx, where

ρx is the measure on X induced by the marginal measure ρ. The objective of learning is to find f close to fρ as much as possible. Given the training data z = {xi, yi}m

i=1, then

RSV M = 1 m

l

  • i=1

|yi − fz(xi)| (1) represents the empirical error that fz made on the data z, where the classifier fz is induced by SVMs from z. For RLS, on the other hand, the empirical error is RRLS = 1 m

l

  • i=1

(yi − fz(xi))2. (2) Note that the main issue concerning learning is generaliza-

  • tion. A good (predictive) classifier minimizes the error it

makes on new (unseen) data not on the training data. Also, learning starts from a hypothesis space from which f is cho- sen.

2.1. SVMs

In the SVM framework, unlike typical classifica- tion methods that simply minimize RSV M, SVMs mini- mize the following upper bound of the expected general- ization error R ≤ RSV M + C(h), (3)

Proceedings of the 17th International Conference on Pattern Recognition (ICPR’04) 1051-4651/04 $ 20.00 IEEE

Authorized licensed use limited to: MIT Libraries. Downloaded on February 16,2010 at 20:28:22 EST from IEEE Xplore. Restrictions apply.

slide-2
SLIDE 2

where C represents the “VC confidence,” and h the VC di-

  • mension. This can be accomplished by maximizing the mar-

gin between the separating plane and the data, which can be viewed as realizing the Structure Risk Minimization princi- ple [10]. The SVM solution produces a hyperplane having the maximum margin, where the margin is defined as 2/w. It is shown [1, 4, 10] that this hyperplane is optimum with respect to the maximum margin. The hyperplane, deter- mined by its normal vector w, can be explicitly written as w =

i∈SV αiyixi, where αi’s are Langrange coeffi-

cients that maximize LD =

  • i

αi − 1 2

  • i,j

αiαjyiyjxi · xj (4) and SV is the set of support vectors determined by the

  • SVM. For the nonlinear case, the dot product can be re-

placed by kernel functions.

2.2. RLS

Starting from the training data z = (xi, yi)m

i

and the unknown true function fρ, instead of looking for the the empirical optimal classifier that minimizes

1 m

l

i=1(yi −

fz(xi))2, RLS focuses on the problem of estimating [5, 6]

  • X

(fz − fρ)2dρX. (5) In order to search fz, it begins with a hypothesis space H. Define the “true optimum” fH relative to H. That is, fH = min

H

  • X(f − fρ)2dρX.

The problem above can then be decomposed as [6]:

  • X

(fz − fρ)2dρX = S(z, H) +

  • X

(fH − fρ)2dρX (6) where S(z, H) =

X(fz − fρ)2dρX − X(fH − fρ)2dρX.

On the right-hand side of (6), the first term is called sample error or sometime estimation error, while the second term is called approximation error [6]. The RLS algorithm chooses RKHS as the hypothesis space HK, and minimizes the following regularized func- tional: 1 m

  • (yi − f(xi))2 + γf2

K

(7) where f2

K is the norm in HK defined by the kernel K,

and γ a fixed parameter. The minimizer exists and is unique [6]. It turns out that the solution to the above optimization problem is quite simple: Compute c = (c1, c2, · · · , cm)t by solving the equation (mγI + K)c = y (8) where K is the Gram (kernel) matrix, and y = (y1, y2, · · · , ym)t. The resulting classifier f is (in the appendix, we show how to derive f) f(x) =

  • ciK(x, xi)

(9) For the binary classification {-1,1} case, if f(x) ≤ 0, the predicted class is −1. Otherwise it is 1. Note that there is no issue of separability or nonseparability for this algorithm.

2.3. Complexity

The bulk of the computational cost associated with SVMs is incurred by solving the quadratic program- ming problem (4). This optimization problem can be bounded by O(N 3

s + N 2 s m + Nsnm) [1], where Ns is the

number of support vectors and n the dimensions of the in- put data. In the worst case, Ns ≈ m, we have O(nm2). On the other hand, solving the linear system of equa- tions (8) has been studied for a long time, and efficient al- gorithms exist in numerical analysis (the condition number is good if mγ is large). In the worst case, it can be bounded by O(m2.376) [3]. Overall, a RLS solution can be obtained much faster than that computed by SVMs. However, a SVM solution has a sparse representation, which can be advanta- geous in prediction.

  • 3. Experiments

The RLS algorithm can be implemented in a straight for- ward way. For SVMs, we used the package from LIBSVM [2]. For both algorithms we adopt the same kernel function: Gaussian K(x, x′) = e−x−x′2/2σ2. The SVM algorithm has two procedural parameters: σ and C, the soft margin parameter. Similarly, the RLS algo- rithm also has two parameters: σ and γ. σ is common to

  • both. For model selection, ten-fold cross-validation is used.

σ takes values in [10−15, 1015], C in [10−15, 1015], and γ in [10−15, 105].

3.1. Real Data Experiments

12 datasets from UCI Machine Learning Reposi- tory were used for comparison. They are: glass, cancer, cancer-w, credit card, heart cleveland, heart hungery, iono- sphere, iris, letter (only v and w are chosen), new thyoid, pima indian, and sonar. Some datasets has minor miss- ing data. In that case, the missing data are removed. All features are normalized to lie between 0 and 1. For ev- ery dataset, we randomly choose 60% as training data and the rest 40% as testing data. The process is re- peated 10 time and the average error rates obtained by the two methods are reported.

Proceedings of the 17th International Conference on Pattern Recognition (ICPR’04) 1051-4651/04 $ 20.00 IEEE

Authorized licensed use limited to: MIT Libraries. Downloaded on February 16,2010 at 20:28:22 EST from IEEE Xplore. Restrictions apply.

slide-3
SLIDE 3

RLS SVM µ σ µ σ sonar 0.1524 0.0311 0.1354 0.0428 glass 0.0576 0.0188 0.0624 0.0222 creditcard 0.1227 0.0171 0.1300 0.0158 heart-c 0.1805 0.0215 0.1797 0.0203 heart-h 0.2120 0.0390 0.2248 0.0349 iris 0.0825 0.0409 0.0475 0.0184 ionosphere 0.0629 0.0184 0.0607 0.0155 thyoid 0.0291 0.0126 0.0279 0.0199 letter(u,w) 0.0022 0.0022 0.0016 0.0013 pima 0.2404 0.0159 0.2430 0.0227 cancer-w 0.0268 0.0072 0.0331 0.0069 cancer 0.2755 0.0197 0.2855 0.0193

  • verall

0.1204 0.0204 0.1184 0.0195 Table 1. Average classification error rates. Table 1 shows the average classification error rate on each dataset and the aggregated error rate by both. Over- all, there is little difference between the two methods, which supports the claim made in [6].

  • 4. Simulate Data Experiments

In order to further understand the performance of both

  • algorithms. We also apply them to simulated data examples.

We used three simulated data examples in two dimensions (unit square). In the first example, shown in the left panel in Figure 2, the areas 1 and 4 are class {−1} data, while the areas 2 and 3 are class {1} data. The data in each area is uniformly randomly distributed. There is no overlapping in this case. The second example is extreme, where two class data are totally mixed with each other (both are uniformly randomly distributed in the same range). However, the percentage of

  • ne class is greater than other, that is, 80% of the data are

in class 1. Then the optimal Bayes classifier classifies every data point as having label {1}. In this case, the Bayes error rate is 0.2. The third example is more interesting: part of data are clearly separable and part of data are overlapping. As shown in the right panel in Figure 2, the areas 1 and 4 are class {−1} data; the areas 2 and 3 are class {1} data. The areas 5 and 6 are overlapping each other where class {1} domi-

  • nates. 70% of the data in these overlapped areas are in class

{1}. The training data are generated uniformly randomly in each area. In the overlapped areas the data are randomly la- beled as either class {1} or class 2 according to the pro- portion described above. For each data set, 40 training data points and 2601 testing data points are randomly generated.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1 2 3 4

,

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1 2 5 3 6 4

Figure 1. Left panel: non-overlapping exam-

  • ple. Right panel: overlapping example.

This process is ten times to obtain average classification er- ror rates (Table 2) by both methods. RLS SVM µ σ µ σ squre 0.0792 0.0226 0.0753 0.0228

  • v

0.1445 0.0186 0.1435 0.0454 mix 0.0363 0.0549 0.1703 0.0752 Table 2. Average classification error rates on the simulated data.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

,

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

,

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 2. Non-overlapping example. Left panel: Bayes prediction. Middle panel: RLS

  • prediction. Right panel: SVM prediction.

The results show that for the separable and partially

  • verlapped data, RLS and SVMs registered similar perfor-

mance, but the SVM algorithm gave larger variance on the partially overlapped data. For the mixture data, RLS per- forms significantly better than SVMs and approaches the Bayesian error rate. In fact, RLS achieved the Bayesian er- ror rate in five out of ten runs. Generating testing data in mesh, we can plot correspond- ing predictions by the two methods. For each problem, from left to right, the predictions by the Bayes classifier, RLS and SVMs are shown. As can be seen, the decision boundary

Proceedings of the 17th International Conference on Pattern Recognition (ICPR’04) 1051-4651/04 $ 20.00 IEEE

Authorized licensed use limited to: MIT Libraries. Downloaded on February 16,2010 at 20:28:22 EST from IEEE Xplore. Restrictions apply.

slide-4
SLIDE 4

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

,

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

,

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 3. Mixture example. Left panel: Bayes

  • prediction. Middle panel: RLS prediction.

Right panel: SVM prediction.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

,

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

,

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 4. Overlapping example. Left panel: Bayes prediction. Middle panel: RLS predic-

  • tion. Right panel: SVM prediction.

generated by SVMs seems to be smoother than that gener- ated by RLS. On the other hand, SVMs produce larger por- tions of wrong predictions in the mixture data case. The results also shows that after training, the c vector of RLS is not sparse. For SVMs, it has on average 1/3 of the training data as support vectors, thereby producing a sparse representation of solutions. Sparseness is a nice property in terms of requiring less storage space and time for actual pre- diction.

  • 5. Summary

This paper examines empirically two recent promising methods, SVMs and RLS, on a collection of data sets. The results show that there is little difference in performance between the two classification techniques. The most strik- ing property of RLS is its simplicity. There are well devel-

  • ped algorithms for solving RLS efficiently in numerical

and computational analysis. On the other hand, SVMs in general have a compact representation of solutions, which may be important in time-critical applications. RLS may not have such guarantees. However, it is possible to design al- gorithms for achieving sparse RLS solutions.

References

[1] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121–167, 1998. [2] C.-C. Chang and C.-J. Lin. LIBSVM : a library for support vector machines, 2001. [3] S. Coppersmith and S. Winograd. On the asymptotic com- plexity of matrix multiplocation. SIAM Journal on Comput- ing, 11(3):472–492, Aug. 1982. [4] N. Cristianini and J. Shawe-Taylor. An Introduction to Sup- port Vector Machines and other kernel-based learning meth-

  • ds. Cambridge University Press, Cambridge, UK, 2000.

[5] F. Cucker and S. Smale. On the mathematical foundations

  • f learning. Bulletin of the American Mathematical society,

pages 1–49, October 2001. [6] T. Poggio and S. Smale. The mathematics of learning: Deal- ing with data. Notices of the AMS, 50(5):537–544, May 2003. [7] B. Sch¨

  • lkopf. The kernel trick for distances. In T. K. Leen,
  • T. G. Dietterich, and V. Tresp, editors, Advances in Neural

Information Processing Systems, volume 13, pages 301–307. The MIT Press, 2001. [8] B. Scholkopf, C. J. C. Burges, and A. J. Smola, editors. Ad- vances in kernel methods: support vector learning. MIT Press, Cambridge, MA, 1999. [9] B. Scholkopf and etal. Input space versus feature space in kernel-based methods. IEEE Transactions on Neural Net- works, 10(5):1000 –1017, September 1999. [10] V. N. Vapnik. Statistical Learning Theory. John Wiley and Sons, New York, 1998.

  • 6. Appendix

Given any g ∈ Hk, and t ∈ R, build function F(t) : F(t) = 1 m

m

  • i=1

(yi − f(xi) − tg(xi))2 + γ f + tg2

K .

In order to find the minimizer, take the derivative of F(t) F ′(t) = 2 m

m

  • i=1

(yi − f(xi) − tg(xi))(−g(xi)) +2 f, g + 2t g, g for the minimizer, F ′(0) = 0. F ′(0) = − 2 m

m

  • i=1

(yi − f(xi))(g(xi)) + 2 f, g =

m

  • i=1

(yi − f(xi)) mγ (g(xi)) − f, g It must be true for any g, we take the g = Kx, and let ci =

(yi−f(xi)) mγ

, we have f, Kx =

m

  • i=1

ciKx(xi). (10) Because Kx, Ky = K(x, y) f(x) =

m

  • i=1

cikXi(x).

Proceedings of the 17th International Conference on Pattern Recognition (ICPR’04) 1051-4651/04 $ 20.00 IEEE

Authorized licensed use limited to: MIT Libraries. Downloaded on February 16,2010 at 20:28:22 EST from IEEE Xplore. Restrictions apply.