support vector machines
play

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) - PowerPoint PPT Presentation

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning methods for classification and regression they can represent non-linear functions and they have an efficient training algorithm derived


  1. Support Vector Machines 290N, 2014

  2. Support Vector Machines (SVM)   Supervised learning methods for classification and regression   they can represent non-linear functions and they have an efficient training algorithm   derived from statistical learning theory by Vapnik and Chervonenkis (COLT-92)   SVM got into mainstream because of their exceptional performance in Handwritten Digit Recognition  1.1% error rate which was comparable to a very carefully constructed (and complex) ANN

  3. Two Class Problem: Linear Separable Case  Many decision Class 2 boundaries can separate these two classes  Which one should we choose? Class 1

  4. Example of Bad Decision Boundaries Class 2 Class 2 Class 1 Class 1

  5. Another intuition  If you have to place a fat separator between classes, you have less choices, and so the capacity of the model has been decreased 5

  6. Support Vector Machine (SVM) Support vectors  SVMs maximize the margin around the separating hyperplane.  A.k.a. large margin classifiers  The decision function is fully specified by a subset of training samples, the support vectors . Maximize  Quadratic programming margin problem 6

  7. Training examples for document ranking Two ranking signals are used (Cosine text similarity score, proximity of term appearance window) Example DocID Query Cosine score Judgment 37 linux operating 0.032 3 relevant system 37 penguin logo 0.02 4 nonrelevant 238 operating system 0.043 2 relevant 238 runtime 0.004 2 nonrelevant environment 1741 kernel layer 0.022 3 relevant 2094 device driver 0.03 2 relevant 3191 device driver 0.027 5 nonrelevant 7

  8. Proposed scoring function for ranking Cosine score R R R R R N R R R 0.025 N R N N N N N N N 0 Term proximity 8 2 3 4 5

  9. Formalization  w: weight coefficients  x i : data point i  y i : class result of data point i (+1 or -1) f(x i ) = sign(w T x i + b)  Classifier is: y i (w T x i + b)  Functional margin of x i is:  We can increase this margin by scaling w, b… 9

  10. Linear Support Vector Machine (SVM) w T x a + b = 1 ρ w T x b + b = -1 Hyperplane  w T x + b = 0 w T x + b = 1 w T x + b = -1 w T x + b = 0 Support vectors ρ = ||x a – x b || 2 = 2/||w|| 2  datapoints that the margin pushes up against 10

  11. Geometric View: Margin of a point T  w x b  Distance from example to the separator is  r y w Examples closest to the hyperplane are support vectors  Margin ρ of the separator is the width of separation between support  vectors of classes. ρ x r x ′ 11

  12. Geometric View of Margin T  w x b  Distance to the separator is  r y w Let X be in line wTx+b=z. Thus (wTx+b) –( wTx’+b)=z -0  then |w| |x- x’|= |z| = y(wTx+b) thus |w| r = y(wTx+b). ρ x r x ′ 12

  13. Linear Support Vector Machine (SVM) w T x a + b = 1 ρ w T x b + b = -1 Hyperplane  w T x + b = 0 This implies:  w T (x a – x b ) = 2 ρ = ||x a – x b || 2 = 2/||w|| 2 w T x + b = 0 Support vectors datapoints that the margin pushes up against 13

  14. Linear SVM Mathematically Assume that all data is at least distance 1 from the hyperplane, then  the following two constraints follow for a training set {( x i , y i )} w T x i + b ≥ 1 if y i = 1 w T x i + b ≤ - 1 if y i = -1 For support vectors, the inequality becomes an equality  Then, since each example’s distance from the hyperplane is  T  w x b  r y w The margin of dataset is:  2   w 14

  15. The Optimization Problem  Let { x 1 , ..., x n } be our data set and let y i  {1,-1} be the class label of x i  The decision boundary should classify all points correctly   A constrained optimization problem  || w || 2 = w T w

  16. Lagrangian of Original Problem  The Lagrangian is Lagrangian multipliers  Note that || w || 2 = w T w  Setting the gradient of w.r.t. w and b to zero,  i  0

  17. The Dual Optimization Problem  We can transform the problem to its dual Dot product of X  ’s  New variables (Lagrangian multipliers)  This is a convex quadratic programming (QP) problem  Global maximum of  i can always be found  well established tools for solving this optimization problem (e.g. cplex)

  18. A Geometrical Interpretation Class 2 Support vectors  10 =0   ’s with values  8 =0.6 different from zero (they hold up the  7 =0 separating plane)!  2 =0  5 =0  1 =0.8  4 =0  6 =1.4  9 =0  3 =0 Class 1

  19. The Optimization Problem Solution The solution has the form:  w = Σ α i y i x i b = y k - w T x k for any x k such that α k  0 Each non-zero α i indicates that corresponding x i is a support vector.  Then the classifying function will have the form:  f ( x ) = Σ α i y i x i T x + b Notice that it relies on an inner product between the test point x and the  support vectors x i – we will return to this later. Also keep in mind that solving the optimization problem involved  T x j between all pairs of training points. computing the inner products x i 19

  20. Classification with SVMs  Given a new point ( x 1 ,x 2 ), we can score its projection onto the hyperplane normal:  In 2 dims: score = w 1 x 1 +w 2 x 2 +b .  I.e., compute score: wx + b = Σ α i y i x i T x + b  Set confidence threshold t. Score > t: yes Score < -t: no 7 3 5 Else: don’t know 20

  21. Soft Margin Classification  If the training set is not linearly separable, slack variables ξ i can be added to allow misclassification of difficult or noisy examples.  Allow some errors ξ i  Let some points be ξ j moved to where they belong, at a cost  Still, try to minimize training set errors, and to place hyperplane “far” from each class (large margin) 21

  22. Soft margin  We allow “error” x i in classification; it is based on the output of the discriminant function w T x +b  x i approximates the number of misclassified samples New objective function: Class 2 C : tradeoff parameter between error and margin; chosen by the user; large C means a higher penalty to errors Class 1

  23. Soft Margin Classification Mathematically The old formulation:  Find w and b such that Φ ( w ) =½ w T w is minimized and for all { ( x i , y i )} y i ( w T x i + b) ≥ 1 The new formulation incorporating slack variables:  Find w and b such that Φ ( w ) =½ w T w + C Σ ξ i is minimized and for all { ( x i , y i )} y i ( w T x i + b ) ≥ 1- ξ i and ξ i ≥ 0 for all i Parameter C can be viewed as a way to control overfitting – a  regularization term 23

  24. The Optimization Problem  The dual of the problem is  w is also recovered as  The only difference with the linear separable case is that there is an upper bound C on  i  Once again, a QP solver can be used to find  i efficiently!!!

  25. Soft Margin Classification – Solution The dual problem for soft margin classification:  Find α 1 …α N such that Q ( α ) = Σ α i - ½ ΣΣ α i α j y i y j x i T x j is maximized and (1) Σ α i y i = 0 (2) 0 ≤ α i ≤ C for all α i Neither slack variables ξ i nor their Lagrange multipliers appear in the  dual problem! Again, x i with non-zero α i will be support vectors.  Solution to the dual problem is:  But w not needed explicitly w = Σ α i y i x i for classification! b= y k (1- ξ k ) - w T x k where k = argmax α k f ( x ) = Σ α i y i x i T x + b k 25

  26. Linear SVMs: Summary The classifier is a separating hyperplane.  Most “important” training points are support vectors; they define  the hyperplane. Quadratic optimization algorithms can identify which training  points x i are support vectors with non-zero Lagrangian multipliers α i . Both in the dual formulation of the problem and in the solution  training points appear only inside inner products: f ( x ) = Σ α i y i x i Find α 1 …α N such that T x + b Q ( α ) = Σ α i - ½ ΣΣ α i α j y i y j x i T x j is maximized and (1) Σ α i y i = 0 (2) 0 ≤ α i ≤ C for all α i 26

  27. Non-linear SVMs Datasets that are linearly separable (with some noise) work out great:  x 0 But what are we going to do if the dataset is just too hard?  x 0 How about … mapping data to a higher -dimensional space:  x 2 x 0 27

  28. Non-linear SVMs: Feature spaces  General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ : x → φ ( x ) 28

  29. Transformation to Feature Space  “Kernel tricks”  Make non-separable problem separable.  Map data into better representational space  ( )  ( )  ( )  ( )  ( )  ( )  (.)  ( )  ( )  ( )  ( )  ( )  ( )  ( )  ( )  ( )  ( )  ( )  ( ) Feature space Input space

  30. Modification Due to Kernel Function  Change all inner products to kernel functions  For training, Original With kernel function     K x x ( , ) ( ) x ( x ) i j i j

  31. Example Transformation  Consider the following transformation  Define the kernel function K ( x , y ) as  The inner product  (.)  (.) can be computed by K without going through the map  (.) explicitly!!!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend