support vector machines
play

Support Vector Machines L eon Bottou COS 424 4/1/2010 Agenda - PowerPoint PPT Presentation

Support Vector Machines L eon Bottou COS 424 4/1/2010 Agenda Classification, clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Representation Linear vs. nonlinear Deep


  1. Support Vector Machines L´ eon Bottou COS 424 – 4/1/2010

  2. Agenda Classification, clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Representation Linear vs. nonlinear Deep vs. shallow Explicit: architecture, feature selection Explicit: regularization, priors Capacity Control Implicit: approximate optimization Implicit: bayesian averaging, ensembles Loss functions Operational Budget constraints Considerations Online vs. offline Exact algorithms for small datasets. Computational Stochastic algorithms for big datasets. Considerations Parallel algorithms. L´ eon Bottou 2/46 COS 424 – 4/1/2010

  3. Summary 1. Maximizing margins. 2. Soft margins. 3. Kernels. 4. Kernels everywhere. L´ eon Bottou 3/46 COS 424 – 4/1/2010

  4. The curse of dimensionality Polynomial classifiers in dimension d Discriminant function: f ( x ) = w ⊤ Φ( x ) + b . Degree Dim (Φ(x)) Φ(x) 1 d Φ( x ) = [ x i ] 1 ≤ i ≤ d ≈ d 2 / 2 2 Φ( x ) += [ x i x j ] 1 ≤ i ≤ j ≤ d ≈ d 3 / 6 3 Φ( x ) += [ x i x j x k ] 1 ≤ i ≤ j ≤ k ≤ d . . . ≈ d n /n ! n The number of parameters increases quickly. Training such a classifier directly requires a number of examples that increases just as quickly as the number of parameters. L´ eon Bottou 4/46 COS 424 – 4/1/2010

  5. Beating the curse of dimensionality? Capacity ≪ number of parameters Assume the patterns x 1 . . . x 2 l are known beforehand. The classes are unknown. Let R = max � x i � . We say that a hyperplane w ⊤ x + b w , x ∈ R d � w � = 1 separates patterns with margin ∆ if | w ⊤ x i + b | ≥ ∆ ∀ i = 1 . . . 2 l The family of ∆ -margin separating hyperplanes has � � R 2 log N ( F , D ) ≤ h log 2 le h ≤ min with ∆ 2 , d + 1 h L´ eon Bottou 5/46 COS 424 – 4/1/2010

  6. Maximizing margins Patterns x i ∈ R d , classes y i = ± 1 . 2∆ w ∀ i y i ( w ⊤ x i + b ) ≥ ∆ w ,b, ∆ ∆ max subject to � w � = 1 and L´ eon Bottou 6/46 COS 424 – 4/1/2010

  7. Maximizing margins Classic formulation wx+b = +1 wx+b = −1 w w ,b � w � 2 ∀ i y i ( w ⊤ x i + b ) ≥ 1 min subject to This is a quadratic programming problem with linear constraints. L´ eon Bottou 7/46 COS 424 – 4/1/2010

  8. Maximizing margins Equivalence between the formulations Let w ′ = w ∆ and b ′ = b ∆ . Constraint y i ( w ⊤ x i + b ) ≥ ∆ becomes y i ( w ′⊤ x i + b ′ ) ≥ 1 . w ′ ,b ′ � w ′ � w ,b, ∆ ∆ subject to � w � = 1 becomes min Problem max Both discriminant functions w ⊤ x + b and w ′⊤ x + b ′ describe the same decision boundary. L´ eon Bottou 8/46 COS 424 – 4/1/2010

  9. Primal and dual formulation Karush-Kuhn-Tucker theory – Refined theory for convex otimization under constraints. – Construct a dual optimization problem whose constraints are simpler, and whose solution is related to the solution we seek. L´ eon Bottou 9/46 COS 424 – 4/1/2010

  10. Primal and dual formulation Karush-Kuhn-Tucker theory – Refined theory for convex otimization under constraints. – Construct a dual optimization problem whose constraints are simpler, and whose solution is related to the solution we seek. Primal formulation Dual formulation A Min distance B Max margin between convex hulls between classes L´ eon Bottou 10/46 COS 424 – 4/1/2010

  11. Dual formulation A Min distance B between convex hulls � � – Point A: β i x i subject to β i ≥ 0 and β i = 1 i ∈ Pos i ∈ Pos � � subject to β i ≥ 0 and – Point B: β i x i β i = 1 i ∈ Neg i ∈ Neg � � � y i β i x i subject to β i ≥ 0 , β i = 2 , and y i β i = 0 . – Vector BA: i i i L´ eon Bottou 11/46 COS 424 – 4/1/2010

  12. Dual formulation A Min distance B between convex hulls  ∀ i β i ≥ 0    y i y j β i β j x ⊤ � � min i x j subject to i y i β i = 0 β ij  �  i β i = 2  Then w = � i y i β i x i . Then b is easy to find by projecting all examples on w . L´ eon Bottou 12/46 COS 424 – 4/1/2010

  13. Dual formulation Classic formulation A Min distance B between convex hulls � ∀ i α i ≥ 0 α i − 1 y i y j α i α j x ⊤ � � max i x j subject to 2 α � i y i α i = 0 i ij This is equivalent with α i = β i ∆ − 2 but the proof is nontrivial. L´ eon Bottou 13/46 COS 424 – 4/1/2010

  14. Support Vectors Machines A Min distance B between convex hulls  ∀ i β i ≥ 0    y i y j β i β j x ⊤ � � min i x j i y i β i = 0 subject to β  ij �  i β i = 2  The only non zero β i are those corresponding to support vectors. L´ eon Bottou 14/46 COS 424 – 4/1/2010

  15. Leave-One-Out Leave one out = n -fold cross-validation – Compute classifiers f i using training set minus example ( x i , y i ) . n – Estimate test misclassification rate as E LOO = 1 � 1 I { y i f i ( x i ) ≤ 0 } . n i =1 Leave one out for maximal margin classifier – Removing a non support vector does not change the classifier. E LOO ≤ # support vectors # examples – The important quantity is not the dimension but is the number of support vectors. L´ eon Bottou 15/46 COS 424 – 4/1/2010

  16. Soft margins When the examples are not linearly separable, the constraints y i ( w ⊤ x i + b ) ≥ 1 cannot be satisfied. Adding slack variables ξ i n w ,b, ξ � w � 2 + C ∀ i y i ( w ⊤ x i + b ) ≥ 1 − ξ i , � ξ i ≥ 0 min ξ i subject to i =1 Parameter C controls the relative importance of: – correctly classifying all the training examples, – obtaining the separation with the largest margin. Reduces to hard margins when C = ∞ . L´ eon Bottou 16/46 COS 424 – 4/1/2010

  17. Soft margins and Hinge loss The soft margin problem n w ,b, ξ � w � 2 + C ∀ i y i ( w ⊤ x i + b ) ≥ 1 − ξ i , � ξ i ≥ 0 min ξ i subject to i =1 is the same thing as n w ,b, ξ � w � 2 + C ℓ ( y i ( w ⊤ x i + b )) � ℓ ( z ) = max(0 , 1 − z ) min with i =1 L´ eon Bottou 17/46 COS 424 – 4/1/2010

  18. Soft Margins Primal formulation n w ,b, ξ � w � 2 + C ∀ i y i ( w ⊤ x i + b ) ≥ 1 − ξ i , � ξ i ≥ 0 min ξ i subject to i =1 Dual formulation � ∀ i 0 ≤ α i ≤ C α i − 1 y i y j α i α j x ⊤ � � max i x j subject to 2 α � i y i α i = 0 i ij n � The primal and dual solutions obey the relation w = y i α i x i . i =1 The threshold b is easy to find once w is known. L´ eon Bottou 18/46 COS 424 – 4/1/2010

  19. Soft Margins α i<C α i<C 0< 0< α =0 α =0 i i α =0 α =0 i i α i<C 0< α i=C α i=C ξ i ξ i ξ i α i<C 0< α i=C α =0 i α i<C 0< α =0 α =0 i i α =0 i L´ eon Bottou 19/46 COS 424 – 4/1/2010

  20. Beyond linear separation Reintroducing the Φ(x) – Define K ( x , v ) = Φ( x ) ⊤ Φ( v ) . – Dual optimization problem � ∀ i 0 ≤ α i ≤ C α i − 1 � � max y i y j α i α j K ( x i , x j ) subject to 2 α � i y i α i = 0 i ij – Discriminant function n f ( x ) = w ⊤ Φ( x ) + b = � y i α i K ( x i , x ) i =1 Curious fact – We do not really need to compute Φ( x ) . – The dot products K ( x , v ) = Φ( x ) ⊤ Φ( v ) are enough. – Can we take advantage of this? L´ eon Bottou 20/46 COS 424 – 4/1/2010

  21. Quadratic Kernel Quadratic basis � √ x 2 � � � � � � � Φ( x ) = x i i , i , 2 x i x j i i<j Dot product Φ( x ) ⊤ Φ( v ) = x 2 i v 2 � � � x i v i + i + 2 x i v i x j v j i i i<j – Are there d ( d + 3) / 2 terms to add ? L´ eon Bottou 21/46 COS 424 – 4/1/2010

  22. Quadratic Kernel Quadratic basis � √ x 2 � � � � � � � Φ( x ) = x i i , i , 2 x i x j i i<j Dot product Φ( x ) ⊤ Φ( v ) = x 2 i v 2 � � � x i v i + i + 2 x i v i x j v j i i i<j � � = x i v i + x i v i x j v j i i,j � 2 �� = ( x ⊤ v ) + ( x ⊤ v ) 2 � = x i v i + x i v i i i – There are only d terms to add ! L´ eon Bottou 22/46 COS 424 – 4/1/2010

  23. Polynomial kernel Φ(x) ⊤ Φ(v) Degree Dim (Φ(x)) ( x ⊤ v ) 1 d ( x ⊤ v ) + ( x ⊤ v ) 2 ≈ d 2 / 2 2 ( x ⊤ v ) + ( x ⊤ v ) 2 + ( x ⊤ v ) 3 ≈ d 3 / 6 3 . . . ≈ d n /n ! (1 + x ⊤ v ) d n The number of parameters increases exponentially quickly. But the total computation remains nearly constant. L´ eon Bottou 23/46 COS 424 – 4/1/2010

  24. Linear L´ eon Bottou 24/46 COS 424 – 4/1/2010

  25. Quadratic L´ eon Bottou 25/46 COS 424 – 4/1/2010

  26. Polynomial degree 3 L´ eon Bottou 26/46 COS 424 – 4/1/2010

  27. Polynomial degree 5 L´ eon Bottou 27/46 COS 424 – 4/1/2010

  28. Polynomial kernels and more d γ i i ! ( x ⊤ v ) i . � Weighted polynomial kernel: K d ( x , v ) = i =0 – This is a polynomial kernel. – Coefficient γ controls the relative importance of terms of various degree. L´ eon Bottou 28/46 COS 424 – 4/1/2010

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend