svms and kernel methods lecture 3
play

SVMs and Kernel Methods Lecture 3 David Sontag New York University - PowerPoint PPT Presentation

SVMs and Kernel Methods Lecture 3 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Todays lecture Dual form of soft-margin SVM Feature mappings & kernels Convexity,


  1. SVMs and Kernel Methods Lecture 3 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin

  2. Today’s lecture  Dual form of soft-margin SVM  Feature mappings & kernels  Convexity, Mercer’s theorem  (Time permitting) Extensions:  Imbalanced data  Multi-class  Other loss functions  L1 regularization

  3. Recap of dual SVM derivation (Dual) ⇤ ⌅ Can solve for optimal w , b as function of α : ∂ L ⌥  ∂ w = w − α j y j x j j  Substituting these values back in (and simplifying), we obtain: (Dual) So, in dual formulation we will solve for α directly! • w and b are computed from α (if needed)

  4. Solving for the offset “b” Lagrangian: α j > 0 for some j implies constraint is tight. We use this to obtain b : (1) (2) (3)

  5. Dual formulation only depends on dot-products of the features! First, we introduce a feature mapping :  Next, replace the dot product with an equivalent kernel function: ~ ↵ ≥ 0 Do kernels need to be symmetric?

  6. Classification rule using dual solution Using dual solution dot product of feature vectors of new example with support vectors Using a kernel function, predict with…

  7. Dual SVM interpretation: Sparsity w . x + b = +1 w . x + b = 0 w . x + b = -1 Final solution tends to be sparse • α j =0 for most j • don’t need to store these points to compute w or make predictions Non-support Vectors: • α j =0 Support Vectors: • moving them will not • α j ≥ 0 change w

  8. Soft-margin SVM Primal: Solve for w,b, α : Dual: What changed? • Added upper bound of C on α i ! • Intuitive explanation: • Without slack, α i  ∞ when constraints are violated (points misclassified) • Upper bound of C limits the α i , so misclassifications are allowed

  9. Common kernels • Polynomials of degree exactly d • Polynomials of degree up to d • Gaussian kernels • Sigmoid • And many others: very active area of research!

  10. Polynomial kernel d =1 � u 1 � v 1 � ⇥ ⇥ � ⇥ ⇥ φ ( u ) . φ ( v ) = = u 1 v 1 + u 2 v 2 = u.v . u 2 v 2 u 2 v 2 ⇤ ⌅ ⇤ ⌅ d =2 ⌃ ⇧ ⌃ 1 1 u 1 u 2 v 1 v 2 ⌃ = u 2 1 v 2 1 + 2 u 1 v 1 u 2 v 2 + u 2 2 v 2 ⌥ � ⌥ � φ ( u ) . φ ( v ) = ⌃ . ⌥ � ⌥ � 2 u 2 u 1 v 2 v 1 ⇧ ⇧ = ( u 1 v 1 + u 2 v 2 ) 2 u 2 v 2 2 2 = ( u.v ) 2 For any d (we will skip proof): φ ( u ) . φ ( v ) = ( u.v ) d Polynomials of degree exactly d

  11. Gaussian kernel Level sets, i.e. for some r w · φ ( x ) = r Support vectors [Cynthia Rudin] [mblondel.org]

  12. Kernel algebra Q: How would you prove that the “Gaussian kernel” is a valid kernel? A: Expand the Euclidean norm as follows: To see that this is a kernel, use the Taylor series expansion of the Then, apply (e) from above exponential, together with repeated application of (a), (b), and (c): The feature mapping is infinite dimensional! [Justin Domke]

  13. Overfitting? • Huge feature space with kernels: should we worry about overfitting? – SVM objective seeks a solution with large margin • Theory says that large margin leads to good generalization (we will see this in a couple of lectures) – But everything overfits sometimes!!! – Can control by: • Setting C • Choosing a better Kernel • Varying parameters of the Kernel (width of Gaussian, etc.)

  14. How to deal with imbalanced data? • In many practical applications we may have imbalanced data sets • We may want errors to be equally distributed between the positive and negative classes • A slight modification to the SVM objective does the trick! Class-specific weighting of the slack variables

  15. How do we do multi-class classification?

  16. One versus all classification w + Learn 3 classifiers: w - • - vs {o,+}, weights w - • + vs {o,-}, weights w + • o vs {+,-}, weights w o w o Predict label using: Any problems? Could we learn this (1-D) dataset?  -1 1 0

  17. Multi-class SVM w + Simultaneously learn 3 sets of weights: w - • How do we guarantee the correct labels? w o • Need new constraints! The “score” of the correct class must be better than the “score” of wrong classes:

  18. Multi-class SVM As for the SVM, we introduce slack variables and maximize margin: To predict, we use: Now can we learn it?  -1 1 0 b + = − . 5

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend