support vector machines svms lecture 6
play

Support vector machines (SVMs) Lecture 6 David Sontag New York - PowerPoint PPT Presentation

Support vector machines (SVMs) Lecture 6 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Pegasos vs. Perceptron Pegasos Algorithm Initialize: w 1 = 0, t=0 For iter = 1,2,,20 For


  1. Support vector machines (SVMs) Lecture 6 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin

  2. Pegasos vs. Perceptron Pegasos Algorithm Initialize: w 1 = 0, t=0 For iter = 1,2,…,20 For j=1,2,…,|data| t = t+1 η t = 1/(t λ ) If y j (w t x j ) < 1 w t+1 = (1- η t λ ) w t + η t y j x j Else w t+1 = (1- η t λ ) w t Output: wt+1

  3. Pegasos vs. Perceptron Perceptron Algorithm Initialize: w 1 = 0, t=0 For iter = 1,2,…,20 For j=1,2,…,|data| t = t+1 η t = 1/(t λ ) If y j (w t x j ) < 1 0 w t+1 = (1- η t λ ) w t + η t y j x j Else w t+1 = (1- η t λ ) w t Output: wt+1

  4. Much faster than previous methods • 3 datasets (provided by Joachims) – Reuters CCAT (800K examples, 47k features) – Physics ArXiv (62k examples, 100k features) – Covertype (581k examples, 54 features) Pegasos SVM-Perf SVM-Light 2 77 20,075 Training Time Reuters (in seconds): 6 85 25,514 Covertype 2 5 80 Astro-Physics

  5. Running time guarantee Error Decomposition Prediction error [Shalev Schwartz, err(w) Srebro ’08] err(w * ) Note: w 0 is redefined in this context (see below) – does not refer to initial weight err(w 0 ) vector • Approximation error: – Best error achievable by large-margin predictor – Error of population minimizer w 0 = argmin E[f(w)] = argmin λ |w| 2 + E x,y [loss( ⟨ w,x ⟩ ;y)] • Estimation error: – Extra error due to replacing E[loss] with empirical loss w * = arg min f n (w) • Optimization error: – Extra error due to only optimizing to within finite precision

  6. Running time guarantee Error Decomposition Prediction error [Shalev Schwartz, err(w) Srebro ’08] err(w * ) Pegasos ✓ 1 err(w 0 ) ◆ T = ˜ After updates: O ��✏ • Approximation error: – Best error achievable by large-margin predictor – Error of population minimizer err(w T ) < err(w 0 ) + ✏ w 0 = argmin E[f(w)] = argmin λ |w| 2 + E x,y [loss( ⟨ w,x ⟩ ;y)] • Estimation error: – Extra error due to replacing E[loss] with empirical loss With probability 1- δ w * = arg min f n (w) • Optimization error: – Extra error due to only optimizing to within finite precision

  7. Extending to multi-class classification

  8. One versus all classification w + Learn 3 classifiers: w - • - vs {o,+}, weights w - • + vs {o,-}, weights w + • o vs {+,-}, weights w o w o Predict label using: Any problems? Could we learn this (1-D) dataset? � -1 1 0

  9. Multi-class SVM w + Simultaneously learn 3 sets of weights: w - • How do we guarantee the correct labels? w o • Need new constraints! The “score” of the correct class must be better than the “score” of wrong classes:

  10. Multi-class SVM As for the SVM, we introduce slack variables and maximize margin: To predict, we use: Now can we learn it? � -1 1 0

  11. How to deal with imbalanced data? • In many practical applications we may have imbalanced data sets • We may want errors to be equally distributed between the positive and negative classes • A slight modification to the SVM objective does the trick! Class-specific weighting of the slack variables

  12. What if the data is not linearly separable? Use features of features of features of features…. x (1)   . . .     x ( n )     x (1) x (2)   φ ( x ) =   x (1) x (3)       . . .   e x (1)     . . . Feature space can get really large really quickly!

  13. Key idea #3: the kernel trick • High dimensional feature spaces at no extra cost! • After every update (of Pegasos), the weight vector can be written in the form: X w = α i y i x i i • As a result, prediction can be performed with: y ← sign( w · φ ( x )) ˆ ⇣ ⌘ X = sign ( α i y i φ ( x i )) · φ ( x ) i ⇣ X ⌘ = sign α i y i ( φ ( x i ) · φ ( x )) i ⇣ X ⌘ = sign α i y i K ( x i , x ) where K ( x , x 0 ) = φ ( x ) · φ ( x 0 ) . i

  14. Common kernels • Polynomials of degree exactly d • Polynomials of degree up to d • Gaussian kernels • Sigmoid • And many others: very active area of research!

  15. Polynomial kernel d =1 � u 1 � v 1 � ⇥ ⇥ � ⇥ ⇥ φ ( u ) . φ ( v ) = = u 1 v 1 + u 2 v 2 = u.v . u 2 v 2 u 2 v 2 ⇤ ⌅ ⇤ ⌅ d =2 ⌃ ⇧ ⌃ 1 1 u 1 u 2 v 1 v 2 ⌃ = u 2 1 v 2 1 + 2 u 1 v 1 u 2 v 2 + u 2 2 v 2 ⌥ � ⌥ � φ ( u ) . φ ( v ) = ⌃ . ⌥ � ⌥ � 2 u 2 u 1 v 2 v 1 ⇧ ⇧ = ( u 1 v 1 + u 2 v 2 ) 2 u 2 v 2 2 2 = ( u.v ) 2 For any d (we will skip proof): φ ( u ) . φ ( v ) = ( u.v ) d Polynomials of degree exactly d

  16. Quadratic kernel [Tommi Jaakkola]

  17. Gaussian kernel Level sets, i.e. for some r Support vectors [Cynthia Rudin] [mblondel.org]

  18. Kernel algebra Q: How would you prove that the “Gaussian kernel” is a valid kernel? A: Expand the Euclidean norm as follows: To see that this is a kernel, use the Taylor series expansion of the Then, apply (e) from above exponential, together with repeated application of (a), (b), and (c): The feature mapping is infinite dimensional! [Justin Domke]

  19. Dual SVM interpretation: Sparsity w . x + b = +1 w . x + b = 0 w . x + b = -1 Final solution tends to be sparse • α j =0 for most j • don’t need to store these points to compute w or make predictions Non-support Vectors: • α j =0 Support Vectors: • moving them will not • α j ≥ 0 change w

  20. Overfitting? • Huge feature space with kernels: should we worry about overfitting? – SVM objective seeks a solution with large margin • Theory says that large margin leads to good generalization (we will see this in a couple of lectures) – But everything overfits sometimes!!! – Can control by: • Setting C • Choosing a better Kernel • Varying parameters of the Kernel (width of Gaussian, etc.)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend