support vector and kernel methods
play

Support Vector and Kernel Methods Thorsten Joachims Cornell - PowerPoint PPT Presentation

SIGIR 2003 Tutorial Support Vector and Kernel Methods Thorsten Joachims Cornell University Computer Science Department tj@cs.cornell.edu http://www.joachims.org 0 Linear Classifiers Rules of the Form: weight vector , threshold w b N


  1. SIGIR 2003 Tutorial Support Vector and Kernel Methods Thorsten Joachims Cornell University Computer Science Department tj@cs.cornell.edu http://www.joachims.org 0

  2. Linear Classifiers Rules of the Form: weight vector , threshold w b  N  N  ∑ > 1 if w i x i + b 0 ∑ ( ) h x = sign w i x i + b =   i = 1  i = 1 – 1 else  Geometric Interpretation (Hyperplane): w b 14

  3. Optimal Hyperplane (SVM Type 1) Assumption: The training examples are linearly separable. 19

  4. Maximizing the Margin δ The hyperplane with maximum margin <~ (roughly, see later) ~> The hypothesis space with minimal VC-dimension according to SRM Support Vectors : Examples with minimal distance. 21

  5. Example: Optimal Hyperplane vs. Perceptron Perceptron with eta=0.1 30 "perceptron_iter_trainerror.dat" "perceptron_iter_testerror.dat" 25 hard_margin_svm_testerror.dat Percent Training/Testing Errors 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 Iterations Train on 1000 pos / 1000 neg examples for “acq” (Reuters-21578). 24

  6. Non-Separable Training Samples • For some training samples there is no separating hyperplane! • Complete separation is suboptimal for many training samples! => minimize trade-off between margin and training error. 25

  7. Soft-Margin Separation Idea: Maximize margin and minimize training error simultanously. Hard Margin: Soft Margin: n 1 1 ∑ minimize ( , ) ⋅ minimize - - - w w P w b ξ ( , , ) ⋅ ξ i P w b = - - w w - = + C 2 2 i = 1 s. t. [ ⋅ ] ≥ s. t. and [ ⋅ ] ≥ ξ i ξ i ≥ y i w x i + b 1 y i w x i + b 1 – 0 Hard Margin (separable) ξ i δ Soft Margin (training error) ξ j 26

  8. Controlling Soft-Margin Separation n 1 ∑ Soft Margin: minimize P w b ξ ( , , ) ⋅ ξ i - - w w - = + C 2 i = 1 s. t. and [ ⋅ ] ≥ ξ i ξ i ≥ y i w x i + b 1 – 0 ∑ is an upper bound on the number of training errors. ξ i • • C is a parameter that controls trade-off between margin and error. Large C ξ i δ ξ j Small C 27

  9. Example Reuters “acq”: Varying C 4 "svm_trainerror.dat" "svm_testerror.dat" 3.5 Percent Training/Testing Errors 3 2.5 2 1.5 1 hard-margin SVM 0.5 0 0.1 1 10 C Observation: Typically no local optima, but not necessarily... 28

  10. Properties of the Soft-Margin Dual OP n n n   1   ∑ ∑ ∑ Dual OP: maximize D α ( ) α i α i α j y i y j x i x j ( ⋅ ) - - - = –   2   n i = 1 i = 1 j = 1 ∑ s. t. α i y i ≤ α i ≤ = 0 und 0 C i = 1 • typically single solution (i. e. is unique) 〈 , 〉 w b • one factor for each training example α i • “influence” of single training example limited by C <=> SV with < α i < ξ i 0 C = 0 • ξ i <=> SV with α i ξ i > = C 0 • else α i = 0 • ξ j • based exclusively on inner product between training examples 37

  11. Primal <=> Dual Theorem: The primal OP and the dual OP have the same solution. Given the solution of the dual OP, α i ° n 1 pos neg ∑ w ° α i ° y i x i b ° ( ⋅ ⋅ ) - - w 0 x - = = + w 0 x 2 i = 1 is the solution of the primal OP. Theorem: For any set of feasible points . ( , ) ≥ D α ( ) P w b => two alternative ways to represent the learning result • weight vector and threshold 〈 , 〉 w b • vector of “influences” α 1 … α n , , 36

  12. Non-Linear Problems ==> Problem: • some tasks have non-linear structure • no hyperplane is sufficiently accurate How can SVMs learn non-linear classification rules? 38

  13. Example Input Space: (2 Attributes) ( , ) x = x 1 x 2 2 x 2 2 Feature Space: (6 Attributes) Φ x ( ) ( , , , , , ) = x 1 2 x 2 x 2 2 x 1 x 2 1 1 40

  14. Extending the Hypothesis Space Input Space Idea: Φ Feature Space ==> Find hyperplane in feature space! Example: a b c Φ a b c aa ab ac bb bc cc ==> The separating hyperplane in features space is a degree two polynomial in input space. 39

  15. Kernels Problem: Very many Parameters! Polynomials of degree p over N O N p attributes in input space lead to attributes in feature space! ( ) Solution: [Boser et al., 1992] The dual OP need only inner products => Kernel Functions ( ) Φ x i ( ) Φ x j ⋅ ( ) K x i x j , = 2 x 2 2 Example: For calculating Φ x ( ) ( , , , , , ) = x 1 2 x 2 x 2 2 x 1 x 2 1 1 2 ( ) [ ⋅ ] Φ x i ( ) Φ x j ⋅ ( ) K x i x j , = x i x j + 1 = gives inner product in feature space. We do not need to represent the feature space explicitly! 41

  16. SVM with Kernels n n n   1   ∑ ∑ ∑ Training: maximize D α ( ) α i α i α j y i y j K x i x j ( , ) - - - = –   2   i = 1 i = 1 j = 1 n ∑ s. t. α i y i ≤ α i ≤ = 0 und 0 C i = 1   ∑ Classification: For new example x ( ) α i y i K x i x ( ) h x = sign  , + b    ∈ x i SV New hypotheses spaces through new Kernels: Linear: ( ) ⋅ K x i x j , = x i x j d Polynomial: ( ) [ ⋅ ] K x i x j , = x i x j + 1 2 σ 2 Radial Basis Functions: ( ) ( ⁄ ) K x i x j , = exp – x i – x j Sigmoid: ( ) ( γ x i ( ) ) K x i x j , = tanh – x j + c 42

  17. Example: SVM with Polynomial of Degree 2 2 Kernel: ( ) [ ⋅ ] K x i x j , = x i x j + 1 plot by Bell SVM applet 43

  18. Example: SVM with RBF-Kernel 2 σ 2 Kernel: ( ) ( ⁄ ) K x i x j , = exp – x i – x j plot by Bell SVM applet 44

  19. Two Reasons for Using a Kernel (1) Turn a linear learner into a non-linear learner (e.g. RBF, polynomial, sigmoid) (2) Make non-vectorial data accessible to learner (e.g. string kernels for sequences) 51

  20. Summary What is an SVM? Given: ℜ N y • Training examples ( , ) … , , ( , ) ∈ ∈ x 1 y 1 x n y n x i { , 1 1 – } i • Hypothesis space according to kernel ( ) K x i x j , • Parameter C for trading-off training error and margin size Training: • Finds hyperplane in feature space generated by kernel. • The hyperplane has maximum margin in feature space with minimal ∑ training error (upper bound ) given C. ξ i • The result of training are . They determine . α 1 … α n , , 〈 , 〉 w b   ∑ Classification: For new example ( ) α i y i K x i x ( ) h x = sign  , + b    ∈ x i SV 52

  21. Part 2: How to use an SVM effectively and efficiently? • normalization of the input vectors • selecting C • handling unbalanced datasets • selecting a kernel • multi-clas s classification • selecting a training algorithm 53

  22. How to Assign Feature Values? Things to take into consideration: • importance of feature is monotonic in its absolute value • the larger the absolute value, the more influence the feature gets • typical problem: number of doors [0-5], price [0-100000] • want relevant features large / irrelevant features low (e.g. IDF) • normalization to make features equally important ( ) x – mean X • by mean and variance: x norm = - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ( ) var X • by other distribution • normalization to bring feature vectors onto the same scale • directional data: text classification x • by normalizing the length of the vector according to x norm = - - - - - - - - x some norm • changes whether a problem is (linearly) separable or not • scale all vectors to a length that allows numerically stable training 57

  23. Selecting a Kernel Things to take into consideration: • kernel can be thought of as a similarity measure • examples in the same class should have high kernel value • examples in different classes should have low kernel value • ideal kernel: equivalence relation ( ) ( ) K x i x j , = sign y i y j • normalization also applies to kernel • relative weight for implicit features • normalize per example for directional data ( ) K x i x j , ( ) K x i x j , = - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ( ) K x j x j ( ) K x i x i , , • potential problems with large numbers, for example polynomial d kernel for large d ( ) [ ⋅ ] K x i x j , = x i x j + 1 58

  24. Selecting Regularization Parameter C Common Method 1 • a reasonable starting point and/or default value is C def = - - - - - - - - - - - - - - - - - - - - - - - - - - - ∑ ( ) K x i x i , • search for C on a log-scale, for example – C def … 10 4 C 10 4 ∈ [ , , ] C def • selection via cross-validation or via approximation of leave-one-out [Jaakkola&Haussler,1999][Vapnik&Chapelle,2000][Joachims,2000] Note • optimal value of C scales with the feature values 59

  25. Selecting Kernel Parameters Problem • results often very sensitive to kernel parameters (e.g. variance in γ RBF kernel) • need to simultaneously optimize C, since optimal C typically depends on kernel parameters Common Method • search for combination of parameters via exhaustive search • selection of kernel parameters typically via cross-validation Advanced Approach • avoiding exhaustive search for improved search efficiency [Chapelle et al, 2002] 60

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend