Support Vector and Kernel Methods Thorsten Joachims Cornell - PowerPoint PPT Presentation

SIGIR 2003 Tutorial Support Vector and Kernel Methods Thorsten Joachims Cornell University Computer Science Department tj@cs.cornell.edu http://www.joachims.org 0

Linear Classifiers Rules of the Form: weight vector , threshold w b  N  N  ∑ > 1 if w i x i + b 0 ∑ ( ) h x = sign w i x i + b =   i = 1  i = 1 – 1 else  Geometric Interpretation (Hyperplane): w b 14

Optimal Hyperplane (SVM Type 1) Assumption: The training examples are linearly separable. 19

Maximizing the Margin δ The hyperplane with maximum margin <~ (roughly, see later) ~> The hypothesis space with minimal VC-dimension according to SRM Support Vectors : Examples with minimal distance. 21

Example: Optimal Hyperplane vs. Perceptron Perceptron with eta=0.1 30 "perceptron_iter_trainerror.dat" "perceptron_iter_testerror.dat" 25 hard_margin_svm_testerror.dat Percent Training/Testing Errors 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 Iterations Train on 1000 pos / 1000 neg examples for “acq” (Reuters-21578). 24

Non-Separable Training Samples • For some training samples there is no separating hyperplane! • Complete separation is suboptimal for many training samples! => minimize trade-off between margin and training error. 25

Soft-Margin Separation Idea: Maximize margin and minimize training error simultanously. Hard Margin: Soft Margin: n 1 1 ∑ minimize ( , ) ⋅ minimize - - - w w P w b ξ ( , , ) ⋅ ξ i P w b = - - w w - = + C 2 2 i = 1 s. t. [ ⋅ ] ≥ s. t. and [ ⋅ ] ≥ ξ i ξ i ≥ y i w x i + b 1 y i w x i + b 1 – 0 Hard Margin (separable) ξ i δ Soft Margin (training error) ξ j 26

Controlling Soft-Margin Separation n 1 ∑ Soft Margin: minimize P w b ξ ( , , ) ⋅ ξ i - - w w - = + C 2 i = 1 s. t. and [ ⋅ ] ≥ ξ i ξ i ≥ y i w x i + b 1 – 0 ∑ is an upper bound on the number of training errors. ξ i • • C is a parameter that controls trade-off between margin and error. Large C ξ i δ ξ j Small C 27

Example Reuters “acq”: Varying C 4 "svm_trainerror.dat" "svm_testerror.dat" 3.5 Percent Training/Testing Errors 3 2.5 2 1.5 1 hard-margin SVM 0.5 0 0.1 1 10 C Observation: Typically no local optima, but not necessarily... 28

Properties of the Soft-Margin Dual OP n n n   1   ∑ ∑ ∑ Dual OP: maximize D α ( ) α i α i α j y i y j x i x j ( ⋅ ) - - - = –   2   n i = 1 i = 1 j = 1 ∑ s. t. α i y i ≤ α i ≤ = 0 und 0 C i = 1 • typically single solution (i. e. is unique) 〈 , 〉 w b • one factor for each training example α i • “influence” of single training example limited by C <=> SV with < α i < ξ i 0 C = 0 • ξ i <=> SV with α i ξ i > = C 0 • else α i = 0 • ξ j • based exclusively on inner product between training examples 37

Primal <=> Dual Theorem: The primal OP and the dual OP have the same solution. Given the solution of the dual OP, α i ° n 1 pos neg ∑ w ° α i ° y i x i b ° ( ⋅ ⋅ ) - - w 0 x - = = + w 0 x 2 i = 1 is the solution of the primal OP. Theorem: For any set of feasible points . ( , ) ≥ D α ( ) P w b => two alternative ways to represent the learning result • weight vector and threshold 〈 , 〉 w b • vector of “influences” α 1 … α n , , 36

Non-Linear Problems ==> Problem: • some tasks have non-linear structure • no hyperplane is sufficiently accurate How can SVMs learn non-linear classification rules? 38

Example Input Space: (2 Attributes) ( , ) x = x 1 x 2 2 x 2 2 Feature Space: (6 Attributes) Φ x ( ) ( , , , , , ) = x 1 2 x 2 x 2 2 x 1 x 2 1 1 40

Extending the Hypothesis Space Input Space Idea: Φ Feature Space ==> Find hyperplane in feature space! Example: a b c Φ a b c aa ab ac bb bc cc ==> The separating hyperplane in features space is a degree two polynomial in input space. 39

Kernels Problem: Very many Parameters! Polynomials of degree p over N O N p attributes in input space lead to attributes in feature space! ( ) Solution: [Boser et al., 1992] The dual OP need only inner products => Kernel Functions ( ) Φ x i ( ) Φ x j ⋅ ( ) K x i x j , = 2 x 2 2 Example: For calculating Φ x ( ) ( , , , , , ) = x 1 2 x 2 x 2 2 x 1 x 2 1 1 2 ( ) [ ⋅ ] Φ x i ( ) Φ x j ⋅ ( ) K x i x j , = x i x j + 1 = gives inner product in feature space. We do not need to represent the feature space explicitly! 41

SVM with Kernels n n n   1   ∑ ∑ ∑ Training: maximize D α ( ) α i α i α j y i y j K x i x j ( , ) - - - = –   2   i = 1 i = 1 j = 1 n ∑ s. t. α i y i ≤ α i ≤ = 0 und 0 C i = 1   ∑ Classification: For new example x ( ) α i y i K x i x ( ) h x = sign  , + b    ∈ x i SV New hypotheses spaces through new Kernels: Linear: ( ) ⋅ K x i x j , = x i x j d Polynomial: ( ) [ ⋅ ] K x i x j , = x i x j + 1 2 σ 2 Radial Basis Functions: ( ) ( ⁄ ) K x i x j , = exp – x i – x j Sigmoid: ( ) ( γ x i ( ) ) K x i x j , = tanh – x j + c 42

Example: SVM with Polynomial of Degree 2 2 Kernel: ( ) [ ⋅ ] K x i x j , = x i x j + 1 plot by Bell SVM applet 43

Example: SVM with RBF-Kernel 2 σ 2 Kernel: ( ) ( ⁄ ) K x i x j , = exp – x i – x j plot by Bell SVM applet 44

Two Reasons for Using a Kernel (1) Turn a linear learner into a non-linear learner (e.g. RBF, polynomial, sigmoid) (2) Make non-vectorial data accessible to learner (e.g. string kernels for sequences) 51

Summary What is an SVM? Given: ℜ N y • Training examples ( , ) … , , ( , ) ∈ ∈ x 1 y 1 x n y n x i { , 1 1 – } i • Hypothesis space according to kernel ( ) K x i x j , • Parameter C for trading-off training error and margin size Training: • Finds hyperplane in feature space generated by kernel. • The hyperplane has maximum margin in feature space with minimal ∑ training error (upper bound ) given C. ξ i • The result of training are . They determine . α 1 … α n , , 〈 , 〉 w b   ∑ Classification: For new example ( ) α i y i K x i x ( ) h x = sign  , + b    ∈ x i SV 52

Part 2: How to use an SVM effectively and efficiently? • normalization of the input vectors • selecting C • handling unbalanced datasets • selecting a kernel • multi-clas s classification • selecting a training algorithm 53

How to Assign Feature Values? Things to take into consideration: • importance of feature is monotonic in its absolute value • the larger the absolute value, the more influence the feature gets • typical problem: number of doors [0-5], price [0-100000] • want relevant features large / irrelevant features low (e.g. IDF) • normalization to make features equally important ( ) x – mean X • by mean and variance: x norm = - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ( ) var X • by other distribution • normalization to bring feature vectors onto the same scale • directional data: text classification x • by normalizing the length of the vector according to x norm = - - - - - - - - x some norm • changes whether a problem is (linearly) separable or not • scale all vectors to a length that allows numerically stable training 57

Selecting a Kernel Things to take into consideration: • kernel can be thought of as a similarity measure • examples in the same class should have high kernel value • examples in different classes should have low kernel value • ideal kernel: equivalence relation ( ) ( ) K x i x j , = sign y i y j • normalization also applies to kernel • relative weight for implicit features • normalize per example for directional data ( ) K x i x j , ( ) K x i x j , = - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ( ) K x j x j ( ) K x i x i , , • potential problems with large numbers, for example polynomial d kernel for large d ( ) [ ⋅ ] K x i x j , = x i x j + 1 58

Selecting Regularization Parameter C Common Method 1 • a reasonable starting point and/or default value is C def = - - - - - - - - - - - - - - - - - - - - - - - - - - - ∑ ( ) K x i x i , • search for C on a log-scale, for example – C def … 10 4 C 10 4 ∈ [ , , ] C def • selection via cross-validation or via approximation of leave-one-out [Jaakkola&Haussler,1999][Vapnik&Chapelle,2000][Joachims,2000] Note • optimal value of C scales with the feature values 59

Selecting Kernel Parameters Problem • results often very sensitive to kernel parameters (e.g. variance in γ RBF kernel) • need to simultaneously optimize C, since optimal C typically depends on kernel parameters Common Method • search for combination of parameters via exhaustive search • selection of kernel parameters typically via cross-validation Advanced Approach • avoiding exhaustive search for improved search efficiency [Chapelle et al, 2002] 60

Support Vector and Kernel Methods Thorsten Joachims Cornell - PowerPoint PPT Presentation

SIGIR 2003 Tutorial Support Vector and Kernel Methods Thorsten Joachims Cornell University Computer Science Department tj@cs.cornell.edu http://www.joachims.org 0 Linear Classifiers Rules of the Form: weight vector , threshold w b N

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Optimization for Kernel Methods S. Sathiya Keerthi Yahoo! Research, Burbank, CA, USA Kernel

Why Deep Learning Is More Natural Questions Efficient than Support Support Vector . . . Support

Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines Preview What is a support vector machine? The perceptron revisited

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Black Kernel Rot Malady of Pecan B Wood, C Bock, l Wells, T Cottrell, M Hotchkiss Black Kernel

Linux Kernel Debugging Your kernel just oopsed - What do you do, hotshot? Muli Ben-Yehuda

The Basics: Pipelining J. Nelson Amaral University of Alberta

Bus Use of Shoulders in ODOT District 12 March 19, 2015 Introductions Introductions

I-205 SB Closed at X Johnson Creek Blvd I-205 SB Detour Route: Johnson Creek Blvd WB to OR213

Kaplan-Meier estimate Heidi Seibold Statistician at LMU Munich DataCamp Survival Analysis in R

On the Complexity of Approximating Wasserstein Barycenters Alexey Kroshnin, Darina Dvinskikh,

I I I I % [ , ] I I I I I w b c : N X X s = c ( i ) w b b f i b

PHP Miscellaneous Dr. E. Benoist Winter Term 2005-2006 PHP Miscellaneous 1 PHP Miscellaneous

Media gives you a library for your images and attached documents, so they can be reused