kernel methods and support vector machines
play

Kernel Methods and Support Vector Machines Oliver Schulte - CMPT - PowerPoint PPT Presentation

Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete target variable. Like


  1. Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6

  2. Support Vector Machines Defining Characteristics • Like logistic regression, good for continuous input features, discrete target variable. • Like nearest neighbor, a kernel method : classification is based on weighted similar instances. The kernel defines similarity measure. • Sparsity: Tries to find a few important instances, the support vectors . • Intuition: Netflix recommendation system.

  3. SVMs: Pros and Cons Pros • Very good classification performance, basically unbeatable. • Fast and scaleable learning. • Pretty fast inference. Cons • No model is built, therefore black-box. • Not so applicable for discrete inputs. • Still need to specify kernel function (like specifying basis functions). • Issues with multiple classes, can use probabilistic version. (Relevance Vector Machine).

  4. Two Views of SVMs Theoretical View: linear separator • SVM looks for linear separator but in new feature space . • Uses a new criterion to choose a line separating classes: max-margin . User View: kernel-based classification • User specifies a kernel function. • SVM learns weights for instances. • Classification is performed by taking average of the labels of other instances, weighted by a) similarity b) instance weight. Nice demo on web http://www.youtube.com/watch?v=3liCbRZPrZA .

  5. Two Views of SVMs Theoretical View: linear separator • SVM looks for linear separator but in new feature space . • Uses a new criterion to choose a line separating classes: max-margin . User View: kernel-based classification • User specifies a kernel function. • SVM learns weights for instances. • Classification is performed by taking average of the labels of other instances, weighted by a) similarity b) instance weight. Nice demo on web http://www.youtube.com/watch?v=3liCbRZPrZA .

  6. Two Views of SVMs Theoretical View: linear separator • SVM looks for linear separator but in new feature space . • Uses a new criterion to choose a line separating classes: max-margin . User View: kernel-based classification • User specifies a kernel function. • SVM learns weights for instances. • Classification is performed by taking average of the labels of other instances, weighted by a) similarity b) instance weight. Nice demo on web http://www.youtube.com/watch?v=3liCbRZPrZA .

  7. Two Views of SVMs Theoretical View: linear separator • SVM looks for linear separator but in new feature space . • Uses a new criterion to choose a line separating classes: max-margin . User View: kernel-based classification • User specifies a kernel function. • SVM learns weights for instances. • Classification is performed by taking average of the labels of other instances, weighted by a) similarity b) instance weight. Nice demo on web http://www.youtube.com/watch?v=3liCbRZPrZA .

  8. Example: X-OR • X-OR problem: class of ( x 1 , x 2 ) is positive iff x 1 · x 2 > 0 . • Use 6 basis functions √ √ √ 2 x 2 , x 2 2 x 1 x 2 , x 2 φ ( x 1 , x 2 ) = ( 1 , 2 x 1 , 1 , 2 ) . √ • Simple classifier y ( x 1 , x 2 ) = φ 5 ( x 1 , x 2 ) = 2 x 1 x 2 . • Linear in basis function space. • Dot product φ ( x ) T φ ( z ) = ( 1 + x T z ) 2 = k ( x , z ) . • A quadratic kernel. let’s check SVM demo http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtml

  9. Example: X-OR • X-OR problem: class of ( x 1 , x 2 ) is positive iff x 1 · x 2 > 0 . • Use 6 basis functions √ √ √ 2 x 2 , x 2 2 x 1 x 2 , x 2 φ ( x 1 , x 2 ) = ( 1 , 2 x 1 , 1 , 2 ) . √ • Simple classifier y ( x 1 , x 2 ) = φ 5 ( x 1 , x 2 ) = 2 x 1 x 2 . • Linear in basis function space. • Dot product φ ( x ) T φ ( z ) = ( 1 + x T z ) 2 = k ( x , z ) . • A quadratic kernel. let’s check SVM demo http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtml

  10. Valid Kernels • Valid kernels: if k ( · , · ) satisfies: • Symmetric; k ( x i , x j ) = k ( x j , x i ) • Positive definite; for any x 1 , . . . , x N , the Gram matrix K must be positive semi-definite:   k ( x 1 , x 1 ) k ( x 1 , x 2 ) . . . k ( x 1 , x N ) . . . ... . . . K =   . . .   k ( x N , x 1 ) k ( x N , x 2 ) . . . k ( x N , x N ) • Positive semi-definite means x T Kx ≥ 0 for all x then k ( · , · ) corresponds to a dot product in some space φ • a.k.a. Mercer kernel, admissible kernel, reproducing kernel

  11. Examples of Kernels • Some kernels: • Linear kernel k ( x 1 , x 2 ) = x T 1 x 2 • Polynomial kernel k ( x 1 , x 2 ) = ( 1 + x T 1 x 2 ) d • Contains all polynomial terms up to degree d • Gaussian kernel k ( x 1 , x 2 ) = exp ( −|| x 1 − x 2 || 2 / 2 σ 2 ) • Infinite dimension feature space

  12. Constructing Kernels • Can build new valid kernels from existing valid ones: • k ( x 1 , x 2 ) = ck 1 ( x 1 , x 2 ) , c > 0 • k ( x 1 , x 2 ) = k 1 ( x 1 , x 2 ) + k 2 ( x 1 , x 2 ) • k ( x 1 , x 2 ) = k 1 ( x 1 , x 2 ) k 2 ( x 1 , x 2 ) • k ( x 1 , x 2 ) = exp ( k 1 ( x 1 , x 2 )) • Table on p. 296 gives many such rules

  13. More Kernels • Stationary kernels are only a function of the difference between arguments: k ( x 1 , x 2 ) = k ( x 1 − x 2 ) • Translation invariant in input space: k ( x 1 , x 2 ) = k ( x 1 + c , x 2 + c ) • Homogeneous kernels, a. k. a. radial basis functions only a function of magnitude of difference: k ( x 1 , x 2 ) = k ( || x 1 − x 2 || ) • Set subsets k ( A 1 , A 2 ) = 2 | A 1 ∩ A 2 | , where | A | denotes number of elements in A • Domain-specific: think hard about your problem, figure out what it means to be similar, define as k ( · , · ) , prove positive definite.

  14. The Kernel Classification Formula • Suppose we have a kernel function k and N labelled instances with weights a n ≥ 0 , n = 1 , . . . , N . • As with the perceptron, the target labels +1 are for positive class, -1 for negative class. • Then N � y ( x ) = a n t n k ( x , x n ) + b n = 1 • x is classified as positive if y ( x ) > 0 , negative otherwise. • If a n > 0 , then x n is a support vector. • Don’t need to store other vectors. • a will be sparse - many zeros.

  15. Examples • SVM with Gaussian kernel • Support vectors circled. • They are the closest to the other class. • Note non-linear decision boundary in x space

  16. Examples • From Burges, A Tutorial on Support Vector Machines for Pattern Recognition (1998) • SVM trained using cubic polynomial kernel k ( x 1 , x 2 ) = ( x T 1 x 2 + 1 ) 3 • Left is linearly separable • Note decision boundary is almost linear, even using cubic polynomial kernel • Right is not linearly separable • But is separable using polynomial kernel

  17. Learning the Instance Weights • The max-margin classifier is found by solving the following problem: • Maximize wrt a N N N a n − 1 ˜ � � � L ( a ) = a n a m t n t m k ( x n , x m ) 2 n = 1 n = 1 m = 1 subject to the constraints • a n ≥ 0 , n = 1 , . . . , N • � N n = 1 a n t n = 0 • It is quadratic, with linear constraints, convex in a • Bounded above since K positive semi-definite • Optimal a can be found • With large datasets, descent strategies employed

  18. Regression Kernelized • Many classifiers can be written as using only dot products. • Kernelization = replace dot products by kernel. • E.g., the kernel solution for regularized least squares regression is y ( x ) = k ( x ) T ( K + λ I N ) − 1 t φ ( x )( Φ T Φ + λ I M ) − 1 Φ T t vs. for original version • N is number of datapoints (size of Gram matrix K ) • M is number of basis functions (size of matrix Φ T Φ ) • Bad if N > M , but good otherwise • k ( x ) = ( k ( x , x 1 , . . . , k ( x , x n )) is the vector of kernel values over data points x n .

  19. Conclusion • Readings: Ch. 6.1-6.2 (pp. 291-297) • Non-linear features, or domain-specific similarity measurements are useful • Dot products of non-linear features, or similarity measurements, can be written as kernel functions • Validity by positive semi-definiteness of kernel function • Can have algorithm work in non-linear feature space without actually mapping inputs to feature space • Advantageous when feature space is high-dimensional

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend