Support Vector Machine Supervised Learning - Classification Ricco - PowerPoint PPT Presentation

SVM Support Vector Machine Supervised Learning - Classification Ricco Rakotomalala Université Lumière Lyon 2 Ricco Rakotomalala 1 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Outline 1. Binary classification – Linear classifier 2. Maximize the margin (I) – Primal form 3. Maximize the margin (II) – Dual form 4. Noisy labels – Soft Margin 5. Nonlinear classification – Kernel trick 6. Estimating class membership probabilities 7. Feature selection 8. Extension to multiclass problem 9. SVM in practice – Tools and software 10. Conclusion – Pros and cons 11. References Ricco Rakotomalala 2 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Binary classification LINEAR SVM Ricco Rakotomalala 3 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Binary classification Supervised learning : Y = f(x 1 ,x 2 ,…, x p ;  ) for a binary problem i.e. Y  {+, -} or Y  {+1, -1} Data linearly separable 10 x1 x2 y 9 1 3 -1 8 2 1 -1 7 4 5 -1 6 6 9 -1 x2 5 8 7 -1 4 5 1 1 3 7 1 1 2 9 4 1 1 12 7 1 0 0 2 4 6 8 10 12 14 13 6 1 x1 The aim is to find a hyperplane which enables to separate perfectly the “+” and “ - ”. The classifier comes in the form of a linear combination of the variables.     T f ( x ) x  =(  1 ,  2 ,…,  p ) and  0 are the (p+1) 0         x x parameters (coefficients) to estimate. 1 1 2 2 0 Ricco Rakotomalala 4 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Once the "shape" of the decision boundary defined, we Finding the “optimal” solution have to choose a solution among the infinite number of possible solutions. 10 9 8 Two keys issues always in the supervised learning framework: 7 (1) Choosing the “Representation bias” or “hypothesis bias”  6 x2 5 we define the shape of the separator 4 (2) Choosing the search bias i.e. the way to select the best 3 solution among all the possible solutions  it often boils 2 1 down to set the objective function to optimize 0 0 2 4 6 8 10 12 14 x1 10 9 8 7 Example: Linear Discriminant Analysis 6 x2 5 The separating hyperplane is to halfway between 4 3 the two conditional centroids within the meaning 2 of the Mahalanobis distance. 1 0 0 2 4 6 8 10 12 14 x1 Ricco Rakotomalala 5 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Primal problem MAXIMIZE THE MARGIN (I) Ricco Rakotomalala 6 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

The optimal separating hyperplane separates the two Hard-margin principle classes and maximizes the distance to the closest point Intuitive layout from either class (Vapnik, 1996) [HTF, page 132] • Distance from any point x with    T x 1  0 d the boundary ( see projection )   10 9 8 2   • The maximum margin is  7 6 x2 • The instances from which rely the margins are 5 “support vectors”. If we remove them from the d 4 sample, the optimal solution is modified. 3 2 • Several areas are defined in the representation 1 space 0 0 2 4 6 8 10 12 14 f(x) = 0, we have the maximum margin hyperplane x1 f(x) > 0, the area of « + » instances f(x) < 0, the area of « - » instances f(x) = +1 or -1, these hyperplanes are the margins Ricco Rakotomalala 7 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Maximizing the margin Maximize the margin is Mathematical formulation 2   max min equivalent to minimize the norm  of the vector of parameters        2 2   • Norm: min 1 p   , 0 • Constraints point out that all points are on the right Subject to side, at least they are on the hyperplane of support     y f ( x ) 1 , i 1 , , n i i vectors. • We have a problem of convex optimization (quadratic objective function, linear constraints). A global optimum exists. Note: There are often also writing • But there is no literal solution. It must pass through 1  2 min numerical optimization programs.   2 , 0 Ricco Rakotomalala 8 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Maximizing the margin We use the SOLVER to solve the optimization problem. A toy example under EXCEL (!) beta.1 beta.2 beta.0 (p + 1) variable cells 0.667 -0.667 -1.667 n° x1 x2 y f(x) f(x)*y 1 1 3 -1 -3 3 2 2 1 -1 -1 1 3 4 5 -1 -2.33333333 2.33333333 Saturated constraints: 3 support 4 6 9 -1 -3.66666667 3.66666667 vectors were found (n° 2 , 5 et 6 ) 5 8 7 -1 -1 1 6 5 1 1 1 1 7 7 1 1 2.33333333 2.33333333 8 9 4 1 1.66666667 1.66666667 9 12 7 1 1.66666667 1.66666667 10 13 6 1 3 3    0 . 667 x 0 . 667 x 1 . 667 0 1 2 12 Norme.Beta 0.943 10 n = 10 constraints 8 Objective cell: 𝛾 6 x2 4 2     0 0 . 667 x 0 . 667 x 1 . 667 1 0 1 2 0 2 4 6 8 10 12 14      T x 1 0 0 -2     0 . 667 x 0 . 667 x 1 . 667 1 0 1 2 -4 x1 Ricco Rakotomalala 9 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Primal problem Comments beta.1 beta.2 beta.0 0.667 -0.667 -1.667 x1 x2 y f(x) prediction 1 3 -1 -3 -1 2 1 -1 -1 -1 4 5 -1 -2.3333 -1 Rule assignment for the     ˆ 0 y 1 6 9 -1 -3.6667 -1 i *  f ( x ) 8 7 -1 -1 -1 instance i* based on the     i * ˆ  0 y 1 5 1 1 1 1 ˆ i *  7 1 1 2.33333 1 estimated coefficients j 9 4 1 1.66667 1 12 7 1 1.66667 1 13 6 1 3 1 (1) Algorithms for numerical optimization (quadratic prog.) are not operational when “p” is large (> a few hundred). This often Drawbacks of this primal happens when we handle real problems (e.g. text mining, image,...) (few examples, many descriptors) form (2) This primal form does not highlight the possibility of using "kernel" functions that enable to go beyond to the linear classifiers Ricco Rakotomalala 10 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Dual problem MAXIMIZE THE MARGIN (II) Ricco Rakotomalala 11 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Dual problem A convex optimization problem has a dual form by using the Lagrange multipliers. Lagrangian multiplier method 1  2 min   2 , 0 The primal problem…        T  s . c . y ( x ) 1 , i 1 , , n i i 0     n         2        …becomes under the dual T 1 L , , y x 1 P 0 i i i 0 2 form  i 1 Where  i are the Lagrange multipliers  n L  We can obtain the parameters (coefficients) of      y x 0   i i i the hyperplane from the Lagrange multipliers  i 1  n By setting each partial L     i y 0   i derivate equal to zero  i 1 0    L          T y x 1 0 , i   i i 0 i           T [ y x 1 ] 0 , i The solution must satisfy the Karush-Kuhn-Tucker (KKT) conditions i i i 0 Ricco Rakotomalala 12 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Dual problem By using information from the partial derivative of the Lagrangian, the results rely only on multipliers Optimization • <x i ,x i ’ > is the scalar product between the vectors of values for the instances i and i ’ n n n   1   p        max L y y x , x   x , x x x D i i i ' i i ' i i '  2 ' ' i i ij i j    i 1 i 1 i ' 1  j 1    0 , i i Subject to n •  i > 0 define the important instances i.e. the    0 y i i support vectors  i 1 • Inevitably, there will be support vectors with different class labels, otherwise this condition cannot be met. Ricco Rakotomalala 13 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Support Vector Machine Supervised Learning - Classification Ricco - PowerPoint PPT Presentation

SVM Support Vector Machine Supervised Learning - Classification Ricco Rakotomalala Universit Lumire Lyon 2 Ricco Rakotomalala 1 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/ Outline 1. Binary classification Linear

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines Preview What is a support vector machine? The perceptron revisited

Why Deep Learning Is More Natural Questions Efficient than Support Support Vector . . . Support

Multi-class Support Vector Machine Rizal Zaini Ahmad Fathony November 10, 2016 University of

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machine w T x + b = 0 b || w || Support Vector Support Vector w X i y i ( x

Support Vector Machines This set of notes presents the Support Vector Machine (SVM) learning al-

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

What is a What are Support Vector Machines Support Vector Machine? Used For? An optimally

Lecture 6: Support Vector Machine (Part 1) Feb 10 2020 Lecturer: Steven Wu Scribe: Steven Wu We

Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning 1 Support

Classifiers: Support Vector Machine 1 MACHINE LEARNING What is Classification? Female Adult

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Confidence Intervals II 18.05 Spring 2018 R Quiz Open internet, open notes (no communication

Two-Stage Residual Inclusion Estimation: A Practitioners Guide to Stata Implementation by Joseph

Power Calculations for a Difference of Means October 9, 2019 October 9, 2019 1 / 20 Case Study:

Conjugate Direction minimization Lectures for PHD course on Numerical optimization Enrico

General AIMD Congestion Control Y. Richard Yang and Simon S. Lam Motivation for new congestion

Statistical Filtering and Control for AI and Robotics Part II. Linear methods for regression

Hochberg Multiple Test Procedure Under Negative Dependence Ajit C. Tamhane Northwestern

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.