Elements - Chapter 12 - SVM Henry Tan Georgetown University April - PowerPoint PPT Presentation

Elements - Chapter 12 - SVM Henry Tan Georgetown University April 13, 2015 Georgetown University SVM 1

Introduction to Support Vector Machines General Idea We want to be able to classify inputs into one of 2 classes. It’s all about how you phrase the question, not how you solve it. First Steps - Support Vector Classification Solve a linearly separable problem (no overlap) using linear programming. Separating hyperplane must be a “flat” space. Extend to SVM Solve non-separable case by allowing some slack in the constraints. Non-linear separation using basis expansion and Kernel functions Georgetown University SVM 2

Support Vectors - Linear, Fully Separable Georgetown University SVM 3

Hyperplane Separation Training data - N pairs ( x 1 , y 1 ) ... ( x N , y N ) p dimensions - x i ∈ R p Class - y i ∈ {− 1 , 1 } { x : f ( x ) = x T β + β 0 = 0 } (12.1) || β || = 1 and some constant β 0 A straight line in 2D, a flat plane in 3D... Note: Equation numbering follows Elements print 10 pdf Georgetown University SVM 4

Hyperplane Separation 2 f ( x ) = x T β + β 0 gives the signed distance from point x to the hyperplane. Classification Rule Given the parameters of a hyperplane β, β 0 we can plug in any observation x i and get which ‘side’ of the plane it is on. G ( x ) = sign [ x T β + β 0 ] (12.2) Georgetown University SVM 5

Hyperplane Separation 3 f ( x ) = x T β + β 0 gives the signed distance from point x to the hyperplane. Since we assume that the classes are linearly separable, we know that there must exist a separating hyperplane, i.e., ∃ f ( x ) = x T β + β 0 such that y i f ( x i ) > 0 ∀ i Optimisation problem Find the hyperplane with the largest margin M between training points β,β 0 , || β || =1 M max (12.3) subject to y i ( x T i β + β 0 ) ≥ M , i = 1 , ..., N Georgetown University SVM 6

Hyperplanes for Non-Separable Case Question Grace What is the intuition/physical meanings of using the slack variables? Georgetown University SVM 7

Hyperplanes for Non-Separable Case For Non-Separable Classes Allow some points to be on the wrong side of the margin. Define slack variables ξ = ( ξ 1 , ..., ξ N ) y i ( x T i β + β 0 ) ≥ M (1 − ξ i ) ∀ i (12.6) Some observations are allowed to be on the wrong side of the margin, but we still attempt to maximize the margin. Georgetown University SVM 8

Hyperplanes for Non-Separable Case 2 More Constraints Slack must be positive - ξ i ≥ 0 N � Total slack is bound by some constant - ξ i ≤ k i =1 If ξ i > 1, that training sample is considered misclassified in the solution Note There is another way to introduce the slack but it leads to a nonconvex problem (I’m not too sure why). Georgetown University SVM 9

Hyperplanes for Non-Separable Case 3 The norm constraint on β can be dropped and M set to 1 / || β || to get the equivalent formulation min || β || y i ( x T subject to i β + β 0 ) ≥ 1 − ξ i ∀ i (12.7) � and ξ i ≥ 0 , ξ i ≤ constant We can see that correctly classified points far from the boundary, i.e. y i ( x T i β + β 0 ) = y i f ( x i ) > 1, do not matter in the constraints and therefore do not affect the solution. Georgetown University SVM 10

Lagrange and his equations Questions Yifang Could you walk us through the Lagrange function reduction, from 12.9 to 12.17? Sicong I am not clear about the Lagrange function in section 12.2, can you make some detail illustrations to it? Yuankai What is Lagrange function and how is it used in margin-based methods? Georgetown University SVM 11

Lagrange and his equations Questions Yifang Could you walk us through the Lagrange function reduction, from 12.9 to 12.17? Sicong I am not clear about the Lagrange function in section 12.2, can you make some detail illustrations to it? Yuankai What is Lagrange function and how is it used in margin-based methods? Disclaimer I don’t properly know Lagrange functions and the duals and all that. The following is what I could figure out in the last few days. Georgetown University SVM 11

A Detour - The Lagrange Function 1 Minimum distance to travel from M → P → C Note that the gradient of the ellipse at the solution is the same as the gradient of P at the solution. 1 Source - http://www.slimy.com/~steuard/teaching/tutorials/Lagrange.html Georgetown University SVM 12

The Lagrange Function - Lagrange Multipliers 2 Consider the following problem - f ( x , y ) = x 2 + y 2 minimize subject to the constraint g ( x , y ) = x + y − 2 = 0 1 Readings - http://www.cs.cmu.edu/~ggordon/lp.pdf Georgetown University SVM 13

The Lagrange Function - Lagrange Multipliers 2 Gradient of the objective function is a multiple of the gradient of the constraint. This can be re-stated as a set of simultaneous equations g ( x , y ) = 0 ← From Original Constraint ∇ f ( x , y ) = α ∇ g ( x , y ) ← New α is called the Lagrange multiplier . Georgetown University SVM 14

The Lagrange Function The Lagrangian This can be restated in a compact form as the Lagrangian L L ( x , y , α ) = f ( x , y ) + α g ( x , y ) where the equations are ∇ L = 0 Multiple Constraints → Multiple (Independent) Lagrange multipliers Minimize f ( x ) with constraints g i ( x ) = 0 for 1 ≤ i ≤ N yields the Lagrangian - L ( x , α ) = f ( x ) + � α i g i ( x ) 1 ≤ i ≤ N with lagrange multipliers α = ( α 1 , ..., α N ) Georgetown University SVM 15

The Lagrange Function - Inequality Constraints Theorem: Solution is at a Saddle Point The solution, if it exists, is one where the Lagrangian cannot be decreased further by changing the original variables, or increased by changing the multipliers. Inequality Constraints? Previously, all the constraints were equalities. To deal with the constraint g i ( x ) ≥ 0, we set p i ≤ 0 or g i ( x ) ≤ 0 → p i ≥ 0. This follows from the above theorem. Georgetown University SVM 16

Back onto linear SVM - The Lagrangian Reformulating 12.7 The previous equation can be converted into the following form N 1 2 || β || 2 + C � min ξ i (12.8) β,β 0 i =1 y i ( x T subject to ξ i ≥ 0 , i β + β 0 ) ≥ 1 − ξ i ∀ i Intuition Previously, we had the constraint � ξ i ≤ constant . Small || β || 2 → large margin. This means that more slack is required. Instead of bounding the total slack by a constant,12.8 minimizes || β || 2 , i.e., maximises the margin, while minimizing the slack. Georgetown University SVM 17

The Lagrangian Primal Form of SVM General Form min f ( x ) with constraints g i ( x ) = 0 for 1 ≤ i ≤ N � yields the Lagrangian - L ( x , α ) = f ( x ) + α i g i ( x ) 1 ≤ i ≤ N Original SVM Optimization Problem N 2 || β || 2 + C 1 � min ξ i β,β 0 i =1 y i ( x T subject to ξ i ≥ 0 , i β + β 0 ) ≥ 1 − ξ i ∀ i yields the lagrangian N N N L P = 1 2 || β || 2 + C � � � α i [ y i ( x T ξ i − i β + β 0 ) − (1 − ξ i )] − µ i ξ i (12.9) i =1 i =1 i =1 with lagrangian multipliers α i , µ i for 1 ≤ i ≤ N . Georgetown University SVM 18

SVM Lagrangian Breakdown Previously Note that from earlier, the Lagrangian is mainly a compact representation and what we actually want to solve is ∇ L = 0. These yield - N ∂ L p � ∂β = 0 = β − α i y i x i (12.10) i =1 N ∂ L p � = 0 = α i y i (12.11) ∂β 0 i =1 ∂ L p = 0 = C − µ i − α i ∀ i (12.12) ξ i And positivity constraints α i , µ i , ξ i ≥ 0 from previous constraints, or because the previous constraints were inequalities. Georgetown University SVM 19

SVM Lagrangian Breakdown - Wolfe Dual Substituting the previous equations into the Lagrangian yields the Wolfe Dual objective function. N N β = � α i y i x i , 0 = � α i y i , α i = C − µ i ∀ i i =1 i =1 Detailed workings done on the board if necessary N N N L P = 1 2 || β || 2 + C � � α i [ y i ( x T � ξ i − i β + β 0 ) − (1 − ξ i )] − µ i ξ i (12.9) i =1 i =1 i =1 N N 1 2 || β || 2 = 1 � � α i α j y i y j x T i x j 2 i =1 j =1 N N N � � � C ξ i − µ i ξ i = α i ξ i i =1 i =1 i =1 N N N N N � α i [ y i ( x T � � α i α j y i y j x T � � − i β + β 0 ) − (1 − ξ i )] = − i x j − α i ξ i + α i i =1 i =1 j =1 i =1 i =1 Georgetown University SVM 20

SVM Lagrangian Breakdown - Wolfe Dual 2 Putting the pieces together Simply adding it up yields- N N N α i − 1 � � � α i α j y i y j x T L D = i x j 2 i =1 i =1 i = j This also provides a lower bound to the objective function. However, why it yields the Wolfe Dual, or that it provides a lower bound I do not know. Georgetown University SVM 21

SVM Lagrangian Breakdown - Karush-Kuhn-Tucker Question Jiyun Why the Karush-Kuhn-Tucker conditions includes the constraints 12.14-12.16? Grace What are the intuition/physical meanings of the KKT conditions (formulas 12.14-12.16)? Georgetown University SVM 22

Elements - Chapter 12 - SVM Henry Tan Georgetown University April - PowerPoint PPT Presentation

Elements - Chapter 12 - SVM Henry Tan Georgetown University April 13, 2015 Georgetown University SVM 1 Introduction to Support Vector Machines General Idea We want to be able to classify inputs into one of 2 classes. Its all about how

SVM-flexible discriminant analysis Huimin Peng November 20, 2014 Outline SVM Nonlinear SVM =

Overview SVM theoretical framework ORACLE data mining technology SVM parameter

SVM on Intel Graphics Jesse Barnes Intel Open Source Technology Center 1 What is SVM?

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

Machine Learning Theory CS 446 1. SVM risk SVM risk Consider the empirical and true/population

Lecture 5: SVM II Princeton University COS 495 Instructor: Yingyu Liang Review: SVM objective

Fitting SVM models in Matlab mdl = fitcsvm(X,y) fit a classifier using SVM X is a

An SVM- -based Masquerade Detection based Masquerade Detection An SVM Method with Online Update

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

Classication SVM algorithms with interval-valued training data using triangular and

Eye-blink Detection Based on SVM Wang Xiaoxing Shanghai Jiao Tong University figure1

Chapter 16 Chapter 16 The Elements: The he Elements: The d -Block -Block The d -Block

Elements of Future COP Elements of Future COP Elements of Future COP Elements of Future COP

Living organisms are composed of about 25 chemical elements Most Common Elements In the Human

Topics 11/13/2006 Chapter 11, start Chapter 12 11/20/2006 Chapter 12 11/27/2006 Chapter 13

SVM Learning of IP Address Structure for Latency Prediction Rob Beverly, Karen Sollins and Arthur

Optimization and Simulation Constrained optimization Michel Bierlaire Transport and Mobility

AM 205: lecture 18 Last time: optimization methods Today: conditions for optimality

First order gravity on the light front Sergei Alexandrov Laboratoire Charles Coulomb

Variational Principles and Constraints in Continuum Dynamics Giovanni Romano, Raffaele Barretta

15.082J and 6.855J and ESD.78J Lagrangian Relaxation 2 Applications Algorithms

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 10 Slides adapted from

UNIT 2 Non-Linear Programming A NLP problem An engineering factory makes 4 products (PROD1 to

Nouveaux d eveloppements sur FreeFem++ et HPDDM Fr ed eric Nataf Laboratory J.L. Lions