Notes on Support Vector Machines, COMP24111 Tingting Mu - PDF document

Notes on Support Vector Machines, COMP24111 Tingting Mu tingtingmu@manchester.ac.uk School of Computer Science University of Manchester Manchester M13 9PL, UK Editor: NA We start from introducing the notations to be used in this notes. We denote a set of training samples by {( x 1 ,y 1 ) , ( x 2 ,y 2 ) ,..., ( x N ,y N )} . The column vector x i ∈ R d denotes the d -dimensional feature vector of the i -th training sample, which corresponds to a data point in a d -dimensional space. The scalar y i ∈ {− 1 , + 1 } denotes the class label of the i -th training sample, for which + 1 represents the positive class while − 1 the negative class 1 . 1. Support Vector Machines x 2 | | 2 w | | / 2 x +b=1 T w T x +b=0 w w T x +b=-1 ρ w ρ x 1 Figure 1: An illustration of the separation margin in a 2-dimensional space. A support vector machine (SVM) classifier (Cortes and Vapnik, 1995; Cristianini and Shawe-Taylor, 2000) finds a hyperplane to separate the two classes of data points with the widest separation margin . To find this separating hyperplane means to decide its direction w ∈ R d and its bias b ∈ R , such that w T x + b = 0. The separation margin is the region bounded by the two parallel hyperplanes w T x + b = + 1 and w T x + b = − 1, and is 1. We focus on the binary classification problem. 1

computed by: margin = 2 ρ = = √ 2 2 ∥ w ∥ 2 . (1) w T w Figure 1 illustrates a separating hyperplane, as well as the two parallel hyperplanes associ- ated with it, and its separation margin. In addition to margin maximisation, the SVM classifier attempts to prevent the two classes of data points from falling into the margin. To achieve this, it forces w T x + b ≥ 1 for a sample ( x , + 1 ) from the positive class, while w T x + b ≤ 1 for a sample ( x , − 1 ) from the negative class. This is equivalent to forcing y ( w T x + b ) ≥ 1 for any given sample ( x ,y ) where y ∈ {− 1 , + 1 } . 1.1 Hard Margin SVM To maximise the separation margin in Eq. (1) is equivalent to minimising the term w T w . To enforce all the training samples stay outside the margin is equivalent to setting the constraints of y i ( w T x i + b ) ≥ 1 for i = 1 , 2 ,...,N . Together, this gives the following constrained optimisation problem to solve: 1 2 w T w , min (2) w ∈ R d ,b ∈ R y i ( w T x i + b ) ≥ 1 , for i = 1 , 2 ,...,N. subject to (3) Eq. (2) is referred to as the objective function of the optimisation problem. The purpose of introducing the scaling factor 1 2 in Eq. (2) is to have a neat gradient form ( w instead of 2 w ). The SVM training refers to the process of finding the optimal values of w and b . The solution gives us the hard margin SVM classifier, which is also called the optimal margin classifier. 1.2 1-Norm Soft Margin SVM When processing real-world data, we often need to deal with non-separable data patterns (see Figure 2 for example), for which it is impossible to find a hyperplane to separate perfectly the two classes. In this situation, it is helpful to allow the margin constraints in Eq. (3) to be violated. We use the slack variable ξ i ≥ 0 to measure the deviation from the ideal situation y i ( w T x i + b ) ≥ 1 for the i -th training sample. A set of more relaxed constraints are used instead: y i ( w T x i + b ) ≥ 1 − ξ i , for i = 1 , 2 ,...,N. (4) When 0 < ξ i < 1, the corresponding sample point is allowed to fall within the margin region, but still has to be in the right side. When ξ i = 1, the corresponding point is allowed to stay on the decision boundary. When ξ i > 1, the point is allowed to stay in the wrong side of the decision boundary. Although we relax the constraints to allow Eq. (3) to be violated, the violation, which is measured by the strength of ξ i (e.g., ∣ ξ i ∣ ), should be small to maintain good accuracy. For instance, it is not wise to build a classifier that allows most training samples to fall in the wrong side of the separating hyperplane. Therefore, we minimise the term ∑ N i = 1 ∣ ξ i ∣ , which is 2

equal to ∑ N i = 1 ξ i since all the slack variables are positive numbers, together with the margin 2 w T w . The modified constrained optimisation problem becomes term 1 2 w T w + C N 1 ∑ min ξ i , (5) w ∈ R d ,b ∈ R i = 1 y i ( w T x i + b ) ≥ 1 − ξ i , for i = 1 , 2 ,...,N, subject to (6) ξ i ≥ 0 , for i = 1 , 2 ,...,N, (7) where C > 0 is called the regularisation parameter. Let the column vector ξ = [ ξ 1 ,ξ 2 ,...ξ N ] T store the slack variables and we refer to it as a slack vector. The second term in the objective function in Eq. (5) can be expressed via l 1 -norm C ∥ ξ ∥ 1 . This modified SVM is subsequently named as the 1-norm soft margin SVM, and is often referred to as the l 1 -SVM for simplification. x 2 x 1 Figure 2: Illustration of an case of non-separable data patterns. 1.3 Support Vectors The training samples that (1) distribute along one of the two parallel hyperplanes, or (2) fall within the margin, or (3) stay in the wrong side of the separating hyperplane are more challenging to classify. They contribute more significantly to the determination of the direction and position of the separating hyperplane. They are called support vectors . The other samples that are not only stay in the right side of the separating hyperplane but also stay outside the margin are called non-support vectors, and they affect less the position of the separating hyperplane. 2. Dual Problem of SVM (Optional Reading) An SVM classifier can be trained by solving a quadratic programming (QP) problem. This QP problem is called the dual problem of an SVM. A QP problem is a special type of optimisation problem, and there exist many sophisticated approaches for solving it. Its 3

objective function is a quadratic function of multiple input variables and its constraint functions are linear functions of these variables. We will explain in this section how to derive this dual problem by using the method of Lagrange multipliers. 2.1 Lagrange Duality We start from introducing some knowledge on how to solve a constrained optimisation problem using the method of Lagrange multipliers. A general way to describe a constrained optimisation problem is f ( x ) , min (8) x ∈ R n g i ( x ) ≤ 0 , for i = 1 , 2 ,...,K, subject to (9) h i ( x ) = 0 , for i = 1 , 2 ,...,L. (10) The above expression means that we would like to find the optimal value of the vector x so that the value of the function f ( x ) ∶ R n → R is minimised, meanwhile the optimal x has to satisfy a total of K different inequality constraints { g i ( x ) ≤ 0 } K i = 1 and a total of L different equality constraints { h i ( x ) = 0 } L i = 1 . Here, g i ( x ) ∶ R n → R and h i ( x ) ∶ R n → R are constraint functions, which take the same input variables as f ( x ) . An effective way for solving the constrained optimisation problem in Eqs. (8)-(10) is the method of Lagrange multipliers . It first constructs a Lagrangian function by adding the constraint functions to the objective function, resulting in K L L ( x , { λ i } K i = 1 , { β i } L i = 1 ) = f ( x ) + λ i g i ( x ) + β i h i ( x ) , ∑ ∑ (11) i = 1 i = 1 where we call { λ i } K i = 1 and { β i } L i = 1 Lagrange multipliers . Storing the Lagrange multipliers in two column vectors such that λ = [ λ 1 ,λ 2 ,...,λ K ] T and β = [ β 1 ,β 2 ,...,β L ] T , the Lagrangian function is denoted by L ( x , λ , β ) . The original constrained optimisation problem in Eqs. (8)-(10) is called the primal problem . To assist solving it, a dual problem is imposed using the above Lagrangian function. It is given as O ( λ , β ) , max (12) λ ∈ R K , β ∈ R L λ i ≥ 0 , for i = 1 , 2 ,...,K, subject to (13) where O ( λ , β ) = min x ∈ R n L ( x , λ , β ) . (14) Under certain assumptions 2 on the objective and constraint functions f , { g i } K i = 1 and i = 1 , there must exist a setting of x and λ and β , denoted by x ∗ and λ ∗ and β ∗ , so { h i } L that x ∗ is the optimal solution of the primal problem in Eqs. (8)-(10), and { λ ∗ , β ∗ } is the 2. We do not discuss details on the assumptions. It is proven that the objective and constraint functions of the SVM problems satisfy these assumptions. 4

Notes on Support Vector Machines, COMP24111 Tingting Mu - PDF document

Notes on Support Vector Machines, COMP24111 Tingting Mu tingtingmu@manchester.ac.uk School of Computer Science University of Manchester Manchester M13 9PL, UK Editor: NA We start from introducing the notations to be used in this notes. We

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

COMP24111 Course Unit Overview Ke Chen and Tingting Mu http:/ / syllabus.cs.manchester.ac.uk/ ugt/

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

COMP24111: Machine Learning and Optimisation Chapter 4: Support Vector Machines Dr. Tingting Mu

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Support Vector Machines This set of notes presents the Support Vector Machine (SVM) learning al-

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

k -means++ seeding Have seen that the k -means algorithm can output arbitrarily poor solutions, if

A motion planner for nonholonomic mobile robots Miguel Vargas Material taken form: J. P.

Clustering / Unsupervised Learning The target features are not given in the training examples The

Positive Logic Is 2-Exptime Hard Aleksy Schubert Pawe Urzyczyn Daria Walukiewicz-Chrzszcz

Self-Advocacy Coffee Break May 15 th , 3:00 pm EST Coffee Break Agenda Introduction In the

VTmt mtss ss Of Office ice Ho Hour ur De Dece cemb mber er 4, 4, 20 2019 Tom Faris

On Xoodoo Gilles Van Assche 1 based on joint work with Joan Daemen 2 , Seth Hoffert and Ronny Van

Phylogeny and Sequence Simulation with Hybridization Michael Woodhams School of Mathematics and

Notes on Support Vector Machines, COMP24111 Tingting Mu - PDF document

Notes on Support Vector Machines, COMP24111 Tingting Mu tingtingmu@manchester.ac.uk School of Computer Science University of Manchester Manchester M13 9PL, UK Editor: NA We start from introducing the notations to be used in this notes. We

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

COMP24111 Course Unit Overview Ke Chen and Tingting Mu http:/ / syllabus.cs.manchester.ac.uk/ ugt/

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

COMP24111: Machine Learning and Optimisation Chapter 4: Support Vector Machines Dr. Tingting Mu

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Support Vector Machines This set of notes presents the Support Vector Machine (SVM) learning al-

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines &amp; Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

k -means++ seeding Have seen that the k -means algorithm can output arbitrarily poor solutions, if

A motion planner for nonholonomic mobile robots Miguel Vargas Material taken form: J. P.

Clustering / Unsupervised Learning The target features are not given in the training examples The

Positive Logic Is 2-Exptime Hard Aleksy Schubert Pawe Urzyczyn Daria Walukiewicz-Chrzszcz

Self-Advocacy Coffee Break May 15 th , 3:00 pm EST Coffee Break Agenda Introduction In the

VTmt mtss ss Of Office ice Ho Hour ur De Dece cemb mber er 4, 4, 20 2019 Tom Faris

On Xoodoo Gilles Van Assche 1 based on joint work with Joan Daemen 2 , Seth Hoffert and Ronny Van

Phylogeny and Sequence Simulation with Hybridization Michael Woodhams School of Mathematics and

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David