Support vector machines Course of Machine Learning Master Degree in - PowerPoint PPT Presentation

Support vector machines Course of Machine Learning Master Degree in Computer Science University of Rome “Tor Vergata” Giorgio Gambosi a.a. 2018-2019 1

Idea The binary classification problem is approached in a direct way, that is: We try and find a plane that separates the classes in feature space (indeed, a “best” plane, according to a reasonable characteristic) If this is not possible, we get creative in two ways: • We soften what we mean by “separates”, and • We enrich and enlarge the feature space so that separation is (more) possible 2

Margins 3 A can be assigned to C 1 with greater confidence than B and even greater confidence than C .

Binary classifiers Moreover, we consider linear classifier such as belongs to each class. 4 Consider a binary classifier which, for any element x , returns a value y ∈ {− 1 , 1 } , where we assume that x is assigned to C 0 if y = − 1 and to C 1 if y = 1 . h ( x ) = g ( w T φ ( x i ) + w 0 ) where g ( z ) = 1 if z ≥ 0 and g ( z ) = − 1 if z < 0 . The prediction on the class of x is then provided by deriving a value in {− 1 , 1 } just as in the case of a perceptron, that is with no estimation of the probabilities p ( C i | x ) that x

Margins item is defined as 5 For any training set item ( x i , t i ) , the functional margin of ( w , w 0 ) wrt such γ i = t i ( w T φ ( x i ) + w 0 ) Observe that the resulting prediction is correct iff γ i > 0 . Moreover, larger values of γ i denote greater confidence on the prediction. Given a training set T = { ( x 1 , t 1 ) , . . . , ( x n , t n ) } the functional margin of ( w , w 0 ) wrt T is the minimum functional margin for all items in T γ = min γ i i

Margins hyperplane 6 The geometric margin γ i of a training set item x i , t i is defined as the product of t i and the distance from x i to the boundary hyperplane, that is as the length of the line segment from x i to its projection on the boundary β A x x x x x γ i x x B

Margins 7 Since, in general, the distance of a point x from a hyperplane w T x = 0 is w T x || w || , it results ( w T ) || w || φ ( x i ) + w 0 γ i γ i = t i = || w || || w || So, differently from γ i , the geometric margin γ i is invariant wrt parameter scaling. In fact, by substituting c w to w and cw 0 to w 0 , we get γ i = t i ( c w T φ ( x i ) + cw 0 ) = ct i ( w T φ ( x i ) + w 0 ) ( c w T ( w T || c w || φ ( x i ) + cw 0 ) || w || φ ( x i ) + w 0 ) γ i = t i = t i || c w || || w ||

the hyperplane and passing (at least one of them) through some point Margins 8 • The geometric margin wrt the training set T = { ( x 1 , t 1 ) , . . . , ( x n , t n ) } is then defined as the smallest geometric margin for all items ( x i , t i ) γ = min γ i i • a useful interpretation of γ is as half the width of the largest strip, centered on the hyperplane w T φ ( x ) + w 0 = 0 , containing none of the points x 1 , . . . , x n • the hyperplanes on the boundary of such strip, each at distance γ from x i are said maximum margin hyperplanes.

Margins x x x x x x x x x x x 9 2 γ

Optimal margin classifiers max max That is, 10 Assume classes are linearly separable in the training set: hence, there exists as large as possible, the confidence on the provided classification increases. between the hyperplanes and the set of points corresponding to elements Given a training set T , we wish to find the hyperplanes which separates the two classes (if one does exist) and has maximum γ : by making the distance a hyperplane (an infinity of them, indeed) separating elements in C 1 from elements in C 2 . In order to find the one among those hyperplanes which maximizes γ , we have to solve the following optimization problem w ,w 0 γ t i || w || ( w T φ ( x i ) + w 0 ) ≥ γ where γ i = i = 1 , . . . , n w ,w 0 γ where t i ( w T φ ( x i ) + w 0 ) ≥ γ || w || i = 1 , . . . , n

Optimal margin classifiers exists at least one active point. 11 exploit this freedom to introduce the constraint As observed, if all parameters are scaled by any constant c , all geometric margins γ i between elements and hyperplane are unchanged: we may then t i ( w T φ ( x i ) + w 0 ) = 1 γ = min i This can be obtained by assuming || w || = 1 γ , which corresponds to considering a scale where the maximum margin has width 2 . This results, for each element x i , t i , into a constraint γ i = t i ( w T φ ( x i ) + w 0 ) ≥ 1 An element (point) is said active if the equality holds, that is if t i ( w T φ ( x i ) + w 0 ) = 1 and inactive if this does not hold. Observe that, by definition, there must

Optimal margin classifiers outside the margin strip on the maximum margin hyperplane class, inside the margin strip other class, inside the margin strip other class, on the maximum margin hyperplane other class, outside the margin strip 12 For any element x , t , 1. t ( w T φ ( x ) + w 0 ) > 1 if φ ( x ) is in the region corresponding to its class, 2. t ( w T φ ( x ) + w 0 ) = 1 if φ ( x ) is in the region corresponding to its class, 3. 0 < t ( w T φ ( x ) + w 0 ) < 1 if φ ( x ) is in the region corresponding to its 4. t ( w T φ ( x ) + w 0 ) = 0 if φ ( x ) is on the separating hyperplane 5. − 1 < t ( w T φ ( x ) + w 0 ) < 0 if φ ( x ) is in the region corresponding to the 6. t ( w T φ ( x ) + w 0 ) = − 1 if φ ( x ) is in the region corresponding to the 7. t ( w T φ ( x ) + w 0 ) < − 1 if φ ( x ) is in the region corresponding to the

Optimal margin classifiers formulate the problem as a convex polyhedron (intersection of half-spaces). minimized is in fact convex and the set of points satisfying the constraint is The optimization problem, is then transformed into min 13 max w ,w 0 γ = || w || − 1 where t i ( w T φ ( x i ) + w 0 ) ≥ 1 i = 1 , . . . , n Maximizing || w || − 1 is equivalent to minimizing || w || 2 (we prefer minimizing || w || 2 instead of || w || since it is smooth everywhere): hence we may 1 2 || w || 2 w ,w 0 where t i ( w T φ ( x i ) + w 0 ) ≥ 1 i = 1 , . . . , n This is a convex quadratic optimization problem. The function to be

Duality From optimization theory it derives that, given the problem structure (linear constraints + convexity): • the optimum of the dual problem is the same the the original (primal) problem 14 • there exists a dual formulation of the problem

Karush-Kuhn-Tucker theorem Consider the optimization problem max Then, the solution of the original problem is the same as the solution of and the minimum 15 min x ∈ Ω f ( x ) g i ( x ) ≥ 0 i = 1 , . . . , k i = 1 , . . . , k ′ h j ( x ) = 0 where f ( x ) , g i ( x ) , h j ( x ) are convex functions and Ω is a convex set. Define the Lagrangian k ′ k ∑ ∑ L ( x , λ , µ ) = h ( x ) + λ i g i ( x ) + µ j h j ( x ) i =1 j =1 θ ( λ , µ ) = min L ( x , λ , µ ) x λ , µ θ ( λ , µ ) λ i ≥ 0 i = 1 , . . . , k

Karush-Kuhn-Tucker theorem The following necessary and sufficient conditions apply for the existence of 16 an optimum ( x ∗ , λ ∗ , µ ∗ ) . ∂L ( x , λ , µ ) � x ∗ , λ ∗ , µ ∗ = 0 � ∂ x � ∂L ( x , λ , µ ) � x ∗ , λ ∗ , µ ∗ = g i ( x ∗ ) ≥ 0 i = 1 , . . . , k � ∂λ i � ∂L ( x , λ , µ ) � x ∗ , λ ∗ , µ ∗ = h j ( x ∗ ) = 0 i = j, . . . , k ′ � ∂µ j � λ ∗ i ≥ 0 i = 1 , . . . , k λ ∗ i g i ( x ∗ ) = 0 i = 1 , . . . , k Note: the last condition states that a Lagrangian multiplier λ ∗ i can be non-zero only if g i ( x ∗ ) = 0 , that is of x ∗ is“at the limit” for the constraint g i ( x ) . In this case, the constraint is said active.

Applying Kuhn-Tucker theorem max under the constraints min In our case, min min 17 By the KKT theorem, the solution is then the same as the solution of hence convex. • f ( x ) corresponds to 1 2 || w || 2 • g i ( x ) corresponds to t i ( w T φ ( x i ) + w 0 ) − 1 ≥ 0 • there is no h j ( x ) • Ω is the intersection of a set of hyperplanes, that is a polyhedron, ( n )) 1 2 w T w − t i ( w T φ ( x i ) + w 0 ) − 1 ( ∑ w ,w 0 L ( w , w 0 , λ ) = max λ i λ λ w ,b i =1 ( ) n n 1 2 w T w − λ i t i ( w T φ ( x i ) + w 0 ) + ∑ ∑ = max λ i λ w ,w 0 i =1 i =1 λ i ≥ 0 i = 1 , . . . , k

Applying the KKT conditions Since the KKT conditions hold for the maximum point, it must be, at that 18 point: n ∂L ( w , w 0 , λ ) ∑ = w − λ i t i φ ( x i ) = 0 ∂ w i =1 n ∂L ( w , w 0 , λ ) ∑ = λ i t i = 0 ∂w 0 i =1 t i ( w T φ ( x i ) + w 0 ) − 1 ≥ 0 i = 1 , . . . , n λ i ≥ 0 i = 1 , . . . , n ( t i ( w T φ ( x i ) + w 0 ) − 1 ) λ i = 0 i = 1 , . . . , n

Lagrange method: dual problem max where 19 problem We may apply the above relations to drop w and w 0 from L ( w , w 0 , λ ) and from all constraints. As a result, we get a new dual formulation of the ( n ) n n λ i − 1 ∑ ∑ ∑ L ( λ ) = max λ i λ j t i t j φ ( x i ) φ ( x j ) 2 λ λ i =1 i =1 j =1 λ i ≥ 0 i = . . . , n n ∑ λ i t i = 0 i =1

Dual problem and kernel function max By defining the kernel function 20 the dual problem’s formulation can be given as κ ( x i , x j ) = φ ( x i ) T φ ( x j ) ( n n n ) λ i − 1 ˜ ∑ ∑ ∑ L ( λ ) = max λ i λ j t i t j κ ( x i , x j ) 2 λ λ i =1 i =1 j =1 λ i ≥ 0 i = 1 , . . . , n n ∑ λ i t i = 0 i =1

Support vector machines Course of Machine Learning Master Degree in - PowerPoint PPT Presentation

Support vector machines Course of Machine Learning Master Degree in Computer Science University of Rome Tor Vergata Giorgio Gambosi a.a. 2018-2019 1 Idea The binary classification problem is approached in a direct way, that is: We try

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Generating a radially separable dataset DataCamp Support Vector Machines in R Generating a 2d

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

PAC-learning, VC Dimension and Margin-based Bounds Machine Learning 10701/15781 Carlos

Standard Microsystems Corporation (Name of Registrant as Specified In Its Charter) Microchip

January 9, 2017 PHIL PHILIPP IPPIN INE D E DEAL EALIN ING & E & EXCHAN XCHANGE CO

SECOND QUARTER FISCAL YEAR 2018 FINANCIAL RESULTS FINANCIAL RESULTS November 2, 2017

Announcements Recognition wrap-up Assignment 1 due Sept 22 11:59 pm on Canvas & Hw2

Medical/Volume Visualizations John Bartlett Papers Gerald Bianchi, Benjamin Knoerlein,

Where Next with the Dengue Vaccines? Anh Wartel, MD IVI, Head of Clinical Development and

Semantische Technologien (M-TANI) Christian Chiarcos Angewandte Computerlinguistik

Support vector machines Course of Machine Learning Master Degree in - PowerPoint PPT Presentation

Support vector machines Course of Machine Learning Master Degree in Computer Science University of Rome Tor Vergata Giorgio Gambosi a.a. 2018-2019 1 Idea The binary classification problem is approached in a direct way, that is: We try

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines &amp; Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Generating a radially separable dataset DataCamp Support Vector Machines in R Generating a 2d

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

PAC-learning, VC Dimension and Margin-based Bounds Machine Learning 10701/15781 Carlos

Standard Microsystems Corporation (Name of Registrant as Specified In Its Charter) Microchip

January 9, 2017 PHIL PHILIPP IPPIN INE D E DEAL EALIN ING &amp; E &amp; EXCHAN XCHANGE CO

SECOND QUARTER FISCAL YEAR 2018 FINANCIAL RESULTS FINANCIAL RESULTS November 2, 2017

Announcements Recognition wrap-up Assignment 1 due Sept 22 11:59 pm on Canvas &amp; Hw2

Medical/Volume Visualizations John Bartlett Papers Gerald Bianchi, Benjamin Knoerlein,

Where Next with the Dengue Vaccines? Anh Wartel, MD IVI, Head of Clinical Development and

Semantische Technologien (M-TANI) Christian Chiarcos Angewandte Computerlinguistik

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

January 9, 2017 PHIL PHILIPP IPPIN INE D E DEAL EALIN ING & E & EXCHAN XCHANGE CO

Announcements Recognition wrap-up Assignment 1 due Sept 22 11:59 pm on Canvas & Hw2