Support Vector Machines L eon Bottou COS 424 4/1/2010 Agenda - PowerPoint PPT Presentation

Support Vector Machines L´ eon Bottou COS 424 – 4/1/2010

Agenda Classification, clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Representation Linear vs. nonlinear Deep vs. shallow Explicit: architecture, feature selection Explicit: regularization, priors Capacity Control Implicit: approximate optimization Implicit: bayesian averaging, ensembles Loss functions Operational Budget constraints Considerations Online vs. offline Exact algorithms for small datasets. Computational Stochastic algorithms for big datasets. Considerations Parallel algorithms. L´ eon Bottou 2/46 COS 424 – 4/1/2010

Summary 1. Maximizing margins. 2. Soft margins. 3. Kernels. 4. Kernels everywhere. L´ eon Bottou 3/46 COS 424 – 4/1/2010

The curse of dimensionality Polynomial classifiers in dimension d Discriminant function: f ( x ) = w ⊤ Φ( x ) + b . Degree Dim (Φ(x)) Φ(x) 1 d Φ( x ) = [ x i ] 1 ≤ i ≤ d ≈ d 2 / 2 2 Φ( x ) += [ x i x j ] 1 ≤ i ≤ j ≤ d ≈ d 3 / 6 3 Φ( x ) += [ x i x j x k ] 1 ≤ i ≤ j ≤ k ≤ d . . . ≈ d n /n ! n The number of parameters increases quickly. Training such a classifier directly requires a number of examples that increases just as quickly as the number of parameters. L´ eon Bottou 4/46 COS 424 – 4/1/2010

Beating the curse of dimensionality? Capacity ≪ number of parameters Assume the patterns x 1 . . . x 2 l are known beforehand. The classes are unknown. Let R = max � x i � . We say that a hyperplane w ⊤ x + b w , x ∈ R d � w � = 1 separates patterns with margin ∆ if | w ⊤ x i + b | ≥ ∆ ∀ i = 1 . . . 2 l The family of ∆ -margin separating hyperplanes has � � R 2 log N ( F , D ) ≤ h log 2 le h ≤ min with ∆ 2 , d + 1 h L´ eon Bottou 5/46 COS 424 – 4/1/2010

Maximizing margins Patterns x i ∈ R d , classes y i = ± 1 . 2∆ w ∀ i y i ( w ⊤ x i + b ) ≥ ∆ w ,b, ∆ ∆ max subject to � w � = 1 and L´ eon Bottou 6/46 COS 424 – 4/1/2010

Maximizing margins Classic formulation wx+b = +1 wx+b = −1 w w ,b � w � 2 ∀ i y i ( w ⊤ x i + b ) ≥ 1 min subject to This is a quadratic programming problem with linear constraints. L´ eon Bottou 7/46 COS 424 – 4/1/2010

Maximizing margins Equivalence between the formulations Let w ′ = w ∆ and b ′ = b ∆ . Constraint y i ( w ⊤ x i + b ) ≥ ∆ becomes y i ( w ′⊤ x i + b ′ ) ≥ 1 . w ′ ,b ′ � w ′ � w ,b, ∆ ∆ subject to � w � = 1 becomes min Problem max Both discriminant functions w ⊤ x + b and w ′⊤ x + b ′ describe the same decision boundary. L´ eon Bottou 8/46 COS 424 – 4/1/2010

Primal and dual formulation Karush-Kuhn-Tucker theory – Refined theory for convex otimization under constraints. – Construct a dual optimization problem whose constraints are simpler, and whose solution is related to the solution we seek. L´ eon Bottou 9/46 COS 424 – 4/1/2010

Primal and dual formulation Karush-Kuhn-Tucker theory – Refined theory for convex otimization under constraints. – Construct a dual optimization problem whose constraints are simpler, and whose solution is related to the solution we seek. Primal formulation Dual formulation A Min distance B Max margin between convex hulls between classes L´ eon Bottou 10/46 COS 424 – 4/1/2010

Dual formulation A Min distance B between convex hulls � � – Point A: β i x i subject to β i ≥ 0 and β i = 1 i ∈ Pos i ∈ Pos � � subject to β i ≥ 0 and – Point B: β i x i β i = 1 i ∈ Neg i ∈ Neg � � � y i β i x i subject to β i ≥ 0 , β i = 2 , and y i β i = 0 . – Vector BA: i i i L´ eon Bottou 11/46 COS 424 – 4/1/2010

Dual formulation A Min distance B between convex hulls  ∀ i β i ≥ 0    y i y j β i β j x ⊤ � � min i x j subject to i y i β i = 0 β ij  �  i β i = 2  Then w = � i y i β i x i . Then b is easy to find by projecting all examples on w . L´ eon Bottou 12/46 COS 424 – 4/1/2010

Dual formulation Classic formulation A Min distance B between convex hulls � ∀ i α i ≥ 0 α i − 1 y i y j α i α j x ⊤ � � max i x j subject to 2 α � i y i α i = 0 i ij This is equivalent with α i = β i ∆ − 2 but the proof is nontrivial. L´ eon Bottou 13/46 COS 424 – 4/1/2010

Support Vectors Machines A Min distance B between convex hulls  ∀ i β i ≥ 0    y i y j β i β j x ⊤ � � min i x j i y i β i = 0 subject to β  ij �  i β i = 2  The only non zero β i are those corresponding to support vectors. L´ eon Bottou 14/46 COS 424 – 4/1/2010

Leave-One-Out Leave one out = n -fold cross-validation – Compute classifiers f i using training set minus example ( x i , y i ) . n – Estimate test misclassification rate as E LOO = 1 � 1 I { y i f i ( x i ) ≤ 0 } . n i =1 Leave one out for maximal margin classifier – Removing a non support vector does not change the classifier. E LOO ≤ # support vectors # examples – The important quantity is not the dimension but is the number of support vectors. L´ eon Bottou 15/46 COS 424 – 4/1/2010

Soft margins When the examples are not linearly separable, the constraints y i ( w ⊤ x i + b ) ≥ 1 cannot be satisfied. Adding slack variables ξ i n w ,b, ξ � w � 2 + C ∀ i y i ( w ⊤ x i + b ) ≥ 1 − ξ i , � ξ i ≥ 0 min ξ i subject to i =1 Parameter C controls the relative importance of: – correctly classifying all the training examples, – obtaining the separation with the largest margin. Reduces to hard margins when C = ∞ . L´ eon Bottou 16/46 COS 424 – 4/1/2010

Soft margins and Hinge loss The soft margin problem n w ,b, ξ � w � 2 + C ∀ i y i ( w ⊤ x i + b ) ≥ 1 − ξ i , � ξ i ≥ 0 min ξ i subject to i =1 is the same thing as n w ,b, ξ � w � 2 + C ℓ ( y i ( w ⊤ x i + b )) � ℓ ( z ) = max(0 , 1 − z ) min with i =1 L´ eon Bottou 17/46 COS 424 – 4/1/2010

Soft Margins Primal formulation n w ,b, ξ � w � 2 + C ∀ i y i ( w ⊤ x i + b ) ≥ 1 − ξ i , � ξ i ≥ 0 min ξ i subject to i =1 Dual formulation � ∀ i 0 ≤ α i ≤ C α i − 1 y i y j α i α j x ⊤ � � max i x j subject to 2 α � i y i α i = 0 i ij n � The primal and dual solutions obey the relation w = y i α i x i . i =1 The threshold b is easy to find once w is known. L´ eon Bottou 18/46 COS 424 – 4/1/2010

Soft Margins α i<C α i<C 0< 0< α =0 α =0 i i α =0 α =0 i i α i<C 0< α i=C α i=C ξ i ξ i ξ i α i<C 0< α i=C α =0 i α i<C 0< α =0 α =0 i i α =0 i L´ eon Bottou 19/46 COS 424 – 4/1/2010

Beyond linear separation Reintroducing the Φ(x) – Define K ( x , v ) = Φ( x ) ⊤ Φ( v ) . – Dual optimization problem � ∀ i 0 ≤ α i ≤ C α i − 1 � � max y i y j α i α j K ( x i , x j ) subject to 2 α � i y i α i = 0 i ij – Discriminant function n f ( x ) = w ⊤ Φ( x ) + b = � y i α i K ( x i , x ) i =1 Curious fact – We do not really need to compute Φ( x ) . – The dot products K ( x , v ) = Φ( x ) ⊤ Φ( v ) are enough. – Can we take advantage of this? L´ eon Bottou 20/46 COS 424 – 4/1/2010

Quadratic Kernel Quadratic basis � √ x 2 � � � � � � � Φ( x ) = x i i , i , 2 x i x j i i<j Dot product Φ( x ) ⊤ Φ( v ) = x 2 i v 2 � � � x i v i + i + 2 x i v i x j v j i i i<j – Are there d ( d + 3) / 2 terms to add ? L´ eon Bottou 21/46 COS 424 – 4/1/2010

Quadratic Kernel Quadratic basis � √ x 2 � � � � � � � Φ( x ) = x i i , i , 2 x i x j i i<j Dot product Φ( x ) ⊤ Φ( v ) = x 2 i v 2 � � � x i v i + i + 2 x i v i x j v j i i i<j � � = x i v i + x i v i x j v j i i,j � 2 �� = ( x ⊤ v ) + ( x ⊤ v ) 2 � = x i v i + x i v i i i – There are only d terms to add ! L´ eon Bottou 22/46 COS 424 – 4/1/2010

Polynomial kernel Φ(x) ⊤ Φ(v) Degree Dim (Φ(x)) ( x ⊤ v ) 1 d ( x ⊤ v ) + ( x ⊤ v ) 2 ≈ d 2 / 2 2 ( x ⊤ v ) + ( x ⊤ v ) 2 + ( x ⊤ v ) 3 ≈ d 3 / 6 3 . . . ≈ d n /n ! (1 + x ⊤ v ) d n The number of parameters increases exponentially quickly. But the total computation remains nearly constant. L´ eon Bottou 23/46 COS 424 – 4/1/2010

Linear L´ eon Bottou 24/46 COS 424 – 4/1/2010

Quadratic L´ eon Bottou 25/46 COS 424 – 4/1/2010

Polynomial degree 3 L´ eon Bottou 26/46 COS 424 – 4/1/2010

Polynomial degree 5 L´ eon Bottou 27/46 COS 424 – 4/1/2010

Polynomial kernels and more d γ i i ! ( x ⊤ v ) i . � Weighted polynomial kernel: K d ( x , v ) = i =0 – This is a polynomial kernel. – Coefficient γ controls the relative importance of terms of various degree. L´ eon Bottou 28/46 COS 424 – 4/1/2010

Support Vector Machines L eon Bottou COS 424 4/1/2010 Agenda - PowerPoint PPT Presentation

Support Vector Machines L eon Bottou COS 424 4/1/2010 Agenda Classification, clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Representation Linear vs. nonlinear Deep

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Generating a radially separable dataset DataCamp Support Vector Machines in R Generating a 2d

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

Particle Flow at 40 MHz with the CMS L1 Trigger Christian Herwig, for the CMS L1PF Team CPAD

Multilingual Web Discussion Topic: Speech Technologies for the Mul5lingual

HAPPY CITY DECLARATION: AMARAVATI 2018 Built Natural Governance Environment Environment

Rethinking the measurement of the middle class: Evidence from Arab countries Khalid Abu-Ismail

Real-Time Scheduling Single Processor Chenyang Lu Critiques 1/2 page critiques of research

Boosting cash flow for employers 26 What is it? Provide tax-free cash flow boosts of between

Coronavirus Community Support Fund Cailen Kinney Funding Manager 6th July 2020 Contents 1.

2014 AGM 16 July 2014 You are listening to the BT London Choir, one of the three Choirs

Support Vector Machines L eon Bottou COS 424 4/1/2010 Agenda - PowerPoint PPT Presentation

Support Vector Machines L eon Bottou COS 424 4/1/2010 Agenda Classification, clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Representation Linear vs. nonlinear Deep

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines &amp; Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Generating a radially separable dataset DataCamp Support Vector Machines in R Generating a 2d

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

Particle Flow at 40 MHz with the CMS L1 Trigger Christian Herwig, for the CMS L1PF Team CPAD

Multilingual Web Discussion Topic: Speech Technologies for the Mul5lingual

HAPPY CITY DECLARATION: AMARAVATI 2018 Built Natural Governance Environment Environment

Rethinking the measurement of the middle class: Evidence from Arab countries Khalid Abu-Ismail

Real-Time Scheduling Single Processor Chenyang Lu Critiques 1/2 page critiques of research

Boosting cash flow for employers 26 What is it? Provide tax-free cash flow boosts of between

Coronavirus Community Support Fund Cailen Kinney Funding Manager 6th July 2020 Contents 1.

2014 AGM 16 July 2014 You are listening to the BT London Choir, one of the three Choirs

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David