smo algorithm
play

SMO Algorithm Milan Straka December 02, 2019 Charles University in - PowerPoint PPT Presentation

NPFL129, Lecture 7 SMO Algorithm Milan Straka December 02, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Kernel Linear Regression 3 O ( D ) D When


  1. NPFL129, Lecture 7 SMO Algorithm Milan Straka December 02, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

  2. Kernel Linear Regression 3 O ( D ) D When dimensionality of input is , one step of SGD takes . Surprisingly, we can do better under some circumstances. We start by noting that we can write φ ( x ) w i the parameters as a linear combination of the input features . w = 0 = 0 ⋅ φ ( x ) w = i φ ( x ⋅ ) ∑ i ∑ i β i i By induction, , and assuming , after a SGD update we get ∑ w ← w + α − w φ ( x T ) ) φ ( x ) ( t i i i i ∑ ( β ) ) ) φ ( x = + α ( t − w φ ( x T ). i i i i i α ( t ) ) ← + − w φ ( x T β β w i i i i A individual update is , and substituting for we get α ( t ∑ ) ) . ← + − φ ( x ) φ ( x T β β β i i i j j i j NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 2/22

  3. Kernel Linear Regression We can formulate the alternative linear regression algorithm (it would be called a dual formulation ): R N × D t ∈ R N α ∈ R + X = { x , x , … , x } ∈ 1 2 N Input : Dataset ( , ), learning rate . ← 0 β i Set K ( x , x ) = φ ( x ) φ ( x ) i T i j j Compute all values Repeat Update the coordinates, either according to a full gradient update: β ← β + α ( t − K β ) or alternatively use single-batch SGD, arriving at: {1, … , N } i for in random permutation of : α ( t ) ) ← β + − K ( x , x ∑ j β β i i i j i j β ← β + α ( t − K β ) In vector notation, we can write . y ( x ) = w φ ( x ) = φ ( x ) φ ( x ) T ∑ i i T β i The predictions are then performed by computing . NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 3/22

  4. Support Vector Machines X ∈ R N × D t ∈ {−1, 1} N φ Assume we have a dataset , , feature map and model def φ ( x ) w + y ( x ) = T b . x i We already know that the distance of a point to the decision boundary is ∣ y ( x )∣ y ( x ) t i i i = . ∣∣ w ∣∣ ∣∣ w ∣∣  We therefore want to maximize 1 arg max ∣∣ w ∣∣ min ( φ ( x ) w + T b ) ] . [ t Figure 4.1 of Pattern Recognition and Machine Learning. i i w , b However, this problem is difficult to optimize directly. NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 4/22

  5. Support Vector Machines w b Because the model is invariant to multiplying and by a constant, we can say that for the points closest to the decision boundary, it will hold that y ( x ) = 1. t i i y ( x ) ≥ 1 t i i Then for all the points we will have and we can simplify 1 arg max ∣∣ w ∣∣ min ( φ ( x ) w + T b ) ] [ t i i w , b to 1 2 arg min 2 ∣∣ w ∣∣ given that t y ( x ) ≥ 1. i i w , b NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 5/22

  6. Support Vector Machines In order to solve the constrained problem of 1 2 arg min 2 ∣∣ w ∣∣ given that t y ( x ) ≥ 1, i i w , b a = ( a , … , a ) 1 N we write the Lagrangian with multipliers as 1 ∑ 2 L = ∣∣ w ∣∣ − y ( x ) − 1 ] . [ t a i i i 2 i w b Setting the derivatives with respect to and to zero, we get ∑ w = φ ( x ) a t i i i i ∑ 0 = a t i i i NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 6/22

  7. Support Vector Machines Substituting these to the Lagrangian, we get 1 ∑ ∑ ∑ L = − K ( x , x ) a a a t t i i j i j i j 2 i i j ∀ : ≥ 0 = 0 K ( x , z ) = φ ( x ) φ ( z ). ∑ i T a a t i i i i with respect to the constraints , and kernel The solution of this Lagrangian will fulfil the KKT conditions, meaning that ≥ 0 a i y ( x ) − 1 ≥ 0 t i i y ( x ) − 1 ) = 0. ( t a i i i = 0 a x i Therefore, either a point is on a boundary, or . Given that the predictions for point are y ( x ) = K ( x , x ) + ∑ a t b i i i given by , we need to keep only the points on the boundary, the so-called support vectors . NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 7/22

  8. Support Vector Machines The dual formulation allows us to use non-linear kernels.                                                                                Figure 7.2 of Pattern Recognition and Machine Learning. NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 8/22

  9. Support Vector Machines for Non-linearly Separable Data Until now, we assumed the data to be linearly separable – the hard-margin SVM variant. We now relax this condition to arrive at soft-margin SVM . The idea is to allow points to be in the margin or even on the wrong side of the decision boundary. ≥ 0 ξ i We introduce slack variables , one for each training instance, defined as { 0 for points fulfilling t y ( x ) ≥ 1, i i = ξ i ∣ t − y ( x )∣ otherwise. i i = 0 0 < ξ < 1 ξ i i Therefore, signifies a point outside of margin, denotes a point inside the = 1 > 1 ξ ξ i i margin, is a point on the decision boundary and indicates the point is on the opposite side of the separating hyperplane. Therefore, we want to optimize 1 ∑ i 2 arg min + ∣∣ w ∣∣ given that t y ( x ) ≥ 1 − ξ and ξ ≥ 0. C ξ i i i i 2 w , b i NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 9/22

  10. Support Vector Machines for Non-linearly Separable Data a = ( a , … , a ) μ = 1 N We again create a Lagrangian, this time with multipliers and also ( μ , … , μ ) 1 N : 1 ∑ i ∑ ∑ 2 L = ∣∣ w ∣∣ + − y ( x ) − 1 + ξ ] − . [ t C ξ a μ ξ i i i i i i 2 i i i w b ξ Solving for the critical points and substituting for , and (obtaining an additional = C − a μ i i constraint compared to the previous case), we obtain the Lagrangian in the form 1 ∑ ∑ ∑ L = − K ( x , x ), a a a t t i i j i j i j 2 i i j which is identical to the previous case, but the constraints are a bit different: ∑ ∀ : C ≥ a ≥ 0 and = 0. a t i i i i i NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 10/22

  11. Support Vector Machines for Non-linearly Separable Data > 0 a i Using KKT conditions, we can see that the support vectors (examples with ) are the y ( x ) = 1 − ξ t i i i ones with , i.e., the examples on the margin boundary, inside the margin and on the opposite side of the decision boundary.       Figure 7.4 of Pattern Recognition and Machine Learning. NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 11/22

  12. Sequential Minimal Optimization Algorithm To solve the dual formulation of a SVM, usually Sequential Minimal Optimization (SMO; John Platt, 1998) algorithm is used. Before we introduce it, we start by introducing coordinate descent optimization algorithm. Consider solving unconstrained optimization problem arg min L ( w , w , … , w ). 1 2 D w Instead of the usual SGD approach, we could optimize the weights one by one, using the following algorithm loop until convergence {1, 2, … , D } i for in : w ← arg min L ( w , w , … , w ) 1 2 i D w i NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 12/22

  13. Sequential Minimal Optimization Algorithm loop until convergence {1, 2, … , D } i for in : w ← arg min L ( w , w , … , w ) 1 2 i D w i arg min  If the inner can be performed efficiently, the  coordinate descent can be fairly efficient.  w i Note that we might want to choose in different  w i order, for example by trying to choose providing the  L largest decrease of .                CS229 Lecture 3 Notes, http://cs229.stanford.edu/notes/cs229-notes3.pdf NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 13/22

  14. Sequential Minimal Optimization Algorithm In soft-margin SVM, we try to minimize 1 ∑ ∑ ∑ L = − K ( x , x ), a a a t t i i j i j i j 2 i i j such that ∑ ∀ : C ≥ a ≥ 0 and = 0. a t i i i i i The KKT conditions for the solution can be reformulated (while staying equivalent) as > 0 ⇒ t y ( x ) ≤ 1, because a > 0 ⇒ t y ( x ) = 1 − ξ and we have ξ ≥ 0, a i i i i i i i i < C ⇒ t y ( x ) ≥ 1, because a < C ⇒ μ > 0 ⇒ ξ = 0 and t y ( x ) ≥ 1 − ξ , a i i i i i i i i i 0 < a < C ⇒ t y ( x ) = 1, a combination of both. i i i NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 14/22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend