SMO Algorithm Milan Straka December 02, 2019 Charles University in - PowerPoint PPT Presentation

NPFL129, Lecture 7 SMO Algorithm Milan Straka December 02, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

Kernel Linear Regression 3 O ( D ) D When dimensionality of input is , one step of SGD takes . Surprisingly, we can do better under some circumstances. We start by noting that we can write φ ( x ) w i the parameters as a linear combination of the input features . w = 0 = 0 ⋅ φ ( x ) w = i φ ( x ⋅ ) ∑ i ∑ i β i i By induction, , and assuming , after a SGD update we get ∑ w ← w + α − w φ ( x T ) ) φ ( x ) ( t i i i i ∑ ( β ) ) ) φ ( x = + α ( t − w φ ( x T ). i i i i i α ( t ) ) ← + − w φ ( x T β β w i i i i A individual update is , and substituting for we get α ( t ∑ ) ) . ← + − φ ( x ) φ ( x T β β β i i i j j i j NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 2/22

Kernel Linear Regression We can formulate the alternative linear regression algorithm (it would be called a dual formulation ): R N × D t ∈ R N α ∈ R + X = { x , x , … , x } ∈ 1 2 N Input : Dataset ( , ), learning rate . ← 0 β i Set K ( x , x ) = φ ( x ) φ ( x ) i T i j j Compute all values Repeat Update the coordinates, either according to a full gradient update: β ← β + α ( t − K β ) or alternatively use single-batch SGD, arriving at: {1, … , N } i for in random permutation of : α ( t ) ) ← β + − K ( x , x ∑ j β β i i i j i j β ← β + α ( t − K β ) In vector notation, we can write . y ( x ) = w φ ( x ) = φ ( x ) φ ( x ) T ∑ i i T β i The predictions are then performed by computing . NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 3/22

Support Vector Machines X ∈ R N × D t ∈ {−1, 1} N φ Assume we have a dataset , , feature map and model def φ ( x ) w + y ( x ) = T b . x i We already know that the distance of a point to the decision boundary is ∣ y ( x )∣ y ( x ) t i i i = . ∣∣ w ∣∣ ∣∣ w ∣∣  We therefore want to maximize 1 arg max ∣∣ w ∣∣ min ( φ ( x ) w + T b ) ] . [ t Figure 4.1 of Pattern Recognition and Machine Learning. i i w , b However, this problem is difficult to optimize directly. NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 4/22

Support Vector Machines w b Because the model is invariant to multiplying and by a constant, we can say that for the points closest to the decision boundary, it will hold that y ( x ) = 1. t i i y ( x ) ≥ 1 t i i Then for all the points we will have and we can simplify 1 arg max ∣∣ w ∣∣ min ( φ ( x ) w + T b ) ] [ t i i w , b to 1 2 arg min 2 ∣∣ w ∣∣ given that t y ( x ) ≥ 1. i i w , b NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 5/22

Support Vector Machines In order to solve the constrained problem of 1 2 arg min 2 ∣∣ w ∣∣ given that t y ( x ) ≥ 1, i i w , b a = ( a , … , a ) 1 N we write the Lagrangian with multipliers as 1 ∑ 2 L = ∣∣ w ∣∣ − y ( x ) − 1 ] . [ t a i i i 2 i w b Setting the derivatives with respect to and to zero, we get ∑ w = φ ( x ) a t i i i i ∑ 0 = a t i i i NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 6/22

Support Vector Machines Substituting these to the Lagrangian, we get 1 ∑ ∑ ∑ L = − K ( x , x ) a a a t t i i j i j i j 2 i i j ∀ : ≥ 0 = 0 K ( x , z ) = φ ( x ) φ ( z ). ∑ i T a a t i i i i with respect to the constraints , and kernel The solution of this Lagrangian will fulfil the KKT conditions, meaning that ≥ 0 a i y ( x ) − 1 ≥ 0 t i i y ( x ) − 1 ) = 0. ( t a i i i = 0 a x i Therefore, either a point is on a boundary, or . Given that the predictions for point are y ( x ) = K ( x , x ) + ∑ a t b i i i given by , we need to keep only the points on the boundary, the so-called support vectors . NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 7/22

Support Vector Machines The dual formulation allows us to use non-linear kernels.                                                                                Figure 7.2 of Pattern Recognition and Machine Learning. NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 8/22

Support Vector Machines for Non-linearly Separable Data Until now, we assumed the data to be linearly separable – the hard-margin SVM variant. We now relax this condition to arrive at soft-margin SVM . The idea is to allow points to be in the margin or even on the wrong side of the decision boundary. ≥ 0 ξ i We introduce slack variables , one for each training instance, defined as { 0 for points fulfilling t y ( x ) ≥ 1, i i = ξ i ∣ t − y ( x )∣ otherwise. i i = 0 0 < ξ < 1 ξ i i Therefore, signifies a point outside of margin, denotes a point inside the = 1 > 1 ξ ξ i i margin, is a point on the decision boundary and indicates the point is on the opposite side of the separating hyperplane. Therefore, we want to optimize 1 ∑ i 2 arg min + ∣∣ w ∣∣ given that t y ( x ) ≥ 1 − ξ and ξ ≥ 0. C ξ i i i i 2 w , b i NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 9/22

Support Vector Machines for Non-linearly Separable Data a = ( a , … , a ) μ = 1 N We again create a Lagrangian, this time with multipliers and also ( μ , … , μ ) 1 N : 1 ∑ i ∑ ∑ 2 L = ∣∣ w ∣∣ + − y ( x ) − 1 + ξ ] − . [ t C ξ a μ ξ i i i i i i 2 i i i w b ξ Solving for the critical points and substituting for , and (obtaining an additional = C − a μ i i constraint compared to the previous case), we obtain the Lagrangian in the form 1 ∑ ∑ ∑ L = − K ( x , x ), a a a t t i i j i j i j 2 i i j which is identical to the previous case, but the constraints are a bit different: ∑ ∀ : C ≥ a ≥ 0 and = 0. a t i i i i i NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 10/22

Support Vector Machines for Non-linearly Separable Data > 0 a i Using KKT conditions, we can see that the support vectors (examples with ) are the y ( x ) = 1 − ξ t i i i ones with , i.e., the examples on the margin boundary, inside the margin and on the opposite side of the decision boundary.       Figure 7.4 of Pattern Recognition and Machine Learning. NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 11/22

Sequential Minimal Optimization Algorithm To solve the dual formulation of a SVM, usually Sequential Minimal Optimization (SMO; John Platt, 1998) algorithm is used. Before we introduce it, we start by introducing coordinate descent optimization algorithm. Consider solving unconstrained optimization problem arg min L ( w , w , … , w ). 1 2 D w Instead of the usual SGD approach, we could optimize the weights one by one, using the following algorithm loop until convergence {1, 2, … , D } i for in : w ← arg min L ( w , w , … , w ) 1 2 i D w i NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 12/22

Sequential Minimal Optimization Algorithm loop until convergence {1, 2, … , D } i for in : w ← arg min L ( w , w , … , w ) 1 2 i D w i arg min  If the inner can be performed efficiently, the  coordinate descent can be fairly efficient.  w i Note that we might want to choose in different  w i order, for example by trying to choose providing the  L largest decrease of .                CS229 Lecture 3 Notes, http://cs229.stanford.edu/notes/cs229-notes3.pdf NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 13/22

Sequential Minimal Optimization Algorithm In soft-margin SVM, we try to minimize 1 ∑ ∑ ∑ L = − K ( x , x ), a a a t t i i j i j i j 2 i i j such that ∑ ∀ : C ≥ a ≥ 0 and = 0. a t i i i i i The KKT conditions for the solution can be reformulated (while staying equivalent) as > 0 ⇒ t y ( x ) ≤ 1, because a > 0 ⇒ t y ( x ) = 1 − ξ and we have ξ ≥ 0, a i i i i i i i i < C ⇒ t y ( x ) ≥ 1, because a < C ⇒ μ > 0 ⇒ ξ = 0 and t y ( x ) ≥ 1 − ξ , a i i i i i i i i i 0 < a < C ⇒ t y ( x ) = 1, a combination of both. i i i NPFL129, Lecture 7 Refresh SMO AlgorithmSketch UpdateRules MultiSVM 14/22

SMO Algorithm Milan Straka December 02, 2019 Charles University in - PowerPoint PPT Presentation

NPFL129, Lecture 7 SMO Algorithm Milan Straka December 02, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Kernel Linear Regression 3 O ( D ) D When

Smoking - IVF Smoking and ectopic pregnancy -smo ex-smo +smo Odds ratio 3,5 3,5

May 15, 2019 37 Ove rvie w Why T SMO Ne e d fo r a Ne w Pe rspe c tive Ma ste r Pla n Ove

Soft-margin SVM, SMO Algorithm, Decision Trees Milan Straka November 25, 2019 Charles

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

New and Emerging Smokeless Tobacco Products Not a Safe Alternative to Cigarettes Smo mokele

Contents Introduction What is social media Social media overview Classification of

11/15/16 & 11/17/16 The purpose of the Schedule Management Office (SMO) within TPMU is to

SMO Airport Commission Meeting SP2533 Reuse of Council Chambers Excess Airfield Pavement

SMO Airport Security Enhancement Project ITEM 5(c) Presented to: Santa Monica Municipal

Project Scheduling Ronald (Ron) D. Allen, Sr., PE, CPM Schedule Management Office (SMO)

PAVEMENT AT SMO RUNWAY ENDS Airport Commission Meeting September 25, 2017 MAY 24 TH , 2017

Cen enter ers for D Diseas ease e Control an and P Preven ention Smo moke ke-free M ee

aug ( h ) = E in ( h ) + onstrained unonstrained : heuristi smo oth, simple h

Optimization for Machine Learning Lecture 4: SMO-MKL S.V . N. (vishy) Vishwanathan Purdue

SSL/TLS Slides by Vitaly Shmatikov University of Texas updated slightly by smo IHK slide 1

2019 NOCoE Operations and Maintenance Peer Exchange Sandi Sauter, Janet Frenkil, Joey Sagal MDOT

Generating Wannier Function within OpenMX Hongming Weng ( ) Institute of Physics,

Implementing Partial Evaluator Via Symbolic Execution (Work in Progress) Ran Ji Joint work with

D EEP B ELIEF N ETWORKS (DBN S ) Deep belief nets are probabilistic generative models that are

T-duality Invariant Formalisms at the Quantum Level Daniel Thompson Queen Mary University of

Max-Margin Markov Networks Ben Taskar Carlos Guestrin Daphne Koller Main Contribution The

Michael Ryan, John Noecker Jr Evaluating Variations in Language Lab Duquesne University mryan,

Markov processes (Markov chains) Construct a Bayes net from these variables: parents? Markov

Support Vector Machines Marco Chiarandini Department of Mathematics & Computer Science

SMO Algorithm Milan Straka December 02, 2019 Charles University in - PowerPoint PPT Presentation

NPFL129, Lecture 7 SMO Algorithm Milan Straka December 02, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Kernel Linear Regression 3 O ( D ) D When

Smoking - IVF Smoking and ectopic pregnancy -smo ex-smo +smo Odds ratio 3,5 3,5

May 15, 2019 37 Ove rvie w Why T SMO Ne e d fo r a Ne w Pe rspe c tive Ma ste r Pla n Ove

Soft-margin SVM, SMO Algorithm, Decision Trees Milan Straka November 25, 2019 Charles

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

New and Emerging Smokeless Tobacco Products Not a Safe Alternative to Cigarettes Smo mokele

Contents Introduction What is social media Social media overview Classification of

11/15/16 &amp; 11/17/16 The purpose of the Schedule Management Office (SMO) within TPMU is to

SMO Airport Commission Meeting SP2533 Reuse of Council Chambers Excess Airfield Pavement

SMO Airport Security Enhancement Project ITEM 5(c) Presented to: Santa Monica Municipal

Project Scheduling Ronald (Ron) D. Allen, Sr., PE, CPM Schedule Management Office (SMO)

PAVEMENT AT SMO RUNWAY ENDS Airport Commission Meeting September 25, 2017 MAY 24 TH , 2017

Cen enter ers for D Diseas ease e Control an and P Preven ention Smo moke ke-free M ee

aug ( h ) = E in ( h ) + onstrained unonstrained : heuristi smo oth, simple h

Optimization for Machine Learning Lecture 4: SMO-MKL S.V . N. (vishy) Vishwanathan Purdue

SSL/TLS Slides by Vitaly Shmatikov University of Texas updated slightly by smo IHK slide 1

2019 NOCoE Operations and Maintenance Peer Exchange Sandi Sauter, Janet Frenkil, Joey Sagal MDOT

Generating Wannier Function within OpenMX Hongming Weng ( ) Institute of Physics,

Implementing Partial Evaluator Via Symbolic Execution (Work in Progress) Ran Ji Joint work with

D EEP B ELIEF N ETWORKS (DBN S ) Deep belief nets are probabilistic generative models that are

T-duality Invariant Formalisms at the Quantum Level Daniel Thompson Queen Mary University of

Max-Margin Markov Networks Ben Taskar Carlos Guestrin Daphne Koller Main Contribution The

Michael Ryan, John Noecker Jr Evaluating Variations in Language Lab Duquesne University mryan,

Markov processes (Markov chains) Construct a Bayes net from these variables: parents? Markov

Support Vector Machines Marco Chiarandini Department of Mathematics &amp; Computer Science

11/15/16 & 11/17/16 The purpose of the Schedule Management Office (SMO) within TPMU is to

Support Vector Machines Marco Chiarandini Department of Mathematics & Computer Science