Soft-margin SVM, SMO Algorithm, Decision Trees Milan Straka - PowerPoint PPT Presentation

NPFL129, Lecture 6 Soft-margin SVM, SMO Algorithm, Decision Trees Milan Straka November 25, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

Kernel Linear Regression 3 O ( D ) D When dimensionality of input is , one step of SGD takes . Surprisingly, we can do better under some circumstances. We start by noting that we can write φ ( x ) w i the parameters as a linear combination of the input features . w = 0 = 0 ⋅ φ ( x ) w = i φ ( x ⋅ ) ∑ i ∑ i β i i By induction, , and assuming , after a SGD update we get ∑ w ← w + α − w φ ( x T ) ) φ ( x ) ( t i i i i ∑ ( β ) ) ) φ ( x = + α ( t − w φ ( x T ). i i i i i α ( t ) ) ← + − w φ ( x T β β w i i i i A individual update is , and substituting for we get α ( t ∑ ) ) . ← + − φ ( x ) φ ( x T β β β i i i j j i j NPFL129, Lecture 6 Refresh Soft-margin SVN SMO Primal vs Dual DecisionTree 2/27

Kernel Linear Regression We can formulate the alternative linear regression algorithm (it would be called a dual formulation ): R N × D t ∈ R N α ∈ R + X = { x , x , … , x } ∈ 1 2 N Input : Dataset ( , ), learning rate . ← 0 β i Set K ( x , x ) = φ ( x ) φ ( x ) i T i j j Compute all values Repeat Update the coordinates, either according to a full gradient update: β ← β + α ( t − K β ) or alternatively use single-batch SGD, arriving at: {1, … , N } i for in random permutation of : α ( t ) ) ← β + − K ( x , x ∑ j β β i i i j i j β ← β + α ( t − K β ) In vector notation, we can write . y ( x ) = w φ ( x ) = φ ( x ) φ ( x ) T ∑ i i T β i The predictions are then performed by computing . NPFL129, Lecture 6 Refresh Soft-margin SVN SMO Primal vs Dual DecisionTree 3/27

Kernels φ We define a kernel corresponding to a feature map as a function def φ ( x ) φ ( z ). K ( x , z ) = t There is quite a lot of theory behind kernel construction. The most often used kernels are: d polynomial kernel or degree K ( x , z ) = ( γ x z + T 1) , d d which corresponds to a feature map generating all combinations of up to input features; Gaussian (or RBF) kernel − γ ∣∣ x − z ∣∣ 2 K ( x , z ) = e , corresponding to a scalar product in an infinite-dimensional space (it is in a sense a combination of polynomial kernels of all degrees). NPFL129, Lecture 6 Refresh Soft-margin SVN SMO Primal vs Dual DecisionTree 4/27

Support Vector Machines X ∈ R N × D t ∈ {−1, 1} N φ Assume we have a dataset , , feature map and model def φ ( x ) w + y ( x ) = T b . x i We already know that the distance of a point to the decision boundary is ∣ y ( x )∣ y ( x ) t i i i = . ∣∣ w ∣∣ ∣∣ w ∣∣  We therefore want to maximize 1 arg max ∣∣ w ∣∣ min ( φ ( x ) w + T b ) ] . [ t Figure 4.1 of Pattern Recognition and Machine Learning. i i w , b However, this problem is difficult to optimize directly. NPFL129, Lecture 6 Refresh Soft-margin SVN SMO Primal vs Dual DecisionTree 5/27

Support Vector Machines w b Because the model is invariant to multiplying and by a constant, we can say that for the points closest to the decision boundary, it will hold that y ( x ) = 1. t i i y ( x ) ≥ 1 t i i Then for all the points we will have and we can simplify 1 arg max ∣∣ w ∣∣ min ( φ ( x ) w + T b ) ] [ t i i w , b to 1 2 arg min 2 ∣∣ w ∣∣ given that t y ( x ) ≥ 1. i i w , b NPFL129, Lecture 6 Refresh Soft-margin SVN SMO Primal vs Dual DecisionTree 6/27

Support Vector Machines In order to solve the constrained problem of 1 2 arg min 2 ∣∣ w ∣∣ given that t y ( x ) ≥ 1, i i w , b a = ( a , … , a ) 1 N we write the Lagrangian with multipliers as 1 ∑ 2 L = ∣∣ w ∣∣ − y ( x ) − 1 ] . [ t a i i i 2 i w b Setting the derivatives with respect to and to zero, we get ∑ w = φ ( x ) a t i i i i ∑ 0 = a t i i i NPFL129, Lecture 6 Refresh Soft-margin SVN SMO Primal vs Dual DecisionTree 7/27

Support Vector Machines Substituting these to the Lagrangian, we get 1 ∑ ∑ ∑ L = − K ( x , x ) a a a t t i i j i j i j 2 i i j ∀ : ≥ 0 = 0 K ( x , z ) = φ ( x ) φ ( z ). ∑ i T a a t i i i i with respect to the constraints , and kernel The solution of this Lagrangian will fulfil the KKT conditions, meaning that ≥ 0 a i y ( x ) − 1 ≥ 0 t i i y ( x ) − 1 ) = 0. ( t a i i i = 0 a x i Therefore, either a point is on a boundary, or . Given that the predictions for point are y ( x ) = K ( x , x ) + ∑ a t b i i i given by , we need to keep only the points on the boundary, the so-called support vectors . NPFL129, Lecture 6 Refresh Soft-margin SVN SMO Primal vs Dual DecisionTree 8/27

Support Vector Machines The dual formulation allows us to use non-linear kernels.                                                                                Figure 7.2 of Pattern Recognition and Machine Learning. NPFL129, Lecture 6 Refresh Soft-margin SVN SMO Primal vs Dual DecisionTree 9/27

Support Vector Machines for Non-linearly Separable Data     Until now, we assumed the data to be linearly separable – the    hard-margin SVM variant. We now relax this condition to arrive at    soft-margin SVM . The idea is to allow points to be in the margin    or even on the wrong side of the decision boundary. We introduce ≥ 0 ξ i slack variables , one for each training instance, defined as    { 0 for points fulfilling t y ( x ) ≥ 1, i i    = ξ i ∣ t − y ( x )∣ otherwise. i i    Figure 7.3 of Pattern Recognition and Machine Learning. = 0 0 < ξ < 1 ξ i i Therefore, signifies a point outside of margin, denotes a point inside the = 1 > 1 ξ ξ i i margin, is a point on the decision boundary and indicates the point is on the opposite side of the separating hyperplane. Therefore, we want to optimize 1 ∑ i 2 arg min + ∣∣ w ∣∣ given that t y ( x ) ≥ 1 − ξ and ξ ≥ 0. C ξ i i i i 2 w , b i NPFL129, Lecture 6 Refresh Soft-margin SVN SMO Primal vs Dual DecisionTree 10/27

Support Vector Machines for Non-linearly Separable Data a = ( a , … , a ) μ = 1 N We again create a Lagrangian, this time with multipliers and also ( μ , … , μ ) 1 N : 1 ∑ i ∑ ∑ 2 L = ∣∣ w ∣∣ + − y ( x ) − 1 + ξ ] − . [ t C ξ a μ ξ i i i i i i 2 i i i w b ξ Solving for the critical points and substituting for , and (obtaining an additional = C − a μ i i constraint compared to the previous case), we obtain the Lagrangian in the form 1 ∑ ∑ ∑ L = − K ( x , x ), a a a t t i i j i j i j 2 i i j which is identical to the previous case, but the constraints are a bit different: ∑ ∀ : C ≥ a ≥ 0 and = 0. a t i i i i i NPFL129, Lecture 6 Refresh Soft-margin SVN SMO Primal vs Dual DecisionTree 11/27

Support Vector Machines for Non-linearly Separable Data > 0 a i Using KKT conditions, we can see that the support vectors (examples with ) are the y ( x ) = 1 − ξ t i i i ones with , i.e., the examples on the margin boundary, inside the margin and on the opposite side of the decision boundary.       Figure 7.4 of Pattern Recognition and Machine Learning. NPFL129, Lecture 6 Refresh Soft-margin SVN SMO Primal vs Dual DecisionTree 12/27

SGD-like Formulation of Soft-Margin SVM Note that the slack variables can be written as = max ( 0, 1 − t y ( x ) ) , ξ i i i so we can reformulate the soft-margin SVM objective using hinge loss def max(0, 1 − ty ) ( t , y ) = L hinge to 1 ∑ 2 arg min , y ( x ) ) + ∣∣ w ∣∣ . ( t L C hinge i i 2 w , b i C Such formulation is analogous to a regularized loss, where is an inverse regularization C = ∞ C = 0 strength, so implies no regularization and ignores the data entirely. NPFL129, Lecture 6 Refresh Soft-margin SVN SMO Primal vs Dual DecisionTree 13/27

Soft-margin SVM, SMO Algorithm, Decision Trees Milan Straka - PowerPoint PPT Presentation

NPFL129, Lecture 6 Soft-margin SVM, SMO Algorithm, Decision Trees Milan Straka November 25, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Kernel

Decision Trees Lecture 23 To left or to right 1 Decision Trees 2 Decision Trees A different

Decision Trees Lecture 22 To left or to right 1 Decision Trees 2 Decision Trees A different

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

SVM-flexible discriminant analysis Huimin Peng November 20, 2014 Outline SVM Nonlinear SVM =

Overview SVM theoretical framework ORACLE data mining technology SVM parameter

SVM on Intel Graphics Jesse Barnes Intel Open Source Technology Center 1 What is SVM?

Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Maximum Margin Criterion Math

Trees Trees CSE, IIT KGP Trees and Spanning Trees Trees and Spanning Trees A graph having

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

Lecture 5: SVM II Princeton University COS 495 Instructor: Yingyu Liang Review: SVM objective

Decision Tree R Greiner Cmput 466 / 551 Learning Decision Trees Def'n: Decision Trees

( ( ) ) ( ) ( ) = = Work = h log t n B- B -Trees Trees B B- -Trees

Trees Chapter 11 Chapter Summary Introduction to Trees Applications of Trees Tree

WALES SOFT POWER BAROMETER 2018 Measuring soft power beyond the nation-state April 2018 01 WHAT

Learning Decision Trees Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Decision Trees: Discussion Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

SMO Algorithm Milan Straka December 02, 2019 Charles University in Prague Faculty of

Generating Wannier Function within OpenMX Hongming Weng ( ) Institute of Physics,

Implementing Partial Evaluator Via Symbolic Execution (Work in Progress) Ran Ji Joint work with

D EEP B ELIEF N ETWORKS (DBN S ) Deep belief nets are probabilistic generative models that are

Max-Margin Markov Networks Ben Taskar Carlos Guestrin Daphne Koller Main Contribution The

Michael Ryan, John Noecker Jr Evaluating Variations in Language Lab Duquesne University mryan,

Markov processes (Markov chains) Construct a Bayes net from these variables: parents? Markov

Support Vector Machines Marco Chiarandini Department of Mathematics & Computer Science

Sambuz

Useful Links

Newsletter

Mail Us

Soft-margin SVM, SMO Algorithm, Decision Trees Milan Straka - PowerPoint PPT Presentation

NPFL129, Lecture 6 Soft-margin SVM, SMO Algorithm, Decision Trees Milan Straka November 25, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Kernel

Decision Trees Lecture 23 To left or to right 1 Decision Trees 2 Decision Trees A different

Decision Trees Lecture 22 To left or to right 1 Decision Trees 2 Decision Trees A different

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

SVM-flexible discriminant analysis Huimin Peng November 20, 2014 Outline SVM Nonlinear SVM =

Overview SVM theoretical framework ORACLE data mining technology SVM parameter

SVM on Intel Graphics Jesse Barnes Intel Open Source Technology Center 1 What is SVM?

Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Maximum Margin Criterion Math

Trees Trees CSE, IIT KGP Trees and Spanning Trees Trees and Spanning Trees A graph having

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

Lecture 5: SVM II Princeton University COS 495 Instructor: Yingyu Liang Review: SVM objective

Decision Tree R Greiner Cmput 466 / 551 Learning Decision Trees Def'n: Decision Trees

( ( ) ) ( ) ( ) = = Work = h log t n B- B -Trees Trees B B- -Trees

Trees Chapter 11 Chapter Summary Introduction to Trees Applications of Trees Tree

WALES SOFT POWER BAROMETER 2018 Measuring soft power beyond the nation-state April 2018 01 WHAT

Learning Decision Trees Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Decision Trees: Discussion Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

SMO Algorithm Milan Straka December 02, 2019 Charles University in Prague Faculty of

Generating Wannier Function within OpenMX Hongming Weng ( ) Institute of Physics,

Implementing Partial Evaluator Via Symbolic Execution (Work in Progress) Ran Ji Joint work with

D EEP B ELIEF N ETWORKS (DBN S ) Deep belief nets are probabilistic generative models that are

Max-Margin Markov Networks Ben Taskar Carlos Guestrin Daphne Koller Main Contribution The

Michael Ryan, John Noecker Jr Evaluating Variations in Language Lab Duquesne University mryan,

Markov processes (Markov chains) Construct a Bayes net from these variables: parents? Markov

Support Vector Machines Marco Chiarandini Department of Mathematics &amp; Computer Science

Sambuz

Useful Links

Newsletter

Mail Us

Support Vector Machines Marco Chiarandini Department of Mathematics & Computer Science