 
              DM825 Introduction to Machine Learning Lecture 9 Support Vector Machines Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark
Kernels Soft margins Overview SMO Algorithm Support Vector Machines: 1. Functional and Geometric Margins 2. Optimal Margin Classifier 3. Lagrange Duality 4. Karush Kuhn Tucker Conditions 5. Solving the Optimal Margin 6. Kernels 7. Soft margins 8. SMO Algorithm 2
Kernels Soft margins In This Lecture SMO Algorithm 1. Kernels 2. Soft margins 3. SMO Algorithm 3
Kernels Soft margins Resume SMO Algorithm m m m α i − 1 � � � y i y j α i α j � � x i , � x j � max W ( � α ) = 2 α � i =1 i =1 j =1 s.t. α i ≥ 0 ∀ i = 1 . . . m m α i y i = 0 � i =1 m � � α i y i � x i θ = ∀ i = 1 . . . m i =1 x i + θ 0 ) ≥ 1 y i ( � θ T � ∀ i = 1 . . . m x i + θ 0 ) − 1] = 0 α i [ y i ( � θ T � ∀ i = 1 . . . m Prediction: � m � h ( � � α i y i � � x i , � θ, � x ) = sign x � + θ 0 i =1 4
Kernels Soft margins Introduction SMO Algorithm We saw: 1. h ( � x ) fitted � θ, � θ on training data then discarded training data 2. k -NN training data kept during the prediction phase. Memory based method. (fast to train, slower to predict) 3. locally weighted linear regression x i − � x ) T ( � x i − � � � − ( � x ) w i = exp � � w i ( y i − � θ T � x i ) 2 , θ = argmin 2 τ 2 i (linear parametric method where predictions are based on a linear combination of kernel functions evaluated at training data) 5
Kernels Soft margins Outline SMO Algorithm 1. Kernels 2. Soft margins 3. SMO Algorithm 6
Kernels Soft margins Kernels SMO Algorithm x 1 , . . . , x D inputs if we want all polynomial terms up to degree 2: � T � � x 2 x 2 . . . x 2 x 1 x 2 x 1 x 3 . . . x D − 1 x D φ ( � x ) = 1 2 D � D � = O ( D 2 ) terms 2 For D = 3 x i ) T · � In SVM we need � � x j ) � = φ ( � φ ( � ⇒ O ( D 2 ) for   1 m 2 times √ 2 x 1   d d m √ x ) T � � � � x 2 i z 2 �   φ ( � φ ( � z ) = 1 + 2 x i z i + i + 2 x i x j z i z j 2 x 2   √   i =1 i =1 i =1 2 x 3     x 2   someone recognized that this is the same as � 1 φ ( � x ) =   x 2 x T · � z ) 2 which can be computed in O ( D ) .   2 (1 + �   x 2   √ 3   2 x 1 x 2 x T · �   z ) s √ k ( � x, � z ) = (1 + � kernel    2 x 1 x 3  √   2 x 2 x 3 we may restrict to compute Kernel matrix 7
Kernels Soft margins Kernels SMO Algorithm For models with fixed non linear feature space: Definition (Kernel) x ) T · � x ′ ) = � x ′ ) k ( � x, � φ ( � φ ( � x ′ ) = k ( � x ′ , � It follows that k ( � x, � x ) Kernel Trick If we have an algorithm in which the input vector � x enters only in form of scalar products, then we can replace the scalar product with some choice of kernel. ◮ This is our case with SVM: thanks to dual formulation, both training and prediction can be done via scalar product. ◮ No need to define features 8
Kernels Soft margins Constructing Kernels SMO Algorithm x T · � x ′ (scalar product) x ′ ) = � It must be k ( � x, � 1. define some basis functions � φ ( � x ) : D x ) T � x ′ ) = � � x ′ ) = x ′ ) k ( � x, � φ ( � φ ( � φ i ( � x ) φ i ( � i =1 2. define kernel directly provided it is some scalar product in some feature space (maybe infinite) x T · � x ′ ) = (1 + � x ′ ) 2 k ( � x, � 9
Kernels Soft margins Constructing Kernels SMO Algorithm Following approach 2: Theorem (Mercer’s Kernel) Necessary and sufficient condition for k ( · ) to be a valid kernel is that the x i , � x j ) , is positive semidefinite Gram matrix k , whose elements are k ( � x T k � ( ∀ x ∈ R n , � x i } . x ≥ 0 ) for all choices of the set { � Proof: x i ) T � x j ) T � x j ) = � x j ) = � x i , � x i ) = k ji Symmetry: k ij = k ( � φ ( � φ ( � φ ( � φ ( � 10
Kernels Soft margins Constructing Kernels SMO Algorithm One easy way to construct kernels is by recombining building blocks. Known building blocks: x ′ ) = � x T � Linear: k ( � x, � x x ′ ) = ( � x T � x + c ) s Polynomials: k ( � x, � x ′ � 2 / 2 σ 2 ) (has infinite dimensionality) x ′ ) = exp( − � � radial basis: k ( � x, � x − � x ′ ) = tanh( k� x T � sigmoid func.: k ( � x, � x − σ ) 11
Kernels Soft margins Outline SMO Algorithm 1. Kernels 2. Soft margins 3. SMO Algorithm 12
Kernels Soft margins Soft margins SMO Algorithm What if data are not separable? 13
Kernels Soft margins Soft margins SMO Algorithm We allow some points to be on the wrong side and introduce slack variables � ξ = ( ξ 1 . . . , ξ m ) in the formulation: geometric margin becomes: x i + θ 0 ) > 0 if predicted correct ◮ y i ( � θ T � x i + θ 0 ) > − ξ i for the points mispredicted ◮ y i ( � θ T � In the formulation we modify x i + θ 0 ) > γ into y i ( � θ T � x i + θ 0 ) > γ (1 − ξ i ) and include a regularization term to minimize: y i ( � θ T � m 1 θ � 2 + C 2 � � � (OPT) : min ξ i � θ,θ 0 i =1 x i + θ 0 ) 1 − ξ i ≤ y i ( � θ T � α i : ∀ i = 1 , . . . , m µ i : ξ i ≥ 1 ∀ i = 1 , . . . , m still convex optimization 14
Kernels Soft margins SMO Algorithm m m m µ ) = 1 θ � 2 + C � x i + θ 0 ) − (1 − ξ i ) � L ( � 2 � � � � y i ( � θ T � � θ, θ 0 , � α, � ξ i − α i − µ i ξ i i =1 i =1 i =1 µ we have the primal L P ( � θ, θ 0 , � ξ ) which we minimize in � θ, θ 0 , � fixed � α, � ξ : m ⇒ � � α i y i x i ∇ � θ L P = 0 = θ = i =1 m ∂ L P � α i y i = 0 = ⇒ 0 = ∂θ 0 i =1 ∂ L P = 0 = ⇒ α i = C − µ i ∀ i ∂ξ i Lagrange dual: m m m α i − 1 � � � x T L D = α i α j y i y j � i � x j 2 i =1 i =1 j =1 15
Kernels Soft margins SMO Algorithm m m m α i − 1 � � � x T max L D = α i α j y i y j � i � x j (1) 2 i =1 i =1 j =1 0 ≤ α i ≤ C (2) m α i y i = 0 � (3) i =1 i � x T α i [ y i ( � θ + θ 0 ) − (1 − ξ i )] = 0 (4) µ i ξ i = 0 (5) i � x T y i ( � θ + θ 0 ) − (1 − ξ i ) ≥ 0 (6) µ i ≥ 0 , ξ i ≥ 0 (7) for (5) + ∂ L P ∂ξ i = 0 support vectors are: ◮ the points that lie on the edge of the margin ( ξ i = 0 ) and hence = ⇒ 0 < α i < C ◮ the misclassified points ξ i > 0 that have α i = C The margin points can be used to solve (4) for θ 0 16
Kernels Soft margins Outline SMO Algorithm 1. Kernels 2. Soft margins 3. SMO Algorithm 17
Kernels Soft margins Coordinate ascent SMO Algorithm max W ( α 1 , α 2 , . . . , α m ) � α repeat for i=1,. . . ,m do α i := arg max ˆ α i W ( α 1 , . . . , α i − 1 ˆ α i , α i +1 , . . . , α m ) until till convergence ; 18
Kernels Soft margins Sequential Minimal Optimization SMO Algorithm max W ( α 1 , α 2 , . . . , α m ) � α m � y i α i = 0 i =1 Fix and change two α s at a time. repeat select α i and α j by some heuristic; hold all α l , l � = i, j fixed and optimize W ( � α ) in α i , α j until till convergence ; α 1 y 1 + α 2 y 2 = − � m i =3 α i y i = const = ⇒ α 1 = C − α 2 y 2 y 1 19
Kernels Soft margins Example SMO Algorithm 20
Kernels Soft margins SVM for K-Classes SMO Algorithm 1. train K SVM each SVM classifies one class from all the others. 2. choose the indication of the SVM that makes the strongest prediction: where the basis vector input point is furthest into positive region 21
Kernels Soft margins SVM for regression SMO Algorithm With a quantitative response we try to fit as much as possible within the margin change, hence we change the objective function in (OPT3) into: m V ( y i − f ( x i )) + λ � 2 � � θ � 2 min i =1 � 0 if | r | < ǫ V ǫ = | r | − ǫ otherwise 22
Kernels Soft margins SVM as Regularized Function SMO Algorithm 23
Recommend
More recommend