support vector machines
play

Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly - PowerPoint PPT Presentation

Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). C. Frogner Support Vector Machines Plan Regularization derivation of SVMs. Analyzing the SVM problem: optimization, duality. Geometric


  1. Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). C. Frogner Support Vector Machines

  2. Plan Regularization derivation of SVMs. Analyzing the SVM problem: optimization, duality. Geometric derivation of SVMs. Practical issues. C. Frogner Support Vector Machines

  3. The Regularization Setting (Again) Given n examples ( x 1 , y 1 ) , . . . , ( x n , y n ) , with x i ∈ R n and y i ∈ {− 1 , 1 } for all i . We can find a classification function by solving a regularized learning problem: n 1 V ( y i , f ( x i )) + λ || f || 2 � argmin H . n f ∈H i = 1 Note that in this class we are specifically considering binary classification . C. Frogner Support Vector Machines

  4. The Hinge Loss The classical SVM arises by considering the specific loss function V ( f ( x , y )) ≡ ( 1 − yf ( x )) + , where ( k ) + ≡ max ( k , 0 ) . C. Frogner Support Vector Machines

  5. The Hinge Loss 4 3.5 3 2.5 Hinge Loss 2 1.5 1 0.5 0 −3 −2 −1 0 1 2 3 y * f(x) C. Frogner Support Vector Machines

  6. Substituting In The Hinge Loss With the hinge loss, our regularization problem becomes n 1 ( 1 − y i f ( x i )) + + λ || f || 2 � argmin H . n f ∈H i = 1 Note that we don’t have a 1 2 multiplier on the regularization term. C. Frogner Support Vector Machines

  7. Slack Variables This problem is non -differentiable (because of the “kink” in V ). So rewrite the “max” function using slack variables ξ i . � n i = 1 ξ i + λ || f || 2 1 argmin n H f ∈H ξ i ≥ 1 − y i f ( x i ) i = 1 , . . . , n subject to : i = 1 , . . . , n ξ i ≥ 0 C. Frogner Support Vector Machines

  8. Applying The Representer Theorem Substituting in: n f ∗ ( x ) = c i K ( x , x i ) , � i = 1 we get a constrained quadratic programming problem: � n i = 1 ξ i + λ c T K c 1 argmin n c ∈ R n ,ξ ∈ R n � n subject to : ξ i ≥ 1 − y i j = 1 c j K ( x i , x j ) i = 1 , . . . , n i = 1 , . . . , n ξ i ≥ 0 C. Frogner Support Vector Machines

  9. Adding A Bias Term Adding an unregularized bias term b (which presents some theoretical difficulties) we get the “primal” SVM: � n i = 1 ξ i + λ c T K c 1 argmin n c ∈ R n , b ∈ R ,ξ ∈ R n ξ i ≥ 1 − y i ( � n subject to : j = 1 c j K ( x i , x j ) + b ) i = 1 , . . . , n i = 1 , . . . , n ξ i ≥ 0 C. Frogner Support Vector Machines

  10. Standard Notation In most of the SVM literature, instead of λ , a parameter C is used to control regularization: 1 C = 2 λ n . Using this definition (after multiplying our objective function by the constant 1 2 λ , the regularization problem becomes n V ( y i , f ( x i )) + 1 C 2 || f || 2 � argmin H . f ∈H i = 1 Like λ, the parameter C also controls the tradeoff between classification accuracy and the norm of the function. The primal problem becomes . . . C. Frogner Support Vector Machines

  11. The Reparametrized Problem C � n 2 c T K c i = 1 ξ i + 1 argmin c ∈ R n , b ∈ R ,ξ ∈ R n ξ i ≥ 1 − y i ( � n j = 1 c j K ( x i , x j ) + b ) i = 1 , . . . , n subject to : i = 1 , . . . , n ξ i ≥ 0 C. Frogner Support Vector Machines

  12. How to Solve? C � n 2 c T K c i = 1 ξ i + 1 argmin c ∈ R n , b ∈ R ,ξ ∈ R n ξ i ≥ 1 − y i ( � n subject to : j = 1 c j K ( x i , x j ) + b ) i = 1 , . . . , n i = 1 , . . . , n ξ i ≥ 0 This is a constrained optimization problem. The general approach: Form the primal problem – we did this. Lagrangian from primal – just like Lagrange multipliers. Dual – one dual variable associated to each primal constraint in the Lagrangian. C. Frogner Support Vector Machines

  13. Lagrangian We derive the dual from the primal using the Lagrangian: n ξ i + 1 2 c T K c L ( c , ξ, b , α, ζ ) C � = i = 1 n n α i ( y i { c j K ( x i , x j ) + b } − 1 + ξ i ) � � − i = 1 j = 1 n � − ζ i ξ i i = 1 C. Frogner Support Vector Machines

  14. Dual I Dual problem is: c ,ξ, b L ( c , ξ, b , α, ζ ) argmax inf α,ζ ≥ 0 First, minimize L w.r.t. ( c , ξ, b ) : ∂ L ⇒ c i = α i y i ( 1 ) ∂ c = 0 = n ∂ L α i y i = 0 � ( 2 ) ∂ b = 0 = ⇒ i = 1 ∂ L ⇒ C − α i − ζ i = 0 ( 3 ) ∂ξ i = 0 = ⇒ 0 ≤ α i ≤ C = C. Frogner Support Vector Machines

  15. Dual II Dual: c ,ξ, b L ( c , ξ, b , α, ζ ) argmax inf α,ζ ≥ 0 Optimality conditions: c i = α i y i ( 1 ) � n i = 1 α i y i = 0 ( 2 ) α i ∈ [ 0 , C ] ( 3 ) Plug in ( 2 ) and ( 3 ) : n  n  c L ( c , α ) = 1 2 c T K c +  1 − y i K ( x i , x j ) c j � � argmax inf α i  α ≥ 0 i = 1 j = 1 C. Frogner Support Vector Machines

  16. Dual II Dual: c ,ξ, b L ( c , ξ, b , α, ζ ) argmax inf α,ζ ≥ 0 Optimality conditions: c i = α i y i ( 1 ) � n i = 1 α i y i = 0 ( 2 ) α i ∈ [ 0 , C ] ( 3 ) Plug in ( 1 ) : = � n � n L ( α ) i , j = 1 α i y i K ( x i , x j ) α j y j i = 1 α i − 1 argmax 2 α ≥ 0 = � n 2 α T ( diag Y ) K ( diag Y ) α i = 1 α i − 1 C. Frogner Support Vector Machines

  17. The Primal and Dual Problems Again C � n 2 c T K c i = 1 ξ i + 1 argmin c ∈ R n , b ∈ R ,ξ ∈ R n ξ i ≥ 1 − y i ( � n j = 1 c j K ( x i , x j ) + b ) i = 1 , . . . , n subject to : i = 1 , . . . , n ξ i ≥ 0 � n 2 α T Q α i = 1 α i − 1 max α ∈ R n � n subject to : i = 1 y i α i = 0 0 ≤ α i ≤ C i = 1 , . . . , n C. Frogner Support Vector Machines

  18. SVM Training Basic idea: solve the dual problem to find the optimal α ’s, and use them to find b and c . The dual problem is easier to solve the primal problem. It has simple box constraints and a single equality constraint, and the problem can be decomposed into a sequence of smaller problems (see appendix). C. Frogner Support Vector Machines

  19. Interpreting the solution α tells us: c and b . The identities of the misclassified points. How to analyze? Use the optimality conditions . Already used: derivative of L w.r.t. ( c , ξ, b ) is zero at optimality. Haven’t used: complementary slackness, primal/dual constraints. C. Frogner Support Vector Machines

  20. Optimality Conditions: all of them All optimal solutions must satisfy: n n c j K ( x i , x j ) − y i α j K ( x i , x j ) = 0 i = 1 , . . . , n � � j = 1 j = 1 n α i y i = 0 � i = 1 C − α i − ζ i = 0 i = 1 , . . . , n n y i ( � y j α j K ( x i , x j ) + b ) − 1 + ξ i ≥ 0 i = 1 , . . . , n j = 1 n α i [ y i ( y j α j K ( x i , x j ) + b ) − 1 + ξ i ] = 0 i = 1 , . . . , n � j = 1 i = 1 , . . . , n ζ i ξ i = 0 i = 1 , . . . , n ξ i , α i , ζ i ≥ 0 C. Frogner Support Vector Machines

  21. Optimality Conditions II These optimality conditions are both necessary and sufficient for optimality: ( c , ξ, b , α, ζ ) satisfy all of the conditions if and only if they are optimal for both the primal and the dual. (Also known as the Karush -Kuhn-Tucker (KKT) conditons.) C. Frogner Support Vector Machines

  22. Interpreting the solution — c ∂ L ⇒ c i = α i y i , ∀ i ∂ c = 0 = C. Frogner Support Vector Machines

  23. Interpreting the solution — b Suppose we have the optimal α i ’s. Also suppose that there exists an i satisfying 0 < α i < C . Then α i < C = ⇒ ζ i > 0 = ⇒ ξ i = 0 n y i ( y j α j K ( x i , x j ) + b ) − 1 = 0 � = ⇒ j = 1 n b = y i − y j α j K ( x i , x j ) � = ⇒ j = 1 C. Frogner Support Vector Machines

  24. Interpreting the solution — sparsity (Remember we defined f ( x ) = � n i = 1 y i α i K ( x , x i ) + b .) y i f ( x i ) > 1 ( 1 − y i f ( x i )) < 0 ⇒ ξ i � = ( 1 − y i f ( x i )) ⇒ ⇒ α i = 0 C. Frogner Support Vector Machines

  25. Interpreting the solution — - support vectors y i f ( x i ) < 1 ( 1 − y i f ( x i )) > 0 ⇒ ⇒ ξ i > 0 ⇒ ζ i = 0 α i = C ⇒ C. Frogner Support Vector Machines

  26. Interpreting the solution — support vectors So y i f ( x i ) < 1 ⇒ α i = C . Conversely, suppose α i = C : α i = C ξ i = 1 − y i f ( x i ) = ⇒ y i f ( x i ) ≤ 1 = ⇒ C. Frogner Support Vector Machines

  27. Interpreting the solution Here are all of the derived conditions: y i f ( x i ) ≥ 1 α i = 0 = ⇒ 0 < α i < C y i f ( x i ) = 1 = ⇒ α i = C y i f ( x i ) < 1 ⇐ = y i f ( x i ) > 1 α i = 0 ⇐ = α i = C y i f ( x i ) ≤ 1 = ⇒ C. Frogner Support Vector Machines

  28. Geometric Interpretation of Reduced Optimality Conditions C. Frogner Support Vector Machines

  29. Summary so far The SVM is a Tikhonov regularization problem, using the hinge loss: n 1 ( 1 − y i f ( x i )) + + λ || f || 2 � argmin H . n f ∈H i = 1 Solving the SVM means solving a constrained quadratic program. Solutions can be sparse – some coefficients are zero. The nonzero coefficients correspond to points that aren’t classified correctly enough – this is where the “support vector” in SVM comes from. C. Frogner Support Vector Machines

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend