machine learning mt 2017 13 support vector machines ii
play

Machine Learning - MT 2017 13 Support Vector Machines II Christoph - PowerPoint PPT Presentation

Machine Learning - MT 2017 13 Support Vector Machines II Christoph Haase University of Oxford November 6, 2017 Last Time Primal Formuation of SVM Slack variables for linearly non-separable data 1 SVM Formulation : Non-Separable Case


  1. Machine Learning - MT 2017 13 Support Vector Machines II Christoph Haase University of Oxford November 6, 2017

  2. Last Time ◮ Primal Formuation of SVM ◮ Slack variables for linearly non-separable data 1

  3. SVM Formulation : Non-Separable Case N � 1 2 � w � 2 minimise: 2 + C ζ i i =1 subject to: y i ( w · x i + w 0 ) ≥ 1 − ζ i ζ i ≥ 0 for i = 1 , . . . , N Here y i ∈ {− 1 , 1 } 2

  4. SVM Formulation : Loss Function � N 1 2 � w � 2 minimise: + C ζ i 6 2 � �� � i =1 � �� � Regularizer Loss Function Hinge Loss 4 subject to: 2 y i ( w · x i + w 0 ) ≥ 1 − ζ i ζ i ≥ 0 0 − 6 − 4 − 2 0 2 4 6 for i = 1 , . . . , N y ( w · x + w 0 ) Here y i ∈ {− 1 , 1 } Note that for the optimal solution, ζ i = max { 0 , 1 − y i ( w · x i + w 0 ) } Thus, SVM can be viewed as minimizing the hinge loss with regularization 3

  5. Logistic Regression: Loss Function Here y i ∈ { 0 , 1 } , so to compare effectively to SVM, let z i = (2 y i − 1) : ◮ z i = 1 if y i = 1 ◮ z i = − 1 if y i = 0 � �� � � � 1 1 NLL( y i ; w , x i ) = − y i log + (1 − y i ) log 1 + e − w · x i 1 + e w · x i � 1 + e − z i ( w · x i ) � � 1 + e − (2 y i − 1)( w · x i ) � = log = log 6 Logistic Loss 4 2 0 − 6 − 4 − 2 0 2 4 6 (2 y − 1)( w · x + w 0 ) 4

  6. Loss Functions 5

  7. Outline Dual Formulation of SVM Kernels

  8. SVM Formulation: Non-Separable Case What if your data looks like this? 6

  9. SVM Formulation : Constrained Minimisation N � 1 2 � w � 2 minimise: 2 + C ζ i i =1 subject to: y i ( w · x i + w 0 ) − (1 − ζ i ) ≥ 0 ζ i ≥ 0 for i = 1 , . . . , N Here y i ∈ {− 1 , 1 } 7

  10. Contrained Optimisation with Inequalities Primal Form minimise F ( z ) subject to g i ( z ) ≥ 0 i = 1 , . . . , m h j ( z ) = 0 j = 1 , . . . , l Lagrange Function m l � � Λ( z ; α, µ ) = F ( z ) − α i g i ( z ) − µ j h j ( z ) i =1 j =1 For convex problems (as defined before), Karush-Kuhn-Tucker (KKT) conditions provide necessary and sufficient conditions for a critical point of Λ to be the minimum of the original constrained optimisation problem For non-convex problems, they are necessary but not sufficient 8

  11. KKT Conditions Lagrange Function m l � � Λ( z ; α , µ ) = F ( z ) − α i g i ( z ) − µ j h j ( z ) i =1 j =1 For convex problems, Karush-Kuhn-Tucker (KKT) conditions give necessary and sufficient conditions for a solution (critical point of Λ ) to be optimal Dual feasibility: for i = 1 , . . . m α i ≥ 0 Primal feasibility: for i = 1 , . . . m g i ( z ) ≥ 0 for j = 1 , . . . l h j ( z ) = 0 Complementary slackness: α i g i ( z ) = 0 for i = 1 , . . . m 9

  12. SVM Formulation N � 1 2 � w � 2 minimise: 2 + C ζ i i =1 subject to: y i ( w · x i + w 0 ) − (1 − ζ i ) ≥ 0 ζ i ≥ 0 for i = 1 , . . . , N Here y i ∈ {− 1 , 1 } Lagrange Function � N � N � N Λ( w , w 0 , ζ ; α , µ ) = 1 2 � w � 2 2 + C ζ i − α i ( y i ( w · x i + w 0 ) − (1 − ζ i )) − µ i ζ i i =1 i =1 i =1 10

  13. SVM Dual Formulation Lagrange Function � N � N � N Λ( w , w 0 , ζ ; α , µ ) = 1 2 � w � 2 2 + C ζ i − α i ( y i ( w · x i + w 0 ) − (1 − ζ i )) − µ i ζ i i =1 i =1 i =1 We write derivatives with respect to w , w 0 and ζ i , N � ∂ Λ ∂w 0 = − α i y i i =1 ∂ Λ ∂ζ i = C − α i − µ i N � ∇ w Λ = w − α i y i x i i =1 For (KKT) dual feasibility constraints, we require α i ≥ 0 , µ i ≥ 0 11

  14. SVM Dual Formulation Setting the derivatives to 0 , substituting the resulting expressions in Λ (and simplifying), we get a function g ( α ) and some constraints N N N � � � α i − 1 g ( α ) = α i α j y i y j x i · x j 2 i =1 i =1 j =1 Constraints 0 ≤ α i ≤ C i = 1 , . . . , N N � α i y i = 0 i =1 Finding critical points of Λ satisfying the KKT conditions corresponds to finding the maximum of g ( α ) subject to the above constraints 12

  15. SVM: Primal and Dual Formulations Primal Form Dual Form N N N N � � � � α i − 1 1 2 � w � 2 minimise: maximise 2 + C ζ i α i α j y i y j x i · x j 2 i =1 i =1 i =1 j =1 subject to: subject to: � N y i ( w · x i + w 0 ) ≥ (1 − ζ i ) i =1 α i y i = 0 ζ i ≥ 0 0 ≤ α i ≤ C for i = 1 , . . . , N for i = 1 , . . . , N 13

  16. KKT Complementary Slackness Conditions � � ◮ For all i , α i y i ( w · x i + w 0 ) − (1 − ζ i ) = 0 ◮ If α i > 0 , y i ( w · x i + w 0 ) = 1 − ζ i ◮ Recall the form of the solution: w = � N i =1 α i y i x i ◮ Thus, only those datapoints x i for which α i > 0 , determine the solution ◮ This is why they are called support vectors 14

  17. Support Vectors 15

  18. SVM Dual Formulation N N N � � � α i − 1 α i α j y i y j x T maximise i x j 2 i =1 i =1 j =1 subject to: 0 ≤ α i ≤ C i = 1 , . . . , N N � α i y i = 0 i =1 ◮ Objective depends only between dot products of training inputs ◮ Dual formulation particularly useful if inputs are high-dimensional ◮ Dual constraints are much simpler than primal ones ◮ To make a new prediction only need to know dot product with support vectors ◮ Solution is of the form w = � N i =1 α i y i x i ◮ And so w · x new = � N i =1 α i y i x i · x new 16

  19. Outline Dual Formulation of SVM Kernels

  20. Gram Matrix If we put the inputs in matrix X , where the i th row of X is x T i .   x T x T x T 1 x 1 1 x 2 · · · 1 x N   x T x T x T · · · 2 x 1 2 x 2 2 x N   K = XX T =   . . . ...   . . . . . .   x T x T x T N x 1 N x 2 · · · N x N ◮ The matrix K is positive definite if D > N and x i are linearly independent ◮ If we perform basis expansion φ : R D → R M then replace entries by φ ( x i ) T φ ( x j ) ◮ We only need the ability to compute inner products to use SVM 17

  21. Kernel Trick Suppose, x ∈ R 2 and we perform degree 2 polynomial expansion, we could use the map: � � T 1 , x 1 , x 2 , x 2 1 , x 2 ψ ( x ) = 2 , x 1 x 2 But, we could also use the map: � � T √ √ √ 2 x 2 , x 2 1 , x 2 φ ( x ) = 1 , 2 x 1 , 2 , 2 x 1 x 2 If x = [ x 1 , x 2 ] T and x ′ = [ x ′ 2 ] T , then 1 , x ′ 1 ) 2 + x 2 2 ) 2 + 2 x 1 x 2 x ′ 2 + x 2 φ ( x ) T φ ( x ′ ) = 1 + 2 x 1 x ′ 1 + 2 x 2 x ′ 1 ( x ′ 2 ( x ′ 1 x ′ 2 2 ) 2 = (1 + x · x ′ ) 2 = (1 + x 1 x ′ 1 + x 2 x ′ Instead of spending ≈ D d time to compute inner products after degree d polynomial basis expansion, we only need O ( D ) time 18

  22. Kernel Trick We can use a symmetric positive semi-definite matrix (Mercer Kernels)   κ ( x 1 , x 1 ) κ ( x 1 , x 2 ) · · · κ ( x 1 , x N )   κ ( x 2 , x 1 ) κ ( x 2 , x 2 ) · · · κ ( x 2 , x N )     K = . . . ...   . . .  . . .  κ ( x N , x 1 ) κ ( x N , x 2 ) · · · κ ( x N , x N ) Here κ ( x , x ′ ) is some measure of similarity between x and x ′ The dual program becomes N N N � � � maximise α i − α i α j y i y j K i,j i =1 i =1 j =1 subject to : 0 ≤ α i ≤ C and � N i =1 α i y i = 0 To make prediction on new x new , only need to compute κ ( x i , x new ) for support vectors x i (for which α i > 0 ) 19

  23. Polynomial Kernels Rather than perform basis expansion, κ ( x , x ′ ) = (1 + x · x ′ ) d This gives all terms of degree up to d If we use κ ( x , x ′ ) = ( x · x ′ ) d , we get only degree d terms Linear Kernel: κ ( x , x ′ ) = x · x ′ All of these satisfy the Mercer or positive-definite condition 20

  24. Gaussian or RBF Kernel Radial Basis Function (RBF) or Gaussian Kernel � � −� x − x ′ � 2 κ ( x , x ′ ) = exp 2 σ 2 σ 2 is known as the bandwidth 1 We used this with γ = 2 σ 2 when we studied kernel basis expansion for regression Can generalise to more general covariance matrices Results in a Mercer kernel 21

  25. Kernels on Discrete Data : Cosine Kernel For text documents: let x denote bag of words Cosine Similarity x · x ′ κ ( x , x ′ ) = � x � 2 � x ′ � 2 Term frequency tf ( c ) = log(1 + c ) , c word count for some word w � � Inverse document frequency idf ( w ) = log N , N w #docs containing w 1+ N w tf - idf ( x ) w = tf ( x w ) idf ( w ) 22

  26. Kernels on Discrete Data : String Kernel Let x and x ′ be strings over some alphabet A A = { A, R, N, D, C, E, Q, G, H, I, L, K, M, F, P, S, T, W, Y, V } κ ( x , x ′ ) = � s w s φ s ( x ) φ s ( x ′ ) φ s ( x ) is the number of times s appears in x as substring w s is the weight associated with substring s 23

  27. How to choose a good kernel? Not always easy to tell whether a kernel function is a Mercer kernel Mercer Condition: For any finite set of points, the Kernel matrix should be positive semi-definite If the following hold: ◮ κ 1 , κ 2 are Mercer kernels for points in R D ◮ f : R D → R ◮ φ : R D → R M ◮ κ 3 is a Mercer kernel on R M the following are Mercer kernels ◮ κ 1 + κ 2 , κ 1 · κ 2 , ακ 1 for α ≥ 0 ◮ κ ( x , x ′ ) = f ( x ) f ( x ′ ) ◮ κ 3 ( φ ( x ) , φ ( x ′ )) ◮ κ ( x , x ′ ) = x T Ax ′ for A positive definite 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend