statistics and learning
play

Statistics and learning Support Vector Machines S A c bastien - PowerPoint PPT Presentation

Statistics and learning Support Vector Machines S A c bastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2017 1 / 20 Linearly separable data Intuition: How would you separate whites and blacks? S. Gadat


  1. Statistics and learning Support Vector Machines S˜ A c � bastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2017 1 / 20

  2. Linearly separable data Intuition: How would you separate whites and blacks? S. Gadat (TSE) SAD 2017 2 / 20

  3. Separation hyperplane S. Gadat (TSE) SAD 2017 3 / 20

  4. Separation hyperplane S. Gadat (TSE) SAD 2017 3 / 20

  5. Separation hyperplane S. Gadat (TSE) SAD 2017 3 / 20

  6. Separation hyperplane M + M - β Any separation hyperplane can be written ( β, β 0 ) such that: ∀ i = 1 ..N, β T x i + β 0 ≥ 0 if y i = +1 ∀ i = 1 ..N, β T x i + β 0 ≤ 0 if y i = − 1 This can be written: β T x i + β 0 � � ∀ i = 1 ..N, y i ≥ 0 S. Gadat (TSE) SAD 2017 3 / 20

  7. Separation hyperplane M + M - β But. . . β T x i + β 0 � � y i is the signed distance between point i and the hyperplane ( β, β 0 ) β T x i + β 0 � � Margin of a separating hyperplane: min y i ? i S. Gadat (TSE) SAD 2017 3 / 20

  8. Separation hyperplane M + M - β Optimal separating hyperplane Maximize the margin between the hyperplane and the data. max β,β 0 M β T x i + β 0 � � such that ∀ i = 1 ..N, y i ≥ M and � β � = 1 S. Gadat (TSE) SAD 2017 3 / 20

  9. Separation hyperplane M + M - β Let’s get rid of � β � = 1 : 1 β T x i + β 0 � � ∀ i = 1 ..N, � β � y i ≥ M β T x i + β 0 � � ⇒ ∀ i = 1 ..N, y i ≥ M � β � S. Gadat (TSE) SAD 2017 3 / 20

  10. Separation hyperplane M + M - β β T x i + β 0 � � ∀ i = 1 ..N, y i ≥ M � β � If ( β, β 0 ) satisfies this constraint, then ∀ α > 0 , ( αβ, αβ 0 ) does too. β T x i + β 0 � � Let’s choose to have ∀ i = 1 ..N, y i ≥ 1 then we need to set � β � = 1 M S. Gadat (TSE) SAD 2017 3 / 20

  11. Separation hyperplane M + M - β 1 Now M = � β � . Geometrical interpretation? So β,β 0 � β � 2 max β,β 0 M ⇔ min β,β 0 � β � ⇔ min S. Gadat (TSE) SAD 2017 3 / 20

  12. Separation hyperplane M + M - β Optimal separating hyperplane (continued) 1 2 � β � 2 min β,β 0 β T x i + β 0 � � such that ∀ i = 1 ..N, y i ≥ 1 1 Maximize the margin M = � β � between the hyperplane and the data. S. Gadat (TSE) SAD 2017 3 / 20

  13. Optimal separating hyperplane 1 2 � β � 2 min β,β 0 β T x i + β 0 � � such that ∀ i = 1 ..N, y i ≥ 1 It’s a QP problem! S. Gadat (TSE) SAD 2017 4 / 20

  14. Optimal separating hyperplane 1 2 � β � 2 min β,β 0 β T x i + β 0 � � such that ∀ i = 1 ..N, y i ≥ 1 It’s a QP problem! N L P ( β, β 0 , α ) = 1 2 � β � 2 − β T x i + β 0 � � � � � α i y i − 1 i =1 S. Gadat (TSE) SAD 2017 4 / 20

  15. Optimal separating hyperplane 1 2 � β � 2 min β,β 0 β T x i + β 0 � � such that ∀ i = 1 ..N, y i ≥ 1 It’s a QP problem! N L P ( β, β 0 , α ) = 1 2 � β � 2 − β T x i + β 0 � � � � � α i y i − 1 i =1 N  ∂L P ∂β = 0 ⇒ β = � α i y i x i    i =1    N  ∂L P KKT conditions ∂β 0 = 0 ⇒ 0 = � α i y i i =1  β T x i + β 0  � � � �  ∀ i = 1 ..N, α i y i − 1 = 0     ∀ i = 1 ..N, α i ≥ 0 S. Gadat (TSE) SAD 2017 4 / 20

  16. Optimal separating hyperplane 1 2 � β � 2 min β,β 0 β T x i + β 0 � � such that ∀ i = 1 ..N, y i ≥ 1 It’s a QP problem! β T x i + β 0 � � � � ∀ i = 1 ..N, α i y i − 1 = 0 Two possibilities: β T x i + β 0 ◮ α i > 0 , then y i � � = 1 : x i is on the margin’s boundary ◮ α i = 0 , then x i is anywhere on the boundary or further . . . but does not participate in β . N � β = α i y i x i i =1 The x i for which α i > 0 are called Support Vectors . S. Gadat (TSE) SAD 2017 4 / 20

  17. Optimal separating hyperplane 1 2 � β � 2 min β,β 0 β T x i + β 0 � � such that ∀ i = 1 ..N, y i ≥ 1 It’s a QP problem! N N N α i − 1 � � � α i α j y i y j x T Dual problem: α ∈ R + N L D ( α ) = max i x j 2 i =1 i =1 j =1 N � such that α i y i = 0 i =1 Solving the dual problem is a maximization in R N , rather than a (constrained) minimization in R n . Usual algorithm: SMO=Sequential Minimal Optimization. S. Gadat (TSE) SAD 2017 4 / 20

  18. Optimal separating hyperplane 1 2 � β � 2 min β,β 0 β T x i + β 0 � � such that ∀ i = 1 ..N, y i ≥ 1 It’s a QP problem! And β 0 ? β T x i + β 0 � � � � Solve α i y i − 1 = 0 for any i such that α i > 0 S. Gadat (TSE) SAD 2017 4 / 20

  19. Optimal separating hyperplane M + M - β Overall: N � β = α i y i x i i =1 With α i > 0 only for x i support vectors . � N � β T x + β 0 � � α i y i x T Prediction: f ( x ) = sign = sign � i x + β 0 i =1 S. Gadat (TSE) SAD 2017 4 / 20

  20. Non-linearly separable data? S. Gadat (TSE) SAD 2017 5 / 20

  21. Non-linearly separable data? S. Gadat (TSE) SAD 2017 5 / 20

  22. Non-linearly separable data? S. Gadat (TSE) SAD 2017 5 / 20

  23. Non-linearly separable data? Slack variables ξ = ( ξ 1 , . . . , ξ N ) y i ( β T x i + β 0 ) ≥ M − ξ i  N  � or  and ξ i ≥ 0 and ξ i ≤ K y i ( β T x i + β 0 ) ≥ M (1 − ξ i ) i =1 S. Gadat (TSE) SAD 2017 5 / 20

  24. Non-linearly separable data? y i ( β T x i + β 0 ) ≥ M (1 − ξ i ) ⇒ misclassification if ξ i ≥ 1 N � ξ i ≤ K ⇒ maximum K misclassifications i =1 S. Gadat (TSE) SAD 2017 5 / 20

  25. Non-linearly separable data? Optimal separating hyperplane min β,β 0 � β � β T x i + β 0  � � y i ≥ 1 − ξ i ,  N such that ∀ i = 1 ..N, � ξ i ≥ 0 , ξ i ≤ K  i =1 S. Gadat (TSE) SAD 2017 5 / 20

  26. Non-linearly separable data? Optimal separating hyperplane N 1 2 � β � 2 + C � min ξ i β,β 0 � y i i =1 β T x i + β 0 � � ≥ 1 − ξ i , such that ∀ i = 1 ..N, ξ i ≥ 0 S. Gadat (TSE) SAD 2017 5 / 20

  27. Optimal separating hyperplane Again a QP problem. N N N L P = 1 2 � β � 2 + C β T x i + β 0 � � � � � � � ξ i − α i y i − (1 − ξ i ) − µ i ξ i i =1 i =1 i =1  N ∂L P ∂β = 0 ⇒ β = � α i y i x i     i =1   N   ∂L P  ∂β 0 = 0 ⇒ 0 = � α i y i    i =1 KKT conditions ∂L P ∂ξ = 0 ⇒ α i = C − µ i   β T x i + β 0  � � � � ∀ i = 1 ..N, α i y i − (1 − ξ i ) = 0     ∀ i = 1 ..N, µ i ξ i = 0     ∀ i = 1 ..N, α i ≥ 0 , µ i ≥ 0  S. Gadat (TSE) SAD 2017 6 / 20

  28. Optimal separating hyperplane N N N α i − 1 � � � α i α j y i y j x T Dual problem: α ∈ R + N L D ( α ) = max i x j 2 i =1 i =1 j =1 N � such that α i y i = 0 i =1 and 0 ≤ α i ≤ C S. Gadat (TSE) SAD 2017 6 / 20

  29. Optimal separating hyperplane N β T x i + β 0 � � � � � α i y i − (1 − ξ i ) = 0 and β = α i y i x i i =1 Again: β T x i + β 0 ◮ α i > 0 , then y i � � = 1 − ξ i : x i is a support vector . Among these: ◮ ξ i = 0 , then 0 ≤ α i ≤ C ◮ ξ i > 0 , then α i = C (because µ i = 0 , because µ i ξ i = 0 ) ◮ α i = 0 , then x i does not participate in β . S. Gadat (TSE) SAD 2017 6 / 20

  30. Optimal separating hyperplane Overall: N � β = α i y i x i i =1 With α i > 0 only for x i support vectors . � N � β T x + β 0 α i y i x T � � � Prediction: f ( x ) = sign = sign i x + β 0 i =1 S. Gadat (TSE) SAD 2017 6 / 20

  31. Non-linear SVMs? Key remark � X → H h : is a mapping to a p-dimensional Euclidean space. x �→ h ( x ) ( p ≫ n , possibly infinite) � N � SVM classifier in H : f ( x ′ ) = sign α i y i � x ′ i , x ′ � + β 0 � . i =1 Suppose K ( x, x ′ ) = � h ( x ) , h ( x ′ ) � , Then: � N � � f ( x ) = sign α i y i K ( x i , x ) + β 0 . i =1 S. Gadat (TSE) SAD 2017 7 / 20

  32. Kernels Kernel K ( x, y ) = � h ( x ) , h ( y ) � is called a kernel function. S. Gadat (TSE) SAD 2017 8 / 20

  33. Kernels Kernel K ( x, y ) = � h ( x ) , h ( y ) � is called a kernel function. Example: x 2   √ 1 X = R 2 , H = R 3 , h ( x ) = 2 x 1 x 2   x 2 2 K ( x, y ) = h ( x ) T h ( y ) S. Gadat (TSE) SAD 2017 8 / 20

  34. Kernels Kernel K ( x, y ) = � h ( x ) , h ( y ) � is called a kernel function. What if we knew that K ( · , · ) is a kernel, without explicitly building h ? The SVM would be a linear classifier in H but we would never have to compute h ( x ) for training or prediction! This is called the kernel trick . S. Gadat (TSE) SAD 2017 8 / 20

  35. Kernels Kernel K ( x, y ) = � h ( x ) , h ( y ) � is called a kernel function. Under what conditions is K ( · , · ) an acceptable kernel? Answer: if it is an inner product on a (separable) Hilbert space. In more general words, we are interested in positive, definite kernel on a Hilbert space: Positive Definite Kernels K ( · , · ) is a positive definite kernel on X if n ∀ n ∈ N , x ∈ X n and c ∈ R n , � c i c j K ( x i , x j ) ≥ 0 i,j =1 S. Gadat (TSE) SAD 2017 8 / 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend