support vector machine and kernel methods
play

Support Vector Machine and Kernel Methods Jiayu Zhou 1 Department of - PowerPoint PPT Presentation

Support Vector Machine and Kernel Methods Jiayu Zhou 1 Department of Computer Science and Engineering Michigan State University East Lansing, MI USA February 26, 2017 Jiayu Zhou CSE 847 Machine Learning 1 / 50 Which Separator Do You Pick?


  1. Support Vector Machine and Kernel Methods Jiayu Zhou 1 Department of Computer Science and Engineering Michigan State University East Lansing, MI USA February 26, 2017 Jiayu Zhou CSE 847 Machine Learning 1 / 50

  2. Which Separator Do You Pick? Jiayu Zhou CSE 847 Machine Learning 2 / 50

  3. Robustness to Noisy Data Being robust to noise (measurement error) is good (remember regularization). Jiayu Zhou CSE 847 Machine Learning 3 / 50

  4. Thicker Cushion Means More Robustness We call such hyperplanes fat Jiayu Zhou CSE 847 Machine Learning 4 / 50

  5. Two Crucial Questions 1 Can we efficiently find the fattest separating hyperplane? 2 Is a fatter hyperplane better than a thin one? Jiayu Zhou CSE 847 Machine Learning 5 / 50

  6. Pulling Out the Bias After Before x ∈ R d ; b ∈ R , w ∈ R d x ∈ { 1 } × R d ; w ∈ R d +1         x 1 w 1 1 w 0 . . x 1 w 1  .   .  x =  ; w =     . .        x =  ; w = . .  .   .  x d w d . .    x d w d bias b signal = w T x signal = w T x + b Jiayu Zhou CSE 847 Machine Learning 6 / 50

  7. Separating The Data Hyperplane h = ( b, w ) h separates the data means: y n ( w T x n + b ) > 0 By rescaling the weights and bias, n =1 ,...,N y n ( w T x n + b ) = 1 min Jiayu Zhou CSE 847 Machine Learning 7 / 50

  8. Distance to the Hyperplane w is normal to the hyperplane (why?) w T ( x 2 − x 1 ) = w T x 2 − w T x 1 = − b + b = 0 Scalar projection: a T b = ∥ a ∥∥ b ∥ cos( a , b ) ⇒ a T b / ∥ b ∥ = ∥ a ∥ cos( a , b ) let x ⊥ be the orthogonal projection of x to h , distance to hyperplane is given by projection of x − x ⊥ to w (why?) 1 ∥ w ∥ · | w T x − w T x ⊥ | dist ( x , h ) = 1 ∥ w ∥ · | w T x + b | = Jiayu Zhou CSE 847 Machine Learning 8 / 50

  9. Fatness of a Separating Hyperplane 1 1 1 ∥ w ∥ · | w T x + b | = ∥ w ∥ · | y n ( w T x + b ) | = ∥ w ∥ · y n ( w T x + b ) dist ( x , h ) = Fatness = Distance to the closest point Fatness = min n dist ( x n , h ) 1 n y n ( w T x + b ) = ∥ w ∥ min 1 = ∥ w ∥ Jiayu Zhou CSE 847 Machine Learning 9 / 50

  10. Maximizing the Margin Formal definition of margin: 1 margin: γ ( h ) = ∥ w ∥ NOTE: Bias b does not appear in the margin. Objective maximizing margin: 1 2 w T w min b, w n =1 ,...,N y n ( w T x n + b ) = 1 subject to: min An equivalent objective: 1 2 w T w min b, w subject to: y n ( w T x n + b ) ≥ 1 for n = 1 , . . . , N Jiayu Zhou CSE 847 Machine Learning 10 / 50

  11. Example - Our Toy Data Set 1 2 w T w min b, w subject to: y n ( w T x n + b ) ≥ 1 for n = 1 , . . . , N Training Data:     0 0 − 1 2 2 − 1     X =  , y =     2 0 +1    3 0 +1 What is the margin? Jiayu Zhou CSE 847 Machine Learning 11 / 50

  12. Example - Our Toy Data Set 1 2 w T w min b, w subject to: y n ( w T x n + b ) ≥ 1 for n = 1 , . . . , N      (1) : − b ≥ 1  0 0 − 1     (2) : − (2 w 1 + 2 w 2 + b ) ≥ 1 2 2 − 1     X =  , y = ⇒     2 0 +1    (3) : 2 w 1 + b ≥ 1    3 0 +1   (4) : 3 w 1 + b ≥ 1 { (1) + (3) → w 1 ≥ 1 → w 2 ≤ − 1 ⇒ 1 2 w T w = 1 2( w 2 1 + w 2 2 ) ≥ 1 (2) + (3) Thus: w 1 = 1 , w 2 = − 1 , b = − 1 Jiayu Zhou CSE 847 Machine Learning 12 / 50

  13. Example - Our Toy Data Set   0 0 2 2   Given data X =   2 0   3 0 Optimal solution [ w 1 = 1 ] w ∗ = , b ∗ = − 1 w 2 = − 1 Optimal hyperplane g ( x ) = sign ( x 1 − x 2 − 1) margin: For data points (1), (2) and 1 1 ∥ w ∥ = 2 ≈ 0 . 707 n w ∗ + b ∗ ) = 1 √ (3) y n ( x T Support Vectors Jiayu Zhou CSE 847 Machine Learning 13 / 50

  14. Solver: Quadratic Programming 1 2 u T Q u + p T u min u ∈ R q subject to: A u ≥ c u ∗ ← QP ( Q, p , A, c ) ( Q = 0 is linear programming.) http://cvxopt.org/examples/tutorial/qp.html Jiayu Zhou CSE 847 Machine Learning 14 / 50

  15. Maximum Margin Hyperplane is QP 1 1 2 w T w 2 u T Q u + p T u min min b, w u ∈ R q subject to: y n ( w T x n + b ) ≥ 1 , ∀ n subject to: A u ≥ c [ b ] [ b ] ∈ R d +1 ⇒ 1 [ 0 T ] [ 0 T ] 0 0 2 w T w = [ b, w T ] = u T u = d d u w T I d I d w 0 d 0 d [ 0 T ] 0 Q = d , p = 0 d +1 I d 0 d y 1 x T y 1 1     1 . . . y n ( w T x n + b ) ≥ 1 = [ y n , y n x T n ] u ≥ 1 ⇒ . .  u ≥ .     . . .    y N x T y N 1 N y 1 x T  y 1   1  1 . . . A = . .  , c = .     . . .    y N x T y N 1 N Jiayu Zhou CSE 847 Machine Learning 15 / 50

  16. Back To Our Example Exercise:      (1) : − b ≥ 1  0 0 − 1     2 2 − 1 (2) : − (2 w 1 + 2 w 2 + b ) ≥ 1     X =  , y =     2 0 +1    (3) : 2 w 1 + b ≥ 1    3 0 +1   (4) : 3 w 1 + b ≥ 1 Show the corresponding Q, p , A, c .     − 1 0 0 1     0 0 0 0 − 1 − 2 − 2 1      , p =  , A = Q = 0 1 0 0  , c =       1 2 0 1    0 0 1 0 1 3 0 1 Use your QP-solver to give u ∗ = [ b ∗ , w ∗ 2 ] T = [ − 1 , 1 , − 1] 1 , w ∗ Jiayu Zhou CSE 847 Machine Learning 16 / 50

  17. Primal QP algorithm for linear-SVM 1 Let p = 0 d +1 be the ( d + 1) -vector of zeros and c = 1 N the N -vector of ones. Construct matrices Q and A , where   − y 1 x T y 1 1 − [ 0 ] 0 T . .  . .  d A =  , Q = . .  I d 0 d − y N x T y N N − [ b ∗ ] = u ∗ ← QP ( Q, p , A, c ) . 2 Return w ∗ 3 The final hypothesis is g ( x ) = sign ( x T w ∗ + b ∗ ) . Jiayu Zhou CSE 847 Machine Learning 17 / 50

  18. Link to Regularization min E in ( w ) w subject to: w T w ≤ C optimal hyperplane regularization w T w minimize E in w T w ≤ C subject to E in = 0 Jiayu Zhou CSE 847 Machine Learning 18 / 50

  19. How to Handle Non-Separable Data? (a) Tolerate noisy data points: soft-margin SVM. (b) Inherent nonlinear boundary: non-linear transformation. Jiayu Zhou CSE 847 Machine Learning 19 / 50

  20. Non-Linear Transformation Φ 1 ( x ) = ( x 1 , x 2 ) Φ 2 ( x ) = ( x 1 , x 2 , x 2 1 , x 1 x 2 , x 2 2 ) Φ 3 ( x ) = ( x 1 , x 2 , x 2 1 , x 1 x 2 , x 2 2 , x 3 1 , x 2 1 x 2 , x 1 x 2 2 , x 3 2 ) Jiayu Zhou CSE 847 Machine Learning 20 / 50

  21. Non-Linear Transformation Using the nonlinear transform with the optimal hyperplane using a transform Φ : R d → R ˜ d : z n = Φ ( x n ) w ∗ , ˜ b ∗ ): Solve the hard-margin SVM in the Z -space ( ˜ 1 w T ˜ min 2 ˜ w ˜ b, ˜ w w T z n + ˜ subject to: y n ( ˜ b ) ≥ 1 , ∀ n Final hypothesis: w ∗ T Φ ( x ) + ˜ b ∗ ) g ( x ) = sign ( ˜ Jiayu Zhou CSE 847 Machine Learning 21 / 50

  22. SVM and non-linear transformation The margin is shaded in yellow, and the support vectors are boxed. For Φ 2 , ˜ d 2 = 5 and for Φ 3 , ˜ d 3 = 9 d 2 is nearly double ˜ ˜ d 3 , yet the resulting SVM separator is not severely overfitting with Φ 3 (regularization?). Jiayu Zhou CSE 847 Machine Learning 22 / 50

  23. Support Vector Machine Summary A very powerful, easy to use linear model which comes with automatic regularization. Fully exploit SVM: Kernel potential robustness to overfitting even after transforming to a much higher dimension How about infinite dimensional transforms? Kernel Trick Jiayu Zhou CSE 847 Machine Learning 23 / 50

  24. SVM Dual: Formulation Primal and dual in optimization. The dual view of SVM enables us to exploit the kernel trick. In the primal SVM problem we solve w ∈ R d , b , while in the dual problem we solve α ∈ R N N N N α n − 1 ∑ ∑ ∑ α n α m y n y m x T max n x m 2 α ∈ R N n =1 m =1 n =1 N ∑ subject to y n α n = 0 , α n ≥ 0 , ∀ n n =1 which is also a QP problem. Jiayu Zhou CSE 847 Machine Learning 24 / 50

  25. SVM Dual: Prediction We can obtain the primal solution: N ∑ w ∗ = y n α ∗ n x n n =1 where for support vectors α n > 0 The optimal hypothesis: g ( x ) = sign ( w ∗ T x + b ∗ ) ( N ) ∑ y n α ∗ n x T n x + b ∗ = sign n =1    ∑ y n α ∗ n x T n x + b ∗ = sign  α ∗ n > 0 Jiayu Zhou CSE 847 Machine Learning 25 / 50

  26. Dual SVM: Summary N N N α n − 1 ∑ ∑ ∑ α n α m y n y m x T max n x m 2 α ∈ R N n =1 m =1 n =1 N ∑ subject to y n α n = 0 , α n ≥ 0 , ∀ n n =1 N ∑ w ∗ = y n α ∗ n x n n =1 Jiayu Zhou CSE 847 Machine Learning 26 / 50

  27. Common SVM Basis Functions z k = polynomial terms of x k of degree 1 to q z k = radial basis function of x k z k ( j ) = φ j ( x k ) = exp( −| x k − c j | 2 /σ 2 ) z k = sigmoid functions of x k Jiayu Zhou CSE 847 Machine Learning 27 / 50

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend