non parametric methods and support vector machines
play

Non-Parametric Methods and Support Vector Machines Shan-Hung Wu - PowerPoint PPT Presentation

Non-Parametric Methods and Support Vector Machines Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine


  1. Parzen Windows and Kernels Binary KNN classifier: ⇣ ∑ i : x ( i ) 2 KNN ( x ) y ( i ) ⌘ f ( x ) = sign The “radius” of voter boundary depends on the input x We can instead use the Parzen window with a fixed radius: ⇣ ∑ i y ( i ) 1 ( x ( i ) ; k x ( i ) � x k  R ) ⌘ f ( x ) = sign Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 10 / 42

  2. Parzen Windows and Kernels Binary KNN classifier: ⇣ ∑ i : x ( i ) 2 KNN ( x ) y ( i ) ⌘ f ( x ) = sign The “radius” of voter boundary depends on the input x We can instead use the Parzen window with a fixed radius: ⇣ ∑ i y ( i ) 1 ( x ( i ) ; k x ( i ) � x k  R ) ⌘ f ( x ) = sign Parzen windows also replace the hard boundary with a soft one: ⇣ ⌘ ∑ i y ( i ) k ( x ( i ) , x ) f ( x ) = sign k ( x ( i ) , x ) is a radial basis function (RBF) kernel whose value decreases along space radiating outward from x Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 10 / 42

  3. Parzen Windows and Kernels Binary KNN classifier: ⇣ ∑ i : x ( i ) 2 KNN ( x ) y ( i ) ⌘ f ( x ) = sign The “radius” of voter boundary depends on the input x We can instead use the Parzen window with a fixed radius: ⇣ ∑ i y ( i ) 1 ( x ( i ) ; k x ( i ) � x k  R ) ⌘ f ( x ) = sign Parzen windows also replace the hard boundary with a soft one: ⇣ ⌘ ∑ i y ( i ) k ( x ( i ) , x ) f ( x ) = sign k ( x ( i ) , x ) is a radial basis function (RBF) kernel whose value decreases along space radiating outward from x Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 10 / 42

  4. Common RBF Kernels How to act like soft K -NN? Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 11 / 42

  5. Common RBF Kernels How to act like soft K -NN? Gaussian RBF kernel: k ( x ( i ) , x ) = N ( x ( i ) � x ; 0 , σ 2 I ) Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 11 / 42

  6. Common RBF Kernels How to act like soft K -NN? Gaussian RBF kernel: k ( x ( i ) , x ) = N ( x ( i ) � x ; 0 , σ 2 I ) Or simply ⇣ � γ k x ( i ) � x k 2 ⌘ k ( x ( i ) , x ) = exp γ � 0 (or σ 2 ) is a hyperparameter controlling the smoothness of f Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 11 / 42

  7. Outline Non-Parametric Methods 1 K -NN Parzen Windows Local Models Support Vector Machines 2 SVC Slacks Nonlinear SVC Dual Problem Kernel Trick Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 12 / 42

  8. Locally Weighted Linear Regression In addition to the majority voting and average, we can define local models for lazy predictions Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 13 / 42

  9. Locally Weighted Linear Regression In addition to the majority voting and average, we can define local models for lazy predictions E.g., in (eager) linear regression, we find w 2 R D + 1 that minimizes SSE: ( y ( i ) � w > x ( i ) ) 2 w ∑ argmin i Local model: to find w minimizing SSE local to the point x we want to predict : k ( x ( i ) , x )( y ( i ) � w > x ( i ) ) 2 w ∑ argmin i k ( · , · ) 2 R is an RBF kernel Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 13 / 42

  10. Outline Non-Parametric Methods 1 K -NN Parzen Windows Local Models Support Vector Machines 2 SVC Slacks Nonlinear SVC Dual Problem Kernel Trick Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 14 / 42

  11. Kernel Machines Kernel machines : N c i k ( x ( i ) , x )+ c 0 ∑ f ( x ) = i = 1 For example: Parzen windows: c i = y ( i ) and c 0 = 0 Locally weighted linear regression: c i = ( y ( i ) � w > x ( i ) ) 2 and c 0 = 0 Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 15 / 42

  12. Kernel Machines Kernel machines : N c i k ( x ( i ) , x )+ c 0 ∑ f ( x ) = i = 1 For example: Parzen windows: c i = y ( i ) and c 0 = 0 Locally weighted linear regression: c i = ( y ( i ) � w > x ( i ) ) 2 and c 0 = 0 The variable c 2 R N can be learned in either an eager or lazy manner Pros: complex, but highly accurate if regularized well Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 15 / 42

  13. Sparse Kernel Machines To make a prediction, we need to store all examples May be infeasible due to Large dataset ( N ) Time limit Space limit Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 16 / 42

  14. Sparse Kernel Machines To make a prediction, we need to store all examples May be infeasible due to Large dataset ( N ) Time limit Space limit Can we make c sparse ? I.e., to make c i 6 = 0 for only a small fraction of examples called support vectors How? Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 16 / 42

  15. Outline Non-Parametric Methods 1 K -NN Parzen Windows Local Models Support Vector Machines 2 SVC Slacks Nonlinear SVC Dual Problem Kernel Trick Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 17 / 42

  16. Separating Hyperplane I Model: F = { f : f ( x ; w , b ) = w > x + b } A collection of hyperplanes Prediction: ˆ y = sign ( f ( x )) Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 18 / 42

  17. Separating Hyperplane I Model: F = { f : f ( x ; w , b ) = w > x + b } A collection of hyperplanes Prediction: ˆ y = sign ( f ( x )) Training: to find w and b such that w > x ( i ) + b � 0 , if y ( i ) = 1 w > x ( i ) + b  0 , if y ( i ) = � 1 Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 18 / 42

  18. Separating Hyperplane I Model: F = { f : f ( x ; w , b ) = w > x + b } A collection of hyperplanes Prediction: ˆ y = sign ( f ( x )) Training: to find w and b such that w > x ( i ) + b � 0 , if y ( i ) = 1 w > x ( i ) + b  0 , if y ( i ) = � 1 or simply y ( i ) ( w > x ( i ) + b ) � 0 Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 18 / 42

  19. Separating Hyperplane II There are many feasible w ’s and b ’s when the classes are linearly separable Which hyperplane is the best? Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 19 / 42

  20. Support Vector Classification Support vector classifier (SVC) picks one with largest margin : y ( i ) ( w > x ( i ) + b ) � a for all i Margin: 2 a / k w k [Homework] Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 20 / 42

  21. Support Vector Classification Support vector classifier (SVC) picks one with largest margin : y ( i ) ( w > x ( i ) + b ) � a for all i Margin: 2 a / k w k [Homework] With loss of generality, we let a = 1 and solve the problem: argmin w , b 1 2 k w k 2 sibject to y ( i ) ( w > x ( i ) + b ) � 1 , 8 i Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 20 / 42

  22. Outline Non-Parametric Methods 1 K -NN Parzen Windows Local Models Support Vector Machines 2 SVC Slacks Nonlinear SVC Dual Problem Kernel Trick Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 21 / 42

  23. Overlapping Classes In practice, classes may be overlapping Due to, e.g., noises or outliers Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 22 / 42

  24. Overlapping Classes In practice, classes may be overlapping Due to, e.g., noises or outliers The problem argmin w , b 1 2 k w k 2 sibject to y ( i ) ( w > x ( i ) + b ) � 1 , 8 i has no solution in this case. How to fix this? Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 22 / 42

  25. Slacks SVC tolerates slacks that fall outside of the regions they ought to be Problem: 1 2 k w k 2 + C ∑ N i = 1 ξ i argmin w , b , ξ sibject to y ( i ) ( w > x ( i ) + b ) � 1 � ξ i and ξ i � 0 , 8 i Favors large margin but also fewer slacks Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 23 / 42

  26. Hyperparameter C 2 k w k 2 + C ∑ N 1 i = 1 ξ i argmin w , b , ξ The hyperparameter C controls the tradeo ff between Maximizing margin Minimizing number of slacks Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 24 / 42

  27. Hyperparameter C 2 k w k 2 + C ∑ N 1 i = 1 ξ i argmin w , b , ξ The hyperparameter C controls the tradeo ff between Maximizing margin Minimizing number of slacks Provides a geometric explanation to the weight decay Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 24 / 42

  28. Outline Non-Parametric Methods 1 K -NN Parzen Windows Local Models Support Vector Machines 2 SVC Slacks Nonlinear SVC Dual Problem Kernel Trick Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 25 / 42

  29. Nonlinearly Separable Classes In practice, classes may be nonlinearly separable Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 26 / 42

  30. Nonlinearly Separable Classes In practice, classes may be nonlinearly separable SVC (with slacks) gives “bad” hyperplanes due to underfitting How to make it nonlinear? Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 26 / 42

  31. Feature Augmentation Recall that in polynomial regression, we augment data features to make a linear regressor nonlinear Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 27 / 42

  32. Feature Augmentation Recall that in polynomial regression, we augment data features to make a linear regressor nonlinear We can can define a function Φ ( · ) that maps each data point to a high dimensional space: 2 k w k 2 + C ∑ i ξ i 1 argmin w , b , ξ sibject to y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 , 8 i Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 27 / 42

  33. Outline Non-Parametric Methods 1 K -NN Parzen Windows Local Models Support Vector Machines 2 SVC Slacks Nonlinear SVC Dual Problem Kernel Trick Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 28 / 42

  34. Time Complexity Nonlinear SVC: 2 k w k 2 + C ∑ i ξ i 1 argmin w , b , ξ sibject to y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 , 8 i The higher augmented feature dimension, the more variables in w to solve Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 29 / 42

  35. Time Complexity Nonlinear SVC: 2 k w k 2 + C ∑ i ξ i 1 argmin w , b , ξ sibject to y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 , 8 i The higher augmented feature dimension, the more variables in w to solve Can we solve w in time complexity that is independent with the mapped dimension? Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 29 / 42

  36. Dual Problem Primal problem: 2 k w k 2 + C ∑ i ξ i 1 argmin w , b , ξ sibject to y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 , 8 i Dual problem: argmax α , β min w , b , ξ L ( w , b , ξ , α , β ) subject to α � 0 , β � 0 where L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 30 / 42

  37. Dual Problem Primal problem: 2 k w k 2 + C ∑ i ξ i 1 argmin w , b , ξ sibject to y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 , 8 i Dual problem: argmax α , β min w , b , ξ L ( w , b , ξ , α , β ) subject to α � 0 , β � 0 where L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 Primal problem is convex, so strong duality holds Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 30 / 42

  38. Solving Dual Problem I L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 The inner problem w , b , ξ L ( w , b , ξ , α , β ) min is convex in terms of w , b , and ξ Let’s solve it analytically: Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 31 / 42

  39. Solving Dual Problem I L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 The inner problem w , b , ξ L ( w , b , ξ , α , β ) min is convex in terms of w , b , and ξ Let’s solve it analytically: ∂ w = w � ∑ i α i y ( i ) Φ ( x ( i ) ) = 0 ) w = ∑ i α i y ( i ) Φ ( x ( i ) ) ∂ L Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 31 / 42

  40. Solving Dual Problem I L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 The inner problem w , b , ξ L ( w , b , ξ , α , β ) min is convex in terms of w , b , and ξ Let’s solve it analytically: ∂ w = w � ∑ i α i y ( i ) Φ ( x ( i ) ) = 0 ) w = ∑ i α i y ( i ) Φ ( x ( i ) ) ∂ L ∂ b = ∑ i α i y ( i ) = 0 ∂ L Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 31 / 42

  41. Solving Dual Problem I L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 The inner problem w , b , ξ L ( w , b , ξ , α , β ) min is convex in terms of w , b , and ξ Let’s solve it analytically: ∂ w = w � ∑ i α i y ( i ) Φ ( x ( i ) ) = 0 ) w = ∑ i α i y ( i ) Φ ( x ( i ) ) ∂ L ∂ b = ∑ i α i y ( i ) = 0 ∂ L ∂ L ∂ξ i = C � α i � β i = 0 ) β i = C � α i Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 31 / 42

  42. Solving Dual Problem II L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 Substituting w = ∑ i α i y ( i ) Φ ( x ( i ) ) and β i = C � α i in L ( w , b , ξ , α , β ) : α i � 1 L ( w , b , ξ , α , β ) = ∑ α i α j y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) � b ∑ α i y ( i ) , 2 ∑ i , j i i Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 32 / 42

  43. Solving Dual Problem II L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 Substituting w = ∑ i α i y ( i ) Φ ( x ( i ) ) and β i = C � α i in L ( w , b , ξ , α , β ) : α i � 1 L ( w , b , ξ , α , β ) = ∑ α i α j y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) � b ∑ α i y ( i ) , 2 ∑ i , j i i ∑ i α i � 1 2 ∑ i , j α i α j y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) , 8 > if ∑ i α i y ( i ) = 0 , > < w , b , ξ L ( w , b , ξ , α , β ) = min � ∞ , > > : otherwise Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 32 / 42

  44. Solving Dual Problem II L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 Substituting w = ∑ i α i y ( i ) Φ ( x ( i ) ) and β i = C � α i in L ( w , b , ξ , α , β ) : α i � 1 L ( w , b , ξ , α , β ) = ∑ α i α j y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) � b ∑ α i y ( i ) , 2 ∑ i , j i i ∑ i α i � 1 2 ∑ i , j α i α j y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) , 8 > if ∑ i α i y ( i ) = 0 , > < w , b , ξ L ( w , b , ξ , α , β ) = min � ∞ , > > : otherwise Outer maximization problem: argmax α 1 > α � 1 2 α > K α subject to 0  α  C 1 and y > α = 0 K i , j = y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 32 / 42

  45. Solving Dual Problem II L ( w , b , ξ , α , β ) = 2 k w k 2 + C ∑ i ξ t + ∑ i α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i )+ ∑ i β i ( � ξ i ) 1 Substituting w = ∑ i α i y ( i ) Φ ( x ( i ) ) and β i = C � α i in L ( w , b , ξ , α , β ) : α i � 1 L ( w , b , ξ , α , β ) = ∑ α i α j y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) � b ∑ α i y ( i ) , 2 ∑ i , j i i ∑ i α i � 1 2 ∑ i , j α i α j y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) , 8 > if ∑ i α i y ( i ) = 0 , > < w , b , ξ L ( w , b , ξ , α , β ) = min � ∞ , > > : otherwise Outer maximization problem: argmax α 1 > α � 1 2 α > K α subject to 0  α  C 1 and y > α = 0 K i , j = y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) β i = C � α i � 0 implies α i  C Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 32 / 42

  46. Solving Dual Problem II Dual minimization problem of SVC: argmin α 1 2 α > K α � 1 > α subject to 0  α  C 1 and y > α = 0 Number of variables to solve? Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 33 / 42

  47. Solving Dual Problem II Dual minimization problem of SVC: argmin α 1 2 α > K α � 1 > α subject to 0  α  C 1 and y > α = 0 Number of variables to solve? N instead of augmented feature dimension Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 33 / 42

  48. Solving Dual Problem II Dual minimization problem of SVC: argmin α 1 2 α > K α � 1 > α subject to 0  α  C 1 and y > α = 0 Number of variables to solve? N instead of augmented feature dimension In practice, this problem is solved by specialized solvers such as the sequential minimal optimization (SMO) [3] As K is usually ill-conditioned Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 33 / 42

  49. Making Predictions y = sign ( f ( x )) = sign ( w > x + b ) Prediction: ˆ Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 34 / 42

  50. Making Predictions y = sign ( f ( x )) = sign ( w > x + b ) Prediction: ˆ We have w = ∑ i α i y ( i ) Φ ( x ( i ) ) How to obtain b ? Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 34 / 42

  51. Making Predictions y = sign ( f ( x )) = sign ( w > x + b ) Prediction: ˆ We have w = ∑ i α i y ( i ) Φ ( x ( i ) ) How to obtain b ? By the complementary slackness of KKT conditions, we have: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 34 / 42

  52. Making Predictions y = sign ( f ( x )) = sign ( w > x + b ) Prediction: ˆ We have w = ∑ i α i y ( i ) Φ ( x ( i ) ) How to obtain b ? By the complementary slackness of KKT conditions, we have: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 For any x ( i ) having 0 < α i < C , we have β i = C � α i > 0 ) ξ i = 0 , ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 ) b = y ( i ) � w > Φ ( x ( i ) ) Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 34 / 42

  53. Making Predictions y = sign ( f ( x )) = sign ( w > x + b ) Prediction: ˆ We have w = ∑ i α i y ( i ) Φ ( x ( i ) ) How to obtain b ? By the complementary slackness of KKT conditions, we have: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 For any x ( i ) having 0 < α i < C , we have β i = C � α i > 0 ) ξ i = 0 , ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 ) b = y ( i ) � w > Φ ( x ( i ) ) In practice, we usually take the average over all x ( i ) ’s having 0 < α i < C to avoid numeric error Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 34 / 42

  54. Outline Non-Parametric Methods 1 K -NN Parzen Windows Local Models Support Vector Machines 2 SVC Slacks Nonlinear SVC Dual Problem Kernel Trick Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 35 / 42

  55. Kernel as Inner Product We need to evaluate Φ ( x ( i ) ) > Φ ( x ( j ) ) when Solving dual problem of SVC, where K i , j = y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) Making a prediction, where f ( x ) = w > x + b = ∑ i α i y ( i ) Φ ( x ( i ) ) > Φ ( x )+ b Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 36 / 42

  56. Kernel as Inner Product We need to evaluate Φ ( x ( i ) ) > Φ ( x ( j ) ) when Solving dual problem of SVC, where K i , j = y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) Making a prediction, where f ( x ) = w > x + b = ∑ i α i y ( i ) Φ ( x ( i ) ) > Φ ( x )+ b Time complexity? Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 36 / 42

  57. Kernel as Inner Product We need to evaluate Φ ( x ( i ) ) > Φ ( x ( j ) ) when Solving dual problem of SVC, where K i , j = y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) Making a prediction, where f ( x ) = w > x + b = ∑ i α i y ( i ) Φ ( x ( i ) ) > Φ ( x )+ b Time complexity? If we choose Φ carefully, we can can evaluate Φ ( x ( i ) ) > Φ ( x ) = k ( x ( i ) , x ) e ffi ciently Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 36 / 42

  58. Kernel as Inner Product We need to evaluate Φ ( x ( i ) ) > Φ ( x ( j ) ) when Solving dual problem of SVC, where K i , j = y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) Making a prediction, where f ( x ) = w > x + b = ∑ i α i y ( i ) Φ ( x ( i ) ) > Φ ( x )+ b Time complexity? If we choose Φ carefully, we can can evaluate Φ ( x ( i ) ) > Φ ( x ) = k ( x ( i ) , x ) e ffi ciently Polynomial kernel: k ( a , b ) = ( a > b / α + β ) γ E.g., let α = 1 , β = 1 , γ = 2 and a 2 R 2 , then p p p 2 a 1 a 2 ] > 2 R 6 2 a 2 , a 2 1 , a 2 Φ ( a ) = [ 1 , 2 a 1 , 2 , Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 36 / 42

  59. Kernel as Inner Product We need to evaluate Φ ( x ( i ) ) > Φ ( x ( j ) ) when Solving dual problem of SVC, where K i , j = y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) Making a prediction, where f ( x ) = w > x + b = ∑ i α i y ( i ) Φ ( x ( i ) ) > Φ ( x )+ b Time complexity? If we choose Φ carefully, we can can evaluate Φ ( x ( i ) ) > Φ ( x ) = k ( x ( i ) , x ) e ffi ciently Polynomial kernel: k ( a , b ) = ( a > b / α + β ) γ E.g., let α = 1 , β = 1 , γ = 2 and a 2 R 2 , then p p p 2 a 1 a 2 ] > 2 R 6 2 a 2 , a 2 1 , a 2 Φ ( a ) = [ 1 , 2 a 1 , 2 , Gaussian RBF kernel: k ( a , b ) = exp ( � γ k a � b k 2 ) , γ � 0 k ( a , b ) = exp ( � γ k a k 2 + 2 γ a > b � γ k b k 2 )= exp ( � γ k a k 2 � γ k b k 2 )( 1 + 2 γ a > b + ( 2 γ a > b ) 2 + ··· ) 1! 2! Let a 2 R 2 , then Φ ( a ) = q q q q q 2! a 1 a 2 , ··· ] > 2 R ∞ 2 γ 2 γ 2 γ 2 γ γ exp ( � γ k a k 2 )[ 1 , 2! a 2 2! a 2 1! a 1 , 1! a 2 , 1 , 2 , 2 Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 36 / 42

  60. Kernel as Inner Product We need to evaluate Φ ( x ( i ) ) > Φ ( x ( j ) ) when Solving dual problem of SVC, where K i , j = y ( i ) y ( j ) Φ ( x ( i ) ) > Φ ( x ( j ) ) Making a prediction, where f ( x ) = w > x + b = ∑ i α i y ( i ) Φ ( x ( i ) ) > Φ ( x )+ b Time complexity? If we choose Φ carefully, we can can evaluate Φ ( x ( i ) ) > Φ ( x ) = k ( x ( i ) , x ) e ffi ciently Polynomial kernel: k ( a , b ) = ( a > b / α + β ) γ E.g., let α = 1 , β = 1 , γ = 2 and a 2 R 2 , then p p p 2 a 1 a 2 ] > 2 R 6 2 a 2 , a 2 1 , a 2 Φ ( a ) = [ 1 , 2 a 1 , 2 , Gaussian RBF kernel: k ( a , b ) = exp ( � γ k a � b k 2 ) , γ � 0 k ( a , b ) = exp ( � γ k a k 2 + 2 γ a > b � γ k b k 2 )= exp ( � γ k a k 2 � γ k b k 2 )( 1 + 2 γ a > b + ( 2 γ a > b ) 2 + ··· ) 1! 2! Let a 2 R 2 , then Φ ( a ) = q q q q q 2! a 1 a 2 , ··· ] > 2 R ∞ 2 γ 2 γ 2 γ 2 γ γ exp ( � γ k a k 2 )[ 1 , 2! a 2 2! a 2 1! a 1 , 1! a 2 , 1 , 2 , 2 Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 36 / 42

  61. Kernel Trick If we choose Φ induced by Polynomial or Gaussian RBF kernel, then K i , j = y ( i ) y ( j ) k ( x ( i ) , x ) takes only O ( D ) time to evaluate, and f ( x ) = ∑ α i y ( i ) k ( x ( i ) , x )+ b i takes O ( ND ) time Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 37 / 42

  62. Kernel Trick If we choose Φ induced by Polynomial or Gaussian RBF kernel, then K i , j = y ( i ) y ( j ) k ( x ( i ) , x ) takes only O ( D ) time to evaluate, and f ( x ) = ∑ α i y ( i ) k ( x ( i ) , x )+ b i takes O ( ND ) time Independent with the augmented feature dimension Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 37 / 42

  63. Kernel Trick If we choose Φ induced by Polynomial or Gaussian RBF kernel, then K i , j = y ( i ) y ( j ) k ( x ( i ) , x ) takes only O ( D ) time to evaluate, and f ( x ) = ∑ α i y ( i ) k ( x ( i ) , x )+ b i takes O ( ND ) time Independent with the augmented feature dimension α , β , and γ are new hyperparameters Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 37 / 42

  64. Sparse Kernel Machines SVC is a kernel machine: f ( x ) = ∑ α i y ( i ) k ( x ( i ) , x )+ b i It is surprising that SVC works like K -NN in some sense Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 38 / 42

  65. Sparse Kernel Machines SVC is a kernel machine: f ( x ) = ∑ α i y ( i ) k ( x ( i ) , x )+ b i It is surprising that SVC works like K -NN in some sense However, SVC is a sparse kernel machine Only the slacks become the support vectors ( α i > 0 ) Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 38 / 42

  66. KKT Conditions and Types of SVs By KKT conditions, we have: Primal feasibility: y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 Complementary slackness: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 39 / 42

  67. KKT Conditions and Types of SVs By KKT conditions, we have: Primal feasibility: y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 Complementary slackness: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 Depending on the value of α i , each example x ( i ) can be: Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 39 / 42

  68. KKT Conditions and Types of SVs By KKT conditions, we have: Primal feasibility: y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 Complementary slackness: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 Depending on the value of α i , each example x ( i ) can be: Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 39 / 42

  69. KKT Conditions and Types of SVs By KKT conditions, we have: Primal feasibility: y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 Complementary slackness: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 Depending on the value of α i , each example x ( i ) can be: Non SVs ( α i = 0 ): y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 (usually strict) Free SVs ( 0 < α i < C ): y ( i ) ( w > Φ ( x ( i ) )+ b ) = 1 Bounded SVs ( α i = C ): y ( i ) ( w > Φ ( x ( i ) )+ b )  1 (usually strict) Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 39 / 42

  70. KKT Conditions and Types of SVs By KKT conditions, we have: Primal feasibility: y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 Complementary slackness: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 Depending on the value of α i , each example x ( i ) can be: Non SVs ( α i = 0 ): y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 (usually strict) 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i  0 Since β i = C � α i 6 = 0 , we have ξ i = 0 Free SVs ( 0 < α i < C ): y ( i ) ( w > Φ ( x ( i ) )+ b ) = 1 Bounded SVs ( α i = C ): y ( i ) ( w > Φ ( x ( i ) )+ b )  1 (usually strict) Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 39 / 42

  71. KKT Conditions and Types of SVs By KKT conditions, we have: Primal feasibility: y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 Complementary slackness: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 Depending on the value of α i , each example x ( i ) can be: Non SVs ( α i = 0 ): y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 (usually strict) 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i  0 Since β i = C � α i 6 = 0 , we have ξ i = 0 Free SVs ( 0 < α i < C ): y ( i ) ( w > Φ ( x ( i ) )+ b ) = 1 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i = 0 Since β i = C � α i 6 = 0 , we have ξ i = 0 Bounded SVs ( α i = C ): y ( i ) ( w > Φ ( x ( i ) )+ b )  1 (usually strict) Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 39 / 42

  72. KKT Conditions and Types of SVs By KKT conditions, we have: Primal feasibility: y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 � ξ i and ξ i � 0 Complementary slackness: α i ( 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i ) = 0 and β i ( � ξ i ) = 0 Depending on the value of α i , each example x ( i ) can be: Non SVs ( α i = 0 ): y ( i ) ( w > Φ ( x ( i ) )+ b ) � 1 (usually strict) 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i  0 Since β i = C � α i 6 = 0 , we have ξ i = 0 Free SVs ( 0 < α i < C ): y ( i ) ( w > Φ ( x ( i ) )+ b ) = 1 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i = 0 Since β i = C � α i 6 = 0 , we have ξ i = 0 Bounded SVs ( α i = C ): y ( i ) ( w > Φ ( x ( i ) )+ b )  1 (usually strict) 1 � y ( i ) ( w > Φ ( x ( i ) )+ b ) � ξ i = 0 Since β i = 0 , we have ξ i � 0 Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 39 / 42

  73. Remarks I Pros of SVC: Global optimality (convex problem) Shan-Hung Wu (CS, NTHU) Non-Parametric Methods & SVM Machine Learning 40 / 42

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend