svm kernels
play

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine - PowerPoint PPT Presentation

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 / 27 Outline 1 Linear Separability and Feature Augmentation 2 Sample Complexity 3 Computational Complexity 4 Kernels and Nonlinear SVMs 5 Mercers


  1. SVM Kernels COMPSCI 371D — Machine Learning COMPSCI 371D — Machine Learning SVM Kernels 1 / 27

  2. Outline 1 Linear Separability and Feature Augmentation 2 Sample Complexity 3 Computational Complexity 4 Kernels and Nonlinear SVMs 5 Mercer’s Conditions 6 Gaussian Kernels and Support Vectors COMPSCI 371D — Machine Learning SVM Kernels 2 / 27

  3. Linear Separability and Feature Augmentation Data Representations • Linear separability is a property of the data in a given representation • A set that is not linearly separable. Boundary x 2 = x 2 1 COMPSCI 371D — Machine Learning SVM Kernels 3 / 27

  4. Linear Separability and Feature Augmentation Feature Transformations • x = ( x 1 , x 2 ) → z = ( z 1 , z 2 ) = ( x 2 1 , x 2 ) • Now it is! Boundary z 2 = z 1 COMPSCI 371D — Machine Learning SVM Kernels 4 / 27

  5. Linear Separability and Feature Augmentation Feature Augmentation • Feature transformation: x = ( x 1 , x 2 ) → z = ( z 1 , z 2 ) = ( x 2 1 , x 2 ) • Problem: We don’t know the boundary! • We cannot guess the correct transformation • Feature augmentation : x = ( x 1 , x 2 ) → z = ( z 1 , z 2 , z 3 ) = ( x 1 , x 2 , x 2 1 ) • Why is this better? • Add many features in the hope that some combination will help COMPSCI 371D — Machine Learning SVM Kernels 5 / 27

  6. Linear Separability and Feature Augmentation Not Really Just a Hope! • Add all monomials of x 1 , x 2 up to some degree k • Example: k = 3 ⇒ d ′ = � d + k � 2 + 3 � � = = 10 monomials d 2 z = ( 1 , x 1 , x 2 , x 2 1 , x 1 x 2 , x 2 2 , x 3 1 , x 2 1 x 2 , x 1 x 2 2 , x 3 2 ) • From Taylor’s theorem, we know that with k high enough we can approximate any hypersurface by a linear combination of the features in z • Issue 1: Sample complexity: More dimensions, more training data (remember the curse) • Issue 2: Computational complexity: More features, more work • With SVMs, we can address both issues COMPSCI 371D — Machine Learning SVM Kernels 6 / 27

  7. Sample Complexity A Detour into Sample Complexity • The more training samples we have, the better we generalize • With a larger N , the set T represents the model p ( x , y ) better • How to formalize this notion? • Introduce a number ǫ that measures how far from optimal a classifier is • The smaller ǫ we want to be, the bigger N needs to be • Easier to think about: the bigger 1 /ǫ (“exactitude”), the bigger N • The rate of growth of N ( 1 /ǫ ) is the sample complexity , more or less • Removing “more or less” requires care COMPSCI 371D — Machine Learning SVM Kernels 7 / 27

  8. Sample Complexity Various Risks Involved • We train a classifier on set T , by picking the best h ∈ H : ˆ h = ERM T ( H ) ∈ arg min h ∈H L T ( h ) • Empirical risk actually achieved by ˆ h : L T (ˆ h ) = L T ( H ) = min h ∈H L T ( h ) • When we deploy ˆ h we want its statistical risk to be small L p (ˆ h ) = E p [ ℓ ( y , ˆ h ( x ))] We can get some idea of L p (ˆ h ) by testing ˆ h • Typically, L p (ˆ h ) > L T (ˆ h ) • More importantly: How small can L p (ˆ h ) conceivably be? • L p (ˆ h ) is typically bigger than L p ( H ) = min h ∈H L p ( h ) COMPSCI 371D — Machine Learning SVM Kernels 8 / 27

  9. Sample Complexity Risk Summary • Empirical training risk L T (ˆ h ) is just a means to an end • That’s what we minimize for training. Ignore that • Statistical risk achieved by ˆ h : L p (ˆ h ) • Smallest statistical risk over all h ∈ H : L p ( H ) = min h ∈H L p ( h ) • Obviously L p (ˆ h ) ≥ L p ( H ) (by definition of the latter) • Typically, L p (ˆ h ) > L p ( H ) . Why? • Because T is a poor proxy for p ( x , y ) • Also, often L p ( H ) > 0. Why? • Because H may not contain a perfect h • Example: Linear classifier for a non linearly-separable problem COMPSCI 371D — Machine Learning SVM Kernels 9 / 27

  10. Sample Complexity Sample Complexity • Typically, L p (ˆ h ) > L p ( H ) ≥ 0 • Best we can do is L p (ˆ h ) = L p ( H ) + ǫ with small ǫ > 0 • High performance (large 1 /ǫ ) requires lots of data (large N ) • Sample complexity measures how fast N needs to grow as 1 /ǫ grows • It is the rate of growth of N ( 1 /ǫ ) • Problem: T is random, so even a huge N might give poor performance once in a while if we have bad luck (“statistical fluke”) • We cannot guarantee that a large N yields a small ǫ • We can guarantee that this happens with high probability COMPSCI 371D — Machine Learning SVM Kernels 10 / 27

  11. Sample Complexity Sample Complexity, Cont’d • We can only give a probabilistic guarantee: • Given probability 0 < δ < 1 (think of this as “small”), we can guarantee that if N is large enough then the probability that L p (ˆ h ) ≥ L p ( H ) + ǫ is less than δ : P [ L p (ˆ h ) ≥ L p ( H ) + ǫ ] ≤ δ • The sample complexity for hypothesis space H is the function N H ( ǫ, δ ) that gives the smallest N for which this bound holds, regardless of model p ( x , y ) • Tall order: Typically, we can only give asymptotic bounds for N H ( ǫ, δ ) COMPSCI 371D — Machine Learning SVM Kernels 11 / 27

  12. Sample Complexity Sample Complexity for Linear Classifiers and SVMs • For a binary linear classifier, the sample complexity is � d + log( 1 /δ ) � Ω ǫ • Grows linearly with d , the dimensionality of X , and 1 /ǫ • Not too bad, this is why linear classifiers are so successful • SVMs with bounded data space X do even better • “Bounded:” Contained in a hypersphere of finite radius • For SVMs with bounded X , the sample complexity is independent of d . No curse! • We can augment features to our heart’s content COMPSCI 371D — Machine Learning SVM Kernels 12 / 27

  13. Computational Complexity What About Computational Complexity? • Remember our plan: Go from x = ( x 1 , x 2 ) to z = ( 1 , x 1 , x 2 , x 2 1 , x 1 x 2 , x 2 2 , x 3 1 , x 2 1 x 2 , x 1 x 2 2 , x 3 2 ) in order to make the data separable • Can we do this without paying the computational cost? • Yes, with SVMs COMPSCI 371D — Machine Learning SVM Kernels 13 / 27

  14. Computational Complexity SVMs and the Representer Theorem • Recall the formulation of SVM training: Minimize N f ( w , ξ ) = 1 2 � w � 2 + γ � ξ n . n = 1 with constraints y n ( w T x n + b ) − 1 + ξ n ≥ 0 ξ n ≥ 0 . • Representer theorem: w = � n ∈A ( w , b ) α n y n x n � w � 2 = w T w = � � α m α n y m y n x T m x n m ∈A ( w , b ) n ∈A ( w , b ) COMPSCI 371D — Machine Learning SVM Kernels 14 / 27

  15. Kernels and Nonlinear SVMs Using the Representer Theorem w = � • Representer theorem: n ∈A ( w , b ) α n y n x n • In the constraint y n ( w T x n + b ) − 1 + ξ n ≥ 0 we have � w T x n = α m y m x T m x n m ∈A ( w , b ) • Summary: x appears in an inner product, never alone : N 1 � � � α m α n y m y n x T min m x n + C ξ n 2 w , b , ξ m ∈A ( u ) n ∈A ( u ) n = 1 subject to the constraints    � α m y m x T  − 1 + ξ n y n m x n + b ≥ 0 m ∈A ( u ) ≥ 0 ξ n COMPSCI 371D — Machine Learning SVM Kernels 15 / 27

  16. Kernels and Nonlinear SVMs The Kernel • Augment x ∈ R d to ϕ ( x ) ∈ R d ′ , with d ′ ≫ d (typically) N 1 � � � α m α n y m y n ϕ ( x m ) T ϕ ( x n ) + C min ξ n 2 w , b , ξ m ∈A ( u ) n ∈A ( u ) n = 1 subject to the constraints    � α m y m ϕ ( x m ) T ϕ ( x n ) + b  − 1 + ξ n y n ≥ 0 m ∈A ( u ) ξ n ≥ 0 . def = ϕ ( x m ) T ϕ ( x n ) is a number • The value K ( x m , x n ) • The optimization algorithm needs to know only K ( x m , x n ) , not ϕ ( x n ) . K is called a kernel COMPSCI 371D — Machine Learning SVM Kernels 16 / 27

  17. Kernels and Nonlinear SVMs Decision Rule • Same holds for the decision rule: y = h ( x ) = sign ( w T x + b ) ˆ becomes   � α m y m x T ˆ y = h ( x ) = sign m x + b   m ∈A ( w , b ) because of the representer theorem w = � n ∈A ( w , b ) α n y n x n and therefore, after feature augmentation,   � α m y m ϕ ( x m ) T ϕ ( x ) + b ˆ y = h ( x ) = sign   m ∈A ( w , b ) COMPSCI 371D — Machine Learning SVM Kernels 17 / 27

  18. Kernels and Nonlinear SVMs Kernel Idea 1 • Start with some ϕ ( x ) and use the kernel to save computation • Example: ϕ ( x ) = ( 1 , x 1 , x 2 , x 2 1 , x 1 x 2 , x 2 2 , x 3 1 , x 2 1 x 2 , x 1 x 2 2 , x 3 2 ) • Don’t know how to simplify. Try this: ϕ ( x ) = √ √ √ √ √ √ √ 3 x 2 3 x 2 2 , x 3 3 x 2 3 x 1 x 2 2 , x 3 ( 1 , 3 x 1 , 3 x 2 , 6 x 1 x 2 , 1 x 2 , 2 ) 1 , 1 , • Can show (see notes) that K ( x , z ) = ϕ ( x ) T ϕ ( z ) = ( x T z + 1 ) 3 • Something similar works for any d and k • 4 products and 2 sums instead of 10 products and 9 sums • Meager savings, but grows exponentially with d and k , as we know COMPSCI 371D — Machine Learning SVM Kernels 18 / 27

  19. Kernels and Nonlinear SVMs Much Better Kernel Idea 2 • Just come up with K ( x , z ) without knowing the corresponding ϕ ( x ) • Not just any K . Must behave like an inner product • For instance, x T z = z T x and ( x T z ) 2 ≤ � x � 2 � z � 2 (symmetry and Cauchy-Schwartz), so we need at least K 2 ( x , z ) ≤ K ( x , x ) K ( z , z ) K ( x , z ) = K ( z , x ) and • These conditions are necessary, but they are not sufficient • Fortunately, there is a theory for this COMPSCI 371D — Machine Learning SVM Kernels 19 / 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend