kernel based methods and support vector machines
play

Kernel-based Methods and Support Vector Machines Larry Holder CptS - PowerPoint PPT Presentation

Kernel-based Methods and Support Vector Machines Larry Holder CptS 570 Machine Learning School of Electrical Engineering and Computer Science Washington State University References Muller et al., An Introduction to Kernel-Based


  1. Kernel-based Methods and Support Vector Machines Larry Holder CptS 570 – Machine Learning School of Electrical Engineering and Computer Science Washington State University

  2. References � Muller et al., “An Introduction to Kernel-Based Learning Algorithms,” IEEE Transactions on Neural Networks , 12(2):181-201, 2001.

  3. Learning Problem � Estimate function f : R N � { -1,+ 1} using n training data ( x i ,y i ) sampled from P( x ,y) � Want f minimizing expected error (risk) R[f] ∫ = R [ f ] loss ( f ( x ), y ) dP ( x , y ) � P( x ,y) unknown, so compute empirical risk R emp [f] 1 n ∑ = R [ f ] loss ( f ( x ), y ) emp i i n = i 1

  4. Overfit � Using R emp [f] to estimate R[f] for small n may lead to overfit

  5. Overfit � Can restrict the class F of f � I.e., restrict the VC dimension h of F � Model selection � Find F such that learned f ∈ F minimizes R emp [f] ’s overestimate of R[f] � With probability 1- δ and n> h : 2 n + − δ h (ln 1 ) ln( / 4 ) h ≤ + R [ f ] R [ f ] emp n

  6. Overfit � Tradeoff between empirical risk R emp [f] and uncertainty in estimate of R[f] Expected Risk Uncertainty Empirical Risk Complexity of F

  7. Margins � Consider a training sample separable by the hyperplane f( x ) = ( w · x ) + b � Margin is the minimal distance of a sample to the decision surface � We can bound the VC dimension of the set of hyperplanes by bounding the margin w margin

  8. Nonlinear Algorithms � Likely to underfit using only hyperplanes � But we can map the data to a nonlinear space and use hyperplanes there � Φ : R N � F � x � Φ ( x ) Φ

  9. Curse of Dimensionality � Difficulty of learning increases with the dimensionality of the problem � I.e., Harder to learn with more features � But, difficulty based on complexity of learning algorithm and VC of hypothesis class � Hyperplanes are easy to learn � Still mapping to extremely high dimensional spaces makes even hyperplane learning difficult

  10. Kernel Functions � For some feature spaces F and mappings Φ there is a “trick” for efficiently computing scalar products � Kernel functions compute scalar products in F without mapping data to F or even knowing Φ

  11. Kernel Functions � Example: kernel k Φ → 2 3 : R R → = 2 2 ( x , x ) ( z , z , z ) ( x , 2 x x , x ) 1 2 1 2 3 1 1 2 2 Τ Φ ⋅ Φ = 2 2 2 2 ( ( x ) ( y )) ( x , 2 x x , x )( y , 2 y y , y ) 1 1 2 2 1 1 2 2 Τ = 2 (( x , x )( y , y ) ) 1 2 1 2 = ⋅ 2 ( x y ) = k ( x , y )

  12. Kernel Functions ⎛ ⎞ − − 2 || x y || ⎜ ⎟ = Gaussian RBF : k ( x , y ) exp ⎜ ⎟ c ⎝ ⎠ ⋅ + θ d Polynomial : (( x y ) ) κ ⋅ + θ Sigmoidal : tanh( ( x y ) ) 1 Inverse multiquadr atic : − + 2 2 || x y || c

  13. Support Vector Machines � Supervised learning ⋅ x + ≥ = y (( w ) b ) 1 , i 1 , , n K i i � Mapping to nonlinear space ⋅ Φ + ≥ = y (( w ( x ) b ) 1 , i 1 , , n (Eq. 8) K i i � Minimize (subject to Eq. 8) 1 2 min || w || 2 w b ,

  14. Support Vector Machines � Problem: w resides in F , where computation is difficult � Solution: remove dependency on w � Introduce Lagrange multipliers � α i ≥ 0, i = 1,…,n � One for each constraint in Eq. 8 � And use kernel function

  15. Support Vector Machines 1 n ∑ = − α ⋅ Φ + − 2 L ( w , b , α ) || w || ( y (( w ( x )) b ) 1 ) i i i 2 = i 1 ∂ L n ∑ = → α = 0 y 0 i i ∂ b = i 1 ∂ n L ∑ = → = α Φ 0 w y ( x ) i i i ∂ w = i 1 Substituting last two equations into first and replacing ( Φ (x i ) · Φ (x j )) with kernel function k(x i ,x j ) …

  16. Support Vector Machines n 1 n ∑ ∑ α − α α max y y k ( x , x ) i i j i j i j 2 α = = i 1 i , j 1 Subject to : α ≥ = 0 , i 1 ,..., n i n ∑ α = y 0 i i = i 1 This is a quadratic optimization function.

  17. Support Vector Machines � Once we have α , we have w and can perform classification ⎛ ⎞ n ∑ = ⎜ α Φ ⋅ Φ + ⎟ f ( x ) sgn y ( ( x ) ( x )) b i i i ⎝ ⎠ = i 1 ⎛ ⎞ n ∑ = α + ⎜ ⎟ sgn y k ( x , x ) b , where i i i ⎝ ⎠ = i 1 ⎛ ⎞ 1 n n ∑ ∑ ⎜ ⎟ = − α b y y k ( x , x ) ⎜ ⎟ i j j i j n ⎝ ⎠ = = i 1 j 1

  18. SVMs with Noise � Until now, assuming problem is linearly separable in some space � But if noise is present, this may be a bad assumption � Solution: Introduce noise terms (slack variables ξ i ) into the classification ⋅ + ≥ − ξ ξ ≥ = y (( w x ) b ) 1 , 0 , i 1 , , n K i i i i

  19. SVMs with Noise � Now, we want to minimize 1 n ∑ + ξ 2 min || w || C i 2 w , b , ξ = i 1 � Where C > 0 determines tradeoff between empirical error and hypothesis complexity

  20. SVMs with Noise n n 1 ∑ ∑ α − α α max y y k ( x , x ) i i j i j i j 2 α = = i 1 i , j 1 Subject to : ≤ α ≥ = 0 C , i 1 ,..., n i n ∑ α = y 0 i i = i 1 where C is limiting the size of the Lagrange multipliers α i

  21. Sparsity � Note that many training examples will be outside the margin w � Therefore, their optimal α i = 0 margin α = ⇒ ≥ ξ = 0 y f ( x ) 1 and 0 i i i i < α < ⇒ = ξ = 0 C y f ( x ) 1 and 0 i i i i α = ⇒ ≤ ξ ≥ C y f ( x ) 1 and 0 i i i i � This reduces the optimization problem from n variables down to the number of examples on or inside the margin

  22. Kernel Methods � Fisher’s linear discriminant � Find a linear projection of the feature space such that classes are well separated � “Well separated” defined as a large difference in the means and a small variance along the discriminant � Can be solved using kernel methods to find nonlinear discriminants

  23. Applications � Optical pattern and object recognition � Invariant SVM achieved best error rate (0.6%) on USPS handwritten digit recognition problem � Better than humans (2.5%) � Text categorization � Time-series prediction

  24. Applications � Gene expression profile analysis � DNA and protein analysis � SVM method (13%) of classifying DNA translation initiation sites outperforms best neural network (15%) � Virtual SVMs, incorporating prior biological knowledge, reached 11-12% error rate

  25. Kernel Methods for Unsupervised Learning � Principal Components Analysis (PCA) used in unsupervised learning � PCA is a linear method � Kernel-based PCA can achieve non-linear components using standard kernel techniques � Application to USPS data to reduce noise indicated a factor of 8 performance improvement over linear PCA method

  26. Summary � (+ ) Kernel-based methods allow linear-speed learning in non-linear spaces � (+ ) Support vector machines ignore all but the most differentiating training data (those on or inside the margin) � (+ ) Kernel-based methods and SVMs in particular, are among the best performing classifiers on many learning problems � (-) Choosing an appropriate kernel can be difficult � (-) High dimensionality of original learning problem can still be a computational bottleneck

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend