kernel methods
play

Kernel Methods CE-717: Machine Learning Sharif University of - PowerPoint PPT Presentation

Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2019 Soleymani Not linearly separable data } Noisy data or overlapping classes 2 } (we discussed about it: soft margin) } Near linearly separable 1 }


  1. Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2019 Soleymani

  2. Not linearly separable data } Noisy data or overlapping classes ๐‘ฆ 2 } (we discussed about it: soft margin) } Near linearly separable ๐‘ฆ 1 } Non-linear decision surface ๐‘ฆ 2 } Transform to a new feature space 2 ๐‘ฆ 1

  3. Nonlinear SVM } Assume a transformation ๐œš: โ„ ' โ†’ โ„ ) on the feature space ๐” ๐’š = [๐œš , (๐’š), . . . , ๐œš ) (๐’š)] } ๐’š โ†’ ๐” ๐’š {๐œš , (๐’š),...,๐œš ) (๐’š)} : set of basis functions (or features) ๐œš < ๐’š : โ„ ' โ†’ โ„ } Find a hyper-plane in the transformed feature space: ๐œš / (๐’š) ๐‘ฆ 2 ๐œš: ๐’š โ†’ ๐” ๐’š ๐’™ 2 ๐” ๐’š + ๐‘ฅ 5 = 0 ๐‘ฆ 1 ๐œš , (๐’š) 3

  4. Soft-margin SVM in a transformed space: Primal problem } Primal problem: H 1 2 ๐’™ / + ๐ท E ๐œŠ G min ๐’™,B C GI, ๐’™ 2 ๐”(๐’š G ) + ๐‘ฅ 5 โ‰ฅ 1 โˆ’ ๐œŠ G ๐‘œ = 1, โ€ฆ , ๐‘‚ s. t. ๐‘ง G ๐œŠ G โ‰ฅ 0 } ๐’™ โˆˆ โ„ ) : the weights that must be found } If ๐‘› โ‰ซ ๐‘’ (very high dimensional feature space) then there are many more parameters to learn 4

  5. Soft-margin SVM in a transformed space: Dual problem } Optimization problem: H H H โˆ’ 1 2 E E ๐›ฝ G ๐›ฝ ) ๐‘ง (G) ๐‘ง ()) ๐” ๐’š (G) 2 ๐” ๐’š ()) max E ๐›ฝ G ๐œท GI, GI, )I, ๐›ฝ G ๐‘ง (G) = 0 H โˆ‘ Subject to } GI, 0 โ‰ค ๐›ฝ G โ‰ค ๐ท ๐‘œ = 1, โ€ฆ , ๐‘‚ } } If we have inner products ๐” ๐’š (<) 2 ๐” ๐’š (\) , only ๐œท = [๐›ฝ , , โ€ฆ , ๐›ฝ H ] needs to be learnt. } not necessary to learn ๐‘› parameters as opposed to the primal problem 5

  6. ๏ฟฝ Classifying a new data ] = ๐‘ก๐‘—๐‘•๐‘œ ๐‘ฅ 5 + ๐’™ 2 ๐”(๐’š) ๐‘ง ๐›ฝ G ๐‘ง (G) ๐”(๐’š (G) ) where ๐’™ = โˆ‘ b c d5 and ๐‘ฅ 5 = ๐‘ง (e) โˆ’ ๐’™ 2 ๐”(๐’š (e) ) 6

  7. Kernel SVM } Learns linear decision boundary in a high dimension space without explicitly working on the mapped data } Let ๐” ๐’š 2 ๐” ๐’š f = ๐ฟ(๐’š, ๐’š f ) (kernel) } Example: ๐’š = ๐‘ฆ , , ๐‘ฆ / and second-order ๐” : / , ๐‘ฆ / / , ๐‘ฆ , ๐‘ฆ / ๐” ๐’š = 1, ๐‘ฆ , , ๐‘ฆ / , ๐‘ฆ , ๐ฟ ๐’š, ๐’š f f + ๐‘ฆ / ๐‘ฆ / f + ๐‘ฆ , f/ + ๐‘ฆ / f/ + ๐‘ฆ , ๐‘ฆ , / ๐‘ฆ , / ๐‘ฆ / f ๐‘ฆ / ๐‘ฆ / f = 1 + ๐‘ฆ , ๐‘ฆ , 7

  8. ๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ Kernel trick } Compute ๐ฟ ๐’š, ๐’š f without transforming ๐’š and ๐’šโ€ฒ } Example: Consider ๐ฟ ๐’š, ๐’š f = 1 + ๐’š 2 ๐’š f / f + ๐‘ฆ / ๐‘ฆ / f / = 1 + ๐‘ฆ , ๐‘ฆ , f + 2๐‘ฆ / ๐‘ฆ / f + ๐‘ฆ , f/ + ๐‘ฆ / f/ + 2๐‘ฆ , ๐‘ฆ , / ๐‘ฆ , / ๐‘ฆ / f ๐‘ฆ / ๐‘ฆ / f = 1 + 2๐‘ฆ , ๐‘ฆ , This is an inner product in: / , ๐‘ฆ / / , ๐” ๐’š = 1, 2 ๐‘ฆ , , 2 ๐‘ฆ / , ๐‘ฆ , 2 ๐‘ฆ , ๐‘ฆ / f , f , ๐‘ฆโ€ฒ , / , ๐‘ฆโ€ฒ / / , f ๐‘ฆ / f ๐” ๐’šโ€ฒ = 1, 2 ๐‘ฆ , 2 ๐‘ฆ / 2 ๐‘ฆ , 8

  9. ๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ ๏ฟฝ Polynomial kernel: Degree two } We instead use ๐ฟ(๐’š, ๐’šโ€ฒ) = ๐’š 2 ๐’šโ€ฒ + 1 / that corresponds to: ๐‘’ -dimensional feature space ๐’š = ๐‘ฆ , , โ€ฆ ,๐‘ฆ ' 2 ๐” ๐’š 2 / , . . , ๐‘ฆ ' / , = 1, 2 ๐‘ฆ , , โ€ฆ , 2 ๐‘ฆ ' , ๐‘ฆ , 2 ๐‘ฆ , ๐‘ฆ / , โ€ฆ , 2 ๐‘ฆ , ๐‘ฆ ' , 2 ๐‘ฆ / ๐‘ฆ i , โ€ฆ , 2 ๐‘ฆ 'j, ๐‘ฆ ' 9

  10. ๏ฟฝ ๏ฟฝ ๏ฟฝ Polynomial kernel } This can similarly be generalized to d-dimensioan ๐’š and ๐œš s are polynomials of order ๐‘ : ๐ฟ ๐’š, ๐’š f = 1 + ๐’š 2 ๐’š f l f + ๐‘ฆ / ๐‘ฆ / f + โ‹ฏ + ๐‘ฆ ' ๐‘ฆ ' f l = 1 + ๐‘ฆ , ๐‘ฆ , } Example: SVM boundary for a polynomial kernel } ๐‘ฅ 5 + ๐’™ 2 ๐” ๐’š = 0 2 ๐” ๐’š = 0 ๐›ฝ < ๐‘ง (<) ๐” ๐’š < } โ‡’ ๐‘ฅ 5 + โˆ‘ b o d5 ๐›ฝ < ๐‘ง (<) ๐‘™(๐’š < , ๐’š) = 0 } โ‡’ ๐‘ฅ 5 + โˆ‘ b o d5 l Boundary is a ๐›ฝ < ๐‘ง (<) 1 + ๐’š (<) q ๐’š } โ‡’ ๐‘ฅ 5 + โˆ‘ = 0 b o d5 polynomial of order ๐‘ 10

  11. Why kernel? } kernel functions ๐ฟ can indeed be efficiently computed, with a cost proportional to ๐‘’ (the dimensionality of the input) instead of ๐‘› . } Example: consider the second-order polynomial transform: / , ๐‘ฆ , ๐‘ฆ / , โ€ฆ , ๐‘ฆ ' ๐‘ฆ ' 2 ๐” ๐’š = 1, ๐‘ฆ , , โ€ฆ , ๐‘ฆ ' , ๐‘ฆ , ๐‘Ÿ = 1 + ๐‘’ + ๐‘’ / ' ' ' f f ๐‘ฆ \ f ๐” ๐’š 2 ๐” ๐’šโ€ฒ = 1 + E ๐‘ฆ < ๐‘ฆ < + E E ๐‘ฆ < ๐‘ฆ \ ๐‘ฆ < ๐‘ƒ(๐‘Ÿ) \I, <I, <I, ' ' f f E ๐‘ฆ < ๐‘ฆ < ร— E ๐‘ฆ \ ๐‘ฆ \ <I, \I, ๐” ๐’š 2 ๐” ๐’šโ€ฒ = 1 + ๐‘ฆ 2 ๐‘ฆ f + ๐‘ฆ 2 ๐‘ฆ f / ๐‘ƒ(๐‘’) 11

  12. Gaussian or RBF kernel } If ๐ฟ ๐’š, ๐’š f is an inner product in some transformed space of x, it is good ๐’šj๐’š w x } ๐ฟ ๐’š, ๐’š f = exp (โˆ’ ) y } Take one dimensional case with ๐›ฟ = 1 : ๐ฟ ๐‘ฆ, ๐‘ฆ f = exp โˆ’ ๐‘ฆ โˆ’ ๐‘ฆ f / = exp โˆ’๐‘ฆ / exp โˆ’๐‘ฆโ€ฒ / exp 2๐‘ฆ๐‘ฆโ€ฒ } = exp โˆ’๐‘ฆ / exp โˆ’๐‘ฆโ€ฒ / E 2 { ๐‘ฆ { ๐‘ฆโ€ฒ { ๐‘™! {I5 12

  13. Some common kernel functions } Linear: ๐‘™(๐’š, ๐’š f ) = ๐’š 2 ๐’šโ€ฒ } Polynomial: ๐‘™ ๐’š, ๐’š f = (๐’š 2 ๐’š f + 1) l ๐’šj๐’š w x } Gaussian: ๐‘™ ๐’š, ๐’š f = exp (โˆ’ ) y } Sigmoid: ๐‘™ ๐’š, ๐’š f = tanh (๐‘๐’š 2 ๐’š f + ๐‘) 13

  14. Kernel formulation of SVM } Optimization problem: H H H โˆ’ 1 2 E E ๐›ฝ G ๐›ฝ ) ๐‘ง (G) ๐‘ง ()) ๐” ๐’š (G) 2 ๐” ๐’š ()) ๐‘™(๐’š G , ๐’š ()) ) max E ๐›ฝ G ๐œท GI, GI, )I, ๐›ฝ G ๐‘ง (G) = 0 H โˆ‘ Subject to } GI, 0 โ‰ค ๐›ฝ G โ‰ค ๐ท ๐‘œ = 1, โ€ฆ , ๐‘‚ } ๐‘ง , ๐‘ง , ๐ฟ ๐’š , , ๐’š , ๐‘ง , ๐‘ง H ๐ฟ ๐’š H , ๐’š , โ‹ฏ ๐‘น = โ‹ฎ โ‹ฑ โ‹ฎ ๐‘ง H ๐‘ง , ๐ฟ ๐’š H , ๐’š , ๐‘ง H ๐‘ง H ๐ฟ ๐’š H , ๐’š H โ‹ฏ 14

  15. ๏ฟฝ ๏ฟฝ ๏ฟฝ Classifying a new data ] = ๐‘ก๐‘—๐‘•๐‘œ ๐‘ฅ 5 + ๐’™ 2 ๐”(๐’š) ๐‘ง ๐›ฝ G ๐‘ง (G) ๐”(๐’š (G) ) where ๐’™ = โˆ‘ b c d5 and ๐‘ฅ 5 = ๐‘ง (e) โˆ’ ๐’™ 2 ๐”(๐’š (e) ) ๐‘™(๐’š G , ๐’š) 2 ] = ๐‘ก๐‘—๐‘•๐‘œ ๐‘ฅ 5 + E ๐›ฝ G ๐‘ง G ๐” ๐’š G ๐‘ง ๐”(๐’š) b c d5 ๐‘™(๐’š G , ๐’š (e) ) 2 ๐‘ฅ 5 = ๐‘ง (e) โˆ’ E ๐›ฝ G ๐‘ง G ๐” ๐’š G ๐” ๐’š e b c d5 15

  16. ๏ฟฝ Gaussian kernel } Example: SVM boundary for a Gaussian kernel } Considers a Gaussian function around each data point. ๐’šj๐’š (o) x ๐›ฝ < ๐‘ง (<) exp } ๐‘ฅ 5 + โˆ‘ (โˆ’ ) = 0 b o d5 โ€ž } SVM + Gaussian Kernel can classify any arbitrary training set } Training error is zero when ๐œ โ†’ 0 ยจ All samples become support vectors (likely overfiting) 16

  17. Hard margin Example } For narrow Gaussian (large ๐œ ), even the protection of a large margin cannot suppress overfitting. Y. Abu-Mostafa et. Al, 2012 17

  18. ๏ฟฝ SVM Gaussian kernel: Example (โˆ’ ๐’š โˆ’ ๐’š (<) / ๐›ฝ < ๐‘ง (<) exp ๐‘” ๐’š = ๐‘ฅ 5 + E ) 2๐œ / b o d5 18 This example has been adopted from Zissermanโ€™s slides

  19. SVM Gaussian kernel: Example 19 This example has been adopted from Zissermanโ€™s slides

  20. SVM Gaussian kernel: Example 20 This example has been adopted from Zissermanโ€™s slides

  21. SVM Gaussian kernel: Example 21 This example has been adopted from Zissermanโ€™s slides

  22. SVM Gaussian kernel: Example 22 This example has been adopted from Zissermanโ€™s slides

  23. SVM Gaussian kernel: Example 23 This example has been adopted from Zissermanโ€™s slides

  24. SVM Gaussian kernel: Example 24 This example has been adopted from Zissermanโ€™s slides

  25. Kernel trick: Idea } Kernel trick โ†’ Extension of many well-known algorithms to kernel-based ones } By substituting the dot product with the kernel function } ๐‘™ ๐’š, ๐’š f = ๐” ๐’š 2 ๐”(๐’šโ€ฒ) } ๐‘™ ๐’š, ๐’š f shows the dot product of ๐’š and ๐’š f in the transformed space. } Idea: when the input vectors appears only in the form of dot products, we can use kernel trick } Solving the problem without explicitly mapping the data } Explicit mapping is expensive if ๐” ๐’š is very high dimensional 25

  26. Kernel trick: Idea (Contโ€™d) } Instead of using a mapping ๐”: ๐’ด โ† โ„ฑ to represent ๐’š โˆˆ ๐’ด by ๐”(๐’š) โˆˆ โ„ฑ , a kernel function ๐‘™: ๐’ดร—๐’ด โ†’ โ„ is used. } We specify only an inner product function between points in the transformed space (not their coordinates) } In many cases, the inner product in the embedding space can be computed efficiently. 26

  27. ๏ฟฝ Constructing kernels } Construct kernel functions directly } Ensure that it is a valid kernel } Corresponds to an inner product in some feature space. } Example: ๐‘™(๐’š, ๐’š f ) = ๐’š 2 ๐’š f / / 2 for ๐’š = / , } Corresponding mapping: ๐” ๐’š = ๐‘ฆ , 2 ๐‘ฆ , ๐‘ฆ / , ๐‘ฆ / ๐‘ฆ , , ๐‘ฆ / 2 } We need a way to test whether a kernel is valid without having to construct ๐” ๐’š 27

  28. Construct Valid Kernels ๐‘‘ > 0 , ๐‘™ 1 : valid kernel โ€ข ๐‘”(. ) : any function โ€ข ๐‘Ÿ(. ) : a polynomial with coefficients โ‰ฅ 0 โ€ข ๐‘™ 1 , ๐‘™ 2 : valid kernels โ€ข ๐”(๐’š) : a function from ๐’š to โ„ l โ€ข ๐‘™3(. , . ) : a valid kernel in โ„ l ๐‘ฉ : a symmetric positive semi-definite โ€ข matrix ๐’š ลฝ and ๐’š โ€ข are variables (not necessarily โ€ข disjoint) with ๐’š = (๐’š ลฝ , ๐’š โ€ข ) , and ๐‘™ ลฝ and ๐‘™ โ€ข are valid kernel functions over their respective spaces. [Bishop] 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend