advanced machine learning
play

Advanced Machine Learning Learning Kernels MEHRYAR MOHRI - PowerPoint PPT Presentation

Advanced Machine Learning Learning Kernels MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH .. Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. Advanced Machine Learning -


  1. Advanced Machine Learning Learning Kernels MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH ..

  2. Outline Kernel methods. Learning kernels • scenario. • learning bounds. • algorithms. Advanced Machine Learning - Mohri@ page 2

  3. Machine Learning Components user features sample algorithm h Advanced Machine Learning - Mohri@ page 3

  4. Machine Learning Components user features sample main focus critical task algorithm of ML literature h Advanced Machine Learning - Mohri@ page 4

  5. Kernel Methods Features implicitly de fi ned via the choice of a Φ : X → H PDS kernel K Φ ( x ) · Φ ( y ) = K ( x, y ) . ∀ x, y ∈ X, interpreted as a similarity measure. K Flexibility: PDS kernel can be chosen arbitrarily. Help extend a variety of algorithms to non-linear predictors, e.g., SVMs, KRR, SVR, KPCA. PDS condition directly related to convexity of optimization problem. Advanced Machine Learning - Mohri@ page 5

  6. Example - Polynomial Kernels De fi nition : ∀ x, y ∈ R N , K ( x, y ) = ( x · y + c ) d , c > 0 . Example: for and , N =2 d =2 K ( x, y ) = ( x 1 y 1 + x 2 y 2 + c ) 2 x 2 y 2     1 1 x 2 y 2 2 2      √   √  2 x 1 x 2 2 y 1 y 2     = .   ·   √ √ 2 c x 1 2 c y 1         √ √ 2 c x 2 2 c y 2     c c Advanced Machine Learning - Mohri@ page 6

  7. XOR Problem Use second-degree polynomial kernel with : c = 1 √ 2 x 1 x 2 x 2 √ √ √ ( − 1 , 1) (1 , 1) √ √ √ (1 , 1 , + 2 , − 2 , − 2 , 1) (1 , 1 , + 2 , + 2 , + 2 , 1) √ 2 x 1 x 1 (1 , − 1) ( − 1 , − 1) √ √ √ √ √ √ (1 , 1 , − 2 , − 2 , + 2 , 1) (1 , 1 , − 2 , + 2 , − 2 , 1) Linearly non-separable Linearly separable by x 1 x 2 = 0 . Advanced Machine Learning - Mohri@ page 7

  8. Other Standard PDS Kernels Gaussian kernels : � || x � y || 2 � � K ( x, y ) = exp , σ � = 0 . 2 σ 2 • � x · x � Normalized kernel of � ( x , x � ) �� exp σ 2 . Sigmoid Kernels: K ( x, y ) = tanh( a ( x · y ) + b ) , a, b ≥ 0 . Advanced Machine Learning - Mohri@ page 8

  9. SVM (Cortes and Vapnik, 1995; Boser, Guyon, and Vapnik, 1992) Primal: m 1 2 � w � 2 + C � � � min 1 � y i ( w · Φ K ( x i ) + b ) + . w ,b i =1 Dual: m m α i − 1 � � max α i α j y i y j K ( x i , x j ) 2 α i =1 i,j =1 m � subject to: 0 ≤ α i ≤ C ∧ α i y i = 0 , i ∈ [1 , m ] . i =1 Advanced Machine Learning - Mohri@ page 9

  10. Kernel Ridge Regression (Hoerl and Kennard, 1970; Sanders et al., 1998) Primal: m ( w · Φ K ( x i ) + b � y i ) 2 . w λ � w � 2 + � min i =1 Dual: � ( K + λ I ) α + 2 α � y . max α � R m − α Advanced Machine Learning - Mohri@ page 10

  11. Questions How should the user choose the kernel? • problem similar to that of selecting features for other learning algorithms. • poor choice learning made very di ffi cult. • good choice even poor learners could succeed. The requirement from the user is thus critical. • can this requirement be lessened? • is a more automatic selection of features possible? Advanced Machine Learning - Mohri@ page 11

  12. Outline Kernel methods. Learning kernels • scenario. • learning bounds. • algorithms. Advanced Machine Learning - Mohri@ page 12

  13. Standard Learning with Kernels user kernel K sample algorithm h Advanced Machine Learning - Mohri@ page 13

  14. Learning Kernel Framework kernel user sample family K algorithm ( K, h ) Advanced Machine Learning - Mohri@ page 14

  15. Kernel Families Most frequently used kernel families, , q ≥ 1 � µ 1 p � � � � . K q = K µ : K µ = µ k K k , µ = ∈ ∆ q . . µ p k =1 � � with ∆ q = µ : µ � 0 , � µ � q = 1 . Hypothesis sets: 
 � � h � H K : K � K q , � h � H K � 1 H q = . Advanced Machine Learning - Mohri@ page 15

  16. Relation between Norms Lemma: for , the following holds: p, q ∈ (0 , + ∞ ] 1 p − 1 q � x � q . � x � R N , p � q � � x � q � � x � p � N Proof: for the left inequalities, observe that for , x � = 0  | x i |  | x i |  k x k p � p N � p N � q X X � = = 1 . k x k q k x k q k x k q i =1 i =1 | {z } ≤ 1 • Right inequalities follow immediately Hölder’s inequality: 1 � N � 1 � N q � N � p � 1 − p � q � p p q q p − 1 1 � � � | x i | p ( | x i | p ) q . � x � p = � (1) = � x � q N p q − p � � i =1 i =1 i =1 Advanced Machine Learning - Mohri@ page 16

  17. Single Kernel Guarantee (Koltchinskii and Panchenko, 2002) Theorem: fi x . Then, for any , with probability at δ > 0 ρ > 0 least , the following holds for all , h ∈ H 1 1 − δ � � log 1 R ρ ( h ) + 2 Tr[ K ] R ( h ) ≤ � δ + 2 m . m ρ Advanced Machine Learning - Mohri@ page 17

  18. Multiple Kernel Guarantee (Cortes, MM, and Rostamizadeh, 2010) Theorem: fi x . Let with . Then, for q + 1 1 ρ > 0 q, r ≥ 1 r =1 any , with probability at least , the following holds δ > 0 1 − δ for all and any integer : h ∈ H q 1 ≤ s ≤ r s p log 1 R ρ ( h ) + 2 s k u k s R ( h )  b δ + 2 m , m ρ with . u = (Tr[ K 1 ] , . . . , Tr[ K p ]) � Advanced Machine Learning - Mohri@ page 18

  19. Proof Let with . 1 q + 1 q, r ≥ 1 r =1 h i m X R S ( H q ) = 1 b m E sup σ i h ( x i ) σ h 2 H q i =1 h i m X = 1 m E sup σ i α j K µ ( x i , x j ) σ µ 2 ∆ q , α > K µ α  1 i,j =1 h i h i = 1 = 1 σ > K µ α m E sup m E sup h σ , α i K 1 / 2 σ σ µ µ 2 ∆ q , α > K µ α  1 µ 2 ∆ q , k α k  1 K 1 / 2 µ q h i = 1 σ > K µ σ m E sup (Cauchy-Schwarz) σ µ 2 ∆ q h i ⇥ ⇤ = 1 p µ · u σ u σ = ( σ > K 1 σ , . . . , σ > K p σ ) > ) m E sup σ µ 2 ∆ q ⇥p ⇤ = 1 m E k u σ k r (definition of dual norm) . σ Advanced Machine Learning - Mohri@ page 19

  20. Lemma (Cortes, MM, and Rostamizadeh, 2010) Lemma: Let be a kernel matrix for a fi nite sample. Then, K for any integer , r ⌘ r h ( σ > K σ ) r i ⇣ E r Tr[ K ] ≤ . σ Proof: combinatorial argument. Advanced Machine Learning - Mohri@ page 20

  21. Proof For any , 1 ≤ s ≤ r ⇥p ⇤ R S ( H q ) = 1 b m E k u σ k r σ ⇥p ⇤  1 m E k u σ k s σ hh ( σ > K k σ ) s i 1 2 s i p X = 1 m E σ k =1 h h ( σ > K k σ ) s ii 1 p X  1 2 s (Jensen’s inequality) E m σ k =1 h h ( σ > K k σ ) s ii 1 p X = 1 2 s E m σ p k =1 h ⇣ ⌘ s i 1 p X s k u k s  1 2 s = s Tr[ K k ] (lemma) . m m k =1 Advanced Machine Learning - Mohri@ page 21

  22. L 1 Learning Bound (Cortes, MM, and Rostamizadeh, 2010) Corollary: fi x . For any , with probability , the δ > 0 1 − δ ρ > 0 following holds for all : h ∈ H 1 r s p e d log p e max k =1 Tr[ K k ] log 1 R ρ ( h ) + 2 R ( h )  b δ + 2 m . m ρ • weak dependency on . p • bound valid for . p � m • Tr[ K k ] ≤ m max K k ( x, x ) . x Advanced Machine Learning - Mohri@ page 22

  23. Proof For , the bound holds for any integer q = 1 s ≥ 1 s p log 1 R ρ ( h ) + 2 s k u k s R ( h )  b δ + 2 m , m ρ " p # 1 s p 1 X with Tr[ K k ] s s k u k s = s  sp max k =1 Tr[ K k ] . s k =1 1 The function reaches it minimum at . log p s �� sp s Advanced Machine Learning - Mohri@ page 23

  24. Lower Bound Tight bound: • dependency cannot be improved. � log p • argument based on VC dimension or example. Observations: case . X = { - 1 , + 1 } p • canonical projection kernels . K k ( x , x � )= x k x � k • contains . J p = { x �� sx k : k � [1 , p ] , s � { - 1 , + 1 }} H 1 • . VCdim( J p )= Ω (log p ) • for and , . R ρ ( h )= � � h ∈ J p R ( h ) ρ =1 • VC lower bound: . �� � VCdim( J p ) /m Ω Advanced Machine Learning - Mohri@ page 24

  25. Pseudo-Dimension Bound (Srebro and Ben-David, 2006) Assume that for all . Then, for k ∈ [1 , p ] , K k ( x, x ) ≤ R 2 any , with probability at least , for any , δ > 0 1 − δ h ∈ H 1 ⇥ 2 + p log 128 em 3 R 2 + 256 R 2 ρ 2 log ρ em 8 R log 128 mR 2 + log(1 / δ ) ρ 2 p ρ 2 R ( h ) ≤ � R ρ ( h ) + 8 . m • bound additive in (modulo log terms). p • not informative for . p>m • based on pseudo-dimension of kernel family. • similar guarantees for other families. Advanced Machine Learning - Mohri@ page 25

  26. Comparison ρ /R = . 2 Advanced Machine Learning - Mohri@ page 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend