Advanced Machine Learning Learning Kernels MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH ..

Outline Kernel methods. Learning kernels • scenario. • learning bounds. • algorithms. Advanced Machine Learning - Mohri@ page 2

Machine Learning Components user features sample algorithm h Advanced Machine Learning - Mohri@ page 3

Machine Learning Components user features sample main focus critical task algorithm of ML literature h Advanced Machine Learning - Mohri@ page 4

Kernel Methods Features implicitly de fi ned via the choice of a Φ : X → H PDS kernel K Φ ( x ) · Φ ( y ) = K ( x, y ) . ∀ x, y ∈ X, interpreted as a similarity measure. K Flexibility: PDS kernel can be chosen arbitrarily. Help extend a variety of algorithms to non-linear predictors, e.g., SVMs, KRR, SVR, KPCA. PDS condition directly related to convexity of optimization problem. Advanced Machine Learning - Mohri@ page 5

Example - Polynomial Kernels De fi nition : ∀ x, y ∈ R N , K ( x, y ) = ( x · y + c ) d , c > 0 . Example: for and , N =2 d =2 K ( x, y ) = ( x 1 y 1 + x 2 y 2 + c ) 2 x 2 y 2 1 1 x 2 y 2 2 2 √ √ 2 x 1 x 2 2 y 1 y 2 = . · √ √ 2 c x 1 2 c y 1 √ √ 2 c x 2 2 c y 2 c c Advanced Machine Learning - Mohri@ page 6

XOR Problem Use second-degree polynomial kernel with : c = 1 √ 2 x 1 x 2 x 2 √ √ √ ( − 1 , 1) (1 , 1) √ √ √ (1 , 1 , + 2 , − 2 , − 2 , 1) (1 , 1 , + 2 , + 2 , + 2 , 1) √ 2 x 1 x 1 (1 , − 1) ( − 1 , − 1) √ √ √ √ √ √ (1 , 1 , − 2 , − 2 , + 2 , 1) (1 , 1 , − 2 , + 2 , − 2 , 1) Linearly non-separable Linearly separable by x 1 x 2 = 0 . Advanced Machine Learning - Mohri@ page 7

Other Standard PDS Kernels Gaussian kernels : � || x � y || 2 � � K ( x, y ) = exp , σ � = 0 . 2 σ 2 • � x · x � Normalized kernel of � ( x , x � ) �� exp σ 2 . Sigmoid Kernels: K ( x, y ) = tanh( a ( x · y ) + b ) , a, b ≥ 0 . Advanced Machine Learning - Mohri@ page 8

SVM (Cortes and Vapnik, 1995; Boser, Guyon, and Vapnik, 1992) Primal: m 1 2 � w � 2 + C � � � min 1 � y i ( w · Φ K ( x i ) + b ) + . w ,b i =1 Dual: m m α i − 1 � � max α i α j y i y j K ( x i , x j ) 2 α i =1 i,j =1 m � subject to: 0 ≤ α i ≤ C ∧ α i y i = 0 , i ∈ [1 , m ] . i =1 Advanced Machine Learning - Mohri@ page 9

Kernel Ridge Regression (Hoerl and Kennard, 1970; Sanders et al., 1998) Primal: m ( w · Φ K ( x i ) + b � y i ) 2 . w λ � w � 2 + � min i =1 Dual: � ( K + λ I ) α + 2 α � y . max α � R m − α Advanced Machine Learning - Mohri@ page 10

Questions How should the user choose the kernel? • problem similar to that of selecting features for other learning algorithms. • poor choice learning made very di ffi cult. • good choice even poor learners could succeed. The requirement from the user is thus critical. • can this requirement be lessened? • is a more automatic selection of features possible? Advanced Machine Learning - Mohri@ page 11

Outline Kernel methods. Learning kernels • scenario. • learning bounds. • algorithms. Advanced Machine Learning - Mohri@ page 12

Standard Learning with Kernels user kernel K sample algorithm h Advanced Machine Learning - Mohri@ page 13

Learning Kernel Framework kernel user sample family K algorithm ( K, h ) Advanced Machine Learning - Mohri@ page 14

Kernel Families Most frequently used kernel families, , q ≥ 1 � µ 1 p � � � � . K q = K µ : K µ = µ k K k , µ = ∈ ∆ q . . µ p k =1 � � with ∆ q = µ : µ � 0 , � µ � q = 1 . Hypothesis sets: � � h � H K : K � K q , � h � H K � 1 H q = . Advanced Machine Learning - Mohri@ page 15

Relation between Norms Lemma: for , the following holds: p, q ∈ (0 , + ∞ ] 1 p − 1 q � x � q . � x � R N , p � q � � x � q � � x � p � N Proof: for the left inequalities, observe that for , x � = 0 | x i | | x i | k x k p � p N � p N � q X X � = = 1 . k x k q k x k q k x k q i =1 i =1 | {z } ≤ 1 • Right inequalities follow immediately Hölder’s inequality: 1 � N � 1 � N q � N � p � 1 − p � q � p p q q p − 1 1 � � � | x i | p ( | x i | p ) q . � x � p = � (1) = � x � q N p q − p � � i =1 i =1 i =1 Advanced Machine Learning - Mohri@ page 16

Single Kernel Guarantee (Koltchinskii and Panchenko, 2002) Theorem: fi x . Then, for any , with probability at δ > 0 ρ > 0 least , the following holds for all , h ∈ H 1 1 − δ � � log 1 R ρ ( h ) + 2 Tr[ K ] R ( h ) ≤ � δ + 2 m . m ρ Advanced Machine Learning - Mohri@ page 17

Multiple Kernel Guarantee (Cortes, MM, and Rostamizadeh, 2010) Theorem: fi x . Let with . Then, for q + 1 1 ρ > 0 q, r ≥ 1 r =1 any , with probability at least , the following holds δ > 0 1 − δ for all and any integer : h ∈ H q 1 ≤ s ≤ r s p log 1 R ρ ( h ) + 2 s k u k s R ( h ) b δ + 2 m , m ρ with . u = (Tr[ K 1 ] , . . . , Tr[ K p ]) � Advanced Machine Learning - Mohri@ page 18

Proof Let with . 1 q + 1 q, r ≥ 1 r =1 h i m X R S ( H q ) = 1 b m E sup σ i h ( x i ) σ h 2 H q i =1 h i m X = 1 m E sup σ i α j K µ ( x i , x j ) σ µ 2 ∆ q , α > K µ α 1 i,j =1 h i h i = 1 = 1 σ > K µ α m E sup m E sup h σ , α i K 1 / 2 σ σ µ µ 2 ∆ q , α > K µ α 1 µ 2 ∆ q , k α k 1 K 1 / 2 µ q h i = 1 σ > K µ σ m E sup (Cauchy-Schwarz) σ µ 2 ∆ q h i ⇥ ⇤ = 1 p µ · u σ u σ = ( σ > K 1 σ , . . . , σ > K p σ ) > ) m E sup σ µ 2 ∆ q ⇥p ⇤ = 1 m E k u σ k r (definition of dual norm) . σ Advanced Machine Learning - Mohri@ page 19

Lemma (Cortes, MM, and Rostamizadeh, 2010) Lemma: Let be a kernel matrix for a fi nite sample. Then, K for any integer , r ⌘ r h ( σ > K σ ) r i ⇣ E r Tr[ K ] ≤ . σ Proof: combinatorial argument. Advanced Machine Learning - Mohri@ page 20

Proof For any , 1 ≤ s ≤ r ⇥p ⇤ R S ( H q ) = 1 b m E k u σ k r σ ⇥p ⇤ 1 m E k u σ k s σ hh ( σ > K k σ ) s i 1 2 s i p X = 1 m E σ k =1 h h ( σ > K k σ ) s ii 1 p X 1 2 s (Jensen’s inequality) E m σ k =1 h h ( σ > K k σ ) s ii 1 p X = 1 2 s E m σ p k =1 h ⇣ ⌘ s i 1 p X s k u k s 1 2 s = s Tr[ K k ] (lemma) . m m k =1 Advanced Machine Learning - Mohri@ page 21

L 1 Learning Bound (Cortes, MM, and Rostamizadeh, 2010) Corollary: fi x . For any , with probability , the δ > 0 1 − δ ρ > 0 following holds for all : h ∈ H 1 r s p e d log p e max k =1 Tr[ K k ] log 1 R ρ ( h ) + 2 R ( h ) b δ + 2 m . m ρ • weak dependency on . p • bound valid for . p � m • Tr[ K k ] ≤ m max K k ( x, x ) . x Advanced Machine Learning - Mohri@ page 22

Proof For , the bound holds for any integer q = 1 s ≥ 1 s p log 1 R ρ ( h ) + 2 s k u k s R ( h ) b δ + 2 m , m ρ " p # 1 s p 1 X with Tr[ K k ] s s k u k s = s sp max k =1 Tr[ K k ] . s k =1 1 The function reaches it minimum at . log p s �� sp s Advanced Machine Learning - Mohri@ page 23

Lower Bound Tight bound: • dependency cannot be improved. � log p • argument based on VC dimension or example. Observations: case . X = { - 1 , + 1 } p • canonical projection kernels . K k ( x , x � )= x k x � k • contains . J p = { x �� sx k : k � [1 , p ] , s � { - 1 , + 1 }} H 1 • . VCdim( J p )= Ω (log p ) • for and , . R ρ ( h )= � � h ∈ J p R ( h ) ρ =1 • VC lower bound: . �� � VCdim( J p ) /m Ω Advanced Machine Learning - Mohri@ page 24

Pseudo-Dimension Bound (Srebro and Ben-David, 2006) Assume that for all . Then, for k ∈ [1 , p ] , K k ( x, x ) ≤ R 2 any , with probability at least , for any , δ > 0 1 − δ h ∈ H 1 ⇥ 2 + p log 128 em 3 R 2 + 256 R 2 ρ 2 log ρ em 8 R log 128 mR 2 + log(1 / δ ) ρ 2 p ρ 2 R ( h ) ≤ � R ρ ( h ) + 8 . m • bound additive in (modulo log terms). p • not informative for . p>m • based on pseudo-dimension of kernel family. • similar guarantees for other families. Advanced Machine Learning - Mohri@ page 25

Comparison ρ /R = . 2 Advanced Machine Learning - Mohri@ page 26

Recommend

More recommend