extreme classification
play

Extreme Classification Many modern applications involve a huge number - PDF document

Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms Marius Kloft Joint work with Yunwen Lei (CU Hong Kong), Urun Dogan (Microsoft Research), and Alexander Binder (Singapore). Multi-class SVMs From Tighter


  1. Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms Marius Kloft Joint work with Yunwen Lei (CU Hong Kong), Urun Dogan (Microsoft Research), and Alexander Binder (Singapore). Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 1 Extreme Classification Many modern applications involve a huge number of classes . ◮ E.g., image annotation (Deng, Dong, Socher, Li, Li, and Fei-Fei, 2009) ◮ Still growing datasets Need for theory and algorithms for extreme classification (multi-class classification with huge amount of classes). Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 2

  2. Discrepancy of Theory and Algorithms in Extreme Classification ◮ Algorithms for handling huge class sizes ◮ (stochastic) dual coordinate ascent (Keerthi et al., 2008; Shalev-Shwartz and Zhang, (to appear) ◮ Theory not prepared for extreme classification ◮ Data-dependent bounds scale at least linearly with the number of classes (Koltchinskii and Panchenko, 2002; Mohri et al., 2012; Kuznetsov et al., 2014) Questions ◮ Can we get bounds with mild dependence on #classes? ◮ What would we learn from such bounds? ⇒ Novel algorithms? Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 3 Theory Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 4

  3. Multi-class Classification Given : i.i.d. ◮ Training data z 1 = ( x 1 , y 1 ) , . . . , z n = ( x n , y n ) ∼ P � �� � ∈X×Y ◮ Y := { 1 , 2 , . . . , c } ◮ c = number of classes aeroplane bicycle bottle bird boat bus car cat chair cow diningtable horse person dog motorbike pottedplant sheep sofa train tvmonitor Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 5 Formal Problem Setting Aim : ◮ Define a hypothesis class H of functions h = ( h 1 , . . . , h c ) ◮ Find an h ∈ H that “predicts well” via ˆ y := arg max y ∈Y h y ( x ) Multi-class SVMs : ◮ h y ( x ) = � w y , φ ( x ) � ◮ Introduce notion of the (multi-class) margin ρ h ( x , y ) := h y ( x ) − max y ′ : y ′ � = y h y ′ ( x ) ◮ the larger the margin, the better Want : large expected margin E ρ h ( X , Y ) . Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 6

  4. Types of Generalization bounds for Multi-class Classification Data-independent bounds ◮ based on covering numbers (Guermeur, 2002; Zhang, 2004a,b; Hill and Doucet, 2007) - conservative ◮ unable to adapt to data Data-dependent bounds ◮ based on Rademacher complexity (Koltchinskii and Panchenko, 2002; Mohri et al., 2012; Cortes et al., 2013; Kuznetsov et al., 2014) + tighter ◮ able to capture the real data ◮ computable from the data Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 7 Rademacher & Gaussian Complexity Definition ◮ Let σ 1 , . . . , σ n be independent Rademacher variables (taking only values ± 1 , with equal probability). ◮ The Rademacher complexity (RC) is defined as n 1 � � � R ( H ) := E σ σ i h ( z i ) sup n h ∈ H i = 1 Definition ◮ Let g 1 , . . . , g n ∼ N ( 0 , 1 ) . ◮ The Gaussian complexity (GC) is defined as n 1 � � � G ( H ) = E g g i h ( z i ) sup n h ∈ H i = 1 Interpretation: RC and GC reflect the ability of the hypothesis class to correlate with random noise . Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 8 Theorem ( (Ledoux and Talagrand, 1991) ) � �

  5. Existing Data-Dependent Analysis The key step is estimating R ( { ρ h : h ∈ H } ) induced from the margin operator ρ h and class H . Existing bounds build on the structural result: c � R ( max { h 1 , . . . , h c } : h j ∈ H j , j = 1 , . . . , c ) ≤ R ( H j ) (1) j = 1 The correlation among class-wise components is ignored. Best known dependence on the number of classes: ◮ quadratic dependence Koltchinskii and Panchenko (2002); Mohri et al. (2012); Cortes et al. (2013) ◮ linear dependence Kuznetsov et al. (2014) Can we do better? Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 9 A New Structural Lemma on Gaussian Complexities We consider Gaussian complexity. ◮ H is a vector-valued function class, g 11 , . . . , g nc ∼ N ( 0 , 1 ) ◮ We show: � � { max { h 1 , . . . , h c } : h = ( h 1 , . . . , h c ) ∈ H } ≤ G n c 1 � � g ij h j ( x i ) . (2) n E g sup h =( h 1 ,..., h c ) ∈ H i = 1 j = 1 Core idea: Comparison inequality on GPs: (Slepian, 1962) n n c � � � X h := g i max { h 1 ( x i ) , . . . , h c ( x i ) } , Y h := g ij h j ( x i ) , ∀ h ∈ H . i = 1 i = 1 j = 1 θ ) 2 ] ≤ E [( Y θ − Y ¯ θ ) 2 ] = E [( X θ − X ¯ ⇒ E [ sup X θ ] ≤ E [ sup Y θ ] . θ ∈ Θ θ ∈ Θ Eq. (2) preserves the coupling among class-wise components! Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 10

  6. Example on Comparison of the Structural Lemma ◮ Consider H := { ( x 1 , x 2 ) → ( h 1 , h 2 )( x 1 , x 2 ) = ( w 1 x 1 , w 2 x 2 ) : � ( w 1 , w 2 ) � 2 ≤ 1 } ◮ For the function class { max { h 1 , h 2 } : h = ( h 1 , h 2 ) ∈ H } , � n i = 1 σ i h 1 ( x i ) + sup n � ( h 1 , h 2 ) ∈ H [ g i 1 h 1 ( x i ) + g i 2 h 2 ( x i )] sup � n i = 1 σ i h 2 ( x i ) sup ( h 1 , h 2 ) ∈ H i = 1 ( h 1 , h 2 ) ∈ H Preserving the coupling means supremum in a smaller space! Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 11 Estimating Multi-class Gaussian Complexity ◮ Consider a vector-valued function class defined by H := { h w = ( � w 1 , φ ( x ) � , . . . , � w c , φ ( x ) � ) : f ( w ) ≤ Λ } , where f is β -strongly convex w.r.t. � · � ◮ f ( α x + ( 1 − α ) y ) ≤ α f ( x ) + ( 1 − α ) f ( y ) − β 2 α ( 1 − α ) � x − y � 2 . Theorem � � n c n � � � c � � 2 π Λ 1 j ( x i ) ≤ 1 2 � � � � g ij h w � � g ij φ ( x i ) (3) n E g sup β E g ∗ , � � n j = 1 h w ∈ H i = 1 j = 1 i = 1 where � · � ∗ is the dual norm of � · � . Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 12

  7. Features of the complexity bound ◮ Applies to a general function class defined through a strongly-convex regularizer f ◮ Class-wise components h 1 , . . . , h c are correlated through � � � � c 2 � � the term g ij φ ( x i ) � � j = 1 ∗ ◮ Consider class H p , Λ := { h w : � w � 2 , p ≤ Λ } , ( 1 p + 1 p ∗ = 1 ) ; then: � � n c n j ( x i ) ≤ Λ 1 � � � � g ij h w k ( x i , x i ) × n E g sup � n h w ∈ H p , Λ i = 1 j = 1 i = 1  √ e ( 4 log c ) 1 + 1 if p ∗ ≥ 2 log c , 2 log c ,  2 p ∗ � 1 + 1 � 1 p ∗ c p ∗ ,  otherwise . The dependence is sublinear for 1 ≤ p ≤ 2 , and even logarithmic when p approaches to 1 ! Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 13 Algorithms Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 14

  8. ℓ p -norm Multi-class SVM Motivated by the mild dependence on c as p → 1 , we consider ( ℓ p -norm) Multi-class SVM, 1 ≤ p ≤ 2 c n � 2 � 1 � p + C � � w j � p ( 1 − t i ) + , min 2 2 w (P) j = 1 i = 1 s.t. t i = � w y i , φ ( x i ) � − max y : y � = y i � w y , φ ( x i ) � , Dual Problem c n n � 2 ( p − 1 ) � p α ∈ R n × c − 1 � � � � � p p − 1 α ij φ ( x i ) + sup α iy i � � 2 2 (D) j = 1 i = 1 i = 1 s.t. α i ≤ e y i · C ∧ α i · 1 = 0 , ∀ i = 1 , . . . , n . (D) is not quadratic if p � = 2 ; how to optimize? Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 15 Equivalent Formulation We introduce class weights β 1 , . . . , β c to get quadratic dual � � w j � 2 � c + λ � β � p � w j � 2 . 1 p + 1 has optimum for β j ∝ min β j = 1 p 2 β j Equivalent Problem c n � w j � 2 � � 2 + C ( 1 − t i ) + min 2 β j w , β j = 1 i = 1 (E) s.t. t i ≤ � w y i , φ ( x i ) � − � w y , φ ( x i ) � , y � = y i , i = 1 , . . . , n , p = p ( 2 − p ) − 1 , β j ≥ 0 . � β � ¯ p ≤ 1 , ¯ Alternating optimization w.r.t. β and to w Multi-class SVMs From Tighter Data-Dependent Generalization Bounds to Novel Algorithms 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend