multi class to binary reduction of large scale
play

Multi-class to Binary reduction of Large-scale classification - PowerPoint PPT Presentation

1/21 Multi-class to Binary reduction of Large-scale classification Problems Bikash Joshi Joint work with Massih-Reza Amini, Ioannis Partalas, Liva Ralaivola, Nicolas Usunier and Eric Gaussier BigTargets ECML 2015 workshop September the 11 th ,


  1. 1/21 Multi-class to Binary reduction of Large-scale classification Problems Bikash Joshi Joint work with Massih-Reza Amini, Ioannis Partalas, Liva Ralaivola, Nicolas Usunier and Eric Gaussier BigTargets ECML 2015 workshop September the 11 th , 2015

  2. 2/21 Outline ❑ Motivation ❑ Learning objective and reduction strategy ❑ Experimental results ❑ Conclusion

  3. 2/21 Outline ❑ Motivation ❑ Learning objective and reduction strategy ❑ Experimental results ❑ Conclusion

  4. 3/21 Multiclass classification : emerging problems ❑ The number of classes, K , in new emerging multiclass problems, for example in text and image classification, may reach 10 5 to 10 6 categories. ❑ For example

  5. 4/21 Large-scale classification : power law distribution of classes Collection K d DMOZ 7500 594158 4000 DMOZ-7500 3500 3000 2500 # Classes 2000 1500 1000 500 0 2-5 6-10 11-30 31-100 101-200 >200 # Documents

  6. 5/21 Multiclass classification approaches ❑ Uncombined approaches, i.e. MSVM or MLP. The number of parameters, M , is at least O ( K × d ) . ❑ Combined approaches based on binary classification : ❑ One-Vs-one - M ≥ O ( K 2 × d ) ❑ One-Vs-Rest - M ≥ O ( K × d ) ❑ For K >> 1 and d >> 1 traditional approaches do not pass the scale.

  7. 6/21 Outline ❑ Motivation ❑ Learning objective and reduction strategy ❑ Experimental results ❑ Conclusion

  8. 7/21 Learning objective ❑ Large-scale multiclass classification, ❑ Hypothesis : Observations x y = ( x , y ) ∈ X × Y are i.i.d with respect to a distribution D , ❑ For a class of H = { h : X × Y → R } , a ranking instanstaneous loss h ∈ H over an example x y by : 1 � e ( h , x y ) = ✶ h ( x y ) ≤ h ( x y ′ ) , K − 1 y ′ ∈Y\{ y } ❑ The aim is to find a function h ∈ H that minimizes the generalization error L ( h ) : L ( h ) = E x y ∼D [ e ( h , x y )] . ❑ Empirical error of a function h ∈ H over a training � � m x y i set S = i = 1 is i m L m ( h , S ) = 1 � ˆ e ( h , x y i i ) m i = 1

  9. 8/21 Reduction strategy ❑ Consider the empirical loss m 1 � � ˆ L m ( h , S ) = ✶ h ( x i ) ≤ h ( x y ′ yi m ( K − 1 ) i ) i = 1 y ′ ∈Y\{ y i } n 1 � = ✶ ˜ y i g ( Z i ) ≤ 0 n i = 1 � �� � L T n ( g , T ( S )) where n = m ( K − 1 ) , Z i is a pair of couples costituted by a couple of example and its class and the couple corresponding to the example and another class, ˜ y i = 1 if the first couple in Z i is the true couple and − 1 otherwise, and g ( x y , x y ′ ) = h ( x y ) − h ( x y ′ ) .

  10. 9/21 Reduction strategy for the class of linear functions

  11. 9/21 Reduction strategy for the class of linear functions Problems : ❑ How to define Φ( x y ) , ❑ Consistency of the ERM principle with interdependant data.

  12. 10/21 Consistency of the ERM principle with interdependant data ❑ Different statistical tools for extending concentration inequalities to the case of interdependent data, ❑ tools based on colorable graphs proposed by (Janson, 2004) 1 . x 1 x 2 x 3 S 2 3 1 ( x 1 1 , x 2 1 ) ( x 1 1 , x 3 1 ) ( x 2 2 , x 1 ( x 2 2 , x 3 T ( S ) 2 ) 2 ) ( x 3 3 , x 1 3 ) ( x 3 3 , x 2 3 ) ( x 1 1 , x 2 1 ) ( x 2 2 , x 1 2 ) ( x 3 3 , x 1 3 ) ( C 1 , α 1 = 1) ( x 1 1 , x 3 1 ) ( x 2 2 , x 3 2 ) ( x 3 3 , x 2 3 ) ( C 2 , α 2 = 1) 1. S. Janson. Large deviations for sums of partly dependent random variables. Random Structures and Algorithms, 24(3) :234–248, 2004.

  13. 11/21 Theorem Let S = ( x y i i = 1 ∈ ( X × Y ) m be a training set constituted of m examples i ) m generated i.i.d. with respect to a probability distribution D over X × Y and i = 1 ∈ ( Z × {− 1 , 1 } ) n the transformed set obtained with y i )) n T ( S ) = (( Z i , ˜ application T. Let κ : Z → R by a PSD kernel, and Φ : X × Y → H the associated mapping function. For all 1 > δ > 0 , and all g w ∈ G B = { x �→ � w , Φ( x ) � | || w || ≤ B } with probability at least ( 1 − δ ) over T ( S ) we have then : � ln ( 2 δ ) n ( g w , T ( S )) + 2 B G ( T ( S )) L T ( g w ) ≤ ǫ T m √ + 3 (1) K − 1 2 m n � n ( g w , T ( S )) = 1 where ǫ T L (˜ y i g w ( Z i )) with a surrogate Hinge loss n i = 1 L : t �→ min ( 1 , max ( 1 − t , 0 )) , L T ( g w ) = E T ( S ) [ L T n ( g w , T ( S ))] et �� n G ( T ( S )) = i = 1 d κ ( Z i ) with d κ ( x y , x y ′ ) = κ ( x y , x y ) + κ ( x y ′ , x y ′ ) − 2 κ ( x y , x y ′ )

  14. 12/21 Key Features of Algorithm ❑ Data dependent bound : If the feature representation of (x,y) pairs is independent of original dimension, then : G ( T ( S )) ≤ √ n × Constant ≈ � m × ( K − 1 ) × Constant ❑ Non-trivial joint feature representation (example-class pair) ❑ Same for any number of class ❑ Same parameter vector for all classes

  15. 13/21 Outline ❑ Motivation ❑ Learning objective and reduction strategy ❑ Experimental results ❑ Conclusion

  16. 14/21 Feature representation Φ( x y ) Features � � ln ( 1 + l S 1. ln ( 1 + y t ) 2. ) S t t ∈ y ∩ x t ∈ y ∩ x � � ln ( 1 + y t 3. 4. ) I t | y | t ∈ y ∩ x t ∈ y ∩ x � ln ( 1 + y t � ln ( 1 + y t | y | . l S 5. | y | . I t ) 6. ) S t t ∈ y ∩ x t ∈ y ∩ x � � y t 7. 1 8. | y | . I t t ∈ y ∩ x t ∈ y ∩ x d 1 ( x y ) 10. d 2 ( x y ) 9. ❑ x t : number of occurrences of terme t in document x , ❑ V : Number of distinct terms in S , ❑ y t = � x ∈ y x t , | y | = � t ∈V y t , S t = � x ∈S x t , l S = � t ∈V S t . ❑ I t : idf of the terme t ,

  17. 15/21 Experimental results on text classification Collection K d m Test size DMOZ 7500 594158 394756 104263 WIKIPEDIA 7500 346299 456886 81262 K × d = O ( 10 9 ) ❑ Random samples of 100, 500, 1000, 3000, 5000 and 7500

  18. 16/21 Experimental Setup Implementation and comparison : ❑ SVM with linear kernel as binary classification algorithm ❑ Value of C chosen by cross-validation ❑ Comparison with OVA, OVO, M-SVM, LogT Performance Evaluation : ❑ Accuracy : Correctly classified examples in test dataset ❑ Macro F-Measure : Harmonic mean of precision and recall

  19. 17/21 Experimental Results Result for 7500 class : ❑ OVO and M-SVM did not pass the scale for 7500 classes ❑ N c : Proportion of classes for which at leaset one TP document found ❑ mRb covers 6-9.5% classes than OVA ( 500 - 700 classes)

  20. 18/21 # of Classes Vs. Macro F-Measure

  21. 19/21 # of Classes Vs. Macro F-Measure

  22. 20/21 Conclusion ❑ A new method of large-scale multiclass classification based on reduction of multiclass classification to binary classification. ❑ Efficiency of deduced algorithm comparable or better than the state of the art multiclass classification approaches.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend