decision rule based algorithm for ordinal classification
play

Decision Rule-based Algorithm for Ordinal Classification based on - PowerPoint PPT Presentation

Decision Rule-based Algorithm for Ordinal Classification based on Rank Loss Minimization Krzysztof Dembczyski 1 , 2 , Wojciech Kotowski 1 , 3 1 Institute of Computing Science, Pozna University of Technology 2 KEBI, Philipps-Universit at


  1. Decision Rule-based Algorithm for Ordinal Classification based on Rank Loss Minimization Krzysztof Dembczyński 1 , 2 , Wojciech Kotłowski 1 , 3 1 Institute of Computing Science, Poznań University of Technology 2 KEBI, Philipps-Universit¨ at in Marburg 3 Centrum Wiskunde & Informatica, Amsterdam PL-09, Bled, September 11, 2009

  2. 1 Ordinal Classification 2 RankRules 3 Conclusions

  3. 1 Ordinal Classification 2 RankRules 3 Conclusions

  4. Ordinal classification consists in predicting a label ta- ken from a finite and ordered set for an object described by some attributes . This problem shares some characteristics of multi-class classification and regression , but: • the order between class labels cannot be neglected , • the scale of the decision attribute is not cardinal .

  5. Recommender system predicting a rating of a movie for a gi- ven user.

  6. Email filtering to ordered groups like: important, normal, later, or spam.

  7. Denotation : • K – number of classes • y – actual label • x – attributes • ˆ y – predicted label • F ( x ) – prediction function • f ( x ) – ranking or utility function • θ = ( θ 0 , . . . , θ K ) – thresholds • L ( · ) – loss function • � · � – Boolean test • { y i , x i } N 1 – training examples

  8. Ordinal Classification : • Since y is discrete, it obeys a multinomial distribution for a given x : p k ( x ) = Pr( y = k | x ) , k = 1 , . . . , K. • The optimal prediction is clearly given by: K y ∗ = F ∗ ( x ) = arg min � ˆ p k ( x ) L ( y, F ( x )) , F ( x ) k =1 where L ( y, ˆ y ) is the loss function defined as a matrix : L ( y, ˆ y ) = ( l y, ˆ y ) K × K with v-shaped rows and zeros on diagonal.   0 1 2 L ( y, ˆ y ) = 1 0 1     2 1 0

  9. Ordinal Classification : • A natural choice of the loss matrix is the absolute-error loss for which l y, ˆ y = | y − ˆ y | . • The optimal prediction in this case is median over class distribution: F ∗ ( x ) = median p k ( x ) ( y ) . • Median does not depend on a distance between class labels , so the scale of the decision attribute does not matter; the order of labels is taken into consideration only.

  10. Two Approaches to Ordinal Classification : • Threshold Loss Minimization (SVOR, ORBoost-All, MMMF), • Rank Loss Minimization (RankSVM, RankBoost). In both approaches, one assumes existence of: • ranking (or utility ) function f ( x ) , and • consecutive thresholds θ = ( θ 0 , . . . , θ K ) on a range of the ranking function, and the final prediction is given by: K � F ( x ) = k � f ( x ) ∈ [ θ k − 1 , θ k ) � . k =1

  11. Threshold Loss Minimization : • Threshold loss function is defined by: K − 1 � L ( y, f ( x ) , θ ) = � y k ( f ( x ) − θ k ) � 0 � , k =1 where y k = 1 , if y > k, and y k = − 1 , otherwise . θ 0 = −∞ ... θ 1 = − 3.5 θ 2 = − 1.2 θ k − 1 = = 1.2 θ k − θ − 2 = 3.8 θ K = ∞ ... ... −5 −4 −3 −2 −1 0 1 2 3 4 5 f ( x )

  12. Rank Loss Minimization : • Rank loss function is defined over pairs of objects: L ( y ◦• , f ( x ◦ ) , f ( x • )) = � y ◦• ( f ( x ◦ ) − f ( x • )) � 0 � , where y ◦• = sgn( y ◦ − y • ) . • Thresholds are computed afterwards with respect to a given loss matrix . y i 1 > y i 2 > y i 3 > . . . > y i N − 1 > y i N f ( x i 1 ) > f ( x i 3 ) > f ( x i 2 ) > . . . > f ( x i N − 1 ) > f ( x i N )

  13. Comparison of the two approaches : Threshold loss: • Comparison of an object to thresholds instead to all other training objects . • Weighted threshold loss can approximate any loss matrix. Rank loss: • Minimization of the rank loss on training set has quadratic complexity with respect to a number of object, however, in the case of K ordered classes, the algorithm can work in linear time . • Rank loss minimization is closely related to maximization of AUC criterion .

  14. 1 Ordinal Classification 2 RankRules 3 Conclusions

  15. RankRules : • Ranking function is an ensemble of decision rules : M � f ( x ) = r m ( x ) , m =1 where r m ( x ) = α m Φ m ( x ) is a decision rule defined by a response α m ∈ R , and an axis-parallel region in attribute space Φ m ( x ) ∈ { 0 , 1 } . • Decision rule can be seen as logical pattern : if [condition] then [decision].

  16. RankRules : • RankRules follows the rank loss minimization. • We use the boosting approach to learn the ensemble. • The rank loss is upper-bounded by the exponential function: L ( y, f ) = exp( − yf ) . • This is a convex function, which makes the minimization process easier to cope with. • Due to modularity of the exponential function, minimization of the rank loss can be performed in a fast way.

  17. RankRules : • In the m -th iteration , the rule is computed by: � w ij e − α (Φ m ( x i ) − Φ m ( x j )) , r m = arg min Φ ,α y ij > 0 where f m − 1 is rule ensemble after m − 1 iterations, and w ij = e − ( f m − 1 ( x i ) − f m − 1 ( x j )) can be treated as weights associated with pairs of training examples. • The overall loss changes only for pairs in which one example is covered by the rule and the other is not ( Φ( x i ) � = Φ( x j ) ).

  18. RankRules : • Thresholds are computed by: N K − 1 � � e − y ik ( f ( x i ) − θ k ) , θ = arg min θ i =1 k =1 subject to θ 0 = −∞ � θ 1 � . . . � θ K − 1 � θ K = ∞ . • The problem has a closed-form solution :: � N i =1 � y ik > 0 � e f ( x i ) θ k = 1 2 log i =1 � y ik < 0 � e − f ( x i ) , k = 1 , . . . , K − 1 . � N • The monotonicity condition is satisfied by this solution as proved by Lin and Li (2007).

  19. Single Rule Generation : • The m -th rule is obtained by solving: � w ij e − α (Φ m ( x i ) − Φ m ( x j )) . r m = arg min Φ ,α y ij > 0 • For given Φ m the problem of finding α m has a closed-form solution : � y ij > 0 ∧ Φ m ( x i ) > Φ m ( x j ) w ij α m = 1 2 ln . � y ij > 0 ∧ Φ m ( x i ) < Φ m ( x j ) w ij • The challenge is to find Φ m by deriving the impurity measure L (Φ m ) in such a way that the optimization problem does not longer depend on α m .

  20. Boosting Approaches and Impurity Measures : • Simultaneous minimization : finds the closed-form solution for Φ (Confidence-rated AdaBoost, SLIPPER, RankBoost). • Gradient descent : relies on approximation of the loss function up to the first order (AdaBoost, AnyBoost). • Gradient boosting : minimizes the squared-error between rule outputs and the negative gradient of the loss function (Gradient Boosting Machine, MART). • Constant-step minimization : restricts α ∈ {− β, β } , with β being a fixed parameter.

  21. Boosting Approaches and Impurity Measures : • Each of the boosting approaches provides another impurity measure that represents different trade-off between misclassification and coverage of the rule. • Gradient descent produces the most general rules in comparison to other techniques. • Gradient descent represents 1 2 trade-off between misclassification and coverage of the rule. • Constant-step minimization generalizes the gradient descent technique to obtain different trade-offs between misclassification and coverage of the rule, namely ℓ ∈ [0 , 0 . 5) , with β = ln 1 − ℓ . ℓ

  22. Rule Coverage (artificial data) 500 Number of covered training examples RR SM−Exp ν ν = = 0.1 ζ ζ = = 0.25 RR CS−Exp β = 0.1 ν = 0.1 ζ = 0.25 β = ν = ζ = RR CS−Exp β = 0.2 ν = 0.1 ζ = 0.25 β = ν = ζ = 400 RR CS−Exp β = 0.5 ν = 0.1 ζ = 0.25 β = ν = ζ = RR GD−Exp β β = = 0 ν ν = = 0.1 ζ ζ = = 0.25 RR GB−Exp ν ν = = 0.1 ζ ζ = = 0.25 300 200 100 0 0 200 400 600 800 1000 Rule

  23. Fast Implementation : • We rewrite the minimization problem of complexity O ( N 2 ) : � w ij e − α (Φ m ( x i ) − Φ m ( x j )) , r m = arg min Φ ,α y ij > 0 to the problem that can be solved in O ( KN ) . • We use the fact that w ij = e − ( f m − 1 ( x i ) − f m − 1 ( x j )) = e − f m − 1 ( x i ) e f m − 1 ( x j ) = w i w − j , and use denotation: W 0 � w − � w − W k = i , k = i . y i = k ∧ Φ( x i )=1 y i = k ∧ Φ( x i )=0

  24. Fast Implementation : • The minimization problem can be rewritten to N � w i e − α (Φ m ( x i )) � j e α Φ m ( x j ) , w − r m = arg min Φ ,α i =1 y i >y j where the inner sum can be given by: j e α Φ m ( x j ) = e α � W 0 � � w − W k + k . y i >y j y i >k y i >k • The values W 0 W k and k , k = 1 , . . . , K, can be easily computed and updated in each iteration.

  25. Fast Implementation 800 RR SM−Exp ν = = 0.1 ζ ζ = = 1 RR SM−Exp ν = = 0.1 ζ ζ = = 0.5 600 Time 400 200 0 0 2000 4000 6000 8000 10000 Number of training instances

  26. Regularization : • The rule is shrinked (multiplied) by the amount ν ∈ (0 , 1] towards rules already present in the ensemble: f m ( x ) = f m − 1 ( x ) + ν · r m ( x ) . • Procedure for finding Φ m works on a fraction ζ of original data, drawn without replacement. • Value of α m is calculated on all training examples – this usually decreases | α m | and plays the role of regularization .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend