learning multiple tasks with boosted decision trees
play

Learning Multiple Tasks with Boosted Decision Trees Jean Baptiste - PDF document

Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Learning Multiple Tasks with Boosted Decision Trees Jean Baptiste Faddoul, Boris Chidlovskii, Fabien Torre, R emi Gilleron CAp12, May 2012 Introduction Learning MT-DTs


  1. Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Learning Multiple Tasks with Boosted Decision Trees Jean Baptiste Faddoul, Boris Chidlovskii, Fabien Torre, R´ emi Gilleron CAp’12, May 2012 Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Multitask Learning Multitask Learning MTL considers learning multiple ” related ” tasks jointly, in order to improve their predictive performance.

  2. Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Related Tasks ? Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Related Tasks ?

  3. Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Table of contents Introduction 1 Learning MT-DTs 2 MT-Adaboost 3 Experiments 4 Conclusion 5 Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Label Correspondence Assumption Learning multiple tasks becomes easier under the label correspondence assumption, where either: Tasks share the same labels sets. Tasks share the same training data points, each has a label for each task.

  4. Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Global Relatedness Assumption Related tasks might show different relatedness degrees / signs across the learning space. Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Prior-Art [Quadrianto et al., 2010] formulates MTL as a maximization problem of mutual information among the label sets. Their approach assumes a global relatedness pattern between the tasks. In previous work [Faddoul et al., 2010] we proposed MT-Adaboost , an adaptation of Adaboost to MTL. The weak classifier proposed is called MT-Stump .

  5. Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Prior-Art (Cont’d) No label correspondence or global relatedness assumptions. But, the sequential design of multi-task stump and its greedy algorithm can fail to capture task relatedness. Contribution We propose a novel technique for the multi-task learning which addresses the limitations of previous approaches: We propose an adaptation to decision trees learning to multi-task setting. We derive an information-theoretic criterion and prove its superiority to baseline information gain. We integrate the proposed classifier in boosting framework. Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion A bit of Notation N classification tasks T 1 , . . . , T N over the instance space X and label sets Y 1 , . . . , Y N Distribution D over X × { 1 , . . . , N } . Training set S = { < x i , y i , j > | x i ∈ X , y i ∈ Y j , j ∈ { 1 , . . . , N } , 1 ≤ i ≤ m } . Output h : X → Y 1 × . . . × Y N which minimizes error( h ) = Pr < x , y , j > ∼ D [ h j ( x ) � = y ], where h j ( x ) is the j -th component of h ( x ).

  6. Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Multi-Task Decision Tree (MT-DT) Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Information Gain Decision tree learning is based Information Gain (IG) criteria. IG ( Y ; X ) on a random variable Y obtained after observing the value of X is the Kullback-Leibler divergent D KL ( p ( Y | X ) || p ( Y | I )) It is the reduction of Y ’s entropy obtained by observing the value of X . IG defines a preferred sequence of attributes to investigate to most rapidly narrow down the state of Y . IG ( Y ; X ) = H ( Y ) − H ( Y | X ) ,

  7. Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Information Gain for MTL As a baseline, we can pool all the tasks as a single multi-class task. The IG in this case will be given by: IG J = IG ( ⊕ N j =1 Y j ; X ) ⊕ indicates the pooling of all tasks’ labels Another baseline can be the sum of individual IGs: T � IG U = IG ( Y j ; X ) j =1 Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Information Gain for MTL (Cont’d) Evaluations show that IG U fails to make better than IG J . We prove that IG J is equivalent to the weighted sum of individual task information gains. Then we derive IG M a criterion superior to IG J .

  8. Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Information Gain for MTL (Cont’d) Theorem For N tasks with the class sets Y 1 , . . . , Y N , let p j denote the | S j | fraction of task j in the full dataset, p j = j =1 | S j | , j = 1 , . . . , N, � N � N j =1 p j = 1 . Then we have N � IG ( ⊕ N j =1 Y j ; X ) = p j IG ( Y j ; X ) ≤ max ( IG ( Y 1 ; X ) , . . . , IG ( Y N ; X )) j =1 Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Information Gain for MTL (Cont’d) To prove our theorem we use the Generalized Grouping property of the entropy: Lemma For q kj ≥ 0 , such that � n � m j =1 q kj = 1 , p k = � m j =1 q kj , ∀ k = 1 , . . . , n, the following k =1 holds H ( q 11 , . . . , q 1 m , q 21 , . . . , q 2 m , . . . , q n 1 , . . . , q nm ) = (1) � q k 1 , . . . , q km � � H ( p 1 , . . . , p n ) + p k H , p k > 0 , ∀ k . (2) p k p k

  9. Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Information Gain for MTL (Cont’d) First, we use Lemma to develop the entropy term H ( ⊕ N j =1 Y j ) of the information gain N � H ( ⊕ N j =1 Y j ) = H ( p 1 , . . . , p N ) + p j H ( Y j ) , (3) j =1 where � N j =1 p j = 1. Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Information Gain for MTL (Cont’d) Second, we develop the conditional entropy term as follows. We assume here that the tasks proportions are independent of the observation, i.e., H ( p 1 , . . . , p N | x ) = H ( p 1 , . . . , p N ). � H ( ⊕ N p ( x ) H ( ⊕ N j =1 Y j | X ) = j =1 Y j | X = x ) x   N � � = p ( x )  H ( p 1 , . . . , p N ) + p j H ( Y j | X = x )  x j =1 N � � = H ( p 1 , . . . , p N ) + p j p ( x ) H ( Y j | X = x ) x j =1 N � = H ( p 1 , . . . , p N ) + p j H ( Y j | X ) . j =1

  10. Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Information Gain for MTL (Cont’d) Now we combine the entropy and the conditional entropy terms to evaluate the joint information gain IG ( ⊕ N j =1 Y j ; X ). We obtain IG ( ⊕ N H ( ⊕ N j =1 Y j ) − H ( ⊕ N j =1 Y j ; X ) = j =1 Y j | X ) (4) N � = p j IG ( Y j ; X ) (5) j =1 N � ≤ p j max ( IG ( Y 1 ; X ) , . . . , IG ( Y N ; X )) (6) j =1 = max ( IG ( Y 1 ; X ) , . . . , IG ( Y N ; X )) . (7) This completes the proof of the theorem. Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion IGs Comparison Figure: Information gain for randomly generated datasets.

  11. Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion Adaboost Adaboost is a meta-algorithm, comes from PAC framework [Valiant, 1984] A Weak-learner means that it is slightly better than random ( ǫ < 0 . 5) A Strong-learner allows to chose its error ǫ as small as wanted Adaboost transforms any weak learner into a strong learner! Head lines of the algorithm: Initialize examples weights D 0 1 At each round n = 1 , . . . , N : 2 Call a weak learner on the distribution D n , to learn a hypothesis h n Calculate ǫ n the error of the h n on the training set. Weights of each incorrectly classified example are increased. Weights of each correctly classified example are decreased. The final classifier is a weighted sum of the weak classifier’s 3 output Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion What is Good about Adaboost Framework ? Not prone to overfitting Can be used with many different classifiers Inherits the features of its weak classifier (semi-supervised, relational, statistical, etc ...) Simple to implement Those properties motivate our choice of Adaboost as the framework of our multi-task algorithm Requirements We have to: Define Multi-Task-hypotheses and its learning algorithm: in our case it is MT-DT Modify AdaBoost for Multi-Task learning

  12. Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion MT-Adaboost We adapt Adaboost.M1 which was introduced in [Schapire and Singer, 1999]. Any other boosting algorithm can be used. In the case of multi-task, the distribution is defined over pairs of (example, task), i.e., D ⊂ X × { 1 , . . . , N } The output of the algorithm is a function which takes an example as input and give a label per task as output: i = T � H j ( x ) = arg max ( (ln 1 /β t )) , 1 ≤ j ≤ N y ∈ Y j i =1 Introduction Learning MT-DTs MT-Adaboost Experiments Conclusion MT-Adaboost Require: S = ∪ N j =1 { e i = < x i , y i , j > | x i ∈ X ; y i ∈ Y j } 1: D 1 = init ( S ) initialize distribution 2: for t = 1 to T do h t = WL ( S , D t ) { train the weak learner and get an hypothesis MT-DT } 3: 4: Calculate the error of h t : ǫ t = � N � j ( xi ) � = yi D j ( x i ) . i : ht j =1 5: if ǫ t > 1 / 2 then 6: Set T = t − 1 and abort loop. 7: end if 8: ǫ t β t = 1 − ǫ t { Update distribution: } 9: if h t j ( x i ) == y i then D t +1 ( e i ) = Dt ( ei ) × β t 10: Zt 11: else D t +1 ( e i ) = Dt ( ei ) 12: Zt 13: end if 14: end for { Where Z t is a normalization constant chosen so that D t +1 is a distribution } 15: return Classifier H defined by: i = T � H j ( x ) = arg max ( (ln 1 /β t )) , 1 ≤ j ≤ N y ∈Y j i =1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend