 
              Bandit Multiclass Linear Classification: Efficient Algorithms for the Separable Case Alina Beygelzimer David Pal Balazs Szorenyi (Yahoo!) (Yahoo!) (Yahoo!) Devanathan Chen-Yu Wei Chicheng Zhang Thiruvenkatachari (USC) (Microsoft) (NYU)
Bandit multiclass classification For t = 1 , 2 , . . . , T :
Bandit multiclass classification For t = 1 , 2 , . . . , T : 1. Example ( x t , y t ) is chosen, where
Bandit multiclass classification For t = 1 , 2 , . . . , T : 1. Example ( x t , y t ) is chosen, where x t ∈ R d is the feature (shown − − − − − → to the learner),
Bandit multiclass classification For t = 1 , 2 , . . . , T : 1. Example ( x t , y t ) is chosen, where x t ∈ R d is the feature (shown − − − − − → to the learner), y t ∈ [ K ] is the label (hidden).
Bandit multiclass classification For t = 1 , 2 , . . . , T : 1. Example ( x t , y t ) is chosen, where x t ∈ R d is the feature (shown − − − − − → to the learner), ← − − − − − y t ∈ [ K ] is the label (hidden). 2. Predict class label � y t ∈ [ K ].
Bandit multiclass classification For t = 1 , 2 , . . . , T : 1. Example ( x t , y t ) is chosen, where x t ∈ R d is the feature (shown − − − − − → to the learner), ← − − − − − y t ∈ [ K ] is the label (hidden). − − − − − → 2. Predict class label � y t ∈ [ K ]. 3. Observe feedback z t = 1 [ � y t � = y t ] ∈ { 0 , 1 } .
Bandit multiclass classification For t = 1 , 2 , . . . , T : 1. Example ( x t , y t ) is chosen, where x t ∈ R d is the feature (shown − − − − − → to the learner), ← − − − − − y t ∈ [ K ] is the label (hidden). − − − − − → 2. Predict class label � y t ∈ [ K ]. 3. Observe feedback z t = 1 [ � y t � = y t ] ∈ { 0 , 1 } . Goal: minimize the total number of mistakes � T t =1 z t .
Challenge: efficient algorithms in the separable setting Definition A dataset is called γ -linearly separable if there exists w 1 , . . . , w K such that � � ∀ y ′ � = y , � w y , x � ≥ w y ′ , x + γ, for all ( x , y ) in the dataset. (with the constraint � K i =1 � w i � 2 ≤ 1)
Challenge: efficient algorithms in the separable setting Definition A dataset is called γ -linearly separable if there exists w 1 , . . . , w K such that � � ∀ y ′ � = y , � w y , x � ≥ w y ′ , x + γ, for all ( x , y ) in the dataset. (with the constraint � K i =1 � w i � 2 ≤ 1) � w 1 − w 2 , x � = 0 Class 1 Class 2 Class 3 � w 1 − w 3 , x � = 0 � w 2 − w 3 , x � = 0
Related work Algorithm Mistake Bound Efficient? 1 See also [HK11, BOZ17, FKL + 18, ..] that have similar guarantees
Related work Algorithm Mistake Bound Efficient? O ( K /γ 2 ) Minimax algorithm [DH13] No 1 See also [HK11, BOZ17, FKL + 18, ..] that have similar guarantees
Related work Algorithm Mistake Bound Efficient? O ( K /γ 2 ) Minimax algorithm [DH13] No � Banditron [KSST08] 1 TK /γ 2 ) O ( Yes 1 See also [HK11, BOZ17, FKL + 18, ..] that have similar guarantees
Related work Algorithm Mistake Bound Efficient? O ( K /γ 2 ) Minimax algorithm [DH13] No � Banditron [KSST08] 1 TK /γ 2 ) O ( Yes O (min( K log 2 (1 /γ ) , √ 2 � 1 /γ log K )) This work Yes 1 See also [HK11, BOZ17, FKL + 18, ..] that have similar guarantees
Related work Algorithm Mistake Bound Efficient? O ( K /γ 2 ) Minimax algorithm [DH13] No � Banditron [KSST08] 1 TK /γ 2 ) O ( Yes O (min( K log 2 (1 /γ ) , √ 2 � 1 /γ log K )) This work Yes √ Contribution : first efficient algorithm that breaks the T barrier 1 See also [HK11, BOZ17, FKL + 18, ..] that have similar guarantees
Algorithm (One-versus-rest approach)
Algorithm (One-versus-rest approach)
Algorithm (One-versus-rest approach) If ≥ 1 of them respond YES: y t ← any one of those YES labels � If all of them respond NO: y t ← uniform from { 1 , . . . , K } �
Algorithm (One-versus-rest approach) If ≥ 1 of them respond YES: y t ← any one of those YES labels � If all of them respond NO: y t ← uniform from { 1 , . . . , K } � E [#mistakes(alg)] ≤ K � i #mistakes( i )
Algorithm ◮ Each non-linear binary classifier learns the support of class i , which lies in an intersection of K − 1 halfspaces with a margin [KS04]. � w 1 − w 2 , x � = 0 Class 1 Class 2 Class 3 � w 1 − w 3 , x � = 0 � w 2 − w 3 , x � = 0
Algorithm ◮ Each non-linear binary classifier learns the support of class i , which lies in an intersection of K − 1 halfspaces with a margin [KS04]. � w 1 − w 2 , x � = 0 Class 1 Class 2 Class 3 � w 1 − w 3 , x � = 0 � w 2 − w 3 , x � = 0 ◮ Choice: kernel Perceptron with rational kernel [SSSS11]: 1 K ( x , x ′ ) = 2 � x , x ′ � . 1 − 1
Algorithm ◮ Each non-linear binary classifier learns the support of class i , which lies in an intersection of K − 1 halfspaces with a margin [KS04]. � w 1 − w 2 , x � = 0 Class 1 Class 2 Class 3 � w 1 − w 3 , x � = 0 � w 2 − w 3 , x � = 0 ◮ Choice: kernel Perceptron with rational kernel [SSSS11]: 1 K ( x , x ′ ) = 2 � x , x ′ � . 1 − 1 Thu. Poster#158 ◮
Recommend
More recommend