 
              Selective sampling algorithms for cost-sensitive multiclass prediction Alekh Agarwal Microsoft Research Alekh Agarwal Selective sampling for multiclass prediction
Why active learning? Standard setting - receive randomly sampled examples Alekh Agarwal Selective sampling for multiclass prediction
Why active learning? Standard setting - receive randomly sampled examples Not all data points are equally informative! Alekh Agarwal Selective sampling for multiclass prediction
Why active learning? Standard setting - receive randomly sampled examples Not all data points are equally informative! Labelled data points are expensive, unlabelled cheap Object recognition - images need human labelling Protein interaction prediction - lab test for each protein pair Web ranking - human editors to label relevant pages Alekh Agarwal Selective sampling for multiclass prediction
What is active learning? Sequentially query points with label uncertainty Alekh Agarwal Selective sampling for multiclass prediction
What is active learning? Sequentially query points with label uncertainty Like random search vs. binary search Alekh Agarwal Selective sampling for multiclass prediction
What is active learning? Sequentially query points with label uncertainty Like random search vs. binary search Example: sampling near decision boundary for linear separators Alekh Agarwal Selective sampling for multiclass prediction
Online selective sampling paradigm x 1 x 2 x t Z t = 0 Predict ˆ y t Algorithm Don’t Observe y t Z t = 1 Observe y t Filter examples online, querying only a subset of labels. Examples not revisited Alekh Agarwal Selective sampling for multiclass prediction
Prior work Bulk of work in the binary setting Agnostic active learning Atlas, Balcan, Beygelzimer, Cohn, Dasgupta, Hanneke, Hsu, Ladner, Langford, . . . Linear selective sampling: Cesa-Bianchi, Gentile and co-authors Alekh Agarwal Selective sampling for multiclass prediction
Prior work Bulk of work in the binary setting Agnostic active learning Atlas, Balcan, Beygelzimer, Cohn, Dasgupta, Hanneke, Hsu, Ladner, Langford, . . . Linear selective sampling: Cesa-Bianchi, Gentile and co-authors Empirical work in the multiclass setting: Jain and Kapoor (2009), Joshi et al. (2012), . . . Relatively little theoretical work Alekh Agarwal Selective sampling for multiclass prediction
This talk Efficient algorithm in a multiclass GLM setting Analysis of regret and label complexity Sharp rates under Tsybakov-type noise condition O (1 / √ N T ) (noisy) to � Regret ranges between � O (exp( − c 0 N T )) (hard-margin) Alekh Agarwal Selective sampling for multiclass prediction
This talk Efficient algorithm in a multiclass GLM setting Analysis of regret and label complexity Sharp rates under Tsybakov-type noise condition O (1 / √ N T ) (noisy) to � Regret ranges between � O (exp( − c 0 N T )) (hard-margin) Safety guarantee under model mismatch Numerical simulations Alekh Agarwal Selective sampling for multiclass prediction
Multiclass prediction x ∈ R d , y ∈ { 1 , 2 , . . . , K } Only one label per example Cat Dog Horse Alekh Agarwal Selective sampling for multiclass prediction
Multiclass prediction x ∈ R d , y ∈ { 1 , 2 , . . . , K } Only one label per example Cost matrix C ∈ R K × K Penalty C ij for predicting label j when true label is i Cat Cat Dog Horse Cat 0 1 10 Dog Dog 1 0 10 Horse 10 10 0 Horse Alekh Agarwal Selective sampling for multiclass prediction
Structured cost matrices Often have block- or tree-structured cost matrices in applications 0 / 1 Block Tree 1 1 1 1 0.9 0.9 2 2 2 0.9 0.8 0.8 0.8 3 3 4 0.7 0.7 0.7 4 4 0.6 6 0.6 0.6 5 5 0.5 0.5 8 0.5 6 6 0.4 0.4 10 0.4 7 7 0.3 0.3 0.3 8 8 12 0.2 0.2 0.2 9 9 14 0.1 0.1 0.1 10 10 16 0 0 0 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 12 14 16 Alekh Agarwal Selective sampling for multiclass prediction
Multiclass GLM Weight matrix W ∗ ∈ R K × d Convex function Φ : R K �→ R Definition (Multiclass GLM) For every x ∈ R d , the class conditional probabilities follow the model P ( Y = i | W ∗ , x ) = ( ∇ Φ( W ∗ x )) i d x W ∗ x P ( Y | W ∗ , x ) ∇ Φ = W ∗ K K d Alekh Agarwal Selective sampling for multiclass prediction
Multiclass GLM intuition 1 0.9 0.8 Binary: K = 2. Φ is convex ⇐ ⇒ link P ( y = 1 | w, x ) 0.7 0.6 function is monotone increasing. E.g.: 0.5 0.4 logistic, linear, . . . 0.3 0.2 0.1 0 −100 −50 0 50 100 w T x Alekh Agarwal Selective sampling for multiclass prediction
Example: multiclass logistic Define Φ( v ) = log( � K i =1 exp( v i )) Obtain ( ∇ Φ( v )) i = exp( v i ) / ( � K j =1 exp( v j )) Yields the multinomial logit noise model exp( x T W i ) P ( Y = i | W , x ) = . � K j =1 exp( x T W j ) Alekh Agarwal Selective sampling for multiclass prediction
Loss function Given Φ, define the loss ℓ ( W x , y ) = Φ( W x ) − y T W x . Convex since Φ is convex Fisher consistent: W ∗ minimizes E [ ℓ ( W x , y ) | W ∗ , x ] for each x Alekh Agarwal Selective sampling for multiclass prediction
Loss function Given Φ, define the loss ℓ ( W x , y ) = Φ( W x ) − y T W x . Convex since Φ is convex Fisher consistent: W ∗ minimizes E [ ℓ ( W x , y ) | W ∗ , x ] for each x = E [ ∇ Φ( W x ) | W ∗ , x ] − E [ ∇ y T W x | W ∗ , x ] E [ ∇ ℓ ( W x , y ) | W ∗ , x ] = ∇ Φ( W x ) x T − E [ y | W ∗ , x ] x T = ∇ Φ( W x ) x T − ∇ Φ( W ∗ x ) x T Alekh Agarwal Selective sampling for multiclass prediction
Score function Given a cost matrix C and Φ, define K � S x W ( i ) = − C ( j , i ) ( ∇ Φ( W x )) j . � �� � � �� � j =1 cost of i probability of j Negative expected cost of predicting j , when W is the true weight matrix Maximum score ⇐ ⇒ minimum expected cost Alekh Agarwal Selective sampling for multiclass prediction
Score function Given a cost matrix C and Φ, define K � S x W ( i ) = − C ( j , i ) ( ∇ Φ( W x )) j . � �� � � �� � j =1 cost of i probability of j Negative expected cost of predicting j , when W is the true weight matrix Maximum score ⇐ ⇒ minimum expected cost Bayes predictor predicts arg max K i =1 S x W ∗ ( i ) Alekh Agarwal Selective sampling for multiclass prediction
CS-Selectron algorithm with general query function Input: Query function Q , cost matrix C , parameter γ > 0 Initialize: W 1 = 0, M 1 = γ I /γ ℓ For t = 1 , 2 , . . . T Alekh Agarwal Selective sampling for multiclass prediction
CS-Selectron algorithm with general query function Input: Query function Q , cost matrix C , parameter γ > 0 Initialize: W 1 = 0, M 1 = γ I /γ ℓ For t = 1 , 2 , . . . T Observe x t , H t +1 = H t ∪ { x t } x 1 x 2 x t Algorithm Alekh Agarwal Selective sampling for multiclass prediction
CS-Selectron algorithm with general query function For t = 1 , 2 , . . . T Observe x t , H t +1 = H t ∪ { x t } y t = arg max i =1 , 2 ,..., K S x t Predict ˆ W t ( i ) x 1 x 2 x t Predict Algorithm y t = arg max S x t ˆ W t ( i ) Alekh Agarwal Selective sampling for multiclass prediction
CS-Selectron algorithm with general query function For t = 1 , 2 , . . . T Observe x t , H t +1 = H t ∪ { x t } y t = arg max i =1 , 2 ,..., K S x t Predict ˆ W t ( i ) If Q ( x t , H t ) = 1 , then x 1 x 2 x t Predict Algorithm y t = arg max S x t ˆ W t ( i ) Q ( x t , H t ) = 1 Alekh Agarwal Selective sampling for multiclass prediction
CS-Selectron algorithm with general query function For t = 1 , 2 , . . . T Observe x t , H t +1 = H t ∪ { x t } y t = arg max i =1 , 2 ,..., K S x t Predict ˆ W t ( i ) If Q ( x t , H t ) = 1 , then Query label y t x 1 x 2 x t Predict Algorithm y t = arg max S x t ˆ W t ( i ) Q ( x t , H t ) = 1 Observe y t Alekh Agarwal Selective sampling for multiclass prediction
CS-Selectron algorithm with general query function For t = 1 , 2 , . . . T Observe x t , H t +1 = H t ∪ { x t } y t = arg max i =1 , 2 ,..., K S x t Predict ˆ W t ( i ) If Q ( x t , H t ) = 1 , then Query label y t Update W t , M t and H t x 1 x 2 x t Predict Algorithm y t = arg max S x t ˆ W t ( i ) Q ( x t , H t ) = 1 Observe y t Z t = 1, H t +1 = H t ∪ { y t } , M t +1 = M t + x t x T t � � t � Z s ℓ ( W x s , y s ) + γ � W � 2 W t +1 = arg min . F W ∈W s =1 Alekh Agarwal Selective sampling for multiclass prediction
Algorithm intuition Low-regret algorithm on queried examples Update ensures � W t − W ∗ � M t is small Query function ensures low regret on rounds with no queries Alekh Agarwal Selective sampling for multiclass prediction
Query function: BBQ ǫ rule Variant of Cesa-Bianchi et al. (2009) � ≥ ǫ 2 � η ǫ � x t � 2 Q ( x t , H t ) = 1 1 M − 1 t Alekh Agarwal Selective sampling for multiclass prediction
Query function: BBQ ǫ rule Variant of Cesa-Bianchi et al. (2009) � ≥ ǫ 2 � η ǫ � x t � 2 Q ( x t , H t ) = 1 1 M − 1 t Note: � W ∗ x t − W t x t � 2 ≤ � W ∗ − W t � M t � x t � M − 1 t Queries points with large confidence intervals on the predictions x t x t M t M t Q ( x t , H t ) = 1 Q ( x t , H t ) = 0 Alekh Agarwal Selective sampling for multiclass prediction
Recommend
More recommend