Selective sampling algorithms for cost-sensitive multiclass - PowerPoint PPT Presentation

Selective sampling algorithms for cost-sensitive multiclass prediction Alekh Agarwal Microsoft Research Alekh Agarwal Selective sampling for multiclass prediction

Why active learning? Standard setting - receive randomly sampled examples Alekh Agarwal Selective sampling for multiclass prediction

Why active learning? Standard setting - receive randomly sampled examples Not all data points are equally informative! Alekh Agarwal Selective sampling for multiclass prediction

Why active learning? Standard setting - receive randomly sampled examples Not all data points are equally informative! Labelled data points are expensive, unlabelled cheap Object recognition - images need human labelling Protein interaction prediction - lab test for each protein pair Web ranking - human editors to label relevant pages Alekh Agarwal Selective sampling for multiclass prediction

What is active learning? Sequentially query points with label uncertainty Alekh Agarwal Selective sampling for multiclass prediction

What is active learning? Sequentially query points with label uncertainty Like random search vs. binary search Alekh Agarwal Selective sampling for multiclass prediction

What is active learning? Sequentially query points with label uncertainty Like random search vs. binary search Example: sampling near decision boundary for linear separators Alekh Agarwal Selective sampling for multiclass prediction

Online selective sampling paradigm x 1 x 2 x t Z t = 0 Predict ˆ y t Algorithm Don’t Observe y t Z t = 1 Observe y t Filter examples online, querying only a subset of labels. Examples not revisited Alekh Agarwal Selective sampling for multiclass prediction

Prior work Bulk of work in the binary setting Agnostic active learning Atlas, Balcan, Beygelzimer, Cohn, Dasgupta, Hanneke, Hsu, Ladner, Langford, . . . Linear selective sampling: Cesa-Bianchi, Gentile and co-authors Alekh Agarwal Selective sampling for multiclass prediction

Prior work Bulk of work in the binary setting Agnostic active learning Atlas, Balcan, Beygelzimer, Cohn, Dasgupta, Hanneke, Hsu, Ladner, Langford, . . . Linear selective sampling: Cesa-Bianchi, Gentile and co-authors Empirical work in the multiclass setting: Jain and Kapoor (2009), Joshi et al. (2012), . . . Relatively little theoretical work Alekh Agarwal Selective sampling for multiclass prediction

This talk Efficient algorithm in a multiclass GLM setting Analysis of regret and label complexity Sharp rates under Tsybakov-type noise condition O (1 / √ N T ) (noisy) to � Regret ranges between � O (exp( − c 0 N T )) (hard-margin) Alekh Agarwal Selective sampling for multiclass prediction

This talk Efficient algorithm in a multiclass GLM setting Analysis of regret and label complexity Sharp rates under Tsybakov-type noise condition O (1 / √ N T ) (noisy) to � Regret ranges between � O (exp( − c 0 N T )) (hard-margin) Safety guarantee under model mismatch Numerical simulations Alekh Agarwal Selective sampling for multiclass prediction

Multiclass prediction x ∈ R d , y ∈ { 1 , 2 , . . . , K } Only one label per example Cat Dog Horse Alekh Agarwal Selective sampling for multiclass prediction

Multiclass prediction x ∈ R d , y ∈ { 1 , 2 , . . . , K } Only one label per example Cost matrix C ∈ R K × K Penalty C ij for predicting label j when true label is i Cat Cat Dog Horse Cat 0 1 10 Dog Dog 1 0 10 Horse 10 10 0 Horse Alekh Agarwal Selective sampling for multiclass prediction

Structured cost matrices Often have block- or tree-structured cost matrices in applications 0 / 1 Block Tree 1 1 1 1 0.9 0.9 2 2 2 0.9 0.8 0.8 0.8 3 3 4 0.7 0.7 0.7 4 4 0.6 6 0.6 0.6 5 5 0.5 0.5 8 0.5 6 6 0.4 0.4 10 0.4 7 7 0.3 0.3 0.3 8 8 12 0.2 0.2 0.2 9 9 14 0.1 0.1 0.1 10 10 16 0 0 0 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 12 14 16 Alekh Agarwal Selective sampling for multiclass prediction

Multiclass GLM Weight matrix W ∗ ∈ R K × d Convex function Φ : R K �→ R Definition (Multiclass GLM) For every x ∈ R d , the class conditional probabilities follow the model P ( Y = i | W ∗ , x ) = ( ∇ Φ( W ∗ x )) i d x W ∗ x P ( Y | W ∗ , x ) ∇ Φ = W ∗ K K d Alekh Agarwal Selective sampling for multiclass prediction

Multiclass GLM intuition 1 0.9 0.8 Binary: K = 2. Φ is convex ⇐ ⇒ link P ( y = 1 | w, x ) 0.7 0.6 function is monotone increasing. E.g.: 0.5 0.4 logistic, linear, . . . 0.3 0.2 0.1 0 −100 −50 0 50 100 w T x Alekh Agarwal Selective sampling for multiclass prediction

Example: multiclass logistic Define Φ( v ) = log( � K i =1 exp( v i )) Obtain ( ∇ Φ( v )) i = exp( v i ) / ( � K j =1 exp( v j )) Yields the multinomial logit noise model exp( x T W i ) P ( Y = i | W , x ) = . � K j =1 exp( x T W j ) Alekh Agarwal Selective sampling for multiclass prediction

Loss function Given Φ, define the loss ℓ ( W x , y ) = Φ( W x ) − y T W x . Convex since Φ is convex Fisher consistent: W ∗ minimizes E [ ℓ ( W x , y ) | W ∗ , x ] for each x Alekh Agarwal Selective sampling for multiclass prediction

Loss function Given Φ, define the loss ℓ ( W x , y ) = Φ( W x ) − y T W x . Convex since Φ is convex Fisher consistent: W ∗ minimizes E [ ℓ ( W x , y ) | W ∗ , x ] for each x = E [ ∇ Φ( W x ) | W ∗ , x ] − E [ ∇ y T W x | W ∗ , x ] E [ ∇ ℓ ( W x , y ) | W ∗ , x ] = ∇ Φ( W x ) x T − E [ y | W ∗ , x ] x T = ∇ Φ( W x ) x T − ∇ Φ( W ∗ x ) x T Alekh Agarwal Selective sampling for multiclass prediction

Score function Given a cost matrix C and Φ, define K � S x W ( i ) = − C ( j , i ) ( ∇ Φ( W x )) j . � �� j =1 cost of i probability of j Negative expected cost of predicting j , when W is the true weight matrix Maximum score ⇐ ⇒ minimum expected cost Alekh Agarwal Selective sampling for multiclass prediction

Score function Given a cost matrix C and Φ, define K � S x W ( i ) = − C ( j , i ) ( ∇ Φ( W x )) j . � �� j =1 cost of i probability of j Negative expected cost of predicting j , when W is the true weight matrix Maximum score ⇐ ⇒ minimum expected cost Bayes predictor predicts arg max K i =1 S x W ∗ ( i ) Alekh Agarwal Selective sampling for multiclass prediction

CS-Selectron algorithm with general query function Input: Query function Q , cost matrix C , parameter γ > 0 Initialize: W 1 = 0, M 1 = γ I /γ ℓ For t = 1 , 2 , . . . T Alekh Agarwal Selective sampling for multiclass prediction

CS-Selectron algorithm with general query function Input: Query function Q , cost matrix C , parameter γ > 0 Initialize: W 1 = 0, M 1 = γ I /γ ℓ For t = 1 , 2 , . . . T Observe x t , H t +1 = H t ∪ { x t } x 1 x 2 x t Algorithm Alekh Agarwal Selective sampling for multiclass prediction

CS-Selectron algorithm with general query function For t = 1 , 2 , . . . T Observe x t , H t +1 = H t ∪ { x t } y t = arg max i =1 , 2 ,..., K S x t Predict ˆ W t ( i ) x 1 x 2 x t Predict Algorithm y t = arg max S x t ˆ W t ( i ) Alekh Agarwal Selective sampling for multiclass prediction

CS-Selectron algorithm with general query function For t = 1 , 2 , . . . T Observe x t , H t +1 = H t ∪ { x t } y t = arg max i =1 , 2 ,..., K S x t Predict ˆ W t ( i ) If Q ( x t , H t ) = 1 , then x 1 x 2 x t Predict Algorithm y t = arg max S x t ˆ W t ( i ) Q ( x t , H t ) = 1 Alekh Agarwal Selective sampling for multiclass prediction

CS-Selectron algorithm with general query function For t = 1 , 2 , . . . T Observe x t , H t +1 = H t ∪ { x t } y t = arg max i =1 , 2 ,..., K S x t Predict ˆ W t ( i ) If Q ( x t , H t ) = 1 , then Query label y t x 1 x 2 x t Predict Algorithm y t = arg max S x t ˆ W t ( i ) Q ( x t , H t ) = 1 Observe y t Alekh Agarwal Selective sampling for multiclass prediction

CS-Selectron algorithm with general query function For t = 1 , 2 , . . . T Observe x t , H t +1 = H t ∪ { x t } y t = arg max i =1 , 2 ,..., K S x t Predict ˆ W t ( i ) If Q ( x t , H t ) = 1 , then Query label y t Update W t , M t and H t x 1 x 2 x t Predict Algorithm y t = arg max S x t ˆ W t ( i ) Q ( x t , H t ) = 1 Observe y t Z t = 1, H t +1 = H t ∪ { y t } , M t +1 = M t + x t x T t � � t � Z s ℓ ( W x s , y s ) + γ � W � 2 W t +1 = arg min . F W ∈W s =1 Alekh Agarwal Selective sampling for multiclass prediction

Algorithm intuition Low-regret algorithm on queried examples Update ensures � W t − W ∗ � M t is small Query function ensures low regret on rounds with no queries Alekh Agarwal Selective sampling for multiclass prediction

Query function: BBQ ǫ rule Variant of Cesa-Bianchi et al. (2009) � ≥ ǫ 2 � η ǫ � x t � 2 Q ( x t , H t ) = 1 1 M − 1 t Alekh Agarwal Selective sampling for multiclass prediction

Query function: BBQ ǫ rule Variant of Cesa-Bianchi et al. (2009) � ≥ ǫ 2 � η ǫ � x t � 2 Q ( x t , H t ) = 1 1 M − 1 t Note: � W ∗ x t − W t x t � 2 ≤ � W ∗ − W t � M t � x t � M − 1 t Queries points with large confidence intervals on the predictions x t x t M t M t Q ( x t , H t ) = 1 Q ( x t , H t ) = 0 Alekh Agarwal Selective sampling for multiclass prediction

Selective sampling algorithms for cost-sensitive multiclass - PowerPoint PPT Presentation

Selective sampling algorithms for cost-sensitive multiclass prediction Alekh Agarwal Microsoft Research Alekh Agarwal Selective sampling for multiclass prediction Why active learning? Standard setting - receive randomly sampled examples

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

A6: Sensitive Data Exposure A6 Sensitive Data Exposure Sensitive data stored or transmitted

02 Sampling algorithms Shravan Vasishth SMLP Shravan Vasishth 02 Sampling algorithms SMLP 1 /

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Texas Instruments & RFAB TI Information Selective Disclosure TI Information Selective

Cimzia Selective rebrand Concept A Cimzia Selective rebrand Logo Main / Colour Grayscale

Selective Prediction Binary classifications Rong Zhou November 8, 2017 Table of contents 1.

TUTORIAL - TUTORIAL -ABC ABC TOTAL COST for a COST OBJECT TOTAL COST for a COST OBJECT

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling

Volunteer clouds to extend the resources of EMI middleware based VOs Attila Csaba Marosi, Pter

Renewable and Non-Renewable Distributed Generation Technologies Pavol Bauer Learning objectives

OB-PWS: Obfuscation-Based Private Web Search Ero Balsa , Carmela Troncoso and Claudia Diaz

Computational Linguistics: TAG, CG and DG Raffaella Bernardi University of Trento Contents

Progressive lattice sieving Thijs Laarhoven and Artur Mariano

The Bricks to Build Tomorrow's Translation Technologies and Processes Christian Lieske (SAP AG),

Deidre S. Gifford, MD, MPH Deputy Director Center for Medicaid and CHIP Services (CMCS) Med

Ideal Clinic Realisation and Maintenance Post-Lab planning Post-Lab workplan 17 18 19 20 21 22

Selective sampling algorithms for cost-sensitive multiclass - PowerPoint PPT Presentation

Selective sampling algorithms for cost-sensitive multiclass prediction Alekh Agarwal Microsoft Research Alekh Agarwal Selective sampling for multiclass prediction Why active learning? Standard setting - receive randomly sampled examples

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

A6: Sensitive Data Exposure A6 Sensitive Data Exposure Sensitive data stored or transmitted

02 Sampling algorithms Shravan Vasishth SMLP Shravan Vasishth 02 Sampling algorithms SMLP 1 /

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Texas Instruments &amp; RFAB TI Information Selective Disclosure TI Information Selective

Cimzia Selective rebrand Concept A Cimzia Selective rebrand Logo Main / Colour Grayscale

Selective Prediction Binary classifications Rong Zhou November 8, 2017 Table of contents 1.

TUTORIAL - TUTORIAL -ABC ABC TOTAL COST for a COST OBJECT TOTAL COST for a COST OBJECT

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Sampling Distributions Sampling Distribution of the Mean &amp; Hypothesis Testing Sampling

Volunteer clouds to extend the resources of EMI middleware based VOs Attila Csaba Marosi, Pter

Renewable and Non-Renewable Distributed Generation Technologies Pavol Bauer Learning objectives

OB-PWS: Obfuscation-Based Private Web Search Ero Balsa , Carmela Troncoso and Claudia Diaz

Computational Linguistics: TAG, CG and DG Raffaella Bernardi University of Trento Contents

Progressive lattice sieving Thijs Laarhoven and Artur Mariano

The Bricks to Build Tomorrow's Translation Technologies and Processes Christian Lieske (SAP AG),

Deidre S. Gifford, MD, MPH Deputy Director Center for Medicaid and CHIP Services (CMCS) Med

Ideal Clinic Realisation and Maintenance Post-Lab planning Post-Lab workplan 17 18 19 20 21 22

Texas Instruments & RFAB TI Information Selective Disclosure TI Information Selective

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling